freebsd-nq

Author	SHA1	Message	Date
Bjoern A. Zeeb	ae69ad884d	After inpcb route caching was put back in place there is no need for flowtable anymore (as flowtable was never considered to be useful in the forwarding path). Reviewed by: np Differential Revision: https://reviews.freebsd.org/D11448	2017-07-27 13:03:36 +00:00
Ed Maste	1edbb54fe9	cc_cubic: restore braces around if-condition block r307901 was reverted in r321480, restoring an incorrect block delimitation bug present in the original cc_cubic commit. Restore only the bugfix (brace addition) from r307901. CID: 1090182 Approved by: sbruno	2017-07-26 21:23:09 +00:00
Sean Bruno	43053c125a	Revert r307901 - Inform CC modules about loss events. This was discussed between various transport@ members and it was requested to be reverted and discussed. Submitted by: Kevin Bowling <kevin.bowling@kev009.com> Reported by: lawrence Reviewed by: hiren Sponsored by: Limelight Networks	2017-07-25 15:08:52 +00:00
Sean Bruno	5d53981a18	Revert r308180 - Set slow start threshold more accurrately on loss ... This was discussed between various transport@ members and it was requested to be reverted and discussed. Submitted by: kevin Reported by: lawerence Reviewed by: hiren	2017-07-25 15:03:05 +00:00
Michael Tuexen	e5a9c519bc	Remove duplicate statement.	2017-07-25 11:05:53 +00:00
Michael Tuexen	9dd6ca9602	Deal with listening socket correctly.	2017-07-20 14:50:13 +00:00
Michael Tuexen	bbc9dfbc08	Fix the explicit EOR mode. If the final messages is not complete, send an ABORT. Joint work with rrs@ MFC after: 1 week	2017-07-20 11:09:33 +00:00
Michael Tuexen	1f76872c36	Avoid shadowed variables. MFC after: 1 week	2017-07-19 15:12:23 +00:00
Michael Tuexen	5ba7f91f9d	Use memset/memcpy instead of bzero/bcopy. Just use one variant instead of both. Use the memset/memcpy ones since they cause less problems in crossplatform deployment. MFC after: 1 week	2017-07-19 14:28:58 +00:00
Michael Tuexen	28cd0699b6	Fix the accounting and add code to detect errors in accounting. Joint work with rrs@ MFC after: 1 week	2017-07-19 12:27:40 +00:00
Michael Tuexen	d32ed2c735	Fix the handling of Explicit EOR mode. While there, appropriately handle the overhead depending on the usage of DATA or I-DATA chunks. Take the overhead only into account, when required. Joint work with rrs@ MFC after: 1 week	2017-07-15 19:54:03 +00:00
Konstantin Belousov	5cead59181	Correct sysent flags for dynamically loaded syscalls. Using the https://github.com/google/capsicum-test/ suite, the PosixMqueue.CapModeForked test was failing due to an ECAPMODE after calling kmq_notify(). On further inspection, the dynamically loaded syscall entry was initialized with sy_flags zeroed out, since SYSCALL_INIT_HELPER() left sysent.sy_flags with the default value. Add a new helper SYSCALL{,32}_INIT_HELPER_F() which takes an additional argument to specify the sy_flags value. Submitted by: Siva Mahadevan <smahadevan@freebsdfoundation.org> Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D11576	2017-07-14 09:34:44 +00:00
Jonathan T. Looney	cb503ae22d	Don't overpromote values when calculating len in tcp_output(). sbavail() returns u_int and sendwin is a uint32_t. Therefore, min() (which operates on two u_int values) is able to correctly calculate the minimum of these two arguments. Reported by: rrs MFC after: 1 week Sponsored by: Netflix	2017-07-05 16:10:30 +00:00
Michael Tuexen	1698cbd919	Move to open state after plausibility checks. When doing this too early, the MIB counters go wrong. MFC after: 1 week	2017-07-04 18:24:50 +00:00
Michael Tuexen	afffa1a9ad	Don't hold if refcount on an stcb when it is not needed. This improves the consistency with other parts of the code.	2017-07-04 18:04:44 +00:00
Sean Bruno	ac952dd274	Add a sysctl to toggle the use of the sockets LOWAT when calculating auto window growth Submitted by: j@nitrology.com (Jason Wolfe) Reviewed by: gnn hiren Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11016	2017-07-03 19:39:58 +00:00
Michael Tuexen	f4358911bf	Handle sctp_get_next_param() in a consistent way. This addresses an issue found by Felix Weinrank using libfuzz. While there, use also consistent nameing. MFC after: 3 days	2017-06-23 21:01:57 +00:00
Michael Tuexen	d44b45df2c	Check the length of a COOKIE chunk before accessing fields in it. Thanks to Felix Weinrank for reporting the issue he found by using libFuzzer. MFC after: 3 days	2017-06-23 10:09:49 +00:00
Michael Tuexen	1a7abbb3be	Use a longer buffer for messages in ERROR chunks. This allows them to be sent in a non truncated way and addresses a warning given by newver versions of gcc. Thanks to Anselm Jonas Scholl for reporting it and providing a patch.	2017-06-23 09:27:31 +00:00
Michael Tuexen	94f66d603a	Honor the backlog field.	2017-06-23 08:35:54 +00:00
Michael Tuexen	3017b21bb6	Improve compilation on platforms different from FreeBSD.	2017-06-23 08:34:01 +00:00
Gleb Smirnoff	779f106aa1	Listening sockets improvements. o Separate fields of struct socket that belong to listening from fields that belong to normal dataflow, and unionize them. This shrinks the structure a bit. - Take out selinfo's from the socket buffers into the socket. The first reason is to support braindamaged scenario when a socket is added to kevent(2) and then listen(2) is cast on it. The second reason is that there is future plan to make socket buffers pluggable, so that for a dataflow socket a socket buffer can be changed, and in this case we also want to keep same selinfos through the lifetime of a socket. - Remove struct struct so_accf. Since now listening stuff no longer affects struct socket size, just move its fields into listening part of the union. - Provide sol_upcall field and enforce that so_upcall_set() may be called only on a dataflow socket, which has buffers, and for listening sockets provide solisten_upcall_set(). o Remove ACCEPT_LOCK() global. - Add a mutex to socket, to be used instead of socket buffer lock to lock fields of struct socket that don't belong to a socket buffer. - Allow to acquire two socket locks, but the first one must belong to a listening socket. - Make soref()/sorele() to use atomic(9). This allows in some situations to do soref() without owning socket lock. There is place for improvement here, it is possible to make sorele() also to lock optionally. - Most protocols aren't touched by this change, except UNIX local sockets. See below for more information. o Reduce copy-and-paste in kernel modules that accept connections from listening sockets: provide function solisten_dequeue(), and use it in the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4), infiniband, rpc. o UNIX local sockets. - Removal of ACCEPT_LOCK() global uncovered several races in the UNIX local sockets. Most races exist around spawning a new socket, when we are connecting to a local listening socket. To cover them, we need to hold locks on both PCBs when spawning a third one. This means holding them across sonewconn(). This creates a LOR between pcb locks and unp_list_lock. - To fix the new LOR, abandon the global unp_list_lock in favor of global unp_link_lock. Indeed, separating these two locks didn't provide us any extra parralelism in the UNIX sockets. - Now call into uipc_attach() may happen with unp_link_lock hold if, we are accepting, or without unp_link_lock in case if we are just creating a socket. - Another problem in UNIX sockets is that uipc_close() basicly did nothing for a listening socket. The vnode remained opened for connections. This is fixed by removing vnode in uipc_close(). Maybe the right way would be to do it for all sockets (not only listening), simply move the vnode teardown from uipc_detach() to uipc_close()? Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D9770	2017-06-08 21:30:34 +00:00
Jonathan T. Looney	dc6a41b936	Add the infrastructure to support loading multiple versions of TCP stack modules. It adds support for mangling symbols exported by a module by prepending a string to them. (This avoids overlapping symbols in the kernel linker.) It allows the use of a macro as the module name in the DECLARE_MACRO() and MACRO_VERSION() macros. It allows the code to register stack aliases (e.g. both a generic name ["default"] and version-specific name ["default_10_3p1"]). With these changes, it is trivial to compile TCP stack modules with the name defined in the Makefile and to load multiple versions of the same stack simultaneously. This functionality can be used to enable side-by-side testing of an old and new version of the same TCP stack. It also could support upgrading the TCP stack without a reboot. Reviewed by: gnn, sjg (makefiles only) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D11086	2017-06-08 20:41:28 +00:00
Gleb Smirnoff	3acfe1e1b0	This code was missing socket unlock and socket buffer lock, but it worked since right now these two locks are the same.	2017-06-08 06:37:11 +00:00
Gleb Smirnoff	12d8a8e7a3	The desired lock here is socket buffer, not socket. Right now they match, but won't in future.	2017-06-08 06:34:09 +00:00
Michael Tuexen	8cb5a8e90a	Fix the ICMP6 handling for TCP. The ICMP6 packets might not be contained in a single mbuf. So don't assume this. Keep the IPv4 and IPv6 code in sync and make explicit that the syncache code only need the TCP sequence number, not the complete TCP header. MFC after: 3 days Sponsored by: Netflix, Inc.	2017-06-03 21:53:58 +00:00
Michael Tuexen	98732609d5	Improve comments to describe what the code does. Reported by: jtl Sponsored by: Netflix, Inc.	2017-06-01 15:11:18 +00:00
Jonathan T. Looney	382a6bbcf1	Enforce the limit on ICMP messages before doing work to formulate the response. Delete an unneeded rate limit for UDP under IPv6. Because ICMP6 messages have their own rate limit, it is unnecessary to apply a second rate limit to UDP messages. Reviewed by: glebius MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D10387	2017-05-30 14:32:44 +00:00
Michael Tuexen	5d08768a2b	Use the SCTP_PCB_FLAGS_ACCEPTING flags to check for listeners. While there, use a macro for checking the listen state to allow for easier changes if required. This done to help glebius@ with his listen changes.	2017-05-26 16:29:00 +00:00
Gleb Smirnoff	6a6cefac3d	o Rearrange struct inpcb fields to optimize the TCP output code path considering cache line hits and misses. Put the lock and hash list glue into the first cache line, put inp_refcount inp_flags inp_socket into the second cache line. o On allocation zero out entire structure except the lock and list entries, including inp_route inp_lle inp_gencnt. When inp_route and inp_lle were introduced, they were added below inp_zero_size, resulting on not being cleared after free/alloc. This definitely was a source of bugs with route caching. Could be that r315956 has just fixed one of them. The inp_gencnt is reinitialized on every alloc, so it is safe to clear it. This has been proved to improve TCP performance at Netflix. Obtained from: rrs Differential Revision: D10686	2017-05-24 17:47:16 +00:00
Michael Tuexen	5dba6ada91	The connect() system call should return -1 and set errno to EAFNOSUPPORT if it is called on a TCP socket * with an IPv6 address and the socket is bound to an IPv4-mapped IPv6 address. * with an IPv4-mapped IPv6 address and the socket is bound to an IPv6 address. Thanks to Jonathan T. Leighton for reporting this issue. Reviewed by: bz gnn MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D9163	2017-05-22 15:29:10 +00:00
Andrey V. Elsukov	38cc96a887	Set M_BCAST and M_MCAST flags on mbuf sent via divert socket. r290383 has changed how mbufs sent by divert socket are handled. Previously they are always handled by slow path processing in ip_input(). Now ip_tryforward() is invoked from ip_input() before in_broadcast() check. Since diverted packet lost all mbuf flags, it passes the broadcast check in ip_tryforward() due to missing M_BCAST flag. In the result the broadcast packet is forwarded to the wire instead of be consumed by network stack. Add in_broadcast() check to the div_output() function. And restore the M_BCAST flag if destination address is broadcast for the given network interface. PR: 209491 MFC after: 1 week	2017-05-17 09:04:09 +00:00
Ed Maste	3e85b721d6	Remove register keyword from sys/ and ANSIfy prototypes A long long time ago the register keyword told the compiler to store the corresponding variable in a CPU register, but it is not relevant for any compiler used in the FreeBSD world today. ANSIfy related prototypes while here. Reviewed by: cem, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D10193	2017-05-17 00:34:34 +00:00
Gleb Smirnoff	cc487c1697	Reduce in_pcbinfo_init() by two params. No users supply any flags to this function (they used to say UMA_ZONE_NOFREE), so flag parameter goes away. The zone_fini parameter also goes away. Previously no protocols (except divert) supplied zone_fini function, so inpcb locks were leaked with slabs. This was okay while zones were allocated with UMA_ZONE_NOFREE flag, but now this is a leak. Fix that by suppling inpcb_fini() function as fini method for all inpcb zones.	2017-05-15 21:58:36 +00:00
Enji Cooper	bd7459366e	Add missing braces around MCAST_EXCLUDE check when KTR support is compiled into the kernel This ensures that .iss_asm (the number of ASM listeners) isn't incorrectly decremented for MLD-layer source datagrams when inspecting im*s_st[1] (the second state in the structure). MFC after: 2 months PR: 217509 [1] Reported by: Coverity (Isilon) Reviewed by: ae ("This patch looks correct to me." [1]) Submitted by: Miles Ohlrich <miles.ohlrich@isilon.com> Sponsored by: Dell EMC Isilon	2017-05-13 18:41:24 +00:00
Gleb Smirnoff	7637c57ee1	There is no good reason for TCP reassembly zone to be UMA_ZONE_NOFREE. It has strong locking model, doesn't have any timers associated with entries. The entries theirselves are referenced only from the tcpcb zone, which itself is a normal zone, without the UMA_ZONE_NOFREE flag.	2017-05-10 23:32:31 +00:00
Eugene Grosbein	1a356b8b90	ipfw nat and natd support multiple aliasing instances with "nat global" feature that chooses right alias_address for outgoing packets that already have corresponding state in one of aliasing instances. This feature works just fine for ICMP, UDP, TCP and SCTP packes but not for others. For example, outgoing PPtP/GRE packets always get alias_address of latest configured instance no matter whether such packets have corresponding state or not. This change unbreaks translation of transit PPtP/GRE connections for "nat global" case fixing a bug in static ProtoAliasOut() function that ignores its "create" argument and performs translation regardless of its value. This static function is called only by LibAliasOutLocked() function and only for packers other than ICMP, UDP, TCP and SCTP. LibAliasOutLocked() passes its "create" argument unmodified. We have only two consumers of LibAliasOutLocked() in the source tree calling it with "create" unequal to 1: "ipfw nat global" code and similar natd code having same problem. All other consumers of LibAliasOutLocked() call it with create = 1 and the patch is "no-op" for such cases. PR: 218968 Approved by: ae, vsevolod (mentor) MFC after: 1 week	2017-05-10 19:41:52 +00:00
Michael Tuexen	10e0318afa	Allow SCTP to use the hostcache. This patch allows the MTU stored in the hostcache to be used as an initial value for SCTP paths. When an ICMP PTB message is received, store the MTU in the hostcache. MFC after: 1 week	2017-04-29 19:20:50 +00:00
Michael Tuexen	4f43a14a85	Don't set the DF-bit on timer based retransmissions. MFC after: 1 week	2017-04-29 09:57:27 +00:00
Michael Tuexen	b6ecf43450	Set the DF bit for responses to out-of-the-blue packets. MFC after: 1 week	2017-04-28 15:38:34 +00:00
Michael Tuexen	d274bcc661	Fix an issue with MTU calculation if an ICMP messaeg is received for an SCTP/UDP packet. MFC after: 1 week	2017-04-26 20:21:05 +00:00
Michael Tuexen	6ebfa5ee14	Use consistently uint32_t for mtu values. This does not change functionality, but this cleanup is need for further improvements of ICMP handling. MFC after: 1 week	2017-04-26 19:26:40 +00:00
Michael Tuexen	ebfd753408	When a SYN-ACK is received in SYN-SENT state, RFC 793 requires the validation of SEG.ACK as the first step. If the ACK is not acceptable, a RST segment should be sent and the segment should be dropped. Up to now, the segment was partially processed. This patch moves the check for the SEG.ACK validation up to the front as required. Reviewed by: hiren, gnn MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D10424	2017-04-26 06:20:58 +00:00
Navdeep Parhar	f8acc03ef1	Flush the LRO ctrl as soon as lro_mbufs fills up. There is no need to wait for the next enqueue from the driver. Reviewed by: gnn@, hselasky@, gallatin@ MFC after: 1 week Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D10432	2017-04-24 22:35:00 +00:00
Navdeep Parhar	ea9a92f112	Frames that are not considered for LRO should not be counted in LRO statistics. Reviewed by: gnn@, hselasky@, gallatin@ MFC after: 1 week Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D10430	2017-04-24 22:31:56 +00:00
Brooks Davis	a7dc31283a	Remove the NATM framework including the en(4), fatm(4), hatm(4), and patm(4) devices. Maintaining an address family and framework has real costs when we make infrastructure improvements. In the case of NATM we support no devices manufactured in the last 20 years and some will not even work in modern motherboards (some newer devices that patm(4) could be updated to support apparently exist, but we do not currently have support). With this change, support remains for some netgraph modules that don't require NATM support code. It is unclear if all these should remain, though ng_atmllc certainly stands alone. Note well: FreeBSD 11 supports NATM and will continue to do so until at least September 30, 2021. Improvements to the code in FreeBSD 11 are certainly welcome. Reviewed by: philip Approved by: harti	2017-04-24 21:21:49 +00:00
Michael Tuexen	75e7a91649	Represent "a syncache overflow hasn't happend yet" by using -(SYNCOOKIE_LIFETIME + 1) instead of INT64_MIN, since it is good enough and works when time_t is int32 or int64. This fixes the issue reported by cy@ on i386. Reported by: cy MFC after: 1 week Sponsored by: Netflix, Inc.	2017-04-21 06:05:34 +00:00
Michael Tuexen	190d9abce7	Syncoockies can be used in combination with the syncache. If the cache overflows, syncookies are used. This patch restricts the usage of syncookies in this case: accept syncookies only if there was an overflow of the syncache recently. This mitigates a problem reported in PR217637, where is syncookie was accepted without any recent drops. Thanks to glebius@ for suggesting an improvement. PR: 217637 Reviewed by: gnn, glebius MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D10272	2017-04-20 19:19:33 +00:00
Navdeep Parhar	b0ca71f0a0	Free lro_hash unconditionally, just like lro_mbuf_data a few lines later. Fix whitespace nit while here.	2017-04-19 23:06:07 +00:00
Navdeep Parhar	a3927369fa	Do not leak lro_hash on failure to allocate lro_mbuf_data. MFC after: 1 week	2017-04-19 22:27:26 +00:00
Navdeep Parhar	3d24e03800	Remove redundant assignment.	2017-04-19 22:20:41 +00:00
Andrey V. Elsukov	c33a231337	Rework r316770 to make it protocol independent and general, like we do for streaming sockets. And do more cleanup in the sbappendaddr_locked_internal() to prevent leak information from existing mbuf to the one, that will be possible created later by netgraph. Suggested by: glebius Tested by: Irina Liakh <spell at itl ua> MFC after: 1 week	2017-04-14 09:00:48 +00:00
Andrey V. Elsukov	8428914909	Clear h/w csum flags on mbuf handled by UDP. When checksums of received IP and UDP header already checked, UDP uses sbappendaddr_locked() to pass received data to the socket. sbappendaddr_locked() uses given mbuf as is, and if NIC supports checksum offloading, mbuf contains csum_data and csum_flags that were calculated for already stripped headers. Some NICs support only limited checksums offloading and do not use CSUM_PSEUDO_HDR flag, and csum_data contains some value that UDP/TCP should use for pseudo header checksum calculation. When L2TP is used for tunneling with mpd5, ng_ksocket receives mbuf with filled csum_flags and csum_data, that were calculated for outer headers. When L2TP header is stripped, a packet that was tunneled goes to the IP layer and due to presence of csum_flags (without CSUM_PSEUDO_HDR) and csum_data, the UDP/TCP checksum check fails for this packet. Reported by: Irina Liakh <spell at itl ua> Tested by: Irina Liakh <spell at itl ua> MFC after: 1 week	2017-04-13 17:03:57 +00:00
Michael Tuexen	013f4df643	The sysctl variable net.inet.tcp.drop_synfin is not honored in all states, for example not in SYN-SENT. This patch adds code to check the sysctl variable in other states than LISTEN. Thanks to ae and gnn for providing comments. Reviewed by: gnn MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D9894	2017-04-12 20:27:15 +00:00
Andrey V. Elsukov	7faa0d213b	Make sysctl identifiers for direct netisr queue unique. Introduce IPCTL_INTRDQMAXLEN and IPCTL_INTRDQDROPS macros for this purpose. Reviewed by: gnn MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D10358	2017-04-11 19:20:20 +00:00
Steven Hartland	e44c1887fd	Use estimated RTT for receive buffer auto resizing instead of timestamps Switched from using timestamps to RTT estimates when performing TCP receive buffer auto resizing, as not all hosts support / enable TCP timestamps. Disabled reset of receive buffer auto scaling when not in bulk receive mode, which gives an extra 20% performance increase. Also extracted auto resizing to a common method shared between standard and fastpath modules. With this AWS S3 downloads at ~17ms latency on a 1Gbps connection jump from ~3MB/s to ~100MB/s using the default settings. Reviewed by: lstewart, gnn MFC after: 2 weeks Relnotes: Yes Sponsored by: Multiplay Differential Revision: https://reviews.freebsd.org/D9668	2017-04-10 08:19:35 +00:00
Ryan Stone	4af540d197	Revert the optimization from r304436 r304436 attempted to optimize the handling of incoming UDP packet by only making an expensive call to in_broadcast() if the mbuf was marked as an broadcast packet. Unfortunately, this cannot work in the case of point-to- point L2 protocols like PPP, which have no notion of "broadcast". The optimization has been disabled for several months now with no progress towards fixing it, so it needs to go.	2017-04-05 16:57:13 +00:00
Andrey V. Elsukov	11c56650f0	Add O_EXTERNAL_DATA opcode support. This opcode can be used to attach some data to external action opcode. And unlike to O_EXTERNAL_INSTANCE opcode, this opcode does not require creating of named instance to pass configuration arguments to external action handler. The data is coming just next to O_EXTERNAL_ACTION opcode. The userlevel part currenly supports formatting for opcode with ipfw_insn size, by default it expects u16 numeric value in the arg1. Obtained from: Yandex LLC MFC after: 2 weeks Sponsored by: Yandex LLC	2017-04-03 02:44:40 +00:00
Steven Hartland	6ebc1b7b7d	Allow explicitly assigned IPv4 loopback address to be used in jails If a jail has an explicitly assigned loopback address then allow it to be used instead of remapping requests for the loopback adddress to the first IPv4 address assigned to the jail. This fixes issues where applications attempt to detect their bound port where they requested a loopback address, which was available, but instead the kernel remapped it to the jails first address. A example of this is binding nginx to 127.0.0.1 and then running "service nginx upgrade" which before this change would cause nginx to fail. Also: * Correct the description of prison_check_ip4_locked to match the code. MFC after: 2 weeks Relnotes: Yes Sponsored by: Multiplay	2017-03-31 00:41:54 +00:00
Mike Karels	4a5c6c6ab0	Enable route and LLE (ndp) caching in TCP/IPv6 tcp_output.c was using a route on the stack for IPv6, which does not allow route caching or LLE/ndp caching. Switch to using the route (v6 flavor) in the in_pcb, which was already present, which caches both L3 and L2 lookups. Reviewed by: gnn hiren MFC after: 2 weeks	2017-03-27 23:48:36 +00:00
Mike Karels	8c1960d506	Fix reference count leak with L2 caching. ip_forward, TCP/IPv6, and probably SCTP leaked references to L2 cache entry because they used their own routes on the stack, not in_pcb routes. The original model for route caching was callers that provided a route structure to ip{,6}input() would keep the route, and this model was used for L2 caching as well. Instead, change L2 caching to be done by default only when using a route structure in the in_pcb; the pcb deallocation code frees L2 as well as L3 cacches. A separate change will add route caching to TCP/IPv6. Another suggestion was to have the transport protocols indicate willingness to use L2 caching, but this approach keeps the changes in the network level Reviewed by: ae gnn MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D10059	2017-03-25 15:06:28 +00:00
Gleb Smirnoff	3ae4b0e7e8	Force same alignment on struct xinpgen as we have on struct xinpcb. This fixes 32-bit builds.	2017-03-21 16:23:44 +00:00
Gleb Smirnoff	cc65eb4e79	Hide struct inpcb, struct tcpcb from the userland. This is a painful change, but it is needed. On the one hand, we avoid modifying them, and this slows down some ideas, on the other hand we still eventually modify them and tools like netstat(1) never work on next version of FreeBSD. We maintain a ton of spares in them, and we already got some ifdef hell at the end of tcpcb. Details: - Hide struct inpcb, struct tcpcb under _KERNEL \|\| _WANT_FOO. - Make struct xinpcb, struct xtcpcb pure API structures, not including kernel structures inpcb and tcpcb inside. Export into these structures the fields from inpcb and tcpcb that are known to be used, and put there a ton of spare space. - Make kernel and userland utilities compilable after these changes. - Bump __FreeBSD_version. Reviewed by: rrs, gnn Differential Revision: D10018	2017-03-21 06:39:49 +00:00
Eric van Gyzen	40769242ed	Add some ntohl() love to r315277 inet_ntoa() and inet_ntoa_r() take the address in network byte-order. When I removed those calls, I should have replaced them with ntohl() to make the hex addresses slightly less unreadable. Here they are. See r315277 regarding classic blunders. vangyzen: you're deep in "no good deed" territory, it seems --badger Reported by: ian MFC after: 3 days MFC when: I finally get it right Sponsored by: Dell EMC	2017-03-14 20:57:54 +00:00
Eric van Gyzen	47d803ea71	KTR: log IPv4 addresses in hex rather than dotted-quad When I made the changes in r313821, I fell victim to one of the classic blunders, the most famous of which is: never get involved in a land war in Asia. But only slightly less well known is this: Keep your brain turned on and engaged when making a tedious, sweeping, mechanical change. KTR can correctly log the immediate integral values passed to it, as well as constant strings, but not non-constant strings, since they might change by the time ktrdump retrieves them. Reported by: glebius MFC after: 3 days Sponsored by: Dell EMC	2017-03-14 18:27:48 +00:00
Conrad Meyer	fdb727f4f2	alias_proxy.c: Fix accidental error quashing This was introduced on accident in r165243, when return sites were unified to add a lock around LibAliasProxyRule(). PR: 217749 Submitted by: Svyatoslav <razmyslov at viva64.com> Sponsored by: Viva64 (PVS-Studio)	2017-03-13 18:05:31 +00:00
Andrey V. Elsukov	719498102c	Fix the L2 address printed in the "arp: %s moved from %*D" message. In the r292978 struct llentry was changed and the ll_addr field become the pointer. PR: 217667 MFC after: 1 week	2017-03-11 04:57:52 +00:00
Gleb Smirnoff	c75e266608	Make inp_lock_assert() depend on INVARIANT_SUPPORT, not INVARIANTS. This will make INVARIANT-enabled modules, that use this function to load successfully on a kernel that has INVARIANT_SUPPORT only.	2017-03-09 00:55:19 +00:00
Ermal Luçi	dce33a45c9	The patch provides the same socket option as Linux IP_ORIGDSTADDR. Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD. The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application. This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets. Reviewed by: adrian, aw Approved by: ae (mentor) Sponsored by: rsync.net Differential Revision: D9235	2017-03-06 04:01:58 +00:00
Warner Losh	fbbd9655e5	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
Michael Tuexen	8d62aae8df	TCP window updates are only sent if the window can be increased by at least 2 * MSS. However, if the receive buffer size is small, this might be impossible. Add back a criterion to send a TCP window update if the window can be increased by at least half of the receive buffer size. This condition was removed in r242252. This patch simply brings it back. PR: 211003 Reviewed by: gnn MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D9475	2017-02-23 18:14:36 +00:00
Eric van Gyzen	922193e7ff	Remove inet_ntoa() from the kernel inet_ntoa() cannot be used safely in a multithreaded environment because it uses a static local buffer. Remove it from the kernel. Suggested by: glebius, emaste Reviewed by: gnn MFC after: never Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D9625	2017-02-16 20:50:01 +00:00
Eric van Gyzen	8144690af4	Use inet_ntoa_r() instead of inet_ntoa() throughout the kernel inet_ntoa() cannot be used safely in a multithreaded environment because it uses a static local buffer. Instead, use inet_ntoa_r() with a buffer on the caller's stack. Suggested by: glebius, emaste Reviewed by: gnn MFC after: 2 weeks Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D9625	2017-02-16 20:47:41 +00:00
Andrey V. Elsukov	7a60a91011	Add missing check to fix the build with IPSEC_SUPPORT and without MAC. Submitted by: netchild	2017-02-14 21:33:10 +00:00
Andrey V. Elsukov	627c036f65	Remove IPsec related PCB code from SCTP. The inpcb structure has inp_sp pointer that is initialized by ipsec_init_pcbpolicy() function. This pointer keeps strorage for IPsec security policies associated with a specific socket. An application can use IP_IPSEC_POLICY and IPV6_IPSEC_POLICY socket options to configure these security policies. Then ip[6]_output() uses inpcb pointer to specify that an outgoing packet is associated with some socket. And IPSEC_OUTPUT() method can use a security policy stored in the inp_sp. For inbound packet the protocol-specific input routine uses IPSEC_CHECK_POLICY() method to check that a packet conforms to inbound security policy configured in the inpcb. SCTP protocol doesn't specify inpcb for ip[6]_output() when it sends packets. Thus IPSEC_OUTPUT() method does not consider such packets as associated with some socket and can not apply security policies from inpcb, even if they are configured. Since IPSEC_CHECK_POLICY() method is called from protocol-specific input routine, it can specify inpcb pointer and associated with socket inbound policy will be checked. But there are two problems: 1. Such check is asymmetric, becasue we can not apply security policy from inpcb for outgoing packet. 2. IPSEC_CHECK_POLICY() expects that caller holds INPCB lock and access to inp_sp is protected. But for SCTP this is not correct, becasue SCTP uses own locks to protect inpcb. To fix these problems remove IPsec related PCB code from SCTP. This imply that IP_IPSEC_POLICY and IPV6_IPSEC_POLICY socket options will be not applicable to SCTP sockets. To be able correctly check inbound security policies for SCTP, mark its protocol header with the PR_LASTHDR flag. Reported by: tuexen Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D9538	2017-02-13 11:37:52 +00:00
Ermal Luçi	c10c5b1eba	Committed without approval from mentor. Reported by: gnn	2017-02-12 06:56:33 +00:00
Ryan Stone	5ede40dcf2	Don't zero out srtt after excess retransmits If the TCP stack has retransmitted more than 1/4 of the total number of retransmits before a connection drop, it decides that its current RTT estimate is hopelessly out of date and decides to recalculate it from scratch starting with the next ACK. Unfortunately, it implements this by zeroing out the current RTT estimate. Drop this hack entirely, as it makes it significantly more difficult to debug connection issues. Instead check for excessive retransmits at the point where srtt is updated from an ACK being received. If we've exceeded 1/4 of the maximum retransmits, discard the previous srtt estimate and replace it with the latest rtt measurement. Differential Revision: https://reviews.freebsd.org/D9519 Reviewed by: gnn Sponsored by: Dell EMC Isilon	2017-02-11 17:05:08 +00:00
Gleb Smirnoff	cfff3743cd	Move tcp_fields_to_net() static inline into tcp_var.h, just below its friend tcp_fields_to_host(). There is third party code that also uses this inline. Reviewed by: ae	2017-02-10 17:46:26 +00:00
Ermal Luçi	e97b60264d	Fix build after r313524 Reported-by: ohartmann@walstatt.org	2017-02-10 06:01:47 +00:00
Ermal Luçi	4616026faf	Revert r313527 Heh svn is not git	2017-02-10 05:58:16 +00:00
Ermal Luçi	c0fadfdbbf	Correct missed variable name. Reported-by: ohartmann@walstatt.org	2017-02-10 05:51:39 +00:00
Ermal Luçi	ed55edceef	The patch provides the same socket option as Linux IP_ORIGDSTADDR. Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD. The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application. This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets. Sponsored-by: rsync.net Differential Revision: D9235 Reviewed-by: adrian	2017-02-10 05:16:14 +00:00
Eric van Gyzen	edf0313b70	Fix garbage IP addresses in UDP log_in_vain messages If multiple threads emit a UDP log_in_vain message concurrently, the IP addresses could be garbage due to concurrent usage of a single string buffer inside inet_ntoa(). Use inet_ntoa_r() with two stack buffers instead. Reported by: Mark Martinec <Mark.Martinec+freebsd@ijs.si> MFC after: 3 days Relnotes: yes Sponsored by: Dell EMC	2017-02-07 18:57:57 +00:00
Andrey V. Elsukov	fcf596178b	Merge projects/ipsec into head/. Small summary ------------- o Almost all IPsec releated code was moved into sys/netipsec. o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel option IPSEC_SUPPORT added. It enables support for loading and unloading of ipsec.ko and tcpmd5.ko kernel modules. o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type support was removed. Added TCP/UDP checksum handling for inbound packets that were decapsulated by transport mode SAs. setkey(8) modified to show run-time NAT-T configuration of SA. o New network pseudo interface if_ipsec(4) added. For now it is build as part of ipsec.ko module (or with IPSEC kernel). It implements IPsec virtual tunnels to create route-based VPNs. o The network stack now invokes IPsec functions using special methods. The only one header file <netipsec/ipsec_support.h> should be included to declare all the needed things to work with IPsec. o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed. Now these protocols are handled directly via IPsec methods. o TCP_SIGNATURE support was reworked to be more close to RFC. o PF_KEY SADB was reworked: - now all security associations stored in the single SPI namespace, and all SAs MUST have unique SPI. - several hash tables added to speed up lookups in SADB. - SADB now uses rmlock to protect access, and concurrent threads can do SA lookups in the same time. - many PF_KEY message handlers were reworked to reflect changes in SADB. - SADB_UPDATE message was extended to support new PF_KEY headers: SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They can be used by IKE daemon to change SA addresses. o ipsecrequest and secpolicy structures were cardinally changed to avoid locking protection for ipsecrequest. Now we support only limited number (4) of bundled SAs, but they are supported for both INET and INET6. o INPCB security policy cache was introduced. Each PCB now caches used security policies to avoid SP lookup for each packet. o For inbound security policies added the mode, when the kernel does check for full history of applied IPsec transforms. o References counting rules for security policies and security associations were changed. The proper SA locking added into xform code. o xform code was also changed. Now it is possible to unregister xforms. tdb_xxx structures were changed and renamed to reflect changes in SADB/SPDB, and changed rules for locking and refcounting. Reviewed by: gnn, wblock Obtained from: Yandex LLC Relnotes: yes Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D9352	2017-02-06 08:49:57 +00:00
Patrick Kelsey	ec93ed8d95	Fix VIMAGE-related bugs in TFO. The autokey callout vnet context was not being initialized, and the per-vnet fastopen context was only being initialized for the default vnet. PR: 216613 Reported by: Alex Deiter <alex dot deiter at gmail dot com> MFC after: 1 week	2017-02-03 17:02:57 +00:00
George V. Neville-Neil	82988b50a1	Add an mbuf to ipinfo_t translator to finish cleanup of mbuf passing to TCP probes. Reviewed by: markj MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D9401	2017-02-01 19:33:00 +00:00
Michael Tuexen	c03627fd06	Ensure that the variable bail is always initialized before used. MFC after: 1 week	2017-02-01 00:10:29 +00:00
Michael Tuexen	2aa116007c	Take the SCTP common header into account when computing the space available for chunks. This unbreaks the handling of ICMPV6 packets indicating "packet too big". It just worked for IPv4 since we are overbooking for IPv4. MFC after: 1 week	2017-01-31 23:36:31 +00:00
Michael Tuexen	7858d7cb8e	Remove a duplicate debug statement. MFC after: 1 week	2017-01-31 23:34:02 +00:00
Cy Schubert	3df96ee68e	Correct comment grammar and make it easier to understand. MFC after: 1 week	2017-01-30 04:51:18 +00:00
Hiren Panchasara	6134aabe38	Add a knob to change default behavior of inheriting listen socket's tcp stack regardless of what the default stack for the system is set to. With current/default behavior, after changing the default tcp stack, the application needs to be restarted to pick up that change. Setting this new knob net.inet.tcp.functions_inherit_listen_socket_stack to '0' would change that behavior and make any new connection use the newly selected default tcp stack. Reviewed by: rrs MFC after: 2 weeks Sponsored by: Limelight Networks	2017-01-27 23:10:46 +00:00
Luiz Otavio O Souza	338e227ac0	After the in_control() changes in r257692, an existing address is (intentionally) deleted first and then completely added again (so all the events, announces and hooks are given a chance to run). This cause an issue with CARP where the existing CARP data structure is removed together with the last address for a given VHID, which will cause a subsequent fail when the address is later re-added. This change fixes this issue by adding a new flag to keep the CARP data structure when an address is not being removed. There was an additional issue with IPv6 CARP addresses, where the CARP data structure would never be removed after a change and lead to VHIDs which cannot be destroyed. Reviewed by: glebius Obtained from: pfSense MFC after: 2 weeks Sponsored by: Rubicon Communications, LLC (Netgate)	2017-01-25 19:04:08 +00:00
Michael Tuexen	bd60638c98	Fix a bug where the overhead of the I-DATA chunk was not considered. MFC after: 1 week	2017-01-24 21:30:31 +00:00
Hans Petter Selasky	f3e7afe2d7	Implement kernel support for hardware rate limited sockets. - Add RATELIMIT kernel configuration keyword which must be set to enable the new functionality. - Add support for hardware driven, Receive Side Scaling, RSS aware, rate limited sendqueues and expose the functionality through the already established SO_MAX_PACING_RATE setsockopt(). The API support rates in the range from 1 to 4Gbytes/s which are suitable for regular TCP and UDP streams. The setsockopt(2) manual page has been updated. - Add rate limit function callback API to "struct ifnet" which supports the following operations: if_snd_tag_alloc(), if_snd_tag_modify(), if_snd_tag_query() and if_snd_tag_free(). - Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT flag, which tells if a network driver supports rate limiting or not. - This patch also adds support for rate limiting through VLAN and LAGG intermediate network devices. - How rate limiting works: 1) The userspace application calls setsockopt() after accepting or making a new connection to set the rate which is then stored in the socket structure in the kernel. Later on when packets are transmitted a check is made in the transmit path for rate changes. A rate change implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the destination network interface, which then sets up a custom sendqueue with the given rate limitation parameter. A "struct m_snd_tag" pointer is returned which serves as a "snd_tag" hint in the m_pkthdr for the subsequently transmitted mbufs. 2) When the network driver sees the "m->m_pkthdr.snd_tag" different from NULL, it will move the packets into a designated rate limited sendqueue given by the snd_tag pointer. It is up to the individual drivers how the rate limited traffic will be rate limited. 3) Route changes are detected by the NIC drivers in the ifp->if_transmit() routine when the ifnet pointer in the incoming snd_tag mismatches the one of the network interface. The network adapter frees the mbuf and returns EAGAIN which causes the ip_output() to release and clear the send tag. Upon next ip_output() a new "snd_tag" will be tried allocated. 4) When the PCB is detached the custom sendqueue will be released by a non-blocking ifp->if_snd_tag_free() call to the currently bound network interface. Reviewed by: wblock (manpages), adrian, gallatin, scottl (network) Differential Revision: https://reviews.freebsd.org/D3687 Sponsored by: Mellanox Technologies MFC after: 3 months	2017-01-18 13:31:17 +00:00
Maxim Sobolev	339efd75a4	Add a new socket option SO_TS_CLOCK to pick from several different clock sources to return timestamps when SO_TIMESTAMP is enabled. Two additional clock sources are: o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME); o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC). In addition to this, this option provides unified interface to get bintime (equivalent of using SO_BINTIME), except it also supported with IPv6 where SO_BINTIME has never been supported. The long term plan is to depreciate SO_BINTIME and move everything to using SO_TS_CLOCK. Idea for this enhancement has been briefly discussed on the Net session during dev summit in Ottawa last June and the general input was positive. This change is believed to benefit network benchmarks/profiling as well as other scenarios where precise time of arrival measurement is necessary. There are two regression test cases as part of this commit: one extends unix domain test code (unix_cmsg) to test new SCM_XXX types and another one implementis totally new test case which exchanges UDP packets between two processes using both conventional methods (i.e. calling clock_gettime(2) before recv(2) and after send(2)), as well as using setsockopt()+recv() in receive path. The resulting delays are checked for sanity for all supported clock types. Reviewed by: adrian, gnn Differential Revision: https://reviews.freebsd.org/D9171	2017-01-16 17:46:38 +00:00
Conrad Meyer	1d64db52f3	Fix a variety of cosmetic typos and misspellings No functional change. PR: 216096, 216097, 216098, 216101, 216102, 216106, 216109, 216110 Reported by: Bulat <bltsrc at mail.ru> Sponsored by: Dell EMC Isilon	2017-01-15 18:00:45 +00:00
Gleb Smirnoff	0f7ddf91e9	Use getsock_cap() instead of deprecated fgetsock(). Reviewed by: tuexen	2017-01-13 16:54:44 +00:00
Michael Tuexen	24209f0122	Ensure that the buffer length and the length provided in the IPv4 header match when using a raw socket to send IPv4 packets and providing the header. If they don't match, let send return -1 and set errno to EINVAL. Before this patch is was only enforced that the length in the header is not larger then the buffer length. PR: 212283 Reviewed by: ae, gnn MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D9161	2017-01-13 10:55:26 +00:00
Maxim Sobolev	5e946c03c7	Fix slight type mismatch between so_options defined in sys/socketvar.h and tw_so_options defined here which is supposed to be a copy of the former (short vs u_short respectively). Switch tw_so_options to be "signed short" to match the type of the field it's inherited from.	2017-01-12 10:14:54 +00:00
Hiren Panchasara	b8a2fb91f6	sysctl net.inet.tcp.hostcache.list in a jail can see connections from other jails and the host. This commit fixes it. PR: 200361 Submitted by: bz (original version), hiren (minor corrections) Reported by: Marcus Reid <marcus at blazingdot dot com> Reviewed by: bz, gnn Tested by: Lohith Bellad <lohithbsd at gmail dot com> MFC after: 1 week Sponsored by: Limelight Networks (minor corrections)	2017-01-05 17:22:09 +00:00
George V. Neville-Neil	fad073dd44	Followup to mtod removal in main stack (r311225). Continued removal of mtod() calls from TCP_PROBE macros. MFC after: 1 week Sponsored by: Limelight Networks	2017-01-04 04:00:28 +00:00
George V. Neville-Neil	2b9c998413	Fix DTrace TCP tracepoints to not use mtod() as it is both unnecessary and dangerous. Those wanting data from an mbuf should use DTrace itself to get the data. PR: 203409 Reviewed by: hiren MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D9035	2017-01-04 02:19:13 +00:00
Enji Cooper	cfff8d3dbd	Unbreak ip_carp with WITHOUT_INET6 enabled by conditionalizing all IPv6 structs under the INET6 #ifdef. Similarly (even though it doesn't seem to affect the build), conditionalize all IPv4 structs under the INET #ifdef This also unbreaks the LINT-NOINET6 tinderbox target on amd64; I have not verified other MACHINE/TARGET pairs (e.g. armv6/arm). MFC after: 2 weeks X-MFC with: r310847 Pointyhat to: jpaetzel Reported by: O. Hartmann <o.hartmann@walstatt.org>	2016-12-30 21:33:01 +00:00
Josh Paetzel	8151740c88	Harden CARP against network loops. If there is a loop in the network a CARP that is in MASTER state will see it's own broadcasts, which will then cause it to assume BACKUP state. When it assumes BACKUP it will stop sending advertisements. In that state it will no longer see advertisements and will assume MASTER... We can't catch all the cases where we are seeing our own CARP broadcast, but we can catch the obvious case. Submitted by: torek Obtained from: FreeNAS MFC after: 2 weeks Sponsored by: iXsystems	2016-12-30 18:46:21 +00:00
Andrey V. Elsukov	2e77d270c1	When we are sending IP fragments, update ip pointers in IP_PROBE() for each fragment. MFC after: 1 week	2016-12-29 19:57:46 +00:00
Michael Tuexen	2048d80aa3	Consistent handling of errors reported from the lower layer. MFC after: 3 days	2016-12-27 22:14:41 +00:00
Michael Tuexen	b7b84c0e02	Whitespace changes. The toolchain for processing the sources has been updated. No functional change. MFC after: 3 days	2016-12-26 11:06:41 +00:00
Michael Tuexen	d6194c562f	Remove a KASSERT which is not always true. In case of the empty queue tp->snd_holes and tcp_sackhole_insert() failing due to memory shortage, tp->snd_holes will be empty. This problem was hit when stress tests where performed by pho. PR: 215513 Reported by: pho Tested by: pho Sponsored by: Netflix, Inc.	2016-12-25 17:37:18 +00:00
Gleb Smirnoff	030b9c2f69	Remove assigned only variable.	2016-12-21 22:47:10 +00:00
Andrey V. Elsukov	ad9f4d6ab6	ip[6]_tryforward does inbound and outbound packet firewall processing. This can lead to change of mbuf pointer (packet filter could do m_pullup(), NAT, etc). Also in case of change of destination address, tryforward can decide that packet should be handled by local system. In this case modified mbuf can be returned to the ip[6]_input(). To handle this correctly, check M_FASTFWD_OURS flag after return from ip[6]_tryforward. And if it is present, update variables that depend from mbuf pointer and skip another inbound firewall processing. No objection from: #network MFC after: 3 weeks Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D8764	2016-12-19 11:02:49 +00:00
Michael Tuexen	3d6fe5d84c	Fix the handling of buffered messages in stream reset deferred handling. Thanks to Eugen-Andrei Gavriloaie for reporting the issue and providing substantial help in nailing down the issue. MFC after: 1 week	2016-12-17 22:31:30 +00:00
Hiren Panchasara	b6ff672460	We currently don't do TSO if ip options are present. In case of IPv6, we look at in6p_options to check that. That is incorrect as we carry ip options in in6p_outputopts. Also, just checking for in6p_outputopts being NULL won't suffice as we combine ip options and ip header fields both in that one field. The commit fixes this by using ip6_optlen() which correctly calculates length of only ip options for IPv6. Reviewed by: ae, bz MFC after: 3 weeks Sponsored by: Limelight Networks	2016-12-11 23:14:47 +00:00
Michael Tuexen	8b9c95f4a9	Ensure that the reported ppid and tsn are taken from the first fragment. This fixes a bug where the wrong ppid was reported, if * I-DATA was used on the first fragement was not received first * DATA was used and different ppids where used. Thanks to Julian Cordes for making me aware of the issue. MFC after: 1 week	2016-12-11 13:26:35 +00:00
Gleb Smirnoff	8c70a35334	Fix build for 32-bit machines. Submitted by: tuexen	2016-12-09 20:50:35 +00:00
Gleb Smirnoff	3cbee8caa1	Use counter_ratecheck() in the ICMP rate limiting. Together with: rrs, jtl	2016-12-09 17:59:15 +00:00
Michael Tuexen	ebecdad811	Don't bundle a SACK chunk with a SHUTDOWN chunk if it is not required. MFC after: 1 week	2016-12-09 17:58:07 +00:00
Michael Tuexen	8d0a31e19c	Don't send multiple SHUTDOWN chunks in a single packet. Thanks to Felix Weinrank for making me aware of this issue. MFC after: 1 week	2016-12-09 17:57:17 +00:00
Michael Tuexen	b594081bdf	Silence a warning produced by newer versions of gcc. MFC after: 1 week	2016-12-07 22:01:09 +00:00
Michael Tuexen	49656eefc8	Cleanup the names of SSN, SID, TSN, FSN, PPID and MID. This made a couple of bugs visible in handling SSN wrap-arounds when using DATA chunks. Now bulk transfer seems to work fine... This fixes the issue reported in https://github.com/sctplab/usrsctp/issues/111 MFC after: 1 week	2016-12-07 19:30:59 +00:00
Michael Tuexen	5b495f17a5	Whitespace changes. The tools using to generate the sources has been updated and produces different whitespaces. Commit this seperately to avoid intermixing these with real code changes. MFC after: 3 days	2016-12-06 10:21:25 +00:00
Michael Tuexen	4ddd5aadea	Fix the handling of TCP FIN-segments in the CLOSED state When a TCP segment with the FIN bit set was received in the CLOSED state, a TCP RST-ACK-segment is sent. When computing SEG.ACK for this, the FIN counts as one byte. This accounting was missing and is fixed by this patch. Reviewed by: hiren MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://svn.freebsd.org/base/head	2016-12-02 08:02:31 +00:00
Andrey V. Elsukov	dc9d21f8b0	Rework ip_tryforward() to use FIB4 KPI. Tested by: olivier Obtained from: Yandex LLC MFC after: 1 month Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D8526	2016-11-28 17:55:32 +00:00
Hiren Panchasara	2806b2933b	For RTT calculations mid-session, we explicitly ignore ACKs with tsecr of 0 as many borken middle-boxes tend to do that. But during 3whs, in syncache_expand(), we don't do that which causes us to send a RST to such a client. Relax this constraint by only using tsecr to compare against timestamp that we sent when it is not 0. As a result, we'd now accept the final ACK of 3whs with tsecr of 0. Reviewed by: jtl, gnn Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D8552	2016-11-21 20:53:11 +00:00
Michael Tuexen	35dfb8cb68	Ensure that TCP state changes to state-closing are reported via dtrace. This does not cover state changes from TIME-WAIT. Reviewed by: gnn MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8443	2016-11-19 14:45:08 +00:00
Michael Tuexen	6779a1a101	Notify the use via setting errno when a TCP RST segment is received either in the CLOSING or LAST-ACK state. Reviewed by: hiren MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8371	2016-11-17 08:15:02 +00:00
Andrey V. Elsukov	8432fa5fd9	Initialize ip6 pointer before use. PR: 214169 MFC after: 1 week	2016-11-06 02:33:04 +00:00
Hiren Panchasara	e04310d59b	Set slow start threshold more accurately on loss to be flightsize/2 instead of cwnd/2 as recommended by RFC5681. (spotted by mmacy at nextbsd dot org) Restore pre-r307901 behavior of aligning ssthresh/cwnd on mss boundary. (spotted by slawa at zxy dot spb dot ru) Tested by: dim, Slawa <slawa at zxy dot spb dot ru> MFC after: 1 month X-MFC with: r307901 Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D8349	2016-11-01 21:08:37 +00:00
Julien Charbon	f1ee30ccd6	Remove an extraneous call to soisconnected() in syncache_socket(), introduced with r261242. The useful and expected soisconnected() call is done in tcp_do_segment(). Has been found as part of unrelated PR:212920 investigation. Improve slightly (~2%) the maximum number of TCP accept per second. Tested by: kevin.bowling_kev009.com, jch Approved by: gnn, hiren MFC after: 1 week Sponsored by: Verisign, Inc Differential Revision: https://reviews.freebsd.org/D8072	2016-10-26 15:19:18 +00:00
Hiren Panchasara	4e7f755377	FreeBSD tcp stack used to inform respective congestion control module about the loss event but not use or obay the recommendations i.e. values set by it in some cases. Here is an attempt to solve that confusion by following relevant RFCs/drafts. Stack only sets congestion window/slow start threshold values when there is no CC module availalbe to take that action. All CC modules are inspected and updated when needed to take appropriate action on loss. tcp_stacks/fastpath module has been updated to adapt these changes. Note: Probably, the most significant change would be to not bring congestion window down to 1MSS on a loss signaled by 3-duplicate acks and letting respective CC decide that value. In collaboration with: Matt Macy <mmacy at nextbsd dot org> Discussed on: transport@ mailing list Reviewed by: jtl MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D8225	2016-10-25 05:45:47 +00:00
Hiren Panchasara	dd13b7d387	Undo r307899. It needs a bit more work and proper commit log.	2016-10-25 05:07:51 +00:00
Hiren Panchasara	95d8236011	In Collaboration with: Matt Macy <mmacy at nextbsd dot com> Reviewed by: jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D8225	2016-10-25 05:03:33 +00:00
Ryan Stone	6c1bd55875	Fix ip_output() on point-to-point links In r304435, ip_output() was changed to use the result of the route lookup to decide whether the outgoing packet was a broadcast or not. This introduced a regression on interfaces where IFF_BROADCAST was not set (e.g. point-to-point links), as the algorithm could incorrectly treat the destination address as a broadcast address, and ip_output() would subsequently drop the packet as broadcasting on a non-IFF_BROADCAST interface is not allowed. Differential Revision: https://reviews.freebsd.org/D8303 Reviewed by: jtl Reported by: ambrisko MFC after: 2 weeks X-MFC-With: r304435 Sponsored by: Dell EMC Isilon	2016-10-24 22:11:33 +00:00
Michael Tuexen	38d3251c3d	No functional changes, mostly getting the whitespace changes resulting from an updated formatting tool chain. MFC after: 1 month	2016-10-22 17:21:21 +00:00
Michael Tuexen	3e1465754f	Make ICMPv6 hard error handling for TCP consistent with the ICMPv4 handling. Ensure that: * Protocol unreachable errors are handled by indicating ECONNREFUSED to the TCP user for both IPv4 and IPv6. These were ignored for IPv6. * Communication prohibited errors are handled by indicating ECONNREFUSED to the TCP user for both IPv4 and IPv6. These were ignored for IPv6. * Hop Limited exceeded errors are handled by indicating EHOSTUNREACH to the TCP user for both IPv4 and IPv6. For IPv6 the TCP connected was dropped but errno wasn't set. Reviewed by: gallatin, rrs MFC after: 1 month Sponsored by: Netflix Differential Revision: 7904	2016-10-21 10:32:57 +00:00
Julien Charbon	f5cf1e5f5a	Fix a double-free when an inp transitions to INP_TIMEWAIT state after having been dropped. This fixes enforces in_pcbdrop() logic in tcp_input(): "in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet delivery or event notification when a socket remains open but TCP has closed." PR: 203175 Reported by: Palle Girgensohn, Slawa Olhovchenkov Tested by: Slawa Olhovchenkov Reviewed by: Slawa Olhovchenkov Approved by: gnn, Slawa Olhovchenkov Differential Revision: https://reviews.freebsd.org/D8211 MFC after: 1 week Sponsored by: Verisign, inc	2016-10-18 07:16:49 +00:00
Hiren Panchasara	784ce8fad2	Make sure tcp_mss() has the same check as tcp_mss_update() to have t_maxseg set to at least 64. This is still just a coverup to avoid kernel panic and not an actual fix. PR: 213232 Reviewed by: glebius MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D8272	2016-10-18 02:40:25 +00:00
Patrick Kelsey	09c305eb65	Fix cases where the TFO pending counter would leak references, and eventually, memory. Also renamed some tfo labels and added/reworked comments for clarity. Based on an initial patch from jtl. PR: 213424 Reviewed by: jtl MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D8235	2016-10-15 01:41:28 +00:00
Jonathan T. Looney	82676a28eb	r307082 added the TCP_HHOOK kernel option and made some existing code only compile when that option is configured. In tcp_destroy(), the error variable is now only used in code enclosed in an '#ifdef TCP_HHOOK' block. This broke the build for VNET images. Enclose the error variable itself in an #ifdef block. Submitted by: Shawn Webb <shawn.webb at hardenedbsd.org> Reported by: Shawn Webb <shawn.webb at hardenedbsd.org> PointyHat to: jtl	2016-10-15 00:29:15 +00:00
Jonathan T. Looney	6d172f58a2	The code currently resets the keepalive timer each time a packet is received on a TCP session that has entered the ESTABLISHED state. This results in a lot of calls to reset the keepalive timer. This patch changes the behavior so we set the keepalive timer for the keepalive idle time (TP_KEEPIDLE). When the keepalive timer fires, it will first check to see if the session has been idle for TP_KEEPIDLE ticks. If not, it will reschedule the keepalive timer for the time the session will have been idle for TP_KEEPIDLE ticks. For a session with regular communication, the keepalive timer should fire approximately once every TP_KEEPIDLE ticks. For sessions with irregular communication, the keepalive timer might fire more often. But, the disruption from a periodic keepalive timer should be less than the regular cost of resetting the keepalive timer on every packet. (FWIW, this change saved approximately 1.73% of the busy CPU cycles on a particular test system with a heavy TCP output load. Of course, the actual impact is very specific to the particular hardware and workload.) Reviewed by: gallatin, rrs MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8243	2016-10-14 14:57:43 +00:00
Gleb Smirnoff	cc94f0c2d7	- Revert r300854, r303657 which tried to fix regression from r297225. - Fix the regression proper way using RO_RTFREE(). Submitted by: ae	2016-10-13 20:15:47 +00:00
Gleb Smirnoff	ec7bbf1f79	With build without TCP_HHOOK and with INVARIANTS. Before mutex.h came via sys/hhook.h -> sys/rmlock.h -> sys/mutex.h.	2016-10-13 18:02:29 +00:00
Michael Tuexen	859422cc12	Mark the socket as un-writable when it is 1-to-1 and the SCTP association is freed. MFC after: 1 month	2016-10-13 13:53:01 +00:00
Michael Tuexen	4c7fb0cf6e	Whitespace changes. MFC after: 1 month	2016-10-13 13:38:14 +00:00
Jonathan T. Looney	68bd7ed102	The TFO server-side code contains some changes that are not conditioned on the TCP_RFC7413 kernel option. This change removes those few instructions from the packet processing path. While not strictly necessary, for the sake of consistency, I applied the new IS_FASTOPEN macro to all places in the packet processing path that used the (t_flags & TF_FASTOPEN) check. Reviewed by: hiren Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8219	2016-10-12 19:06:50 +00:00
Jonathan T. Looney	4527476029	Currently, when tcp_input() receives a packet on a session that matches a TCPCB, it checks (so->so_options & SO_ACCEPTCONN) to determine whether or not the socket is a listening socket. However, this causes the code to access a different cacheline. If we first check if the socket is in the LISTEN state, we can avoid accessing so->so_options when processing packets received for ESTABLISHED sessions. If INVARIANTS is defined, the code still needs to access both variables to check that so->so_options is consistent with the state. Reviewed by: gallatin MFC after: 1 week Sponsored by: Netflix	2016-10-12 02:30:33 +00:00
Jonathan T. Looney	bd79708dbf	In the TCP stack, the hhook(9) framework provides hooks for kernel modules to add actions that run when a TCP frame is sent or received on a TCP session in the ESTABLISHED state. In the base tree, this functionality is only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd, and cc_vegas congestion control modules. Presently, we incur overhead to check for hooks each time a TCP frame is sent or received on an ESTABLISHED TCP session. This change adds a new compile-time option (TCP_HHOOK) to determine whether to include the hhook(9) framework for TCP. To retain backwards compatibility, I added the TCP_HHOOK option to every configuration file that already defined "options INET". (Therefore, this patch introduces no functional change. In order to see a functional difference, you need to compile a custom kernel without the TCP_HHOOK option.) This change will allow users to easily exclude this functionality from their kernel, should they wish to do so. Note that any users who use a custom kernel configuration and use one of the congestion control modules listed above will need to add the TCP_HHOOK option to their kernel configuration. Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8185	2016-10-12 02:16:42 +00:00
Mark Johnston	d748f7efcd	Lock the ND prefix list and add refcounting for prefixes. This change extends the nd6 lock to protect the ND prefix list as well as the list of advertising routers associated with each prefix. To handle cases where the nd6 lock must be dropped while iterating over either the prefix or default router lists, a generation counter is used to track modifications to the lists. Additionally, a new mutex is used to serialize prefix on-link/off-link transitions. This mutex must be acquired before the nd6 lock and is held while updating the routing table in nd6_prefix_onlink() and nd6_prefix_offlink(). Reviewed by: ae, tuexen (SCTP bits) Tested by: Jason Wolfe <jason@llnw.com>, Larry Rosenman <ler@lerctr.org> MFC after: 2 months Differential Revision: https://reviews.freebsd.org/D8125	2016-10-07 21:10:53 +00:00
Jonathan T. Looney	3ac125068a	Remove "long" variables from the TCP stack (not including the modular congestion control framework). Reviewed by: gnn, lstewart (partial) Sponsored by: Juniper Networks, Netflix Differential Revision: (multiple) Tested by: Limelight, Netflix	2016-10-06 16:28:34 +00:00
Jonathan T. Looney	0dda76b82b	If the new window size is less than the old window size, skip the calculations to check if we should advertise a larger window. Reviewed by: gnn MFC after: 2 weeks Sponsored by: Juniper Networks, Netflix Differential Revision: https://reviews.freebsd.org/D7076 Tested by: Limelight, Netflix	2016-10-06 16:09:45 +00:00
Jonathan T. Looney	15c825712e	Correctly calculate snd_max in persist case. In the persist case, take the SYN and FIN flags into account when updating the sequence space sent. Reviewed by: gnn MFC after: 2 weeks Sponsored by: Juniper Networks, Netflix Differential Revision: https://reviews.freebsd.org/D7075 Tested by: Limelight, Netflix	2016-10-06 16:00:48 +00:00
Jonathan T. Looney	55a429a6dc	Remove declaration of un-defined function tcp_seq_subtract(). Reviewed by: gnn MFC after: 1 week Sponsored by: Juniper Networks, Netflix Differential Revision: https://reviews.freebsd.org/D7055	2016-10-06 15:57:15 +00:00
Kevin Lo	c2b5ba7661	Remove an alias if_list, use if_link consistently. Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D8075	2016-10-06 00:51:27 +00:00
Eric van Gyzen	2d9db0bc63	Add GARP retransmit capability A single gratuitous ARP (GARP) is always transmitted when an IPv4 address is added to an interface, and that is usually sufficient. However, in some circumstances, such as when a shared address is passed between cluster nodes, this single GARP may occasionally be dropped or lost. This can lead to neighbors on the network link working with a stale ARP cache and sending packets destined for that address to the node that previously owned the address, which may not respond. To avoid this situation, GARP retransmissions can be enabled by setting the net.link.ether.inet.garp_rexmit_count sysctl to a value greater than zero. The setting represents the maximum number of retransmissions. The interval between retransmissions is calculated using an exponential backoff algorithm, doubling each time, so the retransmission intervals are: {1, 2, 4, 8, 16, ...} (seconds). Due to the exponential backoff algorithm used for the interval between GARP retransmissions, the maximum number of retransmissions is limited to 16 for sanity. This limit corresponds to a maximum interval between retransmissions of 2^16 seconds ~= 18 hours. Increasing this limit is possible, but sending out GARPs spaced days apart would be of little use. Submitted by: David A. Bright <david.a.bright@dell.com> MFC after: 1 month Relnotes: yes Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D7695	2016-10-02 01:42:45 +00:00
Rick Macklem	00b460ffc5	r297225 broke udp_output() for the case where the "addr" argument is NULL and the function jumps to the "release:" label. For this case, the "inp" was write locked, but the code attempted to read unlock it. This patch fixes the problem. This case could occur for NFS over UDP mounts, where the server was down for a few minutes under certain circumstances. Reported by: bde Tested by: bde Reviewed by: gnn MFC after: 2 weeks	2016-10-01 19:39:09 +00:00
Hiren Panchasara	8a56c64533	This adds a sysctl which allows you to disable the TCP hostcache. This is handy during testing of network related changes where cached entries may pollute your results, or during known congestion events where you don't want to unfairly penalize hosts. Prior to r232346 this would have meant you would break any connection with a sub 1500 MTU, as the hostcache was authoritative. All entries as they stand today should simply be used to pre populate values for efficiency. Submitted by: Jason Wolfe (j at nitrology dot com) Reviewed by: rwatson, sbruno, rrs , bz (earlier version) MFC after: 2 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D6198	2016-09-30 00:10:57 +00:00
Kurt Lidl	1d7ee746e6	Properly preserve ip_tos bits for IPv4 packets Restructure code slightly to save ip_tos bits earlier. Fix the bug where the ip_tos field is zeroed out before assigning to the iptos variable. Restore the ip_tos and ip_ver fields only if they have been zeroed during the pseudo-header checksum calculation. Reviewed by: cem, gnn, hiren MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D8053	2016-09-29 19:45:24 +00:00
Julien Charbon	c1b19923a3	Fix an issue with accept_filter introduced with r261242: As a side effect of r261242 when using accept_filter the first call to soisconnected() is done earlier in tcp_input() instead of tcp_do_segment() context. Restore the expected behaviour. Note: This call to soisconnected() seems to be extraneous in all cases (with or without accept_filter). Will be addressed in a separate commit. PR: 212920 Reported by: Alexey Tested by: Alexey, jch Sponsored by: Verisign, Inc. MFC after: 1 week	2016-09-29 11:18:48 +00:00
Kevin Lo	c7641cd18d	Remove ifa_list, use ifa_link (structure field) instead. While here, prefer if_addrhead (FreeBSD) to if_addrlist (BSD compat) naming for the interface address list in sctp_bsd_addr.c Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D8051	2016-09-28 13:29:11 +00:00
Mariusz Zaborski	85b0f9de11	capsicum: propagate rights on accept(2) Descriptor returned by accept(2) should inherits capabilities rights from the listening socket. PR: 201052 Reviewed by: emaste, jonathan Discussed with: many Differential Revision: https://reviews.freebsd.org/D7724	2016-09-22 09:58:46 +00:00
Michael Tuexen	5cb9165556	Fix the handling of unordered fragmented user messages using DATA chunks. There were two bugs: * There was an accounting bug resulting in reporting a too small a_rwnd. * There are a bug when abandoning messages in the reassembly queue. MFC after: 4 weeks	2016-09-21 08:28:18 +00:00
Kevin Lo	c3bef61e58	Remove the 4.3BSD compatible macro m_copy(), use m_copym() instead. Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D7878	2016-09-15 07:41:48 +00:00
Michael Tuexen	5a17b6ad98	Ensure that the IPPROTO_TCP level socket options * TCP_KEEPINIT * TCP_KEEPINTVL * TCP_KEEPIDLE * TCP_KEEPCNT always always report the values currently used when getsockopt() is used. This wasn't the case when the sysctl-inherited default values where used. Ensure that the IPPROTO_TCP level socket option TCP_INFO has the TCPI_OPT_ECN flag set in the tcpi_options field when ECN support has been negotiated successfully. Reviewed by: rrs, jtl, hiren MFC after: 1 month Differential Revision: 7833	2016-09-14 14:48:00 +00:00
Dimitry Andric	6c01c0e0c6	With clang 3.9.0, compiling sys/netinet/igmp.c results in the following warning: sys/netinet/igmp.c:546:21: error: implicit conversion from 'int' to 'char' changes value from 148 to -108 [-Werror,-Wconstant-conversion] p->ipopt_list[0] = IPOPT_RA; /* Router Alert Option / ~ ^~~~~~~~ sys/netinet/ip.h:153:19: note: expanded from macro 'IPOPT_RA' #define IPOPT_RA 148 / router alert */ ^~~ This is because ipopt_list is an array of char, so IPOPT_RA is wrapped to a negative value. It would be nice to change ipopt_list to an array of u_char, but it changes the signature of the public struct ipoption, so add an explicit cast to suppress the warning. Reviewed by: imp MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D7777	2016-09-04 17:23:10 +00:00
Hiren Panchasara	06b99bd826	Adjust TCP module fastpath after r304803's cc_ack_received() changes. Reported by: hiren, bz, np Reviewed by: rrs Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D7664	2016-08-26 19:23:17 +00:00
Hiren Panchasara	e7106d6be2	Update TCPS_HAVERCVDFIN() macro to correctly include all states a connection can be in after receiving a FIN. FWIW, NetBSD has this change for quite some time. This has been tested at Netflix and Limelight in production traffic. Reported by: Sam Kumar <samkumar99 at gmail.com> on transport@ Reviewed by: rrs MFC after: 4 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D7475	2016-08-26 17:48:54 +00:00
Michael Tuexen	91843cf34e	Fix a bug, where no SACK is sent when receiving a FORWARD-TSN or I-FORWARD-TSN chunk before any DATA or I-DATA chunk. Thanks to Julian Cordes for finding this problem and prividing packetdrill scripts to reporduce the issue. MFC after: 3 days	2016-08-26 07:49:23 +00:00
Lawrence Stewart	4b7b743c16	Pass the number of segments coalesced by LRO up the stack by repurposing the tso_segsz pkthdr field during RX processing, and use the information in TCP for more correct accounting and as a congestion control input. This is only a start, and an audit of other uses for the data is left as future work. Reviewed by: gallatin, rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D7564	2016-08-25 13:33:32 +00:00
Michael Tuexen	884d8c53e6	When aborting an association, send the ABORT before notifying the upper layer. For the kernel this doesn't matter, for the userland stack, it does. While there, silence a clang warning when compiling it in userland.	2016-08-24 06:22:53 +00:00
Ryan Stone	23424a2021	Temporarily disable the optimization from r304436 r304436 attempted to optimize the handling of incoming UDP packet by only making an expensive call to in_broadcast() if the mbuf was marked as an broadcast packet. Unfortunately, this cannot work in the case of point-to- point L2 protocols like PPP, which have no notion of "broadcast". Discussions on how to properly fix r304436 are ongoing, but in the meantime disable the optimization to ensure that no existing network setups are broken. Reported by: bms	2016-08-22 15:27:37 +00:00
Michael Tuexen	7fcbd928f8	Improve the locking when sending user messages. First, keep a ref count on the stcb after looking it up, as done in the other lookup cases. Second, before looking again at sp, ensure that it is not freed, because the assoc is about to be freed. MFC after: 3 days	2016-08-22 01:45:29 +00:00
Michael Tuexen	26a5d52f03	Remove duplicate code, which is not protected by the appropriate locks. MFC after: 3 days	2016-08-22 00:40:45 +00:00
Bjoern A. Zeeb	77ecef378a	Remove the kernel optoion for IPSEC_FILTERTUNNEL, which was deprecated more than 7 years ago in favour of a sysctl in r192648.	2016-08-21 18:55:30 +00:00
Marko Zec	9da85a912d	Permit disabling net.inet.udp.require_l2_bcast in VIMAGE kernels. The default value of the tunable introduced in r304436 couldn't be effectively overrided on VIMAGE kernels, because instead of being accessed via the appropriate VNET() accessor macro, it was accessed via the VNET_NAME() macro, which resolves to the (should-be) read-only master template of initial values of per-VNET data. Hence, while the value of udp_require_l2_bcast could be altered on per-VNET basis, the code in udp_input() would ignore it as it would always read the default value (one) from the VNET master template. Silence from: rstone	2016-08-20 22:12:26 +00:00
Michael Tuexen	e19497672b	Unbreak sctp_connectx(). MFC after: 3 days	2016-08-20 20:15:36 +00:00
Ryan Stone	11f2a7cd67	Fix unlocked access to ifnet address list in_broadcast() was iterating over the ifnet address list without first taking an IF_ADDR_RLOCK. This could cause a panic if a concurrent operation modified the list. Reviewed by: bz MFC after: 2 months Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D7227	2016-08-18 22:59:10 +00:00
Ryan Stone	41029db13f	Don't check for broadcast IPs on non-bcast pkts in_broadcast() can be quite expensive, so skip calling it if the incoming mbuf wasn't sent to a broadcast L2 address in the first place. Reviewed by: gnn MFC after: 2 months Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D7309	2016-08-18 22:59:05 +00:00
Ryan Stone	90cc51a1ab	Don't iterate over the ifnet addr list in ip_output() For almost every packet that is transmitted through ip_output(), a call to in_broadcast() was made to decide if the destination IP was a broadcast address. in_broadcast() iterates over the ifnet's address to find a source IP matching the subnet of the destination IP, and then checks if the IP is a broadcast in that subnet. This is completely redundant as we have already performed the route lookup, so the source IP is already known. Just use that address to directly check whether the destination IP is a broadcast address or not. MFC after: 2 months Sponsored By: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D7266	2016-08-18 22:59:00 +00:00
Randall Stewart	eadd00f81a	A few more wording tweaks as suggested (with some modifications as well) by Ravi Pokala. Thanks for the comments :-) Sponsored by: Netflix Inc.	2016-08-16 15:17:36 +00:00
Randall Stewart	587d67c008	Here we update the modular tcp to be able to switch to an alternate TCP stack in other then the closed state (pre-listen/connect). The idea is that if that is supported by the alternate stack, it is asked if its ok to switch. If it approves the "handoff" then we allow the switch to happen. Also the fini() function now gets a flag to tell if you are switching away or the tcb is destroyed. The init() call into the alternate stack is moved to the end so the tcb is more fully formed before the init transpires. Sponsored by: Netflix Inc. Differential Revision: D6790	2016-08-16 15:11:46 +00:00
Randall Stewart	0fa047b98c	Comments describing how to properly use the new lock_add functions and its respective companion. Sponsored by: Netflix Inc.	2016-08-16 13:08:03 +00:00
Randall Stewart	b07fef500b	This cleans up the timer code in TCP and also makes it so we do not take the INFO lock unless we are really going to delete the TCB. Differential Revision: D7136	2016-08-16 12:40:56 +00:00
Sepherosa Ziehau	8452c1b345	tcp/lro: Make # of LRO entries tunable Reviewed by: hps, gallatin Obtained from: rrs, gallatin MFC after: 2 weeks Sponsored by: Netflix (rrs, gallatin), Microsoft (sephe) Differential Revision: https://reviews.freebsd.org/D7499	2016-08-16 06:40:27 +00:00
Michael Tuexen	dcb436c936	Ensure that sctp_it_ctl.cur_it does not point to a free object (during a small time window). Thanks to Byron Campen for reporting the issue and suggesting a fix. MFC after: 3 days	2016-08-15 10:16:08 +00:00
Andrey V. Elsukov	57fb3b7a78	Add `stats reset` command implementation to NPTv6 module to be able reset statistics counters. Obtained from: Yandex LLC Sponsored by: Yandex LLC	2016-08-13 16:45:14 +00:00
Andrey V. Elsukov	d8caf56e9e	Add ipfw_nat64 module that implements stateless and stateful NAT64. The module works together with ipfw(4) and implemented as its external action module. Stateless NAT64 registers external action with name nat64stl. This keyword should be used to create NAT64 instance and to address this instance in rules. Stateless NAT64 uses two lookup tables with mapped IPv4->IPv6 and IPv6->IPv4 addresses to perform translation. A configuration of instance should looks like this: 1. Create lookup tables: # ipfw table T46 create type addr valtype ipv6 # ipfw table T64 create type addr valtype ipv4 2. Fill T46 and T64 tables. 3. Add rule to allow neighbor solicitation and advertisement: # ipfw add allow icmp6 from any to any icmp6types 135,136 4. Create NAT64 instance: # ipfw nat64stl NAT create table4 T46 table6 T64 5. Add rules that matches the traffic: # ipfw add nat64stl NAT ip from any to table(T46) # ipfw add nat64stl NAT ip from table(T64) to 64:ff9b::/96 6. Configure DNS64 for IPv6 clients and add route to 64:ff9b::/96 via NAT64 host. Stateful NAT64 registers external action with name nat64lsn. The only one option required to create nat64lsn instance - prefix4. It defines the pool of IPv4 addresses used for translation. A configuration of instance should looks like this: 1. Add rule to allow neighbor solicitation and advertisement: # ipfw add allow icmp6 from any to any icmp6types 135,136 2. Create NAT64 instance: # ipfw nat64lsn NAT create prefix4 A.B.C.D/28 3. Add rules that matches the traffic: # ipfw add nat64lsn NAT ip from any to A.B.C.D/28 # ipfw add nat64lsn NAT ip6 from any to 64:ff9b::/96 4. Configure DNS64 for IPv6 clients and add route to 64:ff9b::/96 via NAT64 host. Obtained from: Yandex LLC Relnotes: yes Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D6434	2016-08-13 16:09:49 +00:00
Mike Karels	bca0155f64	Fix kernel build with TCP_RFC7413 option The current in_pcb.h includes route.h, which includes sockaddr structures. Including <sys/socketvar.h> should require <sys/socket.h>; add it in the appropriate place. PR: 211385 Submitted by: Sergey Kandaurov and iron at mail.ua Reviewed by: gnn Approved by: gnn (mentor) MFC after: 1 day	2016-08-11 23:52:24 +00:00
Andrey V. Elsukov	d6eb9b0249	Restore "nat global" support. Now zero value of arg1 used to specify "tablearg", use the old "tablearg" value for "nat global". Introduce new macro IP_FW_NAT44_GLOBAL to replace hardcoded magic number to specify "nat global". Also replace 65535 magic number with corresponding macro. Fix typo in comments. PR: 211256 Tested by: Victor Chernov MFC after: 3 days	2016-08-11 10:10:10 +00:00
Michael Tuexen	be46a7c54d	Improve a consistency check to not detect valid cases for unordered user messages using DATA chunks as invalid ones. While there, ensure that error causes are provided when sending ABORT chunks in case of reassembly problems detected. Thanks to Taylor Brandstetter for making me aware of this problem. MFC after: 3 days	2016-08-10 17:19:33 +00:00
Stephen J. Kiernan	0ce1624d0e	Move IPv4-specific jail functions to new file netinet/in_jail.c _prison_check_ip4 renamed to prison_check_ip4_locked Move IPv6-specific jail functions to new file netinet6/in6_jail.c _prison_check_ip6 renamed to prison_check_ip6_locked Add appropriate prototypes to sys/sys/jail.h Adjust kern_jail.c to call prison_check_ip4_locked and prison_check_ip6_locked accordingly. Add netinet/in_jail.c and netinet6/in6_jail.c to the list of files that need to be built when INET and INET6, respectively, are configured in the kernel configuration file. Reviewed by: jtl Approved by: sjg (mentor) Sponsored by: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D6799	2016-08-09 02:16:21 +00:00
Michael Tuexen	d6e73fa13d	Fix the sending of FORWARD-TSN and I-FORWARD-TSN chunks. The last SID/SSN pair wasn't filled in. Thanks to Julian Cordes for providing a packetdrill script triggering the issue and making me aware of the bug. MFC after: 3 days	2016-08-08 13:52:18 +00:00
Michael Tuexen	9c5ca6f247	Fix a locking issue found by stress testing with tsctp. The inp read lock neeeds to be held when considering control->do_not_ref_stcb. MFC after: 3 days	2016-08-08 08:20:10 +00:00
Michael Tuexen	124d851acf	Consistently check for unsent data on the stream queues. MFC after: 3 days	2016-08-07 23:04:46 +00:00
Michael Tuexen	4d58b0c3a9	Remove stream queue entry consistently from wheel. While there, improve the handling of drain. MFC after: 3 days	2016-08-07 12:51:13 +00:00
Michael Tuexen	cf46cace5c	Don't modify a structure without holding a reference count on it. MFC after: 3 days	2016-08-06 15:29:46 +00:00
Michael Tuexen	bfe7e9328c	Mark an unused parameter as such. MFC after: 3 days	2016-08-06 12:51:07 +00:00
Michael Tuexen	d1ea5fa9c2	Fix various bugs in relation to the I-DATA chunk support This is joint work with rrs. MFC after: 3 days	2016-08-06 12:33:15 +00:00
Sepherosa Ziehau	b9ec6f0b02	tcp/lro: If timestamps mismatch or it's a FIN, force flush. This keeps the segments/ACK/FIN delivery order. Before this patch, it was observed: if A sent FIN immediately after an ACK, B would deliver FIN first to the TCP stack, then the ACK. This out-of-order delivery causes one unnecessary ACK sent from B. Reviewed by: gallatin, hps Obtained from: rrs, gallatin Sponsored by: Netflix (rrs, gallatin), Microsoft (sephe) Differential Revision: https://reviews.freebsd.org/D7415	2016-08-05 09:08:00 +00:00
Sepherosa Ziehau	05cde7efa6	tcp/lro: Implement hash table for LRO entries. This significantly improves HTTP workload performance and reduces HTTP workload latency. Reviewed by: rrs, gallatin, hps Obtained from: rrs, gallatin Sponsored by: Netflix (rrs, gallatin) , Microsoft (sephe) Differential Revision: https://reviews.freebsd.org/D6689	2016-08-02 06:36:47 +00:00
Andrew Gallatin	d4c22202e6	Rework IPV6 TCP path MTU discovery to match IPv4 - Re-write tcp_ctlinput6() to closely mimic the IPv4 tcp_ctlinput() - Now that tcp_ctlinput6() updates t_maxseg, we can allow ip6_output() to send TCP packets without looking at the tcp host cache for every single transmit. - Make the icmp6 code mimic the IPv4 code & avoid returning PRC_HOSTDEAD because it is so expensive. Without these changes in place, every TCP6 pmtu discovery or host unreachable ICMP resulted in a call to in6_pcbnotify() which walks the tcbinfo table with the write lock held. Because the tcbinfo table is shared between IPv4 and IPv6, this causes huge scalabilty issues on servers with lots of (~100K) TCP connections, to the point where even a small percent of IPv6 traffic had a disproportionate impact on overall throughput. Reviewed by: bz, rrs, ae (all earlier versions), lstewart (in Netflix's tree) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D7272	2016-08-01 17:02:21 +00:00
Andrew Gallatin	0e3b891988	Call tcp_notify() directly to shoot down routes, rather than calling in_pcbnotifyall(). This avoids lock contention on tcbinfo due to in_pcbnotifyall() holding the tcbinfo write lock while walking all connections. Reviewed by: rrs, karels MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D7251	2016-07-28 19:32:25 +00:00

... 2 3 4 5 6 ...

5932 Commits