freebsd-dev

Author	SHA1	Message	Date
Gleb Smirnoff	9a8cf950b2	carp: fix send error demotion recovery The problem is that carp(4) would clear the error counter on first successful send, and stop counting successes after that. Fix this logic and document it in human language. PR: 260499 Differential revision: https://reviews.freebsd.org/D33536	2021-12-18 17:19:26 -08:00
Gleb Smirnoff	75add59a8e	tcp: allocate statistics in the main tcp_init() No reason to have a separate SYSINIT.	2021-12-17 10:50:56 -08:00
Gleb Smirnoff	d8b45c8e14	inpcb: don't leak the port zone in in_pcbinfo_destroy()	2021-12-16 15:15:02 -08:00
Mark Johnston	014f98b119	udp: Fix a use-after-free in udp_multi_input() "ip" is a pointer into the input mbuf chain, so we shouldn't access it after the chain is freed. Fix style at the call site while here. Reported by: syzbot+7c8258509722af1b6145@syzkaller.appspotmail.com Reviewed by: tuexen, glebius Fixes: `de2d47842e` ("SMR protection for inpcbs") Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33473	2021-12-16 09:17:05 -05:00
Randall Stewart	9b60296531	tcp: Rack in a rare case we can get stuck sending a very small amount. If a tlp sending new data fails, and then the peer starts talking to us again, we can be in a situation where the tlp_new_data count is set, we are not in recovery and we always send one packet every RTT. The failure has to occur when we send the TLP initially from the ip_output() which is rare. But if it occurs you are basically stuck. This fixes it so we use the new_data count and clear it so we know it will be cleared. If a failure occurs the tlp timer will regenerate a new amount anyway so it is un-needed to carry the value on. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D33325	2021-12-15 09:41:33 -05:00
Gleb Smirnoff	185e659c40	inpcb: use locked variant of prison_check_ip*() The pcb lookup always happens in the network epoch and in SMR section. We can't block on a mutex due to the latter. Right now this patch opens up a race. But soon that will be addressed by D33339. Reviewed by: markj, jamie Differential revision: https://reviews.freebsd.org/D33340 Fixes: `de2d47842e`	2021-12-14 09:38:52 -08:00
Gleb Smirnoff	d74b7baeb0	ifnet_byindex() actually requires network epoch Sweep over potentially unsafe calls to ifnet_byindex() and wrap them in epoch. Most of the code touched remains unsafe, as the returned pointer is being used after epoch exit. Mark that with a comment. Validate the index argument inside the function, reducing argument validation requirement from the callers and making V_if_index private to if.c. Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D33263	2021-12-06 09:32:31 -08:00
Randall Stewart	dadbc04250	tcp: rack fails to send out a TLP after a MTU change When rack sends out a TLP it sets up various state to make sure it avoids the cwnd (its been more than 1 RTT since our last send) and it may at times send new data. If an MTU change as occurred and our cwnd has collapsed we can have a situation where must_retran flag is set and we obey the cwnd thus never sending the TLP and then sitting stuck. This one line fix addresses that problem Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D33231	2021-12-06 09:56:09 -05:00
Gleb Smirnoff	eb93b99d69	in_pcb: delay crfree() down into UMA dtor inpcb lookups, which check inp_cred, work with pcbs that potentially went through in_pcbfree(). So inp_cred should stay valid until SMR guarantees its invisibility to lookups. While here, put the whole inpcb destruction sequence of in_pcbfree(), inpcb_dtor() and inpcb_fini() sequentially. Submitted by: markj Differential revision: https://reviews.freebsd.org/D33273	2021-12-05 10:46:37 -08:00
Michael Tuexen	54912d47b6	sctp: unbreak NOINET6 builds. PR: 260119 Reported by: kostikbel MFC after: 1 week	2021-12-04 19:16:18 +01:00
Michael Tuexen	d79676fb13	sctp: inherit IP level socket options from listening socket Ensure that TTL and TOS values set on a listener get inheritet to the accepted sockets. PR: 260119 MFC after: 1 week	2021-12-03 22:44:01 +01:00
Gleb Smirnoff	36f42c5ebf	tcp_ccalgounload(): initialize the inpcb iterator when curvnet is set Pointy hat to: glebius Fixes: `de2d47842e`	2021-12-03 12:39:56 -08:00
Peter Lei	4c018b5aed	in_pcb: limit the effect of wraparound in TCP random port allocation check The check to see if TCP port allocation should change from random to sequential port allocation mode may incorrectly cause a false positive due to negative wraparound. Example: V_ipport_tcpallocs = 2147483585 (0x7fffffc1) V_ipport_tcplastcount = 2147483553 (0x7fffffa1) V_ipport_randomcps = 100 The original code would compare (2147483585 <= -2147483643) and thus incorrectly move to sequential allocation mode. Compute the delta first before comparing against the desired limit to limit the wraparound effect (since tcplastcount is always a snapshot of a previous tcpallocs).	2021-12-03 12:38:12 -08:00
Michael Tuexen	f32357be53	sctp: use the correct traffic class when sending SCTP/IPv6 packets When sending packets the stcb was used to access the inp and then access the endpoint specific IPv6 level options. This fails when there exists an inp, but no stcb yet. This is the case for sending an INIT-ACK in response to an INIT when no association already exists. Fix this by just providing the inp instead of the stcb. PR: 260120 MFC after: 1 week	2021-12-03 21:36:44 +01:00
Peter Lei	13e3f3349f	in_pcb: fix TCP local ephemeral port accounting Fix logic error causing UDP(-Lite) local ephemeral port bindings to count against the TCP allocation counter, potentially causing TCP to go from random to sequential port allocation mode prematurely.	2021-12-03 12:30:21 -08:00
Gleb Smirnoff	12ae3476f3	tcp_drain(): initialize the inpcb iterator when curvnet is set Reported by: cy Pointy hat to: glebius Fixes: `de2d47842e`	2021-12-02 21:08:30 -08:00
Gleb Smirnoff	651a545143	udp_detach(): fix set but not used warning	2021-12-02 20:12:40 -08:00
Gleb Smirnoff	bd1d085045	udp_multi_input(): the UDP header is only needed for probes Reported by: kib Fixes: `de2d47842e`	2021-12-02 20:12:40 -08:00
Cy Schubert	db0ac6ded6	Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816" This reverts commit `266f97b5e9`, reversing changes made to `a10253cffe`. A mismerge of a merge to catch up to main resulted in files being committed which should not have been.	2021-12-02 14:45:04 -08:00
Cy Schubert	266f97b5e9	wpa: Import wpa_supplicant/hostapd commit 14ab4a816 This is the November update to vendor/wpa committed upstream 2021-11-26. MFC after: 1 month	2021-12-02 13:35:14 -08:00
Gleb Smirnoff	3cce6164ab	ip_input: remove pointless check in INP_RECVIF handling An mbuf rcvif pointer is supposed to be valid and doesn't need extra checks. The code appeared in `d314ad7b73`.	2021-12-02 11:15:04 -08:00
Gleb Smirnoff	2e27230ff9	tcp_hpts: rewrite inpcb synchronization Just trust the pcb database, that if we did in_pcbref(), no way an inpcb can go away. And if we never put a dropped inpcb on our queue, and tcp_discardcb() always removes an inpcb to be dropped from the queue, then any inpcb on the queue is valid. Now, to solve LOR between inpcb lock and HPTS queue lock do the following trick. When we are about to process a certain time slot, take the full queue of the head list into on stack list, drop the HPTS lock and work on our queue. This of course opens a race when an inpcb is being removed from the on stack queue, which was already mentioned in comments. To address this race introduce generation count into queues. If we want to remove an inpcb with generation count mismatch, we can't do that, we can only mark it with desired new time slot or -1 for remove. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33026	2021-12-02 10:48:49 -08:00
Gleb Smirnoff	f971e79139	tcp_hpts: rename input queue to drop queue and trim dead code The HPTS input queue is in reality used only for "delayed drops". When a TCP stack decides to drop a connection on the output path it can't do that due to locking protocol between main tcp_output() and stacks. So, rack/bbr utilize HPTS to drop the connection in a different context. In the past the queue could also process input packets in context of HPTS thread, but now no stack uses this, so remove this functionality. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33025	2021-12-02 10:48:48 -08:00
Gleb Smirnoff	b0a7c008cb	tcp_hpts: make struct tcp_hpts_entry private to the module. Also, make some of the functions also private to the module. Remove unused functions discovered after that. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33024	2021-12-02 10:48:48 -08:00
Gleb Smirnoff	50f081ecb7	tcp_hpts: provide tcp_in_hpts(). It will hide some internal HPTS knowledge from the consumers. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33023	2021-12-02 10:48:48 -08:00
Gleb Smirnoff	de2d47842e	SMR protection for inpcbs With introduction of epoch(9) synchronization to network stack the inpcb database became protected by the network epoch together with static network data (interfaces, addresses, etc). However, inpcb aren't static in nature, they are created and destroyed all the time, which creates some traffic on the epoch(9) garbage collector. Fairly new feature of uma(9) - Safe Memory Reclamation allows to safely free memory in page-sized batches, with virtually zero overhead compared to uma_zfree(). However, unlike epoch(9), it puts stricter requirement on the access to the protected memory, needing the critical(9) section to access it. Details: - The database is already build on CK lists, thanks to epoch(9). - For write access nothing is changed. - For a lookup in the database SMR section is now required. Once the desired inpcb is found we need to transition from SMR section to r/w lock on the inpcb itself, with a check that inpcb isn't yet freed. This requires some compexity, since SMR section itself is a critical(9) section. The complexity is hidden from KPI users in inp_smr_lock(). - For a inpcb list traversal (a pcblist sysctl, or broadcast notification) also a new KPI is provided, that hides internals of the database - inp_next(struct inp_iterator *). Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33022	2021-12-02 10:48:48 -08:00
Gleb Smirnoff	565655f4e3	inpcb: reduce some aliased functions after removal of PCBGROUP. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33021	2021-12-02 10:48:48 -08:00
Gleb Smirnoff	93c67567e0	Remove "options PCBGROUP" With upcoming changes to the inpcb synchronisation it is going to be broken. Even its current status after the move of PCB synchronization to the network epoch is very questionable. This experimental feature was sponsored by Juniper but ended never to be used in Juniper and doesn't exist in their source tree [sjg@, stevek@, jtl@]. In the past (AFAIK, pre-epoch times) it was tried out at Netflix [gallatin@, rrs@] with no positive result and at Yandex [ae@, melifaro@]. I'm up to resurrecting it back if there is any interest from anybody. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33020	2021-12-02 10:48:48 -08:00
Gleb Smirnoff	1cec1c5831	Allow to compile RSS without PCBGROUP. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33019	2021-12-02 10:48:48 -08:00
Randall Stewart	dcf2dfed26	tcp: unloading a module that is set to default should error. I just discovered that the return of the EBUSY error was incorrectly rigged so that you could unload a CC module that was set to default. Its supposed to be an EBUSY error. Make it so. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D33229	2021-12-02 06:12:16 -05:00
Michael Tuexen	13c196a41e	sctp: improve handling of assoc ids in socket options For socket options related to local and remote addresses providing generic association ids does not make sense. Report EINVAL in this case. MFC after: 1 week	2021-12-01 14:54:55 +01:00
Michael Tuexen	a01b8859cb	sctp: cleanup, no functional change intended. MFC after: 1 week	2021-12-01 09:19:40 +01:00
Gordon Bergling	1dadeab367	netinet: Fix a common typo in source code comments - s/segement/segment/ MFC after: 3 days	2021-11-30 10:37:20 +01:00
Gordon Bergling	27c4abc7cd	inet(3): Fix two typos in sysctl descriptions - s/sequental/sequential/ MFC after: 3 days	2021-11-30 10:21:47 +01:00
Gordon Bergling	b4aa9cb217	tcp(4): Fix a typo in a sysctl description - s/entires/entries/ MFC after: 3 days	2021-11-30 07:17:30 +01:00
Michael Tuexen	147bf5e930	tcp: Don't try to upgrade a read lock just for logging Reviewed by: glebius, lstewart, rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D33098	2021-11-29 13:48:40 +01:00
Michael Tuexen	3c1ba6f394	sctp: improve consistency, no functional change intended	2021-11-26 12:53:43 +01:00
Michael Tuexen	0906362646	sctp: add some asserts, no functional changes intended This might help in narrowing down https://syzkaller.appspot.com/bug?id=fbd79abaec55f5aede63937182f4247006ea883b	2021-11-26 12:19:33 +01:00
Mark Johnston	44775b163b	netinet: Remove unneeded mb_unmapped_to_ext() calls in_cksum_skip() now handles unmapped mbufs on platforms where they're permitted. Reviewed by: glebius, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33097	2021-11-24 13:31:16 -05:00
Mark Johnston	0d9c3423f5	netinet: Implement in_cksum_skip() using m_apply() This allows it to work with unmapped mbufs. In particular, in_cksum_skip() calls no longer need to be preceded by calls to mb_unmapped_to_ext() to avoid a page fault. PR: 259645 Reviewed by: gallatin, glebius, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33096	2021-11-24 13:31:16 -05:00
Mark Johnston	ecbbe83144	netinet: Deduplicate most in_cksum() implementations in_cksum() and related routines are implemented separately for each platform, but only i386 and arm have optimized versions. Other platforms' copies of in_cksum.c are identical except for style differences and support for big-endian CPUs. Deduplicate the implementations for the rest of the platforms. This will make it easier to implement in_cksum() for unmapped mbufs. On arm and i386, define HAVE_MD_IN_CKSUM to mean that the MI implementation is not to be compiled. No functional change intended. Reviewed by: kp, glebius MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33095	2021-11-24 13:31:16 -05:00
Mark Johnston	5195bcc212	netinet: Remove in_cksum.c It does not get compiled into the kernel. No functional change inteneded. Reviewed by: kp, glebius, cy MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33094	2021-11-24 13:31:16 -05:00
Gordon Bergling	b4fbc855a5	cc_newreno(4): Fix a typo in a source code comment - s/conditons/conditions/ MFC after: 3 days	2021-11-19 19:16:02 +01:00
Gleb Smirnoff	ff94500855	Add tcp_freecb() - single place to free tcpcb. Until this change there were two places where we would free tcpcb - tcp_discardcb() in case if all timers are drained and tcp_timer_discard() otherwise. They were pretty much copy-n-paste, except that in the default case we would run tcp_hc_update(). Merge this into single function tcp_freecb() and move new short version of tcp_timer_discard() to tcp_timer.c and make it static. Reviewed by: rrs, hselasky Differential revision: https://reviews.freebsd.org/D32965	2021-11-18 20:27:45 -08:00
Gleb Smirnoff	fb8588d2cb	tcp_timewait: use on stack struct tcptw as last resort In case we failed to uma_zalloc() and also failed to reuse with tcp_tw_2msl_scan(), then just use on stack tcptw. This will allow to run through tcp_twrespond() and standard tcpcb discard routine. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D32965	2021-11-18 20:27:45 -08:00
Randall Stewart	97e28f0f58	tcp: Rack ack war with a mis-behaving firewall or nat with resets. Previously we added ack-war prevention for misbehaving firewalls. This is where the f/w or nat messes up its sequence numbers and causes an ack-war. There is yet another type of ack war that we have found in the wild that is like unto this. Basically the f/w or nat gets a ack (keep-alive probe or such) and instead of turning the ack/seq around and adding a TH_RST it does something real stupid and sends a new packet with seq=0. This of course triggers the challenge ack in the reset processing which then sends in a challenge ack (if the seq=0 is within the range of possible sequence numbers allowed by the challenge) and then we rinse-repeat. This will add the needed tweaks (similar to the last ack-war prevention using the same sysctls and counters) to prevent it and allow say 5 per second by default. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32938	2021-11-17 09:45:51 -05:00
Mark Johnston	756bb50b6a	sctp: Remove now-unneeded mb_unmapped_to_ext() calls sctp_delayed_checksum() now handles unmapped mbufs, thanks to m_apply(). No functional change intended. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32942	2021-11-16 13:38:09 -05:00
Mark Johnston	b4d758a0cc	sctp: Use m_apply() to calcuate a checksum for an mbuf chain m_apply() works on unmapped mbufs, so this will let us elide mb_unmapped_to_ext() calls preceding sctp_calculate_cksum() calls in the network stack. Modify sctp_calculate_cksum() to assume it's passed an mbuf header. This assumption appears to be true in practice, and we need to know the full length of the chain. No functional change intended. Reviewed by: tuexen, jhb MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32941	2021-11-16 13:36:30 -05:00
Mike Karels	2f35e7d9fa	kernel: partially revert e9efb1125a15, default inet mask When no mask is supplied to the ioctl adding an Internet interface address, revert to using the historical class mask rather than a single default. Similarly for the NFS bootp code. MFC after: 3 weeks Reviewed by: melifaro glebius Differential Revision: https://reviews.freebsd.org/D32951	2021-11-14 14:12:25 -06:00
Michael Tuexen	2f62f92e37	tcp: Fix a locking issue related to logging tcp_respond() is sometimes called with only a read lock. The logging however, requires a write lock. So either try to upgrade the lock if needed, or don't log the packet. Reported by: syzbot+8151ef969c170f76706b@syzkaller.appspotmail.com Reported by: syzbot+eb679adb3304c511c1e4@syzkaller.appspotmail.com Reviewed by: markj, rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D32983	2021-11-14 15:04:27 +01:00
Gleb Smirnoff	ef396441ce	tcp_usr_detach: revert debugging piece from `f5cf1e5f5a`. The code was probably useful during the problem being chased down, but for brevity makes sense just to return to the original KASSERT. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D32968	2021-11-13 08:33:32 -08:00
Gleb Smirnoff	9a06a82455	tcp_timers: check for (INP_TIMEWAIT \| INP_DROPPED) only once All timers keep inpcb locked through their execution. We need to check these flags only once. Checking for INP_TIMEWAIT earlier is is also safer, since such inpcbs point into tcptw rather than tcpcb, and any dereferences of inp_ppcb as tcpcb are erroneous. Reviewed by: rrs, hselasky Differential revision: https://reviews.freebsd.org/D32967	2021-11-13 08:32:06 -08:00
Michael Tuexen	df07bfda67	tcp: Fix a locking issue INP_WLOCK_RECHECK_CLEANUP() and INP_WLOCK_RECHECK() might return from the function, so any locks held must be released. Reported by: syzbot+b1a888df08efaa7b4bf1@syzkaller.appspotmail.com Reviewed by: markj Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D32975	2021-11-12 22:13:50 +01:00
Mark Johnston	034a924009	tcp: Ensure that vnets have an initialized V_default_cc_ptr This causes new vnets to inherit the cc algorithm from vnet0. This is a temporary patch to fix vnet jail creation. With encouragement from: glebius Fixes: `b8d60729de` ("tcp: Congestion control cleanup.") Differential Revision: https://reviews.freebsd.org/D32970	2021-11-12 12:18:12 -07:00
Warner Losh	7e3c9ec906	tcp: better congestion control defaults Define CC_NEWRENO in all the appropriate DEFAULTS and std.* config files. It's the default congestion control algorithm. Add code to cc.c so that CC_DEFAULT is "newreno" if it's not overriden in the config file. Sponsored by: Netflix Fixes: `b8d60729de` ("tcp: Congestion control cleanup.") Revired by: manu, hselasky, jhb, glebius, tuexen Differential Revision: https://reviews.freebsd.org/D32964	2021-11-12 12:16:11 -07:00
Gleb Smirnoff	2ce85919bb	Add net.inet.ip.source_address_validation Drop packets arriving from the network that have our source IP address. If maliciously crafted they can create evil effects like an RST exchange between two of our listening TCP ports. Such packets just can't be legitimate. Enable the tunable by default. Long time due for a modern Internet host. Reviewed by: donner, melifaro Differential revision: https://reviews.freebsd.org/D32914	2021-11-12 09:00:33 -08:00
Gleb Smirnoff	9c89392f12	Add in_localip_fib(), in6_localip_fib(). Check if given address/FIB exists locally. Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D32913	2021-11-12 08:59:42 -08:00
Gleb Smirnoff	81674f121e	ip_input: packet filters shall not modify m_pkthdr.rcvif Quick review confirms that they do not, also IPv6 doesn't expect such a change in mbuf. In IPv4 this appeared in `0aade26e6d`, which doesn't seem to have a valid explanation why. Reviewed by: donner, kp, melifaro Differential revision: https://reviews.freebsd.org/D32913	2021-11-12 08:58:27 -08:00
Gleb Smirnoff	94df3271d6	Rename net.inet.ip.check_interface to rfc1122_strong_es and document it. This very questionable feature was enabled in FreeBSD for a very short time. It was disabled very soon upon merging to RELENG_4 - `23d7f14119`. And in HEAD was also disabled pretty soon - `4bc37f9836`. The tunable has very vague name. Check interface for what? Given that it was never documented and almost never enabled, I think it is fine to rename it together with documenting it. Also, count packets dropped by this tunable as ips_badaddr, otherwise they fall down to ips_cantforward counter, which is misleading, as packet was not supposed to be forwarded, it was destined locally. Reviewed by: donner, kp Differential revision: https://reviews.freebsd.org/D32912	2021-11-12 08:57:06 -08:00
Mateusz Guzik	0359e7a5e4	net: sprinkle __predict_false in ip_input on error conditions While here rearrange the RVSP check to inspect proto first and avoid evaluating V_rsvp in the common case to begin with (most notably avoid the expensive read). Reviewed by: glebius Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D32929	2021-11-12 15:40:28 +00:00
Randall Stewart	26cbd0028c	tcp: Rack may still calculate long RTT on persists probes. When a persists probe is lost, we will end up calculating a long RTT based on the initial probe and when the response comes from the second probe (or third etc). This means we have a minimum of a confidence level of 3 on a incorrect probe. This commit will change it so that we have one of two options a) Just not count RTT of probes where we had a loss <or> b) Count them still but degrade the confidence to 0. I have set in this the default being to just not measure them, but I am open to having the default be otherwise. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32897	2021-11-11 06:35:51 -05:00
Randall Stewart	b8d60729de	tcp: Congestion control cleanup. NOTE: HEADS UP read the note below if your kernel config is not including GENERIC!! This patch does a bit of cleanup on TCP congestion control modules. There were some rather interesting surprises that one could get i.e. where you use a socket option to change from one CC (say cc_cubic) to another CC (say cc_vegas) and you could in theory get a memory failure and end up on cc_newreno. This is not what one would expect. The new code fixes this by requiring a cc_data_sz() function so we can malloc with M_WAITOK and pass in to the init function preallocated memory. The CC init is expected in this case not to fail but if it does and a module does break the "no fail with memory given" contract we do fall back to the CC that was in place at the time. This also fixes up a set of common newreno utilities that can be shared amongst other CC modules instead of the other CC modules reaching into newreno and executing what they think is a "common and understood" function. Lets put these functions in cc.c and that way we have a common place that is easily findable by future developers or bug fixers. This also allows newreno to evolve and grow support for its features i.e. ABE and HYSTART++ without having to dance through hoops for other CC modules, instead both newreno and the other modules just call into the common functions if they desire that behavior or roll there own if that makes more sense. Note: This commit changes the kernel configuration!! If you are not using GENERIC in some form you must add a CC module option (one of CC_NEWRENO, CC_VEGAS, CC_CUBIC, CC_CDG, CC_CHD, CC_DCTCP, CC_HTCP, CC_HD). You can have more than one defined as well if you desire. Note that if you create a kernel configuration that does not define a congestion control module and includes INET or INET6 the kernel compile will break. Also you need to define a default, generic adds 'options CC_DEFAULT=\"newreno\" but you can specify any string that represents the name of the CC module (same names that show up in the CC module list under net.inet.tcp.cc). If you fail to add the options CC_DEFAULT in your kernel configuration the kernel build will also break. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. RELNOTES:YES Differential Revision: https://reviews.freebsd.org/D32693	2021-11-11 06:28:18 -05:00
John Baldwin	e3ba94d4f3	Don't require the socket lock for sorele(). Previously, sorele() always required the socket lock and dropped the lock if the released reference was not the last reference. Many callers locked the socket lock just before calling sorele() resulting in a wasted lock/unlock when not dropping the last reference. Move the previous implementation of sorele() into a new sorele_locked() function and use it instead of sorele() for various places in uipc_socket.c that called sorele() while already holding the socket lock. The sorele() macro now uses refcount_release_if_not_last() try to drop the socket reference without locking the socket. If that shortcut fails, it locks the socket and calls sorele_locked(). Reviewed by: kib, markj Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D32741	2021-11-09 10:50:12 -08:00
Mike Karels	20d5940396	kernel: deprecate Internet Class A/B/C Hide historical Class A/B/C macros unless IN_HISTORICAL_NETS is defined; define it for user level. Define IN_MULTICAST separately from IN_CLASSD, and use it in pf instead of IN_CLASSD. Stop using class for setting default masks when not specified; instead, define new default mask (24 bits). Warn when an Internet address is set without a mask. MFC after: 1 month Reviewed by: cy Differential Revision: https://reviews.freebsd.org/D32708	2021-11-09 09:32:38 -06:00
Randall Stewart	477aeb3dd4	tcp: Printf should be removed. There is a printf when a socket option down to the CC module fails, this really should not be a printf. In fact this whole option needs to be re-thought in coordination with some other changes in the CC modules (its just not right but its ok what it does here if it fails since it will just use the ECN beta). Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32894	2021-11-08 11:49:34 -05:00
Hans Petter Selasky	10a62eb109	Use layer five checksum flags in the mbuf packet header to pass on crypto state. The mbuf protocol flags get cleared between layers, and also it was discovered that M_DECRYPTED conflicts with M_HASFCS when receiving ethernet patckets. Add the proper CSUM_TLS_MASK and CSUM_TLS_DECRYPTED defines, and start using these instead of M_DECRYPTED inside the TCP LRO code. This change is needed by coming TLS RX hardware offload support patches. Suggested by: kib@ Reviewed by: jhb@ MFC after: 1 week Sponsored by: NVIDIA Networking	2021-11-04 18:52:06 +01:00
Allan Jude	34d8fffff3	SIFTR: Fix compilation with -DSIFTR_IPV6 A few pieces of the SIFTR code that are behind #ifdef SIFTR_IPV6 have not been updated as APIs have changed, etc. Reported by: Alexander Sideropoulos <Alexander.Sideropoulos@netapp.com> Reviewed by: rscheff, lstewart Sponsored by: NetApp Sponsored by: Klara Inc. Differential Revision: https://reviews.freebsd.org/D32698	2021-11-04 00:32:17 +00:00
Gleb Smirnoff	3ea9a7cf7b	blackhole(4): disable for locally originated TCP/UDP packets In most cases blackholing for locally originated packets is undesired, leads to different kind of lags and delays. Provide sysctls to enforce it, e.g. for debugging purposes. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D32718	2021-11-03 13:02:44 -07:00
Gordon Bergling	c28e39c3d6	Fix a common typo in syctl descriptions - s/maxiumum/maximum/ MFC after: 3 days	2021-11-03 20:49:24 +01:00
Gleb Smirnoff	3358df2973	udp_input: remove a BSD stack relict I should had removed it 9 years ago in `8ad458a471`. That commit left save_ip as a write-only variable. With save_ip removed we got one case when IP header can be modified: the calculation of IP checksum with zeroed out header. This place already has had a header saver char b[9]. However, the b[9] saver didn't cover the ip_sum field, which we explicitly overwrite aliased as (struct ipovly *)->ih_len. This was fine in `cb34210012`, since checksum doesn't need to be restored if packet is consumed. Now we need to extend up to ip_sum field. In collaboration with: ae Differential revision: https://reviews.freebsd.org/D32719	2021-11-03 10:39:34 -07:00
Gordon Bergling	bb91496a85	netinet: Fix a common typo in source code comments - s/writting/writing/ MFC after: 3 days	2021-11-03 16:21:49 +01:00
Andrey V. Elsukov	4a9e95286c	ip_divert: calculate delayed checksum for IPv6 adress family Before passing an IPv6 packet to application apply delayed checksum calculation. Mbuf flags will be lost when divert listener will return a packet back, so we will not be able to do delayed checksum calculation later. Also an application will get a packet with correct checksum. Reviewed by: donner MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D32807	2021-11-03 15:20:51 +03:00
Mateusz Guzik	8e27968786	inet: remove tcp_debug from netinet/tcp_debug.h It was a hack only needed for trpt, which can just define it locally. This makes it possible to fix up systat which also includes the file. Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-11-01 23:10:30 +00:00
Marius Halden	1019354b54	carp: deal with negative net.inet.carp.demotion Given nodes 1 and 2, where node 1 has an advskew of 0 and node 2 has an advskew of 100, making them master and backup respectively. If net.inet.carp.demotion is set to a negative value on node 1, node 2 might become master while node 1 still retains it master status. Wether or not node 2 becomes master seems to depend on the nodes advskew and what the demotion sysctl was set to on node 1. The reason for node 2 becoming master seems to be that the calculated advskew taking demotion into account is truncated to a single unsigned byte when copied into the carp header for sending, and node 1 stays master since it takes uses the whole non-truncated calculated advskew when deciding wether to stay master. PR: 259528 Reviewed by: donner, glebius MFC after: 3 weeks Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D32759	2021-11-01 17:08:23 +01:00
Randall Stewart	141a53cd58	tcp: Rack might retransmit forever. If we get a Sacked peer with an MTU change we can retransmit forever if the last bytes are sacked and the client goes away (think power off). Then we never see the end condition and continually retransmit. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32671	2021-10-29 17:37:49 -04:00
Randall Stewart	aeda852782	tcp: Rack at times can miscalculate the RTT from what it thinks is a persists probe respone. Turns out that if a peer sends in a window update right after rack fires off a persists probe, we can mis-interpret the window update and calculate a bogus RTT (very short). We still process the window update and send the data but we incorrectly generate an RTT. We should be only doing the RTT stuff if the rwnd is still small and has not changed. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32717	2021-10-29 03:17:43 -04:00
Gleb Smirnoff	92b3e07229	Enable net.inet.tcp.nolocaltimewait. This feature has been used for many years at large sites and didn't show any pitfalls.	2021-10-28 15:34:00 -07:00
Wojciech Macek	8a727c3df8	mroute: add missing WUNLOCK Add missing WNLOCK as in all other error cases. Reported by: Stormshield Obtained from: Semihalf	2021-10-28 07:12:23 +02:00
Wojciech Macek	fb3854845f	mroute: fix memory leak Add MFC to linked list to store incoming packets before MCAST JOIN was captured. Sponsored by: Stormshield Obtained from: Semihalf MFC after: 2 weeks	2021-10-28 07:12:16 +02:00
Gleb Smirnoff	5d3bf5b1d2	rack: Update the fast send block on setsockopt(2) Rack caches TCP/IP header for fast send, so it doesn't call tcpip_fillheaders(). After certain socket option changes, namely IPV6_TCLASS, IP_TOS and IP_TTL it needs to update its fast block to be in sync with the inpcb. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655	2021-10-27 08:22:00 -07:00
Gleb Smirnoff	f581a26e46	Factor out tcp6_use_min_mtu() to handle IPV6_USE_MIN_MTU by TCP. Pass control for IP/IP6 level options from generic tcp_ctloutput_set() down to per-stack ctloutput. Call tcp6_use_min_mtu() from tcp stack tcp_default_ctloutput(). Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655	2021-10-27 08:22:00 -07:00
Gleb Smirnoff	de156263a5	Several IP level socket options may affect TCP. After handling them in IP level ctloutput, pass them down to TCP ctloutput. We already have a hack to handle IPV6_USE_MIN_MTU. Leave it in place for now, but comment out how it should be handled. For IPv4 we are interested in IP_TOS and IP_TTL. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655	2021-10-27 08:21:59 -07:00
Gleb Smirnoff	fc4d53cc2e	Split tcp_ctloutput() into set/get parts. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655	2021-10-27 08:21:59 -07:00
Peter Lei	e28330832b	tcp: socket option to get stack alias name TCP stack sysctl nodes are currently inserted using the stack name alias. Allow the user to get the current stack's alias to allow for programatic sysctl access. Obtained from: Netflix	2021-10-27 08:21:59 -07:00
Randall Stewart	12752978d3	tcp: The rack stack can incorrectly have an overflow when calculating a burst delay. If the congestion window is very large the fact that we multiply it by 1000 (for microseconds) can cause the uint32_t to overflow and we incorrectly calculate a very small divisor. This will then cause the burst timer to be very large when it should be 0. Instead lets make the three variables uint64_t and avoid the issue. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32668	2021-10-26 13:17:58 -04:00
Michael Tuexen	b15b053596	tcp: allow new reno functions to be called from other CC modules Some new reno functions use the internal data, but are also called from functions of other CC modules. Ensure that in this case, the internal data is not accessed. Reported by: syzbot+1d219ea351caa5109d4b@syzkaller.appspotmail.com Reported by: syzbot+b08144f8cad9c67258c5@syzkaller.appspotmail.com Reviewed by: rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D32649	2021-10-25 22:53:49 +02:00
Gleb Smirnoff	f2d266f3b0	Don't run ip_ctloutput() for divert socket. It was here since divert(4) was introduced, probably just came with a protocol definition boilerplate. There is no useful socket option that can be set or get for a divert socket. Reviewed by: donner Differential Revision: https://reviews.freebsd.org/D32608	2021-10-25 11:16:59 -07:00
Gleb Smirnoff	d89c820b0d	Remove div_ctlinput(). This function does nothing since `97d8d152c2`. It was introduced in `252f24a2cf` with a sidenote "may not be needed". Reviewed by: donner Differential Revision: https://reviews.freebsd.org/D32608	2021-10-25 11:16:49 -07:00
Gleb Smirnoff	c8ee75f231	Use network epoch to protect local IPv4 addresses hash. The modification to the hash are already naturally locked by in_control_sx. Convert the hash lists to CK lists. Remove the in_ifaddr_rmlock. Assert the network epoch where necessary. Most cases when the hash lookup is done the epoch is already entered. Cover a few cases, that need entering the epoch, which mostly is initial configuration of tunnel interfaces and multicast addresses. Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D32584	2021-10-22 14:40:53 -07:00
Randall Stewart	4e4c84f8d1	tcp: Add hystart-plus to cc_newreno and rack. TCP Hystart draft version -03: https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-hystartplusplus Is a new version of hystart that allows one to carefully exit slow start if the RTT spikes too much. The newer version has a slower-slow-start so to speak that then kicks in for five round trips. To see if you exited too early, if not into congestion avoidance. This commit will add that feature to our newreno CC and add the needed bits in rack to be able to enable it. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32373	2021-10-22 07:10:28 -04:00
Roy Marples	5c5340108e	net: Allow binding of unspecified address without address existance Previously in_pcbbind_setup returned EADDRNOTAVAIL for empty V_in_ifaddrhead (i.e., no IPv4 addresses configured) and in6_pcbbind did the same for empty V_in6_ifaddrhead (no IPv6 addresses). An equivalent test has existed since 4.4-Lite. It was presumably done to avoid extra work (assuming the address isn't going to be found later). In normal system operation *_ifaddrhead will not be empty: they will at least have the loopback address(es). In practice no work will be avoided. Further, this case caused net/dhcpd to fail when run early in boot before assignment of any addresses. It should be possible to bind the unspecified address even if no addresses have been configured yet, so just remove the tests. The now-removed "XXX broken" comments were added in `59562606b9`, which converted the ifaddr lists to TAILQs. As far as I (emaste) can tell the brokenness is the issue described above, not some aspect of the TAILQ conversion. PR: 253166 Reviewed by: ae, bz, donner, emaste, glebius MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D32563	2021-10-20 19:25:51 -04:00
Gleb Smirnoff	9b7501e797	in_mcast: garbage collect inp_gcmoptions() It is is used only once, merge it into inp_freemoptions().	2021-10-18 11:36:07 -07:00
Gleb Smirnoff	0f617ae48a	Add in_pcb_var.h for KPIs that are private to in_pcb.c and in6_pcb.c.	2021-10-18 10:19:57 -07:00
Gleb Smirnoff	744a64bd92	in_pcb: garbage collect in_pcbrele()	2021-10-18 10:07:16 -07:00
Gleb Smirnoff	5a78df20ce	in_pcb: garbage collect unused structure in_pcblist	2021-10-18 10:06:39 -07:00
Maxim Sobolev	461e6f23db	Fix fragmented UDP packets handling since rev.360967. Consider IP_MF flag when checking length of the UDP packet to match the declared value. Sponsored by: Sippy Software, Inc. Differential Revision: https://reviews.freebsd.org/D32363 MFC after: 2 weeks	2021-10-15 16:48:12 -07:00
Gleb Smirnoff	2144431c11	Remove in_ifaddr_lock acquisiton to access in_ifaddrhead. An IPv4 address is embedded into an ifaddr which is freed via epoch. And the in_ifaddrhead is already a CK list. Use the network epoch to protect against use after free. Next step would be to CK-ify the in_addr hash and get rid of the... Reviewed by: melifaro Differential Revision: https://reviews.freebsd.org/D32434	2021-10-13 10:04:46 -07:00
Marko Zec	bc8b8e106b	[fib_algo][dxr] Retire counters which are no longer used The number of chunks can still be tracked via vmstat -z\|fgrep dxr. MFC after: 3 days	2021-10-09 13:47:10 +02:00
Marko Zec	1549575f22	[fib_algo][dxr] Improve incremental updating strategy Tracking the number of unused holes in the trie and the range table was a bad metric based on which full trie and / or range rebuilds were triggered, which would happen in vain by far too frequently, particularly with live BGP feeds. Instead, track the total unused space inside the trie and range table structures, and trigger rebuilds if the percentage of unused space exceeds a sysctl-tunable threshold. MFC after: 3 days PR: 257965	2021-10-09 13:22:27 +02:00
Michael Tuexen	bd19202c92	sctp: improve KASSERT messages MFC after: 1 week	2021-10-08 11:33:56 +02:00
Michael Tuexen	3ff3733991	sctp: don't keep being locked on a stream which is removed Reported by: syzbot+f5f551e8a3a0302a4914@syzkaller.appspotmail.com MFC after: 1 week	2021-10-02 00:48:01 +02:00
Randall Stewart	a36230f75e	tcp: Make dsack stats available in netstat and also make sure its aware of TLP's. DSACK accounting has been for quite some time under a NETFLIX_STATS ifdef. Statistics on DSACKs however are very useful in figuring out how much bad retransmissions you are doing. This is further complicated, however, by stacks that do TLP. A TLP when discovering a lost ack in the reverse path will cause the generation of a DSACK. For this situation we introduce a new dsack-tlp-bytes as well as the more traditional dsack-bytes and dsack-packets. These will now all display in netstat -p tcp -s. This also updates all stacks that are currently built to keep track of these stats. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32158	2021-10-01 10:36:27 -04:00
Michael Tuexen	28ea947078	sctp: provide a specific stream scheduler function for FCFS A KASSERT in the genric routine does not apply and triggers incorrectly. Reported by: syzbot+8435af157238c6a11430@syzkaller.appspotmail.com MFC after: 1 week	2021-09-29 02:08:37 +02:00
Michael Tuexen	fa947a3687	sctp: cleanup and adding KASSERT()s, no functional change MFC after: 1 week	2021-09-28 20:31:12 +02:00
Michael Tuexen	5b53e749a9	sctp: fix usage of stream scheduler functions sctp_ss_scheduled() should only be called for streams that are scheduled. So call sctp_ss_remove_from_stream() before it. This bug was uncovered by the earlier cleanup. Reported by: syzbot+bbf739922346659df4b2@syzkaller.appspotmail.com Reported by: syzbot+0a0857458f4a7b0507c8@syzkaller.appspotmail.com Reported by: syzbot+a0b62c6107b34a04e54d@syzkaller.appspotmail.com Reported by: syzbot+0aa0d676429ebcd53299@syzkaller.appspotmail.com Reported by: syzbot+104cc0c1d3ccf2921c1d@syzkaller.appspotmail.com MFC after: 1 week	2021-09-28 05:25:58 +02:00
Michael Tuexen	171633765c	sctp: avoid locking an already locked mutex Reported by: syzbot+f048680690f2e8d7ddad@syzkaller.appspotmail.com Reported by: syzbot+0725c712ba89d123c2e9@syzkaller.appspotmail.com MFC after: 1 week	2021-09-28 05:17:03 +02:00
Gordon Bergling	d2e616147d	sctp: Fix a typo in a comment - s/assue/assume/ MFC after: 3 days	2021-09-26 15:15:39 +02:00
Marko Zec	43880c511c	[fib_algo][dxr] Split unused range chunk list in multiple buckets Traversing a single list of unused range chunks in search for a block of optimal size was suboptimal. The experience with real-world BGP workloads has shown that on average unused range chunks are tiny, mostly in length from 1 to 4 or 5, when DXR is configured with K = 20 which is the current default (D16X4R). Therefore, introduce a limited amount of buckets to accomodate descriptors of empty blocks of fixed (small) size, so that those can be found in O(1) time. If no empty chunks of the requested size can be found in fixed-size buckets, the search continues in an unsorted list of empty chunks of variable lengths, which should only happen infrequently. This change should permit us to manage significantly more empty range chunks without sacrifying the speed of incremental range table updating. MFC after: 3 days	2021-09-25 06:29:48 +02:00
Randall Stewart	1ca931a540	tcp: Rack compressed ack path updates the recv window too easily The compressed ack path of rack is not following proper procedures in updating the peers window. It should be checking the seq and ack values before updating and instead it is blindly updating the values. This could in theory get the wrong window in the connection for some length of time. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32082	2021-09-23 11:43:29 -04:00
Randall Stewart	fd69939e79	tcp: Two bugs in rack one of which can lead to a panic. In extensive testing in NF we have found two issues inside the rack stack. 1) An incorrect offset is being generated by the fast send path when a fast send is initiated on the end of the socket buffer and before the fast send runs, the sb_compress macro adds data to the trailing socket. This fools the fast send code into thinking the sb offset changed and it miscalculates a "updated offset". It should only do that when the mbuf in question got smaller.. i.e. an ack was processed. This can lead to a panic deref'ing a NULL mbuf if that packet is ever retransmitted. At the best case it leads to invalid data being sent to the client which usually terminates the connection. The fix is to have the proper logic (that is in the rsm fast path) to make sure we only update the offset when the mbuf shrinks. 2) The other issue is more bothersome. The timestamp check in rack needs to use the msec timestamp when comparing the timestamp echo to now. It was using a microsecond timestamp which ends up giving error prone results but causes only small harm in trying to identify which send to use in RTT calculations if its a retransmit. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32062	2021-09-23 10:54:23 -04:00
Michael Tuexen	414499b3f9	sctp: Cleanup stream schedulers. No functional change intended. MFC after: 1 week	2021-09-23 14:16:56 +02:00
Michael Tuexen	762ae0ec8d	sctp: Simplify stream scheduler usage Callers are getting the stcb send lock, so just KASSERT that. No need to signal this when calling stream scheduler functions. No functional change intended. MFC after: 1 week	2021-09-21 17:13:57 +02:00
Michael Tuexen	0b79a76f84	sctp: improve consistency when calling stream scheduler Hold always the stcb send lock when calling sctp_ss_init() and sctp_ss_remove_from_stream(). MFC after: 1 week	2021-09-21 00:54:13 +02:00
Michael Tuexen	34b1efcea1	sctp: use a valid outstream when adding it to the scheduler Without holding the stcb send lock, the outstreams might get reallocated if the number of streams are increased. Reported by: syzbot+4a5431d7caa666f2c19c@syzkaller.appspotmail.com Reported by: syzbot+aa2e3b013a48870e193d@syzkaller.appspotmail.com Reported by: syzbot+e4368c3bde07cd2fb29f@syzkaller.appspotmail.com Reported by: syzbot+fe2f110e34811ea91690@syzkaller.appspotmail.com Reported by: syzbot+ed6e8de942351d0309f4@syzkaller.appspotmail.com MFC after: 1 week	2021-09-20 15:52:10 +02:00
Marko Zec	2ac039f7be	[fib_algo][dxr] Merge adjacent empty range table chunks. MFC after: 3 days	2021-09-20 06:30:45 +02:00
Michael Tuexen	e19d93b19d	sctp: fix FCFS stream scheduler Reported by: syzbot+c6793f0f0ce698bce230@syzkaller.appspotmail.com MFC after: 1 week	2021-09-19 11:56:26 +02:00
Mark Johnston	bf25678226	ktls: Fix error/mode confusion in TCP_*TLS_MODE getsockopt handlers ktls_get_(rx\|tx)_mode() can return an errno value or a TLS mode, so errors are effectively hidden. Fix this by using a separate output parameter. Convert to the new socket buffer locking macros while here. Note that the socket buffer lock is not needed to synchronize the SOLISTENING check here, we can rely on the PCB lock. Reviewed by: jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31977	2021-09-17 14:19:05 -04:00
Mike Karels	fd0765933c	Change lowest address on subnet (host 0) not to broadcast by default. The address with a host part of all zeros was used as a broadcast long ago, but the default has been all ones since 4.3BSD and RFC1122. Until now, we would broadcast the host zero address as well as the configured address. Change to not broadcasting that address by default, but add a sysctl (net.inet.ip.broadcast_lowest) to re-enable it. Note that the correct way to use the zero address for broadcast would be to configure it as the broadcast address for the network. See https:/datatracker.ietf.org/doc/draft-schoen-intarea-lowest-address/ and the discussion in https://reviews.freebsd.org/D19316. Note, Linux now implements this. Reviewed by: rgrimes, tuexen; melifaro (previous version) MFC after: 1 month Relnotes: yes Differential Revision: https://reviews.freebsd.org/D31861	2021-09-16 19:42:20 -05:00
Marko Zec	eb3148cc4d	[fib algo][dxr] Fix division by zero. A division by zero would occur if DXR would be activated on a vnet with no IP addresses configured on any interfaces. PR: 257965 MFC after: 3 days Reported by: Raul Munoz	2021-09-16 16:34:05 +02:00
Marko Zec	b51f8bae57	[fib algo][dxr] Optimize trie updating. Don't rebuild in vain trie parts unaffected by accumulated incremental RIB updates. PR: 257965 Tested by: Konrad Kreciwilk MFC after: 3 days	2021-09-15 22:42:49 +02:00
Marko Zec	442c8a245e	[fib algo][dxr] Fix undefined behavior. The result of shifting uint32_t by 32 (or more) is undefined: fix it.	2021-09-15 22:42:48 +02:00
Hans Petter Selasky	e3e7d95332	tcp: Avoid division by zero when KERN_TLS is enabled in tcp_account_for_send(). If the "len" variable is non-zero, we can assume that the sum of "tp->t_snd_rxt_bytes + tp->t_sndbytes" is also non-zero. It is also assumed that the 64-bit byte counters will never wrap around. Differential Revision: https://reviews.freebsd.org/D31959 Reviewed by: gallatin, rrs and tuexen Found by: "I told you so", also called hselasky MFC after: 1 week Sponsored by: NVIDIA Networking	2021-09-15 18:05:31 +02:00
Michael Tuexen	4542164685	sctp: cleanup, no functional change intended MFC after: 1 week	2021-09-15 10:18:11 +02:00
John Baldwin	c782ea8bb5	Add a switch structure for send tags. Move the type and function pointers for operations on existing send tags (modify, query, next, free) out of 'struct ifnet' and into a new 'struct if_snd_tag_sw'. A pointer to this structure is added to the generic part of send tags and is initialized by m_snd_tag_init() (which now accepts a switch structure as a new argument in place of the type). Previously, device driver ifnet methods switched on the type to call type-specific functions. Now, those type-specific functions are saved in the switch structure and invoked directly. In addition, this more gracefully permits multiple implementations of the same tag within a driver. In particular, NIC TLS for future Chelsio adapters will use a different implementation than the existing NIC TLS support for T6 adapters. Reviewed by: gallatin, hselasky, kib (older version) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D31572	2021-09-14 11:43:41 -07:00
Mark Johnston	e6c19aa94d	sctp: Allow blocking on I/O locks even with non-blocking sockets There are two flags to request a non-blocking receive on a socket: MSG_NBIO and MSG_DONTWAIT. They are handled a bit differently in that soreceive_generic() and soreceive_stream() will block on the socket I/O lock when MSG_NBIO is set, but not if MSG_DONTWAIT is set. In general, MSG_NBIO seems to mean, "don't block if there is no data to receive" and MSG_DONTWAIT means "don't go to sleep for any reason". SCTP's soreceive implementation did not allow blocking on the I/O lock if either flag is set, but this violates an assumption in aio_process_sb(), which specifies MSG_NBIO but nonetheless expects to make progress if data is available to read. Change sctp_sorecvmsg() to block on the I/O lock only if MSG_DONTWAIT is not set. Reported by: syzbot+c7d22dbbb9aef509421d@syzkaller.appspotmail.com Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31915	2021-09-14 09:02:05 -04:00
Michael Tuexen	29545986bd	sctp: avoid LOR Don't lock the inp-info lock while holding an stcb lock. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D31921	2021-09-12 21:11:14 +02:00
Michael Tuexen	4181fa2a20	sctp: minor cleanup, no functional change MFC after: 1 week	2021-09-12 19:21:15 +02:00
Mark Johnston	2d5c48eccd	sctp: Tighten up locking around sctp_aloc_assoc() All callers of sctp_aloc_assoc() mark the PCB as connected after a successful call (for one-to-one-style sockets). In all cases this is done without the PCB lock, so the PCB's flags can be corrupted. We also do not atomically check whether a one-to-one-style socket is a listening socket, which violates various assumptions in solisten_proto(). We need to hold the PCB lock across all of sctp_aloc_assoc() to fix this. In order to do that without introducing lock order reversals, we have to hold the global info lock as well. So: - Convert sctp_aloc_assoc() so that the inp and info locks are consistently held. It returns with the association lock held, as before. - Fix an apparent bug where we failed to remove an association from a global hash if sctp_add_remote_addr() fails. - sctp_select_a_tag() is called when initializing an association, and it acquires the global info lock. To avoid lock recursion, push locking into its callers. - Introduce sctp_aloc_assoc_connected(), which atomically checks for a listening socket and sets SCTP_PCB_FLAGS_CONNECTED. There is still one edge case in sctp_process_cookie_new() where we do not update PCB/socket state correctly. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31908	2021-09-11 10:15:21 -04:00
orange30	f5777c123a	net: Fix memory leaks upon arp_fillheader() failures Free memory before return from arprequest_internal(). In in_arpinput(), if arp_fillheader() fails, it should use goto drop. Reviewed by: melifaro, imp, markj MFC after: 1 week Pull Request: https://github.com/freebsd/freebsd-src/pull/534	2021-09-10 09:45:26 -04:00
Michael Tuexen	3ea2cdd45e	sctp: add explicit cast, no functional change intended MFC after: 3 days	2021-09-09 19:13:47 +02:00
Michael Tuexen	0c1a20beb4	sctp: use appropriate argument when freeing association Reported by: syzbot+7fe26e26911344e7211d@syzkaller.appspotmail.com MFC after: 3 days	2021-09-09 18:01:35 +02:00
Mark Johnston	4250aa1188	sctp: Clear assoc socket references when freeing a PCB This restores behaviour present in the first import of SCTP. Commit `ceaad40ae7` commented this out and commit `62fb761ff2` removed it. However, once sctp_inpcb_free() returns, the socket reference is gone no matter what, so we need to clear it. Reported by: syzbot+30dd69297fcbc5f0e10a@syzkaller.appspotmail.com Reported by: syzbot+7b2f9d4bcac1c9569291@syzkaller.appspotmail.com Reported by: syzbot+ed3e651f7d040af480a6@syzkaller.appspotmail.com Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31886	2021-09-09 08:33:26 -04:00
Michael Tuexen	58a7bf124c	sctp: cleanup timewait handling for vtags MFC after: 1 week	2021-09-09 01:18:58 +02:00
Mark Johnston	ee4731179c	sctp: Fix a lock order reversal in sctp_swap_inpcb_for_listen() When port reuse is enabled in a one-to-one-style socket, sctp_listen() may call sctp_swap_inpcb_for_listen() to move the PCB out of the "TCP pool". In so doing it will drop the PCB lock, yielding an LOR since we now hold several socket locks. Reorder sctp_listen() so that it performs this operation before beginning the conversion to a listening socket. Also modify sctp_swap_inpcb_for_listen() to return with PCB write-locked, since that's what sctp_listen() expects now. Reviewed by: tuexen Fixes: `bd4a39cc93` ("socket: Properly interlock when transitioning to a listening socket") MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31879	2021-09-08 11:41:19 -04:00
Mark Johnston	6e3af6321b	sctp: Fix lock recursion in sctp_swap_inpcb_for_listen() After commit `bd4a39cc93` we now hold the global inp info lock across the call to sctp_swap_inpcb_for_listen(), which attempts to acquire it again. Since sctp_swap_inpcb_for_listen()'s sole caller is sctp_listen(), we can simply change it to not try to acquire the lock. Reported by: syzbot+a76b19ea2f8e1190c451@syzkaller.appspotmail.com Reported by: syzbot+a1b6cef257ad145b7187@syzkaller.appspotmail.com Reviewed by: tuexen Fixes: `bd4a39cc93` ("socket: Properly interlock when transitioning to a listening socket") MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31878	2021-09-08 11:41:18 -04:00
Michael Tuexen	aab1d593b2	sctp: minor cleanups, no functional change intended	2021-09-08 15:13:49 +02:00
Alexander V. Chernikov	4b631fc832	routing: fix source address selection rules for IPv4 over IPv6. Current logic always selects an IFA of the same family from the outgoing interfaces. In IPv4 over IPv6 setup there can be just single non-127.0.0.1 ifa, attached to the loopback interface. Create a separate rt_getifa_family() to handle entire ifa selection for the IPv4 over IPv6. Differential Revision: https://reviews.freebsd.org/D31868 MFC after: 1 week	2021-09-07 21:41:05 +00:00
Mark Johnston	c4b44adcf0	sctp: Remove special handling for a listen(2) backlog of 0 ... when applied to one-to-one-style sockets. sctp_listen() cannot be used to toggle the listening state of such a socket. See RFC 6458's description of expected listen(2) semantics for one-to-one- and one-to-many-style sockets. Reviewed by: tuexen MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31774	2021-09-07 17:12:09 -04:00
Mark Johnston	bd4a39cc93	socket: Properly interlock when transitioning to a listening socket Currently, most protocols implement pru_listen with something like the following: SOCK_LOCK(so); error = solisten_proto_check(so); if (error) { SOCK_UNLOCK(so); return (error); } solisten_proto(so); SOCK_UNLOCK(so); solisten_proto_check() fails if the socket is connected or connecting. However, the socket lock is not used during I/O, so this pattern is racy. The change modifies solisten_proto_check() to additionally acquire socket buffer locks, and the calling thread holds them until solisten_proto() or solisten_proto_abort() is called. Now that the socket buffer locks are preserved across a listen(2), this change allows socket I/O paths to properly interlock with listen(2). This fixes a large number of syzbot reports, only one is listed below and the rest will be dup'ed to it. Reported by: syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com Reviewed by: tuexen, gallatin MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31659	2021-09-07 17:11:43 -04:00
Mark Johnston	f94acf52a4	socket: Rename sb(un)lock() and interlock with listen(2) In preparation for moving sockbuf locks into the containing socket, provide alternative macros for the sockbuf I/O locks: SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK(). These operate on a socket rather than a socket buffer. Note that these locks are used only to prevent concurrent readers and writters from interleaving I/O. When locking for I/O, return an error if the socket is a listening socket. Currently the check is racy since the sockbuf sx locks are destroyed during the transition to a listening socket, but that will no longer be true after some follow-up changes. Modify a few places to check for errors from sblock()/SOCK_IO_(SEND\|RECV)_LOCK() where they were not before. In particular, add checks to sendfile() and sorflush(). Reviewed by: tuexen, gallatin MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31657	2021-09-07 15:06:48 -04:00
Mark Johnston	173a7a4ee4	sctp: Fix iterator synchronization in sctp_sendall() - The SCTP_PCB_FLAGS_SND_ITERATOR_UP check was racy, since two threads could observe that the flag is not set and then both set it. I'm not sure if this is actually a problem in practice, i.e., maybe there's no problem having multiple sends for a single PCB in the iterator list? - sctp_sendall() was modifying sctp_flags without the inp lock held. The change simply acquires the PCB write lock before toggling the flag, fixing both problems. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31813	2021-09-07 11:19:29 -04:00
Mark Johnston	e8e23ec127	sctp: Remove an unused sctp_inpcb field This appears to be unused in usrsctp as well. No functional change intended. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31812	2021-09-07 11:19:29 -04:00
Mark Johnston	c17b531bed	sctp: Fix races around sctp_inpcb_free() sctp_close() and sctp_abort() disassociate the PCB from its socket. As a part of this, they attempt to free the PCB, which may end up lingering. Fix some bugs in this area: - For some reason, sctp_close() and sctp_abort() set SCTP_PCB_FLAGS_SOCKET_GONE using an atomic compare-and-set without the PCB lock held. This is racy since sctp_flags is normally updated without atomics, using the PCB lock to synchronize. So, the update can be lost, which can cause all sort of races with other SCTP components which look for the _GONE flag. Fix the problem simply by acquiring the PCB lock in order to set the flag. Note that we have to drop and re-acquire the lock again in sctp_inpcb_free(), but I don't see a good way around that for now. If it's a real problem, the _GONE flag could be split out of sctp_flags and into a dedicated sctp_inpcb field. - In sctp_inpcb_free(), load sctp_socket after acquiring the PCB lock, to avoid possible races with parallel sctp_inpcb_free() calls. - Add an assertion sctp_inpcb_free() to verify that _ALLGONE is not set. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31811	2021-09-07 11:19:29 -04:00
Alexander V. Chernikov	936f4a42fa	lltable: do not require prefix lookup when checking lle allocation rules. With the new FIB_ALGO infrastructure, nearly all subsystems use fib[46]_lookup() functions, which provides lockless lookups. A number of places remains that uses old-style lookup functions, that still requires RIB read lock to return the result. One of such places is arp processing code. FIB_ALGO implementation makes some tradeoffs, resulting in (relatively) prolonged periods of holding RIB_WLOCK. If the lock is held and datapath competes for it, the RX ring may get blocked, ending in traffic delays and losses. As currently arp processing is performed directly in the interrupt handler, handling ARP replies triggers the problem descibed above when the amount of ARP replies is high. To be more specific, prior to creating new ARP entry, routing lookup for the entry address in interface fib is executed. The following conditions are the verified: 1. If lookup returns an empty result, or the resulting prefix is non-directly-reachable, failure is returned. The only exception are host routes w/ gateway==address. 2. If the routing lookup returns different interface and non-host route, we want to support the use case of having multiple interfaces with the same prefix. In fact, the current code just checks if the returned prefix covers target address (always true) and effectively allow allocating ARP entries for any directly-reachable prefix, regardless of its interface. Change the code to perform the following: 1) use fib4_lookup() to get the nexthop, instead of requesting exact prefix. 2) Rewrite first condition check using nexthop flags (1:1 match) 3) Rewrite second condition to check for interface addresses matching target address on the input interface. Differential Revision: https://reviews.freebsd.org/D31824 Reviewed by: ae MFC after: 1 week PR: 257965	2021-09-06 21:03:22 +00:00
Gordon Bergling	631504fb34	Fix a common typo in source code comments - s/existant/existent/ MFC after: 3 days	2021-09-04 12:56:57 +02:00
Mark Johnston	c98bf2a45e	sctp: Always check for a vanishing inpcb when processing COOKIE-ECHO We previously did this only in the normal case where no association exists yet. However, it is not safe to process COOKIE-ECHO even if an association exists, as sctp_process_cookie_existing() may dereference the socket pointer. See also commit `0c7dc84076`. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31755	2021-09-01 10:28:17 -04:00
Mark Johnston	d35be50f57	sctp: Hold association locks across socket wakeups when freeing At this point we do not hold the inpcb lock, so the only thing holding the socket reference live is the TCB lock, which needs to be acquired by sctp_inpcb_free() in order to destroy associations. Defer the unlock to until after we dereference the socket reference. Reported by: syzbot+1d0f2c4675de76a4cf1e@syzkaller.appspotmail.com Reported by: syzbot+fabee77954fe69d3a5ad@syzkaller.appspotmail.com Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31754	2021-09-01 10:27:51 -04:00
Mark Johnston	65f30a39e1	sctp: Release the socket reference when detaching an association Later in sctp_free_assoc(), when we clean up chunk lists, sctp_free_spbufspace() is used to reset the byte count in the socket send buffer. However, if the PCB is going away, the socket may already have been detached from the PCB, in which case this becomes a use-after free. Clear the socket reference from the association before detaching it from the PCB, if the PCB has already lost its socket reference. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31753	2021-09-01 10:27:31 -04:00
Mark Johnston	457abbb857	sctp: Implement sctp_inpcb_bind_locked() This will be used by sctp_listen() to avoid dropping locks when performing an implicit bind. No functional change intended. Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31757	2021-09-01 10:06:18 -04:00
Mark Johnston	be8ee77e9e	sctp: Add macros to assert on inp info lock state Reviewed by: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31756	2021-09-01 10:06:18 -04:00

1 2 3 4 5 ...

7278 Commits