There is small window, when encap_detach() can free matched entry
directly after we release encapmtx. Instead of use pointer to the
matched entry, save pointers to needed variables from this entry
and use them after release mutex.
Pass argument stored in the encaptab entry to encap_fillarg(), instead
of pointer to matched entry. Also do not allocate new mbuf tag, when
argument that we plan to save in this tag is NULL.
Also make encaptab variable static.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
to be transmitted but the arp cache entry expired, which triggers an arp request
to be sent, the bpf code might want to sleep but crash the system due
to a non sleep lock held from the arp entry not released properly.
Release the lock before calling the arp request code to solve the issue
as is done on all the other code paths.
PR: 200323
Approved by: ae, gnn(mentor)
MFC after: 1 week
Sponsored by: Netgate
Differential Revision: https://reviews.freebsd.org/D2828
continue sending on the same net.
This fixes a bug where an invalid mbuf chain was constructed, if a
full size frame of control chunks should be sent and there is a
output error.
Based on a discussion with rrs@, change move to the next net. This fixes
the bug and improves the behaviour.
Thanks to Irene Ruengeler for spending a lot of time in narrowing this
problem down.
MFC after: 3 days
In TCP timer context, it is not enough to check callout_stop() return value
to decide if a callout is still running or not, previous callout_reset()
return values have also to be checked.
Differential Revision: https://reviews.freebsd.org/D2763
Reviewed by: hiren
Approved by: hiren
MFC after: 1 day
Sponsored by: Verisign, Inc.
the scope.
This fixes a problem when a client with a global address
connects to a server with a private address.
Thanks to Irene Ruengeler in helping me to find the issue.
MFC after: 3 days
years for head. However, it is continuously misused as the mpsafe argument
for callout_init(9). Deprecate the flag and clean up callout_init() calls
to make them more consistent.
Differential Revision: https://reviews.freebsd.org/D2613
Reviewed by: jhb
MFC after: 2 weeks
gif(4) interface. Add new option "ignore_source" for gif(4) interface.
When it is enabled, gif's encapcheck function requires match only for
packet's destination address.
Differential Revision: https://reviews.freebsd.org/D2004
Obtained from: Yandex LLC
MFC after: 2 weeks
Sponsored by: Yandex LLC
Although this is not important to the rest of the TCP processing
it is a conveneint way to make the DTrace state-transition probe
catch this important state change.
MFC after: 1 week
Currently we have tables identified by their names in userland
with internal kernel-assigned indices. This works the following way:
When userland wishes to communicate with kernel to add or change rule(s),
it makes indexed sorted array of table names
(internally ipfw_obj_ntlv entries), and refer to indices in that
array in rule manipulation.
Prior to committing new rule to the ruleset kernel
a) finds all referenced tables, bump their refcounts and change
values inside the opcodes to be real kernel indices
b) auto-creates all referenced but not existing tables and then
do a) for them.
Kernel does almost the same when exporting rules to userland:
prepares array of used tables in all rules in range, and
prepends it before the actual ruleset retaining actual in-kernel
indexes for that.
There is also special translation layer for legacy clients which is
able to provide 'real' indices for table names (basically doing atoi()).
While it is arguable that every subsystem really needs names instead of
numbers, there are several things that should be noted:
1) every non-singleton subsystem needs to store its runtime state
somewhere inside ipfw chain (and be able to get it fast)
2) we can't assume object numbers provided by humans will be dense.
Existing nat implementation (O(n) access and LIST inside chain) is a
good example.
Hence the following:
* Convert table-centric rewrite code to be more generic, callback-based
* Move most of the code from ip_fw_table.c to ip_fw_sockopt.c
* Provide abstract API to permit subsystems convert their objects
between userland string identifier and in-kernel index.
(See struct opcode_obj_rewrite) for more details
* Create another per-chain index (in next commit) shared among all subsystems
* Convert current NAT44 implementation to use new API, O(1) lookups,
shared index and names instead of numbers (in next commit).
Sponsored by: Yandex LLC
When we are passing mbuf to IPSec processing via ipsec[46]_process_packet(),
we hold one reference to security policy and release it just after return
from this function. But IPSec processing can be deffered and when we release
reference to security policy after ipsec[46]_process_packet(), user can
delete this security policy from SPDB. And when IPSec processing will be
done, xform's callback function will do access to already freed memory.
To fix this move KEY_FREESP() into callback function. Now IPSec code will
release reference to SP after processing will be finished.
Differential Revision: https://reviews.freebsd.org/D2324
No objections from: #network
Sponsored by: Yandex LLC
- Use the carp_sx to serialize not only CARP ioctls, but also carp_attach()
and carp_detach().
- Use cif_mtx to lock only access to those the linked list.
- These locking changes allow us to do some memory allocations with M_WAITOK
and also properly call callout_drain() in carp_destroy().
- In carp_attach() assert that ifaddr isn't attached. We always come here
with a pristine address from in[6]_control().
Reviewed by: oleg
Sponsored by: Nginx, Inc.
TCP timers:
- Add a reference from tcpcb to its inpcb
- Defer tcpcb deletion until TCP timers have finished
Differential Revision: https://reviews.freebsd.org/D2079
Submitted by: jch, Marc De La Gueronniere <mdelagueronniere@verisign.com>
Reviewed by: imp, rrs, adrian, jhb, bz
Approved by: jhb
Sponsored by: Verisign, Inc.
sequential IP ID case (e.g. ping -f), distribution fell into 8-10 buckets
out of 64. With Jenkins hash, distribution is even.
o Add random seed to the hash.
Sponsored by: Nginx, Inc.
function names have changed and comments are reformatted or added, but
there is no functional change.
Claim copyright for me and Adrian.
Sponsored by: Nginx, Inc.
of allocations in V_nipq is racy. To fix that, we would simply stop doing
book-keeping ourselves, and rely on UMA doing that. There could be a
slight overcommit due to caches, but that isn't a big deal.
o V_nipq and V_maxnipq go away.
o net.inet.ip.fragpackets is now just SYSCTL_UMA_CUR()
o net.inet.ip.maxfragpackets could have been just SYSCTL_UMA_MAX(), but
historically it has special semantics about values of 0 and -1, so
provide sysctl_maxfragpackets() to handle these special cases.
o If zone limit lowers either due to net.inet.ip.maxfragpackets or due to
kern.ipc.nmbclusters, then new function ipq_drain_tomax() goes over
buckets and frees the oldest packets until we are in the limit.
The code that (incorrectly) did that in ip_slowtimo() is removed.
o ip_reass() doesn't check any limits and calls uma_zalloc(M_NOWAIT).
If it fails, a new function ipq_reuse() is called. This function will
find the oldest packet in the currently locked bucket, and if there is
none, it will search in other buckets until success.
Sponsored by: Nginx, Inc.
free a fragment, provide two inline functions that do that for us:
ipq_drop() and ipq_timeout().
o Rename ip_free_f() to ipq_free() to match the name scheme of IP reassembly.
o Remove assertion from ipq_free(), since it requires extra argument to be
passed, but locking scheme is simple enough and function is static.
Sponsored by: Nginx, Inc.
This significantly improves performance on multi-core servers where there
is any kind of IPv4 reassembly going on.
glebius@ would like to see the locking moved to be attached to the reassembly
bucket, which would make it per-bucket + per-VNET, instead of being global.
I decided to keep it global for now as it's the minimal useful change;
if people agree / wish to migrate it to be per-bucket / per-VNET then please
do feel free to do so. I won't complain.
Thanks to Norse Corp for giving me access to much larger servers
to test this at across the 4 core boxes I have at home.
Differential Revision: https://reviews.freebsd.org/D2095
Reviewed by: glebius (initial comments incorporated into this patch)
MFC after: 2 weeks
Sponsored by: Norse Corp, Inc (hardware)
header and not only partial flags and fields. Firewalls can attach
classification tags to the outgoing mbufs which should be copied to
all the new fragments. Else only the first fragment will be let
through by the firewall. This can easily be tested by sending a large
ping packet through a firewall. It was also discovered that VLAN
related flags and fields should be copied for packets traversing
through VLANs. This is all handled by "m_dup_pkthdr()".
Regarding the MAC policy check in ip_fragment(), the tag provided by
the originating mbuf is copied instead of using the default one
provided by m_gethdr().
Tested by: Karim Fodil-Lemelin <fodillemlinkarim at gmail.com>
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
PR: 7802
where we want to create a new IP datagram.
o Add support for RFC6864, which allows to set IP ID for atomic IP
datagrams to any value, to improve performance. The behaviour is
controlled by net.inet.ip.rfc6864 sysctl knob, which is enabled by
default.
o In case if we generate IP ID, use counter(9) to improve performance.
o Gather all code related to IP ID into ip_id.c.
Differential Revision: https://reviews.freebsd.org/D2177
Reviewed by: adrian, cy, rpaulo
Tested by: Emeric POUPON <emeric.poupon stormshield.eu>
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Relnotes: yes
to obtain IPv4 next hop address in tablearg case.
Add `fwd tablearg' support for IPv6. ipfw(8) uses INADDR_ANY as next hop
address in O_FORWARD_IP opcode for specifying tablearg case. For IPv6 we
still use this opcode, but when packet identified as IPv6 packet, we
obtain next hop address from dedicated field nh6 in struct table_value.
Replace hopstore field in struct ip_fw_args with anonymous union and add
hopstore6 field. Use this field to copy tablearg value for IPv6.
Replace spare1 field in struct table_value with zoneid. Use it to keep
scope zone id for link-local IPv6 addresses. Since spare1 was used
internally, replace spare0 array with two variables spare0 and spare1.
Use getaddrinfo(3)/getnameinfo(3) functions for parsing and formatting
IPv6 addresses in table_value. Use zoneid field in struct table_value
to store sin6_scope_id value.
Since the kernel still uses embedded scope zone id to represent
link-local addresses, convert next_hop6 address into this form before
return from pfil processing. This also fixes in6_localip() check
for link-local addresses.
Differential Revision: https://reviews.freebsd.org/D2015
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
draft-ietf-6man-enhanced-dad-13.
This basically adds a random nonce option (RFC 3971) to NS messages
for DAD probe to detect a looped back packet. This looped back packet
prevented DAD on some pseudo-interfaces which aggregates multiple L2 links
such as lagg(4).
The length of the nonce is set to 6 bytes. This algorithm can be disabled by
setting net.inet6.ip6.dad_enhanced sysctl to 0 in a per-vnet basis.
Reported by: hiren
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D1835
of packets. When the data payload length excluding any headers, of an
outgoing IPv4 packet exceeds PAGE_SIZE bytes, a special case in
ip_fragment() can kick in to optimise the outgoing payload(s). The
code which was added in r98849 as part of zero copy socket support
assumes that the beginning of any MTU sized payload is aligned to
where a MBUF's "m_data" pointer points. This is not always the case
and can sometimes cause large IPv4 packets, as part of ping replies,
to be split more than needed.
Instead of iterating the MBUFs to figure out how much data is in the
current chain, use the value already in the "m_pkthdr.len" field of
the first MBUF in the chain.
Reviewed by: ken @
Differential Revision: https://reviews.freebsd.org/D1893
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
Previous __alignment(4) allowed compiler to assume that operations are
performed on aligned region. On ARM processor, this led to alignment fault
as shown below:
trapframe: 0xda9e5b10
FSR=00000001, FAR=a67b680e, spsr=60000113
r0 =00000000, r1 =00000068, r2 =0000007c, r3 =00000000
r4 =a67b6826, r5 =a67b680e, r6 =00000014, r7 =00000068
r8 =00000068, r9 =da9e5bd0, r10=00000011, r11=da9e5c10
r12=da9e5be0, ssp=da9e5b60, slr=a054f164, pc =a054f2cc
<...>
udp_input+0x264: ldmia r5, {r0-r3, r6}
udp_input+0x268: stmia r12, {r0-r3, r6}
This was due to instructions which do not support unaligned access,
whereas for __alignment(2) compiler replaced ldmia/stmia with some
logically equivalent memcpy operations.
In fact, the assumption that 'struct ip' is always 4-byte aligned
is definitely false, as we have no impact on data alignment of packet
stream received.
Another possible solution would be to explicitely perform memcpy()
on objects of 'struct ip' type, which, however, would suffer from
performance drop, and be merely a problem hiding.
Please, note that this has nothing to do with
ARM32_DISABLE_ALIGNMENT_FAULTS option, but is related strictly to
compiler behaviour.
Submitted by: Wojciech Macek <wma@semihalf.com>
Reviewed by: glebius, ian
Obtained from: Semihalf
represents a context.
- Preserve name 'struct igmp_ifinfo' for a new structure, that will be stable
API between userland and kernel.
- Make sysctl_igmp_ifinfo() return the new 'struct igmp_ifinfo', instead of
old one, which had a bunch of internal kernel structures in it.
- Move all above declarations from in_var.h to igmp_var.h, since they are
private to IGMP code.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
and arp were being used. They basically would pass in the
mutex to the callout_init. Because they used this method
to the callout system, it was possible to "stop" the callout.
When flushing the table and you stopped the running callout, the
callout_stop code would return 1 indicating that it was going
to stop the callout (that was about to run on the callout_wheel blocked
by the function calling the stop). Now when 1 was returned, it would
lower the reference count one extra time for the stopped timer, then
a few lines later delete the memory. Of course the callout_wheel was
stuck in the lock code and would then crash since it was accessing
freed memory. By using callout_init(c, 1) we always get a 0 back
and the reference counting bug does not rear its head. We do have
to make a few adjustments to the callouts themselves though to make
sure it does the proper thing if rescheduled as well as gets the lock.
Commented upon by hiren and sbruno
See Phabricator D1777 for more details.
Commented upon by hiren and sbruno
Reviewed by: adrian, jhb and bz
Sponsored by: Netflix Inc.
when fragmenting IP packets to preserve the order of the packets in a
stream. Else the resulting fragments can be sent out of order when the
hardware supports multiple transmit rings.
Reviewed by: glebius @
MFC after: 1 week
Sponsored by: Mellanox Technologies
This fixes what seems like a simple oversight when the function was added in
r253210.
Reported by: Daniel Borkmann <dborkman@redhat.com>
Florian Westphal <fw@strlen.de>
Differential Revision: https://reviews.freebsd.org/D1628
Reviewed by: gnn
MFC after: 1 month
Sponsored by: Limelight Networks
We would like to acknowledge Gerasimos Dimitriadis who reported
the issue and Michael Tuexen who analyzed and provided the
fix.
Security: FreeBSD-SA-15:03.sctp
Security: CVE-2014-8613
Submitted by: tuexen
We would like to acknowledge Clement LECIGNE from Google Security Team and
Francisco Falcon from Core Security Technologies who discovered the issue
independently and reported to the FreeBSD Security Team.
Security: FreeBSD-SA-15:02.kmem
Security: CVE-2014-8612
Submitted by: tuexen
sys/netinet/ip_carp.c:
Add a "reason" string parameter to carp_set_state() and
carp_master_down_locked() allowing more specific logging
information to be passed into these apis.
Refactor existing state transition logging into a single
log call in carp_set_state().
Update all calls to carp_set_state() and
carp_master_down_locked() to pass an appropriate reason
string. For state transitions that were previously logged,
the output should be unchanged.
Submitted by: gibbs (original), asomers (updated)
MFC after: 1 week
Sponsored by: Spectra Logic
MFSpectraBSD: 1039697 on 2014/02/11 (original)
1049992 on 2014/03/21 (updated)
bits.
The motivation here is to eventually teach netisr and potentially
other networking subsystems a bit more about how RSS work queues / buckets
are configured so things have a hope of auto-configuring in the future.
* net/rss_config.[ch] takes care of the generic bits for doing
configuration, hash function selection, etc;
* topelitz.[ch] is now in net/ rather than netinet/;
* (and would be in libkern if it didn't directly include RSS_KEYSIZE;
that's a later thing to fix up.)
* netinet/in_rss.[ch] now just contains the IPv4 specific methods;
* and netinet/in6_rss.[ch] now just contains the IPv6 specific methods.
This should have no functional impact on anyone currently using
the RSS support.
Differential Revision: D1383
Reviewed by: gnn, jfv (intel driver bits)
use ifqueue at all. Second, there is no point in this lockless check.
Either positive or negative result of the check could be incorrect after
a tick.
Reviewed by: tuexen
Sponsored by: Nginx, Inc.
a) assumed that ifqueue length is measured in bytes, instead of packets
b) assumed that any interface has working ifqueue
c) incremented global counter instead of ifi_oqdrops
Sponsored by: Nginx, Inc.
DCTCP congestion control algorithm aims to maximise throughput and minimise
latency in data center networks by utilising the proportion of Explicit
Congestion Notification (ECN) marked packets received from capable hardware as a
congestion signal.
Highlights:
Implemented as a mod_cc(4) module.
ECN (Explicit congestion notification) processing is done differently from
RFC3168.
Takes one-sided DCTCP into consideration where only one of the sides is using
DCTCP and other is using standard ECN.
IETF draft: http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Thesis report by Midori Kato: https://eggert.org/students/kato-thesis.pdf
Submitted by: Midori Kato <katoon@sfc.wide.ad.jp> and
Lars Eggert <lars@netapp.com>
with help and modifications from
hiren
Differential Revision: https://reviews.freebsd.org/D604
Reviewed by: gnn
handle it in arc_output() instead of nd6_storelladdr().
* Remove IFT_ARCNET check from arpresolve() since arc_output() does not
use arpresolve() to handle broadcast/multicast. This check was there
since r84931. It looks like it was not used since r89099 (initial
import of Arcnet support where multicast is handled separately).
* Remove IFT_IEEE1394 case from nd6_storelladdr() since firewire_output()
calles nd6_storelladdr() for unicast addresses only.
* Remove IFT_ARCNET case from nd6_storelladdr() since arc_output() now
handles multicast by itself.
As a result, we have the following pattern: all non-ethernet-style
media have their own multicast map handling inside their appropriate
routines. On the other hand, arpresolve() (and nd6_storelladdr()) which
meant to be 'generic' ones de-facto handles ethernet-only multicast maps.
MFC after: 3 weeks
CARP devices are created with advskew set to '0' and once you set it to
any other value in the valid range (0..254) you can't set it back to zero.
The code in question is also used to prevent that zeroed values overwrite
the CARP defaults when a new CARP device is created. Since advskew already
defaults to '0' for newly created devices and the new value is guaranteed
to be within the valid range, it is safe to overwrite it here.
PR: 194672
Reported by: cmb@pfsense.org
In collaboration with: garga
Tested by: garga
MFC after: 2 weeks
the knowledge of mbuf layout, and in particular constants such as M_EXT,
MLEN, MHLEN, and so on, in mbuf consumers by unifying various alignment
utility functions (M_ALIGN(), MH_ALIGN(), MEXT_ALIGN() in a single
M_ALIGN() macro, implemented by a now-inlined m_align() function:
- Move m_align() from uipc_mbuf.c to mbuf.h; mark as __inline.
- Reimplement M_ALIGN(), MH_ALIGN(), and MEXT_ALIGN() using m_align().
- Update consumers around the tree to simply use M_ALIGN().
This change eliminates a number of cases where mbuf consumers must be aware
of whether or not mbufs returned by the allocator use external storage, but
also assumptions about the size of the returned mbuf. This will make it
easier to introduce changes in how we use external storage, as well as
features such as variable-size mbufs.
Differential Revision: https://reviews.freebsd.org/D1436
Reviewed by: glebius, trasz, gnn, bz
Sponsored by: EMC / Isilon Storage Division
* Make most of lltable_* methods 'normal' functions instead of inline
* Add lltable_get_<af|ifp>() functions to access given lltable fields
* Temporarily resurrect nd6_lookup() function
rather than passing them in by value.
The eventual aim is to do incremental hash construction rather than
all of the memcpy()'ing into a contiguous buffer for the hash
function, which does show up as taking quite a bit of CPU during
profiling.
Tested:
* a variety of laptops/desktop setups I have, with v6 connectivity
Differential Revision: D1404
Reviewed by: bz, rpaulo
(UTC) rather than the archaic (GMT) in comments. Except where the
comments are making fun of people doing this (and pedants who insist
on the new terms).
ipsec_getpolicybyaddr()
ipsec4_checkpolicy()
ip_ipsec_output()
ip6_ipsec_output()
The only flag used here was IP_FORWARDING.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
and make its prototype similar to ipsec6_process_packet.
The flags argument isn't used here, tunalready is always zero.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Remove check for presence PACKET_TAG_IPSEC_IN_DONE mbuf tag from
ip_ipsec_fwd(). PACKET_TAG_IPSEC_IN_DONE tag means that packet is
already handled by IPSEC code. This means that before IPSEC processing
it was destined to our address and security policy was checked in
the ip_ipsec_input(). After IPSEC processing packet has new IP
addresses and destination address isn't our own. So, anyway we can't
check security policy from the mbuf tag, because it corresponds
to different addresses.
We should check security policy that corresponds to packet
attributes in both cases - when it has a mbuf tag and when it has not.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
security policy. The changed block of code in ip*_ipsec_input() is
called when packet has ESP/AH header. Presence of
PACKET_TAG_IPSEC_IN_DONE mbuf tag in the same time means that
packet was already handled by IPSEC and reinjected in the netisr,
and it has another ESP/AH headers (encrypted twice?).
Since it was already processed by IPSEC code, the AH/ESP headers
was already stripped (and probably outer IP header was stripped too)
and security policy from the tdb_ident was applied to those headers.
It is incorrect to apply this security policy to current headers.
Also make ip_ipsec_input() prototype similar to ip6_ipsec_input().
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
PACKET_TAG_IPSEC_OUT_CRYPTO_NEEDED mbuf tags. They aren't used in FreeBSD.
Instead check presence of PACKET_TAG_IPSEC_OUT_DONE mbuf tag. If it
is found, bypass security policy lookup as described in the comment.
PACKET_TAG_IPSEC_OUT_DONE tag added to mbuf when IPSEC code finishes
ESP/AH processing. Since it was already finished, this means the security
policy placed in the tdb_ident was already checked. And there is no reason
to check it again here.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
use llt_fill_sa_entry() llt method to store lle address in sa.
* Eliminate L3_ADDR macro and either reference IPv4/IPv6 address
directly from lle or use newly-created llt_fill_sa_entry().
* Do not store sockaddr inside arp/ndp lle anymore.
and explicit calls to RTENTRY_FREE_LOCKED()
* Use lltable_prefix_free() in arp_ifscrub to be consistent with nd6.
* Rename <lltable_|llt>_delete function to _delete_addr() to note that
this function is used to external callers. Make this function maintain
its own locking.
* Use lookup/unlink/clear call chain from internal callers instead of
delete_addr.
* Fix LLE_DELETED flag handling
cleanup including unlinking/freeing
* Relax locking in lltable_prefix_free_af/lltable_free
* Do not pass @llt to lle free callback: it is always NULL now.
* Unify arptimer/nd6_llinfo_timer: explicitly unlock lle avoiding
unlock/lock sequinces
* Do not pass unlocked lle to nd6_ns_output(): add nd6_llinfo_get_holdsrc()
to retrieve preferred source address from lle hold queue and pass it
instead of lle.
* Finally, make nd6_create() create and return unlocked lle
* Separate defrtr handling code from nd6_free():
use nd6_check_del_defrtr() to check if we need to keep entry instead of
performing GC,
use nd6_check_recalc_defrtr() to perform actual recalc on lle removal.
* Move isRouter handling from nd6_cache_lladdr() to separate
nd6_check_router()
* Add initial code to maintain lle runtime flags in sync.
does actual new lle creation without extensive locking and existing
lle search.
Move lle updating code from gigantic in_arpinput() to arp_update_llle()
and some other functions.
IPv6 changes to follow.
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.
This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.
"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.
Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.
MFC after: 1 month
Sponsored by: Mellanox Technologies
- Provide pru_ready function for TCP.
- Don't call tcp_output() from tcp_usr_send() if no ready data was put
into the socket buffer.
- In case of dropped connection don't try to m_freem() not ready data.
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
sending not ready data:
o Add new flag to pru_send() flags - PRUS_NOTREADY.
o Add new protocol method pru_ready().
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
o Introduce a notion of "not ready" mbufs in socket buffers. These
mbufs are now being populated by some I/O in background and are
referenced outside. This forces following implications:
- An mbuf which is "not ready" can't be taken out of the buffer.
- An mbuf that is behind a "not ready" in the queue neither.
- If sockbet buffer is flushed, then "not ready" mbufs shouln't be
freed.
o In struct sockbuf the sb_cc field is split into sb_ccc and sb_acc.
The sb_ccc stands for ""claimed character count", or "committed
character count". And the sb_acc is "available character count".
Consumers of socket buffer API shouldn't already access them directly,
but use sbused() and sbavail() respectively.
o Not ready mbufs are marked with M_NOTREADY, and ready but blocked ones
with M_BLOCKED.
o New field sb_fnrdy points to the first not ready mbuf, to avoid linear
search.
o New function sbready() is provided to activate certain amount of mbufs
in a socket buffer.
A special note on SCTP:
SCTP has its own sockbufs. Unfortunately, FreeBSD stack doesn't yet
allow protocol specific sockbufs. Thus, SCTP does some hacks to make
itself compatible with FreeBSD: it manages sockbufs on its own, but keeps
sb_cc updated to inform the stack of amount of data in them. The new
notion of "not ready" data isn't supported by SCTP. Instead, only a
mechanical substitute is done: s/sb_cc/sb_ccc/.
A proper solution would be to take away struct sockbuf from struct
socket and allow protocols to implement their own socket buffers, like
SCTP already does. This was discussed with rrs@.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
use rwlock accessible via external functions
(IF_AFDATA_CFG_* -> if_afdata_cfg_*()) for all control plane tasks
use rmlock (IF_AFDATA_RUN_*) for fast-path lookups.
use rwlock accessible via external functions
(IN_IFADDR_CFG_* -> in_ifaddr_cfg_*()) for all control plane tasks
use rmlock (IN_IFADDR_RUN_*) for fast-path lookups.
Update route MTU in case of ifnet MTU change.
Add new RTF_FIXEDMTU to track explicitly specified MTU.
Old behavior:
ifconfig em0 mtu 1500->9000 -> all routes traversing em0 do not change MTU.
User has to manually update all routes.
ifconfig em0 mtu 9000->1500 -> all routes traversing em0 do not change MTU.
However, if ip[6]_output finds route with rt_mtu > interface mtu, rt_mtu
gets updated.
New behavior:
ifconfig em0 mtu 1500->9000 -> all interface routes in all fibs gets updated
with new MTU unless RTF_FIXEDMTU flag set on them.
ifconfig em0 mtu 9000->1500 -> all routes in all fibs gets updated with new
MTU unless RTF_FIXEDMTU flag set on them AND rt_mtu is less than ifp mtu.
route add ... -mtu XXX automatically sets RTF_FIXEDMTU flag.
route change .. -mtu 0 automatically removes RTF_FIXEDMTU flag.
PR: 194238
MFC after: 1 month
CR: D1125
* struct llentry is now basically split into 2 pieces:
all fields within 64 bytes (amd64) are now protected by both
ifdata lock AND lle lock, e.g. you require both locks to be held
exclusively for modification. All data necessary for fast path
operations is kept here. Some fields were added:
- r_l3addr - makes lookup key liev within first 64 bytes.
- r_flags - flags, containing pre-compiled decision whether given
lle contains usable data or not. Current the only flag is RLLE_VALID.
- r_len - prepend data len, currently unused
- r_kick - used to provide feedback to control plane (see below).
All other fields are protected by lle lock.
* Add simple state machine for ARP to handle "about to expire" case:
Current model (for the fast path) is the following:
- rlock afdata
- find / rlock rte
- runlock afdata
- see if "expire time" is approaching
(time_uptime + la->la_preempt > la->la_expire)
- if true, call arprequest() and decrease la_preempt
- store MAC and runlock rte
New model (data plane):
- rlock afdata
- find rte
- check if it can be used using r_* fields only
- if true, store MAC
- if r_kick field != 0 set it to 0.
- runlock afdata
New mode (control plane):
- schedule arptimer to be called in (V_arpt_keep - V_arp_maxtries)
seconds instead of V_arpt_keep.
- on first timer invocation change state from ARP_LLINFO_REACHABLE
to ARP_LLINFO_VERIFY, sets r_kick to 1 and shedules next call in
V_arpt_rexmit (default to 1 sec).
- on subsequent timer invocations in ARP_LLINFO_VERIFY state, checks
for r_kick value: reschedule if not changed, and send arprequest()
if set to zero (e.g. entry was used).
* Convert IPv4 path to use new single-lock approach. IPv6 bits to follow.
* Slow down in_arpinput(): now valid reply will (in most cases) require
acquiring afdata WLOCK twice. This is requirement for storing changed
lle data. This change will be slightly optimized in future.
* Provide explicit hash link/unlink functions for both ipv4/ipv6 code.
This will probably be moved to generic lle code once we have per-AF
hashing callback inside lltable.
* Perform lle unlink on deletion immediately instead of delaying it to
the timer routine.
* Make r244183 more explicit: use new LLE_CALLOUTREF flag to indicate the
presence of lle reference used for safe callout calls.
lla_lookup(LLE_CREATE) -> lla_create
lla_lookup(LLE_DELETE) -> lla_delete
Assume lla_create to return LLE_EXCLUSIVE lock for lle.
* Rework lla_rt_output to perform all lle changes under afdata WLOCK.
* change arp_ifscrub() ackquire afdata WLOCK, the same as arp_ifinit().
sb_cc member of struct sockbuf to a couple of inline functions:
sbavail() and sbused()
Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Initially in_matrote() in_clsroute() in their current state was introduced by
r4105 20 years ago. Instead of deleting inactive routes immediately, we kept them
in route table, setting RTPRF_OURS flag and some expire time. After that, either
GC came or RTPRF_OURS got removed on first-packet. It was a good solution
in that days (and probably another decade after that) to keep TCP metrics.
However, after moving metrics to TCP hostcache in r122922, most of in_rmx
functionality became unused. It might had been used for flushing icmp-originated
routes before rte mutexes/refcounting, but I'm not sure about that.
So it looks like this is nearly impossible to make GC do its work nowadays:
in_rtkill() ignores non-RTPRF_OURS routes.
route can only become RTPRF_OURS after dropping last reference via rtfree()
which calls in_clsroute(), which, it turn, ignores UP and non-RTF_DYNAMIC routes.
Dynamic routes can still be installed via received redirect, but they
have default lifetime (no specific rt_expire) and no one has another trie walker
to call RTFREE() on them.
So, the changelist:
* remove custom rnh_match / rnh_close matching function.
* remove all GC functions
* partially revert r256695 (proto3 is no more used inside kernel,
it is not possible to use rt_expire from user point of view, proto3 support
is not complete)
* Finish r241884 (similar to this commit) and remove remaining IPv6 parts
MFC after: 1 month
have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.
No objections from: net@
It isn't safe to keep unreferenced ifaddrs. Use in6ifa_ifwithaddr() to
determine ifaddr corresponding to destination address. Since currently
we keep addresses with embedded scope zone, in6ifa_ifwithaddr is called
with zero zoneid and marked with XXX.
Also remove route and lle lookups from ip6_input. Use in6ifa_ifwithaddr()
instead.
Sponsored by: Yandex LLC
by r6399 to enhance expiring large number of host cache routes.
Since we don't have route cloning anymore and no one altered
V_rtq_toomany default (which is 128) in nearly 20 years, I assume
this can be safely deleted.
Split it into two modules: if_gre(4) for GRE encapsulation and
if_me(4) for minimal encapsulation within IP.
gre(4) changes:
* convert to if_transmit;
* rework locking: protect access to softc with rmlock,
protect from concurrent ioctls with sx lock;
* correct interface accounting for outgoing datagramms (count only payload size);
* implement generic support for using IPv6 as delivery header;
* make implementation conform to the RFC 2784 and partially to RFC 2890;
* add support for GRE checksums - calculate for outgoing datagramms and check
for inconming datagramms;
* add support for sending sequence number in GRE header;
* remove support of cached routes. This fixes problem, when gre(4) doesn't
work at system startup. But this also removes support for having tunnels with
the same addresses for inner and outer header.
* deprecate support for various GREXXX ioctls, that doesn't used in FreeBSD.
Use our standard ioctls for tunnels.
me(4):
* implementation conform to RFC 2004;
* use if_transmit;
* use the same locking model as gre(4);
PR: 164475
Differential Revision: D1023
No objections from: net@
Relnotes: yes
Sponsored by: Yandex LLC
Some virtual if drivers has (ab)used ifa ifa_rtrequest hook to enforce
route MTU to be not bigger that interface MTU. While ifa_rtrequest hooking
might be an option in some situation, it is not feasible to do MTU checks
there: generic (or per-domain) routing code is perfectly capable of doing
this.
We currrently have 3 places where MTU is altered:
1) route addition.
In this case domain overrides radix _addroute callback (in[6]_addroute)
and all necessary checks/fixes are/can be done there.
2) route change (especially, GW change).
In this case, there are no explicit per-domain calls, but one can
override rte by setting ifa_rtrequest hook to domain handler
(inet6 does this).
3) ifconfig ifaceX mtu YYYY
In this case, we have no callbacks, but ip[6]_output performes runtime
checks and decreases rt_mtu if necessary.
Generally, the goals are to be able to handle all MTU changes in
control plane, not in runtime part, and properly deal with increased
interface MTU.
This commit changes the following:
* removes hooks setting MTU from drivers side
* adds proper per-doman MTU checks for case 1)
* adds generic MTU check for case 2)
* The latter is done by using new dom_ifmtu callback since
if_mtu denotes L3 interface MTU, e.g. maximum trasmitted _packet_ size.
However, IPv6 mtu might be different from if_mtu one (e.g. default 1280)
for some cases, so we need an abstract way to know maximum MTU size
for given interface and domain.
* moves rt_setmetrics() before MTU/ifa_rtrequest hooks since it copies
user-supplied data which must be checked.
* removes RT_LOCK_ASSERT() from other ifa_rtrequest hooks to be able to
use this functions on new non-inserted rte.
More changes will follow soon.
MFC after: 1 month
Sponsored by: Yandex LLC
According to IANA RPC uaddr registry, there are no AFs
except IPv4 and IPv6, so it's not worth being too abstract here.
Remove ne_rtable[AF_MAX+1] and use explicit per-AF radix tries.
Use own initialization without relying on domattach code.
While I admit that this was one of the rare places in kernel
networking code which really was capable of doing multi-AF
without any AF-depended code, it is not possible anymore to
rely on dom* code.
While here, change terrifying "Invalid radix node head, rn:" message,
to different non-understandable "netcred already exists for given addr/mask",
but less terrifying. Since we know that rn_addaddr() returns NULL if
the same record already exists, we should provide more friendly error.
MFC after: 1 month
* Remove &ifpp from ip6_output() in favor of ri->ri_nh_info
* Provide different wrappers to in6_selectsrc:
Currently it is used by 2 differenct type of customers:
- socket-based one, which all are unsure about provided
address scope and
- in-kernel ones (ND code mostly), which don't have
any sockets, options, crededentials, etc.
So, we provide two different wrappers to in6_selectsrc()
returning select source.
* Make different versions of selectroute():
Currenly selectroute() is used in two scenarios:
- SAS, via in6_selecsrc() -> in6_selectif() -> selectroute()
- output, via in6_output -> wrapper -> selectroute()
Provide different versions for each customer:
- fib6_lookup_nh_basic()-based in6_selectif() which is
capable of returning interface only, without MTU/NHOP/L2
calculations
- full-blown fib6_selectroute() with cached route/multipath/
MTU/L2
* Stop using routing table for link-local address lookups
* Add in6_ifawithifp_lla() to make for-us check faster for link-local
* Add in6_splitscope / in6_setllascope for faster embed/deembed scopes
The check was recommened in the draft-ietf-ngtrans-mech-05.txt. But it isn't
clear, should it compare the source with all direct broadcast addresses in the
system or not.
RFC 4213 says it is enough to verify that the source address is the address
of the encapsulator, as configured on the decapsulator. And this verification
can be extended by administrator with any other forms of IPv4 ingress filtering.
Discussed with: glebius, melifaro
Sponsored by: Yandex LLC
directly accessed. Although this will work on some platforms, it can
throw an exception if the pointer is invalid and then panic the kernel.
Add a missing SYSCTL_IN() of "SCTP_BASE_STATS" structure.
MFC after: 3 days
Sponsored by: Mellanox Technologies
The kernel tracks syscall users so that modules can safely unregister them.
But if the module is not unloadable or was compiled into the kernel, there is
no need to do this.
Achieve this by adding SY_THR_STATIC_KLD macro which expands to SY_THR_STATIC
during kernel build and 0 otherwise.
Reviewed by: kib (previous version)
MFC after: 2 weeks
* Use NHF_ namespace for all nhop flags
* Rename nhop_data -> nhop_prepend
* Rename fib4_lookup_nh_extended -> fib4_lookup_nh_ext
* Add "flags" argument to fib4_lookup_nh_ext() to specify whether we want
returned nh_ext structure to be refcounted or not.
that sctp_med_chunk_output() always initialized the reason_code
instead of relying on the caller.
The variable is only used for debugging purpose.
This issue was reported by Peter Bostroem from Google.
MFC after: 3 days
structure without doinf L2 resolve. It also requires freeing
references by calling fib4_free_nh_ext().
Convert in_pcbladdr() to use it.
Convert tcp_maxmtu() to use it.
- Wrong integer type was specified.
- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.
- Logical OR where binary OR was expected.
- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.
- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.
- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.
- Updated "EXAMPLES" section in SYSCTL manual page.
MFC after: 3 days
Sponsored by: Mellanox Technologies
The goals of the new API is to provide consumers with minimal
needed information, but as fast as possible. So we provide
full nexthop info copied into alighed on-cache structure
instead of rte/ia pointers, their refcounts and locks.
This does not provide solution for protecting from egress
ifp destruction, but does not make it any worse.
Current changes:
nhops:
Add fib4_lookup_prepend() function which stores either full
L2+L3 prepend info (e.g. MAC header in case of plain IPv4) or
L3 info with NH_FLAGS_L2_INCOMPLETE flag indicating that no valid L2
info exists and we have to take "slow" path.
ip_output:
Currently ip[ 46]_output consumers use 'struct route' for
the following purposes:
1) double lookup avoidance(route caching)
2) plain route caching
3) get path MTU to be able to notify source.
The former pattern is mostly used by various tunnels
(gif, gre, stf). (Actually, gre is the only remaining,
others were already converted. Their locking model did
not scale good enogh to benefit from such caching, so
we have (temporarily) removed it without any performance
loss).
Plain route caching used by SCTP is simply wrong and should be removed.
Temporary break it for now just to be able to compile.
Optimize path mtu reporting by providing it in new 'route_info' stucture.
Minimize games with @ia locking/refcounting for route lookup:
add special nhop[46]_extended structure to store more route attributes.
Pointer to given structure can be passed to fib4_lookup_prepend() to indicate
we want this info (we actually needs it for UDP and raw IP).
ether_output:
Provide light-weight ether_output2() call to deal with
transmitting L2 frame (e.g. properly handle broadcast/simloop/bridge/
other L2 hooks before actually transmitting frame by if_transmit()).
Add a hack based on new RT_NHOP ro_flag to distinguish which version should
we call. Better way is probably to add a new "if_output_frame" driver
callbacks.
Next steps:
* Convert ip_fastfwd part
* Implement auto-growing array for per-radix nexthops
* Implement LLE tracking for nexthop calculations to be able to
immediately provide all necessary info in single route lookup
for gateway routes
* Switch radix locking scheme to runtime/cfg lock
* Implement multipath support for rtsock
* Implement "tracked nexthops" for tunnels (e.g. _proper_
nexthop caching)
* Add IPv6 support for remaining parts (postponed not to
interfere with user/ae/inet6 branch)
* Consider adding "if_output_frame" driver call to
ease logical frame pushing.
sent incoming stream reset request was responded with failed
or denied.
Thanks to Peter Bostroem from Google for reporting the issue.
MFC after: 3 days
o convert to if_transmit;
o use rmlock to protect access to gif_softc;
o use sx lock to protect from concurrent ioctls;
o remove a lot of unneeded and duplicated code;
o remove cached route support (it won't work with concurrent io);
o style fixes.
Reviewed by: melifaro
Obtained from: Yandex LLC
MFC after: 1 month
Sponsored by: Yandex LLC
received any RST packet. Do not set error to ECONNRESET in this case.
Differential Revision: https://reviews.freebsd.org/D879
Reviewed by: rpaulo, adrian
Approved by: jhb (mentor)
Sponsored by: Verisign, Inc.
data in an mbuf, use M_WRITABLE() instead of a direct test of M_EXT;
the latter both unnecessarily exposes mbuf-allocator internals in the
protocol stack and is also insufficient to catch all cases of
non-writability.
(NB: m_pullup() does not actually guarantee that a writable mbuf is
returned, so further refinement of all of these code paths continues to
be required.)
Reviewed by: bz
MFC after: 3 days
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D900
These are needed for the forthcoming vxlan implementation. The context
pointer means we do not have to use a spare pointer field in the inpcb,
and the source address is required to populate vxlan's forwarding table.
While I highly doubt there is an out of tree consumer of the UDP
tunneling callback, this change may be a difficult to eventually MFC.
Phabricator: https://reviews.freebsd.org/D383
Reviewed by: gnn
syscalls themselves are tightly coupled with the network stack and
therefore should not be in the generic socket code.
The following four syscalls have been marked as NOSTD so they can be
dynamically registered in sctp_syscalls_init() function:
sys_sctp_peeloff
sys_sctp_generic_sendmsg
sys_sctp_generic_sendmsg_iov
sys_sctp_generic_recvmsg
The syscalls are also set up to be dynamically registered when COMPAT32
option is configured.
As a side effect of moving the SCTP syscalls, getsock_cap needs to be
made available outside of the uipc_syscalls.c source file. A proper
prototype has been added to the sys/socketvar.h header file.
API tests from the SCTP reference implementation have been run to ensure
compatibility. (http://code.google.com/p/sctp-refimpl/source/checkout)
Submitted by: Steve Kiernan <stevek@juniper.net>
Reviewed by: tuexen, rrs
Obtained from: Juniper Networks, Inc.