function. The change is mostly mechanical with the following exception:
Last piece of nd6_resolve_slow() was refactored: ND6_LLINFO_PERMANENT
condition was removed as always-true, explicit ND6_LLINFO_NOSTATE ->
ND6_LLINFO_INCOMPLETE state transition was removed as duplicate.
Reviewed by: ae
Sponsored by: Yandex LLC
* prepare gateway before insertion
* use RTM_CHANGE instead of explicit find/change route
* Remove fib argument from ifa_switch_loopback_route added in r264887:
if old ifp fib differes from new one, that the caller
is doing something wrong
* Make ifa_*_loopback_route call single ifa_maintain_loopback_route().
without holding afdata wlock
* convert per-af delete_address callback to global lltable_delete_entry() and
more low-level "delete this lle" per-af callback
* fix some bugs/inconsistencies in IPv4/IPv6 ifscrub procedures
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D3573
To make driver programming easier the TSO limits are changed to
reflect the values used in the BUSDMA tag a network adapter driver is
using. The TCP/IP network stack will subtract space for all linklevel
and protocol level headers and ensure that the full mbuf chain passed
to the network adapter fits within the given limits.
Implementation notes:
If a network adapter driver needs to fixup the first mbuf in order to
support VLAN tag insertion, the size of the VLAN tag should be
subtracted from the TSO limit. Else not.
Network adapters which typically inline the complete header mbuf could
technically transmit one more segment. This patch does not implement a
mechanism to recover the last segment for data transmission. It is
believed when sufficiently large mbuf clusters are used, the segment
limit will not be reached and recovering the last segment will not
have any effect.
The current TSO algorithm tries to send MTU-sized packets, where the
MTU typically is 1500 bytes, which gives 1448 bytes of TCP data
payload per packet for IPv4. That means if the TSO length limitiation
is set to 65536 bytes, there will be a data payload remainder of
(65536 - 1500) mod 1448 bytes which is equal to 324 bytes. Trying to
recover total TSO length due to inlining mbuf header data will not
have any effect, because adding or removing the ETH/IP/TCP headers
to or from 324 bytes will not cause more or less TCP payload to be
TSO'ed.
Existing network adapter limits will be updated separately.
Differential Revision: https://reviews.freebsd.org/D3458
Reviewed by: rmacklem
MFC after: 2 weeks
to provide the TCPDEBUG functionality with pure DTrace.
Reviewed by: rwatson
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: D3530
in the routine, which queues an ERROR chunk, instead on relyinh
on the callers to do so. Since one caller missed this, this actially
fixes a bug.
MFC after: 1 week
because the RSS hash may need to be recalculated.
Submitted by: Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision: https://reviews.freebsd.org/D3564
o Unlike xor, in Jenkins hash every bit of input affects virtually
every bit of output, thus salting the hash actually works. With
xor salting only provides a false sense of security, since if
hash(x) collides with hash(y), then of course, hash(x) ^ salt
would also collide with hash(y) ^ salt. [1]
o Jenkins provides much better distribution than xor, very close to
ideal.
TCP connection setup/teardown benchmark has shown a 10% increase
with default hash size, and with bigger hashes that still provide
possibility for collisions. With enormous hash size, when dataset is
by an order of magnitude smaller than hash size, the benchmark has
shown 4% decrease in performance decrease, which is expected and
acceptable.
Noticed by: Jeffrey Knockel <jeffk cs.unm.edu> [1]
Benchmarks by: jch
Reviewed by: jch, pkelsey, delphij
Security: strengthens protection against hash collision DoS
Sponsored by: Nginx, Inc.
workaround for a callout(9) issue, it turns out it is instead the right
way to use callout in mpsafe mode without using callout_drain().
r284245 commit message:
Fix a callout race condition introduced in TCP timers callouts with r281599.
In TCP timer context, it is not enough to check callout_stop() return value
to decide if a callout is still running or not, previous callout_reset()
return values have also to be checked.
Differential Revision: https://reviews.freebsd.org/D2763
do not do what one would expect by name. Prefix them with "udp_"
to at least obviously limit the scope.
This is a non-functional change.
Reviewed by: gnn, rwatson
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D3505
timers callouts with r281599."
r281599 fixed a TCP timer race condition, but due a callout(9) bug
it also introduced another race condition workaround-ed with r284245.
The callout(9) bug being fixed with r286880, we can now revert the
workaround (r284245).
Differential Revision: https://reviews.freebsd.org/D2079 (Initial change)
Differential Revision: https://reviews.freebsd.org/D2763 (Workaround)
Differential Revision: https://reviews.freebsd.org/D3078 (Fix)
Sponsored by: Verisign, Inc.
MFC after: 2 weeks
Before that, the logic besides lle_create() was the following:
return existing if found, create if not. This behaviour was error-prone
since we had to deal with 'sudden' static<>dynamic lle changes.
This commit fixes bunch of different issues like:
- refcount leak when lle is converted to static.
Simple check case:
console 1:
while true;
do for i in `arp -an|awk '$4~/incomp/{print$2}'|tr -d '()'`;
do arp -s $i 00:22:44:66:88:00 ; arp -d $i;
done;
done
console 2:
ping -f any-dead-host-in-L2
console 3:
# watch for memory consumption:
vmstat -m | awk '$1~/lltable/{print$2}'
- possible problems in arptimer() / nd6_timer() when dropping/reacquiring
lock.
New logic explicitly handles use-or-create cases in every lla_create
user. Basically, most of the changes are purely mechanical. However,
we explicitly avoid using existing lle's for interface/static LLE records.
* While here, call lle_event handlers on all real table lle change.
* Create lltable_free_entry() calling existing per-lltable
lle_free_t callback for entry deletion
This change isolates the most common case (e.g. successful lookup)
from more complicates scenarios. It also (tries to) make code
more simple by avoiding retry: cycle.
The actual goal is to prepare code to the upcoming change that will
allow LL address retrieval without acquiring LLE lock at all.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D3383
separate bunch of functions. The goal is to isolate actual lle
updates to permit more fine-grained locking.
Do all lle link-level update under AFDATA wlock.
Sponsored by: Yandex LLC
This permits us having all (not fully true yet) all the info
needed in lookup process in first 64 bytes of 'struct llentry'.
struct llentry layout:
BEFORE:
[rwlock .. state .. state .. MAC ] (lle+1) [sockaddr_in[6]]
AFTER
[ in[6]_addr MAC .. state .. rwlock ]
Currently, address part of struct llentry has only 16 bytes for the key.
However, lltable does not restrict any custom lltable consumers with long
keys use the previous approach (store key at (lle+1)).
Sponsored by: Yandex LLC
* Split lltable_init() into lltable_allocate_htbl() (alloc
hash table with default callbacks) and lltable_link() (
links any lltable to the list).
* Switch from LLTBL_HASHTBL_SIZE to per-lltable hash size field.
* Move lltable setup to separate functions in in[6]_domifattach.
differences between projects/routing and HEAD.
This commit tries to keep code logic the same while changing underlying
code to use unified callbacks.
* Add llt_foreach_entry method to traverse all entries in given llt
* Add llt_dump_entry method to export particular lle entry in sysctl/rtsock
format (code is not indented properly to minimize diff). Will be fixed
in the next commits.
* Add llt_link_entry/llt_unlink_entry methods to link/unlink particular lle.
* Add llt_fill_sa_entry method to export address in the lle to sockaddr
format.
* Add llt_hash method to use in generic hash table support code.
* Add llt_free_entry method which is used in llt_prefix_free code.
* Prepare for fine-grained locking by separating lle unlink and deletion in
lltable_free() and lltable_prefix_free().
* Provide lltable_get<ifp|af>() functions to reduce direct 'struct lltable'
access by external callers.
* Remove @llt agrument from lle_free() lle callback since it was unused.
* Temporarily add L3_CADDR() macro for 'const' sockaddr typecasting.
* Switch to per-af hashing code.
* Rename LLE_FREE_LOCKED() callback from in[6]_lltable_free() to
in_[6]lltable_destroy() to avoid clashing with llt_free_entry() method.
Update description from these functions.
* Use unified lltable_free_entry() function instead of per-af one.
Reviewed by: ae
This fixes a panic during 'sysctl -a' on VIMAGE kernels.
The tcp_reass_zone variable is not VNET_DEFINE() so we can not mark it as a VNET
variable (with CTLFLAG_VNET).
* Move interface route cleanup to route.c:rt_flushifroutes()
* Convert most of "for (fibnum = 0; fibnum < rt_numfibs; fibnum++)" users
to use new rt_foreach_fib() instead of hand-rolling cycles.
* Move lle creation/deletion from lla_lookup to separate functions:
lla_lookup(LLE_CREATE) -> lla_create
lla_lookup(LLE_DELETE) -> lla_delete
lla_create now returns with LLE_EXCLUSIVE lock for lle.
* Provide typedefs for new/existing lltable callbacks.
Reviewed by: ae
Do not pass 'dst' sockaddr to ip[6]_mloopback:
- We have explicit check for AF_INET in ip_output()
- We assume ip header inside passed mbuf in ip_mloopback
- We assume ip6 header inside passed mbuf in ip6_mloopback
Avoid too strict INP_INFO_RLOCK_ASSERT checks due to
tcp_notify() being called from in6_pcbnotify().
Reported by: Larry Rosenman <ler@lerctr.org>
Submitted by: markj, jch
- The existing TCP INP_INFO lock continues to protect the global inpcb list
stability during full list traversal (e.g. tcp_pcblist()).
- A new INP_LIST lock protects inpcb list actual modifications (inp allocation
and free) and inpcb global counters.
It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input())
and INP_INFO_WLOCK only in occasional operations that walk all connections.
PR: 183659
Differential Revision: https://reviews.freebsd.org/D2599
Reviewed by: jhb, adrian
Tested by: adrian, nitroboost-gmail.com
Sponsored by: Verisign, Inc.
When firewalls force a reloop of packets and the caller supplied a route the reference to the route might be reduced twice creating issues.
This is especially the scenario when a packet is looped because of operation in the firewall but the new route lookup gives a down route.
Differential Revision: https://reviews.freebsd.org/D3037
Reviewed by: gnn
Approved by: gnn(mentor)
ip_output has a big chunk of code used to handle special cases with pfil consumers which also forces a reloop on it.
Gather all this code together to make it readable and properly handle the reloop cases.
Some of the issues identified:
M_IP_NEXTHOP is not handled properly in existing code.
route reference leaking is possible with in FIB number change
route flags checking is not consistent in the function
Differential Revision: https://reviews.freebsd.org/D3022
Reviewed by: gnn
Approved by: gnn(mentor)
MFC after: 4 weeks
non-inline urgent data and introduce an mbuf exhaustion attack vector
similar to FreeBSD-SA-15:15.tcp, but not requiring VNETs.
Address the issue described in FreeBSD-SA-15:15.tcp.
Reviewed by: glebius
Approved by: so
Approved by: jmallett (mentor)
Security: FreeBSD-SA-15:15.tcp
Sponsored by: Norse Corp, Inc.
ip_encap already has inspected mbuf's data, at least an IP header.
And it is safe to use mtod() and do direct access to needed fields.
Add M_ASSERTPKTHDR() to gif_encapcheck(), since the code expects that
mbuf has a packet header.
Move the code from gif_validate[46] into in[6]_gif_encapcheck(), also
remove "martian filters" checks. According to RFC 4213 it is enough to
verify that the source address is the address of the encapsulator, as
configured on the decapsulator.
Reviewed by: melifaro
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Both are used to protect access to IP addresses lists and they can be
acquired for reading several times per packet. To reduce lock contention
it is better to use rmlock here.
Reviewed by: gnn (previous version)
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D3149
1) We were not handling (or sending) the IN_PROGRESS case if
the other side (or our side) was not able to reset (awaiting more data).
2) We would improperly send a stream-reset when we should not. Not
waiting until the TSN had been assigned when data was inqueue.
Reviewed by: tuexen
lock on the INP before calling the tunnel protocol, else a LOR
may occur (it does with SCTP for sure). Instead we must acquire a
ref count and release the lock, taking care to allow for the case
where the UDP socket has gone away and *not* unlocking since the
refcnt decrement on the inp will do the unlock in that case.
Reviewed by: tuexen
MFC after: 3 weeks
a set of differentiated services, set IPTOS_PREC_* macros using
IPTOS_DSCP_* macro definitions.
While here, add IPTOS_DSCP_VA macro according to RFC 5865.
Differential Revision: https://reviews.freebsd.org/D3119
Reviewed by: gnn
scaling code does not use an uninitialized timestamp echo reply value
from the stack when timestamps are not enabled.
Differential Revision: https://reviews.freebsd.org/D3060
Reviewed by: hiren
Approved by: jmallett (mentor)
MFC after: 3 days
Sponsored by: Norse Corp, Inc.
apparently neither clang nor gcc complain about this.
But clang intis the var to NULL correctly while gcc on at least mips does not.
Correct the undefined behavior by initializing the variable properly.
PR: 201371
Differential Revision: https://reviews.freebsd.org/D3036
Reviewed by: gnn
Approved by: gnn(mentor)
ip_forward() does a route lookup for testing this packet can be sent to a known destination,
it also can do another route lookup if it detects that an ICMP redirect is needed,
it forgets all of this and handovers to ip_output() to do the same lookup yet again.
This optimisation just does one route lookup during the forwarding path and handovers that to be considered by ip_output().
Differential Revision: https://reviews.freebsd.org/D2964
Approved by: ae, gnn(mentor)
MFC after: 1 week
condition.
If you send a 0-length packet, but there is data is the socket buffer, and
neither the rexmt or persist timer is already set, then activate the persist
timer.
PR: 192599
Differential Revision: D2946
Submitted by: jlott at averesystems dot com
Reviewed by: jhb, jch, gnn, hiren
Tested by: jlott at averesystems dot com, jch
MFC after: 2 weeks
There is small window, when encap_detach() can free matched entry
directly after we release encapmtx. Instead of use pointer to the
matched entry, save pointers to needed variables from this entry
and use them after release mutex.
Pass argument stored in the encaptab entry to encap_fillarg(), instead
of pointer to matched entry. Also do not allocate new mbuf tag, when
argument that we plan to save in this tag is NULL.
Also make encaptab variable static.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
to be transmitted but the arp cache entry expired, which triggers an arp request
to be sent, the bpf code might want to sleep but crash the system due
to a non sleep lock held from the arp entry not released properly.
Release the lock before calling the arp request code to solve the issue
as is done on all the other code paths.
PR: 200323
Approved by: ae, gnn(mentor)
MFC after: 1 week
Sponsored by: Netgate
Differential Revision: https://reviews.freebsd.org/D2828
continue sending on the same net.
This fixes a bug where an invalid mbuf chain was constructed, if a
full size frame of control chunks should be sent and there is a
output error.
Based on a discussion with rrs@, change move to the next net. This fixes
the bug and improves the behaviour.
Thanks to Irene Ruengeler for spending a lot of time in narrowing this
problem down.
MFC after: 3 days
In TCP timer context, it is not enough to check callout_stop() return value
to decide if a callout is still running or not, previous callout_reset()
return values have also to be checked.
Differential Revision: https://reviews.freebsd.org/D2763
Reviewed by: hiren
Approved by: hiren
MFC after: 1 day
Sponsored by: Verisign, Inc.
the scope.
This fixes a problem when a client with a global address
connects to a server with a private address.
Thanks to Irene Ruengeler in helping me to find the issue.
MFC after: 3 days
years for head. However, it is continuously misused as the mpsafe argument
for callout_init(9). Deprecate the flag and clean up callout_init() calls
to make them more consistent.
Differential Revision: https://reviews.freebsd.org/D2613
Reviewed by: jhb
MFC after: 2 weeks
gif(4) interface. Add new option "ignore_source" for gif(4) interface.
When it is enabled, gif's encapcheck function requires match only for
packet's destination address.
Differential Revision: https://reviews.freebsd.org/D2004
Obtained from: Yandex LLC
MFC after: 2 weeks
Sponsored by: Yandex LLC
Although this is not important to the rest of the TCP processing
it is a conveneint way to make the DTrace state-transition probe
catch this important state change.
MFC after: 1 week
Currently we have tables identified by their names in userland
with internal kernel-assigned indices. This works the following way:
When userland wishes to communicate with kernel to add or change rule(s),
it makes indexed sorted array of table names
(internally ipfw_obj_ntlv entries), and refer to indices in that
array in rule manipulation.
Prior to committing new rule to the ruleset kernel
a) finds all referenced tables, bump their refcounts and change
values inside the opcodes to be real kernel indices
b) auto-creates all referenced but not existing tables and then
do a) for them.
Kernel does almost the same when exporting rules to userland:
prepares array of used tables in all rules in range, and
prepends it before the actual ruleset retaining actual in-kernel
indexes for that.
There is also special translation layer for legacy clients which is
able to provide 'real' indices for table names (basically doing atoi()).
While it is arguable that every subsystem really needs names instead of
numbers, there are several things that should be noted:
1) every non-singleton subsystem needs to store its runtime state
somewhere inside ipfw chain (and be able to get it fast)
2) we can't assume object numbers provided by humans will be dense.
Existing nat implementation (O(n) access and LIST inside chain) is a
good example.
Hence the following:
* Convert table-centric rewrite code to be more generic, callback-based
* Move most of the code from ip_fw_table.c to ip_fw_sockopt.c
* Provide abstract API to permit subsystems convert their objects
between userland string identifier and in-kernel index.
(See struct opcode_obj_rewrite) for more details
* Create another per-chain index (in next commit) shared among all subsystems
* Convert current NAT44 implementation to use new API, O(1) lookups,
shared index and names instead of numbers (in next commit).
Sponsored by: Yandex LLC
When we are passing mbuf to IPSec processing via ipsec[46]_process_packet(),
we hold one reference to security policy and release it just after return
from this function. But IPSec processing can be deffered and when we release
reference to security policy after ipsec[46]_process_packet(), user can
delete this security policy from SPDB. And when IPSec processing will be
done, xform's callback function will do access to already freed memory.
To fix this move KEY_FREESP() into callback function. Now IPSec code will
release reference to SP after processing will be finished.
Differential Revision: https://reviews.freebsd.org/D2324
No objections from: #network
Sponsored by: Yandex LLC
- Use the carp_sx to serialize not only CARP ioctls, but also carp_attach()
and carp_detach().
- Use cif_mtx to lock only access to those the linked list.
- These locking changes allow us to do some memory allocations with M_WAITOK
and also properly call callout_drain() in carp_destroy().
- In carp_attach() assert that ifaddr isn't attached. We always come here
with a pristine address from in[6]_control().
Reviewed by: oleg
Sponsored by: Nginx, Inc.
TCP timers:
- Add a reference from tcpcb to its inpcb
- Defer tcpcb deletion until TCP timers have finished
Differential Revision: https://reviews.freebsd.org/D2079
Submitted by: jch, Marc De La Gueronniere <mdelagueronniere@verisign.com>
Reviewed by: imp, rrs, adrian, jhb, bz
Approved by: jhb
Sponsored by: Verisign, Inc.
sequential IP ID case (e.g. ping -f), distribution fell into 8-10 buckets
out of 64. With Jenkins hash, distribution is even.
o Add random seed to the hash.
Sponsored by: Nginx, Inc.
function names have changed and comments are reformatted or added, but
there is no functional change.
Claim copyright for me and Adrian.
Sponsored by: Nginx, Inc.
of allocations in V_nipq is racy. To fix that, we would simply stop doing
book-keeping ourselves, and rely on UMA doing that. There could be a
slight overcommit due to caches, but that isn't a big deal.
o V_nipq and V_maxnipq go away.
o net.inet.ip.fragpackets is now just SYSCTL_UMA_CUR()
o net.inet.ip.maxfragpackets could have been just SYSCTL_UMA_MAX(), but
historically it has special semantics about values of 0 and -1, so
provide sysctl_maxfragpackets() to handle these special cases.
o If zone limit lowers either due to net.inet.ip.maxfragpackets or due to
kern.ipc.nmbclusters, then new function ipq_drain_tomax() goes over
buckets and frees the oldest packets until we are in the limit.
The code that (incorrectly) did that in ip_slowtimo() is removed.
o ip_reass() doesn't check any limits and calls uma_zalloc(M_NOWAIT).
If it fails, a new function ipq_reuse() is called. This function will
find the oldest packet in the currently locked bucket, and if there is
none, it will search in other buckets until success.
Sponsored by: Nginx, Inc.
free a fragment, provide two inline functions that do that for us:
ipq_drop() and ipq_timeout().
o Rename ip_free_f() to ipq_free() to match the name scheme of IP reassembly.
o Remove assertion from ipq_free(), since it requires extra argument to be
passed, but locking scheme is simple enough and function is static.
Sponsored by: Nginx, Inc.
This significantly improves performance on multi-core servers where there
is any kind of IPv4 reassembly going on.
glebius@ would like to see the locking moved to be attached to the reassembly
bucket, which would make it per-bucket + per-VNET, instead of being global.
I decided to keep it global for now as it's the minimal useful change;
if people agree / wish to migrate it to be per-bucket / per-VNET then please
do feel free to do so. I won't complain.
Thanks to Norse Corp for giving me access to much larger servers
to test this at across the 4 core boxes I have at home.
Differential Revision: https://reviews.freebsd.org/D2095
Reviewed by: glebius (initial comments incorporated into this patch)
MFC after: 2 weeks
Sponsored by: Norse Corp, Inc (hardware)
header and not only partial flags and fields. Firewalls can attach
classification tags to the outgoing mbufs which should be copied to
all the new fragments. Else only the first fragment will be let
through by the firewall. This can easily be tested by sending a large
ping packet through a firewall. It was also discovered that VLAN
related flags and fields should be copied for packets traversing
through VLANs. This is all handled by "m_dup_pkthdr()".
Regarding the MAC policy check in ip_fragment(), the tag provided by
the originating mbuf is copied instead of using the default one
provided by m_gethdr().
Tested by: Karim Fodil-Lemelin <fodillemlinkarim at gmail.com>
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
PR: 7802
where we want to create a new IP datagram.
o Add support for RFC6864, which allows to set IP ID for atomic IP
datagrams to any value, to improve performance. The behaviour is
controlled by net.inet.ip.rfc6864 sysctl knob, which is enabled by
default.
o In case if we generate IP ID, use counter(9) to improve performance.
o Gather all code related to IP ID into ip_id.c.
Differential Revision: https://reviews.freebsd.org/D2177
Reviewed by: adrian, cy, rpaulo
Tested by: Emeric POUPON <emeric.poupon stormshield.eu>
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Relnotes: yes
to obtain IPv4 next hop address in tablearg case.
Add `fwd tablearg' support for IPv6. ipfw(8) uses INADDR_ANY as next hop
address in O_FORWARD_IP opcode for specifying tablearg case. For IPv6 we
still use this opcode, but when packet identified as IPv6 packet, we
obtain next hop address from dedicated field nh6 in struct table_value.
Replace hopstore field in struct ip_fw_args with anonymous union and add
hopstore6 field. Use this field to copy tablearg value for IPv6.
Replace spare1 field in struct table_value with zoneid. Use it to keep
scope zone id for link-local IPv6 addresses. Since spare1 was used
internally, replace spare0 array with two variables spare0 and spare1.
Use getaddrinfo(3)/getnameinfo(3) functions for parsing and formatting
IPv6 addresses in table_value. Use zoneid field in struct table_value
to store sin6_scope_id value.
Since the kernel still uses embedded scope zone id to represent
link-local addresses, convert next_hop6 address into this form before
return from pfil processing. This also fixes in6_localip() check
for link-local addresses.
Differential Revision: https://reviews.freebsd.org/D2015
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
draft-ietf-6man-enhanced-dad-13.
This basically adds a random nonce option (RFC 3971) to NS messages
for DAD probe to detect a looped back packet. This looped back packet
prevented DAD on some pseudo-interfaces which aggregates multiple L2 links
such as lagg(4).
The length of the nonce is set to 6 bytes. This algorithm can be disabled by
setting net.inet6.ip6.dad_enhanced sysctl to 0 in a per-vnet basis.
Reported by: hiren
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D1835
of packets. When the data payload length excluding any headers, of an
outgoing IPv4 packet exceeds PAGE_SIZE bytes, a special case in
ip_fragment() can kick in to optimise the outgoing payload(s). The
code which was added in r98849 as part of zero copy socket support
assumes that the beginning of any MTU sized payload is aligned to
where a MBUF's "m_data" pointer points. This is not always the case
and can sometimes cause large IPv4 packets, as part of ping replies,
to be split more than needed.
Instead of iterating the MBUFs to figure out how much data is in the
current chain, use the value already in the "m_pkthdr.len" field of
the first MBUF in the chain.
Reviewed by: ken @
Differential Revision: https://reviews.freebsd.org/D1893
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
Previous __alignment(4) allowed compiler to assume that operations are
performed on aligned region. On ARM processor, this led to alignment fault
as shown below:
trapframe: 0xda9e5b10
FSR=00000001, FAR=a67b680e, spsr=60000113
r0 =00000000, r1 =00000068, r2 =0000007c, r3 =00000000
r4 =a67b6826, r5 =a67b680e, r6 =00000014, r7 =00000068
r8 =00000068, r9 =da9e5bd0, r10=00000011, r11=da9e5c10
r12=da9e5be0, ssp=da9e5b60, slr=a054f164, pc =a054f2cc
<...>
udp_input+0x264: ldmia r5, {r0-r3, r6}
udp_input+0x268: stmia r12, {r0-r3, r6}
This was due to instructions which do not support unaligned access,
whereas for __alignment(2) compiler replaced ldmia/stmia with some
logically equivalent memcpy operations.
In fact, the assumption that 'struct ip' is always 4-byte aligned
is definitely false, as we have no impact on data alignment of packet
stream received.
Another possible solution would be to explicitely perform memcpy()
on objects of 'struct ip' type, which, however, would suffer from
performance drop, and be merely a problem hiding.
Please, note that this has nothing to do with
ARM32_DISABLE_ALIGNMENT_FAULTS option, but is related strictly to
compiler behaviour.
Submitted by: Wojciech Macek <wma@semihalf.com>
Reviewed by: glebius, ian
Obtained from: Semihalf
represents a context.
- Preserve name 'struct igmp_ifinfo' for a new structure, that will be stable
API between userland and kernel.
- Make sysctl_igmp_ifinfo() return the new 'struct igmp_ifinfo', instead of
old one, which had a bunch of internal kernel structures in it.
- Move all above declarations from in_var.h to igmp_var.h, since they are
private to IGMP code.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
and arp were being used. They basically would pass in the
mutex to the callout_init. Because they used this method
to the callout system, it was possible to "stop" the callout.
When flushing the table and you stopped the running callout, the
callout_stop code would return 1 indicating that it was going
to stop the callout (that was about to run on the callout_wheel blocked
by the function calling the stop). Now when 1 was returned, it would
lower the reference count one extra time for the stopped timer, then
a few lines later delete the memory. Of course the callout_wheel was
stuck in the lock code and would then crash since it was accessing
freed memory. By using callout_init(c, 1) we always get a 0 back
and the reference counting bug does not rear its head. We do have
to make a few adjustments to the callouts themselves though to make
sure it does the proper thing if rescheduled as well as gets the lock.
Commented upon by hiren and sbruno
See Phabricator D1777 for more details.
Commented upon by hiren and sbruno
Reviewed by: adrian, jhb and bz
Sponsored by: Netflix Inc.
when fragmenting IP packets to preserve the order of the packets in a
stream. Else the resulting fragments can be sent out of order when the
hardware supports multiple transmit rings.
Reviewed by: glebius @
MFC after: 1 week
Sponsored by: Mellanox Technologies
This fixes what seems like a simple oversight when the function was added in
r253210.
Reported by: Daniel Borkmann <dborkman@redhat.com>
Florian Westphal <fw@strlen.de>
Differential Revision: https://reviews.freebsd.org/D1628
Reviewed by: gnn
MFC after: 1 month
Sponsored by: Limelight Networks
We would like to acknowledge Gerasimos Dimitriadis who reported
the issue and Michael Tuexen who analyzed and provided the
fix.
Security: FreeBSD-SA-15:03.sctp
Security: CVE-2014-8613
Submitted by: tuexen
We would like to acknowledge Clement LECIGNE from Google Security Team and
Francisco Falcon from Core Security Technologies who discovered the issue
independently and reported to the FreeBSD Security Team.
Security: FreeBSD-SA-15:02.kmem
Security: CVE-2014-8612
Submitted by: tuexen
sys/netinet/ip_carp.c:
Add a "reason" string parameter to carp_set_state() and
carp_master_down_locked() allowing more specific logging
information to be passed into these apis.
Refactor existing state transition logging into a single
log call in carp_set_state().
Update all calls to carp_set_state() and
carp_master_down_locked() to pass an appropriate reason
string. For state transitions that were previously logged,
the output should be unchanged.
Submitted by: gibbs (original), asomers (updated)
MFC after: 1 week
Sponsored by: Spectra Logic
MFSpectraBSD: 1039697 on 2014/02/11 (original)
1049992 on 2014/03/21 (updated)
bits.
The motivation here is to eventually teach netisr and potentially
other networking subsystems a bit more about how RSS work queues / buckets
are configured so things have a hope of auto-configuring in the future.
* net/rss_config.[ch] takes care of the generic bits for doing
configuration, hash function selection, etc;
* topelitz.[ch] is now in net/ rather than netinet/;
* (and would be in libkern if it didn't directly include RSS_KEYSIZE;
that's a later thing to fix up.)
* netinet/in_rss.[ch] now just contains the IPv4 specific methods;
* and netinet/in6_rss.[ch] now just contains the IPv6 specific methods.
This should have no functional impact on anyone currently using
the RSS support.
Differential Revision: D1383
Reviewed by: gnn, jfv (intel driver bits)
use ifqueue at all. Second, there is no point in this lockless check.
Either positive or negative result of the check could be incorrect after
a tick.
Reviewed by: tuexen
Sponsored by: Nginx, Inc.