packets on tunnel interfaces. Add support of these ioctls to gre(4),
gif(4) and me(4) interfaces. For incoming packets M_SETFIB() should use
if_fib value from ifnet structure, use proper value in gre(4) and me(4).
Differential Revision: https://reviews.freebsd.org/D2462
No objection from: #network
MFC after: 2 weeks
Sponsored by: Yandex LLC
since types became a enum C type. Some software uses such ifdefs to
determine whether an operating systems supports certain interface type.
Of course, such check is bogus. E.g. FreeBSD defines about 250 interface
types, but supports only around 20.
However, we need not upset such software so provide a set of defines. The
current set was taken to suffice the dhcpd.
Reported & tested by: Guy Yur <guyyur gmail.com>
to specify exact number of cpus in loader.conf which get annoying when you have
mix of machines which don't have equal number of total cpus. I propose "-1" as
that value. When loader.conf has net.isr.maxthreads = -1, netisr will use all
available cpus.
In collaboration with: davide
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D2318
MFC after: 2 weeks
Sponsored by: Limelight Networks
quite unexpected result of toggling capabilities on the neighbour vlan(4)
interfaces.
Reviewed by: melifaro, np
Differential Revision: https://reviews.freebsd.org/D2310
Sponsored by: Nginx, Inc.
It is not network-specific code and would
be better as part of libkern instead.
Move zlib.h and zutil.h from net/ to sys/
Update includes to use sys/zlib.h and sys/zutil.h instead of net/
Submitted by: Steve Kiernan stevek@juniper.net
Obtained from: Juniper Networks, Inc.
GitHub Pull Request: https://github.com/freebsd/freebsd/pull/28
Relnotes: yes
A couple of fields are still exposed via struct bpf_if_ext so that
bpf_peers_present() can be inlined into its callers. However, this change
eliminates some type duplication in the resulting CTF container, since
otherwise ctfmerge(1) propagates the duplication through all types that
contain a struct bpf_if.
Differential Revision: https://reviews.freebsd.org/D2319
Reviewed by: melifaro, rpaulo
dhclient opens bpf as write-only to send packets. It never reads received
packets from that descriptor, but processing them in kernel takes time.
Especially much time takes packet timestamping on systems with expensive
timecounter, such as bhyve guest, where network speed dropped in half.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
discontinued by its initial authors. In FreeBSD the code was already
slightly edited during the pf(4) SMP project. It is about to be edited
more in the projects/ifnet. Moving out of contrib also allows to remove
several hacks to the make glue.
Reviewed by: net@
device polling rather than having it only be controlled by the compile
time option.
Summary: Rubicon Communications (Netgate)
Reviewers: #network, hiren
Reviewed By: #network, hiren
Subscribers: hiren
Differential Revision: https://reviews.freebsd.org/D2258
- Extend the number of available subtypes for Ethernet media by using some
of the ifmedia word's option bits to help denote subtypes. As a result, the
number of possible Ethernet subtype values increases from 31 to 511.
- Use some of those new values to define new media types.
- lacp_compose_key() recgonizes the new Ethernet media types added.
(Change made as required by a comment in if_media.h)
- New ioctl, SIOGIFXMEDIA, to handle getting the new extended media types.
SIOCGIFMEDIA is retained for backwards compatibility.
- Changes to ifconfig to allow it to handle the new extended media types.
Submitted by: mike@karels.net (original), hselasky
Reviewed by: jfvogel, gnn, hselasky
Approved by: jfvogel (mentor), gnn (mentor)
Differential Revision: http://reviews.freebsd.org/D1965
where counter was incremented on parent, instead of vlan(4) interface.
The second is more complicated. Historically, in our stack the incoming
packets are accounted in drivers, while incoming bytes for Ethernet
drivers are accounted in ether_input_internal(). Thus, it should be
removed from vlan(4) driver.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
- Use ifunit() instead of going through the interface list ourselves.
- Remove unused parameter.
- Move the most important comment above the function.
Sponsored by: Nginx, Inc.
initialization, when no input method specified before if_attach().
This prevents panics when if_input() method called directly e.g.
from bpf(4) code.
PR: 192426
Reviewed by: glebius
MFC after: 1 week
drivers can use it. This avoids some code duplication. Add missing
default case to all switch statements while at it. Also move the
hashing of the IPv6 flow field to layer 4 because the IPv6 flow field
is constant on a per L4 connection basis and not on a per L3 network.
Differential Revision: https://reviews.freebsd.org/D1987
Sponsored by: Mellanox Technologies
MFC after: 1 month
in kern_gzio.c. The old gzio interface was somewhat inflexible and has not
worked properly since r272535: currently, the gzio functions are called with
a range lock held on the output vnode, but kern_gzio.c does not pass the
IO_RANGELOCKED flag to vn_rdwr() calls, resulting in deadlock when vn_rdwr()
attempts to reacquire the range lock. Moreover, the new gzio interface can
be used to implement kernel core compression.
This change also modifies the kernel configuration options needed to enable
userland core dump compression support: gzio is now an option rather than a
device, and the COMPRESS_USER_CORES option is removed. Core dump compression
is enabled using the kern.compress_user_cores sysctl/tunable.
Differential Revision: https://reviews.freebsd.org/D1832
Reviewed by: rpaulo
Discussed with: kib
if_vmove().
In if_vmove(), if_detach_internal() and if_attach_internal() were
called in series to detach and reattach the interface. When
detaching, if_delgroup() was called and the interface leaves all of
the group membership. And then upon attachment, if_addgroup(ifp,
IFG_ALL) was called and it joined only "all" group again.
This had a problem. Normally, a cloned interface automatically joins
a group whose name is ifc_name of the cloner in addition to "all"
upon creation. However, if_vmove() removed the membership and did
not restore upon attachment.
Differential Revision: https://reviews.freebsd.org/D1859
allows adding an vlan interface into a bridge.
Thanks for William Katsak <wkatsak cs rutgers edu> for testing and fixing
an issue in my previous patch draft.
MFC after: 2 weeks
There was a race that bridge_ifdetach() could be called via
ifnet_departure event handler after vnet_bridge_uninit().
PR: 195859
Reported by: Danilo Egea Gondolfo
I edited the original change to retain the use of arc4random() as a seed for
the hashing as a very basic defense against intentional lagg port selection.
The author's original commit message (edited slightly):
sys/net/ieee8023ad_lacp.c
sys/net/if_lagg.c
In lagg_hashmbuf, use the FNV hash instead of the old
hash32_buf. The hash32 family of functions operate one octet
at a time, and when run on a string s of length n, their output
is equivalent to :
----- i=n-1
\
n \ (n-i-1) 32
( seed^ + / 33^ * s[i] ) % 2^
/
----- i=0
The problem is that the last five bytes of input don't get
multiplied by sufficiently many powers of 33 to rollover 2^32.
That means that changing the last few bytes (but obviously not
the very last) of input will always change the value of the
hash by a multiple of 33. In the case of lagg_hashmbuf() with
ipv4 input, the last four bytes are the TCP or UDP port
numbers. Since the output of lagg_hashmbuf is always taken
modulo the port count, and 3 is a common port count for a lagg,
that's bad. It means that the UDP or TCP source port will
never affect which lagg member is selected on a 3-port lagg.
At 10Gbps, I was not able to measure any difference in CPU
consumption between the old and new hash.
Submitted by: asomers (original commit)
Reviewed by: emaste, glebius
MFC after: 1 week
Sponsored by: Spectra Logic
MFSpectraBSD: 1001723 on 2013/08/28 (original)
1114258 on 2015/01/22 (edit)
very questionable, since it makes vimages more dependent on each other. But
the reason for the backout is that it screwed up shutting down the pf purge
threads, and now kernel immedially panics on pf module unload. Although module
unloading isn't an advertised feature of pf, it is very important for
development process.
I'd like to not backout r276746, since in general it is good. But since it
has introduced numerous build breakages, that later were addressed in
r276841, r276756, r276747, I need to back it out as well. Better replay it
in clean fashion from scratch.
bits.
The motivation here is to eventually teach netisr and potentially
other networking subsystems a bit more about how RSS work queues / buckets
are configured so things have a hope of auto-configuring in the future.
* net/rss_config.[ch] takes care of the generic bits for doing
configuration, hash function selection, etc;
* topelitz.[ch] is now in net/ rather than netinet/;
* (and would be in libkern if it didn't directly include RSS_KEYSIZE;
that's a later thing to fix up.)
* netinet/in_rss.[ch] now just contains the IPv4 specific methods;
* and netinet/in6_rss.[ch] now just contains the IPv6 specific methods.
This should have no functional impact on anyone currently using
the RSS support.
Differential Revision: D1383
Reviewed by: gnn, jfv (intel driver bits)
Quoting 19 years bpf.4 manual from bpf-1.2a1:
"
(SIOCGIFADDR is obsolete under BSD systems. SIOCGIFCONF should be
used to query link-level addresses.)
"
* SIOCGIFADDR was not imported in NetBSD (bpf.c 1.36) and OpenBSD.
* Last bits (e.g. manpage claiming SIOCGIFADDR exists) was cleaned
from NetBSD via kern/21513 5 years ago,
from OpenBSD via documentation/6352 5 years ago.
r273087. Move all checks from gif_output() into gif_transmit(). Previously
they were checked always, because if_start always called gif_output.
Now gif_transmit() can be called directly from if_bridge() code and we need
do checks here.
PR: 196646
MFC after: 1 week
handle it in arc_output() instead of nd6_storelladdr().
* Remove IFT_ARCNET check from arpresolve() since arc_output() does not
use arpresolve() to handle broadcast/multicast. This check was there
since r84931. It looks like it was not used since r89099 (initial
import of Arcnet support where multicast is handled separately).
* Remove IFT_IEEE1394 case from nd6_storelladdr() since firewire_output()
calles nd6_storelladdr() for unicast addresses only.
* Remove IFT_ARCNET case from nd6_storelladdr() since arc_output() now
handles multicast by itself.
As a result, we have the following pattern: all non-ethernet-style
media have their own multicast map handling inside their appropriate
routines. On the other hand, arpresolve() (and nd6_storelladdr()) which
meant to be 'generic' ones de-facto handles ethernet-only multicast maps.
MFC after: 3 weeks
Split functions that initialize various pf parts into their
vimage parts and global parts.
Since global parts appeared to be only mutex initializations, just
abandon them and use MTX_SYSINIT() instead.
Kill my incorrect VNET_FOREACH() iterator and instead use correct
approach with VNET_SYSINIT().
PR: 194515
Differential Revision: D1309
Submitted by: glebius, Nikos Vassiliadis <nvass@gmx.com>
Reviewed by: trociny, zec, gnn
* Make most of lltable_* methods 'normal' functions instead of inline
* Add lltable_get_<af|ifp>() functions to access given lltable fields
* Temporarily resurrect nd6_lookup() function
use llt_fill_sa_entry() llt method to store lle address in sa.
* Eliminate L3_ADDR macro and either reference IPv4/IPv6 address
directly from lle or use newly-created llt_fill_sa_entry().
* Do not store sockaddr inside arp/ndp lle anymore.
and explicit calls to RTENTRY_FREE_LOCKED()
* Use lltable_prefix_free() in arp_ifscrub to be consistent with nd6.
* Rename <lltable_|llt>_delete function to _delete_addr() to note that
this function is used to external callers. Make this function maintain
its own locking.
* Use lookup/unlink/clear call chain from internal callers instead of
delete_addr.
* Fix LLE_DELETED flag handling
cleanup including unlinking/freeing
* Relax locking in lltable_prefix_free_af/lltable_free
* Do not pass @llt to lle free callback: it is always NULL now.
* Unify arptimer/nd6_llinfo_timer: explicitly unlock lle avoiding
unlock/lock sequinces
* Do not pass unlocked lle to nd6_ns_output(): add nd6_llinfo_get_holdsrc()
to retrieve preferred source address from lle hold queue and pass it
instead of lle.
* Finally, make nd6_create() create and return unlocked lle
* Separate defrtr handling code from nd6_free():
use nd6_check_del_defrtr() to check if we need to keep entry instead of
performing GC,
use nd6_check_recalc_defrtr() to perform actual recalc on lle removal.
* Move isRouter handling from nd6_cache_lladdr() to separate
nd6_check_router()
* Add initial code to maintain lle runtime flags in sync.
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.
This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.
"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.
Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.
MFC after: 1 month
Sponsored by: Mellanox Technologies
use rwlock accessible via external functions
(IF_AFDATA_CFG_* -> if_afdata_cfg_*()) for all control plane tasks
use rmlock (IF_AFDATA_RUN_*) for fast-path lookups.
Update route MTU in case of ifnet MTU change.
Add new RTF_FIXEDMTU to track explicitly specified MTU.
Old behavior:
ifconfig em0 mtu 1500->9000 -> all routes traversing em0 do not change MTU.
User has to manually update all routes.
ifconfig em0 mtu 9000->1500 -> all routes traversing em0 do not change MTU.
However, if ip[6]_output finds route with rt_mtu > interface mtu, rt_mtu
gets updated.
New behavior:
ifconfig em0 mtu 1500->9000 -> all interface routes in all fibs gets updated
with new MTU unless RTF_FIXEDMTU flag set on them.
ifconfig em0 mtu 9000->1500 -> all routes in all fibs gets updated with new
MTU unless RTF_FIXEDMTU flag set on them AND rt_mtu is less than ifp mtu.
route add ... -mtu XXX automatically sets RTF_FIXEDMTU flag.
route change .. -mtu 0 automatically removes RTF_FIXEDMTU flag.
PR: 194238
MFC after: 1 month
CR: D1125
* struct llentry is now basically split into 2 pieces:
all fields within 64 bytes (amd64) are now protected by both
ifdata lock AND lle lock, e.g. you require both locks to be held
exclusively for modification. All data necessary for fast path
operations is kept here. Some fields were added:
- r_l3addr - makes lookup key liev within first 64 bytes.
- r_flags - flags, containing pre-compiled decision whether given
lle contains usable data or not. Current the only flag is RLLE_VALID.
- r_len - prepend data len, currently unused
- r_kick - used to provide feedback to control plane (see below).
All other fields are protected by lle lock.
* Add simple state machine for ARP to handle "about to expire" case:
Current model (for the fast path) is the following:
- rlock afdata
- find / rlock rte
- runlock afdata
- see if "expire time" is approaching
(time_uptime + la->la_preempt > la->la_expire)
- if true, call arprequest() and decrease la_preempt
- store MAC and runlock rte
New model (data plane):
- rlock afdata
- find rte
- check if it can be used using r_* fields only
- if true, store MAC
- if r_kick field != 0 set it to 0.
- runlock afdata
New mode (control plane):
- schedule arptimer to be called in (V_arpt_keep - V_arp_maxtries)
seconds instead of V_arpt_keep.
- on first timer invocation change state from ARP_LLINFO_REACHABLE
to ARP_LLINFO_VERIFY, sets r_kick to 1 and shedules next call in
V_arpt_rexmit (default to 1 sec).
- on subsequent timer invocations in ARP_LLINFO_VERIFY state, checks
for r_kick value: reschedule if not changed, and send arprequest()
if set to zero (e.g. entry was used).
* Convert IPv4 path to use new single-lock approach. IPv6 bits to follow.
* Slow down in_arpinput(): now valid reply will (in most cases) require
acquiring afdata WLOCK twice. This is requirement for storing changed
lle data. This change will be slightly optimized in future.
* Provide explicit hash link/unlink functions for both ipv4/ipv6 code.
This will probably be moved to generic lle code once we have per-AF
hashing callback inside lltable.
* Perform lle unlink on deletion immediately instead of delaying it to
the timer routine.
* Make r244183 more explicit: use new LLE_CALLOUTREF flag to indicate the
presence of lle reference used for safe callout calls.
lla_lookup(LLE_CREATE) -> lla_create
lla_lookup(LLE_DELETE) -> lla_delete
Assume lla_create to return LLE_EXCLUSIVE lock for lle.
* Rework lla_rt_output to perform all lle changes under afdata WLOCK.
* change arp_ifscrub() ackquire afdata WLOCK, the same as arp_ifinit().
have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.
No objections from: net@
gcc requires variables to be initialised in two places. One of them
is correctly used only under the same conditional though.
For module builds properly check if the kernel supports INET or INET6,
as otherwise various mips kernels without IPv6 support would fail to build.
Split it into two modules: if_gre(4) for GRE encapsulation and
if_me(4) for minimal encapsulation within IP.
gre(4) changes:
* convert to if_transmit;
* rework locking: protect access to softc with rmlock,
protect from concurrent ioctls with sx lock;
* correct interface accounting for outgoing datagramms (count only payload size);
* implement generic support for using IPv6 as delivery header;
* make implementation conform to the RFC 2784 and partially to RFC 2890;
* add support for GRE checksums - calculate for outgoing datagramms and check
for inconming datagramms;
* add support for sending sequence number in GRE header;
* remove support of cached routes. This fixes problem, when gre(4) doesn't
work at system startup. But this also removes support for having tunnels with
the same addresses for inner and outer header.
* deprecate support for various GREXXX ioctls, that doesn't used in FreeBSD.
Use our standard ioctls for tunnels.
me(4):
* implementation conform to RFC 2004;
* use if_transmit;
* use the same locking model as gre(4);
PR: 164475
Differential Revision: D1023
No objections from: net@
Relnotes: yes
Sponsored by: Yandex LLC
it, except Ethernet, where it carried ng_ether(4) pointer.
For now carry the pointer in if_l2com directly.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Some virtual if drivers has (ab)used ifa ifa_rtrequest hook to enforce
route MTU to be not bigger that interface MTU. While ifa_rtrequest hooking
might be an option in some situation, it is not feasible to do MTU checks
there: generic (or per-domain) routing code is perfectly capable of doing
this.
We currrently have 3 places where MTU is altered:
1) route addition.
In this case domain overrides radix _addroute callback (in[6]_addroute)
and all necessary checks/fixes are/can be done there.
2) route change (especially, GW change).
In this case, there are no explicit per-domain calls, but one can
override rte by setting ifa_rtrequest hook to domain handler
(inet6 does this).
3) ifconfig ifaceX mtu YYYY
In this case, we have no callbacks, but ip[6]_output performes runtime
checks and decreases rt_mtu if necessary.
Generally, the goals are to be able to handle all MTU changes in
control plane, not in runtime part, and properly deal with increased
interface MTU.
This commit changes the following:
* removes hooks setting MTU from drivers side
* adds proper per-doman MTU checks for case 1)
* adds generic MTU check for case 2)
* The latter is done by using new dom_ifmtu callback since
if_mtu denotes L3 interface MTU, e.g. maximum trasmitted _packet_ size.
However, IPv6 mtu might be different from if_mtu one (e.g. default 1280)
for some cases, so we need an abstract way to know maximum MTU size
for given interface and domain.
* moves rt_setmetrics() before MTU/ifa_rtrequest hooks since it copies
user-supplied data which must be checked.
* removes RT_LOCK_ASSERT() from other ifa_rtrequest hooks to be able to
use this functions on new non-inserted rte.
More changes will follow soon.
MFC after: 1 month
Sponsored by: Yandex LLC
* Remove &ifpp from ip6_output() in favor of ri->ri_nh_info
* Provide different wrappers to in6_selectsrc:
Currently it is used by 2 differenct type of customers:
- socket-based one, which all are unsure about provided
address scope and
- in-kernel ones (ND code mostly), which don't have
any sockets, options, crededentials, etc.
So, we provide two different wrappers to in6_selectsrc()
returning select source.
* Make different versions of selectroute():
Currenly selectroute() is used in two scenarios:
- SAS, via in6_selecsrc() -> in6_selectif() -> selectroute()
- output, via in6_output -> wrapper -> selectroute()
Provide different versions for each customer:
- fib6_lookup_nh_basic()-based in6_selectif() which is
capable of returning interface only, without MTU/NHOP/L2
calculations
- full-blown fib6_selectroute() with cached route/multipath/
MTU/L2
* Stop using routing table for link-local address lookups
* Add in6_ifawithifp_lla() to make for-us check faster for link-local
* Add in6_splitscope / in6_setllascope for faster embed/deembed scopes
This code has had an extensive rewrite and a good series of reviews, both by the author and other parties. This means a lot of code has been simplified. Pluggable structures for high-rate entropy generators are available, and it is most definitely not the case that /dev/random can be driven by only a hardware souce any more. This has been designed out of the device. Hardware sources are stirred into the CSPRNG (Yarrow, Fortuna) like any other entropy source. Pluggable modules may be written by third parties for additional sources.
The harvesting structures and consequently the locking have been simplified. Entropy harvesting is done in a more general way (the documentation for this will follow). There is some GREAT entropy to be had in the UMA allocator, but it is disabled for now as messing with that is likely to annoy many people.
The venerable (but effective) Yarrow algorithm, which is no longer supported by its authors now has an alternative, Fortuna. For now, Yarrow is retained as the default algorithm, but this may be changed using a kernel option. It is intended to make Fortuna the default algorithm for 11.0. Interested parties are encouraged to read ISBN 978-0-470-47424-2 "Cryptography Engineering" By Ferguson, Schneier and Kohno for Fortuna's gory details. Heck, read it anyway.
Many thanks to Arthur Mesh who did early grunt work, and who got caught in the crossfire rather more than he deserved to.
My thanks also to folks who helped me thresh this out on whiteboards and in the odd "Hallway track", or otherwise.
My Nomex pants are on. Let the feedback commence!
Reviewed by: trasz,des(partial),imp(partial?),rwatson(partial?)
Approved by: so(des)
directly accessed. Although this will work on some platforms, it can
throw an exception if the pointer is invalid and then panic the kernel.
Add a missing SYSCTL_IN() of "SCTP_BASE_STATS" structure.
MFC after: 3 days
Sponsored by: Mellanox Technologies
* Use NHF_ namespace for all nhop flags
* Rename nhop_data -> nhop_prepend
* Rename fib4_lookup_nh_extended -> fib4_lookup_nh_ext
* Add "flags" argument to fib4_lookup_nh_ext() to specify whether we want
returned nh_ext structure to be refcounted or not.
structure without doinf L2 resolve. It also requires freeing
references by calling fib4_free_nh_ext().
Convert in_pcbladdr() to use it.
Convert tcp_maxmtu() to use it.
vxlan creates a virtual LAN by encapsulating the inner Ethernet frame in
a UDP packet. This implementation is based on RFC7348.
Currently, the IPv6 support is not fully compliant with the specification:
we should be able to receive UPDv6 packets with a zero checksum, but we
need to support RFC6935 first. Patches for this should come soon.
Encapsulation protocols such as vxlan emphasize the need for the FreeBSD
network stack to support batching, GRO, and GSO. Each frame has to make
two trips through the network stack, and each frame will be at most MTU
sized. Performance suffers accordingly.
Some latest generation NICs have begun to support vxlan HW offloads that
we should also take advantage of. VIMAGE support should also be added soon.
Differential Revision: https://reviews.freebsd.org/D384
Reviewed by: gnn
Relnotes: yes
The goals of the new API is to provide consumers with minimal
needed information, but as fast as possible. So we provide
full nexthop info copied into alighed on-cache structure
instead of rte/ia pointers, their refcounts and locks.
This does not provide solution for protecting from egress
ifp destruction, but does not make it any worse.
Current changes:
nhops:
Add fib4_lookup_prepend() function which stores either full
L2+L3 prepend info (e.g. MAC header in case of plain IPv4) or
L3 info with NH_FLAGS_L2_INCOMPLETE flag indicating that no valid L2
info exists and we have to take "slow" path.
ip_output:
Currently ip[ 46]_output consumers use 'struct route' for
the following purposes:
1) double lookup avoidance(route caching)
2) plain route caching
3) get path MTU to be able to notify source.
The former pattern is mostly used by various tunnels
(gif, gre, stf). (Actually, gre is the only remaining,
others were already converted. Their locking model did
not scale good enogh to benefit from such caching, so
we have (temporarily) removed it without any performance
loss).
Plain route caching used by SCTP is simply wrong and should be removed.
Temporary break it for now just to be able to compile.
Optimize path mtu reporting by providing it in new 'route_info' stucture.
Minimize games with @ia locking/refcounting for route lookup:
add special nhop[46]_extended structure to store more route attributes.
Pointer to given structure can be passed to fib4_lookup_prepend() to indicate
we want this info (we actually needs it for UDP and raw IP).
ether_output:
Provide light-weight ether_output2() call to deal with
transmitting L2 frame (e.g. properly handle broadcast/simloop/bridge/
other L2 hooks before actually transmitting frame by if_transmit()).
Add a hack based on new RT_NHOP ro_flag to distinguish which version should
we call. Better way is probably to add a new "if_output_frame" driver
callbacks.
Next steps:
* Convert ip_fastfwd part
* Implement auto-growing array for per-radix nexthops
* Implement LLE tracking for nexthop calculations to be able to
immediately provide all necessary info in single route lookup
for gateway routes
* Switch radix locking scheme to runtime/cfg lock
* Implement multipath support for rtsock
* Implement "tracked nexthops" for tunnels (e.g. _proper_
nexthop caching)
* Add IPv6 support for remaining parts (postponed not to
interfere with user/ae/inet6 branch)
* Consider adding "if_output_frame" driver call to
ease logical frame pushing.
o convert to if_transmit;
o use rmlock to protect access to gif_softc;
o use sx lock to protect from concurrent ioctls;
o remove a lot of unneeded and duplicated code;
o remove cached route support (it won't work with concurrent io);
o style fixes.
Reviewed by: melifaro
Obtained from: Yandex LLC
MFC after: 1 month
Sponsored by: Yandex LLC
is added to return EEXIST when only "b" interface exists---this can happen
when epair<N>b is moved to a vnet jail and then "ifconfig epair<N> create"
is invoked there.
outer header, consider it as new packet and clear the protocols flags.
This fixes problems when IPSEC traffic goes through various tunnels and router
doesn't send ICMP/ICMPv6 errors.
PR: 174602
Obtained from: Yandex LLC
MFC after: 2 weeks
Sponsored by: Yandex LLC
LOR of softc rmlock in iflladdr_event handlers.
- Call if_delmulti_ifma() after LACP_UNLOCK(). This fixes another LOR.
- Fix a panic in lacp_transit_expire().
- Fix a panic in lagg_input() upon shutting down a port.
if_lagg(4) interfaces which were cloned in a vnet jail.
Sysctl nodes which are dynamically generated for each cloned interface
(net.link.lagg.N.*) have been removed, and use_flowid and flowid_shift
ifconfig(8) parameters have been added instead. Flags and per-interface
statistics counters are displayed in "ifconfig -v".
CR: D842
This is temporary commit to be merged to 10.
Other approach (like hash table) should be used
to store different masks.
PR: 194078
Submitted by: Rumen Telbizov
MFC after: 3 days
This seems to allow us to pass a universe with either clang or gcc
after r272244 (and r272260) and probably makes it easier to untabgle
these chained #includes in the future.
if_var.h has the expected on and if_var.h include ifq.h and thus we get
duplicates. It seems only one cavium ethernet file actually includes ifq.h
directly which might be another cleanup to be done but need to test first.
then drop lock, run the attach routines, and then set it to specific
proto. This removes tons of WITNESS warnings.
- Make lagg protocol attach handlers not failing and allocate memory
with M_WAITOK.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.
Reviewed by: adrian, rmacklem
Sponsored by: Mellanox Technologies
MFC after: 1 week
struct ifnet if_oqdrops.
Some netgraph modules used ifqueue w/o ifnet. Accounting of queue drops
is simply removed from them. There were no API to read this statistic.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
imporant moments that we discussed with Marcel and Anuranjan was that
a converted driver should return false for 'grep ifnet if_driver.c' :)
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
The compat counters will go away, but the function will remain in its place,
and in all places where it is going to be called.
Discussed with: melifaro
and receives frames on any port of the lagg(4).
Phabric: D549
Reviewed by: glebius, thompsa
Approved by: glebius
Obtained from: OpenBSD
Sponsored by: QNAP Systems Inc.
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.
MFC after: 1 week
Sponsored by: Mellanox Technologies
ifa_ifwithdstaddr. For the sake of backwards compatibility, the new
arguments were added to new functions named ifa_ifwithnet_fib and
ifa_ifwithdstaddr_fib, while the old functions became wrappers around the
new ones that passed RT_ALL_FIBS for the fib argument. However, the
backwards compatibility is not desired for FreeBSD 11, because there are
numerous other incompatible changes to the ifnet(9) API. We therefore
decided to remove it from head but leave it in place for stable/9 and
stable/10. In addition, this commit adds the fib argument to
ifa_ifwithbroadaddr for consistency's sake.
sys/sys/param.h
Increment __FreeBSD_version
sys/net/if.c
sys/net/if_var.h
sys/net/route.c
Add fibnum argument to ifa_ifwithbroadaddr, and remove the _fib
versions of ifa_ifwithdstaddr, ifa_ifwithnet, and ifa_ifwithroute.
sys/net/route.c
sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_options.c
sys/netinet/ip_output.c
sys/netinet6/nd6.c
Fixup calls of modified functions.
share/man/man9/ifnet.9
Document changed API.
CR: https://reviews.freebsd.org/D458
MFC after: Never
Sponsored by: Spectra Logic
with no RSS hash.
When doing RSS:
* Create a new IPv4 netisr which expects the frames to have been verified;
it just directly dispatches to the IPv4 input path.
* Once IPv4 reassembly is done, re-calculate the RSS hash with the new
IP and L3 header; then reinject it as appropriate.
* Update the IPv4 netisr to be a CPU affinity netisr with the RSS hash
function (rss_soft_m2cpuid) - this will do a software hash if the
hardware doesn't provide one.
NICs that don't implement hardware RSS hashing will now benefit from RSS
distribution - it'll inject into the correct destination netisr.
Note: the netisr distribution doesn't work out of the box - netisr doesn't
query RSS for how many CPUs and the affinity setup. Yes, netisr likely
shouldn't really be doing CPU stuff anymore and should be "some kind of
'thing' that is a workqueue that may or may not have any CPU affinity";
that's for a later commit.
Differential Revision: https://reviews.freebsd.org/D527
Reviewed by: grehan
and keep both converted to drvapi and non-converted drivers
compilable.
o Make if_t typedef to struct ifnet *.
o Remove shim functions.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
for route(4) socket and ifmib(4) sysctl.
o Move fields from if_data to ifnet, but keep all statistic counters
separate, since they should disappear later.
o Provide function if_data_copy() to fill if_data, utilize it in routing
socket and ifmib handler.
o Provide overridable ifnet(9) method to fetch counters. If no provided,
if_get_counters_compat() would be used, that returns old counters.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
via net.link.generic.IFMIB_IFDATA.*.IFDATA_GENERAL sysctl. Reasons
for removal are:
- No code in tree uses this possibility.
- The documentation ifmib(4) doesn't say that such possibility
exist. The example provided in manual page only reads data.
- On many interfaces the feature simply doesn't work, since they
do accounting in hardware, and overwrite if_data on tick.
Sponsored by: Nginx, Inc.
* Convert ixgbe to use this ioctl
* Convert ifconfig to use generic i2c handler for "ix" interfaces.
Approved by: Eric Joyner (ixgbe part)
MFC after: 2 weeks
Sponsored by: Yandex LLC
QSFP+ data via i2c inteface. These constants has been taken
from SFF-8436 "QSFP+ 10 Gbs 4X PLUGGABLE TRANSCEIVER" standard
rev 4.8.
* Add support for printing QSFP+ information from 40G NICs
such as Chelsio T5.
This commit does not contain ioctl changes necessary for this
functionality work, there will be another commit soon.
Example:
cxl1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=ec07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,.....>
ether 00:07:43:28:ad:08
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
media: Ethernet 40Gbase-LR4 <full-duplex>
status: active
plugged: QSFP+ 40GBASE-LR4 (MPO Parallel Optic)
vendor: OEM PN: OP-QSFP-40G-LR4 SN: 20140318001 DATE: 2014-03-18
module temperature: 64.06 C voltage: 3.26 Volts
lane 1: RX: 0.47 mW (-3.21 dBm) TX: 2.78 mW (4.46 dBm)
lane 2: RX: 0.20 mW (-6.94 dBm) TX: 2.80 mW (4.47 dBm)
lane 3: RX: 0.18 mW (-7.38 dBm) TX: 2.79 mW (4.47 dBm)
lane 4: RX: 0.90 mW (-0.45 dBm) TX: 2.80 mW (4.48 dBm)
Tested on: Chelsio T5
Tested on: Mellanox/Huawei passive/active cables/transceivers.
MFC after: 2 weeks
Sponsored by: Yandex LLC
Mostly bugfixes or features developed in the past 6 months,
so this is a 10.1 candidate.
Basically no user API changes (some bugfixes in sys/net/netmap_user.h).
In detail:
1. netmap support for virtio-net, including in netmap mode.
Under bhyve and with a netmap backend [2] we reach over 1Mpps
with standard APIs (e.g. libpcap), and 5-8 Mpps in netmap mode.
2. (kernel) add support for multiple memory allocators, so we can
better partition physical and virtual interfaces giving access
to separate users. The most visible effect is one additional
argument to the various kernel functions to compute buffer
addresses. All netmap-supported drivers are affected, but changes
are mechanical and trivial
3. (kernel) simplify the prototype for *txsync() and *rxsync()
driver methods. All netmap drivers affected, changes mostly mechanical.
4. add support for netmap-monitor ports. Think of it as a mirroring
port on a physical switch: a netmap monitor port replicates traffic
present on the main port. Restrictions apply. Drive carefully.
5. if_lem.c: support for various paravirtualization features,
experimental and disabled by default.
Most of these are described in our ANCS'13 paper [1].
Paravirtualized support in netmap mode is new, and beats the
numbers in the paper by a large factor (under qemu-kvm,
we measured gues-host throughput up to 10-12 Mpps).
A lot of refactoring and additional documentation in the files
in sys/dev/netmap, but apart from #2 and #3 above, almost nothing
of this stuff is visible to other kernel parts.
Example programs in tools/tools/netmap have been updated with bugfixes
and to support more of the existing features.
This is meant to go into 10.1 so we plan an MFC before the Aug.22 deadline.
A lot of this code has been contributed by my colleagues at UNIPI,
including Giuseppe Lettieri, Vincenzo Maffione, Stefano Garzarella.
MFC after: 3 days.
This is needed to prevent having interfaces with ifp->if_addr == NULL
on bridge interfaces. Moving the notification event handlers up makes
sure the interfaces are removed before doing any more cleanup.
Sponsored by: Citrix Systems R&D
Reviewed by: melifaro
Differential Revision: https://reviews.freebsd.org/D598
net/if.c
- Move interface removal notification up in if_detach_internal.
- Do not count global number of states and of src_nodes,
use uma_zone_get_cur() to obtain values.
- Struct pf_status becomes merely an ioctl API structure,
and moves to netpfil/pf/pf.h with its constants.
- V_pf_status is now of type struct pf_kstatus.
Submitted by: Kajetan Staszkiewicz <vegeta tuxpowered.net>
Sponsored by: InnoGames GmbH
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
adding and removing pages to them.
Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.
This change effectively modifies the KPI. __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho
Direct bpf(4) consumers should now work fine with this tunable turned on.
In fact, the only case when optimized_writers can change program
behavior is direct bpf(4) consumer setting its read filter to
catch-all one.
MFC after: 2 weeks
Sponsored by: Yandex LLC
- stdio.h is needed for fprint()
- make memsize uint32_t to avoid errors due to overflow
- honor the *XPOLL flagg in NIOCREGIF requests
- mmap fails wit MAP_FAILED, not NULL.
MFC after: 3 days
interface allows the ifnet structure to be defined as an opaque
type in NIC drivers. This then allows the ifnet structure to be
changed without a need to change or recompile NIC drivers.
Put differently, NIC drivers can be written and compiled once and
be used with different network stack implementations, provided of
course that those network stack implementations have an API and
ABI compatible interface.
This commit introduces the 'if_t' type to replace 'struct ifnet *'
as the type of a network interface. The 'if_t' type is defined as
'void *' to enable the compiler to perform type conversion to
'struct ifnet *' and vice versa where needed and without warnings.
The functions that implement the API are the only functions that
need to have an explicit cast.
The MII code has been converted to use the driver API to avoid
unnecessary code churn. Code churn comes from having to work with
both converted and unconverted drivers in correlation with having
callback functions that take an interface. By converting the MII
code first, the callback functions can be defined so that the
compiler will perform the typecasts automatically.
As soon as all drivers have been converted, the if_t type can be
redefined as needed and the API functions can be fix to not need
an explicit cast.
The immediate benefactors of this change are:
1. Juniper Networks - The network stack implementation in Junos
is entirely different from FreeBSD's one and this change
allows Juniper to build "stock" NIC drivers that can be used
in combination with both the FreeBSD and Junos stacks.
2. FreeBSD - This change opens the door towards changing ifnet
and implementing new features and optimizations in the network
stack without it requiring a change in the many NIC drivers
FreeBSD has.
Submitted by: Anuranjan Shukla <anshukla@juniper.net>
Reviewed by: glebius@
Obtained from: Juniper Networks, Inc.
ifa_ifwithnet() and ifa_ifwithdstaddr() The legacy functions will call the
_fib() versions with RT_ALL_FIBS, preserving legacy behavior.
sys/net/if_var.h
sys/net/if.c
Add legacy-compatible functions as described above. Ensure legacy
behavior when RT_ALL_FIBS is passed as fibnum.
sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/net/route.c
sys/net/rtsock.c
sys/netinet6/nd6.c
Call with _fib() functions if we must use a specific fib, or the
legacy functions otherwise.
tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
Improve the udp_dontroute test. The bug that this test exercises is
that ifa_ifwithnet() will return the wrong address, if multiple
interfaces have addresses on the same subnet but with different
fibs. The previous version of the test only considered one possible
failure mode: that ifa_ifwithnet_fib() might fail to find any
suitable address at all. The new version also checks whether
ifa_ifwithnet_fib() finds the correct address by checking where the
ARP request goes.
Reported by: bz, hrs
Reviewed by: hrs
MFC after: 1 week
X-MFC-with: 264905
Sponsored by: Spectra Logic
birthday-paradox style address collisions when
bhyve VMs are connected to the same broadcoast
domain and are using pseudo-random allocations.
Reviewed by: gnn
MFC after: 1 week
Since radix has been ignoring sa_family in passed sockaddrs,
no one ever has bothered filling valid sa_family in netmasks.
Additionally, radix adjusts sa_len field in every netmask not to
compare zero bytes at all.
This leads us to rt_mask with sa_family of AF_UNSPEC (-1) and
arbitrary sa_len field (0 for default route, for example).
However, rtsock have been passing that rt_mask intact for ages,
requiring all rtsock consumers to make ther own local hacks.
We even have unfixed on in base:
do `route -n monitor` in one window and issue `route -n get addr`
for some directly-connected address. You will probably see the following:
got message of size 304 on Thu May 8 15:06:06 2014
RTM_GET: Report Metrics: len 304, pid: 30493, seq 1, errno 0, flags:<UP,DONE,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK,IFP,IFA>
10.0.0.0 link#1 (255) ffff ffff ff em0:8.0.27.c5.29.d4 10.0.0.92
_________________^^^^^^^^^^^^^^^^^^
after the change:
got message of size 312 on Thu May 8 15:44:07 2014
RTM_GET: Report Metrics: len 312, pid: 2895, seq 1, errno 0, flags:<UP,DONE,PINNED>
locks: inits:
sockaddrs: <DST,GATEWAY,NETMASK,IFP,IFA>
10.0.0.0 link#1 255.255.255.0 em0:8.0.27.c5.29.d4 10.0.0.92
_________________^^^^^^^^^^^^^^^^^^
Sponsored by: Yandex LLC
MFC after: 1 month
It looks like current consumers are either unaware
of MRT (and uses RT_DEFAULT_FIB implicitly) or
know what thay are doing, In latter case they
will be either hit by KASSERT or ESCRH will be returned
due to NULL rnh.
The thread that is destroying the lagg has already set sc->sc_psc=NULL when
the "ifconfig -am" thread gets to lacp_req(). It tries to dereference
sc->sc_psc and panics. The solution is for lacp_req() to check the value of
sc->sc_psc. If NULL, harmlessly return an lacp_opreq structure full of
zeros. Full details in GNATS.
PR: kern/189003
Reviewed by: timeout on freebsd-net@
MFC after: 3 weeks
Sponsored by: Spectra Logic Corporation
All rtsock-initiated rte creation/modification are now
performed in route.c holding radix tree write lock.
This reduces the need for per-rte mutex.
Sponsored by: Yandex LLC
MFC after: 1 month
* memory is now allocated as early as possible, without holding locks.
* sysctl users are now guaranteed to get a response (M_WAITOK buffer prealloc).
* socket users are more likely to use on-stack buffer for replies.
* standard kernel malloc/free functions are now used instead of radix wrappers.
rt_msg2() has been renamed to rtsock_msg_buffer().
MFC after: 1 month
RTM_CHANGE is now handled inside route.c:rtrequest1_fib() as it should be.
Note change change handler is a separate function rtrequest1_fib_change().
MFC after: 1 month