Commit Graph

3370 Commits

Author SHA1 Message Date
glebius
b786a57a34 Move struct ether_vlan_header to ethernet.h, out of if_vlan_var.h,
since this structure is protocol definition, not part of implementation.
2014-11-11 10:22:33 +00:00
luigi
02cd0a8cd6 return kernel-supplied error if available.
Also fix field names in a comment.
2014-11-10 08:31:56 +00:00
melifaro
b5d711d3a6 Renove faith(4) and faithd(8) from base. It looks like industry
have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.

No objections from:	net@
2014-11-09 21:33:01 +00:00
glebius
037bd5af57 Remove remnants of if_ef(4). 2014-11-09 11:13:15 +00:00
glebius
83e84205ec Use standard mtx(9), rwlock(9), sx(9) system initialization macros
instead of doing initialization manually.

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2014-11-09 11:11:08 +00:00
bz
20dab50bef After r274246 make the tree compile again.
gcc requires variables to be initialised in two places.  One of them
is correctly  used only under the same conditional though.

For module builds properly check if the kernel supports INET or INET6,
as otherwise various mips kernels without IPv6 support would fail to build.
2014-11-08 14:41:32 +00:00
glebius
959c68aefc ifindex_alloc_locked() never fails and doesn't have no-lock version,
so change the prototype.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-08 07:23:01 +00:00
ae
7144dc8bc2 Overhaul if_gre(4).
Split it into two modules: if_gre(4) for GRE encapsulation and
if_me(4) for minimal encapsulation within IP.

gre(4) changes:
* convert to if_transmit;
* rework locking: protect access to softc with rmlock,
  protect from concurrent ioctls with sx lock;
* correct interface accounting for outgoing datagramms (count only payload size);
* implement generic support for using IPv6 as delivery header;
* make implementation conform to the RFC 2784 and partially to RFC 2890;
* add support for GRE checksums - calculate for outgoing datagramms and check
  for inconming datagramms;
* add support for sending sequence number in GRE header;
* remove support of cached routes. This fixes problem, when gre(4) doesn't
  work at system startup. But this also removes support for having tunnels with
  the same addresses for inner and outer header.
* deprecate support for various GREXXX ioctls, that doesn't used in FreeBSD.
  Use our standard ioctls for tunnels.

me(4):
* implementation conform to RFC 2004;
* use if_transmit;
* use the same locking model as gre(4);

PR:		164475
Differential Revision:	D1023
No objections from:	net@
Relnotes:	yes
Sponsored by:	Yandex LLC
2014-11-07 19:13:19 +00:00
glebius
6306f79560 Remove struct arpcom. It is unused by most interface types, that allocate
it, except Ethernet, where it carried ng_ether(4) pointer.
For now carry the pointer in if_l2com directly.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-07 15:14:10 +00:00
glebius
99f4ec50e8 Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed.
Sponsored by:	Nginx, Inc.
2014-11-07 09:39:05 +00:00
glebius
ac67ed6c19 Remove useless structure ifindex_entry.
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2014-11-07 09:15:39 +00:00
melifaro
dfd9a3788d Fix build.
Pointy hat to:	melifaro
2014-11-06 17:50:35 +00:00
melifaro
5c54c0c246 Finish r274118: remove useless fields from struct domain.
Sponsored by:	Yandex LLC
2014-11-06 14:39:04 +00:00
melifaro
11af63037f Make checks for rt_mtu generic:
Some virtual if drivers has (ab)used ifa ifa_rtrequest hook to enforce
route MTU to be not bigger that interface MTU. While ifa_rtrequest hooking
might be an option in some situation, it is not feasible to do MTU checks
there: generic (or per-domain) routing code is perfectly capable of doing
this.

We currrently have 3 places where MTU is altered:

1) route addition.
 In this case domain overrides radix _addroute callback (in[6]_addroute)
 and all necessary checks/fixes are/can be done there.

2) route change (especially, GW change).
 In this case, there are no explicit per-domain calls, but one can
 override rte by setting ifa_rtrequest hook to domain handler
 (inet6 does this).

3) ifconfig ifaceX mtu YYYY
 In this case, we have no callbacks, but ip[6]_output performes runtime
 checks and decreases rt_mtu if necessary.

Generally, the goals are to be able to handle all MTU changes in
 control plane, not in runtime part, and properly deal with increased
 interface MTU.

This commit changes the following:
* removes hooks setting MTU from drivers side
* adds proper per-doman MTU checks for case 1)
* adds generic MTU check for case 2)

* The latter is done by using new dom_ifmtu callback since
 if_mtu denotes L3 interface MTU, e.g. maximum trasmitted _packet_ size.
 However, IPv6 mtu might be different from if_mtu one (e.g. default 1280)
 for some cases, so we need an abstract way to know maximum MTU size
 for given interface and domain.
* moves rt_setmetrics() before MTU/ifa_rtrequest hooks since it copies
  user-supplied data which must be checked.
* removes RT_LOCK_ASSERT() from other ifa_rtrequest hooks to be able to
  use this functions on new non-inserted rte.

More changes will follow soon.

MFC after:	1 month
Sponsored by:	Yandex LLC
2014-11-06 13:13:09 +00:00
hselasky
a8147b2f48 Clarify TSO segment limit comment and remove two TABs to make lines a
bit shorter.

Sponsored by:	Mellanox Technologies
2014-11-03 13:02:58 +00:00
markm
fce6747f55 This is the much-discussed major upgrade to the random(4) device, known to you all as /dev/random.
This code has had an extensive rewrite and a good series of reviews, both by the author and other parties. This means a lot of code has been simplified. Pluggable structures for high-rate entropy generators are available, and it is most definitely not the case that /dev/random can be driven by only a hardware souce any more. This has been designed out of the device. Hardware sources are stirred into the CSPRNG (Yarrow, Fortuna) like any other entropy source. Pluggable modules may be written by third parties for additional sources.

The harvesting structures and consequently the locking have been simplified. Entropy harvesting is done in a more general way (the documentation for this will follow). There is some GREAT entropy to be had in the UMA allocator, but it is disabled for now as messing with that is likely to annoy many people.

The venerable (but effective) Yarrow algorithm, which is no longer supported by its authors now has an alternative, Fortuna. For now, Yarrow is retained as the default algorithm, but this may be changed using a kernel option. It is intended to make Fortuna the default algorithm for 11.0. Interested parties are encouraged to read ISBN 978-0-470-47424-2 "Cryptography Engineering" By Ferguson, Schneier and Kohno for Fortuna's gory details. Heck, read it anyway.

Many thanks to Arthur Mesh who did early grunt work, and who got caught in the crossfire rather more than he deserved to.

My thanks also to folks who helped me thresh this out on whiteboards and in the odd "Hallway track", or otherwise.

My Nomex pants are on. Let the feedback commence!

Reviewed by:	trasz,des(partial),imp(partial?),rwatson(partial?)
Approved by:	so(des)
2014-10-30 21:21:53 +00:00
kib
ad7bf17db7 Replace some calls to fuword() by fueword() with proper error checking.
Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	3 weeks
2014-10-28 15:28:20 +00:00
hselasky
a0b8ff0c54 The SYSCTL data pointers can come from userspace and must not be
directly accessed. Although this will work on some platforms, it can
throw an exception if the pointer is invalid and then panic the kernel.

Add a missing SYSCTL_IN() of "SCTP_BASE_STATS" structure.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2014-10-28 12:00:39 +00:00
ae
4a180510c8 Remove redundant check and m_pullup() call. 2014-10-24 13:34:22 +00:00
ae
2921b89f9b Move if_get_counter initialization from if_attach into if_alloc.
Also, initialize all counters before ifnet will become available in the system.
This fixes possible access to uninitialized ifned fields.

PR:		194550
2014-10-23 14:29:52 +00:00
luigi
0db2375e22 since we cast a pointer, use the correct signedness
(this was already in, and got lost in a recent update).
2014-10-22 18:55:36 +00:00
bryanv
d0cefd6466 Use the size of the Ethernet address, not the entire header, when
copying into forwarding entry.

Reported by:	Coverity
CID:		1248849
2014-10-21 05:45:57 +00:00
bryanv
783bd6e089 Add vxlan interface
vxlan creates a virtual LAN by encapsulating the inner Ethernet frame in
a UDP packet. This implementation is based on RFC7348.

Currently, the IPv6 support is not fully compliant with the specification:
we should be able to receive UPDv6 packets with a zero checksum, but we
need to support RFC6935 first. Patches for this should come soon.

Encapsulation protocols such as vxlan emphasize the need for the FreeBSD
network stack to support batching, GRO, and GSO. Each frame has to make
two trips through the network stack, and each frame will be at most MTU
sized. Performance suffers accordingly.

Some latest generation NICs have begun to support vxlan HW offloads that
we should also take advantage of. VIMAGE support should also be added soon.

Differential Revision:	https://reviews.freebsd.org/D384
Reviewed by:	gnn
Relnotes:	yes
2014-10-20 14:42:42 +00:00
melifaro
80a1a77bec * Remove route caching in if_stf.
* Copy necessary in6_ifa on stack instead of playing with refcounts.
2014-10-17 15:07:04 +00:00
hrs
95ad85717f - Fix lladdr configuration which could prevent LACP mode from working.
- Fix LORs when a laggport interface has an IPv6 LLA.

PR:	194321
2014-10-17 09:08:44 +00:00
ae
e8b631abdf Add more ifdefs. SIOC*_IN6 are defined only with INET6.
MFC after:	1 month
Reported  by:	bz
2014-10-14 14:51:27 +00:00
ae
b92a2b74c4 Move memset under ifdef INET6.
MFH:		1 month
Reported by:	bz
2014-10-14 14:41:06 +00:00
ae
88b7be7ff6 Overhaul if_gif(4):
o convert to if_transmit;
 o use rmlock to protect access to gif_softc;
 o use sx lock to protect from concurrent ioctls;
 o remove a lot of unneeded and duplicated code;
 o remove cached route support (it won't work with concurrent io);
 o style fixes.

Reviewed by:	melifaro
Obtained from:	Yandex LLC
MFC after:	1 month
Sponsored by:	Yandex LLC
2014-10-14 13:31:47 +00:00
hrs
1805c9c20a Virtualize if_epair(4). An if_xname check for both "a" and "b" interfaces
is added to return EEXIST when only "b" interface exists---this can happen
when epair<N>b is moved to a vnet jail and then "ifconfig epair<N> create"
is invoked there.
2014-10-10 06:45:13 +00:00
ae
d5044f2234 When tunneling interface is going to insert mbuf into netisr queue after stripping
outer header, consider it as new packet and clear the protocols flags.

This fixes problems when IPSEC traffic goes through various tunnels and router
doesn't send ICMP/ICMPv6 errors.

PR:		174602
Obtained from:	Yandex LLC
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-10-08 21:23:34 +00:00
ae
ea6fd7eaaf Our packet filters use mbuf's rcvif pointer to determine incoming interface.
Change mbuf's rcvif to enc0 and restore it after pfil processing.

PR:		110959
Sponsored by:	Yandex LLC
2014-10-07 13:31:04 +00:00
hrs
3297c817fa Virtualize if_edsc(4). 2014-10-05 21:27:26 +00:00
hrs
70e14abeb8 Virtualize if_disc(4) cloner. 2014-10-05 19:46:52 +00:00
hrs
0826f2b25d Virtualize if_bridge(4) cloner. 2014-10-05 19:43:37 +00:00
hrs
d278f7c187 Use printb() for boolean flags in ro_opts and actor_state for LACP. 2014-10-05 02:37:01 +00:00
hrs
2edb8f7e2f - Move L2 addr configuration for the primary port to a taskqueue. This fixes
LOR of softc rmlock in iflladdr_event handlers.

- Call if_delmulti_ifma() after LACP_UNLOCK().  This fixes another LOR.

- Fix a panic in lacp_transit_expire().

- Fix a panic in lagg_input() upon shutting down a port.
2014-10-05 02:34:21 +00:00
hrs
db53b4f174 Separate option handling from SIOC[SG]LAGG to SIOC[SG]LAGGOPTS for
backward compatibility with old ifconfig(8).
2014-10-02 20:01:13 +00:00
hrs
d30b551ba7 Virtualize net.link.vlan.soft_pad. 2014-10-02 05:56:17 +00:00
hrs
667a2b7369 Virtualize lagg(4) cloner. This change fixes a panic when tearing down
if_lagg(4) interfaces which were cloned in a vnet jail.

Sysctl nodes which are dynamically generated for each cloned interface
(net.link.lagg.N.*) have been removed, and use_flowid and flowid_shift
ifconfig(8) parameters have been added instead.  Flags and per-interface
statistics counters are displayed in "ifconfig -v".

CR:	D842
2014-10-01 21:37:32 +00:00
melifaro
e6ca9a3b21 Free radix mask entries on main radix destroy.
This is temporary commit to be merged to 10.
Other approach (like hash table) should be used
to store different masks.

PR:		194078
Submitted by:	Rumen Telbizov
MFC after:	3 days
2014-10-01 21:24:58 +00:00
melifaro
d8b683d70f Remove lock init from radix.c.
Radix has never managed its locking itself.
The only consumer using radix with embeded rwlock
is system routing table. Move per-AF lock inits there.
2014-10-01 14:39:06 +00:00
glebius
57bac09d3f Fix off by one in lagg_port_destroy().
Reported by:	"Max N. Boyarov" <zotrix bsd.by>
2014-10-01 11:23:54 +00:00
bz
aab771d812 Move the unconditional #include of net/ifq.h to the very end of file.
This seems to allow us to pass a universe with either clang or gcc
after r272244 (and r272260) and probably makes it easier to untabgle
these chained #includes in the future.
2014-09-28 17:09:40 +00:00
bz
abef5517f6 Remove duplicate declaraton of the if_inc_counter() function after r272244.
if_var.h has the expected on and if_var.h include ifq.h and thus we get
duplicates.  It seems only one cavium ethernet file actually includes ifq.h
directly which might be another cleanup to be done but need to test first.
2014-09-28 15:38:21 +00:00
glebius
0f9d61b26b - Remove empty wrappers ether_poll_[de]register_drv(). [1]
- Move polling(9) declarations out of ifq.h back to if_var.h
  they are absolutely unrelated to queues.

Submitted by:	Mikhail <mp lenta.ru> [1]
2014-09-28 14:05:18 +00:00
glebius
2cb6078939 Finally, convert counters in struct ifnet to counter(9).
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-09-28 08:57:07 +00:00
glebius
53d7fba29f Convert to if_inc_counter() last remnantes of bare access to struct ifnet
counters.
2014-09-28 07:43:38 +00:00
melifaro
7d70b89c51 Use underlying ports counters to get lagg statistics instead of
per-packet accounting.
This introduce user-visible changes like aggregating error counters.

Reviewed by:	asomers (prev.version), glebius
CR:		D781
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-09-27 13:57:48 +00:00
glebius
58a4ee184a Remove macros that hide access to struct ifnet fields. 2014-09-26 13:02:29 +00:00
glebius
f564c3e730 Make all lagg protocol methods live in lagg_protos, not in softc. All
interfaces of a same protocol, use the same methods.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-09-26 12:54:24 +00:00
ae
530a56d2e5 Keep list of lagg ports sorted by if_index.
Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-09-26 12:42:06 +00:00
glebius
7f6197c96b - Whitespace.
- Remove caddr_t.
2014-09-26 12:35:58 +00:00
glebius
62993359de - Provide lagg_proto_attach(), lagg_proto_detach().
- Make detach a protocol method in lagg_protos.
- Simplify code to lookup protocols.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-09-26 11:01:04 +00:00
glebius
680ed8e05c - When reconfiguring protocol on a lagg, first set it to LAGG_PROTO_NONE,
then drop lock, run the attach routines, and then set it to specific
  proto. This removes tons of WITNESS warnings.
- Make lagg protocol attach handlers not failing and allocate memory
  with M_WAITOK.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-09-26 08:42:32 +00:00
glebius
ee9b35f736 Make lagg protos a enum. 2014-09-26 08:12:12 +00:00
glebius
e30ec249f1 Make lagg protocols detach methods returning void.
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-09-26 07:12:40 +00:00
hselasky
bdacf9ba4d Improve transmit sending offload, TSO, algorithm in general.
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

Reviewed by:	adrian, rmacklem
Sponsored by:	Mellanox Technologies
MFC after:	1 week
2014-09-22 08:27:27 +00:00
hrs
3eeeb7c9a3 Fix build. 2014-09-21 07:16:51 +00:00
hrs
ffad09823e - Virtualize interface cloner for gre(4). This fixes a panic when destroying
a vnet jail which has a gre(4) interface.

- Make net.link.gre.max_nesting vnet-local.
2014-09-21 03:56:06 +00:00
hrs
05fa00b397 Virtualize interface cloner for gif(4). This fixes a panic when destroying
a vnet jail which has a gif(4) interface.
2014-09-21 03:55:04 +00:00
hrs
5e2751fde9 Make net.add_addr_allfibs vnet-local. 2014-09-21 03:48:20 +00:00
glebius
f2cafe032f Mechanically convert to if_inc_counter(). 2014-09-19 10:39:58 +00:00
glebius
72f04611ec Remove ifq_drops from struct ifqueue. Now queue drops are accounted in
struct ifnet if_oqdrops.

Some netgraph modules used ifqueue w/o ifnet. Accounting of queue drops
is simply removed from them. There were no API to read this statistic.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-09-19 09:01:19 +00:00
glebius
b010d64973 Increase errors, not queue drops, in cases the module is supplied
with a bad packet or if mbuf allocation failes.
2014-09-19 05:43:38 +00:00
glebius
de25153d59 Remove a bunch of methods that are superseded by if_inc_counter().
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-09-18 16:17:20 +00:00
glebius
c2d27a81fe While not too late rename 'ifnet_counter' to 'ift_counter'. One of the
imporant moments that we discussed with Marcel and Anuranjan was that
a converted driver should return false for 'grep ifnet if_driver.c' :)

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-09-18 14:47:13 +00:00
glebius
0917a065ca Add a function to set if_get_counter method for an ifnet. To be used
in the drivers that are already converted to "Juniper drvapi". This
can be revisited in future.
2014-09-18 14:38:28 +00:00
glebius
bf71125f67 While not too late rename if_get_counter_compat() to if_get_counter_default().
The compat counters will go away, but the function will remain in its place,
and in all places where it is going to be called.

Discussed with:	melifaro
2014-09-18 10:01:56 +00:00
glebius
f76e492f6d Add if_inc_counter(), a generic method to update ifnet(9) counter
w/o dereferencing the struct.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-09-18 09:54:57 +00:00
araujo
36c415243f Revert r271735. The comment is absolutely correct, we do not support 802.1p priority tagging. I got confused with the packet tagged and packet to be tagged.
Spotted by:	glebius
2014-09-18 05:43:19 +00:00
araujo
80e51d9d11 Remove old comment, we already do 802.1q tagging.
Phabric:	D797
Reviewed by:	kevlo
Approved by:	kevlo
Sponsored by:	QNAP Systems Inc.
2014-09-18 03:09:34 +00:00
araujo
d7a9c633d7 Add laggproto broadcast, it allows sends frames to all ports of the lagg(4) group
and receives frames on any port of the lagg(4).

Phabric:	D549
Reviewed by:	glebius, thompsa
Approved by:	glebius
Obtained from:	OpenBSD
Sponsored by:	QNAP Systems Inc.
2014-09-18 02:12:48 +00:00
melifaro
f0ab9ab876 * Fix if_omcast handling
* Convert if_oerrors to pcpu.

Suggested by:	glebius
MFC after:	2 weeks
2014-09-16 21:48:48 +00:00
hselasky
727760a4e4 Revert r271504. A new patch to solve this issue will be made.
Suggested by:	adrian @
2014-09-13 20:52:01 +00:00
melifaro
bf4280c0a8 Switch if_vlan(4) to rmlock.
MFC after:	2 weeks
2014-09-13 18:41:24 +00:00
melifaro
a31da764ba Switch if_vlan(4) to use counter(9) using new
if_get_counter api.
2014-09-13 18:13:08 +00:00
hselasky
3d04a989df Improve transmit sending offload, TSO, algorithm in general.
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2014-09-13 08:26:09 +00:00
asomers
081aa8a15c Revisions 264905 and 266860 added a "int fib" argument to ifa_ifwithnet and
ifa_ifwithdstaddr. For the sake of backwards compatibility, the new
arguments were added to new functions named ifa_ifwithnet_fib and
ifa_ifwithdstaddr_fib, while the old functions became wrappers around the
new ones that passed RT_ALL_FIBS for the fib argument. However, the
backwards compatibility is not desired for FreeBSD 11, because there are
numerous other incompatible changes to the ifnet(9) API. We therefore
decided to remove it from head but leave it in place for stable/9 and
stable/10. In addition, this commit adds the fib argument to
ifa_ifwithbroadaddr for consistency's sake.

sys/sys/param.h
	Increment __FreeBSD_version

sys/net/if.c
sys/net/if_var.h
sys/net/route.c
	Add fibnum argument to ifa_ifwithbroadaddr, and remove the _fib
	versions of ifa_ifwithdstaddr, ifa_ifwithnet, and ifa_ifwithroute.

sys/net/route.c
sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_options.c
sys/netinet/ip_output.c
sys/netinet6/nd6.c
	Fixup calls of modified functions.

share/man/man9/ifnet.9
	Document changed API.

CR:		https://reviews.freebsd.org/D458
MFC after:	Never
Sponsored by:	Spectra Logic
2014-09-11 20:21:03 +00:00
adrian
b73995e058 Update the IPv4 input path to handle reassembled frames and incoming frames
with no RSS hash.

When doing RSS:

* Create a new IPv4 netisr which expects the frames to have been verified;
  it just directly dispatches to the IPv4 input path.
* Once IPv4 reassembly is done, re-calculate the RSS hash with the new
  IP and L3 header; then reinject it as appropriate.
* Update the IPv4 netisr to be a CPU affinity netisr with the RSS hash
  function (rss_soft_m2cpuid) - this will do a software hash if the
  hardware doesn't provide one.

NICs that don't implement hardware RSS hashing will now benefit from RSS
distribution - it'll inject into the correct destination netisr.

Note: the netisr distribution doesn't work out of the box - netisr doesn't
query RSS for how many CPUs and the affinity setup.  Yes, netisr likely
shouldn't really be doing CPU stuff anymore and should be "some kind of
'thing' that is a workqueue that may or may not have any CPU affinity";
that's for a later commit.

Differential Revision:	https://reviews.freebsd.org/D527
Reviewed by:	grehan
2014-09-09 04:18:20 +00:00
glebius
2e01608625 Clean up unused CSUM_FRAGMENT.
Sponsored by:	Nginx, Inc.
2014-09-03 08:30:18 +00:00
glebius
9dfcf3eeb1 Toss fields so that no padding field is required to achieve alignment. 2014-08-31 13:30:54 +00:00
glebius
833eb3c331 It is actually possible to have if_t a typedef to non-void type,
and keep both converted to drvapi and non-converted drivers
compilable.

o Make if_t typedef to struct ifnet *.
o Remove shim functions.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-08-31 12:48:13 +00:00
glebius
3b5ede57e9 Provide pointer from struct ifnet to struct netmap_adapter,
instead of abusing spare field.
2014-08-31 11:33:19 +00:00
glebius
70b7c46209 o Remove struct if_data from struct ifnet. Now it is merely API structure
for route(4) socket and ifmib(4) sysctl.
o Move fields from if_data to ifnet, but keep all statistic counters
  separate, since they should disappear later.
o Provide function if_data_copy() to fill if_data, utilize it in routing
  socket and ifmib handler.
o Provide overridable ifnet(9) method to fetch counters. If no provided,
  if_get_counters_compat() would be used, that returns old counters.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-08-31 06:46:21 +00:00
glebius
73b170d619 Remove ability to write to struct if_data residing in struct ifnet
via net.link.generic.IFMIB_IFDATA.*.IFDATA_GENERAL sysctl. Reasons
for removal are:
- No code in tree uses this possibility.
- The documentation ifmib(4) doesn't say that such possibility
  exist. The example provided in manual page only reads data.
- On many interfaces the feature simply doesn't work, since they
  do accounting in hardware, and overwrite if_data on tick.

Sponsored by:	Nginx, Inc.
2014-08-31 06:23:54 +00:00
melifaro
69a7dea554 * Add SIOCGI2C driver ioctl used to retrieve i2c info.
* Convert ixgbe to use this ioctl
* Convert ifconfig to use generic i2c handler for  "ix" interfaces.

Approved by:	Eric Joyner (ixgbe part)
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-08-29 18:02:58 +00:00
melifaro
81b97334ef * Add new net/sff8436.h containing constants used to access
QSFP+ data via i2c inteface. These constants has been taken
  from SFF-8436 "QSFP+ 10 Gbs 4X PLUGGABLE TRANSCEIVER" standard
  rev 4.8.
* Add support for printing QSFP+ information from 40G NICs
  such as Chelsio T5.

This commit does not contain ioctl changes necessary for this
functionality work, there will be another commit soon.

Example:
cxl1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=ec07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,.....>
        ether 00:07:43:28:ad:08
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet 40Gbase-LR4 <full-duplex>
        status: active
        plugged: QSFP+ 40GBASE-LR4 (MPO Parallel Optic)
        vendor: OEM PN: OP-QSFP-40G-LR4 SN: 20140318001 DATE: 2014-03-18
        module temperature: 64.06 C voltage: 3.26 Volts
        lane 1: RX: 0.47 mW (-3.21 dBm) TX: 2.78 mW (4.46 dBm)
        lane 2: RX: 0.20 mW (-6.94 dBm) TX: 2.80 mW (4.47 dBm)
        lane 3: RX: 0.18 mW (-7.38 dBm) TX: 2.79 mW (4.47 dBm)
        lane 4: RX: 0.90 mW (-0.45 dBm) TX: 2.80 mW (4.48 dBm)

Tested on:	Chelsio T5
Tested on:	Mellanox/Huawei passive/active cables/transceivers.
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-08-21 17:54:42 +00:00
melifaro
3142dab2d4 * Use standard net/sff8472.h header for sff bits and offsets.
* Convert sff_8472_id to 'const char *' to please clang.

Pointed by:	np
2014-08-16 21:53:44 +00:00
luigi
3ab69a246b Update to the current version of netmap.
Mostly bugfixes or features developed in the past 6 months,
so this is a 10.1 candidate.

Basically no user API changes (some bugfixes in sys/net/netmap_user.h).

In detail:

1. netmap support for virtio-net, including in netmap mode.
  Under bhyve and with a netmap backend [2] we reach over 1Mpps
  with standard APIs (e.g. libpcap), and 5-8 Mpps in netmap mode.

2. (kernel) add support for multiple memory allocators, so we can
  better partition physical and virtual interfaces giving access
  to separate users. The most visible effect is one additional
  argument to the various kernel functions to compute buffer
  addresses. All netmap-supported drivers are affected, but changes
  are mechanical and trivial

3. (kernel) simplify the prototype for *txsync() and *rxsync()
  driver methods. All netmap drivers affected, changes mostly mechanical.

4. add support for netmap-monitor ports. Think of it as a mirroring
  port on a physical switch: a netmap monitor port replicates traffic
  present on the main port. Restrictions apply. Drive carefully.

5. if_lem.c: support for various paravirtualization features,
  experimental and disabled by default.
  Most of these are described in our ANCS'13 paper [1].
  Paravirtualized support in netmap mode is new, and beats the
  numbers in the paper by a large factor (under qemu-kvm,
  we measured gues-host throughput up to 10-12 Mpps).

A lot of refactoring and additional documentation in the files
in sys/dev/netmap, but apart from #2 and #3 above, almost nothing
of this stuff is visible to other kernel parts.

Example programs in tools/tools/netmap have been updated with bugfixes
and to support more of the existing features.

This is meant to go into 10.1 so we plan an MFC before the Aug.22 deadline.

A lot of this code has been contributed by my colleagues at UNIPI,
including Giuseppe Lettieri, Vincenzo Maffione, Stefano Garzarella.

MFC after:	3 days.
2014-08-16 15:00:01 +00:00
royger
6a2fcceb9c net: move interface removal notification up in if_detach_internal
This is needed to prevent having interfaces with ifp->if_addr == NULL
on bridge interfaces. Moving the notification event handlers up makes
sure the interfaces are removed before doing any more cleanup.

Sponsored by: Citrix Systems R&D
Reviewed by: melifaro
Differential Revision: https://reviews.freebsd.org/D598

net/if.c
 - Move interface removal notification up in if_detach_internal.
2014-08-16 10:47:24 +00:00
kevlo
dd40fa7e62 Change pr_output's prototype to avoid the need for explicit casts.
This is a follow up to r269699.

Phabric:	D564
Reviewed by:	jhb
2014-08-15 02:43:02 +00:00
glebius
7d0b571895 - Count global pf(4) statistics in counter(9).
- Do not count global number of states and of src_nodes,
  use uma_zone_get_cur() to obtain values.
- Struct pf_status becomes merely an ioctl API structure,
  and moves to netpfil/pf/pf.h with its constants.
- V_pf_status is now of type struct pf_kstatus.

Submitted by:	Kajetan Staszkiewicz <vegeta tuxpowered.net>
Sponsored by:	InnoGames GmbH
2014-08-14 18:57:46 +00:00
araujo
9abce0e567 - Remove unneeded include.
Phabric:	D563
Reviewed by:	kevlo
Approved by:	kevlo
2014-08-11 03:04:16 +00:00
kevlo
7727a3c215 Merge 'struct ip6protosw' and 'struct protosw' into one. Now we have
only one protocol switch structure that is shared between ipv4 and ipv6.

Phabric:	D476
Reviewed by:	jhb
2014-08-08 01:57:15 +00:00
mav
1d4e2a0972 Improve locking of multicast addresses in VLAN and LAGG interfaces.
This fixes several scenarios of reproducible panics, cause by races
between multicast address changes and interface destruction.

MFC after:	2 weeks
2014-08-04 00:58:12 +00:00
glebius
d32e428cc3 Garbage collect couple of unused fields from struct ifaddr:
- ifa_claim_addr() unused since removal of NetAtalk
- ifa_metric seems to be never utilized, always a copy of if_metric
2014-07-29 15:01:29 +00:00
kevlo
940bebdcf2 Deprecate m_act. Use m_nextpkt always. 2014-07-17 05:21:16 +00:00
hselasky
35b126e324 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
gjb
fc21f40567 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
hselasky
bd1ed65f0f Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
attilio
2802c525ad - Modify vm_page_unwire() and vm_page_enqueue() to directly accept
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
  adding and removing pages to them.

Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.

This change effectively modifies the KPI.  __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2014-06-16 18:15:27 +00:00
melifaro
e9f6263cd3 Improve logic besides net.bpf.optimize_writers.
Direct bpf(4) consumers should now work fine with this tunable turned on.
In fact, the only case when optimized_writers can change program
behavior is direct bpf(4) consumer setting its read filter to
catch-all one.

MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-06-11 11:27:44 +00:00
luigi
55c24573f4 misc bugfixes:
- stdio.h is needed for fprint()
- make memsize uint32_t to avoid errors due to overflow
- honor the *XPOLL flagg in NIOCREGIF requests
- mmap fails wit MAP_FAILED, not NULL.

MFC after:	3 days
2014-06-06 15:17:19 +00:00
luigi
d33a7e50b0 whitespace change: fix one comment, remove a stale one. 2014-06-06 15:15:27 +00:00
luigi
359dad8e6c whitespace change: remove trailing whitespace 2014-06-05 21:12:41 +00:00
marcel
916c7006f5 Introduce a procedural interface to the ifnet structure. The new
interface allows the ifnet structure to be defined as an opaque
type in NIC drivers.  This then allows the ifnet structure to be
changed without a need to change or recompile NIC drivers.

Put differently, NIC drivers can be written and compiled once and
be used with different network stack implementations, provided of
course that those network stack implementations have an API and
ABI compatible interface.

This commit introduces the 'if_t' type to replace 'struct ifnet *'
as the type of a network interface. The 'if_t' type is defined as
'void *' to enable the compiler to perform type conversion to
'struct ifnet *' and vice versa where needed and without warnings.
The functions that implement the API are the only functions that
need to have an explicit cast.

The MII code has been converted to use the driver API to avoid
unnecessary code churn. Code churn comes from having to work with
both converted and unconverted drivers in correlation with having
callback functions that take an interface. By converting the MII
code first, the callback functions can be defined so that the
compiler will perform the typecasts automatically.

As soon as all drivers have been converted, the if_t type can be
redefined as needed and the API functions can be fix to not need
an explicit cast.

The immediate benefactors of this change are:
1.  Juniper Networks - The network stack implementation in Junos
    is entirely different from FreeBSD's one and this change
    allows Juniper to build "stock" NIC drivers that can be used
    in combination with both the FreeBSD and Junos stacks.
2.  FreeBSD - This change opens the door towards changing ifnet
    and implementing new features and optimizations in the network
    stack without it requiring a change in the many NIC drivers
    FreeBSD has.

Submitted by:	Anuranjan Shukla <anshukla@juniper.net>
Reviewed by:	glebius@
Obtained from:	Juniper Networks, Inc.
2014-06-02 17:54:39 +00:00
asomers
7ca8bf0f2c Fix unintended KBI change from r264905. Add _fib versions of
ifa_ifwithnet() and ifa_ifwithdstaddr()  The legacy functions will call the
_fib() versions with RT_ALL_FIBS, preserving legacy behavior.

sys/net/if_var.h
sys/net/if.c
	Add legacy-compatible functions as described above.  Ensure legacy
	behavior when RT_ALL_FIBS is passed as fibnum.

sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/net/route.c
sys/net/rtsock.c
sys/netinet6/nd6.c
	Call with _fib() functions if we must use a specific fib, or the
	legacy functions otherwise.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
	Improve the udp_dontroute test.  The bug that this test exercises is
	that ifa_ifwithnet() will return the wrong address, if multiple
	interfaces have addresses on the same subnet but with different
	fibs.  The previous version of the test only considered one possible
	failure mode: that ifa_ifwithnet_fib() might fail to find any
	suitable address at all.  The new version also checks whether
	ifa_ifwithnet_fib() finds the correct address by checking where the
	ARP request goes.

Reported by:	bz, hrs
Reviewed by:	hrs
MFC after:	1 week
X-MFC-with:	264905
Sponsored by:	Spectra Logic
2014-05-29 21:03:49 +00:00
grehan
db7a034f36 Bump bhyve allocation up to 20 bits to avoid
birthday-paradox style address collisions when
bhyve VMs are connected to the same broadcoast
domain and are using pseudo-random allocations.

Reviewed by:	gnn
MFC after:	1 week
2014-05-20 02:59:13 +00:00
melifaro
f13915719f Rename rt_msg1() to more handy rtsock_msg_mbuf().
(Just for history purposes: rt_msg2() was renamed
 to rtsock_msg_buffer() in r265019).

Sponsored by:	Yandex LLC
MFC after:	1 month
2014-05-08 13:54:57 +00:00
melifaro
4a170f05aa Fix incorrect netmasks being passed via rtsock.
Since radix has been ignoring sa_family in passed sockaddrs,
no one ever has bothered filling valid sa_family in netmasks.
Additionally, radix adjusts sa_len field in every netmask not to
compare zero bytes at all.

This leads us to rt_mask with sa_family of AF_UNSPEC (-1) and
arbitrary sa_len field (0 for default route, for example).

However, rtsock have been passing that rt_mask intact for ages,
requiring all rtsock consumers to make ther own local hacks.
We even have unfixed on in base:

do `route -n monitor` in one window and issue `route -n get addr`
for some directly-connected address. You will probably see the following:

got message of size 304 on Thu May  8 15:06:06 2014
RTM_GET: Report Metrics: len 304, pid: 30493, seq 1, errno 0, flags:<UP,DONE,PINNED>
locks:  inits:
sockaddrs: <DST,GATEWAY,NETMASK,IFP,IFA>
 10.0.0.0 link#1 (255) ffff ffff ff em0:8.0.27.c5.29.d4 10.0.0.92
_________________^^^^^^^^^^^^^^^^^^

after the change:

got message of size 312 on Thu May  8 15:44:07 2014
RTM_GET: Report Metrics: len 312, pid: 2895, seq 1, errno 0, flags:<UP,DONE,PINNED>
locks:  inits:
sockaddrs: <DST,GATEWAY,NETMASK,IFP,IFA>
 10.0.0.0 link#1 255.255.255.0 em0:8.0.27.c5.29.d4 10.0.0.92
_________________^^^^^^^^^^^^^^^^^^

Sponsored by:	Yandex LLC
MFC after:	1 month
2014-05-08 11:56:06 +00:00
melifaro
1f938512ab Fix sysctl_ifmalist() broken in r265019.
Reported by:	Olivier Cochard-Labbé
MFC with:	r265019
2014-05-03 17:57:06 +00:00
melifaro
bb33a54f34 Remove additional fib checks from rtalloc1_fib.
It looks like current consumers are either unaware
of MRT (and uses RT_DEFAULT_FIB implicitly) or
know what thay are doing, In latter case they
will be either hit by KASSERT or ESCRH will be returned
due to NULL rnh.
2014-05-03 16:38:05 +00:00
melifaro
a4407f98c0 Pass radix head ptr along with rte to rtexpunge().
Rename rtexpunge to rt_expunge().
2014-05-03 16:28:54 +00:00
asomers
cf37d83a59 Fix a panic caused by doing "ifconfig -am" while a lagg is being destroyed.
The thread that is destroying the lagg has already set sc->sc_psc=NULL when
the "ifconfig -am" thread gets to lacp_req().  It tries to dereference
sc->sc_psc and panics.  The solution is for lacp_req() to check the value of
sc->sc_psc.  If NULL, harmlessly return an lacp_opreq structure full of
zeros.  Full details in GNATS.

PR:		kern/189003
Reviewed by:	timeout on freebsd-net@
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corporation
2014-05-02 16:24:09 +00:00
melifaro
53cfe851be Fix rnh_walktree_from() function (patch from kern/174959).
Require valid netmask to be passed since host route is always a leaf.

PR:		kern/174959
Submitted by:	Keith Sklower
MFC after:	2 weeks
2014-05-01 15:04:32 +00:00
melifaro
e75a4a90b5 Partially revert r265019 - allocating 512 bytes on stack
can be too much for architectures like ARM. Always use rounded
malloc instead.

Discussed with:	jmallett
MFC after:	4 weeks
2014-04-29 19:48:11 +00:00
melifaro
1883ddc524 Move rt_setmetrics() from rtsock.c to route.c.
All rtsock-initiated rte creation/modification are now
performed in route.c holding radix tree write lock.
This reduces the need for per-rte mutex.

Sponsored by:	Yandex LLC
MFC after:	1 month
2014-04-29 19:14:42 +00:00
melifaro
b1337c7d4c Do not use senderr() in rtrequest1_fib_change().
Suggested by:	glebius
MFC after:	4 weeks
2014-04-29 12:52:36 +00:00
melifaro
03224963a1 Fix build
Found by:	ian
Pointyhat to:	me
2014-04-27 21:17:54 +00:00
melifaro
a153bd7770 Improve memory allocation model for rt_msg2() rtsock messages:
* memory is now allocated as early as possible, without holding locks.
 * sysctl users are now guaranteed to get a response (M_WAITOK buffer prealloc).
 * socket users are more likely to use on-stack buffer for replies.
 * standard kernel malloc/free functions are now used instead of radix wrappers.
rt_msg2() has been renamed to rtsock_msg_buffer().

MFC after:	1 month
2014-04-27 17:41:18 +00:00
melifaro
c5479b1c51 Remove useless zeroing of RTAX_DST on error.
Cleanup a bit.

MFC after:	1 month
2014-04-27 10:43:48 +00:00
melifaro
29b944e3ac Cleanup route_output() a bit.
MFC after:	1 month
2014-04-27 10:20:37 +00:00
melifaro
bf1b5f8b0c Do not delay freeing rtm. Bandaid added in r227061 is not needed since r227061,
MFC after:	1 month
2014-04-27 09:49:35 +00:00
melifaro
f51d6fcb64 Move up fibnum to ensure it is always defined.
Found by:	ian
MFC with:	r264987
2014-04-27 02:20:09 +00:00
melifaro
5416196308 Remove useless `register' declarations.
MFC after:	1 month
2014-04-26 22:42:21 +00:00
melifaro
78166405b1 Determine fibnum once in the beginning of route_output().
MFC after:	1 month
2014-04-26 22:32:04 +00:00
melifaro
e815654815 Decouple RTM_CHANGE from RTM_GET handling in rtsock.c:route_output().
RTM_CHANGE is now handled inside route.c:rtrequest1_fib() as it should be.
Note change change handler is a separate function rtrequest1_fib_change().

MFC after:	1 month
2014-04-26 21:03:41 +00:00
melifaro
7b860c446e Unify sa_equal() macro usage.
MFC after:	2 weeks
2014-04-26 14:52:03 +00:00
asomers
f8a34b6f49 Fix subnet and default routes on different FIBs on the same subnet.
These two bugs are closely related.  The root cause is that ifa_ifwithnet
does not consider FIBs when searching for an interface address.

sys/net/if_var.h
sys/net/if.c
	Add a fib argument to ifa_ifwithnet and ifa_ifwithdstadddr.  Those
	functions will only return an address whose interface fib equals the
	argument.

sys/net/route.c
	Update calls to ifa_ifwithnet and ifa_ifwithdstaddr with fib
	arguments.

sys/netinet/in.c
	Update in_addprefix to consider the interface fib when adding
	prefixes.  This will prevent it from not adding a subnet route when
	one already exists on a different fib.

sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/netinet6/nd6.c
	Add RT_DEFAULT_FIB arguments to ifa_ifwithdstaddr and ifa_ifwithnet.
	In some cases it there wasn't a clear specific fib number to use.
	In others, I was unable to test those functions so I chose
	RT_DEFAULT_FIB to minimize divergence from current behavior.  I will
	fix some of the latter changes along with PR kern/187553.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
tests/sys/netinet/Makefile
	Revert r263738.  The udp_dontroute test was right all along.
	However, bugs kern/187550 and kern/187553 cancelled each other out
	when it came to this test.  Because of kern/187553, ifa_ifwithnet
	searched the default fib instead of the requested one, but because
	of kern/187550, there was an applicable subnet route on the default
	fib.  The new test added in r263738 doesn't work right, however.  I
	can verify with dtrace that ifa_ifwithnet returned the wrong address
	before I applied this commit, but route(8) miraculously found the
	correct interface to use anyway.  I don't know how.

	Clear expected failure messages for kern/187550 and kern/187552.

PR:		kern/187550
PR:		kern/187552
Reviewed by:	melifaro
MFC after:	3 weeks
Sponsored by:	Spectra Logic
2014-04-24 23:56:56 +00:00
asomers
6e7494c7e1 Fix host and network routes for new interfaces when net.add_addr_allfibs=0
sys/net/route.c
	In rtinit1, use the interface fib instead of the process fib.  The
	latter wasn't very useful because ifconfig(8) is usually invoked
	with the default process fib.  Changing ifconfig(8) to use setfib(2)
	would be redundant, because it already sets the interface fib.

tests/sys/netinet/fibs_test.sh
	Clear the expected ATF failure

sys/net/if.c
	Pass the interface fib in calls to rtrequest1_fib and rtalloc1_fib

sys/netinet/in.c
sys/net/if_var.h
	Add a fibnum argument to ifa_switch_loopback_route, a subroutine of
	in_scrubprefix.  Pass it the interface fib.

PR:		kern/187549
Reviewed by:	melifaro
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corporation
2014-04-24 17:23:16 +00:00
mm
532d55ab5f Backport from projects/pf r263908:
De-virtualize UMA zone pf_mtag_z and move to global initialization part.

The m_tag struct does not know about vnet context and the pf_mtag_free()
callback is called unaware of current vnet. This causes a panic.

MFC after:	1 week
2014-04-20 09:17:48 +00:00
jmg
072b24d95e garbage collect something that hasn't been triggered in almost 5 years...
the last consumer was removed a couple years ago...
2014-04-19 19:08:08 +00:00
rmacklem
512a8b24e7 For NFS mounts using rsize,wsize=65536 over TSO enabled
network interfaces limited to 32 transmit segments, there
are two known issues.
The more serious one is that for an I/O of slightly less than 64K,
the net device driver prepends an ethernet header, resulting in a
TSO segment slightly larger than 64K. Since m_defrag() copies this
into 33 mbuf clusters, the transmit fails with EFBIG.
A tester indicated observing a similar failure using iSCSI.

The second less critical problem is that the network
device driver must copy the mbuf chain via m_defrag()
(m_collapse() is not sufficient), resulting in measurable overhead.

This patch reduces the default size of if_hw_tsomax
slightly, so that the first issue is avoided.
Fixing the second issue will require a way for the
network device driver to inform tcp_output() that it
is limited to 32 transmit segments.

Reported and tested by:	csforgeron@gmail.com, markus.gebert@hostpoint.ch
MFC after:	2 weeks
2014-04-17 23:31:50 +00:00
rmacklem
6067137dbd Vlan did not set the value of if_hw_tsomax, so when vlan
was stacked on top of a network interface that set if_hw_tsomax,
tcp_output() would see the default value instead of the value
set by the network interface. This patch modifies vlan so that
it sets if_hw_tsomax to the value of the parent interface.

Reviewed by:	glebius
MFC after:	2 weeks
2014-04-15 21:48:35 +00:00
rmacklem
dc2495c46c Fix build for non-INET that was broken by r264469.
MFC after:	2 weeks
2014-04-15 13:28:54 +00:00
rmacklem
ff97df6be2 Lagg did not set the value of if_hw_tsomax, so when lagg
was stacked on top of network interfaces that set if_hw_tsomax,
tcp_output() would see the default value instead of the value
set by the network interface(s). This patch modifies lagg so that
it sets if_hw_tsomax to the minimum of the value(s) for the
underlying network interfaces.

Reviewed by:	glebius
MFC after:	2 weeks
2014-04-14 20:34:48 +00:00
bms
f923a8498a In if_freemulti(), relax the paranoid KASSERT() on ifma->ifma_protospec.
This KASSERT() existed as a sanity check that upper layers in the network
stack (e.g. inet, inet6) had released their reference to the underlying
driver's multicast memberships (ifmultiaddr{}). However it assumes the
lifecycle of the driver membership corresponds to the lifecycle of the
network layer membership.

In the submitter's case, ieee80211_ioctl_updatemulti() attempts to
reprogram the (parent, physical) ifnet{} memberships in response
to a change in membership on the (child, virtual) VAP ifnet, using
a batched update mechanism. These updates happen independently from
the network layer, causing a "false negative" assertion failure.

There are possibly other use cases where this KASSERT() may be triggered
by other networking stack activity (e.g. where a nesting relationship
exists between multiple ifnet{} instances). This suggests that further
review of FreeBSD's approach to nested ifnet relationships is needed.

MFC after:	6 weeks
Submitted by:	adrian@
2014-04-10 18:43:02 +00:00
tuexen
90c4737aa0 Call sctp_addr_change() from rt_addrmsg() instead of rt_newaddrmsg_fib(),
since rt_addrmsg() gets also called from other functions.

MFC after: 3 days
2014-04-07 21:28:21 +00:00
mm
c4f653f608 Merge from projects/pf r251993 (glebius@):
De-vnet hash sizes and hash masks.

Submitted by:	Nikos Vassiliadis <nvass gmx.com>
Reviewed by:	trociny

MFC after:	1 month
2014-03-25 06:55:53 +00:00
np
26117c8d46 Add a shorter alias for if_data.ifi_oqdrops. 2014-03-20 02:23:52 +00:00
jmmv
ca228204dc Include strings.h so that bpf_filter.c can be built in userland.
This is to bring in a definition for bzero(3), which in turn allows the
tests in tools/regression/bpf/ to build again.
2014-03-19 13:10:25 +00:00
glebius
6c64d03c91 When exporting ifnet via sysctl, add ifqueue(9) drop count to the
ifi_oqdrops. This is a temporary workaround until ifqueue(9) vanishes.

While here, remove the pointless ifi_vhid assignment. It has
sense only when we are exporting ifaddrs, not ifnets.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-03-19 06:08:03 +00:00
glebius
8293a6c1cc Garbage collect long time obsoleted (or never used) stuff from routing API. 2014-03-15 06:49:32 +00:00
rwatson
f411704afc Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation.  This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.

(1) Merge a software implementation of the Toeplitz hash specified in
    RSS implemented by David Malone.  This is used to allow suitable
    pcbgroup placement of connections before the first packet is
    received from the NIC.  Software hashing is generally avoided,
    however, due to high cost of the hash on general-purpose CPUs.

(2) In in_rss.c, maintain authoritative versions of RSS state intended
    to be pushed to each NIC, including keying material, hash
    algorithm/ configuration, and buckets.  Provide software-facing
    interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
    the RSS standardised Toeplitz and a 'naive' variation with a hash
    efficient in software but with poor distribution properties.
    Implement rss_m2cpuid()to be used by netisr and other load
    balancing code to look up the CPU on which an mbuf should be
    processed.

(3) In the Ethernet link layer, allow netisr distribution using RSS as
    a source of policy as an alternative to source ordering; continue
    to default to direct dispatch (i.e., don't try and requeue packets
    for processing on the 'right' CPU if they arrive in a directly
    dispatchable context).

(4) Allow RSS to control tuning of connection groups in order to align
    groups with RSS buckets.  If a packet arrives on a protocol using
    connection groups, and contains a suitable hardware-generated
    hash, use that hash value to select the connection group for pcb
    lookup for both IPv4 and IPv6.  If no hardware-generated Toeplitz
    hash is available, we fall back on regular PCB lookup risking
    contention rather than pay the cost of Toeplitz in software --
    this is a less scalable but, at my last measurement, faster
    approach.  As core counts go up, we may want to revise this
    strategy despite CPU overhead.

Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP.  This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers.  Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing.  This will hopefully prove
a useful starting point for refinement.

No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.

Sponsored by:   Juniper Networks (original work)
Sponsored by:   EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
glebius
80e85e32a5 Remove AppleTalk support.
AppleTalk was a network transport protocol for Apple Macintosh devices
in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was
a legacy protocol and primary networking protocol is TCP/IP. The last
Mac OS X release to support AppleTalk happened in 2009. The same year
routing equipment vendors (namely Cisco) end their support.

Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.
2014-03-14 06:29:43 +00:00
glebius
d494babace Remove IPX support.
IPX was a network transport protocol in Novell's NetWare network operating
system from late 80s and then 90s. The NetWare itself switched to TCP/IP
as default transport in 1998. Later, in this century the Novell Open
Enterprise Server became successor of Novell NetWare. The last release
that claimed to still support IPX was OES 2 in 2007. Routing equipment
vendors (e.g. Cisco) discontinued support for IPX in 2011.

Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.
2014-03-14 02:58:48 +00:00
glebius
b38edcd355 Since 32-bit if_baudrate isn't enough to describe a baud rate of a 10 Gbit
interface, in the r241616 a crutch was provided. It didn't work well, and
finally we decided that it is time to break ABI and simply make if_baudrate
a 64-bit value. Meanwhile, the entire struct if_data was reviewed.

o Remove the if_baudrate_pf crutch.

o Make all fields of struct if_data fixed machine independent size. The
  notion of data (packet counters, etc) are by no means MD. And it is a
  bug that on amd64 we've got a 64-bit counters, while on i386 32-bit,
  which at modern speeds overflow within a second.

  This also removes quite a lot of COMPAT_FREEBSD32 code.

o Give 16 bit for the ifi_datalen field. This field was provided to
  make future changes to if_data less ABI breaking. Unfortunately the
  8 bit size of it had effectively limited sizeof if_data to 256 bytes.

o Give 32 bits to ifi_mtu and ifi_metric.
o Give 64 bits to the rest of fields, since they are counters.

__FreeBSD_version bumped.

Discussed with:	emax
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-03-13 03:42:24 +00:00
glebius
dddd70a112 The route code used to mtx_destroy() a locked mutex before rtentry free. Now,
after r262763 it started to return locked mutexes to UMA. To fix that,
conditionally unlock the mutex in the destructor.

Tested by:	"Sergey V. Dyatko" <sergey.dyatko@gmail.com>
2014-03-05 21:16:46 +00:00
glebius
86b259a9f4 Pacify gcc. 2014-03-05 02:35:15 +00:00
glebius
2d3e25388b Hide struct rtentry from userland. 2014-03-05 01:47:08 +00:00