Commit Graph

3202 Commits

Author SHA1 Message Date
Alan Somers
0489b8916e Fix host and network routes for new interfaces when net.add_addr_allfibs=0
sys/net/route.c
	In rtinit1, use the interface fib instead of the process fib.  The
	latter wasn't very useful because ifconfig(8) is usually invoked
	with the default process fib.  Changing ifconfig(8) to use setfib(2)
	would be redundant, because it already sets the interface fib.

tests/sys/netinet/fibs_test.sh
	Clear the expected ATF failure

sys/net/if.c
	Pass the interface fib in calls to rtrequest1_fib and rtalloc1_fib

sys/netinet/in.c
sys/net/if_var.h
	Add a fibnum argument to ifa_switch_loopback_route, a subroutine of
	in_scrubprefix.  Pass it the interface fib.

PR:		kern/187549
Reviewed by:	melifaro
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corporation
2014-04-24 17:23:16 +00:00
Martin Matuska
ecb47cf9c5 Backport from projects/pf r263908:
De-virtualize UMA zone pf_mtag_z and move to global initialization part.

The m_tag struct does not know about vnet context and the pf_mtag_free()
callback is called unaware of current vnet. This causes a panic.

MFC after:	1 week
2014-04-20 09:17:48 +00:00
John-Mark Gurney
de51fbd695 garbage collect something that hasn't been triggered in almost 5 years...
the last consumer was removed a couple years ago...
2014-04-19 19:08:08 +00:00
Rick Macklem
209579aeac For NFS mounts using rsize,wsize=65536 over TSO enabled
network interfaces limited to 32 transmit segments, there
are two known issues.
The more serious one is that for an I/O of slightly less than 64K,
the net device driver prepends an ethernet header, resulting in a
TSO segment slightly larger than 64K. Since m_defrag() copies this
into 33 mbuf clusters, the transmit fails with EFBIG.
A tester indicated observing a similar failure using iSCSI.

The second less critical problem is that the network
device driver must copy the mbuf chain via m_defrag()
(m_collapse() is not sufficient), resulting in measurable overhead.

This patch reduces the default size of if_hw_tsomax
slightly, so that the first issue is avoided.
Fixing the second issue will require a way for the
network device driver to inform tcp_output() that it
is limited to 32 transmit segments.

Reported and tested by:	csforgeron@gmail.com, markus.gebert@hostpoint.ch
MFC after:	2 weeks
2014-04-17 23:31:50 +00:00
Rick Macklem
bd9fd667c4 Vlan did not set the value of if_hw_tsomax, so when vlan
was stacked on top of a network interface that set if_hw_tsomax,
tcp_output() would see the default value instead of the value
set by the network interface. This patch modifies vlan so that
it sets if_hw_tsomax to the value of the parent interface.

Reviewed by:	glebius
MFC after:	2 weeks
2014-04-15 21:48:35 +00:00
Rick Macklem
d092e11c6a Fix build for non-INET that was broken by r264469.
MFC after:	2 weeks
2014-04-15 13:28:54 +00:00
Rick Macklem
9387570f8a Lagg did not set the value of if_hw_tsomax, so when lagg
was stacked on top of network interfaces that set if_hw_tsomax,
tcp_output() would see the default value instead of the value
set by the network interface(s). This patch modifies lagg so that
it sets if_hw_tsomax to the minimum of the value(s) for the
underlying network interfaces.

Reviewed by:	glebius
MFC after:	2 weeks
2014-04-14 20:34:48 +00:00
Bruce M Simpson
d565e5f206 In if_freemulti(), relax the paranoid KASSERT() on ifma->ifma_protospec.
This KASSERT() existed as a sanity check that upper layers in the network
stack (e.g. inet, inet6) had released their reference to the underlying
driver's multicast memberships (ifmultiaddr{}). However it assumes the
lifecycle of the driver membership corresponds to the lifecycle of the
network layer membership.

In the submitter's case, ieee80211_ioctl_updatemulti() attempts to
reprogram the (parent, physical) ifnet{} memberships in response
to a change in membership on the (child, virtual) VAP ifnet, using
a batched update mechanism. These updates happen independently from
the network layer, causing a "false negative" assertion failure.

There are possibly other use cases where this KASSERT() may be triggered
by other networking stack activity (e.g. where a nesting relationship
exists between multiple ifnet{} instances). This suggests that further
review of FreeBSD's approach to nested ifnet relationships is needed.

MFC after:	6 weeks
Submitted by:	adrian@
2014-04-10 18:43:02 +00:00
Michael Tuexen
7f946da063 Call sctp_addr_change() from rt_addrmsg() instead of rt_newaddrmsg_fib(),
since rt_addrmsg() gets also called from other functions.

MFC after: 3 days
2014-04-07 21:28:21 +00:00
Martin Matuska
7e92ce7380 De-virtualize UMA zone pf_mtag_z and move to global initialization part.
The m_tag struct does not know about vnet context and the pf_mtag_free()
callback is called unaware of current vnet. This causes a panic.

Reviewed by:	Nikos Vassiliadis, trociny@
2014-03-29 09:05:25 +00:00
Martin Matuska
1709ccf9d3 Merge head up to r263906. 2014-03-29 08:39:53 +00:00
Martin Matuska
d318d97fb5 Merge from projects/pf r251993 (glebius@):
De-vnet hash sizes and hash masks.

Submitted by:	Nikos Vassiliadis <nvass gmx.com>
Reviewed by:	trociny

MFC after:	1 month
2014-03-25 06:55:53 +00:00
Navdeep Parhar
61b9f217c6 Add a shorter alias for if_data.ifi_oqdrops. 2014-03-20 02:23:52 +00:00
Julio Merino
b6d88aa39b Include strings.h so that bpf_filter.c can be built in userland.
This is to bring in a definition for bzero(3), which in turn allows the
tests in tools/regression/bpf/ to build again.
2014-03-19 13:10:25 +00:00
Gleb Smirnoff
2d70c0ded1 When exporting ifnet via sysctl, add ifqueue(9) drop count to the
ifi_oqdrops. This is a temporary workaround until ifqueue(9) vanishes.

While here, remove the pointless ifi_vhid assignment. It has
sense only when we are exporting ifaddrs, not ifnets.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-03-19 06:08:03 +00:00
Gleb Smirnoff
66dcee729c Garbage collect long time obsoleted (or never used) stuff from routing API. 2014-03-15 06:49:32 +00:00
Robert Watson
7527624efa Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation.  This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.

(1) Merge a software implementation of the Toeplitz hash specified in
    RSS implemented by David Malone.  This is used to allow suitable
    pcbgroup placement of connections before the first packet is
    received from the NIC.  Software hashing is generally avoided,
    however, due to high cost of the hash on general-purpose CPUs.

(2) In in_rss.c, maintain authoritative versions of RSS state intended
    to be pushed to each NIC, including keying material, hash
    algorithm/ configuration, and buckets.  Provide software-facing
    interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
    the RSS standardised Toeplitz and a 'naive' variation with a hash
    efficient in software but with poor distribution properties.
    Implement rss_m2cpuid()to be used by netisr and other load
    balancing code to look up the CPU on which an mbuf should be
    processed.

(3) In the Ethernet link layer, allow netisr distribution using RSS as
    a source of policy as an alternative to source ordering; continue
    to default to direct dispatch (i.e., don't try and requeue packets
    for processing on the 'right' CPU if they arrive in a directly
    dispatchable context).

(4) Allow RSS to control tuning of connection groups in order to align
    groups with RSS buckets.  If a packet arrives on a protocol using
    connection groups, and contains a suitable hardware-generated
    hash, use that hash value to select the connection group for pcb
    lookup for both IPv4 and IPv6.  If no hardware-generated Toeplitz
    hash is available, we fall back on regular PCB lookup risking
    contention rather than pay the cost of Toeplitz in software --
    this is a less scalable but, at my last measurement, faster
    approach.  As core counts go up, we may want to revise this
    strategy despite CPU overhead.

Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP.  This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers.  Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing.  This will hopefully prove
a useful starting point for refinement.

No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.

Sponsored by:   Juniper Networks (original work)
Sponsored by:   EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
Gleb Smirnoff
45c203fce2 Remove AppleTalk support.
AppleTalk was a network transport protocol for Apple Macintosh devices
in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was
a legacy protocol and primary networking protocol is TCP/IP. The last
Mac OS X release to support AppleTalk happened in 2009. The same year
routing equipment vendors (namely Cisco) end their support.

Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.
2014-03-14 06:29:43 +00:00
Gleb Smirnoff
2c284d9395 Remove IPX support.
IPX was a network transport protocol in Novell's NetWare network operating
system from late 80s and then 90s. The NetWare itself switched to TCP/IP
as default transport in 1998. Later, in this century the Novell Open
Enterprise Server became successor of Novell NetWare. The last release
that claimed to still support IPX was OES 2 in 2007. Routing equipment
vendors (e.g. Cisco) discontinued support for IPX in 2011.

Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.
2014-03-14 02:58:48 +00:00
Gleb Smirnoff
b245f96c44 Since 32-bit if_baudrate isn't enough to describe a baud rate of a 10 Gbit
interface, in the r241616 a crutch was provided. It didn't work well, and
finally we decided that it is time to break ABI and simply make if_baudrate
a 64-bit value. Meanwhile, the entire struct if_data was reviewed.

o Remove the if_baudrate_pf crutch.

o Make all fields of struct if_data fixed machine independent size. The
  notion of data (packet counters, etc) are by no means MD. And it is a
  bug that on amd64 we've got a 64-bit counters, while on i386 32-bit,
  which at modern speeds overflow within a second.

  This also removes quite a lot of COMPAT_FREEBSD32 code.

o Give 16 bit for the ifi_datalen field. This field was provided to
  make future changes to if_data less ABI breaking. Unfortunately the
  8 bit size of it had effectively limited sizeof if_data to 256 bytes.

o Give 32 bits to ifi_mtu and ifi_metric.
o Give 64 bits to the rest of fields, since they are counters.

__FreeBSD_version bumped.

Discussed with:	emax
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-03-13 03:42:24 +00:00
Gleb Smirnoff
256ea2aba9 The route code used to mtx_destroy() a locked mutex before rtentry free. Now,
after r262763 it started to return locked mutexes to UMA. To fix that,
conditionally unlock the mutex in the destructor.

Tested by:	"Sergey V. Dyatko" <sergey.dyatko@gmail.com>
2014-03-05 21:16:46 +00:00
Gleb Smirnoff
f788ab5f0b Pacify gcc. 2014-03-05 02:35:15 +00:00
Gleb Smirnoff
5274e55eb3 Hide struct rtentry from userland. 2014-03-05 01:47:08 +00:00
Gleb Smirnoff
e3a7aa6f56 - Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
  removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
  mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with:		melifaro
Sponsored by:		Netflix
Sponsored by:		Nginx, Inc.
2014-03-05 01:17:47 +00:00
Gleb Smirnoff
fb3541ad15 Instead of playing games with casts simply add 3 more members to the
structure pf_rule, that are used when the structure is passed via
ioctl().

PR:		187074
2014-03-05 00:40:03 +00:00
George V. Neville-Neil
9b5f5ede41 Revert previous commit (262727) and bounce patch back to the
submitter.

Pointed out by: jhb
2014-03-04 23:55:04 +00:00
George V. Neville-Neil
596031c052 Naming consistency fix. The routing code defines
RADIX_NODE_HEAD_LOCK as grabbing the write lock,
but RADIX_NODE_HEAD_LOCK_ASSERT as checking the read lock.

Submitted by:	Vijay Singh <vijju.singh at gmail.com>
MFC after:	1 month
2014-03-04 05:09:46 +00:00
John Baldwin
5b26ea5df3 Remove more constants related to static sysctl nodes. The MAXID constants
were primarily used to size the sysctl name list macros that were removed
in r254295.  A few other constants either did not have an associated
sysctl node, or the associated node used OID_AUTO instead.

PR:		ports/184525 (exp-run)
2014-02-25 18:44:33 +00:00
Martin Matuska
5748b897da Merge head up to r262222 (last merge was incomplete). 2014-02-19 22:02:15 +00:00
Marko Zec
c5d4eab644 V_irtualize rtsock refcounting, which reduces the chances for panics
on teardown of vnets without active routing sockets while at least
one routing socket is active elsewhere.

Tested by:	Vijay Singh
MFC after:	3 days
2014-02-19 08:29:07 +00:00
Gleb Smirnoff
ec0ad11ed2 Fix incorrect assertions. 2014-02-18 14:21:26 +00:00
Gleb Smirnoff
044ae9dbef Add my copyright to flowtable. 2014-02-17 12:07:17 +00:00
Gleb Smirnoff
47c8550576 Whitespace. 2014-02-17 12:02:44 +00:00
Gleb Smirnoff
5ec03c2bff Bring copyright notice to standard style. 2014-02-17 12:01:50 +00:00
Gleb Smirnoff
0ff96b4f55 o Remove at compile time the HASH_ALL code, that was never
tested and is unfinished. However, I've tested my version,
  it works okay. As before it is unfinished: timeout aren't
  driven by TCP session state. To enable the HASH_ALL mode,
  one needs in kernel config:

	options FLOWTABLE_HASH_ALL

o Reduce the alignment on flentry to 64 bytes. Without
  the FLOWTABLE_HASH_ALL option, twice less memory would
  be consumed by flows.
o API to ip_output()/ip6_output() got even more thin: 1 liner.
o Remove unused unions. Simply use fle->f_key[].
o Merge all IPv4 code into flowtable_lookup_ipv4(), and do same
  flowtable_lookup_ipv6(). Stop copying data to on stack
  sockaddr structures, simply use key[] on stack.
o Move code from flowtable_lookup_common() that actually works
  on insertion into flowtable_insert().

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-17 11:50:56 +00:00
Adrian Chadd
a2c5809961 Make sure that the flowtable flowid is only set to m_flowid if there
isn't one already supplied.

The previous flowtable code also did this.

Reviewed by:	glebius
Sponsored by:	Netflix, Inc.
2014-02-15 07:57:01 +00:00
Luigi Rizzo
f0ea3689a9 This new version of netmap brings you the following:
- netmap pipes, providing bidirectional blocking I/O while moving
  100+ Mpps between processes using shared memory channels
  (no mistake: over one hundred million. But mind you, i said
  *moving* not *processing*);

- kqueue support (BHyVe needs it);

- improved user library. Just the interface name lets you select a NIC,
  host port, VALE switch port, netmap pipe, and individual queues.
  The upcoming netmap-enabled libpcap will use this feature.

- optional extra buffers associated to netmap ports, for applications
  that need to buffer data yet don't want to make copies.

- segmentation offloading for the VALE switch, useful between VMs.

and a number of bug fixes and performance improvements.

My colleagues Giuseppe Lettieri and Vincenzo Maffione did a substantial
amount of work on these features so we owe them a big thanks.

There are some external repositories that can be of interest:

    https://code.google.com/p/netmap
        our public repository for netmap/VALE code, including
        linux versions and other stuff that does not belong here,
        such as python bindings.

    https://code.google.com/p/netmap-libpcap
        a clone of the libpcap repository with netmap support.
	With this any libpcap client has access to most netmap
	feature with no recompilation. E.g. tcpdump can filter
	packets at 10-15 Mpps.

    https://code.google.com/p/netmap-ipfw
        a userspace version of ipfw+dummynet which uses netmap
        to send/receive packets. Speed is up in the 7-10 Mpps
        range per core for simple rulesets.

Both netmap-libpcap and netmap-ipfw will be merged upstream at some
point, but while this happens it is useful to have access to them.

And yes, this code will be merged soon. It is infinitely better
than the version currently in 10 and 9.

MFC after:	3 days
2014-02-15 04:53:04 +00:00
Gleb Smirnoff
f0e49f6631 Whenever flowtable lookup fails, we do route lookup and then try to
insert flow entry. During the route lookup the critical section is
exited. It may happen, that after route lookup we will be executed
on an other CPU that already has such flowentry. Before this change
we simply freed the flowentry and returned to ip_output() with
failure.

Actually there is nothing wrong with using previously allocated
flow entry, updating it properly. Thus, make flowentry_insert()
return the new either old fle, and make use of it.

Count reuses as "collisions" and real inserts as "inserts".

Reviewed by:	adrian
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-14 10:56:26 +00:00
Gleb Smirnoff
48278b8846 Once pf became not covered by a single mutex, many counters in it became
race prone. Some just gather statistics, but some are later used in
different calculations.

A real problem was the race provoked underflow of the states_cur counter
on a rule. Once it goes below zero, it wraps to UINT32_MAX. Later this
value is used in pf_state_expires() and any state created by this rule
is immediately expired.

Thus, make fields states_cur, states_tot and src_nodes of struct
pf_rule be counter(9)s.

Thanks to Dennis for providing me shell access to problematic box and
his help with reproducing, debugging and investigating the problem.

Thanks to:		Dennis Yusupoff <dyr smartspb.net>
Also reported by:	dumbbell, pgj, Rambler
Sponsored by:		Nginx, Inc.
2014-02-14 10:05:21 +00:00
Adrian Chadd
0e778c88c9 Don't insert a flowtable entry if the lle isn't yet valid.
Some of the collisions that are occuring are due to flowtable lookups
that succeed but have an invalid lle - typically because the L2 adjacency
lookup hasn't completed.  This would lead to a follow-up insert which
would then fail (ie, collision) and the code would fall through to doing
a slow-path L2/L3 lookup in the netinet/netinet6 code.

This patch simply aborts storing a new flowtable entry if the lle isn't
yet valid.

Whilst I'm here, add a new pcpu counter for the item so the number of
failures can be tracked separately from generic "collisions."

Reviewed by:	glebius
MFC after:	10 days
Sponsored by:	Netflix, Inc.
2014-02-14 00:05:09 +00:00
Gleb Smirnoff
25c03b9100 Remove unused FL_NOAUTO. 2014-02-13 05:19:09 +00:00
Gleb Smirnoff
4343b5fa49 o Axe non-pcpu flowtable implementation. It wasn't enabled or used,
and probably is a leftover from first prototyping by Kip. The
  non-pcpu implementation used mutexes, so it doubtfully worked
  better than simple routing lookup.
o Use UMA_ZONE_PCPU zone for pointers instead of [MAXCPU] arrays,
  use zpcpu_get() to access data in there.
o Substitute own single list implementation with SLIST(). This
  has two functional side effects:
  - new flows go into head of a list, before they went to tail.
  - a bug when incorrect flow was deleted in flow cleaner is
    fixed.
o Due to cache line alignment, there is no reason to keep
  different zones for IPv4 and IPv6 flows. Both consume one
  cache line, real size of allocation is equal.
o Rely on that f_hash, f_rt, f_lle are stable during fle
  lifetime, remove useless volatile quilifiers.
o More INET/INET6 splitting.

Reviewed by:	adrian
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-13 04:59:18 +00:00
Mikolaj Golub
db2f5a2461 Fixup for r261590 (vnet sysctl handlers cleanup).
Reviewed by:	glebius
2014-02-09 08:13:17 +00:00
Gleb Smirnoff
0d3009b37c Remove ft_rtalloc and choose rtalloc function at compile time. 2014-02-08 22:12:00 +00:00
Gleb Smirnoff
2333893af3 Spacing. 2014-02-08 22:10:53 +00:00
Gleb Smirnoff
07d9bc0740 Revert accidentially leaked changes in r261627. 2014-02-08 09:57:52 +00:00
Gleb Smirnoff
603819bc74 Remove never set flag FL_OVERWRITE. The only place where
it was checked led to lock/critnest leak.
2014-02-08 09:56:26 +00:00
Gleb Smirnoff
d60a1d1ef4 Fix comment. 2014-02-07 22:30:42 +00:00
Gleb Smirnoff
f83f97fcbc Remove unused defines. 2014-02-07 21:56:16 +00:00
Gleb Smirnoff
5d6d7e756b o Revamp API between flowtable and netinet, netinet6.
- ip_output() and ip_output6() simply call flowtable_lookup(),
    passing mbuf and address family. That's the only code under
    #ifdef FLOWTABLE in the protocols code now.
o Revamp statistics gathering and export.
  - Remove hand made pcpu stats, and utilize counter(9).
  - Snapshot of statistics is available via 'netstat -rs'.
  - All sysctls are moved into net.flowtable namespace, since
    spreading them over net.inet isn't correct.
o Properly separate at compile time INET and INET6 parts.
o General cleanup.
  - Remove chain of multiple flowtables. We simply have one for
    IPv4 and one for IPv6.
  - Flowtables are allocated in flowtable.c, symbols are static.
  - With proper argument to SYSINIT() we no longer need flowtable_ready.
  - Hash salt doesn't need to be per-VNET.
  - Removed rudimentary debugging, which use quite useless in dtrace era.

The runtime behavior of flowtable shouldn't be changed by this commit.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-07 15:18:23 +00:00
Gleb Smirnoff
b5c32cf481 Remove identical vnet sysctl handlers, and handle CTLFLAG_VNET
in the sysctl_root().

Note: SYSCTL_VNET_* macros can be removed as well. All is
  needed to virtualize a sysctl oid is set CTLFLAG_VNET on it.
  But for now keep macros in place to avoid large code churn.

Sponsored by:	Nginx, Inc.
2014-02-07 13:47:33 +00:00
Gleb Smirnoff
7cf4986ba9 Spacing. 2014-02-07 10:05:12 +00:00
Alexander V. Chernikov
95fbe4d0cc Simplify filling sockaddr_dl structure for if_resolvemulti()
callback providers. link_init_sdl() function can be used to
fill most of the parameters. Use caller stack instead of
allocation / freing memory for each request. Do not drop support
for extra-long (probably non-existing) link-layer protocols by
introducing link_alloc_sdl() (used by if_resolvemulti() callback)
and link_free_sdl() (used by caller).
Since this change breaks KBI, MFC requires slightly different approach
(link_init_sdl() auto-allocating buffer if necessary to handle cases
 with unmodified if_resolvemulti() callers).

MFC after:	2 weeks
2014-01-18 23:24:51 +00:00
Luigi Rizzo
6601501905 forgot to update this file in 2607000 2014-01-17 04:38:58 +00:00
Luigi Rizzo
b82b221181 use explicit casts with void* to compile when included by C++ code 2014-01-11 00:00:11 +00:00
Alexander V. Chernikov
d375edc9b5 Simplify inet alias handling code: if we're adding/removing alias which
has the same prefix as some other alias on the same interface, use
newly-added rt_addrmsg() instead of hand-rolled in_addralias_rtmsg().

This eliminates the following rtsock messages:

Pinned RTM_ADD for prefix (for alias addition).
Pinned RTM_DELETE for prefix (for alias withdrawal).

Example (got 10.0.0.1/24 on vlan4, playing with 10.0.0.2/24):

before commit, addition:

  got message of size 116 on Fri Jan 10 14:13:15 2014
  RTM_NEWADDR: address being added to iface: len 116, metric 0, flags:
  sockaddrs: <NETMASK,IFP,IFA,BRD>
   255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

  got message of size 192 on Fri Jan 10 14:13:15 2014
  RTM_ADD: Add Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED>
  locks:  inits:
  sockaddrs: <DST,GATEWAY,NETMASK>
   10.0.0.0 10.0.0.2 (255) ffff ffff ff

after commit, addition:

  got message of size 116 on Fri Jan 10 13:56:26 2014
  RTM_NEWADDR: address being added to iface: len 116, metric 0, flags:
  sockaddrs: <NETMASK,IFP,IFA,BRD>
   255.255.255.0 vlan4:8.0.27.c5.29.d4 14.0.0.2 14.0.0.255

before commit, wihdrawal:

  got message of size 192 on Fri Jan 10 13:58:59 2014
  RTM_DELETE: Delete Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED>
  locks:  inits:
  sockaddrs: <DST,GATEWAY,NETMASK>
   10.0.0.0 10.0.0.2 (255) ffff ffff ff

  got message of size 116 on Fri Jan 10 13:58:59 2014
  RTM_DELADDR: address being removed from iface: len 116, metric 0, flags:
  sockaddrs: <NETMASK,IFP,IFA,BRD>
   255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

adter commit, withdrawal:

  got message of size 116 on Fri Jan 10 14:14:11 2014
  RTM_DELADDR: address being removed from iface: len 116, metric 0, flags:
  sockaddrs: <NETMASK,IFP,IFA,BRD>
   255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

Sending both RTM_ADD/RTM_DELETE messages to rtsock is completely wrong
(and requires some hacks to keep prefix in route table on RTM_DELETE).

I've tested this change with quagga (no change) and bird (*).

bird alias handling is already broken in *BSD sysdep code, so nothing
changes here, too.

I'm going to MFC this change if there will be no complains about behavior
change.

While here, fix some style(9) bugs introduced by r260488
(pointed by glebius and bde).

Sponsored by:	Yandex LLC
MFC after:	4 weeks
2014-01-10 12:13:55 +00:00
Alexander V. Chernikov
4cbac30b29 Split rt_newaddrmsg_fib() into two different functions.
Adding/deleting interface addresses involves access to 3 different subsystems,
int different parts of code. Each call can fail, so reporting successful
operation by rtsock in the middle of the process error-prone.

Further split routing notification API and actual rtsock calls via creating
public-available rt_addrmsg() / rt_routemsg() functions with "private"
rtsock_* backend.

MFC after:	2 weeks
2014-01-09 18:13:25 +00:00
Alexander V. Chernikov
7d9b6df18b Constanly use RT_ALL_FIBS everywhere instead of -1.
MFC after:	2 weeks
2014-01-08 23:09:02 +00:00
Alexander V. Chernikov
955a2deb52 Remove dead code.
Reported by:	Coverity
Coverity CID:	1018057
MFC after:	2 weeks
2014-01-07 19:00:40 +00:00
Alexander V. Chernikov
50da3e886d Teach every SIOCGIFSTATUS provider to fill in ifs->ascii anyway.
Remove old bits of data concat for 'ascii' field.
Remove special SIOCGIFSTATUS handling from if.c (which Coverity yells at).

Reported by:	Coverity
Coverity CID:	1147174
MFC after:	2 weeks
2014-01-07 15:59:33 +00:00
Alexander V. Chernikov
034c09ff10 Partially fix IPv4 interface routes deletion in RADIX_MPATH.
Noticed by:	Nikolay Denev <ndenev at gmail.com>
MFC after:	1 month
2014-01-06 22:36:20 +00:00
Luigi Rizzo
17885a7bfd It is 2014 and we have a new version of netmap.
Most relevant features:

- netmap emulation on any NIC, even those without native netmap support.

  On the ixgbe we have measured about 4Mpps/core/queue in this mode,
  which is still a lot more than with sockets/bpf.

- seamless interconnection of VALE switch, NICs and host stack.

  If you disable accelerations on your NIC (say em0)

        ifconfig em0 -txcsum -txcsum

  you can use the VALE switch to connect the NIC and the host stack:

        vale-ctl -h valeXX:em0

  allowing sharing the NIC with other netmap clients.

- THE USER API HAS SLIGHTLY CHANGED (head/cur/tail pointers
  instead of pointers/count as before). This was unavoidable to support,
  in the future, multiple threads operating on the same rings.
  Netmap clients require very small source code changes to compile again.
      On the plus side, the new API should be easier to understand
  and the internals are a lot simpler.

The manual page has been updated extensively to reflect the current
features and give some examples.

This is the result of work of several people including Giuseppe Lettieri,
Vincenzo Maffione, Michio Honda and myself, and has been financially
supported by EU projects CHANGE and OPENLAB, from NetApp University
Research Fund, NEC, and of course the Universita` di Pisa.
2014-01-06 12:53:15 +00:00
Alexander V. Chernikov
5a2f4cbd92 Change semantics for rnh_lookup() function: now
it performs exact match search, regardless of netmask existance.
This simplifies most of rnh_lookup() consumers.

Fix panic triggered by deleting non-existent host route.

PR:		kern/185092
Submitted by:	Nikolay Denev <ndenev at gmail.com>
MFC after:	1 month
2014-01-04 22:25:26 +00:00
Alexander V. Chernikov
868f984c05 Remove useless register variable modifiers.
Do some more style(9).

MFC after:	2 weeks
2014-01-03 14:33:25 +00:00
George V. Neville-Neil
d9168b014f Convert #defines to enums so that the values are visible in the debugger.
Requested by:	gibbs
MFC after:	2 weeks
2014-01-02 21:30:59 +00:00
Scott Long
1a8959dac6 Multi-queue NIC drivers and multi-port lagg tend to use the same lower
bits of the flowid as each other, resulting in a poor distribution of
packets among queues in certain cases.  Work around this by adding a
set of sysctls for controlling a bit-shift on the flowid when doing
multi-port aggrigation in lagg and lacp.  By default, lagg/lacp will
now use bits 16 and higher instead of 0 and higher.

Reviewed by:	max
Obtained from:	Netflix
MFC after:	3 days
2013-12-30 01:32:17 +00:00
Alexander V. Chernikov
78aed5e800 Simplify contiguous mask checking.
Suggested by:	glebius
MFC after:	2 weeks
2013-12-17 22:16:27 +00:00
Luigi Rizzo
f9790aeb88 split netmap code according to functions:
- netmap.c		base code
- netmap_freebsd.c	FreeBSD-specific code
- netmap_generic.c	emulate netmap over standard drivers
- netmap_mbq.c		simple mbuf tailq
- netmap_mem2.c		memory management
- netmap_vale.c		VALE switch

simplify devce-specific code
2013-12-15 08:37:24 +00:00
George V. Neville-Neil
4fe3b90bd3 Add constants for use in interrogating various fiber and copper connectors
most often used with network interfaces.

The SFF-8472 standard defines the information that can be retrieved
from an optic or a copper cable plugged into a NIC, most often
referred to as SFP+.  Examples of values that can be read
include the cable vendor's name, part number, date of manufacture
as well as running data such as temperature, voltage and tx
and rx power.

Copious comments on how to use these values with an I2C interface
are given in the header file itself.

MFC after:	2 weeks
2013-11-27 20:20:02 +00:00
Gleb Smirnoff
4678c74014 Fix build. 2013-11-27 07:21:25 +00:00
Sergey Kandaurov
da162ca88f Fix macro name in comment. 2013-11-26 15:23:56 +00:00
Andriy Gapon
d9fae5ab88 dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by:	markj
MFC after:	3 weeks
2013-11-26 08:46:27 +00:00
Craig Rodrigues
6274ce3e2b In vnet_route_uninit(), free some memory that is allocated in vnet_route_init().
To reproduce the problem:
  (1)  Take a GENERIC kernel config, and add options for: VIMAGE, WITNESS,
       INVARIANTS.
  (2)  Run this command in a loop:
       jail -l -u root -c path=/ name=foo persist vnet && jexec foo ifconfig lo0 127.0.0.1/8 && jail -r foo

       see: http://lists.freebsd.org/pipermail/freebsd-current/2010-November/021280.html
            http://lists.freebsd.org/pipermail/freebsd-current/2010-November/021291.html

This doesn't eliminate all the "Freed UMA keg was not empty" warning messages
on the console, but it helps.
2013-11-25 20:33:33 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Gleb Smirnoff
d77c1b3269 To support upcoming changes change internal API for source node handling:
- Removed pf_remove_src_node().
- Introduce pf_unlink_src_node() and pf_unlink_src_node_locked().
  These function do not proceed with freeing of a node, just disconnect
  it from storage.
- New function pf_free_src_nodes() works on a list of previously
  disconnected nodes and frees them.
- Utilize new API in pf_purge_expired_src_nodes().

In collaboration with:	Kajetan Staszkiewicz <kajetan.staszkiewicz innogames.de>

Sponsored by:	InnoGames GmbH
Sponsored by:	Nginx, Inc.
2013-11-22 19:16:34 +00:00
Gleb Smirnoff
3260ae00be Add missing 'extern'. 2013-11-22 19:02:22 +00:00
Gleb Smirnoff
654957c2c8 Merge head up to r258343. 2013-11-19 12:21:47 +00:00
George V. Neville-Neil
4857f5fbbc Allow ethernet drivers to pass in packets connected via the nextpkt pointer.
Handling packets in this way allows drivers to amortize work during packet reception.

Submitted by:	Vijay Singh
Sponsored by:	NetApp
2013-11-18 22:58:14 +00:00
Gleb Smirnoff
f053058cee - Split functions that initialize various pf parts into their vimage
parts and global parts.
- Since global parts appeared to be only mutex initializations, just
  abandon them and use MTX_SYSINIT() instead.
- Kill my incorrect VNET_FOREACH() iterator and instead use correct
  approach with VNET_SYSINIT().

Submitted by:	Nikos Vassiliadis <nvass gmx.com>
Reviewed by:	trociny
2013-11-18 22:18:07 +00:00
George V. Neville-Neil
1350c361f6 Clean up the macros to avoid using casts.
Suggested by: bde and jhb
2013-11-15 16:03:32 +00:00
Andrey V. Elsukov
c72a5d5d89 ANSIfy function defintions. 2013-11-15 12:12:50 +00:00
George V. Neville-Neil
d4aff8d11e Put in the correct bit shifting and add a type to prevent clang from complaining.
While here fix up a grammar nit.

Pointed out by: Sergey Kandaurov and bz@ respectively.
2013-11-14 21:57:37 +00:00
George V. Neville-Neil
868eef3239 Shift our OUI correctly.
Pointed out by: emaste
2013-11-14 20:07:17 +00:00
George V. Neville-Neil
62f1e1b2bc The FreeBSD Project now has its own, Ogranizationally Unique Identifier,
assigned by the IEEE.  This file includes documentation on how developers
must carve up the space as well as an initial allocation for bhyve.

Sponsored by:	The FreeBSD Foundation
2013-11-14 19:53:35 +00:00
Gleb Smirnoff
50d3286d9d Merge head r232040 through r258006. 2013-11-11 20:33:25 +00:00
Gleb Smirnoff
555036b5f6 Remove never used ioctls that originate from KAME. The proof
of their zero usage was exp-run from misc/183538.
2013-11-11 05:39:42 +00:00
Gleb Smirnoff
77b89ad837 Provide compat layer for OSIOCAIFADDR. 2013-11-06 19:46:20 +00:00
Gleb Smirnoff
af50ea380f Axe IFF_SMART. Fortunately this layering violating flag was never used,
it was just declared.
2013-11-05 12:52:56 +00:00
Gleb Smirnoff
5fb009bda7 Drop support for historic ioctls and also undefine them, so that code
that checks their presence via ifdef, won't use them.

Bump __FreeBSD_version as safety measure.
2013-11-05 10:29:47 +00:00
Gleb Smirnoff
9a6356bc31 In complemence to ifa_add_loopback_route() and ifa_del_loopback_route()
provide function ifa_switch_loopback_route() that will be used in case when
an interface address used for a loopback route goes away, but we have another
interface address with same address value and want to preserve loopback
route.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-11-05 07:36:17 +00:00
Gleb Smirnoff
b1b9dcae46 Remove net.link.ether.inet.useloopback sysctl tunable. It was always on by
default from the very beginning. It was placed in wrong namespace
net.link.ether, originally it had been at another wrong namespace. It was
incorrectly documented at incorrect manual page arp(8). Since new-ARP commit,
the tunable have been consulted only on route addition, and ignored on route
deletion. Behaviour of a system with tunable turned off is not fully correct,
and has no advantages comparing to normal behavior.
2013-11-05 07:32:09 +00:00
Adrian Chadd
dd50b3107e Restore the entropy gathering from the m_data pointer value, not the
m_data payload.

After talking with markm/bde, this is what markm actually intended.
2013-11-02 15:13:02 +00:00
Luigi Rizzo
ce3ee1e7c4 update to the latest netmap snapshot.
This includes the following:
- use separate memory regions for VALE ports
- locking fixes
- some simplifications in the NIC-specific routines
- performance improvements for the VALE switch
- some new features in the pkt-gen test program
- documentation updates

There are small API changes that require programs to be recompiled
(NETMAP_API has been bumped so you will detect old binaries at runtime).

In particular:
- struct netmap_slot now is 16 bytes to support an extra pointer,
  which may save one data copy when using VALE ports or VMs;
- the struct netmap_if has two extra fields;

MFC after:	3 days
2013-11-01 21:21:14 +00:00
Adrian Chadd
a09968c479 Convert the random entropy harvesting code to use a const void * pointer
rather than just void *.

Then, as part of this, convert a couple of mbuf m->m_data accesses
to mtod(m, const void *).

Reviewed by:	markm
Approved by:	security-officer (delphij)
Sponsored by:	Netflix, Inc.
2013-11-01 20:53:49 +00:00
Gleb Smirnoff
f9b2a21c9e Merge head r232040 through r257457.
M    usr.sbin/portsnap/portsnap/portsnap.8
M    usr.sbin/portsnap/portsnap/portsnap.sh
M    usr.sbin/tcpdump/tcpdump/Makefile
2013-10-31 17:33:29 +00:00
Andre Oppermann
5b74cfe42f Make struct ifnet readable and comprehensible again by grouping
and ordering related variables, fields and locks next to each
other.  Add more comments to variables.

Over time 'ifnet' has accumlated a lot of additional pointers and
functionality in an unstructured way making it quite hard to read
and understand while obfuscating relationships between fields and
variables.

Quantify the structure size and how bloated it has become.

This is only a mechanical change in preparation for upcoming
work to make ifnet opaque to drivers and to separate out the
interface queuing.

Sponsored by:	The FreeBSD Foundation
2013-10-31 15:46:10 +00:00
Andre Oppermann
ded7d20fc5 Move all interface queue related structures, macros and definitions
from net/if_var to it own new net/ifq.h.

For now net/ifq.h is unconditionally included through net/if_var.h.

This is a mechanical change in preparation to make struct ifnet and
the individual interface queue mechanisms opaque.

Discussed with:	glebius
Sponsored by:	The FreeBSD Foundation
2013-10-29 17:48:08 +00:00
Gleb Smirnoff
eaeb0c139a Style: s/SYS_EVENTHANDLER_H/_SYS_EVENTHANDLER_H_/g
Submitted by:	bde
2013-10-28 20:32:05 +00:00
Gleb Smirnoff
c29e1ad930 - Make the prophecy from 1997 happen and remove if_var.h inclusion
from if.h.
- Remove unnecessary includes and declarations from if.h
- Remove unnecessary includes and declarations from if_var.h [1]
- Mark some declarations that are about to be removed in near
  future with comments, explaning why this declaration is still
  necessary.
- Protect eventhandler declarations with #ifdef SYS_EVENTHANDLER_H.

Obtained from:	bdeBSD [1]
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-28 08:03:40 +00:00
Gleb Smirnoff
7ced9c2f66 Instead of putting ifnet declaration into eventhandler.h, move
bpf(4) and vlan(4) related event declarations to bpf.h and
if_vlan_var.h. To avoid dependency on eventhandler.h, protect
these declarations with ifdef SYS_EVENTHANDLER_H.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-28 07:45:03 +00:00