Commit Graph

3349 Commits

Author SHA1 Message Date
Rick Macklem
209579aeac For NFS mounts using rsize,wsize=65536 over TSO enabled
network interfaces limited to 32 transmit segments, there
are two known issues.
The more serious one is that for an I/O of slightly less than 64K,
the net device driver prepends an ethernet header, resulting in a
TSO segment slightly larger than 64K. Since m_defrag() copies this
into 33 mbuf clusters, the transmit fails with EFBIG.
A tester indicated observing a similar failure using iSCSI.

The second less critical problem is that the network
device driver must copy the mbuf chain via m_defrag()
(m_collapse() is not sufficient), resulting in measurable overhead.

This patch reduces the default size of if_hw_tsomax
slightly, so that the first issue is avoided.
Fixing the second issue will require a way for the
network device driver to inform tcp_output() that it
is limited to 32 transmit segments.

Reported and tested by:	csforgeron@gmail.com, markus.gebert@hostpoint.ch
MFC after:	2 weeks
2014-04-17 23:31:50 +00:00
Rick Macklem
bd9fd667c4 Vlan did not set the value of if_hw_tsomax, so when vlan
was stacked on top of a network interface that set if_hw_tsomax,
tcp_output() would see the default value instead of the value
set by the network interface. This patch modifies vlan so that
it sets if_hw_tsomax to the value of the parent interface.

Reviewed by:	glebius
MFC after:	2 weeks
2014-04-15 21:48:35 +00:00
Rick Macklem
d092e11c6a Fix build for non-INET that was broken by r264469.
MFC after:	2 weeks
2014-04-15 13:28:54 +00:00
Rick Macklem
9387570f8a Lagg did not set the value of if_hw_tsomax, so when lagg
was stacked on top of network interfaces that set if_hw_tsomax,
tcp_output() would see the default value instead of the value
set by the network interface(s). This patch modifies lagg so that
it sets if_hw_tsomax to the minimum of the value(s) for the
underlying network interfaces.

Reviewed by:	glebius
MFC after:	2 weeks
2014-04-14 20:34:48 +00:00
Bruce M Simpson
d565e5f206 In if_freemulti(), relax the paranoid KASSERT() on ifma->ifma_protospec.
This KASSERT() existed as a sanity check that upper layers in the network
stack (e.g. inet, inet6) had released their reference to the underlying
driver's multicast memberships (ifmultiaddr{}). However it assumes the
lifecycle of the driver membership corresponds to the lifecycle of the
network layer membership.

In the submitter's case, ieee80211_ioctl_updatemulti() attempts to
reprogram the (parent, physical) ifnet{} memberships in response
to a change in membership on the (child, virtual) VAP ifnet, using
a batched update mechanism. These updates happen independently from
the network layer, causing a "false negative" assertion failure.

There are possibly other use cases where this KASSERT() may be triggered
by other networking stack activity (e.g. where a nesting relationship
exists between multiple ifnet{} instances). This suggests that further
review of FreeBSD's approach to nested ifnet relationships is needed.

MFC after:	6 weeks
Submitted by:	adrian@
2014-04-10 18:43:02 +00:00
Michael Tuexen
7f946da063 Call sctp_addr_change() from rt_addrmsg() instead of rt_newaddrmsg_fib(),
since rt_addrmsg() gets also called from other functions.

MFC after: 3 days
2014-04-07 21:28:21 +00:00
Martin Matuska
7e92ce7380 De-virtualize UMA zone pf_mtag_z and move to global initialization part.
The m_tag struct does not know about vnet context and the pf_mtag_free()
callback is called unaware of current vnet. This causes a panic.

Reviewed by:	Nikos Vassiliadis, trociny@
2014-03-29 09:05:25 +00:00
Martin Matuska
1709ccf9d3 Merge head up to r263906. 2014-03-29 08:39:53 +00:00
Martin Matuska
d318d97fb5 Merge from projects/pf r251993 (glebius@):
De-vnet hash sizes and hash masks.

Submitted by:	Nikos Vassiliadis <nvass gmx.com>
Reviewed by:	trociny

MFC after:	1 month
2014-03-25 06:55:53 +00:00
Navdeep Parhar
61b9f217c6 Add a shorter alias for if_data.ifi_oqdrops. 2014-03-20 02:23:52 +00:00
Julio Merino
b6d88aa39b Include strings.h so that bpf_filter.c can be built in userland.
This is to bring in a definition for bzero(3), which in turn allows the
tests in tools/regression/bpf/ to build again.
2014-03-19 13:10:25 +00:00
Gleb Smirnoff
2d70c0ded1 When exporting ifnet via sysctl, add ifqueue(9) drop count to the
ifi_oqdrops. This is a temporary workaround until ifqueue(9) vanishes.

While here, remove the pointless ifi_vhid assignment. It has
sense only when we are exporting ifaddrs, not ifnets.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-03-19 06:08:03 +00:00
Gleb Smirnoff
66dcee729c Garbage collect long time obsoleted (or never used) stuff from routing API. 2014-03-15 06:49:32 +00:00
Robert Watson
7527624efa Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation.  This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.

(1) Merge a software implementation of the Toeplitz hash specified in
    RSS implemented by David Malone.  This is used to allow suitable
    pcbgroup placement of connections before the first packet is
    received from the NIC.  Software hashing is generally avoided,
    however, due to high cost of the hash on general-purpose CPUs.

(2) In in_rss.c, maintain authoritative versions of RSS state intended
    to be pushed to each NIC, including keying material, hash
    algorithm/ configuration, and buckets.  Provide software-facing
    interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
    the RSS standardised Toeplitz and a 'naive' variation with a hash
    efficient in software but with poor distribution properties.
    Implement rss_m2cpuid()to be used by netisr and other load
    balancing code to look up the CPU on which an mbuf should be
    processed.

(3) In the Ethernet link layer, allow netisr distribution using RSS as
    a source of policy as an alternative to source ordering; continue
    to default to direct dispatch (i.e., don't try and requeue packets
    for processing on the 'right' CPU if they arrive in a directly
    dispatchable context).

(4) Allow RSS to control tuning of connection groups in order to align
    groups with RSS buckets.  If a packet arrives on a protocol using
    connection groups, and contains a suitable hardware-generated
    hash, use that hash value to select the connection group for pcb
    lookup for both IPv4 and IPv6.  If no hardware-generated Toeplitz
    hash is available, we fall back on regular PCB lookup risking
    contention rather than pay the cost of Toeplitz in software --
    this is a less scalable but, at my last measurement, faster
    approach.  As core counts go up, we may want to revise this
    strategy despite CPU overhead.

Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP.  This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers.  Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing.  This will hopefully prove
a useful starting point for refinement.

No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.

Sponsored by:   Juniper Networks (original work)
Sponsored by:   EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
Gleb Smirnoff
45c203fce2 Remove AppleTalk support.
AppleTalk was a network transport protocol for Apple Macintosh devices
in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was
a legacy protocol and primary networking protocol is TCP/IP. The last
Mac OS X release to support AppleTalk happened in 2009. The same year
routing equipment vendors (namely Cisco) end their support.

Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.
2014-03-14 06:29:43 +00:00
Gleb Smirnoff
2c284d9395 Remove IPX support.
IPX was a network transport protocol in Novell's NetWare network operating
system from late 80s and then 90s. The NetWare itself switched to TCP/IP
as default transport in 1998. Later, in this century the Novell Open
Enterprise Server became successor of Novell NetWare. The last release
that claimed to still support IPX was OES 2 in 2007. Routing equipment
vendors (e.g. Cisco) discontinued support for IPX in 2011.

Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.
2014-03-14 02:58:48 +00:00
Gleb Smirnoff
b245f96c44 Since 32-bit if_baudrate isn't enough to describe a baud rate of a 10 Gbit
interface, in the r241616 a crutch was provided. It didn't work well, and
finally we decided that it is time to break ABI and simply make if_baudrate
a 64-bit value. Meanwhile, the entire struct if_data was reviewed.

o Remove the if_baudrate_pf crutch.

o Make all fields of struct if_data fixed machine independent size. The
  notion of data (packet counters, etc) are by no means MD. And it is a
  bug that on amd64 we've got a 64-bit counters, while on i386 32-bit,
  which at modern speeds overflow within a second.

  This also removes quite a lot of COMPAT_FREEBSD32 code.

o Give 16 bit for the ifi_datalen field. This field was provided to
  make future changes to if_data less ABI breaking. Unfortunately the
  8 bit size of it had effectively limited sizeof if_data to 256 bytes.

o Give 32 bits to ifi_mtu and ifi_metric.
o Give 64 bits to the rest of fields, since they are counters.

__FreeBSD_version bumped.

Discussed with:	emax
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-03-13 03:42:24 +00:00
Gleb Smirnoff
256ea2aba9 The route code used to mtx_destroy() a locked mutex before rtentry free. Now,
after r262763 it started to return locked mutexes to UMA. To fix that,
conditionally unlock the mutex in the destructor.

Tested by:	"Sergey V. Dyatko" <sergey.dyatko@gmail.com>
2014-03-05 21:16:46 +00:00
Gleb Smirnoff
f788ab5f0b Pacify gcc. 2014-03-05 02:35:15 +00:00
Gleb Smirnoff
5274e55eb3 Hide struct rtentry from userland. 2014-03-05 01:47:08 +00:00
Gleb Smirnoff
e3a7aa6f56 - Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
  removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
  mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with:		melifaro
Sponsored by:		Netflix
Sponsored by:		Nginx, Inc.
2014-03-05 01:17:47 +00:00
Gleb Smirnoff
fb3541ad15 Instead of playing games with casts simply add 3 more members to the
structure pf_rule, that are used when the structure is passed via
ioctl().

PR:		187074
2014-03-05 00:40:03 +00:00
George V. Neville-Neil
9b5f5ede41 Revert previous commit (262727) and bounce patch back to the
submitter.

Pointed out by: jhb
2014-03-04 23:55:04 +00:00
George V. Neville-Neil
596031c052 Naming consistency fix. The routing code defines
RADIX_NODE_HEAD_LOCK as grabbing the write lock,
but RADIX_NODE_HEAD_LOCK_ASSERT as checking the read lock.

Submitted by:	Vijay Singh <vijju.singh at gmail.com>
MFC after:	1 month
2014-03-04 05:09:46 +00:00
John Baldwin
5b26ea5df3 Remove more constants related to static sysctl nodes. The MAXID constants
were primarily used to size the sysctl name list macros that were removed
in r254295.  A few other constants either did not have an associated
sysctl node, or the associated node used OID_AUTO instead.

PR:		ports/184525 (exp-run)
2014-02-25 18:44:33 +00:00
Martin Matuska
5748b897da Merge head up to r262222 (last merge was incomplete). 2014-02-19 22:02:15 +00:00
Marko Zec
c5d4eab644 V_irtualize rtsock refcounting, which reduces the chances for panics
on teardown of vnets without active routing sockets while at least
one routing socket is active elsewhere.

Tested by:	Vijay Singh
MFC after:	3 days
2014-02-19 08:29:07 +00:00
Gleb Smirnoff
ec0ad11ed2 Fix incorrect assertions. 2014-02-18 14:21:26 +00:00
Gleb Smirnoff
044ae9dbef Add my copyright to flowtable. 2014-02-17 12:07:17 +00:00
Gleb Smirnoff
47c8550576 Whitespace. 2014-02-17 12:02:44 +00:00
Gleb Smirnoff
5ec03c2bff Bring copyright notice to standard style. 2014-02-17 12:01:50 +00:00
Gleb Smirnoff
0ff96b4f55 o Remove at compile time the HASH_ALL code, that was never
tested and is unfinished. However, I've tested my version,
  it works okay. As before it is unfinished: timeout aren't
  driven by TCP session state. To enable the HASH_ALL mode,
  one needs in kernel config:

	options FLOWTABLE_HASH_ALL

o Reduce the alignment on flentry to 64 bytes. Without
  the FLOWTABLE_HASH_ALL option, twice less memory would
  be consumed by flows.
o API to ip_output()/ip6_output() got even more thin: 1 liner.
o Remove unused unions. Simply use fle->f_key[].
o Merge all IPv4 code into flowtable_lookup_ipv4(), and do same
  flowtable_lookup_ipv6(). Stop copying data to on stack
  sockaddr structures, simply use key[] on stack.
o Move code from flowtable_lookup_common() that actually works
  on insertion into flowtable_insert().

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-17 11:50:56 +00:00
Adrian Chadd
a2c5809961 Make sure that the flowtable flowid is only set to m_flowid if there
isn't one already supplied.

The previous flowtable code also did this.

Reviewed by:	glebius
Sponsored by:	Netflix, Inc.
2014-02-15 07:57:01 +00:00
Luigi Rizzo
f0ea3689a9 This new version of netmap brings you the following:
- netmap pipes, providing bidirectional blocking I/O while moving
  100+ Mpps between processes using shared memory channels
  (no mistake: over one hundred million. But mind you, i said
  *moving* not *processing*);

- kqueue support (BHyVe needs it);

- improved user library. Just the interface name lets you select a NIC,
  host port, VALE switch port, netmap pipe, and individual queues.
  The upcoming netmap-enabled libpcap will use this feature.

- optional extra buffers associated to netmap ports, for applications
  that need to buffer data yet don't want to make copies.

- segmentation offloading for the VALE switch, useful between VMs.

and a number of bug fixes and performance improvements.

My colleagues Giuseppe Lettieri and Vincenzo Maffione did a substantial
amount of work on these features so we owe them a big thanks.

There are some external repositories that can be of interest:

    https://code.google.com/p/netmap
        our public repository for netmap/VALE code, including
        linux versions and other stuff that does not belong here,
        such as python bindings.

    https://code.google.com/p/netmap-libpcap
        a clone of the libpcap repository with netmap support.
	With this any libpcap client has access to most netmap
	feature with no recompilation. E.g. tcpdump can filter
	packets at 10-15 Mpps.

    https://code.google.com/p/netmap-ipfw
        a userspace version of ipfw+dummynet which uses netmap
        to send/receive packets. Speed is up in the 7-10 Mpps
        range per core for simple rulesets.

Both netmap-libpcap and netmap-ipfw will be merged upstream at some
point, but while this happens it is useful to have access to them.

And yes, this code will be merged soon. It is infinitely better
than the version currently in 10 and 9.

MFC after:	3 days
2014-02-15 04:53:04 +00:00
Gleb Smirnoff
f0e49f6631 Whenever flowtable lookup fails, we do route lookup and then try to
insert flow entry. During the route lookup the critical section is
exited. It may happen, that after route lookup we will be executed
on an other CPU that already has such flowentry. Before this change
we simply freed the flowentry and returned to ip_output() with
failure.

Actually there is nothing wrong with using previously allocated
flow entry, updating it properly. Thus, make flowentry_insert()
return the new either old fle, and make use of it.

Count reuses as "collisions" and real inserts as "inserts".

Reviewed by:	adrian
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-14 10:56:26 +00:00
Gleb Smirnoff
48278b8846 Once pf became not covered by a single mutex, many counters in it became
race prone. Some just gather statistics, but some are later used in
different calculations.

A real problem was the race provoked underflow of the states_cur counter
on a rule. Once it goes below zero, it wraps to UINT32_MAX. Later this
value is used in pf_state_expires() and any state created by this rule
is immediately expired.

Thus, make fields states_cur, states_tot and src_nodes of struct
pf_rule be counter(9)s.

Thanks to Dennis for providing me shell access to problematic box and
his help with reproducing, debugging and investigating the problem.

Thanks to:		Dennis Yusupoff <dyr smartspb.net>
Also reported by:	dumbbell, pgj, Rambler
Sponsored by:		Nginx, Inc.
2014-02-14 10:05:21 +00:00
Adrian Chadd
0e778c88c9 Don't insert a flowtable entry if the lle isn't yet valid.
Some of the collisions that are occuring are due to flowtable lookups
that succeed but have an invalid lle - typically because the L2 adjacency
lookup hasn't completed.  This would lead to a follow-up insert which
would then fail (ie, collision) and the code would fall through to doing
a slow-path L2/L3 lookup in the netinet/netinet6 code.

This patch simply aborts storing a new flowtable entry if the lle isn't
yet valid.

Whilst I'm here, add a new pcpu counter for the item so the number of
failures can be tracked separately from generic "collisions."

Reviewed by:	glebius
MFC after:	10 days
Sponsored by:	Netflix, Inc.
2014-02-14 00:05:09 +00:00
Gleb Smirnoff
25c03b9100 Remove unused FL_NOAUTO. 2014-02-13 05:19:09 +00:00
Gleb Smirnoff
4343b5fa49 o Axe non-pcpu flowtable implementation. It wasn't enabled or used,
and probably is a leftover from first prototyping by Kip. The
  non-pcpu implementation used mutexes, so it doubtfully worked
  better than simple routing lookup.
o Use UMA_ZONE_PCPU zone for pointers instead of [MAXCPU] arrays,
  use zpcpu_get() to access data in there.
o Substitute own single list implementation with SLIST(). This
  has two functional side effects:
  - new flows go into head of a list, before they went to tail.
  - a bug when incorrect flow was deleted in flow cleaner is
    fixed.
o Due to cache line alignment, there is no reason to keep
  different zones for IPv4 and IPv6 flows. Both consume one
  cache line, real size of allocation is equal.
o Rely on that f_hash, f_rt, f_lle are stable during fle
  lifetime, remove useless volatile quilifiers.
o More INET/INET6 splitting.

Reviewed by:	adrian
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-13 04:59:18 +00:00
Mikolaj Golub
db2f5a2461 Fixup for r261590 (vnet sysctl handlers cleanup).
Reviewed by:	glebius
2014-02-09 08:13:17 +00:00
Gleb Smirnoff
0d3009b37c Remove ft_rtalloc and choose rtalloc function at compile time. 2014-02-08 22:12:00 +00:00
Gleb Smirnoff
2333893af3 Spacing. 2014-02-08 22:10:53 +00:00
Gleb Smirnoff
07d9bc0740 Revert accidentially leaked changes in r261627. 2014-02-08 09:57:52 +00:00
Gleb Smirnoff
603819bc74 Remove never set flag FL_OVERWRITE. The only place where
it was checked led to lock/critnest leak.
2014-02-08 09:56:26 +00:00
Gleb Smirnoff
d60a1d1ef4 Fix comment. 2014-02-07 22:30:42 +00:00
Gleb Smirnoff
f83f97fcbc Remove unused defines. 2014-02-07 21:56:16 +00:00
Gleb Smirnoff
5d6d7e756b o Revamp API between flowtable and netinet, netinet6.
- ip_output() and ip_output6() simply call flowtable_lookup(),
    passing mbuf and address family. That's the only code under
    #ifdef FLOWTABLE in the protocols code now.
o Revamp statistics gathering and export.
  - Remove hand made pcpu stats, and utilize counter(9).
  - Snapshot of statistics is available via 'netstat -rs'.
  - All sysctls are moved into net.flowtable namespace, since
    spreading them over net.inet isn't correct.
o Properly separate at compile time INET and INET6 parts.
o General cleanup.
  - Remove chain of multiple flowtables. We simply have one for
    IPv4 and one for IPv6.
  - Flowtables are allocated in flowtable.c, symbols are static.
  - With proper argument to SYSINIT() we no longer need flowtable_ready.
  - Hash salt doesn't need to be per-VNET.
  - Removed rudimentary debugging, which use quite useless in dtrace era.

The runtime behavior of flowtable shouldn't be changed by this commit.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-07 15:18:23 +00:00
Gleb Smirnoff
b5c32cf481 Remove identical vnet sysctl handlers, and handle CTLFLAG_VNET
in the sysctl_root().

Note: SYSCTL_VNET_* macros can be removed as well. All is
  needed to virtualize a sysctl oid is set CTLFLAG_VNET on it.
  But for now keep macros in place to avoid large code churn.

Sponsored by:	Nginx, Inc.
2014-02-07 13:47:33 +00:00
Gleb Smirnoff
7cf4986ba9 Spacing. 2014-02-07 10:05:12 +00:00
Alexander V. Chernikov
95fbe4d0cc Simplify filling sockaddr_dl structure for if_resolvemulti()
callback providers. link_init_sdl() function can be used to
fill most of the parameters. Use caller stack instead of
allocation / freing memory for each request. Do not drop support
for extra-long (probably non-existing) link-layer protocols by
introducing link_alloc_sdl() (used by if_resolvemulti() callback)
and link_free_sdl() (used by caller).
Since this change breaks KBI, MFC requires slightly different approach
(link_init_sdl() auto-allocating buffer if necessary to handle cases
 with unmodified if_resolvemulti() callers).

MFC after:	2 weeks
2014-01-18 23:24:51 +00:00
Luigi Rizzo
6601501905 forgot to update this file in 2607000 2014-01-17 04:38:58 +00:00
Luigi Rizzo
b82b221181 use explicit casts with void* to compile when included by C++ code 2014-01-11 00:00:11 +00:00
Alexander V. Chernikov
d375edc9b5 Simplify inet alias handling code: if we're adding/removing alias which
has the same prefix as some other alias on the same interface, use
newly-added rt_addrmsg() instead of hand-rolled in_addralias_rtmsg().

This eliminates the following rtsock messages:

Pinned RTM_ADD for prefix (for alias addition).
Pinned RTM_DELETE for prefix (for alias withdrawal).

Example (got 10.0.0.1/24 on vlan4, playing with 10.0.0.2/24):

before commit, addition:

  got message of size 116 on Fri Jan 10 14:13:15 2014
  RTM_NEWADDR: address being added to iface: len 116, metric 0, flags:
  sockaddrs: <NETMASK,IFP,IFA,BRD>
   255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

  got message of size 192 on Fri Jan 10 14:13:15 2014
  RTM_ADD: Add Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED>
  locks:  inits:
  sockaddrs: <DST,GATEWAY,NETMASK>
   10.0.0.0 10.0.0.2 (255) ffff ffff ff

after commit, addition:

  got message of size 116 on Fri Jan 10 13:56:26 2014
  RTM_NEWADDR: address being added to iface: len 116, metric 0, flags:
  sockaddrs: <NETMASK,IFP,IFA,BRD>
   255.255.255.0 vlan4:8.0.27.c5.29.d4 14.0.0.2 14.0.0.255

before commit, wihdrawal:

  got message of size 192 on Fri Jan 10 13:58:59 2014
  RTM_DELETE: Delete Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED>
  locks:  inits:
  sockaddrs: <DST,GATEWAY,NETMASK>
   10.0.0.0 10.0.0.2 (255) ffff ffff ff

  got message of size 116 on Fri Jan 10 13:58:59 2014
  RTM_DELADDR: address being removed from iface: len 116, metric 0, flags:
  sockaddrs: <NETMASK,IFP,IFA,BRD>
   255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

adter commit, withdrawal:

  got message of size 116 on Fri Jan 10 14:14:11 2014
  RTM_DELADDR: address being removed from iface: len 116, metric 0, flags:
  sockaddrs: <NETMASK,IFP,IFA,BRD>
   255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

Sending both RTM_ADD/RTM_DELETE messages to rtsock is completely wrong
(and requires some hacks to keep prefix in route table on RTM_DELETE).

I've tested this change with quagga (no change) and bird (*).

bird alias handling is already broken in *BSD sysdep code, so nothing
changes here, too.

I'm going to MFC this change if there will be no complains about behavior
change.

While here, fix some style(9) bugs introduced by r260488
(pointed by glebius and bde).

Sponsored by:	Yandex LLC
MFC after:	4 weeks
2014-01-10 12:13:55 +00:00
Alexander V. Chernikov
4cbac30b29 Split rt_newaddrmsg_fib() into two different functions.
Adding/deleting interface addresses involves access to 3 different subsystems,
int different parts of code. Each call can fail, so reporting successful
operation by rtsock in the middle of the process error-prone.

Further split routing notification API and actual rtsock calls via creating
public-available rt_addrmsg() / rt_routemsg() functions with "private"
rtsock_* backend.

MFC after:	2 weeks
2014-01-09 18:13:25 +00:00
Alexander V. Chernikov
7d9b6df18b Constanly use RT_ALL_FIBS everywhere instead of -1.
MFC after:	2 weeks
2014-01-08 23:09:02 +00:00
Alexander V. Chernikov
955a2deb52 Remove dead code.
Reported by:	Coverity
Coverity CID:	1018057
MFC after:	2 weeks
2014-01-07 19:00:40 +00:00
Alexander V. Chernikov
50da3e886d Teach every SIOCGIFSTATUS provider to fill in ifs->ascii anyway.
Remove old bits of data concat for 'ascii' field.
Remove special SIOCGIFSTATUS handling from if.c (which Coverity yells at).

Reported by:	Coverity
Coverity CID:	1147174
MFC after:	2 weeks
2014-01-07 15:59:33 +00:00
Alexander V. Chernikov
034c09ff10 Partially fix IPv4 interface routes deletion in RADIX_MPATH.
Noticed by:	Nikolay Denev <ndenev at gmail.com>
MFC after:	1 month
2014-01-06 22:36:20 +00:00
Luigi Rizzo
17885a7bfd It is 2014 and we have a new version of netmap.
Most relevant features:

- netmap emulation on any NIC, even those without native netmap support.

  On the ixgbe we have measured about 4Mpps/core/queue in this mode,
  which is still a lot more than with sockets/bpf.

- seamless interconnection of VALE switch, NICs and host stack.

  If you disable accelerations on your NIC (say em0)

        ifconfig em0 -txcsum -txcsum

  you can use the VALE switch to connect the NIC and the host stack:

        vale-ctl -h valeXX:em0

  allowing sharing the NIC with other netmap clients.

- THE USER API HAS SLIGHTLY CHANGED (head/cur/tail pointers
  instead of pointers/count as before). This was unavoidable to support,
  in the future, multiple threads operating on the same rings.
  Netmap clients require very small source code changes to compile again.
      On the plus side, the new API should be easier to understand
  and the internals are a lot simpler.

The manual page has been updated extensively to reflect the current
features and give some examples.

This is the result of work of several people including Giuseppe Lettieri,
Vincenzo Maffione, Michio Honda and myself, and has been financially
supported by EU projects CHANGE and OPENLAB, from NetApp University
Research Fund, NEC, and of course the Universita` di Pisa.
2014-01-06 12:53:15 +00:00
Alexander V. Chernikov
5a2f4cbd92 Change semantics for rnh_lookup() function: now
it performs exact match search, regardless of netmask existance.
This simplifies most of rnh_lookup() consumers.

Fix panic triggered by deleting non-existent host route.

PR:		kern/185092
Submitted by:	Nikolay Denev <ndenev at gmail.com>
MFC after:	1 month
2014-01-04 22:25:26 +00:00
Alexander V. Chernikov
868f984c05 Remove useless register variable modifiers.
Do some more style(9).

MFC after:	2 weeks
2014-01-03 14:33:25 +00:00
George V. Neville-Neil
d9168b014f Convert #defines to enums so that the values are visible in the debugger.
Requested by:	gibbs
MFC after:	2 weeks
2014-01-02 21:30:59 +00:00
Scott Long
1a8959dac6 Multi-queue NIC drivers and multi-port lagg tend to use the same lower
bits of the flowid as each other, resulting in a poor distribution of
packets among queues in certain cases.  Work around this by adding a
set of sysctls for controlling a bit-shift on the flowid when doing
multi-port aggrigation in lagg and lacp.  By default, lagg/lacp will
now use bits 16 and higher instead of 0 and higher.

Reviewed by:	max
Obtained from:	Netflix
MFC after:	3 days
2013-12-30 01:32:17 +00:00
Alexander V. Chernikov
78aed5e800 Simplify contiguous mask checking.
Suggested by:	glebius
MFC after:	2 weeks
2013-12-17 22:16:27 +00:00
Luigi Rizzo
f9790aeb88 split netmap code according to functions:
- netmap.c		base code
- netmap_freebsd.c	FreeBSD-specific code
- netmap_generic.c	emulate netmap over standard drivers
- netmap_mbq.c		simple mbuf tailq
- netmap_mem2.c		memory management
- netmap_vale.c		VALE switch

simplify devce-specific code
2013-12-15 08:37:24 +00:00
George V. Neville-Neil
4fe3b90bd3 Add constants for use in interrogating various fiber and copper connectors
most often used with network interfaces.

The SFF-8472 standard defines the information that can be retrieved
from an optic or a copper cable plugged into a NIC, most often
referred to as SFP+.  Examples of values that can be read
include the cable vendor's name, part number, date of manufacture
as well as running data such as temperature, voltage and tx
and rx power.

Copious comments on how to use these values with an I2C interface
are given in the header file itself.

MFC after:	2 weeks
2013-11-27 20:20:02 +00:00
Gleb Smirnoff
4678c74014 Fix build. 2013-11-27 07:21:25 +00:00
Sergey Kandaurov
da162ca88f Fix macro name in comment. 2013-11-26 15:23:56 +00:00
Andriy Gapon
d9fae5ab88 dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by:	markj
MFC after:	3 weeks
2013-11-26 08:46:27 +00:00
Craig Rodrigues
6274ce3e2b In vnet_route_uninit(), free some memory that is allocated in vnet_route_init().
To reproduce the problem:
  (1)  Take a GENERIC kernel config, and add options for: VIMAGE, WITNESS,
       INVARIANTS.
  (2)  Run this command in a loop:
       jail -l -u root -c path=/ name=foo persist vnet && jexec foo ifconfig lo0 127.0.0.1/8 && jail -r foo

       see: http://lists.freebsd.org/pipermail/freebsd-current/2010-November/021280.html
            http://lists.freebsd.org/pipermail/freebsd-current/2010-November/021291.html

This doesn't eliminate all the "Freed UMA keg was not empty" warning messages
on the console, but it helps.
2013-11-25 20:33:33 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Gleb Smirnoff
d77c1b3269 To support upcoming changes change internal API for source node handling:
- Removed pf_remove_src_node().
- Introduce pf_unlink_src_node() and pf_unlink_src_node_locked().
  These function do not proceed with freeing of a node, just disconnect
  it from storage.
- New function pf_free_src_nodes() works on a list of previously
  disconnected nodes and frees them.
- Utilize new API in pf_purge_expired_src_nodes().

In collaboration with:	Kajetan Staszkiewicz <kajetan.staszkiewicz innogames.de>

Sponsored by:	InnoGames GmbH
Sponsored by:	Nginx, Inc.
2013-11-22 19:16:34 +00:00
Gleb Smirnoff
3260ae00be Add missing 'extern'. 2013-11-22 19:02:22 +00:00
Gleb Smirnoff
654957c2c8 Merge head up to r258343. 2013-11-19 12:21:47 +00:00
George V. Neville-Neil
4857f5fbbc Allow ethernet drivers to pass in packets connected via the nextpkt pointer.
Handling packets in this way allows drivers to amortize work during packet reception.

Submitted by:	Vijay Singh
Sponsored by:	NetApp
2013-11-18 22:58:14 +00:00
Gleb Smirnoff
f053058cee - Split functions that initialize various pf parts into their vimage
parts and global parts.
- Since global parts appeared to be only mutex initializations, just
  abandon them and use MTX_SYSINIT() instead.
- Kill my incorrect VNET_FOREACH() iterator and instead use correct
  approach with VNET_SYSINIT().

Submitted by:	Nikos Vassiliadis <nvass gmx.com>
Reviewed by:	trociny
2013-11-18 22:18:07 +00:00
George V. Neville-Neil
1350c361f6 Clean up the macros to avoid using casts.
Suggested by: bde and jhb
2013-11-15 16:03:32 +00:00
Andrey V. Elsukov
c72a5d5d89 ANSIfy function defintions. 2013-11-15 12:12:50 +00:00
George V. Neville-Neil
d4aff8d11e Put in the correct bit shifting and add a type to prevent clang from complaining.
While here fix up a grammar nit.

Pointed out by: Sergey Kandaurov and bz@ respectively.
2013-11-14 21:57:37 +00:00
George V. Neville-Neil
868eef3239 Shift our OUI correctly.
Pointed out by: emaste
2013-11-14 20:07:17 +00:00
George V. Neville-Neil
62f1e1b2bc The FreeBSD Project now has its own, Ogranizationally Unique Identifier,
assigned by the IEEE.  This file includes documentation on how developers
must carve up the space as well as an initial allocation for bhyve.

Sponsored by:	The FreeBSD Foundation
2013-11-14 19:53:35 +00:00
Gleb Smirnoff
50d3286d9d Merge head r232040 through r258006. 2013-11-11 20:33:25 +00:00
Gleb Smirnoff
555036b5f6 Remove never used ioctls that originate from KAME. The proof
of their zero usage was exp-run from misc/183538.
2013-11-11 05:39:42 +00:00
Gleb Smirnoff
77b89ad837 Provide compat layer for OSIOCAIFADDR. 2013-11-06 19:46:20 +00:00
Gleb Smirnoff
af50ea380f Axe IFF_SMART. Fortunately this layering violating flag was never used,
it was just declared.
2013-11-05 12:52:56 +00:00
Gleb Smirnoff
5fb009bda7 Drop support for historic ioctls and also undefine them, so that code
that checks their presence via ifdef, won't use them.

Bump __FreeBSD_version as safety measure.
2013-11-05 10:29:47 +00:00
Gleb Smirnoff
9a6356bc31 In complemence to ifa_add_loopback_route() and ifa_del_loopback_route()
provide function ifa_switch_loopback_route() that will be used in case when
an interface address used for a loopback route goes away, but we have another
interface address with same address value and want to preserve loopback
route.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-11-05 07:36:17 +00:00
Gleb Smirnoff
b1b9dcae46 Remove net.link.ether.inet.useloopback sysctl tunable. It was always on by
default from the very beginning. It was placed in wrong namespace
net.link.ether, originally it had been at another wrong namespace. It was
incorrectly documented at incorrect manual page arp(8). Since new-ARP commit,
the tunable have been consulted only on route addition, and ignored on route
deletion. Behaviour of a system with tunable turned off is not fully correct,
and has no advantages comparing to normal behavior.
2013-11-05 07:32:09 +00:00
Adrian Chadd
dd50b3107e Restore the entropy gathering from the m_data pointer value, not the
m_data payload.

After talking with markm/bde, this is what markm actually intended.
2013-11-02 15:13:02 +00:00
Luigi Rizzo
ce3ee1e7c4 update to the latest netmap snapshot.
This includes the following:
- use separate memory regions for VALE ports
- locking fixes
- some simplifications in the NIC-specific routines
- performance improvements for the VALE switch
- some new features in the pkt-gen test program
- documentation updates

There are small API changes that require programs to be recompiled
(NETMAP_API has been bumped so you will detect old binaries at runtime).

In particular:
- struct netmap_slot now is 16 bytes to support an extra pointer,
  which may save one data copy when using VALE ports or VMs;
- the struct netmap_if has two extra fields;

MFC after:	3 days
2013-11-01 21:21:14 +00:00
Adrian Chadd
a09968c479 Convert the random entropy harvesting code to use a const void * pointer
rather than just void *.

Then, as part of this, convert a couple of mbuf m->m_data accesses
to mtod(m, const void *).

Reviewed by:	markm
Approved by:	security-officer (delphij)
Sponsored by:	Netflix, Inc.
2013-11-01 20:53:49 +00:00
Gleb Smirnoff
f9b2a21c9e Merge head r232040 through r257457.
M    usr.sbin/portsnap/portsnap/portsnap.8
M    usr.sbin/portsnap/portsnap/portsnap.sh
M    usr.sbin/tcpdump/tcpdump/Makefile
2013-10-31 17:33:29 +00:00
Andre Oppermann
5b74cfe42f Make struct ifnet readable and comprehensible again by grouping
and ordering related variables, fields and locks next to each
other.  Add more comments to variables.

Over time 'ifnet' has accumlated a lot of additional pointers and
functionality in an unstructured way making it quite hard to read
and understand while obfuscating relationships between fields and
variables.

Quantify the structure size and how bloated it has become.

This is only a mechanical change in preparation for upcoming
work to make ifnet opaque to drivers and to separate out the
interface queuing.

Sponsored by:	The FreeBSD Foundation
2013-10-31 15:46:10 +00:00
Andre Oppermann
ded7d20fc5 Move all interface queue related structures, macros and definitions
from net/if_var to it own new net/ifq.h.

For now net/ifq.h is unconditionally included through net/if_var.h.

This is a mechanical change in preparation to make struct ifnet and
the individual interface queue mechanisms opaque.

Discussed with:	glebius
Sponsored by:	The FreeBSD Foundation
2013-10-29 17:48:08 +00:00
Gleb Smirnoff
eaeb0c139a Style: s/SYS_EVENTHANDLER_H/_SYS_EVENTHANDLER_H_/g
Submitted by:	bde
2013-10-28 20:32:05 +00:00
Gleb Smirnoff
c29e1ad930 - Make the prophecy from 1997 happen and remove if_var.h inclusion
from if.h.
- Remove unnecessary includes and declarations from if.h
- Remove unnecessary includes and declarations from if_var.h [1]
- Mark some declarations that are about to be removed in near
  future with comments, explaning why this declaration is still
  necessary.
- Protect eventhandler declarations with #ifdef SYS_EVENTHANDLER_H.

Obtained from:	bdeBSD [1]
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-28 08:03:40 +00:00
Gleb Smirnoff
7ced9c2f66 Instead of putting ifnet declaration into eventhandler.h, move
bpf(4) and vlan(4) related event declarations to bpf.h and
if_vlan_var.h. To avoid dependency on eventhandler.h, protect
these declarations with ifdef SYS_EVENTHANDLER_H.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-28 07:45:03 +00:00
Gleb Smirnoff
c3322cb91c Include necessary headers that now are available due to pollution
via if_var.h.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-28 07:29:16 +00:00
Gleb Smirnoff
628c030f77 Provide forward declaration for struct ifnet. Consumers
of this header don't need contents of struct.
2013-10-27 17:27:06 +00:00
Gleb Smirnoff
47bb65deb8 Almost all if_clone consumers do not care about if_clone_event.
Do not force them to include sys/eventhandler.h. Those who
utilize EVENTHANDLER(9), will see the declaration.
2013-10-27 17:14:33 +00:00
Gleb Smirnoff
75bf2db380 Move new pf includes to the pf directory. The pfvar.h remain
in net, to avoid compatibility breakage for no sake.

The future plan is to split most of non-kernel parts of
pfvar.h into pf.h, and then make pfvar.h a kernel only
include breaking compatibility.

Discussed with:		bz
2013-10-27 16:25:57 +00:00
Gleb Smirnoff
9dae57e134 Start splitting pfvar.h into internal and external parts.
- Provide pf_altq.h that has only stuff needed for ALTQ.
- Start pf.h, that would have all constant values and
  eventually non-kernel structures.
- Build ALTQ w/o pfvar.h, include if_var.h, that before
  came via pollution.
- Build tcpdump w/o pfvar.h.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 18:59:58 +00:00
Gleb Smirnoff
76039bc84f The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
Gleb Smirnoff
b9f19cb397 vnet.h needs to be included before raw_cb.h. Now it compiles due to
pollution via if_var.h.
2013-10-25 19:49:03 +00:00
Peter Grehan
4083db7d5d Fix panic in the tap driver when a tap and vmnet interface were
created after each other e.g.

 ifconfig tap0
 ifconfig vmnet0
 <panic>

Appears to be a cut'n'paste error from the tap code to the vmnet
code where the name string wasn't updated in the call to make_dev().

Reviewed by:	glebius
MFC after:	3 days
2013-10-24 22:21:31 +00:00
Andrey V. Elsukov
22dc101fac Add a note that lacp_compose_key() should be updated, when new media
types will be added.

Submitted by:	melifaro
X-MFC after:	r256689
2013-10-21 07:49:36 +00:00
Gleb Smirnoff
0bfd163f52 Merge head r233826 through r256722. 2013-10-18 09:32:02 +00:00
Andrey V. Elsukov
d5773da82a Use the same actor key for media types of the same speed.
PR:		176097
MFC after:	2 weeks
2013-10-17 15:14:58 +00:00
Alexander V. Chernikov
65a17d744e Fix long-standing issue with incorrect radix mask calculation.
Usual symptoms are messages like
rn_delete: inconsistent annotation
rn_addmask: mask impossibly already in tree
or inability to flush/delete particular prefix in ipfw table.

Changes:
* Assume 32 bytes as maximum radix key length
* Remove rn_init()
* Statically allocate rn_ones/rn_zeroes
* Make separate mask tree for each "normal" tree instead of system global one
* Remove "optimization" on masks reusage and key zeroying
* Change rn_addmask() arguments to accept tree pointer (no users in base)

PR:		kern/182851, kern/169206, kern/135476, kern/134531
Found by:	Slawa Olhovchenkov <slw@zxy.spb.ru>
MFC after:	2 weeks
Reviewed by:	glebius
Sponsored by:	Yandex LLC
2013-10-16 12:18:44 +00:00
Alexander V. Chernikov
fa27a1fa71 Remove unused fields from radix_node_head.
Sponsored by:	Yandex LLC
2013-10-16 10:33:20 +00:00
Gleb Smirnoff
994409375b Rename Free() macro to R_Free(). This matches R_Malloc() and has much lower
probability to clash with other headers.

Submitted by:	Eric van Gyzen <eric_van_gyzen dell.com>
2013-10-16 04:59:59 +00:00
Maksim Yevmenkin
9ca42425c6 In the flowtable scanner, restart the scan at the last found position,
not at position 0.  Changes the scanner from O(N^2) to O(N).

Submitted by:	scottl
Obtained from:	Netflix, Inc
MFC after:	3 weeks
2013-10-15 21:28:51 +00:00
Gleb Smirnoff
7caf4ab7ac - Utilize counter(9) to accumulate statistics on interface addresses. Add
four counters to struct ifaddr. This kills '+=' on a variables shared
  between processors for every packet.
- Nuke struct if_data from struct ifaddr.
- In ip_input() do not put a reference on ifaddr, instead update statistics
  right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9)
  for every packet. [1]
- To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in
  rtsock.c fill if_data fields using counter_u64_fetch().
- Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which
  took if_data not from the ifaddr, but from ifaddr's ifnet. [2]

Submitted by:	melifaro [1], pluknet[2]
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 11:37:57 +00:00
Gleb Smirnoff
3fffa8c8ff Push some defines under _KERNEL, improve styling and comments. 2013-10-15 10:43:26 +00:00
Gleb Smirnoff
67420bda02 Remove ifa_mtx. It was used only in one place in kernel, and ifnet's
ifaddr lock can substitute it there.

Discussed with:	melifaro, ae
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 10:41:22 +00:00
Gleb Smirnoff
4675896098 Remove ifa_init() and provide ifa_alloc() that will allocate and setup
struct ifaddr internally.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 10:31:42 +00:00
Gleb Smirnoff
6ed910fabe Hide 'struct ifaddr' definition from userland. Two tools left that use it,
namely ipftest(1) and ifmcstat(1). These sniff structure definition using
_WANT_IFADDR define.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 10:19:24 +00:00
Mark Murray
72acff0f07 MFC - tracking commit. 2013-10-09 21:03:34 +00:00
Gleb Smirnoff
4cdc1f5421 There are some high performance NICs that count statistics in hardware,
and there are ifnets, that do that via counter(9). Provide a flag that
would skip cache line trashing '+=' operation in ether_input().

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
Reviewed by:	melifaro, adrian
Approved by:	re (marius)
2013-10-09 19:04:40 +00:00
Mark Murray
ad1f331196 Debug run. This now works, except that the "live" sources haven't
been tested. With all sources turned on, this unlocks itself in
a couple of seconds! That is no my box, and there is no guarantee
that this will be the case everywhere.

* Cut debug prints.

* Use the same locks/mutexes all the way through.

* Be a tad more conservative about entropy estimates.
2013-10-06 12:40:32 +00:00
Mark Murray
f02e47dc1e Snapshot. This passes the build test, but has not yet been finished or debugged.
Contains:

* Refactor the hardware RNG CPU instruction sources to feed into
the software mixer. This is unfinished. The actual harvesting needs
to be sorted out. Modified by me (see below).

* Remove 'frac' parameter from random_harvest(). This was never
used and adds extra code for no good reason.

* Remove device write entropy harvesting. This provided a weak
attack vector, was not very good at bootstrapping the device. To
follow will be a replacement explicit reseed knob.

* Separate out all the RANDOM_PURE sources into separate harvest
entities. This adds some secuity in the case where more than one
is present.

* Review all the code and fix anything obviously messy or inconsistent.
Address som review concerns while I'm here, like rename the pseudo-rng
to 'dummy'.

Submitted by:	Arthur Mesh <arthurmesh@gmail.com> (the first item)
2013-10-04 06:55:06 +00:00
Gleb Smirnoff
c7063c15b0 Clear knlist before destroying it in tap(4) and tun(4). This fixes later
crash, when a kqueue descriptor tries to dereference appropriate knotes.

Approved by:	re (kib)
2013-10-02 20:44:36 +00:00
Gleb Smirnoff
bdad3190a2 Fix a fallout from r241610. One enc interface must be created on startup.
Pointy hat to:	glebius
Reported by:	gavin
Approved by:	re (gjb)
2013-09-28 14:14:23 +00:00
Gleb Smirnoff
540b1a7238 Clean up SIOCSIFDSTADDR usage from ifnet drivers. The ioctl itself is
extremely outdated, and I doubt that it was ever used for ifnet drivers.
It was used for AF_INET sockets in pre-FreeBSD time.

Approved by:	re (hrs)
Sponsored by:	Nginx, Inc.
2013-09-11 09:19:44 +00:00
Dag-Erling Smørgrav
1a05c762b9 Fix the length calculation for the final block of a sendfile(2)
transmission which could be tricked into rounding up to the nearest
page size, leaking up to a page of kernel memory.  [13:11]

In IPv6 and NetATM, stop SIOCSIFADDR, SIOCSIFBRDADDR, SIOCSIFDSTADDR
and SIOCSIFNETMASK at the socket layer rather than pass them on to the
link layer without validation or credential checks.  [SA-13:12]

Prevent cross-mount hardlinks between different nullfs mounts of the
same underlying filesystem.  [SA-13:13]

Security:	CVE-2013-5666
Security:	FreeBSD-SA-13:11.sendfile
Security:	CVE-2013-5691
Security:	FreeBSD-SA-13:12.ifioctl
Security:	CVE-2013-5710
Security:	FreeBSD-SA-13:13.nullfs
Approved by:	re
2013-09-10 10:05:59 +00:00
Mark Murray
a40c2646a4 Bring in some behind-the-scenes development, mainly By Arthur Mesh,
the rest by me.

o Namespace cleanup; the Yarrow name is now restricted to where it
  really applies; this is in anticipation of being augmented or
  replaced by Fortuna in the future. Fortuna is mentioned, but behind
  #if logic, and is ignorable for now.

o The harvest queue is pulled out into its own modules.

o Entropy harvesting is emproved, both by being made more conservative,
  and by separating (a bit!) the sources. Available entropy crumbs are
  marginally improved.

o Selection of sources is made clearer. With recent revelations,
  this will receive more work in the weeks and months to come.

Submitted by:	 Arthur Mesh (partly) <arthurmesh@gmail.com>
2013-09-07 14:15:13 +00:00
Davide Italiano
ab97ad0806 Don't clear the unused SI_CHEAPCLONE flag in tap_create()/tuncreate().
Reviewed by:	kib
2013-09-07 13:50:13 +00:00
Mark Murray
9d32fc31c7 MFC 2013-09-07 07:58:29 +00:00
Davide Italiano
933e681d93 Retire netisr.netisr_direct and netisr.netisr_direct_force sysctls.
These were used to control/export dispatch policy but they're not anymore.
This commit cannot be MFC'ed to 9 because old netstat(9) binary relies
on such sysctl to work. On the other hand, there's no real reason to
keep'em around in 10.
2013-09-06 21:02:43 +00:00
Mark Murray
f27c28dc6e MFC 2013-08-30 11:38:34 +00:00
Adrian Chadd
310915a45a Convert the if_lagg rwlock to an rmlock.
We've been seeing lots of cache line contention (but not lock contention!)
in our workloads between the various TX and RX threads going on.

The write lock is only grabbed when configuration changes are made - which
are infrequent.

With this patch, the contention and cycles spent waiting for updates
disappear.

Sponsored by:	Netflix, Inc.
2013-08-29 19:35:14 +00:00
Alfred Perlstein
29c463d633 Remove include opt_ofed.h since OFED is unifdef'd.
Pointed out by: glebius
2013-08-27 16:45:00 +00:00
Mark Murray
c495c93567 Snapshot; Do some running repairs on entropy harvesting. More needs to follow. 2013-08-26 18:35:21 +00:00
John Baldwin
fd77bbb967 Remove most of the remaining sysctl name list macros. They were only
ever intended for use in sysctl(8) and it has not used them for many
years.

Reviewed by:	bde
Tested by:	exp-run by bdrewery
2013-08-26 18:16:05 +00:00
Andre Oppermann
edb159e1ea Remove unnecessary setup of the m->pkthdr.header pointer.
Sponsored by:	The FreeBSD Foundation
2013-08-25 09:41:37 +00:00
Alfred Perlstein
250053bc41 Remove the #ifdef OFED from the 20 byte mac in struct llentry.
With this change it is now possible to build the entire infiniband
stack as modules and load it dynamically including IP over IB.
2013-08-25 01:55:14 +00:00
Andre Oppermann
1b4381afbb Restructure the mbuf pkthdr to make it fit for upcoming capabilities and
features.  The changes in particular are:

o Remove rarely used "header" pointer and replace it with a 64bit protocol/
  layer specific union PH_loc for local use.  Protocols can flexibly overlay
  their own 8 to 64 bit fields to store information while the packet is
  worked on.

o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc
  instead of pkthdr.header.

o Extend csum_flags to 64bits to allow for additional future offload
  information to be carried (e.g. iSCSI, IPsec offload, and others).

o Move the RSS hash type enumerator from abusing m_flags to its own 8bit
  rsstype field.  Adjust accessor macros.

o Add cosqos field to store Class of Service / Quality of Service information
  with the packet.  It is not yet supported in any drivers but allows us to
  get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with
  a modernized ALTQ.

o Add four 8 bit fields l[2-5]hlen to store the relative header offsets
  from the start of the packet.  This is important for various offload
  capabilities and to relieve the drivers from having to parse the packet
  and protocol headers to find out location of checksums and other
  information.  Header parsing in drivers is a lot of copy-paste and
  unhandled corner cases which we want to avoid.

o Add another flexible 64bit union to map various additional persistent
  packet information, like ether_vtag, tso_segsz and csum fields.
  Depending on the csum_flags settings some fields may have different usage
  making it very flexible and adaptable to future capabilities.

o Restructure the CSUM flags to better signify their outbound (down the
  stack) and inbound (up the stack) use.  The CSUM flags used to be a bit
  chaotic and rather poorly documented leading to incorrect use in many
  places.  Bring clarity into their use through better naming.
  Compatibility mappings are provided to preserve the API.  The drivers
  can be corrected one by one and MFC'd without issue.

o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures).

Sponsored by:	The FreeBSD Foundation
2013-08-24 19:51:18 +00:00
Andre Oppermann
804c784c13 Whitespace, style cleanups, and improved comments. 2013-08-24 12:03:24 +00:00
Andre Oppermann
737003b366 ename PFIL_LIST_[UN]LOCK() to PFIL_HEADLIST_[UN]LOCK() to avoid
confusion with the pfil_head chain locking macros.
2013-08-24 11:24:15 +00:00
Andre Oppermann
8da0139975 Resolve the confusion between the head_list and the hook list.
The linked list of pfil hooks is changed to "chain" and this term
is applied consistently.  The head_list remains with "list" term.

Add KASSERT to vnet_pfil_uninit().

Update and extend comments.

Reviewed by:	eri (previous version)
2013-08-24 11:17:25 +00:00
Andre Oppermann
887c60fc86 Internalize pfil_hook_get(). There are no outside consumers of
this API, it is only safe for internal use and even the pfil(9)
man page says so in the BUGS section.

Reviewed by:	eri
2013-08-24 10:36:33 +00:00
Andre Oppermann
f13e611f7c Convert one instance of pfil hook callback missed in r254769. 2013-08-24 10:30:20 +00:00
Andre Oppermann
25da5060a4 Introduce typedef for pfil hook callback function and replace all
spelled out occurrences with it.

Reviewed by:	eri
2013-08-24 10:13:59 +00:00
Bjoern A. Zeeb
413e45bf81 After r241616 properly export ifi_baudrate_pf in the 32bit compat case.
MFC after:	3 days
2013-08-20 14:35:17 +00:00
Andre Oppermann
86bd049144 Add m_clrprotoflags() to clear protocol specific mbuf flags at up and
downwards layer crossings.

Consistently use it within IP, IPv6 and ethernet protocols.

Discussed with:	trociny, glebius
2013-08-19 13:27:32 +00:00
Mark Johnston
5bc4f6b3ab Add a missing module version declaration to if_tun(4).
PR:		181078
Submitted by:	Brandon Gooch <jamesbrandongooch@gmail.com>
MFC after:	1 week
2013-08-07 01:32:08 +00:00
Hiroki Sato
12bdf23a3a sin6 should be assigned before the loop. 2013-07-28 20:02:41 +00:00
Hiroki Sato
9fcd8e9ebd - Relax the restriction on the member interfaces with LLAs. Two or more
LLAs on the member interfaces are actually harmless when the parent
  interface does not have a LLA.

- Add net.link.bridge.allow_llz_overlap.  This is a knob to allow LLAs on
  a bridge and the member interfaces at the same time.  The default is 0.

Pointed out by:	ume
MFC after:	3 days
2013-07-28 19:49:39 +00:00
Adrian Chadd
49de4f2214 Break out the static, global LACP debug options into a per-lagg unit
sysctl tree.

* Create a net.link.lagg.X.lacp node
* Add a debug node under that for tx_test and rx_test
* Add lacp_strict_mode, defaulting to 1

tx_test and rx_test are still a bitmap of unit numbers for now.
At some point it would be nice to create child nodes of the lagg bundle
for each sub-interface, and then populate those with various knobs
and statistics.

Sponsored by:	Netflix
2013-07-26 19:41:13 +00:00
Adrian Chadd
387e754ae5 Fix typo.
Sponsored by:	Netflix
2013-07-25 19:10:23 +00:00
Marcel Moolenaar
ef1f916971 Decouple the UUID generator from network interfaces by having MAC
addresses added to the UUID generator using uuid_ether_add(). The
UUID generator keeps an arbitrary number of MAC addresses, under
the assumption that they are rarely removed (= uuid_ether_del()).
This achieves the following:
1.  It brings up closer to having the network stack as a loadable
    module.
2.  It allows the UUID generator to filter MAC addresses for best
    results (= highest chance of uniqeness).
3.  MAC addresses can come from anywhere, irrespactive of whether
    it's used for an interface or not.

A side-effect of the change is that when no MAC addresses have been
added, a random multicast MAC address is created once and re-used if
needed. Previusly, when a random MAC address was needed, it was
created for every call. Thus, a change in behaviour is introduced
for when no MAC addresses exist.

Obtained from:	Juniper Networks, Inc.
2013-07-24 04:24:21 +00:00
Craig Rodrigues
719fb72517 PR: 168520 170096
Submitted by: adrian, zec

Fix multiple kernel panics when VIMAGE is enabled in the kernel.
These fixes are based on patches submitted by Adrian Chadd and Marko Zec.

(1)  Set curthread->td_vnet to vnet0 in device_probe_and_attach() just before calling
     device_attach().  This fixes multiple VIMAGE related kernel panics
     when trying to attach Bluetooth or USB Ethernet devices because
     curthread->td_vnet is NULL.

(2)  Set curthread->td_vnet in if_detach().  This fixes kernel panics when detaching networking
     interfaces, especially USB Ethernet devices.

(3)  Use VNET_DOMAIN_SET() in ng_btsocket.c

(4)  In ng_unref_node() set curthread->td_vnet.  This fixes kernel panics
     when detaching Netgraph nodes.
2013-07-15 01:32:55 +00:00
Adrian Chadd
31402c27b8 Bring over some link aggregation / LACP protocol improvements and debugging
additions.

* Add some new tracing events to aid in debugging.
* Add in a debugging mode to drop transmit and received frames, specifically
  to test whether seeing or hearing heartbeats correctly cause LACP to
  drop the port.
* Add in (and make default) a strict LACP mode, which requires the
  heartbeat on a port to be heard before it's used.  Sometimes vendor ports
  will hang but the link layer stays up, resulting in hung traffic.
* Add logging the number of link status flaps, again to aid in debugging
  badly behaving switch ports.
* Calculate the lagg interface port speed as the multiple of the
  configured ports, rather than the largest.

Obtained from:	Netflix
MFC after:	2 weeks
2013-07-13 04:25:03 +00:00
Hiroki Sato
4825b1e098 Add a leaf node CTL_NET.PF_ROUTE.0.AF.NET_RT_DUMP.0.FIB. This returns
routing table with the specified FIB number, not td->td_proc->p_fibnum.
2013-07-12 12:36:12 +00:00
Hiroki Sato
e9f947e27c - Drop GIF_ACCEPT_REVETHIP flag by default.
- Add IFF_MONITOR support.
2013-07-12 12:18:07 +00:00
Andrey V. Elsukov
9bea6fd6c6 Correct CTASSERT condition. 2013-07-09 15:10:27 +00:00
Andrey V. Elsukov
5b7cb97c2b Migrate structs arpstat, icmpstat, mrtstat, pimstat and udpstat to PCPU
counters.
2013-07-09 09:50:15 +00:00
Andrey V. Elsukov
7daad711df Add several macros to help migrate statistics structures to PCPU counters. 2013-07-09 09:37:21 +00:00
Andrey V. Elsukov
c80211e3cf Prepare network statistics structures for migration to PCPU counters.
Use uint64_t as type for all fields of structures.

Changed structures: ahstat, arpstat, espstat, icmp6_ifstat, icmp6stat,
in6_ifstat, ip6stat, ipcompstat, ipipstat, ipsecstat, mrt6stat, mrtstat,
pfkeystat, pim6stat, pimstat, rip6stat, udpstat.

Discussed with:	arch@
2013-07-09 09:32:06 +00:00
Colin Percival
d36ed80a7b Fix typo: minmum -> minimum.
Submitted by:	@z3ndrag0n
2013-07-05 23:40:08 +00:00
Hiroki Sato
6facd7a6b8 Fix a compiler warning.
MFC after:	1 week
2013-07-03 07:31:07 +00:00
Hiroki Sato
af8056441e - Allow ND6_IFF_AUTO_LINKLOCAL for IFT_BRIDGE. An interface with IFT_BRIDGE
is initialized with !ND6_IFF_AUTO_LINKLOCAL && !ND6_IFF_ACCEPT_RTADV
  regardless of net.inet6.ip6.accept_rtadv and net.inet6.ip6.auto_linklocal.
  To configure an autoconfigured link-local address (RFC 4862), the
  following rc.conf(5) configuration can be used:

   ifconfig_bridge0_ipv6="inet6 auto_linklocal"

- if_bridge(4) now removes IPv6 addresses on a member interface to be
  added when the parent interface or one of the existing member
  interfaces has an IPv6 address.  if_bridge(4) merges each link-local
  scope zone which the member interfaces form respectively, so it causes
  address scope violation.  Removal of the IPv6 addresses prevents it.

- if_lagg(4) now removes IPv6 addresses on a member interfaces
  unconditionally.

- Set reasonable flags to non-IPv6-capable interfaces. [*]

Submitted by:	rpaulo [*]
MFC after:	1 week
2013-07-02 16:58:15 +00:00
Qing Li
f672f56f21 Due to the routing related networking kernel redesign work
in FBSD 8.0, interface routes have been returened to the
applications without the RTF_GATEWAY bit. This incompatibility
has caused some issues with Zebra, Qugga and the like.
This patch provides the RTF_GATEWAY flag bit in returned interface
routes so to behave similarly to pre 8.0 systems.

Reviewed by:	    hrs
Verified by:	    mackn at opendns dot com
2013-06-25 00:10:49 +00:00
Gleb Smirnoff
6828cc99e1 De-vnet hash sizes and hash masks.
Submitted by:	Nikos Vassiliadis <nvass gmx.com>
Reviewed by:	trociny
2013-06-19 13:37:29 +00:00
Xin LI
eda6cf02b8 Return ENETDOWN instead of ENOENT when all lagg(4) links are
inactive when upper layer tries to transmit packet.  This
gives better feedback and meaningful errors for applications.

MFC after:	2 weeks
Reviewed by:	thompsa
2013-06-17 19:31:03 +00:00
Hiroki Sato
8b20f6cf8a Return ENETDOWN when the parent interface is down.
MFC after:	1 week
2013-06-16 04:40:02 +00:00
Mikolaj Golub
f8afe33795 Properly set curvnet context in lagg_port_setlladdr() task handler.
Reported by:	Nikos Vassiliadis <nvass gmx.com>
Submitted by:	zec
Tested by:	Nikos Vassiliadis <nvass gmx.com>
MFC after:	1 week
2013-06-07 10:27:50 +00:00
John Baldwin
8aa9937318 Fix build with both INET and INET6 disabled. 2013-06-04 20:40:16 +00:00
Andre Oppermann
3c914c547e Allow drivers to specify a maximum TSO length in bytes if they are
limited in the amount of data they can handle at once.

Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.

The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything
less wouldn't be very useful anymore.  The upper limit is still at
IP_MAXPACKET (65536 bytes).  Raising it requires further auditing of
the IPv4/v6 code path's as the length field in the IP header would
overflow leading to confusion in firewalls and others packet handler on
the real size of the packet.

The placement into "struct ifnet" is a bit hackish but the best place
that was found.  When the stack/driver boundary is updated it should
be handled in a better way.

Submitted by:	cperciva (earlier version)
Reviewed by:	cperciva
Tested by:	cperciva
MFC after:	1 week (using spare struct members to preserve ABI)
2013-06-03 12:55:13 +00:00
Luigi Rizzo
f18be5766f Bring in a number of new features, mostly implemented by Michio Honda:
- the VALE switch now support up to 254 destinations per switch,
  unicast or broadcast (multicast goes to all ports).

- we can attach hw interfaces and the host stack to a VALE switch,
  which means we will be able to use it more or less as a native bridge
  (minor tweaks still necessary).
  A 'vale-ctl' program is supplied in tools/tools/netmap
  to attach/detach ports the switch, and list current configuration.

- the lookup function in the VALE switch can be reassigned to
  something else, similar to the pf hooks. This will enable
  attaching the firewall, or other processing functions (e.g. in-kernel
  openvswitch) directly on the netmap port.

The internal API used by device drivers does not change.

Userspace applications should be recompiled because we
bump NETMAP_API as we now use some fields in the struct nmreq
that were previously ignored -- otherwise, data structures
are the same.

Manpages will be committed separately.
2013-05-30 14:07:14 +00:00
Luigi Rizzo
27892e02fb clarify usage of NETMAP_BUF 2013-05-30 13:41:19 +00:00
Guy Helmer
d013d9022a While waiting for the bpf hold buffer to become idle, check
the return value from mtx_sleep() and exit bpfread() on
errors such as EINTR.

Reviewed by:	jhb
2013-05-23 21:33:10 +00:00
Ed Schouten
6ed0f50f78 Allow certain headers to be included more easily.
Spotted by:	http://hacks.owlfolio.org/header-survey/
2013-05-21 21:20:10 +00:00
Alexander V. Chernikov
22f8ce4335 Use separate function to update mbuf checksum flags instead of
duplicating the same code in different places.

MFC after:	2 weeks
2013-05-18 08:14:21 +00:00
Alexander V. Chernikov
d54455b0c9 Fix rte leak introduced in r248070.
MFC after:	2 weeks
2013-05-18 07:10:22 +00:00
Julian Elischer
4871fc4ab5 Finally change the mbuf to have its own fib field instead of stealing
4 flag bits. This was supposed to happen in 8.0, and again in 2012..

MFC after:	never
2013-05-16 16:20:17 +00:00
Hiroki Sato
b8992a6792 Add IFF_MONITOR support to gre(4).
Tested by:	Chip Marshall
MFC after:	1 week
2013-05-11 19:05:38 +00:00
Andre Oppermann
f89d4c3acf Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.
2013-05-06 16:42:18 +00:00
Eitan Adler
578acad37e Correct a few sizeof()s
Submitted by:	swildner@DragonFlyBSD.org
Reviewed by:	alfred
2013-05-01 04:37:34 +00:00
Luigi Rizzo
c10b5796c0 remove $Id$ (whitespace change) 2013-04-30 16:00:21 +00:00
Gleb Smirnoff
47e8d432d5 Add const qualifier to the dst parameter of the ifnet if_output method. 2013-04-26 12:50:32 +00:00
Oleg Bulyzhin
2c5b403e2d Recover missing arp_ifinit() call.
MFC after:	2 weeks
2013-04-18 20:13:33 +00:00
Gleb Smirnoff
b64478a137 Switch lagg(4) statistics to counter(9).
The lagg(4) is often used to bond high speed links, so basic per-packet +=
on statistics cause cache misses and statistics loss.

Perfect solution would be to convert ifnet(9) to counters(9), but this
requires much more work, and unfortunately ABI change, so temporarily
patch lagg(4) manually.

We store counters in the softc, and once per second push their values
to legacy ifnet counters.

Sponsored by:	Nginx, Inc.
2013-04-15 13:00:42 +00:00
Gleb Smirnoff
18ba072a22 Fix build. 2013-04-10 08:09:25 +00:00
Andre Oppermann
e8b3186b6a Change certain heavily used network related mutexes and rwlocks to
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.

NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
2013-04-09 21:02:20 +00:00
Andrey V. Elsukov
9cb8d207af Use IP6STAT_INC/IP6STAT_DEC macros to update ip6 stats.
MFC after:	1 week
2013-04-09 07:11:22 +00:00
Mark Johnston
83a3ff21a8 Ignore interface renames instead of removing the interface from the bridge
group.

Reviewed by:	rstone
Approved by:	rstone (co-mentor)
Sponsored by:	Sandvine Incorporated
MFC after:	1 week
2013-03-28 20:37:07 +00:00
Gleb Smirnoff
209dddb90e Remove __FreeBSD_version ifdefs. 2013-03-22 20:44:16 +00:00
Andrey V. Elsukov
5474386bd3 Fix style and comments. 2013-03-19 05:51:47 +00:00
Gleb Smirnoff
dc4ad05ecd Use m_get/m_gethdr instead of compat macros.
Sponsored by:	Nginx, Inc.
2013-03-15 12:55:30 +00:00
Gleb Smirnoff
c69f77c339 - Use m_getcl() instead of hand allocating.
- Convert panic() to KASSERT.
- Remove superfluous cleaning of mbuf fields after allocation.
- Add comment on possible use of m_get2() here.

Sponsored by:	Nginx, Inc.
2013-03-15 12:52:59 +00:00
Gleb Smirnoff
41a7572b26 Functions m_getm2() and m_get2() have different order of arguments,
and that can drive someone crazy. While m_get2() is young and not
documented yet, change its order of arguments to match m_getm2().

Sorry for churn, but better now than later.
2013-03-12 13:42:47 +00:00
Gleb Smirnoff
129004c56f Reinitialize eh after pfil(9) processing.
PR:		176764
Submitted by:	adri
2013-03-11 12:06:57 +00:00
Alexander V. Chernikov
3034f43f2f Fix long-standing issue with interface routes being unprotected:
Use RTM_PINNED flag to mark route as immutable.
Forbid deleting immutable routes without special rtrequest1_fib() flag.
Adding interface address with prefix already in route table is handled
by atomically deleting old prefix and adding interface one.

Discussed with:	andre, eri
MFC after:	3 weeks
2013-03-08 20:33:50 +00:00
Alexander V. Chernikov
14126522cf Write lock is not required for find&compare operation.
MFC after:	2 weeks
2013-03-05 13:38:45 +00:00
Gleb Smirnoff
e2a55a0021 Finish the r244185. This fixes ever growing counter of pfsync bad
length packets, which was actually harmless.

Note that peers with different version of head/ may grow this
counter, but it is harmless - all pfsync data is processed.

Reported & tested by:	Anton Yuzhaninov <citrin citrin.ru>
Sponsored by:		Nginx, Inc
2013-02-15 09:03:56 +00:00
Gleb Smirnoff
24421c1c32 Resolve source address selection in presense of CARP. Add a couple
of helper functions:

- carp_master()   - boolean function which is true if an address
		    is in the MASTER state.
- ifa_preferred() - boolean function that compares two addresses,
		    and is aware of CARP.

  Utilize ifa_preferred() in ifa_ifwithnet().

  The previous version of patch also changed source address selection
logic in jails using carp_master(), but we failed to negotiate this part
with Bjoern. May be we will approach this problem again later.

Reported & tested by:	Anton Yuzhaninov <citrin citrin.ru>
Sponsored by:		Nginx, Inc
2013-02-11 10:58:22 +00:00
Randall Stewart
ded5ea6a25 This fixes a out-of-order problem with several
of the newer drivers. The basic problem was
that the driver was pulling the mbuf off the
drbr ring and then when sending with xmit(), encounting
a full transmit ring. Thus the lower layer
xmit() function would return an error, and the
drivers would then append the data back on to the ring.
For TCP this is a horrible scenario sure to bring
on a fast-retransmit.

The fix is to use drbr_peek() to pull the data pointer
but not remove it from the ring. If it fails then
we either call the new drbr_putback or drbr_advance
method. Advance moves it forward (we do this sometimes
when the xmit() function frees the mbuf). When
we succeed we always call advance. The
putback will always copy the mbuf back to the top
of the ring. Note that the putback *cannot* be used
with a drbr_dequeue() only with drbr_peek(). We most
of the time, in putback, would not need to copy it
back since most likey the mbuf is still the same, but
sometimes xmit() functions will change the mbuf via
a pullup or other call. So the optimial case for
the single consumer is to always copy it back. If
we ever do a multiple_consumer (for lagg?) we
will  need a test and atomic in the put back possibly
a seperate putback_mc() in the ring buf.

Reviewed by:	jhb@freebsd.org, jlv@freebsd.org
2013-02-07 15:20:54 +00:00
Gleb Smirnoff
9711a168b9 Retire struct sockaddr_inarp.
Since ARP and routing are separated, "proxy only" entries
don't have any meaning, thus we don't need additional field
in sockaddr to pass SIN_PROXY flag.

New kernel is binary compatible with old tools, since sizes
of sockaddr_inarp and sockaddr_in match, and sa_family are
filled with same value.

The structure declaration is left for compatibility with
third party software, but in tree code no longer use it.

Reviewed by:	ru, andre, net@
2013-01-31 08:55:21 +00:00
Gleb Smirnoff
1910bfcba2 route_output() always supplies info with RTAX_GATEWAY member that
points to a sockaddr of AF_LINK family. Assert this instead of
checking.
2013-01-29 21:44:22 +00:00
Navdeep Parhar
4364ec0852 Move lle_event to if_llatbl.h
lle_event replaced arp_update_event after the ARP rewrite and ended up
in if_ether.h simply because arp_update_event used to be there too.
IPv6 neighbor discovery is going to grow lle_event support and this is a
good time to move it to if_llatbl.h.

The two in-tree consumers of this event - OFED and toecore - are not
affected.

Reviewed by:	bz@
2013-01-25 23:58:21 +00:00
Gleb Smirnoff
ed63043b21 - Utilize m_get2(), accidentially fixing some signedness bugs.
- Return EMSGSIZE in both cases if uio_resid is oversized or undersized.
- No need to clear rcvif.
2013-01-24 14:29:31 +00:00
Luigi Rizzo
01c039a19c leftover from r245579... flags for semi transparent mode and direct
forwarding through a VALE switch
2013-01-23 03:49:48 +00:00
Gleb Smirnoff
1d9797f128 If lagg(4) can't forward a packet due to underlying port problems,
return much more meaningful ENETDOWN to the stack, instead of EBUSY.
2013-01-21 08:59:31 +00:00
Gleb Smirnoff
f6eef2c2d6 - Add dashes before copyright notices.
- Add $FreeBSD$.
- Remove unused define.
2013-01-07 19:36:11 +00:00
Peter Wemm
a116ec4b5e Juggle some internal symbols from our antique zlib (that originally came
in from kernel-pppd which is long gone) so that ZFS and DTRACE play nice.

This is a horrible hack to get freefall to compile, and is in dire need
of reconciliation.  This antique zlib-1.04 code needs to go away.
2013-01-06 14:59:59 +00:00
Andrey V. Elsukov
e37e7917f3 Add an ability to set net.link.stf.permit_rfc1918 from the loader.
MFC after:	2 weeks
2012-12-27 21:26:08 +00:00
Andrey V. Elsukov
51743c5f73 Add net.link.stf.permit_rfc1918 sysctl variable. It can be used to allow
the use of private IPv4 addresses with stf(4).

MFC after:	2 weeks
2012-12-27 20:59:22 +00:00
Kevin Lo
c7dada99bb Fix typo in comment.
Reviewed by:	thompsa
2012-12-18 06:37:23 +00:00
Gleb Smirnoff
b1ec2940af Fix problem in r238990. The LLE_LINKED flag should be tested prior to
entering llentry_free(), and in case if we lose the race, we should simply
perform LLE_FREE_LOCKED(). Otherwise, if the race is lost by the thread
performing arptimer(), it will remove two references from the lle instead
of one.

Reported by:	Ian FREISLICH <ianf clue.co.za>
2012-12-13 11:11:15 +00:00
Guy Helmer
3b3b91e736 Changes to resolve races in bpfread() and catchpacket() that, at worst,
cause kernel panics.

Add a flag to the bpf descriptor to indicate whether the hold buffer
is in use. In bpfread(), set the "hold buffer in use" flag before
dropping the descriptor lock during the call to bpf_uiomove().
Everywhere else the hold buffer is used or changed, wait while
the hold buffer is in use by bpfread(). Add a KASSERT in bpfread()
after re-acquiring the descriptor lock to assist uncovering any
additional hold buffer races.
2012-12-10 16:14:44 +00:00
Hiroki Sato
0bebb5448b - Move definition of V_deembed_scopeid to scope6_var.h.
- Deembed scope id in L3 address in in6_lltable_dump().
- Simplify scope id recovery in rtsock routines.
- Remove embedded scope id handling in ndp(8) and route(8) completely.
2012-12-05 19:45:24 +00:00
Gleb Smirnoff
eb1b1807af Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
Hiroki Sato
5c9fa630f6 - Fix LOR in sa6_recoverscope() in rt_msg2()[1].
- Check V_deembed_scopeid before checking if sa_family == AF_INET6.
- Fix scope id handing in route(8)[2] and ifconfig(8).

Reported by:	rpaulo[1], Mateusz Guzik[1], peter[2]
2012-12-04 17:12:23 +00:00
Alexander V. Chernikov
f079a0fa8c Fix bpf_if structure leak introduced in r235745.
Move all such structures to delayed-free lists and
delete all matching on interface departure event.

MFC after:	1 week
2012-12-02 21:43:37 +00:00
Pawel Jakub Dawidek
5ad9520341 - Use more appropriate loop (do { } while()) when generating ethernet address
for bridge interface.
- If we found a collision we can break the loop - only one collision is
  possible and one is exactly enough to need to renegerate.

Obtained from:	WHEEL Systems
MFC after:	1 week
2012-11-29 08:06:23 +00:00
Andre Oppermann
da2299c5c7 Remove unused and unnecessary CSUM_IP_FRAGS checksumming capability.
Checksumming the IP header of fragments is no different from doing
normal IP headers.

Discussed with:	yongari
MFC after:	1 week
2012-11-27 19:31:49 +00:00
David Xu
ba60525b3f Pass allocated unit number to make_dev, otherwise kernel panics later while
cloning second tap.

Reviewed by: kevlo,ed
2012-11-27 12:23:57 +00:00
Gleb Smirnoff
5e9a54290d Better safe than sorry: reinitialize eh after ng_ether(4) and
if_bridge(4) processing, since mbuf may be modified there.

Submitted by:	youngari
2012-11-27 06:35:26 +00:00
Gleb Smirnoff
97cce87f78 Re-initialize eh pointer after m_adj()
Submitted by:	Kohji Okuno <okuno.kohji jp.panasonic.com>
Reviewed by:	yongari
2012-11-26 19:45:01 +00:00
Adrian Chadd
d60ec817ea Fix up a compile time warning if INET6 isn't defined. 2012-11-18 04:51:46 +00:00
Hiroki Sato
6bbfef9004 Fill sin6_scope_id in sockaddr_in6 before passing it from the kernel to
userland via routing socket or sysctl.  This eliminates the following
KAME-specific sin6_scope_id handling routine from each userland utility:

 sin6.sin6_scope_id = ntohs(*(u_int16_t *)&sin6.sin6_addr.s6_addr[2]);

This behavior can be controlled by net.inet6.ip6.deembed_scopeid.  This is
set to 1 by default (sin6_scope_id will be filled in the kernel).

Reviewed by:	bz
2012-11-17 20:19:00 +00:00
Guy Helmer
0e8a1cb3c9 Work around a race in bpfread() by validating the hold buffer pointer
before freeing it. Otherwise, we can lose a buffer and cause a panic
in catchpacket().
2012-11-06 21:07:04 +00:00
Andrey V. Elsukov
ffdbf9da3b Remove the recently added sysctl variable net.pfil.forward.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.

Suggested by:	andre
2012-11-02 01:20:55 +00:00
Gleb Smirnoff
078468ede4 o Remove last argument to ip_fragment(), and obtain all needed information
on checksums directly from mbuf flags. This simplifies code.
o Clear CSUM_IP from the mbuf in ip_fragment() if we did checksums in
  hardware. Some driver may not announce CSUM_IP in theur if_hwassist,
  although try to do checksums if CSUM_IP set on mbuf. Example is em(4).
o While here, consistently use CSUM_IP instead of its alias CSUM_DELAY_IP.
  After this change CSUM_DELAY_IP vanishes from the stack.

Submitted by:	Sebastian Kuzminsky <seb lineratesystems.com>
2012-10-26 21:06:33 +00:00
Andrey V. Elsukov
c1de64a495 Remove the IPFIREWALL_FORWARD kernel option and make possible to turn
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.

Sponsored by:	Yandex LLC
Discussed with:	net@
MFC after:	2 weeks
2012-10-25 09:39:14 +00:00
Gleb Smirnoff
da1fc67f8a Fix fallout from r240071. If destination interface lookup fails,
we should broadcast a packet, not try to deliver it to NULL.

Reported by:	rpaulo
2012-10-24 18:33:44 +00:00
Gleb Smirnoff
8f134647ca Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

  After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

  After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by:	luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by:	ray, Olivier Cochard-Labbe <olivier cochard.me>
2012-10-22 21:09:03 +00:00
Alexander V. Chernikov
4dab1a18a3 Make PFIL use per-VNET lock instead of per-AF lock. Since most used packet
filters (ipfw and PF) use the same ruleset with the same lock for both
AF_INET and AF_INET6 there is no need in more fine-grade locking.
However, it is possible to request personal lock by specifying
PFIL_FLAG_PRIVATE_LOCK flag in pfil_head structure (see pfil.9 for
more details).

Export PFIL lock via rw_lock(9)/rm_lock(9)-like API permitting pfil consumers
to use this lock instead of own lock. This help reducing locks on main
traffic path.

pfil_assert() is currently not implemented due to absense of rm_assert().
Waiting for some kind of r234648 to be merged in HEAD.

This change is part of bigger patch reducing routing locking.

Sponsored by:	Yandex LLC
Reviewed by:	glebius, ae
OK'd by:	silence on net@
MFC after:	3 weeks
2012-10-22 14:10:17 +00:00
Andre Oppermann
813ee737dd Update to previous r241688 to use __func__ instead of spelled out function
name in log(9) message.

Suggested by:	glebius
2012-10-19 10:07:55 +00:00
Andre Oppermann
15134be81b Use LOG_WARNING level in in_attachdomain1() instead of printf().
Submitted by:	vijju.singh-at-gmail.com
2012-10-18 14:08:26 +00:00
Andre Oppermann
c9b652e3e8 Mechanically remove the last stray remains of spl* calls from net*/*.
They have been Noop's for a long time now.
2012-10-18 13:57:24 +00:00
Gleb Smirnoff
44e1d89044 Utilize new macro to initialize if_baudrate(). 2012-10-18 09:57:56 +00:00
Gleb Smirnoff
3aaf0159dd Fix VIMAGE build.
Reported by:	Nikolai Lifanov <lifanov mail.lifanov.com>
Pointy hat to:	glebius
2012-10-17 21:19:27 +00:00
Maksim Yevmenkin
608ae712d3 provide helper if_initbaudrate() to set if_baudrate_pf and if_baudrate_pf.
again, use ixgbe(4) as an example of how to use new helper function.

Reviewed by:	jhb
MFC after:	1 week
2012-10-17 19:24:13 +00:00
Xin LI
6684e46988 Fix build. 2012-10-17 08:19:08 +00:00
Maksim Yevmenkin
e9bbb44e09 report total number of ports for each lagg(4) interface
via net.link.lagg.X.count sysctl

MFC after:	1  week
2012-10-16 22:43:14 +00:00
Maksim Yevmenkin
0fef97fea3 introduce concept of ifi_baudrate power factor. the idea is to work
around the problem where high speed interfaces (such as ixgbe(4))
are not able to report real ifi_baudrate. bascially, take a spare
byte from struct if_data and use it to store ifi_baudrate power
factor. in other words,

real ifi_baudrate = ifi_baudrate * 10 ^ ifi_baudrate power factor

this should be backwards compatible with old binaries. use ixgbe(4)
as an example on how drivers would set ifi_baudrate power factor

Discussed with:	kib, scottl, glebius
MFC after:	1 week
2012-10-16 20:18:15 +00:00
Gleb Smirnoff
42a58907c3 Make the "struct if_clone" opaque to users of the cloning API. Users
now use function calls:

  if_clone_simple()
  if_clone_advanced()

to initialize a cloner, instead of macros that initialize if_clone
structure.

Discussed with:		brooks, bz, 1 year ago
2012-10-16 13:37:54 +00:00
Kevin Lo
9823d52705 Revert previous commit...
Pointyhat to:	kevlo (myself)
2012-10-10 08:36:38 +00:00
Kevin Lo
a10cee30c9 Prefer NULL over 0 for pointers 2012-10-09 08:27:40 +00:00
Gleb Smirnoff
21d172a3f1 A step in resolving mess with byte ordering for AF_INET. After this change:
- All packets in NETISR_IP queue are in net byte order.
  - ip_input() is entered in net byte order and converts packet
    to host byte order right _after_ processing pfil(9) hooks.
  - ip_output() is entered in host byte order and converts packet
    to net byte order right _before_ processing pfil(9) hooks.
  - ip_fragment() accepts and emits packet in net byte order.
  - ip_forward(), ip_mloopback() use host byte order (untouched actually).
  - ip_fastforward() no longer modifies packet at all (except ip_ttl).
  - Swapping of byte order there and back removed from the following modules:
    pf(4), ipfw(4), enc(4), if_bridge(4).
  - Swapping of byte order added to ipfilter(4), based on __FreeBSD_version
  - __FreeBSD_version bumped.
  - pfil(9) manual page updated.

Reviewed by:	ray, luigi, eri, melifaro
Tested by:	glebius (LE), ray (BE)
2012-10-06 10:02:11 +00:00
Xin LI
15752fa858 MFV: libpcap 1.3.0.
MFC after:	4 weeks
2012-10-05 18:42:50 +00:00
Andrew Thompson
3e92ee8a53 Remove the M_NOWAIT from bridge_rtable_init as it isn't needed. The function
return value is not even checked and could lead to a panic on a null sc_rthash.

MFC after:	2 weeks
2012-10-04 07:40:55 +00:00
Ed Maste
104d9fc776 Cast through void * to silence compiler warning
The base netmap pointer and offsets involved are provided by the kernel
side of the netmap interface and will have appropriate alignment.

Sponsored by: ADARA Networks
MFC After: 2 weeks
2012-10-03 21:41:20 +00:00
John Baldwin
b3aa419331 Rename the module for 'device enc' to "if_enc" to avoid conflicting with
the CAM "enc" peripheral (part of ses(4)).  Previously the two modules
used the same name, so only one was included in a linked kernel causing
enc0 to not be created if you added IPSEC to GENERIC.  The new module
name follows the pattern of other network interfaces (e.g. "if_loop").

MFC after:	1 week
2012-10-02 12:25:30 +00:00
Gleb Smirnoff
063efed28c The drbr(9) API appeared to be so unclear, that most drivers in
tree used it incorrectly, which lead to inaccurate overrated
if_obytes accounting. The drbr(9) used to update ifnet stats on
drbr_enqueue(), which is not accurate since enqueuing doesn't
imply successful processing by driver. Dequeuing neither mean
that. Most drivers also called drbr_stats_update() which did
accounting again, leading to doubled if_obytes statistics. And
in case of severe transmitting, when a packet could be several
times enqueued and dequeued it could have been accounted several
times.

o Thus, make drbr(9) API thinner. Now drbr(9) merely chooses between
  ALTQ queueing or buf_ring(9) queueing.
  - It doesn't touch the buf_ring stats any more.
  - It doesn't touch ifnet stats anymore.
  - drbr_stats_update() no longer exists.

o buf_ring(9) handles its stats itself:
  - It handles br_drops itself.
  - br_prod_bytes stats are dropped. Rationale: no one ever
    reads them but update of a common counter on every packet
    negatively affects performance due to excessive cache
    invalidation.
  - buf_ring_enqueue_bytes() reduced to buf_ring_enqueue(), since
    we no longer account bytes.

o Drivers handle their stats theirselves: if_obytes, if_omcasts.

o mlx4(4), igb(4), em(4), vxge(4), oce(4) and  ixv(4) no longer
  use drbr_stats_update(), and update ifnet stats theirselves.

o bxe(4) was the most correct driver, it didn't call
  drbr_stats_update(), thus it was the only driver accurate under
  moderate load. Now it also maintains stats itself.

o ixgbe(4) had already taken stats from hardware, so just
  - drop software stats updating.
  - take multicast packet count from hardware as well.

o mxge(4) just no longer needs NO_SLOW_STATS define.

o cxgb(4), cxgbe(4) need no change, since they obtain stats
  from hardware.

Reviewed by:	jfv, gnn
2012-09-28 18:28:27 +00:00
Gleb Smirnoff
80cd7c7596 - In the bridge_enqueue() do success/error accounting for
each fragment, not only once.
- In the GRAB_OUR_PACKETS() macro do increase if_ibytes.
2012-09-26 20:09:48 +00:00
Ed Maste
66d3579a1e Correct misspelling in debug output. 2012-09-26 01:09:19 +00:00
Ed Maste
c11038e252 Revert part of an earlier patch attempt that snuck in with r240938. 2012-09-25 23:41:45 +00:00