Commit Graph

3229 Commits

Author SHA1 Message Date
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Gleb Smirnoff
d77c1b3269 To support upcoming changes change internal API for source node handling:
- Removed pf_remove_src_node().
- Introduce pf_unlink_src_node() and pf_unlink_src_node_locked().
  These function do not proceed with freeing of a node, just disconnect
  it from storage.
- New function pf_free_src_nodes() works on a list of previously
  disconnected nodes and frees them.
- Utilize new API in pf_purge_expired_src_nodes().

In collaboration with:	Kajetan Staszkiewicz <kajetan.staszkiewicz innogames.de>

Sponsored by:	InnoGames GmbH
Sponsored by:	Nginx, Inc.
2013-11-22 19:16:34 +00:00
Gleb Smirnoff
3260ae00be Add missing 'extern'. 2013-11-22 19:02:22 +00:00
Gleb Smirnoff
654957c2c8 Merge head up to r258343. 2013-11-19 12:21:47 +00:00
George V. Neville-Neil
4857f5fbbc Allow ethernet drivers to pass in packets connected via the nextpkt pointer.
Handling packets in this way allows drivers to amortize work during packet reception.

Submitted by:	Vijay Singh
Sponsored by:	NetApp
2013-11-18 22:58:14 +00:00
Gleb Smirnoff
f053058cee - Split functions that initialize various pf parts into their vimage
parts and global parts.
- Since global parts appeared to be only mutex initializations, just
  abandon them and use MTX_SYSINIT() instead.
- Kill my incorrect VNET_FOREACH() iterator and instead use correct
  approach with VNET_SYSINIT().

Submitted by:	Nikos Vassiliadis <nvass gmx.com>
Reviewed by:	trociny
2013-11-18 22:18:07 +00:00
George V. Neville-Neil
1350c361f6 Clean up the macros to avoid using casts.
Suggested by: bde and jhb
2013-11-15 16:03:32 +00:00
Andrey V. Elsukov
c72a5d5d89 ANSIfy function defintions. 2013-11-15 12:12:50 +00:00
George V. Neville-Neil
d4aff8d11e Put in the correct bit shifting and add a type to prevent clang from complaining.
While here fix up a grammar nit.

Pointed out by: Sergey Kandaurov and bz@ respectively.
2013-11-14 21:57:37 +00:00
George V. Neville-Neil
868eef3239 Shift our OUI correctly.
Pointed out by: emaste
2013-11-14 20:07:17 +00:00
George V. Neville-Neil
62f1e1b2bc The FreeBSD Project now has its own, Ogranizationally Unique Identifier,
assigned by the IEEE.  This file includes documentation on how developers
must carve up the space as well as an initial allocation for bhyve.

Sponsored by:	The FreeBSD Foundation
2013-11-14 19:53:35 +00:00
Gleb Smirnoff
50d3286d9d Merge head r232040 through r258006. 2013-11-11 20:33:25 +00:00
Gleb Smirnoff
555036b5f6 Remove never used ioctls that originate from KAME. The proof
of their zero usage was exp-run from misc/183538.
2013-11-11 05:39:42 +00:00
Gleb Smirnoff
77b89ad837 Provide compat layer for OSIOCAIFADDR. 2013-11-06 19:46:20 +00:00
Gleb Smirnoff
af50ea380f Axe IFF_SMART. Fortunately this layering violating flag was never used,
it was just declared.
2013-11-05 12:52:56 +00:00
Gleb Smirnoff
5fb009bda7 Drop support for historic ioctls and also undefine them, so that code
that checks their presence via ifdef, won't use them.

Bump __FreeBSD_version as safety measure.
2013-11-05 10:29:47 +00:00
Gleb Smirnoff
9a6356bc31 In complemence to ifa_add_loopback_route() and ifa_del_loopback_route()
provide function ifa_switch_loopback_route() that will be used in case when
an interface address used for a loopback route goes away, but we have another
interface address with same address value and want to preserve loopback
route.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-11-05 07:36:17 +00:00
Gleb Smirnoff
b1b9dcae46 Remove net.link.ether.inet.useloopback sysctl tunable. It was always on by
default from the very beginning. It was placed in wrong namespace
net.link.ether, originally it had been at another wrong namespace. It was
incorrectly documented at incorrect manual page arp(8). Since new-ARP commit,
the tunable have been consulted only on route addition, and ignored on route
deletion. Behaviour of a system with tunable turned off is not fully correct,
and has no advantages comparing to normal behavior.
2013-11-05 07:32:09 +00:00
Adrian Chadd
dd50b3107e Restore the entropy gathering from the m_data pointer value, not the
m_data payload.

After talking with markm/bde, this is what markm actually intended.
2013-11-02 15:13:02 +00:00
Luigi Rizzo
ce3ee1e7c4 update to the latest netmap snapshot.
This includes the following:
- use separate memory regions for VALE ports
- locking fixes
- some simplifications in the NIC-specific routines
- performance improvements for the VALE switch
- some new features in the pkt-gen test program
- documentation updates

There are small API changes that require programs to be recompiled
(NETMAP_API has been bumped so you will detect old binaries at runtime).

In particular:
- struct netmap_slot now is 16 bytes to support an extra pointer,
  which may save one data copy when using VALE ports or VMs;
- the struct netmap_if has two extra fields;

MFC after:	3 days
2013-11-01 21:21:14 +00:00
Adrian Chadd
a09968c479 Convert the random entropy harvesting code to use a const void * pointer
rather than just void *.

Then, as part of this, convert a couple of mbuf m->m_data accesses
to mtod(m, const void *).

Reviewed by:	markm
Approved by:	security-officer (delphij)
Sponsored by:	Netflix, Inc.
2013-11-01 20:53:49 +00:00
Gleb Smirnoff
f9b2a21c9e Merge head r232040 through r257457.
M    usr.sbin/portsnap/portsnap/portsnap.8
M    usr.sbin/portsnap/portsnap/portsnap.sh
M    usr.sbin/tcpdump/tcpdump/Makefile
2013-10-31 17:33:29 +00:00
Andre Oppermann
5b74cfe42f Make struct ifnet readable and comprehensible again by grouping
and ordering related variables, fields and locks next to each
other.  Add more comments to variables.

Over time 'ifnet' has accumlated a lot of additional pointers and
functionality in an unstructured way making it quite hard to read
and understand while obfuscating relationships between fields and
variables.

Quantify the structure size and how bloated it has become.

This is only a mechanical change in preparation for upcoming
work to make ifnet opaque to drivers and to separate out the
interface queuing.

Sponsored by:	The FreeBSD Foundation
2013-10-31 15:46:10 +00:00
Andre Oppermann
ded7d20fc5 Move all interface queue related structures, macros and definitions
from net/if_var to it own new net/ifq.h.

For now net/ifq.h is unconditionally included through net/if_var.h.

This is a mechanical change in preparation to make struct ifnet and
the individual interface queue mechanisms opaque.

Discussed with:	glebius
Sponsored by:	The FreeBSD Foundation
2013-10-29 17:48:08 +00:00
Gleb Smirnoff
eaeb0c139a Style: s/SYS_EVENTHANDLER_H/_SYS_EVENTHANDLER_H_/g
Submitted by:	bde
2013-10-28 20:32:05 +00:00
Gleb Smirnoff
c29e1ad930 - Make the prophecy from 1997 happen and remove if_var.h inclusion
from if.h.
- Remove unnecessary includes and declarations from if.h
- Remove unnecessary includes and declarations from if_var.h [1]
- Mark some declarations that are about to be removed in near
  future with comments, explaning why this declaration is still
  necessary.
- Protect eventhandler declarations with #ifdef SYS_EVENTHANDLER_H.

Obtained from:	bdeBSD [1]
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-28 08:03:40 +00:00
Gleb Smirnoff
7ced9c2f66 Instead of putting ifnet declaration into eventhandler.h, move
bpf(4) and vlan(4) related event declarations to bpf.h and
if_vlan_var.h. To avoid dependency on eventhandler.h, protect
these declarations with ifdef SYS_EVENTHANDLER_H.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-28 07:45:03 +00:00
Gleb Smirnoff
c3322cb91c Include necessary headers that now are available due to pollution
via if_var.h.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-28 07:29:16 +00:00
Gleb Smirnoff
628c030f77 Provide forward declaration for struct ifnet. Consumers
of this header don't need contents of struct.
2013-10-27 17:27:06 +00:00
Gleb Smirnoff
47bb65deb8 Almost all if_clone consumers do not care about if_clone_event.
Do not force them to include sys/eventhandler.h. Those who
utilize EVENTHANDLER(9), will see the declaration.
2013-10-27 17:14:33 +00:00
Gleb Smirnoff
75bf2db380 Move new pf includes to the pf directory. The pfvar.h remain
in net, to avoid compatibility breakage for no sake.

The future plan is to split most of non-kernel parts of
pfvar.h into pf.h, and then make pfvar.h a kernel only
include breaking compatibility.

Discussed with:		bz
2013-10-27 16:25:57 +00:00
Gleb Smirnoff
9dae57e134 Start splitting pfvar.h into internal and external parts.
- Provide pf_altq.h that has only stuff needed for ALTQ.
- Start pf.h, that would have all constant values and
  eventually non-kernel structures.
- Build ALTQ w/o pfvar.h, include if_var.h, that before
  came via pollution.
- Build tcpdump w/o pfvar.h.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 18:59:58 +00:00
Gleb Smirnoff
76039bc84f The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
Gleb Smirnoff
b9f19cb397 vnet.h needs to be included before raw_cb.h. Now it compiles due to
pollution via if_var.h.
2013-10-25 19:49:03 +00:00
Peter Grehan
4083db7d5d Fix panic in the tap driver when a tap and vmnet interface were
created after each other e.g.

 ifconfig tap0
 ifconfig vmnet0
 <panic>

Appears to be a cut'n'paste error from the tap code to the vmnet
code where the name string wasn't updated in the call to make_dev().

Reviewed by:	glebius
MFC after:	3 days
2013-10-24 22:21:31 +00:00
Andrey V. Elsukov
22dc101fac Add a note that lacp_compose_key() should be updated, when new media
types will be added.

Submitted by:	melifaro
X-MFC after:	r256689
2013-10-21 07:49:36 +00:00
Gleb Smirnoff
0bfd163f52 Merge head r233826 through r256722. 2013-10-18 09:32:02 +00:00
Andrey V. Elsukov
d5773da82a Use the same actor key for media types of the same speed.
PR:		176097
MFC after:	2 weeks
2013-10-17 15:14:58 +00:00
Alexander V. Chernikov
65a17d744e Fix long-standing issue with incorrect radix mask calculation.
Usual symptoms are messages like
rn_delete: inconsistent annotation
rn_addmask: mask impossibly already in tree
or inability to flush/delete particular prefix in ipfw table.

Changes:
* Assume 32 bytes as maximum radix key length
* Remove rn_init()
* Statically allocate rn_ones/rn_zeroes
* Make separate mask tree for each "normal" tree instead of system global one
* Remove "optimization" on masks reusage and key zeroying
* Change rn_addmask() arguments to accept tree pointer (no users in base)

PR:		kern/182851, kern/169206, kern/135476, kern/134531
Found by:	Slawa Olhovchenkov <slw@zxy.spb.ru>
MFC after:	2 weeks
Reviewed by:	glebius
Sponsored by:	Yandex LLC
2013-10-16 12:18:44 +00:00
Alexander V. Chernikov
fa27a1fa71 Remove unused fields from radix_node_head.
Sponsored by:	Yandex LLC
2013-10-16 10:33:20 +00:00
Gleb Smirnoff
994409375b Rename Free() macro to R_Free(). This matches R_Malloc() and has much lower
probability to clash with other headers.

Submitted by:	Eric van Gyzen <eric_van_gyzen dell.com>
2013-10-16 04:59:59 +00:00
Maksim Yevmenkin
9ca42425c6 In the flowtable scanner, restart the scan at the last found position,
not at position 0.  Changes the scanner from O(N^2) to O(N).

Submitted by:	scottl
Obtained from:	Netflix, Inc
MFC after:	3 weeks
2013-10-15 21:28:51 +00:00
Gleb Smirnoff
7caf4ab7ac - Utilize counter(9) to accumulate statistics on interface addresses. Add
four counters to struct ifaddr. This kills '+=' on a variables shared
  between processors for every packet.
- Nuke struct if_data from struct ifaddr.
- In ip_input() do not put a reference on ifaddr, instead update statistics
  right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9)
  for every packet. [1]
- To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in
  rtsock.c fill if_data fields using counter_u64_fetch().
- Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which
  took if_data not from the ifaddr, but from ifaddr's ifnet. [2]

Submitted by:	melifaro [1], pluknet[2]
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 11:37:57 +00:00
Gleb Smirnoff
3fffa8c8ff Push some defines under _KERNEL, improve styling and comments. 2013-10-15 10:43:26 +00:00
Gleb Smirnoff
67420bda02 Remove ifa_mtx. It was used only in one place in kernel, and ifnet's
ifaddr lock can substitute it there.

Discussed with:	melifaro, ae
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 10:41:22 +00:00
Gleb Smirnoff
4675896098 Remove ifa_init() and provide ifa_alloc() that will allocate and setup
struct ifaddr internally.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 10:31:42 +00:00
Gleb Smirnoff
6ed910fabe Hide 'struct ifaddr' definition from userland. Two tools left that use it,
namely ipftest(1) and ifmcstat(1). These sniff structure definition using
_WANT_IFADDR define.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 10:19:24 +00:00
Mark Murray
72acff0f07 MFC - tracking commit. 2013-10-09 21:03:34 +00:00
Gleb Smirnoff
4cdc1f5421 There are some high performance NICs that count statistics in hardware,
and there are ifnets, that do that via counter(9). Provide a flag that
would skip cache line trashing '+=' operation in ether_input().

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
Reviewed by:	melifaro, adrian
Approved by:	re (marius)
2013-10-09 19:04:40 +00:00
Mark Murray
ad1f331196 Debug run. This now works, except that the "live" sources haven't
been tested. With all sources turned on, this unlocks itself in
a couple of seconds! That is no my box, and there is no guarantee
that this will be the case everywhere.

* Cut debug prints.

* Use the same locks/mutexes all the way through.

* Be a tad more conservative about entropy estimates.
2013-10-06 12:40:32 +00:00
Mark Murray
f02e47dc1e Snapshot. This passes the build test, but has not yet been finished or debugged.
Contains:

* Refactor the hardware RNG CPU instruction sources to feed into
the software mixer. This is unfinished. The actual harvesting needs
to be sorted out. Modified by me (see below).

* Remove 'frac' parameter from random_harvest(). This was never
used and adds extra code for no good reason.

* Remove device write entropy harvesting. This provided a weak
attack vector, was not very good at bootstrapping the device. To
follow will be a replacement explicit reseed knob.

* Separate out all the RANDOM_PURE sources into separate harvest
entities. This adds some secuity in the case where more than one
is present.

* Review all the code and fix anything obviously messy or inconsistent.
Address som review concerns while I'm here, like rename the pseudo-rng
to 'dummy'.

Submitted by:	Arthur Mesh <arthurmesh@gmail.com> (the first item)
2013-10-04 06:55:06 +00:00
Gleb Smirnoff
c7063c15b0 Clear knlist before destroying it in tap(4) and tun(4). This fixes later
crash, when a kqueue descriptor tries to dereference appropriate knotes.

Approved by:	re (kib)
2013-10-02 20:44:36 +00:00
Gleb Smirnoff
bdad3190a2 Fix a fallout from r241610. One enc interface must be created on startup.
Pointy hat to:	glebius
Reported by:	gavin
Approved by:	re (gjb)
2013-09-28 14:14:23 +00:00
Gleb Smirnoff
540b1a7238 Clean up SIOCSIFDSTADDR usage from ifnet drivers. The ioctl itself is
extremely outdated, and I doubt that it was ever used for ifnet drivers.
It was used for AF_INET sockets in pre-FreeBSD time.

Approved by:	re (hrs)
Sponsored by:	Nginx, Inc.
2013-09-11 09:19:44 +00:00
Dag-Erling Smørgrav
1a05c762b9 Fix the length calculation for the final block of a sendfile(2)
transmission which could be tricked into rounding up to the nearest
page size, leaking up to a page of kernel memory.  [13:11]

In IPv6 and NetATM, stop SIOCSIFADDR, SIOCSIFBRDADDR, SIOCSIFDSTADDR
and SIOCSIFNETMASK at the socket layer rather than pass them on to the
link layer without validation or credential checks.  [SA-13:12]

Prevent cross-mount hardlinks between different nullfs mounts of the
same underlying filesystem.  [SA-13:13]

Security:	CVE-2013-5666
Security:	FreeBSD-SA-13:11.sendfile
Security:	CVE-2013-5691
Security:	FreeBSD-SA-13:12.ifioctl
Security:	CVE-2013-5710
Security:	FreeBSD-SA-13:13.nullfs
Approved by:	re
2013-09-10 10:05:59 +00:00
Mark Murray
a40c2646a4 Bring in some behind-the-scenes development, mainly By Arthur Mesh,
the rest by me.

o Namespace cleanup; the Yarrow name is now restricted to where it
  really applies; this is in anticipation of being augmented or
  replaced by Fortuna in the future. Fortuna is mentioned, but behind
  #if logic, and is ignorable for now.

o The harvest queue is pulled out into its own modules.

o Entropy harvesting is emproved, both by being made more conservative,
  and by separating (a bit!) the sources. Available entropy crumbs are
  marginally improved.

o Selection of sources is made clearer. With recent revelations,
  this will receive more work in the weeks and months to come.

Submitted by:	 Arthur Mesh (partly) <arthurmesh@gmail.com>
2013-09-07 14:15:13 +00:00
Davide Italiano
ab97ad0806 Don't clear the unused SI_CHEAPCLONE flag in tap_create()/tuncreate().
Reviewed by:	kib
2013-09-07 13:50:13 +00:00
Mark Murray
9d32fc31c7 MFC 2013-09-07 07:58:29 +00:00
Davide Italiano
933e681d93 Retire netisr.netisr_direct and netisr.netisr_direct_force sysctls.
These were used to control/export dispatch policy but they're not anymore.
This commit cannot be MFC'ed to 9 because old netstat(9) binary relies
on such sysctl to work. On the other hand, there's no real reason to
keep'em around in 10.
2013-09-06 21:02:43 +00:00
Mark Murray
f27c28dc6e MFC 2013-08-30 11:38:34 +00:00
Adrian Chadd
310915a45a Convert the if_lagg rwlock to an rmlock.
We've been seeing lots of cache line contention (but not lock contention!)
in our workloads between the various TX and RX threads going on.

The write lock is only grabbed when configuration changes are made - which
are infrequent.

With this patch, the contention and cycles spent waiting for updates
disappear.

Sponsored by:	Netflix, Inc.
2013-08-29 19:35:14 +00:00
Alfred Perlstein
29c463d633 Remove include opt_ofed.h since OFED is unifdef'd.
Pointed out by: glebius
2013-08-27 16:45:00 +00:00
Mark Murray
c495c93567 Snapshot; Do some running repairs on entropy harvesting. More needs to follow. 2013-08-26 18:35:21 +00:00
John Baldwin
fd77bbb967 Remove most of the remaining sysctl name list macros. They were only
ever intended for use in sysctl(8) and it has not used them for many
years.

Reviewed by:	bde
Tested by:	exp-run by bdrewery
2013-08-26 18:16:05 +00:00
Andre Oppermann
edb159e1ea Remove unnecessary setup of the m->pkthdr.header pointer.
Sponsored by:	The FreeBSD Foundation
2013-08-25 09:41:37 +00:00
Alfred Perlstein
250053bc41 Remove the #ifdef OFED from the 20 byte mac in struct llentry.
With this change it is now possible to build the entire infiniband
stack as modules and load it dynamically including IP over IB.
2013-08-25 01:55:14 +00:00
Andre Oppermann
1b4381afbb Restructure the mbuf pkthdr to make it fit for upcoming capabilities and
features.  The changes in particular are:

o Remove rarely used "header" pointer and replace it with a 64bit protocol/
  layer specific union PH_loc for local use.  Protocols can flexibly overlay
  their own 8 to 64 bit fields to store information while the packet is
  worked on.

o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc
  instead of pkthdr.header.

o Extend csum_flags to 64bits to allow for additional future offload
  information to be carried (e.g. iSCSI, IPsec offload, and others).

o Move the RSS hash type enumerator from abusing m_flags to its own 8bit
  rsstype field.  Adjust accessor macros.

o Add cosqos field to store Class of Service / Quality of Service information
  with the packet.  It is not yet supported in any drivers but allows us to
  get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with
  a modernized ALTQ.

o Add four 8 bit fields l[2-5]hlen to store the relative header offsets
  from the start of the packet.  This is important for various offload
  capabilities and to relieve the drivers from having to parse the packet
  and protocol headers to find out location of checksums and other
  information.  Header parsing in drivers is a lot of copy-paste and
  unhandled corner cases which we want to avoid.

o Add another flexible 64bit union to map various additional persistent
  packet information, like ether_vtag, tso_segsz and csum fields.
  Depending on the csum_flags settings some fields may have different usage
  making it very flexible and adaptable to future capabilities.

o Restructure the CSUM flags to better signify their outbound (down the
  stack) and inbound (up the stack) use.  The CSUM flags used to be a bit
  chaotic and rather poorly documented leading to incorrect use in many
  places.  Bring clarity into their use through better naming.
  Compatibility mappings are provided to preserve the API.  The drivers
  can be corrected one by one and MFC'd without issue.

o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures).

Sponsored by:	The FreeBSD Foundation
2013-08-24 19:51:18 +00:00
Andre Oppermann
804c784c13 Whitespace, style cleanups, and improved comments. 2013-08-24 12:03:24 +00:00
Andre Oppermann
737003b366 ename PFIL_LIST_[UN]LOCK() to PFIL_HEADLIST_[UN]LOCK() to avoid
confusion with the pfil_head chain locking macros.
2013-08-24 11:24:15 +00:00
Andre Oppermann
8da0139975 Resolve the confusion between the head_list and the hook list.
The linked list of pfil hooks is changed to "chain" and this term
is applied consistently.  The head_list remains with "list" term.

Add KASSERT to vnet_pfil_uninit().

Update and extend comments.

Reviewed by:	eri (previous version)
2013-08-24 11:17:25 +00:00
Andre Oppermann
887c60fc86 Internalize pfil_hook_get(). There are no outside consumers of
this API, it is only safe for internal use and even the pfil(9)
man page says so in the BUGS section.

Reviewed by:	eri
2013-08-24 10:36:33 +00:00
Andre Oppermann
f13e611f7c Convert one instance of pfil hook callback missed in r254769. 2013-08-24 10:30:20 +00:00
Andre Oppermann
25da5060a4 Introduce typedef for pfil hook callback function and replace all
spelled out occurrences with it.

Reviewed by:	eri
2013-08-24 10:13:59 +00:00
Bjoern A. Zeeb
413e45bf81 After r241616 properly export ifi_baudrate_pf in the 32bit compat case.
MFC after:	3 days
2013-08-20 14:35:17 +00:00
Andre Oppermann
86bd049144 Add m_clrprotoflags() to clear protocol specific mbuf flags at up and
downwards layer crossings.

Consistently use it within IP, IPv6 and ethernet protocols.

Discussed with:	trociny, glebius
2013-08-19 13:27:32 +00:00
Mark Johnston
5bc4f6b3ab Add a missing module version declaration to if_tun(4).
PR:		181078
Submitted by:	Brandon Gooch <jamesbrandongooch@gmail.com>
MFC after:	1 week
2013-08-07 01:32:08 +00:00
Hiroki Sato
12bdf23a3a sin6 should be assigned before the loop. 2013-07-28 20:02:41 +00:00
Hiroki Sato
9fcd8e9ebd - Relax the restriction on the member interfaces with LLAs. Two or more
LLAs on the member interfaces are actually harmless when the parent
  interface does not have a LLA.

- Add net.link.bridge.allow_llz_overlap.  This is a knob to allow LLAs on
  a bridge and the member interfaces at the same time.  The default is 0.

Pointed out by:	ume
MFC after:	3 days
2013-07-28 19:49:39 +00:00
Adrian Chadd
49de4f2214 Break out the static, global LACP debug options into a per-lagg unit
sysctl tree.

* Create a net.link.lagg.X.lacp node
* Add a debug node under that for tx_test and rx_test
* Add lacp_strict_mode, defaulting to 1

tx_test and rx_test are still a bitmap of unit numbers for now.
At some point it would be nice to create child nodes of the lagg bundle
for each sub-interface, and then populate those with various knobs
and statistics.

Sponsored by:	Netflix
2013-07-26 19:41:13 +00:00
Adrian Chadd
387e754ae5 Fix typo.
Sponsored by:	Netflix
2013-07-25 19:10:23 +00:00
Marcel Moolenaar
ef1f916971 Decouple the UUID generator from network interfaces by having MAC
addresses added to the UUID generator using uuid_ether_add(). The
UUID generator keeps an arbitrary number of MAC addresses, under
the assumption that they are rarely removed (= uuid_ether_del()).
This achieves the following:
1.  It brings up closer to having the network stack as a loadable
    module.
2.  It allows the UUID generator to filter MAC addresses for best
    results (= highest chance of uniqeness).
3.  MAC addresses can come from anywhere, irrespactive of whether
    it's used for an interface or not.

A side-effect of the change is that when no MAC addresses have been
added, a random multicast MAC address is created once and re-used if
needed. Previusly, when a random MAC address was needed, it was
created for every call. Thus, a change in behaviour is introduced
for when no MAC addresses exist.

Obtained from:	Juniper Networks, Inc.
2013-07-24 04:24:21 +00:00
Craig Rodrigues
719fb72517 PR: 168520 170096
Submitted by: adrian, zec

Fix multiple kernel panics when VIMAGE is enabled in the kernel.
These fixes are based on patches submitted by Adrian Chadd and Marko Zec.

(1)  Set curthread->td_vnet to vnet0 in device_probe_and_attach() just before calling
     device_attach().  This fixes multiple VIMAGE related kernel panics
     when trying to attach Bluetooth or USB Ethernet devices because
     curthread->td_vnet is NULL.

(2)  Set curthread->td_vnet in if_detach().  This fixes kernel panics when detaching networking
     interfaces, especially USB Ethernet devices.

(3)  Use VNET_DOMAIN_SET() in ng_btsocket.c

(4)  In ng_unref_node() set curthread->td_vnet.  This fixes kernel panics
     when detaching Netgraph nodes.
2013-07-15 01:32:55 +00:00
Adrian Chadd
31402c27b8 Bring over some link aggregation / LACP protocol improvements and debugging
additions.

* Add some new tracing events to aid in debugging.
* Add in a debugging mode to drop transmit and received frames, specifically
  to test whether seeing or hearing heartbeats correctly cause LACP to
  drop the port.
* Add in (and make default) a strict LACP mode, which requires the
  heartbeat on a port to be heard before it's used.  Sometimes vendor ports
  will hang but the link layer stays up, resulting in hung traffic.
* Add logging the number of link status flaps, again to aid in debugging
  badly behaving switch ports.
* Calculate the lagg interface port speed as the multiple of the
  configured ports, rather than the largest.

Obtained from:	Netflix
MFC after:	2 weeks
2013-07-13 04:25:03 +00:00
Hiroki Sato
4825b1e098 Add a leaf node CTL_NET.PF_ROUTE.0.AF.NET_RT_DUMP.0.FIB. This returns
routing table with the specified FIB number, not td->td_proc->p_fibnum.
2013-07-12 12:36:12 +00:00
Hiroki Sato
e9f947e27c - Drop GIF_ACCEPT_REVETHIP flag by default.
- Add IFF_MONITOR support.
2013-07-12 12:18:07 +00:00
Andrey V. Elsukov
9bea6fd6c6 Correct CTASSERT condition. 2013-07-09 15:10:27 +00:00
Andrey V. Elsukov
5b7cb97c2b Migrate structs arpstat, icmpstat, mrtstat, pimstat and udpstat to PCPU
counters.
2013-07-09 09:50:15 +00:00
Andrey V. Elsukov
7daad711df Add several macros to help migrate statistics structures to PCPU counters. 2013-07-09 09:37:21 +00:00
Andrey V. Elsukov
c80211e3cf Prepare network statistics structures for migration to PCPU counters.
Use uint64_t as type for all fields of structures.

Changed structures: ahstat, arpstat, espstat, icmp6_ifstat, icmp6stat,
in6_ifstat, ip6stat, ipcompstat, ipipstat, ipsecstat, mrt6stat, mrtstat,
pfkeystat, pim6stat, pimstat, rip6stat, udpstat.

Discussed with:	arch@
2013-07-09 09:32:06 +00:00
Colin Percival
d36ed80a7b Fix typo: minmum -> minimum.
Submitted by:	@z3ndrag0n
2013-07-05 23:40:08 +00:00
Hiroki Sato
6facd7a6b8 Fix a compiler warning.
MFC after:	1 week
2013-07-03 07:31:07 +00:00
Hiroki Sato
af8056441e - Allow ND6_IFF_AUTO_LINKLOCAL for IFT_BRIDGE. An interface with IFT_BRIDGE
is initialized with !ND6_IFF_AUTO_LINKLOCAL && !ND6_IFF_ACCEPT_RTADV
  regardless of net.inet6.ip6.accept_rtadv and net.inet6.ip6.auto_linklocal.
  To configure an autoconfigured link-local address (RFC 4862), the
  following rc.conf(5) configuration can be used:

   ifconfig_bridge0_ipv6="inet6 auto_linklocal"

- if_bridge(4) now removes IPv6 addresses on a member interface to be
  added when the parent interface or one of the existing member
  interfaces has an IPv6 address.  if_bridge(4) merges each link-local
  scope zone which the member interfaces form respectively, so it causes
  address scope violation.  Removal of the IPv6 addresses prevents it.

- if_lagg(4) now removes IPv6 addresses on a member interfaces
  unconditionally.

- Set reasonable flags to non-IPv6-capable interfaces. [*]

Submitted by:	rpaulo [*]
MFC after:	1 week
2013-07-02 16:58:15 +00:00
Qing Li
f672f56f21 Due to the routing related networking kernel redesign work
in FBSD 8.0, interface routes have been returened to the
applications without the RTF_GATEWAY bit. This incompatibility
has caused some issues with Zebra, Qugga and the like.
This patch provides the RTF_GATEWAY flag bit in returned interface
routes so to behave similarly to pre 8.0 systems.

Reviewed by:	    hrs
Verified by:	    mackn at opendns dot com
2013-06-25 00:10:49 +00:00
Gleb Smirnoff
6828cc99e1 De-vnet hash sizes and hash masks.
Submitted by:	Nikos Vassiliadis <nvass gmx.com>
Reviewed by:	trociny
2013-06-19 13:37:29 +00:00
Xin LI
eda6cf02b8 Return ENETDOWN instead of ENOENT when all lagg(4) links are
inactive when upper layer tries to transmit packet.  This
gives better feedback and meaningful errors for applications.

MFC after:	2 weeks
Reviewed by:	thompsa
2013-06-17 19:31:03 +00:00
Hiroki Sato
8b20f6cf8a Return ENETDOWN when the parent interface is down.
MFC after:	1 week
2013-06-16 04:40:02 +00:00
Mikolaj Golub
f8afe33795 Properly set curvnet context in lagg_port_setlladdr() task handler.
Reported by:	Nikos Vassiliadis <nvass gmx.com>
Submitted by:	zec
Tested by:	Nikos Vassiliadis <nvass gmx.com>
MFC after:	1 week
2013-06-07 10:27:50 +00:00
John Baldwin
8aa9937318 Fix build with both INET and INET6 disabled. 2013-06-04 20:40:16 +00:00
Andre Oppermann
3c914c547e Allow drivers to specify a maximum TSO length in bytes if they are
limited in the amount of data they can handle at once.

Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.

The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything
less wouldn't be very useful anymore.  The upper limit is still at
IP_MAXPACKET (65536 bytes).  Raising it requires further auditing of
the IPv4/v6 code path's as the length field in the IP header would
overflow leading to confusion in firewalls and others packet handler on
the real size of the packet.

The placement into "struct ifnet" is a bit hackish but the best place
that was found.  When the stack/driver boundary is updated it should
be handled in a better way.

Submitted by:	cperciva (earlier version)
Reviewed by:	cperciva
Tested by:	cperciva
MFC after:	1 week (using spare struct members to preserve ABI)
2013-06-03 12:55:13 +00:00
Luigi Rizzo
f18be5766f Bring in a number of new features, mostly implemented by Michio Honda:
- the VALE switch now support up to 254 destinations per switch,
  unicast or broadcast (multicast goes to all ports).

- we can attach hw interfaces and the host stack to a VALE switch,
  which means we will be able to use it more or less as a native bridge
  (minor tweaks still necessary).
  A 'vale-ctl' program is supplied in tools/tools/netmap
  to attach/detach ports the switch, and list current configuration.

- the lookup function in the VALE switch can be reassigned to
  something else, similar to the pf hooks. This will enable
  attaching the firewall, or other processing functions (e.g. in-kernel
  openvswitch) directly on the netmap port.

The internal API used by device drivers does not change.

Userspace applications should be recompiled because we
bump NETMAP_API as we now use some fields in the struct nmreq
that were previously ignored -- otherwise, data structures
are the same.

Manpages will be committed separately.
2013-05-30 14:07:14 +00:00
Luigi Rizzo
27892e02fb clarify usage of NETMAP_BUF 2013-05-30 13:41:19 +00:00
Guy Helmer
d013d9022a While waiting for the bpf hold buffer to become idle, check
the return value from mtx_sleep() and exit bpfread() on
errors such as EINTR.

Reviewed by:	jhb
2013-05-23 21:33:10 +00:00
Ed Schouten
6ed0f50f78 Allow certain headers to be included more easily.
Spotted by:	http://hacks.owlfolio.org/header-survey/
2013-05-21 21:20:10 +00:00
Alexander V. Chernikov
22f8ce4335 Use separate function to update mbuf checksum flags instead of
duplicating the same code in different places.

MFC after:	2 weeks
2013-05-18 08:14:21 +00:00
Alexander V. Chernikov
d54455b0c9 Fix rte leak introduced in r248070.
MFC after:	2 weeks
2013-05-18 07:10:22 +00:00
Julian Elischer
4871fc4ab5 Finally change the mbuf to have its own fib field instead of stealing
4 flag bits. This was supposed to happen in 8.0, and again in 2012..

MFC after:	never
2013-05-16 16:20:17 +00:00
Hiroki Sato
b8992a6792 Add IFF_MONITOR support to gre(4).
Tested by:	Chip Marshall
MFC after:	1 week
2013-05-11 19:05:38 +00:00
Andre Oppermann
f89d4c3acf Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.
2013-05-06 16:42:18 +00:00
Eitan Adler
578acad37e Correct a few sizeof()s
Submitted by:	swildner@DragonFlyBSD.org
Reviewed by:	alfred
2013-05-01 04:37:34 +00:00
Luigi Rizzo
c10b5796c0 remove $Id$ (whitespace change) 2013-04-30 16:00:21 +00:00
Gleb Smirnoff
47e8d432d5 Add const qualifier to the dst parameter of the ifnet if_output method. 2013-04-26 12:50:32 +00:00
Oleg Bulyzhin
2c5b403e2d Recover missing arp_ifinit() call.
MFC after:	2 weeks
2013-04-18 20:13:33 +00:00
Gleb Smirnoff
b64478a137 Switch lagg(4) statistics to counter(9).
The lagg(4) is often used to bond high speed links, so basic per-packet +=
on statistics cause cache misses and statistics loss.

Perfect solution would be to convert ifnet(9) to counters(9), but this
requires much more work, and unfortunately ABI change, so temporarily
patch lagg(4) manually.

We store counters in the softc, and once per second push their values
to legacy ifnet counters.

Sponsored by:	Nginx, Inc.
2013-04-15 13:00:42 +00:00
Gleb Smirnoff
18ba072a22 Fix build. 2013-04-10 08:09:25 +00:00
Andre Oppermann
e8b3186b6a Change certain heavily used network related mutexes and rwlocks to
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.

NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
2013-04-09 21:02:20 +00:00
Andrey V. Elsukov
9cb8d207af Use IP6STAT_INC/IP6STAT_DEC macros to update ip6 stats.
MFC after:	1 week
2013-04-09 07:11:22 +00:00
Mark Johnston
83a3ff21a8 Ignore interface renames instead of removing the interface from the bridge
group.

Reviewed by:	rstone
Approved by:	rstone (co-mentor)
Sponsored by:	Sandvine Incorporated
MFC after:	1 week
2013-03-28 20:37:07 +00:00
Gleb Smirnoff
209dddb90e Remove __FreeBSD_version ifdefs. 2013-03-22 20:44:16 +00:00
Andrey V. Elsukov
5474386bd3 Fix style and comments. 2013-03-19 05:51:47 +00:00
Gleb Smirnoff
dc4ad05ecd Use m_get/m_gethdr instead of compat macros.
Sponsored by:	Nginx, Inc.
2013-03-15 12:55:30 +00:00
Gleb Smirnoff
c69f77c339 - Use m_getcl() instead of hand allocating.
- Convert panic() to KASSERT.
- Remove superfluous cleaning of mbuf fields after allocation.
- Add comment on possible use of m_get2() here.

Sponsored by:	Nginx, Inc.
2013-03-15 12:52:59 +00:00
Gleb Smirnoff
41a7572b26 Functions m_getm2() and m_get2() have different order of arguments,
and that can drive someone crazy. While m_get2() is young and not
documented yet, change its order of arguments to match m_getm2().

Sorry for churn, but better now than later.
2013-03-12 13:42:47 +00:00
Gleb Smirnoff
129004c56f Reinitialize eh after pfil(9) processing.
PR:		176764
Submitted by:	adri
2013-03-11 12:06:57 +00:00
Alexander V. Chernikov
3034f43f2f Fix long-standing issue with interface routes being unprotected:
Use RTM_PINNED flag to mark route as immutable.
Forbid deleting immutable routes without special rtrequest1_fib() flag.
Adding interface address with prefix already in route table is handled
by atomically deleting old prefix and adding interface one.

Discussed with:	andre, eri
MFC after:	3 weeks
2013-03-08 20:33:50 +00:00
Alexander V. Chernikov
14126522cf Write lock is not required for find&compare operation.
MFC after:	2 weeks
2013-03-05 13:38:45 +00:00
Gleb Smirnoff
e2a55a0021 Finish the r244185. This fixes ever growing counter of pfsync bad
length packets, which was actually harmless.

Note that peers with different version of head/ may grow this
counter, but it is harmless - all pfsync data is processed.

Reported & tested by:	Anton Yuzhaninov <citrin citrin.ru>
Sponsored by:		Nginx, Inc
2013-02-15 09:03:56 +00:00
Gleb Smirnoff
24421c1c32 Resolve source address selection in presense of CARP. Add a couple
of helper functions:

- carp_master()   - boolean function which is true if an address
		    is in the MASTER state.
- ifa_preferred() - boolean function that compares two addresses,
		    and is aware of CARP.

  Utilize ifa_preferred() in ifa_ifwithnet().

  The previous version of patch also changed source address selection
logic in jails using carp_master(), but we failed to negotiate this part
with Bjoern. May be we will approach this problem again later.

Reported & tested by:	Anton Yuzhaninov <citrin citrin.ru>
Sponsored by:		Nginx, Inc
2013-02-11 10:58:22 +00:00
Randall Stewart
ded5ea6a25 This fixes a out-of-order problem with several
of the newer drivers. The basic problem was
that the driver was pulling the mbuf off the
drbr ring and then when sending with xmit(), encounting
a full transmit ring. Thus the lower layer
xmit() function would return an error, and the
drivers would then append the data back on to the ring.
For TCP this is a horrible scenario sure to bring
on a fast-retransmit.

The fix is to use drbr_peek() to pull the data pointer
but not remove it from the ring. If it fails then
we either call the new drbr_putback or drbr_advance
method. Advance moves it forward (we do this sometimes
when the xmit() function frees the mbuf). When
we succeed we always call advance. The
putback will always copy the mbuf back to the top
of the ring. Note that the putback *cannot* be used
with a drbr_dequeue() only with drbr_peek(). We most
of the time, in putback, would not need to copy it
back since most likey the mbuf is still the same, but
sometimes xmit() functions will change the mbuf via
a pullup or other call. So the optimial case for
the single consumer is to always copy it back. If
we ever do a multiple_consumer (for lagg?) we
will  need a test and atomic in the put back possibly
a seperate putback_mc() in the ring buf.

Reviewed by:	jhb@freebsd.org, jlv@freebsd.org
2013-02-07 15:20:54 +00:00
Gleb Smirnoff
9711a168b9 Retire struct sockaddr_inarp.
Since ARP and routing are separated, "proxy only" entries
don't have any meaning, thus we don't need additional field
in sockaddr to pass SIN_PROXY flag.

New kernel is binary compatible with old tools, since sizes
of sockaddr_inarp and sockaddr_in match, and sa_family are
filled with same value.

The structure declaration is left for compatibility with
third party software, but in tree code no longer use it.

Reviewed by:	ru, andre, net@
2013-01-31 08:55:21 +00:00
Gleb Smirnoff
1910bfcba2 route_output() always supplies info with RTAX_GATEWAY member that
points to a sockaddr of AF_LINK family. Assert this instead of
checking.
2013-01-29 21:44:22 +00:00
Navdeep Parhar
4364ec0852 Move lle_event to if_llatbl.h
lle_event replaced arp_update_event after the ARP rewrite and ended up
in if_ether.h simply because arp_update_event used to be there too.
IPv6 neighbor discovery is going to grow lle_event support and this is a
good time to move it to if_llatbl.h.

The two in-tree consumers of this event - OFED and toecore - are not
affected.

Reviewed by:	bz@
2013-01-25 23:58:21 +00:00
Gleb Smirnoff
ed63043b21 - Utilize m_get2(), accidentially fixing some signedness bugs.
- Return EMSGSIZE in both cases if uio_resid is oversized or undersized.
- No need to clear rcvif.
2013-01-24 14:29:31 +00:00
Luigi Rizzo
01c039a19c leftover from r245579... flags for semi transparent mode and direct
forwarding through a VALE switch
2013-01-23 03:49:48 +00:00
Gleb Smirnoff
1d9797f128 If lagg(4) can't forward a packet due to underlying port problems,
return much more meaningful ENETDOWN to the stack, instead of EBUSY.
2013-01-21 08:59:31 +00:00
Gleb Smirnoff
f6eef2c2d6 - Add dashes before copyright notices.
- Add $FreeBSD$.
- Remove unused define.
2013-01-07 19:36:11 +00:00
Peter Wemm
a116ec4b5e Juggle some internal symbols from our antique zlib (that originally came
in from kernel-pppd which is long gone) so that ZFS and DTRACE play nice.

This is a horrible hack to get freefall to compile, and is in dire need
of reconciliation.  This antique zlib-1.04 code needs to go away.
2013-01-06 14:59:59 +00:00
Andrey V. Elsukov
e37e7917f3 Add an ability to set net.link.stf.permit_rfc1918 from the loader.
MFC after:	2 weeks
2012-12-27 21:26:08 +00:00
Andrey V. Elsukov
51743c5f73 Add net.link.stf.permit_rfc1918 sysctl variable. It can be used to allow
the use of private IPv4 addresses with stf(4).

MFC after:	2 weeks
2012-12-27 20:59:22 +00:00
Kevin Lo
c7dada99bb Fix typo in comment.
Reviewed by:	thompsa
2012-12-18 06:37:23 +00:00
Gleb Smirnoff
b1ec2940af Fix problem in r238990. The LLE_LINKED flag should be tested prior to
entering llentry_free(), and in case if we lose the race, we should simply
perform LLE_FREE_LOCKED(). Otherwise, if the race is lost by the thread
performing arptimer(), it will remove two references from the lle instead
of one.

Reported by:	Ian FREISLICH <ianf clue.co.za>
2012-12-13 11:11:15 +00:00
Guy Helmer
3b3b91e736 Changes to resolve races in bpfread() and catchpacket() that, at worst,
cause kernel panics.

Add a flag to the bpf descriptor to indicate whether the hold buffer
is in use. In bpfread(), set the "hold buffer in use" flag before
dropping the descriptor lock during the call to bpf_uiomove().
Everywhere else the hold buffer is used or changed, wait while
the hold buffer is in use by bpfread(). Add a KASSERT in bpfread()
after re-acquiring the descriptor lock to assist uncovering any
additional hold buffer races.
2012-12-10 16:14:44 +00:00
Hiroki Sato
0bebb5448b - Move definition of V_deembed_scopeid to scope6_var.h.
- Deembed scope id in L3 address in in6_lltable_dump().
- Simplify scope id recovery in rtsock routines.
- Remove embedded scope id handling in ndp(8) and route(8) completely.
2012-12-05 19:45:24 +00:00
Gleb Smirnoff
eb1b1807af Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
Hiroki Sato
5c9fa630f6 - Fix LOR in sa6_recoverscope() in rt_msg2()[1].
- Check V_deembed_scopeid before checking if sa_family == AF_INET6.
- Fix scope id handing in route(8)[2] and ifconfig(8).

Reported by:	rpaulo[1], Mateusz Guzik[1], peter[2]
2012-12-04 17:12:23 +00:00
Alexander V. Chernikov
f079a0fa8c Fix bpf_if structure leak introduced in r235745.
Move all such structures to delayed-free lists and
delete all matching on interface departure event.

MFC after:	1 week
2012-12-02 21:43:37 +00:00
Pawel Jakub Dawidek
5ad9520341 - Use more appropriate loop (do { } while()) when generating ethernet address
for bridge interface.
- If we found a collision we can break the loop - only one collision is
  possible and one is exactly enough to need to renegerate.

Obtained from:	WHEEL Systems
MFC after:	1 week
2012-11-29 08:06:23 +00:00
Andre Oppermann
da2299c5c7 Remove unused and unnecessary CSUM_IP_FRAGS checksumming capability.
Checksumming the IP header of fragments is no different from doing
normal IP headers.

Discussed with:	yongari
MFC after:	1 week
2012-11-27 19:31:49 +00:00
David Xu
ba60525b3f Pass allocated unit number to make_dev, otherwise kernel panics later while
cloning second tap.

Reviewed by: kevlo,ed
2012-11-27 12:23:57 +00:00
Gleb Smirnoff
5e9a54290d Better safe than sorry: reinitialize eh after ng_ether(4) and
if_bridge(4) processing, since mbuf may be modified there.

Submitted by:	youngari
2012-11-27 06:35:26 +00:00
Gleb Smirnoff
97cce87f78 Re-initialize eh pointer after m_adj()
Submitted by:	Kohji Okuno <okuno.kohji jp.panasonic.com>
Reviewed by:	yongari
2012-11-26 19:45:01 +00:00
Adrian Chadd
d60ec817ea Fix up a compile time warning if INET6 isn't defined. 2012-11-18 04:51:46 +00:00
Hiroki Sato
6bbfef9004 Fill sin6_scope_id in sockaddr_in6 before passing it from the kernel to
userland via routing socket or sysctl.  This eliminates the following
KAME-specific sin6_scope_id handling routine from each userland utility:

 sin6.sin6_scope_id = ntohs(*(u_int16_t *)&sin6.sin6_addr.s6_addr[2]);

This behavior can be controlled by net.inet6.ip6.deembed_scopeid.  This is
set to 1 by default (sin6_scope_id will be filled in the kernel).

Reviewed by:	bz
2012-11-17 20:19:00 +00:00
Guy Helmer
0e8a1cb3c9 Work around a race in bpfread() by validating the hold buffer pointer
before freeing it. Otherwise, we can lose a buffer and cause a panic
in catchpacket().
2012-11-06 21:07:04 +00:00
Andrey V. Elsukov
ffdbf9da3b Remove the recently added sysctl variable net.pfil.forward.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.

Suggested by:	andre
2012-11-02 01:20:55 +00:00
Gleb Smirnoff
078468ede4 o Remove last argument to ip_fragment(), and obtain all needed information
on checksums directly from mbuf flags. This simplifies code.
o Clear CSUM_IP from the mbuf in ip_fragment() if we did checksums in
  hardware. Some driver may not announce CSUM_IP in theur if_hwassist,
  although try to do checksums if CSUM_IP set on mbuf. Example is em(4).
o While here, consistently use CSUM_IP instead of its alias CSUM_DELAY_IP.
  After this change CSUM_DELAY_IP vanishes from the stack.

Submitted by:	Sebastian Kuzminsky <seb lineratesystems.com>
2012-10-26 21:06:33 +00:00
Andrey V. Elsukov
c1de64a495 Remove the IPFIREWALL_FORWARD kernel option and make possible to turn
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.

Sponsored by:	Yandex LLC
Discussed with:	net@
MFC after:	2 weeks
2012-10-25 09:39:14 +00:00
Gleb Smirnoff
da1fc67f8a Fix fallout from r240071. If destination interface lookup fails,
we should broadcast a packet, not try to deliver it to NULL.

Reported by:	rpaulo
2012-10-24 18:33:44 +00:00
Gleb Smirnoff
8f134647ca Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

  After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

  After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by:	luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by:	ray, Olivier Cochard-Labbe <olivier cochard.me>
2012-10-22 21:09:03 +00:00
Alexander V. Chernikov
4dab1a18a3 Make PFIL use per-VNET lock instead of per-AF lock. Since most used packet
filters (ipfw and PF) use the same ruleset with the same lock for both
AF_INET and AF_INET6 there is no need in more fine-grade locking.
However, it is possible to request personal lock by specifying
PFIL_FLAG_PRIVATE_LOCK flag in pfil_head structure (see pfil.9 for
more details).

Export PFIL lock via rw_lock(9)/rm_lock(9)-like API permitting pfil consumers
to use this lock instead of own lock. This help reducing locks on main
traffic path.

pfil_assert() is currently not implemented due to absense of rm_assert().
Waiting for some kind of r234648 to be merged in HEAD.

This change is part of bigger patch reducing routing locking.

Sponsored by:	Yandex LLC
Reviewed by:	glebius, ae
OK'd by:	silence on net@
MFC after:	3 weeks
2012-10-22 14:10:17 +00:00
Andre Oppermann
813ee737dd Update to previous r241688 to use __func__ instead of spelled out function
name in log(9) message.

Suggested by:	glebius
2012-10-19 10:07:55 +00:00
Andre Oppermann
15134be81b Use LOG_WARNING level in in_attachdomain1() instead of printf().
Submitted by:	vijju.singh-at-gmail.com
2012-10-18 14:08:26 +00:00
Andre Oppermann
c9b652e3e8 Mechanically remove the last stray remains of spl* calls from net*/*.
They have been Noop's for a long time now.
2012-10-18 13:57:24 +00:00
Gleb Smirnoff
44e1d89044 Utilize new macro to initialize if_baudrate(). 2012-10-18 09:57:56 +00:00
Gleb Smirnoff
3aaf0159dd Fix VIMAGE build.
Reported by:	Nikolai Lifanov <lifanov mail.lifanov.com>
Pointy hat to:	glebius
2012-10-17 21:19:27 +00:00
Maksim Yevmenkin
608ae712d3 provide helper if_initbaudrate() to set if_baudrate_pf and if_baudrate_pf.
again, use ixgbe(4) as an example of how to use new helper function.

Reviewed by:	jhb
MFC after:	1 week
2012-10-17 19:24:13 +00:00
Xin LI
6684e46988 Fix build. 2012-10-17 08:19:08 +00:00
Maksim Yevmenkin
e9bbb44e09 report total number of ports for each lagg(4) interface
via net.link.lagg.X.count sysctl

MFC after:	1  week
2012-10-16 22:43:14 +00:00
Maksim Yevmenkin
0fef97fea3 introduce concept of ifi_baudrate power factor. the idea is to work
around the problem where high speed interfaces (such as ixgbe(4))
are not able to report real ifi_baudrate. bascially, take a spare
byte from struct if_data and use it to store ifi_baudrate power
factor. in other words,

real ifi_baudrate = ifi_baudrate * 10 ^ ifi_baudrate power factor

this should be backwards compatible with old binaries. use ixgbe(4)
as an example on how drivers would set ifi_baudrate power factor

Discussed with:	kib, scottl, glebius
MFC after:	1 week
2012-10-16 20:18:15 +00:00
Gleb Smirnoff
42a58907c3 Make the "struct if_clone" opaque to users of the cloning API. Users
now use function calls:

  if_clone_simple()
  if_clone_advanced()

to initialize a cloner, instead of macros that initialize if_clone
structure.

Discussed with:		brooks, bz, 1 year ago
2012-10-16 13:37:54 +00:00
Kevin Lo
9823d52705 Revert previous commit...
Pointyhat to:	kevlo (myself)
2012-10-10 08:36:38 +00:00
Kevin Lo
a10cee30c9 Prefer NULL over 0 for pointers 2012-10-09 08:27:40 +00:00
Gleb Smirnoff
21d172a3f1 A step in resolving mess with byte ordering for AF_INET. After this change:
- All packets in NETISR_IP queue are in net byte order.
  - ip_input() is entered in net byte order and converts packet
    to host byte order right _after_ processing pfil(9) hooks.
  - ip_output() is entered in host byte order and converts packet
    to net byte order right _before_ processing pfil(9) hooks.
  - ip_fragment() accepts and emits packet in net byte order.
  - ip_forward(), ip_mloopback() use host byte order (untouched actually).
  - ip_fastforward() no longer modifies packet at all (except ip_ttl).
  - Swapping of byte order there and back removed from the following modules:
    pf(4), ipfw(4), enc(4), if_bridge(4).
  - Swapping of byte order added to ipfilter(4), based on __FreeBSD_version
  - __FreeBSD_version bumped.
  - pfil(9) manual page updated.

Reviewed by:	ray, luigi, eri, melifaro
Tested by:	glebius (LE), ray (BE)
2012-10-06 10:02:11 +00:00
Xin LI
15752fa858 MFV: libpcap 1.3.0.
MFC after:	4 weeks
2012-10-05 18:42:50 +00:00
Andrew Thompson
3e92ee8a53 Remove the M_NOWAIT from bridge_rtable_init as it isn't needed. The function
return value is not even checked and could lead to a panic on a null sc_rthash.

MFC after:	2 weeks
2012-10-04 07:40:55 +00:00
Ed Maste
104d9fc776 Cast through void * to silence compiler warning
The base netmap pointer and offsets involved are provided by the kernel
side of the netmap interface and will have appropriate alignment.

Sponsored by: ADARA Networks
MFC After: 2 weeks
2012-10-03 21:41:20 +00:00
John Baldwin
b3aa419331 Rename the module for 'device enc' to "if_enc" to avoid conflicting with
the CAM "enc" peripheral (part of ses(4)).  Previously the two modules
used the same name, so only one was included in a linked kernel causing
enc0 to not be created if you added IPSEC to GENERIC.  The new module
name follows the pattern of other network interfaces (e.g. "if_loop").

MFC after:	1 week
2012-10-02 12:25:30 +00:00
Gleb Smirnoff
063efed28c The drbr(9) API appeared to be so unclear, that most drivers in
tree used it incorrectly, which lead to inaccurate overrated
if_obytes accounting. The drbr(9) used to update ifnet stats on
drbr_enqueue(), which is not accurate since enqueuing doesn't
imply successful processing by driver. Dequeuing neither mean
that. Most drivers also called drbr_stats_update() which did
accounting again, leading to doubled if_obytes statistics. And
in case of severe transmitting, when a packet could be several
times enqueued and dequeued it could have been accounted several
times.

o Thus, make drbr(9) API thinner. Now drbr(9) merely chooses between
  ALTQ queueing or buf_ring(9) queueing.
  - It doesn't touch the buf_ring stats any more.
  - It doesn't touch ifnet stats anymore.
  - drbr_stats_update() no longer exists.

o buf_ring(9) handles its stats itself:
  - It handles br_drops itself.
  - br_prod_bytes stats are dropped. Rationale: no one ever
    reads them but update of a common counter on every packet
    negatively affects performance due to excessive cache
    invalidation.
  - buf_ring_enqueue_bytes() reduced to buf_ring_enqueue(), since
    we no longer account bytes.

o Drivers handle their stats theirselves: if_obytes, if_omcasts.

o mlx4(4), igb(4), em(4), vxge(4), oce(4) and  ixv(4) no longer
  use drbr_stats_update(), and update ifnet stats theirselves.

o bxe(4) was the most correct driver, it didn't call
  drbr_stats_update(), thus it was the only driver accurate under
  moderate load. Now it also maintains stats itself.

o ixgbe(4) had already taken stats from hardware, so just
  - drop software stats updating.
  - take multicast packet count from hardware as well.

o mxge(4) just no longer needs NO_SLOW_STATS define.

o cxgb(4), cxgbe(4) need no change, since they obtain stats
  from hardware.

Reviewed by:	jfv, gnn
2012-09-28 18:28:27 +00:00
Gleb Smirnoff
80cd7c7596 - In the bridge_enqueue() do success/error accounting for
each fragment, not only once.
- In the GRAB_OUR_PACKETS() macro do increase if_ibytes.
2012-09-26 20:09:48 +00:00
Ed Maste
66d3579a1e Correct misspelling in debug output. 2012-09-26 01:09:19 +00:00
Ed Maste
c11038e252 Revert part of an earlier patch attempt that snuck in with r240938. 2012-09-25 23:41:45 +00:00
Ed Maste
5a71e42350 Avoid INVARIANTS panic destroying an in-use tap(4)
The requirement (implied by the KASSERT in tap_destroy) that the tap is
closed isn't valid; destroy_dev will block in devdrn while other threads
are in d_* functions.

Note: if_tun had the same issue, addressed in SVN revisions r186391,
r186483 and r186497.  The use of the condvar there appears to be
redundant with the functionality provided by destroy_dev.

Sponsored by:	ADARA Networks
Reviewed by:	dwhite
MFC after:	2 weeks
2012-09-25 22:10:14 +00:00
Ed Maste
cf8f32025f Remove an incorrect comment 2012-09-25 21:19:17 +00:00
Gleb Smirnoff
3b7d677b8f Convert lagg(4) to use if_transmit instead of if_start.
In collaboration with:	thompsa, sbruno, fabient
2012-09-20 10:05:10 +00:00
Gleb Smirnoff
22c914789e Utilize Jenkins hash with random seed for source nodes storage. 2012-09-20 06:52:05 +00:00
Gleb Smirnoff
7b11548469 Add missing break.
Pointy hat to:	glebius
2012-09-20 03:09:58 +00:00
Gleb Smirnoff
9ed8bbbdbe Fix build, pass the pointy hat please. 2012-09-18 12:21:32 +00:00
Gleb Smirnoff
1d6139c0e4 Make ruleset anchors in pf(4) reentrant. We've got two problems here:
1) Ruleset parser uses a global variable for anchor stack.
2) When processing a wildcard anchor, matching anchors are marked.

To fix the first one:

o Allocate anchor processing stack on stack. To make this allocation
  as small as possible, following measures taken:
  - Maximum stack size reduced from 64 to 32.
  - The struct pf_anchor_stackframe trimmed by one pointer - parent.
    We can always obtain the parent via the rule pointer.
  - When pf_test_rule() calls pf_get_translation(), the former lends
    its stack to the latter, to avoid recursive allocation 32 entries.

The second one appeared more tricky. The code, that marks anchors was
added in OpenBSD rev. 1.516 of pf.c. According to commit log, the idea
is to enable the "quick" keyword on an anchor rule. The feature isn't
documented anywhere. The most obscure part of the 1.516 was that code
examines the "match" mark on a just processed child, which couldn't be
put here by current frame. Since this wasn't documented even in the
commit message and functionality of this is not clear to me, I decided
to drop this examination for now. The rest of 1.516 is redone in a
thread safe manner - the mark isn't put on the anchor itself, but on
current stack frame. To avoid growing stack frame, we utilize LSB
from the rule pointer, relying on kernel malloc(9) returning pointer
aligned addresses.

Discussed with:		dhartmei
2012-09-18 10:54:56 +00:00
Gleb Smirnoff
9e8c4accee - Add $FreeBSD$ to allow modifications to this file.
- Move $OpenBSD$ to a more standard place.
2012-09-18 10:52:46 +00:00
Gleb Smirnoff
3b3a8eb937 o Create directory sys/netpfil, where all packet filters should
reside, and move there ipfw(4) and pf(4).

o Move most modified parts of pf out of contrib.

Actual movements:

sys/contrib/pf/net/*.c		-> sys/netpfil/pf/
sys/contrib/pf/net/*.h		-> sys/net/
contrib/pf/pfctl/*.c		-> sbin/pfctl
contrib/pf/pfctl/*.h		-> sbin/pfctl
contrib/pf/pfctl/pfctl.8	-> sbin/pfctl
contrib/pf/pfctl/*.4		-> share/man/man4
contrib/pf/pfctl/*.5		-> share/man/man5

sys/netinet/ipfw		-> sys/netpfil/ipfw

The arguable movement is pf/net/*.h -> sys/net. There are
future plans to refactor pf includes, so I decided not to
break things twice.

Not modified bits of pf left in contrib: authpf, ftp-proxy,
tftp-proxy, pflogd.

The ipfw(4) movement is planned to be merged to stable/9,
to make head and stable match.

Discussed with:		bz, luigi
2012-09-14 11:51:49 +00:00
Gleb Smirnoff
d6d3f01e0a Merge the projects/pf/head branch, that was worked on for last six months,
into head. The most significant achievements in the new code:

 o Fine grained locking, thus much better performance.
 o Fixes to many problems in pf, that were specific to FreeBSD port.

New code doesn't have that many ifdefs and much less OpenBSDisms, thus
is more attractive to our developers.

  Those interested in details, can browse through SVN log of the
projects/pf/head branch. And for reference, here is exact list of
revisions merged:

r232043, r232044, r232062, r232148, r232149, r232150, r232298, r232330,
r232332, r232340, r232386, r232390, r232391, r232605, r232655, r232656,
r232661, r232662, r232663, r232664, r232673, r232691, r233309, r233782,
r233829, r233830, r233834, r233835, r233836, r233865, r233866, r233868,
r233873, r234056, r234096, r234100, r234108, r234175, r234187, r234223,
r234271, r234272, r234282, r234307, r234309, r234382, r234384, r234456,
r234486, r234606, r234640, r234641, r234642, r234644, r234651, r235505,
r235506, r235535, r235605, r235606, r235826, r235991, r235993, r236168,
r236173, r236179, r236180, r236181, r236186, r236223, r236227, r236230,
r236252, r236254, r236298, r236299, r236300, r236301, r236397, r236398,
r236399, r236499, r236512, r236513, r236525, r236526, r236545, r236548,
r236553, r236554, r236556, r236557, r236561, r236570, r236630, r236672,
r236673, r236679, r236706, r236710, r236718, r237154, r237155, r237169,
r237314, r237363, r237364, r237368, r237369, r237376, r237440, r237442,
r237751, r237783, r237784, r237785, r237788, r237791, r238421, r238522,
r238523, r238524, r238525, r239173, r239186, r239644, r239652, r239661,
r239773, r240125, r240130, r240131, r240136, r240186, r240196, r240212.

I'd like to thank people who participated in early testing:

Tested by:	Florian Smeets <flo freebsd.org>
Tested by:	Chekaluk Vitaly <artemrts ukr.net>
Tested by:	Ben Wilber <ben desync.com>
Tested by:	Ian FREISLICH <ianf cloudseed.co.za>
2012-09-08 06:41:54 +00:00
Alexander V. Chernikov
73c23f3ba1 Fix the build broken by r240099.
Hide link_pfil_hook under _KERNEL macro.

MFC after:    3 weeks
2012-09-04 22:17:33 +00:00
Alexander V. Chernikov
7d4317bd40 Introduce new link-layer PFIL hook V_link_pfil_hook.
Merge ether_ipfw_chk() and part of bridge_pfil() into
unified ipfw_check_frame() function called by PFIL.
This change was suggested by rwatson? @ DevSummit.

Remove ipfw headers from ether/bridge code since they are unneeded now.

Note this thange introduce some (temporary) performance penalty since
PFIL read lock has to be acquired for every link-level packet.

MFC after:     3 weeks
2012-09-04 19:43:26 +00:00
Gleb Smirnoff
62208ca5d2 - Move jenkins.h to jenkins_hash.c
- Provide missing function that can do hashing of arbitrary sized buffer.
- Refetch lookup3.c and do only minimal edits to it, so that diff between
  our jenkins_hash.c and lookup3.c is minimal.
- Add declarations for jenkins_hash(), jenkins_hash32() to sys/hash.h.
- Document these functions in hash(9)

Obtained from:	http://burtleburtle.net/bob/c/lookup3.c
2012-09-04 12:07:33 +00:00
Gleb Smirnoff
3582a9f6c6 Change bridge(4) to use if_transmit for forwarding packets to underlying
interfaces instead of queueing.

Tested by:	ray
2012-09-03 10:08:20 +00:00
Gleb Smirnoff
3932d76033 In ifc_alloc_unit():
- In the !wildcard case, return ENOSPC instead of confusing EEXIST
  in case if ifc->ifc_maxunit reached.
- Fix unit leak, that I've introduced in previous revision.

Submitted by:	Daan Vreeken <Daan vitsch.nl>
2012-08-30 12:18:45 +00:00
John Baldwin
b90dde2fbc Fix a silly grammar bogon.
Submitted by:	Stephen McKay
2012-08-21 19:07:28 +00:00
John Baldwin
28cc4d37e6 Refine the changes made in r208212 to avoid bogus failures from
if_delmulti() when clearing the configuration for a subinterface when
the parent interface is being detached.  The current code was still
triggering an assertion in if_delmulti() due to the parent interface being
partially detached.  Fix this by not calling if_delmulti() at all if the
parent interface is being detached.  Warn if if_delmulti() fails when the
parent is not being detached (but similar to 208212, still proceed with
tearing down the vlan state).

Tested by:	ae@
MFC after:	1 month
2012-08-20 16:00:33 +00:00
John Baldwin
2541fcd953 Unexpand a couple of TAILQ_FOREACH()s. 2012-08-17 16:01:24 +00:00
Konstantin Belousov
1c771f9222 After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason
to pull vm_param.h was removed.  Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.

Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.

Suggested and reviewed by:	alc
MFC after:    2 weeks
2012-08-05 14:11:42 +00:00
Gleb Smirnoff
ea53792942 Fix races between in_lltable_prefix_free(), lla_lookup(),
llentry_free() and arptimer():

o Use callout_init_rw() for lle timeout, this allows us safely
  disestablish them.
  - This allows us to simplify the arptimer() and make it
    race safe.
o Consistently use ifp->if_afdata_lock to lock access to
  linked lists in the lle hashes.
o Introduce new lle flag LLE_LINKED, which marks an entry that
  is attached to the hash.
  - Use LLE_LINKED to avoid double unlinking via consequent
    calls to llentry_free().
  - Mark lle with LLE_DELETED via |= operation istead of =,
    so that other flags won't be lost.
o Make LLE_ADDREF(), LLE_REMREF() and LLE_FREE_LOCKED() more
  consistent and provide more informative KASSERTs.

The patch is a collaborative work of all submitters and myself.

PR:		kern/165863
Submitted by:	Andrey Zonov <andrey zonov.org>
Submitted by:	Ryan Stone <rysto32 gmail.com>
Submitted by:	Eric van Gyzen <eric_van_gyzen dell.com>
2012-08-02 13:57:49 +00:00