Commit Graph

416 Commits

Author SHA1 Message Date
glebius
9d7706f9f4 o Revamp API between flowtable and netinet, netinet6.
- ip_output() and ip_output6() simply call flowtable_lookup(),
    passing mbuf and address family. That's the only code under
    #ifdef FLOWTABLE in the protocols code now.
o Revamp statistics gathering and export.
  - Remove hand made pcpu stats, and utilize counter(9).
  - Snapshot of statistics is available via 'netstat -rs'.
  - All sysctls are moved into net.flowtable namespace, since
    spreading them over net.inet isn't correct.
o Properly separate at compile time INET and INET6 parts.
o General cleanup.
  - Remove chain of multiple flowtables. We simply have one for
    IPv4 and one for IPv6.
  - Flowtables are allocated in flowtable.c, symbols are static.
  - With proper argument to SYSINIT() we no longer need flowtable_ready.
  - Hash salt doesn't need to be per-VNET.
  - Removed rudimentary debugging, which use quite useless in dtrace era.

The runtime behavior of flowtable shouldn't be changed by this commit.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-07 15:18:23 +00:00
attilio
7ee4e910ce - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
glebius
790225cfbc - Utilize counter(9) to accumulate statistics on interface addresses. Add
four counters to struct ifaddr. This kills '+=' on a variables shared
  between processors for every packet.
- Nuke struct if_data from struct ifaddr.
- In ip_input() do not put a reference on ifaddr, instead update statistics
  right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9)
  for every packet. [1]
- To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in
  rtsock.c fill if_data fields using counter_u64_fetch().
- Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which
  took if_data not from the ifaddr, but from ifaddr's ifnet. [2]

Submitted by:	melifaro [1], pluknet[2]
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 11:37:57 +00:00
trociny
8439b55778 Unregister inet/inet6 pfil hooks on vnet destroy.
Discussed with:	andre
Approved by:	re (rodrigc)
2013-09-13 18:45:10 +00:00
markj
29e4661920 Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by:	gnn, hiren
MFC after:	1 month
2013-08-25 21:54:41 +00:00
andre
e3737c33e7 Restructure the mbuf pkthdr to make it fit for upcoming capabilities and
features.  The changes in particular are:

o Remove rarely used "header" pointer and replace it with a 64bit protocol/
  layer specific union PH_loc for local use.  Protocols can flexibly overlay
  their own 8 to 64 bit fields to store information while the packet is
  worked on.

o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc
  instead of pkthdr.header.

o Extend csum_flags to 64bits to allow for additional future offload
  information to be carried (e.g. iSCSI, IPsec offload, and others).

o Move the RSS hash type enumerator from abusing m_flags to its own 8bit
  rsstype field.  Adjust accessor macros.

o Add cosqos field to store Class of Service / Quality of Service information
  with the packet.  It is not yet supported in any drivers but allows us to
  get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with
  a modernized ALTQ.

o Add four 8 bit fields l[2-5]hlen to store the relative header offsets
  from the start of the packet.  This is important for various offload
  capabilities and to relieve the drivers from having to parse the packet
  and protocol headers to find out location of checksums and other
  information.  Header parsing in drivers is a lot of copy-paste and
  unhandled corner cases which we want to avoid.

o Add another flexible 64bit union to map various additional persistent
  packet information, like ether_vtag, tso_segsz and csum fields.
  Depending on the csum_flags settings some fields may have different usage
  making it very flexible and adaptable to future capabilities.

o Restructure the CSUM flags to better signify their outbound (down the
  stack) and inbound (up the stack) use.  The CSUM flags used to be a bit
  chaotic and rather poorly documented leading to incorrect use in many
  places.  Bring clarity into their use through better naming.
  Compatibility mappings are provided to preserve the API.  The drivers
  can be corrected one by one and MFC'd without issue.

o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures).

Sponsored by:	The FreeBSD Foundation
2013-08-24 19:51:18 +00:00
andre
1b3e4b45e5 Move ip_reassemble()'s use of the global M_FRAG mbuf flag to a protocol layer
specific flag instead.  The flag is only relevant while the packet stays in
the IP reassembly queue.

Discussed with:	trociny, glebius
2013-08-19 10:34:10 +00:00
ae
1b1e1de2b2 Use new macros to implement ipstat and tcpstat using PCPU counters.
Change interface of kread_counters() similar ot kread() in the netstat(1).
2013-07-09 09:43:03 +00:00
glebius
68c4131bea Fix kmod_*stat_inc() after r249276. The incorrect code actually
increased the pointer, not the memory it points to.

In collaboration with:	kib
Reported & tested by:	Ian FREISLICH <ianf clue.co.za>
Sponsored by:		Nginx, Inc.
2013-06-21 06:36:26 +00:00
andre
cc8c6e4d01 Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.
2013-05-06 16:42:18 +00:00
andre
306fddaf78 Change certain heavily used network related mutexes and rwlocks to
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.

NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
2013-04-09 21:02:20 +00:00
glebius
2a0fbb38ed Merge from projects/counters: TCP/IP stats.
Convert 'struct ipstat' and 'struct tcpstat' to counter(9).

  This speeds up IP forwarding at extreme packet rates, and
makes accounting more precise.

Sponsored by:	Nginx, Inc.
2013-04-08 19:57:21 +00:00
glebius
b37af62b9e Use m_get/m_gethdr instead of compat macros.
Sponsored by:	Nginx, Inc.
2013-03-15 12:55:30 +00:00
pluknet
5be772ca10 ip_savecontrol() style fixes. No functional changes.
- fix indentation
- put the operator at the end of the line for long statements
- remove spaces between the type and the variable in a cast
- remove excessive parentheses

Tested by:	md5
2013-02-20 15:44:40 +00:00
glebius
8e20fa5ae9 Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
ae
4354018055 Remove the recently added sysctl variable net.pfil.forward.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.

Suggested by:	andre
2012-11-02 01:20:55 +00:00
ae
71112b5a8e Remove the IPFIREWALL_FORWARD kernel option and make possible to turn
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.

Sponsored by:	Yandex LLC
Discussed with:	net@
MFC after:	2 weeks
2012-10-25 09:39:14 +00:00
glebius
3d11eb1465 After r241923 the updated ip_len no longer needed. 2012-10-25 09:02:21 +00:00
glebius
a5c4b7118d Fix error in r241913 that had broken fragment reassembly. 2012-10-25 09:00:57 +00:00
glebius
fea857f2a8 Do not reduce ip_len by size of IP header in the ip_input()
before passing a packet to protocol input routines.
  For several protocols this mean that now protocol needs to
do subtraction itself, and for another half this means that
we do not need to add header length back to the packet.

  Make ip_stripoptions() to adjust ip_len, since now we enter
this function with a packet header whose ip_len does represent
length of entire packet, not payload only.
2012-10-23 08:33:13 +00:00
glebius
5cc3ac5902 Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

  After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

  After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by:	luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by:	ray, Olivier Cochard-Labbe <olivier cochard.me>
2012-10-22 21:09:03 +00:00
glebius
f3a0231bff A step in resolving mess with byte ordering for AF_INET. After this change:
- All packets in NETISR_IP queue are in net byte order.
  - ip_input() is entered in net byte order and converts packet
    to host byte order right _after_ processing pfil(9) hooks.
  - ip_output() is entered in host byte order and converts packet
    to net byte order right _before_ processing pfil(9) hooks.
  - ip_fragment() accepts and emits packet in net byte order.
  - ip_forward(), ip_mloopback() use host byte order (untouched actually).
  - ip_fastforward() no longer modifies packet at all (except ip_ttl).
  - Swapping of byte order there and back removed from the following modules:
    pf(4), ipfw(4), enc(4), if_bridge(4).
  - Swapping of byte order added to ipfilter(4), based on __FreeBSD_version
  - __FreeBSD_version bumped.
  - pfil(9) manual page updated.

Reviewed by:	ray, luigi, eri, melifaro
Tested by:	glebius (LE), ray (BE)
2012-10-06 10:02:11 +00:00
glebius
418a04b467 When ip_output()/ip6_output() is supplied a struct route *ro argument,
it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning
here: it may be supplied to provide route, and it may be supplied to
store and return to caller the route that ip_output()/ip6_output()
finds. In the latter case skipping FLOWTABLE lookup is pessimisation.

The difference between struct route filled by FLOWTABLE and filled
by rtalloc() family is that the former doesn't hold a reference on
its rtentry. Reference is hold by flow entry, and it is about to
be released in future. Thus, route filled by FLOWTABLE shouldn't
be passed to RTFREE() macro.

- Introduce new flag for struct route/route_in6, that marks route
  not holding a reference on rtentry.
- Introduce new macro RO_RTFREE() that cleans up a struct route
  depending on its kind.
- All callers to ip_output()/ip6_output() that do supply non-NULL
  but empty route should use RO_RTFREE() to free results of
  lookup.
- ip_output()/ip6_output() now do FLOWTABLE lookup always when
  ro->ro_rt == NULL.

Tested by:	tuexen (SCTP part)
2012-07-04 07:37:53 +00:00
tuexen
32041f44ed Add a IP_RECVTOS socket option to receive for received UDP/IPv4
packets a cmsg of type IP_RECVTOS which contains the TOS byte.
Much like IP_RECVTTL does for TTL. This allows to implement a
protocol on top of UDP and implementing ECN.

MFC after: 3 days
2012-06-12 14:02:38 +00:00
jhb
4ef366671a Convert all users of IF_ADDR_LOCK to use new locking macros that specify
either a read lock or write lock.

Reviewed by:	bz
MFC after:	2 weeks
2012-01-05 19:00:36 +00:00
glebius
2eb352ca19 Remove last remnants of classful addressing:
- Remove ia_net, ia_netmask, ia_netbroadcast from struct in_ifaddr.
- Remove net.inet.ip.subnetsarelocal, I bet no one need it in 2011.
- fix bug when we were not forwarding to a host which matches classful
  net address. For example router having 192.168.x.y/16 network attached,
  would not forward traffic to 192.168.*.0, which are legal IPs in
  CIDR world.
- For compatibility, leave autoguessing of mask based on class.

Reviewed by:	andre, bz, rwatson
2011-10-15 16:28:06 +00:00
bz
b4b3d062cd Correct comments and debug logging in ipsec to better match reality.
MFC after:	3 days
2011-06-08 03:02:11 +00:00
bz
d6e97eaa7b MfP4 CH=192004:
Move ip_defttl to raw_ip.c where it is actually used.  In an IPv6
only world we do not want to compile ip_input.c in for that and
it is a shared default with INET6.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-27 19:32:27 +00:00
bz
25cccaf492 MFp4 CH=191470:
Move the ipport_tick_callout and related functions from ip_input.c
to in_pcb.c.  The random source port allocation code has been merged
and is now local to in_pcb.c only.
Use a SYSINIT to get the callout started and no longer depend on
initialization from the inet code, which would not work in an IPv6
only setup.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-20 08:00:29 +00:00
bz
d10762bec7 MFp4 CH=191466:
Move fw_one_pass to where it belongs: it is a property of ipfw,
not of ip_input.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	3 days
2011-04-20 07:55:33 +00:00
brucec
6d9b42b486 Fix typos - remove duplicate "the".
PR:	bin/154928
Submitted by:	Eitan Adler <lists at eitanadler.com>
MFC after: 	3 days
2011-02-21 09:01:34 +00:00
dim
fb307d7d1d After some off-list discussion, revert a number of changes to the
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files.  A better long-term solution is
still being considered.  This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.

Changes reverted:

------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines

Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.

------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.

------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines

Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
2010-11-22 19:32:54 +00:00
dim
fda4020a88 Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
2010-11-14 20:38:11 +00:00
bz
18b40a43ee MFp4 CH=183052 183053 183258:
In protosw we define pr_protocol as short, while on the wire
  it is an uint8_t.  That way we can have "internal" protocols
  like DIVERT, SEND or gaps for modules (PROTO_SPACER).
  Switch ipproto_{un,}register to accept a short protocol number(*)
  and do an upfront check for valid boundries. With this we
  also consistently report EPROTONOSUPPORT for out of bounds
  protocols, as we did for proto == 0.  This allows a caller
  to not error for this case, which is especially important
  if we want to automatically call these from domain handling.

  (*) the functions have been without any in-tree consumer
  since the initial introducation, so this is considered save.

  Implement ip6proto_{un,}register() similarly to their legacy IP
  counter parts to allow modules to hook up dynamically.

Reviewed by:	philip, will
MFC after:	1 week
2010-09-02 17:43:44 +00:00
will
aa4e762c4a Allow carp(4) to be loaded as a kernel module. Follow precedent set by
bridge(4), lagg(4) etc. and make use of function pointers and
pf_proto_register() to hook carp into the network stack.

Currently, because of the uncertainty about whether the unload path is free
of race condition panics, unloads are disallowed by default.  Compiling with
CARPMOD_CAN_UNLOAD in CFLAGS removes this anti foot shooting measure.

This commit requires IP6PROTOSPACER, introduced in r211115.

Reviewed by:	bz, simon
Approved by:	ken (mentor)
MFC after:	2 weeks
2010-08-11 00:51:50 +00:00
bz
0a90ef1728 MFP4: @176978-176982, 176984, 176990-176994, 177441
"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by:	jhb
Discussed with:	rwatson
Sponsored by:	The FreeBSD Foundation
Sponsored by:	CK Software GmbH
MFC after:	6 days
2010-04-29 11:52:42 +00:00
bz
b883f7a391 Avoid memory access after free. Use the (shortend) copy for the
ipsec mtu lookup as well.

PR:		kern/145736
Submitted by:	Peter Molnar (peter molnar.cc)
MFC after:	3 days
2010-04-21 10:21:34 +00:00
kmacy
01cb21605b - boot-time size the ipv4 flowtable and the maximum number of flows
- increase flow cleaning frequency and decrease flow caching time
  when near the flow limit
- stop allocating new flows when within 3% of maxflows don't start
  allocating again until below 12.5%

MFC after:	7 days
2010-03-22 23:04:12 +00:00
kmacy
128542c758 - restructure flowtable to support ipv6
- add a name argument to flowtable_alloc for printing with ddb commands
- extend ddb commands to print destination address or 4-tuples
- don't parse ports in ulp header if FL_HASH_ALL is not passed
- add kern_flowtable_insert to enable more generic use of flowtable
  (e.g. system calls for adding entries)
- don't hash loopback addresses
- cleanup whitespace
- keep statistics per-cpu for per-cpu flowtables to avoid cache line contention
- add sysctls to accumulate stats and report aggregate

MFC after:	7 days
2010-03-12 05:03:26 +00:00
bz
29381991cf Split up ip_drain() into an outer lock and iterator part and
a "locked" version that will only handle a single network stack
instance. The latter is called directly from ip_destroy().

Hook up an ip_destroy() function to release resources from the
legacy IP network layer upon virtual network stack teardown.

Sponsored by:	ISPsystem
Reviewed by:	rwatson
MFC After:	5 days
2010-02-20 19:59:52 +00:00
rwatson
a5668d2042 Correct spelling typo in ip_input comment.
Pointed out by:	N.J. Mann <njm at njm.me.uk>,
		John Nielsen <john at jnielsen.net>, julian (!), lstewart
MFC after:	2 days
2009-10-24 09:18:26 +00:00
rwatson
552ae71ff3 Improve grammar in ip_input comment while attempting to maintain what
might be its meaning.

MFC after:	3 days
2009-10-23 13:35:00 +00:00
rwatson
ec5eebfd83 Rewrap ip_input() comment so that it prints more nicely.
MFC after:	3 days
2009-10-18 11:23:56 +00:00
julian
79c1f884ef Virtualize the pfil hooks so that different jails may chose different
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.

Sitting aroung in tree waiting to commit for: 2 months
MFC after:	2 months
2009-10-11 05:59:43 +00:00
rwatson
5c6699ad3d Many network stack subsystems use a single global data structure to hold
all pertinent statatistics for the subsystem.  These structures are
sometimes "borrowed" by kernel modules that require a place to store
statistics for similar events.

Add KPI accessor functions for statistics structures referenced by kernel
modules so that they no longer encode certain specifics of how the data
structures are named and stored.  This change is intended to make it
easier to move to per-CPU network stats following 8.0-RELEASE.

The following modules are affected by this change:

      if_bridge
      if_cxgb
      if_gif
      ip_mroute
      ipdivert
      pf

In practice, most of these statistics consumers should, in fact, maintain
their own statistics data structures rather than borrowing structures
from the base network stack.  However, that change is too agressive for
this point in the release cycle.

Reviewed by:	bz
Approved by:	re (kib)
2009-08-02 19:43:32 +00:00
rwatson
fb9ffed650 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
rwatson
90830ed53d Back out the moving in r195782 of V_ip_id's initialization from the top
back to the bottom of ip_init() as found in 7.x.  I missed the fact that
the bottom half of the init routine only runs in the !VNET case.

Submitted by:	zec
Approved by:	re (vimage blanket)
2009-07-20 19:40:09 +00:00
rwatson
d32e9e9901 Garbage collect vnet module registrations that have neither constructors
nor destructors, as there's no actual work to do.

In most cases, the constructors weren't needed because of the existing
protocol initialization functions run by net_init_domain() as part of
VNET_MOD_NET, or they were eliminated when support for static
initialization of virtualized globals was added.

Garbage collect dependency references to modules without constructors or
destructors, notably VNET_MOD_INET and VNET_MOD_INET6.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-07-20 13:55:33 +00:00
rwatson
6955067932 Reimplement and/or implement vnet list locking by replacing a mostly
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use).  Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.

Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions.  Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.

Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.

Update various consumers of these KPIs based on whether they may sleep
or not.

Reviewed by:	bz
Approved by:	re (kib)
2009-07-19 14:20:53 +00:00
rwatson
88f8de4d40 Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used.  Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with:	bz, julian
Reviewed by:	bz
Approved by:	re (kensmith, kib)
2009-07-16 21:13:04 +00:00