Commit Graph

428 Commits

Author SHA1 Message Date
hselasky
12fec3618b Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after:	1 month
Sponsored by:	Mellanox Technologies
2014-12-01 11:45:24 +00:00
melifaro
f8d64c469a Finish r274175: do control plane MTU tracking.
Update route MTU in case of ifnet MTU change.
Add new RTF_FIXEDMTU to track explicitly specified MTU.

Old behavior:
ifconfig em0 mtu 1500->9000 -> all routes traversing em0 do not change MTU.
User has to manually update all routes.
ifconfig em0 mtu 9000->1500 -> all routes traversing em0 do not change MTU.
However, if ip[6]_output finds route with rt_mtu > interface mtu, rt_mtu
gets updated.

New behavior:
ifconfig em0 mtu 1500->9000 -> all interface routes in all fibs gets updated
with new MTU unless RTF_FIXEDMTU flag set on them.
ifconfig em0 mtu 9000->1500 -> all routes in all fibs gets updated with new
MTU unless RTF_FIXEDMTU flag set on them AND rt_mtu is less than ifp mtu.

route add ... -mtu XXX automatically sets RTF_FIXEDMTU flag.
route change .. -mtu 0 automatically removes RTF_FIXEDMTU flag.

PR:		194238
MFC after:	1 month
CR:		D1125
2014-11-17 01:05:29 +00:00
melifaro
b5d711d3a6 Renove faith(4) and faithd(8) from base. It looks like industry
have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.

No objections from:	net@
2014-11-09 21:33:01 +00:00
rwatson
9532a34956 When deciding whether to call m_pullup() even though there is adequate
data in an mbuf, use M_WRITABLE() instead of a direct test of M_EXT;
the latter both unnecessarily exposes mbuf-allocator internals in the
protocol stack and is also insufficient to catch all cases of
non-writability.

(NB: m_pullup() does not actually guarantee that a writable mbuf is
returned, so further refinement of all of these code paths continues to
be required.)

Reviewed by:	bz
MFC after:	3 days
Sponsored by:	EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D900
2014-10-12 15:49:52 +00:00
hrs
43ddffd06b Add an additional routing table lookup when m->m_pkthdr.fibnum is changed
at a PFIL hook in ip{,6}_output().  IPFW setfib rule did not perform
a routing table lookup when the destination address was not changed.

CR:	D805
2014-10-02 00:25:57 +00:00
glebius
2bb30b3418 Remove disabled code, that is very unlikely to be ever enabled again,
as well as the comment that explains why is it disabled.
2014-09-19 05:23:47 +00:00
asomers
c1b1b15f41 Fix source address selection on unbound sockets in the presence of multiple
fibs. Use the mbuf's or the socket's fib instead of RT_ALL_FIBS. Fixes PR
187553. Also fixes netperf's UDP_STREAM test on a nondefault fib.

sys/netinet/ip_output.c
	In ip_output, lookup the source address using the mbuf's fib instead
	of RT_ALL_FIBS.

sys/netinet/in_pcb.c
	in in_pcbladdr, lookup the source address using the socket's fib,
	because we don't seem to have the mbuf fib. They should be the same,
	though.

tests/sys/net/fibs_test.sh
	Clear the expected failure on udp_dontroute.

PR:		187553
CR:		https://reviews.freebsd.org/D772
MFC after:	3 weeks
Sponsored by:	Spectra Logic
2014-09-16 15:28:19 +00:00
asomers
081aa8a15c Revisions 264905 and 266860 added a "int fib" argument to ifa_ifwithnet and
ifa_ifwithdstaddr. For the sake of backwards compatibility, the new
arguments were added to new functions named ifa_ifwithnet_fib and
ifa_ifwithdstaddr_fib, while the old functions became wrappers around the
new ones that passed RT_ALL_FIBS for the fib argument. However, the
backwards compatibility is not desired for FreeBSD 11, because there are
numerous other incompatible changes to the ifnet(9) API. We therefore
decided to remove it from head but leave it in place for stable/9 and
stable/10. In addition, this commit adds the fib argument to
ifa_ifwithbroadaddr for consistency's sake.

sys/sys/param.h
	Increment __FreeBSD_version

sys/net/if.c
sys/net/if_var.h
sys/net/route.c
	Add fibnum argument to ifa_ifwithbroadaddr, and remove the _fib
	versions of ifa_ifwithdstaddr, ifa_ifwithnet, and ifa_ifwithroute.

sys/net/route.c
sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_options.c
sys/netinet/ip_output.c
sys/netinet6/nd6.c
	Fixup calls of modified functions.

share/man/man9/ifnet.9
	Document changed API.

CR:		https://reviews.freebsd.org/D458
MFC after:	Never
Sponsored by:	Spectra Logic
2014-09-11 20:21:03 +00:00
adrian
e623d51cd5 Add support for receiving and setting flowtype, flowid and RSS bucket
information as part of recvmsg().

This is primarily used for debugging/verification of the various
processing paths in the IP, PCB and driver layers.

Unfortunately the current implementation of the control message path
results in a ~10% or so drop in UDP frame throughput when it's used.

Differential Revision:	https://reviews.freebsd.org/D527
Reviewed by:	grehan
2014-09-09 01:45:39 +00:00
adrian
aab402c146 Add a flag to ip_output() - IP_NODEFAULTFLOWID - which prevents it from
overriding an existing flowid/flowtype field in the outbound mbuf with
the inp_flowid/inp_flowtype details.

The upcoming RSS UDP support calculates a valid RSS value for outbound
mbufs and since it may change per send, it doesn't cache it in the inpcb.
So overriding it here would be wrong.

Differential Revision:	https://reviews.freebsd.org/D527
Reviewed by:	grehan
2014-09-09 00:19:02 +00:00
glebius
2e01608625 Clean up unused CSUM_FRAGMENT.
Sponsored by:	Nginx, Inc.
2014-09-03 08:30:18 +00:00
adrian
627c6869c3 Implement the first stage of multi-bind listen sockets and RSS socket
awareness.

* Introduce IP_BINDMULTI - indicating that it's okay to bind multiple
  sockets on the same bind details.

  Although the PCB code has been taught about this (see below) this patch
  doesn't introduce the rest of the PCB changes necessary to distribute
  lookups among multiple PCB entries in the global wildcard table.

* Introduce IP_RSS_LISTEN_BUCKET - placing an listen socket into the
  given RSS bucket (and thus a single PCBGROUP hash.)

* Modify the PCB add path to be aware of IP_BINDMULTI:
  + Only allow further PCB entries to be added if the owner credentials
    and IP_BINDMULTI has been specified.  Ie, only allow further
    IP_BINDMULTI sockets to appear if the first bind() was IP_BINDMULTI.

* Teach the PCBGROUP code about IP_RSS_LISTE_BUCKET marked PCB entries.
  Instead of using the wildcard logic and hashing, these sockets are
  simply placed into the PCBGROUP and _not_ in the wildcard hash.

* When doing a PCBGROUP lookup, also do a wildcard match as well.
  This allows for an RSS bucket PCB entry to appear in a PCBGROUP
  rather than having to exist in the wildcard list.

Tested:

* TCP IPv4 server testing with igb(4)
* TCP IPv4 server testing with ix(4)

TODO:

* The pcbgroup lookup code duplicated the wildcard and wildcard-PCB
  logic.  This could be refactored into a single function.

* This doesn't yet work for IPv6 (The PCBGROUP code in netinet6/ doesn't
  yet know about this); nor does it yet fully work for UDP.
2014-07-10 03:10:56 +00:00
glebius
ab0c569422 In several cases in ip_output() we obtain reference on ifa. Do not
leak it.

Together with:	asomers, np
Sponsored by:	Nginx, Inc.
2014-07-09 07:48:05 +00:00
adrian
dc6db53254 Remove old reference to IP_RSSCPUID.
Submitted by:	Eggert, Lars <lars@netapp.com>
2014-07-01 17:27:48 +00:00
adrian
9fbdf6a6f5 Add missing variable declarations when using RSS.
Reported by: bryanv@
2014-06-27 19:07:00 +00:00
adrian
d4fe515519 Retire IP_RSSCPUID ; the right thing to do is query the RSS bucket;
map the bucket to an RSS queue, then map the queue to a CPU ID.
This way the bucket->queue and queue->CPU mapping can change
over time.

Introduce IP_RSSBUCKETID - which instead looks up the RSS bucket.
User applications can then map the RSS bucket to a CPU.
2014-06-26 04:12:41 +00:00
asomers
7ca8bf0f2c Fix unintended KBI change from r264905. Add _fib versions of
ifa_ifwithnet() and ifa_ifwithdstaddr()  The legacy functions will call the
_fib() versions with RT_ALL_FIBS, preserving legacy behavior.

sys/net/if_var.h
sys/net/if.c
	Add legacy-compatible functions as described above.  Ensure legacy
	behavior when RT_ALL_FIBS is passed as fibnum.

sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/net/route.c
sys/net/rtsock.c
sys/netinet6/nd6.c
	Call with _fib() functions if we must use a specific fib, or the
	legacy functions otherwise.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
	Improve the udp_dontroute test.  The bug that this test exercises is
	that ifa_ifwithnet() will return the wrong address, if multiple
	interfaces have addresses on the same subnet but with different
	fibs.  The previous version of the test only considered one possible
	failure mode: that ifa_ifwithnet_fib() might fail to find any
	suitable address at all.  The new version also checks whether
	ifa_ifwithnet_fib() finds the correct address by checking where the
	ARP request goes.

Reported by:	bz, hrs
Reviewed by:	hrs
MFC after:	1 week
X-MFC-with:	264905
Sponsored by:	Spectra Logic
2014-05-29 21:03:49 +00:00
adrian
429607ed6b * When copying the flowid from inp -> outbound mbuf, also assign the
hashtype to to the outbound mbuf as well as the flowid.

* Add in socket options to fetch the hashid, the hashtype and RSS CPU
  ID for a given socket.
2014-05-18 22:37:31 +00:00
tuexen
b6b80cf495 Use KASSERTs as suggested by glebius@
MFC after: 3 days
X-MFC with: 265691
2014-05-08 20:47:54 +00:00
tuexen
c296f5d2af For some UDP packets (for example with 200 byte payload) and IP options,
the IP header and the UDP header are not in the same mbuf.
Add code to in_delayed_cksum() to deal with this case.

MFC after: 3 days
2014-05-08 17:27:46 +00:00
asomers
f8a34b6f49 Fix subnet and default routes on different FIBs on the same subnet.
These two bugs are closely related.  The root cause is that ifa_ifwithnet
does not consider FIBs when searching for an interface address.

sys/net/if_var.h
sys/net/if.c
	Add a fib argument to ifa_ifwithnet and ifa_ifwithdstadddr.  Those
	functions will only return an address whose interface fib equals the
	argument.

sys/net/route.c
	Update calls to ifa_ifwithnet and ifa_ifwithdstaddr with fib
	arguments.

sys/netinet/in.c
	Update in_addprefix to consider the interface fib when adding
	prefixes.  This will prevent it from not adding a subnet route when
	one already exists on a different fib.

sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/netinet6/nd6.c
	Add RT_DEFAULT_FIB arguments to ifa_ifwithdstaddr and ifa_ifwithnet.
	In some cases it there wasn't a clear specific fib number to use.
	In others, I was unable to test those functions so I chose
	RT_DEFAULT_FIB to minimize divergence from current behavior.  I will
	fix some of the latter changes along with PR kern/187553.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
tests/sys/netinet/Makefile
	Revert r263738.  The udp_dontroute test was right all along.
	However, bugs kern/187550 and kern/187553 cancelled each other out
	when it came to this test.  Because of kern/187553, ifa_ifwithnet
	searched the default fib instead of the requested one, but because
	of kern/187550, there was an applicable subnet route on the default
	fib.  The new test added in r263738 doesn't work right, however.  I
	can verify with dtrace that ifa_ifwithnet returned the wrong address
	before I applied this commit, but route(8) miraculously found the
	correct interface to use anyway.  I don't know how.

	Clear expected failure messages for kern/187550 and kern/187552.

PR:		kern/187550
PR:		kern/187552
Reviewed by:	melifaro
MFC after:	3 weeks
Sponsored by:	Spectra Logic
2014-04-24 23:56:56 +00:00
glebius
8a3e4bbebb - Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
  removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
  mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with:		melifaro
Sponsored by:		Netflix
Sponsored by:		Nginx, Inc.
2014-03-05 01:17:47 +00:00
glebius
2c50988dca Remove ifa_ref()/ifa_free(), which are atomic(9), from ip_output().
The ifaddr is already referenced by the rtentry, and we are holding
reference on the rtentry throughout the function execution.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-03-04 19:49:41 +00:00
glebius
f62415c467 o Remove at compile time the HASH_ALL code, that was never
tested and is unfinished. However, I've tested my version,
  it works okay. As before it is unfinished: timeout aren't
  driven by TCP session state. To enable the HASH_ALL mode,
  one needs in kernel config:

	options FLOWTABLE_HASH_ALL

o Reduce the alignment on flentry to 64 bytes. Without
  the FLOWTABLE_HASH_ALL option, twice less memory would
  be consumed by flows.
o API to ip_output()/ip6_output() got even more thin: 1 liner.
o Remove unused unions. Simply use fle->f_key[].
o Merge all IPv4 code into flowtable_lookup_ipv4(), and do same
  flowtable_lookup_ipv6(). Stop copying data to on stack
  sockaddr structures, simply use key[] on stack.
o Move code from flowtable_lookup_common() that actually works
  on insertion into flowtable_insert().

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-17 11:50:56 +00:00
glebius
9d7706f9f4 o Revamp API between flowtable and netinet, netinet6.
- ip_output() and ip_output6() simply call flowtable_lookup(),
    passing mbuf and address family. That's the only code under
    #ifdef FLOWTABLE in the protocols code now.
o Revamp statistics gathering and export.
  - Remove hand made pcpu stats, and utilize counter(9).
  - Snapshot of statistics is available via 'netstat -rs'.
  - All sysctls are moved into net.flowtable namespace, since
    spreading them over net.inet isn't correct.
o Properly separate at compile time INET and INET6 parts.
o General cleanup.
  - Remove chain of multiple flowtables. We simply have one for
    IPv4 and one for IPv6.
  - Flowtables are allocated in flowtable.c, symbols are static.
  - With proper argument to SYSINIT() we no longer need flowtable_ready.
  - Hash salt doesn't need to be per-VNET.
  - Removed rudimentary debugging, which use quite useless in dtrace era.

The runtime behavior of flowtable shouldn't be changed by this commit.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-07 15:18:23 +00:00
glebius
abd19c8039 Cleanup comments and whitespace. No functional changes. 2014-01-16 12:58:03 +00:00
melifaro
479797d59e Fix ipfw fwd for IPv4 traffic broken by r249894.
Problem case:
Original lookup returns route with GW set, so gw points to
rte->rt_gateway.
After that we're changing dst and performing lookup another time.
Since fwd host is most probably directly reachable, resulting
rte does not contain rt_gateway, so gw is not set. Finally, we
end with packet transmitted to proper interface but wrong
link-layer address.

Found by:	lstewart
Discussed with:	ae,lstewart
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-01-16 11:50:00 +00:00
glebius
f3c61c8731 Fix regression from r249894. Now we pass "gw" as argument to if_output
method, thus for multicast case we need it to point at "dst".

PR:		185395
Submitted by:	ae
2014-01-02 10:18:39 +00:00
adrian
778876aacb Disable the now unpredicably bogus check for whether we have
eneough queue space before queuing a bunch of IP fragments.

As the comment in the committed change says, in the post-if_transmit(),
post-SMP, post-preemption world, there's just too much overlapping
concurrent code paths and different approaches to driver transmit
queue management to have this code even remotely be effective.

The only specific place it could be useful is if ALTQ is enabled
but again it doesn't at all promise that all the fragments will be
transmitted anyway.

The main reason for committing this change is to disable a parallel
place where the drops counter is incremented.  This is a side effect
of an upcoming change to ixgbe/cxgbe to handle the queue drops
counter slightly better.

Sponsored by:	Netflix, Inc.
2013-12-20 07:41:03 +00:00
attilio
7ee4e910ce - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
glebius
ff6e113f1b The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
glebius
790225cfbc - Utilize counter(9) to accumulate statistics on interface addresses. Add
four counters to struct ifaddr. This kills '+=' on a variables shared
  between processors for every packet.
- Nuke struct if_data from struct ifaddr.
- In ip_input() do not put a reference on ifaddr, instead update statistics
  right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9)
  for every packet. [1]
- To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in
  rtsock.c fill if_data fields using counter_u64_fetch().
- Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which
  took if_data not from the ifaddr, but from ifaddr's ifnet. [2]

Submitted by:	melifaro [1], pluknet[2]
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 11:37:57 +00:00
markj
29e4661920 Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by:	gnn, hiren
MFC after:	1 month
2013-08-25 21:54:41 +00:00
andre
7cc6cc696c Add m_clrprotoflags() to clear protocol specific mbuf flags at up and
downwards layer crossings.

Consistently use it within IP, IPv6 and ethernet protocols.

Discussed with:	trociny, glebius
2013-08-19 13:27:32 +00:00
andre
b1d7216313 Remove unused M_FRAG, M_FIRSTFRAG and M_LASTFRAG tagging from ip_fragment().
There wasn't any real driver (and hardware) support for it.  Modern hardware
does full fragmentation/segmentation offload instead.
2013-08-19 10:30:15 +00:00
trociny
9b554dcd02 In r227207, to fix the issue with possible NULL inp_socket pointer
dereferencing, when checking for SO_REUSEPORT option (and SO_REUSEADDR
for multicast), INP_REUSEPORT flag was introduced to cache the socket
option.  It was decided then that one flag would be enough to cache
both SO_REUSEPORT and SO_REUSEADDR: when processing SO_REUSEADDR
setsockopt(2), it was checked if it was called for a multicast address
and INP_REUSEPORT was set accordingly.

Unfortunately that approach does not work when setsockopt(2) is called
before binding to a multicast address: the multicast check fails and
INP_REUSEPORT is not set.

Fix this by adding INP_REUSEADDR flag to unconditionally cache
SO_REUSEADDR.

PR:		179901
Submitted by:	Michael Gmelin freebsd grem.de (initial version)
Reviewed by:	rwatson
MFC after:	1 week
2013-07-04 18:38:00 +00:00
glebius
b4bc270e8f Add const qualifier to the dst parameter of the ifnet if_output method. 2013-04-26 12:50:32 +00:00
glebius
dd30d87e8d Introduce a pointer to const variable gw, which points either at the
same place as dst, or to the sockaddr in the routing table.

The const constraint of gw makes us safe from modifing routing table
accidentially. And "onstantness" of dst allows us to remove several
bandaids, when we switched it back at &ro->ro_dst, now it always
points there.

Reviewed by:	rrs
2013-04-25 12:42:09 +00:00
rrs
d90354b019 This fixes the issue with the "randomly changing" default
route. What it was is there are two places in ip_output.c
where we do a goto again. One place was fine, it
copies out the new address and then resets dst = ro->rt_dst;
But the other place does *not* do that, which means earlier
when we found the gateway, we have dst pointing there
aka dst = ro->rt_gateway is done.. then we do a
goto again.. bam now we clobber the default route.

The fix is just to move the again so we are always
doing dst = &ro->rt_dst; in the again loop.

PR:	 174749,157796
MFC after:	1 week
2013-04-24 18:30:32 +00:00
glebius
b37af62b9e Use m_get/m_gethdr instead of compat macros.
Sponsored by:	Nginx, Inc.
2013-03-15 12:55:30 +00:00
glebius
8e20fa5ae9 Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
andre
cd840ea089 Remove unused and unnecessary CSUM_IP_FRAGS checksumming capability.
Checksumming the IP header of fragments is no different from doing
normal IP headers.

Discussed with:	yongari
MFC after:	1 week
2012-11-27 19:31:49 +00:00
ae
4354018055 Remove the recently added sysctl variable net.pfil.forward.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.

Suggested by:	andre
2012-11-02 01:20:55 +00:00
glebius
f79061ff05 o Remove last argument to ip_fragment(), and obtain all needed information
on checksums directly from mbuf flags. This simplifies code.
o Clear CSUM_IP from the mbuf in ip_fragment() if we did checksums in
  hardware. Some driver may not announce CSUM_IP in theur if_hwassist,
  although try to do checksums if CSUM_IP set on mbuf. Example is em(4).
o While here, consistently use CSUM_IP instead of its alias CSUM_DELAY_IP.
  After this change CSUM_DELAY_IP vanishes from the stack.

Submitted by:	Sebastian Kuzminsky <seb lineratesystems.com>
2012-10-26 21:06:33 +00:00
ae
71112b5a8e Remove the IPFIREWALL_FORWARD kernel option and make possible to turn
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.

Sponsored by:	Yandex LLC
Discussed with:	net@
MFC after:	2 weeks
2012-10-25 09:39:14 +00:00
glebius
5cc3ac5902 Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

  After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

  After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by:	luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by:	ray, Olivier Cochard-Labbe <olivier cochard.me>
2012-10-22 21:09:03 +00:00
glebius
b6c0be02b6 Fix a miss from r241344: in ip_mloopback() we need to go to
net byte order prior to calling in_delayed_cksum().

Reported by:	 Olivier Cochard-Labbe <olivier cochard.me>
2012-10-14 15:08:07 +00:00
glebius
9086143e8c After r241245 it appeared that in_delayed_cksum(), which still expects
host byte order, was sometimes called with net byte order. Since we are
moving towards net byte order throughout the stack, the function was
converted to expect net byte order, and its consumers fixed appropriately:
  - ip_output(), ipfilter(4) not changed, since already call
    in_delayed_cksum() with header in net byte order.
  - divert(4), ng_nat(4), ipfw_nat(4) now don't need to swap byte order
    there and back.
  - mrouting code and IPv6 ipsec now need to switch byte order there and
    back, but I hope, this is temporary solution.
  - In ipsec(4) shifted switch to net byte order prior to in_delayed_cksum().
  - pf_route() catches up on r241245 changes to ip_output().
2012-10-08 08:03:58 +00:00
glebius
f3a0231bff A step in resolving mess with byte ordering for AF_INET. After this change:
- All packets in NETISR_IP queue are in net byte order.
  - ip_input() is entered in net byte order and converts packet
    to host byte order right _after_ processing pfil(9) hooks.
  - ip_output() is entered in host byte order and converts packet
    to net byte order right _before_ processing pfil(9) hooks.
  - ip_fragment() accepts and emits packet in net byte order.
  - ip_forward(), ip_mloopback() use host byte order (untouched actually).
  - ip_fastforward() no longer modifies packet at all (except ip_ttl).
  - Swapping of byte order there and back removed from the following modules:
    pf(4), ipfw(4), enc(4), if_bridge(4).
  - Swapping of byte order added to ipfilter(4), based on __FreeBSD_version
  - __FreeBSD_version bumped.
  - pfil(9) manual page updated.

Reviewed by:	ray, luigi, eri, melifaro
Tested by:	glebius (LE), ray (BE)
2012-10-06 10:02:11 +00:00
glebius
3b4ff3bafb Plug a reference leak: before doing 'goto again' we need to unref
ia->ia_ifa if there is any.

Submitted by:	Andrey Zonov <andrey zonov.org>
2012-07-18 08:58:30 +00:00
glebius
418a04b467 When ip_output()/ip6_output() is supplied a struct route *ro argument,
it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning
here: it may be supplied to provide route, and it may be supplied to
store and return to caller the route that ip_output()/ip6_output()
finds. In the latter case skipping FLOWTABLE lookup is pessimisation.

The difference between struct route filled by FLOWTABLE and filled
by rtalloc() family is that the former doesn't hold a reference on
its rtentry. Reference is hold by flow entry, and it is about to
be released in future. Thus, route filled by FLOWTABLE shouldn't
be passed to RTFREE() macro.

- Introduce new flag for struct route/route_in6, that marks route
  not holding a reference on rtentry.
- Introduce new macro RO_RTFREE() that cleans up a struct route
  depending on its kind.
- All callers to ip_output()/ip6_output() that do supply non-NULL
  but empty route should use RO_RTFREE() to free results of
  lookup.
- ip_output()/ip6_output() now do FLOWTABLE lookup always when
  ro->ro_rt == NULL.

Tested by:	tuexen (SCTP part)
2012-07-04 07:37:53 +00:00
tuexen
32041f44ed Add a IP_RECVTOS socket option to receive for received UDP/IPv4
packets a cmsg of type IP_RECVTOS which contains the TOS byte.
Much like IP_RECVTTL does for TTL. This allows to implement a
protocol on top of UDP and implementing ECN.

MFC after: 3 days
2012-06-12 14:02:38 +00:00
trociny
f9135967f2 Cache SO_REUSEPORT socket option in inpcb-layer in order to avoid
inp_socket->so_options dereference when we may not acquire the lock on
the inpcb.

This fixes the crash due to NULL pointer dereference in
in_pcbbind_setup() when inp_socket->so_options in a pcb returned by
in_pcblookup_local() was checked.

Reported by:	dave jones <s.dave.jones@gmail.com>, Arnaud Lacombe <lacombar@gmail.com>
Suggested by:	rwatson
Glanced by:	rwatson
Tested by:	dave jones <s.dave.jones@gmail.com>
2011-11-06 10:47:20 +00:00
bz
c3c7ca25e7 The mbuf_frag_size always was and is file local and not queried from base
user space tools via kvm.  Mark it static.

MFC after:	3 days
2011-04-14 09:47:09 +00:00
bz
969f8f9046 Try to catch a possible divide-by-zero as early as possible if "mtu" is 0
(also test for negative MTUs if checking it anyway).
An MTU of 0 is arguably a bug elsewhere, but this at least gives us some
more debugging hints.

Sponsored by:	ISPsystem (Early 2010)
MFC after:	1 week
2010-12-31 21:47:11 +00:00
attilio
d658ddc7c7 IP_BINDANY is not correctly handled in getsockopt() case.
Fix it by specifying the correct bits.

Sponsored by:	Sandvine Incorporated
Reviewed by:	bz, emaste, rstone
Obtained from:	Sandvine Incorporated
MFC after:	10 days
2010-09-24 14:38:54 +00:00
qingli
f6ab4a6810 This patch fixes the problem where proxy ARP entries cannot be added
over the if_ng interface.

MFC after:	3 days
2010-05-25 20:42:35 +00:00
rrs
5db64758fc The proper fix for the delayed SCTP checksum is to
have the delayed function take an argument as to the offset
to the SCTP header. This allows it to work for V4 and V6.
This of course means changing all callers of the function
to either pass the header len, if they have it, or create
it (ip_hl << 2 or sizeof(ip6_hdr)).
PR:		144529
MFC after:	2 weeks
2010-03-12 22:58:52 +00:00
kmacy
128542c758 - restructure flowtable to support ipv6
- add a name argument to flowtable_alloc for printing with ddb commands
- extend ddb commands to print destination address or 4-tuples
- don't parse ports in ulp header if FL_HASH_ALL is not passed
- add kern_flowtable_insert to enable more generic use of flowtable
  (e.g. system calls for adding entries)
- don't hash loopback addresses
- cleanup whitespace
- keep statistics per-cpu for per-cpu flowtables to avoid cache line contention
- add sysctls to accumulate stats and report aggregate

MFC after:	7 days
2010-03-12 05:03:26 +00:00
qingli
93013817b0 One of the advantages of enabling ECMP (a.k.a RADIX_MPATH) is to
allow for connection load balancing across interfaces. Currently
the address alias handling method is colliding with the ECMP code.
For example, when two interfaces are configured on the same prefix,
only one prefix route is installed. So connection load balancing
among the available interfaces is not possible.

The other advantage of ECMP is for failover. The issue with the
current code, is that the interface link-state is not reflected
in the route entry. For example, if there are two interfaces on
the same prefix, the cable on one interface is unplugged, new and
existing connections should switch over to the other interface.
This is not done today and packets go into a black hole.

Also, there is a small bug in the kernel where deleting ECMP routes
in the userland will always return an error even though the command
is successfully executed.

MFC after:	5 days
2010-03-09 01:11:45 +00:00
bz
7eddc3a63a Make the compiler happy after r201125:
- + remove two unnecessary initializations in ip_output;
+ + remove one unnecessary initializations in ip_output;
2009-12-28 21:14:18 +00:00
luigi
1a1b4d40fb introduce a local variable rte acting as a cache of ro->ro_rt
within ip_output, achieving (in random order of importance):
- a reduction of the number of 'r's in the source code;
- improved legibility;
- a reduction of 64 bytes in the .text
2009-12-28 14:48:32 +00:00
luigi
9c18067568 + remove an unused #define print_ip;
+ remove two unnecessary initializations in ip_output;
+ localize 'len';
+ introduce a temporary variable n to count the number of fragments,
  the compiler seems unable to identify a common subexpression
  (written 3 times, used twice);
+ document some assumptions on ip_len and ip_hl
2009-12-28 14:09:46 +00:00
trasz
ab44b532fc Remove ifdefed out part of code, which seems to have originated a decade ago
in OpenBSD.  As it is now, there is no way for this to be useful, since IPsec
is free to forward packets via whatever interface it wants, so checking
capabilities of the interface passed from ip_output (fetched from the routing
table) serves no purpose.

Discussed with:	sam@
2009-11-09 19:53:34 +00:00
julian
79c1f884ef Virtualize the pfil hooks so that different jails may chose different
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.

Sitting aroung in tree waiting to commit for: 2 months
MFC after:	2 months
2009-10-11 05:59:43 +00:00
qingli
eb767da59e Do not try to free the rt_lle entry of the cached route in
ip_output() if the cached route was not initialized from the
flow-table. The rt_lle entry is invalid unless it has been
initialized through the flow-table.

Reviewed by:	kmacy, rwatson
MFC after:	immediately
2009-08-28 05:37:31 +00:00
kmacy
bbd03fa206 - change the interface to flowtable_lookup so that we don't rely on
the mbuf for obtaining the fib index
 - check that a cached flow corresponds to the same fib index as the
   packet for which we are doing the lookup
 - at interface detach time flush any flows referencing stale rtentrys
   associated with the interface that is going away (fixes reported
   panics)
 - reduce the time between cleans in case the cleaner is running at
   the time the eventhandler is called and the wakeup is missed less
   time will elapse before the eventhandler returns
 - separate per-vnet initialization from global initialization
   (pointed out by jeli@)

Reviewed by:	sam@
Approved by:	re@
2009-08-18 20:28:58 +00:00
qingli
5d81358a8d In function ip_output(), the cached route is flushed when there is a
mismatch between the cached entry and the intended destination. The
cached rtentry{} is flushed but the associated llentry{} is not. This
causes the wrong destination MAC address being used in the output
packets. The fix is to flush the llentry{} when rtentry{} is cleared.

Reviewed by:	kmacy, rwatson
Approved by:	re
2009-08-14 23:44:59 +00:00
rwatson
fb9ffed650 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
rwatson
57ca4583e7 Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
rwatson
c9ef486fe1 Modify most routines returning 'struct ifaddr *' to return references
rather than pointers, requiring callers to properly dispose of those
references.  The following routines now return references:

  ifaddr_byindex
  ifa_ifwithaddr
  ifa_ifwithbroadaddr
  ifa_ifwithdstaddr
  ifa_ifwithnet
  ifaof_ifpforaddr
  ifa_ifwithroute
  ifa_ifwithroute_fib
  rt_getifa
  rt_getifa_fib
  IFP_TO_IA
  ip_rtaddr
  in6_ifawithifp
  in6ifa_ifpforlinklocal
  in6ifa_ifpwithaddr
  in6_ifadd
  carp_iamatch6
  ip6_getdstifaddr

Remove unused macro which didn't have required referencing:

  IFP_TO_IA6

This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc).  In a few cases, add missing if_addr_list locking
required to safely acquire references.

Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit.  Once
we have mbuf tag deep copy support, this can be fixed.

Reviewed by:	bz
Obtained from:	Apple, Inc. (portions)
MFC after:	6 weeks (portions)
2009-06-23 20:19:09 +00:00
zec
f9dfbad808 V_irtualize flowtable state.
This change should make options VIMAGE kernel builds usable again,
to some extent at least.

Note that the size of struct vnet_inet has changed, though in
accordance with one-bump-per-day policy we didn't update the
__FreeBSD_version number, given that it has already been touched
by r194640 a few hours ago.
Reviewed by:	bz
Approved by:	julian (mentor)
2009-06-22 21:19:24 +00:00
bz
64d1a84073 Move the kernel option FLOWTABLE chacking from the header file to the
actual implementation.
Remove the accessor functions for the compiled out case, just returning
"unavail" values. Remove the kernel conditional from the header file as
it is no longer needed, only leaving the externs.
Hide the improperly virtualized SYSCTL/TUNABLE for the flowtable size
under the kernel option as well.

Reviewed by:	rwatson
2009-06-12 20:46:36 +00:00
pjd
80ee56b38b Only four out of nine arguments for ip_ipsec_output() are actually used.
Kill unused arguments except for 'ifp' as it might be used in the future
for detecting IPsec-capable interfaces.
2009-06-05 23:53:17 +00:00
rwatson
f4934662e5 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
pjd
5243d2d206 - Rename IP_NONLOCALOK IP socket option to IP_BINDANY, to be more consistent
with OpenBSD (and BSD/OS originally). We can't easly do it SOL_SOCKET option
  as there is no more space for more SOL_SOCKET options, but this option also
  fits better as an IP socket option, it seems.
- Implement this functionality also for IPv6 and RAW IP sockets.
- Always compile it in (don't use additional kernel options).
- Remove sysctl to turn this functionality on and off.
- Introduce new privilege - PRIV_NETINET_BINDANY, which allows to use this
  functionality (currently only unjail root can use it).

Discussed with:	julian, adrian, jhb, rwatson, kmacy
2009-06-01 10:30:00 +00:00
rwatson
5389d28839 Consolidate and clean up the first section of ip_output.c in light of the
last year or two's work on routing:

- Combine iproute initialization and flowtable lookup blocks, eliminating
  unnecessary tests for known-zero'd iproute fields.

- Add a comment indicating (a) why the route entry returned by the
  flowtable is considered stable and (b) that the flowtable lookup must
  occur after the setup of the mbuf flow ID.

- Assert the inpcb lock before any use of inpcb fields.

Reviewed by:	kmacy
2009-05-21 09:45:47 +00:00
trasz
76f48d741e Don't require packet to match a route (any route; this information wasn't
used anyway, so a typical workaround was to add a dummy route) if it's going
to be sent through IPSec tunnel.

Reviewed by:	bz
2009-04-28 11:10:33 +00:00
kmacy
96ce69a748 - Allocate a small flowtable in ip_input.c (changeable by tuneable)
- Use for accelerating ip_output
2009-04-19 04:44:05 +00:00
kmacy
24b38efdce Change if_output to take a struct route as its fourth argument in order
to allow passing a cached struct llentry * down to L2

Reviewed by:	rwatson
2009-04-16 20:30:28 +00:00
rwatson
692f8aa2fa Update stats in struct ipstat using four new macros, IPSTAT_ADD(),
IPSTAT_INC(), IPSTAT_SUB(), and IPSTAT_DEC(), rather than directly
manipulating the fields across the kernel.  This will make it easier
to change the implementation of these statistics, such as using
per-CPU versions of the data structures.

MFC after:	3 days
2009-04-11 23:35:20 +00:00
kmacy
fbd3646842 Import "flowid" support for serializing flows across transmit queues
Reviewed by:	rwatson and jeli
2009-04-10 06:16:14 +00:00
bms
6370f947cc In ip_output(), do not acquire the IN_MULTI_LOCK(),
and do not attempt to perform a group lookup.
This is a socket layer lock, and the bottom half of IP
really has no business taking it.

Use the value of the in_mcast_loop sysctl to determine
if we should loop back by default, in the absence of
any multicast socket options. Because the check on
group membership is now deferred to the input path,
an m_copym() is now required.

This should increase multicast send performance where the
source has not requested loopback, although this has not been
benchmarked or measured.

It is also a necessary change for IN_MULTI_LOCK to become
non-recursive, which is required in order to implement IGMPv3
in a thread-safe way.
2009-03-04 03:45:34 +00:00
bz
df2be82cec For all files including net/vnet.h directly include opt_route.h and
net/route.h.

Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.

We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.

This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.
2009-02-27 14:12:05 +00:00
bz
8d30abae87 Try to remove/assimilate as much of formerly IPv4/6 specific
(duplicate) code in sys/netipsec/ipsec.c and fold it into
common, INET/6 independent functions.

The file local functions ipsec4_setspidx_inpcb() and
ipsec6_setspidx_inpcb() were 1:1 identical after the change
in r186528. Rename to ipsec_setspidx_inpcb() and remove the
duplicate.

Public functions ipsec[46]_get_policy() were 1:1 identical.
Remove one copy and merge in the factored out code from
ipsec_get_policy() into the other. The public function left
is now called ipsec_get_policy() and callers were adapted.

Public functions ipsec[46]_set_policy() were 1:1 identical.
Rename file local ipsec_set_policy() function to
ipsec_set_policy_internal().
Remove one copy of the public functions, rename the other
to ipsec_set_policy() and adapt callers.

Public functions ipsec[46]_hdrsiz() were logically identical
(ignoring one questionable assert in the v6 version).
Rename the file local ipsec_hdrsiz() to ipsec_hdrsiz_internal(),
the public function to ipsec_hdrsiz(), remove the duplicate
copy and adapt the callers.
The v6 version had been unused anyway. Cleanup comments.

Public functions ipsec[46]_in_reject() were logically identical
apart from statistics. Move the common code into a file local
ipsec46_in_reject() leaving vimage+statistics in small AF specific
wrapper functions. Note: unfortunately we already have a public
ipsec_in_reject().

Reviewed by:	sam
Discussed with:	rwatson (renaming to *_internal)
MFC after:	26 days
X-MFC:		keep wrapper functions for public symbols?
2009-02-08 09:27:07 +00:00
rrs
2688c69180 Adds support for SCTP checksum offload. This means
we, like TCP and UDP, move the checksum calculation
into the IP routines when there is no hardware support
we call into the normal SCTP checksum routine.

The next round of SCTP updates will use
this functionality. Of course the IGB driver needs
a few updates to support the new intel controller set
that actually does SCTP csum offload too.

Reviewed by:	gnn, rwatson, kmacy
2009-02-03 11:00:43 +00:00
adrian
2b9fd92371 Fix indentation; add FALLTHROUGH.
Thanks Max!
2009-01-09 17:21:22 +00:00
adrian
e2eee65f21 Implement a new IP option (not compiled/enabled by default) to allow
applications to specify a non-local IP address when bind()'ing a socket
to a local endpoint.

This allows applications to spoof the client IP address of connections
if (obviously!) they somehow are able to receive the traffic normally
destined to said clients.

This patch doesn't include any changes to ipfw or the bridging code to
redirect the client traffic through the PCB checks so TCP gets a shot
at it. The normal behaviour is that packets with a non-local destination
IP address are not handled locally. This can be dealth with some IPFW hackery;
modifications to IPFW to make this less hacky will occur in subsequent
commmits.

Thanks to Julian Elischer and others at Ironport. This work was approved
and donated before Cisco acquired them.

Obtained from:	Julian Elischer and others
MFC after:	2 weeks
2009-01-09 16:02:19 +00:00
rwatson
b56f7e98e0 Allow the IP_MINTTL socket option to be set to 0 so that it can be
disabled entirely, which is its default state before set to a
non-zero value.

PR:		128790
Submitted by:	Nick Hilliard <nick at foobar dot org>
MFC after:	3 weeks
2009-01-03 11:35:31 +00:00
qingli
ec826ad5c7 This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
   possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
  the last piece of the puzzle, Kip has also been conducting
  active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
  provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
  me maintaining that branch before the svn conversion
2008-12-15 06:10:57 +00:00
zec
7b573d1496 Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.

Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out.  Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.

Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively

#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif

Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.

Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs.  This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.

Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.

De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps.  PF virtualization will be done
separately, most probably after next PF import.

Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw.  Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-12-10 23:12:39 +00:00
bz
604d89458a Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by:	brooks, gnn, des, zec, imp
Sponsored by:	The FreeBSD Foundation
2008-12-02 21:37:28 +00:00
zec
95a15f5c84 Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by:  bz, julian
Approved by:  julian (mentor)
Obtained from:        //depot/projects/vimage-commit2/...
X-MFC after:  never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-11-26 22:32:07 +00:00
julian
cf07f793f2 Fix a scope problem in the multiple routing table code that stopped the
SO_SETFIB socket option from working correctly.

Obtained from:	Ironport
MFC after:	3 days
2008-11-19 19:19:30 +00:00
zec
815d52c5df Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions.  As a rule,
initialization at instatiation for such variables should never be
introduced again from now on.  Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact.  In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-11-19 09:39:34 +00:00
zec
8797d4caec Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by:	julian, bz, brooks, zec
Reviewed by:	julian, bz, brooks, kris, rwatson, ...
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
gnn
995cb3f198 Fix a bug whereby multicast packets that are looped back locally
wind up with the incorrect checksum on the wire when transmitted via
devices that do checksum offloading.

PR:		kern/119635
Reviewed by:	rwatson
MFC after:	5 days
2008-08-29 20:42:58 +00:00
rwatson
80f65f3279 Remove comments and #ifdef notyet'd code relating to directly dispatching
the IP multicast input code from the output path; we don't allow
reentrance of the input path from the IP output path, it must use the
netisr due to potential lock recursion.

MFC after:	3 days
2008-08-21 17:24:49 +00:00
bz
1021d43b56 Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from:	//depot/projects/vimage-commit2/...
Reviewed by:	brooks, des, ed, mav, julian,
		jamie, kris, rwatson, zec, ...
		(various people I forgot, different versions)
		md5 (with a bit of help)
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
X-MFC after:	never
V_Commit_Message_Reviewed_By:	more people than the patch
2008-08-17 23:27:27 +00:00
julian
1dfc5c98a4 Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

  One thing where FreeBSD has been falling behind, and which by chance I
  have some time to work on is "policy based routing", which allows
  different
  packet streams to be routed by more than just the destination address.

  Constraints:
  ------------

  I want to make some form of this available in the 6.x tree
  (and by extension 7.x) , but FreeBSD in general needs it so I might as
  well do it in -current and back port the portions I need.

  One of the ways that this can be done is to have the ability to
  instantiate multiple kernel routing tables (which I will now
  refer to as "Forwarding Information Bases" or "FIBs" for political
  correctness reasons). Which FIB a particular packet uses to make
  the next hop decision can be decided by a number of mechanisms.
  The policies these mechanisms implement are the "Policies" referred
  to in "Policy based routing".

  One of the constraints I have if I try to back port this work to
  6.x is that it must be implemented as a EXTENSION to the existing
  ABIs in 6.x so that third party applications do not need to be
  recompiled in timespan of the branch.

  This first version will not have some of the bells and whistles that
  will come with later versions. It will, for example, be limited to 16
  tables in the first commit.
  Implementation method, Compatible version. (part 1)
  -------------------------------
  For this reason I have implemented a "sufficient subset" of a
  multiple routing table solution in Perforce, and back-ported it
  to 6.x. (also in Perforce though not  always caught up with what I
  have done in -current/P4). The subset allows a number of FIBs
  to be defined at compile time (8 is sufficient for my purposes in 6.x)
  and implements the changes needed to allow IPV4 to use them. I have not
  done the changes for ipv6 simply because I do not need it, and I do not
  have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

  Other protocol families are left untouched and should there be
  users with proprietary protocol families, they should continue to work
  and be oblivious to the existence of the extra FIBs.

  To understand how this is done, one must know that the current FIB
  code starts everything off with a single dimensional array of
  pointers to FIB head structures (One per protocol family), each of
  which in turn points to the trie of routes available to that family.

  The basic change in the ABI compatible version of the change is to
  extent that array to be a 2 dimensional array, so that
  instead of protocol family X looking at rt_tables[X] for the
  table it needs, it looks at rt_tables[Y][X] when for all
  protocol families except ipv4 Y is always 0.
  Code that is unaware of the change always just sees the first row
  of the table, which of course looks just like the one dimensional
  array that existed before.

  The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
  are all maintained, but refer only to the first row of the array,
  so that existing callers in proprietary protocols can continue to
  do the "right thing".
  Some new entry points are added, for the exclusive use of ipv4 code
  called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
  which have an extra argument which refers the code to the correct row.

  In addition, there are some new entry points (currently called
  rtalloc_fib() and friends) that check the Address family being
  looked up and call either rtalloc() (and friends) if the protocol
  is not IPv4 forcing the action to row 0 or to the appropriate row
  if it IS IPv4 (and that info is available). These are for calling
  from code that is not specific to any particular protocol. The way
  these are implemented would change in the non ABI preserving code
  to be added later.

  One feature of the first version of the code is that for ipv4,
  the interface routes show up automatically on all the FIBs, so
  that no matter what FIB you select you always have the basic
  direct attached hosts available to you. (rtinit() does this
  automatically).

  You CAN delete an interface route from one FIB should you want
  to but by default it's there. ARP information is also available
  in each FIB. It's assumed that the same machine would have the
  same MAC address, regardless of which FIB you are using to get
  to it.

  This brings us as to how the correct FIB is selected for an outgoing
  IPV4 packet.

  Firstly, all packets have a FIB associated with them. if nothing
  has been done to change it, it will be FIB 0. The FIB is changed
  in the following ways.

  Packets fall into one of a number of classes.

  1/ locally generated packets, coming from a socket/PCB.
     Such packets select a FIB from a number associated with the
     socket/PCB. This in turn is inherited from the process,
     but can be changed by a socket option. The process in turn
     inherits it on fork. I have written a utility call setfib
     that acts a bit like nice..

         setfib -3 ping target.example.com # will use fib 3 for ping.

     It is an obvious extension to make it a property of a jail
     but I have not done so. It can be achieved by combining the setfib and
     jail commands.

  2/ packets received on an interface for forwarding.
     By default these packets would use table 0,
     (or possibly a number settable in a sysctl(not yet)).
     but prior to routing the firewall can inspect them (see below).
     (possibly in the future you may be able to associate a FIB
     with packets received on an interface..  An ifconfig arg, but not yet.)

  3/ packets inspected by a packet classifier, which can arbitrarily
     associate a fib with it on a packet by packet basis.
     A fib assigned to a packet by a packet classifier
     (such as ipfw) would over-ride a fib associated by
     a more default source. (such as cases 1 or 2).

  4/ a tcp listen socket associated with a fib will generate
     accept sockets that are associated with that same fib.

  5/ Packets generated in response to some other packet (e.g. reset
     or icmp packets). These should use the FIB associated with the
     packet being reponded to.

  6/ Packets generated during encapsulation.
     gif, tun and other tunnel interfaces will encapsulate using the FIB
     that was in effect withthe proces that set up the tunnel.
     thus setfib 1 ifconfig gif0 [tunnel instructions]
     will set the fib for the tunnel to use to be fib 1.

  Routing messages would be associated with their
  process, and thus select one FIB or another.
  messages from the kernel would be associated with the fib they
  refer to and would only be received by a routing socket associated
  with that fib. (not yet implemented)

  In addition Netstat has been edited to be able to cope with the
  fact that the array is now 2 dimensional. (It looks in system
  memory using libkvm (!)). Old versions of netstat see only the first FIB.

  In addition two sysctls are added to give:
  a) the number of FIBs compiled in (active)
  b) the default FIB of the calling process.

  Early testing experience:
  -------------------------

  Basically our (IronPort's) appliance does this functionality already
  using ipfw fwd but that method has some drawbacks.

  For example,
  It can't fully simulate a routing table because it can't influence the
  socket's choice of local address when a connect() is done.

  Testing during the generating of these changes has been
  remarkably smooth so far. Multiple tables have co-existed
  with no notable side effects, and packets have been routes
  accordingly.

  ipfw has grown 2 new keywords:

  setfib N ip from anay to any
  count ip from any to any fib N

  In pf there seems to be a requirement to be able to give symbolic names to the
  fibs but I do not have that capacity. I am not sure if it is required.

  SCTP has interestingly enough built in support for this, called VRFs
  in Cisco parlance. it will be interesting to see how that handles it
  when it suddenly actually does something.

  Where to next:
  --------------------

  After committing the ABI compatible version and MFCing it, I'd
  like to proceed in a forward direction in -current. this will
  result in some roto-tilling in the routing code.

  Firstly: the current code's idea of having a separate tree per
  protocol family, all of the same format, and pointed to by the
  1 dimensional array is a bit silly. Especially when one considers that
  there is code that makes assumptions about every protocol having the
  same internal structures there. Some protocols don't WANT that
  sort of structure. (for example the whole idea of a netmask is foreign
  to appletalk). This needs to be made opaque to the external code.

  My suggested first change is to add routing method pointers to the
  'domain' structure, along with information pointing the data.
  instead of having an array of pointers to uniform structures,
  there would be an array pointing to the 'domain' structures
  for each protocol address domain (protocol family),
  and the methods this reached would be called. The methods would have
  an argument that gives FIB number, but the protocol would be free
  to ignore it.

  When the ABI can be changed it raises the possibilty of the
  addition of a fib entry into the "struct route". Currently,
  the structure contains the sockaddr of the desination, and the resulting
  fib entry. To make this work fully, one could add a fib number
  so that given an address and a fib, one can find the third element, the
  fib entry.

  Interaction with the ARP layer/ LL layer would need to be
  revisited as well. Qing Li has been working on this already.

  This work was sponsored by Ironport Systems/Cisco

Reviewed by:    several including rwatson, bz and mlair (parts each)
Obtained from:  Ironport systems/Cisco
2008-05-09 23:03:00 +00:00