Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.
Suggested by: andre
on checksums directly from mbuf flags. This simplifies code.
o Clear CSUM_IP from the mbuf in ip_fragment() if we did checksums in
hardware. Some driver may not announce CSUM_IP in theur if_hwassist,
although try to do checksums if CSUM_IP set on mbuf. Example is em(4).
o While here, consistently use CSUM_IP instead of its alias CSUM_DELAY_IP.
After this change CSUM_DELAY_IP vanishes from the stack.
Submitted by: Sebastian Kuzminsky <seb lineratesystems.com>
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.
Sponsored by: Yandex LLC
Discussed with: net@
MFC after: 2 weeks
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.
After this change a packet processed by the stack isn't
modified at all[2] except for TTL.
After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.
[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.
[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.
Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
host byte order, was sometimes called with net byte order. Since we are
moving towards net byte order throughout the stack, the function was
converted to expect net byte order, and its consumers fixed appropriately:
- ip_output(), ipfilter(4) not changed, since already call
in_delayed_cksum() with header in net byte order.
- divert(4), ng_nat(4), ipfw_nat(4) now don't need to swap byte order
there and back.
- mrouting code and IPv6 ipsec now need to switch byte order there and
back, but I hope, this is temporary solution.
- In ipsec(4) shifted switch to net byte order prior to in_delayed_cksum().
- pf_route() catches up on r241245 changes to ip_output().
- All packets in NETISR_IP queue are in net byte order.
- ip_input() is entered in net byte order and converts packet
to host byte order right _after_ processing pfil(9) hooks.
- ip_output() is entered in host byte order and converts packet
to net byte order right _before_ processing pfil(9) hooks.
- ip_fragment() accepts and emits packet in net byte order.
- ip_forward(), ip_mloopback() use host byte order (untouched actually).
- ip_fastforward() no longer modifies packet at all (except ip_ttl).
- Swapping of byte order there and back removed from the following modules:
pf(4), ipfw(4), enc(4), if_bridge(4).
- Swapping of byte order added to ipfilter(4), based on __FreeBSD_version
- __FreeBSD_version bumped.
- pfil(9) manual page updated.
Reviewed by: ray, luigi, eri, melifaro
Tested by: glebius (LE), ray (BE)
it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning
here: it may be supplied to provide route, and it may be supplied to
store and return to caller the route that ip_output()/ip6_output()
finds. In the latter case skipping FLOWTABLE lookup is pessimisation.
The difference between struct route filled by FLOWTABLE and filled
by rtalloc() family is that the former doesn't hold a reference on
its rtentry. Reference is hold by flow entry, and it is about to
be released in future. Thus, route filled by FLOWTABLE shouldn't
be passed to RTFREE() macro.
- Introduce new flag for struct route/route_in6, that marks route
not holding a reference on rtentry.
- Introduce new macro RO_RTFREE() that cleans up a struct route
depending on its kind.
- All callers to ip_output()/ip6_output() that do supply non-NULL
but empty route should use RO_RTFREE() to free results of
lookup.
- ip_output()/ip6_output() now do FLOWTABLE lookup always when
ro->ro_rt == NULL.
Tested by: tuexen (SCTP part)
packets a cmsg of type IP_RECVTOS which contains the TOS byte.
Much like IP_RECVTTL does for TTL. This allows to implement a
protocol on top of UDP and implementing ECN.
MFC after: 3 days
inp_socket->so_options dereference when we may not acquire the lock on
the inpcb.
This fixes the crash due to NULL pointer dereference in
in_pcbbind_setup() when inp_socket->so_options in a pcb returned by
in_pcblookup_local() was checked.
Reported by: dave jones <s.dave.jones@gmail.com>, Arnaud Lacombe <lacombar@gmail.com>
Suggested by: rwatson
Glanced by: rwatson
Tested by: dave jones <s.dave.jones@gmail.com>
(also test for negative MTUs if checking it anyway).
An MTU of 0 is arguably a bug elsewhere, but this at least gives us some
more debugging hints.
Sponsored by: ISPsystem (Early 2010)
MFC after: 1 week
have the delayed function take an argument as to the offset
to the SCTP header. This allows it to work for V4 and V6.
This of course means changing all callers of the function
to either pass the header len, if they have it, or create
it (ip_hl << 2 or sizeof(ip6_hdr)).
PR: 144529
MFC after: 2 weeks
- add a name argument to flowtable_alloc for printing with ddb commands
- extend ddb commands to print destination address or 4-tuples
- don't parse ports in ulp header if FL_HASH_ALL is not passed
- add kern_flowtable_insert to enable more generic use of flowtable
(e.g. system calls for adding entries)
- don't hash loopback addresses
- cleanup whitespace
- keep statistics per-cpu for per-cpu flowtables to avoid cache line contention
- add sysctls to accumulate stats and report aggregate
MFC after: 7 days
allow for connection load balancing across interfaces. Currently
the address alias handling method is colliding with the ECMP code.
For example, when two interfaces are configured on the same prefix,
only one prefix route is installed. So connection load balancing
among the available interfaces is not possible.
The other advantage of ECMP is for failover. The issue with the
current code, is that the interface link-state is not reflected
in the route entry. For example, if there are two interfaces on
the same prefix, the cable on one interface is unplugged, new and
existing connections should switch over to the other interface.
This is not done today and packets go into a black hole.
Also, there is a small bug in the kernel where deleting ECMP routes
in the userland will always return an error even though the command
is successfully executed.
MFC after: 5 days
within ip_output, achieving (in random order of importance):
- a reduction of the number of 'r's in the source code;
- improved legibility;
- a reduction of 64 bytes in the .text
+ remove two unnecessary initializations in ip_output;
+ localize 'len';
+ introduce a temporary variable n to count the number of fragments,
the compiler seems unable to identify a common subexpression
(written 3 times, used twice);
+ document some assumptions on ip_len and ip_hl
in OpenBSD. As it is now, there is no way for this to be useful, since IPsec
is free to forward packets via whatever interface it wants, so checking
capabilities of the interface passed from ip_output (fetched from the routing
table) serves no purpose.
Discussed with: sam@
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.
Sitting aroung in tree waiting to commit for: 2 months
MFC after: 2 months
ip_output() if the cached route was not initialized from the
flow-table. The rt_lle entry is invalid unless it has been
initialized through the flow-table.
Reviewed by: kmacy, rwatson
MFC after: immediately
the mbuf for obtaining the fib index
- check that a cached flow corresponds to the same fib index as the
packet for which we are doing the lookup
- at interface detach time flush any flows referencing stale rtentrys
associated with the interface that is going away (fixes reported
panics)
- reduce the time between cleans in case the cleaner is running at
the time the eventhandler is called and the wakeup is missed less
time will elapse before the eventhandler returns
- separate per-vnet initialization from global initialization
(pointed out by jeli@)
Reviewed by: sam@
Approved by: re@
mismatch between the cached entry and the intended destination. The
cached rtentry{} is flushed but the associated llentry{} is not. This
causes the wrong destination MAC address being used in the output
packets. The fix is to flush the llentry{} when rtentry{} is cleared.
Reviewed by: kmacy, rwatson
Approved by: re
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.
Reviewed by: bz
Approved by: re (vimage blanket)
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
rather than pointers, requiring callers to properly dispose of those
references. The following routines now return references:
ifaddr_byindex
ifa_ifwithaddr
ifa_ifwithbroadaddr
ifa_ifwithdstaddr
ifa_ifwithnet
ifaof_ifpforaddr
ifa_ifwithroute
ifa_ifwithroute_fib
rt_getifa
rt_getifa_fib
IFP_TO_IA
ip_rtaddr
in6_ifawithifp
in6ifa_ifpforlinklocal
in6ifa_ifpwithaddr
in6_ifadd
carp_iamatch6
ip6_getdstifaddr
Remove unused macro which didn't have required referencing:
IFP_TO_IA6
This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc). In a few cases, add missing if_addr_list locking
required to safely acquire references.
Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit. Once
we have mbuf tag deep copy support, this can be fixed.
Reviewed by: bz
Obtained from: Apple, Inc. (portions)
MFC after: 6 weeks (portions)
This change should make options VIMAGE kernel builds usable again,
to some extent at least.
Note that the size of struct vnet_inet has changed, though in
accordance with one-bump-per-day policy we didn't update the
__FreeBSD_version number, given that it has already been touched
by r194640 a few hours ago.
Reviewed by: bz
Approved by: julian (mentor)
actual implementation.
Remove the accessor functions for the compiled out case, just returning
"unavail" values. Remove the kernel conditional from the header file as
it is no longer needed, only leaving the externs.
Hide the improperly virtualized SYSCTL/TUNABLE for the flowtable size
under the kernel option as well.
Reviewed by: rwatson
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.
Discussed with: pjd
with OpenBSD (and BSD/OS originally). We can't easly do it SOL_SOCKET option
as there is no more space for more SOL_SOCKET options, but this option also
fits better as an IP socket option, it seems.
- Implement this functionality also for IPv6 and RAW IP sockets.
- Always compile it in (don't use additional kernel options).
- Remove sysctl to turn this functionality on and off.
- Introduce new privilege - PRIV_NETINET_BINDANY, which allows to use this
functionality (currently only unjail root can use it).
Discussed with: julian, adrian, jhb, rwatson, kmacy
last year or two's work on routing:
- Combine iproute initialization and flowtable lookup blocks, eliminating
unnecessary tests for known-zero'd iproute fields.
- Add a comment indicating (a) why the route entry returned by the
flowtable is considered stable and (b) that the flowtable lookup must
occur after the setup of the mbuf flow ID.
- Assert the inpcb lock before any use of inpcb fields.
Reviewed by: kmacy
IPSTAT_INC(), IPSTAT_SUB(), and IPSTAT_DEC(), rather than directly
manipulating the fields across the kernel. This will make it easier
to change the implementation of these statistics, such as using
per-CPU versions of the data structures.
MFC after: 3 days
and do not attempt to perform a group lookup.
This is a socket layer lock, and the bottom half of IP
really has no business taking it.
Use the value of the in_mcast_loop sysctl to determine
if we should loop back by default, in the absence of
any multicast socket options. Because the check on
group membership is now deferred to the input path,
an m_copym() is now required.
This should increase multicast send performance where the
source has not requested loopback, although this has not been
benchmarked or measured.
It is also a necessary change for IN_MULTI_LOCK to become
non-recursive, which is required in order to implement IGMPv3
in a thread-safe way.
net/route.h.
Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.
We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.
This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.
(duplicate) code in sys/netipsec/ipsec.c and fold it into
common, INET/6 independent functions.
The file local functions ipsec4_setspidx_inpcb() and
ipsec6_setspidx_inpcb() were 1:1 identical after the change
in r186528. Rename to ipsec_setspidx_inpcb() and remove the
duplicate.
Public functions ipsec[46]_get_policy() were 1:1 identical.
Remove one copy and merge in the factored out code from
ipsec_get_policy() into the other. The public function left
is now called ipsec_get_policy() and callers were adapted.
Public functions ipsec[46]_set_policy() were 1:1 identical.
Rename file local ipsec_set_policy() function to
ipsec_set_policy_internal().
Remove one copy of the public functions, rename the other
to ipsec_set_policy() and adapt callers.
Public functions ipsec[46]_hdrsiz() were logically identical
(ignoring one questionable assert in the v6 version).
Rename the file local ipsec_hdrsiz() to ipsec_hdrsiz_internal(),
the public function to ipsec_hdrsiz(), remove the duplicate
copy and adapt the callers.
The v6 version had been unused anyway. Cleanup comments.
Public functions ipsec[46]_in_reject() were logically identical
apart from statistics. Move the common code into a file local
ipsec46_in_reject() leaving vimage+statistics in small AF specific
wrapper functions. Note: unfortunately we already have a public
ipsec_in_reject().
Reviewed by: sam
Discussed with: rwatson (renaming to *_internal)
MFC after: 26 days
X-MFC: keep wrapper functions for public symbols?
we, like TCP and UDP, move the checksum calculation
into the IP routines when there is no hardware support
we call into the normal SCTP checksum routine.
The next round of SCTP updates will use
this functionality. Of course the IGB driver needs
a few updates to support the new intel controller set
that actually does SCTP csum offload too.
Reviewed by: gnn, rwatson, kmacy
applications to specify a non-local IP address when bind()'ing a socket
to a local endpoint.
This allows applications to spoof the client IP address of connections
if (obviously!) they somehow are able to receive the traffic normally
destined to said clients.
This patch doesn't include any changes to ipfw or the bridging code to
redirect the client traffic through the PCB checks so TCP gets a shot
at it. The normal behaviour is that packets with a non-local destination
IP address are not handled locally. This can be dealth with some IPFW hackery;
modifications to IPFW to make this less hacky will occur in subsequent
commmits.
Thanks to Julian Elischer and others at Ironport. This work was approved
and donated before Cisco acquired them.
Obtained from: Julian Elischer and others
MFC after: 2 weeks
disabled entirely, which is its default state before set to a
non-zero value.
PR: 128790
Submitted by: Nick Hilliard <nick at foobar dot org>
MFC after: 3 weeks
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,
The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.
Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:
- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion