Commit Graph

3459 Commits

Author SHA1 Message Date
Bjoern A. Zeeb
c2c2a7c11e Convert the two dimensional array to be malloced and introduce
an accessor function to get the correct rnh pointer back.

Update netstat to get the correct pointer using kvm_read()
as well.

This not only fixes the ABI problem depending on the kernel
option but also permits the tunable to overwrite the kernel
option at boot time up to MAXFIBS, enlarging the number of
FIBs without having to recompile. So people could just use
GENERIC now.

Reviewed by:	julian, rwatson, zec
X-MFC:		not possible
2009-06-01 15:49:42 +00:00
Bruce M Simpson
b01c90a31a Merge fixes from p4:
* Tighten v1 query input processing.
 * Borrow changes from MLDv2 for how general queries are processed.
   * Do address field validation upfront before accepting input.
   * Do NOT switch protocol version if old querier present timer active.
 * Always clear IGMPv3 state in igmp_v3_cancel_link_timers().
 * Update comments.

Tested by:	deeptech71 at gmail dot com
2009-06-01 15:30:18 +00:00
Robert Watson
d4b5cae49b Reimplement the netisr framework in order to support parallel netisr
threads:

- Support up to one netisr thread per CPU, each processings its own
  workstream, or set of per-protocol queues.  Threads may be bound
  to specific CPUs, or allowed to migrate, based on a global policy.

  In the future it would be desirable to support topology-centric
  policies, such as "one netisr per package".

- Allow each protocol to advertise an ordering policy, which can
  currently be one of:

  NETISR_POLICY_SOURCE: packets must maintain ordering with respect to
    an implicit or explicit source (such as an interface or socket).

  NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work,
    as well as allowing protocols to provide a flow generation function
    for mbufs without flow identifers (m2flow).  Falls back on
    NETISR_POLICY_SOURCE if now flow ID is available.

  NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for
    each packet handled by netisr (m2cpuid).

- Provide utility functions for querying the number of workstreams
  being used, as well as a mapping function from workstream to CPU ID,
  which protocols may use in work placement decisions.

- Add explicit interfaces to get and set per-protocol queue limits, and
  get and clear drop counters, which query data or apply changes across
  all workstreams.

- Add a more extensible netisr registration interface, in which
  protocols declare 'struct netisr_handler' structures for each
  registered NETISR_ type.  These include name, handler function,
  optional mbuf to flow ID function, optional mbuf to CPU ID function,
  queue limit, and ordering policy.  Padding is present to allow these
  to be expanded in the future.  If no queue limit is declared, then
  a default is used.

- Queue limits are now per-workstream, and raised from the previous
  IFQ_MAXLEN default of 50 to 256.

- All protocols are updated to use the new registration interface, and
  with the exception of netnatm, default queue limits.  Most protocols
  register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use
  NETISR_POLICY_FLOW, and will therefore take advantage of driver-
  generated flow IDs if present.

- Formalize a non-packet based interface between interface polling and
  the netisr, rather than having polling pretend to be two protocols.
  Provide two explicit hooks in the netisr worker for start and end
  events for runs: netisr_poll() and netisr_pollmore(), as well as a
  function, netisr_sched_poll(), to allow the polling code to schedule
  netisr execution.  DEVICE_POLLING still embeds single-netisr
  assumptions in its implementation, so for now if it is compiled into
  the kernel, a single and un-bound netisr thread is enforced
  regardless of tunable configuration.

In the default configuration, the new netisr implementation maintains
the same basic assumptions as the previous implementation: a single,
un-bound worker thread processes all deferred work, and direct dispatch
is enabled by default wherever possible.

Performance measurement shows a marginal performance improvement over
the old implementation due to the use of batched dequeue.

An rmlock is used to synchronize use and registration/unregistration
using the framework; currently, synchronized use is disabled
(replicating current netisr policy) due to a measurable 3%-6% hit in
ping-pong micro-benchmarking.  It will be enabled once further rmlock
optimization has taken place.  However, in practice, netisrs are
rarely registered or unregistered at runtime.

A new man page for netisr will follow, but since one doesn't currently
exist, it hasn't been updated.

This change is not appropriate for MFC, although the polling shutdown
handler should be merged to 7-STABLE.

Bump __FreeBSD_version.

Reviewed by:	bz
2009-06-01 10:41:38 +00:00
Pawel Jakub Dawidek
f44270e764 - Rename IP_NONLOCALOK IP socket option to IP_BINDANY, to be more consistent
with OpenBSD (and BSD/OS originally). We can't easly do it SOL_SOCKET option
  as there is no more space for more SOL_SOCKET options, but this option also
  fits better as an IP socket option, it seems.
- Implement this functionality also for IPv6 and RAW IP sockets.
- Always compile it in (don't use additional kernel options).
- Remove sysctl to turn this functionality on and off.
- Introduce new privilege - PRIV_NETINET_BINDANY, which allows to use this
  functionality (currently only unjail root can use it).

Discussed with:	julian, adrian, jhb, rwatson, kmacy
2009-06-01 10:30:00 +00:00
Randall Stewart
a16ccdcead Adds missing sysctl to manage the vtag_time_wait time. This will
even allow disabling time-wait all together if you set the value
to 0 (not advisable actually). The default remains the same
i.e. 60 seconds.
2009-05-30 11:14:41 +00:00
Randall Stewart
bf1be57101 Fix a small memory leak from the nr-sack code - the mapping array
was not being freed at term of association. Also get rid of
the MICHAELS_EXP code.
2009-05-30 10:56:27 +00:00
Randall Stewart
667d2d3cbb Make sctp_uio user to kernel structure match the
socket-api draft. Two fields were uint32_t when they
should have been uint16_t.

Reported by Jonathan Leighton at U-del.
2009-05-30 10:50:40 +00:00
Zachary Loafman
81ad7eb017 Correct handling of SYN packets that are to the left of the current window of an ESTABLISHED connection.
Reviewed by:        net@, gnn
Approved by:        dfr (mentor)
2009-05-27 17:02:10 +00:00
Jamie Gritton
0304c73163 Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails.  Child jails may be restricted more than their parents,
but never less.  Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system.  Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings.  The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by:	bz (mentor)
2009-05-27 14:11:23 +00:00
Edward Tomasz Napierala
efbad25934 Don't discard packets with 'Destination Unreachable' at the beginning
of ip_forward(), if the IPSEC is compiled in.  It is possible that there
is an SPD that this packets will go through, even if there is no matching
route.  If not, ICMP will be sent anyway, after ip_output().

This is somewhat similar in purpose to r191621, except that one was
for the packets sent from the host, while this one is for packets
being forwarded by the host.

Reviewed by:	bz@
Sponsored by:	Wheel Sp. z o.o. (http://www.wheel.pl)
2009-05-27 12:44:36 +00:00
John Baldwin
256d7a8a16 Correct the sense of a test so that this filter always waits for the full
request to arrive.  Previously it would end up returning as soon as the
request length stored in the first two bytes had arrived.

Reviewed by:	dwmalone
MFC after:	1 week
2009-05-26 20:00:30 +00:00
Robert Watson
cce6e15556 Remove comment about moving tcp_reass() to its own file named tcp_reass.c,
that happened a while ago.

MFC after:	3 days
2009-05-25 14:51:47 +00:00
Bjoern A. Zeeb
6d453b1001 For UDP with introducing the UDP control block, the uma zone had to
be named "udp_inpcb" to avoid a naming conflict with tcp[1].
For consistency rename the uma zone for TCP from "inpcb" to "tcp_inpcb".

Found by:	rwatson [1]
Discussed with:	rwatson
2009-05-23 17:02:30 +00:00
Bjoern A. Zeeb
6a9148fe92 Implement UDP control block support.
So far the udp_tun_func_t had been (ab)using inp_ppcb for udp in kernel
tunneling callbacks.  Move that into the udpcb and add a field for flags
there to be used by upcoming changes instead of sticking udp only flags
into in_pcb flags2.

Bump __FreeBSD_version for ports to detect it and because of vnet* struct
size changes.

Submitted by:	jhb (7.x version)
Reviewed by:	rwatson
2009-05-23 16:51:13 +00:00
Bjoern A. Zeeb
db2e47925e Add sysctls to toggle the behaviour of the (former) IPSEC_FILTERTUNNEL
kernel option.
This also permits tuning of the option per virtual network stack, as
well as separately per inet, inet6.

The kernel option is left for a transition period, marked deprecated,
and will be removed soon.

Initially requested by:	phk (1 year 1 day ago)
MFC after:		4 weeks
2009-05-23 16:42:38 +00:00
Bjoern A. Zeeb
f81a8a320c If including vnet.h one has to include opt_route.h as well. This is
because struct vnet_net holds the rt_tables[][] for MRT and array size
is compile time dependent.  If you had ROUTETABLES set to >1 after
r192011 V_loif was pointing into nonsense leading to strange results
or even panics for some people.

Reviewed by:	mz
2009-05-22 23:03:15 +00:00
Robert Watson
62e1ba833c Consolidate and clean up the first section of ip_output.c in light of the
last year or two's work on routing:

- Combine iproute initialization and flowtable lookup blocks, eliminating
  unnecessary tests for known-zero'd iproute fields.

- Add a comment indicating (a) why the route entry returned by the
  flowtable is considered stable and (b) that the flowtable lookup must
  occur after the setup of the mbuf flow ID.

- Assert the inpcb lock before any use of inpcb fields.

Reviewed by:	kmacy
2009-05-21 09:45:47 +00:00
Qing Li
c9d763bf41 When an interface address is removed and the last prefix
route is also being deleted, the link-layer address table
(arp or nd6) will flush those L2 llinfo entries that match
the removed prefix.

Reviewed by:	kmacy
2009-05-20 21:07:15 +00:00
Bjoern A. Zeeb
75ac4f3d32 Revert the logical change of r192341.
net.inet.ip.fw.one_pass is a classic ip_input.c variable and is used in
the pfil and bridge code as well. As ipfw is loadable we need to always
provide it.  That is the reason why it lives in struct vnet_inet and
not in struct vnet_ipfw.
2009-05-18 22:34:44 +00:00
John Baldwin
10f5c8be92 - Fix typo in description of 'net.inet.ip.fw.autoinc_step'.
- Use 'vnet_ipfw' instead of 'vnet_inet' for 'net.inet.ip.fw.one_pass'.
2009-05-18 21:46:46 +00:00
Bjoern A. Zeeb
1600c117d8 Unbreak options VIMAGE builds, in a followup to r192011 which did not
introduce INIT_VNET_NET() initializers necessary for accessing V_loif.

Submitted by:	zec
Reviewed by:	julian
2009-05-17 20:53:10 +00:00
Robert Watson
6d888973c8 Staticize two functions not used outside of in_pcb.c: in_pcbremlists() and
db_print_inpcb().

MFC after:	1 month
2009-05-14 20:59:36 +00:00
Qing Li
92fac99477 Ignore the INADDR_ANY address inserted/deleted by DHCP when installing a loopback route
to the interface address.
2009-05-14 05:27:09 +00:00
Qing Li
ebc90701ac This patch adds a host route to an interface address (that is assigned
to a non loopback/ppp link types) through the loopback interface. Prior
to the new L2/L3 rewrite, this host route is implicitly added by the L2
code during RTM_RESOLVE of that interface address. This host route is
deleted when that interface is removed.

Reviewed by:	kmacy
2009-05-12 07:41:20 +00:00
Warner Losh
573a04c930 Remove bogus comment. 2009-05-09 18:50:01 +00:00
John Baldwin
5f17ebf94d Convert IPFW_DEFAULT_TO_ACCEPT into a loader tunable
'net.inet.ip.fw.default_to_accept'.  The current value can also be queried
via a read-only sysctl of the same name.

Requested by:	plosher
MFC after:	1 week
2009-05-09 05:07:36 +00:00
Marko Zec
2114e063f0 A NOP change: style / whitespace cleanup of the noise that slipped
into r191816.

Spotted by:	bz
Approved by:	julian (mentor) (an earlier version of the diff)
2009-05-08 14:34:25 +00:00
Marko Zec
ddd50c3439 Remove a bogus check that unintentionally slipped in r191816.
This change has no functional impact on nooptions VIMAGE builds.
Submitted by:	bz
2009-05-08 14:28:06 +00:00
Randall Stewart
096ed42dad repository sync to multi-OS repo ... spaceing change 2009-05-07 16:43:49 +00:00
Randall Stewart
892f1c7141 ABI expansions to hopefully future-proof our MIB/netstat code for 8.0 2009-05-07 16:42:45 +00:00
Marko Zec
94e9f5a1c2 Remove unnecessary CURVNET_SET() calls where curvnet context is
(i.e. seems to be) already set.

This should reduce console noise due to curvnet recursion reports.

This change has no impact on nooptions VIMAGE builds.
Approved by:	julian (mentor)
2009-05-06 13:30:46 +00:00
Marko Zec
743da3bcdb Unbreak options VIMAGE kernel builds.
Approved by:	julian (mentor)
2009-05-06 08:49:39 +00:00
Marko Zec
21ca7b57bd Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one.  The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE().  Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc.  Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by:	julian (mentor)
2009-05-05 10:56:12 +00:00
Marko Zec
5f416f8e84 Make indentation more uniform accross vnet container structs.
This is a purely cosmetic / NOP change.

Reviewed by:	bz
Approved by:	julian (mentor)
Verified by:	svn diff -x -w producing no output
2009-05-02 08:16:26 +00:00
Marko Zec
d7fcc52895 Unbreak options VIMAGE + nooptions INVARIANTS kernel builds.
Submitted by:	julian
Approved by:	julian (mentor)
2009-05-02 05:02:28 +00:00
Marko Zec
f6dfe47a14 Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance.  Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables.  As an example, V_ifnet becomes:

    options VIMAGE:          ((struct vnet_net *) vnet_net)->_ifnet
    default build:           vnet_net_0._ifnet
    options VIMAGE_GLOBALS:  ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

    INIT_VNET_NET(ifp->if_vnet); becomes

    struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals.  If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet.  options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod.  SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by:	bz, rwatson
Approved by:	julian (mentor)
2009-04-30 13:36:26 +00:00
Bruce M Simpson
33cde13046 Bite the bullet, and make the IPv6 SSM and MLDv2 mega-commit:
import from p4 bms_netdev.  Summary of changes:

 * Connect netinet6/in6_mcast.c to build.
   The legacy KAME KPIs are mostly preserved.
 * Eliminate now dead code from ip6_output.c.
   Don't do mbuf bingo, we are not going to do RFC 2292 style
   CMSG tricks for multicast options as they are not required
   by any current IPv6 normative reference.
 * Refactor transports (UDP, raw_ip6) to do own mcast filtering.
   SCTP, TCP unaffected by this change.
 * Add ip6_msource, in6_msource structs to in6_var.h.
 * Hookup mld_ifinfo state to in6_ifextra, allocate from
   domifattach path.
 * Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced.
   Kernel consumers which need this should use in6m_lookup().
 * Refactor IPv6 socket group memberships to use a vector (like IPv4).
 * Update ifmcstat(8) for IPv6 SSM.
 * Add witness lock order for IN6_MULTI_LOCK.
 * Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths.
 * Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup.
 * Update carp(4) for new IPv6 SSM KPIs.
 * Virtualize ip6_mrouter socket.
   Changes mostly localized to IPv6 MROUTING.
 * Don't do a local group lookup in MROUTING.
 * Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge().
 * Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode.
 * Bump __FreeBSD_version to 800084.
 * Update UPDATING.

NOTE WELL:
 * This code hasn't been tested against real MLDv2 queriers
   (yet), although the on-wire protocol has been verified in Wireshark.
 * There are a few unresolved issues in the socket layer APIs to
   do with scope ID propagation.
 * There is a LOR present in ip6_output()'s use of
   in6_setscope() which needs to be resolved. See comments in mld6.c.
   This is believed to be benign and can't be avoided for the moment
   without re-introducing an indirect netisr.

This work was mostly derived from the IGMPv3 implementation, and
has been sponsored by a third party.
2009-04-29 19:19:13 +00:00
Bruce M Simpson
9efc1a1bbf Add MLDv2 prototypes and defines. 2009-04-29 10:20:17 +00:00
Bruce M Simpson
5cf93e5d2c Use KTR_INET for MROUTING CTRs. 2009-04-29 10:17:08 +00:00
Bruce M Simpson
c566b47669 Cut over to KTR_INET for CTR.
For clarity, put pointer incremement/size decrement on own line
when copying out in-mode source filters to userland.
2009-04-29 10:14:16 +00:00
Bruce M Simpson
1096332a4a Do not assume that ip6_moptions is always set, it is
a lazy-allocated structure.
2009-04-29 10:13:22 +00:00
Bruce M Simpson
31a3e65dc2 Fix a problem whereby enqueued IGMPv3 filter list changes would be
incorrectly output, if the RB-tree enumeration happened to reuse the
same chain for a mode switch: that is, both ALLOW and BLOCK records
were appended for the same group, in the same mbuf packet chain.

This was introduced during an mbuf chain layout bug fix involving
m_getptr(), which obviously cannot count from offset 0 on the
second pass through the RB-tree when serializing the IGMPv3
group records into the pending mbuf chain.

Cut over to KTR_INET for IGMPv3 CTR usage.
2009-04-29 10:12:01 +00:00
Edward Tomasz Napierala
1a4998162e Don't require packet to match a route (any route; this information wasn't
used anyway, so a typical workaround was to add a dummy route) if it's going
to be sent through IPSec tunnel.

Reviewed by:	bz
2009-04-28 11:10:33 +00:00
Oleg Bulyzhin
a3a981b7f9 Optimize packet flow: if net.inet.ip.fw.one_pass != 0 and packet was
processed by ipfw once - avoid second ipfw_chk() call.
This saves us from unnecessary IPFW_RLOCK(), m_tag_find() calls and
ip/tcp/udp header parsing.

MFC after:	2 month
2009-04-27 17:37:36 +00:00
Marko Zec
093f25f8c8 In preparation for turning on options VIMAGE in next commits,
rearrange / replace / adjust several INIT_VNET_* initializer
macros, all of which currently resolve to whitespace.

Reviewed by:	bz (an older version of the patch)
Approved by:	julian (mentor)
2009-04-26 22:06:42 +00:00
Robert Watson
db091502fb Acquire IF_ADDR_LOCK() around most iterations over ifp->if_addrhead
(colloquially known as if_addrlist).  Currently not acquired around
interface address loops that call out to the routing code due to
potential lock order issues.

MFC after:	3 weeks
2009-04-26 19:05:40 +00:00
Robert Watson
588885f2f5 Expand coverage of IF_ADDR_LOCK() in in_control() from point of initial
lookup of 'ia' from if_addrhead through most use.  Note that we
currently have to drop it prematurely in some cases due to calls out to
the routing and interface code while using 'ia', but this closes many
races.  Annotate several potential races that persist after this change.
Move to using M_NOWAIT for allocating new interface addresses due to
lock(s) being held.

MFC after:	3 weeks
2009-04-25 23:02:57 +00:00
Robert Watson
07cde5e92c In in_purgemaddrs(), remove the inm being freed from the address list
before freeing it, rather than vice version, to avoid potential use
after free.

Reviewed by:	bms
2009-04-24 22:11:53 +00:00
Robert Watson
cf7b18f15e Relocate permissions checking code in in_control() to before the body
of the implementation of ioctls.  This makes the mapping of ioctls to
specific privileges more explicit, and also simplifies the
implementation by reducing the use of FALLTHROUGH handling in switch.

While this is not intended to be a functional change, it does mean
that certain privilege checks are now performed earlier, so EPERM
might be returned in preference to EADDRNOTAVAIL for management
ioctls that could have failed for both reasons.

MFC after:	3 weeks
2009-04-24 09:54:46 +00:00
Robert Watson
bbb3fb6194 Reorganize in_control() so that invariants are more obvious, and so
that it is easier to lock:

- Handle the unsupported ioctl case at the beginning of in_control(),
  handing off to ifp->if_ioctl, rather than looking up interfaces and
  addresses unnecessarily in this case.

- Make it an invariant that ifp is always non-NULL when running
  in_control()-implemented ioctls, simplifying the code structure.

MFC after:	3 weeks
2009-04-23 21:41:37 +00:00