Commit Graph

3630 Commits

Author SHA1 Message Date
Michael Tuexen
4af6c75c39 Use appropriate locking when using interface list.
Approved by: rrs (mentor)
MFC after: 1 month.
2009-09-19 14:55:12 +00:00
Michael Tuexen
30c3a8430c Fix the disabling of sctp_drain().
Approved by: rrs (mentor)
MFC after: 1 month.
2009-09-19 14:18:42 +00:00
Michael Tuexen
8518270e20 Get SCTP working in combination with VIMAGE.
Contains code from bz.
Approved by: rrs (mentor)
MFC after: 1 month.
2009-09-19 14:02:16 +00:00
Bruce M Simpson
99bf30cf01 Return ENOBUFS consistently if user attempts to exceed
in_mcast_maxsocksrc resource limit.

Submitted by:	syrinx
MFC after:	3 days
2009-09-18 15:12:31 +00:00
Randall Stewart
482444b4a5 Support for VNET in SCTP (hopefully) 2009-09-17 15:11:12 +00:00
Michael Tuexen
d830c305ea Fix a bug reported by Daniel Mentz:
When authenticating DATA chunks some DATA chunks
might get stuck when the MTU gets decreased via
an ICMP message.

Approved by: rrs (mentor)
MFC after: immediately
2009-09-16 14:23:31 +00:00
Mike Silbersack
b8614722ff Add the ability to see TCP timers via netstat -x. This can be a useful
feature when you have a seemingly stuck socket and want to figure
out why it has not been closed yet.

No plans to MFC this, as it changes the netstat sysctl ABI.

Reviewed by:	andre, rwatson, Eric Van Gyzen
2009-09-16 05:33:15 +00:00
Andre Oppermann
11c99a6d7b -Put the optimized soreceive_stream() under a compile time option called
TCP_SORECEIVE_STREAM for the time being.

Requested by:	brooks

Once compiled in make it easily switchable for testers by using a tuneable
 net.inet.tcp.soreceive_stream
and a corresponding read-only sysctl to report the current state.

Suggested by:	rwatson

MFC after:	2 days
2009-09-15 22:23:45 +00:00
Qing Li
9bb7d0f47a Self pointing routes are installed for configured interface addresses
and address aliases. After an interface is brought down and brought
back up again, those self pointing routes disappeared. This patch
ensures after an interface is brought back up, the loopback routes
are reinstalled properly.

Reviewed by:	bz
MFC after:	immediately
2009-09-15 19:18:34 +00:00
Qing Li
cd29a7797d This patch enables the node to respond to ARP requests for
configured proxy ARP entries.

Reviewed by:	bz
MFC after:	immediately
2009-09-15 18:39:27 +00:00
Qing Li
96ed1732bb The bootp code installs an interface address and the nfs client
module tries to install the same address again. This extra code
is removed, which was discovered by the removal of a call to
in_ifscrub() in r196714. This call to in_ifscrub is put back here
because the SIOCAIFADDR command can be used to change the prefix
length of an existing alias.

Reviewed by:    kmacy
2009-09-15 01:01:03 +00:00
Qing Li
f0bb05fca5 Previously local end of point-to-point interface is not reachable
within the system that owns the interface. Packets destined to
the local end point leak to the wire towards the default gateway
if one exists. This behavior is changed as part of the L2/L3
rewrite efforts. The local end point is now reachable within the
system. The inpcb code needs to consider this fact during the
address selection process.

Reviewed by:	bz
MFC after:	immediately
2009-09-14 22:19:47 +00:00
Randall Stewart
f3d06a3c68 Fixes two bugs:
1) A lock issue, if we ever had to try again
   we would double lock the INP lock.
2) We were allowing (at wrap) associd 0... which really
   we cannot allow since 0 normally means in most socket
   API calls that we are wishing to effect something on
   the INP not TCB.

MFC after:	1 week
2009-09-13 17:45:31 +00:00
Bruce M Simpson
6cbbe26f98 In expire_mfc(), add an assert on the multicast forwarding cache mutex.
PR:		138666
2009-09-13 01:00:24 +00:00
Bruce M Simpson
fa2eebfce6 Comment some flawed assumptions in inp_join_group() about
mixing SSM full-state and delta-based APIs.

ENOTIME to fix right now.  No functional changes.

MFC after:	5 days
2009-09-12 20:37:44 +00:00
Bruce M Simpson
0eebc0d7b4 Don't allow joins w/o source on an existing group.
This is almost always pilot error.

We don't need to check for group filter UNDEFINED state at t1,
because we only ever allocate filters with their groups, so we
unconditionally reject such calls with EINVAL.
Trying to change the active filter mode w/o going through IP_MSFILTER
is also disallowed.

Deals with the case described in PR 137164 upfront, cumulative
with the fix in svn rev 197132 which only calls imo_match_source()
if the source address family was not unspecified.

PR:		137164
MFC after:	5 days
2009-09-12 20:18:23 +00:00
Bruce M Simpson
1fc39d5424 Tighten input checking in inp_join_group():
* Don't try to use the source address, when its family is unspecified.
 * If we get a join without a source, on an existing inclusive
   mode group, this is an error, as it would change the filter mode.

Fix a problem with the handling of in_mfilter for new memberships:
 * Do not rely on imf being NULL; it is explicitly initialized to a
   non-NULL pointer when constructing a membership.
 * Explicitly initialize *imf to EX mode when the source address
   is unspecified.

This fixes a problem with in_mfilter slot recycling in the join path.

PR:		138690
Submitted by:	Stef Walter
MFC after:	5 days
2009-09-12 19:45:55 +00:00
Bruce M Simpson
cc5776b24d Fix an obvious logic error in the IPv4 multicast leave processing,
where the filter mode vector was not updated correctly after the leave.

PR:		138691
Submitted by:	Stef Walter
MFC after:	5 days
2009-09-12 19:07:03 +00:00
Bruce M Simpson
67e89408e5 Fix an API issue in leave processing for IPv4 multicast groups.
* Do not assume that the group lookup performed by imo_match_group()
   is valid when ifp is NULL in this case.
 * Instead, return EADDRNOTAVAIL if the ifp cannot be resolved for the
   membership we are being asked to leave.

Caveat user:
 * The way IPv4 multicast memberships are implemented in the inpcb layer
   at the moment, has the side-effect that struct ip_moptions will
   still hold the membership, under the old ifp, until ip_freemoptions()
   is called for the parent inpcb.
 * The underlying issue is: the inpcb layer does not get notification
   of ifp being detached going away in a thread-safe manner.
   This is non-trivial to fix.

But hey, at least the kernel should't panic when you unplug a card.

PR:		138689
Submitted by:	Stef Walter
MFC after:	5 days
2009-09-12 18:55:15 +00:00
Navdeep Parhar
9a31144537 Add arp_update_event. This replaces route_arp_update_event, which
has not worked since the arp-v2 rewrite.

The event handler will be called with the llentry write-locked and
can examine la_flags to determine whether the entry is being added
or removed.

Reviewed by:	gnn, kmacy
Approved by:	gnn (mentor)
MFC after:	1 month
2009-09-08 21:17:17 +00:00
Poul-Henning Kamp
2ac047d1fe Move the duplicate definition of struct sockaddr_storage to its own
include file, and include this where the previous duplicate definitions were.

Static program checkers like FlexeLint rightfully take a dim view of
duplicate definitions, even if they currently are identical.
2009-09-08 10:39:38 +00:00
Shteryana Shopova
e72ae6eafd When joining a multicast group, the inp_lookup_mcast_ifp call
does a KASSERT that the group address is multicast, so the
check if this is indeed true and eventually return a EINVAL if not,
should be done before calling inp_lookup_mcast_ifp. This fixes a kernel
crash when calling setsockopt (sock, IPPROTO_IP, IP_ADD_MEMBERSHIP,...)
with invalid group address.

Reviewed by:	bms
Approved by:	bms

MFC after:	3 days
2009-09-07 16:00:33 +00:00
Pawel Jakub Dawidek
360488410f Correct comment. 2009-09-06 07:29:22 +00:00
George V. Neville-Neil
54fc657d59 Add ARP statistics to the kernel and netstat.
New counters now exist for:
requests sent
replies sent
requests received
replies received
packets received
total packets dropped due to no ARP entry
entrys timed out
Duplicate IPs seen

The new statistics are seen in the netstat command
when it is given the -s command line switch.

MFC after:	2 weeks
In collaboration with: bz
2009-09-03 21:10:57 +00:00
Bjoern A. Zeeb
cc7e9d4325 In case an upper layer protocol tries to send a packet but the
L2 code does not have the ethernet address for the destination
within the broadcast domain in the table, we remember the
original mbuf in `la_hold' in arpresolve() and send out a
different packet with an arp request.
In case there will be more upper layer packets to send we will
free an earlier one held in `la_hold' and queue the new one.

Once we get a packet in, with which we can perfect our arp table
entry we send out the original 'on hold' packet, should there
be any.
Rather than continuing to process the packet that we received,
we returned without freeing the packet that came in, which
basically means that we leaked an mbuf for every arp request
we sent.

Rather than freeing the received packet and returning, continue
to process the incoming arp packet as well.
This should (a) improve some setups, also proxy-arp, in case it was an
incoming arp request and (b) resembles the behaviour FreeBSD had
from day 1, which alignes with RFC826 "Packet reception" (merge case).

Rename 'm0' to 'hold' to make the code more understandable as
well as diffable to earlier versions more easily.

Handle the link-layer entry 'la' lock comepletely in the block
where needed and release it as early as possible, rather than
holding it longer, down to the end of the function.

Found by:			pointyhat, ns1
Bug hunting session with:	erwin, simon, rwatson
Tested by:			simon on cluster machines
Reviewed by:			ratson, kmacy, julian
MFC after:			3 days
2009-09-01 17:53:01 +00:00
Qing Li
1bf38b1292 This patch fixes the following issues:
- Routing messages are not generated when adding and removing
  interface address aliases.
- Loopback route installed for an interface address alias is
  not deleted from the routing table when that address alias
  is removed from the associated interface.
- Function in_ifscrub() is called extraneously.

Reviewed by:	gnn, kmacy, sam
MFC after:	3 days
2009-08-31 21:02:48 +00:00
Michael Tuexen
2b77dd0181 Fix a bug where vlan interfaces are not supported by SCTP.
Approved by: rrs (mentor)
MFC after: 3 days
2009-08-28 08:41:59 +00:00
Qing Li
0437a93339 Do not try to free the rt_lle entry of the cached route in
ip_output() if the cached route was not initialized from the
flow-table. The rt_lle entry is invalid unless it has been
initialized through the flow-table.

Reviewed by:	kmacy, rwatson
MFC after:	immediately
2009-08-28 05:37:31 +00:00
Robert Watson
dc56e98f0d Use locks specific to the lltable code, rather than borrow the ifnet
list/index locks, to protect link layer address tables.  This avoids
lock order issues during interface teardown, but maintains the bug that
sysctl copy routines may be called while a non-sleepable lock is held.

Reviewed by:	bz, kmacy
MFC after:	3 days
2009-08-25 09:52:38 +00:00
Michael Tuexen
24ae5c4a73 This fixes a bug where the value set by SCTP_PARTIAL_DELIVERY_POINT
was not honored, if the socket buffer size was not 4 times that large.

Approved by: rrs (mentor)
MFC after: 3 days.
2009-08-24 11:46:40 +00:00
Randall Stewart
0fa753b3fb This fixes two bugs in the NR-Sack code:
1) When calculating the table offset for sliding the sack
    array, the two byte values must be "ored" together in order
    for us to do the correct sliding of the arrays.
 2) We were NOT properly doing CC and other changes to things only
    NR-Sacked. The solution here is to make a separate function that
    will actually do both CC/updates and free things if its NR sack'd.
    This actually shrinks out common code from three places (much better).

MFC after:	3 days
2009-08-24 11:13:32 +00:00
Marko Zec
2b73aacaf9 Introduce a div_destroy() function which takes over per-vnet cleanup tasks
from the existing modevent / MOD_UNLOAD handler, and register div_destroy()
in protosw as per-vnet .pr_destroy() handler for options VIMAGE builds.  In
nooptions VIMAGE builds, div_destroy() will be invoked from the modevent
handler, resulting in effectively identical operation as it was prior this
change.  div_destroy() also tears down hashtables used by ipdivert, which
were previously left behind on ipdivert kldunloads.

For options VIMAGE builds only, temporarily disable kldunloading of ipdivert,
because without introducing additional locking logic it is impossible to
atomically check whether all ipdivert instances in all vnets are idle, and
proceed with cleanup without opening a race window for a vnet to open an
ipdivert socket while ipdivert tear-down is in progress.

While here, staticize div_init(), because it is not used outside of
ip_divert.c.

In cooperation with:	julian
Approved by:	re (rwatson), julian (mentor)
MFC after:	3 days
2009-08-24 10:06:02 +00:00
Robert Watson
77dfcdc445 Rework global locks for interface list and index management, correcting
several critical bugs, including race conditions and lock order issues:

Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock.  Either can be held to stablize the lists and indexes, but both
are required to write.  This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions.  As before, writes to
the interface list must occur from sleepable contexts.

Reviewed by:	bz, julian
MFC after:	3 days
2009-08-23 20:40:19 +00:00
Julian Elischer
d3cef1d91e Fix another typo right next to the previous one, that amazingly, I did
not see before.

MFC after:	 1 week
2009-08-23 08:49:32 +00:00
Julian Elischer
8f26c03fe6 Fix typo in comment that has been bugging me for days.
MFC after:	1 week
2009-08-23 07:59:28 +00:00
Julian Elischer
c4b21cbe4a Fix ipfw's initialization functions to get the correct order of evaluation
to allow vnet and non vnet operation. Move some functions from ip_fw_pfil.c
to ip_fw2.c and mode to mostly using the SYSINIT and VNET_SYSINIT handlers
instead of the modevent handler. Correct some spelling errors in comments
in the affected code. Note this bug fixes a crash in NON VIMAGE kernels when
ipfw is unloaded.

This patch is a minimal patch for 8.0
I have a much larger patch that actually fixes the underlying problems
that will be applied after 8.0

Reviewed by:	zec@, rwatson@, bz@(earlier version)
Approved by:	re (rwatson)
MFC after:	Immediatly
2009-08-21 11:20:10 +00:00
Peter Wemm
b4e7e7a065 Fix signed comparison bug when ticks goes negative after 24 days of
uptime.  This causes the tcp time_wait state code to fail to expire
sockets in timewait state.

Approved by:	re (kensmith)
2009-08-20 22:53:28 +00:00
Will Andrews
52e12426d1 Fix CARP memory leaks on carp_if's malloc'd using M_CARP. This occurs when
CARP tries to free them using M_IFADDR after the last address for a virtual
host is removed and when detaching from the parent interface.

Reviewed by:	mlaier
Approved by:	re (kib), ken (mentor)
2009-08-20 02:33:12 +00:00
Michael Tuexen
2f99457b0c Fix a bug in the handling of unreliable messages
which results in stalled associations.

Approved by: re, rrs (mentor)
MFC after: immediately
2009-08-19 12:02:28 +00:00
Kip Macy
3ee42584f9 - change the interface to flowtable_lookup so that we don't rely on
the mbuf for obtaining the fib index
 - check that a cached flow corresponds to the same fib index as the
   packet for which we are doing the lookup
 - at interface detach time flush any flows referencing stale rtentrys
   associated with the interface that is going away (fixes reported
   panics)
 - reduce the time between cleans in case the cleaner is running at
   the time the eventhandler is called and the wakeup is missed less
   time will elapse before the eventhandler returns
 - separate per-vnet initialization from global initialization
   (pointed out by jeli@)

Reviewed by:	sam@
Approved by:	re@
2009-08-18 20:28:58 +00:00
Michael Tuexen
627dfd6df9 Fix a crash when using one-to-one stlye socket in non-blocking
mode and there is no listening server.
PR: 137795
Approved by: re, rrs (mentor)
MFC after:immediately.
2009-08-18 19:58:49 +00:00
Michael Tuexen
810ec53688 * Fix a bug where PR-SCTP settings are ignore when using implicit
association setup.
* Fix a bug where message with illegal stream ids are not deleted.
* Fix a crash when reporting back unsent messages from the send_queue.
* Fix a bug related to INIT retransmission when the socket is already
  closed.
* Fix a bug where associations were stalled when partial delivery API
  was enabled.
* Fix a bug where the receive buffer size was smaller than the
  partial_delivery_point.

Approved by: re, rrs (mentor)
MFC after: One day.
2009-08-15 21:10:52 +00:00
Qing Li
3ef5e21d01 In function ip_output(), the cached route is flushed when there is a
mismatch between the cached entry and the intended destination. The
cached rtentry{} is flushed but the associated llentry{} is not. This
causes the wrong destination MAC address being used in the output
packets. The fix is to flush the llentry{} when rtentry{} is cleared.

Reviewed by:	kmacy, rwatson
Approved by:	re
2009-08-14 23:44:59 +00:00
Marko Zec
f92ae4d706 SCTP is not yet compatible with options VIMAGE kernels although it compiles
with VIMAGE defined, so explicitly disallow building such kernels.

Reviewed by:	rrs
Approved by:	re (rwatson), julian (mentor)
2009-08-14 22:43:25 +00:00
Julian Elischer
72034f5548 Fix ipfw crash on uid or gid check.
Receiving any ip packet for which there is no existing socket will
crash if ipfw has a uid or gid test rule, as the uid/gid
of the non existent owner of said non existent socket is tested.
Brooks introduced this error as part of his >16 gids patch.
It appears to be a cut-n-paste error from similar code a few lines
before. The old code used the 'pcb' variable here, but in the
new code that switched the 'inp' variable, which is often NULL
and what is tested in the code further up. The rest of the multi-gid
patch for ipfw seems solid (and cleaner than previous code).

Reviewed by:	brooks
Approved by:	re (rwatson)
2009-08-14 10:09:45 +00:00
Robert Watson
9d2eb78bcb Add padding to struct inpcb, missed during our padding sweep earlier in
the release cycle.

Approved by:	re (kensmith)
2009-08-02 22:47:08 +00:00
Robert Watson
315e3e38fa Many network stack subsystems use a single global data structure to hold
all pertinent statatistics for the subsystem.  These structures are
sometimes "borrowed" by kernel modules that require a place to store
statistics for similar events.

Add KPI accessor functions for statistics structures referenced by kernel
modules so that they no longer encode certain specifics of how the data
structures are named and stored.  This change is intended to make it
easier to move to per-CPU network stats following 8.0-RELEASE.

The following modules are affected by this change:

      if_bridge
      if_cxgb
      if_gif
      ip_mroute
      ipdivert
      pf

In practice, most of these statistics consumers should, in fact, maintain
their own statistics data structures rather than borrowing structures
from the base network stack.  However, that change is too agressive for
this point in the release cycle.

Reviewed by:	bz
Approved by:	re (kib)
2009-08-02 19:43:32 +00:00
Robert Watson
530c006014 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
Xin LI
16e324fc56 Show interface name which received short CARP packet (e.g. a VRRP packet),
in order to match other codepaths nearby.  This makes troubleshooting
easier.

Approved by:	re (kib)
MFC after:	1 month
2009-07-30 17:40:47 +00:00
Julian Elischer
3d1001cb11 Startup the vnet part of initialization a bit after the global part.
Fixes crash on boot if ipfw compiled in.

Submitted by:	tegge@
Reviewed by:	tegge@
Approved by:	re (kib)
2009-07-28 19:58:07 +00:00
Julian Elischer
7973fba3a4 Somewhere along the line accept sockets stopped honoring the
FIB selected for them. Fix this.

Reviewed by:	ambrisko
Approved by:	re (kib)
MFC after:	3 days
2009-07-28 19:43:27 +00:00
Michael Tuexen
bf3d517756 Fix a bug where wrong initialization value
in used for an SCTP specific sysctl variable.

Approved by: re, rrs(mentor).
MFC after: 2 weeks.
2009-07-28 15:07:41 +00:00
Randall Stewart
cfde3ff70b Turns out that when a receiver forwards through its TNS's the
processing code holds the read lock (when processing a
FWD-TSN for pr-sctp). If it finds stranded data that
can be given to the application, it calls sctp_add_to_readq().
The readq function also grabs this lock. So if INVAR is on
we get a double recurse on a non-recursive lock and panic.

This fix will change it so that readq() function gets a
flag to tell if the lock is held, if so then it does not
get the lock.

Approved by:	re@freebsd.org (Kostik Belousov)
MFC after:	1 week
2009-07-28 14:09:06 +00:00
Qing Li
df813b7ea2 This patch does the following:
- Allow loopback route to be installed for address assigned to
      interface of IFF_POINTOPOINT type.
    - Install loopback route for an IPv4 interface addreess when the
      "useloopback" sysctl variable is enabled. Similarly, install
      loopback route for an IPv6 interface address when the sysctl variable
      "nd6_useloopback" is enabled. Deleting loopback routes for interface
      addresses is unconditional in case these sysctl variables were
      disabled after an interface address has been assigned.

Reviewed by:	bz
Approved by:	re
2009-07-27 17:08:06 +00:00
Michael Tuexen
8e71b6947a Fix the handling of unordered messages when using
PR-SCTP.

Approved by: re, rrs (mentor)
MFC after: 3 weeks.
2009-07-27 13:41:45 +00:00
Michael Tuexen
4420d9a062 Get rid of unused field. This will also be deleted
in the official speciication of the SCTP socket API.

Approved by:re, rrs (mentor)
2009-07-27 12:09:32 +00:00
Michael Tuexen
47a490cbbc Add a missing unlock for the inp lock when
returning early from sctp_add_to_readq().

Approved by: re, rrs (mentor)
MFC after: 2 weeks.
2009-07-26 15:06:59 +00:00
Julian Elischer
9d85f50ad5 Catch ipfw up to the rest of the vimage code.
It got left behind when it moved to its new location.

Approved by:	re (kensmith)
2009-07-25 06:42:42 +00:00
Robert Watson
d0728d7174 Introduce and use a sysinit-based initialization scheme for virtual
network stacks, VNET_SYSINIT:

- Add VNET_SYSINIT and VNET_SYSUNINIT macros to declare events that will
  occur each time a network stack is instantiated and destroyed.  In the
  !VIMAGE case, these are simply mapped into regular SYSINIT/SYSUNINIT.
  For the VIMAGE case, we instead use SYSINIT's to track their order and
  properties on registration, using them for each vnet when created/
  destroyed, or immediately on module load for already-started vnets.
- Remove vnet_modinfo mechanism that existed to serve this purpose
  previously, as well as its dependency scheme: we now just use the
  SYSINIT ordering scheme.
- Implement VNET_DOMAIN_SET() to allow protocol domains to declare that
  they want init functions to be called for each virtual network stack
  rather than just once at boot, compiling down to DOMAIN_SET() in the
  non-VIMAGE case.
- Walk all virtualized kernel subsystems and make use of these instead
  of modinfo or DOMAIN_SET() for init/uninit events.  In some cases,
  convert modular components from using modevent to using sysinit (where
  appropriate).  In some cases, do minor rejuggling of SYSINIT ordering
  to make room for or better manage events.

Portions submitted by:	jhb (VNET_SYSINIT), bz (cleanup)
Discussed with:		jhb, bz, julian, zec
Reviewed by:		bz
Approved by:		re (VIMAGE blanket)
2009-07-23 20:46:49 +00:00
Bjoern A. Zeeb
a08362ce46 sysctl_msec_to_ticks is used with both virtualized and
non-vrtiualized sysctls so we cannot used one common function.

Add a macro to convert the arg1 in the virtualized case to
vnet.h to not expose the maths to all over the code.

Add a wrapper for the single virtualized call, properly handling
arg1 and call the default implementation from there.

Convert the two over places to use the new macro.

Reviewed by:	rwatson
Approved by:	re (kib)
2009-07-21 21:58:55 +00:00
Robert Watson
a511354af4 Back out the moving in r195782 of V_ip_id's initialization from the top
back to the bottom of ip_init() as found in 7.x.  I missed the fact that
the bottom half of the init routine only runs in the !VNET case.

Submitted by:	zec
Approved by:	re (vimage blanket)
2009-07-20 19:40:09 +00:00
Robert Watson
0a4747d4d0 Garbage collect vnet module registrations that have neither constructors
nor destructors, as there's no actual work to do.

In most cases, the constructors weren't needed because of the existing
protocol initialization functions run by net_init_domain() as part of
VNET_MOD_NET, or they were eliminated when support for static
initialization of virtualized globals was added.

Garbage collect dependency references to modules without constructors or
destructors, notably VNET_MOD_INET and VNET_MOD_INET6.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-07-20 13:55:33 +00:00
Robert Watson
5ee847d3ac Reimplement and/or implement vnet list locking by replacing a mostly
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use).  Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.

Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions.  Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.

Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.

Update various consumers of these KPIs based on whether they may sleep
or not.

Reviewed by:	bz
Approved by:	re (kib)
2009-07-19 14:20:53 +00:00
Robert Watson
1e77c1056a Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used.  Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with:	bz, julian
Reviewed by:	bz
Approved by:	re (kensmith, kib)
2009-07-16 21:13:04 +00:00
Robert Watson
eddfbb763d Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
Lawrence Stewart
91a5ebde45 Fix a race in the manipulation of the V_tcp_sack_globalholes global variable,
which is currently not protected by any type of lock. When triggered, the bug
would sometimes cause a panic when the TCP activity to an affected machine
eventually slowed during a lull. The panic only occurs if INVARIANTS is compiled
into the kernel, and has laid dormant for some time as a result of INVARIANTS
being off by default except in FreeBSD-CURRENT.

Switch to atomic operations in the locations where the variable is changed.
Reads have not been updated to be protected by atomics, so there is a
possibility of accounting errors in any given calculation where the variable is
read. This is considered unlikely to occur in the wild, and will not cause
serious harm on rare occasions where it does.

Thanks to Robert Watson for debugging help.

Reported by:	Kamigishi Rei <spambox at haruhiism dot net>
Tested by:	Kamigishi Rei <spambox at haruhiism dot net>
Reviewed by:	silby
Approved by:	re (rwatson), kensmith (mentor temporarily unavailable)
2009-07-13 11:59:38 +00:00
Lawrence Stewart
237fbe0a1c Replace struct tcpopt with a proxy toeopt struct in the TOE driver interface to
the TCP syncache. This returns struct tcpopt to being private within the TCP
implementation, thus allowing it to be modified without ABI concerns.

The patch breaks the ABI. Bump __FreeBSD_version to 800103 accordingly. The cxgb
driver is the only TOE consumer affected by this change, and needs to be
recompiled along with the kernel.

Suggested by:	rwatson
Reviewed by:	rwatson, kmacy
Approved by:	re (kensmith), kensmith (mentor temporarily unavailable)
2009-07-13 11:51:02 +00:00
Lawrence Stewart
962ebef8c0 Pad the following TCP related structs to allow MFCs of upcoming features/fixes
back to the 8 branch:

tcp_var.h
- struct sackhint
- struct tcpcb
- struct tcpstat

The patch breaks the ABI. Bump __FreeBSD_version to 800102 accordingly. User
space tools that rely on the size of any of these structs (e.g. sockstat) need
to be recompiled.

Reviewed by:	rpaulo, sam, andre, rwatson
Approved by:	re & mentor (gnn)
2009-07-12 09:14:28 +00:00
Robert Watson
6c8615603b Update various IPFW-related modules to use if_addr_rlock()/
if_addr_runlock() rather than IF_ADDR_LOCK()/IF_ADDR_UNLOCK().

MFC after:	6 weeks
2009-06-26 00:46:50 +00:00
Robert Watson
d1da0a0672 Add address list locking for in6_ifaddrhead/ia_link: as with locking
for in_ifaddrhead, we stick with an rwlock for the time being, which
we will revisit in the future with a possible move to rmlocks.

Some pieces of code require significant further reworking to be
safe from all classes of writer-writer races.

Reviewed by:	bz
MFC after:	6 weeks
2009-06-25 16:35:28 +00:00
Robert Watson
64aeca7b42 Initialize in_ifaddr_lock using RW_SYSINIT() instead of in ip_init(),
so that it doesn't run multiple times if VIMAGE is being used.

Discussed with:	bz
MFC after:	6 weeks
2009-06-25 14:44:00 +00:00
Robert Watson
2d9cfabad4 Add a new global rwlock, in_ifaddr_lock, which will synchronize use of the
in_ifaddrhead and INADDR_HASH address lists.

Previously, these lists were used unsynchronized as they were effectively
never changed in steady state, but we've seen increasing reports of
writer-writer races on very busy VPN servers as core count has gone up
(and similar configurations where address lists change frequently and
concurrently).

For the time being, use rwlocks rather than rmlocks in order to take
advantage of their better lock debugging support.  As a result, we don't
enable ip_input()'s read-locking of INADDR_HASH until an rmlock conversion
is complete and a performance analysis has been done.  This means that one
class of reader-writer races still exists.

MFC after:      6 weeks
Reviewed by:    bz
2009-06-25 11:52:33 +00:00
Oleg Bulyzhin
6882bf4d92 - fix dummynet 'fast' mode for WF2Q case.
- fix printing of pipe profile data.
- introduce new pipe parameter: 'burst' - how much data can be sent through
  pipe bypassing bandwidth limit.
2009-06-24 22:57:07 +00:00
Robert Watson
32187eb6d9 Fix CARP build.
Reported by:	bz
2009-06-24 21:34:38 +00:00
Robert Watson
80af0152f3 Convert netinet6 to using queue(9) rather than hand-crafted linked lists
for the global IPv6 address list (in6_ifaddr -> in6_ifaddrhead).  Adopt
the code styles and conventions present in netinet where possible.

Reviewed by:	gnn, bz
MFC after:	6 weeks (possibly not MFCable?)
2009-06-24 21:00:25 +00:00
Robert Watson
f8574c7a22 Add missing unlock of if_addr_mtx when an unmatched ARP packet is received.
Reported by:	lstewart
MFC after:	6 weeks
2009-06-24 14:49:26 +00:00
Robert Watson
19e5b0a797 Clear 'ia' after iterating if_addrhead for unicast address matching: since
'ifa' was used as the TAILQ_FOREACH() iterator argument, and 'ia' was just
derived form it, it could be left non-NULL which confused later
conditional freeing code.  This could cause kernel panics if multicast IP
packets were received.  [1]

Call 'struct in_ifaddr *' in ip_rtaddr() 'ia', not 'ifa' in keeping with
normal conventions.

When 'ipstealth' is enabled returns from ip_input early, properly release
the 'ia' reference.

Reported by:	lstewart, sam [1]
MFC after:	6 weeks
2009-06-24 14:29:40 +00:00
Robert Watson
09d547787f In ARP input, more consistently acquire and release ifaddr references.
MFC after:	6 weeks
2009-06-24 10:33:35 +00:00
Bjoern A. Zeeb
88d166bf19 Make callers to in6_selectsrc() and in6_pcbladdr() pass in memory
to save the selected source address rather than returning an
unreferenced copy to a pointer that might long be gone by the
time we use the pointer for anything meaningful.

Asked for by:	rwatson
Reviewed by:	rwatson
2009-06-23 22:08:55 +00:00
Robert Watson
8c0fec805f Modify most routines returning 'struct ifaddr *' to return references
rather than pointers, requiring callers to properly dispose of those
references.  The following routines now return references:

  ifaddr_byindex
  ifa_ifwithaddr
  ifa_ifwithbroadaddr
  ifa_ifwithdstaddr
  ifa_ifwithnet
  ifaof_ifpforaddr
  ifa_ifwithroute
  ifa_ifwithroute_fib
  rt_getifa
  rt_getifa_fib
  IFP_TO_IA
  ip_rtaddr
  in6_ifawithifp
  in6ifa_ifpforlinklocal
  in6ifa_ifpwithaddr
  in6_ifadd
  carp_iamatch6
  ip6_getdstifaddr

Remove unused macro which didn't have required referencing:

  IFP_TO_IA6

This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc).  In a few cases, add missing if_addr_list locking
required to safely acquire references.

Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit.  Once
we have mbuf tag deep copy support, this can be fixed.

Reviewed by:	bz
Obtained from:	Apple, Inc. (portions)
MFC after:	6 weeks (portions)
2009-06-23 20:19:09 +00:00
Bjoern A. Zeeb
5736e6fb9d After cleaning up rt_tables from vnet.h and cleaning up opt_route.h
a lot of files no longer need route.h either. Garbage collect them.
While here remove now unneeded vnet.h #includes as well.
2009-06-23 17:03:45 +00:00
Andre Oppermann
ef760e6ad2 Add soreceive_stream(), an optimized version of soreceive() for
stream (TCP) sockets.

It is functionally identical to generic soreceive() but has a
number stream specific optimizations:
o does only one sockbuf unlock/lock per receive independent of
  the length of data to be moved into the uio compared to
  soreceive() which unlocks/locks per *mbuf*.
o uses m_mbuftouio() instead of its own copy(out) variant.
o much more compact code flow as a large number of special
  cases is removed.
o much improved reability.

It offers significantly reduced CPU usage and lock contention
when receiving fast TCP streams.  Additional gains are obtained
when the receiving application is using SO_RCVLOWAT to batch up
some data before a read (and wakeup) is done.

This function was written by "reverse engineering" and is not
just a stripped down variant of soreceive().

It is not yet enabled by default on TCP sockets.  Instead it is
commented out in the protocol initialization in tcp_usrreq.c
until more widespread testing has been done.

Testers, especially with 10GigE gear, are welcome.

MFP4:	r164817 //depot/user/andre/soreceive_stream/
2009-06-22 23:08:05 +00:00
Marko Zec
fa057b15bd V_irtualize flowtable state.
This change should make options VIMAGE kernel builds usable again,
to some extent at least.

Note that the size of struct vnet_inet has changed, though in
accordance with one-bump-per-day policy we didn't update the
__FreeBSD_version number, given that it has already been touched
by r194640 a few hours ago.
Reviewed by:	bz
Approved by:	julian (mentor)
2009-06-22 21:19:24 +00:00
Robert Watson
8896f83a58 Add a new function, ifa_ifwithaddr_check(), which rather than returning
a pointer to an ifaddr matching the passed socket address, returns a
boolean indicating whether one was present.  In the (near) future,
ifa_ifwithaddr() will return a referenced ifaddr rather than a raw
ifaddr pointer, and the new wrapper will allow callers that care only
about the boolean condition to avoid having to free that reference.

MFC after:	3 weeks
2009-06-22 10:59:34 +00:00
Bjoern A. Zeeb
173de0f9cc Remove a hack from r186086 so that IPsec via loopback routes continued
working. It was targeted for stable/7 compatibility and actually never
did anything in HEAD.

Reminded by:	rwatson
X-MFC after:	never
2009-06-22 09:24:46 +00:00
Robert Watson
1099f828b3 Clean up common ifaddr management:
- Unify reference count and lock initialization in a single function,
  ifa_init().
- Move tear-down from a macro (IFAFREE) to a function ifa_free().
- Move reference count bump from a macro (IFAREF) to a function ifa_ref().
- Instead of using a u_int protected by a mutex to refcount(9) for
  reference count management.

The ifa_mtx is now used for exactly one ioctl, and possibly should be
removed.

MFC after:	3 weeks
2009-06-21 19:30:33 +00:00
Roman Divacky
e40bae9a45 Switch cmd argument to u_long. This matches what if_ethersubr.c does and
allows the code to compile cleanly on amd64 with clang.

Reviewed by:	rwatson
Approved by:	ed (mentor)
2009-06-21 10:29:31 +00:00
Brooks Davis
838d985825 Rework the credential code to support larger values of NGROUPS and
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively.  (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)

The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer.  Do the equivalent in
kinfo_proc.

Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively.  Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary.  In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.

Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups.  When feasible, truncate
the group list rather than generating an error.

Minor changes:
  - Reduce the number of hand rolled versions of groupmember().
  - Do not assign to both cr_gid and cr_groups[0].
  - Modify ipfw to cache ucreds instead of part of their contents since
    they are immutable once referenced by more than one entity.

Submitted by:	Isilon Systems (initial implementation)
X-MFC after:	never
PR:		bin/113398 kern/133867
2009-06-19 17:10:35 +00:00
Bjoern A. Zeeb
ebd8672cc3 Add explicit includes for jail.h to the files that need them and
remove the "hidden" one from vimage.h.
2009-06-17 15:01:01 +00:00
Bjoern A. Zeeb
7654a365db Add the explicit include of vimage.h to another five .c files still
missing it.

Remove the "hidden" kernel only include of vimage.h from ip_var.h added
with the very first Vimage commit r181803 to avoid further kernel poisoning.
2009-06-17 12:44:11 +00:00
Randall Stewart
d50c1d79d0 Changes to the NR-Sack code so that:
1) All bit disappears
2) The two sets of gaps (nr and non-nr) are
   disjointed, you don't have gaps struck in
   both places.

This adjusts us to coorespond to the new draft. Still
to-do, cleanup the code so that there are only one set
of sack routines (original NR-Sack done by E cloned all
sack code).
2009-06-17 12:34:56 +00:00
John Baldwin
6b0c5521b5 Trim extra sets of ()'s.
Requested by:	bde
2009-06-16 19:00:48 +00:00
John Baldwin
6dfb8b316c Fix edge cases with ticks wrapping from INT_MAX to INT_MIN in the handling
of the per-tcpcb t_badtrxtwin.

Submitted by:	bde
2009-06-16 19:00:12 +00:00
John Baldwin
9f78a87a06 - Change members of tcpcb that cache values of ticks from int to u_int:
t_rcvtime, t_starttime, t_rtttime, t_bw_rtttime, ts_recent_age,
  t_badrxtwin.
- Change t_recent in struct timewait from u_long to u_int32_t to match
  the type of the field it shadows from tcpcb: ts_recent.
- Change t_starttime in struct timewait from u_long to u_int to match
  the t_starttime field in tcpcb.

Requested by:	bde (1, 3)
2009-06-16 18:58:50 +00:00
Jamie Gritton
9ed47d01eb Get vnets from creds instead of threads where they're available, and from
passed threads instead of curthread.

Reviewed by:	zec, julian
Approved by:	bz (mentor)
2009-06-15 19:01:53 +00:00
Oleg Bulyzhin
1917ef996d Since dn_pipe.numbytes is int64_t now - remove unnecessary overflow detection
code in ready_event_wfq().
2009-06-15 17:14:47 +00:00
Bjoern A. Zeeb
53be8fca00 Move the kernel option FLOWTABLE chacking from the header file to the
actual implementation.
Remove the accessor functions for the compiled out case, just returning
"unavail" values. Remove the kernel conditional from the header file as
it is no longer needed, only leaving the externs.
Hide the improperly virtualized SYSCTL/TUNABLE for the flowtable size
under the kernel option as well.

Reviewed by:	rwatson
2009-06-12 20:46:36 +00:00
VANHULLEBUS Yvan
7b495c4494 Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...

X-MFC: never

Reviewed by:	bz
Approved by:	gnn(mentor)
Obtained from:	NETASQ
2009-06-12 15:44:35 +00:00
John Baldwin
a13c655c64 Correct printf format type mismatches. 2009-06-11 14:37:18 +00:00
John Baldwin
1a0e7cfc42 Trim extra ()'s.
Submitted by:	bde
2009-06-11 14:36:13 +00:00