Commit Graph

3608 Commits

Author SHA1 Message Date
Bruce M Simpson
1975dc405a Comment IGMP_PIM as being very historic, as in, don't use. 2009-03-19 01:15:26 +00:00
Bruce M Simpson
56663a40eb Deal with the case where ifma_protospec may be NULL, during
any IPv4 multicast operations which reference it.

There is a potential race because ifma_protospec is set to NULL
when we discover the underlying ifnet has gone away. This write
is not covered by the IF_ADDR_LOCK, and it's difficult to widen
its scope without making it a recursive lock. It isn't clear why
this manifests more quickly with 802.11 interfaces, but does not
seem to manifest at all with wired interfaces.

With this change, the 802.11 related panics reported by sam@
and cokane@ should go away. It is not the right fix, that requires
more thought before 8.0.

Idea from:	sam
Tested by:	cokane
2009-03-17 14:41:54 +00:00
Robert Watson
e5adda3d51 Remove IFF_NEEDSGIANT, a compatibility infrastructure introduced
in FreeBSD 5.x to allow network device drivers to run with Giant
despite the network stack being Giant-free.  This significantly
simplifies calls into ioctl() on network interfaces, especially
in the multicast code, as well as eliminates deferred invocation
of interface if_start routines.

Disable the build on device drivers still depending on
IFF_NEEDSGIANT as they no longer compile.  They will be removed
in a few weeks if they haven't been made MPSAFE in that time.
Disabled drivers:

        if_ar
        if_axe
        if_aue
        if_cdce
        if_cue
        if_kue
        if_ray
        if_rue
        if_rum
        if_sr
        if_udav
        if_ural
        if_zyd

Drivers that were already disabled because of tty changes:

        if_ppp
        if_sl

Discussed on:	arch@
2009-03-15 14:21:05 +00:00
Robert Watson
ad71fe3c35 Correct a number of evolved problems with inp_vflag and inp_flags:
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted.  Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):

  INP_TIMEWAIT
  INP_SOCKREF
  INP_ONESBCAST
  INP_DROPPED

Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.

MFC after:      1 week (or after dependencies are MFC'd)
Reviewed by:    bz
2009-03-15 09:58:31 +00:00
Randall Stewart
49633f4b36 Opps.. I missed a file on the commit :-) 2009-03-14 23:13:16 +00:00
David Schultz
b3c11b5b91 Namespace: Defining htonl() and friends here instead of arpa/inet.h is
a BSD extension.
2009-03-14 20:16:54 +00:00
Randall Stewart
0c0982b80c Fixes several PR-SCTP releated bugs.
- When sending large PR-SCTP messages over a
   lossy link we would incorrectly calculate the fwd-tsn
 - When receiving large multipart pr-sctp packets we would
   incorrectly send back a SACK that would renege improperly
   on already received packets thus causing unneeded retransmissions.
2009-03-14 13:42:13 +00:00
Robert Watson
111d57a69c Add INP_INHASHLIST flag for inpcb->inp_flags to indicate whether
or not the inpcb is currenty on various hash lookup lists, rather
than using (lport != 0) to detect this.  This means that the full
4-tuple of a connection can be retained after close, which should
lead to more sensible netstat output in the window between TCP
close and socket close.

MFC after:	2 weeks
2009-03-11 00:29:22 +00:00
Robert Watson
4cf172fd65 Remove unused v6 macro aliases for inpcb fields:
in6p_ip6_nxt
        in6p_vflag
        in6p_flags
        in6p_socket
        in6p_lport
        in6p_fport
        in6p_ppcb

Remove unused v6 macro aliases for inpcb flags:

        IN6P_HIGHPORT
        IN6P_LOWPORT
        IN6P_ANONPORT
        IN6P_RECVIF
        IN6P_MTUDISC
        IN6P_FAITH
        IN6P_CONTROLOPTS

References to in6p_lport and in6_fport in sockstat are also replaced with
normal inp_lport and inp_fport references.

MFC after:	3 days
Reviewed by:	bz
2009-03-10 17:57:41 +00:00
Bruce M Simpson
30e239fe64 Don't print inm_print() chatter when KTR_IGMPV3 is not enabled
in the KTR_COMPILE mask.

Found by:	gnn
2009-03-10 17:48:49 +00:00
Robert Watson
b9bbb597b1 Remove now-unused INP_UNMAPPABLEOPTS.
MFC after:	3 days
Discussed with:	bz
2009-03-10 11:04:19 +00:00
Bruce M Simpson
c75aa3548f Fix uninitialized use of ifp for ii.
Found by:	Peter Holm
2009-03-09 22:54:17 +00:00
Bruce M Simpson
d10910e6ce Merge IGMPv3 and Source-Specific Multicast (SSM) to the FreeBSD
IPv4 stack.

Diffs are minimized against p4.
PCS has been used for some protocol verification, more widespread
testing of recorded sources in Group-and-Source queries is needed.
sizeof(struct igmpstat) has changed.

__FreeBSD_version is bumped to 800070.
2009-03-09 17:53:05 +00:00
Marius Strobl
c89c8a1029 On architectures with strict alignment requirements compensate
the misalignment of the IP header that prepending the EtherIP
header might have caused.

PR:		131921
MFC after:	1 week
2009-03-07 19:08:58 +00:00
Randall Stewart
5171328bd6 Fixes for window probes:
1) WP should never be marked unless flight size is 0
 2) When recovering from wp if the peer ack's it we don't mark for retran
 3) When recovering, we must assure a timer is still running.
2009-03-06 11:03:52 +00:00
Randall Stewart
dfb11ef895 - PR-SCTP bug, where the CUM-ACK was not being updated
into the advance_peer_ack point so we would incorrectly
  send a wrong value in the FWD-TSN
- PR-SCTP bug, where an PR packet is used for a window
  probe which could incorrectly get the packet moved
  back into the send_queue, which will cause major issues and
  should not happen.
- Fix a trace to use the proper macro.
2009-03-04 20:54:42 +00:00
Bruce M Simpson
8b889dbb9e In ip_output(), do not acquire the IN_MULTI_LOCK(),
and do not attempt to perform a group lookup.
This is a socket layer lock, and the bottom half of IP
really has no business taking it.

Use the value of the in_mcast_loop sysctl to determine
if we should loop back by default, in the absence of
any multicast socket options. Because the check on
group membership is now deferred to the input path,
an m_copym() is now required.

This should increase multicast send performance where the
source has not requested loopback, although this has not been
benchmarked or measured.

It is also a necessary change for IN_MULTI_LOCK to become
non-recursive, which is required in order to implement IGMPv3
in a thread-safe way.
2009-03-04 03:45:34 +00:00
Bruce M Simpson
dd7fd7c07c Add sysctl net.inet.ip.mcast.loop. This controls whether or not
IPv4 multicast sends are looped back to senders by default
on a stack-wide basis, rather than relying on the socket option.
Note that the sysctl only applies to newly created multicast sockets.
2009-03-04 03:40:02 +00:00
Bruce M Simpson
346e3178ea Merge header file definitions used by the new IGMPv3 implementation.
This is a partial merge. Compatibility defines are retained for
the existing IGMPv2 implementation.
2009-03-04 03:22:03 +00:00
Bruce M Simpson
b554b6ca91 Add various defines/macros required by IGMPv3:
* MCAST_UNDEFINED state.
 * in_allhosts() macro (group is 224.0.0.1).
   This uses a const endian comparison.
 * IP_MAX_GROUP_SRC_FILTER, IP_MAX_SOCK_SRC_FILTER
   default resource limits.
2009-03-04 03:01:05 +00:00
Bruce M Simpson
f0dcb78326 Add function ip_checkrouteralert(), which will be used
by IGMPv3 to check for the IPv4 Router Alert [RFC2113]
option in a pulled-up IP mbuf chain.
2009-03-04 02:51:22 +00:00
Bjoern A. Zeeb
1263305f0c Start removing IPv6 Type 0 Routing header code.
RH0 was deprecated by RFC 5095.

While most of the code had been disabled by #if 0 already, leave a
bit of infrastructure for possible RH2 code and a log message under
BURN_BRIDGES in case a user still tries to send RH0 packets.

Reviewed by:	gnn (a bit back, earlier version)
2009-03-03 13:12:12 +00:00
Luigi Rizzo
ac6bb60e0a curr_time is a 64 bit variable so SYSCTL_LONG is not appropriate
as a handler.
The variable was exported only for debugging, but there is little reason
to do it now that the timekeeping is supported by various other variables.
For the time being just comment out the sysctl, but I think this
should go away.
2009-03-02 22:16:50 +00:00
Luigi Rizzo
0906f40fd8 fw_debug has been unused for ages, so remove it from the list
of sysctl_variables.
I would also remove it from the VNET record but I am unsure if
there is any ABI issue -- so for the time being just mark it as
unused in ip_fw.h, and then we will collect the garbage at some
appropriate time in the future.

MFC after:	3 days
2009-03-02 22:11:48 +00:00
Bjoern A. Zeeb
2bebb49117 Add size-guards evaluated at compile-time to the main struct vnet_*
which are not in a module of their own like gif.

Single kernel compiles and universe will fail if the size of the struct
changes. Th expected values are given in sys/vimage.h.
See the comments where how to handle this.

Requested by:	peter
2009-03-01 11:01:00 +00:00
Robert Watson
8e5057ed20 Remove unreachable code for generating RST segments from tcp_twcheck();
this code became stale when T/TCP support was removed.

Discussed with:	bz, sam
MFC after:	1 month
2009-02-28 22:58:52 +00:00
Randall Stewart
8aae94933f Fix the add stream feature of strm-reset to really work:
- Fix the copy, we can't do a blind copy but must transfer
   the data from the old to the new.
 - Fix the ACK processing so we properly stop retransmitting
   the thing.
 - Fix it so if we get a retran we will properly reply with
   the saved response without doing anything.

MFC after:	1 month
2009-02-27 20:54:45 +00:00
Bjoern A. Zeeb
33553d6e99 For all files including net/vnet.h directly include opt_route.h and
net/route.h.

Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.

We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.

This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.
2009-02-27 14:12:05 +00:00
Roman Divacky
af83f5d77c Change the functions to ANSI in those cases where it breaks promotion
to int rule. See ISO C Standard: SS6.7.5.3:15.

Approved by:	kib (mentor)
Reviewed by:	warner
Tested by:	silence on -current
2009-02-24 18:09:31 +00:00
Robert Watson
ce2ae9ab4b In tcp_usr_shutdown() and tcp_usr_send(), I missed converting NULL
checks for the tcpcb, previously used to detect complete disconnection,
with INP_DROPPED checks.  Correct that, preventing shutdown() from
improperly generating a TCP segment with destination IP and port of
0.0.0.0:0.

PR:		kern/132050
Reported by:	david gueluy <david.gueluy at netasq.com>
MFC after:	3 weeks
2009-02-24 11:17:50 +00:00
Robert Watson
63d0295c2f In in_rtqkill(), assert the radix head lock, and pass RTF_RNH_LOCKED
to in_rtrequest(); the radix head lock is already acquired before
rnh_walktree is called in in_rtqtimo_one().  This avoids a recursive
acquisition that is no longer permitted in 8.x due to use of an rwlock
for the radix head lock.

Reported by:	dikshie <dikshie at gmail.com>
MFC after:	3 days
2009-02-23 22:57:55 +00:00
Randall Stewart
ea44232b3a Add the add-stream capability. Still needs more
testing..

MFC after:	1 month
2009-02-20 15:03:54 +00:00
Randall Stewart
186414058a Fix a bug. The sending was being restricted improperly by
the max_burst. It should only be gated by cwnd in the
lower level send.

Obtained from:	Michael Tuexen
MFC after:	1 week.
2009-02-20 14:33:45 +00:00
Luigi Rizzo
d8d42f3f4e correct some #include 2009-02-16 15:10:51 +00:00
Luigi Rizzo
35b78b7520 remove dependency on eventhandler.h, we only need a forward declaration 2009-02-16 15:08:41 +00:00
Luigi Rizzo
281c8daea2 remove dependency on net/if.h of this header 2009-02-16 15:07:40 +00:00
Luigi Rizzo
2eef235973 use a const format string in the log message so we can check the
arguments (if/when we enable those checks)
2009-02-16 12:09:52 +00:00
Luigi Rizzo
ada55ca0b7 remove unnecessary #include from vnet.h and vinet.h
Approved by:	Marko Zec
2009-02-15 00:28:28 +00:00
Randall Stewart
eef9e53e55 This commit fixes the issue with alias_sctp.c. No
longer do we require SCTP to be in the kernel for the
lib to be able to handle SCTP. We do this by moving
the CRC32c checksum into libkern/crc32.c and then adjusting
all routines to use the common methods. Note that this
will improve the performance of iSCSI since they were
using the old single 256 bit table lookup versus the
slicing 8 algorithm (which gives a 4x speed up in
CRC32c calculation :-D)

Reviewed by:rwatson, gnn, scottl, paolo
MFC after:	4 week? (assuming we MFC the alias_sctp changes)
2009-02-14 11:34:57 +00:00
Randall Stewart
c3b8c73cf1 Have the jail code use the error returned to pass not constant
errors.
Obtained from:	jamie@freebsd.org
2009-02-13 18:44:30 +00:00
Luigi Rizzo
8f2f943e8f remove unnecessary #include, and document some of the others 2009-02-13 15:37:14 +00:00
Luigi Rizzo
d685b6ee05 Use uint32_t instead of n_long and n_time, and uint16_t instead of n_short.
Add a note next to fields in network format.

The n_* types are not enough for compiler checks on endianness, and their
use often requires an otherwise unnecessary #include <netinet/in_systm.h>

The typedef in in_systm.h are still there.
2009-02-13 15:14:43 +00:00
Randall Stewart
4f6b49338e Move the new rwnd field down to the very end
of the xsctp structure. This is where all new
fields belong (not that we will be ABI compatiable
with 7.x anyway.. sigh).
2009-02-13 14:43:46 +00:00
Randall Stewart
11b14db397 Add padding to then end of the xsctp_xxx structures to
allow future changes to be able to maintain ABI compatibility
2009-02-09 17:37:17 +00:00
Randall Stewart
74246b2734 Fix minor spacing problem found by s9indent from last
commit.
2009-02-09 11:42:23 +00:00
Randall Stewart
a1f2f7a5a0 Fix INET only build breakage with SCTP - pointy hat to me :-) 2009-02-09 11:41:54 +00:00
Bjoern A. Zeeb
97aa4a517a Try to remove/assimilate as much of formerly IPv4/6 specific
(duplicate) code in sys/netipsec/ipsec.c and fold it into
common, INET/6 independent functions.

The file local functions ipsec4_setspidx_inpcb() and
ipsec6_setspidx_inpcb() were 1:1 identical after the change
in r186528. Rename to ipsec_setspidx_inpcb() and remove the
duplicate.

Public functions ipsec[46]_get_policy() were 1:1 identical.
Remove one copy and merge in the factored out code from
ipsec_get_policy() into the other. The public function left
is now called ipsec_get_policy() and callers were adapted.

Public functions ipsec[46]_set_policy() were 1:1 identical.
Rename file local ipsec_set_policy() function to
ipsec_set_policy_internal().
Remove one copy of the public functions, rename the other
to ipsec_set_policy() and adapt callers.

Public functions ipsec[46]_hdrsiz() were logically identical
(ignoring one questionable assert in the v6 version).
Rename the file local ipsec_hdrsiz() to ipsec_hdrsiz_internal(),
the public function to ipsec_hdrsiz(), remove the duplicate
copy and adapt the callers.
The v6 version had been unused anyway. Cleanup comments.

Public functions ipsec[46]_in_reject() were logically identical
apart from statistics. Move the common code into a file local
ipsec46_in_reject() leaving vimage+statistics in small AF specific
wrapper functions. Note: unfortunately we already have a public
ipsec_in_reject().

Reviewed by:	sam
Discussed with:	rwatson (renaming to *_internal)
MFC after:	26 days
X-MFC:		keep wrapper functions for public symbols?
2009-02-08 09:27:07 +00:00
Paolo Pisati
e13710afbd Silent LINT: add 2 stubs (update_crc32 and sctp_finalize_crc32) to fix LIBALIAS + SCTP_NO_CSUM case. 2009-02-08 03:03:55 +00:00
Paolo Pisati
37ce2656ec Add SCTP NAT support.
Submitted by: CAIA (http://caia.swin.edu.au)
2009-02-07 18:49:42 +00:00
Jamie Gritton
7c2f3cb964 Remove redundant calls of prison_local_ip4 in in_pcbbind_setup, and of
prison_local_ip6 in in6_pcbbind.

Approved by:	bz (mentor)
2009-02-05 14:25:53 +00:00
Jamie Gritton
b89e82dd87 Standardize the various prison_foo_ip[46] functions and prison_if to
return zero on success and an error code otherwise.  The possible errors
are EADDRNOTAVAIL if an address being checked for doesn't match the
prison, and EAFNOSUPPORT if the prison doesn't have any addresses in
that address family.  For most callers of these functions, use the
returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or
EINVAL.

Always include a jailed() check in these functions, where a non-jailed
cred always returns success (and makes no changes).  Remove the explicit
jailed() checks that preceded many of the function calls.

Approved by:	bz (mentor)
2009-02-05 14:06:09 +00:00
Randall Stewart
be27fdd0c4 LOR fix - Lock only when calling the actual code that
is messing with the UDP tunnel. This means
          that if two users actually tried to change the
          tunnel port at the same time interesting things COULD
          result, but its probably very unlikely to happen :-)
2009-02-03 20:33:28 +00:00
Randall Stewart
a99b67833a - Cleanup checksum code.
- Prepare for CRC offloading, add MIB counters (RS/MT).
- Bugfix: Disable CRC computation for IPv6 addresses with local scope (MT).
- Bugfix: Handle close() with SO_LINGER correctly when notifications
          are generated during the close() call(MT).
- Bugfix: Generate DRY event when sender is dry during subscription.
          Only for 1-to-1 style sockets (RS/MT)
- Bugfix: Put vtags for the correct amount of time into time-wait (MT).
- Bugfix: Clear vtag entries correctly on expiration (MT).
- Bugfix: shutdown() indicates ENOTCONN when called for unconnected
          1-to-1 style sockets (MT).
- Bugfix: In sctp Auth code (PL).
- Add support for devices that support SCTP csum offload (igb).
- Add missing sctp_associd to mib sysctl xsctp_tcb structure (RS)
Obtained from:	With help from Peter Lei and Michael Tuexen
2009-02-03 11:04:03 +00:00
Randall Stewart
2f4afd2125 Adds support for SCTP checksum offload. This means
we, like TCP and UDP, move the checksum calculation
into the IP routines when there is no hardware support
we call into the normal SCTP checksum routine.

The next round of SCTP updates will use
this functionality. Of course the IGB driver needs
a few updates to support the new intel controller set
that actually does SCTP csum offload too.

Reviewed by:	gnn, rwatson, kmacy
2009-02-03 11:00:43 +00:00
Luigi Rizzo
6e152a7539 initialize a couple of variables, gcc 4.2.4-4 (linux) reports
some possible uninitialized uses and the warning does make sense.
2009-01-28 13:39:01 +00:00
Luigi Rizzo
36cb0db476 For some reason (probably dating ages ago) an #ifdef SYSCTL_NODE / #endif
section included a lot of stuff that did not belong there.
So split the block in multiple components each around the relevant stuff.

This said, I wonder if building a kernel where SYSCTL_NODE is not
defined is supported at all.

Submitted by:	Marta Carbone
2009-01-28 13:11:22 +00:00
Bjoern A. Zeeb
1cecba0fcd For consistency with prison_{local,remote,check}_ipN rename
prison_getipN to prison_get_ipN.

Submitted by:	jamie (as part of a larger patch)
MFC after:	1 week
2009-01-25 10:11:58 +00:00
Bjoern A. Zeeb
de4fbddd5b Add externs to fix build with VIMAGE_GLOBALS after r187289. 2009-01-22 10:29:09 +00:00
Sam Leffler
cbd1844537 remove too noisy DIAGNOSTIC code
Reviewed by:	qingli
2009-01-18 07:20:02 +00:00
Paolo Pisati
dd14bc5dca Silent userland warnings about missing prototypes.
Submitted by:	Roman Divacky <rdivacky@freebsd.org>
2009-01-15 19:35:23 +00:00
Lawrence Stewart
24cb0f2232 Add TCP Appropriate Byte Counting (RFC 3465) support to kernel.
The new behaviour is on by default, and can be disabled by setting the
net.inet.tcp.rfc3465 sysctl to 0 to obtain previous behaviour.

The patch changes struct tcpcb in sys/netinet/tcp_var.h which breaks
the ABI. Bump __FreeBSD_version to 800061 accordingly. User space tools
that rely on the size of struct tcpcb (e.g. sockstat) need to be recompiled.

Reviewed by:	rpaulo, gnn
Approved by:	gnn, kmacy (mentors)
Sponsored by:	FreeBSD Foundation
2009-01-15 06:44:22 +00:00
Robert Watson
87e0451806 Since we allow conditional allocation of labels on syncache entries,
remove historic assertion that labels are always present.
2009-01-11 20:01:43 +00:00
Bjoern A. Zeeb
813dd6ae5e Restrict arp, ndp and theoretically the FIB listing (if not
read with libkvm) to the addresses of a prison, when inside a
jail. [1]
As the patch from the PR was pre-'new-arp', add checks to the
llt_dump handlers as well.

While touching RTM_GET in route_output(), consistently use
curthread credentials rather than the creds from the socket
there. [2]

PR:		kern/68189
Submitted by:	Mark Delany <sxcg2-fuwxj@qmda.emu.st> [1]
Discussed with:	rwatson [2]
Reviewed by:	rwatson
MFC after:	4 weeks
2009-01-09 21:57:49 +00:00
Adrian Chadd
8696873dae Fix fat-fingered comment.
Noticed-by: julian
2009-01-09 18:38:57 +00:00
Adrian Chadd
cef2729493 Fix indentation; add FALLTHROUGH.
Thanks Max!
2009-01-09 17:21:22 +00:00
Adrian Chadd
4f2e6bfdd8 Better comment what the socket option does. Thanks to Sam Leffler
for suggesting this.
2009-01-09 17:18:17 +00:00
Adrian Chadd
4209e01ad7 Comment some potentially confusing logic.
Nitpicking by: mlaier

MFC after:	2 weeks
2009-01-09 17:16:18 +00:00
Adrian Chadd
be9347e3fe Implement a new IP option (not compiled/enabled by default) to allow
applications to specify a non-local IP address when bind()'ing a socket
to a local endpoint.

This allows applications to spoof the client IP address of connections
if (obviously!) they somehow are able to receive the traffic normally
destined to said clients.

This patch doesn't include any changes to ipfw or the bridging code to
redirect the client traffic through the PCB checks so TCP gets a shot
at it. The normal behaviour is that packets with a non-local destination
IP address are not handled locally. This can be dealth with some IPFW hackery;
modifications to IPFW to make this less hacky will occur in subsequent
commmits.

Thanks to Julian Elischer and others at Ironport. This work was approved
and donated before Cisco acquired them.

Obtained from:	Julian Elischer and others
MFC after:	2 weeks
2009-01-09 16:02:19 +00:00
Bjoern A. Zeeb
5ce0eb7f08 Make SIOCGIFADDR and related, as well as SIOCGIFADDR_IN6 and related
jail-aware. Up to now we returned the first address of the interface
for SIOCGIFADDR w/o an ifr_addr in the query. This caused problems for
programs querying for an address but running inside a jail, as the
address returned usually did not belong to the jail.
Like for v6, if there was an ifr_addr given on v4, you could probe
for more addresses on the interfaces that you were not allowed to see
from inside a jail. Return an error (EADDRNOTAVAIL) in that case
now unless the address is on the given interface and valid for the
jail.

PR:		kern/114325
Reviewed by:	rwatson
MFC after:	4 weeks
2009-01-09 13:06:56 +00:00
Hartmut Brandt
c0e9a8a154 Set a minimum of information in the routing message (like version and type)
so that generic routing message parsing code can parse the messages for
L2 info that are retrieved via the sysctl interface.
2009-01-09 10:58:59 +00:00
Randall Stewart
bbb0e3d9d5 Addresses Roberts comments on comments. Also adds
the KASSERT and checks suggested.

Reviewed by:	The udp tunneling was discussed on net@ under the
                thread entitled "Heads up -- Thinking about UDP and tunneling"
2009-01-06 13:27:56 +00:00
Randall Stewart
c7c7ea4b5a Add the ability of an alternate transport protocol
to easily tunnel over udp by providing a hook
function that will be called instead of appending
to the socket buffer.
2009-01-06 12:13:40 +00:00
Robert Watson
a603c811f8 Allow the IP_MINTTL socket option to be set to 0 so that it can be
disabled entirely, which is its default state before set to a
non-zero value.

PR:		128790
Submitted by:	Nick Hilliard <nick at foobar dot org>
MFC after:	3 weeks
2009-01-03 11:35:31 +00:00
Qing Li
dc49549713 Some modules such as SCTP supplies a valid route entry as an input argument
to ip_output(). The destionation is represented in a sockaddr{} object
that may contain other pieces of information, e.g., port number. This
same destination sockaddr{} object may be passed into L2 code, which
could be used to create a L2 entry. Since there exists a L2 table per
address family, the L2 lookup function can make address family specific
comparison instead of the generic bcmp() operation over the entire
sockaddr{} structure.

Note in the IPv6 case the sin6_scope_id is not compared because the
address is currently stored in the embedded form inside the kernel.
The in6_lltable_lookup() has to account for the scope-id if this
storage format were to change in the future.
2009-01-03 00:27:28 +00:00
Bjoern A. Zeeb
42d866dd69 For consistency use LLE_IS_VALID() in this 4th place that is actually
interested in the (void *)-1 return value hack.
This way we can easily identify those special parts of the code.
2008-12-28 21:18:01 +00:00
Qing Li
8eca593c5a This checkin addresses a couple of issues:
1. The "route" command allows route insertion through the interface-direct
   option "-iface". During if_attach(), an sockaddr_dl{} entry is created
   for the interface and is part of the interface address list. This
   sockaddr_dl{} entry describes the interface in detail. The "route"
   command selects this entry as the "gateway" object when the "-iface"
   option is present. The "arp" and "ndp" commands also interact with the
   kernel through the routing socket when adding and removing static L2
   entries. The static L2 information is also provided through the
   "gateway" object with an AF_LINK family type, similar to what is
   provided by the "route" command. In order to differentiate between
   these two types of operations, a RTF_LLDATA flag is introduced. This
   flag is set by the "arp" and "ndp" commands when issuing the add and
   delete commands. This flag is also set in each L2 entry returned by the
   kernel. The "arp" and "ndp" command follows a convention where a RTM_GET
   is issued first followed by a RTM_ADD/DELETE. This RTM_GET request fills
   in the fields for a "rtm" object, which is reinjected into the kernel by
   a subsequent RTM_ADD/DELETE command. The entry returend from RTM_GET
   is a prefix route, so the RTF_LLDATA flag must be specified when issuing
   the RTM_ADD/DELETE messages.

2. Enforce the convention that NET_RT_FLAGS with a 0 w_arg is the
   specification for retrieving L2 information. Also optimized the
   code logic.

Reviewed by:   julian
2008-12-26 19:45:24 +00:00
Kip Macy
5e96c0a13e Fix missed unlock and reference drop of lle
Found by: pho
2008-12-24 05:31:26 +00:00
Bjoern A. Zeeb
f3b28b6bfb Remove long unused netinet/ipprotosw.h (basically since r82884).
Discussed with:		rwatson
MFC after:		4 weeks
2008-12-23 16:52:03 +00:00
Qing Li
ce9122fd3e Don't create a bogus ARP entry for 0.0.0.0. 2008-12-23 03:33:32 +00:00
Qing Li
897d75c98e The proxy-arp code was broken and responds to ARP
requests for addresses that are not proxied locally.
2008-12-19 11:07:34 +00:00
Bjoern A. Zeeb
97590249ad Another step assimilating IPv[46] PCB code:
normalize IN6P_* compat flags usage to their equialent
INP_* counterpart.

Discussed with:	rwatson
Reviewed by:	rwatson
MFC after:	4 weeks
2008-12-17 13:00:18 +00:00
Bjoern A. Zeeb
dcdb4371ca Use inc_flags instead of the inc_isipv6 alias which so far
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.

While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]

Found by:	rwatson during review [1]
Discussed with:	rwatson
Reviewed by:	rwatson
MFC after:	4 weeks
2008-12-17 12:52:34 +00:00
Kip Macy
00a46b3122 default to doing lla_lookup with shared afdata lock and returning a
shared lock on the lle - thus restoring parallel performance to
pre-arpv2 level
2008-12-17 00:14:28 +00:00
Robert Watson
ec313afa3f IPFW's pfil hook/unhook code ignores the return values of pfil_add_hook()
and pfil_remove_hook(), so cast them to (void).

MFC after:	pretty soon
2008-12-16 15:05:35 +00:00
Kip Macy
848552f31f ipfw doesn't use the radix node head lock to protect the radix tree - remove acquisition 2008-12-16 11:06:30 +00:00
Kip Macy
3bb87a6c70 check pointer against NULL
add new line after declaration for style
2008-12-16 03:18:59 +00:00
Kip Macy
86cd829d64 don't unlock lle if it is NULL 2008-12-16 02:48:12 +00:00
Kip Macy
fbc2ca1bef unlock and destroy an llentry's lock before freeing
Found by: sam
2008-12-16 00:20:49 +00:00
Bjoern A. Zeeb
fc384fa5d6 Another step assimilating IPv[46] PCB code - directly use
the inpcb names rather than the following IPv6 compat macros:
in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag,
in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and
sotoin6pcb().

Apart from removing duplicate code in netipsec, this is a pure
whitespace, not a functional change.

Discussed with:	rwatson
Reviewed by:	rwatson (version before review requested changes)
MFC after:	4 weeks (set the timer and see then)
2008-12-15 21:50:54 +00:00
Qing Li
6e6b3f7cbc This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
   possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
  the last piece of the puzzle, Kip has also been conducting
  active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
  provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
  me maintaining that branch before the svn conversion
2008-12-15 06:10:57 +00:00
Bjoern A. Zeeb
03d8b6fd1b Add a check, that is currently under discussion for 8 but that we need
to keep for 7-STABLE when MFCing in_pcbladdr() to not change the
behaviour there.

With this a destination route via a loopback interface is treated as
a valid and reachable thing for IPv4 source address selection, even
though nothing of that network is ever directly reachable, but it is
more like a blackhole route.
With this the source address will be selected and IPsec can grab the
packets before we would discard them at a later point, encapsulate them
and send them out from a different tunnel endpoint IP.

Discussed on:	net
Reported by:	Frank Behrens <frank@harz.behrens.de>
Tested by:	Frank Behrens <frank@harz.behrens.de>
MFC after:	4 weeks (just so that I get the mail)
2008-12-14 17:47:33 +00:00
Bjoern A. Zeeb
bccd413962 De-virtualize the MD5 context for TCP initial seq number generation
and make it a function local variable like we do almost everywhere
inside the kernel.

Discussed with:	rwatson, silby
MFC after:	4 weeks
2008-12-13 21:59:18 +00:00
Kip Macy
cdacee3468 version that will compile 2008-12-13 20:34:41 +00:00
Kip Macy
fe6320b468 radix node head lock needs to be held when calling rnh_addaddr 2008-12-13 20:18:05 +00:00
Kip Macy
979245af95 don't acquire lock recursively 2008-12-13 20:16:03 +00:00
Bjoern A. Zeeb
1b193af610 Second round of putting global variables, which were virtualized
but formerly missed under VIMAGE_GLOBAL.

Put the extern declarations of the  virtualized globals
under VIMAGE_GLOBAL as the globals themsevles are already.
This will help by the time when we are going to remove the globals
entirely.

Sponsored by:	The FreeBSD Foundation
2008-12-13 19:13:03 +00:00
Bjoern A. Zeeb
86413abf5f Put a global variables, which were virtualized but formerly
missed under VIMAGE_GLOBAL.

Start putting the extern declarations of the  virtualized globals
under VIMAGE_GLOBAL as the globals themsevles are already.
This will help by the time when we are going to remove the globals
entirely.

While there garbage collect a few dead externs from ip6_var.h.

Sponsored by:	The FreeBSD Foundation
2008-12-11 16:26:38 +00:00
Bjoern A. Zeeb
0750c2ed96 Use the correct INIT_VNET_INET() as the virtualized variable here
are in vinet.h not in vinet6.h

Sponsored by:	The FreeBSD Foundation
2008-12-11 16:05:07 +00:00
Marko Zec
385195c062 Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.

Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out.  Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.

Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively

#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif

Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.

Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs.  This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.

Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.

De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps.  PF virtualization will be done
separately, most probably after next PF import.

Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw.  Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-12-10 23:12:39 +00:00
Robert Watson
cd416355a8 Remove inconsistent white space from in_pcballoc().
MFC after:	pretty soon
2008-12-10 13:24:38 +00:00
Robert Watson
5d04565101 Move syncache flag definitions below data structure, compress some vertical
whitespace.

MFC after:	pretty soon
2008-12-10 11:11:43 +00:00
Robert Watson
c3ce7a790c Move flag definitions for t_flags and t_oobflags below the definition of
struct tcpcb so that the structure definition is a bit more vertically
compact.  Can't yet fit it on one printed page, though.

MFC after:	pretty soon
2008-12-10 11:03:16 +00:00
Kip Macy
65954fda79 unlock when done 2008-12-10 08:23:47 +00:00
Kip Macy
e08ab8576d don't reference if_addr_mtx directly 2008-12-10 08:22:51 +00:00
Robert Watson
0ca989b376 Update comment on INP_TIMEWAIT to say what it's about, as we caution
regarding the misplacement of flags in inp_vflag in an earlier comment.

MFC after:	pretty soon
2008-12-09 23:57:09 +00:00
Robert Watson
d15fb96522 Enhance one comment relating to recent TCP locking changes, and fix a
typo in another.

MFC after:	6 weeks
2008-12-09 15:49:02 +00:00
Robert Watson
a5654bb2ae Move macros defining flags and shortcus to nested structure fields in
inpcbinfo below the structure definition in order to make inpcbinfo
fit on a single printed page; related style tweaks.

MFC after:	pretty soon
2008-12-09 10:21:38 +00:00
Robert Watson
252ca42863 Move from solely write-locking the global tcbinfo in tcp_input()
to read-locking in the TCP input path, allowing greater TCP
input parallelism where multiple ithreads or ithread and netisr
are able to run in parallel.  Previously, most TCP input paths
held a write lock on the global tcbinfo lock, effectively
serializing TCP input.

Before looking up the connection, acquire a write lock if a
potentially state-changing flag is set on the TCP segment header
(FIN, RST, SYN), and otherwise a read lock.  We may later have
to upgrade to a write lock in certain cases (ACKs received by the
syncache or during TIMEWAIT) in order to support global state
transitions, but this is never required for steady-state packets.

Upgrading from a write lock to a read lock must be done as a
trylock operation to avoid deadlocks, and actually violates the
lock order as the tcbinfo lock preceeds the inpcb lock held at
the time of upgrade.  If the trylock fails, we bump the refcount
on the inpcb, drop both locks, and re-acquire in-order.  If
another thread has freed the connection while the locks are
dropped, we free the inpcb and repeat the lookup (this should
hardly ever or never happen in practice).

For now, maintain a number of new counters measuring how many
times various cases execute, and in particular whether various
optimistic assumptions about when read locks can be used, whether
upgrades are done using the fast path, and whether connections
close in practice in the above-described race, actually occur.

MFC after:	6 weeks
Discussed with:	kmacy
Reviewed by:	bz, gnn, kmacy
Tested by:	kmacy
2008-12-08 20:27:00 +00:00
Robert Watson
28696211d6 Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele().  Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released.  This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.

Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().

MFC after:	1 month
Discussed with:	bz, kmacy
Reviewed by:	bz, gnn, kmacy
Tested by:	kmacy
2008-12-08 20:18:50 +00:00
Christian S.J. Peron
4e57bc3338 in_rtalloc1(9) returns a locked route, so make sure that we use
RTFREE_LOCKED() here.  This macro makes sure the reference count
on the route is being managed properly.  This elimates another
case which results in the following message being printed to the
console:

rtfree: 0xc841ee88 has 1 refs

Reviewed by:	bz
MFC after:	2 weeks
2008-12-06 19:09:38 +00:00
Randall Stewart
830d754d52 Code from the hack-session known as the IETF (and a
bit of debugging afterwards):
- Fix protection code for notification generation.
- Decouple associd from vtag
- Allow vtags to have less strigent requirements in non-uniqueness.
   o don't pre-hash them when you issue one in a cookie.
   o Allow duplicates and use addresses and ports to
     discriminate amongst the duplicates during lookup.
- Add support for the NAT draft draft-ietf-behave-sctpnat-00, this
  is still experimental and needs more extensive testing with the
  Jason Butt ipfw changes.
- Support for the SENDER_DRY event to get DTLS in OpenSSL working
  with a set of patches from Michael Tuexen (hopefully heading to OpenSSL soon).
- Update the support of SCTP-AUTH by Peter Lei.
- Use macros for refcounting.
- Fix MTU for UDP encapsulation.
- Fix reporting back of unsent data.
- Update assoc send counter handling to be consistent with endpoint sent counter.
- Fix a bug in PR-SCTP.
- Fix so we only send another FWD-TSN when a SACK arrives IF and only
  if the adv-peer-ack point progressed. However we still make sure
  a timer is running if we do have an adv_peer_ack point.
- Fix PR-SCTP bug where chunks were retransmitted if they are sent
  unreliable but not abandoned yet.

With the help of:	Michael Teuxen and Peter Lei :-)
MFC after:	 4 weeks
2008-12-06 13:19:54 +00:00
Gleb Smirnoff
0b476f1cce In a case of CARP status change run through the if_link_state_change()
routine, so that devd(8) and others are notified about link state change.
2008-12-05 14:37:14 +00:00
Bjoern A. Zeeb
4b79449e2f Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by:	brooks, gnn, des, zec, imp
Sponsored by:	The FreeBSD Foundation
2008-12-02 21:37:28 +00:00
Bjoern A. Zeeb
413628a7e3 MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
  and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
  help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
  suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
  on cluster machines as well as all the testers and people
  who provided feedback the last months on freebsd-jail and
  other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by:	(see above)
MFC after:	3 months (this is just so that I get the mail)
X-MFC Before:   7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
Marko Zec
5c890d3c4f Add an essential .h file that skipped from the last commit (r185419).
Pointy hat #1 on...

Pointed out by:	bz
2008-11-28 23:39:25 +00:00
Marko Zec
f02493cbbd Unhide declarations of network stack virtualization structs from
underneath #ifdef VIMAGE blocks.

This change introduces some churn in #include ordering and nesting
throughout the network stack and drivers but is not expected to cause
any additional issues.

In the next step this will allow us to instantiate the virtualization
container structures and switch from using global variables to their
"containerized" counterparts.

Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-11-28 23:30:51 +00:00
Dag-Erling Smørgrav
3b6fe5fcd9 missing V_ 2008-11-28 13:13:44 +00:00
Bjoern A. Zeeb
5cd54324ee Replace most INP_CHECK_SOCKAF() uses checking if it is an
IPv6 socket by comparing a constant inp vflag.
This is expected to help to reduce extra locking.

Suggested by:	rwatson
Reviewed by:	rwatson
MFC after:	6 weeks
2008-11-27 13:19:42 +00:00
Bjoern A. Zeeb
6aee2fc550 Merge in6_pcbfree() into in_pcbfree() which after the previous
IPsec change in r185366 only differed in two additonal IPv6 lines.
Rather than splattering conditional code everywhere add the v6
check centrally at this single place.

Reviewed by:	rwatson (as part of a larger changset)
MFC after:	6 weeks (*)
(*) possibly need to leave a stub wrapper in 7 to keep the symbol.
2008-11-27 12:04:35 +00:00
Bjoern A. Zeeb
6974bd9e75 Unify ipsec[46]_delete_pcbpolicy in ipsec_delete_pcbpolicy.
Ignoring different names because of macros (in6pcb, in6p_sp) and
inp vs. in6p variable name both functions were entirely identical.

Reviewed by:	rwatson (as part of a larger changeset)
MFC after:	6 weeks (*)
(*) possibly need to leave a stub wrappers in 7 to keep the symbols.
2008-11-27 10:43:08 +00:00
Marko Zec
97021c2464 Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by:  bz, julian
Approved by:  julian (mentor)
Obtained from:        //depot/projects/vimage-commit2/...
X-MFC after:  never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-11-26 22:32:07 +00:00
Bjoern A. Zeeb
0206cdb846 Remove in6_pcbdetach() as it is exactly the same function
as in_pcbdetach() and we don't need the code twice.

Reviewed by:	rwatson
MFC after:	6 weeks (*)
(*) possibly need to leave a stub wrapper in 7 to keep the symbol.
2008-11-26 20:52:26 +00:00
Bjoern A. Zeeb
a7df09e8c9 Unify the v4 and v6 versions of pcbdetach and pcbfree as good
as possible so that they are easily diffable.

No functional changes.

Reviewed by:	rwatson
MFC after:	6 weeks
2008-11-26 12:54:31 +00:00
Julian Elischer
bc97ba5100 Fix a scope problem in the multiple routing table code that stopped the
SO_SETFIB socket option from working correctly.

Obtained from:	Ironport
MFC after:	3 days
2008-11-19 19:19:30 +00:00
Marko Zec
44e33a0758 Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions.  As a rule,
initialization at instatiation for such variables should never be
introduced again from now on.  Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact.  In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-11-19 09:39:34 +00:00
Randall Stewart
a1e132720b -Improvement: Add '\n' on debug output in sctp_lower_sosend().
-Improvement: panic() on INVARIANTS kernels if memory allocation
 fails for a tagblock in sctp_add_vtag_to_timewait().
-Bugfix: Protect code in sctp_is_in_timewait() by
 SCTP_INP_INFO_WLOCK/SCTP_INP_INFO_WUNLOCK.
-Cleanup: Get rid of unused variable now in sctp_init_asoc().
-Bugfix: Reuse the correct vtag in sctp_add_vtag_to_timewait().
-Cleanup: Get rid of unused constant SCTP_TIME_WAIT_SHORT
 in sctp_constants.h.
-Improvement: Use all hash buckets of the vtag hash table.
-Cleanup: Get rid of then unused constant SCTP_STACK_VTAG_HASH_SIZE_A.
-Bugfix: Handle SHUTDOWN;SACK packet correctly.
-Bugfix: Last TSN in a gap ack block was not being "ack'd"
         in the internal scoreboard.
Obtained from:	(with help from Michael Tuexen)
2008-11-12 14:16:39 +00:00
Bjoern A. Zeeb
687a9b4738 For consistency work on the local object passed into the function for the
lock operation instead using the global name.

Submitted by:	ganbold
MFC after:	2 months
2008-11-09 14:06:44 +00:00
Bjoern A. Zeeb
8e5c87f4b6 Fix typo and while here another one.
Reviewed by:	keramida
Reported by:	keramida
MFC after:	2 months (with r184720)
2008-11-06 16:30:20 +00:00
Bjoern A. Zeeb
91d6cfa6b1 Fix a bug introduced with r182851 splitting tcp_mss() into
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.

Move the TSO logic back to tcp_mss() and out of tcp_mss_update().
We tried to avoid that initially but if were are called from
tcp_output() with EMSGSIZE, we cleared the TSO flag on the tcpcb
there, called into tcp_mtudisc() and tcp_mss_update() which
then would reenable TSO on the tcpcb based on TSO capabilities
of the interface as learnt in tcp_maxmtu/6().
So if TSO was enabled on the (possibly new) outgoing interface
it was turned back on, which lead to an endless loop between
tcp_output() and tcp_mtudisc() until we overflew the stack.

Reported by:	kmacy
MFC after:	2 months (along with r182851)
2008-11-06 13:25:59 +00:00
Bjoern A. Zeeb
4b3f4d3818 Adopt the comment for tcp_maxmtu(); we are returning a number
not a pointer. While here update the rest of the comment to
better match what we have these days.

MFC after:	2 months
2008-11-06 12:59:00 +00:00
Bjoern A. Zeeb
6f01cac68a Fix a bug introduced with r182851 splitting tcp_mss() into
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.

In case we return early and got a metricptr to pass the hostcache
info back to the caller we need to initialize the data to a defined
state (zero it) as tcp_hc_get() would do if there was no hit.
Without that the caller would check on random stack garbage which
could lead to undefined results.

This only affected tcp_mss() if there was no routing entry for the peer,
tcp_mtudisc() was not affected.

MFC after:	2 months (along with r182851)
2008-11-06 12:33:33 +00:00
Oleg Bulyzhin
02d09f7901 Type of q_time (start of queue idle time) has changed: uint32_t -> uint64_t.
This should fix q_time overflow, which happens after 2^32/(86400*hz) days of
uptime (~50days for hz = 1000).
q_time overflow cause following:
- traffic shaping may not work in 'fast' mode (not enabled by default).
- incorrect average queue length calculation in RED/GRED algorithm.

NB: due to ABI change this change is not applicable to stable.

PR:		kern/128401
2008-10-28 14:14:57 +00:00
Randall Stewart
73adc48f49 More issues with pre-blocking:
a) Need for EEOR mode to take the min of the socket buffer size and the
    add more threshold, otherwise if you are so silly as to set a send
    buf size less than the add-more you could block forever in eeor mode.

 b) We were incorrectly using the sysctl vs the calculated value. This
    causes us to block forever if the addmore theshold is larger than
    then the socket buffer size.
2008-10-27 14:49:12 +00:00
Randall Stewart
35e4161b1f Two inter-related bugs.
- If we send EXACTLY the size left in the send buffer
    and then send again, we end up with exactly 0 bytes and
    don't hit the pre-block code to wait for more space.
  - If we fall into the loop with our max_len == 0 (the bug
    above) we then call in to copy out the data, setup the length
    of the waiting to transmit data to 0 and call the mbuf copy routine
    which 0 indicates copy all the data to the mbuf chain.. which it
    does. This then leaves a "stuck" message on the stream queue with
    its size exactly 0 bytes but all the data there and thus nothing
    left in the uio structure. We then reach a stuck forever state
    never being able to send data.
2008-10-27 14:01:23 +00:00
Randall Stewart
a4c651183e Get rid of ifdef for vimage on version 8 comparison. Now the
scrubbing program properly takes care of this.
2008-10-27 13:54:54 +00:00
Randall Stewart
83416c885d Invariants changes that make more sense. 2008-10-27 13:53:31 +00:00
Robert Watson
dd8ac7f990 In both dropwithreset paths in tcp_input.c, drop the tcbinfo lock
sooner to decomplicate locking and eliminate the need for a rather
chatty comment about why we have to handle the global lock in a
special way for the benefit of ipfw and pf cred rules.

MFC after:	3 days
2008-10-26 22:03:52 +00:00
Robert Watson
4c95fd23d6 Remove endearing but syntactically unnecessary "return;" statements
directly before the final closeing brackets of some TCP functions.

MFC after:	3 days
2008-10-26 19:33:22 +00:00
Bjoern A. Zeeb
460473a071 Style changes only:
- Consistently add parentheses to return statements.
 - Use NULL instead of 0 when comparing pointers, also avoiding
   unnecessary casts.
 - Do not use pointers as booleans.

Reviewed by:	rwatson (earlier version)
MFC after:	2 months
2008-10-26 19:17:25 +00:00
Dag-Erling Smørgrav
e11e3f187d Fix a number of style issues in the MALLOC / FREE commit. I've tried to
be careful not to fix anything that was already broken; the NFSv4 code is
particularly bad in this respect.
2008-10-23 20:26:15 +00:00
Dag-Erling Smørgrav
1ede983cc9 Retire the MALLOC and FREE macros. They are an abomination unto style(9).
MFC after:	3 months
2008-10-23 15:53:51 +00:00
Bjoern A. Zeeb
7e1bc2729c Update a comment which to my reading had been misplaced in rev. 1.12
already (but probably had been way above as the code was there twice)
and describe what was last changed in rev. 1.199 there (which now is
in sync with in6_src.c r184096).

Pointed at by:	mlaier
MFC after:	2 mmonths
2008-10-20 18:56:00 +00:00
Bjoern A. Zeeb
dc3c09c89f Bring over the change switching from using sequential to random
ephemeral port allocation as implemented in netinet/in_pcb.c rev. 1.143
(initially from OpenBSD) and follow-up commits during the last four and
a half years including rev. 1.157, 1.162 and 1.199.
This now is relying on the same infrastructure as has been implemented
in in_pcb.c since rev. 1.199.

Reviewed by:	silby, rpaulo, mlaier
MFC after:	2 months
2008-10-20 18:43:59 +00:00
Randall Stewart
1b9f62a044 The flags value was not always being copied out in the recv routine like it
should be.
Obtained from:	Michael Tuexen
2008-10-18 15:56:52 +00:00
Randall Stewart
ac29704161 New sockets (accepted) were not inheriting the proper snd/rcv buffer value.
Obtained from:	 Michael Tuexen
2008-10-18 15:56:12 +00:00
Randall Stewart
1862b24533 - Peers rwnd is now available for the MIB.
Obtained from:	Michael Tuexen
2008-10-18 15:55:15 +00:00
Randall Stewart
fc69c30240 - Adapt layer indication was always being given (it should only
be given when the user has enabled it). (Michael Tuexen)
- Sack Immediately was not being set properly on the actual chunk, it
  was only put in the rcvd_flags which is incorrect. (Michael Tuexen)
- added an ifndef userspace to one of the already present macro's for
  inet (Brad Penoff)
Obtained from:	Michael Tuexen and Brad Penoff
MFC after:	4 weeks
2008-10-18 15:54:25 +00:00
Randall Stewart
fcea7c2ed3 Reported by Yehuda Weinraub (yehudasa@gamil.com) - CRC32C algorithm
uses incorrect init_bytes value. It SHOULD have the number
of bytes to get to a 4 byte boundary.

PR:	128134
MFC after:	4 weeks
2008-10-18 15:53:31 +00:00
Bjoern A. Zeeb
f08ef6c595 Add cr_canseeinpcb() doing checks using the cached socket
credentials from inp_cred which is also available after the
socket is gone.
Switch cr_canseesocket consumers to cr_canseeinpcb.
This removes an extra acquisition of the socket lock.

Reviewed by:	rwatson
MFC after:	3 months (set timer; decide then)
2008-10-17 16:26:16 +00:00
Marko Zec
3ff0b2135b Remove a useless global static variable.
Approved by:	bz (ad-hoc mentor)
2008-10-16 12:31:03 +00:00
Maxim Konovalov
0279bb29a0 o Remove unnecessary parentheses and restore identation.
Prodded by:	mlaier
2008-10-14 17:47:29 +00:00
Maxim Konovalov
8e6c0f8cfd o Reformat ipfw nat get|setsockopt code to look it more
style(9) compliant.  No functional changes.
2008-10-14 12:26:55 +00:00
Robert Watson
1f6ef666b5 Fix content and spelling of comment on _ipfw_insn.len -- a count of
32-bit words, not 32-byte words.

MFC after:	3 days
2008-10-10 14:33:47 +00:00
Robert Watson
6c8286e42d Don't pass curthread to sbreserve_locked() in tcp_do_segment(), as the
netisr or ithread's socket buffer size limit is not the right limit to
use.  Instead, pass NULL as the other two calls to sbreserve_locked()
in the TCP input path (tcp_mss()) do.

In practice, this is a no-op, as ithreads and the netisr run without a
process limit on socket buffer use, and a NULL thread pointer leads to
not using the process's limit, if any.  However, if tcp_input() is
called in other contexts that do have limits, this may prevent the
incorrect limit from being used.

MFC after:	3 days
2008-10-07 09:41:07 +00:00
Bjoern A. Zeeb
c6ddb94cf2 Remove an INP_RUNLOCK() missed in SVN r183606, cvs rev. 1.195 raw_ip.c
when transitioning from so_cred to inp_cred.

MFC after:	6 weeks
2008-10-04 16:48:09 +00:00
Bjoern A. Zeeb
86d02c5c63 Cache so_cred as inp_cred in the inpcb.
This means that inp_cred is always there, even after the socket
has gone away. It also means that it is constant for the lifetime
of the inp.
Both facts lead to simpler code and possibly less locking.

Suggested by:	rwatson
Reviewed by:	rwatson
MFC after:	6 weeks
X-MFC Note:	use a inp_pspare for inp_cred
2008-10-04 15:06:34 +00:00
Bjoern A. Zeeb
0895aec30c Implement IPv4 source address selection for unbound sockets.
For the jail case we are already looping over the interface addresses
before falling back to the only IP address of a jail in case of no
match. This is in preparation for the upcoming multi-IPv4/v6/no-IP
jail patch this change was developed with initially.

This also changes the semantics of selecting the IP for processes within
a jail as it now uses the same logic as outside the jail (with additional
checks) but no longer is on a mutually exclusive code path.

Benchmarks had shown no difference at 95.0% confidence for neither the
plain nor the jail case (even with the additional overhead).  See:
http://lists.freebsd.org/pipermail/freebsd-net/2008-September/019531.html

Inpsired by a patch from:	Yahoo! (partially)
Tested by:			latest multi-IP jail patch users (implictly)
Discussed with:			rwatson (general things around this)
Reviewed by:			mostly silence (feedback from bms)
Help with benchmarking from:	kris
MFC after:			2 months
2008-10-03 12:21:21 +00:00
Marko Zec
8b615593fc Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by:	julian, bz, brooks, zec
Reviewed by:	julian, bz, brooks, kris, rwatson, ...
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
Robert Watson
c0a211c51f Expand comments relating various detach/free/drop inpcb routines.
MFC after:	3 days
2008-09-29 13:50:17 +00:00
Robert Watson
fc18af966f Fix typo in comment.
MFC after:	3 days
2008-09-29 13:48:48 +00:00
Robert Watson
47505890d6 When an inpcb doesn't have a socket but the inpcb is passed to ipfw
in the transmit path, such as TCPS_TIMEWAIT, fail the credential
extraction immediately rather than acquiring locks and looking up
the inpcb on the global lists in order to reach the conclusion that
the credential extraction has failed.

This is more efficient, but more importantly, it avoids lock
recursion on the inpcbinfo, which is no longer allowed with rwlocks.
This appears to have been responsible for at least two reported
panics.

MFC after:	3 days
Reported by:	ganbold
2008-09-27 19:28:28 +00:00
Robert Watson
d83412e791 Rather than shadowing global variable 'lookup' in check_uidgid(), rename
it to ugid_lookupp.  This should make debugging issues with ipfw uid
rules easier.

MFC after:	3 days
2008-09-27 10:14:02 +00:00
Ed Maste
d2035ffb7a Move CTASSERT from header file to source file, per implementation note now
in the CTASSERT man page.

Submitted by:	Ryan Stone
2008-09-26 18:30:11 +00:00
Robert Watson
014ea782b1 As a follow-on to r183323, correct another case where ip_output() was
called without an inpcb pointer despite holding the tcbinfo global
lock, which lead to a deadlock or panic when ipfw tried to further
acquire it recursively.

Reported by:    Stefan Ehmann <shoesoft at gmx dot net>
MFC after:      3 days
2008-09-25 17:26:54 +00:00
Robert Watson
a0ca087183 When dropping a packet and issuing a reset during TCP segment handling,
unconditionally drop the tcbinfo lock (after all, we assert it lines
before), but call tcp_dropwithreset() under both inpcb and inpcbinfo
locks only if we pass in an tcpcb.  Otherwise, if the pointer is NULL,
firewall code may later recurse the global tcbinfo lock trying to look
up an inpcb.

This is an instance where a layering violation leads not only
potentially to code reentrace and recursion, but also to lock
recursion, and was revealed by the conversion to rwlocks because
acquiring a read lock on an rwlock already held with a write lock is
forbidden.  When these locks were mutexes, they simply recursed.

Reported by:	Stefan Ehmann <shoesoft at gmx dot net>
MFC after:	3 days
2008-09-24 11:07:03 +00:00
Roman Kurakin
f7b5554eb7 Export IPFW_TABLES_MAX value for compiled in defaults. 2008-09-21 20:42:42 +00:00
Roman Kurakin
6b057f1b5e Export IPFW_TABLES_MAX via sysctl. Part of PR: 127058.
PR:		127058
2008-09-14 09:24:12 +00:00
Julian Elischer
de34ad3f4b oops commit the version that compiles 2008-09-14 08:24:45 +00:00
Julian Elischer
93fcb5a28d Revert a part of the MRT commit that proved un-needed.
rt_check() in its original form proved to be sufficient and
rt_check_fib() can go away (as can its evil twin in_rt_check()).

I believe this does NOT address the crashes people have been seeing
in rt_check.

MFC after:	1 week
2008-09-14 08:19:48 +00:00
Roman Kurakin
eb29d14ccb Make the commet for the default rule number more clear.
Submitted by:	yar@
2008-09-14 06:14:06 +00:00
Bjoern A. Zeeb
3418daf2f1 Implement IPv6 support for TCP MD5 Signature Option (RFC 2385)
the same way it has been implemented for IPv4.

Reviewed by:	bms (skimmed)
Tested by:	Nick Hilliard (nick netability.ie) (with more changes)
MFC after:	2 months
2008-09-13 17:26:46 +00:00
Bjoern A. Zeeb
c10eb6d10a Work around an integer division resulting in 0 and thus the
congestion window not being incremented, if cwnd > maxseg^2.
As suggested in RFC2581 increment the cwnd by 1 in this case.

See http://caia.swin.edu.au/reports/080829A/CAIA-TR-080829A.pdf
for more details.

Submitted by:	Alana Huebner, Lawrence Stewart,
		Grenville Armitage (caia.swin.edu.au)
Reviewed by:	dwmalone, gnn, rpaulo
MFC After:	3 days
2008-09-09 07:35:21 +00:00
Bjoern A. Zeeb
00db174bc2 To my reading there are no real consumers of ip6_plen (IPv6
Payload Length) as set in tcpip_fillheaders().
ip6_output() will calculate it based of the length from the
mbuf packet header itself.
So initialize the value in tcpip_fillheaders() in correct
(network) byte order.

With the above change, to my reading, all places calling tcp_trace()
pass in the ip6 header via ipgen as serialized in the mbuf and with
ip6_plen in network byte order.
Thus convert the IPv6 payload length to host byte order before printing.

MFC after:	2 months
2008-09-07 20:44:45 +00:00
Bjoern A. Zeeb
3cee92e074 Split tcp_mss() in tcp_mss() and tcp_mss_update() where the former
calls the latter.

Merge tcp_mss_update() with code from tcp_mtudisc() basically
doing the same thing.

This gives us one central place where we calcuate and check mss values
to update t_maxopd (maximum mss + options length) instead of two slightly
different but almost equal implementations to maintain.

PR:		kern/118455
Reviewed by:	silby (back in March)
MFC after:	2 months
2008-09-07 18:50:25 +00:00
Bjoern A. Zeeb
ebe5426934 V_irtualize SVN r182846 tcp_mssdflt/tcp_v6mssdflt procedure based
sysctl implementations for VIMAGE the same way we did elsewhere:
update the implementation but leave the globals and the SYSCTL
statement untouched.
2008-09-07 15:20:21 +00:00
Bjoern A. Zeeb
4cdf3bedf3 Convert SYSCTL_INTs for tcp_mssdflt and tcp_v6mssdflt to
SYSCTL_PROCs and check that the default mss for neither v4 nor
v6 goes below the minimum MSS constant (216).

This prevents people from shooting themselves in the foot.

PR:		kern/118455 (remotely related)
Reviewed by:	silby (as part of a larger patch in March)
MFC after:	2 months
2008-09-07 14:44:55 +00:00
Bjoern A. Zeeb
c4982fae59 Add a second KASSERT checking for len >= 0 in the tcp output path.
This is different to the first one (as len gets updated between those
two) and would have caught various edge cases (read bugs) at a well
defined place I had been debugging the last months instead of
triggering (random) panics further down the call graph.

MFC after:	2 months
2008-09-07 11:38:30 +00:00
Roman Kurakin
8191aa7c0b Export the IPFW_DEFAULT_RULE outside ip_fw2.c. This number in not only
the default rule number but also the maximum rule number.  User space
software such as ipfw and natd should be aware of its value.  The
software that already includes ip_fw.h should use the defined value.  All
other a expected to use sysctl (as discussed on net@).

MFC after: 5 days.
Discussed on: net@
2008-09-06 16:47:07 +00:00
Giorgos Keramidas
57a5a46e00 Slightly reword comment and remove typos. 2008-09-05 01:36:30 +00:00
Julian Elischer
9c6b07a695 whitespace nit 2008-09-03 18:09:15 +00:00
Brooks Davis
0eb7cf4d5f Wrap an 81 column SYSCTL_NODE decleration.
Obtained from:	//depot/projects/vimage-commit2/...
2008-09-01 19:25:27 +00:00
Kip Macy
7c80e4f37f Don't check if an interface can do tcp offload if there are no offload devices registered on the system.
Suggested by: rwatson
MFC after:	3 days
2008-09-01 05:30:22 +00:00
Julian Elischer
22b55ba9a0 fix tiny nti in comment 2008-08-31 18:54:35 +00:00
Christian S.J. Peron
8751c5bac8 Improve the entropy of the source port randomization for network address
translation.  It turns out this is useful for applications which require
source port randomization for security (i.e. dns servers).

Discussed with:	secteam
Requested by:	mlaier
MFC after:	2 weeks
2008-08-30 20:58:34 +00:00
George V. Neville-Neil
e4762f75c3 Fix a bug whereby multicast packets that are looped back locally
wind up with the incorrect checksum on the wire when transmitted via
devices that do checksum offloading.

PR:		kern/119635
Reviewed by:	rwatson
MFC after:	5 days
2008-08-29 20:42:58 +00:00
Rui Paulo
003c7e36b2 Fix typo in comment. 2008-08-28 21:55:40 +00:00
Randall Stewart
df4ad1fd93 ok, non static the function and put in the .h so
when we do INVARANT compile the compiler will not
dis the function that is not used. Hmm maybe I should have
made it ifndef INVARIANTs..
2008-08-28 20:31:24 +00:00
Randall Stewart
e1bfc4d739 Fixes compile error when INVARIANTs is on. Adds an
empty goto to keep the compiler happy.
2008-08-28 20:14:07 +00:00
Randall Stewart
df6e0cc37d - Make strict-sacks be the default.
- Change it so that without INVARIANTs there are
  no panics in SCTP.
- sctp_timer changes so that we have a recovery mechanism
  when the sent list is out of order.
2008-08-28 09:44:07 +00:00
Christian S.J. Peron
f440aeea83 Fix a panic in MAC kernels that was a result of un-initialized label
storage.  We can safely remove the label copying operations since
M_MOVE_PKTHDR will move the mbuf tags (which contain MAC labels) to
the destination mbuf.

MFC after:	1 week
Discussed with:	rwatson
2008-08-27 23:52:03 +00:00
Randall Stewart
4a16c2c883 - When we close a socket with pending assoc's that are still
shutting down, NULL out the socket pointer so we won't
  ever refer to a dead socket.

Obtained from: Neil Wilson
2008-08-27 13:13:35 +00:00
Julian Elischer
2c0d658fca Another missed V_ instance 2008-08-25 05:57:56 +00:00
Julian Elischer
b53c8130e5 Another V_ forgotten 2008-08-25 05:49:16 +00:00
Julian Elischer
576c43c844 We left out V_static_len from ip_fw2.c
(also a whitespace diff that i'd rahter fix her ethan break in the
vimage branch.)
2008-08-25 05:38:18 +00:00
Julian Elischer
e0306e8be7 Move some struct defs around. This is a prep step for Vimage.A
No real effect of this at this time.
2008-08-25 00:33:30 +00:00
Bjoern A. Zeeb
ad27dca959 Make the kernel compile with SCTP and SCTP_DEBUG but
no INET6 defined.
2008-08-24 18:29:22 +00:00
Kip Macy
4570959392 Don't calculate checksum if it has already been validated
Obtained from:	Chelsio Inc.
MFC after:	3 days
2008-08-24 02:31:09 +00:00
Bjoern A. Zeeb
c06f087ccb Cache the cred locally in _syncache_add() while holding the locks, so
we can be sure that it's valid.
In case we abort early free it again else put it into the syncache.

We need the cred in the syncache to be able to restrict what will be
exportet by the sysctl helper function syncache_pcblist() (to netstat)
within jails.

PR:		kern/126493
Reviewed by:	rwatson (earlier versions)
MFC after:	3 days
2008-08-23 14:22:12 +00:00
Bjoern A. Zeeb
bb580846dc Add an explicit comment why we NULLify the two variables.
Reviewed by:	rwatson
MFC after:	3 days
2008-08-23 12:27:18 +00:00
Robert Watson
5060346d0b Remove comments and #ifdef notyet'd code relating to directly dispatching
the IP multicast input code from the output path; we don't allow
reentrance of the input path from the IP output path, it must use the
netisr due to potential lock recursion.

MFC after:	3 days
2008-08-21 17:24:49 +00:00
Julian Elischer
5ed3800e41 Fix some of the formatting fixes.. It's amazing how some thing stand out
in a commit message.
2008-08-20 01:24:55 +00:00
Julian Elischer
ac957cd271 A bunch of formatting fixes brough to light by, or created by the Vimage commit
a few days ago.
2008-08-20 01:05:56 +00:00
Philip Paeps
80b11ee46a Fix ARP in bridging scenarios where the bridge shares its
MAC address with one of its members (see my r180140).

Pointy hat to:	philip
Submitted by:	Eygene Ryabinkin <rea-fbsd@codelabs.ru>
MFC after:	3 days
2008-08-18 09:06:11 +00:00
Bjoern A. Zeeb
603724d3ab Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from:	//depot/projects/vimage-commit2/...
Reviewed by:	brooks, des, ed, mav, julian,
		jamie, kris, rwatson, zec, ...
		(various people I forgot, different versions)
		md5 (with a bit of help)
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
X-MFC after:	never
V_Commit_Message_Reviewed_By:	more people than the patch
2008-08-17 23:27:27 +00:00
Bjoern A. Zeeb
48d48eb980 Fix a regression introduced in r179289 splitting up ip6_savecontrol()
into v4-only vs. v6-only inp_flags processing.
When ip6_savecontrol_v4() is called from ip6_savecontrol() we
were not passing back the **mp thus the information will be missing
in userland.
Istead of going with a *** as suggested in the PR we are returning
**mp now and passing in the v4only flag as a pointer argument.

PR:		kern/126349
Reviewed by:	rwatson, dwmalone
2008-08-16 06:39:18 +00:00
Dag-Erling Smørgrav
c3a7b734ad Nit 2008-08-09 11:28:57 +00:00
Robert Watson
5cb2685a59 Minor white space tweaks.
MFC after:	1 week
2008-08-07 09:06:04 +00:00
Robert Watson
72bed08287 Correct comment typo.
MFC after:	1 week (after inpcb rwlocking)
2008-08-07 09:03:51 +00:00
John Baldwin
aa91bee2dc Minor style tweaks. 2008-08-05 21:59:20 +00:00
Julian Elischer
711ca7efbb The IPFW code accepts the use of the tablearg keyword along with the skipto
keyword. But it doesn't work. Two options.. make it no longer accept it,
or actually make it work.. I chose the 2nd..

Allow the tablearg to be used to specify a skipto destination.

This is actually a very powerful construct if used correctly, or a sink
of cpu cycles if used badly.

changes t teh man page will follow.
2008-08-01 22:21:03 +00:00
Rui Paulo
f2512ba12a MFp4 (//depot/projects/tcpecn/):
TCP ECN support. Merge of my GSoC 2006 work for NetBSD.
  TCP ECN is defined in RFC 3168.

Partly reviewed by:	dwmalone, silby
Obtained from:		NetBSD
2008-07-31 15:10:09 +00:00
Randall Stewart
6d9e8f2b3a Adds support for the SCTP_PORT_REUSE option
Fixes a refcount bug found in the process

Obtained from:	With the help of Michael Tuexen
2008-07-31 11:08:30 +00:00
Randall Stewart
52baa64a19 Fix build breakage - kthread_exit() in 8 now has no arguments
MFC after:	1 week
2008-07-29 09:30:50 +00:00
Randall Stewart
d6af161a34 - Out with some printfs.
- Fix a initialization of last_tsn_used
- Fix handling of mapped IPv4 addresses
Obtained from:	Michael Tuexen and I :-)
MFC after:	1 week
2008-07-29 09:06:35 +00:00
Alexander Motin
18f401c664 Some style and assertion fixes to the previous commits hinted by rwatson.
There is no functional changes.
2008-07-28 06:57:28 +00:00
Alexander Motin
d185578a78 According to in_pcb.h protocol binding information has double locking.
It allows access it while list travercing holding only global pcbinfo lock.
2008-07-27 20:48:22 +00:00
Alexander Motin
e2ed8f3514 Increase UDBHASHSIZE from 16 to 128 items.
Previous value was chosen 10 years ago and not very effective now.
This change gives several percents speedup on 1000 L2TP mpd links.
2008-07-26 23:07:34 +00:00
Alexander Motin
0ca3b0967b According to in_pcb.h protocol binding information has double locking.
It allows access it while list travercing holding only global pcbinfo lock.
This relaxed locking noticably increses receive socket lookup performance.
2008-07-26 21:12:00 +00:00
Alexander Motin
9ed324c9a5 Add hash table lookup for a fully connected raw sockets.
This gives significant performance improvements when many raw sockets used.
Benchmarks of mpd handeling 1000 simultaneous PPTP connections show up to 50%
performance boost. With higher number of connections benefit becomes even
bigger. PopTop snd others should also get some benefits.
2008-07-26 17:32:15 +00:00
Tai-hwa Liang
df9cf830d1 Trying to fix compilation bustage:
- removing 'const' qualifier from an input parameter to conform to the type
  required by rw_assert();
- using in_addr->s_addr to retrive 32 bits address value.

Observed by:	tinderbox
2008-07-22 04:23:57 +00:00
Kip Macy
9d29c635da make new accessor functions consistent with existing style 2008-07-21 22:11:39 +00:00
Kip Macy
84330faa64 - Switch to INP_WLOCK macro from inp_wlock
- calling sodisconnect after tcp_twstart is both gratuitous and unsafe - remove

Submitted by:	rwatson
2008-07-21 21:22:56 +00:00
Kip Macy
b1f8bd6464 Add versions of tcp_twstart, tcp_close, and tcp_drop that hide the acquisition the tcbinfo lock.
MFC after:	1 week
2008-07-21 02:23:02 +00:00
Kip Macy
409d8ba5c7 add interface for external consumers to syncache_expand - rename syncache_add in a manner consistent with other bits intended for offload 2008-07-21 02:11:06 +00:00
Kip Macy
dd0e6c383a Add accessor functions for socket fields.
MFC after:	1 week
2008-07-21 00:49:34 +00:00
Kip Macy
9378e4377f add inpcb accessor functions for fields needed by TOE devices 2008-07-21 00:08:34 +00:00
Tom Rhodes
41698ebf5b Document a few sysctls.
Reviewed by:	rwatson
2008-07-20 15:29:58 +00:00
Bjoern A. Zeeb
8699ea087e ia is a pointer thus use NULL rather then 0 for initialization and
in comparisons to make this more obvious.

MFC after:	5 days
2008-07-20 12:31:36 +00:00
Kip Macy
b1bc0b2a86 remove unused toedev functions and add comments for rest 2008-07-20 02:02:50 +00:00
David Malone
744eaff7e6 Add an accept filter for TCP based DNS requests. It waits until the
whole first request is present before returning from accept.
2008-07-18 14:44:51 +00:00
Robert Watson
3b19fa3597 Eliminate use of the global ripsrc which was being used to pass address
information from rip_input() to rip_append().  Instead, pass the source
address for an IP datagram to rip_append() using a stack-allocated
sockaddr_in, similar to udp_input() and udp_append().

Prior to the move to rwlocks for inpcbinfo, this was not a problem, as
use of the global was synchronized using the ripcbinfo mutex, but with
read-locking there is the potential for a race during concurrent
receive.

This problem is not present in the IPv6 raw IP socket code, which
already used a stack variable for the address.

Spotted by:	mav
MFC after:	1 week (before inpcbinfo rwlock changes)
2008-07-18 10:47:07 +00:00
Robert Watson
ca528788b8 Fix error in comment.
MFC after:	3 weeks
2008-07-16 10:55:50 +00:00
Robert Watson
43cc0bc1df Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:

- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
  local inpcb address and port bindings in order to determine what further
  locking is required:
  - If the inpcb is currently not bound (at all) and are implicitly
    connecting, we require inpcbinfo and inpcb write locks, so drop the
    read lock and re-acquire.
  - If the inpcb is bound for at least one of the port or address, but an
    explicit source or destination is requested, trylock the inpcbinfo
    lock, and if that fails, drop the inpcb lock, lock the global lock,
    and relock the inpcb lock.
  - Otherwise, no further locking is required (common case).
- Update comments.

In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack.  This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.

The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.

Tested by:	ps, kris (older revision), bde
MFC after:	3 weeks
2008-07-15 15:38:47 +00:00
Rui Paulo
b27227029b Fix commment in typo.
M    tcp_output.c
2008-07-15 10:32:35 +00:00
Ermal Luçi
7972c979c5 Fix carp(4) panics that can occur during carp interface configuration.
Approved by:	mlaier (mentor)
Reported by:	Scott Ullrich
MFC after:	1 week
2008-07-14 20:11:51 +00:00
Robert Watson
3144b7d3d3 Slightly rearrange validation of UDP arguments and jail processing in
udp_output() so that argument validation occurs before jail processing.

Add additional comments explaining what's going on when we process
addresses and binding during udp_output().

MFC after:	3 weeks
2008-07-10 16:20:18 +00:00
Bjoern A. Zeeb
078b704233 Pass the ucred along into in{,6}_pcblookup_local for upcoming
prison checks.

Reviewed by:	rwatson
2008-07-10 13:31:11 +00:00
Bjoern A. Zeeb
cdcb11b92c For consistency take lport as u_short in in{,6}_pcblookup_local.
All callers either pass in an u_short or u_int16_t.

Reviewed by:	rwatson
2008-07-10 13:23:22 +00:00
Robert Watson
1175d9d56d Apply the MAC label to an outgoing UDP packet when other inpcb properties are
processed, meaning that we avoid the cost of MAC label assignment if we're
going to drop the packet due to mbuf exhaustion, etc.

MFC after:	3 weeks
2008-07-10 09:45:28 +00:00
Bjoern A. Zeeb
e5cf427baf For consistency with the rest of the function use the locally cached
pointer pcbinfo rather than inp->inp_pcbinfo.

MFC after:	3 weeks
2008-07-09 19:03:06 +00:00
Randall Stewart
fc14de76f4 1) Adds the rest of the VIMAGE change macros
2) Adds some __UserSpace__ on some of the common defines that
   the user space code needs
3) Fixes a bug when we send up data to a user that failed. We
   need to a) trim off the data chunk headers, if present, and
   b) make sure the frag bit is communicated properly for the
   msgs coming off the stream queues... i.e. we see if some
   of the msg has been taken.

Obtained from:	jeli contributed the VIMAGE changes on this pass Thanks Julain!
2008-07-09 16:45:30 +00:00
Robert Watson
7b709f8ad4 Provide some initial chicken-scratching annotations of locking for
struct inpcb.

Prodded by:	bz
MFC after:	3 days
2008-07-08 17:22:59 +00:00
Robert Watson
ac9ae27991 Allow udp_notify() to accept read, as well as write, locks on the passed
inpcb.  When directly invoking udp_notify() from udp_ctlinput(), acquire
only a read lock; we may still see write locks in udp_notify() as the
in_pcbnotifyall() routine is shared with TCP and always uses a write lock
on the inpcb being notified.

MFC after:	1 month
2008-07-07 12:27:55 +00:00
Robert Watson
c4d585aefe Add additional udbinfo and inpcb locking assertions to udp_output(); for
some code paths, global or inpcb write locks are required, but for other
code paths, read locks or no locking at all are sufficient for the data
structures.

MFC after:	1 month
2008-07-07 12:14:10 +00:00
Robert Watson
948d0fc926 First step towards parallel transmit in UDP: if neither a specific
source or a specific destination address is requested as part of a send
on a UDP socket, read lock the inpcb rather than write lock it.  This
will allow fully parallel transmit down to the IP layer when sending
simultaneously from multiple threads on a connected UDP socket.

Parallel transmit for more complex cases, such as when sendto(2) is
invoked with an address and there's already a local binding, will
follow.

MFC after:	1 month
2008-07-07 10:56:55 +00:00
Robert Watson
10cc62b7a6 Drop read lock on udbinfo earlier during delivery to the last matching
UDP socket for a datagram; the inpcb read lock is sufficient to provide
inpcb stability during udp_append().

MFC after:	1 month
2008-07-07 09:26:52 +00:00
Robert Watson
cec9ffee22 Rename raw_append() to rip_append(): the raw_ prefix is generally used
for functions in the generic raw socket library (raw_cb.c, raw_usrreq.c),
and they are not used for IPv4 raw sockets.

MFC after:	3 days
2008-07-05 18:55:03 +00:00
Robert Watson
0ae76120da Improve approximation of style(9) in raw socket code. 2008-07-05 18:03:39 +00:00
Oleksandr Tymoshenko
06a37c4203 Enqueue de-capsulated packet instead of performing direct dispatch. It's
possible to exhaust and garble stack with a packet that contains a couple
of hundreds nested encapsulation levels.

Submitted by:   Ming Fu <fming@borderware.com>
Reviewed by:    rwatson
PR:             kern/85320
2008-07-04 21:01:30 +00:00
Robert Watson
59dd72d040 Remove NETISR_MPSAFE, which allows specific netisr handlers to be directly
dispatched without Giant, and add NETISR_FORCEQUEUE, which allows specific
netisr handlers to always be dispatched via a queue (deferred).  Mark the
usb and if_ppp netisr handlers as NETISR_FORCEQUEUE, and explicitly
acquire Giant in those handlers.

Previously, any netisr handler not marked NETISR_MPSAFE would necessarily
run deferred and with Giant acquired.  This change removes Giant
scaffolding from the netisr infrastructure, but NETISR_FORCEQUEUE allows
non-MPSAFE handlers to continue to force deferred dispatch so as to avoid
lock order reversals between their acqusition of Giant and any calling
context.

It is likely we will be able to remove NETISR_FORCEQUEUE once
IFF_NEEDSGIANT is removed, as non-MPSAFE usb and if_ppp drivers will no
longer be supported.

Reviewed by:	bz
MFC after:	1 month
X-MFC note:	We can't remove NETISR_MPSAFE from stable/7 for KPI reasons,
		but the rest can go back.
2008-07-04 00:21:38 +00:00
Bjoern A. Zeeb
62ee136457 Remove a bogusly introduced rtalloc_ign() in rev. 1.335/SVN 178029,
generating an RTM_MISS for every IP packet forwarded making user space
routing daemons unhappy.

PR:		kern/123621, kern/124540, kern/122338
Reported by:	Paul <paul gtcomm.net>, Mike Tancsa <mike sentex.net> on net@
Tested by:	Paul and Mike
Reviewed by:	andre
MFC after:	3 days
2008-07-03 12:44:36 +00:00
Robert Watson
5df3e83946 Add soreceive_dgram(9), an optimized socket receive function for use by
datagram-only protocols, such as UDP.  This version removes use of
sblock(), which is not required due to an inability to interlace data
improperly with datagrams, as well as avoiding some of the larger loops
and state management that don't apply on datagram sockets.

This is experimental code, so hook it up only for UDPv4 for testing; if
there are problems we may need to revise it or turn it off by default,
but it offers *significant* performance improvements for threaded UDP
applications such as BIND9, nsd, and memcached using UDP.

Tested by:	kris, ps
2008-07-02 23:23:27 +00:00
Robert Watson
119d85f6e0 In udp_append() and udp_input(), make use of read locking on incpbs
rather than write locking: while we need to maintain a valid reference
to the inpcb and fix its state, no protocol layer state is modified
during an IPv4 UDP receive -- there are only changes at the socket
layer, which is separately protected by socket locking.

While parallel concurrent receive on a single UDP socket is currently
relatively unusual, introducing read locking in the transmit path,
allowing concurrent receive and transmit, will significantly improve
performance for loads such as BIND, memcached, etc.

MFC after:	2 months
Tested by:	gnn, kris, ps
2008-06-30 18:26:43 +00:00
Oleksandr Tymoshenko
cf77b84879 In case of interface initialization failure remove struct in_ifaddr* from
in_ifaddrhashtbl in in_ifinit because error handler in in_control removes
entries only for AF_INET addresses. If in_ifinit is called for the cloned
inteface that has just been created its address family is not AF_INET and
therefor LIST_REMOVE is not called for respective LIST_INSERT_HEAD and
freed entries remain in in_ifaddrhashtbl and lead to memory corruption.

PR:	kern/124384
2008-06-24 13:58:28 +00:00
Alexander Motin
48ca67bea6 Partially revert previous commit. DeleteLink() does not deletes permanent
links so we should be aware of it and try to delete every link only once
or we will loop forever.
2008-06-22 11:39:42 +00:00
Alexander Motin
ea29dd9241 Implement UDP transparent proxy support.
PR:		bin/54274
Submitted by:	Nicolai Petri <nicolai@petri.cc>
2008-06-21 20:18:57 +00:00
Alexander Motin
b46d3e21bb Add support for PORT/EPRT FTP commands in lowercase.
Use strncasecmp() instead of huge local implementation to reduce code size.
Check space presence after command/code.

PR:		kern/73034
2008-06-21 16:22:56 +00:00
Stephan Uphoff
606a2669cf Change incorrect stale cookie detection in syncookie_lookup() that prematurely
declared a cookie as expired.

Reviewed by:	andre@, silby@
Reported by:    Yahoo!
2008-06-16 20:08:22 +00:00
Stephan Uphoff
104ac85378 Fix a check in SYN cache expansion (syncache_expand()) to accept packets that arrive in the receive window instead of just on the left edge of the receive window.
This is needed for correct behavior when packets are lost or reordered.

PR:	kern/123950
Reviewed by:	andre@, silby@
Reported by:	Yahoo!, Wang Jin
MFC after:	1 week
2008-06-16 19:56:59 +00:00
Randall Stewart
97a7b90ff3 More prep for Vimage:
- only one functino to destroy an SCTP stack sctp_finish()
 - Make it so this function also arranges for any threads
   created by the image to do a kthread_exit()
2008-06-15 12:31:23 +00:00
Randall Stewart
9b02321796 - Fixes foobar on my part. Some missing virtualization macros from
specific logging cases.
2008-06-14 13:24:49 +00:00
Randall Stewart
b3f1ea41fd - Macro-izes the packed declaration in all headers.
- Vimage prep - these are major restructures to move
  all global variables to be accessed via a macro or two.
  The variables all go into a single structure.
- Asconf address addition tweaks (add_or_del Interfaces)
- Fix rwnd calcualtion to be more conservative.
- Support SACK_IMMEDIATE flag to skip delayed sack
  by demand of peer.
- Comment updates in the sack mapping calculations
- Invarients panic added.
- Pre-support for UDP tunneling (we can do this on
  MAC but will need added support from UDP to
  get a "pipe" of UDP packets in.
- clear trace buffer sysctl added when local tracing on.

Note the majority of this huge patch is all the vimage prep stuff :-)
2008-06-14 07:58:05 +00:00
Jack F Vogel
6c5087a818 Add generic TCP LOR into netinet 2008-06-11 22:12:50 +00:00
Max Laier
1ead26d4e1 Sort IP addresses before hashing them for the signature. Otherwise carp is
sensitive to address configuration order.

PR:		kern/121574
Reported by:	Douglas K. Rand, Wouter de Jong
Obtained from:	OpenBSD (rev 1.114 + fixes)
MFC after:	2 weeks
2008-06-02 18:58:07 +00:00
Robert Watson
53640b0e3a When allocating temporary storage to hold a TCP/IP packet header
template, use an M_TEMP malloc(9) allocation rather than an mbuf
with mtod(9) and dtom(9).  This eliminates the last use of
dtom(9) in TCP.

MFC after:	3 weeks
2008-06-02 14:20:26 +00:00
Alexander Motin
ef30318ee9 Increase LINK_TABLE_OUT_SIZE from 101 to 4001 like LINK_TABLE_IN_SIZE
to reduce performance degradation under heavy outgoing scan/flood.
Scalability is now much more important then several kilobytes of RAM.

Remove unneded TCP-specific expiration handeling. Before this connected
TCP sessions could never expire. Now connected TCP sessions will expire
after 24hours of inactivity.

Simplify HouseKeeping() to avoid several mul/div-s per packet. Taking into
account increased LINK_TABLE_OUT_SIZE, precision is still much more then
required.
2008-06-01 18:34:58 +00:00
Alexander Motin
efc66711f9 Make m_megapullup() more intelligent:
- to increase performance do not reallocate mbuf when possible,
 - to support up to 16K packets (was 2K max) use mbuf cluster of proper size.
This change depends on recent ng_nat and ip_fw_nat changes.
2008-06-01 17:52:40 +00:00
Alexander Motin
1913488d10 PKT_ALIAS_FOUND_HEADER_FRAGMENT result is not an error, so pass that packet.
This fixes packet fragmentation handeling.

Pass really available buffer size to libalias instead of MCLBYTES constant.
MCLBYTES constant were used with believe that m_megapullup() always moves
date into a fresh cluster that sometimes may become not so.
2008-06-01 12:29:23 +00:00
Alexander Motin
aac54f0a70 Fix packet fragmentation support broken by copy/paste error in rev.1.60.
ip_id should be u_short, but not u_char.
2008-06-01 11:47:04 +00:00
Robert Watson
c28cb4d82f Read lock rather than write lock TCP inpcbs in monitoring sysctls. In
some cases, add explicit inpcb locking rather than relying on the global
lock, as we dereference inp_socket, but also allowing us to drop the
global lock more quickly.

MFC after:	1 week
2008-05-29 14:28:26 +00:00
Robert Watson
9622e84fcf Employ read locks on UDP inpcbs, rather than write locks, when
monitoring UDP connections using sysctls.  In some cases, add
previously missing locking of inpcbs, as inp_socket is followed,
which also allows us to drop global locks more quickly.

MFC after:	1 week
2008-05-29 08:27:14 +00:00
Bjoern A. Zeeb
9a38ba8101 Factor out the v4-only vs. the v6-only inp_flags processing in
ip6_savecontrol in preparation for udp_append() to no longer
need an WLOCK as we will no longer be modifying socket options.

Requested by:		rwatson
Reviewed by:		gnn
MFC after:		10 days
2008-05-24 15:20:48 +00:00
Robert Watson
22c82719cf Consistently check IPFW and DUMMYNET privileges in the configuration
routines for those modules, rather than in the raw socket code.  This
each privilege check to occur in exactly once place and avoids
duplicate checks across layers.

MFC after:	3 weeks
Sponsored by:	nCircle Network Security, Inc.
2008-05-22 08:10:31 +00:00
Randall Stewart
d61374e183 - sctputil.c - If debug is on, the INPKILL timer can deref a freed value.
Change so that we save off a type field for display and
               NULL inp just for good measure.

- sctp_output.c - Fix it so in sending to the loopback we use the
                  src address of the inbound INIT. We don't want
                  to do this for non local addresses since otherwise
                  we might be ingressed filtered so we need to use
                  the best src address and list the address sent to.

Obtained from:	time bug - Neil Wilson
MFC after:	1 week
2008-05-21 16:51:21 +00:00
Randall Stewart
c54a18d26b - Adds support for the multi-asconf (From Kozuka-san)
- Adds some prepwork (Not all yet) for vimage in particular
  support the delete the sctppcbinfo.xx structs. There is
  still a leak in here if it were to be called plus we stil
  need the regrouping (From Me and Michael Tuexen)
- Adds support for UDP tunneling. For BSD there is no
  socket yet setup so its disabled, but major argument
  changes are in here to emcompass the passing of the port
  number (zero when you don't have a udp tunnel, the default
  for BSD). Will add some hooks in UDP here shortly (discussed
  with Robert) that will allow easy tunneling. (Mainly from
  Peter Lei and Michael Tuexen with some BSD work from me :-D)
- Some ease for windows, evidently leave is reserved by their
  compile move label leave: -> out:

MFC after:	1 week
2008-05-20 13:47:46 +00:00
Randall Stewart
bfefd19036 - Define changes in sctp.h
- Bug in CA that does not get us incrementing the PBA properly which
  made us more conservative.
- comment updated in sctp_input.c
- memsets added before we log
- added arg to hmac id's
MFC after:	2 weeks
2008-05-20 09:51:36 +00:00
George V. Neville-Neil
fff0ededf8 Fix the loopback interface. Cleaning up some code with new macros
was a tad too aggressive.

PR:		kern/123568
Submitted by:	Vladimir Ermakov <samflanker at gmail dot com>
Obtained from:	antoine
2008-05-12 02:44:53 +00:00
Julian Elischer
8b07e49a00 Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

  One thing where FreeBSD has been falling behind, and which by chance I
  have some time to work on is "policy based routing", which allows
  different
  packet streams to be routed by more than just the destination address.

  Constraints:
  ------------

  I want to make some form of this available in the 6.x tree
  (and by extension 7.x) , but FreeBSD in general needs it so I might as
  well do it in -current and back port the portions I need.

  One of the ways that this can be done is to have the ability to
  instantiate multiple kernel routing tables (which I will now
  refer to as "Forwarding Information Bases" or "FIBs" for political
  correctness reasons). Which FIB a particular packet uses to make
  the next hop decision can be decided by a number of mechanisms.
  The policies these mechanisms implement are the "Policies" referred
  to in "Policy based routing".

  One of the constraints I have if I try to back port this work to
  6.x is that it must be implemented as a EXTENSION to the existing
  ABIs in 6.x so that third party applications do not need to be
  recompiled in timespan of the branch.

  This first version will not have some of the bells and whistles that
  will come with later versions. It will, for example, be limited to 16
  tables in the first commit.
  Implementation method, Compatible version. (part 1)
  -------------------------------
  For this reason I have implemented a "sufficient subset" of a
  multiple routing table solution in Perforce, and back-ported it
  to 6.x. (also in Perforce though not  always caught up with what I
  have done in -current/P4). The subset allows a number of FIBs
  to be defined at compile time (8 is sufficient for my purposes in 6.x)
  and implements the changes needed to allow IPV4 to use them. I have not
  done the changes for ipv6 simply because I do not need it, and I do not
  have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

  Other protocol families are left untouched and should there be
  users with proprietary protocol families, they should continue to work
  and be oblivious to the existence of the extra FIBs.

  To understand how this is done, one must know that the current FIB
  code starts everything off with a single dimensional array of
  pointers to FIB head structures (One per protocol family), each of
  which in turn points to the trie of routes available to that family.

  The basic change in the ABI compatible version of the change is to
  extent that array to be a 2 dimensional array, so that
  instead of protocol family X looking at rt_tables[X] for the
  table it needs, it looks at rt_tables[Y][X] when for all
  protocol families except ipv4 Y is always 0.
  Code that is unaware of the change always just sees the first row
  of the table, which of course looks just like the one dimensional
  array that existed before.

  The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
  are all maintained, but refer only to the first row of the array,
  so that existing callers in proprietary protocols can continue to
  do the "right thing".
  Some new entry points are added, for the exclusive use of ipv4 code
  called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
  which have an extra argument which refers the code to the correct row.

  In addition, there are some new entry points (currently called
  rtalloc_fib() and friends) that check the Address family being
  looked up and call either rtalloc() (and friends) if the protocol
  is not IPv4 forcing the action to row 0 or to the appropriate row
  if it IS IPv4 (and that info is available). These are for calling
  from code that is not specific to any particular protocol. The way
  these are implemented would change in the non ABI preserving code
  to be added later.

  One feature of the first version of the code is that for ipv4,
  the interface routes show up automatically on all the FIBs, so
  that no matter what FIB you select you always have the basic
  direct attached hosts available to you. (rtinit() does this
  automatically).

  You CAN delete an interface route from one FIB should you want
  to but by default it's there. ARP information is also available
  in each FIB. It's assumed that the same machine would have the
  same MAC address, regardless of which FIB you are using to get
  to it.

  This brings us as to how the correct FIB is selected for an outgoing
  IPV4 packet.

  Firstly, all packets have a FIB associated with them. if nothing
  has been done to change it, it will be FIB 0. The FIB is changed
  in the following ways.

  Packets fall into one of a number of classes.

  1/ locally generated packets, coming from a socket/PCB.
     Such packets select a FIB from a number associated with the
     socket/PCB. This in turn is inherited from the process,
     but can be changed by a socket option. The process in turn
     inherits it on fork. I have written a utility call setfib
     that acts a bit like nice..

         setfib -3 ping target.example.com # will use fib 3 for ping.

     It is an obvious extension to make it a property of a jail
     but I have not done so. It can be achieved by combining the setfib and
     jail commands.

  2/ packets received on an interface for forwarding.
     By default these packets would use table 0,
     (or possibly a number settable in a sysctl(not yet)).
     but prior to routing the firewall can inspect them (see below).
     (possibly in the future you may be able to associate a FIB
     with packets received on an interface..  An ifconfig arg, but not yet.)

  3/ packets inspected by a packet classifier, which can arbitrarily
     associate a fib with it on a packet by packet basis.
     A fib assigned to a packet by a packet classifier
     (such as ipfw) would over-ride a fib associated by
     a more default source. (such as cases 1 or 2).

  4/ a tcp listen socket associated with a fib will generate
     accept sockets that are associated with that same fib.

  5/ Packets generated in response to some other packet (e.g. reset
     or icmp packets). These should use the FIB associated with the
     packet being reponded to.

  6/ Packets generated during encapsulation.
     gif, tun and other tunnel interfaces will encapsulate using the FIB
     that was in effect withthe proces that set up the tunnel.
     thus setfib 1 ifconfig gif0 [tunnel instructions]
     will set the fib for the tunnel to use to be fib 1.

  Routing messages would be associated with their
  process, and thus select one FIB or another.
  messages from the kernel would be associated with the fib they
  refer to and would only be received by a routing socket associated
  with that fib. (not yet implemented)

  In addition Netstat has been edited to be able to cope with the
  fact that the array is now 2 dimensional. (It looks in system
  memory using libkvm (!)). Old versions of netstat see only the first FIB.

  In addition two sysctls are added to give:
  a) the number of FIBs compiled in (active)
  b) the default FIB of the calling process.

  Early testing experience:
  -------------------------

  Basically our (IronPort's) appliance does this functionality already
  using ipfw fwd but that method has some drawbacks.

  For example,
  It can't fully simulate a routing table because it can't influence the
  socket's choice of local address when a connect() is done.

  Testing during the generating of these changes has been
  remarkably smooth so far. Multiple tables have co-existed
  with no notable side effects, and packets have been routes
  accordingly.

  ipfw has grown 2 new keywords:

  setfib N ip from anay to any
  count ip from any to any fib N

  In pf there seems to be a requirement to be able to give symbolic names to the
  fibs but I do not have that capacity. I am not sure if it is required.

  SCTP has interestingly enough built in support for this, called VRFs
  in Cisco parlance. it will be interesting to see how that handles it
  when it suddenly actually does something.

  Where to next:
  --------------------

  After committing the ABI compatible version and MFCing it, I'd
  like to proceed in a forward direction in -current. this will
  result in some roto-tilling in the routing code.

  Firstly: the current code's idea of having a separate tree per
  protocol family, all of the same format, and pointed to by the
  1 dimensional array is a bit silly. Especially when one considers that
  there is code that makes assumptions about every protocol having the
  same internal structures there. Some protocols don't WANT that
  sort of structure. (for example the whole idea of a netmask is foreign
  to appletalk). This needs to be made opaque to the external code.

  My suggested first change is to add routing method pointers to the
  'domain' structure, along with information pointing the data.
  instead of having an array of pointers to uniform structures,
  there would be an array pointing to the 'domain' structures
  for each protocol address domain (protocol family),
  and the methods this reached would be called. The methods would have
  an argument that gives FIB number, but the protocol would be free
  to ignore it.

  When the ABI can be changed it raises the possibilty of the
  addition of a fib entry into the "struct route". Currently,
  the structure contains the sockaddr of the desination, and the resulting
  fib entry. To make this work fully, one could add a fib number
  so that given an address and a fib, one can find the third element, the
  fib entry.

  Interaction with the ARP layer/ LL layer would need to be
  revisited as well. Qing Li has been working on this already.

  This work was sponsored by Ironport Systems/Cisco

Reviewed by:    several including rwatson, bz and mlair (parts each)
Obtained from:  Ironport systems/Cisco
2008-05-09 23:03:00 +00:00
John Baldwin
790fce68dd Always bump tcpstat.tcps_badrst if we get a RST for a connection in the
syncache that has an invalid SEQ instead of only doing it when we suceed
in mallocing space for the log message.

MFC after:	1 week
Reviewed by:	sam, bz
2008-05-08 22:21:09 +00:00
Kip Macy
8ab7ce7c61 replace spaces added in last change with tabs 2008-05-05 23:13:27 +00:00
Kip Macy
535fbad68f add rcv_nxt, snd_nxt, and toe offload id to FreeBSD-specific
extension fields for tcp_info
2008-05-05 20:13:31 +00:00
Dmitry Morozovsky
03bc210eb9 Fix build, together with a bit of style breakage. 2008-05-02 18:54:36 +00:00
Robert Watson
bcf5b9fa38 Fix a comment typo.
MFC after:	3 days
2008-04-29 21:21:15 +00:00
Robert Watson
9ad11dd8a4 With IPv4 raw sockets, read lock rather than write lock the inpcb when
receiving or transmitting.

With IPv6 raw sockets, read lock rather than write lock the inpcb when
receiving.  Unfortunately, IPv6 source address selection appears to
require a write lock on the inpcb for the time being.

MFC after:	3 months
2008-04-21 12:06:41 +00:00
Robert Watson
3656a4fe2e Read lock, rather than write lock, the inpcb when transmitting with or
delivering to an IP divert socket.

MFC after:	3 months
2008-04-21 12:03:59 +00:00
Bjoern A. Zeeb
032fae41d4 Revert to rev. 1.161 - switch back to optimized TCP options ordering.
A lot of testing has shown that the problem people were seeing was due
to invalid padding after the end of option list option, which was corrected
in tcp_output.c rev. 1.146.

Thanks to:		anders@, s3raphi, Matt Reimer
Thanks to:		Doug Hardie and Randy Rose, John Mayer, Susan Guzzardi
Special thanks to:	dwhite@ and BitGravity
Discussed with:		silby
MFC after:		1 day
2008-04-20 18:36:59 +00:00
Robert Watson
fdd9b0723e Teach pf and ipfw to use read locks in inpcbs write than write locks
when reading credential data from sockets.

Teach pf to unlock the pcbinfo more quickly once it has acquired an
inpcb lock, as the inpcb lock is sufficient to protect the reference.

Assert locks, rather than read locks or write locks, on inpcbs in
subroutines--this is necessary as the inpcb may be passed down with a
write lock from the protocol, or may be passed down with a read lock
from the firewall lookup routine, and either is sufficient.

MFC after:	3 months
2008-04-20 00:21:54 +00:00
Robert Watson
baa45840d7 In ip_output(), allow a read lock as well as a write lock when asserting
a lock on the passed inpcb.

MFC after:	3 months
2008-04-19 14:35:17 +00:00
Robert Watson
a69042a5be When querying the local or foreign address from an IP socket, acquire
only a read lock on the inpcb.

When an external module requests a read lock, acquire only a read lock.

MFC after:	3 months
2008-04-19 14:34:38 +00:00
Kip Macy
73a0d5896e move tcbinfo lock acquisition in to syncache 2008-04-19 03:39:17 +00:00
Kip Macy
46b0a854cc move cxgb_lt2.[ch] from NIC to TOE
move most offload functionality from NIC to TOE
factor out all socket and inpcb direct access
factor out access to locking in incpb, pcbinfo, and sockbuf
2008-04-19 03:22:43 +00:00
George V. Neville-Neil
0327aeb9e3 Add in check for loopback as well, which was missing from the original patch.
PR: 120958
Submitted by: James Snow <snow at teardrop.org>
MFC after: 2 weeks
2008-04-17 23:24:58 +00:00
Robert Watson
8501a69cc9 Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after:	3 months
Tested by:	kris (superset of committered patch)
2008-04-17 21:38:18 +00:00
George V. Neville-Neil
6b9ff6b7a7 Clean up the code that checks the types of address so that it is
done by understandable macros.

Fix the bug that prevented the system from responding on interfaces with
link local addresses assigned.

PR: 120958
Submitted by: James Snow <snow at teardrop.org>
MFC after: 2 weeks
2008-04-17 12:50:42 +00:00
Randall Stewart
5e2c2d872b Allow SCTP to compile without INET6.
PR:		116816
Obtained from	tuexen@fh-muenster.de:
MFC after:	2 weeks
2008-04-16 17:24:18 +00:00
Randall Stewart
eadccaccf0 Use the pru_flush infrastructure to avoid a panic
PR:		122710
MFC after:	1 week
2008-04-14 18:13:33 +00:00
Randall Stewart
c40e9cf2c1 Protection against errant sender sending a stream
seq number out of order with no missing TSN's (a
cisco box has this problem which will make a ssn
be held forever).
MFC after:	1 week
2008-04-14 14:34:29 +00:00
Randall Stewart
2a3eb019db New logging values. 2008-04-14 14:33:07 +00:00
Randall Stewart
45ccc1a635 1) adds some additional logging
2) changes to use a inqueue_bytes calculated value in max_len calc's.
MFC after:	1 week
2008-04-14 14:32:32 +00:00
Qing Li
e440aed958 This patch provides the back end support for equal-cost multi-path
(ECMP) for both IPv4 and IPv6. Previously, multipath route insertion
is disallowed. For example,

	route add -net 192.103.54.0/24 10.9.44.1
	route add -net 192.103.54.0/24 10.9.44.2

The second route insertion will trigger an error message of
"add net 192.103.54.0/24: gateway 10.2.5.2: route already in table"

Multiple default routes can also be inserted. Here is the netstat
output:

default		10.2.5.1	UGS	0	3074	bge0 =>
default		10.2.5.2	UGS	0	0	bge0

When multipath routes exist, the "route delete" command requires
a specific gateway to be specified or else an error message would
be displayed. For example,

	route delete default

would fail and trigger the following error message:

"route: writing to routing socket: No such process"
"delete net default: not in table"

On the other hand,

	route delete default 10.2.5.2

would be successful: "delete net default: gateway 10.2.5.2"

One does not have to specify a gateway if there is only a single
route for a particular destination.

I need to perform more testings on address aliases and multiple
interfaces that have the same IP prefixes. This patch as it
stands today is not yet ready for prime time. Therefore, the ECMP
code fragments are fully guarded by the RADIX_MPATH macro.
Include the "options  RADIX_MPATH" in the kernel configuration
to enable this feature.

Reviewed by:	robert, sam, gnn, julian, kmacy
2008-04-13 05:45:14 +00:00