Commit Graph

3859 Commits

Author SHA1 Message Date
Luigi Rizzo
6aada3117b improve compatibility with RELENG_7.2 2010-03-04 16:52:26 +00:00
Luigi Rizzo
cc4d3c30ea Bring in the most recent version of ipfw and dummynet, developed
and tested over the past two months in the ipfw3-head branch.  This
also happens to be the same code available in the Linux and Windows
ports of ipfw and dummynet.

The major enhancement is a completely restructured version of
dummynet, with support for different packet scheduling algorithms
(loadable at runtime), faster queue/pipe lookup, and a much cleaner
internal architecture and kernel/userland ABI which simplifies
future extensions.

In addition to the existing schedulers (FIFO and WF2Q+), we include
a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new,
very fast version of WF2Q+ called QFQ.

Some test code is also present (in sys/netinet/ipfw/test) that
lets you build and test schedulers in userland.

Also, we have added a compatibility layer that understands requests
from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries,
and replies correctly (at least, it does its best; sometimes you
just cannot tell who sent the request and how to answer).
The compatibility layer should make it possible to MFC this code in a
relatively short time.

Some minor glitches (e.g. handling of ipfw set enable/disable,
and a workaround for a bug in RELENG_7's /sbin/ipfw) will be
fixed with separate commits.

CREDITS:
This work has been partly supported by the ONELAB2 project, and
mostly developed by Riccardo Panicucci and myself.
The code for the qfq scheduler is mostly from Fabio Checconi,
and Marta Carbone and Francesco Magno have helped with testing,
debugging and some bug fixes.
2010-03-02 17:40:48 +00:00
Joel Dahl
7df6f59359 The NetBSD Foundation has granted permission to remove clause 3 and 4 from
their software.

Obtained from:	NetBSD
2010-03-01 17:05:46 +00:00
Bjoern A. Zeeb
aa3f803697 Upon virtual network stack teardown properly release the TCP syncache
resources.

Sponsored by:	ISPsystem
Reviewed by:	rwatson
MFC After:	5 days
2010-02-20 21:45:04 +00:00
Michael Tuexen
7b470fc31c Fix handling of SHUTDOWN-ACK chunk in COOKIE_WAIT and COOKIE_ECHOED.
MFC after: 1 week
2010-02-20 20:30:40 +00:00
Bjoern A. Zeeb
9802380e41 Split up ip_drain() into an outer lock and iterator part and
a "locked" version that will only handle a single network stack
instance. The latter is called directly from ip_destroy().

Hook up an ip_destroy() function to release resources from the
legacy IP network layer upon virtual network stack teardown.

Sponsored by:	ISPsystem
Reviewed by:	rwatson
MFC After:	5 days
2010-02-20 19:59:52 +00:00
Michael Tuexen
7291848a0b * Fix another u_long -> uint32_t issue.
* Remove an unused global variable.
* Fix an issue reported by Bruce Cran related to reusing SCTP socket which
  where connected.

MFC after: 1 week
2010-02-19 18:00:38 +00:00
Pawel Jakub Dawidek
957d68dd91 No need to include security/mac/mac_framework.h here. 2010-02-18 22:26:01 +00:00
Michael Tuexen
63eda93d1a Use uint32_t instead of u_long.
MFC after: 1 week
2010-02-18 13:46:54 +00:00
Luigi Rizzo
27c9c97a3e remove recursive lock/unlock calls, we do them already before entering
the switch.

Reported by: Marta Carbone
2010-02-17 13:06:06 +00:00
Michael Tuexen
8d9d061323 Add missing SCTP_PACKED. Spotted by Irene Ruengeler.
MFC after: 1 week
2010-02-13 21:38:15 +00:00
Bjoern A. Zeeb
fffb9f1d9c Properly free resources when destroying the TCP hostcache while
tearing down a network stack (in the VIMAGE jail+vnet case).

For that break out the logic from tcp_hc_purge() into an internal
function we can call from both, the sysctl handler and the
tcp_hc_destroy().

Sponsored by:	ISPsystem
Reviewed by:	silby, lstewart
MFC After:	8 days
2010-02-09 21:31:53 +00:00
Michael Tuexen
f1150dc0a5 Restore the checksum received before processing the packet.
MFC after: 1 week
2010-02-04 21:02:29 +00:00
Qing Li
d577d18a00 Some of the existing ppp and vpn related scripts create and set
the IP addresses of the tunnel end points to the same value. In
these cases the loopback route is not installed for the local
end.

Verified by:	avg
MFC after:	5 days
2010-02-02 20:38:30 +00:00
Luigi Rizzo
dc5fd2595c use u_char instead of u_int for short bitfields.
For our compiler the two constructs are completely equivalent, but
some compilers (including MSC and tcc) use the base type for alignment,
which in the cases touched here result in aligning the bitfields
to 32 bit instead of the 8 bit that is meant here.

Note that almost all other headers where small bitfields
are used have u_int8_t instead of u_int.

MFC after:	3 days
2010-02-01 14:13:44 +00:00
Michael Tuexen
663fdad84b Use [] instead of [0] for flexible arrays.
Obtained from: Bruce Cran
MFC after: 1 week
2010-01-22 07:53:41 +00:00
Michael Tuexen
cd55430963 Get rid of a lot of duplicated code for NR-SACK handle.
Generalize the SACK to code handle also NR-SACKs.
2010-01-17 21:00:28 +00:00
Randall Stewart
e34b217f91 Bug fix: If the allocation of a socket failed and we
freed the inpcb, it was possible to not set the
proper flags on the pcb (i.e. the socket is not there).
This is HIGHLY unlikely since no one else should be
able to find the socket.. but for consistency we
do the proper loop thing to make sure that we
mark the socket as gone on the PCB.
2010-01-17 19:47:59 +00:00
Randall Stewart
0812a4d5e6 Pulls out another leaked windows ifdef that somehow
made its way through the scrubber.
2010-01-17 19:40:21 +00:00
Randall Stewart
a10c3242c7 This change syncs up the socketAPI stream-reset
values to match those in linux and the I-D
just released to the IETF.
2010-01-17 19:35:38 +00:00
Randall Stewart
92cf719944 More leaked ifdefs for APPLE and its mobility stuff. 2010-01-17 19:24:30 +00:00
Randall Stewart
33141385fc Remove another set of "leaked" ifdefs that somehow found
their way into FreeBSD.
2010-01-17 19:21:50 +00:00
Randall Stewart
58ac2d97b7 Remove strange APPLE define that leaked
through the scrubber scripts. Scripts are
now fixed so this won't happen again.
2010-01-17 19:17:16 +00:00
Bjoern A. Zeeb
4dcc55a363 Garbage collect references to the no longer implemented tcp_fasttimo().
Discussed with:	rwatson
MFC after:	5 days
2010-01-17 13:07:52 +00:00
Bjoern A. Zeeb
592bcae802 Add ip4.saddrsel/ip4.nosaddrsel (and equivalent for ip6) to control
whether to use source address selection (default) or the primary
jail address for unbound outgoing connections.

This is intended to be used by people upgrading from single-IP
jails to multi-IP jails but not having to change firewall rules,
application ACLs, ... but to force their connections (unless
otherwise changed) to the primry jail IP they had been used for
years, as well as for people prefering to implement similar policies.

Note that for IPv6, if configured incorrectly, this might lead to
scope violations, which single-IPv6 jails could as well, as by the
design of jails. [1]

Reviewed by:	jamie, hrs (ipv6 part)
Pointed out by:	hrs [1]
MFC After:	2 weeks
Asked for by:	Jase Thew (bazerka beardz.net)
2010-01-17 12:57:11 +00:00
Hajimu UMEMOTO
416458131a Change 'me' to match any IPv6 address configured on an interface in
the system as well as any IPv4 address.

Reviewed by:	David Horn <dhorn2000__at__gmail.com>, luigi, qingli
MFC after:	2 weeks
2010-01-17 08:39:48 +00:00
Michael Tuexen
5661a9ed70 Get rid of support of an old version of the SCTP-AUTH draft.
Get rid of unused MD5 code.

MFC after: 1 week
2010-01-16 20:04:17 +00:00
Qing Li
646c800540 Ensure an address is removed from the interface address
list when the installation of that address fails.

PR:		139559
2010-01-08 17:49:24 +00:00
Ruslan Ermilov
acc0fee071 Complete the swap of carp(4) log levels and document the change.
MFC after:	3 days
2010-01-08 16:14:41 +00:00
Martin Blapp
c2ede4b379 Remove extraneous semicolons, no functional changes.
Submitted by:	Marc Balmer <marc@msys.ch>
MFC after:	1 week
2010-01-07 21:01:37 +00:00
Luigi Rizzo
5afa29b41a we don't use dummynet_drain! 2010-01-07 13:53:47 +00:00
Luigi Rizzo
59a613b14d check that we have an ipv4 packet before swapping ip_len and ip_off.
This should fix the handling of ipv6 packets which i broke when i
made ipfw operate on packets in network format.

Reported by: Hajimu UMEMOTO
2010-01-07 12:00:54 +00:00
Luigi Rizzo
b2019e1789 Following up on a request from Ermal Luci to make
ip_divert work as a client of pf(4),
make ip_divert not depend on ipfw.

This is achieved by moving to ip_var.h the struct ipfw_rule_ref
(which is part of the mtag for all reinjected packets) and other
declarations of global variables, and moving to raw_ip.c global
variables for filter and divert hooks.

Note that names and locations could be made more generic
(ipfw_rule_ref is really a generic reference robust to reconfigurations;
the packet filter is not necessarily ipfw; filters and their clients
are not necessarily limited to ipv4), but _right now_ most
of this stuff works on ipfw and ipv4, so i don't feel like
doing a gratuitous renaming, at least for the time being.
2010-01-07 10:39:15 +00:00
Luigi Rizzo
62081e0f8d some header shuffling to help decoupling ip_divert from ipfw 2010-01-07 10:08:05 +00:00
Luigi Rizzo
eb6842e2a9 put ip_len in correct order for ip_output().
This prevents a panic when ipfw generates packets on its own
(such as reject or keepalives for dynamic rules).

Reported by: Chagin Dmitry
2010-01-07 09:28:17 +00:00
Luigi Rizzo
c95477dfa1 this file does not require ip_dummynet.h 2010-01-05 11:00:31 +00:00
Qing Li
ee8a75d320 An existing incomplete ARP entry would expire a subsequent
statically configured entry of the same host. This bug was
due to the expiration timer was not cancelled when installing
the static entry. Since there exist a potential race condition
with respect to timer cancellation, simply check for the
LLE_STATIC bit inside the expiration function instead of
cancelling the active timer.

MFC after:	5 days
2010-01-05 00:35:46 +00:00
Luigi Rizzo
7173b6e554 Various cleanup done in ipfw3-head branch including:
- use a uniform mtag format for all packets that exit and re-enter
  the firewall in the middle of a rulechain. On reentry, all tags
  containing reinject info are renamed to MTAG_IPFW_RULE so the
  processing is simpler.

- make ipfw and dummynet use ip_len and ip_off in network format
  everywhere. Conversion is done only once instead of tracking
  the format in every place.

- use a macro FREE_PKT to dispose of mbufs. This eases portability.

On passing i also removed a few typos, staticise or localise variables,
remove useless declarations and other minor things.

Overall the code shrinks a bit and is hopefully more readable.

I have tested functionality for all but ng_ipfw and if_bridge/if_ethersubr.
For ng_ipfw i am actually waiting for feedback from glebius@ because
we might have some small changes to make.
For if_bridge and if_ethersubr feedback would be welcome
(there are still some redundant parts in these two modules that
I would like to remove, but first i need to check functionality).
2010-01-04 19:01:22 +00:00
Michael Tuexen
f5366806c6 Correct usage of parenthesis.
PR:	kern/142066
Approved by: rrs (mentor)
Obtained from: Henning Petersen, Bruce Cran.
MFC after: 2 weeks
2010-01-04 18:25:38 +00:00
Navdeep Parhar
567145993f Avoid NULL dereference in arpresolve. 2010-01-03 06:43:13 +00:00
Qing Li
ccbb9c359d Consolidate the route message generation code for when address
aliases were added or deleted. The announced route entry for
an address alias is no longer empty because this empty route
entry was causing some route daemon to fail and exit abnormally.

MFC after:	5 days
2009-12-30 22:13:01 +00:00
Qing Li
c7ab66020f The proxy arp entries could not be added into the system over the
IFF_POINTOPOINT link types. The reason was due to the routing
entry returned from the kernel covering the remote end is of an
interface type that does not support ARP. This patch fixes this
problem by providing a hint to the kernel routing code, which
indicates the prefix route instead of the PPP host route should
be returned to the caller. Since a host route to the local end
point is also added into the routing table, and there could be
multiple such instantiations due to multiple PPP links can be
created with the same local end IP address, this patch also fixes
the loopback route installation failure problem observed prior to
this patch. The reference count of loopback route to local end would
be either incremented or decremented. The first instantiation would
create the entry and the last removal would delete the route entry.

MFC after:	5 days
2009-12-30 21:35:34 +00:00
Shteryana Shopova
7c90b0258f Make sure the multicast forwarding cache entry's stall queue is properly
initialized before trying to insert an entry into it.

PR:		kern/142052
Reviewed by:	bms
MFC after:	now
2009-12-30 08:52:13 +00:00
Luigi Rizzo
bcd3b68dd2 we really need htonl() here, see the comment a few lines above in the code. 2009-12-29 00:02:57 +00:00
Antoine Brodin
13e403fdea (S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument.
Fix some wrong usages.
Note: this does not affect generated binaries as this argument is not used.

PR:		137213
Submitted by:	Eygene Ryabinkin (initial version)
MFC after:	1 month
2009-12-28 22:56:30 +00:00
Bjoern A. Zeeb
fc74d005d9 Make the compiler happy after r201125:
- + remove two unnecessary initializations in ip_output;
+ + remove one unnecessary initializations in ip_output;
2009-12-28 21:14:18 +00:00
Luigi Rizzo
ec396e61ed introduce a local variable rte acting as a cache of ro->ro_rt
within ip_output, achieving (in random order of importance):
- a reduction of the number of 'r's in the source code;
- improved legibility;
- a reduction of 64 bytes in the .text
2009-12-28 14:48:32 +00:00
Luigi Rizzo
ca8b83b0fa + remove an unused #define print_ip;
+ remove two unnecessary initializations in ip_output;
+ localize 'len';
+ introduce a temporary variable n to count the number of fragments,
  the compiler seems unable to identify a common subexpression
  (written 3 times, used twice);
+ document some assumptions on ip_len and ip_hl
2009-12-28 14:09:46 +00:00
Luigi Rizzo
e59084e086 bring the NGM_IPFW_COOKIE back into ng_ipfw.h, libnetgraph expects
to find it there. Unfortunately this reintroduces the dependency
on ip_fw_pfil.c
2009-12-28 12:29:13 +00:00
Luigi Rizzo
830c6e2b97 bring in several cleanups tested in ipfw3-head branch, namely:
r201011
- move most of ng_ipfw.h into ip_fw_private.h, as this code is
  ipfw-specific. This removes a dependency on ng_ipfw.h from some files.

- move many equivalent definitions of direction (IN, OUT) for
  reinjected packets into ip_fw_private.h

- document the structure of the packet tags used for dummynet
  and netgraph;

r201049
- merge some common code to attach/detach hooks into
  a single function.

r201055
- remove some duplicated code in ip_fw_pfil. The input
  and output processing uses almost exactly the same code so
  there is no need to use two separate hooks.
  ip_fw_pfil.o goes from 2096 to 1382 bytes of .text

r201057 (see the svn log for full details)
- macros to make the conversion of ip_len and ip_off
  between host and network format more explicit

r201113 (the remaining parts)
- readability fixes -- put braces around some large for() blocks,
  localize variables so the compiler does not think they are uninitialized,
  do not insist on precise allocation size if we have more than we need.

r201119
- when doing a lookup, keys must be in big endian format because
  this is what the radix code expects (this fixes a bug in the
  recently-introduced 'lookup' option)

No ABI changes in this commit.

MFC after:	1 week
2009-12-28 10:47:04 +00:00
Luigi Rizzo
6cc7b9f5d9 readability fixes -- add braces on large blocks, remove unnecessary
initializations
2009-12-28 10:19:53 +00:00
Luigi Rizzo
6730dcaec7 explain details of operation of table lookups, and improve portability 2009-12-28 10:12:35 +00:00
Luigi Rizzo
2082ecd966 diverted packet must re-enter _after_ the matching rule,
or we create loops.
The divert cookie (that can be set from userland too)
contains the matching rule nr, so we must start from nr+1.

Reported by: Joe Marcus Clarke
2009-12-27 10:19:10 +00:00
Luigi Rizzo
4a3c1bd27f fix poor indentation resulting from a merge 2009-12-24 17:35:28 +00:00
Luigi Rizzo
84918f5bc8 mostly style changes, such as removal of trailing whitespace,
reformatting to avoid unnecessary line breaks, small block
restructuring to avoid unnecessary nesting, replace macros
with function calls, etc.

As a side effect of code restructuring, this commit fixes one bug:
previously, if a realloc() failed, memory was leaked. Now, the
realloc is not there anymore, as we first count how much memory
we need and then do a single malloc.
2009-12-23 18:53:11 +00:00
Luigi Rizzo
3ae19c3ba3 fix build with the new fast lookup structure.
Also remove some unnecessary headers
2009-12-23 12:15:21 +00:00
Luigi Rizzo
6aab896346 fix build on 64-bit architectures.
Also fix the indentation on a few lines.
2009-12-23 12:00:50 +00:00
Luigi Rizzo
de240d1013 merge code from ipfw3-head to reduce contention on the ipfw lock
and remove all O(N) sequences from kernel critical sections in ipfw.

In detail:

 1. introduce a IPFW_UH_LOCK to arbitrate requests from
     the upper half of the kernel. Some things, such as 'ipfw show',
     can be done holding this lock in read mode, whereas insert and
     delete require IPFW_UH_WLOCK.

  2. introduce a mapping structure to keep rules together. This replaces
     the 'next' chain currently used in ipfw rules. At the moment
     the map is a simple array (sorted by rule number and then rule_id),
     so we can find a rule quickly instead of having to scan the list.
     This reduces many expensive lookups from O(N) to O(log N).

  3. when an expensive operation (such as insert or delete) is done
     by userland, we grab IPFW_UH_WLOCK, create a new copy of the map
     without blocking the bottom half of the kernel, then acquire
     IPFW_WLOCK and quickly update pointers to the map and related info.
     After dropping IPFW_LOCK we can then continue the cleanup protected
     by IPFW_UH_LOCK. So userland still costs O(N) but the kernel side
     is only blocked for O(1).

  4. do not pass pointers to rules through dummynet, netgraph, divert etc,
     but rather pass a <slot, chain_id, rulenum, rule_id> tuple.
     We validate the slot index (in the array of #2) with chain_id,
     and if successful do a O(1) dereference; otherwise, we can find
     the rule in O(log N) through <rulenum, rule_id>

All the above does not change the userland/kernel ABI, though there
are some disgusting casts between pointers and uint32_t

Operation costs now are as follows:

  Function				Old	Now	  Planned
-------------------------------------------------------------------
  + skipto X, non cached		O(N)	O(log N)
  + skipto X, cached			O(1)	O(1)
XXX dynamic rule lookup			O(1)	O(log N)  O(1)
  + skipto tablearg			O(N)	O(1)
  + reinject, non cached		O(N)	O(log N)
  + reinject, cached			O(1)	O(1)
  + kernel blocked during setsockopt()	O(N)	O(1)
-------------------------------------------------------------------

The only (very small) regression is on dynamic rule lookup and this will
be fixed in a day or two, without changing the userland/kernel ABI

Supported by: Valeria Paoli
MFC after:	1 month
2009-12-22 19:01:47 +00:00
John Baldwin
43d9473499 - Rename the __tcpi_(snd|rcv)_mss fields of the tcp_info structure to remove
the leading underscores since they are now implemented.
- Implement the tcpi_rto and tcpi_last_data_recv fields in the tcp_info
  structure.

Reviewed by:	rwatson
MFC after:	2 weeks
2009-12-22 15:47:40 +00:00
Luigi Rizzo
46fdc2bf60 some mostly cosmetic changes in preparation for upcoming work:
+ in many places, replace &V_layer3_chain with a local
  variable chain;
+ bring the counter of rules and static_len within ip_fw_chain
  replacing static variables;
+ remove some spurious comments and extern declaration;
+ document which lock protects certain data structures
2009-12-22 13:53:34 +00:00
Ruslan Ermilov
bec5f27f73 Added proper attribution.
Requested by:	luigi
2009-12-18 17:22:21 +00:00
Luigi Rizzo
1328a38b96 Add some experimental code to log traffic with tcpdump,
similar to pflog(4).
To use the feature, just put the 'log' options on rules
you are interested in, e.g.

	ipfw add 5000 count log ....

and run
	tcpdump -ni ipfw0 ...

net.inet.ip.fw.verbose=0 enables logging to ipfw0,
net.inet.ip.fw.verbose=1 sends logging to syslog as before.

More features can be added, similar to pflog(), to store in
the MAC header metadata such as rule numbers and actions.
Manpage to come once features are settled.
2009-12-17 23:11:16 +00:00
Luigi Rizzo
60ab046a41 simplify and document lookup_next_rule() 2009-12-17 17:27:12 +00:00
Luigi Rizzo
59cd9f65f9 simplify the code that finds the next rule after reinjections
MFC after:	1 week
2009-12-17 12:27:54 +00:00
Luigi Rizzo
53638988bc remove a duplicate sysctl entry 2009-12-16 18:03:35 +00:00
Luigi Rizzo
1b5691c61e bring back a couple of #include that are supplied by nesting,
and explain why they are used.
2009-12-16 13:00:37 +00:00
Luigi Rizzo
97219abf05 Various cosmetic cleanup of the files:
- move global variables around to reduce the scope and make them
  static if possible;
- add an ipfw_ prefix to all public functions to prevent conflicts
  (the same should be done for variables);
- try to pack variable declaration in an uniform way across files;
- clarify some comments;
- remove some misspelling of names (#define V_foo VNET(bar)) that
  slipped in due to cut&paste
- remove duplicate static variables in different files;

MFC after:	1 month
2009-12-16 10:48:40 +00:00
Warner Losh
26bbc1fc5a Quick fix to make this compile:
Remove redundant extern declearations.
If the maintainer has a better fix, then feel free to back this out.
2009-12-16 03:26:37 +00:00
Luigi Rizzo
22f123afad more splitting of ip_fw2.c, now extract the 'table' routines
and the sockopt routines (the upper half of the kernel).

Whoever is the author of the 'table' code (Ruslan/glebius/oleg ?)
please change the attribution in ip_fw_table.c. I have copied
the copyright line from ip_fw2.c but it carries my name and I have
neither written nor designed the feature so I don't deserve
the credit.

MFC after:	1 month
2009-12-15 21:24:12 +00:00
Luigi Rizzo
70228fb346 Start splitting ip_fw2.c and ip_fw.h into smaller components.
At this time we pull out from ip_fw2.c the logging functions, and
support for dynamic rules, and move kernel-only stuff into
netinet/ipfw/ip_fw_private.h

No ABI change involved in this commit, unless I made some mistake.
ip_fw.h has changed, though not in the userland-visible part.

Files touched by this commit:

conf/files
	now references the two new source files

netinet/ip_fw.h
	remove kernel-only definitions gone into netinet/ipfw/ip_fw_private.h.

netinet/ipfw/ip_fw_private.h
	new file with kernel-specific ipfw definitions

netinet/ipfw/ip_fw_log.c
	ipfw_log and related functions

netinet/ipfw/ip_fw_dynamic.c
	code related to dynamic rules

netinet/ipfw/ip_fw2.c
	removed the pieces that goes in the new files

netinet/ipfw/ip_fw_nat.c
	minor rearrangement to remove LOOKUP_NAT from the
	main headers. This require a new function pointer.

A bunch of other kernel files that included netinet/ip_fw.h now
require netinet/ipfw/ip_fw_private.h as well.
Not 100% sure i caught all of them.

MFC after:	1 month
2009-12-15 16:15:14 +00:00
Luigi Rizzo
472099c4b0 implement a new match option,
lookup {dst-ip|src-ip|dst-port|src-port|uid|jail} N

which searches the specified field in table N and sets tablearg
accordingly.
With dst-ip or src-ip the option replicates two existing options.
When used with other arguments, the option can be useful to
quickly dispatch traffic based on other fields.

Work supported by the Onelab project.

MFC after:	1 week
2009-12-15 09:46:27 +00:00
Bjoern A. Zeeb
de0bd6f76b Throughout the network stack we have a few places of
if (jailed(cred))
left.  If you are running with a vnet (virtual network stack) those will
return true and defer you to classic IP-jails handling and thus things
will be "denied" or returned with an error.

Work around this problem by introducing another "jailed()" function,
jailed_without_vnet(), that also takes vnets into account, and permits
the calls, should the jail from the given cred have its own virtual
network stack.

We cannot change the classic jailed() call to do that,  as it is used
outside the network stack as well.

Discussed with:	julian, zec, jamie, rwatson (back in Sept)
MFC after:	5 days
2009-12-13 13:57:32 +00:00
Luigi Rizzo
b2089673e5 use div64 when converting back the burst value for userland 2009-12-10 18:37:14 +00:00
Luigi Rizzo
89717f91ef when draining a flowset free the entire chain, not just one packet. 2009-12-10 18:34:07 +00:00
Luigi Rizzo
478cae8a97 centralize the code to free a packet (or a chain) while in dummynet.
Remove an old macro and its stale comment.
2009-12-10 15:17:34 +00:00
Oleg Bulyzhin
22746035ec Fix burst processing for WF2Q pipes - do not increase available burst size
unless pipe is idle. This should fix follwing issues:
- 'dummynet: OUCH! pipe should have been idle!' log messages.
- exceeding configured pipe bandwidth.

MFC after:	1 week
2009-12-05 23:27:21 +00:00
Luigi Rizzo
f573a0a634 adjust comment in previous commit after Julian's explanation 2009-12-05 11:51:32 +00:00
Luigi Rizzo
bc0d5982e2 remove a dead block of code, document how the ipfw clients are
hooked and the difference in handling the 'enable' variable
for layer2 and layer3. The latter needs fixing once i figure out
how it worked pre-vnet.

MFC after:	7 days
2009-12-05 09:13:06 +00:00
Luigi Rizzo
e99816f1eb fix build with VNET enabled
Reported by: David Wolfskill
2009-12-05 08:32:12 +00:00
Hajimu UMEMOTO
2ea64e8ef9 Use INET_ADDRSTRLEN and INET6_ADDRSTRLEN rather than hard
coded number.

Spotted by:	bz
2009-12-04 15:39:37 +00:00
Luigi Rizzo
4f60c0b97d preparation work to replace the monster switch in ipfw_chk() with
table of functions.

This commit (which is heavily based on work done by Marta Carbone
in this year's GSOC project), removes the goto's and explicit
return from the inner switch(), so we will have a easier time when
putting the blocks into individual functions.

MFC after:	3 weeks
2009-12-03 14:22:15 +00:00
Hajimu UMEMOTO
a22e82b87b Teach an IPv6 to the debug prints. 2009-12-03 11:16:53 +00:00
Luigi Rizzo
3c95089ef4 - initialize src_ip in the main loop to prevent a compiler warning
(gcc 4.x under linux, not sure how real is the complaint).
- rename a macro argument to prevent name clashes.
-  add the macro name on a couple of #endif
- add a blank line for readability.

MFC after:	3 days
2009-12-02 17:50:52 +00:00
Luigi Rizzo
3429911d4d Dispatch sockopt calls to ipfw and dummynet
using the new option numbers, IP_FW3 and IP_DUMMYNET3.
Right now the modules return an error if called with those arguments
so there is no danger of unwanted behaviour.

MFC after:	3 days
2009-12-02 15:50:43 +00:00
Luigi Rizzo
0a13f6b1b3 small changes for portability and diff reduction wrt/ FreeBSD 7.
No functional differences.

- use the div64() macro to wrap 64 bit divisions
  (which almost always are 64 / 32 bits) so they are easier
  to handle with compilers or OS that do not have native
  support for 64bit divisions;

- use a local variable for p_numbytes even if not strictly
  necessary on HEAD, as it reduces diffs with FreeBSD7

- in dummynet_send() check that a tag is present before
  dereferencing the pointer.

- add a couple of blank lines for readability near the end of a function

MFC after:	3 days
2009-12-02 15:20:31 +00:00
Hajimu UMEMOTO
fd63c04193 Teach an IPv6 to send_pkt() and ipfw_tick().
It fixes the issue which keep-alive doesn't work for an IPv6.

PR:		kern/117234
Submitted by:	mlaier, Joost Bekkers <joost__at__jodocus.org>
MFC after:	1 month
2009-12-02 14:32:01 +00:00
Gleb Smirnoff
e81ab87652 Until this moment carp(4) used a strange logging priority. It used debug
priority for such important information as MASTER/BACKUP state change,
and used a normal logging priority for such innocent messages as receiving
short packet (which is a normal VRRP packet between some other routers) or
receving a CARP packet on non-carp interface (someone else running CARP).

This commit shifts message logging priorities to a more sane default.
2009-12-02 13:24:21 +00:00
Luigi Rizzo
de9fc6bcd4 Add new sockopt names for ipfw and dummynet.
This commit is just grabbing entries for the new names
that will be used in the future, so you don't need to
rebuild anything now.

MFC after:	3 days
2009-12-02 10:36:41 +00:00
Luigi Rizzo
9565806f16 change the type of the opcode from enum *:8 to u_int8_t
so the size and alignment of the ipfw_insn is not compiler dependent.
No changes in the code generated by gcc.

There was only one instance of this kind in our entire source tree,
so i suspect the old definition was a poor choice (which i made).

MFC after:	3 days
2009-12-02 08:52:06 +00:00
Michael Tuexen
dec7fa27c6 Use the default stack size for the iterator thread.
This fixes a crash reported by Irene Ruengeler.

Approved by: rrs (mentor)
MFC after: 1 month
2009-11-27 17:25:19 +00:00
Bruce M Simpson
a8cf681de2 Correct a comment.
MFC after:	1 day
2009-11-19 13:21:37 +00:00
Michael Tuexen
7e6206af12 Fix a bug where the system panics when a SHUTDOWN is received with an
illegal TSN.

Approved by: rrs (mentor)
MFC after: ASAP
2009-11-18 12:17:06 +00:00
Michael Tuexen
0e891bcdc1 Get rid of unused fields addr_over which is never really used,
only copied around.

Approved by: rrs (mentor)
2009-11-17 23:03:38 +00:00
Michael Tuexen
83fc1165c5 Use always LIST_EMPTY instead of sometime SCTP_LIST_EMPTY,
which is defined as LIST_EMPTY.

Approved by: rrs (mentor)
MFC after: 1 month
2009-11-17 20:56:14 +00:00
Michael Tuexen
2ab6846a23 Fix a bug where queued ASCONF messags are not sent out.
Approved by: rrs (mentor)
Obtained from:	Irene Ruengeler
MFC after: 1 month
2009-11-17 13:36:21 +00:00
Michael Tuexen
b6c5780299 Fix a memory leak when destroying an SCTP stack.
Clean up sctp_pcb_finish().
Approved by: rrs (mentor)
MFC after: 1 month
2009-11-17 13:13:58 +00:00
Michael Tuexen
87b4fcd323 Do not start the iterator when there are no associations.
This fixes a bug found by Irene Ruengeler.

Approved by: rrs (mentor)
MFC after: 1 month
2009-11-17 13:11:23 +00:00
Michael Tuexen
1e01164145 Disable (temporary) the thread based interator. It does not work with vnet.
Approved by: rrs (mentor)
2009-11-17 13:09:50 +00:00
Michael Tuexen
cf458c646d Allow the UMA to free data. This resolves the UMA related bug reported
by Julian.

Approved by: rrs (mentor)
MFC after: 1 month
2009-11-17 13:08:15 +00:00
Michael Tuexen
7a9b5b2040 Do not hold the lock longer than necessary.
Approved by: rrs (mentor)
MFC after: 1 month
2009-11-17 13:05:51 +00:00
Bruce M Simpson
793c70425a Fix a functional regression in multicast.
Userland daemons need to see IGMP traffic regardless of the group;
omit the imo filter check if the proto is IGMP. The kernel part
of IGMP will have already filtered appropriately at this point.

MFC after:      ASAP
Submitted by:   Franz Struwig
Reported by:    Ivor Prebeg, Franz Struwig
2009-11-15 11:07:22 +00:00
Attilio Rao
758801232c Move inet_aton() (specular to inet_ntoa(), already present in libkern)
into libkern in order to made it usable by other modules than alias_proxy.

Obtained from:	Sandvine Incorporated
Sponsored by:	Sandvine Incorporated
MFC:		1 week
2009-11-12 00:46:28 +00:00
Edward Tomasz Napierala
4f7418a09f Remove ifdefed out part of code, which seems to have originated a decade ago
in OpenBSD.  As it is now, there is no way for this to be useful, since IPsec
is free to forward packets via whatever interface it wants, so checking
capabilities of the interface passed from ip_output (fetched from the routing
table) serves no purpose.

Discussed with:	sam@
2009-11-09 19:53:34 +00:00
Oleg Bulyzhin
57edc1bbf3 style(9): add missing parentheses 2009-11-09 09:12:45 +00:00
John Baldwin
c6d9480519 Several years ago a feature was added to TCP that casued soreceive() to
send an ACK right away if data was drained from a TCP socket that had
previously advertised a zero-sized window.  The current code requires the
receive window to be exactly zero for this to kick in.  If window scaling is
enabled and the window is smaller than the scale, then the effective window
that is advertised is zero.  However, in that case the zero-sized window
handling is not enabled because the window is not exactly zero.  The fix
changes the code to check the raw window value against zero.

Reviewed by:	bz
MFC after:	1 week
2009-11-06 16:55:05 +00:00
Oleg Bulyzhin
5661377e37 Fix two issues that can lead to exceeding configured pipe bandwidth:
- do not expire queues which are not ready to be expired.
- properly calculate available burst size.

MFC after:	3 days
2009-11-03 08:41:14 +00:00
Michael Tuexen
08abf6399a Improve round robin stream scheduler and cleanup some code.
Approved by: rrs (mentor)
MFC after: 3 days
2009-10-29 17:40:33 +00:00
Christian Brueffer
621882f0bc Close a stream file descriptor leak.
PR:		138130
Submitted by:	Patroklos Argyroudis <argp@census-labs.com>
MFC after:	1 week
2009-10-28 12:10:29 +00:00
Michael Tuexen
d18f7e0a98 Bugfix: Use formula from section 7.2.3 of RFC 4960. Reported by Martin Becke.
Approved by: rrs (mentor)
MFC after: 3 days
2009-10-27 18:17:07 +00:00
Michael Tuexen
ac9bce0f3b Improve the round robin stream scheduler.
Approved by: rrs (mentor)
MFC after: 3 days
2009-10-26 19:23:34 +00:00
Robert Watson
99b96cf934 Correct spelling typo in ip_input comment.
Pointed out by:	N.J. Mann <njm at njm.me.uk>,
		John Nielsen <john at jnielsen.net>, julian (!), lstewart
MFC after:	2 days
2009-10-24 09:18:26 +00:00
Qing Li
6cb2b4e7a8 Use the correct option name in the preprocessor command to enable
or disable diagnostic messages.

Reviewed by:	ru
MFC after:	3 days
2009-10-23 18:27:34 +00:00
Robert Watson
0d3d0d74ea Improve grammar in ip_input comment while attempting to maintain what
might be its meaning.

MFC after:	3 days
2009-10-23 13:35:00 +00:00
Qing Li
fc02323563 In the ARP callout timer expiration function, the current time_second
is compared against the entry expiration time value (that was set based
on time_second) to check if the current time is larger than the set
expiration time. Due to the +/- timer granularity value, the comparison
returns false, causing the alternative code to be executed. The
alternative code path freed the memory without removing that entry
from the table list, causing a use-after-free bug.

Reviewed by:	discussed with kmacy
MFC after:	immediately
Verified by:	rnoland, yongari
2009-10-20 17:55:42 +00:00
Robert Watson
6426657e9f Rewrap ip_input() comment so that it prints more nicely.
MFC after:	3 days
2009-10-18 11:23:56 +00:00
Qing Li
93704ac5d7 This patch fixes the following issues in the ARP operation:
1. There is a regression issue in the ARP code. The incomplete
   ARP entry was timing out too quickly (1 second timeout), as
   such, a new entry is created each time arpresolve() is called.
   Therefore the maximum attempts made is always 1. Consequently
   the error code returned to the application is always 0.
2. Set the expiration of each incomplete entry to a 20-second
   lifetime.
3. Return "incomplete" entries to the application.

Reviewed by:	kmacy
MFC after:	3 days
2009-10-15 06:12:04 +00:00
Bjoern A. Zeeb
852da713c3 Compare pointer to NULL rather than 0.
MFC after:	1 month
2009-10-13 20:29:14 +00:00
Michael Tuexen
f71e78a1d9 Fix a race condition where a mutex was destroyed while sleeping on it.
Found while analyzing a report from julian. It might fix his bug.
Approved by: rrs (mentor)
MFC after: 3 days
2009-10-11 12:23:56 +00:00
Julian Elischer
0b4b0b0fee Virtualize the pfil hooks so that different jails may chose different
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.

Sitting aroung in tree waiting to commit for: 2 months
MFC after:	2 months
2009-10-11 05:59:43 +00:00
Michael Tuexen
45623593fb Correct include order as indicated by bz.
Approved by: re (mentor)
MFC after: 3 days
2009-10-10 13:59:18 +00:00
Michael Tuexen
3b1de911e0 Do not include vnet.h twice.
Approved by: rrs (mentor)
MFC after: 3 days
2009-10-09 19:30:23 +00:00
Michael Tuexen
9dd512290c Use correct arguments when calling SCTP_RTALLOC().
Approved by: rrs (mentor)
MFC after: 0 days
2009-10-08 20:33:12 +00:00
Randall Stewart
806a5b8414 Fix so that round robing stream scheduling works as advertised
MFC after:	0 days
2009-10-08 11:36:06 +00:00
Robert Watson
f681a5fdd4 Remove tcp_input lock statistics; these are intended for debugging only
and are not intended to ship in 8.0 as they dirty additional cache
lines in a performance-critical per-packet path.

MFC after:	3 days
2009-10-06 20:35:41 +00:00
Robert Watson
883e9bc41d In tcp_input(), we acquire a global write lock at first only if a
segment is likely to trigger a TCP state change (i.e., FIN/RST/SYN).
If we later have to upgrade the lock, we acquire an inpcb reference
and drop both global/inpcb locks before reacquiring in-order.  In
that gap, the connection may transition into TIMEWAIT, so we need
to loop back and reevaluate the inpcb after relocking.

MFC after:	3 days
Reported by:	Kamigishi Rei <spambox at haruhiism.net>
Reviewed by:	bz
2009-10-05 22:24:13 +00:00
Qing Li
b4a22c365c Remove a log message from production code. This log message can be
triggered by a misconfigured host that is sending out gratuious ARPs.
This log message can also be triggered during a network renumbering
event when multiple prefixes co-exist on a single network segment.

MFC after:	immediately
2009-10-02 01:45:11 +00:00
Qing Li
fa3cfd39ff Previously, if an address alias is configured on an interface, and
this address alias has a prefix matching that of another address
configured on the same interface, then the ARP entry for the alias
is not deleted from the ARP table when that address alias is removed.
This patch fixes the aforementioned issue.

PR:		kern/139113
MFC after:	3 days
2009-10-02 01:34:55 +00:00
Michael Tuexen
4b6492f5ab Fix handling of sctp_drain().
Approved by: rrs (mentor)
MFC after: 2 month
2009-09-20 11:33:39 +00:00
Michael Tuexen
2c19e7fa86 Fix errnos.
Approved by: rrs(mentor)
MFC after: 3 days.
2009-09-20 11:32:22 +00:00
Michael Tuexen
4af6c75c39 Use appropriate locking when using interface list.
Approved by: rrs (mentor)
MFC after: 1 month.
2009-09-19 14:55:12 +00:00
Michael Tuexen
30c3a8430c Fix the disabling of sctp_drain().
Approved by: rrs (mentor)
MFC after: 1 month.
2009-09-19 14:18:42 +00:00
Michael Tuexen
8518270e20 Get SCTP working in combination with VIMAGE.
Contains code from bz.
Approved by: rrs (mentor)
MFC after: 1 month.
2009-09-19 14:02:16 +00:00
Bruce M Simpson
99bf30cf01 Return ENOBUFS consistently if user attempts to exceed
in_mcast_maxsocksrc resource limit.

Submitted by:	syrinx
MFC after:	3 days
2009-09-18 15:12:31 +00:00
Randall Stewart
482444b4a5 Support for VNET in SCTP (hopefully) 2009-09-17 15:11:12 +00:00
Michael Tuexen
d830c305ea Fix a bug reported by Daniel Mentz:
When authenticating DATA chunks some DATA chunks
might get stuck when the MTU gets decreased via
an ICMP message.

Approved by: rrs (mentor)
MFC after: immediately
2009-09-16 14:23:31 +00:00
Mike Silbersack
b8614722ff Add the ability to see TCP timers via netstat -x. This can be a useful
feature when you have a seemingly stuck socket and want to figure
out why it has not been closed yet.

No plans to MFC this, as it changes the netstat sysctl ABI.

Reviewed by:	andre, rwatson, Eric Van Gyzen
2009-09-16 05:33:15 +00:00
Andre Oppermann
11c99a6d7b -Put the optimized soreceive_stream() under a compile time option called
TCP_SORECEIVE_STREAM for the time being.

Requested by:	brooks

Once compiled in make it easily switchable for testers by using a tuneable
 net.inet.tcp.soreceive_stream
and a corresponding read-only sysctl to report the current state.

Suggested by:	rwatson

MFC after:	2 days
2009-09-15 22:23:45 +00:00
Qing Li
9bb7d0f47a Self pointing routes are installed for configured interface addresses
and address aliases. After an interface is brought down and brought
back up again, those self pointing routes disappeared. This patch
ensures after an interface is brought back up, the loopback routes
are reinstalled properly.

Reviewed by:	bz
MFC after:	immediately
2009-09-15 19:18:34 +00:00
Qing Li
cd29a7797d This patch enables the node to respond to ARP requests for
configured proxy ARP entries.

Reviewed by:	bz
MFC after:	immediately
2009-09-15 18:39:27 +00:00
Qing Li
96ed1732bb The bootp code installs an interface address and the nfs client
module tries to install the same address again. This extra code
is removed, which was discovered by the removal of a call to
in_ifscrub() in r196714. This call to in_ifscrub is put back here
because the SIOCAIFADDR command can be used to change the prefix
length of an existing alias.

Reviewed by:    kmacy
2009-09-15 01:01:03 +00:00
Qing Li
f0bb05fca5 Previously local end of point-to-point interface is not reachable
within the system that owns the interface. Packets destined to
the local end point leak to the wire towards the default gateway
if one exists. This behavior is changed as part of the L2/L3
rewrite efforts. The local end point is now reachable within the
system. The inpcb code needs to consider this fact during the
address selection process.

Reviewed by:	bz
MFC after:	immediately
2009-09-14 22:19:47 +00:00
Randall Stewart
f3d06a3c68 Fixes two bugs:
1) A lock issue, if we ever had to try again
   we would double lock the INP lock.
2) We were allowing (at wrap) associd 0... which really
   we cannot allow since 0 normally means in most socket
   API calls that we are wishing to effect something on
   the INP not TCB.

MFC after:	1 week
2009-09-13 17:45:31 +00:00
Bruce M Simpson
6cbbe26f98 In expire_mfc(), add an assert on the multicast forwarding cache mutex.
PR:		138666
2009-09-13 01:00:24 +00:00
Bruce M Simpson
fa2eebfce6 Comment some flawed assumptions in inp_join_group() about
mixing SSM full-state and delta-based APIs.

ENOTIME to fix right now.  No functional changes.

MFC after:	5 days
2009-09-12 20:37:44 +00:00
Bruce M Simpson
0eebc0d7b4 Don't allow joins w/o source on an existing group.
This is almost always pilot error.

We don't need to check for group filter UNDEFINED state at t1,
because we only ever allocate filters with their groups, so we
unconditionally reject such calls with EINVAL.
Trying to change the active filter mode w/o going through IP_MSFILTER
is also disallowed.

Deals with the case described in PR 137164 upfront, cumulative
with the fix in svn rev 197132 which only calls imo_match_source()
if the source address family was not unspecified.

PR:		137164
MFC after:	5 days
2009-09-12 20:18:23 +00:00
Bruce M Simpson
1fc39d5424 Tighten input checking in inp_join_group():
* Don't try to use the source address, when its family is unspecified.
 * If we get a join without a source, on an existing inclusive
   mode group, this is an error, as it would change the filter mode.

Fix a problem with the handling of in_mfilter for new memberships:
 * Do not rely on imf being NULL; it is explicitly initialized to a
   non-NULL pointer when constructing a membership.
 * Explicitly initialize *imf to EX mode when the source address
   is unspecified.

This fixes a problem with in_mfilter slot recycling in the join path.

PR:		138690
Submitted by:	Stef Walter
MFC after:	5 days
2009-09-12 19:45:55 +00:00
Bruce M Simpson
cc5776b24d Fix an obvious logic error in the IPv4 multicast leave processing,
where the filter mode vector was not updated correctly after the leave.

PR:		138691
Submitted by:	Stef Walter
MFC after:	5 days
2009-09-12 19:07:03 +00:00
Bruce M Simpson
67e89408e5 Fix an API issue in leave processing for IPv4 multicast groups.
* Do not assume that the group lookup performed by imo_match_group()
   is valid when ifp is NULL in this case.
 * Instead, return EADDRNOTAVAIL if the ifp cannot be resolved for the
   membership we are being asked to leave.

Caveat user:
 * The way IPv4 multicast memberships are implemented in the inpcb layer
   at the moment, has the side-effect that struct ip_moptions will
   still hold the membership, under the old ifp, until ip_freemoptions()
   is called for the parent inpcb.
 * The underlying issue is: the inpcb layer does not get notification
   of ifp being detached going away in a thread-safe manner.
   This is non-trivial to fix.

But hey, at least the kernel should't panic when you unplug a card.

PR:		138689
Submitted by:	Stef Walter
MFC after:	5 days
2009-09-12 18:55:15 +00:00
Navdeep Parhar
9a31144537 Add arp_update_event. This replaces route_arp_update_event, which
has not worked since the arp-v2 rewrite.

The event handler will be called with the llentry write-locked and
can examine la_flags to determine whether the entry is being added
or removed.

Reviewed by:	gnn, kmacy
Approved by:	gnn (mentor)
MFC after:	1 month
2009-09-08 21:17:17 +00:00
Poul-Henning Kamp
2ac047d1fe Move the duplicate definition of struct sockaddr_storage to its own
include file, and include this where the previous duplicate definitions were.

Static program checkers like FlexeLint rightfully take a dim view of
duplicate definitions, even if they currently are identical.
2009-09-08 10:39:38 +00:00
Shteryana Shopova
e72ae6eafd When joining a multicast group, the inp_lookup_mcast_ifp call
does a KASSERT that the group address is multicast, so the
check if this is indeed true and eventually return a EINVAL if not,
should be done before calling inp_lookup_mcast_ifp. This fixes a kernel
crash when calling setsockopt (sock, IPPROTO_IP, IP_ADD_MEMBERSHIP,...)
with invalid group address.

Reviewed by:	bms
Approved by:	bms

MFC after:	3 days
2009-09-07 16:00:33 +00:00
Pawel Jakub Dawidek
360488410f Correct comment. 2009-09-06 07:29:22 +00:00
George V. Neville-Neil
54fc657d59 Add ARP statistics to the kernel and netstat.
New counters now exist for:
requests sent
replies sent
requests received
replies received
packets received
total packets dropped due to no ARP entry
entrys timed out
Duplicate IPs seen

The new statistics are seen in the netstat command
when it is given the -s command line switch.

MFC after:	2 weeks
In collaboration with: bz
2009-09-03 21:10:57 +00:00
Bjoern A. Zeeb
cc7e9d4325 In case an upper layer protocol tries to send a packet but the
L2 code does not have the ethernet address for the destination
within the broadcast domain in the table, we remember the
original mbuf in `la_hold' in arpresolve() and send out a
different packet with an arp request.
In case there will be more upper layer packets to send we will
free an earlier one held in `la_hold' and queue the new one.

Once we get a packet in, with which we can perfect our arp table
entry we send out the original 'on hold' packet, should there
be any.
Rather than continuing to process the packet that we received,
we returned without freeing the packet that came in, which
basically means that we leaked an mbuf for every arp request
we sent.

Rather than freeing the received packet and returning, continue
to process the incoming arp packet as well.
This should (a) improve some setups, also proxy-arp, in case it was an
incoming arp request and (b) resembles the behaviour FreeBSD had
from day 1, which alignes with RFC826 "Packet reception" (merge case).

Rename 'm0' to 'hold' to make the code more understandable as
well as diffable to earlier versions more easily.

Handle the link-layer entry 'la' lock comepletely in the block
where needed and release it as early as possible, rather than
holding it longer, down to the end of the function.

Found by:			pointyhat, ns1
Bug hunting session with:	erwin, simon, rwatson
Tested by:			simon on cluster machines
Reviewed by:			ratson, kmacy, julian
MFC after:			3 days
2009-09-01 17:53:01 +00:00
Qing Li
1bf38b1292 This patch fixes the following issues:
- Routing messages are not generated when adding and removing
  interface address aliases.
- Loopback route installed for an interface address alias is
  not deleted from the routing table when that address alias
  is removed from the associated interface.
- Function in_ifscrub() is called extraneously.

Reviewed by:	gnn, kmacy, sam
MFC after:	3 days
2009-08-31 21:02:48 +00:00
Michael Tuexen
2b77dd0181 Fix a bug where vlan interfaces are not supported by SCTP.
Approved by: rrs (mentor)
MFC after: 3 days
2009-08-28 08:41:59 +00:00
Qing Li
0437a93339 Do not try to free the rt_lle entry of the cached route in
ip_output() if the cached route was not initialized from the
flow-table. The rt_lle entry is invalid unless it has been
initialized through the flow-table.

Reviewed by:	kmacy, rwatson
MFC after:	immediately
2009-08-28 05:37:31 +00:00
Robert Watson
dc56e98f0d Use locks specific to the lltable code, rather than borrow the ifnet
list/index locks, to protect link layer address tables.  This avoids
lock order issues during interface teardown, but maintains the bug that
sysctl copy routines may be called while a non-sleepable lock is held.

Reviewed by:	bz, kmacy
MFC after:	3 days
2009-08-25 09:52:38 +00:00
Michael Tuexen
24ae5c4a73 This fixes a bug where the value set by SCTP_PARTIAL_DELIVERY_POINT
was not honored, if the socket buffer size was not 4 times that large.

Approved by: rrs (mentor)
MFC after: 3 days.
2009-08-24 11:46:40 +00:00
Randall Stewart
0fa753b3fb This fixes two bugs in the NR-Sack code:
1) When calculating the table offset for sliding the sack
    array, the two byte values must be "ored" together in order
    for us to do the correct sliding of the arrays.
 2) We were NOT properly doing CC and other changes to things only
    NR-Sacked. The solution here is to make a separate function that
    will actually do both CC/updates and free things if its NR sack'd.
    This actually shrinks out common code from three places (much better).

MFC after:	3 days
2009-08-24 11:13:32 +00:00
Marko Zec
2b73aacaf9 Introduce a div_destroy() function which takes over per-vnet cleanup tasks
from the existing modevent / MOD_UNLOAD handler, and register div_destroy()
in protosw as per-vnet .pr_destroy() handler for options VIMAGE builds.  In
nooptions VIMAGE builds, div_destroy() will be invoked from the modevent
handler, resulting in effectively identical operation as it was prior this
change.  div_destroy() also tears down hashtables used by ipdivert, which
were previously left behind on ipdivert kldunloads.

For options VIMAGE builds only, temporarily disable kldunloading of ipdivert,
because without introducing additional locking logic it is impossible to
atomically check whether all ipdivert instances in all vnets are idle, and
proceed with cleanup without opening a race window for a vnet to open an
ipdivert socket while ipdivert tear-down is in progress.

While here, staticize div_init(), because it is not used outside of
ip_divert.c.

In cooperation with:	julian
Approved by:	re (rwatson), julian (mentor)
MFC after:	3 days
2009-08-24 10:06:02 +00:00
Robert Watson
77dfcdc445 Rework global locks for interface list and index management, correcting
several critical bugs, including race conditions and lock order issues:

Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock.  Either can be held to stablize the lists and indexes, but both
are required to write.  This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions.  As before, writes to
the interface list must occur from sleepable contexts.

Reviewed by:	bz, julian
MFC after:	3 days
2009-08-23 20:40:19 +00:00
Julian Elischer
d3cef1d91e Fix another typo right next to the previous one, that amazingly, I did
not see before.

MFC after:	 1 week
2009-08-23 08:49:32 +00:00
Julian Elischer
8f26c03fe6 Fix typo in comment that has been bugging me for days.
MFC after:	1 week
2009-08-23 07:59:28 +00:00
Julian Elischer
c4b21cbe4a Fix ipfw's initialization functions to get the correct order of evaluation
to allow vnet and non vnet operation. Move some functions from ip_fw_pfil.c
to ip_fw2.c and mode to mostly using the SYSINIT and VNET_SYSINIT handlers
instead of the modevent handler. Correct some spelling errors in comments
in the affected code. Note this bug fixes a crash in NON VIMAGE kernels when
ipfw is unloaded.

This patch is a minimal patch for 8.0
I have a much larger patch that actually fixes the underlying problems
that will be applied after 8.0

Reviewed by:	zec@, rwatson@, bz@(earlier version)
Approved by:	re (rwatson)
MFC after:	Immediatly
2009-08-21 11:20:10 +00:00
Peter Wemm
b4e7e7a065 Fix signed comparison bug when ticks goes negative after 24 days of
uptime.  This causes the tcp time_wait state code to fail to expire
sockets in timewait state.

Approved by:	re (kensmith)
2009-08-20 22:53:28 +00:00
Will Andrews
52e12426d1 Fix CARP memory leaks on carp_if's malloc'd using M_CARP. This occurs when
CARP tries to free them using M_IFADDR after the last address for a virtual
host is removed and when detaching from the parent interface.

Reviewed by:	mlaier
Approved by:	re (kib), ken (mentor)
2009-08-20 02:33:12 +00:00
Michael Tuexen
2f99457b0c Fix a bug in the handling of unreliable messages
which results in stalled associations.

Approved by: re, rrs (mentor)
MFC after: immediately
2009-08-19 12:02:28 +00:00
Kip Macy
3ee42584f9 - change the interface to flowtable_lookup so that we don't rely on
the mbuf for obtaining the fib index
 - check that a cached flow corresponds to the same fib index as the
   packet for which we are doing the lookup
 - at interface detach time flush any flows referencing stale rtentrys
   associated with the interface that is going away (fixes reported
   panics)
 - reduce the time between cleans in case the cleaner is running at
   the time the eventhandler is called and the wakeup is missed less
   time will elapse before the eventhandler returns
 - separate per-vnet initialization from global initialization
   (pointed out by jeli@)

Reviewed by:	sam@
Approved by:	re@
2009-08-18 20:28:58 +00:00
Michael Tuexen
627dfd6df9 Fix a crash when using one-to-one stlye socket in non-blocking
mode and there is no listening server.
PR: 137795
Approved by: re, rrs (mentor)
MFC after:immediately.
2009-08-18 19:58:49 +00:00
Michael Tuexen
810ec53688 * Fix a bug where PR-SCTP settings are ignore when using implicit
association setup.
* Fix a bug where message with illegal stream ids are not deleted.
* Fix a crash when reporting back unsent messages from the send_queue.
* Fix a bug related to INIT retransmission when the socket is already
  closed.
* Fix a bug where associations were stalled when partial delivery API
  was enabled.
* Fix a bug where the receive buffer size was smaller than the
  partial_delivery_point.

Approved by: re, rrs (mentor)
MFC after: One day.
2009-08-15 21:10:52 +00:00
Qing Li
3ef5e21d01 In function ip_output(), the cached route is flushed when there is a
mismatch between the cached entry and the intended destination. The
cached rtentry{} is flushed but the associated llentry{} is not. This
causes the wrong destination MAC address being used in the output
packets. The fix is to flush the llentry{} when rtentry{} is cleared.

Reviewed by:	kmacy, rwatson
Approved by:	re
2009-08-14 23:44:59 +00:00
Marko Zec
f92ae4d706 SCTP is not yet compatible with options VIMAGE kernels although it compiles
with VIMAGE defined, so explicitly disallow building such kernels.

Reviewed by:	rrs
Approved by:	re (rwatson), julian (mentor)
2009-08-14 22:43:25 +00:00
Julian Elischer
72034f5548 Fix ipfw crash on uid or gid check.
Receiving any ip packet for which there is no existing socket will
crash if ipfw has a uid or gid test rule, as the uid/gid
of the non existent owner of said non existent socket is tested.
Brooks introduced this error as part of his >16 gids patch.
It appears to be a cut-n-paste error from similar code a few lines
before. The old code used the 'pcb' variable here, but in the
new code that switched the 'inp' variable, which is often NULL
and what is tested in the code further up. The rest of the multi-gid
patch for ipfw seems solid (and cleaner than previous code).

Reviewed by:	brooks
Approved by:	re (rwatson)
2009-08-14 10:09:45 +00:00
Robert Watson
9d2eb78bcb Add padding to struct inpcb, missed during our padding sweep earlier in
the release cycle.

Approved by:	re (kensmith)
2009-08-02 22:47:08 +00:00
Robert Watson
315e3e38fa Many network stack subsystems use a single global data structure to hold
all pertinent statatistics for the subsystem.  These structures are
sometimes "borrowed" by kernel modules that require a place to store
statistics for similar events.

Add KPI accessor functions for statistics structures referenced by kernel
modules so that they no longer encode certain specifics of how the data
structures are named and stored.  This change is intended to make it
easier to move to per-CPU network stats following 8.0-RELEASE.

The following modules are affected by this change:

      if_bridge
      if_cxgb
      if_gif
      ip_mroute
      ipdivert
      pf

In practice, most of these statistics consumers should, in fact, maintain
their own statistics data structures rather than borrowing structures
from the base network stack.  However, that change is too agressive for
this point in the release cycle.

Reviewed by:	bz
Approved by:	re (kib)
2009-08-02 19:43:32 +00:00
Robert Watson
530c006014 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
Xin LI
16e324fc56 Show interface name which received short CARP packet (e.g. a VRRP packet),
in order to match other codepaths nearby.  This makes troubleshooting
easier.

Approved by:	re (kib)
MFC after:	1 month
2009-07-30 17:40:47 +00:00
Julian Elischer
3d1001cb11 Startup the vnet part of initialization a bit after the global part.
Fixes crash on boot if ipfw compiled in.

Submitted by:	tegge@
Reviewed by:	tegge@
Approved by:	re (kib)
2009-07-28 19:58:07 +00:00
Julian Elischer
7973fba3a4 Somewhere along the line accept sockets stopped honoring the
FIB selected for them. Fix this.

Reviewed by:	ambrisko
Approved by:	re (kib)
MFC after:	3 days
2009-07-28 19:43:27 +00:00
Michael Tuexen
bf3d517756 Fix a bug where wrong initialization value
in used for an SCTP specific sysctl variable.

Approved by: re, rrs(mentor).
MFC after: 2 weeks.
2009-07-28 15:07:41 +00:00
Randall Stewart
cfde3ff70b Turns out that when a receiver forwards through its TNS's the
processing code holds the read lock (when processing a
FWD-TSN for pr-sctp). If it finds stranded data that
can be given to the application, it calls sctp_add_to_readq().
The readq function also grabs this lock. So if INVAR is on
we get a double recurse on a non-recursive lock and panic.

This fix will change it so that readq() function gets a
flag to tell if the lock is held, if so then it does not
get the lock.

Approved by:	re@freebsd.org (Kostik Belousov)
MFC after:	1 week
2009-07-28 14:09:06 +00:00
Qing Li
df813b7ea2 This patch does the following:
- Allow loopback route to be installed for address assigned to
      interface of IFF_POINTOPOINT type.
    - Install loopback route for an IPv4 interface addreess when the
      "useloopback" sysctl variable is enabled. Similarly, install
      loopback route for an IPv6 interface address when the sysctl variable
      "nd6_useloopback" is enabled. Deleting loopback routes for interface
      addresses is unconditional in case these sysctl variables were
      disabled after an interface address has been assigned.

Reviewed by:	bz
Approved by:	re
2009-07-27 17:08:06 +00:00
Michael Tuexen
8e71b6947a Fix the handling of unordered messages when using
PR-SCTP.

Approved by: re, rrs (mentor)
MFC after: 3 weeks.
2009-07-27 13:41:45 +00:00
Michael Tuexen
4420d9a062 Get rid of unused field. This will also be deleted
in the official speciication of the SCTP socket API.

Approved by:re, rrs (mentor)
2009-07-27 12:09:32 +00:00
Michael Tuexen
47a490cbbc Add a missing unlock for the inp lock when
returning early from sctp_add_to_readq().

Approved by: re, rrs (mentor)
MFC after: 2 weeks.
2009-07-26 15:06:59 +00:00
Julian Elischer
9d85f50ad5 Catch ipfw up to the rest of the vimage code.
It got left behind when it moved to its new location.

Approved by:	re (kensmith)
2009-07-25 06:42:42 +00:00
Robert Watson
d0728d7174 Introduce and use a sysinit-based initialization scheme for virtual
network stacks, VNET_SYSINIT:

- Add VNET_SYSINIT and VNET_SYSUNINIT macros to declare events that will
  occur each time a network stack is instantiated and destroyed.  In the
  !VIMAGE case, these are simply mapped into regular SYSINIT/SYSUNINIT.
  For the VIMAGE case, we instead use SYSINIT's to track their order and
  properties on registration, using them for each vnet when created/
  destroyed, or immediately on module load for already-started vnets.
- Remove vnet_modinfo mechanism that existed to serve this purpose
  previously, as well as its dependency scheme: we now just use the
  SYSINIT ordering scheme.
- Implement VNET_DOMAIN_SET() to allow protocol domains to declare that
  they want init functions to be called for each virtual network stack
  rather than just once at boot, compiling down to DOMAIN_SET() in the
  non-VIMAGE case.
- Walk all virtualized kernel subsystems and make use of these instead
  of modinfo or DOMAIN_SET() for init/uninit events.  In some cases,
  convert modular components from using modevent to using sysinit (where
  appropriate).  In some cases, do minor rejuggling of SYSINIT ordering
  to make room for or better manage events.

Portions submitted by:	jhb (VNET_SYSINIT), bz (cleanup)
Discussed with:		jhb, bz, julian, zec
Reviewed by:		bz
Approved by:		re (VIMAGE blanket)
2009-07-23 20:46:49 +00:00
Bjoern A. Zeeb
a08362ce46 sysctl_msec_to_ticks is used with both virtualized and
non-vrtiualized sysctls so we cannot used one common function.

Add a macro to convert the arg1 in the virtualized case to
vnet.h to not expose the maths to all over the code.

Add a wrapper for the single virtualized call, properly handling
arg1 and call the default implementation from there.

Convert the two over places to use the new macro.

Reviewed by:	rwatson
Approved by:	re (kib)
2009-07-21 21:58:55 +00:00
Robert Watson
a511354af4 Back out the moving in r195782 of V_ip_id's initialization from the top
back to the bottom of ip_init() as found in 7.x.  I missed the fact that
the bottom half of the init routine only runs in the !VNET case.

Submitted by:	zec
Approved by:	re (vimage blanket)
2009-07-20 19:40:09 +00:00
Robert Watson
0a4747d4d0 Garbage collect vnet module registrations that have neither constructors
nor destructors, as there's no actual work to do.

In most cases, the constructors weren't needed because of the existing
protocol initialization functions run by net_init_domain() as part of
VNET_MOD_NET, or they were eliminated when support for static
initialization of virtualized globals was added.

Garbage collect dependency references to modules without constructors or
destructors, notably VNET_MOD_INET and VNET_MOD_INET6.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-07-20 13:55:33 +00:00
Robert Watson
5ee847d3ac Reimplement and/or implement vnet list locking by replacing a mostly
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use).  Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.

Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions.  Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.

Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.

Update various consumers of these KPIs based on whether they may sleep
or not.

Reviewed by:	bz
Approved by:	re (kib)
2009-07-19 14:20:53 +00:00
Robert Watson
1e77c1056a Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used.  Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with:	bz, julian
Reviewed by:	bz
Approved by:	re (kensmith, kib)
2009-07-16 21:13:04 +00:00
Robert Watson
eddfbb763d Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
Lawrence Stewart
91a5ebde45 Fix a race in the manipulation of the V_tcp_sack_globalholes global variable,
which is currently not protected by any type of lock. When triggered, the bug
would sometimes cause a panic when the TCP activity to an affected machine
eventually slowed during a lull. The panic only occurs if INVARIANTS is compiled
into the kernel, and has laid dormant for some time as a result of INVARIANTS
being off by default except in FreeBSD-CURRENT.

Switch to atomic operations in the locations where the variable is changed.
Reads have not been updated to be protected by atomics, so there is a
possibility of accounting errors in any given calculation where the variable is
read. This is considered unlikely to occur in the wild, and will not cause
serious harm on rare occasions where it does.

Thanks to Robert Watson for debugging help.

Reported by:	Kamigishi Rei <spambox at haruhiism dot net>
Tested by:	Kamigishi Rei <spambox at haruhiism dot net>
Reviewed by:	silby
Approved by:	re (rwatson), kensmith (mentor temporarily unavailable)
2009-07-13 11:59:38 +00:00
Lawrence Stewart
237fbe0a1c Replace struct tcpopt with a proxy toeopt struct in the TOE driver interface to
the TCP syncache. This returns struct tcpopt to being private within the TCP
implementation, thus allowing it to be modified without ABI concerns.

The patch breaks the ABI. Bump __FreeBSD_version to 800103 accordingly. The cxgb
driver is the only TOE consumer affected by this change, and needs to be
recompiled along with the kernel.

Suggested by:	rwatson
Reviewed by:	rwatson, kmacy
Approved by:	re (kensmith), kensmith (mentor temporarily unavailable)
2009-07-13 11:51:02 +00:00
Lawrence Stewart
962ebef8c0 Pad the following TCP related structs to allow MFCs of upcoming features/fixes
back to the 8 branch:

tcp_var.h
- struct sackhint
- struct tcpcb
- struct tcpstat

The patch breaks the ABI. Bump __FreeBSD_version to 800102 accordingly. User
space tools that rely on the size of any of these structs (e.g. sockstat) need
to be recompiled.

Reviewed by:	rpaulo, sam, andre, rwatson
Approved by:	re & mentor (gnn)
2009-07-12 09:14:28 +00:00
Robert Watson
6c8615603b Update various IPFW-related modules to use if_addr_rlock()/
if_addr_runlock() rather than IF_ADDR_LOCK()/IF_ADDR_UNLOCK().

MFC after:	6 weeks
2009-06-26 00:46:50 +00:00
Robert Watson
d1da0a0672 Add address list locking for in6_ifaddrhead/ia_link: as with locking
for in_ifaddrhead, we stick with an rwlock for the time being, which
we will revisit in the future with a possible move to rmlocks.

Some pieces of code require significant further reworking to be
safe from all classes of writer-writer races.

Reviewed by:	bz
MFC after:	6 weeks
2009-06-25 16:35:28 +00:00
Robert Watson
64aeca7b42 Initialize in_ifaddr_lock using RW_SYSINIT() instead of in ip_init(),
so that it doesn't run multiple times if VIMAGE is being used.

Discussed with:	bz
MFC after:	6 weeks
2009-06-25 14:44:00 +00:00