Commit Graph

2351 Commits

Author SHA1 Message Date
Mike Silbersack
5f311da2cc Port randomization leads to extremely fast port reuse at high
connection rates, which is causing problems for some users.

To retain the security advantage of random ports and ensure
correct operation for high connection rate users, disable
port randomization during periods of high connection rates.

Whenever the connection rate exceeds randomcps (10 by default),
randomization will be disabled for randomtime (45 by default)
seconds.  These thresholds may be tuned via sysctl.

Many thanks to Igor Sysoev, who proved the necessity of this
change and tested many preliminary versions of the patch.

MFC After:	20 seconds
2005-01-02 01:50:57 +00:00
Robert Watson
74d4630b71 Remove an errant blank line apparently introduced in
ip_output.c:1.194.
2004-12-25 22:59:42 +00:00
Robert Watson
42cf3289c3 In the dropafterack case of tcp_input(), it's OK to release the TCP
pcbinfo lock before calling tcp_output(), as holding just the inpcb
lock is sufficient to prevent garbage collection.
2004-12-25 22:26:13 +00:00
Robert Watson
e0bef1cb35 Revert parts of tcp_input.c:1.255 associated with the header predicted
cases for tcp_input():

While it is true that the pcbinfo lock provides a pseudo-reference to
inpcbs, both the inpcb and pcbinfo locks are required to free an
un-referenced inpcb.  As such, we can release the pcbinfo lock as
long as the inpcb remains locked with the confidence that it will not
be garbage-collected.  This leads to a less conservative locking
strategy that should reduce contention on the TCP pcbinfo lock.

Discussed with: sam
2004-12-25 22:23:13 +00:00
Robert Watson
452d9f5b1c Attempt to consistently use () around return values in calls to
return() in newer code (sysctl, ISN, timewait).
2004-12-23 01:34:26 +00:00
Robert Watson
06da46b241 Remove an XXXRW comment relating to whether or not the TCP timers are
MPSAFE: they are now believed to be.

Correct a typo in a second comment.

MFC after:	2 weeks
2004-12-23 01:27:13 +00:00
Robert Watson
db0aae38b6 Remove the now unused tcp_canceltimers() function. tcpcb timers are
now stopped as part of tcp_discardcb().

MFC after:	2 weeks
2004-12-23 01:25:59 +00:00
Robert Watson
950ab1e470 Remove an annotation of a minor race relating to the update of
multiple MIB entries using sysctl in short order, which might
result in unexpected values for tcp_maxidle being generated by
tcp_slowtimo.  In practice, this will not happen, or at least,
doesn't require an explicit comment.

MFC after:	2 weeks
2004-12-23 01:21:54 +00:00
Gleb Smirnoff
5e5da86597 In certain cases ip_output() can free our route, so check
for its presence before RTFREE().

Noticed by:	ru
2004-12-10 07:51:14 +00:00
Gleb Smirnoff
d2a09f901a Revert last change.
Andre:
  First lets get major new features into the kernel in a clean and nice way,
  and then start optimizing. In this case we don't have any obfusication that
  makes later profiling and/or optimizing difficult in any way.

Requested by:	csjp, sam
2004-12-10 07:47:17 +00:00
Christian S.J. Peron
fbf2edb6e4 This commit adds a shared locking mechanism very similar to the
mechanism used by pfil.  This shared locking mechanism will remove
a nasty lock order reversal which occurs when ucred based rules
are used which results in hard locks while mpsafenet=1.

So this removes the debug.mpsafenet=0 requirement when using
ucred based rules with IPFW.

It should be noted that this locking mechanism does not guarantee
fairness between read and write locks, and that it will favor
firewall chain readers over writers. This seemed acceptable since
write operations to firewall chains protected by this lock tend to
be less frequent than reads.

Reviewed by:	andre, rwatson
Tested by:	myself, seanc
Silence on:	ipfw@
MFC after:	1 month
2004-12-10 02:17:18 +00:00
Gleb Smirnoff
f5a19d3909 Check that DUMMYNET_LOADED before seeking dummynet m_tag.
Reviewed by:	andre
MFC after:	1 week
2004-12-09 16:41:47 +00:00
Max Laier
067a8bab8a More fixing of multiple addresses in the same prefix. This time do not try
to arp resolve "secondary" local addresses.

Found and submitted by:	ru
With additions from:	OpenBSD (rev. 1.47)
Reviewed by:		ru
2004-12-09 00:12:41 +00:00
Ruslan Ermilov
5cae05ad33 Time out routes created by redirect. 2004-12-06 22:27:22 +00:00
Gleb Smirnoff
98335aa976 - Make route cacheing optional, configurable via IFF_LINK0 flag.
- Turn it off by default.

Requested by:	many
Reviewed by:	andre
Approved by:	julian (mentor)
MFC after:	3 days
2004-12-06 19:02:43 +00:00
Robert Watson
79a9e59c89 Assert the tcptw inpcb lock in tcp_timer_2msl_reset(), as fields in
the tcptw undergo non-atomic read-modify-writes.

MFC after:	2 weeks
2004-12-05 22:47:29 +00:00
Robert Watson
b9155d92b2 Assert inpcb lock in:
tcpip_fillheaders()
  tcp_discardcb()
  tcp_close()
  tcp_notify()
  tcp_new_isn()
  tcp_xmit_bandwidth_limit()

Fix a locking comment in tcp_twstart(): the pcbinfo will be locked (and
is asserted).

MFC after:	2 weeks
2004-12-05 22:27:53 +00:00
Robert Watson
6fbed4af22 Minor grammer fix in comment. 2004-12-05 22:20:59 +00:00
Robert Watson
89924e5865 Pass the inpcb reference into ip_getmoptions() rather than just the
inp->inp_moptions pointer, so that ip_getmoptions() can perform
necessary locking when doing non-atomic reads.

Lock the inpcb by default to copy any data to local variables, then
unlock before performing sooptcopyout().

MFC after:	2 weeks
2004-12-05 22:08:37 +00:00
Robert Watson
92c71ab30b Define INP_UNLOCK_ASSERT() to assert that an inpcb is unlocked.
MFC after:	2 weeks
2004-12-05 22:07:14 +00:00
Robert Watson
5c918b56d8 Push the inpcb argument into ip_setmoptions() when setting IP multicast
socket options, so that it is available for locking.
2004-12-05 21:38:33 +00:00
Robert Watson
993d9505d4 Start working through inpcb locking for ip_ctloutput() by cleaning up
modifications to the inpcb IP options mbuf:

- Lock the inpcb before passing it into ip_pcbopts() in order to prevent
  simulatenous reads and read-modify-writes that could result in races.
- Pass the inpcb reference into ip_pcbopts() instead of the option chain
  pointer in the inpcb.
- Assert the inpcb lock in ip_pcbots.
- Convert one or two uses of a pointer as a boolean or an integer
  comparison to a comparison with NULL for readability.
2004-12-05 19:11:09 +00:00
Paul Saab
7d5ed1ceea Fixes a bug in SACK causing us to send data beyond the receive window.
Found by: Pawel Worach and Daniel Hartmeier
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
2004-11-29 18:47:27 +00:00
Robert Watson
2be3bf2244 Assert the inpcb lock in tcp_xmit_timer() as it performs read-modify-
write of various time/rtt-related fields in the tcpcb.
2004-11-28 11:06:22 +00:00
Robert Watson
18ad5842c5 Expand coverage of the receive socket buffer lock when handling urgent
pointer updates: test available space while holding the socket buffer
mutex, and continue to hold until until the pointer update has been
performed.

MFC after:	2 weeks
2004-11-28 11:01:31 +00:00
Robert Watson
c8443a1dc0 Do export the advertised receive window via the tcpi_rcv_space field of
struct tcp_info.
2004-11-27 20:20:11 +00:00
Robert Watson
b8af5dfa81 Implement parts of the TCP_INFO socket option as found in Linux 2.6.
This socket option allows processes query a TCP socket for some low
level transmission details, such as the current send, bandwidth, and
congestion windows.  Linux provides a 'struct tcpinfo' structure
containing various variables, rather than separate socket options;
this makes the API somewhat fragile as it makes it dificult to add
new entries of interest as requirements and implementation evolve.
As such, I've included a large pad at the end of the structure.
Right now, relatively few of the Linux API fields are filled in, and
some contain no logical equivilent on FreeBSD.  I've include __'d
entries in the structure to make it easier to figure ou what is and
isn't omitted.  This API/ABI should be considered unstable for the
time being.
2004-11-26 18:58:46 +00:00
Mike Silbersack
6a220ed80a Fix a problem where our TCP stack would ignore RST packets if the receive
window was 0 bytes in size.  This may have been the cause of unsolved
"connection not closing" reports over the years.

Thanks to Michiel Boland for providing the fix and providing a concise
test program for the problem.

Submitted by:	Michiel Boland
MFC after:	2 weeks
2004-11-25 19:04:20 +00:00
Robert Watson
de30ea131f In tcp_reass(), assert the inpcb lock on the passed tcpcb, since the
contents of the tcpcb are read and modified in volume.

In tcp_input(), replace th comparison with 0 with a comparison with
NULL.

At the 'findpcb', 'dropafterack', and 'dropwithreset' labels in
tcp_input(), assert 'headlocked'.  Try to improve consistency between
various assertions regarding headlocked to be more informative.

MFC after:	2 weeks
2004-11-23 23:41:20 +00:00
Robert Watson
cce83ffb5a tcp_timewait() performs multiple non-atomic reads on the tcptw
structure, so assert the inpcb lock associated with the tcptw.
Also assert the tcbinfo lock, as tcp_timewait() may call
tcp_twclose() or tcp_2msl_rest(), which require it.  Since
tcp_timewait() is already called with that lock from tcp_input(),
this doesn't change current locking, merely documents reasons for
it.

In tcp_twstart(), assert the tcbinfo lock, as tcp_timer_2msl_rest()
is called, which requires that lock.

In tcp_twclose(), assert the tcbinfo lock, as tcp_timer_2msl_stop()
is called, which requires that lock.

Document the locking strategy for the time wait queues in tcp_timer.c,
which consists of protecting the time wait queues in the same manner
as the tcbinfo structure (using the tcbinfo lock).

In tcp_timer_2msl_reset(), assert the tcbinfo lock, as the time wait
queues are modified.

In tcp_timer_2msl_stop(), assert the tcbinfo lock, as the time wait
queues may be modified.

In tcp_timer_2msl_tw(), assert the tcbinfo lock, as the time wait
queues may be modified.

MFC after:	2 weeks
2004-11-23 17:21:30 +00:00
Robert Watson
b42ff86e73 De-spl tcp_slowtimo; tcp_maxidle assignment is subject to possible
but unlikely races that could be corrected by having tcp_keepcnt
and tcp_keepintvl modifications go through handler functions via
sysctl, but probably is not worth doing.  Updates to multiple
sysctls within evaluation of a single addition are unlikely.

Annotate that tcp_canceltimers() is currently unused.

De-spl tcp_timer_delack().

De-spl tcp_timer_2msl().

MFC after:	2 weeks
2004-11-23 16:45:07 +00:00
Robert Watson
7258e91f0f Assert the inpcb lock in tcp_twstart(), which does both read-modify-write
on the tcpcb, but also calls into tcp_close() and tcp_twrespond().

Annotate that tcp_twrecycleable() requires the inpcb lock because it does
a series of non-atomic reads of the tcpcb, but is currently called
without the inpcb lock by the caller.  This is a bug.

Assert the inpcb lock in tcp_twclose() as it performs a read-modify-write
of the timewait structure/inpcb, and calls in_pcbdetach() which requires
the lock.

Assert the inpcb lock in tcp_twrespond(), as it performs multiple
non-atomic reads of the tcptw and inpcb structures, as well as calling
mac_create_mbuf_from_inpcb(), tcpip_fillheaders(), which require the
inpcb lock.

MFC after:	2 weeks
2004-11-23 16:23:13 +00:00
Robert Watson
8263bab34d Assert inpcb lock in tcp_quench(), tcp_drop_syn_sent(), tcp_mtudisc(),
and tcp_drop(), due to read-modify-write of TCP state variables.

MFC after:	2 weeks
2004-11-23 16:06:15 +00:00
Robert Watson
8438db0f59 Assert the tcbinfo write lock in tcp_new_isn(), as the tcbinfo lock
protects access to the ISN state variables.

Acquire the tcbinfo write lock in tcp_isn_tick() to synchronize
timer-driven isn bumping.

Staticize internal ISN variables since they're not used outside of
tcp_subr.c.

MFC after:	2 weeks
2004-11-23 15:59:43 +00:00
Robert Watson
ca127a3e80 Remove "Unlocked read" annotations associated with previously unlocked
use of socket buffer fields in the TCP input code.  These references
are now protected by use of the receive socket buffer lock.

MFC after:	1 week
2004-11-22 13:16:27 +00:00
Robert Watson
98734750b4 s/send/sent/ in comment describing TCPS_SYN_RECEIVED. 2004-11-21 14:38:04 +00:00
Gleb Smirnoff
c1384b5ae2 - Since divert protocol is not connection oriented, remove SS_ISCONNECTED flag
from divert sockets.
- Remove div_disconnect() method, since it shouldn't be called now.
- Remove div_abort() method. It was never called directly, since protocol
  doesn't have listen queue. It was called only from div_disconnect(),
  which is removed now.

Reviewed by:	rwatson, maxim
Approved by:	julian (mentor)
MT5 after:	1 week
MT4 after:	1 month
2004-11-18 13:49:18 +00:00
Max Laier
9a6a6eeba2 Fix host route addition for more than one address to a loopback interface
after allowing more than one address with the same prefix.

Reported by:	Vladimir Grebenschikov <vova NO fbsd SPAM ru>
Submitted by:	ru (also NetBSD rev. 1.83)
Pointyhat to:	mlaier
2004-11-17 23:14:03 +00:00
Max Laier
81d96ce8a4 Merge copyright notices.
Requested by:	njl
2004-11-13 17:05:40 +00:00
Gleb Smirnoff
ea0bd57615 Fix ng_ksocket(4) operation as a divert socket, which is pretty useful
and has been broken twice:

- in the beginning of div_output() replace KASSERT with assignment, as
  it was in rev. 1.83. [1] [to be MFCed]
- refactor changes introduced in rev. 1.100: do not prepend a new tag
  unconditionally. Before doing this check whether we have one. [2]

A small note for all hacking in this area:
when divert socket is not a real userland, but ng_ksocket(4), we receive
_the same_ mbufs, that we transmitted to socket. These mbufs have rcvif,
the tags we've put on them. And we should treat them correctly.

Discussed with:	mlaier [1]
Silence from:	green [2]
Reviewed by:	maxim
Approved by:	julian (mentor)
MFC after:	1 week
2004-11-12 22:17:42 +00:00
Max Laier
48321abefe Change the way we automatically add prefix routes when adding a new address.
This makes it possible to have more than one address with the same prefix.
The first address added is used for the route. On deletion of an address
with IFA_ROUTE set, we try to find a "fallback" address and hand over the
route if possible.
I plan to MFC this in 4 weeks, hence I keep the - now obsolete - argument to
in_ifscrub as it must be considered KAPI as it is not static in in.c. I will
clean this after the MFC.

Discussed on:	arch, net
Tested by:	many testers of the CARP patches
Nits from:	ru, Andrea Campi <andrea+freebsd_arch webcom it>
Obtained from:	WIDE via OpenBSD
MFC after:	1 month
2004-11-12 20:53:51 +00:00
Poul-Henning Kamp
e21e4c19c9 Add missing '='
Spotted by:	obrien
2004-11-11 19:02:01 +00:00
Andre Oppermann
5e7b233055 Fix a double-free in the 'hlen > m->m_len' sanity check.
Bug report by:	<james@towardex.com>
MFC after:	2 weeks
2004-11-09 09:40:32 +00:00
SUZUKI Shinsuke
3d54848fc2 support TCP-MD5(IPv4) in KAME-IPSEC, too.
MFC after: 3 week
2004-11-08 18:49:51 +00:00
Poul-Henning Kamp
756d52a195 Initialize struct pr_userreqs in new/sparse style and fill in common
default elements in net_init_domain().

This makes it possible to grep these structures and see any bogosities.
2004-11-08 14:44:54 +00:00
Robert Watson
d6915262af Do some re-sorting of TCP pcbinfo locking and assertions: make sure to
retain the pcbinfo lock until we're done using a pcb in the in-bound
path, as the pcbinfo lock acts as a pseuo-reference to prevent the pcb
from potentially being recycled.  Clean up assertions and make sure to
assert that the pcbinfo is locked at the head of code subsections where
it is needed.  Free the mbuf at the end of tcp_input after releasing
any held locks to reduce the time the locks are held.

MFC after:	3 weeks
2004-11-07 19:19:35 +00:00
Andre Oppermann
e9a4cd2426 Fix a double-free in the 'm->m_len < sizeof (struct ip)' sanity check.
Bug report by:	<james@towardex.com>
MFC after:	2 weeks
2004-11-06 10:47:36 +00:00
Poul-Henning Kamp
c83c1318f5 Hide udp_in6 behind #ifdef INET6 2004-11-04 07:14:03 +00:00
Bruce M Simpson
38f061057b When performing IP fast forwarding, immediately drop traffic which is
destined for a blackhole route.

This also means that blackhole routes do not need to be bound to lo(4)
or disc(4) interfaces for the net.inet.ip.fastforwarding=1 case.

Submitted by:	james at towardex dot com
Sponsored by:	eXtensible Open Router Project <URL:http://www.xorp.org/>
MFC after:	3 weeks
2004-11-04 02:14:38 +00:00
Robert Watson
d4b509bd7f Until this change, the UDP input code used global variables udp_in,
udp_in6, and udp_ip6 to pass socket address state between udp_input(),
udp_append(), and soappendaddr_locked().  While file in the default
configuration, when running with multiple netisrs or direct ithread
dispatch, this can result in races wherein user processes using
recvmsg() get back the wrong source IP/port.  To correct this and
related races:

- Eliminate udp_ip6, which is believed to be generated but then never
  used.  Eliminate ip_2_ip6_hdr() as it is now unneeded.

- Eliminate setting, testing, and existence of 'init' status fields
  for the IPv6 structures.  While with multiple UDP delivery this
  could lead to amortization of IPv4 -> IPv6 conversion when
  delivering an IPv4 UDP packet to an IPv6 socket, it added
  substantial complexity and side effects.

- Move global structures into the stack, declaring udp_in in
  udp_input(), and udp_in6 in udp_append() to be used if a conversion
  is required.  Pass &udp_in into udp_append().

- Re-annotate comments to reflect updates.

With this change, UDP appears to operate correctly in the presence of
substantial inbound processing parallelism.  This solution avoids
introducing additional synchronization, but does increase the
potential stack depth.

Discovered by:	kris (Bug Magnet)
MFC after:	3 weeks
2004-11-04 01:25:23 +00:00
Andre Oppermann
c94c54e4df Remove RFC1644 T/TCP support from the TCP side of the network stack.
A complete rationale and discussion is given in this message
and the resulting discussion:

 http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706

Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel.  Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols)  and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.

Discussed on:	-arch
2004-11-02 22:22:22 +00:00
Robert Watson
ab5c14d828 Correct a bug in TCP SACK that could result in wedging of the TCP stack
under high load: only set function state to loop and continuing sending
if there is no data left to send.

RELENG_5_3 candidate.

Feet provided:	Peter Losher <Peter underscore Losher at isc dot org>
Diagnosed by:	Aniel Hartmeier <daniel at benzedrine dot cx>
Submitted by:	mohan <mohans at yahoo-inc dot com>
2004-10-30 12:02:50 +00:00
Robert Watson
c427483381 Add a matching tunable for net.inet.tcp.sack.enable sysctl. 2004-10-26 08:59:09 +00:00
Bruce M Simpson
d6fa5d2806 Check that rt_mask(rt) is non-NULL before dereferencing it, in the
RTM_ADD case, thus avoiding a panic.

Submitted by:	Iasen Kostov
2004-10-26 03:31:58 +00:00
Andre Oppermann
84bb6a2e75 IPDIVERT is a module now and tell the other parts of the kernel about it.
IPDIVERT depends on IPFIREWALL being loaded or compiled into the kernel.
2004-10-25 20:02:34 +00:00
Ruslan Ermilov
a35d88931c For variables that are only checked with defined(), don't provide
any fake value.
2004-10-24 15:33:08 +00:00
Andre Oppermann
cd109b0d82 Shave 40 unused bytes from struct tcpcb. 2004-10-22 19:55:04 +00:00
Andre Oppermann
21dcc96f4a When printing the initialization string and IPDIVERT is not compiled into the
kernel refer to it as "loadable" instead of "disabled".
2004-10-22 19:18:06 +00:00
Andre Oppermann
24fc79b0a4 Refuse to unload the ipdivert module unless the 'force' flag is given to kldunload.
Reflect the fact that IPDIVERT is a loadable module in the divert(4) and ipfw(8)
man pages.
2004-10-22 19:12:01 +00:00
Andre Oppermann
57bbe2e1ab Destroy the UMA zone on unload. 2004-10-19 22:51:20 +00:00
Andre Oppermann
2de1a9eb6e Slightly extend the locking during unload to fully cover the protocol
deregistration.  This does not entirely close the race but narrows the
even previously extremely small chance of a race some more.
2004-10-19 22:08:13 +00:00
Robert Watson
279128e295 Annotate a newly introduced race present due to the unloading of
protocols: it is possible for sockets to be created and attached
to the divert protocol between the test for sockets present and
successful unload of the registration handler.  We will need to
explore more mature APIs for unregistering the protocol and then
draining consumers, or an atomic test-and-unregister mechanism.
2004-10-19 21:35:42 +00:00
Andre Oppermann
72584fd2c0 Convert IPDIVERT into a loadable module. This makes use of the dynamic loadability
of protocols.  The call to divert_packet() is done through a function pointer.  All
semantics of IPDIVERT remain intact.  If IPDIVERT is not loaded ipfw will refuse to
install divert rules and  natd will complain about 'protocol not supported'.  Once
it is loaded both will work and accept rules and open the divert socket.  The module
can only be unloaded if no divert sockets are open.  It does not close any divert
sockets when an unload is requested but will return EBUSY instead.
2004-10-19 21:14:57 +00:00
Andre Oppermann
969bb53e80 Properly declare the "net.inet" sysctl subtree. 2004-10-19 21:06:14 +00:00
Andre Oppermann
539be79a9d Pre-emptively define IPPROTO_SPACER to 32767, the same value as PROTO_SPACER
to document that this value is globally assigned for a special purpose and
may not be reused within the IPPROTO number space.
2004-10-19 20:59:01 +00:00
Andre Oppermann
dff3237ee5 Make use of the PROTO_SPACER functionality for dynamically loadable
protocols in inetsw[] and define initially eight spacer slots.

Remove conflicting declaration 'struct pr_usrreqs nousrreqs'.  It is
now declared and initialized in kern/uipc_domain.c.
2004-10-19 15:58:22 +00:00
Andre Oppermann
de38924dc0 Support for dynamically loadable and unloadable IP protocols in the ipmux.
With pr_proto_register() it has become possible to dynamically load protocols
within the PF_INET domain.  However the PF_INET domain has a second important
structure called ip_protox[] that is derived from the 'struct protosw inetsw[]'
and takes care of the de-multiplexing of the various protocols that ride on
top of IP packets.

The functions ipproto_[un]register() allow to dynamically adjust the ip_protox[]
array mux in a consistent and easy way.  To register a protocol within
ip_protox[] the existence of a corresponding and matching protocol definition
in inetsw[] is required.  The function does not allow to overwrite an already
registered protocol.  The unregister function simply replaces the mux slot with
the default index pointer to IPPROTO_RAW as it was previously.
2004-10-19 15:45:57 +00:00
Andre Oppermann
1cf15713ed Add a macro for the destruction of INP_INFO_LOCK's used by loadable modules. 2004-10-19 14:34:13 +00:00
Andre Oppermann
de1c2ac4bf Make comments more clear. Change the order of one if() statement to check the
more likely variable first.
2004-10-19 14:31:56 +00:00
Robert Watson
81158452be Push acquisition of the accept mutex out of sofree() into the caller
(sorele()/sotryfree()):

- This permits the caller to acquire the accept mutex before the socket
  mutex, avoiding sofree() having to drop the socket mutex and re-order,
  which could lead to races permitting more than one thread to enter
  sofree() after a socket is ready to be free'd.

- This also covers clearing of the so_pcb weak socket reference from
  the protocol to the socket, preventing races in clearing and
  evaluation of the reference such that sofree() might be called more
  than once on the same socket.

This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket.  The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets.  The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.

RELENG_5_3 candidate.

MFC after:	3 days
Reviewed by:	dwhite
Discussed with:	gnn, dwhite, green
Reported by:	Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by:	Vlad <marchenko at gmail dot com>
2004-10-18 22:19:43 +00:00
Robert Watson
6b8e5a9862 Don't release the udbinfo lock until after the last use of UDP inpcb
in udp_input(), since the udbinfo lock is used to prevent removal of
the inpcb while in use (i.e., as a form of reference count) in the
in-bound path.

RELENG_5 candidate.
2004-10-12 20:03:56 +00:00
Robert Watson
00fcf9d12d Modify the thrilling "%D is using my IP address %s!" message so that
it isn't printed if the IP address in question is '0.0.0.0', which is
used by nodes performing DHCP lookup, and so constitute a false
positive as a report of misconfiguration.
2004-10-12 17:10:40 +00:00
Robert Watson
6c67b8b695 When the access control on creating raw sockets was modified so that
processes in jail could create raw sockets, additional access control
checks were added to raw IP sockets to limit the ways in which those
sockets could be used.  Specifically, only the socket option IP_HDRINCL
was permitted in rip_ctloutput().  Other socket options were protected
by a call to suser().  This change was required to prevent processes
in a Jail from modifying system properties such as multicast routing
and firewall rule sets.

However, it also introduced a regression: processes that create a raw
socket with root privilege, but then downgraded credential (i.e., a
daemon giving up root, or a setuid process switching back to the real
uid) could no longer issue other unprivileged generic IP socket option
operations, such as IP_TOS, IP_TTL, and the multicast group membership
options, which prevented multicast routing daemons (and some other
tools) from operating correctly.

This change pushes the access control decision down to the granularity
of individual socket options, rather than all socket options, on raw
IP sockets.  When rip_ctloutput() doesn't implement an option, it will
now pass the request directly to in_control() without an access
control check.  This should restore the functionality of the generic
IP socket options for raw sockets in the above-described scenarios,
which may be confirmed with the ipsockopt regression test.

RELENG_5 candidate.

Reviewed by:	csjp
2004-10-12 16:47:25 +00:00
Robert Watson
cf2942b67c Acquire the send socket buffer lock around tcp_output() activities
reaching into the socket buffer.  This prevents a number of potential
races, including dereferencing of sb_mb while unlocked leading to
a NULL pointer deref (how I found it).  Potentially this might also
explain other "odd" TCP behavior on SMP boxes (although  haven't
seen it reported).

RELENG_5 candidate.
2004-10-09 16:48:51 +00:00
Robert Watson
fcf4e3a168 When running with debug.mpsafenet=0, initialize IP multicast routing
callouts as non-CALLOUT_MPSAFE.  Otherwise, they may trigger an
assertion regarding Giant if they enter other parts of the stack from
the callout.

MFC after:	3 days
Reported by:	Dikshie < dikshie at ppk dot itb dot ac dot id >
2004-10-07 14:13:35 +00:00
Paul Saab
a55db2b6e6 - Estimate the amount of data in flight in sack recovery and use it
to control the packets injected while in sack recovery (for both
  retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
  number of sack retransmissions done upon initiation of sack recovery.

Submitted by:	Mohan Srinivasan <mohans@yahoo-inc.com>
2004-10-05 18:36:24 +00:00
Brian Feldman
c99ee9e042 Add support to IPFW for matching by TCP data length. 2004-10-03 00:47:15 +00:00
Brian Feldman
6daf7ebd28 Add support to IPFW for classification based on "diverted" status
(that is, input via a divert socket).
2004-10-03 00:26:35 +00:00
Brian Feldman
974dfe3084 Add to IPFW the ability to do ALTQ classification/tagging. 2004-10-03 00:17:46 +00:00
Brian Feldman
88ef2880c1 Validate the action pointer to be within the rule size, so that trying to
add corrupt ipfw rules would not potentially panic the system or worse.
2004-09-30 17:42:00 +00:00
Max Laier
d6a8d58875 Add an additional struct inpcb * argument to pfil(9) in order to enable
passing along socket information. This is required to work around a LOR with
the socket code which results in an easy reproducible hard lockup with
debug.mpsafenet=1. This commit does *not* fix the LOR, but enables us to do
so later. The missing piece is to turn the filter locking into a leaf lock
and will follow in a seperate (later) commit.

This will hopefully be MT5'ed in order to fix the problem for RELENG_5 in
forseeable future.

Suggested by:		rwatson
A lot of work by:	csjp (he'd be even more helpful w/o mentor-reviews ;)
Reviewed by:		rwatson, csjp
Tested by:		-pf, -ipfw, LINT, csjp and myself
MFC after:		3 days

LOR IDs:		14 - 17 (not fixed yet)
2004-09-29 04:54:33 +00:00
Robert Watson
48ac555d83 Assign so_pcb to NULL rather than 0 as it's a pointer.
Spotted by:	dwhite
2004-09-29 04:01:13 +00:00
Maxim Konovalov
4bc37f9836 o Turn net.inet.ip.check_interface sysctl off by default.
When net.inet.ip.check_interface was MFCed to RELENG_4 3+ years ago in
rev. 1.130.2.17 ip_input.c it was 1 by default but shortly changed to
0 (accidently?) in rev. 1.130.2.20 in RELENG_4 only.  Among with the
fact this knob is not documented it breaks POLA especially in bridge
environment.

OK'ed by:	andre
Reviewed by:	-current
2004-09-24 12:18:40 +00:00
Andre Oppermann
db09bef308 Fix an out of bounds write during the initialization of the PF_INET protocol
family to the ip_protox[] array.  The protocol number of IPPROTO_DIVERT is
larger than IPPROTO_MAX and was initializing memory beyond the array.
Catch all these kinds of errors by ignoring protocols that are higher than
IPPROTO_MAX or 0 (zero).

Add more comments ip_init().
2004-09-16 18:33:39 +00:00
Andre Oppermann
76ff6dcf46 Clarify some comments for the M_FASTFWD_OURS case in ip_input(). 2004-09-15 20:17:03 +00:00
Andre Oppermann
e098266191 Remove the last two global variables that are used to store packet state while
it travels through the IP stack.  This wasn't much of a problem because IP
source routing is disabled by default but when enabled together with SMP and
preemption it would have very likely cross-corrupted the IP options in transit.

The IP source route options of a packet are now stored in a mtag instead of the
global variable.
2004-09-15 20:13:26 +00:00
Andre Oppermann
bda337d05e Do not allow 'ipfw fwd' command when IPFIREWALL_FORWARD is not compiled into
the kernel.  Return EINVAL instead.
2004-09-13 19:27:23 +00:00
Andre Oppermann
f91248c1ad If we have to 'ipfw fwd'-tag a packet the second time in ipfw_pfil_out() don't
prepend an already existing tag again.  Instead unlink it and prepend it again
to have it as the first tag in the chain.

PR:		kern/71380
2004-09-13 19:20:14 +00:00
Andre Oppermann
f4fca2d8d3 Make comments more clear for the packet changed cases after pfil hooks. 2004-09-13 17:09:06 +00:00
Andre Oppermann
eedc0a7535 Fix ip_input() fallback for the destination modified cases (from the packet
filters).  After the ipfw to pfil move ip_input() expects M_FASTFWD_OURS
tagged packets to have ip_len and ip_off in host byte order instead of
network byte order.

PR:		kern/71652
Submitted by:	mlaier (patch)
2004-09-13 17:01:53 +00:00
Andre Oppermann
7c0102f575 Make 'ipfw tee' behave as inteded and designed. A tee'd packet is copied
and sent to the DIVERT socket while the original packet continues with the
next rule.  Unlike a normally diverted packet no IP reassembly attemts are
made on tee'd packets and they are passed upwards totally unmodified.

Note: This will not be MFC'd to 4.x because of major infrastucture changes.

PR:		kern/64240 (and many others collapsed into that one)
2004-09-13 16:46:05 +00:00
Gleb Smirnoff
324398687f Check flag do_bridge always, even if kernel was compiled without
BRIDGE support. This makes dynamic bridge.ko working.

Reviewed by:	sam
Approved by:	julian (mentor)
MFC after:	1 week
2004-09-09 12:34:07 +00:00
John-Mark Gurney
cb459254a2 revert comment from rev1.158 now that rev1.225 backed it out..
MFC after:	3 days
2004-09-06 15:48:38 +00:00
Gleb Smirnoff
f46a6aac29 Recover normal behavior: return EINVAL to attempt to add a divert rule
when module is built without IPDIVERT.

Silence from:	andre
Approved by:	julian (mentor)
2004-09-05 20:06:50 +00:00
John-Mark Gurney
b5d47ff592 fix up socket/ip layer violation... don't assume/know that
SO_DONTROUTE == IP_ROUTETOIF and SO_BROADCAST == IP_ALLOWBROADCAST...
2004-09-05 02:34:12 +00:00
Andre Oppermann
3161f583ca Apply error and success logic consistently to the function netisr_queue() and
its users.

netisr_queue() now returns (0) on success and ERRNO on failure.  At the
moment ENXIO (netisr queue not functional) and ENOBUFS (netisr queue full)
are supported.

Previously it would return (1) on success but the return value of IF_HANDOFF()
was interpreted wrongly and (0) was actually returned on success.  Due to this
schednetisr() was never called to kick the scheduling of the isr.  However this
was masked by other normal packets coming through netisr_dispatch() causing the
dequeueing of waiting packets.

PR:		kern/70988
Found by:	MOROHOSHI Akihiko <moro@remus.dti.ne.jp>
MFC after:	3 days
2004-08-27 18:33:08 +00:00
Andre Oppermann
a9c92b54a9 In the case the destination of a packet was changed by the packet filter
to point to a local IP address; and the packet was sourced from this host
we fill in the m_pkthdr.rcvif with a pointer to the loopback interface.

Before the function ifunit("lo0") was used to obtain the ifp.  However
this is sub-optimal from a performance point of view and might be dangerous
if the loopback interface has been renamed.  Use the global variable 'loif'
instead which always points to the loopback interface.

Submitted by:	brooks
2004-08-27 15:39:34 +00:00
Andre Oppermann
319c4c256a Remove a junk line left over from the recent IPFW to PFIL_HOOKS conversion. 2004-08-27 15:32:28 +00:00
Andre Oppermann
c21fd23260 Always compile PFIL_HOOKS into the kernel and remove the associated kernel
compile option.  All FreeBSD packet filters now use the PFIL_HOOKS API and
thus it becomes a standard part of the network stack.

If no hooks are connected the entire packet filter hooks section and related
activities are jumped over.  This removes any performance impact if no hooks
are active.

Both OpenBSD and DragonFlyBSD have integrated PFIL_HOOKS permanently as well.
2004-08-27 15:16:24 +00:00
Ruslan Ermilov
9bfe6d472a Revert the last change to sys/modules/ipfw/Makefile and fix a
standalone module build in a better way.

Silence from:	andre
MFC after:	3 days
2004-08-26 14:18:30 +00:00
Pawel Jakub Dawidek
a7f3feff1b Allocate memory when dumping pipes with M_WAITOK flag.
On a system with huge number of pipes, M_NOWAIT failes almost always,
because of memory fragmentation.
My fix is different than the patch proposed by Pawel Malachowski,
because in FreeBSD 5.x we cannot sleep while holding dummynet mutex
(in 4.x there is no such lock).
My fix is also ugly, but there is no easy way to prepare nice and clean fix.

PR:		kern/46557
Submitted by:	Eugene Grosbein <eugen@grosbein.pp.ru>
Reviewed by:	mlaier
2004-08-25 09:31:30 +00:00
Max Laier
ca7a789aa6 Allow early drop for non-ALTQ enabled queues in an ALTQ-enabled kernel.
Previously the early drop was disabled unconditionally for ALTQ-enabled
kernels.

This should give some benefit for the normal gateway + LAN-server case with
a busy LAN leg and an ALTQ managed uplink.

Reviewed and style help from:	cperciva, pjd
2004-08-22 16:42:28 +00:00
Robert Watson
392e840716 When sliding the m_data pointer forward, update m_pktrhdr.len as well
as m_len, or the pkthdr length will be inconsistent with the actual
length of data in the mbuf chain.  The symptom of this occuring was
"out of data" warnings from in_cksum_skip() on large UDP packets sent
via the loopback interface.

Foot shot:	green
2004-08-22 01:32:48 +00:00
Christian S.J. Peron
5090559b7f When a prison is given the ability to create raw sockets (when the
security.jail.allow_raw_sockets sysctl MIB is set to 1) where privileged
access to jails is given out, it is possible for prison root to manipulate
various network parameters which effect the host environment. This commit
plugs a number of security holes associated with the use of raw sockets
and prisons.

This commit makes the following changes:

- Add a comment to rtioctl warning developers that if they add
  any ioctl commands, they should use super-user checks where necessary,
  as it is possible for PRISON root to make it this far in execution.
- Add super-user checks for the execution of the SIOCGETVIFCNT
  and SIOCGETSGCNT IP multicast ioctl commands.
- Add a super-user check to rip_ctloutput(). If the calling cred
  is PRISON root, make sure the socket option name is IP_HDRINCL,
  otherwise deny the request.

Although this patch corrects a number of security problems associated
with raw sockets and prisons, the warning in jail(8) should still
apply, and by default we should keep the default value of
security.jail.allow_raw_sockets MIB to 0 (or disabled) until
we are certain that we have tracked down all the problems.

Looking forward, we will probably want to eliminate the
references to curthread.

This may be a MFC candidate for RELENG_5.

Reviewed by:	rwatson
Approved by:	bmilekic (mentor)
2004-08-21 17:38:57 +00:00
Robert Watson
e6ccd70936 When prepending space onto outgoing UDP datagram payloads to hold the
UDP/IP header, make sure that space is also allocated for the link
layer header.  If an mbuf must be allocated to hold the UDP/IP header
(very likely), then this will avoid an additional mbuf allocation at
the link layer.  This trick is also used by TCP and other protocols to
avoid extra calls to the mbuf allocator in the ethernet (and related)
output routines.
2004-08-21 16:14:04 +00:00
Andre Oppermann
ce63226177 Fix a stupid typo which prevented an ipfw KLD unload from successfully cleaning
up its remains.  Do not terminate 'if' lines with ';'.

Spotted by:	claudio@OpenBSD.ORG (sitting 3m from my desk)
Pointy hat to:	andre
2004-08-20 00:36:55 +00:00
Andre Oppermann
70222723f3 When unloading ipfw module use callout_drain() to make absolutely sure that
all callouts are stopped and finished.  Move it before IPFW_LOCK() to avoid
deadlocking when draining callouts.
2004-08-19 23:31:40 +00:00
Andre Oppermann
6f2d4ea6f8 For IPv6 access pointer to tcpcb only after we have checked it is valid.
Found by:	Coverity's automated analysis (via Ted Unangst)
2004-08-19 20:16:17 +00:00
Andre Oppermann
50ab727669 Give a useful error message if someone tries to compile IPFIREWALL into the
kernel without specifying PFIL_HOOKS as well.
2004-08-19 18:38:23 +00:00
Andre Oppermann
9108601915 Do not unconditionally ignore IPDIVERT and IPFIREWALL_FORWARD when building
the ipfw KLD.

 For IPFIREWALL_FORWARD this does not have any side effects.  If the module
 has it but not the kernel it just doesn't do anything.

 For IPDIVERT the KLD will be unloadable if the kernel doesn't have IPDIVERT
 compiled in too.  However this is the least disturbing behaviour.  The user
 can just recompile either module or the kernel to match the other one.  The
 access to the machine is not denied if ipfw refuses to load.
2004-08-19 17:59:26 +00:00
Andre Oppermann
e4c97eff8e Bring back the sysctl 'net.inet.ip.fw.enable' to unbreak the startup scripts
and to be able to disable ipfw if it was compiled directly into the kernel.
2004-08-19 17:38:47 +00:00
Robert Watson
5c32ea6517 Push down pcbinfo and inpcb locking from udp_send() into udp_output().
This provides greater context for the locking and allows us to avoid
locking the pcbinfo structure if not binding operations will take
place (i.e., already bound, connected, and no expliti sendto()
address).
2004-08-19 01:13:10 +00:00
Robert Watson
4c2bb15a89 In in_pcbrehash(), do assert the inpcb lock as well as the pcbinfo lock. 2004-08-19 01:11:17 +00:00
Robert Watson
0f48e25b63 Fix build of ip_input.c with "options IPSEC" -- the "pass:" label
is used with both FAST_IPSEC and IPSEC, but was defined for only
FAST_IPSEC.
2004-08-18 03:11:04 +00:00
Peter Wemm
1e5cc10dc2 Make the kernel compile again if you are not using PFIL_HOOKS 2004-08-18 00:37:46 +00:00
Andre Oppermann
9b932e9e04 Convert ipfw to use PFIL_HOOKS. This is change is transparent to userland
and preserves the ipfw ABI.  The ipfw core packet inspection and filtering
functions have not been changed, only how ipfw is invoked is different.

However there are many changes how ipfw is and its add-on's are handled:

 In general ipfw is now called through the PFIL_HOOKS and most associated
 magic, that was in ip_input() or ip_output() previously, is now done in
 ipfw_check_[in|out]() in the ipfw PFIL handler.

 IPDIVERT is entirely handled within the ipfw PFIL handlers.  A packet to
 be diverted is checked if it is fragmented, if yes, ip_reass() gets in for
 reassembly.  If not, or all fragments arrived and the packet is complete,
 divert_packet is called directly.  For 'tee' no reassembly attempt is made
 and a copy of the packet is sent to the divert socket unmodified.  The
 original packet continues its way through ip_input/output().

 ipfw 'forward' is done via m_tag's.  The ipfw PFIL handlers tag the packet
 with the new destination sockaddr_in.  A check if the new destination is a
 local IP address is made and the m_flags are set appropriately.  ip_input()
 and ip_output() have some more work to do here.  For ip_input() the m_flags
 are checked and a packet for us is directly sent to the 'ours' section for
 further processing.  Destination changes on the input path are only tagged
 and the 'srcrt' flag to ip_forward() is set to disable destination checks
 and ICMP replies at this stage.  The tag is going to be handled on output.
 ip_output() again checks for m_flags and the 'ours' tag.  If found, the
 packet will be dropped back to the IP netisr where it is going to be picked
 up by ip_input() again and the directly sent to the 'ours' section.  When
 only the destination changes, the route's 'dst' is overwritten with the
 new destination from the forward m_tag.  Then it jumps back at the route
 lookup again and skips the firewall check because it has been marked with
 M_SKIP_FIREWALL.  ipfw 'forward' has to be compiled into the kernel with
 'option IPFIREWALL_FORWARD' to enable it.

 DUMMYNET is entirely handled within the ipfw PFIL handlers.  A packet for
 a dummynet pipe or queue is directly sent to dummynet_io().  Dummynet will
 then inject it back into ip_input/ip_output() after it has served its time.
 Dummynet packets are tagged and will continue from the next rule when they
 hit the ipfw PFIL handlers again after re-injection.

 BRIDGING and IPFW_ETHER are not changed yet and use ipfw_chk() directly as
 they did before.  Later this will be changed to dedicated ETHER PFIL_HOOKS.

More detailed changes to the code:

 conf/files
	Add netinet/ip_fw_pfil.c.

 conf/options
	Add IPFIREWALL_FORWARD option.

 modules/ipfw/Makefile
	Add ip_fw_pfil.c.

 net/bridge.c
	Disable PFIL_HOOKS if ipfw for bridging is active.  Bridging ipfw
	is still directly invoked to handle layer2 headers and packets would
	get a double ipfw when run through PFIL_HOOKS as well.

 netinet/ip_divert.c
	Removed divert_clone() function.  It is no longer used.

 netinet/ip_dummynet.[ch]
	Neither the route 'ro' nor the destination 'dst' need to be stored
	while in dummynet transit.  Structure members and associated macros
	are removed.

 netinet/ip_fastfwd.c
	Removed all direct ipfw handling code and replace it with the new
	'ipfw forward' handling code.

 netinet/ip_fw.h
	Removed 'ro' and 'dst' from struct ip_fw_args.

 netinet/ip_fw2.c
	(Re)moved some global variables and the module handling.

 netinet/ip_fw_pfil.c
	New file containing the ipfw PFIL handlers and module initialization.

 netinet/ip_input.c
	Removed all direct ipfw handling code and replace it with the new
	'ipfw forward' handling code.  ip_forward() does not longer require
	the 'next_hop' struct sockaddr_in argument.  Disable early checks
	if 'srcrt' is set.

 netinet/ip_output.c
	Removed all direct ipfw handling code and replace it with the new
	'ipfw forward' handling code.

 netinet/ip_var.h
	Add ip_reass() as general function.  (Used from ipfw PFIL handlers
	for IPDIVERT.)

 netinet/raw_ip.c
	Directly check if ipfw and dummynet control pointers are active.

 netinet/tcp_input.c
	Rework the 'ipfw forward' to local code to work with the new way of
	forward tags.

 netinet/tcp_sack.c
	Remove include 'opt_ipfw.h' which is not needed here.

 sys/mbuf.h
	Remove m_claim_next() macro which was exclusively for ipfw 'forward'
	and is no longer needed.

Approved by:	re (scottl)
2004-08-17 22:05:54 +00:00
Robert Watson
a4f757cd5d White space cleanup for netinet before branch:
- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs

This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.

Approved by:	re (scottl)
Submitted by:	Xin LI <delphij@frontfree.net>
2004-08-16 18:32:07 +00:00
David E. O'Brien
5af87d0ea1 Put the 'antispoof' opcode in the proper place in the opcode list such
that it doesn't break the ipfw2 ABI.
2004-08-16 12:05:19 +00:00
David Malone
1f44b0a1b5 Get rid of the RANDOM_IP_ID option and make it a sysctl. NetBSD
have already done this, so I have styled the patch on their work:

        1) introduce a ip_newid() static inline function that checks
        the sysctl and then decides if it should return a sequential
        or random IP ID.

        2) named the sysctl net.inet.ip.random_id

        3) IPv6 flow IDs and fragment IDs are now always random.
        Flow IDs and frag IDs are significantly less common in the
        IPv6 world (ie. rarely generated per-packet), so there should
        be smaller performance concerns.

The sysctl defaults to 0 (sequential IP IDs).

Reviewed by:	andre, silby, mlaier, ume
Based on:	NetBSD
MFC after:	2 months
2004-08-14 15:32:40 +00:00
Poul-Henning Kamp
e7581f0fc2 Fix outgoing ICMP on global instance. 2004-08-14 14:21:09 +00:00
Christian S.J. Peron
31c88a3043 Add the ability to associate ipfw rules with a specific prison ID.
Since the only thing truly unique about a prison is it's ID, I figured
this would be the most granular way of handling this.

This commit makes the following changes:

- Adds tokenizing and parsing for the ``jail'' command line option
  to the ipfw(8) userspace utility.
- Append the ipfw opcode list with O_JAIL.
- While Iam here, add a comment informing others that if they
  want to add additional opcodes, they should append them to the end
  of the list to avoid ABI breakage.
- Add ``fw_prid'' to the ipfw ucred cache structure.
- When initializing ucred cache, if the process is jailed,
  set fw_prid to the prison ID, otherwise set it to -1.
- Update man page to reflect these changes.

This change was a strong motivator behind the ucred caching
mechanism in ipfw.

A sample usage of this new functionality could be:

    ipfw add count ip from any to any jail 2

It should be noted that because ucred based constraints
are only implemented for TCP and UDP packets, the same
applies for jail associations.

Conceptual head nod by:	pjd
Reviewed by:	rwatson
Approved by:	bmilekic (mentor)
2004-08-12 22:06:55 +00:00
David Malone
849112666a In tcp6_ctlinput, lock tcbinfo around the call to syncache_unreach
so that the locks held are the same as the IPv4 case.

Reviewed by:	rwatson
2004-08-12 18:19:36 +00:00
Andre Oppermann
9d804f818c Fix two cases of incorrect IPQ_UNLOCK'ing in the merged ip_reass() function.
The first one was going to 'dropfrag', which unlocks the IPQ, before the lock
was aquired; The second one doing a unlock and then a 'goto dropfrag' which
led to a double-unlock.

Tripped over by:	des
2004-08-12 08:37:42 +00:00
Robert Watson
c19c5239a6 When udp_send() fails, make sure to free the control mbufs as well as
the data mbuf.  This was done in most error cases, but not the case
where the inpcb pointer is surprisingly NULL.
2004-08-12 01:34:27 +00:00
Andre Oppermann
420a281164 Backout removal of UMA_ZONE_NOFREE flag for all zones which are established
for structures with timers in them.  It might be that a timer might fire
even when the associated structure has already been free'd.  Having type-
stable storage in this case is beneficial for graceful failure handling and
debugging.

Discussed with:	bosko, tegge, rwatson
2004-08-11 20:30:08 +00:00
Andre Oppermann
4efb805c0c Remove the UMA_ZONE_NOFREE flag to all uma_zcreate() calls in the IP and
TCP code.  This flag would have prevented giving back excessive free slabs
to the global pool after a transient peak usage.
2004-08-11 17:08:31 +00:00
Andre Oppermann
67d0b24ed1 Make use of in_localip() function and replace previous direct LIST_FOREACH
loops over INADDR_HASH.
2004-08-11 12:32:10 +00:00
Andre Oppermann
2eccc90b61 Add the function in_localip() which returns 1 if an internet address is for
the local host and configured on one of its interfaces.
2004-08-11 11:49:48 +00:00
Andre Oppermann
6e234ede37 Only invoke verify_path() for verrevpath and versrcreach when we have an IP packet. 2004-08-11 11:41:11 +00:00
Andre Oppermann
767981878c Only check for local broadcast addresses if the mbuf is flagged with M_BCAST. 2004-08-11 10:49:56 +00:00
Andre Oppermann
0b17fba7bc Consistently use NULL for pointer comparisons. 2004-08-11 10:46:15 +00:00
Andre Oppermann
de2e5d1e20 Make IP fastforwarding ALTQ-aware by adding the input traffic conditioner
check and disabling the early output interface queue length check.
2004-08-11 10:42:59 +00:00
Andre Oppermann
2f6e6e9b4c Correct the displayed bandwidth calculation for a readout via sysctl. The
saved value does not have to be scaled with HZ; it is already in bytes per
second.  Only the multiply by eight remains to show bits per second (bps).
2004-08-11 10:12:16 +00:00
Robert Watson
27f74fd0ed Assert the locks of inpcbinfo's and inpcb's passed into in_pcbconnect()
and in_pcbconnect_setup(), since these functions frob the port and
address state of inpcbs.
2004-08-11 04:35:20 +00:00
Andre Oppermann
bb7c5b3055 Make a comment that IP source routing is not SMP and PREEMPTION safe. 2004-08-09 16:17:37 +00:00
Andre Oppermann
a5053398d4 Make a comment that "ipfw forward" is not SMP and PREEMPTION safe. 2004-08-09 16:16:10 +00:00
Andre Oppermann
5f9541ecbd New ipfw option "antispoof":
For incoming packets, the packet's source address is checked if it
 belongs to a directly connected network.  If the network is directly
 connected, then the interface the packet came on in is compared to
 the interface the network is connected to.  When incoming interface
 and directly connected interface are not the same, the packet does
 not match.

Usage example:

 ipfw add deny ip from any to any not antispoof in

Manpage education by:	ru
2004-08-09 16:12:10 +00:00
Robert Watson
f31f65a708 Pass pcbinfo structures to in6_pcbnotify() rather than pcbhead
structures, allowing in6_pcbnotify() to lock the pcbinfo and each
inpcb that it notifies of ICMPv6 events.  This prevents inpcb
assertions from firing when IPv6 generates and delievers event
notifications for inpcbs.

Reported by:	kuriyama
Tested by:	kuriyama
2004-08-06 03:45:45 +00:00
Robert Watson
9c1df6951f When iterating the UDP inpcb list processing an inbound broadcast
or multicast packet, we don't need to acquire the inpcb mutex
unless we are actually using inpcb fields other than the bound port
and address.  Since we hold the pcbinfo lock already, these can't
change.  Defer acquiring the inpcb mutex until we have a high
chance of a match.  This avoids about 120 mutex operations per UDP
broadcast packet received on one of my work systems.

Reviewed by:	sam
2004-08-06 02:08:31 +00:00
Robert Watson
98aed8ca56 Now that IPv6 performs basic in6pcb and inpcb locking, enable inpcb
lock assertions even if IPv6 is compiled into the kernel.  Previously,
inclusion of IPv6 and locking assertions would result in a rapid
assertion failure as IPv6 was not properly locking inpcbs.
2004-08-04 18:27:55 +00:00
Joe Marcus Clarke
5c7e7e80cc Fix Skinny and PPTP NAT'ing after the introduction of the {ip,tcp,udp}_next
functions.  Basically, the ip_next() function was used to get the PPTP and
Skinny headers when tcp_next() should have been used instead.  Symptoms of
this included a segfault in natd when trying to process a PPTP or Skinny
packet.

Approved by:	des
2004-08-04 15:17:08 +00:00
Andre Oppermann
81007fd4eb o Delayed checksums are now calculated in divert_packet() for diverted packets
Remove the XXX-escaped code that did it in ip_output()'s IPHACK section.
2004-08-03 14:13:36 +00:00
Andre Oppermann
24a098ea9b o Move the inflight sysctls to their own sub-tree under net.inet.tcp to be
more consistent with the other sysctls around it.
2004-08-03 13:54:11 +00:00
Andre Oppermann
f0cada84b1 o Move all parts of the IP reassembly process into the function ip_reass() to
make it fully self-contained.
o ip_reass() now returns a new mbuf with the reassembled packet and ip->ip_len
  including the IP header.
o Computation of the delayed checksum is moved into divert_packet().

Reviewed by:	silby
2004-08-03 12:31:38 +00:00
Jeffrey Hsu
2ff39e1543 Fix bug with tracking the previous element in a list.
Found by:	edrt@citiz.net
Submitted by:	pavlin@icir.org
2004-08-03 02:01:44 +00:00
Yaroslav Tykhiy
a4eb4405e3 Disallow a particular kind of port theft described by the following scenario:
Alice is too lazy to write a server application in PF-independent
	manner.  Therefore she knocks up the server using PF_INET6 only
	and allows the IPv6 socket to accept mapped IPv4 as well.  An evil
	hacker known on IRC as cheshire_cat has an account in the same
	system.  He starts a process listening on the same port as used
	by Alice's server, but in PF_INET.  As a consequence, cheshire_cat
	will distract all IPv4 traffic supposed to go to Alice's server.

Such sort of port theft was initially enabled by copying the code that
implemented the RFC 2553 semantics on IPv4/6 sockets (see inet6(4)) for
the implied case of the same owner for both connections.  After this
change, the above scenario will be impossible.  In the same setting,
the user who attempts to start his server last will get EADDRINUSE.

Of course, using IPv4 mapped to IPv6 leads to security complications
in the first place, but there is no reason to make it even more unsafe.

This change doesn't apply to KAME since it affects a FreeBSD-specific
part of the code.  It doesn't modify the out-of-box behaviour of the
TCP/IP stack either as long as mapping IPv4 to IPv6 is off by default.

MFC after:	1 month
2004-07-28 13:03:07 +00:00
Jayanth Vijayaraghavan
5d3b1b7556 Fix a bug in the sack code that was causing data to be retransmitted
with the FIN bit set for all segments, if a FIN has already been sent before.
The fix will allow the FIN bit to be set for only the last segment, in case
it has to be retransmitted.

Fix another bug that would have caused snd_nxt to be pulled by len if
there was an error from ip_output. snd_nxt should not be touched
during sack retransmissions.
2004-07-28 02:15:14 +00:00
Jayanth Vijayaraghavan
e9f2f80e09 Fix for a SACK bug where the very last segment retransmitted
from the SACK scoreboard could result in the next (untransmitted)
segment to be skipped.
2004-07-26 23:41:12 +00:00
John-Mark Gurney
0aa8ce5012 compare pointer against NULL, not 0
when inpcb is NULL, this is no longer invalid since jlemon added the
tcp_twstart function... this prevents close "failing" w/ EINVAL when it
really was successful...

Reviewed by:	jeremy (NetBSD)
2004-07-26 21:29:56 +00:00
Colin Percival
56f21b9d74 Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with:	rwatson, scottl
Requested by:	jhb
2004-07-26 07:24:04 +00:00
Andre Oppermann
55db762b76 Extend versrcreach by checking against the rt_flags for RTF_REJECT and
RTF_BLACKHOLE as well.

To quote the submitter:

 The uRPF loose-check implementation by the industry vendors, at least on Cisco
 and possibly Juniper, will fail the check if the route of the source address
 is pointed to Null0 (on Juniper, discard or reject route). What this means is,
 even if uRPF Loose-check finds the route, if the route is pointed to blackhole,
 uRPF loose-check must fail. This allows people to utilize uRPF loose-check mode
 as a pseudo-packet-firewall without using any manual filtering configuration --
 one can simply inject a IGP or BGP prefix with next-hop set to a static route
 that directs to null/discard facility. This results in uRPF Loose-check failing
 on all packets with source addresses that are within the range of the nullroute.

Submitted by:	James Jun <james@towardex.com>
2004-07-21 19:55:14 +00:00
Robert Watson
2d01d331c6 M_PREPEND() the IP header on to the front of an outgoing raw IP packet
using M_DONTWAIT rather than M_WAITOK to avoid sleeping on memory
while holding a mutex.
2004-07-20 20:52:30 +00:00
Jayanth Vijayaraghavan
04f0d9a0ea Let IN_FASTREOCOVERY macro decide if we are in recovery mode.
Nuke sackhole_limit for now. We need to add it back to limit the total
number of sack blocks in the system.
2004-07-19 22:37:33 +00:00
Jayanth Vijayaraghavan
f787edd847 Fix a potential panic in the SACK code that was causing
1) data to be sent to the right of snd_recover.
2) send more data then whats in the send buffer.

The fix is to postpone sack retransmit to a subsequent recovery episode
if the current retransmit pointer is beyond snd_recover.

Thanks to Mohan Srinivasan for helping fix the bug.

Submitted by:Daniel Lang
2004-07-19 22:06:01 +00:00
David Malone
932312d60b Fix the !INET6 build.
Reported by:	alc
2004-07-17 21:40:14 +00:00
David Malone
969860f3ed The tcp syncache code was leaving the IPv6 flowlabel uninitialised
for the SYN|ACK packet and then letting in6_pcbconnect set the
flowlabel later. Arange for the syncache/syncookie code to set and
recall the flow label so that the flowlabel used for the SYN|ACK
is consistent. This is done by using some of the cookie (when tcp
cookies are enabeled) and by stashing the flowlabel in syncache.

Tested and Discovered by:	Orla McGann <orly@cnri.dit.ie>
Approved by:			ume, silby
MFC after:			1 month
2004-07-17 19:44:13 +00:00
Max Laier
c550f2206d Define semantic of M_SKIP_FIREWALL more precisely, i.e. also pass associated
icmp_error() packets. While here retire PACKET_TAG_PF_GENERATED (which
served the same purpose) and use M_SKIP_FIREWALL in pf as well. This should
speed up things a bit as we get rid of the tag allocations.

Discussed with:	juli
2004-07-17 05:10:06 +00:00
Juli Mallett
765d141c78 Make M_SKIP_FIREWALL a global (and semantic) flag, preventing anything from
using M_PROTO6 and possibly shooting someone's foot, as well as allowing the
firewall to be used in multiple passes, or with a packet classifier frontend,
that may need to explicitly allow a certain packet.  Presently this is handled
in the ipfw_chk code as before, though I have run with it moved to upper
layers, and possibly it should apply to ipfilter and pf as well, though this
has not been investigated.

Discussed with:	luigi, rwatson
2004-07-17 02:40:13 +00:00
Hajimu UMEMOTO
8a59da300c when IN6P_AUTOFLOWLABEL is set, the flowlabel is not set on
outgoing tcp connections.

Reported by:	Orla McGann <orly@cnri.dit.ie>
Reviewed by:	Orla McGann <orly@cnri.dit.ie>
Obtained from:	KAME
2004-07-16 18:08:13 +00:00
Poul-Henning Kamp
3e019deaed Do a pass over all modules in the kernel and make them return EOPNOTSUPP
for unknown events.

A number of modules return EINVAL in this instance, and I have left
those alone for now and instead taught MOD_QUIESCE to accept this
as "didn't do anything".
2004-07-15 08:26:07 +00:00
Stefan Farfeleder
439dfb0c35 Remove erroneous semicolons. 2004-07-13 16:06:19 +00:00
Robert Watson
7cfc690440 After each label in tcp_input(), assert the inpcbinfo and inpcb lock
state that we expect.
2004-07-12 19:28:07 +00:00
Brian Somers
0ac4013324 Change the following environment variables to kernel options:
bootp -> BOOTP
    bootp.nfsroot -> BOOTP_NFSROOT
    bootp.nfsv3 -> BOOTP_NFSV3
    bootp.compat -> BOOTP_COMPAT
    bootp.wired_to -> BOOTP_WIRED_TO

- i.e. back out the previous commit.  It's already possible to
pxeboot(8) with a GENERIC kernel.

Pointed out by: dwmalone
2004-07-08 22:35:36 +00:00
Brian Somers
59e1ebc9b5 Change the following kernel options to environment variables:
BOOTP -> bootp
    BOOTP_NFSROOT -> bootp.nfsroot
    BOOTP_NFSV3 -> bootp.nfsv3
    BOOTP_COMPAT -> bootp.compat
    BOOTP_WIRED_TO -> bootp.wired_to

This lets you PXE boot with a GENERIC kernel by putting this sort of thing
in loader.conf:

    bootp="YES"
    bootp.nfsroot="YES"
    bootp.nfsv3="YES"
    bootp.wired_to="bge1"

or even setting the variables manually from the OK prompt.
2004-07-08 13:40:33 +00:00
Dag-Erling Smørgrav
de47739e71 Push WARNS back up to 6, but define NO_WERROR; I want the warts out in the
open where people can see them and hopefully fix them.
2004-07-06 12:15:24 +00:00
Dag-Erling Smørgrav
9fa0fd2682 Introduce inline {ip,udp,tcp}_next() functions which take a pointer to an
{ip,udp,tcp} header and return a void * pointing to the payload (i.e. the
first byte past the end of the header and any required padding).  Use them
consistently throughout libalias to a) reduce code duplication, b) improve
code legibility, c) get rid of a bunch of alignment warnings.
2004-07-06 12:13:28 +00:00
Dag-Erling Smørgrav
e3e2c21639 Rewrite twowords() to access its argument through a char pointer and not
a short pointer.  The previous implementation seems to be in a gray zone
of the C standard, and GCC generates incorrect code for it at -O2 or
higher on some platforms.
2004-07-06 09:22:18 +00:00
Dag-Erling Smørgrav
95347a8ee0 Temporarily lower WARNS to 3 while I figure out the alignment issues on
alpha.
2004-07-06 08:44:41 +00:00
Dag-Erling Smørgrav
ed01a58215 Make libalias WARNS?=6-clean. This mostly involves renaming variables
named link, foo_link or link_foo to lnk, foo_lnk or lnk_foo, fixing
signed / unsigned comparisons, and shoving unused function arguments
under the carpet.

I was hoping WARNS?=6 might reveal more serious problems, and perhaps
the source of the -O2 breakage, but found no smoking gun.
2004-07-05 11:10:57 +00:00
Dag-Erling Smørgrav
ffcb611a9d Parenthesize return values. 2004-07-05 10:55:23 +00:00
Dag-Erling Smørgrav
f311ebb4ec Mechanical whitespace cleanup. 2004-07-05 10:53:28 +00:00
Poul-Henning Kamp
e6bbb69149 Add LibAliasOutTry() which checks a packet for a hit in the tables, but
does not create a new entry if none is found.
2004-07-04 12:53:07 +00:00
Ruslan Ermilov
1a0a934547 Mechanically kill hard sentence breaks. 2004-07-02 23:52:20 +00:00
Jayanth Vijayaraghavan
a0445c2e2c On receiving 3 duplicate acknowledgements, SACK recovery was not being entered correctly.
Fix this problem by separating out the SACK and the newreno cases. Also, check
if we are in FASTRECOVERY for the sack case and if so, turn off dupacks.

Fix an issue where the congestion window was not being incremented by ssthresh.

Thanks to Mohan Srinivasan for finding this problem.
2004-07-01 23:34:06 +00:00
Ruslan Ermilov
c9a246418d Bumped document date.
Fixed markup.
Fixed examples to match the new API.
2004-07-01 17:51:48 +00:00
Poul-Henning Kamp
e3e244bff6 Rwatson, write 100 times for tomorrow:
First unlock, then assign NULL to pointer.
2004-06-27 21:54:34 +00:00
Pawel Jakub Dawidek
0a44517d3a Those are unneeded too. 2004-06-27 09:06:10 +00:00
Pawel Jakub Dawidek
46e3b1cbe7 Add two missing includes and remove two uneeded.
This is quite serious fix, because even with MAC framework compiled in,
MAC entry points in those two files were simply ignored.
2004-06-27 09:03:22 +00:00
Robert Watson
1e4d7da707 Reduce the number of unnecessary unlock-relocks on socket buffer mutexes
associated with performing a wakeup on the socket buffer:

- When performing an sbappend*() followed by a so[rw]wakeup(), explicitly
  acquire the socket buffer lock and use the _locked() variants of both
  calls.  Note that the _locked() sowakeup() versions unlock the mutex on
  return.  This is done in uipc_send(), divert_packet(), mroute
  socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append().

- When the socket buffer lock is dropped before a sowakeup(), remove the
  explicit unlock and use the _locked() sowakeup() variant.  This is done
  in soisdisconnecting(), soisdisconnected() when setting the can't send/
  receive flags and dropping data, and in uipc_rcvd() which adjusting
  back-pressure on the sockets.

For UNIX domain sockets running mpsafe with a contention-intensive SMP
mysql benchmark, this results in a 1.6% query rate improvement due to
reduce mutex costs.
2004-06-26 19:10:39 +00:00
Robert Watson
3f9d1ef905 Remove spl's from TCP protocol entry points. While not all locking
is merged here yet, this will ease the merge process by bringing the
locked and unlocked versions into sync.
2004-06-26 17:50:50 +00:00
Paul Saab
652178a12a White space & spelling fixes
Submitted by:	Xin LI <delphij@frontfree.net>
2004-06-25 04:11:26 +00:00
Bruce M Simpson
37332f049f Whitespace. 2004-06-25 02:29:58 +00:00
Robert Watson
5905999b2f Broaden scope of the socket buffer lock when processing an ACK so that
the read and write of sb_cc are atomic.  Call sbdrop_locked() instead
of sbdrop() since we already hold the socket buffer lock.
2004-06-24 03:07:27 +00:00
Robert Watson
927c5cea3f Protect so_oobmark with with SOCKBUF_LOCK(&so->so_rcv), and broaden
locking in tcp_input() for TCP packets with urgent data pointers to
hold the socket buffer lock across testing and updating oobmark
from just protecting sb_state.

Update socket locking annotations
2004-06-24 02:57:12 +00:00
Robert Watson
a138d21769 In ip_ctloutput(), acquire the inpcb lock around some of the basic
inpcb flag and status updates.
2004-06-24 02:05:47 +00:00
Robert Watson
d67ec3dd48 When asserting non-Giant locks in the network stack, also assert
Giant if debug.mpsafenet=0, as any points that require synchronization
in the SMPng world also required it in the Giant-world:

- inpcb locks (including IPv6)
- inpcbinfo locks (including IPv6)
- dummynet subsystem lock
- ipfw2 subsystem lock
2004-06-24 02:01:48 +00:00
Robert Watson
3f11a2f374 Introduce sbreserve_locked(), which asserts the socket buffer lock on
the socket buffer having its limits adjusted.  sbreserve() now acquires
the lock before calling sbreserve_locked().  In soreserve(), acquire
socket buffer locks across read-modify-writes of socket buffer fields,
and calls into sbreserve/sbrelease; make sure to acquire in keeping
with the socket buffer lock order.  In tcp_mss(), acquire the socket
buffer lock in the calling context so that we have atomic read-modify
-write on buffer sizes.
2004-06-24 01:37:04 +00:00
Paul Saab
76947e3222 Move the sack sysctl's under net.inet.tcp.sack
net.inet.tcp.do_sack -> net.inet.tcp.sack.enable
net.inet.tcp.sackhole_limit -> net.inet.tcp.sack.sackhole_limit

Requested by:	wollman
2004-06-23 21:34:07 +00:00
Paul Saab
6d90faf3d8 Add support for TCP Selective Acknowledgements. The work for this
originated on RELENG_4 and was ported to -CURRENT.

The scoreboarding code was obtained from OpenBSD, and many
of the remaining changes were inspired by OpenBSD, but not
taken directly from there.

You can enable/disable sack using net.inet.tcp.do_sack. You can
also limit the number of sack holes that all senders can have in
the scoreboard with net.inet.tcp.sackhole_limit.

Reviewed by:	gnn
Obtained from:	Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)
2004-06-23 21:04:37 +00:00
Robert Watson
bb7479a613 Acquire socket lock around frobbing of socket state in divert sockets. 2004-06-22 04:00:51 +00:00
Robert Watson
ffcbc0e4c5 Prefer use of the inpcb as a MAC label source for outgoing packets sent
via divert sockets, when available.
2004-06-22 03:58:50 +00:00
Robert Watson
d330008e3b If debug.mpsafenet is set, initialize TCP callouts as CALLOUT_MPSAFE. 2004-06-20 21:44:50 +00:00
Robert Watson
1f82efb3b7 Assert the inpcb lock before letting MAC check whether we can deliver
to the inpcb in tcp_input().
2004-06-20 20:17:29 +00:00
Robert Watson
1b83216eda IP multicast code no longer needs to acquire Giant before appending
an mbuf onto a socket buffer.  This is left over from debug.mpsafenet
affecting the forwarding/bridging plane only.
2004-06-20 20:10:05 +00:00
Robert Watson
4e397bc524 In tcp_ctloutput(), don't hold the inpcb lock over a call to
ip_ctloutput(), as it may need to perform blocking memory allocations.
This also improves consistency with locking relative to other points
that call into ip_ctloutput().

Bumped into by:	Grover Lines <grover@ceribus.net>
2004-06-18 20:22:21 +00:00
Bruce M Simpson
4f450ff9a5 Check that m->m_pkthdr.rcvif is not NULL before checking if a packet
was received on a broadcast address on the input path. Under certain
circumstances this could result in a panic, notably for locally-generated
packets which do not have m_pkthdr.rcvif set.

This is a similar situation to that which is solved by
src/sys/netinet/ip_icmp.c rev 1.66.

PR:		kern/52935
2004-06-18 12:58:45 +00:00
Bruce M Simpson
f3e0b7ef7f Appease GCC. 2004-06-18 09:53:58 +00:00
Bruce M Simpson
5214cb3f59 If SO_DEBUG is enabled for a TCP socket, and a received segment is
encapsulated within an IPv6 datagram, do not abuse the 'ipov' pointer
when registering trace records.  'ipov' is specific to IPv4, and
will therefore be uninitialized.

[This fandango is only necessary in the first place because of our
host-byte-order IP field pessimization.]

PR:		kern/60856
Submitted by:	Galois Zheng
2004-06-18 03:31:07 +00:00
Bruce M Simpson
da181cc144 Don't set FIN on a retransmitted segment after a FIN has been sent,
unless the segment really contains the last of the data for the stream.

PR:		kern/34619
Obtained from:	OpenBSD (tcp_output.c rev 1.47)
Noticed by:	Joseph Ishac
Reviewed by:	George Neville-Neil
2004-06-18 02:47:59 +00:00
Bruce M Simpson
27de0135ce Ensure that dst is bzeroed before calling rtalloc_ign(), to avoid possible
routing table corruption.

PR:		kern/40563, freebsd4/432 (KAME)
Obtained from:	NetBSD (in_gif.c rev 1.26.10.1)
Requested by:	Jean-Luc Richier
2004-06-18 02:04:07 +00:00
Max Laier
7c1fe95333 Commit pf version 3.5 and link additional files to the kernel build.
Version 3.5 brings:
 - Atomic commits of ruleset changes (reduce the chance of ending up in an
   inconsistent state).
 - A 30% reduction in the size of state table entries.
 - Source-tracking (limit number of clients and states per client).
 - Sticky-address (the flexibility of round-robin with the benefits of
   source-hash).
 - Significant improvements to interface handling.
 - and many more ...
2004-06-16 23:24:02 +00:00
Max Laier
a306c902b8 Prepare for pf 3.5 import:
- Remove pflog and pfsync modules. Things will change in such a fashion
   that there will be one module with pf+pflog that can be loaded into
   GENERIC without problems (which is what most people want). pfsync is no
   longer possible as a module.
 - Add multicast address for in-kernel multicast pfsync protocol. Protocol
   glue will follow once the import is done.
 - Add one more mbuf tag
2004-06-16 22:59:06 +00:00
Maxim Konovalov
ef14c36965 o connect(2): if there is no a route to the destination
do not pick up the first local ip address for the source
ip address, return ENETUNREACH instead.

Submitted by:	Gleb Smirnoff
Reviewed by:	-current (silence)
2004-06-16 10:02:36 +00:00
Bruce M Simpson
d420fcda27 Fix build for IPSEC && !INET6
PR:		kern/66125
Submitted by:	Cyrille Lefevre
2004-06-16 09:35:07 +00:00
Bruce M Simpson
49b19bfc47 Reverse a patch which has no effect on -CURRENT and should probably be
applied directly to -STABLE.

Noticed by:	iedowse
Pointy hat to:	bms
2004-06-16 08:50:14 +00:00
Bruce M Simpson
57ab3660ff In ip_forward(), when calculating the MTU in effect for an IPSEC transport
mode tunnel, take the per-route MTU into account, *if* and *only if* it
is non-zero (as found in struct rt_metrics/rt_metrics_lite).

PR:		kern/42727
Obtained from:	NetBSD (ip_input.c rev 1.151)
2004-06-16 08:33:09 +00:00
Bruce M Simpson
e6b0a57025 In ip_forward(), set m->m_pkthdr.len correctly such that the mbuf chain
is sane, and ipsec4_getpolicybyaddr() will therefore complete.

PR:		kern/42727
Obtained from:	KAME (kame/freebsd4/sys/netinet/ip_input.c rev 1.42)
2004-06-16 08:28:54 +00:00
Bruce M Simpson
34e3ccb34b Disconnect a temporarily-connected UDP socket in out-of-mbufs case. This
fixes the problem of UDP sockets getting wedged in a connected state (and
bound to their destination) under heavy load.
Temporary bind/connect should probably be deleted in future
as an optimization, as described in "A Faster UDP" [Partridge/Pink 1993].

Notes:
 - INP_LOCK() is already held in udp_output(). The connection is in effect
   happening at a layer lower than the socket layer, therefore in theory
   socket locking should not be needed.
 - Inlining the in_pcbdisconnect() operation buys us nothing (in the case
   of the current state of the code), as laddr is not part of the
   inpcb hash or the udbinfo hash. Therefore there should be no need
   to rehash after restoring laddr in the error case (this was a
   concern of the original author of the patch).

PR:		kern/41765
Requested by:	gnn
Submitted by:	Jinmei Tatuya (with cleanups)
Tested by:	spray(8)
2004-06-16 05:41:00 +00:00
Robert Watson
a97719a4c5 Convert GIANT_REQUIRED to NET_ASSERT_GIANT for socket access. 2004-06-16 03:36:06 +00:00
Robert Watson
7721f5d760 Grab the socket buffer send or receive mutex when performing a
read-modify-write on the sb_state field.  This commit catches only
the "easy" ones where it doesn't interact with as yet unmerged
locking.
2004-06-15 03:51:44 +00:00
Robert Watson
c0b99ffa02 The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality.  This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties.  The
following fields are moved to sb_state:

  SS_CANTRCVMORE            (so_state)
  SS_CANTSENDMORE           (so_state)
  SS_RCVATMARK              (so_state)

Rename respectively to:

  SBS_CANTRCVMORE           (so_rcv.sb_state)
  SBS_CANTSENDMORE          (so_snd.sb_state)
  SBS_RCVATMARK             (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts).  In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.
2004-06-14 18:16:22 +00:00
Max Laier
02b199f158 Link ALTQ to the build and break with ABI for struct ifnet. Please recompile
your (network) modules as well as any userland that might make sense of
sizeof(struct ifnet).
This does not change the queueing yet. These changes will follow in a
seperate commit. Same with the driver changes, which need case by case
evaluation.

__FreeBSD_version bump will follow.

Tested-by:	(i386)LINT
2004-06-13 17:29:10 +00:00
Doug Rabson
b8b3323469 Add a new driver to support IP over firewire. This driver is intended to
conform to the rfc2734 and rfc3146 standard for IP over firewire and
should eventually supercede the fwe driver. Right now the broadcast
channel number is hardwired and we don't support MCAP for multicast
channel allocation - more infrastructure is required in the firewire
code itself to fix these problems.
2004-06-13 10:54:36 +00:00
Robert Watson
310e7ceb94 Socket MAC labels so_label and so_peerlabel are now protected by
SOCK_LOCK(so):

- Hold socket lock over calls to MAC entry points reading or
  manipulating socket labels.

- Assert socket lock in MAC entry point implementations.

- When externalizing the socket label, first make a thread-local
  copy while holding the socket lock, then release the socket lock
  to externalize to userspace.
2004-06-13 02:50:07 +00:00
Robert Watson
395a08c904 Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
  soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
  so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
  various contexts in the stack, both at the socket and protocol
  layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
  this could result in frobbing of a non-present socket if
  sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
  they don't free the socket.

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-12 20:47:32 +00:00
Christian S.J. Peron
d316f2cf4f Modify ip fw so that whenever UID or GID constraints exist in a
ruleset, the pcb is looked up once per ipfw_chk() activation.

This is done by extracting the required information out of the PCB
and caching it to the ipfw_chk() stack. This should greatly reduce
PCB looking contention and speed up the processing of UID/GID based
firewall rules (especially with large UID/GID rulesets).

Some very basic benchmarks were taken which compares the number
of in_pcblookup_hash(9) activations to the number of firewall
rules containing UID/GID based contraints before and after this patch.

The results can be viewed here:
o http://people.freebsd.org/~csjp/ip_fw_pcb.png

Reviewed by:	andre, luigi, rwatson
Approved by:	bmilekic (mentor)
2004-06-11 22:17:14 +00:00
Robert Watson
c1d587c848 Remove unneeded Giant acquisition in divert_packet(), which is
left over from debug.mpsafenet affecting only the forwarding
plane.  Giant is now acquired in the ithread/netisr or in the
system call code.
2004-06-11 04:06:51 +00:00
Robert Watson
c14800e6ff Lock down parallel router_info list for tracking multicast IGMP
versions of various routers seen:

- Introduce igmp_mtx.
- Protect global variable 'router_info_head' and list fields
  in struct router_info with this mutex, as well as
  igmp_timers_are_running.
- find_rti() asserts that the caller acquires igmp_mtx.
- Annotate a failure to check the return value of
  MALLOC(..., M_NOWAIT).
2004-06-11 03:42:37 +00:00
Ruslan Ermilov
dd4d62c7d8 init_tables() must be run after sys/net/route.c:route_init(). 2004-06-10 20:20:37 +00:00
Ruslan Ermilov
cd8b5ae0ae Introduce a new feature to IPFW2: lookup tables. These are useful
for handling large sparse address sets.  Initial implementation by
Vsevolod Lobko <seva@ip.net.ua>, refined by me.

MFC after:	1 week
2004-06-09 20:10:38 +00:00
Hajimu UMEMOTO
cad1917d48 do not send icmp response if the original packet is encrypted.
Obtained from:	KAME
MFC after:	1 week
2004-06-07 09:56:59 +00:00
Bosko Milekic
ac830b58d1 Move the locking of the pcb into raw_output(). Organize code so
that m_prepend() is not called with possibility to wait while the
pcb lock is held.  What still needs revisiting is whether the
ripcbinfo lock is really required here.

Discussed with: rwatson
2004-06-03 03:15:29 +00:00
Poul-Henning Kamp
5dba30f15a add missing #include <sys/module.h> 2004-05-30 20:27:19 +00:00
Poul-Henning Kamp
41ee9f1c69 Add some missing <sys/module.h> includes which are masked by the
one on death-row in <sys/kernel.h>
2004-05-30 17:57:46 +00:00
Christian S.J. Peron
b5ef991561 Add a super-user check to ipfw_ctl() to make sure that the calling
process is a non-prison root. The security.jail.allow_raw_sockets
sysctl variable is disabled by default, however if the user enables
raw sockets in prisons, prison-root should not be able to interact
with firewall rule sets.

Approved by:	rwatson, bmilekic (mentor)
2004-05-25 15:02:12 +00:00
Yaroslav Tykhiy
4658dc8325 When checking for possible port theft, skip over a TCP inpcb
unless it's in the closed or listening state (remote address
== INADDR_ANY).

If a TCP inpcb is in any other state, it's impossible to steal
its local port or use it for port theft.  And if there are
both closed/listening and connected TCP inpcbs on the same
localIP:port couple, the call to in_pcblookup_local() will
find the former due to the design of that function.

No objections raised in:	-net, -arch
MFC after:			1 month
2004-05-20 06:35:02 +00:00
Maxim Konovalov
a49b21371a o Calculate a number of bytes to copy (cnt) correctly:
+----+-+-+-+-+----+----+- - - - - - - - - - - -  -+----+
  |    | |C| | |    |    |                          |    |
  | IP |N|O|L|P|    | IP |                          | IP |
  | #1 |O|D|E|T|    | #2 |                          | #n |
  |    |P|E|N|R|    |    |                          |    |
  +----+-+-+-+-+----+----+- - - - - - - - - - - -  -+----+
               ^    ^<---- cnt - (IPOPT_MINOFF - 1) ---->|
               |    |
src            |    +-- cp[IPOPT_OFF + 1] + sizeof(struct in_addr)
               |
dst            +-- cp[IPOPT_OFF + 1]

PR:		kern/66386
Submitted by:	Andrei Iltchenko
MFC after:	3 weeks
2004-05-11 19:14:44 +00:00
Maxim Konovalov
d0946241ac o IFNAMSIZ does include the trailing \0.
Approved by:	andre

o Document net.inet.icmp.reply_src.
2004-05-07 01:24:53 +00:00
Andre Oppermann
2bde81acd6 Provide the sysctl net.inet.ip.process_options to control the processing
of IP options.

 net.inet.ip.process_options=0  Ignore IP options and pass packets unmodified.
 net.inet.ip.process_options=1  Process all IP options (default).
 net.inet.ip.process_options=2  Reject all packets with IP options with ICMP
  filter prohibited message.

This sysctl affects packets destined for the local host as well as those
only transiting through the host (routing).

IP options do not have any legitimate purpose anymore and are only used
to circumvent firewalls or to exploit certain behaviours or bugs in TCP/IP
stacks.

Reviewed by:	sam (mentor)
2004-05-06 18:46:03 +00:00
Robert Watson
c18b97c630 Switch to using the inpcb MAC label instead of socket MAC label when
labeling new mbufs created from sockets/inpcbs in IPv4.  This helps avoid
the need for socket layer locking in the lower level network paths
where inpcb locks are already frequently held where needed.  In
particular:

- Use the inpcb for label instead of socket in raw_append().
- Use the inpcb for label instead of socket in tcp_output().
- Use the inpcb for label instead of socket in tcp_respond().
- Use the inpcb for label instead of socket in tcp_twrespond().
- Use the inpcb for label instead of socket in syncache_respond().

While here, modify tcp_respond() to avoid assigning NULL to a stack
variable and centralize assertions about the inpcb when inp is
assigned.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, McAfee Research
2004-05-04 02:11:47 +00:00
Robert Watson
87f2bb8caf Assert inpcb lock in udp_append().
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, McAfee Research
2004-05-04 01:08:15 +00:00
Robert Watson
cbe42d48bd Assert the inpcb lock on 'last' in udp_append(), since it's always
called with it, and also requires it.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, McAfee Research
2004-05-04 00:10:16 +00:00
Maxim Konovalov
1a0c4873ed o Fix misindentation in the previous commit. 2004-05-03 17:15:34 +00:00
Andre Oppermann
7652802b06 Back out a change that slipped into the previous commit for which other
supporting parts have not yet been committed.

Remove pre-mature IP options ignoring option.
2004-05-03 16:07:13 +00:00
Andre Oppermann
06bb56f43c Optimize IP fastforwarding some more:
o New function ip_findroute() to reduce code duplication for the
  route lookup cases. (luigi)

o Store ip_len in host byte order on the stack instead of using
  it via indirection from the mbuf.  This allows to defer the host
  byte conversion to a later point and makes a quicker fallback to
  normal ip_input() processing. (luigi)

o Check if route is dampned with RTF_REJECT flag and drop packet
  already here when ARP is unable to resolve destination address.
  An ICMP unreachable is sent to inform the sender.

o Check if interface output queue is full and drop packet already
  here.  No ICMP notification is sent because signalling source quench
  is depreciated.

o Check if media_state is down (used for ethernet type interfaces)
  and drop the packet already here.  An ICMP unreachable is sent to
  inform the sender.

o Do not account sent packets to the interface address counters.  They
  are only for packets with that 'ia' as source address.

o Update and clarify some comments.

Submitted by:	luigi (most of it)
2004-05-03 13:52:47 +00:00
Darren Reed
2f3f1e6773 Rename m_claim_next_hop() to m_claim_next(), as suggested by Max Laier. 2004-05-02 15:10:17 +00:00
Darren Reed
7fbb130049 oops, I forgot this file in a prior commit (change was still sitting here,
uncommitted):

Rename ip_claim_next_hop() to m_claim_next_hop(), give it an extra arg
(the type of tag to claim) and push it out of ip_var.h into mbuf.h
alongside all of the other macros that work ok mbuf's and tag's.
2004-05-02 15:07:37 +00:00
Darren Reed
ab884d993e Rename ip_claim_next_hop() to m_claim_next_hop(), give it an extra arg
(the type of tag to claim) and push it out of ip_var.h into mbuf.h alongside
all of the other macros that work ok mbuf's and tag's.
2004-05-02 06:36:30 +00:00
Bosko Milekic
5a59cefcd1 Give jail(8) the feature to allow raw sockets from within a
jail, which is less restrictive but allows for more flexible
jail usage (for those who are willing to make the sacrifice).
The default is off, but allowing raw sockets within jails can
now be accomplished by tuning security.jail.allow_raw_sockets
to 1.

Turning this on will allow you to use things like ping(8)
or traceroute(8) from within a jail.

The patch being committed is not identical to the patch
in the PR.  The committed version is more friendly to
APIs which pjd is working on, so it should integrate
into his work quite nicely.  This change has also been
presented and addressed on the freebsd-hackers mailing
list.

Submitted by: Christian S.J. Peron <maneo@bsdpro.com>
PR: kern/65800
2004-04-26 19:46:52 +00:00
Mike Silbersack
80dd2a81fb Tighten up reset handling in order to make reset attacks as difficult as
possible while maintaining compatibility with the widest range of TCP stacks.

The algorithm is as follows:

---
For connections in the ESTABLISHED state, only resets with
sequence numbers exactly matching last_ack_sent will cause a reset,
all other segments will be silently dropped.

For connections in all other states, a reset anywhere in the window
will cause the connection to be reset.  All other segments will be
silently dropped.
---

The necessity of accepting all in-window resets was discovered
by jayanth and jlemon, both of whom have seen TCP stacks that
will respond to FIN-ACK packets with resets not meeting the
strict last_ack_sent check.

Idea by:        Darren Reed
Reviewed by:    truckman, jlemon, others(?)
2004-04-26 02:56:31 +00:00
Luigi Rizzo
b2a8ac7ca5 Another small set of changes to reduce diffs with the new arp code. 2004-04-25 15:00:17 +00:00
Luigi Rizzo
491522eade remove a stale comment on the behaviour of arpresolve 2004-04-25 14:06:23 +00:00
Luigi Rizzo
cfff63f1b8 Start the arp timer at init time.
It runs so rarely that it makes no sense to wait until the first request.
2004-04-25 12:50:14 +00:00
Luigi Rizzo
cd46a114fc This commit does two things:
1. rt_check() cleanup:
    rt_check() is only necessary for some address families to gain access
    to the corresponding arp entry, so call it only in/near the *resolve()
    routines where it is actually used -- at the moment this is
    arpresolve(), nd6_storelladdr() (the call is embedded here),
    and atmresolve() (the call is just before atmresolve to reduce
    the number of changes).
    This change will make it a lot easier to decouple the arp table
    from the routing table.

    There is an extra call to rt_check() in if_iso88025subr.c to
    determine the routing info length. I have left it alone for
    the time being.

    The interface of arpresolve() and nd6_storelladdr() now changes slightly:
     + the 'rtentry' parameter (really a hint from the upper level layer)
       is now passed unchanged from *_output(), so it becomes the route
       to the final destination and not to the gateway.
     + the routines will return 0 if resolution is possible, non-zero
       otherwise.
     + arpresolve() returns EWOULDBLOCK in case the mbuf is being held
       waiting for an arp reply -- in this case the error code is masked
       in the caller so the upper layer protocol will not see a failure.

2. arpcom untangling
    Where possible, use 'struct ifnet' instead of 'struct arpcom' variables,
    and use the IFP2AC macro to access arpcom fields.
    This mostly affects the netatalk code.

=== Detailed changes: ===
net/if_arcsubr.c
   rt_check() cleanup, remove a useless variable

net/if_atmsubr.c
   rt_check() cleanup

net/if_ethersubr.c
   rt_check() cleanup, arpcom untangling

net/if_fddisubr.c
   rt_check() cleanup, arpcom untangling

net/if_iso88025subr.c
   rt_check() cleanup

netatalk/aarp.c
   arpcom untangling, remove a block of duplicated code

netatalk/at_extern.h
   arpcom untangling

netinet/if_ether.c
   rt_check() cleanup (change arpresolve)

netinet6/nd6.c
   rt_check() cleanup (change nd6_storelladdr)
2004-04-25 09:24:52 +00:00
Mike Silbersack
6b2fc10b64 Wrap two long lines in the previous commit. 2004-04-23 23:29:49 +00:00
Andre Oppermann
2d166c0202 Correct an edge case in tcp_mss() where the cached path MTU
from tcp_hostcache would have overridden a (now) lower MTU of
an interface or route that changed since first PMTU discovery.
The bug would have caused TCP to redo the PMTU discovery when
not strictly necessary.

Make a comment about already pre-initialized default values
more clear.

Reviewed by:	sam
2004-04-23 22:44:59 +00:00
Andre Oppermann
22b5770b99 Add the option versrcreach to verify that a valid route to the
source address of a packet exists in the routing table.  The
default route is ignored because it would match everything and
render the check pointless.

This option is very useful for routers with a complete view of
the Internet (BGP) in the routing table to reject packets with
spoofed or unrouteable source addresses.

Example:

 ipfw add 1000 deny ip from any to any not versrcreach

also known in Cisco-speak as:

  ip verify unicast source reachable-via any

Reviewed by:	luigi
2004-04-23 14:28:38 +00:00
Andre Oppermann
b62dccc7e5 Fix a potential race when purging expired hostcache entries.
Spotted by:	luigi
2004-04-23 13:54:28 +00:00
Mike Silbersack
174624e01d Take out an unneeded variable I forgot to remove in the last commit,
and make two small whitespace fixes so that diffs vs rev 1.142 are minimal.
2004-04-22 08:34:55 +00:00
Mike Silbersack
6ac48b7409 Simplify random port allocation, and add net.inet.ip.portrange.randomized,
which can be used to turn off randomized port allocation if so desired.

Requested by:	alfred
2004-04-22 08:32:14 +00:00