Commit Graph

2427 Commits

Author SHA1 Message Date
rwatson
6917b2b1d9 In tcp_reass(), assert the inpcb lock on the passed tcpcb, since the
contents of the tcpcb are read and modified in volume.

In tcp_input(), replace th comparison with 0 with a comparison with
NULL.

At the 'findpcb', 'dropafterack', and 'dropwithreset' labels in
tcp_input(), assert 'headlocked'.  Try to improve consistency between
various assertions regarding headlocked to be more informative.

MFC after:	2 weeks
2004-11-23 23:41:20 +00:00
rwatson
75d5a09a05 tcp_timewait() performs multiple non-atomic reads on the tcptw
structure, so assert the inpcb lock associated with the tcptw.
Also assert the tcbinfo lock, as tcp_timewait() may call
tcp_twclose() or tcp_2msl_rest(), which require it.  Since
tcp_timewait() is already called with that lock from tcp_input(),
this doesn't change current locking, merely documents reasons for
it.

In tcp_twstart(), assert the tcbinfo lock, as tcp_timer_2msl_rest()
is called, which requires that lock.

In tcp_twclose(), assert the tcbinfo lock, as tcp_timer_2msl_stop()
is called, which requires that lock.

Document the locking strategy for the time wait queues in tcp_timer.c,
which consists of protecting the time wait queues in the same manner
as the tcbinfo structure (using the tcbinfo lock).

In tcp_timer_2msl_reset(), assert the tcbinfo lock, as the time wait
queues are modified.

In tcp_timer_2msl_stop(), assert the tcbinfo lock, as the time wait
queues may be modified.

In tcp_timer_2msl_tw(), assert the tcbinfo lock, as the time wait
queues may be modified.

MFC after:	2 weeks
2004-11-23 17:21:30 +00:00
rwatson
53e97a895b De-spl tcp_slowtimo; tcp_maxidle assignment is subject to possible
but unlikely races that could be corrected by having tcp_keepcnt
and tcp_keepintvl modifications go through handler functions via
sysctl, but probably is not worth doing.  Updates to multiple
sysctls within evaluation of a single addition are unlikely.

Annotate that tcp_canceltimers() is currently unused.

De-spl tcp_timer_delack().

De-spl tcp_timer_2msl().

MFC after:	2 weeks
2004-11-23 16:45:07 +00:00
rwatson
93fe353ec5 Assert the inpcb lock in tcp_twstart(), which does both read-modify-write
on the tcpcb, but also calls into tcp_close() and tcp_twrespond().

Annotate that tcp_twrecycleable() requires the inpcb lock because it does
a series of non-atomic reads of the tcpcb, but is currently called
without the inpcb lock by the caller.  This is a bug.

Assert the inpcb lock in tcp_twclose() as it performs a read-modify-write
of the timewait structure/inpcb, and calls in_pcbdetach() which requires
the lock.

Assert the inpcb lock in tcp_twrespond(), as it performs multiple
non-atomic reads of the tcptw and inpcb structures, as well as calling
mac_create_mbuf_from_inpcb(), tcpip_fillheaders(), which require the
inpcb lock.

MFC after:	2 weeks
2004-11-23 16:23:13 +00:00
rwatson
32947f494f Assert inpcb lock in tcp_quench(), tcp_drop_syn_sent(), tcp_mtudisc(),
and tcp_drop(), due to read-modify-write of TCP state variables.

MFC after:	2 weeks
2004-11-23 16:06:15 +00:00
rwatson
37654f9d78 Assert the tcbinfo write lock in tcp_new_isn(), as the tcbinfo lock
protects access to the ISN state variables.

Acquire the tcbinfo write lock in tcp_isn_tick() to synchronize
timer-driven isn bumping.

Staticize internal ISN variables since they're not used outside of
tcp_subr.c.

MFC after:	2 weeks
2004-11-23 15:59:43 +00:00
rwatson
ec333e6577 Remove "Unlocked read" annotations associated with previously unlocked
use of socket buffer fields in the TCP input code.  These references
are now protected by use of the receive socket buffer lock.

MFC after:	1 week
2004-11-22 13:16:27 +00:00
rwatson
69595c71c3 s/send/sent/ in comment describing TCPS_SYN_RECEIVED. 2004-11-21 14:38:04 +00:00
glebius
1ad65ec555 - Since divert protocol is not connection oriented, remove SS_ISCONNECTED flag
from divert sockets.
- Remove div_disconnect() method, since it shouldn't be called now.
- Remove div_abort() method. It was never called directly, since protocol
  doesn't have listen queue. It was called only from div_disconnect(),
  which is removed now.

Reviewed by:	rwatson, maxim
Approved by:	julian (mentor)
MT5 after:	1 week
MT4 after:	1 month
2004-11-18 13:49:18 +00:00
mlaier
4603a76576 Fix host route addition for more than one address to a loopback interface
after allowing more than one address with the same prefix.

Reported by:	Vladimir Grebenschikov <vova NO fbsd SPAM ru>
Submitted by:	ru (also NetBSD rev. 1.83)
Pointyhat to:	mlaier
2004-11-17 23:14:03 +00:00
mlaier
5780422cd7 Merge copyright notices.
Requested by:	njl
2004-11-13 17:05:40 +00:00
glebius
a4a6b8f0c4 Fix ng_ksocket(4) operation as a divert socket, which is pretty useful
and has been broken twice:

- in the beginning of div_output() replace KASSERT with assignment, as
  it was in rev. 1.83. [1] [to be MFCed]
- refactor changes introduced in rev. 1.100: do not prepend a new tag
  unconditionally. Before doing this check whether we have one. [2]

A small note for all hacking in this area:
when divert socket is not a real userland, but ng_ksocket(4), we receive
_the same_ mbufs, that we transmitted to socket. These mbufs have rcvif,
the tags we've put on them. And we should treat them correctly.

Discussed with:	mlaier [1]
Silence from:	green [2]
Reviewed by:	maxim
Approved by:	julian (mentor)
MFC after:	1 week
2004-11-12 22:17:42 +00:00
mlaier
583a3d8244 Change the way we automatically add prefix routes when adding a new address.
This makes it possible to have more than one address with the same prefix.
The first address added is used for the route. On deletion of an address
with IFA_ROUTE set, we try to find a "fallback" address and hand over the
route if possible.
I plan to MFC this in 4 weeks, hence I keep the - now obsolete - argument to
in_ifscrub as it must be considered KAPI as it is not static in in.c. I will
clean this after the MFC.

Discussed on:	arch, net
Tested by:	many testers of the CARP patches
Nits from:	ru, Andrea Campi <andrea+freebsd_arch webcom it>
Obtained from:	WIDE via OpenBSD
MFC after:	1 month
2004-11-12 20:53:51 +00:00
phk
530b64583e Add missing '='
Spotted by:	obrien
2004-11-11 19:02:01 +00:00
andre
173ef4db97 Fix a double-free in the 'hlen > m->m_len' sanity check.
Bug report by:	<james@towardex.com>
MFC after:	2 weeks
2004-11-09 09:40:32 +00:00
suz
30108058ef support TCP-MD5(IPv4) in KAME-IPSEC, too.
MFC after: 3 week
2004-11-08 18:49:51 +00:00
phk
027fce30f5 Initialize struct pr_userreqs in new/sparse style and fill in common
default elements in net_init_domain().

This makes it possible to grep these structures and see any bogosities.
2004-11-08 14:44:54 +00:00
rwatson
185ec80b05 Do some re-sorting of TCP pcbinfo locking and assertions: make sure to
retain the pcbinfo lock until we're done using a pcb in the in-bound
path, as the pcbinfo lock acts as a pseuo-reference to prevent the pcb
from potentially being recycled.  Clean up assertions and make sure to
assert that the pcbinfo is locked at the head of code subsections where
it is needed.  Free the mbuf at the end of tcp_input after releasing
any held locks to reduce the time the locks are held.

MFC after:	3 weeks
2004-11-07 19:19:35 +00:00
andre
becc212fd3 Fix a double-free in the 'm->m_len < sizeof (struct ip)' sanity check.
Bug report by:	<james@towardex.com>
MFC after:	2 weeks
2004-11-06 10:47:36 +00:00
phk
f4e34013c8 Hide udp_in6 behind #ifdef INET6 2004-11-04 07:14:03 +00:00
bms
ade2a04c45 When performing IP fast forwarding, immediately drop traffic which is
destined for a blackhole route.

This also means that blackhole routes do not need to be bound to lo(4)
or disc(4) interfaces for the net.inet.ip.fastforwarding=1 case.

Submitted by:	james at towardex dot com
Sponsored by:	eXtensible Open Router Project <URL:http://www.xorp.org/>
MFC after:	3 weeks
2004-11-04 02:14:38 +00:00
rwatson
f00509ea8d Until this change, the UDP input code used global variables udp_in,
udp_in6, and udp_ip6 to pass socket address state between udp_input(),
udp_append(), and soappendaddr_locked().  While file in the default
configuration, when running with multiple netisrs or direct ithread
dispatch, this can result in races wherein user processes using
recvmsg() get back the wrong source IP/port.  To correct this and
related races:

- Eliminate udp_ip6, which is believed to be generated but then never
  used.  Eliminate ip_2_ip6_hdr() as it is now unneeded.

- Eliminate setting, testing, and existence of 'init' status fields
  for the IPv6 structures.  While with multiple UDP delivery this
  could lead to amortization of IPv4 -> IPv6 conversion when
  delivering an IPv4 UDP packet to an IPv6 socket, it added
  substantial complexity and side effects.

- Move global structures into the stack, declaring udp_in in
  udp_input(), and udp_in6 in udp_append() to be used if a conversion
  is required.  Pass &udp_in into udp_append().

- Re-annotate comments to reflect updates.

With this change, UDP appears to operate correctly in the presence of
substantial inbound processing parallelism.  This solution avoids
introducing additional synchronization, but does increase the
potential stack depth.

Discovered by:	kris (Bug Magnet)
MFC after:	3 weeks
2004-11-04 01:25:23 +00:00
andre
d06f3bef4e Remove RFC1644 T/TCP support from the TCP side of the network stack.
A complete rationale and discussion is given in this message
and the resulting discussion:

 http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706

Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel.  Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols)  and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.

Discussed on:	-arch
2004-11-02 22:22:22 +00:00
rwatson
f7da0c44ca Correct a bug in TCP SACK that could result in wedging of the TCP stack
under high load: only set function state to loop and continuing sending
if there is no data left to send.

RELENG_5_3 candidate.

Feet provided:	Peter Losher <Peter underscore Losher at isc dot org>
Diagnosed by:	Aniel Hartmeier <daniel at benzedrine dot cx>
Submitted by:	mohan <mohans at yahoo-inc dot com>
2004-10-30 12:02:50 +00:00
rwatson
70db0bbc92 Add a matching tunable for net.inet.tcp.sack.enable sysctl. 2004-10-26 08:59:09 +00:00
bms
53c873427a Check that rt_mask(rt) is non-NULL before dereferencing it, in the
RTM_ADD case, thus avoiding a panic.

Submitted by:	Iasen Kostov
2004-10-26 03:31:58 +00:00
andre
4bbe5a2c0f IPDIVERT is a module now and tell the other parts of the kernel about it.
IPDIVERT depends on IPFIREWALL being loaded or compiled into the kernel.
2004-10-25 20:02:34 +00:00
ru
5db2b9d5b3 For variables that are only checked with defined(), don't provide
any fake value.
2004-10-24 15:33:08 +00:00
andre
a8987888fb Shave 40 unused bytes from struct tcpcb. 2004-10-22 19:55:04 +00:00
andre
42e8443fa1 When printing the initialization string and IPDIVERT is not compiled into the
kernel refer to it as "loadable" instead of "disabled".
2004-10-22 19:18:06 +00:00
andre
7c8480e7f1 Refuse to unload the ipdivert module unless the 'force' flag is given to kldunload.
Reflect the fact that IPDIVERT is a loadable module in the divert(4) and ipfw(8)
man pages.
2004-10-22 19:12:01 +00:00
andre
513a1a5fd1 Destroy the UMA zone on unload. 2004-10-19 22:51:20 +00:00
andre
952f796999 Slightly extend the locking during unload to fully cover the protocol
deregistration.  This does not entirely close the race but narrows the
even previously extremely small chance of a race some more.
2004-10-19 22:08:13 +00:00
rwatson
009582eb77 Annotate a newly introduced race present due to the unloading of
protocols: it is possible for sockets to be created and attached
to the divert protocol between the test for sockets present and
successful unload of the registration handler.  We will need to
explore more mature APIs for unregistering the protocol and then
draining consumers, or an atomic test-and-unregister mechanism.
2004-10-19 21:35:42 +00:00
andre
9f43dad9fc Convert IPDIVERT into a loadable module. This makes use of the dynamic loadability
of protocols.  The call to divert_packet() is done through a function pointer.  All
semantics of IPDIVERT remain intact.  If IPDIVERT is not loaded ipfw will refuse to
install divert rules and  natd will complain about 'protocol not supported'.  Once
it is loaded both will work and accept rules and open the divert socket.  The module
can only be unloaded if no divert sockets are open.  It does not close any divert
sockets when an unload is requested but will return EBUSY instead.
2004-10-19 21:14:57 +00:00
andre
40693cc7d9 Properly declare the "net.inet" sysctl subtree. 2004-10-19 21:06:14 +00:00
andre
11ab41ab2f Pre-emptively define IPPROTO_SPACER to 32767, the same value as PROTO_SPACER
to document that this value is globally assigned for a special purpose and
may not be reused within the IPPROTO number space.
2004-10-19 20:59:01 +00:00
andre
cf99677e64 Make use of the PROTO_SPACER functionality for dynamically loadable
protocols in inetsw[] and define initially eight spacer slots.

Remove conflicting declaration 'struct pr_usrreqs nousrreqs'.  It is
now declared and initialized in kern/uipc_domain.c.
2004-10-19 15:58:22 +00:00
andre
00d43f4bd8 Support for dynamically loadable and unloadable IP protocols in the ipmux.
With pr_proto_register() it has become possible to dynamically load protocols
within the PF_INET domain.  However the PF_INET domain has a second important
structure called ip_protox[] that is derived from the 'struct protosw inetsw[]'
and takes care of the de-multiplexing of the various protocols that ride on
top of IP packets.

The functions ipproto_[un]register() allow to dynamically adjust the ip_protox[]
array mux in a consistent and easy way.  To register a protocol within
ip_protox[] the existence of a corresponding and matching protocol definition
in inetsw[] is required.  The function does not allow to overwrite an already
registered protocol.  The unregister function simply replaces the mux slot with
the default index pointer to IPPROTO_RAW as it was previously.
2004-10-19 15:45:57 +00:00
andre
23afa2eef1 Add a macro for the destruction of INP_INFO_LOCK's used by loadable modules. 2004-10-19 14:34:13 +00:00
andre
dcb4801af4 Make comments more clear. Change the order of one if() statement to check the
more likely variable first.
2004-10-19 14:31:56 +00:00
rwatson
4b81ce6dd2 Push acquisition of the accept mutex out of sofree() into the caller
(sorele()/sotryfree()):

- This permits the caller to acquire the accept mutex before the socket
  mutex, avoiding sofree() having to drop the socket mutex and re-order,
  which could lead to races permitting more than one thread to enter
  sofree() after a socket is ready to be free'd.

- This also covers clearing of the so_pcb weak socket reference from
  the protocol to the socket, preventing races in clearing and
  evaluation of the reference such that sofree() might be called more
  than once on the same socket.

This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket.  The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets.  The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.

RELENG_5_3 candidate.

MFC after:	3 days
Reviewed by:	dwhite
Discussed with:	gnn, dwhite, green
Reported by:	Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by:	Vlad <marchenko at gmail dot com>
2004-10-18 22:19:43 +00:00
rwatson
cf6eacbc1e Don't release the udbinfo lock until after the last use of UDP inpcb
in udp_input(), since the udbinfo lock is used to prevent removal of
the inpcb while in use (i.e., as a form of reference count) in the
in-bound path.

RELENG_5 candidate.
2004-10-12 20:03:56 +00:00
rwatson
338d307146 Modify the thrilling "%D is using my IP address %s!" message so that
it isn't printed if the IP address in question is '0.0.0.0', which is
used by nodes performing DHCP lookup, and so constitute a false
positive as a report of misconfiguration.
2004-10-12 17:10:40 +00:00
rwatson
84c4649f53 When the access control on creating raw sockets was modified so that
processes in jail could create raw sockets, additional access control
checks were added to raw IP sockets to limit the ways in which those
sockets could be used.  Specifically, only the socket option IP_HDRINCL
was permitted in rip_ctloutput().  Other socket options were protected
by a call to suser().  This change was required to prevent processes
in a Jail from modifying system properties such as multicast routing
and firewall rule sets.

However, it also introduced a regression: processes that create a raw
socket with root privilege, but then downgraded credential (i.e., a
daemon giving up root, or a setuid process switching back to the real
uid) could no longer issue other unprivileged generic IP socket option
operations, such as IP_TOS, IP_TTL, and the multicast group membership
options, which prevented multicast routing daemons (and some other
tools) from operating correctly.

This change pushes the access control decision down to the granularity
of individual socket options, rather than all socket options, on raw
IP sockets.  When rip_ctloutput() doesn't implement an option, it will
now pass the request directly to in_control() without an access
control check.  This should restore the functionality of the generic
IP socket options for raw sockets in the above-described scenarios,
which may be confirmed with the ipsockopt regression test.

RELENG_5 candidate.

Reviewed by:	csjp
2004-10-12 16:47:25 +00:00
rwatson
a475461b84 Acquire the send socket buffer lock around tcp_output() activities
reaching into the socket buffer.  This prevents a number of potential
races, including dereferencing of sb_mb while unlocked leading to
a NULL pointer deref (how I found it).  Potentially this might also
explain other "odd" TCP behavior on SMP boxes (although  haven't
seen it reported).

RELENG_5 candidate.
2004-10-09 16:48:51 +00:00
rwatson
ccb2845f23 When running with debug.mpsafenet=0, initialize IP multicast routing
callouts as non-CALLOUT_MPSAFE.  Otherwise, they may trigger an
assertion regarding Giant if they enter other parts of the stack from
the callout.

MFC after:	3 days
Reported by:	Dikshie < dikshie at ppk dot itb dot ac dot id >
2004-10-07 14:13:35 +00:00
ps
c8e4aa1cd5 - Estimate the amount of data in flight in sack recovery and use it
to control the packets injected while in sack recovery (for both
  retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
  number of sack retransmissions done upon initiation of sack recovery.

Submitted by:	Mohan Srinivasan <mohans@yahoo-inc.com>
2004-10-05 18:36:24 +00:00
green
cb606898b9 Add support to IPFW for matching by TCP data length. 2004-10-03 00:47:15 +00:00
green
4f70622005 Add support to IPFW for classification based on "diverted" status
(that is, input via a divert socket).
2004-10-03 00:26:35 +00:00
green
a1ab5f0c7d Add to IPFW the ability to do ALTQ classification/tagging. 2004-10-03 00:17:46 +00:00
green
3f01e230b4 Validate the action pointer to be within the rule size, so that trying to
add corrupt ipfw rules would not potentially panic the system or worse.
2004-09-30 17:42:00 +00:00
mlaier
b65eae4c19 Add an additional struct inpcb * argument to pfil(9) in order to enable
passing along socket information. This is required to work around a LOR with
the socket code which results in an easy reproducible hard lockup with
debug.mpsafenet=1. This commit does *not* fix the LOR, but enables us to do
so later. The missing piece is to turn the filter locking into a leaf lock
and will follow in a seperate (later) commit.

This will hopefully be MT5'ed in order to fix the problem for RELENG_5 in
forseeable future.

Suggested by:		rwatson
A lot of work by:	csjp (he'd be even more helpful w/o mentor-reviews ;)
Reviewed by:		rwatson, csjp
Tested by:		-pf, -ipfw, LINT, csjp and myself
MFC after:		3 days

LOR IDs:		14 - 17 (not fixed yet)
2004-09-29 04:54:33 +00:00
rwatson
e455dd69f8 Assign so_pcb to NULL rather than 0 as it's a pointer.
Spotted by:	dwhite
2004-09-29 04:01:13 +00:00
maxim
72a6bed376 o Turn net.inet.ip.check_interface sysctl off by default.
When net.inet.ip.check_interface was MFCed to RELENG_4 3+ years ago in
rev. 1.130.2.17 ip_input.c it was 1 by default but shortly changed to
0 (accidently?) in rev. 1.130.2.20 in RELENG_4 only.  Among with the
fact this knob is not documented it breaks POLA especially in bridge
environment.

OK'ed by:	andre
Reviewed by:	-current
2004-09-24 12:18:40 +00:00
andre
d4e3412583 Fix an out of bounds write during the initialization of the PF_INET protocol
family to the ip_protox[] array.  The protocol number of IPPROTO_DIVERT is
larger than IPPROTO_MAX and was initializing memory beyond the array.
Catch all these kinds of errors by ignoring protocols that are higher than
IPPROTO_MAX or 0 (zero).

Add more comments ip_init().
2004-09-16 18:33:39 +00:00
andre
44b7e0a719 Clarify some comments for the M_FASTFWD_OURS case in ip_input(). 2004-09-15 20:17:03 +00:00
andre
5b67b5c1f3 Remove the last two global variables that are used to store packet state while
it travels through the IP stack.  This wasn't much of a problem because IP
source routing is disabled by default but when enabled together with SMP and
preemption it would have very likely cross-corrupted the IP options in transit.

The IP source route options of a packet are now stored in a mtag instead of the
global variable.
2004-09-15 20:13:26 +00:00
andre
3767c4cf7b Do not allow 'ipfw fwd' command when IPFIREWALL_FORWARD is not compiled into
the kernel.  Return EINVAL instead.
2004-09-13 19:27:23 +00:00
andre
2c213c186f If we have to 'ipfw fwd'-tag a packet the second time in ipfw_pfil_out() don't
prepend an already existing tag again.  Instead unlink it and prepend it again
to have it as the first tag in the chain.

PR:		kern/71380
2004-09-13 19:20:14 +00:00
andre
e837c32545 Make comments more clear for the packet changed cases after pfil hooks. 2004-09-13 17:09:06 +00:00
andre
532ad0416b Fix ip_input() fallback for the destination modified cases (from the packet
filters).  After the ipfw to pfil move ip_input() expects M_FASTFWD_OURS
tagged packets to have ip_len and ip_off in host byte order instead of
network byte order.

PR:		kern/71652
Submitted by:	mlaier (patch)
2004-09-13 17:01:53 +00:00
andre
eba7c4085c Make 'ipfw tee' behave as inteded and designed. A tee'd packet is copied
and sent to the DIVERT socket while the original packet continues with the
next rule.  Unlike a normally diverted packet no IP reassembly attemts are
made on tee'd packets and they are passed upwards totally unmodified.

Note: This will not be MFC'd to 4.x because of major infrastucture changes.

PR:		kern/64240 (and many others collapsed into that one)
2004-09-13 16:46:05 +00:00
glebius
c887bf2814 Check flag do_bridge always, even if kernel was compiled without
BRIDGE support. This makes dynamic bridge.ko working.

Reviewed by:	sam
Approved by:	julian (mentor)
MFC after:	1 week
2004-09-09 12:34:07 +00:00
jmg
af7268d8eb revert comment from rev1.158 now that rev1.225 backed it out..
MFC after:	3 days
2004-09-06 15:48:38 +00:00
glebius
2087edb938 Recover normal behavior: return EINVAL to attempt to add a divert rule
when module is built without IPDIVERT.

Silence from:	andre
Approved by:	julian (mentor)
2004-09-05 20:06:50 +00:00
jmg
8e8293b765 fix up socket/ip layer violation... don't assume/know that
SO_DONTROUTE == IP_ROUTETOIF and SO_BROADCAST == IP_ALLOWBROADCAST...
2004-09-05 02:34:12 +00:00
andre
2126402238 Apply error and success logic consistently to the function netisr_queue() and
its users.

netisr_queue() now returns (0) on success and ERRNO on failure.  At the
moment ENXIO (netisr queue not functional) and ENOBUFS (netisr queue full)
are supported.

Previously it would return (1) on success but the return value of IF_HANDOFF()
was interpreted wrongly and (0) was actually returned on success.  Due to this
schednetisr() was never called to kick the scheduling of the isr.  However this
was masked by other normal packets coming through netisr_dispatch() causing the
dequeueing of waiting packets.

PR:		kern/70988
Found by:	MOROHOSHI Akihiko <moro@remus.dti.ne.jp>
MFC after:	3 days
2004-08-27 18:33:08 +00:00
andre
c3936067e7 In the case the destination of a packet was changed by the packet filter
to point to a local IP address; and the packet was sourced from this host
we fill in the m_pkthdr.rcvif with a pointer to the loopback interface.

Before the function ifunit("lo0") was used to obtain the ifp.  However
this is sub-optimal from a performance point of view and might be dangerous
if the loopback interface has been renamed.  Use the global variable 'loif'
instead which always points to the loopback interface.

Submitted by:	brooks
2004-08-27 15:39:34 +00:00
andre
b32a1f9e40 Remove a junk line left over from the recent IPFW to PFIL_HOOKS conversion. 2004-08-27 15:32:28 +00:00
andre
d243747d92 Always compile PFIL_HOOKS into the kernel and remove the associated kernel
compile option.  All FreeBSD packet filters now use the PFIL_HOOKS API and
thus it becomes a standard part of the network stack.

If no hooks are connected the entire packet filter hooks section and related
activities are jumped over.  This removes any performance impact if no hooks
are active.

Both OpenBSD and DragonFlyBSD have integrated PFIL_HOOKS permanently as well.
2004-08-27 15:16:24 +00:00
ru
fdbaf5c59c Revert the last change to sys/modules/ipfw/Makefile and fix a
standalone module build in a better way.

Silence from:	andre
MFC after:	3 days
2004-08-26 14:18:30 +00:00
pjd
8c34a02cfb Allocate memory when dumping pipes with M_WAITOK flag.
On a system with huge number of pipes, M_NOWAIT failes almost always,
because of memory fragmentation.
My fix is different than the patch proposed by Pawel Malachowski,
because in FreeBSD 5.x we cannot sleep while holding dummynet mutex
(in 4.x there is no such lock).
My fix is also ugly, but there is no easy way to prepare nice and clean fix.

PR:		kern/46557
Submitted by:	Eugene Grosbein <eugen@grosbein.pp.ru>
Reviewed by:	mlaier
2004-08-25 09:31:30 +00:00
mlaier
252cbf1c2a Allow early drop for non-ALTQ enabled queues in an ALTQ-enabled kernel.
Previously the early drop was disabled unconditionally for ALTQ-enabled
kernels.

This should give some benefit for the normal gateway + LAN-server case with
a busy LAN leg and an ALTQ managed uplink.

Reviewed and style help from:	cperciva, pjd
2004-08-22 16:42:28 +00:00
rwatson
2989f4181e When sliding the m_data pointer forward, update m_pktrhdr.len as well
as m_len, or the pkthdr length will be inconsistent with the actual
length of data in the mbuf chain.  The symptom of this occuring was
"out of data" warnings from in_cksum_skip() on large UDP packets sent
via the loopback interface.

Foot shot:	green
2004-08-22 01:32:48 +00:00
csjp
657b6f650c When a prison is given the ability to create raw sockets (when the
security.jail.allow_raw_sockets sysctl MIB is set to 1) where privileged
access to jails is given out, it is possible for prison root to manipulate
various network parameters which effect the host environment. This commit
plugs a number of security holes associated with the use of raw sockets
and prisons.

This commit makes the following changes:

- Add a comment to rtioctl warning developers that if they add
  any ioctl commands, they should use super-user checks where necessary,
  as it is possible for PRISON root to make it this far in execution.
- Add super-user checks for the execution of the SIOCGETVIFCNT
  and SIOCGETSGCNT IP multicast ioctl commands.
- Add a super-user check to rip_ctloutput(). If the calling cred
  is PRISON root, make sure the socket option name is IP_HDRINCL,
  otherwise deny the request.

Although this patch corrects a number of security problems associated
with raw sockets and prisons, the warning in jail(8) should still
apply, and by default we should keep the default value of
security.jail.allow_raw_sockets MIB to 0 (or disabled) until
we are certain that we have tracked down all the problems.

Looking forward, we will probably want to eliminate the
references to curthread.

This may be a MFC candidate for RELENG_5.

Reviewed by:	rwatson
Approved by:	bmilekic (mentor)
2004-08-21 17:38:57 +00:00
rwatson
51b320a56b When prepending space onto outgoing UDP datagram payloads to hold the
UDP/IP header, make sure that space is also allocated for the link
layer header.  If an mbuf must be allocated to hold the UDP/IP header
(very likely), then this will avoid an additional mbuf allocation at
the link layer.  This trick is also used by TCP and other protocols to
avoid extra calls to the mbuf allocator in the ethernet (and related)
output routines.
2004-08-21 16:14:04 +00:00
andre
80ff6433dd Fix a stupid typo which prevented an ipfw KLD unload from successfully cleaning
up its remains.  Do not terminate 'if' lines with ';'.

Spotted by:	claudio@OpenBSD.ORG (sitting 3m from my desk)
Pointy hat to:	andre
2004-08-20 00:36:55 +00:00
andre
5947fa055f When unloading ipfw module use callout_drain() to make absolutely sure that
all callouts are stopped and finished.  Move it before IPFW_LOCK() to avoid
deadlocking when draining callouts.
2004-08-19 23:31:40 +00:00
andre
93c5d20c77 For IPv6 access pointer to tcpcb only after we have checked it is valid.
Found by:	Coverity's automated analysis (via Ted Unangst)
2004-08-19 20:16:17 +00:00
andre
5f83f24499 Give a useful error message if someone tries to compile IPFIREWALL into the
kernel without specifying PFIL_HOOKS as well.
2004-08-19 18:38:23 +00:00
andre
efb29124a7 Do not unconditionally ignore IPDIVERT and IPFIREWALL_FORWARD when building
the ipfw KLD.

 For IPFIREWALL_FORWARD this does not have any side effects.  If the module
 has it but not the kernel it just doesn't do anything.

 For IPDIVERT the KLD will be unloadable if the kernel doesn't have IPDIVERT
 compiled in too.  However this is the least disturbing behaviour.  The user
 can just recompile either module or the kernel to match the other one.  The
 access to the machine is not denied if ipfw refuses to load.
2004-08-19 17:59:26 +00:00
andre
14296889b2 Bring back the sysctl 'net.inet.ip.fw.enable' to unbreak the startup scripts
and to be able to disable ipfw if it was compiled directly into the kernel.
2004-08-19 17:38:47 +00:00
rwatson
eb3ee278bc Push down pcbinfo and inpcb locking from udp_send() into udp_output().
This provides greater context for the locking and allows us to avoid
locking the pcbinfo structure if not binding operations will take
place (i.e., already bound, connected, and no expliti sendto()
address).
2004-08-19 01:13:10 +00:00
rwatson
224ba75d82 In in_pcbrehash(), do assert the inpcb lock as well as the pcbinfo lock. 2004-08-19 01:11:17 +00:00
rwatson
b790600a00 Fix build of ip_input.c with "options IPSEC" -- the "pass:" label
is used with both FAST_IPSEC and IPSEC, but was defined for only
FAST_IPSEC.
2004-08-18 03:11:04 +00:00
peter
79dd918c6c Make the kernel compile again if you are not using PFIL_HOOKS 2004-08-18 00:37:46 +00:00
andre
e4a34b65ad Convert ipfw to use PFIL_HOOKS. This is change is transparent to userland
and preserves the ipfw ABI.  The ipfw core packet inspection and filtering
functions have not been changed, only how ipfw is invoked is different.

However there are many changes how ipfw is and its add-on's are handled:

 In general ipfw is now called through the PFIL_HOOKS and most associated
 magic, that was in ip_input() or ip_output() previously, is now done in
 ipfw_check_[in|out]() in the ipfw PFIL handler.

 IPDIVERT is entirely handled within the ipfw PFIL handlers.  A packet to
 be diverted is checked if it is fragmented, if yes, ip_reass() gets in for
 reassembly.  If not, or all fragments arrived and the packet is complete,
 divert_packet is called directly.  For 'tee' no reassembly attempt is made
 and a copy of the packet is sent to the divert socket unmodified.  The
 original packet continues its way through ip_input/output().

 ipfw 'forward' is done via m_tag's.  The ipfw PFIL handlers tag the packet
 with the new destination sockaddr_in.  A check if the new destination is a
 local IP address is made and the m_flags are set appropriately.  ip_input()
 and ip_output() have some more work to do here.  For ip_input() the m_flags
 are checked and a packet for us is directly sent to the 'ours' section for
 further processing.  Destination changes on the input path are only tagged
 and the 'srcrt' flag to ip_forward() is set to disable destination checks
 and ICMP replies at this stage.  The tag is going to be handled on output.
 ip_output() again checks for m_flags and the 'ours' tag.  If found, the
 packet will be dropped back to the IP netisr where it is going to be picked
 up by ip_input() again and the directly sent to the 'ours' section.  When
 only the destination changes, the route's 'dst' is overwritten with the
 new destination from the forward m_tag.  Then it jumps back at the route
 lookup again and skips the firewall check because it has been marked with
 M_SKIP_FIREWALL.  ipfw 'forward' has to be compiled into the kernel with
 'option IPFIREWALL_FORWARD' to enable it.

 DUMMYNET is entirely handled within the ipfw PFIL handlers.  A packet for
 a dummynet pipe or queue is directly sent to dummynet_io().  Dummynet will
 then inject it back into ip_input/ip_output() after it has served its time.
 Dummynet packets are tagged and will continue from the next rule when they
 hit the ipfw PFIL handlers again after re-injection.

 BRIDGING and IPFW_ETHER are not changed yet and use ipfw_chk() directly as
 they did before.  Later this will be changed to dedicated ETHER PFIL_HOOKS.

More detailed changes to the code:

 conf/files
	Add netinet/ip_fw_pfil.c.

 conf/options
	Add IPFIREWALL_FORWARD option.

 modules/ipfw/Makefile
	Add ip_fw_pfil.c.

 net/bridge.c
	Disable PFIL_HOOKS if ipfw for bridging is active.  Bridging ipfw
	is still directly invoked to handle layer2 headers and packets would
	get a double ipfw when run through PFIL_HOOKS as well.

 netinet/ip_divert.c
	Removed divert_clone() function.  It is no longer used.

 netinet/ip_dummynet.[ch]
	Neither the route 'ro' nor the destination 'dst' need to be stored
	while in dummynet transit.  Structure members and associated macros
	are removed.

 netinet/ip_fastfwd.c
	Removed all direct ipfw handling code and replace it with the new
	'ipfw forward' handling code.

 netinet/ip_fw.h
	Removed 'ro' and 'dst' from struct ip_fw_args.

 netinet/ip_fw2.c
	(Re)moved some global variables and the module handling.

 netinet/ip_fw_pfil.c
	New file containing the ipfw PFIL handlers and module initialization.

 netinet/ip_input.c
	Removed all direct ipfw handling code and replace it with the new
	'ipfw forward' handling code.  ip_forward() does not longer require
	the 'next_hop' struct sockaddr_in argument.  Disable early checks
	if 'srcrt' is set.

 netinet/ip_output.c
	Removed all direct ipfw handling code and replace it with the new
	'ipfw forward' handling code.

 netinet/ip_var.h
	Add ip_reass() as general function.  (Used from ipfw PFIL handlers
	for IPDIVERT.)

 netinet/raw_ip.c
	Directly check if ipfw and dummynet control pointers are active.

 netinet/tcp_input.c
	Rework the 'ipfw forward' to local code to work with the new way of
	forward tags.

 netinet/tcp_sack.c
	Remove include 'opt_ipfw.h' which is not needed here.

 sys/mbuf.h
	Remove m_claim_next() macro which was exclusively for ipfw 'forward'
	and is no longer needed.

Approved by:	re (scottl)
2004-08-17 22:05:54 +00:00
rwatson
87aa99bbbb White space cleanup for netinet before branch:
- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs

This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.

Approved by:	re (scottl)
Submitted by:	Xin LI <delphij@frontfree.net>
2004-08-16 18:32:07 +00:00
obrien
3558849b92 Put the 'antispoof' opcode in the proper place in the opcode list such
that it doesn't break the ipfw2 ABI.
2004-08-16 12:05:19 +00:00
dwmalone
5df13d37b2 Get rid of the RANDOM_IP_ID option and make it a sysctl. NetBSD
have already done this, so I have styled the patch on their work:

        1) introduce a ip_newid() static inline function that checks
        the sysctl and then decides if it should return a sequential
        or random IP ID.

        2) named the sysctl net.inet.ip.random_id

        3) IPv6 flow IDs and fragment IDs are now always random.
        Flow IDs and frag IDs are significantly less common in the
        IPv6 world (ie. rarely generated per-packet), so there should
        be smaller performance concerns.

The sysctl defaults to 0 (sequential IP IDs).

Reviewed by:	andre, silby, mlaier, ume
Based on:	NetBSD
MFC after:	2 months
2004-08-14 15:32:40 +00:00
phk
271672aa9c Fix outgoing ICMP on global instance. 2004-08-14 14:21:09 +00:00
csjp
6661aed38d Add the ability to associate ipfw rules with a specific prison ID.
Since the only thing truly unique about a prison is it's ID, I figured
this would be the most granular way of handling this.

This commit makes the following changes:

- Adds tokenizing and parsing for the ``jail'' command line option
  to the ipfw(8) userspace utility.
- Append the ipfw opcode list with O_JAIL.
- While Iam here, add a comment informing others that if they
  want to add additional opcodes, they should append them to the end
  of the list to avoid ABI breakage.
- Add ``fw_prid'' to the ipfw ucred cache structure.
- When initializing ucred cache, if the process is jailed,
  set fw_prid to the prison ID, otherwise set it to -1.
- Update man page to reflect these changes.

This change was a strong motivator behind the ucred caching
mechanism in ipfw.

A sample usage of this new functionality could be:

    ipfw add count ip from any to any jail 2

It should be noted that because ucred based constraints
are only implemented for TCP and UDP packets, the same
applies for jail associations.

Conceptual head nod by:	pjd
Reviewed by:	rwatson
Approved by:	bmilekic (mentor)
2004-08-12 22:06:55 +00:00
dwmalone
874318f896 In tcp6_ctlinput, lock tcbinfo around the call to syncache_unreach
so that the locks held are the same as the IPv4 case.

Reviewed by:	rwatson
2004-08-12 18:19:36 +00:00
andre
6822e5677f Fix two cases of incorrect IPQ_UNLOCK'ing in the merged ip_reass() function.
The first one was going to 'dropfrag', which unlocks the IPQ, before the lock
was aquired; The second one doing a unlock and then a 'goto dropfrag' which
led to a double-unlock.

Tripped over by:	des
2004-08-12 08:37:42 +00:00
rwatson
c1da641947 When udp_send() fails, make sure to free the control mbufs as well as
the data mbuf.  This was done in most error cases, but not the case
where the inpcb pointer is surprisingly NULL.
2004-08-12 01:34:27 +00:00
andre
d87fe3ee1e Backout removal of UMA_ZONE_NOFREE flag for all zones which are established
for structures with timers in them.  It might be that a timer might fire
even when the associated structure has already been free'd.  Having type-
stable storage in this case is beneficial for graceful failure handling and
debugging.

Discussed with:	bosko, tegge, rwatson
2004-08-11 20:30:08 +00:00
andre
a6a5e26503 Remove the UMA_ZONE_NOFREE flag to all uma_zcreate() calls in the IP and
TCP code.  This flag would have prevented giving back excessive free slabs
to the global pool after a transient peak usage.
2004-08-11 17:08:31 +00:00
andre
2dba36f65b Make use of in_localip() function and replace previous direct LIST_FOREACH
loops over INADDR_HASH.
2004-08-11 12:32:10 +00:00
andre
a93503bce5 Add the function in_localip() which returns 1 if an internet address is for
the local host and configured on one of its interfaces.
2004-08-11 11:49:48 +00:00
andre
957506e985 Only invoke verify_path() for verrevpath and versrcreach when we have an IP packet. 2004-08-11 11:41:11 +00:00
andre
3d16b5d93e Only check for local broadcast addresses if the mbuf is flagged with M_BCAST. 2004-08-11 10:49:56 +00:00
andre
47aa08bf94 Consistently use NULL for pointer comparisons. 2004-08-11 10:46:15 +00:00
andre
d03ce8b4a3 Make IP fastforwarding ALTQ-aware by adding the input traffic conditioner
check and disabling the early output interface queue length check.
2004-08-11 10:42:59 +00:00
andre
0ad10fdbdf Correct the displayed bandwidth calculation for a readout via sysctl. The
saved value does not have to be scaled with HZ; it is already in bytes per
second.  Only the multiply by eight remains to show bits per second (bps).
2004-08-11 10:12:16 +00:00
rwatson
4bd194b32a Assert the locks of inpcbinfo's and inpcb's passed into in_pcbconnect()
and in_pcbconnect_setup(), since these functions frob the port and
address state of inpcbs.
2004-08-11 04:35:20 +00:00
andre
e7784143d4 Make a comment that IP source routing is not SMP and PREEMPTION safe. 2004-08-09 16:17:37 +00:00
andre
1b887d9d08 Make a comment that "ipfw forward" is not SMP and PREEMPTION safe. 2004-08-09 16:16:10 +00:00
andre
649b4336f4 New ipfw option "antispoof":
For incoming packets, the packet's source address is checked if it
 belongs to a directly connected network.  If the network is directly
 connected, then the interface the packet came on in is compared to
 the interface the network is connected to.  When incoming interface
 and directly connected interface are not the same, the packet does
 not match.

Usage example:

 ipfw add deny ip from any to any not antispoof in

Manpage education by:	ru
2004-08-09 16:12:10 +00:00
rwatson
11a2e8ce3c Pass pcbinfo structures to in6_pcbnotify() rather than pcbhead
structures, allowing in6_pcbnotify() to lock the pcbinfo and each
inpcb that it notifies of ICMPv6 events.  This prevents inpcb
assertions from firing when IPv6 generates and delievers event
notifications for inpcbs.

Reported by:	kuriyama
Tested by:	kuriyama
2004-08-06 03:45:45 +00:00
rwatson
2ce46f099e When iterating the UDP inpcb list processing an inbound broadcast
or multicast packet, we don't need to acquire the inpcb mutex
unless we are actually using inpcb fields other than the bound port
and address.  Since we hold the pcbinfo lock already, these can't
change.  Defer acquiring the inpcb mutex until we have a high
chance of a match.  This avoids about 120 mutex operations per UDP
broadcast packet received on one of my work systems.

Reviewed by:	sam
2004-08-06 02:08:31 +00:00
rwatson
bc0c491205 Now that IPv6 performs basic in6pcb and inpcb locking, enable inpcb
lock assertions even if IPv6 is compiled into the kernel.  Previously,
inclusion of IPv6 and locking assertions would result in a rapid
assertion failure as IPv6 was not properly locking inpcbs.
2004-08-04 18:27:55 +00:00
marcus
c8262f39d1 Fix Skinny and PPTP NAT'ing after the introduction of the {ip,tcp,udp}_next
functions.  Basically, the ip_next() function was used to get the PPTP and
Skinny headers when tcp_next() should have been used instead.  Symptoms of
this included a segfault in natd when trying to process a PPTP or Skinny
packet.

Approved by:	des
2004-08-04 15:17:08 +00:00
andre
3a7a087025 o Delayed checksums are now calculated in divert_packet() for diverted packets
Remove the XXX-escaped code that did it in ip_output()'s IPHACK section.
2004-08-03 14:13:36 +00:00
andre
f1492e3a5a o Move the inflight sysctls to their own sub-tree under net.inet.tcp to be
more consistent with the other sysctls around it.
2004-08-03 13:54:11 +00:00
andre
fd96163246 o Move all parts of the IP reassembly process into the function ip_reass() to
make it fully self-contained.
o ip_reass() now returns a new mbuf with the reassembled packet and ip->ip_len
  including the IP header.
o Computation of the delayed checksum is moved into divert_packet().

Reviewed by:	silby
2004-08-03 12:31:38 +00:00
hsu
4b4df6655a Fix bug with tracking the previous element in a list.
Found by:	edrt@citiz.net
Submitted by:	pavlin@icir.org
2004-08-03 02:01:44 +00:00
yar
1d71ae12e0 Disallow a particular kind of port theft described by the following scenario:
Alice is too lazy to write a server application in PF-independent
	manner.  Therefore she knocks up the server using PF_INET6 only
	and allows the IPv6 socket to accept mapped IPv4 as well.  An evil
	hacker known on IRC as cheshire_cat has an account in the same
	system.  He starts a process listening on the same port as used
	by Alice's server, but in PF_INET.  As a consequence, cheshire_cat
	will distract all IPv4 traffic supposed to go to Alice's server.

Such sort of port theft was initially enabled by copying the code that
implemented the RFC 2553 semantics on IPv4/6 sockets (see inet6(4)) for
the implied case of the same owner for both connections.  After this
change, the above scenario will be impossible.  In the same setting,
the user who attempts to start his server last will get EADDRINUSE.

Of course, using IPv4 mapped to IPv6 leads to security complications
in the first place, but there is no reason to make it even more unsafe.

This change doesn't apply to KAME since it affects a FreeBSD-specific
part of the code.  It doesn't modify the out-of-box behaviour of the
TCP/IP stack either as long as mapping IPv4 to IPv6 is off by default.

MFC after:	1 month
2004-07-28 13:03:07 +00:00
jayanth
b754d213ab Fix a bug in the sack code that was causing data to be retransmitted
with the FIN bit set for all segments, if a FIN has already been sent before.
The fix will allow the FIN bit to be set for only the last segment, in case
it has to be retransmitted.

Fix another bug that would have caused snd_nxt to be pulled by len if
there was an error from ip_output. snd_nxt should not be touched
during sack retransmissions.
2004-07-28 02:15:14 +00:00
jayanth
ef090090cb Fix for a SACK bug where the very last segment retransmitted
from the SACK scoreboard could result in the next (untransmitted)
segment to be skipped.
2004-07-26 23:41:12 +00:00
jmg
b138ddd4e5 compare pointer against NULL, not 0
when inpcb is NULL, this is no longer invalid since jlemon added the
tcp_twstart function... this prevents close "failing" w/ EINVAL when it
really was successful...

Reviewed by:	jeremy (NetBSD)
2004-07-26 21:29:56 +00:00
cperciva
d9fecc83c8 Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with:	rwatson, scottl
Requested by:	jhb
2004-07-26 07:24:04 +00:00
andre
695543e4da Extend versrcreach by checking against the rt_flags for RTF_REJECT and
RTF_BLACKHOLE as well.

To quote the submitter:

 The uRPF loose-check implementation by the industry vendors, at least on Cisco
 and possibly Juniper, will fail the check if the route of the source address
 is pointed to Null0 (on Juniper, discard or reject route). What this means is,
 even if uRPF Loose-check finds the route, if the route is pointed to blackhole,
 uRPF loose-check must fail. This allows people to utilize uRPF loose-check mode
 as a pseudo-packet-firewall without using any manual filtering configuration --
 one can simply inject a IGP or BGP prefix with next-hop set to a static route
 that directs to null/discard facility. This results in uRPF Loose-check failing
 on all packets with source addresses that are within the range of the nullroute.

Submitted by:	James Jun <james@towardex.com>
2004-07-21 19:55:14 +00:00
rwatson
4557c47e2f M_PREPEND() the IP header on to the front of an outgoing raw IP packet
using M_DONTWAIT rather than M_WAITOK to avoid sleeping on memory
while holding a mutex.
2004-07-20 20:52:30 +00:00
jayanth
3781ade946 Let IN_FASTREOCOVERY macro decide if we are in recovery mode.
Nuke sackhole_limit for now. We need to add it back to limit the total
number of sack blocks in the system.
2004-07-19 22:37:33 +00:00
jayanth
48943ed977 Fix a potential panic in the SACK code that was causing
1) data to be sent to the right of snd_recover.
2) send more data then whats in the send buffer.

The fix is to postpone sack retransmit to a subsequent recovery episode
if the current retransmit pointer is beyond snd_recover.

Thanks to Mohan Srinivasan for helping fix the bug.

Submitted by:Daniel Lang
2004-07-19 22:06:01 +00:00
dwmalone
ccfd16b40a Fix the !INET6 build.
Reported by:	alc
2004-07-17 21:40:14 +00:00
dwmalone
71eccf2cf5 The tcp syncache code was leaving the IPv6 flowlabel uninitialised
for the SYN|ACK packet and then letting in6_pcbconnect set the
flowlabel later. Arange for the syncache/syncookie code to set and
recall the flow label so that the flowlabel used for the SYN|ACK
is consistent. This is done by using some of the cookie (when tcp
cookies are enabeled) and by stashing the flowlabel in syncache.

Tested and Discovered by:	Orla McGann <orly@cnri.dit.ie>
Approved by:			ume, silby
MFC after:			1 month
2004-07-17 19:44:13 +00:00
mlaier
512e25ff0c Define semantic of M_SKIP_FIREWALL more precisely, i.e. also pass associated
icmp_error() packets. While here retire PACKET_TAG_PF_GENERATED (which
served the same purpose) and use M_SKIP_FIREWALL in pf as well. This should
speed up things a bit as we get rid of the tag allocations.

Discussed with:	juli
2004-07-17 05:10:06 +00:00
jmallett
111d2dd115 Make M_SKIP_FIREWALL a global (and semantic) flag, preventing anything from
using M_PROTO6 and possibly shooting someone's foot, as well as allowing the
firewall to be used in multiple passes, or with a packet classifier frontend,
that may need to explicitly allow a certain packet.  Presently this is handled
in the ipfw_chk code as before, though I have run with it moved to upper
layers, and possibly it should apply to ipfilter and pf as well, though this
has not been investigated.

Discussed with:	luigi, rwatson
2004-07-17 02:40:13 +00:00
ume
6418d70e35 when IN6P_AUTOFLOWLABEL is set, the flowlabel is not set on
outgoing tcp connections.

Reported by:	Orla McGann <orly@cnri.dit.ie>
Reviewed by:	Orla McGann <orly@cnri.dit.ie>
Obtained from:	KAME
2004-07-16 18:08:13 +00:00
phk
5c95d686a1 Do a pass over all modules in the kernel and make them return EOPNOTSUPP
for unknown events.

A number of modules return EINVAL in this instance, and I have left
those alone for now and instead taught MOD_QUIESCE to accept this
as "didn't do anything".
2004-07-15 08:26:07 +00:00
stefanf
355a8ec494 Remove erroneous semicolons. 2004-07-13 16:06:19 +00:00
rwatson
9d5e898163 After each label in tcp_input(), assert the inpcbinfo and inpcb lock
state that we expect.
2004-07-12 19:28:07 +00:00
brian
aae31dbf32 Change the following environment variables to kernel options:
bootp -> BOOTP
    bootp.nfsroot -> BOOTP_NFSROOT
    bootp.nfsv3 -> BOOTP_NFSV3
    bootp.compat -> BOOTP_COMPAT
    bootp.wired_to -> BOOTP_WIRED_TO

- i.e. back out the previous commit.  It's already possible to
pxeboot(8) with a GENERIC kernel.

Pointed out by: dwmalone
2004-07-08 22:35:36 +00:00
brian
2821a50eaa Change the following kernel options to environment variables:
BOOTP -> bootp
    BOOTP_NFSROOT -> bootp.nfsroot
    BOOTP_NFSV3 -> bootp.nfsv3
    BOOTP_COMPAT -> bootp.compat
    BOOTP_WIRED_TO -> bootp.wired_to

This lets you PXE boot with a GENERIC kernel by putting this sort of thing
in loader.conf:

    bootp="YES"
    bootp.nfsroot="YES"
    bootp.nfsv3="YES"
    bootp.wired_to="bge1"

or even setting the variables manually from the OK prompt.
2004-07-08 13:40:33 +00:00
des
9d07523073 Push WARNS back up to 6, but define NO_WERROR; I want the warts out in the
open where people can see them and hopefully fix them.
2004-07-06 12:15:24 +00:00
des
93180ebf2d Introduce inline {ip,udp,tcp}_next() functions which take a pointer to an
{ip,udp,tcp} header and return a void * pointing to the payload (i.e. the
first byte past the end of the header and any required padding).  Use them
consistently throughout libalias to a) reduce code duplication, b) improve
code legibility, c) get rid of a bunch of alignment warnings.
2004-07-06 12:13:28 +00:00
des
c05f2ebe92 Rewrite twowords() to access its argument through a char pointer and not
a short pointer.  The previous implementation seems to be in a gray zone
of the C standard, and GCC generates incorrect code for it at -O2 or
higher on some platforms.
2004-07-06 09:22:18 +00:00
des
4e760d4fc8 Temporarily lower WARNS to 3 while I figure out the alignment issues on
alpha.
2004-07-06 08:44:41 +00:00
des
75b8ca2286 Make libalias WARNS?=6-clean. This mostly involves renaming variables
named link, foo_link or link_foo to lnk, foo_lnk or lnk_foo, fixing
signed / unsigned comparisons, and shoving unused function arguments
under the carpet.

I was hoping WARNS?=6 might reveal more serious problems, and perhaps
the source of the -O2 breakage, but found no smoking gun.
2004-07-05 11:10:57 +00:00
des
831b8f89db Parenthesize return values. 2004-07-05 10:55:23 +00:00
des
0518dc3818 Mechanical whitespace cleanup. 2004-07-05 10:53:28 +00:00
phk
112a83894d Add LibAliasOutTry() which checks a packet for a hit in the tables, but
does not create a new entry if none is found.
2004-07-04 12:53:07 +00:00
ru
01548ace15 Mechanically kill hard sentence breaks. 2004-07-02 23:52:20 +00:00
jayanth
657c0f9155 On receiving 3 duplicate acknowledgements, SACK recovery was not being entered correctly.
Fix this problem by separating out the SACK and the newreno cases. Also, check
if we are in FASTRECOVERY for the sack case and if so, turn off dupacks.

Fix an issue where the congestion window was not being incremented by ssthresh.

Thanks to Mohan Srinivasan for finding this problem.
2004-07-01 23:34:06 +00:00
ru
a0dce18ba8 Bumped document date.
Fixed markup.
Fixed examples to match the new API.
2004-07-01 17:51:48 +00:00
phk
a56b28be2a Rwatson, write 100 times for tomorrow:
First unlock, then assign NULL to pointer.
2004-06-27 21:54:34 +00:00
pjd
537ad587c5 Those are unneeded too. 2004-06-27 09:06:10 +00:00
pjd
5055061c5d Add two missing includes and remove two uneeded.
This is quite serious fix, because even with MAC framework compiled in,
MAC entry points in those two files were simply ignored.
2004-06-27 09:03:22 +00:00
rwatson
758f90deb8 Reduce the number of unnecessary unlock-relocks on socket buffer mutexes
associated with performing a wakeup on the socket buffer:

- When performing an sbappend*() followed by a so[rw]wakeup(), explicitly
  acquire the socket buffer lock and use the _locked() variants of both
  calls.  Note that the _locked() sowakeup() versions unlock the mutex on
  return.  This is done in uipc_send(), divert_packet(), mroute
  socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append().

- When the socket buffer lock is dropped before a sowakeup(), remove the
  explicit unlock and use the _locked() sowakeup() variant.  This is done
  in soisdisconnecting(), soisdisconnected() when setting the can't send/
  receive flags and dropping data, and in uipc_rcvd() which adjusting
  back-pressure on the sockets.

For UNIX domain sockets running mpsafe with a contention-intensive SMP
mysql benchmark, this results in a 1.6% query rate improvement due to
reduce mutex costs.
2004-06-26 19:10:39 +00:00
rwatson
f342eda242 Remove spl's from TCP protocol entry points. While not all locking
is merged here yet, this will ease the merge process by bringing the
locked and unlocked versions into sync.
2004-06-26 17:50:50 +00:00
ps
08f94c1d0b White space & spelling fixes
Submitted by:	Xin LI <delphij@frontfree.net>
2004-06-25 04:11:26 +00:00
bms
af8a541101 Whitespace. 2004-06-25 02:29:58 +00:00
rwatson
72e8ca6e16 Broaden scope of the socket buffer lock when processing an ACK so that
the read and write of sb_cc are atomic.  Call sbdrop_locked() instead
of sbdrop() since we already hold the socket buffer lock.
2004-06-24 03:07:27 +00:00
rwatson
60a4c150d3 Protect so_oobmark with with SOCKBUF_LOCK(&so->so_rcv), and broaden
locking in tcp_input() for TCP packets with urgent data pointers to
hold the socket buffer lock across testing and updating oobmark
from just protecting sb_state.

Update socket locking annotations
2004-06-24 02:57:12 +00:00
rwatson
99412c4858 In ip_ctloutput(), acquire the inpcb lock around some of the basic
inpcb flag and status updates.
2004-06-24 02:05:47 +00:00
rwatson
93baf0b01a When asserting non-Giant locks in the network stack, also assert
Giant if debug.mpsafenet=0, as any points that require synchronization
in the SMPng world also required it in the Giant-world:

- inpcb locks (including IPv6)
- inpcbinfo locks (including IPv6)
- dummynet subsystem lock
- ipfw2 subsystem lock
2004-06-24 02:01:48 +00:00
rwatson
caac080ec9 Introduce sbreserve_locked(), which asserts the socket buffer lock on
the socket buffer having its limits adjusted.  sbreserve() now acquires
the lock before calling sbreserve_locked().  In soreserve(), acquire
socket buffer locks across read-modify-writes of socket buffer fields,
and calls into sbreserve/sbrelease; make sure to acquire in keeping
with the socket buffer lock order.  In tcp_mss(), acquire the socket
buffer lock in the calling context so that we have atomic read-modify
-write on buffer sizes.
2004-06-24 01:37:04 +00:00
ps
a6b0bc7ed0 Move the sack sysctl's under net.inet.tcp.sack
net.inet.tcp.do_sack -> net.inet.tcp.sack.enable
net.inet.tcp.sackhole_limit -> net.inet.tcp.sack.sackhole_limit

Requested by:	wollman
2004-06-23 21:34:07 +00:00
ps
f5f3e8600b Add support for TCP Selective Acknowledgements. The work for this
originated on RELENG_4 and was ported to -CURRENT.

The scoreboarding code was obtained from OpenBSD, and many
of the remaining changes were inspired by OpenBSD, but not
taken directly from there.

You can enable/disable sack using net.inet.tcp.do_sack. You can
also limit the number of sack holes that all senders can have in
the scoreboard with net.inet.tcp.sackhole_limit.

Reviewed by:	gnn
Obtained from:	Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)
2004-06-23 21:04:37 +00:00
rwatson
77bf8d2108 Acquire socket lock around frobbing of socket state in divert sockets. 2004-06-22 04:00:51 +00:00
rwatson
6381db6e63 Prefer use of the inpcb as a MAC label source for outgoing packets sent
via divert sockets, when available.
2004-06-22 03:58:50 +00:00
rwatson
8a0f58ccf0 If debug.mpsafenet is set, initialize TCP callouts as CALLOUT_MPSAFE. 2004-06-20 21:44:50 +00:00
rwatson
6da2beab7f Assert the inpcb lock before letting MAC check whether we can deliver
to the inpcb in tcp_input().
2004-06-20 20:17:29 +00:00
rwatson
c2888023eb IP multicast code no longer needs to acquire Giant before appending
an mbuf onto a socket buffer.  This is left over from debug.mpsafenet
affecting the forwarding/bridging plane only.
2004-06-20 20:10:05 +00:00
rwatson
15ddd25f67 In tcp_ctloutput(), don't hold the inpcb lock over a call to
ip_ctloutput(), as it may need to perform blocking memory allocations.
This also improves consistency with locking relative to other points
that call into ip_ctloutput().

Bumped into by:	Grover Lines <grover@ceribus.net>
2004-06-18 20:22:21 +00:00
bms
66acd3ba6e Check that m->m_pkthdr.rcvif is not NULL before checking if a packet
was received on a broadcast address on the input path. Under certain
circumstances this could result in a panic, notably for locally-generated
packets which do not have m_pkthdr.rcvif set.

This is a similar situation to that which is solved by
src/sys/netinet/ip_icmp.c rev 1.66.

PR:		kern/52935
2004-06-18 12:58:45 +00:00
bms
18926c1c61 Appease GCC. 2004-06-18 09:53:58 +00:00
bms
3163bfb503 If SO_DEBUG is enabled for a TCP socket, and a received segment is
encapsulated within an IPv6 datagram, do not abuse the 'ipov' pointer
when registering trace records.  'ipov' is specific to IPv4, and
will therefore be uninitialized.

[This fandango is only necessary in the first place because of our
host-byte-order IP field pessimization.]

PR:		kern/60856
Submitted by:	Galois Zheng
2004-06-18 03:31:07 +00:00
bms
48317d5cbf Don't set FIN on a retransmitted segment after a FIN has been sent,
unless the segment really contains the last of the data for the stream.

PR:		kern/34619
Obtained from:	OpenBSD (tcp_output.c rev 1.47)
Noticed by:	Joseph Ishac
Reviewed by:	George Neville-Neil
2004-06-18 02:47:59 +00:00
bms
f0aeb408c2 Ensure that dst is bzeroed before calling rtalloc_ign(), to avoid possible
routing table corruption.

PR:		kern/40563, freebsd4/432 (KAME)
Obtained from:	NetBSD (in_gif.c rev 1.26.10.1)
Requested by:	Jean-Luc Richier
2004-06-18 02:04:07 +00:00
mlaier
5eba798674 Commit pf version 3.5 and link additional files to the kernel build.
Version 3.5 brings:
 - Atomic commits of ruleset changes (reduce the chance of ending up in an
   inconsistent state).
 - A 30% reduction in the size of state table entries.
 - Source-tracking (limit number of clients and states per client).
 - Sticky-address (the flexibility of round-robin with the benefits of
   source-hash).
 - Significant improvements to interface handling.
 - and many more ...
2004-06-16 23:24:02 +00:00
mlaier
18ff360027 Prepare for pf 3.5 import:
- Remove pflog and pfsync modules. Things will change in such a fashion
   that there will be one module with pf+pflog that can be loaded into
   GENERIC without problems (which is what most people want). pfsync is no
   longer possible as a module.
 - Add multicast address for in-kernel multicast pfsync protocol. Protocol
   glue will follow once the import is done.
 - Add one more mbuf tag
2004-06-16 22:59:06 +00:00
maxim
32bf9d060d o connect(2): if there is no a route to the destination
do not pick up the first local ip address for the source
ip address, return ENETUNREACH instead.

Submitted by:	Gleb Smirnoff
Reviewed by:	-current (silence)
2004-06-16 10:02:36 +00:00
bms
cafb94bcea Fix build for IPSEC && !INET6
PR:		kern/66125
Submitted by:	Cyrille Lefevre
2004-06-16 09:35:07 +00:00
bms
ff08157f93 Reverse a patch which has no effect on -CURRENT and should probably be
applied directly to -STABLE.

Noticed by:	iedowse
Pointy hat to:	bms
2004-06-16 08:50:14 +00:00
bms
075deb0c4c In ip_forward(), when calculating the MTU in effect for an IPSEC transport
mode tunnel, take the per-route MTU into account, *if* and *only if* it
is non-zero (as found in struct rt_metrics/rt_metrics_lite).

PR:		kern/42727
Obtained from:	NetBSD (ip_input.c rev 1.151)
2004-06-16 08:33:09 +00:00
bms
e8e6ce5e86 In ip_forward(), set m->m_pkthdr.len correctly such that the mbuf chain
is sane, and ipsec4_getpolicybyaddr() will therefore complete.

PR:		kern/42727
Obtained from:	KAME (kame/freebsd4/sys/netinet/ip_input.c rev 1.42)
2004-06-16 08:28:54 +00:00
bms
deb499d51d Disconnect a temporarily-connected UDP socket in out-of-mbufs case. This
fixes the problem of UDP sockets getting wedged in a connected state (and
bound to their destination) under heavy load.
Temporary bind/connect should probably be deleted in future
as an optimization, as described in "A Faster UDP" [Partridge/Pink 1993].

Notes:
 - INP_LOCK() is already held in udp_output(). The connection is in effect
   happening at a layer lower than the socket layer, therefore in theory
   socket locking should not be needed.
 - Inlining the in_pcbdisconnect() operation buys us nothing (in the case
   of the current state of the code), as laddr is not part of the
   inpcb hash or the udbinfo hash. Therefore there should be no need
   to rehash after restoring laddr in the error case (this was a
   concern of the original author of the patch).

PR:		kern/41765
Requested by:	gnn
Submitted by:	Jinmei Tatuya (with cleanups)
Tested by:	spray(8)
2004-06-16 05:41:00 +00:00
rwatson
1e2bc9e8f6 Convert GIANT_REQUIRED to NET_ASSERT_GIANT for socket access. 2004-06-16 03:36:06 +00:00
rwatson
029226f3a8 Grab the socket buffer send or receive mutex when performing a
read-modify-write on the sb_state field.  This commit catches only
the "easy" ones where it doesn't interact with as yet unmerged
locking.
2004-06-15 03:51:44 +00:00
rwatson
f2c0db1521 The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality.  This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties.  The
following fields are moved to sb_state:

  SS_CANTRCVMORE            (so_state)
  SS_CANTSENDMORE           (so_state)
  SS_RCVATMARK              (so_state)

Rename respectively to:

  SBS_CANTRCVMORE           (so_rcv.sb_state)
  SBS_CANTSENDMORE          (so_snd.sb_state)
  SBS_RCVATMARK             (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts).  In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.
2004-06-14 18:16:22 +00:00
mlaier
977d97b004 Link ALTQ to the build and break with ABI for struct ifnet. Please recompile
your (network) modules as well as any userland that might make sense of
sizeof(struct ifnet).
This does not change the queueing yet. These changes will follow in a
seperate commit. Same with the driver changes, which need case by case
evaluation.

__FreeBSD_version bump will follow.

Tested-by:	(i386)LINT
2004-06-13 17:29:10 +00:00
dfr
a1fa8042f5 Add a new driver to support IP over firewire. This driver is intended to
conform to the rfc2734 and rfc3146 standard for IP over firewire and
should eventually supercede the fwe driver. Right now the broadcast
channel number is hardwired and we don't support MCAP for multicast
channel allocation - more infrastructure is required in the firewire
code itself to fix these problems.
2004-06-13 10:54:36 +00:00
rwatson
f1bc833e95 Socket MAC labels so_label and so_peerlabel are now protected by
SOCK_LOCK(so):

- Hold socket lock over calls to MAC entry points reading or
  manipulating socket labels.

- Assert socket lock in MAC entry point implementations.

- When externalizing the socket label, first make a thread-local
  copy while holding the socket lock, then release the socket lock
  to externalize to userspace.
2004-06-13 02:50:07 +00:00
rwatson
82295697cd Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
  soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
  so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
  various contexts in the stack, both at the socket and protocol
  layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
  this could result in frobbing of a non-present socket if
  sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
  they don't free the socket.

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-12 20:47:32 +00:00
csjp
5931514a19 Modify ip fw so that whenever UID or GID constraints exist in a
ruleset, the pcb is looked up once per ipfw_chk() activation.

This is done by extracting the required information out of the PCB
and caching it to the ipfw_chk() stack. This should greatly reduce
PCB looking contention and speed up the processing of UID/GID based
firewall rules (especially with large UID/GID rulesets).

Some very basic benchmarks were taken which compares the number
of in_pcblookup_hash(9) activations to the number of firewall
rules containing UID/GID based contraints before and after this patch.

The results can be viewed here:
o http://people.freebsd.org/~csjp/ip_fw_pcb.png

Reviewed by:	andre, luigi, rwatson
Approved by:	bmilekic (mentor)
2004-06-11 22:17:14 +00:00
rwatson
e7bc2202da Remove unneeded Giant acquisition in divert_packet(), which is
left over from debug.mpsafenet affecting only the forwarding
plane.  Giant is now acquired in the ithread/netisr or in the
system call code.
2004-06-11 04:06:51 +00:00
rwatson
71dd84f2b8 Lock down parallel router_info list for tracking multicast IGMP
versions of various routers seen:

- Introduce igmp_mtx.
- Protect global variable 'router_info_head' and list fields
  in struct router_info with this mutex, as well as
  igmp_timers_are_running.
- find_rti() asserts that the caller acquires igmp_mtx.
- Annotate a failure to check the return value of
  MALLOC(..., M_NOWAIT).
2004-06-11 03:42:37 +00:00
ru
30a30620ee init_tables() must be run after sys/net/route.c:route_init(). 2004-06-10 20:20:37 +00:00
ru
27bed143c8 Introduce a new feature to IPFW2: lookup tables. These are useful
for handling large sparse address sets.  Initial implementation by
Vsevolod Lobko <seva@ip.net.ua>, refined by me.

MFC after:	1 week
2004-06-09 20:10:38 +00:00
ume
4ef088056e do not send icmp response if the original packet is encrypted.
Obtained from:	KAME
MFC after:	1 week
2004-06-07 09:56:59 +00:00
bmilekic
dcf58b3319 Move the locking of the pcb into raw_output(). Organize code so
that m_prepend() is not called with possibility to wait while the
pcb lock is held.  What still needs revisiting is whether the
ripcbinfo lock is really required here.

Discussed with: rwatson
2004-06-03 03:15:29 +00:00
phk
f43aa0c4bc add missing #include <sys/module.h> 2004-05-30 20:27:19 +00:00
phk
d6f7d2bde6 Add some missing <sys/module.h> includes which are masked by the
one on death-row in <sys/kernel.h>
2004-05-30 17:57:46 +00:00
csjp
3006bc70da Add a super-user check to ipfw_ctl() to make sure that the calling
process is a non-prison root. The security.jail.allow_raw_sockets
sysctl variable is disabled by default, however if the user enables
raw sockets in prisons, prison-root should not be able to interact
with firewall rule sets.

Approved by:	rwatson, bmilekic (mentor)
2004-05-25 15:02:12 +00:00
yar
45f0ba1547 When checking for possible port theft, skip over a TCP inpcb
unless it's in the closed or listening state (remote address
== INADDR_ANY).

If a TCP inpcb is in any other state, it's impossible to steal
its local port or use it for port theft.  And if there are
both closed/listening and connected TCP inpcbs on the same
localIP:port couple, the call to in_pcblookup_local() will
find the former due to the design of that function.

No objections raised in:	-net, -arch
MFC after:			1 month
2004-05-20 06:35:02 +00:00
maxim
a4dd24e359 o Calculate a number of bytes to copy (cnt) correctly:
+----+-+-+-+-+----+----+- - - - - - - - - - - -  -+----+
  |    | |C| | |    |    |                          |    |
  | IP |N|O|L|P|    | IP |                          | IP |
  | #1 |O|D|E|T|    | #2 |                          | #n |
  |    |P|E|N|R|    |    |                          |    |
  +----+-+-+-+-+----+----+- - - - - - - - - - - -  -+----+
               ^    ^<---- cnt - (IPOPT_MINOFF - 1) ---->|
               |    |
src            |    +-- cp[IPOPT_OFF + 1] + sizeof(struct in_addr)
               |
dst            +-- cp[IPOPT_OFF + 1]

PR:		kern/66386
Submitted by:	Andrei Iltchenko
MFC after:	3 weeks
2004-05-11 19:14:44 +00:00
maxim
5839c11830 o IFNAMSIZ does include the trailing \0.
Approved by:	andre

o Document net.inet.icmp.reply_src.
2004-05-07 01:24:53 +00:00
andre
832d1bd181 Provide the sysctl net.inet.ip.process_options to control the processing
of IP options.

 net.inet.ip.process_options=0  Ignore IP options and pass packets unmodified.
 net.inet.ip.process_options=1  Process all IP options (default).
 net.inet.ip.process_options=2  Reject all packets with IP options with ICMP
  filter prohibited message.

This sysctl affects packets destined for the local host as well as those
only transiting through the host (routing).

IP options do not have any legitimate purpose anymore and are only used
to circumvent firewalls or to exploit certain behaviours or bugs in TCP/IP
stacks.

Reviewed by:	sam (mentor)
2004-05-06 18:46:03 +00:00
rwatson
ff404935e2 Switch to using the inpcb MAC label instead of socket MAC label when
labeling new mbufs created from sockets/inpcbs in IPv4.  This helps avoid
the need for socket layer locking in the lower level network paths
where inpcb locks are already frequently held where needed.  In
particular:

- Use the inpcb for label instead of socket in raw_append().
- Use the inpcb for label instead of socket in tcp_output().
- Use the inpcb for label instead of socket in tcp_respond().
- Use the inpcb for label instead of socket in tcp_twrespond().
- Use the inpcb for label instead of socket in syncache_respond().

While here, modify tcp_respond() to avoid assigning NULL to a stack
variable and centralize assertions about the inpcb when inp is
assigned.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, McAfee Research
2004-05-04 02:11:47 +00:00
rwatson
e15e5d4977 Assert inpcb lock in udp_append().
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, McAfee Research
2004-05-04 01:08:15 +00:00
rwatson
2f2ce5b406 Assert the inpcb lock on 'last' in udp_append(), since it's always
called with it, and also requires it.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, McAfee Research
2004-05-04 00:10:16 +00:00
maxim
96efdc3250 o Fix misindentation in the previous commit. 2004-05-03 17:15:34 +00:00
andre
d18bda0b0d Back out a change that slipped into the previous commit for which other
supporting parts have not yet been committed.

Remove pre-mature IP options ignoring option.
2004-05-03 16:07:13 +00:00
andre
7e338f7bd0 Optimize IP fastforwarding some more:
o New function ip_findroute() to reduce code duplication for the
  route lookup cases. (luigi)

o Store ip_len in host byte order on the stack instead of using
  it via indirection from the mbuf.  This allows to defer the host
  byte conversion to a later point and makes a quicker fallback to
  normal ip_input() processing. (luigi)

o Check if route is dampned with RTF_REJECT flag and drop packet
  already here when ARP is unable to resolve destination address.
  An ICMP unreachable is sent to inform the sender.

o Check if interface output queue is full and drop packet already
  here.  No ICMP notification is sent because signalling source quench
  is depreciated.

o Check if media_state is down (used for ethernet type interfaces)
  and drop the packet already here.  An ICMP unreachable is sent to
  inform the sender.

o Do not account sent packets to the interface address counters.  They
  are only for packets with that 'ia' as source address.

o Update and clarify some comments.

Submitted by:	luigi (most of it)
2004-05-03 13:52:47 +00:00
darrenr
a50f040209 Rename m_claim_next_hop() to m_claim_next(), as suggested by Max Laier. 2004-05-02 15:10:17 +00:00
darrenr
4e8ce3156b oops, I forgot this file in a prior commit (change was still sitting here,
uncommitted):

Rename ip_claim_next_hop() to m_claim_next_hop(), give it an extra arg
(the type of tag to claim) and push it out of ip_var.h into mbuf.h
alongside all of the other macros that work ok mbuf's and tag's.
2004-05-02 15:07:37 +00:00
darrenr
8f62dbebe1 Rename ip_claim_next_hop() to m_claim_next_hop(), give it an extra arg
(the type of tag to claim) and push it out of ip_var.h into mbuf.h alongside
all of the other macros that work ok mbuf's and tag's.
2004-05-02 06:36:30 +00:00
bmilekic
6bbcc9da29 Give jail(8) the feature to allow raw sockets from within a
jail, which is less restrictive but allows for more flexible
jail usage (for those who are willing to make the sacrifice).
The default is off, but allowing raw sockets within jails can
now be accomplished by tuning security.jail.allow_raw_sockets
to 1.

Turning this on will allow you to use things like ping(8)
or traceroute(8) from within a jail.

The patch being committed is not identical to the patch
in the PR.  The committed version is more friendly to
APIs which pjd is working on, so it should integrate
into his work quite nicely.  This change has also been
presented and addressed on the freebsd-hackers mailing
list.

Submitted by: Christian S.J. Peron <maneo@bsdpro.com>
PR: kern/65800
2004-04-26 19:46:52 +00:00
silby
051b00be73 Tighten up reset handling in order to make reset attacks as difficult as
possible while maintaining compatibility with the widest range of TCP stacks.

The algorithm is as follows:

---
For connections in the ESTABLISHED state, only resets with
sequence numbers exactly matching last_ack_sent will cause a reset,
all other segments will be silently dropped.

For connections in all other states, a reset anywhere in the window
will cause the connection to be reset.  All other segments will be
silently dropped.
---

The necessity of accepting all in-window resets was discovered
by jayanth and jlemon, both of whom have seen TCP stacks that
will respond to FIN-ACK packets with resets not meeting the
strict last_ack_sent check.

Idea by:        Darren Reed
Reviewed by:    truckman, jlemon, others(?)
2004-04-26 02:56:31 +00:00
luigi
53bd42643d Another small set of changes to reduce diffs with the new arp code. 2004-04-25 15:00:17 +00:00
luigi
93066eb95b remove a stale comment on the behaviour of arpresolve 2004-04-25 14:06:23 +00:00
luigi
131ad9c351 Start the arp timer at init time.
It runs so rarely that it makes no sense to wait until the first request.
2004-04-25 12:50:14 +00:00
luigi
59063f7a08 This commit does two things:
1. rt_check() cleanup:
    rt_check() is only necessary for some address families to gain access
    to the corresponding arp entry, so call it only in/near the *resolve()
    routines where it is actually used -- at the moment this is
    arpresolve(), nd6_storelladdr() (the call is embedded here),
    and atmresolve() (the call is just before atmresolve to reduce
    the number of changes).
    This change will make it a lot easier to decouple the arp table
    from the routing table.

    There is an extra call to rt_check() in if_iso88025subr.c to
    determine the routing info length. I have left it alone for
    the time being.

    The interface of arpresolve() and nd6_storelladdr() now changes slightly:
     + the 'rtentry' parameter (really a hint from the upper level layer)
       is now passed unchanged from *_output(), so it becomes the route
       to the final destination and not to the gateway.
     + the routines will return 0 if resolution is possible, non-zero
       otherwise.
     + arpresolve() returns EWOULDBLOCK in case the mbuf is being held
       waiting for an arp reply -- in this case the error code is masked
       in the caller so the upper layer protocol will not see a failure.

2. arpcom untangling
    Where possible, use 'struct ifnet' instead of 'struct arpcom' variables,
    and use the IFP2AC macro to access arpcom fields.
    This mostly affects the netatalk code.

=== Detailed changes: ===
net/if_arcsubr.c
   rt_check() cleanup, remove a useless variable

net/if_atmsubr.c
   rt_check() cleanup

net/if_ethersubr.c
   rt_check() cleanup, arpcom untangling

net/if_fddisubr.c
   rt_check() cleanup, arpcom untangling

net/if_iso88025subr.c
   rt_check() cleanup

netatalk/aarp.c
   arpcom untangling, remove a block of duplicated code

netatalk/at_extern.h
   arpcom untangling

netinet/if_ether.c
   rt_check() cleanup (change arpresolve)

netinet6/nd6.c
   rt_check() cleanup (change nd6_storelladdr)
2004-04-25 09:24:52 +00:00
silby
67d79f0cfb Wrap two long lines in the previous commit. 2004-04-23 23:29:49 +00:00
andre
7357a88fdb Correct an edge case in tcp_mss() where the cached path MTU
from tcp_hostcache would have overridden a (now) lower MTU of
an interface or route that changed since first PMTU discovery.
The bug would have caused TCP to redo the PMTU discovery when
not strictly necessary.

Make a comment about already pre-initialized default values
more clear.

Reviewed by:	sam
2004-04-23 22:44:59 +00:00
andre
d4f49f008f Add the option versrcreach to verify that a valid route to the
source address of a packet exists in the routing table.  The
default route is ignored because it would match everything and
render the check pointless.

This option is very useful for routers with a complete view of
the Internet (BGP) in the routing table to reject packets with
spoofed or unrouteable source addresses.

Example:

 ipfw add 1000 deny ip from any to any not versrcreach

also known in Cisco-speak as:

  ip verify unicast source reachable-via any

Reviewed by:	luigi
2004-04-23 14:28:38 +00:00
andre
e8723e5528 Fix a potential race when purging expired hostcache entries.
Spotted by:	luigi
2004-04-23 13:54:28 +00:00
silby
b4c0a798a2 Take out an unneeded variable I forgot to remove in the last commit,
and make two small whitespace fixes so that diffs vs rev 1.142 are minimal.
2004-04-22 08:34:55 +00:00
silby
760a7deec6 Simplify random port allocation, and add net.inet.ip.portrange.randomized,
which can be used to turn off randomized port allocation if so desired.

Requested by:	alfred
2004-04-22 08:32:14 +00:00
bms
8fb0962eb0 Fix a typo in a comment. 2004-04-20 19:04:24 +00:00
silby
f0d28bbf0c Switch from using sequential to random ephemeral port allocation,
implementation taken directly from OpenBSD.

I've resisted committing this for quite some time because of concern over
TIME_WAIT recycling breakage (sequential allocation ensures that there is a
long time before ports are recycled), but recent testing has shown me that
my fears were unwarranted.
2004-04-20 06:45:10 +00:00
silby
743d110741 Enhance our RFC1948 implementation to perform better in some pathlogical
TIME_WAIT recycling cases I was able to generate with http testing tools.

In short, as the old algorithm relied on ticks to create the time offset
component of an ISN, two connections with the exact same host, port pair
that were generated between timer ticks would have the exact same sequence
number.  As a result, the second connection would fail to pass the TIME_WAIT
check on the server side, and the SYN would never be acknowledged.

I've "fixed" this by adding random positive increments to the time component
between clock ticks so that ISNs will *always* be increasing, no matter how
quickly the port is recycled.

Except in such contrived benchmarking situations, this problem should never
come up in normal usage...  until networks get faster.

No MFC planned, 4.x is missing other optimizations that are needed to even
create the situation in which such quick port recycling will occur.
2004-04-20 06:33:39 +00:00
luigi
eb6b02962a Replace Bcopy with 'the real thing' as in the rest of the file. 2004-04-18 11:45:49 +00:00
luigi
94049d0810 In an effort to simplify the routing code, try to deprecate rtalloc()
in favour of rtalloc_ign(), which is what would end up being called
anyways.

There are 25 more instances of rtalloc() in net*/ and
about 10 instances of rtalloc_ign()
2004-04-14 01:13:14 +00:00
imp
b49b7fe799 Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson
2004-04-07 20:46:16 +00:00
ru
a6980b04fc Fixed a bug in previous revision: compute the payload checksum before
we convert ip_len into a network byte order; in_delayed_cksum() still
expects it in host byte order.

The symtom was the ``in_cksum_skip: out of data by %d'' complaints
from the kernel.

To add to the previous commit log.  These fixes make tcpdump(1) happy
by not complaining about UDP/TCP checksum being bad for looped back
IP multicast when multicast router is deactivated.

Reported by:	Vsevolod Lobko
2004-04-07 10:01:39 +00:00
bde
014c3b4511 Fixed misspelling of IPPORT_MAX as USHRT_MAX. Don't include <sys/limits.h>
to implement this mistake.

Fixed some nearby style bugs (initialization in declaration, misformatting
of this initialization, missing blank line after the declaration, and
comparision of the non-boolean result of the initialization with 0 using
"!".  In KNF, "!" is not even used to compare booleans with 0).
2004-04-06 10:59:11 +00:00
rwatson
e391576e63 Two missed in previous commit -- compare pointer with NULL rather than
using it as a boolean.
2004-04-05 00:52:05 +00:00
rwatson
bdbf43dbba Prefer NULL to 0 when checking pointer values as integers or booleans. 2004-04-05 00:49:07 +00:00
pjd
2e6142d4d9 Fix a panic possibility caused by returning without releasing locks.
It was fixed by moving problemetic checks, as well as checks that
doesn't need locking before locks are acquired.

Submitted by:		Ryan Sommers <ryans@gamersimpact.com>
In co-operation with:	cperciva, maxim, mlaier, sam
Tested by:		submitter (previous patch), me (current patch)
Reviewed by:		cperciva, mlaier (previous patch), sam (current patch)
Approved by:		sam
Dedicated to:		enough!
2004-04-04 20:14:55 +00:00
luigi
c54de1f76f + arpresolve(): remove an unused argument
+ struct ifnet: remove unused fields, move ipv6-related field close
  to each other, add a pointer to l3<->l2 translation tables (arp,nd6,
  etc.) for future use.

+ struct route: remove an unused field, move close to each
  other some fields that might likely go away in the future
2004-04-04 06:14:55 +00:00
deischen
a883ccc958 Unbreak natd.
Reported and submitted by:	Sean McNeil (sean at mcneil.com)
2004-04-02 17:57:57 +00:00
des
38842c29ce Raise WARNS level to 2. 2004-03-31 21:33:55 +00:00
des
2209468b0e Deal with aliasing warnings.
Reviewed by:	ru
Approved by:	silence on the lists
2004-03-31 21:32:58 +00:00
rwatson
8525af93ba Invert the logic of NET_LOCK_GIANT(), and remove the one reference to it.
Previously, Giant would be grabbed at entry to the IP local delivery code
when debug.mpsafenet was set to true, as that implied Giant wouldn't be
grabbed in the driver path.  Now, we will use this primitive to
conditionally grab Giant in the event the entire network stack isn't
running MPSAFE (debug.mpsafenet == 0).
2004-03-28 23:12:19 +00:00
pjd
c83007d58a Remove unused argument. 2004-03-28 15:48:00 +00:00
pjd
49554d1bd8 Reduce 'td' argument to 'cred' (struct ucred) argument in those functions:
- in_pcbbind(),
	- in_pcbbind_setup(),
	- in_pcbconnect(),
	- in_pcbconnect_setup(),
	- in6_pcbbind(),
	- in6_pcbconnect(),
	- in6_pcbsetport().
"It should simplify/clarify things a great deal." --rwatson

Requested by:	rwatson
Reviewed by:	rwatson, ume
2004-03-27 21:05:46 +00:00
pjd
02bc133779 Remove unused argument.
Reviewed by:	ume
2004-03-27 20:41:32 +00:00
ume
11f479f519 Validate IPv6 socket options more carefully to avoid a panic.
PR:		kern/61513
Reviewed by:	cperciva, nectar
2004-03-26 19:52:18 +00:00
pjd
89f5b6c374 Remove unused function.
It was used in FreeBSD 4.x, but now we're using cr_canseesocket().
2004-03-25 15:12:12 +00:00
ru
869b51c6d0 Untangle IP multicast routing interaction with delayed payload checksums.
Compute the payload checksum for a locally originated IP multicast where
God intended, in ip_mloopback(), rather than doing it in ip_output() and
only when multicast router is active.  This is more correct as we do not
fool ip_input() that the packet has the correct payload checksum when in
fact it does not (when multicast router is inactive).  This is also more
efficient if we don't join the multicast group we send to, thus allowing
the hardware to checksum the payload.
2004-03-25 08:46:27 +00:00
rwatson
aaf338640e Lock down global variables in if_gre:
- Add gre_mtx to protect global softc list.
- Hold gre_mtx over various list operations (insert, delete).
- Centralize if_gre interface teardown in gre_destroy(), and call this
  from modevent unload and gre_clone_destroy().
- Export gre_mtx to ip_gre.c, which walks the gre list to look up gre
  interfaces during encapsulation.  Add a wonking comment on how we need
  some sort of drain/reference count mechanism to keep gre references
  alive while in use and simultaneous destroy.

This commit does not lockdown softc data, which follows in a future
commit.
2004-03-22 16:04:43 +00:00
mdodd
97af430ce1 - Fix indentation lost by 'diff -b'.
- Un-wrap short line.
2004-03-21 18:51:26 +00:00
mdodd
3092080960 Remove interface type specific code from arprequest(), and in_arpinput().
The AF_ARP case in the (*if_output)() routine will handle the interface type
specific bits.

Obtained from:	NetBSD
2004-03-21 06:36:05 +00:00
des
3cb81148d8 Run through indent(1) so I can read the code without getting a headache.
The result isn't quite knf, but it's knfer than the original, and far
more consistent.
2004-03-16 21:30:41 +00:00
mdodd
90eb7f448e De-register. 2004-03-14 00:44:11 +00:00
rwatson
fea945fb73 Lock down IP-layer encapsulation library:
- Add encapmtx to protect ip_encap.c global variables (encapsulation
   list).
 - Unifdef #ifdef 0 pieces of encap_init() which was (and now really
   is) basically a no-op.
 - Lock encapmtx when walking encaptab, modifying it, comparing
   entries, etc.
 - Remove spl's.

Note that currently there's no facilite to make sure outstanding
use of encapsulation methods on a table entry have drained bfore
we allow a table entry to be removed.  As such, it's currently the
caller's responsibility to make sure that draining takes place.

Reviewed by:	mlaier
2004-03-10 02:48:50 +00:00
rwatson
7866f6e355 Scrub unused variable zeroin_addr. 2004-03-10 01:01:04 +00:00
hsu
01e90a548b To comply with the spec, do not copy the TOS from the outer IP
header to the inner IP header of the PIM Register if this is a PIM
Null-Register message.

Submitted by:	Pavlin Radoslavov <pavlin@icir.org>
2004-03-08 07:47:27 +00:00
hsu
ef9b350961 Include <sys/types.h> for autoconf/automake detection.
Submitted by:	Pavlin Radoslavov <pavlin@icir.org>
2004-03-08 07:45:32 +00:00
mlaier
206b5dcf52 Add some missing DUMMYNET_UNLOCK() in config_pipe().
Noticed by:	Simon Coggins
Approved by:	bms(mentor)
2004-03-03 01:33:22 +00:00
mlaier
a32329ae33 Two minor follow-ups on the MT_TAG removal:
ifp is now passed explicitly to ether_demux; no need to look it up again.
Make mtag a global var in ip_input.

Noticed by:	rwatson
Approved by:	bms(mentor)
2004-03-02 14:37:23 +00:00
rwatson
8da3c338df Rename NET_PICKUP_GIANT() to NET_LOCK_GIANT(), and NET_DROP_GIANT()
to NET_UNLOCK_GIANT().  While they are used in similar ways, the
semantics are quite different -- NET_LOCK_GIANT() and NET_UNLOCK_GIANT()
directly wrap mutex lock and unlock operations, whereas drop/pickup
special case the handling of Giant recursion.  Add a comment saying
as much.

Add NET_ASSERT_GIANT(), which conditionally asserts Giant based
on the value of debug_mpsafenet.
2004-03-01 22:37:01 +00:00
ume
178eaf92b9 fix -O0 compilation without INET6.
Pointed out by:	ru
2004-03-01 19:10:31 +00:00
rwatson
2336a6f7fe Remove unneeded {} originally used to hold local variables for dummynet
in a code block, as the variable is now gone.

Submitted by:	sam
2004-02-28 19:50:43 +00:00
rwatson
4bcda84b60 Remove now unneeded arguments to tcp_twrespond() -- so and msrc. These
were needed by the MAC Framework until inpcbs gained labels.

Submitted by:	sam
2004-02-28 15:12:20 +00:00
mlaier
d937176b34 Bring eventhandler callbacks for pf.
This enables pf to track dynamic address changes on interfaces (dailup) with
the "on (<ifname>)"-syntax. This also brings hooks in anticipation of
tracking cloned interfaces, which will be in future versions of pf.

Approved by: bms(mentor)
2004-02-26 04:27:55 +00:00
mlaier
428f1c9a0f Tweak existing header and other build infrastructure to be able to build
pf/pflog/pfsync as modules. Do not list them in NOTES or modules/Makefile
(i.e. do not connect it to any (automatic) builds - yet).

Approved by: bms(mentor)
2004-02-26 03:53:54 +00:00
truckman
1de257deb3 Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire().  Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails.  Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by:	bms
2004-02-26 00:27:04 +00:00
mlaier
1504165dce Re-remove MT_TAGs. The problems with dummynet have been fixed now.
Tested by: -current, bms(mentor), me
Approved by: bms(mentor), sam
2004-02-25 19:55:29 +00:00
bde
232b17fc86 Fixed namespace pollution in rev.1.74. Implementation of the syncache
increased <netinet/tcp_var>'s already large set of prerequisites, and
this was handled badly.  Just don't declare the complete syncache struct
unless <netinet/pcb.h> is included before <netinet/tcp_var.h>.

Approved by:	jlemon (years ago, for a more invasive fix)
2004-02-25 13:03:01 +00:00
bde
50325f4c96 Don't use the negatively-opaque type uma_zone_t or be chummy with
<vm/uma.h>'s idempotency indentifier or its misspelling.
2004-02-25 11:53:19 +00:00
hsu
bb0d027044 Relax a KASSERT condition to allow for a valid corner case where
the FIN on the last segment consumes an extra sequence number.

Spurious panic reported by Mike Silbersack <silby@silby.com>.
2004-02-25 08:53:17 +00:00
andre
5ef70fe223 Convert the tcp segment reassembly queue to UMA and limit the maximum
amount of segments it will hold.

The following tuneables and sysctls control the behaviour of the tcp
segment reassembly queue:

 net.inet.tcp.reass.maxsegments (loader tuneable)
  specifies the maximum number of segments all tcp reassemly queues can
  hold (defaults to 1/16 of nmbclusters).

 net.inet.tcp.reass.maxqlen
  specifies the maximum number of segments any individual tcp session queue
  can hold (defaults to 48).

 net.inet.tcp.reass.cursegments (readonly)
  counts the number of segments currently in all reassembly queues.

 net.inet.tcp.reass.overflows (readonly)
  counts how often either the global or local queue limit has been reached.

Tested by:	bms, silby
Reviewed by:	bms, silby
2004-02-24 15:27:41 +00:00
pjd
806650a364 Fixed ucred structure leak.
Approved by:	scottl (mentor)
PR:		54163
MFC after:	3 days
2004-02-19 14:13:21 +00:00
mlaier
60723c3260 Backout MT_TAG removal (i.e. bring back MT_TAGs) for now, as dummynet is
not working properly with the patch in place.

Approved by: bms(mentor)
2004-02-18 00:04:52 +00:00
ume
92aaace604 IPSEC and FAST_IPSEC have the same internal API now;
so merge these (IPSEC has an extra ipsecstat)

Submitted by:	"Bjoern A. Zeeb" <bzeeb+freebsd@zabbadoz.net>
2004-02-17 14:02:37 +00:00
bms
2b958c2272 Shorten the name of the socket option used to enable TCP-MD5 packet
treatment.

Submitted by:	Vincent Jardin
2004-02-16 22:21:16 +00:00
ume
a9d87abe7e don't update outgoing ifp, if ipsec tunnel mode encapsulation
was not made.

Obtained from:	KAME
2004-02-16 17:05:06 +00:00
bms
5770e21ae7 Spell types consistently throughout this file. Do not use the __packed attribute, as we are often #include'd from userland without <sys/cdefs.h> in front of us, and it is not strictly necessary.
Noticed by:	Sascha Blank
2004-02-16 14:40:56 +00:00
bms
1b4c430a4d Final brucification pass. Spell types consistently (u_int). Remove bogus
casts. Remove unnecessary parenthesis.

Submitted by:	bde
2004-02-14 21:49:48 +00:00
mlaier
0f7d176710 Do not expose ip_dn_find_rule inline function to userland and unbreak world.
----------------------------------------------------------------------
2004-02-13 22:26:36 +00:00
mlaier
9edab9a14a Do not check receive interface when pfil(9) hook changed address.
Approved by: bms(mentor)
2004-02-13 19:20:43 +00:00
mlaier
da4d773b12 This set of changes eliminates the use of MT_TAG "pseudo mbufs", replacing
them mostly with packet tags (one case is handled by using an mbuf flag
since the linkage between "caller" and "callee" is direct and there's no
need to incur the overhead of a packet tag).

This is (mostly) work from: sam

Silence from: -arch
Approved by: bms(mentor), sam, rwatson
2004-02-13 19:14:16 +00:00
bms
09ad0862e6 Brucification.
Submitted by:	bde
2004-02-13 18:21:45 +00:00
ume
f35565e63f supported IPV6_RECVPATHMTU socket option.
Obtained from:	KAME
2004-02-13 14:50:01 +00:00
bms
afe7ed20e4 Update the prototype for tcpsignature_apply() to reflect the spelling of
the types used by m_apply()'s callback function, f, as documented in mbuf(9).

Noticed by:	njl
2004-02-12 20:16:09 +00:00
bms
7c4d7ecee0 style(9) pass; whitespace and comments.
Submitted by:	njl
2004-02-12 20:12:48 +00:00
bms
5212ec56c9 Remove an unnecessary initialization that crept in from the code which
verifies TCP-MD5 digests.

Noticed by:	njl
2004-02-12 20:08:28 +00:00
bms
d2298f1b5c Fix a typo; left out preprocessor conditional for sigoff variable, which
is only used by TCP_SIGNATURE code.

Noticed by:	Roop Nanuwa
2004-02-11 09:46:54 +00:00
bms
903cdeea1a Initial import of RFC 2385 (TCP-MD5) digest support.
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.

For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.

Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.

There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.

Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.

This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.

Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.

Sponsored by:	sentex.net
2004-02-11 04:26:04 +00:00
ume
de3407d028 pass pcb rather than so. it is expected that per socket policy
works again.
2004-02-03 18:20:55 +00:00
andre
92b93ba391 Add sysctl net.inet.icmp.reply_src to specify the interface name
used for the ICMP reply source in reponse to packets which are not
directly addressed to us.  By default continue with with normal
source selection.

Reviewed by:	bms
2004-02-02 22:53:16 +00:00
andre
b302b73b2f More verbose description of the source ip address selection for ICMP replies.
Reviewed by:	bms
2004-02-02 22:17:09 +00:00
phk
35592de77b Introduce the SO_BINTIME option which takes a high-resolution timestamp
at packet arrival.

For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL
since it has higher resolution and lower overhead.  Simultaneous
use of the two options is possible and they will return consistent
timestamps.

This introduces an extra test and a function call for SO_TIMEVAL, but I have
not been able to measure that.
2004-01-31 10:40:25 +00:00
sobomax
2b5008cb00 Remove NetBSD'isms (add FreeBSD'isms?), which makes gre(4) working again. 2004-01-30 09:03:01 +00:00
ru
a50969358f Correct the descriptions of the net.inet.{udp,raw}.recvspace sysctls. 2004-01-27 22:17:39 +00:00
sobomax
7029665c48 Add support for WCCPv2. It should be enablem manually using link2
ifconfig(8) flag since header for version 2 is the same but IP payload
is prepended with additional 4-bytes field.

Inspired by:	Roman Synyuk <roman@univ.kiev.ua>
MFC after:	2 weeks
2004-01-26 12:33:56 +00:00
sobomax
6d77c2d2d2 (whilespace-only)
Kill trailing spaces.
2004-01-26 12:21:59 +00:00
andre
df5be91d31 Remove leftover FREE() from changes in rev 1.50.
Noticed by:	Jun Kuriyama <kuriyama@imgsrc.co.jp>
2004-01-23 01:39:12 +00:00
andre
b9c5f2c5c5 Split the overloaded variable 'win' into two for their specific purposes:
recwin and sendwin.  This removes a big source of confusion and makes
following the code much easier.

Reviewed by:	sam (mentor)
Obtained from:	DragonFlyBSD rev 1.6 (hsu)
2004-01-22 23:22:14 +00:00
andre
b6aedfab99 Move the reduction by one of the syncache limit after the zone has been
allocated.

Reviewed by:    sam (mentor)
Obtained from:  DragonFlyBSD rev 1.6 (hsu)
2004-01-22 23:14:48 +00:00
andre
e2c27942df Remove an unused variable and put the sockaddr_in6 onto the stack instead
of malloc'ing it.

Reviewed by:	sam (mentor)
Obtained from:	DragonFlyBSD rev 1.6 (hsu)
2004-01-22 23:10:11 +00:00
hsu
00b5b7f8ec Merge from DragonFlyBSD rev 1.10:
date: 2003/09/02 10:04:47;  author: hsu;  state: Exp;  lines: +5 -6
Account for when Limited Transmit is not congestion window limited.

Obtained from:	DragonFlyBSD
2004-01-20 21:40:25 +00:00
phk
7948e91c15 Mostly mechanical rework of libalias:
Makes it possible to have multiple packet aliasing instances in a
single process by moving all static and global variables into an
instance structure called "struct libalias".

Redefine a new API based on s/PacketAlias/LibAlias/g

Add new "instance" argument to all functions in the new API.

Implement old API in terms of the new API.
2004-01-17 10:52:21 +00:00
ume
703de5ccfd do not deref freed pointer
Submitted by:	"Bjoern A. Zeeb" <bzeeb+freebsd@zabbadoz.net>
Reviewed by:	itojun
2004-01-13 09:51:47 +00:00
andre
dc7ce45d31 Disable the minmssoverload connection drop by default until the detection
logic is refined.
2004-01-12 15:46:04 +00:00
truckman
3481adf426 Check that sa_len is the appropriate value in tcp_usr_bind(),
tcp6_usr_bind(), tcp_usr_connect(), and tcp6_usr_connect() before checking
to see whether the address is multicast so that the proper errno value
will be returned if sa_len is incorrect.  The checks are identical to the
ones in in_pcbbind_setup(), in6_pcbbind(), and in6_pcbladdr(), which are
called after the multicast address check passes.

MFC after:	30 days
2004-01-10 08:53:00 +00:00
andre
3dbc1a9d87 Reduce TCP_MINMSS default to 216. The AX.25 protocol (packet radio)
is frequently used with an MTU of 256 because of slow speeds and a
high packet loss rate.
2004-01-09 14:14:10 +00:00
andre
491421126e Limiters and sanity checks for TCP MSS (maximum segement size)
resource exhaustion attacks.

For network link optimization TCP can adjust its MSS and thus
packet size according to the observed path MTU.  This is done
dynamically based on feedback from the remote host and network
components along the packet path.  This information can be
abused to pretend an extremely low path MTU.

The resource exhaustion works in two ways:

 o during tcp connection setup the advertized local MSS is
   exchanged between the endpoints.  The remote endpoint can
   set this arbitrarily low (except for a minimum MTU of 64
   octets enforced in the BSD code).  When the local host is
   sending data it is forced to send many small IP packets
   instead of a large one.

   For example instead of the normal TCP payload size of 1448
   it forces TCP payload size of 12 (MTU 64) and thus we have
   a 120 times increase in workload and packets. On fast links
   this quickly saturates the local CPU and may also hit pps
   processing limites of network components along the path.

   This type of attack is particularly effective for servers
   where the attacker can download large files (WWW and FTP).

   We mitigate it by enforcing a minimum MTU settable by sysctl
   net.inet.tcp.minmss defaulting to 256 octets.

 o the local host is reveiving data on a TCP connection from
   the remote host.  The local host has no control over the
   packet size the remote host is sending.  The remote host
   may chose to do what is described in the first attack and
   send the data in packets with an TCP payload of at least
   one byte.  For each packet the tcp_input() function will
   be entered, the packet is processed and a sowakeup() is
   signalled to the connected process.

   For example an attack with 2 Mbit/s gives 4716 packets per
   second and the same amount of sowakeup()s to the process
   (and context switches).

   This type of attack is particularly effective for servers
   where the attacker can upload large amounts of data.
   Normally this is the case with WWW server where large POSTs
   can be made.

   We mitigate this by calculating the average MSS payload per
   second.  If it goes below 'net.inet.tcp.minmss' and the pps
   rate is above 'net.inet.tcp.minmssoverload' defaulting to
   1000 this particular TCP connection is resetted and dropped.

MITRE CVE:	CAN-2004-0002
Reviewed by:	sam (mentor)
MFC after:	1 day
2004-01-08 17:40:07 +00:00
andre
09dcc2c21c If path mtu discovery is enabled set the DF bit in all cases we
send packets on a tcp connection.

PR:		kern/60889
Tested by:	Richard Wendland <richard@wendland.org.uk>
Approved by:	re (scottl)
2004-01-08 11:17:11 +00:00
andre
e694e9332e Do not set the ip_id to zero when DF is set on packet and
restore the general pre-randomid behaviour.

Setting the ip_id to zero causes several problems with
packet reassembly when a device along the path removes
the DF bit for some reason.

Other BSD and Linux have found and fixed the same issues.

PR:		kern/60889
Tested by:	Richard Wendland <richard@wendland.org.uk>
Approved by:	re (scottl)
2004-01-08 11:13:40 +00:00
andre
f6253c9b05 Enable the following TCP options by default to give it more exposure:
rfc3042  Limited retransmit
 rfc3390  Increasing TCP's initial congestion Window
 inflight TCP inflight bandwidth limiting

All my production server have it enabled and there have been no
issues.  I am confident about having them on by default and it gives
us better overall TCP performance.

Reviewed by:	sam (mentor)
2004-01-06 23:29:46 +00:00
andre
f14c2fc588 According to RFC1812 we have to ignore ICMP redirects when we
are acting as router (ipforwarding enabled).

This doesn't fix the problem that host routes from ICMP redirects
are never removed from the kernel routing table but removes the
problem for machines doing packet forwarding.

Reviewed by:	sam (mentor)
2004-01-06 23:20:07 +00:00
ru
8dbbb519f5 Document the net.inet.ip.subnets_are_local sysctl. 2003-12-30 16:05:03 +00:00
sobomax
96159ed234 Sync with NetBSD:
if_gre.c rev.1.41-1.49

 o Spell output with two ts.
 o Remove assigned-to but not used variable.
 o fix grammatical error in a diagnostic message.
 o u_short -> u_int16_t.
 o gi_len is ip_len, so it has to be network byteorder.

if_gre.h rev.1.11-1.13

 o prototype must not have variable name.
 o u_short -> u_int16_t.
 o Spell address with two d's.

ip_gre.c rev.1.22-1.29

 o KNF - return is not a function.
 o The "osrc" variable in gre_mobile_input() is only ever set but not
   referenced; remove it.
 o correct (false) assumptions on mbuf chain.  not sure if it really helps, but
   anyways, it is necessary to perform m_pullup.
 o correct arg to m_pullup (need to count IP header size as well).
 o remove redundant adjustment of m->m_pkthdr.len.
 o clear m_flags just for safety.
 o tabify.
 o u_short -> u_int16_t.

MFC after:	2 weeks
2003-12-30 11:41:43 +00:00
sam
c165a87f8d o eliminate widespread on-stack mbuf use for bpf by introducing
a new bpf_mtap2 routine that does the right thing for an mbuf
  and a variable-length chunk of data that should be prepended.
o while we're sweeping the drivers, use u_int32_t uniformly when
  when prepending the address family (several places were assuming
  sizeof(int) was 4)
o return M_ASSERTVALID to BPF_MTAP* now that all stack-allocated
  mbufs have been eliminated; this may better be moved to the bpf
  routines

Reviewed by:	arch@ and several others
2003-12-28 03:56:00 +00:00
maxim
33621bd9ec o Fix a comment: softticks lives in sys/kern/kern_timeout.c.
PR:		kern/60613
Submitted by:	Gleb Smirnoff
MFC after:	3 days
2003-12-27 14:08:53 +00:00
ume
ad7968509b NULL is not 0.
Submitted by:	"Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net>
2003-12-24 18:22:04 +00:00
ru
fc90128bf7 I didn't notice it right away, but check the right length too. 2003-12-23 14:08:50 +00:00
ru
0b3fdbfa52 Fix a problem introduced in revision 1.84: m_pullup() does not
necessarily return the same mbuf chain so we need to recompute
mtod() consumers after pulling up.
2003-12-23 13:33:23 +00:00
peter
7cc77e03ae Catch a few places where NULL (pointer) was used where 0 (integer) was
expected.
2003-12-23 02:36:43 +00:00
sam
2a76f36902 o move mutex init/destroy logic to the module load/unload hooks;
otherwise they are initialized twice when the code is statically
  configured in the kernel because the module load method gets
  invoked before the user application calls ip_mrouter_init
o add a mutex to synchronize the module init/done operations; this
  sort of was done using the value of ip_mroute but X_ip_mrouter_done
  sets it to NULL very early on which can lead to a race against
  ip_mrouter_init--using the additional mutex means this is safe now
o don't call ip_mrouter_reset from ip_mrouter_init; this now happens
  once at module load and X_ip_mrouter_done does the appropriate
  cleanup work to insure the data structures are in a consistent
  state so that a subsequent init operation inherits good state

Reviewed by:	juli
2003-12-20 18:32:48 +00:00
jhb
bfeab27f15 Fix some becuase -> because typos.
Reported by:	Marco Wertejuk <wertejuk@mwcis.com>
2003-12-17 16:12:01 +00:00
rwatson
8315bb904d Switch TCP over to using the inpcb label when responding in timed
wait, rather than the socket label.  This avoids reaching up to
the socket layer during connection close, which requires locking
changes.  To do this, introduce MAC Framework entry point
mac_create_mbuf_from_inpcb(), which is called from tcp_twrespond()
instead of calling mac_create_mbuf_from_socket() or
mac_create_mbuf_netlayer().  Introduce MAC Policy entry point
mpo_create_mbuf_from_inpcb(), and implementations for various
policies, which generally just copy label data from the inpcb to
the mbuf.  Assert the inpcb lock in the entry point since we
require consistency for the inpcb label reference.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-12-17 14:55:11 +00:00
maxim
90bd792be8 o IN_MULTICAST wants an address in host byte order.
PR:		kern/60304
Submitted by:	demon
MFC after:	1 week
2003-12-16 18:21:47 +00:00
emax
9d07822abc Do not panic when flushing dummynet firewall rules
Reviewed by: andre
Approved by: re (scottl)
2003-12-06 09:01:25 +00:00
andre
d0ffd0a240 Swap destination and source arguments of two bcopy() calls.
Before committing the initial tcp_hostcache I changed them from memcpy()
to conform with FreeBSD style without realizing the difference in argument
definition.

This fixes hostcache operation for IPv6 (in general and explicitly IPv6
path mtu discovery) and T/TCP (RFC1644).

Submitted by:	Taku YAMAMOTO <taku@cent.saitama-u.ac.jp>
Approved by:	re (rwatson)
2003-12-02 21:25:12 +00:00
sam
69806d879e Include opt_ipsec.h so IPSEC/FAST_IPSEC is defined and the appropriate
code is compiled in to support the O_IPSEC operator.  Previously no
support was included and ipsec rules were always matching.  Note that
we do not return an error when an ipsec rule is added and the kernel
does not have IPsec support compiled in; this is done intentionally
but we may want to revisit this (document this in the man page).

PR:		58899
Submitted by:	Bjoern A. Zeeb
Approved by:	re (rwatson)
2003-12-02 00:23:45 +00:00
andre
c06e5a954d Fix an optimization where I made an ifdef'd out section to broad.
When the hostcache bucket limit is reached the last bucket wasn't
removed from the bucket row but inserted a few lines later at the
bucket row head again.  This leads to infinite loop when the same
bucket row is accessed the next time for a lookup/insert or purge
action.

Tested by:	imp, Matt Smith
Approved by:	re (rwatson)
2003-11-28 16:33:03 +00:00
andre
f3b2f134c7 Fix verify_rev_path() function. The author of this function tried to
cut corners which completely broke down when the routing table locking
was introduced.

Reviewed by:	sam (mentor)
Approved by:	re (rwatson)
2003-11-27 09:40:13 +00:00
andre
4a037b3dd4 Make sure all uses of stack allocated struct route's are properly
zeroed.  Doing a bzero on the entire struct route is not more
expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst.

Reviewed by:	sam (mentor)
Approved by:	re  (scottl)
2003-11-26 20:31:13 +00:00
sam
960b35f03a Split the "inp" mutex class into separate classes for each of divert,
raw, tcp, udp, raw6, and udp6 sockets to avoid spurious witness
complaints.

Reviewed by:	rwatson
Approved by:	re (rwatson)
2003-11-26 01:40:44 +00:00
andre
f3a38bd433 Restructure a too broad ifdef which was disabling the setting of the
tcp flightsize sysctl value for local networks in the !INET6 case.

Approved by:	re (scottl)
2003-11-25 20:58:59 +00:00
sam
fa98e6052a Correct a problem where ipfw-generated packets were being returned
for ipfw processing w/o an indication the packets were generated
by ipfw--and so should not be processed (this manifested itself
as a LOR.)  The flag bit in the mbuf that was used to mark the
packets was not listed in M_COPYFLAGS so if a packet had a header
prepended (as done by IPsec) the flag was lost.  Correct this by
defining a new M_PROTO6 flag and use it to mark packets that need
this processing.

Reviewed by:	bms
Approved by:	re (rwatson)
MFC after:	2 weeks
2003-11-24 03:57:03 +00:00
sam
083aa028f3 Use MPSAFE callouts only when debug.mpsafenet is 1. Both timer routines
potentially transmit packets that may enter KAME IPsec w/o Giant if the
callouts are marked MPSAFE.

Reviewed by:	ume
Approved by:	re (rwatson)
2003-11-23 18:13:41 +00:00
tmm
5bae25b0ad bzero() the the sockaddr used for the destination address for
rtalloc_ign() in in_pcbconnect_setup() before it is filled out.
Otherwise, stack junk would be left in sin_zero, which could
cause host routes to be ignored because they failed the comparison
in rn_match().
This should fix the wrong source address selection for connect() to
127.0.0.1, among other things.

Reviewed by:	sam
Approved by:	re (rwatson)
2003-11-23 03:02:00 +00:00
andre
6164d7c280 Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table.  Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination.  Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by:	sam (mentor), bms
Reviewed by:	-net, -current, core@kame.net (IPv6 parts)
Approved by:	re (scottl)
2003-11-20 20:07:39 +00:00
andre
6dca20de07 Remove RTF_PRCLONING from routing table and adjust users of it
accordingly.  The define is left intact for ABI compatibility
with userland.

This is a pre-step for the introduction of tcp_hostcache.  The
network stack remains fully useable with this change.

Reviewed by:	sam (mentor), bms
Reviewed by:	-net, -current, core@kame.net (IPv6 parts)
Approved by:	re (scottl)
2003-11-20 19:47:31 +00:00
maxim
18a4a482e1 Fix an arguments order in check_uidgid() call.
PR:		kern/59314
Submitted by:	Andrey V. Shytov
Approved by:	re (rwatson, jhb)
2003-11-20 10:28:33 +00:00
rwatson
9c969b771a Introduce a MAC label reference in 'struct inpcb', which caches
the   MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols.  This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.

This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.

For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks.  Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.

Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.

Reviewed by:	sam, bms
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
cognet
b60bc6ed20 In rip_abort(), unlock the inpcb if we didn't detach it, or we may
recurse on the lock before destroying the mutex.

Submitted by:	sam
2003-11-17 19:21:53 +00:00
green
fc779a7573 Fix a few cases where MT_TAG-type "fake mbufs" are created on the stack, but
do not have mh_nextpkt initialized.  Somtimes what's there is "1", and the
ip_input() code pukes trying to m_free() it, rendering divert sockets and
such broken.
This really underscores the need to get rid of MT_TAG.

Reviewed by:	rwatson
2003-11-17 03:17:49 +00:00
andre
5c67a85f68 Make two casts correct for all types of 64bit platforms.
Explained by:	bde
2003-11-16 12:50:33 +00:00
andre
ccae1a80af Correct a cast to make it compile on 64bit platforms (noticed by tinderbox)
and remove two unneccessary variable initializations.
Make the introduction comment more clear with regard which parts of
the packet are touched.

Requested by:	luigi
2003-11-15 17:03:37 +00:00
andre
8ac3f681eb Make ipstealth global as we need it in ip_fastforward too. 2003-11-15 01:45:56 +00:00
andre
30ed90673d Remove the global one-level rtcache variable and associated
complex locking and rework ip_rtaddr() to do its own rtlookup.
Adopt all its callers to this and make ip_output() callable
with NULL rt pointer.

Reviewed by:	sam (mentor)
2003-11-14 21:48:57 +00:00
andre
de48630dfb Introduce ip_fastforward and remove ip_flow.
Short description of ip_fastforward:

 o adds full direct process-to-completion IPv4 forwarding code
 o handles ip fragmentation incl. hw support (ip_flow did not)
 o sends icmp needfrag to source if DF is set (ip_flow did not)
 o supports ipfw and ipfilter (ip_flow did not)
 o supports divert, ipfw fwd and ipfilter nat (ip_flow did not)
 o returns anything it can't handle back to normal ip_input

Enable with sysctl -w net.inet.ip.fastforwarding=1

Reviewed by:	sam (mentor)
2003-11-14 21:02:22 +00:00
sam
64b81ef0ad add missing inpcb lock before call to tcp_twclose (which reclaims the inpcb)
Supported by:	FreeBSD Foundation
2003-11-13 05:18:23 +00:00
sam
135a94b34b o reorder some locking asserts to reflect the order of the locks
o correct a read-lock assert in in_pcblookup_local that should be
  a write-lock assert (since time wait close cleanups may alter state)

Supported by:	FreeBSD Foundation
2003-11-13 05:16:56 +00:00
andre
c864ff5792 Move global variables for icmp_input() to its stack. With SMP or
preemption two CPUs can be in the same function at the same time
and clobber each others variables.  Remove register declaration
from local variables.

Reviewed by:	sam (mentor)
2003-11-13 00:32:13 +00:00
andre
39849f80f8 Do not fragment a packet with hardware assistance if it has the DF
bit set.

Reviewed by:	sam (mentor)
2003-11-12 23:35:40 +00:00
bms
3f57e25aeb Add a new sysctl knob, net.inet.udp.strict_mcast_mship, to the udp_input path.
This switch toggles between strict multicast delivery, and traditional
multicast delivery.

The traditional (default) behaviour is to deliver multicast datagrams to all
sockets which are members of that group, regardless of the network interface
where the datagrams were received.

The strict behaviour is to deliver multicast datagrams received on a
particular interface only to sockets whose membership is bound to that
interface.

Note that as a matter of course, multicast consumers specifying INADDR_ANY
for their interface get joined on the interface where the default route
happens to be bound. This switch has no effect if the interface which the
consumer specifies for IP_ADD_MEMBERSHIP is not UP and RUNNING.

The original patch has been cleaned up somewhat from that submitted. It has
been tested on a multihomed machine with multiple QuickTime RTP streams
running over the local switch, which doesn't do IGMP snooping.

PR:		kern/58359
Submitted by:	William A. Carrel
Reviewed by:	rwatson
MFC after:	1 week
2003-11-12 20:17:11 +00:00
andre
6c57f18a59 dropwithreset is not needed in this case as tcp_drop() is already notifying
the other side. Before we were sending two RST packets.
2003-11-12 19:38:01 +00:00
rwatson
77ed6e2d1c Modify the MAC Framework so that instead of embedding a (struct label)
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c).  This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE.  This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time.  This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.

This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.

While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.

NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change.  Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.

Suggestions from:	bmilekic
Obtained from:		TrustedBSD Project
Sponsored by:		DARPA, Network Associates Laboratories
2003-11-12 03:14:31 +00:00
sam
cff9b0702d correct typos
Pointed out by:	Mike Silbersack
2003-11-11 18:16:54 +00:00
sam
f6943f86fd o add missing inpcb locking in tcp_respond
o replace spl's with lock assertions

Supported by:	FreeBSD Foundation
2003-11-11 17:54:47 +00:00