Commit Graph

249 Commits

Author SHA1 Message Date
rwatson
87aa99bbbb White space cleanup for netinet before branch:
- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs

This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.

Approved by:	re (scottl)
Submitted by:	Xin LI <delphij@frontfree.net>
2004-08-16 18:32:07 +00:00
dwmalone
874318f896 In tcp6_ctlinput, lock tcbinfo around the call to syncache_unreach
so that the locks held are the same as the IPv4 case.

Reviewed by:	rwatson
2004-08-12 18:19:36 +00:00
andre
d87fe3ee1e Backout removal of UMA_ZONE_NOFREE flag for all zones which are established
for structures with timers in them.  It might be that a timer might fire
even when the associated structure has already been free'd.  Having type-
stable storage in this case is beneficial for graceful failure handling and
debugging.

Discussed with:	bosko, tegge, rwatson
2004-08-11 20:30:08 +00:00
andre
a6a5e26503 Remove the UMA_ZONE_NOFREE flag to all uma_zcreate() calls in the IP and
TCP code.  This flag would have prevented giving back excessive free slabs
to the global pool after a transient peak usage.
2004-08-11 17:08:31 +00:00
rwatson
11a2e8ce3c Pass pcbinfo structures to in6_pcbnotify() rather than pcbhead
structures, allowing in6_pcbnotify() to lock the pcbinfo and each
inpcb that it notifies of ICMPv6 events.  This prevents inpcb
assertions from firing when IPv6 generates and delievers event
notifications for inpcbs.

Reported by:	kuriyama
Tested by:	kuriyama
2004-08-06 03:45:45 +00:00
andre
f1492e3a5a o Move the inflight sysctls to their own sub-tree under net.inet.tcp to be
more consistent with the other sysctls around it.
2004-08-03 13:54:11 +00:00
cperciva
d9fecc83c8 Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with:	rwatson, scottl
Requested by:	jhb
2004-07-26 07:24:04 +00:00
jayanth
3781ade946 Let IN_FASTREOCOVERY macro decide if we are in recovery mode.
Nuke sackhole_limit for now. We need to add it back to limit the total
number of sack blocks in the system.
2004-07-19 22:37:33 +00:00
ps
a6b0bc7ed0 Move the sack sysctl's under net.inet.tcp.sack
net.inet.tcp.do_sack -> net.inet.tcp.sack.enable
net.inet.tcp.sackhole_limit -> net.inet.tcp.sack.sackhole_limit

Requested by:	wollman
2004-06-23 21:34:07 +00:00
ps
f5f3e8600b Add support for TCP Selective Acknowledgements. The work for this
originated on RELENG_4 and was ported to -CURRENT.

The scoreboarding code was obtained from OpenBSD, and many
of the remaining changes were inspired by OpenBSD, but not
taken directly from there.

You can enable/disable sack using net.inet.tcp.do_sack. You can
also limit the number of sack holes that all senders can have in
the scoreboard with net.inet.tcp.sackhole_limit.

Reviewed by:	gnn
Obtained from:	Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)
2004-06-23 21:04:37 +00:00
rwatson
8a0f58ccf0 If debug.mpsafenet is set, initialize TCP callouts as CALLOUT_MPSAFE. 2004-06-20 21:44:50 +00:00
rwatson
82295697cd Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
  soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
  so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
  various contexts in the stack, both at the socket and protocol
  layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
  this could result in frobbing of a non-present socket if
  sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
  they don't free the socket.

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-12 20:47:32 +00:00
rwatson
ff404935e2 Switch to using the inpcb MAC label instead of socket MAC label when
labeling new mbufs created from sockets/inpcbs in IPv4.  This helps avoid
the need for socket layer locking in the lower level network paths
where inpcb locks are already frequently held where needed.  In
particular:

- Use the inpcb for label instead of socket in raw_append().
- Use the inpcb for label instead of socket in tcp_output().
- Use the inpcb for label instead of socket in tcp_respond().
- Use the inpcb for label instead of socket in tcp_twrespond().
- Use the inpcb for label instead of socket in syncache_respond().

While here, modify tcp_respond() to avoid assigning NULL to a stack
variable and centralize assertions about the inpcb when inp is
assigned.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, McAfee Research
2004-05-04 02:11:47 +00:00
silby
743d110741 Enhance our RFC1948 implementation to perform better in some pathlogical
TIME_WAIT recycling cases I was able to generate with http testing tools.

In short, as the old algorithm relied on ticks to create the time offset
component of an ISN, two connections with the exact same host, port pair
that were generated between timer ticks would have the exact same sequence
number.  As a result, the second connection would fail to pass the TIME_WAIT
check on the server side, and the SYN would never be acknowledged.

I've "fixed" this by adding random positive increments to the time component
between clock ticks so that ISNs will *always* be increasing, no matter how
quickly the port is recycled.

Except in such contrived benchmarking situations, this problem should never
come up in normal usage...  until networks get faster.

No MFC planned, 4.x is missing other optimizations that are needed to even
create the situation in which such quick port recycling will occur.
2004-04-20 06:33:39 +00:00
imp
b49b7fe799 Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson
2004-04-07 20:46:16 +00:00
rwatson
e391576e63 Two missed in previous commit -- compare pointer with NULL rather than
using it as a boolean.
2004-04-05 00:52:05 +00:00
rwatson
bdbf43dbba Prefer NULL to 0 when checking pointer values as integers or booleans. 2004-04-05 00:49:07 +00:00
rwatson
4bcda84b60 Remove now unneeded arguments to tcp_twrespond() -- so and msrc. These
were needed by the MAC Framework until inpcbs gained labels.

Submitted by:	sam
2004-02-28 15:12:20 +00:00
truckman
1de257deb3 Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire().  Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails.  Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by:	bms
2004-02-26 00:27:04 +00:00
andre
5ef70fe223 Convert the tcp segment reassembly queue to UMA and limit the maximum
amount of segments it will hold.

The following tuneables and sysctls control the behaviour of the tcp
segment reassembly queue:

 net.inet.tcp.reass.maxsegments (loader tuneable)
  specifies the maximum number of segments all tcp reassemly queues can
  hold (defaults to 1/16 of nmbclusters).

 net.inet.tcp.reass.maxqlen
  specifies the maximum number of segments any individual tcp session queue
  can hold (defaults to 48).

 net.inet.tcp.reass.cursegments (readonly)
  counts the number of segments currently in all reassembly queues.

 net.inet.tcp.reass.overflows (readonly)
  counts how often either the global or local queue limit has been reached.

Tested by:	bms, silby
Reviewed by:	bms, silby
2004-02-24 15:27:41 +00:00
pjd
806650a364 Fixed ucred structure leak.
Approved by:	scottl (mentor)
PR:		54163
MFC after:	3 days
2004-02-19 14:13:21 +00:00
bms
1b4c430a4d Final brucification pass. Spell types consistently (u_int). Remove bogus
casts. Remove unnecessary parenthesis.

Submitted by:	bde
2004-02-14 21:49:48 +00:00
bms
09ad0862e6 Brucification.
Submitted by:	bde
2004-02-13 18:21:45 +00:00
ume
f35565e63f supported IPV6_RECVPATHMTU socket option.
Obtained from:	KAME
2004-02-13 14:50:01 +00:00
bms
afe7ed20e4 Update the prototype for tcpsignature_apply() to reflect the spelling of
the types used by m_apply()'s callback function, f, as documented in mbuf(9).

Noticed by:	njl
2004-02-12 20:16:09 +00:00
bms
7c4d7ecee0 style(9) pass; whitespace and comments.
Submitted by:	njl
2004-02-12 20:12:48 +00:00
bms
903cdeea1a Initial import of RFC 2385 (TCP-MD5) digest support.
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.

For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.

Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.

There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.

Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.

This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.

Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.

Sponsored by:	sentex.net
2004-02-11 04:26:04 +00:00
andre
491421126e Limiters and sanity checks for TCP MSS (maximum segement size)
resource exhaustion attacks.

For network link optimization TCP can adjust its MSS and thus
packet size according to the observed path MTU.  This is done
dynamically based on feedback from the remote host and network
components along the packet path.  This information can be
abused to pretend an extremely low path MTU.

The resource exhaustion works in two ways:

 o during tcp connection setup the advertized local MSS is
   exchanged between the endpoints.  The remote endpoint can
   set this arbitrarily low (except for a minimum MTU of 64
   octets enforced in the BSD code).  When the local host is
   sending data it is forced to send many small IP packets
   instead of a large one.

   For example instead of the normal TCP payload size of 1448
   it forces TCP payload size of 12 (MTU 64) and thus we have
   a 120 times increase in workload and packets. On fast links
   this quickly saturates the local CPU and may also hit pps
   processing limites of network components along the path.

   This type of attack is particularly effective for servers
   where the attacker can download large files (WWW and FTP).

   We mitigate it by enforcing a minimum MTU settable by sysctl
   net.inet.tcp.minmss defaulting to 256 octets.

 o the local host is reveiving data on a TCP connection from
   the remote host.  The local host has no control over the
   packet size the remote host is sending.  The remote host
   may chose to do what is described in the first attack and
   send the data in packets with an TCP payload of at least
   one byte.  For each packet the tcp_input() function will
   be entered, the packet is processed and a sowakeup() is
   signalled to the connected process.

   For example an attack with 2 Mbit/s gives 4716 packets per
   second and the same amount of sowakeup()s to the process
   (and context switches).

   This type of attack is particularly effective for servers
   where the attacker can upload large amounts of data.
   Normally this is the case with WWW server where large POSTs
   can be made.

   We mitigate this by calculating the average MSS payload per
   second.  If it goes below 'net.inet.tcp.minmss' and the pps
   rate is above 'net.inet.tcp.minmssoverload' defaulting to
   1000 this particular TCP connection is resetted and dropped.

MITRE CVE:	CAN-2004-0002
Reviewed by:	sam (mentor)
MFC after:	1 day
2004-01-08 17:40:07 +00:00
andre
09dcc2c21c If path mtu discovery is enabled set the DF bit in all cases we
send packets on a tcp connection.

PR:		kern/60889
Tested by:	Richard Wendland <richard@wendland.org.uk>
Approved by:	re (scottl)
2004-01-08 11:17:11 +00:00
andre
f6253c9b05 Enable the following TCP options by default to give it more exposure:
rfc3042  Limited retransmit
 rfc3390  Increasing TCP's initial congestion Window
 inflight TCP inflight bandwidth limiting

All my production server have it enabled and there have been no
issues.  I am confident about having them on by default and it gives
us better overall TCP performance.

Reviewed by:	sam (mentor)
2004-01-06 23:29:46 +00:00
jhb
bfeab27f15 Fix some becuase -> because typos.
Reported by:	Marco Wertejuk <wertejuk@mwcis.com>
2003-12-17 16:12:01 +00:00
rwatson
8315bb904d Switch TCP over to using the inpcb label when responding in timed
wait, rather than the socket label.  This avoids reaching up to
the socket layer during connection close, which requires locking
changes.  To do this, introduce MAC Framework entry point
mac_create_mbuf_from_inpcb(), which is called from tcp_twrespond()
instead of calling mac_create_mbuf_from_socket() or
mac_create_mbuf_netlayer().  Introduce MAC Policy entry point
mpo_create_mbuf_from_inpcb(), and implementations for various
policies, which generally just copy label data from the inpcb to
the mbuf.  Assert the inpcb lock in the entry point since we
require consistency for the inpcb label reference.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-12-17 14:55:11 +00:00
andre
4a037b3dd4 Make sure all uses of stack allocated struct route's are properly
zeroed.  Doing a bzero on the entire struct route is not more
expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst.

Reviewed by:	sam (mentor)
Approved by:	re  (scottl)
2003-11-26 20:31:13 +00:00
andre
6164d7c280 Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table.  Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination.  Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by:	sam (mentor), bms
Reviewed by:	-net, -current, core@kame.net (IPv6 parts)
Approved by:	re (scottl)
2003-11-20 20:07:39 +00:00
sam
a946bc4854 o correct locking problem: the inpcb must be held across tcp_respond
o add assertions in tcp_respond to validate inpcb locking assumptions
o use local variable instead of chasing pointers in tcp_respond

Supported by:	FreeBSD Foundation
2003-11-08 22:59:22 +00:00
silby
1bddc356b1 Add an additional check to the tcp_twrecycleable function; I had
previously only considered the send sequence space.  Unfortunately,
some OSes (windows) still use a random positive increments scheme for
their syn-ack ISNs, so I must consider receive sequence space as well.

The value of 250000 bytes / second for Microsoft's ISN rate of increase
was determined by testing with an XP machine.
2003-11-02 07:47:03 +00:00
silby
34fc0661fd - Add a new function tcp_twrecycleable, which tells us if the ISN which
we will generate for a given ip/port tuple has advanced far enough
for the time_wait socket in question to be safely recycled.

- Have in_pcblookup_local use tcp_twrecycleable to determine if
time_Wait sockets which are hogging local ports can be safely
freed.

This change preserves proper TIME_WAIT behavior under normal
circumstances while allowing for safe and fast recycling whenever
ephemeral port space is scarce.
2003-11-01 07:30:08 +00:00
silby
4390f05459 Reduce the number of tcp time_wait structs to maxsockets / 5; this ensures
that at most 20% of sockets can be in time_wait at one time, ensuring
that time_wait sockets do not starve real connections from inpcb
structures.

No implementation change is needed, jlemon already implemented a nice
LRU-ish algorithm for tcp_tw structure recycling.

This should reduce the need for sysadmins to lower the default msl on
busy servers.
2003-10-24 05:44:14 +00:00
silby
f0e686a675 Change all SYSCTLS which are readonly and have a related TUNABLE
from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide
more useful error messages.
2003-10-21 18:28:36 +00:00
ru
27c75c8f21 Fix a bunch of off-by-one errors in the range checking code. 2003-09-11 21:40:21 +00:00
rwatson
dd0b6b7e28 Introduce two new MAC Framework and MAC policy entry points:
mac_reflect_mbuf_icmp()
  mac_reflect_mbuf_tcp()

These entry points permit MAC policies to do "update in place"
changes to the labels on ICMP and TCP mbuf headers when an ICMP or
TCP response is generated to a packet outside of the context of
an existing socket.  For example, in respond to a ping or a RST
packet to a SYN on a closed port.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-08-21 18:39:16 +00:00
rwatson
f332b50228 Correct a bug introduced with reduced TCP state handling; make
sure that the MAC label on TCP responses during TIMEWAIT is
properly set from either the socket (if available), or the mbuf
that it's responding to.

Unfortunately, this is made somewhat difficult by the TCP code,
as tcp_twstart() calls tcp_twrespond() after discarding the socket
but without a reference to the mbuf that causes the "response".
Passing both the socket and the mbuf works arounds this--eventually
it might be good to make sure the mbuf always gets passed in in
"response" scenarios but working through this provided to
complicate things too much.

Approved by:	re (scottl)
Reviewed by:	hsu
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-05-07 05:26:27 +00:00
rwatson
c879a35dfe Remove a potential panic condition introduced by reduced TCP wait
state.  Those changed attempted to work around the changed invariant
that inp->in_socket was sometimes now NULL, but the logic wasn't
quite right, meaning that inp->in_socket would be dereferenced by
cr_canseesocket() if security.bsd.see_other_uids, jail, or MAC
were in use.  Attempt to clarify and correct the logic.

Note: the work-around originally introduced with the reduced TCP
wait state handling to use cr_cansee() instead of cr_canseesocket()
in this case isn't really right, although it "Does the right thing"
for most of the cases in the base system.  We'll need to address
this at some point in the future.

Pointed out by:	dcs
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-04-10 20:33:10 +00:00
jlemon
03b8ace489 Remove a panic(); if the zone allocator can't provide more timewait
structures, reuse the oldest one.  Also move the expiry timer from
a per-structure callout to the tcp slow timer.

Sponsored by: DARPA, NAI Labs
2003-03-08 22:06:20 +00:00
des
2756b6c964 More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9). 2003-03-02 16:54:40 +00:00
rwatson
97f9a00e9e When generating a TCP response to a connection, not only test if the
tcpcb is NULL, but also its connected inpcb, since we now allow
elements of a TCP connection to hang around after other state, such
as the socket, has been recycled.

Tested by:	dcs
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-02-25 14:08:41 +00:00
phk
f679a32416 - m = m_gethdr(M_NOWAIT, MT_HEADER);
+       m = m_gethdr(M_DONTWAIT, MT_HEADER);

'nuff said.
2003-02-21 23:17:12 +00:00
jlemon
83eadc04a8 Unbreak non-IPV6 compilation.
Caught by: phk
Sponsored by: DARPA, NAI Labs
2003-02-19 23:43:04 +00:00
jlemon
a8bc02dcb2 Add a TCP TIMEWAIT state which uses less space than a fullblown TCP
control block.  Allow the socket and tcpcb structures to be freed
earlier than inpcb.  Update code to understand an inp w/o a socket.

Reviewed by: hsu, silby, jayanth
Sponsored by: DARPA, NAI Labs
2003-02-19 22:32:43 +00:00
jlemon
79a1ebfa6f Convert tcp_fillheaders(tp, ...) -> tcpip_fillheaders(inp, ...) so the
routine does not require a tcpcb to operate.  Since we no longer keep
template mbufs around, move pseudo checksum out of this routine, and
merge it with the length update.

Sponsored by: DARPA, NAI Labs
2003-02-19 22:18:06 +00:00
imp
cf874b345d Back out M_* changes, per decision of the TRB.
Approved by: trb
2003-02-19 05:47:46 +00:00
hsu
b9cd8d8951 Take advantage of pre-existing lock-free synchronization and type stable memory
to avoid acquiring SMP locks during expensive copyout process.
2003-02-15 02:37:57 +00:00
alfred
bf8e8a6e8f Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
2003-01-21 08:56:16 +00:00
hsu
449c7cf2ec Validate inp to prevent an use after free. 2002-12-24 21:00:31 +00:00
dillon
524e713804 Change tcp.inflight_min from 1024 to a production default of 6144. Create
a sysctl for the stabilization value for the bandwidth delay product (inflight)
algorithm and document it.

MFC after:	3 days
2002-12-14 21:00:17 +00:00
phk
419afbc0d3 Fix two instances of variant struct definitions in sys/netinet:
Remove the never completed _IP_VHL version, it has not caught on
anywhere and it would make us incompatible with other BSD netstacks
to retain this version.

Add a CTASSERT protecting sizeof(struct ip) == 20.

Don't let the size of struct ipq depend on the IPDIVERT option.

This is a functional no-op commit.

Approved by:	re
2002-10-20 22:52:07 +00:00
sam
0ef6c52bbc Tie new "Fast IPsec" code into the build. This involves the usual
configuration stuff as well as conditional code in the IPv4 and IPv6
areas.  Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).

As noted previously, don't use FAST_IPSEC with INET6 at the moment.

Reviewed by:	KAME, rwatson
Approved by:	silence
Supported by:	Vernier Networks
2002-10-16 02:25:05 +00:00
sam
2a86be217a Replace aux mbufs with packet tags:
o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
  ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
  use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
  inpcb parameter to ip_output and ip6_output to allow the IPsec code to
  locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version

Reviewed by:	julian, luigi (silent), -arch, -net, darren
Approved by:	julian, silence from everyone else
Obtained from:	openbsd (mostly)
MFC after:	1 month
2002-10-16 01:54:46 +00:00
dillon
4ede9e7449 turn off debugging by default if bandwidth delay product limiting is
turned on (it is already off in -stable).
2002-10-10 21:41:30 +00:00
dillon
25a0b20721 Correct bug in t_bw_rtttime rollover, #undef USERTT 2002-08-24 17:22:44 +00:00
dillon
bb806c49bf Implement TCP bandwidth delay product window limiting, similar to (but
not meant to duplicate) TCP/Vegas.  Add four sysctls and default the
implementation to 'off'.

net.inet.tcp.inflight_enable	enable algorithm (defaults to 0=off)
net.inet.tcp.inflight_debug	debugging (defaults to 1=on)
net.inet.tcp.inflight_min	minimum window limit
net.inet.tcp.inflight_max	maximum window limit

MFC after:	1 week
2002-08-17 18:26:02 +00:00
rwatson
00b9d17af8 Document the undocumented assumption that at least one of the PCB
pointer and incoming mbuf pointer will be non-NULL in tcp_respond().
This is relied on by the MAC code for correctness, as well as
existing code.

Obtained from:	TrustedBSD PRoject
Sponsored by:	DARPA, NAI Labs
2002-08-01 03:54:43 +00:00
rwatson
a034d0cd3c Introduce support for Mandatory Access Control and extensible
kernel access control.

Instrument the TCP socket code for packet generation and delivery:
label outgoing mbufs with the label of the socket, and check socket and
mbuf labels before permitting delivery to a socket.  Assign labels
to newly accepted connections when the syncache/cookie code has done
its business.  Also set peer labels as convenient.  Currently,
MAC policies cannot influence the PCB matching algorithm, so cannot
implement polyinstantiation.  Note that there is at least one case
where a PCB is not available due to the TCP packet not being associated
with any socket, so we don't label in that case, but need to handle
it in a special manner.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-07-31 19:06:49 +00:00
truckman
b1555a2743 Wire the sysctl output buffer before grabbing any locks to prevent
SYSCTL_OUT() from blocking while locks are held.  This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
released.
2002-07-28 19:59:31 +00:00
dillon
66e9d99f6c Introduce two new sysctl's:
net.inet.tcp.rexmit_min (default 3 ticks equiv)

    This sysctl is the retransmit timer RTO minimum,
    specified in milliseconds.  This value is
    designed for algorithmic stability only.

net.inet.tcp.rexmit_slop (default 200ms)

    This sysctl is the retransmit timer RTO slop
    which is added to every retransmit timeout and
    is designed to handle protocol stack overheads
    and delayed ack issues.

Note that the *original* code applied a 1-second
RTO minimum but never applied real slop to the RTO
calculation, so any RTO calculation over one second
would have no slop and thus not account for
protocol stack overheads (TCP timestamps are not
a measure of protocol turnaround!).  Essentially,
the original code made the RTO calculation almost
completely irrelevant.

Please note that the 200ms slop is debateable.
This commit is not meant to be a line in the sand,
and if the community winds up deciding that increasing
it is the correct solution then it's easy to do.
Note that larger values will destroy performance
on lossy networks while smaller values may result in
a greater number of unnecessary retransmits.
2002-07-18 19:06:12 +00:00
truckman
5d8999f18a Defer calling SYSCTL_OUT() until after the locks have been released. 2002-07-11 23:18:43 +00:00
truckman
fe3b828f84 Reduce the nesting level of a code block that doesn't need to be in
an else clause.
2002-07-11 23:13:31 +00:00
jesper
3a003d229a Extend the effect of the sysctl net.inet.tcp.icmp_may_rst
so that, if we recieve a ICMP "time to live exceeded in transit",
(type 11, code 0) for a TCP connection on SYN-SENT state, close
the connection.

MFC after:	2 weeks
2002-06-30 20:07:21 +00:00
hsu
80cef86a8d TCP notify functions can change the pcb list. 2002-06-21 22:52:48 +00:00
hsu
abda76de0b Notify functions can destroy the pcb, so they have to return an
indication of whether this happenned so the calling function
knows whether or not to unlock the pcb.

Submitted by:	Jennifer Yang (yangjihui@yahoo.com)
Bug reported by:  Sid Carter (sidcarter@symonds.net)
2002-06-14 08:35:21 +00:00
hsu
b67cb93fe3 Fix logic which resulted in missing a call to INP_UNLOCK(). 2002-06-12 03:11:06 +00:00
hsu
cd25d4648f Lock up inpcb.
Submitted by:	Jennifer Yang <yangjihui@yahoo.com>
2002-06-10 20:05:46 +00:00
tanimura
e6fa9b9e92 Back out my lats commit of locking down a socket, it conflicts with hsu's work.
Requested by:	hsu
2002-05-31 11:52:35 +00:00
tanimura
92d8381dd5 Lock down a socket, milestone 1.
o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
  socket buffer. The mutex in the receive buffer also protects the data
  in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

  - so_count
  - so_options
  - so_linger
  - so_state

o Remove *_locked() socket APIs.  Make the following socket APIs
  touching the members above now require a locked socket:

 - sodisconnect()
 - soisconnected()
 - soisconnecting()
 - soisdisconnected()
 - soisdisconnecting()
 - sofree()
 - soref()
 - sorele()
 - sorwakeup()
 - sotryfree()
 - sowakeup()
 - sowwakeup()

Reviewed by:	alfred
2002-05-20 05:41:09 +00:00
silby
c7389be7ba Remove some ISN generation code which has been unused since the
syncache went in.

MFC after:	3 days
2002-04-10 22:12:01 +00:00
jhb
dc2e474f79 Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API.  The entire API now consists of two functions
similar to the pre-KSE API.  The suser() function takes a thread pointer
as its only argument.  The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0.  The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on:	smp@
2002-04-01 21:31:13 +00:00
rwatson
afe2b1f929 Merge from TrustedBSD MAC branch:
Move the network code from using cr_cansee() to check whether a
    socket is visible to a requesting credential to using a new
    function, cr_canseesocket(), which accepts a subject credential
    and object socket.  Implement cr_canseesocket() so that it does a
    prison check, a uid check, and add a comment where shortly a MAC
    hook will go.  This will allow MAC policies to seperately
    instrument the visibility of sockets from the visibility of
    processes.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-03-22 19:57:41 +00:00
jeff
0a59f1223c Switch vm_zone.h with uma.h. Change over to uma interfaces. 2002-03-20 05:48:55 +00:00
alfred
357e37e023 Remove __P. 2002-03-19 21:25:46 +00:00
jhb
3706cd3509 Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.
2002-02-27 18:32:23 +00:00
alfred
943268c4b5 More IPV6 const fixes. 2002-02-27 05:11:50 +00:00
dd
c8a6bd9922 Introduce a version field to `struct xucred' in place of one of the
spares (the size of the field was changed from u_short to u_int to
reflect what it really ends up being).  Accordingly, change users of
xucred to set and check this field as appropriate.  In the kernel,
this is being done inside the new cru2x() routine which takes a
`struct ucred' and fills out a `struct xucred' according to the
former.  This also has the pleasant sideaffect of removing some
duplicate code.

Reviewed by:	rwatson
2002-02-27 04:45:37 +00:00
ume
0c5d24bcd2 In tcp_respond(), correctly reset returned IPv6 header. This is essential
when the original packet contains an IPv6 extension header.

Obtained from:	KAME
MFC after:	1 week
2002-02-04 17:37:06 +00:00
jlemon
a3c1c9fdb4 Introduce a syncache, which enables FreeBSD to withstand a SYN flood
DoS in an improved fashion over the existing code.

Reviewed by: silby  (in a previous iteration)
Sponsored by: DARPA, NAI Labs
2001-11-22 04:50:44 +00:00
rwatson
8cf42b482a o Replace reference to 'struct proc' with 'struct thread' in 'struct
sysctl_req', which describes in-progress sysctl requests.  This permits
  sysctl handlers to have access to the current thread, permitting work
  on implementing td->td_ucred, migration of suser() to using struct
  thread to derive the appropriate ucred, and allowing struct thread to be
  passed down to other code, such as network code where td is not currently
  available (and curproc is used).

o Note: netncp and netsmb are not updated to reflect this change, as they
  are not currently KSE-adapted.

Reviewed by:		julian
Obtained from:	TrustedBSD Project
2001-11-08 02:13:18 +00:00
rwatson
f51eaee62f - Combine kern.ps_showallprocs and kern.ipc.showallsockets into
a single kern.security.seeotheruids_permitted, describes as:
  "Unprivileged processes may see subjects/objects with different real uid"
  NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is
  an API change.  kern.ipc.showallsockets does not.
- Check kern.security.seeotheruids_permitted in cr_cansee().
- Replace visibility calls to socheckuid() with cr_cansee() (retain
  the change to socheckuid() in ipfw, where it is used for rule-matching).
- Remove prison_unpcb() and make use of cr_cansee() against the UNIX
  domain socket credential instead of comparing root vnodes for the
  UDS and the process.  This allows multiple jails to share the same
  chroot() and not see each others UNIX domain sockets.
- Remove unused socheckproc().

Now that cr_cansee() is used universally for socket visibility, a variety
of policies are more consistently enforced, including uid-based
restrictions and jail-based restrictions.  This also better-supports
the introduction of additional MAC models.

Reviewed by:	ps, billf
Obtained from:	TrustedBSD Project
2001-10-09 21:40:30 +00:00
ps
38383190d5 Only allow users to see their own socket connections if
kern.ipc.showallsockets is set to 0.

Submitted by:	billf (with modifications by me)
Inspired by:	Dave McKay (aka pm aka Packet Magnet)
Reviewed by:	peter
MFC after:	2 weeks
2001-10-05 07:06:32 +00:00
rwatson
7a4775391d o Rename u_cansee() to cr_cansee(), making the name more comprehensible
in the face of a rename of ucred to cred, and possibly generally.

Obtained from:	TrustedBSD Project
2001-09-20 21:45:31 +00:00
silby
58e247fcc4 Much delayed but now present: RFC 1948 style sequence numbers
In order to ensure security and functionality, RFC 1948 style
initial sequence number generation has been implemented.  Barring
any major crypographic breakthroughs, this algorithm should be
unbreakable.  In addition, the problems with TIME_WAIT recycling
which affect our currently used algorithm are not present.

Reviewed by: jesper
2001-08-22 00:58:16 +00:00
peter
3ed3578ff8 Fix a warning. 2001-07-27 00:04:39 +00:00
peter
b8da0cdbc4 Patch up some style(9) stuff in tcp_new_isn() 2001-07-27 00:03:49 +00:00
peter
3feb3ed786 s/OpemBSD/OpenBSD/ 2001-07-27 00:01:48 +00:00
silby
2be73222cb Temporary feature: Runtime tuneable tcp initial sequence number
generation scheme.  Users may now select between the currently used
OpenBSD algorithm and the older random positive increment method.

While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT
handling; this is causing trouble for an increasing number of folks.

To switch between generation schemes, one sets the sysctl
net.inet.tcp.tcp_seq_genscheme.  0 = random positive increments,
1 = the OpenBSD algorithm.  1 is still the default.

Once a secure _and_ compatible algorithm is implemented, this sysctl
will be removed.

Reviewed by: jlemon
Tested by: numerous subscribers of -net
2001-07-08 02:20:47 +00:00
dwmalone
db54f212f8 Allow getcred sysctl to work in jailed root processes. Processes can
only do getcred calls for sockets which were created in the same jail.
This should allow the ident to work in a reasonable way within jails.

PR:		28107
Approved by:	des, rwatson
2001-06-24 12:18:27 +00:00
jlemon
e071c16669 Replace bzero() of struct ip with explicit zeroing of structure members,
which is faster.
2001-06-23 17:44:27 +00:00
silby
f41767543e Eliminate the allocation of a tcp template structure for each
connection.  The information contained in a tcptemp can be
reconstructed from a tcpcb when needed.

Previously, tcp templates required the allocation of one
mbuf per connection.  On large systems, this change should
free up a large number of mbufs.

Reviewed by:	bmilekic, jlemon, ru
MFC after: 2 weeks
2001-06-23 03:21:46 +00:00
ume
7ffe6c47e5 made sure to use the correct sa_len for rtalloc().
sizeof(ro_dst) is not necessarily the correct one.
this change would also fix the recent path MTU discovery problem for the
destination of an incoming TCP connection.

Submitted by:	JINMEI Tatuya <jinmei@kame.net>
Obtained from:	KAME
MFC after:	2 weeks
2001-06-20 12:32:48 +00:00
ume
832f8d2249 Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
  - The definitions of SADB_* in sys/net/pfkeyv2.h are still different
    from RFC2407/IANA assignment because of binary compatibility
    issue.  It should be fixed under 5-CURRENT.
  - ip6po_m member of struct ip6_pktopts is no longer used.  But, it
    is still there because of binary compatibility issue.  It should
    be removed under 5-CURRENT.

Reviewed by:	itojun
Obtained from:	KAME
MFC after:	3 weeks
2001-06-11 12:39:29 +00:00
peter
4b91e2ecf0 "Fix" the previous initial attempt at fixing TUNABLE_INT(). This time
around, use a common function for looking up and extracting the tunables
from the kernel environment.  This saves duplicating the same function
over and over again.  This way typically has an overhead of 8 bytes + the
path string, versus about 26 bytes + the path string.
2001-06-08 05:24:21 +00:00
peter
c1df44ae51 Back out part of my previous commit. This was a last minute change
and I botched testing.  This is a perfect example of how NOT to do
this sort of thing. :-(
2001-06-07 03:17:26 +00:00