Commit Graph

373 Commits

Author SHA1 Message Date
Bruce M Simpson
1cfd4b5326 Initial import of RFC 2385 (TCP-MD5) digest support.
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.

For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.

Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.

There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.

Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.

This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.

Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.

Sponsored by:	sentex.net
2004-02-11 04:26:04 +00:00
Andre Oppermann
53369ac9bb Limiters and sanity checks for TCP MSS (maximum segement size)
resource exhaustion attacks.

For network link optimization TCP can adjust its MSS and thus
packet size according to the observed path MTU.  This is done
dynamically based on feedback from the remote host and network
components along the packet path.  This information can be
abused to pretend an extremely low path MTU.

The resource exhaustion works in two ways:

 o during tcp connection setup the advertized local MSS is
   exchanged between the endpoints.  The remote endpoint can
   set this arbitrarily low (except for a minimum MTU of 64
   octets enforced in the BSD code).  When the local host is
   sending data it is forced to send many small IP packets
   instead of a large one.

   For example instead of the normal TCP payload size of 1448
   it forces TCP payload size of 12 (MTU 64) and thus we have
   a 120 times increase in workload and packets. On fast links
   this quickly saturates the local CPU and may also hit pps
   processing limites of network components along the path.

   This type of attack is particularly effective for servers
   where the attacker can download large files (WWW and FTP).

   We mitigate it by enforcing a minimum MTU settable by sysctl
   net.inet.tcp.minmss defaulting to 256 octets.

 o the local host is reveiving data on a TCP connection from
   the remote host.  The local host has no control over the
   packet size the remote host is sending.  The remote host
   may chose to do what is described in the first attack and
   send the data in packets with an TCP payload of at least
   one byte.  For each packet the tcp_input() function will
   be entered, the packet is processed and a sowakeup() is
   signalled to the connected process.

   For example an attack with 2 Mbit/s gives 4716 packets per
   second and the same amount of sowakeup()s to the process
   (and context switches).

   This type of attack is particularly effective for servers
   where the attacker can upload large amounts of data.
   Normally this is the case with WWW server where large POSTs
   can be made.

   We mitigate this by calculating the average MSS payload per
   second.  If it goes below 'net.inet.tcp.minmss' and the pps
   rate is above 'net.inet.tcp.minmssoverload' defaulting to
   1000 this particular TCP connection is resetted and dropped.

MITRE CVE:	CAN-2004-0002
Reviewed by:	sam (mentor)
MFC after:	1 day
2004-01-08 17:40:07 +00:00
Andre Oppermann
bf87c82ebb If path mtu discovery is enabled set the DF bit in all cases we
send packets on a tcp connection.

PR:		kern/60889
Tested by:	Richard Wendland <richard@wendland.org.uk>
Approved by:	re (scottl)
2004-01-08 11:17:11 +00:00
Andre Oppermann
dba7bc6a65 Enable the following TCP options by default to give it more exposure:
rfc3042  Limited retransmit
 rfc3390  Increasing TCP's initial congestion Window
 inflight TCP inflight bandwidth limiting

All my production server have it enabled and there have been no
issues.  I am confident about having them on by default and it gives
us better overall TCP performance.

Reviewed by:	sam (mentor)
2004-01-06 23:29:46 +00:00
John Baldwin
a5b061f9d2 Fix some becuase -> because typos.
Reported by:	Marco Wertejuk <wertejuk@mwcis.com>
2003-12-17 16:12:01 +00:00
Robert Watson
2d92ec9858 Switch TCP over to using the inpcb label when responding in timed
wait, rather than the socket label.  This avoids reaching up to
the socket layer during connection close, which requires locking
changes.  To do this, introduce MAC Framework entry point
mac_create_mbuf_from_inpcb(), which is called from tcp_twrespond()
instead of calling mac_create_mbuf_from_socket() or
mac_create_mbuf_netlayer().  Introduce MAC Policy entry point
mpo_create_mbuf_from_inpcb(), and implementations for various
policies, which generally just copy label data from the inpcb to
the mbuf.  Assert the inpcb lock in the entry point since we
require consistency for the inpcb label reference.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-12-17 14:55:11 +00:00
Andre Oppermann
0cfbbe3bde Make sure all uses of stack allocated struct route's are properly
zeroed.  Doing a bzero on the entire struct route is not more
expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst.

Reviewed by:	sam (mentor)
Approved by:	re  (scottl)
2003-11-26 20:31:13 +00:00
Andre Oppermann
97d8d152c2 Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table.  Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination.  Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by:	sam (mentor), bms
Reviewed by:	-net, -current, core@kame.net (IPv6 parts)
Approved by:	re (scottl)
2003-11-20 20:07:39 +00:00
Sam Leffler
c29afad673 o correct locking problem: the inpcb must be held across tcp_respond
o add assertions in tcp_respond to validate inpcb locking assumptions
o use local variable instead of chasing pointers in tcp_respond

Supported by:	FreeBSD Foundation
2003-11-08 22:59:22 +00:00
Mike Silbersack
4bd4fa3fe6 Add an additional check to the tcp_twrecycleable function; I had
previously only considered the send sequence space.  Unfortunately,
some OSes (windows) still use a random positive increments scheme for
their syn-ack ISNs, so I must consider receive sequence space as well.

The value of 250000 bytes / second for Microsoft's ISN rate of increase
was determined by testing with an XP machine.
2003-11-02 07:47:03 +00:00
Mike Silbersack
96af9ea52b - Add a new function tcp_twrecycleable, which tells us if the ISN which
we will generate for a given ip/port tuple has advanced far enough
for the time_wait socket in question to be safely recycled.

- Have in_pcblookup_local use tcp_twrecycleable to determine if
time_Wait sockets which are hogging local ports can be safely
freed.

This change preserves proper TIME_WAIT behavior under normal
circumstances while allowing for safe and fast recycling whenever
ephemeral port space is scarce.
2003-11-01 07:30:08 +00:00
Mike Silbersack
0709c23335 Reduce the number of tcp time_wait structs to maxsockets / 5; this ensures
that at most 20% of sockets can be in time_wait at one time, ensuring
that time_wait sockets do not starve real connections from inpcb
structures.

No implementation change is needed, jlemon already implemented a nice
LRU-ish algorithm for tcp_tw structure recycling.

This should reduce the need for sysadmins to lower the default msl on
busy servers.
2003-10-24 05:44:14 +00:00
Mike Silbersack
184dcdc7c8 Change all SYSCTLS which are readonly and have a related TUNABLE
from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide
more useful error messages.
2003-10-21 18:28:36 +00:00
Ruslan Ermilov
78f94aa951 Fix a bunch of off-by-one errors in the range checking code. 2003-09-11 21:40:21 +00:00
Robert Watson
baee0c3e66 Introduce two new MAC Framework and MAC policy entry points:
mac_reflect_mbuf_icmp()
  mac_reflect_mbuf_tcp()

These entry points permit MAC policies to do "update in place"
changes to the labels on ICMP and TCP mbuf headers when an ICMP or
TCP response is generated to a packet outside of the context of
an existing socket.  For example, in respond to a ping or a RST
packet to a SYN on a closed port.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-08-21 18:39:16 +00:00
Robert Watson
430c635447 Correct a bug introduced with reduced TCP state handling; make
sure that the MAC label on TCP responses during TIMEWAIT is
properly set from either the socket (if available), or the mbuf
that it's responding to.

Unfortunately, this is made somewhat difficult by the TCP code,
as tcp_twstart() calls tcp_twrespond() after discarding the socket
but without a reference to the mbuf that causes the "response".
Passing both the socket and the mbuf works arounds this--eventually
it might be good to make sure the mbuf always gets passed in in
"response" scenarios but working through this provided to
complicate things too much.

Approved by:	re (scottl)
Reviewed by:	hsu
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-05-07 05:26:27 +00:00
Robert Watson
cacd79e2c9 Remove a potential panic condition introduced by reduced TCP wait
state.  Those changed attempted to work around the changed invariant
that inp->in_socket was sometimes now NULL, but the logic wasn't
quite right, meaning that inp->in_socket would be dereferenced by
cr_canseesocket() if security.bsd.see_other_uids, jail, or MAC
were in use.  Attempt to clarify and correct the logic.

Note: the work-around originally introduced with the reduced TCP
wait state handling to use cr_cansee() instead of cr_canseesocket()
in this case isn't really right, although it "Does the right thing"
for most of the cases in the base system.  We'll need to address
this at some point in the future.

Pointed out by:	dcs
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-04-10 20:33:10 +00:00
Jonathan Lemon
607b0b0cc9 Remove a panic(); if the zone allocator can't provide more timewait
structures, reuse the oldest one.  Also move the expiry timer from
a per-structure callout to the tcp slow timer.

Sponsored by: DARPA, NAI Labs
2003-03-08 22:06:20 +00:00
Dag-Erling Smørgrav
521f364b80 More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9). 2003-03-02 16:54:40 +00:00
Robert Watson
9327ee33bf When generating a TCP response to a connection, not only test if the
tcpcb is NULL, but also its connected inpcb, since we now allow
elements of a TCP connection to hang around after other state, such
as the socket, has been recycled.

Tested by:	dcs
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-02-25 14:08:41 +00:00
Poul-Henning Kamp
d25ecb917b - m = m_gethdr(M_NOWAIT, MT_HEADER);
+       m = m_gethdr(M_DONTWAIT, MT_HEADER);

'nuff said.
2003-02-21 23:17:12 +00:00
Jonathan Lemon
ffae8c5a7e Unbreak non-IPV6 compilation.
Caught by: phk
Sponsored by: DARPA, NAI Labs
2003-02-19 23:43:04 +00:00
Jonathan Lemon
340c35de6a Add a TCP TIMEWAIT state which uses less space than a fullblown TCP
control block.  Allow the socket and tcpcb structures to be freed
earlier than inpcb.  Update code to understand an inp w/o a socket.

Reviewed by: hsu, silby, jayanth
Sponsored by: DARPA, NAI Labs
2003-02-19 22:32:43 +00:00
Jonathan Lemon
7990938421 Convert tcp_fillheaders(tp, ...) -> tcpip_fillheaders(inp, ...) so the
routine does not require a tcpcb to operate.  Since we no longer keep
template mbufs around, move pseudo checksum out of this routine, and
merge it with the length update.

Sponsored by: DARPA, NAI Labs
2003-02-19 22:18:06 +00:00
Warner Losh
a163d034fa Back out M_* changes, per decision of the TRB.
Approved by: trb
2003-02-19 05:47:46 +00:00
Jeffrey Hsu
4b40c56c28 Take advantage of pre-existing lock-free synchronization and type stable memory
to avoid acquiring SMP locks during expensive copyout process.
2003-02-15 02:37:57 +00:00
Alfred Perlstein
44956c9863 Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
2003-01-21 08:56:16 +00:00
Jeffrey Hsu
abe239cfe2 Validate inp to prevent an use after free. 2002-12-24 21:00:31 +00:00
Matthew Dillon
d7ff8ef62a Change tcp.inflight_min from 1024 to a production default of 6144. Create
a sysctl for the stabilization value for the bandwidth delay product (inflight)
algorithm and document it.

MFC after:	3 days
2002-12-14 21:00:17 +00:00
Poul-Henning Kamp
53be11f680 Fix two instances of variant struct definitions in sys/netinet:
Remove the never completed _IP_VHL version, it has not caught on
anywhere and it would make us incompatible with other BSD netstacks
to retain this version.

Add a CTASSERT protecting sizeof(struct ip) == 20.

Don't let the size of struct ipq depend on the IPDIVERT option.

This is a functional no-op commit.

Approved by:	re
2002-10-20 22:52:07 +00:00
Sam Leffler
b9234fafa0 Tie new "Fast IPsec" code into the build. This involves the usual
configuration stuff as well as conditional code in the IPv4 and IPv6
areas.  Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).

As noted previously, don't use FAST_IPSEC with INET6 at the moment.

Reviewed by:	KAME, rwatson
Approved by:	silence
Supported by:	Vernier Networks
2002-10-16 02:25:05 +00:00
Sam Leffler
5d84645305 Replace aux mbufs with packet tags:
o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
  ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
  use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
  inpcb parameter to ip_output and ip6_output to allow the IPsec code to
  locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version

Reviewed by:	julian, luigi (silent), -arch, -net, darren
Approved by:	julian, silence from everyone else
Obtained from:	openbsd (mostly)
MFC after:	1 month
2002-10-16 01:54:46 +00:00
Matthew Dillon
c8d50f2414 turn off debugging by default if bandwidth delay product limiting is
turned on (it is already off in -stable).
2002-10-10 21:41:30 +00:00
Matthew Dillon
4f1e1f32b6 Correct bug in t_bw_rtttime rollover, #undef USERTT 2002-08-24 17:22:44 +00:00
Matthew Dillon
1fcc99b5de Implement TCP bandwidth delay product window limiting, similar to (but
not meant to duplicate) TCP/Vegas.  Add four sysctls and default the
implementation to 'off'.

net.inet.tcp.inflight_enable	enable algorithm (defaults to 0=off)
net.inet.tcp.inflight_debug	debugging (defaults to 1=on)
net.inet.tcp.inflight_min	minimum window limit
net.inet.tcp.inflight_max	maximum window limit

MFC after:	1 week
2002-08-17 18:26:02 +00:00
Robert Watson
d00e44fb4a Document the undocumented assumption that at least one of the PCB
pointer and incoming mbuf pointer will be non-NULL in tcp_respond().
This is relied on by the MAC code for correctness, as well as
existing code.

Obtained from:	TrustedBSD PRoject
Sponsored by:	DARPA, NAI Labs
2002-08-01 03:54:43 +00:00
Robert Watson
c488362e1a Introduce support for Mandatory Access Control and extensible
kernel access control.

Instrument the TCP socket code for packet generation and delivery:
label outgoing mbufs with the label of the socket, and check socket and
mbuf labels before permitting delivery to a socket.  Assign labels
to newly accepted connections when the syncache/cookie code has done
its business.  Also set peer labels as convenient.  Currently,
MAC policies cannot influence the PCB matching algorithm, so cannot
implement polyinstantiation.  Note that there is at least one case
where a PCB is not available due to the TCP packet not being associated
with any socket, so we don't label in that case, but need to handle
it in a special manner.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-07-31 19:06:49 +00:00
Don Lewis
5c38b6dbce Wire the sysctl output buffer before grabbing any locks to prevent
SYSCTL_OUT() from blocking while locks are held.  This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
released.
2002-07-28 19:59:31 +00:00
Matthew Dillon
701bec5a38 Introduce two new sysctl's:
net.inet.tcp.rexmit_min (default 3 ticks equiv)

    This sysctl is the retransmit timer RTO minimum,
    specified in milliseconds.  This value is
    designed for algorithmic stability only.

net.inet.tcp.rexmit_slop (default 200ms)

    This sysctl is the retransmit timer RTO slop
    which is added to every retransmit timeout and
    is designed to handle protocol stack overheads
    and delayed ack issues.

Note that the *original* code applied a 1-second
RTO minimum but never applied real slop to the RTO
calculation, so any RTO calculation over one second
would have no slop and thus not account for
protocol stack overheads (TCP timestamps are not
a measure of protocol turnaround!).  Essentially,
the original code made the RTO calculation almost
completely irrelevant.

Please note that the 200ms slop is debateable.
This commit is not meant to be a line in the sand,
and if the community winds up deciding that increasing
it is the correct solution then it's easy to do.
Note that larger values will destroy performance
on lossy networks while smaller values may result in
a greater number of unnecessary retransmits.
2002-07-18 19:06:12 +00:00
Don Lewis
0e1eebb846 Defer calling SYSCTL_OUT() until after the locks have been released. 2002-07-11 23:18:43 +00:00
Don Lewis
142b2bd644 Reduce the nesting level of a code block that doesn't need to be in
an else clause.
2002-07-11 23:13:31 +00:00
Jesper Skriver
eb538bfd64 Extend the effect of the sysctl net.inet.tcp.icmp_may_rst
so that, if we recieve a ICMP "time to live exceeded in transit",
(type 11, code 0) for a TCP connection on SYN-SENT state, close
the connection.

MFC after:	2 weeks
2002-06-30 20:07:21 +00:00
Jeffrey Hsu
2d40081d1f TCP notify functions can change the pcb list. 2002-06-21 22:52:48 +00:00
Jeffrey Hsu
3ce144ea88 Notify functions can destroy the pcb, so they have to return an
indication of whether this happenned so the calling function
knows whether or not to unlock the pcb.

Submitted by:	Jennifer Yang (yangjihui@yahoo.com)
Bug reported by:  Sid Carter (sidcarter@symonds.net)
2002-06-14 08:35:21 +00:00
Jeffrey Hsu
73dca2078d Fix logic which resulted in missing a call to INP_UNLOCK(). 2002-06-12 03:11:06 +00:00
Jeffrey Hsu
f76fcf6d4c Lock up inpcb.
Submitted by:	Jennifer Yang <yangjihui@yahoo.com>
2002-06-10 20:05:46 +00:00
Seigo Tanimura
4cc20ab1f0 Back out my lats commit of locking down a socket, it conflicts with hsu's work.
Requested by:	hsu
2002-05-31 11:52:35 +00:00
Seigo Tanimura
243917fe3b Lock down a socket, milestone 1.
o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
  socket buffer. The mutex in the receive buffer also protects the data
  in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

  - so_count
  - so_options
  - so_linger
  - so_state

o Remove *_locked() socket APIs.  Make the following socket APIs
  touching the members above now require a locked socket:

 - sodisconnect()
 - soisconnected()
 - soisconnecting()
 - soisdisconnected()
 - soisdisconnecting()
 - sofree()
 - soref()
 - sorele()
 - sorwakeup()
 - sotryfree()
 - sowakeup()
 - sowwakeup()

Reviewed by:	alfred
2002-05-20 05:41:09 +00:00
Mike Silbersack
898568d8ab Remove some ISN generation code which has been unused since the
syncache went in.

MFC after:	3 days
2002-04-10 22:12:01 +00:00
John Baldwin
44731cab3b Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API.  The entire API now consists of two functions
similar to the pre-KSE API.  The suser() function takes a thread pointer
as its only argument.  The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0.  The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on:	smp@
2002-04-01 21:31:13 +00:00
Robert Watson
29dc1288b0 Merge from TrustedBSD MAC branch:
Move the network code from using cr_cansee() to check whether a
    socket is visible to a requesting credential to using a new
    function, cr_canseesocket(), which accepts a subject credential
    and object socket.  Implement cr_canseesocket() so that it does a
    prison check, a uid check, and add a comment where shortly a MAC
    hook will go.  This will allow MAC policies to seperately
    instrument the visibility of sockets from the visibility of
    processes.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-03-22 19:57:41 +00:00
Jeff Roberson
69c2d429c1 Switch vm_zone.h with uma.h. Change over to uma interfaces. 2002-03-20 05:48:55 +00:00
Alfred Perlstein
4d77a549fe Remove __P. 2002-03-19 21:25:46 +00:00
John Baldwin
a854ed9893 Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.
2002-02-27 18:32:23 +00:00
Alfred Perlstein
60d9b32e44 More IPV6 const fixes. 2002-02-27 05:11:50 +00:00
Dima Dorfman
76183f3453 Introduce a version field to `struct xucred' in place of one of the
spares (the size of the field was changed from u_short to u_int to
reflect what it really ends up being).  Accordingly, change users of
xucred to set and check this field as appropriate.  In the kernel,
this is being done inside the new cru2x() routine which takes a
`struct ucred' and fills out a `struct xucred' according to the
former.  This also has the pleasant sideaffect of removing some
duplicate code.

Reviewed by:	rwatson
2002-02-27 04:45:37 +00:00
Hajimu UMEMOTO
c5993889d8 In tcp_respond(), correctly reset returned IPv6 header. This is essential
when the original packet contains an IPv6 extension header.

Obtained from:	KAME
MFC after:	1 week
2002-02-04 17:37:06 +00:00
Jonathan Lemon
be2ac88c59 Introduce a syncache, which enables FreeBSD to withstand a SYN flood
DoS in an improved fashion over the existing code.

Reviewed by: silby  (in a previous iteration)
Sponsored by: DARPA, NAI Labs
2001-11-22 04:50:44 +00:00
Robert Watson
ce17880650 o Replace reference to 'struct proc' with 'struct thread' in 'struct
sysctl_req', which describes in-progress sysctl requests.  This permits
  sysctl handlers to have access to the current thread, permitting work
  on implementing td->td_ucred, migration of suser() to using struct
  thread to derive the appropriate ucred, and allowing struct thread to be
  passed down to other code, such as network code where td is not currently
  available (and curproc is used).

o Note: netncp and netsmb are not updated to reflect this change, as they
  are not currently KSE-adapted.

Reviewed by:		julian
Obtained from:	TrustedBSD Project
2001-11-08 02:13:18 +00:00
Robert Watson
8a7d8cc675 - Combine kern.ps_showallprocs and kern.ipc.showallsockets into
a single kern.security.seeotheruids_permitted, describes as:
  "Unprivileged processes may see subjects/objects with different real uid"
  NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is
  an API change.  kern.ipc.showallsockets does not.
- Check kern.security.seeotheruids_permitted in cr_cansee().
- Replace visibility calls to socheckuid() with cr_cansee() (retain
  the change to socheckuid() in ipfw, where it is used for rule-matching).
- Remove prison_unpcb() and make use of cr_cansee() against the UNIX
  domain socket credential instead of comparing root vnodes for the
  UDS and the process.  This allows multiple jails to share the same
  chroot() and not see each others UNIX domain sockets.
- Remove unused socheckproc().

Now that cr_cansee() is used universally for socket visibility, a variety
of policies are more consistently enforced, including uid-based
restrictions and jail-based restrictions.  This also better-supports
the introduction of additional MAC models.

Reviewed by:	ps, billf
Obtained from:	TrustedBSD Project
2001-10-09 21:40:30 +00:00
Paul Saab
4787fd37af Only allow users to see their own socket connections if
kern.ipc.showallsockets is set to 0.

Submitted by:	billf (with modifications by me)
Inspired by:	Dave McKay (aka pm aka Packet Magnet)
Reviewed by:	peter
MFC after:	2 weeks
2001-10-05 07:06:32 +00:00
Robert Watson
94088977c9 o Rename u_cansee() to cr_cansee(), making the name more comprehensible
in the face of a rename of ucred to cred, and possibly generally.

Obtained from:	TrustedBSD Project
2001-09-20 21:45:31 +00:00
Mike Silbersack
b0e3ad758b Much delayed but now present: RFC 1948 style sequence numbers
In order to ensure security and functionality, RFC 1948 style
initial sequence number generation has been implemented.  Barring
any major crypographic breakthroughs, this algorithm should be
unbreakable.  In addition, the problems with TIME_WAIT recycling
which affect our currently used algorithm are not present.

Reviewed by: jesper
2001-08-22 00:58:16 +00:00
Peter Wemm
57e119f6f2 Fix a warning. 2001-07-27 00:04:39 +00:00
Peter Wemm
016517247f Patch up some style(9) stuff in tcp_new_isn() 2001-07-27 00:03:49 +00:00
Peter Wemm
92971bd3f1 s/OpemBSD/OpenBSD/ 2001-07-27 00:01:48 +00:00
Mike Silbersack
2d610a5028 Temporary feature: Runtime tuneable tcp initial sequence number
generation scheme.  Users may now select between the currently used
OpenBSD algorithm and the older random positive increment method.

While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT
handling; this is causing trouble for an increasing number of folks.

To switch between generation schemes, one sets the sysctl
net.inet.tcp.tcp_seq_genscheme.  0 = random positive increments,
1 = the OpenBSD algorithm.  1 is still the default.

Once a secure _and_ compatible algorithm is implemented, this sysctl
will be removed.

Reviewed by: jlemon
Tested by: numerous subscribers of -net
2001-07-08 02:20:47 +00:00
David Malone
7ce87f1205 Allow getcred sysctl to work in jailed root processes. Processes can
only do getcred calls for sockets which were created in the same jail.
This should allow the ident to work in a reasonable way within jails.

PR:		28107
Approved by:	des, rwatson
2001-06-24 12:18:27 +00:00
Jonathan Lemon
f962cba5c3 Replace bzero() of struct ip with explicit zeroing of structure members,
which is faster.
2001-06-23 17:44:27 +00:00
Mike Silbersack
08517d530e Eliminate the allocation of a tcp template structure for each
connection.  The information contained in a tcptemp can be
reconstructed from a tcpcb when needed.

Previously, tcp templates required the allocation of one
mbuf per connection.  On large systems, this change should
free up a large number of mbufs.

Reviewed by:	bmilekic, jlemon, ru
MFC after: 2 weeks
2001-06-23 03:21:46 +00:00
Hajimu UMEMOTO
ff2428299f made sure to use the correct sa_len for rtalloc().
sizeof(ro_dst) is not necessarily the correct one.
this change would also fix the recent path MTU discovery problem for the
destination of an incoming TCP connection.

Submitted by:	JINMEI Tatuya <jinmei@kame.net>
Obtained from:	KAME
MFC after:	2 weeks
2001-06-20 12:32:48 +00:00
Hajimu UMEMOTO
3384154590 Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
  - The definitions of SADB_* in sys/net/pfkeyv2.h are still different
    from RFC2407/IANA assignment because of binary compatibility
    issue.  It should be fixed under 5-CURRENT.
  - ip6po_m member of struct ip6_pktopts is no longer used.  But, it
    is still there because of binary compatibility issue.  It should
    be removed under 5-CURRENT.

Reviewed by:	itojun
Obtained from:	KAME
MFC after:	3 weeks
2001-06-11 12:39:29 +00:00
Peter Wemm
0978669829 "Fix" the previous initial attempt at fixing TUNABLE_INT(). This time
around, use a common function for looking up and extracting the tunables
from the kernel environment.  This saves duplicating the same function
over and over again.  This way typically has an overhead of 8 bytes + the
path string, versus about 26 bytes + the path string.
2001-06-08 05:24:21 +00:00
Peter Wemm
4422746fdf Back out part of my previous commit. This was a last minute change
and I botched testing.  This is a perfect example of how NOT to do
this sort of thing. :-(
2001-06-07 03:17:26 +00:00
Peter Wemm
81930014ef Make the TUNABLE_*() macros look and behave more consistantly like the
SYSCTL_*() macros.  TUNABLE_INT_DECL() was an odd name because it didn't
actually declare the int, which is what the name suggests it would do.
2001-06-06 22:17:08 +00:00
Jesper Skriver
d1745f454d Say goodbye to TCP_COMPAT_42
Reviewed by:	wollman
Requested by:	wollman
2001-04-20 11:58:56 +00:00
Kris Kennaway
f0a04f3f51 Randomize the TCP initial sequence numbers more thoroughly.
Obtained from:	OpenBSD
Reviewed by:	jesper, peter, -developers
2001-04-17 18:08:01 +00:00
Jesper Skriver
b77d155dd3 MFC candidate.
Change code from PRC_UNREACH_ADMIN_PROHIB to PRC_UNREACH_PORT for
ICMP_UNREACH_PROTOCOL and ICMP_UNREACH_PORT

And let TCP treat PRC_UNREACH_PORT like PRC_UNREACH_ADMIN_PROHIB

This should fix the case where port unreachables for udp returned
ENETRESET instead of ECONNREFUSED

Problem found by:	Bill Fenner <fenner@research.att.com>
Reviewed by:		jlemon
2001-03-28 14:13:19 +00:00
Poul-Henning Kamp
462b86fe91 <sys/queue.h> makeover. 2001-03-16 20:00:53 +00:00
Jonathan Lemon
c693a045de Remove in_pcbnotify and use in_pcblookup_hash to find the cb directly.
For TCP, verify that the sequence number in the ICMP packet falls within
the tcp receive window before performing any actions indicated by the
icmp packet.

Clean up some layering violations (access to tcp internals from in_pcb)
2001-02-26 21:19:47 +00:00
Jonathan Lemon
c484d1a38c When converting soft error into a hard error, drop the connection. The
error will be passed up to the user, who will close the connection, so
it does not appear to make a sense to leave the connection open.

This also fixes a bug with kqueue, where the filter does not set EOF
on the connection, because the connection is still open.

Also remove calls to so{rw}wakeup, as we aren't doing anything with
them at the moment anyway.

Reviewed by: alfred, jesper
2001-02-23 21:07:06 +00:00
Jonathan Lemon
e4bb5b0572 Allow ICMP unreachables which map into PRC_UNREACH_ADMIN_PROHIB to
reset TCP connections which are in the SYN_SENT state, if the sequence
number in the echoed ICMP reply is correct.  This behavior can be
controlled by the sysctl net.inet.tcp.icmp_may_rst.

Currently, only subtypes 2,3,10,11,12 are treated as such
(port, protocol and administrative unreachables).

Assocaiate an error code with these resets which is reported to the
user application: ENETRESET.

Disallow resetting TCP sessions which are not in a SYN_SENT state.

Reviewed by: jesper, -net
2001-02-23 20:51:46 +00:00
Jesper Skriver
d1c54148b7 Redo the security update done in rev 1.54 of src/sys/netinet/tcp_subr.c
and 1.84 of src/sys/netinet/udp_usrreq.c

The changes broken down:

- remove 0 as a wildcard for addresses and port numbers in
  src/sys/netinet/in_pcb.c:in_pcbnotify()
- add src/sys/netinet/in_pcb.c:in_pcbnotifyall() used to notify
  all sessions with the specific remote address.
- change
  - src/sys/netinet/udp_usrreq.c:udp_ctlinput()
  - src/sys/netinet/tcp_subr.c:tcp_ctlinput()
  to use in_pcbnotifyall() to notify multiple sessions, instead of
  using in_pcbnotify() with 0 as src address and as port numbers.
- remove check for src port == 0 in
  - src/sys/netinet/tcp_subr.c:tcp_ctlinput()
  - src/sys/netinet/udp_usrreq.c:udp_ctlinput()
  as they are no longer needed.
- move handling of redirects and host dead from in_pcbnotify() to
  udp_ctlinput() and tcp_ctlinput(), so they will call
  in_pcbnotifyall() to notify all sessions with the specific
  remote address.

Approved by:	jlemon
Inspired by:    NetBSD
2001-02-22 21:23:45 +00:00
Jesper Skriver
58e9b41722 Only call in_pcbnotify if the src port number != 0, as we
treat 0 as a wildcard in src/sys/in_pbc.c:in_pcbnotify()

It's sufficient to check for src|local port, as we'll have no
sessions with src|local port == 0

Without this a attacker sending ICMP messages, where the attached
IP header (+ 8 bytes) has the address and port numbers == 0, would
have the ICMP message applied to all sessions.

PR:		kern/25195
Submitted by:	originally by jesper, reimplimented by jlemon's advice
Reviewed by:	jlemon
Approved by:	jlemon
2001-02-20 23:25:04 +00:00
Brian Feldman
c0511d3b58 Switch to using a struct xucred instead of a struct xucred when not
actually in the kernel.  This structure is a different size than
what is currently in -CURRENT, but should hopefully be the last time
any application breakage is caused there.  As soon as any major
inconveniences are removed, the definition of the in-kernel struct
ucred should be conditionalized upon defined(_KERNEL).

This also changes struct export_args to remove dependency on the
constantly-changing struct ucred, as well as limiting the bounds
of the size fields to the correct size.  This means: a) mountd and
friends won't break all the time, b) mountd and friends won't crash
the kernel all the time if they don't know what they're doing wrt
actual struct export_args layout.

Reviewed by:	bde
2001-02-18 13:30:20 +00:00
Poul-Henning Kamp
90fcbbd635 Remove unneeded loop increment in src/sys/netinet/in_pcb.c:in_pcbnotify
Add new PRC_UNREACH_ADMIN_PROHIB in sys/sys/protosw.h

Remove condition on TCP in src/sys/netinet/ip_icmp.c:icmp_input

In src/sys/netinet/ip_icmp.c:icmp_input set code = PRC_UNREACH_ADMIN_PROHIB
or PRC_UNREACH_HOST for all unreachables except ICMP_UNREACH_NEEDFRAG

Rename sysctl icmp_admin_prohib_like_rst to icmp_unreach_like_rst
to reflect the fact that we also react on ICMP unreachables that
are not administrative prohibited.  Also update the comments to
reflect this.

In sys/netinet/tcp_subr.c:tcp_ctlinput add code to treat
PRC_UNREACH_ADMIN_PROHIB and PRC_UNREACH_HOST different.

PR:		23986
Submitted by:	Jesper Skriver <jesper@skriver.dk>
2001-02-18 09:34:55 +00:00
Poul-Henning Kamp
fc2ffbe604 Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)
2001-02-04 13:13:25 +00:00
Poul-Henning Kamp
442fad6798 Update the "icmp_admin_prohib_like_rst" code to check the tcp-window and
to be configurable with respect to acting only in SYN or in all TCP states.

PR:		23665
Submitted by:	Jesper Skriver <jesper@skriver.dk>
2000-12-24 10:57:21 +00:00
Poul-Henning Kamp
b11d7a4a2f We currently does not react to ICMP administratively prohibited
messages send by routers when they deny our traffic, this causes
a timeout when trying to connect to TCP ports/services on a remote
host, which is blocked by routers or firewalls.

rfc1122 (Requirements for Internet Hosts) section 3.2.2.1 actually
requi re that we treat such a message for a TCP session, that we
treat it like if we had recieved a RST.

quote begin.

            A Destination Unreachable message that is received MUST be
            reported to the transport layer.  The transport layer SHOULD
            use the information appropriately; for example, see Sections
            4.1.3.3, 4.2.3.9, and 4.2.4 below.  A transport protocol
            that has its own mechanism for notifying the sender that a
            port is unreachable (e.g., TCP, which sends RST segments)
            MUST nevertheless accept an ICMP Port Unreachable for the
            same purpose.

quote end.

I've written a small extension that implement this, it also create
a sysctl "net.inet.tcp.icmp_admin_prohib_like_rst" to control if
this new behaviour is activated.

When it's activated (set to 1) we'll treat a ICMP administratively
prohibited message (icmp type 3 code 9, 10 and 13) for a TCP
sessions, as if we recived a TCP RST, but only if the TCP session
is in SYN_SENT state.

The reason for only reacting when in SYN_SENT state, is that this
will solve the problem, and at the same time minimize the risk of
this being abused.

I suggest that we enable this new behaviour by default, but it
would be a change of current behaviour, so if people prefer to
leave it disabled by default, at least for now, this would be ok
for me, the attached diff actually have the sysctl set to 0 by
default.

PR:		23086
Submitted by:	Jesper Skriver <jesper@skriver.dk>
2000-12-16 19:42:06 +00:00
Jonathan Lemon
e82ac18e52 Revert the last commit to the callout interface, and add a flag to
callout_init() indicating whether the callout is safe or not.  Update
the callers of callout_init() to reflect the new interface.

Okayed by: Jake
2000-11-25 06:22:16 +00:00
Poul-Henning Kamp
46aa3347cb Convert all users of fldoff() to offsetof(). fldoff() is bad
because it only takes a struct tag which makes it impossible to
use unions, typedefs etc.

Define __offsetof() in <machine/ansi.h>

Define offsetof() in terms of __offsetof() in <stddef.h> and <sys/types.h>

Remove myriad of local offsetof() definitions.

Remove includes of <stddef.h> in kernel code.

NB: Kernelcode should *never* include from /usr/include !

Make <sys/queue.h> include <machine/ansi.h> to avoid polluting the API.

Deprecate <struct.h> with a warning.  The warning turns into an error on
01-12-2000 and the file gets removed entirely on 01-01-2001.

Paritials reviews by:   various.
Significant brucifications by:  bde
2000-10-27 11:45:49 +00:00
Jun-ichiro itojun Hagino
d31944e6ec be careful on mbuf overrun on ctlinput.
short icmp6 packet may be able to panic the kernel.
sync with kame.
2000-10-23 07:11:01 +00:00
Kris Kennaway
be515d91ad Use stronger random number generation for TCP_ISSINCR and tcp_iss.
Reviewed by:	peter, jlemon
2000-09-29 01:37:19 +00:00
Bosko Milekic
9d8c8a672c Finally make do_tcpdrain sysctl live under correct parent, _net_inet_tcp,
as opposed to _debug. Like before, default value remains 1.
2000-09-25 23:40:22 +00:00
Jayanth Vijayaraghavan
e7f3269307 When a connection is being dropped due to a listen queue overflow,
delete the cloned route that is associated with the connection.
This does not exhaust the routing table memory when the system
is under a SYN flood attack. The route entry is not deleted if there
is any prior information cached in it.

Reviewed by: Peter Wemm,asmodai
2000-07-21 23:26:37 +00:00
Jun-ichiro itojun Hagino
686cdd19b1 sync with kame tree as of july00. tons of bug fixes/improvements.
API changes:
- additional IPv6 ioctls
- IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8).
  (also syntax change)
2000-07-04 16:35:15 +00:00
Poul-Henning Kamp
77978ab8bc Previous commit changing SYSCTL_HANDLER_ARGS violated KNF.
Pointed out by:	bde
2000-07-04 11:25:35 +00:00
Poul-Henning Kamp
82d9ae4e32 Style police catches up with rev 1.26 of src/sys/sys/sysctl.h:
Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our
sources:

        -sysctl_vm_zone SYSCTL_HANDLER_ARGS
        +sysctl_vm_zone (SYSCTL_HANDLER_ARGS)
2000-07-03 09:35:31 +00:00
Yoshinobu Inoue
34e3e010b6 Let initialize th_sum before in6_cksum(), again.
Without this fix, all IPv6 TCP RST packet has wrong cksum value,
so IPv6 connect() trial to 5.0 machine won't fail until tcp connect timeout,
when they should fail soon.

Thanks to haro@tk.kubota.co.jp (Munehiro Matsuda) for his much debugging
help and detailed info.
2000-04-19 15:05:00 +00:00
Jonathan Lemon
db4f9cc703 Add support for offloading IP/TCP/UDP checksums to NIC hardware which
supports them.
2000-03-27 19:14:27 +00:00
Paul Saab
f885f63606 Limit the maximum permissible TCP window size to 65535 octets if
window scaling is disabled.

PR:		kern/16914
Submitted by:	Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>
Reviewed by:	wollman
Approved by:	jkh
2000-02-28 21:18:21 +00:00
Yoshinobu Inoue
e2de10abe4 Fix the bug that IPv4 ttl is not initialized when AF_INET6 socket is used
for IPv4 communication.(IPv4 mapped IPv6 addr.)
Also removed IPv6 hoplimit initialization because it is alway done at
tcp_output.

Confirmed by: Bernd Walter <ticso@cicely5.cicely.de>
2000-01-25 01:05:18 +00:00
Yoshinobu Inoue
3a2a9f7976 Fixed the problem that IPsec connection hangs when bigger data is sent.
-opt_ipsec.h was missing on some tcp files (sorry for basic mistake)
  -made buildable as above fix
  -also added some missing IPv4 mapped IPv6 addr consideration into
   ipsec4_getpolicybysock
2000-01-15 14:56:38 +00:00
Yoshinobu Inoue
0567ae029a Added missing 'else' for 'if (isipv6)' at IPv6 length setting in tcp_respond().
By this bug, IPv6 reset was not sent.
(I checked around same kind of bug, but no other found.)
2000-01-15 14:34:56 +00:00
Yoshinobu Inoue
595ad28111 Removed wrong(unnecessary) & operators for pointer, in ipsec_hdrsiz_tcp().
This must be one of the reason why connections over IPsec hangs for
bigger packets.(which was reported on freebsd-current@freebsd.org)

But there still seems to be another bug and the problem is not yet fixed.
2000-01-15 05:39:28 +00:00
Yoshinobu Inoue
21ab895ff5 Clear rt after RTFREE. This might have sometime caused kernel panic at rtfree()
on INET6 enabled environment.
2000-01-13 14:21:30 +00:00
Yoshinobu Inoue
72e174286c removed incorrect ip6 length setting for IPv6 tcp reset packet. 2000-01-13 05:18:30 +00:00
Yoshinobu Inoue
fb59c426ff tcp updates to support IPv6.
also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project
2000-01-09 19:17:30 +00:00
Mike Smith
052a6aad2c Make tcp_drain() actually do something. When invoked (usually as a
desperation measure in low-memory situations), walk the tcpbs and
flush the reassembly queues.

This behaviour is currently controlled by the debug.do_tcpdrain sysctl
(defaults to on).

Submitted by:	Bosko Milekic <bmilekic@dsuper.net>
Reviewed by:	wollman
1999-12-28 23:18:33 +00:00
Yoshinobu Inoue
cfa1ca9dfa udp IPv6 support, IPv6/IPv4 tunneling support in kernel,
packet divert at kernel for IPv6/IPv4 translater daemon

This includes queue related patch submitted by jburkhol@home.com.

Submitted by: queue related patch from jburkhol@home.com
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project
1999-12-07 17:39:16 +00:00
Yoshinobu Inoue
82cd038d51 KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP
for IPv6 yet)

With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project
1999-11-22 02:45:11 +00:00
Yoshinobu Inoue
76429de41a KAME related header files additions and merges.
(only those which don't affect c source files so much)

Reviewed by: cvs-committers
Obtained from: KAME project
1999-11-05 14:41:39 +00:00
Brian Feldman
2f9a21326c Change so_cred's type to a ucred, not a pcred. THis makes more sense, actually.
Make a sonewconn3() which takes an extra argument (proc) so new sockets created
with sonewconn() from a user's system call get the correct credentials, not
just the parent's credentials.
1999-09-19 02:17:02 +00:00
Jonathan Lemon
9b8b58e033 Restructure TCP timeout handling:
- eliminate the fast/slow timeout lists for TCP and instead use a
    callout entry for each timer.
  - increase the TCP timer granularity to HZ
  - implement "bad retransmit" recovery, as presented in
    "On Estimating End-to-End Network Path Properties", by Allman and Paxson.

Submitted by:	jlemon, wollmann
1999-08-30 21:17:07 +00:00
Peter Wemm
c3aac50f28 $Id$ -> $FreeBSD$ 1999-08-28 01:08:13 +00:00
Jonathan Lemon
6da3d6578b Add readonly OID ``net.inet.tcp.tcbhashsize'' so it is possible to
discover the size of the TCB hashtable on a running system.
1999-08-26 19:52:17 +00:00
Brian Feldman
490d50b60a Two new sysctls: net.inet.tcp.getcred and net.inet.udp.getcred. These take
a sockaddr_in[2] (local, then remote) and return a struct ucred. Example
code for these is at:
	http://www.FreeBSD.org/~green/inetd_ident.patch
	http://www.FreeBSD.org/~green/freebsd4.c (for pidentd)

Reviewed by:	bde
1999-07-11 18:32:46 +00:00
Mike Smith
35ec852af5 Use the new tunable macros for the net.inet.tcp.tcbhashsize tunable. 1999-07-05 08:46:55 +00:00
Tor Egge
23fc6cddce Close a race window where a tcp socket is closed while tcp_pcblist is
copying out tcp socket info, causing a NULL pointer to be dereferenced.
1999-06-16 19:05:17 +00:00
Bill Fumerola
3d177f465a Add sysctl descriptions to many SYSCTL_XXXs
PR:		kern/11197
Submitted by:	Adrian Chadd <adrian@FreeBSD.org>
Reviewed by:	billf(spelling/style/minor nits)
Looked at by:	bde(style)
1999-05-03 23:57:32 +00:00
Poul-Henning Kamp
75c1354190 This Implements the mumbled about "Jail" feature.
This is a seriously beefed up chroot kind of thing.  The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.

For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact:  "real virtual servers".

Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.

Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.

It generally does what one would expect, but setting up a jail
still takes a little knowledge.

A few notes:

   I have no scripts for setting up a jail, don't ask me for them.

   The IP number should be an alias on one of the interfaces.

   mount a /proc in each jail, it will make ps more useable.

   /proc/<pid>/status tells the hostname of the prison for
   jailed processes.

   Quotas are only sensible if you have a mountpoint per prison.

   There are no privisions for stopping resource-hogging.

   Some "#ifdef INET" and similar may be missing (send patches!)

If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!

Tools, comments, patches & documentation most welcome.

Have fun...

Sponsored by:   http://www.rndassociates.com/
Run for almost a year by:       http://www.servetheweb.com/
1999-04-28 11:38:52 +00:00
Mike Smith
ea95d6917a Nuke all the stupid ffs() stuff and use powerof2() instead.
Submitted by:	Bruce Evans <bde@zeta.org.au>
1999-02-04 03:27:43 +00:00
Mike Smith
609b6256fd Fix power-of-2 check for the TCB hash size.
Submitted by:	Brian Feldman <green@unixhelp.org>
1999-02-04 03:02:56 +00:00
Mike Smith
aa1458495d Make TCBHASHSIZE a boot-time tunable as well, taking its value from the
variable net.inet.tcp.tcbhashsize.

Requested by:	David Filo <filo@yahoo-inc.com>
1999-02-03 08:59:30 +00:00
Archie Cobbs
f1d19042b0 The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static
and local variables, goto labels, and functions declared but not defined.
1998-12-07 21:58:50 +00:00
Guido van Rooij
d285db5598 The below patch helps to reduce the leakage of internal socket information
when a TCP "stealth" scan is directed at a *BSD box by ensuring the window
is 0 for all RST packets generated through tcp_respond()
Reviewed by:	Don Lewis <Don.Lewis@tsc.tdk.com>
Obtained from:	Bugtraq (from: Darren Reed <avalon@COOMBS.ANU.EDU.AU>)
1998-11-15 21:35:09 +00:00
Poul-Henning Kamp
19ddafa3b3 RFC 1644 has the status "Experimental Protocol", which means:
4.1.4.  Experimental Protocol

      A system should not implement an experimental protocol unless it
      is participating in the experiment and has coordinated its use of
      the protocol with the developer of the protocol.

Pointed out by:	Steinar Haug <sthaug@nethelp.no>
1998-09-06 08:17:35 +00:00
Doug Rabson
6effc71332 Re-implement tcp and ip fragment reassembly to not store pointers in the
ip header which can't work on alpha since pointers are too big.

Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
1998-08-24 07:47:39 +00:00
Garrett Wollman
98271db4d5 Convert socket structures to be type-stable and add a version number.
Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.

Convert PF_LOCAL PCB structures to be type-stable and add a version number.

Define an external format for infomation about socket structures and use
it in several places.

Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time.  This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.

It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).
1998-05-15 20:11:40 +00:00
Bruce Evans
8781d8e928 Fixed style bugs (mostly) in previous commit. 1998-03-28 10:18:26 +00:00
Garrett Wollman
3d4d47f398 Use the zone allocator to allocate inpcbs and tcpcbs. Each protocol creates
its own zone; this is used particularly by TCP which allocates both inpcb and
tcpcb in a single allocation.  (Some hackery ensures that the tcpcb is
reasonably aligned.)  Also keep track of the number of pcbs of each type
allocated, and keep a generation count (instance version number) for future
use.
1998-03-24 18:06:34 +00:00
David Greenman
c3229e05a3 Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.

Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
   to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
   hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
   be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
   the future, however.

These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.

Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.

WARNING: Anything that knows about inpcb and tcpcb structs will have to be
         recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
Eivind Eklund
92252381f3 Make TCP_COMPAT_42 a new style option. 1998-01-25 04:23:33 +00:00
Julian Elischer
45d6875df6 Fix an incredibly horrible bug in the ipfw code
where if you are using the "reset tcp" firewall command,
the kernel would write ethernet headers onto random kernel stack locations.

Fought to the death by: terry, julian, archie.
fix valid for 2.2 series as well.
1997-12-19 03:36:15 +00:00
Bruce Evans
55b211e3af Removed unused #includes. 1997-10-28 15:59:26 +00:00
Joerg Wunsch
0cc12cc57e Make TCPDEBUG a new-style option. 1997-09-16 18:36:06 +00:00
Bruce Evans
514ede0953 Fixed gratuitous ANSIisms. 1997-09-16 11:44:05 +00:00
David Greenman
ca98b82c8d Reorganize elements of the inpcb struct to take better advantage of
cache lines. Removed the struct ip proto since only a couple of chars
were actually being used in it. Changed the order of compares in the
PCB hash lookup to take advantage of partial cache line fills (on PPro).

Discussed-with: wollman
1997-04-03 05:14:45 +00:00
David Greenman
ddd79a9790 Improved performance of hash algorithm while (hopefully) not reducing
the quality of the hash distribution. This does not fix a problem dealing
with poor distribution when using lots of IP aliases and listening
on the same port on every one of them...some other day perhaps; fixing
that requires significant code changes.
The use of xor was inspired by David S. Miller <davem@jenolan.rutgers.edu>
1997-03-03 09:23:37 +00:00
Peter Wemm
6875d25465 Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.
1997-02-22 09:48:43 +00:00
Garrett Wollman
d0390e0570 Fix the mechanism for choosing wehether to save the slow-start threshold
in the route.  This allows us to remove the unconditional setting of the
pipesize in the route, which should mean that SO_SNDBUF and SO_RCVBUF
should actually work again.  While we're at it:

- Convert udp_usrreq from `mondo switch statement from Hell' to new-style.
- Delete old TCP mondo switch statement from Hell, which had previously
  been diked out.
1997-02-14 18:15:53 +00:00
Jordan K. Hubbard
1130b656e5 Make the long-awaited change from $Id$ to $FreeBSD$
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore.  This update would have been
insane otherwise.
1997-01-14 07:20:47 +00:00
Garrett Wollman
5e2d069649 Eliminate some more references to separate ip_v and ip_hl fields. 1996-07-24 18:46:19 +00:00
Garrett Wollman
51fb392203 Better selection of initial retransmit timeout when no cached
RTT information is available.

Submitted by: kbracey@art.acorn.co.uk (Kevin Bracey)
(slightly modified by me)
1996-06-14 17:17:32 +00:00
Garrett Wollman
6da5712b60 Correct formula for TCP RTO calculation. Also try to do a better job in
filling in a new PCB's rttvar (but this is not the last word on the subject).
And get rid of `#ifdef RTV_RTT', it's been true for four years now...
1996-06-05 16:57:38 +00:00
Garrett Wollman
ebcae94e4f In tcp_respond(), check that ro->ro_rt is non-null before RTFREEing
it.
1996-03-27 18:23:16 +00:00
Garrett Wollman
9512fd2ec6 Make sure tcp_respond() always calls ip_output() with a valid
route pointer.  This has no effect in the current ip_output(),
but my version requires that ip_output() always be passed a route.
1996-03-22 18:11:25 +00:00
David Greenman
2ee45d7d28 Move or add #include <queue.h> in preparation for upcoming struct socket
changes.
1996-03-11 15:13:58 +00:00
Garrett Wollman
bda4c85ae6 Fix a nagging divide-by-zero error resulting from the MTU discovery code
getting triggered at a bad time.
1995-12-20 17:42:28 +00:00
Bruce Evans
b62d102cbb Uniformized pr_ctlinput protosw functions. The third arg is now `void
*' instead of caddr_t and it isn't optional (it never was).  Most of the
netipx (and netns) pr_ctlinput functions abuse the second arg instead of
using the third arg but fixing this is beyond the scope of this round
of changes.
1995-12-16 02:14:44 +00:00
Garrett Wollman
b7a44e3486 Path MTU Discovery is now standard. 1995-12-05 17:46:50 +00:00
Poul-Henning Kamp
0312fbe97d New style sysctl & staticize alot of stuff. 1995-11-14 20:34:56 +00:00
Poul-Henning Kamp
98163b98dc Start adding new style sysctl here too. 1995-11-09 20:23:09 +00:00
Garrett Wollman
3d1f141b23 The ability to administratively change the MTU of an interface presents
a few new wrinkles for MTU discovery which tcp_output() had better
be prepared to handle.  ip_output() is also modified to do something
helpful in this case, since it has already calculated the information
we need.
1995-10-16 18:21:26 +00:00
Garrett Wollman
3abc79d2ee The additional checks involving sequence numbers in MTU discovery resends
turned out not to be necessary; simply watching for MTU decreases (which
we already did) automagically eliminates all the cases we were trying to
protect against.
1995-10-12 17:37:25 +00:00
Garrett Wollman
143d7a5499 More MTU discovery: avoid over-retransmission if route changes in the
middle of a fully-open window.  Also, keep track of how many retransmits
we do as a result of MTU discovery.  This may actually do more work than
necessary, but it's an unusual condition...

Suggested by: Janey Hoe <janey@lcs.mit.edu>
1995-10-10 17:45:43 +00:00
Garrett Wollman
e79adb8ed6 Finish 4.4-Lite-2 merge: randomize TCP initial sequence numbers
to make ISS-guessing spoofing attacks harder.
1995-10-03 16:54:17 +00:00
Garrett Wollman
f001bbb882 Correct spelling error in MTUDISC code. 1995-09-22 17:43:37 +00:00
Garrett Wollman
f138387af1 Add support in TCP for Path MTU discovery. This is highly experimental
and gated on `options MTUDISC' in the source.  It is also practically
untested becausse (sniff!) I don't have easy access to a network with
an MTU of less than an Ethernet.  If you have a small MTU network,
please try it and tell me if it works!
1995-09-20 21:00:59 +00:00
Garrett Wollman
5cbf3e086c Initial back-end support for IP MTU discovery, gated on MTUDISC. The support
for TCP has yet to be written.
1995-09-18 15:51:40 +00:00
Garrett Wollman
fc97827135 Keep track of the number of samples through the srtt filter so that we
know better when to cache values in the route, rather than relying on a
heuristic involving sequence numbers that broke when tcp_sendspace
was increased to 16k.
1995-06-29 18:11:24 +00:00
Garrett Wollman
9167720192 Now that we've gone to all sorts of effort to allow TCP to cache some of
its connection parameters, we want to keep statistics on how often this
actually happens to see whether there is any work that needs to be done in
TCP itself.

Suggested by: John Wroclawski <jtw@lcs.mit.edu>
1995-06-19 16:45:33 +00:00
Rodney W. Grimes
9b2e535452 Remove trailing whitespace. 1995-05-30 08:16:23 +00:00
David Greenman
15bd2b4385 Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash,
and in_pcblookuphash.
1995-04-09 01:29:31 +00:00
Bruce Evans
b5e8ce9f12 Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'.  Fix all the bugs found.  There were no serious
ones.
1995-03-16 18:17:34 +00:00
Nate Williams
9617d8b1f6 Removed unnecessary define for TCPOUTFLAGS since they are not used. 1995-03-06 02:49:24 +00:00
Garrett Wollman
41f82abe5a Transaction TCP support now standard. Hack away! 1995-02-16 00:55:44 +00:00
Garrett Wollman
a0292f2375 Merge Transaction TCP, courtesy of Andras Olah <olah@cs.utwente.nl> and
Bob Braden <braden@isi.edu>.

NB: This has not had David's TCP ACK hack re-integrated.  It is not clear
what the correct solution to this problem is, if any.  If a better solution
doesn't pop up in response to this message, I'll put David's code back in
(or he's welcome to do so himself).
1995-02-09 23:13:27 +00:00
Poul-Henning Kamp
ac0776aed7 Cosmetics: silences gcc -Wall. 1994-10-08 22:39:58 +00:00
Poul-Henning Kamp
623ae52e4e GCC cleanup.
Reviewed by:
Submitted by:
Obtained from:
1994-10-02 17:48:58 +00:00
David Greenman
3c4dd3568f Added $Id$ 1994-08-02 07:55:43 +00:00
Rodney W. Grimes
26f9a76710 The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.
Reviewed by:	Rodney W. Grimes
Submitted by:	John Dyson and David Greenman
1994-05-25 09:21:21 +00:00
Rodney W. Grimes
df8bae1de4 BSD 4.4 Lite Kernel Sources 1994-05-24 10:09:53 +00:00