181 Commits

Author SHA1 Message Date
glebius
d907e70991 o Backout rev. 1.125 of in_pcb.c. It appeared to behave extremely
bad under high load. For example with 40k sockets and 25k tcptw
  entries, connect() syscall can run for seconds. Debugging showed
  that it iterates the cycle millions times and purges thousands of
  tcptw entries at a time.
  Besides practical unusability this change is architecturally
  wrong. First, in_pcblookup_local() is used in connect() and bind()
  syscalls. No stale entries purging shouldn't be done here. Second,
  it is a layering violation.
o Return back the tcptw purging cycle to tcp_timer_2msl_tw(),
  that was removed in rev. 1.78 by rwatson. The commit log of this
  revision tells nothing about the reason cycle was removed. Now
  we need this cycle, since major cleaner of stale tcptw structures
  is removed.
o Disable probably necessary, but now unused
  tcp_twrecycleable() function.

Reviewed by:	ru
2006-09-06 13:56:35 +00:00
ups
ee0a5eb928 Fix race conditions on enumerating pcb lists by moving the initialization
( and where appropriate the destruction) of the pcb mutex to the init/finit
functions of the pcb zones.
This allows locking of the pcb entries and race condition free comparison
of the generation count.
Rearrange locking a bit to avoid extra locking operation to update the generation
count in in_pcballoc(). (in_pcballoc now returns the pcb locked)

I am planning to convert pcb list handling from a type safe to a reference count
model soon. ( As this allows really freeing the PCBs)

Reviewed by:	rwatson@, mohans@
MFC after:	1 week
2006-07-18 22:34:27 +00:00
bz
ed6ddd5a31 Use INPLOOKUP_WILDCARD instead of just 1 more consistently.
OKed by: rwatson (some weeks ago)
2006-06-29 10:49:49 +00:00
pjd
7f09680f0c - Use suser_cred(9) instead of directly checking cr_uid.
- Change the order of conditions to first verify that we actually need
  to check for privileges and then eventually check them.

Reviewed by:	rwatson
2006-06-27 11:35:53 +00:00
rwatson
88f1a971b9 Minor restyling and cleanup around ipport_tick().
MFC after:	1 month
2006-06-02 08:18:27 +00:00
marcel
4ede1798db In in_pcbdrop(), fix !INVARIANTS build. 2006-04-25 23:23:13 +00:00
rwatson
5d598011b5 Abstract inpcb drop logic, previously just setting of INP_DROPPED in TCP,
into in_pcbdrop().  Expand logic to detach the inpcb from its bound
address/port so that dropping a TCP connection releases the inpcb resource
reservation, which since the introduction of socket/pcb reference count
updates, has been persisting until the socket closed rather than being
released implicitly due to prior freeing of the inpcb on TCP drop.

MFC after:	3 months
2006-04-25 11:17:35 +00:00
rwatson
935d16472d Assert the inpcb lock when rehashing an inpcb.
Improve consistency of style around some current assertions.

MFC after:	3 months
2006-04-22 19:15:20 +00:00
rwatson
2a5f091dc3 Remove pcbinfo locking from in_setsockaddr() and in_setpeeraddr();
holding the inpcb lock is sufficient to prevent races in reading
the address and port, as both the inpcb lock and pcbinfo lock are
required to change the address/port.

Improve consistency of spelling in assertions about inp != NULL.

MFC after:	3 months
2006-04-22 19:10:02 +00:00
rwatson
2e3d21db7b Before dereferencing intotw() when INP_TIMEWAIT, check for inp_ppcb being
NULL.  We currently do allow this to happen, but may want to remove that
possibility in the future.  This case can occur when a socket is left
open after TCP wraps up, and the timewait state is recycled.  This will
be cleaned up in the future.

Found by:	Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after:	3 months
2006-04-04 12:26:07 +00:00
rwatson
d67aff8ec4 Change inp_ppcb from caddr_t to void *, fix/remove associated related
casts.

Consistently use intotw() to cast inp_ppcb pointers to struct tcptw *
pointers.

Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb *
pointers.

Don't assign tp to the results to intotcpcb() during variable declation
at the top of functions, as that is before the asserts relating to
locking have been performed.  Do this later in the function after
appropriate assertions have run to allow that operation to be conisdered
safe.

MFC after:	3 months
2006-04-03 13:33:55 +00:00
rwatson
71cc03392b Break out in_pcbdetach() into two functions:
- in_pcbdetach(), which removes the link between an inpcb and its
  socket.

- in_pcbfree(), which frees a detached pcb.

Unlike the previous in_pcbdetach(), neither of these functions will
attempt to conditionally free the socket, as they are responsible only
for managing in_pcb memory.  Mirror these changes into in6_pcbdetach()
by breaking it into in6_pcbdetach() and in6_pcbfree().

While here, eliminate undesired checks for NULL inpcb pointers in
sockets, as we will now have as an invariant that sockets will always
have valid so_pcb pointers.

MFC after:	3 months
2006-04-01 16:04:42 +00:00
andre
5c75fa42eb In in_pcbconnect_setup() reduce code duplication and use ip_rtaddr()
to find the outgoing interface for this connection.

Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	2 weeks
2006-02-16 15:45:28 +00:00
ume
4185f9e81f Never select the PCB that has INP_IPV6 flag and is bound to :: if
we have another PCB which is bound to 0.0.0.0.  If a PCB has the
INP_IPV6 flag, then we set its cost higher than IPv4 only PCBs.

Submitted by:	Keiichi SHIMA <keiichi__at__iijlab.net>
Obtained from:	KAME
MFC after:	1 week
2006-02-04 07:59:17 +00:00
rwatson
695863d37b Convert remaining functions to ANSI C function declarations; remove
'register' where present.

MFC after:	1 week
2006-01-22 01:16:25 +00:00
rwatson
d49765d3fd Remove no-op spl references in in_pcb.c, since in_pcb locking has been
basically complete for several years now.  Update one spl comment to
reference the locking strategy.

MFC after:	3 days
2005-07-19 12:24:27 +00:00
rwatson
be143d8ea5 Commit correct version of previous commit (in_pcb.c:1.164). Use the
local variables as currently named.

MFC after:	7 days
2005-06-01 11:43:39 +00:00
rwatson
ad803f0089 Assert pcbinfo lock in in_pcbdisconnect() and in_pcbdetach(), as the
global pcb lists are modified.

MFC after:	7 days
2005-06-01 11:39:42 +00:00
maxim
58adac10e7 o Tweak the comment a bit. 2005-04-08 08:43:21 +00:00
maxim
a31bda3d3c o Disable random port allocation when ip.portrange.first ==
ip.portrange.last and there is the only port for that because:
a) it is not wise; b) it leads to a panic in the random ip port
allocation code.  In general we need to disable ip port allocation
randomization if the last - first delta is ridiculous small.

PR:		kern/79342
Spotted by:	Anjali Kulkarni
Glanced at by:	silby
MFC after:	2 weeks
2005-04-08 08:42:10 +00:00
maxim
56ed6f8b75 o Document net.inet.ip.portrange.random* sysctls.
o Correct a comment about random port allocation threshold
implementation.

Reviewed by:	silby, ru
MFC after:	3 days
2005-03-23 09:26:38 +00:00
glebius
606d160676 We can make code simplier after last change.
Noticed by:	Andrew Thompson
2005-02-22 08:35:24 +00:00
glebius
5f0d747b30 In in_pcbconnect_setup() remove a check that route points at
loopback interface. Nobody have explained me sense of this check.
It breaks connect() system call to a destination address which is
loopback routed (e.g. blackholed).

Reviewed by:	silence on net@
MFC after:	2 weeks
2005-02-22 07:39:15 +00:00
imp
a50ffc2912 /* -> /*- for license, minor formatting changes 2005-01-07 01:45:51 +00:00
silby
c79cd91efc Port randomization leads to extremely fast port reuse at high
connection rates, which is causing problems for some users.

To retain the security advantage of random ports and ensure
correct operation for high connection rate users, disable
port randomization during periods of high connection rates.

Whenever the connection rate exceeds randomcps (10 by default),
randomization will be disabled for randomtime (45 by default)
seconds.  These thresholds may be tuned via sysctl.

Many thanks to Igor Sysoev, who proved the necessity of this
change and tested many preliminary versions of the patch.

MFC After:	20 seconds
2005-01-02 01:50:57 +00:00
rwatson
4b81ce6dd2 Push acquisition of the accept mutex out of sofree() into the caller
(sorele()/sotryfree()):

- This permits the caller to acquire the accept mutex before the socket
  mutex, avoiding sofree() having to drop the socket mutex and re-order,
  which could lead to races permitting more than one thread to enter
  sofree() after a socket is ready to be free'd.

- This also covers clearing of the so_pcb weak socket reference from
  the protocol to the socket, preventing races in clearing and
  evaluation of the reference such that sofree() might be called more
  than once on the same socket.

This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket.  The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets.  The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.

RELENG_5_3 candidate.

MFC after:	3 days
Reviewed by:	dwhite
Discussed with:	gnn, dwhite, green
Reported by:	Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by:	Vlad <marchenko at gmail dot com>
2004-10-18 22:19:43 +00:00
rwatson
e455dd69f8 Assign so_pcb to NULL rather than 0 as it's a pointer.
Spotted by:	dwhite
2004-09-29 04:01:13 +00:00
rwatson
224ba75d82 In in_pcbrehash(), do assert the inpcb lock as well as the pcbinfo lock. 2004-08-19 01:11:17 +00:00
rwatson
4bd194b32a Assert the locks of inpcbinfo's and inpcb's passed into in_pcbconnect()
and in_pcbconnect_setup(), since these functions frob the port and
address state of inpcbs.
2004-08-11 04:35:20 +00:00
yar
1d71ae12e0 Disallow a particular kind of port theft described by the following scenario:
Alice is too lazy to write a server application in PF-independent
	manner.  Therefore she knocks up the server using PF_INET6 only
	and allows the IPv6 socket to accept mapped IPv4 as well.  An evil
	hacker known on IRC as cheshire_cat has an account in the same
	system.  He starts a process listening on the same port as used
	by Alice's server, but in PF_INET.  As a consequence, cheshire_cat
	will distract all IPv4 traffic supposed to go to Alice's server.

Such sort of port theft was initially enabled by copying the code that
implemented the RFC 2553 semantics on IPv4/6 sockets (see inet6(4)) for
the implied case of the same owner for both connections.  After this
change, the above scenario will be impossible.  In the same setting,
the user who attempts to start his server last will get EADDRINUSE.

Of course, using IPv4 mapped to IPv6 leads to security complications
in the first place, but there is no reason to make it even more unsafe.

This change doesn't apply to KAME since it affects a FreeBSD-specific
part of the code.  It doesn't modify the out-of-box behaviour of the
TCP/IP stack either as long as mapping IPv4 to IPv6 is off by default.

MFC after:	1 month
2004-07-28 13:03:07 +00:00
cperciva
d9fecc83c8 Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with:	rwatson, scottl
Requested by:	jhb
2004-07-26 07:24:04 +00:00
maxim
32bf9d060d o connect(2): if there is no a route to the destination
do not pick up the first local ip address for the source
ip address, return ENETUNREACH instead.

Submitted by:	Gleb Smirnoff
Reviewed by:	-current (silence)
2004-06-16 10:02:36 +00:00
rwatson
f1bc833e95 Socket MAC labels so_label and so_peerlabel are now protected by
SOCK_LOCK(so):

- Hold socket lock over calls to MAC entry points reading or
  manipulating socket labels.

- Assert socket lock in MAC entry point implementations.

- When externalizing the socket label, first make a thread-local
  copy while holding the socket lock, then release the socket lock
  to externalize to userspace.
2004-06-13 02:50:07 +00:00
rwatson
82295697cd Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
  soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
  so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
  various contexts in the stack, both at the socket and protocol
  layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
  this could result in frobbing of a non-present socket if
  sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
  they don't free the socket.

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-12 20:47:32 +00:00
yar
45f0ba1547 When checking for possible port theft, skip over a TCP inpcb
unless it's in the closed or listening state (remote address
== INADDR_ANY).

If a TCP inpcb is in any other state, it's impossible to steal
its local port or use it for port theft.  And if there are
both closed/listening and connected TCP inpcbs on the same
localIP:port couple, the call to in_pcblookup_local() will
find the former due to the design of that function.

No objections raised in:	-net, -arch
MFC after:			1 month
2004-05-20 06:35:02 +00:00
silby
67d79f0cfb Wrap two long lines in the previous commit. 2004-04-23 23:29:49 +00:00
silby
b4c0a798a2 Take out an unneeded variable I forgot to remove in the last commit,
and make two small whitespace fixes so that diffs vs rev 1.142 are minimal.
2004-04-22 08:34:55 +00:00
silby
760a7deec6 Simplify random port allocation, and add net.inet.ip.portrange.randomized,
which can be used to turn off randomized port allocation if so desired.

Requested by:	alfred
2004-04-22 08:32:14 +00:00
silby
f0d28bbf0c Switch from using sequential to random ephemeral port allocation,
implementation taken directly from OpenBSD.

I've resisted committing this for quite some time because of concern over
TIME_WAIT recycling breakage (sequential allocation ensures that there is a
long time before ports are recycled), but recent testing has shown me that
my fears were unwarranted.
2004-04-20 06:45:10 +00:00
imp
b49b7fe799 Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson
2004-04-07 20:46:16 +00:00
bde
014c3b4511 Fixed misspelling of IPPORT_MAX as USHRT_MAX. Don't include <sys/limits.h>
to implement this mistake.

Fixed some nearby style bugs (initialization in declaration, misformatting
of this initialization, missing blank line after the declaration, and
comparision of the non-boolean result of the initialization with 0 using
"!".  In KNF, "!" is not even used to compare booleans with 0).
2004-04-06 10:59:11 +00:00
pjd
49554d1bd8 Reduce 'td' argument to 'cred' (struct ucred) argument in those functions:
- in_pcbbind(),
	- in_pcbbind_setup(),
	- in_pcbconnect(),
	- in_pcbconnect_setup(),
	- in6_pcbbind(),
	- in6_pcbconnect(),
	- in6_pcbsetport().
"It should simplify/clarify things a great deal." --rwatson

Requested by:	rwatson
Reviewed by:	rwatson, ume
2004-03-27 21:05:46 +00:00
pjd
02bc133779 Remove unused argument.
Reviewed by:	ume
2004-03-27 20:41:32 +00:00
pjd
89f5b6c374 Remove unused function.
It was used in FreeBSD 4.x, but now we're using cr_canseesocket().
2004-03-25 15:12:12 +00:00
rwatson
7866f6e355 Scrub unused variable zeroin_addr. 2004-03-10 01:01:04 +00:00
ume
703de5ccfd do not deref freed pointer
Submitted by:	"Bjoern A. Zeeb" <bzeeb+freebsd@zabbadoz.net>
Reviewed by:	itojun
2004-01-13 09:51:47 +00:00
andre
4a037b3dd4 Make sure all uses of stack allocated struct route's are properly
zeroed.  Doing a bzero on the entire struct route is not more
expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst.

Reviewed by:	sam (mentor)
Approved by:	re  (scottl)
2003-11-26 20:31:13 +00:00
sam
960b35f03a Split the "inp" mutex class into separate classes for each of divert,
raw, tcp, udp, raw6, and udp6 sockets to avoid spurious witness
complaints.

Reviewed by:	rwatson
Approved by:	re (rwatson)
2003-11-26 01:40:44 +00:00
tmm
5bae25b0ad bzero() the the sockaddr used for the destination address for
rtalloc_ign() in in_pcbconnect_setup() before it is filled out.
Otherwise, stack junk would be left in sin_zero, which could
cause host routes to be ignored because they failed the comparison
in rn_match().
This should fix the wrong source address selection for connect() to
127.0.0.1, among other things.

Reviewed by:	sam
Approved by:	re (rwatson)
2003-11-23 03:02:00 +00:00
andre
6164d7c280 Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table.  Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination.  Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by:	sam (mentor), bms
Reviewed by:	-net, -current, core@kame.net (IPv6 parts)
Approved by:	re (scottl)
2003-11-20 20:07:39 +00:00