Commit Graph

386 Commits

Author SHA1 Message Date
Robert Watson
56856fbfb4 Remove an additional commented out reference to a possible future sx
lock.
2005-03-11 19:16:02 +00:00
Robert Watson
2b37548a71 When setting up a socket in socreate(), there's no need to lock the
socket lock around knlist_init(), so don't.

Hard code the setting of the socket reference count to 1 rather than
using soref() to avoid asserting the socket lock, since we've not yet
exposed the socket to other threads.

This removes two mutex operations from each socket allocation.
2005-03-11 16:30:02 +00:00
Robert Watson
5fab68b19e Remove suggestive sx_init() comment in soalloc(). We will have something
like this at some point, but for now it clutters the source.
2005-03-11 16:26:33 +00:00
Robert Watson
0daccb9c94 In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections.  As part of this state transition, solisten() calls into the
protocol to update protocol-layer state.  There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.

This change does the following:

- Pushes the socket state transition from the socket layer solisten() to
  to socket "library" routines called from the protocol.  This permits
  the socket routines to be called while holding the protocol mutexes,
  preventing a race exposing the incomplete socket state transition to TCP
  after the TCP state transition has completed.  The check for a socket
  layer state transition is performed by solisten_proto_check(), and the
  actual transition is performed by solisten_proto().

- Holds the socket lock for the duration of the socket state test and set,
  and over the protocol layer state transition, which is now possible as
  the socket lock is acquired by the protocol layer, rather than vice
  versa.  This prevents additional state related races in the socket
  layer.

This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another.  Similar changes are likely require
elsewhere in the socket/protocol code.

Reported by:		Peter Holm <peter@holm.cc>
Review and fixes from:	emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod:	gnn
2005-02-21 21:58:17 +00:00
Robert Watson
a00428ef92 In soreceive(), when considering delivery to a socket in SS_ISCONFIRMING,
only call the protocol's pru_rcvd() if the protocol has the flag
PR_WANTRCVD set.  This brings that instance of pru_rcvd() into line with
the rest, which do check the flag.

MFC after:	3 days
2005-02-20 15:54:44 +00:00
Robert Watson
a7ae36bc45 Correct a typo in the comment describing soreceive_rcvoob().
MFC after:	3 days
2005-02-18 19:15:22 +00:00
Robert Watson
1b5c4b15b4 In soconnect(), when resetting so->so_error, the socket lock is not
required due to a straight integer write in which minor races are not
a problem.
2005-02-18 19:13:51 +00:00
Robert Watson
78e436448f Move do_setopt_accept_filter() from uipc_socket.c to uipc_accf.c, where
the rest of the accept filter code currently lives.

MFC after:	3 days
2005-02-18 18:54:42 +00:00
Robert Watson
627de7fa2c Re-order checks in socheckuid() so that we check all deny cases before
returning accept.

MFC after:	3 days
2005-02-18 18:43:33 +00:00
Robert Watson
0d89301c51 In solisten(), unconditionally set the SO_ACCEPTCONN option in
so->so_options when solisten() will succeed, rather than setting it
conditionally based on there not being queued sockets in the completed
socket queue.  Otherwise, if the protocol exposes new sockets via the
completed queue before solisten() completes, the listen() system call
will succeed, but the socket and protocol state will be out of sync.
For TCP, this didn't happen in practice, as the TCP code will panic if
a new connection comes in after the tcpcb has been transitioned to a
listening state but the socket doesn't have SO_ACCEPTCONN set.

This is historical behavior resulting from bitrot since 4.3BSD, in which
that line of code was associated with the conditional NULL'ing of the
connection queue pointers (one-time initialization to be performed
during the transition to a listening socket), which are now initialized
separately.

Discussed with:	fenner, gnn
MFC after:	3 days
2005-02-18 00:52:17 +00:00
Gleb Smirnoff
90d52f2f21 - Convert so_qlen, so_incqlen, so_qlimit fields of struct socket from
short to unsigned short.
- Add SYSCTL_PROC() around somaxconn, not accepting values < 1 or > U_SHRTMAX.

Before this change setting somaxconn to smth above 32767 and calling
listen(fd, -1) lead to a socket, which doesn't accept connections at all.

Reviewed by:	rwatson
Reported by:	Igor Sysoev
2005-01-24 12:20:21 +00:00
Maxim Sobolev
fdf84ec4c6 When re-connecting already connected datagram socket ensure to clean
up its pending error state, which may be set in some rare conditions resulting
in connect() syscall returning that bogus error and making application believe
that attempt to change association has failed, while it has not in fact.

There is sockets/reconnect regression test which excersises this bug.

MFC after:	2 weeks
2005-01-12 10:15:23 +00:00
Warner Losh
9454b2d864 /* -> /*- for copyright notices, minor format tweaks as necessary 2005-01-06 23:35:40 +00:00
Robert Watson
ba65391172 Remove an XXXRW indicating atomic operations might be used as a
substitute for a global mutex protecting the socket count and
generation number.

The observation that soreceive_rcvoob() can't return an mbuf
chain is a property, not a bug, so remove the XXXRW.

In sorflush, s/existing/previous/ for code when describing prior
behavior.

For SO_LINGER socket option retrieval, remove an XXXRW about why
we hold the mutex: this is correct and not dubious.

MFC after:	2 weeks
2004-12-23 01:07:12 +00:00
Robert Watson
81b5dbecd4 In soalloc(), simplify the mac_init_socket() handling to remove
unnecessary use of a global variable and simplify the return case.
While here, use ()'s around return values.

In sodealloc(), remove a comment about why we bump the gencnt and
decrement the socket count separately.  It doesn't add
substantially to the reading, and clutters the function.

MFC after:	2 weeks
2004-12-23 00:59:43 +00:00
Alan Cox
c73e3e9223 Remove unneeded code from the zero-copy receive path.
Discussed with: gallatin@
Tested by: ken@
2004-12-10 04:49:13 +00:00
Alan Cox
1c4dbedac4 Tidy up the zero-copy receive path: Remove an unneeded argument to
uiomoveco() and userspaceco().
2004-12-08 05:25:08 +00:00
Paul Saab
d297f70246 If soreceive() is called from a socket callback, there's no reason
to do a window update to the peer (thru an ACK) from soreceive()
itself. TCP will do that upon return from the socket callback.
Sending a window update from soreceive() results in a lock reversal.

Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by:	rwatson
2004-11-29 23:10:59 +00:00
Paul Saab
85d11adf25 Make soreceive(MSG_DONTWAIT) nonblocking. If MSG_DONTWAIT is passed into
soreceive(), then pass in M_DONTWAIT to m_copym(). Also fix up error
handling for the case where m_copym() returns failure.

Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by:	rwatson
2004-11-29 23:09:07 +00:00
Gleb Smirnoff
1449a2f547 Since sb_timeo type was increased to int, use INT_MAX instead of SHRT_MAX.
This also gives us ability to close PR.

PR:		kern/42352
Approved by:	julian (mentor)
MFC after:	1 week
2004-11-09 18:35:26 +00:00
Robert Watson
aae2782bff Acquire the accept mutex in soabort() before calling sotryfree(), as
that is now required.

RELENG_5_3 candidate.

Foot provided by:	Dikshie <dikshie at ppk dot itb dot ac dot id>
2004-11-02 17:15:13 +00:00
Andre Oppermann
3a82a5451c socreate() does an early abort if either the protocol cannot be found,
or pru_attach is NULL.  With loadable protocols the SPACER dummy protocols
have valid function pointers for all methods to functions returning just
EOPNOTSUPP.  Thus the early abort check would not detect immediately that
attach is not supported for this protocol.  Instead it would correctly
get the EOPNOTSUPP error later on when it calls the protocol specific
attach function.

Add testing against the pru_attach_notsupp() function pointer to the
early abort check as well.
2004-10-23 19:06:43 +00:00
Robert Watson
81158452be Push acquisition of the accept mutex out of sofree() into the caller
(sorele()/sotryfree()):

- This permits the caller to acquire the accept mutex before the socket
  mutex, avoiding sofree() having to drop the socket mutex and re-order,
  which could lead to races permitting more than one thread to enter
  sofree() after a socket is ready to be free'd.

- This also covers clearing of the so_pcb weak socket reference from
  the protocol to the socket, preventing races in clearing and
  evaluation of the reference such that sofree() might be called more
  than once on the same socket.

This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket.  The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets.  The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.

RELENG_5_3 candidate.

MFC after:	3 days
Reviewed by:	dwhite
Discussed with:	gnn, dwhite, green
Reported by:	Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by:	Vlad <marchenko at gmail dot com>
2004-10-18 22:19:43 +00:00
Robert Watson
35b260cd69 Rework sofree() logic to take into account a possible race with accept().
Sockets in the listen queues have reference counts of 0, so if the
protocol decides to disconnect the pcb and try to free the socket, this
triggered a race with accept() wherein accept() would bump the reference
count before sofree() had removed the socket from the listen queues,
resulting in a panic in sofree() when it discovered it was freeing a
referenced socket.  This might happen if a RST came in prior to accept()
on a TCP connection.

The fix is two-fold: to expand the coverage of the accept mutex earlier
in sofree() to prevent accept() from grabbing the socket after the "is it
really safe to free" tests, and to expand the logic of the "is it really
safe to free" tests to check that the refcount is still 0 (i.e., we
didn't race).

RELENG_5 candidate.

Much discussion with and work by:	green
Reported by:	Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by:	Vlad <marchenko at gmail dot com>
2004-10-11 08:11:26 +00:00
Robert Watson
76f6939888 Expand the scope of the socket buffer locks in sopoll() to include the
state test as well as set, or we risk a race between a socket wakeup
and registering for select() or poll() on the socket.  This does
increase the cost of the poll operation, but can probably be optimized
some in the future.

This appears to correct poll() "wedges" experienced with X11 on SMP
systems with highly interactive applications, and might affect a plethora
of other select() driven applications.

RELENG_5 candidate.

Problem reported by:	Maxim Maximov <mcsi at mcsi dot pp dot ru>
Debugged with help of:	dwhite
2004-09-05 14:33:21 +00:00
Robert Watson
fe0f2d4e11 Conditional acquisition of socket buffer mutexes when testing socket
buffers with kqueue filters is no longer required: the kqueue framework
will guarantee that the mutex is held on entering the filter, either
due to a call from the socket code already holding the mutex, or by
explicitly acquiring it.  This removes the last of the conditional
socket locking.
2004-08-24 05:28:18 +00:00
Robert Watson
7b38f0d3c3 Back out uipc_socket.c:1.208, as it incorrectly assumes that all
sockets are connection-oriented for the purposes of kqueue
registration.  Since UDP sockets aren't connection-oriented, this
appeared to break a great many things, such as RPC-based
applications and services (i.e., NFS).  Since jmg isn't around I'm
backing this out before too many more feet are shot, but intend to
investigate the right solution with him once he's available.

Apologies to:	jmg
Discussed with:	imp, scottl
2004-08-20 16:24:23 +00:00
John-Mark Gurney
5d6dd4685a make sure that the socket is either accepting connections or is connected
when attaching a knote to it...  otherwise return EINVAL...

Pointed out by:	benno
2004-08-20 04:15:30 +00:00
John-Mark Gurney
ad3b9257c2 Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers.  Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks.  Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by:	green, rwatson (both earlier versions)
2004-08-15 06:24:42 +00:00
Robert Watson
217a4b6e4e Replace a reference to splnet() with a reference to locking in a comment. 2004-08-11 03:43:10 +00:00
Robert Watson
99901d0afb Do some initial locking on accept filter registration and attach. While
here, close some races that existed in the pre-locking world during low
memory conditions.  This locking isn't perfect, but it's closer than
before.
2004-07-25 23:29:47 +00:00
David Malone
cdb71f7526 The recent changes to control message passing broke some things
that get certain types of control messages (ping6 and rtsol are
examples). This gets the new code closer to working:

	1) Collect control mbufs for processing in the controlp ==
	NULL case, so that they can be freed by externalize.

	2) Loop over the list of control mbufs, as the externalize
	function may not know how to deal with chains.

	3) In the case where there is no externalize function,
	remember to add the control mbuf to the controlp list so
	that it will be returned.

	4) After adding stuff to the controlp list, walk to the
	end of the list of stuff that was added, incase we added
	a chain.

This code can be further improved, but this is enough to get most
things working again.

Reviewed by:	rwatson
2004-07-18 19:10:36 +00:00
Robert Watson
dad7b41a9b When entering soclose(), assert that SS_NOFDREF is not already set. 2004-07-16 00:37:34 +00:00
David Malone
dcee93dcf9 Rename Alfred's kern_setsockopt to so_setsockopt, as this seems a
a better name. I have a kern_[sg]etsockopt which I plan to commit
shortly, but the arguments to these function will be quite different
from so_setsockopt.

Approved by:	alfred
2004-07-12 21:42:33 +00:00
Alfred Perlstein
d58d3648dd Use SO_REUSEADDR and SO_REUSEPORT when reconnecting NFS mounts.
Tune the timeout from 5 seconds to 12 seconds.
Provide a sysctl to show how many reconnects the NFS client has done.

Seems to fix IPv6 from: kuriyama
2004-07-12 06:22:42 +00:00
Robert Watson
a294c3664f Use sockbuf_pushsync() to synchronize stack and socket buffer state
in soreceive() after removing an MT_SONAME mbuf from the head of the
socket buffer.

When processing MT_CONTROL mbufs in soreceive(), first remove all of
the MT_CONTROL mbufs from the head of the socket buffer to a local
mbuf chain, then feed them into dom_externalize() as a set, which
both avoids thrashing the socket buffer lock when handling multiple
control mbufs, and also avoids races with other threads acting on
the socket buffer when the socket buffer mutex is released to enter
the externalize code.  Existing races that might occur if the protocol
externalize method blocked during processing have also been closed.

Now that we synchronize socket buffer and stack state following
modifications to the socket buffer, turn the manual synchronization
that previously followed control mbuf processing with a set of
assertions.  This can eventually be removed.

The soreceive() code is now substantially more MPSAFE.
2004-07-11 23:13:14 +00:00
Robert Watson
b7562e178c Add sockbuf_pushsync(), an inline function that, following a change to
the head of the mbuf chains in a socket buffer, re-synchronizes the
cache pointers used to optimize socket buffer appends.  This will be
used by soreceive() before dropping socket buffer mutexes to make sure
a consistent version of the socket buffer is visible to other threads.

While here, update copyright to account for substantial rewrite of much
socket code required for fine-grained locking.
2004-07-11 22:59:32 +00:00
Robert Watson
d861372b14 Add additional annotations to soreceive(), documenting the effects of
locking on 'nextrecord' and concerns regarding potentially inconsistent
or stale use of socket buffer or stack fields if they aren't carefully
synchronized whenever the socket buffer mutex is released.  Document
that the high-level sblock() prevents races against other readers on
the socket.

Also document the 'type' logic as to how soreceive() guarantees that
it will only return one of normal data or inline out-of-band data.
2004-07-11 18:29:47 +00:00
Robert Watson
0014b343e0 In the 'dontblock' section of soreceive(), assert that the mbuf on hand
('m') is in fact the first mbuf in the receive socket buffer.
2004-07-11 01:44:12 +00:00
Robert Watson
5e44d93ffc Break out non-inline out-of-band data receive code from soreceive()
and put it in its own helper function soreceive_rcvoob().
2004-07-11 01:34:34 +00:00
Robert Watson
a04b09398c Assign pointers values of NULL rather than 0 in soreceive(). 2004-07-11 01:22:40 +00:00
Robert Watson
7e17bc9f26 When the MT_SONAME mbuf is popped off of a receive socket buffer
associated with a PR_ADDR protocol, make sure to update the m_nextpkt
pointer of the new head mbuf on the chain to point to the next record.
Otherwise, when we release the socket buffer mutex, the socket buffer
mbuf chain may be in an inconsistent state.
2004-07-10 21:43:35 +00:00
Robert Watson
5c2b7a2273 Now socket buffer locks are being asserted at higher code blocks in
soreceive(), remove some leaf assertions that are redundant.
2004-07-10 04:38:06 +00:00
Robert Watson
32775a01da Assert socket buffer lock at strategic points between sections of code
in soreceive() to confirm we've moved from block to block properly
maintaining locking invariants.
2004-07-10 03:47:15 +00:00
Robert Watson
6a72b225b7 Drop the socket buffer lock around a call to m_copym() with M_TRYWAIT.
A subset of locking changes to soreceive() in the queue for merging.

Bumped into by:	Willem Jan Withagen <wjw@withagen.nl>
2004-07-05 19:29:33 +00:00
Robert Watson
a290574663 Add a new global mutex, so_global_mtx, which protects the global variables
so_gencnt, numopensockets, and the per-socket field so_gencnt.  Annotate
this this might be better done with atomic operations.

Annotate what accept_mtx protects.
2004-06-27 03:22:15 +00:00
Robert Watson
11c40a39b6 Replace comment on spl state when calling soabort() with a comment on
locking state.  No socket locks should be held when calling soabort()
as it will call into protocol code that may acquire socket locks.
2004-06-26 17:12:29 +00:00
Robert Watson
c6b93bf29a Lock socket buffers when processing setting socket options SO_SNDLOWAT
or SO_RCVLOWAT for read-modify-write.
2004-06-24 04:28:30 +00:00
Robert Watson
adb4cf0fbc Slide socket buffer lock earlier in sopoll() to cover the call into
selrecord(), setting up select and flagging the socker buffers as SB_SEL
and setting up select under the lock.
2004-06-24 00:54:26 +00:00
Robert Watson
fea24c0a71 Remove spl's from uipc_socket to ease in merging. 2004-06-22 03:49:22 +00:00
Robert Watson
a34b704666 Merge next step in socket buffer locking:
- sowakeup() now asserts the socket buffer lock on entry.  Move
  the call to KNOTE higher in sowakeup() so that it is made with
  the socket buffer lock held for consistency with other calls.
  Release the socket buffer lock prior to calling into pgsigio(),
  so_upcall(), or aio_swake().  Locking for this event management
  will need revisiting in the future, but this model avoids lock
  order reversals when upcalls into other subsystems result in
  socket/socket buffer operations.  Assert that the socket buffer
  lock is not held at the end of the function.

- Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now
  have _locked versions which assert the socket buffer lock on
  entry.  If a wakeup is required by sb_notify(), invoke
  sowakeup(); otherwise, unconditionally release the socket buffer
  lock.  This results in the socket buffer lock being released
  whether a wakeup is required or not.

- Break out socantsendmore() into socantsendmore_locked() that
  asserts the socket buffer lock.  socantsendmore()
  unconditionally locks the socket buffer before calling
  socantsendmore_locked().  Note that both functions return with
  the socket buffer unlocked as socantsendmore_locked() calls
  sowwakeup_locked() which has the same properties.  Assert that
  the socket buffer is unlocked on return.

- Break out socantrcvmore() into socantrcvmore_locked() that
  asserts the socket buffer lock.  socantrcvmore() unconditionally
  locks the socket buffer before calling socantrcvmore_locked().
  Note that both functions return with the socket buffer unlocked
  as socantrcvmore_locked() calls sorwakeup_locked() which has
  similar properties.  Assert that the socket buffer is unlocked
  on return.

- Break out sbrelease() into a sbrelease_locked() that asserts the
  socket buffer lock.  sbrelease() unconditionally locks the
  socket buffer before calling sbrelease_locked().
  sbrelease_locked() now invokes sbflush_locked() instead of
  sbflush().

- Assert the socket buffer lock in socket buffer sanity check
  functions sblastrecordchk(), sblastmbufchk().

- Assert the socket buffer lock in SBLINKRECORD().

- Break out various sbappend() functions into sbappend_locked()
  (and variations on that name) that assert the socket buffer
  lock.  The !_locked() variations unconditionally lock the socket
  buffer before calling their _locked counterparts.  Internally,
  make sure to call _locked() support routines, etc, if already
  holding the socket buffer lock.

- Break out sbinsertoob() into sbinsertoob_locked() that asserts
  the socket buffer lock.  sbinsertoob() unconditionally locks the
  socket buffer before calling sbinsertoob_locked().

- Break out sbflush() into sbflush_locked() that asserts the
  socket buffer lock.  sbflush() unconditionally locks the socket
  buffer before calling sbflush_locked().  Update panic strings
  for new function names.

- Break out sbdrop() into sbdrop_locked() that asserts the socket
  buffer lock.  sbdrop() unconditionally locks the socket buffer
  before calling sbdrop_locked().

- Break out sbdroprecord() into sbdroprecord_locked() that asserts
  the socket buffer lock.  sbdroprecord() unconditionally locks
  the socket buffer before calling sbdroprecord_locked().

- sofree() now calls socantsendmore_locked() and re-acquires the
  socket buffer lock on return.  It also now calls
  sbrelease_locked().

- sorflush() now calls socantrcvmore_locked() and re-acquires the
  socket buffer lock on return.  Clean up/mess up other behavior
  in sorflush() relating to the temporary stack copy of the socket
  buffer used with dom_dispose by more properly initializing the
  temporary copy, and selectively bzeroing/copying more carefully
  to prevent WITNESS from getting confused by improperly
  initialized mutexes.  Annotate why that's necessary, or at
  least, needed.

- soisconnected() now calls sbdrop_locked() before unlocking the
  socket buffer to avoid locking overhead.

Some parts of this change were:

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-21 00:20:43 +00:00
Robert Watson
fa8368a8fe When retrieving the SO_LINGER socket option for user space, hold the
socket lock over pulling so_options and so_linger out of the socket
structure in order to retrieve a consistent snapshot.  This may be
overkill if user space doesn't require a consistent snapshot.
2004-06-20 17:50:42 +00:00
Robert Watson
6f4b1b5578 Convert an if->panic in soclose() into a call to KASSERT(). 2004-06-20 17:47:51 +00:00
Robert Watson
ed2f7766b0 Annotate some ordering-related issues in solisten() which are not yet
resolved by socket locking: in particular, that we test the connection
state at the socket layer without locking, request that the protocol
begin listening, and then set the listen state on the socket
non-atomically, resulting in a non-atomic cross-layer test-and-set.
2004-06-20 17:38:19 +00:00
Robert Watson
31f555a1c5 Assert socket buffer lock in sb_lock() to protect socket buffer sleep
lock state.  Convert tsleep() into msleep() with socket buffer mutex
as argument.  Hold socket buffer lock over sbunlock() to protect sleep
lock state.

Assert socket buffer lock in sbwait() to protect the socket buffer
wait state.  Convert tsleep() into msleep() with socket buffer mutex
as argument.

Modify sofree(), sosend(), and soreceive() to acquire SOCKBUF_LOCK()
in order to call into these functions with the lock, as well as to
start protecting other socket buffer use in their implementation.  Drop
the socket buffer mutexes around calls into the protocol layer, around
potentially blocking operations, for copying to/from user space, and
VM operations relating to zero-copy.  Assert the socket buffer mutex
strategically after code sections or at the beginning of loops.  In
some cases, modify return code to ensure locks are properly dropped.

Convert the potentially blocking allocation of storage for the remote
address in soreceive() into a non-blocking allocation; we may wish to
move the allocation earlier so that it can block prior to acquisition
of the socket buffer lock.

Drop some spl use.

NOTE: Some races exist in the current structuring of sosend() and
soreceive().  This commit only merges basic socket locking in this
code; follow-up commits will close additional races.  As merged,
these changes are not sufficient to run without Giant safely.

Reviewed by:	juli, tjr
2004-06-19 03:23:14 +00:00
Robert Watson
7b574f2e45 Hold SOCK_LOCK(so) while frobbing so_options. Note that while the
local race is corrected, there's still a global race in sosend()
relating to so_options and the SO_DONTROUTE flag.
2004-06-18 04:02:56 +00:00
Robert Watson
c012260726 Merge some additional leaf node socket buffer locking from
rwatson_netperf:

Introduce conditional locking of the socket buffer in fifofs kqueue
filters; KNOTE() will be called holding the socket buffer locks in
fifofs, but sometimes the kqueue() system call will poll using the
same entry point without holding the socket buffer lock.

Introduce conditional locking of the socket buffer in the socket
kqueue filters; KNOTE() will be called holding the socket buffer
locks in the socket code, but sometimes the kqueue() system call
will poll using the same entry points without holding the socket
buffer lock.

Simplify the logic in sodisconnect() since we no longer need spls.

NOTE: To remove conditional locking in the kqueue filters, it would
make sense to use a separate kqueue API entry into the socket/fifo
code when calling from the kqueue() system call.
2004-06-18 02:57:55 +00:00
Robert Watson
9535efc00d Merge additional socket buffer locking from rwatson_netperf:
- Lock down low hanging fruit use of sb_flags with socket buffer
  lock.

- Lock down low hanging fruit use of so_state with socket lock.

- Lock down low hanging fruit use of so_options.

- Lock down low-hanging fruit use of sb_lowwat and sb_hiwat with
  socket buffer lock.

- Annotate situations in which we unlock the socket lock and then
  grab the receive socket buffer lock, which are currently actually
  the same lock.  Depending on how we want to play our cards, we
  may want to coallesce these lock uses to reduce overhead.

- Convert a if()->panic() into a KASSERT relating to so_state in
  soaccept().

- Remove a number of splnet()/splx() references.

More complex merging of socket and socket buffer locking to
follow.
2004-06-17 22:48:11 +00:00
Robert Watson
c0b99ffa02 The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality.  This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties.  The
following fields are moved to sb_state:

  SS_CANTRCVMORE            (so_state)
  SS_CANTSENDMORE           (so_state)
  SS_RCVATMARK              (so_state)

Rename respectively to:

  SBS_CANTRCVMORE           (so_rcv.sb_state)
  SBS_CANTSENDMORE          (so_snd.sb_state)
  SBS_RCVATMARK             (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts).  In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.
2004-06-14 18:16:22 +00:00
Robert Watson
395a08c904 Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
  soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
  so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
  various contexts in the stack, both at the socket and protocol
  layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
  this could result in frobbing of a non-present socket if
  sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
  they don't free the socket.

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-12 20:47:32 +00:00
Robert Watson
f6c0cce6d9 Introduce a mutex into struct sockbuf, sb_mtx, which will be used to
protect fields in the socket buffer.  Add accessor macros to use the
mutex (SOCKBUF_*()).  Initialize the mutex in soalloc(), and destroy
it in sodealloc().  Add addition, add SOCK_*() access macros which
will protect most remaining fields in the socket; for the time being,
use the receive socket buffer mutex to implement socket level locking
to reduce memory overhead.

Submitted by:	sam
Sponosored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-12 16:08:41 +00:00
Stefan Farfeleder
1a5ff9285a Avoid assignments to cast expressions.
Reviewed by:	md5
Approved by:	das (mentor)
2004-06-08 13:08:19 +00:00
Robert Watson
2658b3bb8e Integrate accept locking from rwatson_netperf, introducing a new
global mutex, accept_mtx, which serializes access to the following
fields across all sockets:

          so_qlen          so_incqlen         so_qstate
          so_comp          so_incomp          so_list
          so_head

While providing only coarse granularity, this approach avoids lock
order issues between sockets by avoiding ownership of the fields
by a specific socket and its per-socket mutexes.

While here, rewrite soclose(), sofree(), soaccept(), and
sonewconn() to add assertions, close additional races and  address
lock order concerns.  In particular:

- Reorganize the optimistic concurrency behavior in accept1() to
  always allocate a file descriptor with falloc() so that if we do
  find a socket, we don't have to encounter the "Oh, there wasn't
  a socket" race that can occur if falloc() sleeps in the current
  code, which broke inbound accept() ordering, not to mention
  requiring backing out socket state changes in a way that raced
  with the protocol level.  We may want to add a lockless read of
  the queue state if polling of empty queues proves to be important
  to optimize.

- In accept1(), soref() the socket while holding the accept lock
  so that the socket cannot be free'd in a race with the protocol
  layer.  Likewise in netgraph equivilents of the accept1() code.

- In sonewconn(), loop waiting for the queue to be small enough to
  insert our new socket once we've committed to inserting it, or
  races can occur that cause the incomplete socket queue to
  overfill.  In the previously implementation, it was sufficient
  to simply tested once since calling soabort() didn't release
  synchronization permitting another thread to insert a socket as
  we discard a previous one.

- In soclose()/sofree()/et al, it is the responsibility of the
  caller to remove a socket from the incomplete connection queue
  before calling soabort(), which prevents soabort() from having
  to walk into the accept socket to release the socket from its
  queue, and avoids races when releasing the accept mutex to enter
  soabort(), permitting soabort() to avoid lock ordering issues
  with the caller.

- Generally cluster accept queue related operations together
  throughout these functions in order to facilitate locking.

Annotate new locking in socketvar.h.
2004-06-02 04:15:39 +00:00
Robert Watson
36568179e3 The SS_COMP and SS_INCOMP flags in the so_state field indicate whether
the socket is on an accept queue of a listen socket.  This change
renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new
state field on the socket, so_qstate, as the locking for these flags
is substantially different for the locking on the remainder of the
flags in so_state.
2004-06-01 02:42:56 +00:00
Don Lewis
866046f5a6 Add MSG_NBIO flag option to soreceive() and sosend() that causes
them to behave the same as if the SS_NBIO socket flag had been set
for this call.  The SS_NBIO flag for ordinary sockets is set by
fcntl(fd, F_SETFL, O_NONBLOCK).

Pass the MSG_NBIO flag to the soreceive() and sosend() calls in
fifo_read() and fifo_write() instead of frobbing the SS_NBIO flag
on the underlying socket for each I/O operation.  The O_NONBLOCK
flag is a property of the descriptor, and unlike ordinary sockets,
fifos may be referenced by multiple descriptors.
2004-06-01 01:18:51 +00:00
Bosko Milekic
099a0e588c Bring in mbuma to replace mballoc.
mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
  - Better layering between slab <-> zone caches; introduce
    Keg structure which splits off slab cache away from the
    zone structure and allows multiple zones to be stacked
    on top of a single Keg (single type of slab cache);
    perhaps we should look into defining a subset API on
    top of the Keg for special use by malloc(9),
    for example.
  - UMA_ZONE_REFCNT zones can now be added, and reference
    counters automagically allocated for them within the end
    of the associated slab structures.  uma_find_refcnt()
    does a kextract to fetch the slab struct reference from
    the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
  - integrates mbuf & cluster allocations with extended UMA
    and provides caches for commonly-allocated items; defines
    several zones (two primary, one secondary) and two kegs.
  - change up certain code paths that always used to do:
    m_get() + m_clget() to instead just use m_getcl() and
    try to take advantage of the newly defined secondary
    Packet zone.
  - netstat(1) and systat(1) quickly hacked up to do basic
    stat reporting but additional stats work needs to be
    done once some other details within UMA have been taken
    care of and it becomes clearer to how stats will work
    within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used.  The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
   - One report of 'ips' (ServeRAID) driver acting really
     slow in conjunction with mbuma.  Need more data.
     Latest report is that ips is equally sucking with
     and without mbuma.
   - Giant leak in NFS code sometimes occurs, can't
     reproduce but currently analyzing; brueffer is
     able to reproduce but THIS IS NOT an mbuma-specific
     problem and currently occurs even WITHOUT mbuma.
   - Issues in network locking: there is at least one
     code path in the rip code where one or more locks
     are acquired and we end up in m_prepend() with
     M_WAITOK, which causes WITNESS to whine from within
     UMA.  Current temporary solution: force all UMA
     allocations to be M_NOWAIT from within UMA for now
     to avoid deadlocks unless WITNESS is defined and we
     can determine with certainty that we're not holding
     any locks when we're M_WAITOK.
   - I've seen at least one weird socketbuffer empty-but-
     mbuf-still-attached panic.  I don't believe this
     to be related to mbuma but please keep your eyes
     open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
    rwatson,
    brueffer,
    Ketrien I. Saihr-Kesenchedra,
    ...
Reviewed by: Lots of people (for different parts)
2004-05-31 21:46:06 +00:00
Robert Watson
123f024b24 Compare pointers with NULL rather than using pointers are booleans in
if/for statements.  Assign pointers to NULL rather than typecast 0.
Compare pointers with NULL rather than 0.
2004-04-09 13:23:51 +00:00
Warner Losh
7f8a436ff2 Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core
2004-04-05 21:03:37 +00:00
Robert Watson
8e44a7ec13 In sofree(), avoid nested declaration and initialization in
declaration.  Observe that initialization in declaration is
frequently incompatible with locking, not just a bad idea
due to style(9).

Submitted by:	bde
2004-03-31 03:48:35 +00:00
Robert Watson
181e65db5b Use a common return path for filt_soread() and filt_sowrite() to
simplify the impact of locking on these functions.

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
2004-03-29 18:06:15 +00:00
Robert Watson
71c90a2944 In sofree(), moving caching of 'head' from 'so->so_head' to later in
the function once it has been determined to be non-NULL to simplify
locking on an earlier return.
2004-03-29 17:57:43 +00:00
Robert Watson
746e5bf09b Rename dup_sockaddr() to sodupsockaddr() for consistency with other
functions in kern_socket.c.

Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT
in from the caller context rather than "1" or "0".

Correct mflags pass into mac_init_socket() from previous commit to not
include M_ZERO.

Submitted by:	sam
2004-03-01 03:14:23 +00:00
Scott Long
740d9ba692 Convert the other use of flags to mflags in soalloc(). 2004-03-01 01:14:28 +00:00
Robert Watson
2bc87dcfbe Modify soalloc() API so that it accepts a malloc flags argument rather
than a "waitok" argument.  Callers now passing M_WAITOK or M_NOWAIT
rather than 0 or 1.  This simplifies the soalloc() logic, and also
makes the waiting behavior of soalloc() more clear in the calling
context.

Submitted by:	sam
2004-02-29 17:54:05 +00:00
Brian Feldman
f662a93197 Always socantsendmore() before deallocating a socket. This, in turn,
calls selwakeup() if necessary (which it is, if you don't want freed
memory hanging around on your td->td_selq).

Props to:	alfred
2004-02-12 01:48:40 +00:00
Poul-Henning Kamp
be8a62e821 Introduce the SO_BINTIME option which takes a high-resolution timestamp
at packet arrival.

For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL
since it has higher resolution and lower overhead.  Simultaneous
use of the two options is possible and they will return consistent
timestamps.

This introduces an extra test and a function call for SO_TIMEVAL, but I have
not been able to measure that.
2004-01-31 10:40:25 +00:00
Ruslan Ermilov
0541040c46 Since "m" is not part of the "mp" chain, need to free() it.
Reported by:	Stanford Metacompilation research group
2004-01-18 14:02:53 +00:00
Robert Watson
9e71dd0feb Reduce gratuitous redundancy and length in function names:
mac_setsockopt_label_set() -> mac_setsockopt_label()
  mac_getsockopt_label_get() -> mac_getsockopt_label()
  mac_getsockopt_peerlabel_get() -> mac_getsockopt_peerlabel()

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-16 18:25:20 +00:00
Robert Watson
12cbb9dc56 When implementing getsockopt() for SO_LABEL and SO_PEERLABEL, make
sure to sooptcopyin() the (struct mac) so that the MAC Framework
knows which label types are being requested.  This fixes process
queries of socket labels.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-16 03:53:36 +00:00
Seigo Tanimura
512824f8f7 - Implement selwakeuppri() which allows raising the priority of a
thread being waken up.  The thread waken up can run at a priority as
  high as after tsleep().

- Replace selwakeup()s with selwakeuppri()s and pass appropriate
  priorities.

- Add cv_broadcastpri() which raises the priority of the broadcast
  threads.  Used by selwakeuppri() if collision occurs.

Not objected in:	-arch, -current
2003-11-09 09:17:26 +00:00
Sam Leffler
395bb18680 speedup stream socket recv handling by tracking the tail of
the mbuf chain instead of walking the list for each append

Submitted by:	ps/jayanth
Obtained from:	netbsd (jason thorpe)
2003-10-28 05:47:40 +00:00
Mike Silbersack
184dcdc7c8 Change all SYSCTLS which are readonly and have a related TUNABLE
from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide
more useful error messages.
2003-10-21 18:28:36 +00:00
Jeffrey Hsu
cc3426866c Make the second argument to sooptcopyout() constant in order to
simplify the upcoming PIM patches.

Submitted by:   Pavlin Radoslavov <pavlin@icir.org>
2003-08-05 00:27:54 +00:00
Robert Drehmel
4e19fe1081 To avoid a kernel panic provoked by a NULL pointer dereference,
do not clear the `sb_sel' member of the sockbuf structure
while invalidating the receive sockbuf in sorflush(), called
from soshutdown().

The panic was reproduceable from user land by attaching a knote
with EVFILT_READ filters to a socket, disabling further reads
from it using shutdown(2), and then closing it.  knote_remove()
was called to remove all knotes from the socket file descriptor
by detaching each using its associated filterops' detach call-
back function, sordetach() in this case, which tried to remove
itself from the invalidated sockbuf's klist (sb_sel.si_note).

PR:	kern/54331
2003-07-17 23:49:10 +00:00
Jeffrey Hsu
330841c763 Rev 1.121 meant to pass the value 1 to soalloc() to indicate waitok.
Reported by:	arr
2003-07-14 20:39:22 +00:00
David E. O'Brien
677b542ea2 Use __FBSDID(). 2003-06-11 00:56:59 +00:00
Alexander Kabaev
104a9b7e3e Deprecate machine/limits.h in favor of new sys/limits.h.
Change all in-tree consumers to include <sys/limits.h>

Discussed on:	standards@
Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>
2003-04-29 13:36:06 +00:00
Olivier Houchard
695d74f337 Use while (*controlp != NULL) instead of do ... while (*control != NULL)
There are valid cases where *controlp will be NULL at this point.

Discussed with:	dwmalone
2003-04-14 14:44:36 +00:00
Dag-Erling Smørgrav
8994a245e0 Clean up whitespace, s/register //, refrain from strong urge to ANSIfy. 2003-03-02 15:56:49 +00:00
Dag-Erling Smørgrav
c952458814 uiomove-related caddr_t -> void * (just the low-hanging fruit) 2003-03-02 15:50:23 +00:00
Olivier Houchard
d6bf23783f Remove duplicate includes.
Submitted by:	Cyril Nguyen-Huu <cyril@ci0.org>
2003-02-20 03:26:11 +00:00
Warner Losh
a163d034fa Back out M_* changes, per decision of the TRB.
Approved by: trb
2003-02-19 05:47:46 +00:00
Alfred Perlstein
44956c9863 Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
2003-01-21 08:56:16 +00:00
Thomas Moestl
6f7cab9301 Disallow listen() on sockets which are in the SS_ISCONNECTED or
SS_ISCONNECTING state, returning EINVAL (which is what POSIX mandates
in this case).
listen() on connected or connecting sockets would cause them to enter
a bad state; in the TCP case, this could cause sockets to go
catatonic or panics, depending on how the socket was connected.

Reviewed by:	-net
MFC after:	2 weeks
2003-01-17 19:20:00 +00:00
Matthew Dillon
48e3128b34 Bow to the whining masses and change a union back into void *. Retain
removal of unnecessary casts and throw in some minor cleanups to see if
anyone complains, just for the hell of it.
2003-01-13 00:33:17 +00:00
Matthew Dillon
cd72f2180b Change struct file f_data to un_data, a union of the correct struct
pointer types, and remove a huge number of casts from code using it.

Change struct xfile xf_data to xun_data (ABI is still compatible).

If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary.  There are no operational changes in this
commit.
2003-01-12 01:37:13 +00:00
Alfred Perlstein
a09de2f7cd In sodealloc(), if there is an accept filter present on the socket
then call do_setopt_accept_filter(so, NULL) which will free the filter
instead of duplicating the code in do_setopt_accept_filter().

Pointed out by: Hiten Pandya <hiten@angelica.unixdaemons.com>
2003-01-05 11:14:04 +00:00
Poul-Henning Kamp
6ce9c72c30 s/sokqfilter/soo_kqfilter/ for consistency with the naming of all
other socket/file operations.
2002-12-23 21:37:28 +00:00
Maxim Konovalov
8819f45b51 Small SO_RCVTIMEO and SO_SNDTIMEO values are mistakenly taken to be zero.
PR:		kern/32827
Submitted by:	Hartmut Brandt <brandt@fokus.gmd.de>
Approved by:	re (jhb)
MFC after:	2 weeks
2002-11-27 13:34:04 +00:00
Alfred Perlstein
29f194457c Fix instances of macros with improperly parenthasized arguments.
Verified by: md5
2002-11-09 12:55:07 +00:00
Kelly Yancey
247a32f22a Fix filt_soread() to properly flag a kevent when a 0-byte datagram is
received.

Verified by:	dougb, Manfred Antar <null@pozo.com>
Sponsored by:	NTT Multimedia Communications Labs
2002-11-05 18:48:46 +00:00
Alan Cox
5ee0a409fc Revert the change in revision 1.77 of kern/uipc_socket2.c. It is causing
a panic because the socket's state isn't as expected by sofree().

Discussed with: dillon, fenner
2002-11-02 05:14:31 +00:00
Kelly Yancey
e0f640e82d Track the number of non-data chararacters stored in socket buffers so that
the data value returned by kevent()'s EVFILT_READ filter on non-TCP
sockets accurately reflects the amount of data that can be read from the
sockets by applications.

PR:		30634
Reviewed by:	-net, -arch
Sponsored by:	NTT Multimedia Communications Labs
MFC after:	2 weeks
2002-11-01 21:27:59 +00:00
Robert Watson
6151efaa54 Trim extraneous #else and #endif MAC comments per style(9). 2002-10-28 21:17:53 +00:00
Robert Watson
83985c267e Modify label allocation semantics for sockets: pass in soalloc's malloc
flags so that we can call malloc with M_NOWAIT if necessary, avoiding
potential sleeps while holding mutexes in the TCP syncache code.
Similar to the existing support for mbuf label allocation: if we can't
allocate all the necessary label store in each policy, we back out
the label allocation and fail the socket creation.  Sync from MAC tree.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2002-10-05 21:23:47 +00:00
Robert Watson
ea6027a8e1 Make similar changes to fo_stat() and fo_poll() as made earlier to
fo_read() and fo_write(): explicitly use the cred argument to fo_poll()
as "active_cred" using the passed file descriptor's f_cred reference
to provide access to the file credential.  Add an active_cred
argument to fo_stat() so that implementers have access to the active
credential as well as the file credential.  Generally modify callers
of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which
was redundantly provided via the fp argument.  This set of modifications
also permits threads to perform these operations on behalf of another
thread without modifying their credential.

Trickle this change down into fo_stat/poll() implementations:

- badfo_poll(), badfo_stat(): modify/add arguments.
- kqueue_poll(), kqueue_stat(): modify arguments.
- pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to
  MAC checks rather than td->td_ucred.
- soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather
  than cred to pru_sopoll() to maintain current semantics.
- sopoll(): moidfy arguments.
- vn_poll(), vn_statfile(): modify/add arguments, pass new arguments
  to vn_stat().  Pass active_cred to MAC and fp->f_cred to VOP_POLL()
  to maintian current semantics.
- vn_close(): rename cred to file_cred to reflect reality while I'm here.
- vn_stat(): Add active_cred and file_cred arguments to vn_stat()
  and consumers so that this distinction is maintained at the VFS
  as well as 'struct file' layer.  Pass active_cred instead of
  td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics.

- fifofs: modify the creation of a "filetemp" so that the file
  credential is properly initialized and can be used in the socket
  code if desired.  Pass ap->a_td->td_ucred as the active
  credential to soo_poll().  If we teach the vnop interface about
  the distinction between file and active credentials, we would use
  the active credential here.

Note that current inconsistent passing of active_cred vs. file_cred to
VOP's is maintained.  It's not clear why GETATTR would be authorized
using active_cred while POLL would be authorized using file_cred at
the file system level.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-08-16 12:52:03 +00:00
Robert Watson
5c5384fe80 Use the credential authorizing the socket creation operation to perform
the jail check and the MAC socket labeling in socreate().  This handles
socket creation using a cached credential better (such as in the NFS
client code when rebuilding a socket following a disconnect: the new
socket should be created using the nfsmount cached cred, not the cred
of the thread causing the socket to be rebuilt).

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-08-12 16:49:03 +00:00
Robert Watson
f9d0d52459 Include file cleanup; mac.h and malloc.h at one point had ordering
relationship requirements, and no longer do.

Reminded by:	bde
2002-08-01 17:47:56 +00:00
Robert Watson
b827919594 Introduce support for Mandatory Access Control and extensible
kernel access control.

Implement two IOCTLs at the socket level to retrieve the primary
and peer labels from a socket.  Note that this user process interface
will be changing to improve multi-policy support.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-08-01 03:45:40 +00:00
Robert Watson
335654d73e Introduce support for Mandatory Access Control and extensible
kernel access control.

Invoke the necessary MAC entry points to maintain labels on sockets.
In particular, invoke entry points during socket allocation and
destruction, as well as creation by a process or during an
accept-scenario (sonewconn).  For UNIX domain sockets, also assign
a peer label.  As the socket code isn't locked down yet, locking
interactions are not yet clear.  Various protocol stack socket
operations (such as peer label assignment for IPv4) will follow.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-07-31 03:03:22 +00:00
Mike Barcroft
5f0de71223 Catch up to rev 1.87 of sys/sys/socketvar.h (sb_cc changed from u_long
to u_int).

Noticed by:	sparc64 tinderbox
2002-07-24 14:21:41 +00:00
Alfred Perlstein
802082390b More caddr_t removal.
Change struct knote's kn_hook from caddr_t to void *.
2002-06-29 00:29:12 +00:00
Kenneth D. Merry
98cb733c67 At long last, commit the zero copy sockets code.
MAKEDEV:	Add MAKEDEV glue for the ti(4) device nodes.

ti.4:		Update the ti(4) man page to include information on the
		TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
		and also include information about the new character
		device interface and the associated ioctls.

man9/Makefile:	Add jumbo.9 and zero_copy.9 man pages and associated
		links.

jumbo.9:	New man page describing the jumbo buffer allocator
		interface and operation.

zero_copy.9:	New man page describing the general characteristics of
		the zero copy send and receive code, and what an
		application author should do to take advantage of the
		zero copy functionality.

NOTES:		Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
		TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.

conf/files:	Add uipc_jumbo.c and uipc_cow.c.

conf/options:	Add the 5 options mentioned above.

kern_subr.c:	Receive side zero copy implementation.  This takes
		"disposable" pages attached to an mbuf, gives them to
		a user process, and then recycles the user's page.
		This is only active when ZERO_COPY_SOCKETS is turned on
		and the kern.ipc.zero_copy.receive sysctl variable is
		set to 1.

uipc_cow.c:	Send side zero copy functions.  Takes a page written
		by the user and maps it copy on write and assigns it
		kernel virtual address space.  Removes copy on write
		mapping once the buffer has been freed by the network
		stack.

uipc_jumbo.c:	Jumbo disposable page allocator code.  This allocates
		(optionally) disposable pages for network drivers that
		want to give the user the option of doing zero copy
		receive.

uipc_socket.c:	Add kern.ipc.zero_copy.{send,receive} sysctls that are
		enabled if ZERO_COPY_SOCKETS is turned on.

		Add zero copy send support to sosend() -- pages get
		mapped into the kernel instead of getting copied if
		they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
		can be used elsewhere.  (uipc_cow.c)

if_media.c:	In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
		calling malloc() with M_WAITOK.  Return an error if
		the M_NOWAIT malloc fails.

		The ti(4) driver and the wi(4) driver, at least, call
		this with a mutex held.  This causes witness warnings
		for 'ifconfig -a' with a wi(4) or ti(4) board in the
		system.  (I've only verified for ti(4)).

ip_output.c:	Fragment large datagrams so that each segment contains
		a multiple of PAGE_SIZE amount of data plus headers.
		This allows the receiver to potentially do page
		flipping on receives.

if_ti.c:	Add zero copy receive support to the ti(4) driver.  If
		TI_PRIVATE_JUMBOS is not defined, it now uses the
		jumbo(9) buffer allocator for jumbo receive buffers.

		Add a new character device interface for the ti(4)
		driver for the new debugging interface.  This allows
		(a patched version of) gdb to talk to the Tigon board
		and debug the firmware.  There are also a few additional
		debugging ioctls available through this interface.

		Add header splitting support to the ti(4) driver.

		Tweak some of the default interrupt coalescing
		parameters to more useful defaults.

		Add hooks for supporting transmit flow control, but
		leave it turned off with a comment describing why it
		is turned off.

if_tireg.h:	Change the firmware rev to 12.4.11, since we're really
		at 12.4.11 plus fixes from 12.4.13.

		Add defines needed for debugging.

		Remove the ti_stats structure, it is now defined in
		sys/tiio.h.

ti_fw.h:	12.4.11 firmware.

ti_fw2.h:	12.4.11 firmware, plus selected fixes from 12.4.13,
		and my header splitting patches.  Revision 12.4.13
		doesn't handle 10/100 negotiation properly.  (This
		firmware is the same as what was in the tree previously,
		with the addition of header splitting support.)

sys/jumbo.h:	Jumbo buffer allocator interface.

sys/mbuf.h:	Add a new external mbuf type, EXT_DISPOSABLE, to
		indicate that the payload buffer can be thrown away /
		flipped to a userland process.

socketvar.h:	Add prototype for socow_setup.

tiio.h:		ioctl interface to the character portion of the ti(4)
		driver, plus associated structure/type definitions.

uio.h:		Change prototype for uiomoveco() so that we'll know
		whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c:	In vm_fault(), check to see whether we need to do a page
		based copy on write fault.

vm_object.c:	Add a new function, vm_object_allocate_wait().  This
		does the same thing that vm_object allocate does, except
		that it gives the caller the opportunity to specify whether
		it should wait on the uma_zalloc() of the object structre.

		This allows vm objects to be allocated while holding a
		mutex.  (Without generating WITNESS warnings.)

		vm_object_allocate() is implemented as a call to
		vm_object_allocate_wait() with the malloc flag set to
		M_WAITOK.

vm_object.h:	Add prototype for vm_object_allocate_wait().

vm_page.c:	Add page-based copy on write setup, clear and fault
		routines.

vm_page.h:	Add page based COW function prototypes and variable in
		the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.
2002-06-26 03:37:47 +00:00
Alfred Perlstein
c33c825169 Implement SO_NOSIGPIPE option for sockets. This allows one to request that
an EPIPE error return not generate SIGPIPE on sockets.

Submitted by: lioux
Inspired by: Darwin
2002-06-20 18:52:54 +00:00
Seigo Tanimura
4cc20ab1f0 Back out my lats commit of locking down a socket, it conflicts with hsu's work.
Requested by:	hsu
2002-05-31 11:52:35 +00:00
Andrew R. Reiter
ec41816009 - td will never be NULL, so the call to soalloc() in socreate() will always
be passed a 1; we can, however, use M_NOWAIT to indicate this.
- Check so against NULL since it's a pointer to a structure.
2002-05-21 21:30:44 +00:00
Andrew R. Reiter
1515cd22e1 - OR the flag variable with M_ZERO so that the uma_zalloc() handles the
zero'ing out of the allocated memory.  Also removed the logical bzero
  that followed.
2002-05-21 21:18:41 +00:00
Seigo Tanimura
243917fe3b Lock down a socket, milestone 1.
o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
  socket buffer. The mutex in the receive buffer also protects the data
  in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

  - so_count
  - so_options
  - so_linger
  - so_state

o Remove *_locked() socket APIs.  Make the following socket APIs
  touching the members above now require a locked socket:

 - sodisconnect()
 - soisconnected()
 - soisconnecting()
 - soisdisconnected()
 - soisdisconnecting()
 - sofree()
 - soref()
 - sorele()
 - sorwakeup()
 - sotryfree()
 - sowakeup()
 - sowwakeup()

Reviewed by:	alfred
2002-05-20 05:41:09 +00:00
Alfred Perlstein
e649887b1e Make funsetown() take a 'struct sigio **' so that the locking can
be done internally.

Ensure that no one can fsetown() to a dying process/pgrp.  We need
to check the process for P_WEXIT to see if it's exiting.  Process
groups are already safe because there is no such thing as a pgrp
zombie, therefore the proctree lock completely protects the pgrp
from having sigio structures associated with it after it runs
funsetownlst.

Add sigio lock to witness list under proctree and allproc, but over
proc and pgrp.

Seigo Tanimura helped with this.
2002-05-06 19:31:28 +00:00
Alfred Perlstein
f132072368 Redo the sigio locking.
Turn the sigio sx into a mutex.

Sigio lock is really only needed to protect interrupts from dereferencing
the sigio pointer in an object when the sigio itself is being destroyed.

In order to do this in the most unintrusive manner change pgsigio's
sigio * argument into a **, that way we can lock internally to the
function.
2002-05-01 20:44:46 +00:00
Mike Silbersack
e1f1827f98 Make sure that sockets undergoing accept filtering are aborted in a
LRU fashion when the listen queue fills up.  Previously, there was
no mechanism to kick out old sockets, leading to an easy DoS of
daemons using accept filtering.

Reviewed by:	alfred
MFC after:	3 days
2002-04-26 02:07:46 +00:00
Jeffrey Hsu
20504246d8 There's only one socket zone so we don't need to remember it
in every socket structure.
2002-04-08 03:04:22 +00:00
Jeff Roberson
59295dba57 UMA permited us to utilize the 'waitok' flag to soalloc. 2002-03-20 21:23:26 +00:00
Jeff Roberson
c897b81311 Remove references to vm_zone.h and switch over to the new uma API.
Also, remove maxsockets.  If you look carefully you'll notice that the old
zone allocator never honored this anyway.
2002-03-20 04:09:59 +00:00
Jeff Roberson
8355f576a9 This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by:	arch@
2002-03-19 09:11:49 +00:00
Ian Dowse
167b8d0334 In sosend(), enforce the socket buffer limits regardless of whether
the data was supplied as a uio or an mbuf. Previously the limit was
ignored for mbuf data, and NFS could run the kernel out of mbufs
when an ipfw rule blocked retransmissions.
2002-02-28 11:22:40 +00:00
John Baldwin
a854ed9893 Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.
2002-02-27 18:32:23 +00:00
Matthew Dillon
ecde8f7c29 Get rid of the twisted MFREE() macro entirely.
Reviewed by:	dg, bmilekic
MFC after:	3 days
2002-02-05 02:00:56 +00:00
Alfred Perlstein
468485b8d2 Fix select on fifos.
Backout revision 1.56 and 1.57 of fifo_vnops.c.

Introduce a new poll op "POLLINIGNEOF" that can be used to ignore
EOF on a fifo, POLLIN/POLLRDNORM is converted to POLLINIGNEOF within
the FIFO implementation to effect the correct behavior.

This should allow one to view a fifo pretty much as a data source
rather than worry about connections coming and going.

Reviewed by: bde
2002-01-14 22:03:48 +00:00
Robert Watson
9c4d63da6d o Make the credential used by socreate() an explicit argument to
socreate(), rather than getting it implicitly from the thread
  argument.

o Make NFS cache the credential provided at mount-time, and use
  the cached credential (nfsmount->nm_cred) when making calls to
  socreate() on initially connecting, or reconnecting the socket.

This fixes bugs involving NFS over TCP and ipfw uid/gid rules, as well
as bugs involving NFS and mandatory access control implementations.

Reviewed by:	freebsd-arch
2001-12-31 17:45:16 +00:00
Matthew Dillon
b1e4abd246 Give struct socket structures a ref counting interface similar to
vnodes.  This will hopefully serve as a base from which we can
expand the MP code.  We currently do not attempt to obtain any
mutex or SX locks, but the door is open to add them when we nail
down exactly how that part of it is going to work.
2001-11-17 03:07:11 +00:00
Giorgos Keramidas
7377f0d190 Remove EOL whitespace.
Reviewed by:	alfred
2001-11-12 20:51:40 +00:00
Giorgos Keramidas
074df01866 Make KASSERT's print the values that triggered a panic.
Reviewed by:	alfred
2001-11-12 20:50:06 +00:00
John Baldwin
bd78cece5d Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
  another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
  refcount is > 1 and false (0) otherwise.
2001-10-11 23:38:17 +00:00
Robert Watson
8a7d8cc675 - Combine kern.ps_showallprocs and kern.ipc.showallsockets into
a single kern.security.seeotheruids_permitted, describes as:
  "Unprivileged processes may see subjects/objects with different real uid"
  NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is
  an API change.  kern.ipc.showallsockets does not.
- Check kern.security.seeotheruids_permitted in cr_cansee().
- Replace visibility calls to socheckuid() with cr_cansee() (retain
  the change to socheckuid() in ipfw, where it is used for rule-matching).
- Remove prison_unpcb() and make use of cr_cansee() against the UNIX
  domain socket credential instead of comparing root vnodes for the
  UDS and the process.  This allows multiple jails to share the same
  chroot() and not see each others UNIX domain sockets.
- Remove unused socheckproc().

Now that cr_cansee() is used universally for socket visibility, a variety
of policies are more consistently enforced, including uid-based
restrictions and jail-based restrictions.  This also better-supports
the introduction of additional MAC models.

Reviewed by:	ps, billf
Obtained from:	TrustedBSD Project
2001-10-09 21:40:30 +00:00
Paul Saab
4787fd37af Only allow users to see their own socket connections if
kern.ipc.showallsockets is set to 0.

Submitted by:	billf (with modifications by me)
Inspired by:	Dave McKay (aka pm aka Packet Magnet)
Reviewed by:	peter
MFC after:	2 weeks
2001-10-05 07:06:32 +00:00
David Malone
2bc21ed985 Hopefully improve control message passing over Unix domain sockets.
1) Allow the sending of more than one control message at a time
over a unix domain socket. This should cover the PR 29499.

2) This requires that unp_{ex,in}ternalize and unp_scan understand
mbufs with more than one control message at a time.

3) Internalize and externalize used to work on the mbuf in-place.
This made life quite complicated and the code for sizeof(int) <
sizeof(file *) could end up doing the wrong thing. The patch always
create a new mbuf/cluster now. This resulted in the change of the
prototype for the domain externalise function.

4) You can now send SCM_TIMESTAMP messages.

5) Always use CMSG_DATA(cm) to determine the start where the data
in unp_{ex,in}ternalize. It was using ((struct cmsghdr *)cm + 1)
in some places, which gives the wrong alignment on the alpha.
(NetBSD made this fix some time ago).

This results in an ABI change for discriptor passing and creds
passing on the alpha. (Probably on the IA64 and Spare ports too).

6) Fix userland programs to use CMSG_* macros too.

7) Be more careful about freeing mbufs containing (file *)s.
This is made possible by the prototype change of externalise.

PR:		29499
MFC after:	6 weeks
2001-10-04 13:11:48 +00:00
John Baldwin
ed01445d8f Use the passed in thread to selrecord() instead of curthread. 2001-09-21 22:46:54 +00:00
Julian Elischer
b40ce4165d KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after:    ha ha ha ha
2001-09-12 08:38:13 +00:00
Mark Murray
fb919e4d5a Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by:	bde (with reservations)
2001-05-01 08:13:21 +00:00
Alfred Perlstein
3abedb4e01 Actually show the values that tripped the assertion "receive 1" 2001-04-27 13:42:50 +00:00
Jonathan Lemon
4d286823c5 When doing a recv(.. MSG_WAITALL) for a message which is larger than
the socket buffer size, the receive is done in sections.  After completing
a read, call pru_rcvd on the underlying protocol before blocking again.

This allows the the protocol to take appropriate action, such as
sending a TCP window update to the peer, if the window happened to
close because the socket buffer was filled.  If the protocol is not
notified, a TCP transfer may stall until the remote end sends a window
probe.
2001-03-16 22:37:06 +00:00
Jonathan Lemon
c0647e0d07 Push the test for a disconnected socket when accept()ing down to the
protocol layer.  Not all protocols behave identically.  This fixes the
brokenness observed with unix-domain sockets (and postfix)
2001-03-09 08:16:40 +00:00
Ruslan Ermilov
8ac6dca795 In soshutdown(), use SHUT_{RD,WR,RDWR} instead of FREAD and FWRITE.
Also, return EINVAL if `how' is invalid, as required by POSIX spec.
2001-02-27 13:48:07 +00:00
Jonathan Lemon
da403b9df8 Introduce a NOTE_LOWAT flag for use with the read/write filters, which
allow the watermark to be passed in via the data field during the EV_ADD
operation.

Hook this up to the socket read/write filters; if specified, it overrides
the so_{rcv|snd}.sb_lowat values in the filter.

Inspired by: "Ronald F. Guilmette" <rfg@monkeys.com>
2001-02-24 01:41:31 +00:00
Jonathan Lemon
b07540c837 When returning EV_EOF for the socket read/write filters, also return
the current socket error in fflags.  This may be useful for determining
why a connect() request fails.

Inspired by:  "Jonathan Graehl" <jonathan@graehl.org>
2001-02-24 01:33:12 +00:00
Robert Watson
91421ba234 o Move per-process jail pointer (p->pr_prison) to inside of the subject
credential structure, ucred (cr->cr_prison).
o Allow jail inheritence to be a function of credential inheritence.
o Abstract prison structure reference counting behind pr_hold() and
  pr_free(), invoked by the similarly named credential reference
  management functions, removing this code from per-ABI fork/exit code.
o Modify various jail() functions to use struct ucred arguments instead
  of struct proc arguments.
o Introduce jailed() function to determine if a credential is jailed,
  rather than directly checking pointers all over the place.
o Convert PRISON_CHECK() macro to prison_check() function.
o Move jail() function prototypes to jail.h.
o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the
  flag in the process flags field itself.
o Eliminate that "const" qualifier from suser/p_can/etc to reflect
  mutex use.

Notes:

o Some further cleanup of the linux/jail code is still required.
o It's now possible to consider resolving some of the process vs
  credential based permission checking confusion in the socket code.
o Mutex protection of struct prison is still not present, and is
  required to protect the reference count plus some fields in the
  structure.

Reviewed by:	freebsd-arch
Obtained from:	TrustedBSD Project
2001-02-21 06:39:57 +00:00
Jonathan Lemon
608a3ce62a Extend kqueue down to the device layer.
Backwards compatible approach suggested by: peter
2001-02-15 16:34:11 +00:00
Jonathan Lemon
2fd7d53d36 Return ECONNABORTED from accept if connection is closed while on the
listen queue, as well as the current behavior of a zero-length sockaddr.

Obtained from: KAME
Reviewed by: -net
2001-02-14 02:09:11 +00:00
Dag-Erling Smørgrav
a3ea6d41b9 First step towards an MP-safe zone allocator:
- have zalloc() and zfree() always lock the vm_zone.
 - remove zalloci() and zfreei(), which are now redundant.

Reviewed by:	bmilekic, jasone
2001-01-21 22:23:11 +00:00
Bosko Milekic
2a0c503e7a * Rename M_WAIT mbuf subsystem flag to M_TRYWAIT.
This is because calls with M_WAIT (now M_TRYWAIT) may not wait
  forever when nothing is available for allocation, and may end up
  returning NULL. Hopefully we now communicate more of the right thing
  to developers and make it very clear that it's necessary to check whether
  calls with M_(TRY)WAIT also resulted in a failed allocation.
  M_TRYWAIT basically means "try harder, block if necessary, but don't
  necessarily wait forever." The time spent blocking is tunable with
  the kern.ipc.mbuf_wait sysctl.
  M_WAIT is now deprecated but still defined for the next little while.

* Fix a typo in a comment in mbuf.h

* Fix some code that was actually passing the mbuf subsystem's M_WAIT to
  malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the
  value of the M_WAIT flag, this could have became a big problem.
2000-12-21 21:44:31 +00:00
David Malone
7cc0979fd6 Convert more malloc+bzero to malloc+M_ZERO.
Submitted by:	josh@zipperup.org
Submitted by:	Robert Drehmel <robd@gmx.net>
2000-12-08 21:51:06 +00:00
Alfred Perlstein
830fedd28f Accept filters broke kernels compiled without options INET.
Make accept filters conditional on INET support to fix.

Pointed out by: bde
Tested and assisted by: Stephen J. Kiernan <sab@vegamuse.org>
2000-11-20 01:35:25 +00:00
Jonathan Lemon
d5aa12349f Check so_error in filt_so{read|write} in order to detect UDP errors.
PR: 21601
2000-09-28 04:41:22 +00:00
Don Lewis
f535380cb6 Remove uidinfo hash table lookup and maintenance out of chgproccnt() and
chgsbsize(), which are called rather frequently and may be called from an
interrupt context in the case of chgsbsize().  Instead, do the hash table
lookup and maintenance when credentials are changed, which is a lot less
frequent.  Add pointers to the uidinfo structures to the ucred and pcred
structures for fast access.  Pass a pointer to the credential to chgproccnt()
and chgsbsize() instead of passing the uid.  Add a reference count to the
uidinfo structure and use it to decide when to free the structure rather
than freeing the structure when the resource consumption drops to zero.
Move the resource tracking code from kern_proc.c to kern_resource.c.  Move
some duplicate code sequences in kern_prot.c to separate helper functions.
Change KASSERTs in this code to unconditional tests and calls to panic().
2000-09-05 22:11:13 +00:00
Brian Feldman
6aef685fbb Remove any possibility of hiwat-related race conditions by changing
the chgsbsize() call to use a "subject" pointer (&sb.sb_hiwat) and
a u_long target to set it to.  The whole thing is splnet().

This fixes a problem that jdp has been able to provoke.
2000-08-29 11:28:06 +00:00
Jonathan Lemon
a114459191 Make the kqueue socket read filter honor the SO_RCVLOWAT value.
Spotted by:  "Steve M." <stevem@redlinenetworks.com>
2000-08-07 17:52:08 +00:00
Alfred Perlstein
f408896444 only allow accept filter modifications on listening sockets
Submitted by: ps
2000-07-20 12:17:17 +00:00
Alfred Perlstein
c636255150 fix races in the uidinfo subsystem, several problems existed:
1) while allocating a uidinfo struct malloc is called with M_WAITOK,
   it's possible that while asleep another process by the same user
   could have woken up earlier and inserted an entry into the uid
   hash table.  Having redundant entries causes inconsistancies that
   we can't handle.

   fix: do a non-waiting malloc, and if that fails then do a blocking
   malloc, after waking up check that no one else has inserted an entry
   for us already.

2) Because many checks for sbsize were done as "test then set" in a non
   atomic manner it was possible to exceed the limits put up via races.

   fix: instead of querying the count then setting, we just attempt to
   set the count and leave it up to the function to return success or
   failure.

3) The uidinfo code was inlining and repeating, lookups and insertions
   and deletions needed to be in their own functions for clarity.

Reviewed by: green
2000-06-22 22:27:16 +00:00
Alfred Perlstein
a79b71281c return of the accept filter part II
accept filters are now loadable as well as able to be compiled into
the kernel.

two accept filters are provided, one that returns sockets when data
arrives the other when an http request is completed (doesn't work
with 0.9 requests)

Reviewed by: jmg
2000-06-20 01:09:23 +00:00
Alfred Perlstein
a72fda7154 backout accept optimizations.
Requested by: jmg, dcs, jdp, nate
2000-06-18 08:49:13 +00:00
Alfred Perlstein
8f4e4aa5f1 add socketoptions DELAYACCEPT and HTTPACCEPT which will not allow an accept()
until the incoming connection has either data waiting or what looks like a
HTTP request header already in the socketbuffer.  This ought to reduce
the context switch time and overhead for processing requests.

The initial idea and code for HTTPACCEPT came from Yahoo engineers and has
been cleaned up and a more lightweight DELAYACCEPT for non-http servers
has been added

Reviewed by: silence on hackers.
2000-06-15 18:18:43 +00:00
Jeroen Ruigrok van der Werven
3b43fd626a Fix panic by moving the prp == 0 check up the order of sanity checks.
Submitted by:	Bart Thate <freebsd@1st.dudi.org> on -current
Approved by:	rwatson
2000-06-13 15:44:04 +00:00
Robert Watson
7cadc2663e o Modify jail to limit creation of sockets to UNIX domain sockets,
TCP/IP (v4) sockets, and routing sockets.  Previously, interaction
  with IPv6 was not well-defined, and might be inappropriate for some
  environments.  Similarly, sysctl MIB entries providing interface
  information also give out only addresses from those protocol domains.

  For the time being, this functionality is enabled by default, and
  toggleable using the sysctl variable jail.socket_unixiproute_only.
  In the future, protocol domains will be able to determine whether or
  not they are ``jail aware''.

o Further limitations on process use of getpriority() and setpriority()
  by jailed processes.  Addresses problem described in kern/17878.

Reviewed by:	phk, jmg
2000-06-04 04:28:31 +00:00
Jake Burkholder
e39756439c Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by:		msmith and others
2000-05-26 02:09:24 +00:00
Jake Burkholder
740a1973a6 Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by:	phk
Reviewed by:	phk
Approved by:	mdodd
2000-05-23 20:41:01 +00:00
Jonathan Lemon
cb679c385e Introduce kqueue() and kevent(), a kernel event notification facility. 2000-04-16 18:53:38 +00:00
Bill Fenner
95b2b777b5 Make sure to free the socket in soabort() if the protocol couldn't
free it (this could happen if the protocol already freed its part
 and we just kept the socket around to make sure accept(2) didn't block)
2000-03-18 08:56:56 +00:00
Jason Evans
bfbbc4aa44 Add aio_waitcomplete(). Make aio work correctly for socket descriptors.
Make gratuitous style(9) fixes (me, not the submitter) to make the aio
code more readable.

PR:		kern/12053
Submitted by:	Chris Sedore <cmsedore@maxwell.syr.edu>
2000-01-14 02:53:29 +00:00
Brian Feldman
c2696359ab Correct an uninitialized variable use, which, unlike most times, is
actually a bug this time.

Submitted by:	bde
Reviewed by:	bde
1999-12-27 06:31:53 +00:00
Brian Feldman
f48b807fc0 This is Bosko Milekic's mbuf allocation waiting code. Basically, this
means that running out of mbuf space isn't a panic anymore, and code
which runs out of network memory will sleep to wait for it.

Submitted by:	Bosko Milekic <bmilekic@dsuper.net>
Reviewed by:	green, wollman
1999-12-12 05:52:51 +00:00
Yoshinobu Inoue
82cd038d51 KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP
for IPv6 yet)

With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project
1999-11-22 02:45:11 +00:00
Poul-Henning Kamp
2e3c8fcbd0 This is a partial commit of the patch from PR 14914:
Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
   structures for list operations.  This patch makes all list operations
   in sys/kern use the queue(3) macros, rather than directly accessing the
   *Q_{HEAD,ENTRY} structures.

This batch of changes compile to the same object files.

Reviewed by:    phk
Submitted by:   Jake Burkholder <jake@checker.org>
PR:     14914
1999-11-16 10:56:05 +00:00
Brian Feldman
ecf723083f Implement RLIMIT_SBSIZE in the kernel. This is a per-uid sockbuf total
usage limit.
1999-10-09 20:42:17 +00:00
Brian Feldman
2f9a21326c Change so_cred's type to a ucred, not a pcred. THis makes more sense, actually.
Make a sonewconn3() which takes an extra argument (proc) so new sockets created
with sonewconn() from a user's system call get the correct credentials, not
just the parent's credentials.
1999-09-19 02:17:02 +00:00
Peter Wemm
c3aac50f28 $Id$ -> $FreeBSD$ 1999-08-28 01:08:13 +00:00
Brian Feldman
f29be02190 Reviewed by: the cast of thousands
This is the change to struct sockets that gets rid of so_uid and replaces
it with a much more useful struct pcred *so_cred. This is here to be able
to do socket-level credential checks (i.e. IPFW uid/gid support, to be added
to HEAD soon). Along with this comes an update to pidentd which greatly
simplifies the code necessary to get a uid from a socket. Soon to come:
a sysctl() interface to finding individual sockets' credentials.
1999-06-17 23:54:50 +00:00
Peter Wemm
9c9906e912 Plug a mbuf leak in tcp_usr_send(). pru_send() routines are expected
to either enqueue or free their mbuf chains, but tcp_usr_send() was
dropping them on the floor if the tcpcb/inpcb has been torn down in the
middle of a send/write attempt.  This has been responsible for a wide
variety of mbuf leak patterns, ranging from slow gradual leakage to rather
rapid exhaustion.  This has been a problem since before 2.2 was branched
and appears to have been fixed in rev 1.16 and lost in 1.23/1.28.

Thanks to Jayanth Vijayaraghavan <jayanth@yahoo-inc.com> for checking
(extensively) into this on a live production 2.2.x system and that it
was the actual cause of the leak and looks like it fixes it.  The machine
in question was loosing (from memory) about 150 mbufs per hour under
load and a change similar to this stopped it.  (Don't blame Jayanth
for this patch though)

An alternative approach to this would be to recheck SS_CANTSENDMORE etc
inside the splnet() right before calling pru_send() after all the potential
sleeps, interrupts and delays have happened.  However, this would mean
exposing knowledge of the tcp stack's reset handling and removal of the
pcb to the generic code.  There are other things that call pru_send()
directly though.

Problem originally noted by:  John Plevyak <jplevyak@inktomi.com>
1999-06-04 02:27:06 +00:00
Andrey A. Chernov
925fa5c3f5 Realy fix overflow on SO_*TIMEO
Submitted by: bde
1999-05-21 15:54:40 +00:00
Bill Fumerola
3d177f465a Add sysctl descriptions to many SYSCTL_XXXs
PR:		kern/11197
Submitted by:	Adrian Chadd <adrian@FreeBSD.org>
Reviewed by:	billf(spelling/style/minor nits)
Looked at by:	bde(style)
1999-05-03 23:57:32 +00:00
Andrey A. Chernov
02a3d5261d Lite2 bugfixes merge:
so_linger is in seconds, not in 1/HZ
range checking in SO_*TIMEO was wrong

PR: 11252
1999-04-24 18:22:34 +00:00
Doug Rabson
ce02431ffa * Change sysctl from using linker_set to construct its tree using SLISTs.
This makes it possible to change the sysctl tree at runtime.

* Change KLD to find and register any sysctl nodes contained in the loaded
  file and to unregister them when the file is unloaded.

Reviewed by: Archie Cobbs <archie@whistle.com>,
	Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)
1999-02-16 10:49:55 +00:00
Bill Fenner
8f70ac3e02 Fix the port of the NetBSD 19990120-accept fix. I misread a piece of
code when examining their fix, which caused my code (in rev 1.52) to:
- panic("soaccept: !NOFDREF")
- fatal trap 12, with tracebacks going thru soclose and soaccept
1999-02-02 07:23:28 +00:00
Matthew Dillon
d254af07a1 Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile
1999-01-27 21:50:00 +00:00
Bill Fenner
527b7a14a5 Port NetBSD's 19990120-accept bug fix. This works around the race condition
where select(2) can return that a listening socket has a connected socket
queued, the connection is broken, and the user calls accept(2), which then
blocks because there are no connections queued.

Reviewed by:	wollman
Obtained from:	NetBSD
(ftp://ftp.NetBSD.ORG/pub/NetBSD/misc/security/patches/19990120-accept)
1999-01-25 16:58:56 +00:00
Bill Fenner
7b1777101c Also consider the space left in the socket buffer when deciding whether
to set PRUS_MORETOCOME.
1999-01-20 17:45:22 +00:00
Bill Fenner
b0acefa8d4 Add a flag, passed to pru_send routines, PRUS_MORETOCOME. This
flag means that there is more data to be put into the socket buffer.
Use it in TCP to reduce the interaction between mbuf sizes and the
Nagle algorithm.

Based on:	"Justin C. Walker" <justin@apple.com>'s description of Apple's
		fix for this problem.
1999-01-20 17:32:01 +00:00
Eivind Eklund
219cbf59f2 KNFize, by bde. 1999-01-10 01:58:29 +00:00
Eivind Eklund
5526d2d920 Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as
discussed on -hackers.

Introduce 'KASSERT(assertion, ("panic message", args))' for simple
check + panic.

Reviewed by:	msmith
1999-01-08 17:31:30 +00:00
Archie Cobbs
f1d19042b0 The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static
and local variables, goto labels, and functions declared but not defined.
1998-12-07 21:58:50 +00:00
Don Lewis
831d27a9f5 Installed the second patch attached to kern/7899 with some changes suggested
by bde, a few other tweaks to get the patch to apply cleanly again and
some improvements to the comments.

This change closes some fairly minor security holes associated with
F_SETOWN, fixes a few bugs, and removes some limitations that F_SETOWN
had on tty devices.  For more details, see the description on the PR.

Because this patch increases the size of the proc and pgrp structures,
it is necessary to re-install the includes and recompile libkvm,
the vinum lkm, fstat, gcore, gdb, ipfilter, ps, top, and w.

PR:		kern/7899
Reviewed by:	bde, elvind
1998-11-11 10:04:13 +00:00
Garrett Wollman
9898afa1f1 Bow to tradition and correctly implement the bogus-but-hallowed semantics
of getsockopt never telling how much it might have copied if only the
buffer were big enough.
1998-08-31 18:07:23 +00:00
Garrett Wollman
d224dbc106 Correctly set the return length regardless of the relative size of the
user's buffer.  Simplify the logic a bit.  (Can we have a version of
min() for size_t?)
1998-08-31 15:34:55 +00:00
Garrett Wollman
cfe8b629f1 Yow! Completely change the way socket options are handled, eliminating
another specialized mbuf type in the process.  Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.
1998-08-23 03:07:17 +00:00
Bill Fenner
0c495036b4 Undo rev 1.41 until we get more details about why it makes some systems
fail.
1998-07-18 18:48:45 +00:00
Bill Fenner
dece5b6a43 Introduce (fairly hacky) workaround for odd TCP behavior with application
writes of size (100,208]+N*MCLBYTES.

The bug:
 sosend() hands each mbuf off to the protocol output routine as soon as it
 has copied it, in the hopes of increasing parallelism (see
  http://www.kohala.com/~rstevens/vanj.88jul20.txt ). This works well for
 TCP as long as the first mbuf handed off is at least the MSS.  However,
 when doing small writes (between MHLEN and MINCLSIZE), the transaction is
 split into 2 small MBUF's and each is individually handed off to TCP.
 TCP assumes that the first small mbuf is the whole transaction, so sends
 a small packet.  When the second small mbuf arrives, Nagle prevents TCP
 from sending it so it must wait for a (potentially delayed) ACK.  This
 sends throughput down the toilet.

The workaround:
 Set the "atomic" flag when we're doing small writes.  The "atomic" flag
 has two meanings:
 1. Copy all of the data into a chain of mbufs before handing off to the
    protocol.
 2. Leave room for a datagram header in said mbuf chain.
 TCP wants the first but doesn't want the second.  However, the second
 simply results in some memory wastage (but is why the workaround is a
 hack and not a fix).

The real fix:
 The real fix for this problem is to introduce something like a "requested
 transfer size" variable in the socket->protocol interface.  sosend()
 would then accumulate an mbuf chain until it exceeded the "requested
 transfer size".  TCP could set it to the TCP MSS (note that the
 current interface causes strange TCP behaviors when the MSS > MCLBYTES;
 nobody notices because MCLBYTES > ethernet's MTU).
1998-07-06 19:27:14 +00:00
Garrett Wollman
98271db4d5 Convert socket structures to be type-stable and add a version number.
Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.

Convert PF_LOCAL PCB structures to be type-stable and add a version number.

Define an external format for infomation about socket structures and use
it in several places.

Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time.  This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.

It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).
1998-05-15 20:11:40 +00:00
Bruce Evans
08637435f2 Moved some #includes from <sys/param.h> nearer to where they are actually
used.
1998-03-28 10:33:27 +00:00
Guido van Rooij
4049a04253 Make sure that you can only bind a more specific address when it is
done by the same uid.
Obtained from: OpenBSD
1998-03-01 19:39:29 +00:00
Bill Fenner
92f57d003c Revert sosend() to its behavior from 4.3-Tahoe and before: if
so_error is set, clear it before returning it.  The behavior
introduced in 4.3-Reno (to not clear so_error) causes potentially
transient errors (e.g.  ECONNREFUSED if the other end hasn't opened
its socket yet) to be permanent on connected datagram sockets that
are only used for writing.

(soreceive() clears so_error before returning it, as does
getsockopt(...,SO_ERROR,...).)

Submitted by:	Van Jacobson <van@ee.lbl.gov>, via a comment in the vat sources.
1998-02-19 19:38:20 +00:00