178 Commits

Author SHA1 Message Date
Robert Watson
c0b99ffa02 The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality.  This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties.  The
following fields are moved to sb_state:

  SS_CANTRCVMORE            (so_state)
  SS_CANTSENDMORE           (so_state)
  SS_RCVATMARK              (so_state)

Rename respectively to:

  SBS_CANTRCVMORE           (so_rcv.sb_state)
  SBS_CANTSENDMORE          (so_snd.sb_state)
  SBS_RCVATMARK             (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts).  In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.
2004-06-14 18:16:22 +00:00
Robert Watson
395a08c904 Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
  soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
  so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
  various contexts in the stack, both at the socket and protocol
  layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
  this could result in frobbing of a non-present socket if
  sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
  they don't free the socket.

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-12 20:47:32 +00:00
Robert Watson
f6c0cce6d9 Introduce a mutex into struct sockbuf, sb_mtx, which will be used to
protect fields in the socket buffer.  Add accessor macros to use the
mutex (SOCKBUF_*()).  Initialize the mutex in soalloc(), and destroy
it in sodealloc().  Add addition, add SOCK_*() access macros which
will protect most remaining fields in the socket; for the time being,
use the receive socket buffer mutex to implement socket level locking
to reduce memory overhead.

Submitted by:	sam
Sponosored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-12 16:08:41 +00:00
Stefan Farfeleder
1a5ff9285a Avoid assignments to cast expressions.
Reviewed by:	md5
Approved by:	das (mentor)
2004-06-08 13:08:19 +00:00
Robert Watson
2658b3bb8e Integrate accept locking from rwatson_netperf, introducing a new
global mutex, accept_mtx, which serializes access to the following
fields across all sockets:

          so_qlen          so_incqlen         so_qstate
          so_comp          so_incomp          so_list
          so_head

While providing only coarse granularity, this approach avoids lock
order issues between sockets by avoiding ownership of the fields
by a specific socket and its per-socket mutexes.

While here, rewrite soclose(), sofree(), soaccept(), and
sonewconn() to add assertions, close additional races and  address
lock order concerns.  In particular:

- Reorganize the optimistic concurrency behavior in accept1() to
  always allocate a file descriptor with falloc() so that if we do
  find a socket, we don't have to encounter the "Oh, there wasn't
  a socket" race that can occur if falloc() sleeps in the current
  code, which broke inbound accept() ordering, not to mention
  requiring backing out socket state changes in a way that raced
  with the protocol level.  We may want to add a lockless read of
  the queue state if polling of empty queues proves to be important
  to optimize.

- In accept1(), soref() the socket while holding the accept lock
  so that the socket cannot be free'd in a race with the protocol
  layer.  Likewise in netgraph equivilents of the accept1() code.

- In sonewconn(), loop waiting for the queue to be small enough to
  insert our new socket once we've committed to inserting it, or
  races can occur that cause the incomplete socket queue to
  overfill.  In the previously implementation, it was sufficient
  to simply tested once since calling soabort() didn't release
  synchronization permitting another thread to insert a socket as
  we discard a previous one.

- In soclose()/sofree()/et al, it is the responsibility of the
  caller to remove a socket from the incomplete connection queue
  before calling soabort(), which prevents soabort() from having
  to walk into the accept socket to release the socket from its
  queue, and avoids races when releasing the accept mutex to enter
  soabort(), permitting soabort() to avoid lock ordering issues
  with the caller.

- Generally cluster accept queue related operations together
  throughout these functions in order to facilitate locking.

Annotate new locking in socketvar.h.
2004-06-02 04:15:39 +00:00
Robert Watson
36568179e3 The SS_COMP and SS_INCOMP flags in the so_state field indicate whether
the socket is on an accept queue of a listen socket.  This change
renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new
state field on the socket, so_qstate, as the locking for these flags
is substantially different for the locking on the remainder of the
flags in so_state.
2004-06-01 02:42:56 +00:00
Don Lewis
866046f5a6 Add MSG_NBIO flag option to soreceive() and sosend() that causes
them to behave the same as if the SS_NBIO socket flag had been set
for this call.  The SS_NBIO flag for ordinary sockets is set by
fcntl(fd, F_SETFL, O_NONBLOCK).

Pass the MSG_NBIO flag to the soreceive() and sosend() calls in
fifo_read() and fifo_write() instead of frobbing the SS_NBIO flag
on the underlying socket for each I/O operation.  The O_NONBLOCK
flag is a property of the descriptor, and unlike ordinary sockets,
fifos may be referenced by multiple descriptors.
2004-06-01 01:18:51 +00:00
Bosko Milekic
099a0e588c Bring in mbuma to replace mballoc.
mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
  - Better layering between slab <-> zone caches; introduce
    Keg structure which splits off slab cache away from the
    zone structure and allows multiple zones to be stacked
    on top of a single Keg (single type of slab cache);
    perhaps we should look into defining a subset API on
    top of the Keg for special use by malloc(9),
    for example.
  - UMA_ZONE_REFCNT zones can now be added, and reference
    counters automagically allocated for them within the end
    of the associated slab structures.  uma_find_refcnt()
    does a kextract to fetch the slab struct reference from
    the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
  - integrates mbuf & cluster allocations with extended UMA
    and provides caches for commonly-allocated items; defines
    several zones (two primary, one secondary) and two kegs.
  - change up certain code paths that always used to do:
    m_get() + m_clget() to instead just use m_getcl() and
    try to take advantage of the newly defined secondary
    Packet zone.
  - netstat(1) and systat(1) quickly hacked up to do basic
    stat reporting but additional stats work needs to be
    done once some other details within UMA have been taken
    care of and it becomes clearer to how stats will work
    within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used.  The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
   - One report of 'ips' (ServeRAID) driver acting really
     slow in conjunction with mbuma.  Need more data.
     Latest report is that ips is equally sucking with
     and without mbuma.
   - Giant leak in NFS code sometimes occurs, can't
     reproduce but currently analyzing; brueffer is
     able to reproduce but THIS IS NOT an mbuma-specific
     problem and currently occurs even WITHOUT mbuma.
   - Issues in network locking: there is at least one
     code path in the rip code where one or more locks
     are acquired and we end up in m_prepend() with
     M_WAITOK, which causes WITNESS to whine from within
     UMA.  Current temporary solution: force all UMA
     allocations to be M_NOWAIT from within UMA for now
     to avoid deadlocks unless WITNESS is defined and we
     can determine with certainty that we're not holding
     any locks when we're M_WAITOK.
   - I've seen at least one weird socketbuffer empty-but-
     mbuf-still-attached panic.  I don't believe this
     to be related to mbuma but please keep your eyes
     open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
    rwatson,
    brueffer,
    Ketrien I. Saihr-Kesenchedra,
    ...
Reviewed by: Lots of people (for different parts)
2004-05-31 21:46:06 +00:00
Robert Watson
123f024b24 Compare pointers with NULL rather than using pointers are booleans in
if/for statements.  Assign pointers to NULL rather than typecast 0.
Compare pointers with NULL rather than 0.
2004-04-09 13:23:51 +00:00
Warner Losh
7f8a436ff2 Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core
2004-04-05 21:03:37 +00:00
Robert Watson
8e44a7ec13 In sofree(), avoid nested declaration and initialization in
declaration.  Observe that initialization in declaration is
frequently incompatible with locking, not just a bad idea
due to style(9).

Submitted by:	bde
2004-03-31 03:48:35 +00:00
Robert Watson
181e65db5b Use a common return path for filt_soread() and filt_sowrite() to
simplify the impact of locking on these functions.

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
2004-03-29 18:06:15 +00:00
Robert Watson
71c90a2944 In sofree(), moving caching of 'head' from 'so->so_head' to later in
the function once it has been determined to be non-NULL to simplify
locking on an earlier return.
2004-03-29 17:57:43 +00:00
Robert Watson
746e5bf09b Rename dup_sockaddr() to sodupsockaddr() for consistency with other
functions in kern_socket.c.

Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT
in from the caller context rather than "1" or "0".

Correct mflags pass into mac_init_socket() from previous commit to not
include M_ZERO.

Submitted by:	sam
2004-03-01 03:14:23 +00:00
Scott Long
740d9ba692 Convert the other use of flags to mflags in soalloc(). 2004-03-01 01:14:28 +00:00
Robert Watson
2bc87dcfbe Modify soalloc() API so that it accepts a malloc flags argument rather
than a "waitok" argument.  Callers now passing M_WAITOK or M_NOWAIT
rather than 0 or 1.  This simplifies the soalloc() logic, and also
makes the waiting behavior of soalloc() more clear in the calling
context.

Submitted by:	sam
2004-02-29 17:54:05 +00:00
Brian Feldman
f662a93197 Always socantsendmore() before deallocating a socket. This, in turn,
calls selwakeup() if necessary (which it is, if you don't want freed
memory hanging around on your td->td_selq).

Props to:	alfred
2004-02-12 01:48:40 +00:00
Poul-Henning Kamp
be8a62e821 Introduce the SO_BINTIME option which takes a high-resolution timestamp
at packet arrival.

For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL
since it has higher resolution and lower overhead.  Simultaneous
use of the two options is possible and they will return consistent
timestamps.

This introduces an extra test and a function call for SO_TIMEVAL, but I have
not been able to measure that.
2004-01-31 10:40:25 +00:00
Ruslan Ermilov
0541040c46 Since "m" is not part of the "mp" chain, need to free() it.
Reported by:	Stanford Metacompilation research group
2004-01-18 14:02:53 +00:00
Robert Watson
9e71dd0feb Reduce gratuitous redundancy and length in function names:
mac_setsockopt_label_set() -> mac_setsockopt_label()
  mac_getsockopt_label_get() -> mac_getsockopt_label()
  mac_getsockopt_peerlabel_get() -> mac_getsockopt_peerlabel()

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-16 18:25:20 +00:00
Robert Watson
12cbb9dc56 When implementing getsockopt() for SO_LABEL and SO_PEERLABEL, make
sure to sooptcopyin() the (struct mac) so that the MAC Framework
knows which label types are being requested.  This fixes process
queries of socket labels.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-16 03:53:36 +00:00
Seigo Tanimura
512824f8f7 - Implement selwakeuppri() which allows raising the priority of a
thread being waken up.  The thread waken up can run at a priority as
  high as after tsleep().

- Replace selwakeup()s with selwakeuppri()s and pass appropriate
  priorities.

- Add cv_broadcastpri() which raises the priority of the broadcast
  threads.  Used by selwakeuppri() if collision occurs.

Not objected in:	-arch, -current
2003-11-09 09:17:26 +00:00
Sam Leffler
395bb18680 speedup stream socket recv handling by tracking the tail of
the mbuf chain instead of walking the list for each append

Submitted by:	ps/jayanth
Obtained from:	netbsd (jason thorpe)
2003-10-28 05:47:40 +00:00
Mike Silbersack
184dcdc7c8 Change all SYSCTLS which are readonly and have a related TUNABLE
from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide
more useful error messages.
2003-10-21 18:28:36 +00:00
Jeffrey Hsu
cc3426866c Make the second argument to sooptcopyout() constant in order to
simplify the upcoming PIM patches.

Submitted by:   Pavlin Radoslavov <pavlin@icir.org>
2003-08-05 00:27:54 +00:00
Robert Drehmel
4e19fe1081 To avoid a kernel panic provoked by a NULL pointer dereference,
do not clear the `sb_sel' member of the sockbuf structure
while invalidating the receive sockbuf in sorflush(), called
from soshutdown().

The panic was reproduceable from user land by attaching a knote
with EVFILT_READ filters to a socket, disabling further reads
from it using shutdown(2), and then closing it.  knote_remove()
was called to remove all knotes from the socket file descriptor
by detaching each using its associated filterops' detach call-
back function, sordetach() in this case, which tried to remove
itself from the invalidated sockbuf's klist (sb_sel.si_note).

PR:	kern/54331
2003-07-17 23:49:10 +00:00
Jeffrey Hsu
330841c763 Rev 1.121 meant to pass the value 1 to soalloc() to indicate waitok.
Reported by:	arr
2003-07-14 20:39:22 +00:00
David E. O'Brien
677b542ea2 Use __FBSDID(). 2003-06-11 00:56:59 +00:00
Alexander Kabaev
104a9b7e3e Deprecate machine/limits.h in favor of new sys/limits.h.
Change all in-tree consumers to include <sys/limits.h>

Discussed on:	standards@
Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>
2003-04-29 13:36:06 +00:00
Olivier Houchard
695d74f337 Use while (*controlp != NULL) instead of do ... while (*control != NULL)
There are valid cases where *controlp will be NULL at this point.

Discussed with:	dwmalone
2003-04-14 14:44:36 +00:00
Dag-Erling Smørgrav
8994a245e0 Clean up whitespace, s/register //, refrain from strong urge to ANSIfy. 2003-03-02 15:56:49 +00:00
Dag-Erling Smørgrav
c952458814 uiomove-related caddr_t -> void * (just the low-hanging fruit) 2003-03-02 15:50:23 +00:00
Olivier Houchard
d6bf23783f Remove duplicate includes.
Submitted by:	Cyril Nguyen-Huu <cyril@ci0.org>
2003-02-20 03:26:11 +00:00
Warner Losh
a163d034fa Back out M_* changes, per decision of the TRB.
Approved by: trb
2003-02-19 05:47:46 +00:00
Alfred Perlstein
44956c9863 Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
2003-01-21 08:56:16 +00:00
Thomas Moestl
6f7cab9301 Disallow listen() on sockets which are in the SS_ISCONNECTED or
SS_ISCONNECTING state, returning EINVAL (which is what POSIX mandates
in this case).
listen() on connected or connecting sockets would cause them to enter
a bad state; in the TCP case, this could cause sockets to go
catatonic or panics, depending on how the socket was connected.

Reviewed by:	-net
MFC after:	2 weeks
2003-01-17 19:20:00 +00:00
Matthew Dillon
48e3128b34 Bow to the whining masses and change a union back into void *. Retain
removal of unnecessary casts and throw in some minor cleanups to see if
anyone complains, just for the hell of it.
2003-01-13 00:33:17 +00:00
Matthew Dillon
cd72f2180b Change struct file f_data to un_data, a union of the correct struct
pointer types, and remove a huge number of casts from code using it.

Change struct xfile xf_data to xun_data (ABI is still compatible).

If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary.  There are no operational changes in this
commit.
2003-01-12 01:37:13 +00:00
Alfred Perlstein
a09de2f7cd In sodealloc(), if there is an accept filter present on the socket
then call do_setopt_accept_filter(so, NULL) which will free the filter
instead of duplicating the code in do_setopt_accept_filter().

Pointed out by: Hiten Pandya <hiten@angelica.unixdaemons.com>
2003-01-05 11:14:04 +00:00
Poul-Henning Kamp
6ce9c72c30 s/sokqfilter/soo_kqfilter/ for consistency with the naming of all
other socket/file operations.
2002-12-23 21:37:28 +00:00
Maxim Konovalov
8819f45b51 Small SO_RCVTIMEO and SO_SNDTIMEO values are mistakenly taken to be zero.
PR:		kern/32827
Submitted by:	Hartmut Brandt <brandt@fokus.gmd.de>
Approved by:	re (jhb)
MFC after:	2 weeks
2002-11-27 13:34:04 +00:00
Alfred Perlstein
29f194457c Fix instances of macros with improperly parenthasized arguments.
Verified by: md5
2002-11-09 12:55:07 +00:00
Kelly Yancey
247a32f22a Fix filt_soread() to properly flag a kevent when a 0-byte datagram is
received.

Verified by:	dougb, Manfred Antar <null@pozo.com>
Sponsored by:	NTT Multimedia Communications Labs
2002-11-05 18:48:46 +00:00
Alan Cox
5ee0a409fc Revert the change in revision 1.77 of kern/uipc_socket2.c. It is causing
a panic because the socket's state isn't as expected by sofree().

Discussed with: dillon, fenner
2002-11-02 05:14:31 +00:00
Kelly Yancey
e0f640e82d Track the number of non-data chararacters stored in socket buffers so that
the data value returned by kevent()'s EVFILT_READ filter on non-TCP
sockets accurately reflects the amount of data that can be read from the
sockets by applications.

PR:		30634
Reviewed by:	-net, -arch
Sponsored by:	NTT Multimedia Communications Labs
MFC after:	2 weeks
2002-11-01 21:27:59 +00:00
Robert Watson
6151efaa54 Trim extraneous #else and #endif MAC comments per style(9). 2002-10-28 21:17:53 +00:00
Robert Watson
83985c267e Modify label allocation semantics for sockets: pass in soalloc's malloc
flags so that we can call malloc with M_NOWAIT if necessary, avoiding
potential sleeps while holding mutexes in the TCP syncache code.
Similar to the existing support for mbuf label allocation: if we can't
allocate all the necessary label store in each policy, we back out
the label allocation and fail the socket creation.  Sync from MAC tree.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2002-10-05 21:23:47 +00:00
Robert Watson
ea6027a8e1 Make similar changes to fo_stat() and fo_poll() as made earlier to
fo_read() and fo_write(): explicitly use the cred argument to fo_poll()
as "active_cred" using the passed file descriptor's f_cred reference
to provide access to the file credential.  Add an active_cred
argument to fo_stat() so that implementers have access to the active
credential as well as the file credential.  Generally modify callers
of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which
was redundantly provided via the fp argument.  This set of modifications
also permits threads to perform these operations on behalf of another
thread without modifying their credential.

Trickle this change down into fo_stat/poll() implementations:

- badfo_poll(), badfo_stat(): modify/add arguments.
- kqueue_poll(), kqueue_stat(): modify arguments.
- pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to
  MAC checks rather than td->td_ucred.
- soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather
  than cred to pru_sopoll() to maintain current semantics.
- sopoll(): moidfy arguments.
- vn_poll(), vn_statfile(): modify/add arguments, pass new arguments
  to vn_stat().  Pass active_cred to MAC and fp->f_cred to VOP_POLL()
  to maintian current semantics.
- vn_close(): rename cred to file_cred to reflect reality while I'm here.
- vn_stat(): Add active_cred and file_cred arguments to vn_stat()
  and consumers so that this distinction is maintained at the VFS
  as well as 'struct file' layer.  Pass active_cred instead of
  td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics.

- fifofs: modify the creation of a "filetemp" so that the file
  credential is properly initialized and can be used in the socket
  code if desired.  Pass ap->a_td->td_ucred as the active
  credential to soo_poll().  If we teach the vnop interface about
  the distinction between file and active credentials, we would use
  the active credential here.

Note that current inconsistent passing of active_cred vs. file_cred to
VOP's is maintained.  It's not clear why GETATTR would be authorized
using active_cred while POLL would be authorized using file_cred at
the file system level.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-08-16 12:52:03 +00:00
Robert Watson
5c5384fe80 Use the credential authorizing the socket creation operation to perform
the jail check and the MAC socket labeling in socreate().  This handles
socket creation using a cached credential better (such as in the NFS
client code when rebuilding a socket following a disconnect: the new
socket should be created using the nfsmount cached cred, not the cred
of the thread causing the socket to be rebuilt).

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-08-12 16:49:03 +00:00
Robert Watson
f9d0d52459 Include file cleanup; mac.h and malloc.h at one point had ordering
relationship requirements, and no longer do.

Reminded by:	bde
2002-08-01 17:47:56 +00:00