Commit Graph

185 Commits

Author SHA1 Message Date
Robert Watson
fa8368a8fe When retrieving the SO_LINGER socket option for user space, hold the
socket lock over pulling so_options and so_linger out of the socket
structure in order to retrieve a consistent snapshot.  This may be
overkill if user space doesn't require a consistent snapshot.
2004-06-20 17:50:42 +00:00
Robert Watson
6f4b1b5578 Convert an if->panic in soclose() into a call to KASSERT(). 2004-06-20 17:47:51 +00:00
Robert Watson
ed2f7766b0 Annotate some ordering-related issues in solisten() which are not yet
resolved by socket locking: in particular, that we test the connection
state at the socket layer without locking, request that the protocol
begin listening, and then set the listen state on the socket
non-atomically, resulting in a non-atomic cross-layer test-and-set.
2004-06-20 17:38:19 +00:00
Robert Watson
31f555a1c5 Assert socket buffer lock in sb_lock() to protect socket buffer sleep
lock state.  Convert tsleep() into msleep() with socket buffer mutex
as argument.  Hold socket buffer lock over sbunlock() to protect sleep
lock state.

Assert socket buffer lock in sbwait() to protect the socket buffer
wait state.  Convert tsleep() into msleep() with socket buffer mutex
as argument.

Modify sofree(), sosend(), and soreceive() to acquire SOCKBUF_LOCK()
in order to call into these functions with the lock, as well as to
start protecting other socket buffer use in their implementation.  Drop
the socket buffer mutexes around calls into the protocol layer, around
potentially blocking operations, for copying to/from user space, and
VM operations relating to zero-copy.  Assert the socket buffer mutex
strategically after code sections or at the beginning of loops.  In
some cases, modify return code to ensure locks are properly dropped.

Convert the potentially blocking allocation of storage for the remote
address in soreceive() into a non-blocking allocation; we may wish to
move the allocation earlier so that it can block prior to acquisition
of the socket buffer lock.

Drop some spl use.

NOTE: Some races exist in the current structuring of sosend() and
soreceive().  This commit only merges basic socket locking in this
code; follow-up commits will close additional races.  As merged,
these changes are not sufficient to run without Giant safely.

Reviewed by:	juli, tjr
2004-06-19 03:23:14 +00:00
Robert Watson
7b574f2e45 Hold SOCK_LOCK(so) while frobbing so_options. Note that while the
local race is corrected, there's still a global race in sosend()
relating to so_options and the SO_DONTROUTE flag.
2004-06-18 04:02:56 +00:00
Robert Watson
c012260726 Merge some additional leaf node socket buffer locking from
rwatson_netperf:

Introduce conditional locking of the socket buffer in fifofs kqueue
filters; KNOTE() will be called holding the socket buffer locks in
fifofs, but sometimes the kqueue() system call will poll using the
same entry point without holding the socket buffer lock.

Introduce conditional locking of the socket buffer in the socket
kqueue filters; KNOTE() will be called holding the socket buffer
locks in the socket code, but sometimes the kqueue() system call
will poll using the same entry points without holding the socket
buffer lock.

Simplify the logic in sodisconnect() since we no longer need spls.

NOTE: To remove conditional locking in the kqueue filters, it would
make sense to use a separate kqueue API entry into the socket/fifo
code when calling from the kqueue() system call.
2004-06-18 02:57:55 +00:00
Robert Watson
9535efc00d Merge additional socket buffer locking from rwatson_netperf:
- Lock down low hanging fruit use of sb_flags with socket buffer
  lock.

- Lock down low hanging fruit use of so_state with socket lock.

- Lock down low hanging fruit use of so_options.

- Lock down low-hanging fruit use of sb_lowwat and sb_hiwat with
  socket buffer lock.

- Annotate situations in which we unlock the socket lock and then
  grab the receive socket buffer lock, which are currently actually
  the same lock.  Depending on how we want to play our cards, we
  may want to coallesce these lock uses to reduce overhead.

- Convert a if()->panic() into a KASSERT relating to so_state in
  soaccept().

- Remove a number of splnet()/splx() references.

More complex merging of socket and socket buffer locking to
follow.
2004-06-17 22:48:11 +00:00
Robert Watson
c0b99ffa02 The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality.  This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties.  The
following fields are moved to sb_state:

  SS_CANTRCVMORE            (so_state)
  SS_CANTSENDMORE           (so_state)
  SS_RCVATMARK              (so_state)

Rename respectively to:

  SBS_CANTRCVMORE           (so_rcv.sb_state)
  SBS_CANTSENDMORE          (so_snd.sb_state)
  SBS_RCVATMARK             (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts).  In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.
2004-06-14 18:16:22 +00:00
Robert Watson
395a08c904 Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
  soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
  so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
  various contexts in the stack, both at the socket and protocol
  layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
  this could result in frobbing of a non-present socket if
  sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
  they don't free the socket.

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-12 20:47:32 +00:00
Robert Watson
f6c0cce6d9 Introduce a mutex into struct sockbuf, sb_mtx, which will be used to
protect fields in the socket buffer.  Add accessor macros to use the
mutex (SOCKBUF_*()).  Initialize the mutex in soalloc(), and destroy
it in sodealloc().  Add addition, add SOCK_*() access macros which
will protect most remaining fields in the socket; for the time being,
use the receive socket buffer mutex to implement socket level locking
to reduce memory overhead.

Submitted by:	sam
Sponosored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-12 16:08:41 +00:00
Stefan Farfeleder
1a5ff9285a Avoid assignments to cast expressions.
Reviewed by:	md5
Approved by:	das (mentor)
2004-06-08 13:08:19 +00:00
Robert Watson
2658b3bb8e Integrate accept locking from rwatson_netperf, introducing a new
global mutex, accept_mtx, which serializes access to the following
fields across all sockets:

          so_qlen          so_incqlen         so_qstate
          so_comp          so_incomp          so_list
          so_head

While providing only coarse granularity, this approach avoids lock
order issues between sockets by avoiding ownership of the fields
by a specific socket and its per-socket mutexes.

While here, rewrite soclose(), sofree(), soaccept(), and
sonewconn() to add assertions, close additional races and  address
lock order concerns.  In particular:

- Reorganize the optimistic concurrency behavior in accept1() to
  always allocate a file descriptor with falloc() so that if we do
  find a socket, we don't have to encounter the "Oh, there wasn't
  a socket" race that can occur if falloc() sleeps in the current
  code, which broke inbound accept() ordering, not to mention
  requiring backing out socket state changes in a way that raced
  with the protocol level.  We may want to add a lockless read of
  the queue state if polling of empty queues proves to be important
  to optimize.

- In accept1(), soref() the socket while holding the accept lock
  so that the socket cannot be free'd in a race with the protocol
  layer.  Likewise in netgraph equivilents of the accept1() code.

- In sonewconn(), loop waiting for the queue to be small enough to
  insert our new socket once we've committed to inserting it, or
  races can occur that cause the incomplete socket queue to
  overfill.  In the previously implementation, it was sufficient
  to simply tested once since calling soabort() didn't release
  synchronization permitting another thread to insert a socket as
  we discard a previous one.

- In soclose()/sofree()/et al, it is the responsibility of the
  caller to remove a socket from the incomplete connection queue
  before calling soabort(), which prevents soabort() from having
  to walk into the accept socket to release the socket from its
  queue, and avoids races when releasing the accept mutex to enter
  soabort(), permitting soabort() to avoid lock ordering issues
  with the caller.

- Generally cluster accept queue related operations together
  throughout these functions in order to facilitate locking.

Annotate new locking in socketvar.h.
2004-06-02 04:15:39 +00:00
Robert Watson
36568179e3 The SS_COMP and SS_INCOMP flags in the so_state field indicate whether
the socket is on an accept queue of a listen socket.  This change
renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new
state field on the socket, so_qstate, as the locking for these flags
is substantially different for the locking on the remainder of the
flags in so_state.
2004-06-01 02:42:56 +00:00
Don Lewis
866046f5a6 Add MSG_NBIO flag option to soreceive() and sosend() that causes
them to behave the same as if the SS_NBIO socket flag had been set
for this call.  The SS_NBIO flag for ordinary sockets is set by
fcntl(fd, F_SETFL, O_NONBLOCK).

Pass the MSG_NBIO flag to the soreceive() and sosend() calls in
fifo_read() and fifo_write() instead of frobbing the SS_NBIO flag
on the underlying socket for each I/O operation.  The O_NONBLOCK
flag is a property of the descriptor, and unlike ordinary sockets,
fifos may be referenced by multiple descriptors.
2004-06-01 01:18:51 +00:00
Bosko Milekic
099a0e588c Bring in mbuma to replace mballoc.
mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
  - Better layering between slab <-> zone caches; introduce
    Keg structure which splits off slab cache away from the
    zone structure and allows multiple zones to be stacked
    on top of a single Keg (single type of slab cache);
    perhaps we should look into defining a subset API on
    top of the Keg for special use by malloc(9),
    for example.
  - UMA_ZONE_REFCNT zones can now be added, and reference
    counters automagically allocated for them within the end
    of the associated slab structures.  uma_find_refcnt()
    does a kextract to fetch the slab struct reference from
    the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
  - integrates mbuf & cluster allocations with extended UMA
    and provides caches for commonly-allocated items; defines
    several zones (two primary, one secondary) and two kegs.
  - change up certain code paths that always used to do:
    m_get() + m_clget() to instead just use m_getcl() and
    try to take advantage of the newly defined secondary
    Packet zone.
  - netstat(1) and systat(1) quickly hacked up to do basic
    stat reporting but additional stats work needs to be
    done once some other details within UMA have been taken
    care of and it becomes clearer to how stats will work
    within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used.  The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
   - One report of 'ips' (ServeRAID) driver acting really
     slow in conjunction with mbuma.  Need more data.
     Latest report is that ips is equally sucking with
     and without mbuma.
   - Giant leak in NFS code sometimes occurs, can't
     reproduce but currently analyzing; brueffer is
     able to reproduce but THIS IS NOT an mbuma-specific
     problem and currently occurs even WITHOUT mbuma.
   - Issues in network locking: there is at least one
     code path in the rip code where one or more locks
     are acquired and we end up in m_prepend() with
     M_WAITOK, which causes WITNESS to whine from within
     UMA.  Current temporary solution: force all UMA
     allocations to be M_NOWAIT from within UMA for now
     to avoid deadlocks unless WITNESS is defined and we
     can determine with certainty that we're not holding
     any locks when we're M_WAITOK.
   - I've seen at least one weird socketbuffer empty-but-
     mbuf-still-attached panic.  I don't believe this
     to be related to mbuma but please keep your eyes
     open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
    rwatson,
    brueffer,
    Ketrien I. Saihr-Kesenchedra,
    ...
Reviewed by: Lots of people (for different parts)
2004-05-31 21:46:06 +00:00
Robert Watson
123f024b24 Compare pointers with NULL rather than using pointers are booleans in
if/for statements.  Assign pointers to NULL rather than typecast 0.
Compare pointers with NULL rather than 0.
2004-04-09 13:23:51 +00:00
Warner Losh
7f8a436ff2 Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core
2004-04-05 21:03:37 +00:00
Robert Watson
8e44a7ec13 In sofree(), avoid nested declaration and initialization in
declaration.  Observe that initialization in declaration is
frequently incompatible with locking, not just a bad idea
due to style(9).

Submitted by:	bde
2004-03-31 03:48:35 +00:00
Robert Watson
181e65db5b Use a common return path for filt_soread() and filt_sowrite() to
simplify the impact of locking on these functions.

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
2004-03-29 18:06:15 +00:00
Robert Watson
71c90a2944 In sofree(), moving caching of 'head' from 'so->so_head' to later in
the function once it has been determined to be non-NULL to simplify
locking on an earlier return.
2004-03-29 17:57:43 +00:00
Robert Watson
746e5bf09b Rename dup_sockaddr() to sodupsockaddr() for consistency with other
functions in kern_socket.c.

Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT
in from the caller context rather than "1" or "0".

Correct mflags pass into mac_init_socket() from previous commit to not
include M_ZERO.

Submitted by:	sam
2004-03-01 03:14:23 +00:00
Scott Long
740d9ba692 Convert the other use of flags to mflags in soalloc(). 2004-03-01 01:14:28 +00:00
Robert Watson
2bc87dcfbe Modify soalloc() API so that it accepts a malloc flags argument rather
than a "waitok" argument.  Callers now passing M_WAITOK or M_NOWAIT
rather than 0 or 1.  This simplifies the soalloc() logic, and also
makes the waiting behavior of soalloc() more clear in the calling
context.

Submitted by:	sam
2004-02-29 17:54:05 +00:00
Brian Feldman
f662a93197 Always socantsendmore() before deallocating a socket. This, in turn,
calls selwakeup() if necessary (which it is, if you don't want freed
memory hanging around on your td->td_selq).

Props to:	alfred
2004-02-12 01:48:40 +00:00
Poul-Henning Kamp
be8a62e821 Introduce the SO_BINTIME option which takes a high-resolution timestamp
at packet arrival.

For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL
since it has higher resolution and lower overhead.  Simultaneous
use of the two options is possible and they will return consistent
timestamps.

This introduces an extra test and a function call for SO_TIMEVAL, but I have
not been able to measure that.
2004-01-31 10:40:25 +00:00
Ruslan Ermilov
0541040c46 Since "m" is not part of the "mp" chain, need to free() it.
Reported by:	Stanford Metacompilation research group
2004-01-18 14:02:53 +00:00
Robert Watson
9e71dd0feb Reduce gratuitous redundancy and length in function names:
mac_setsockopt_label_set() -> mac_setsockopt_label()
  mac_getsockopt_label_get() -> mac_getsockopt_label()
  mac_getsockopt_peerlabel_get() -> mac_getsockopt_peerlabel()

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-16 18:25:20 +00:00
Robert Watson
12cbb9dc56 When implementing getsockopt() for SO_LABEL and SO_PEERLABEL, make
sure to sooptcopyin() the (struct mac) so that the MAC Framework
knows which label types are being requested.  This fixes process
queries of socket labels.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-16 03:53:36 +00:00
Seigo Tanimura
512824f8f7 - Implement selwakeuppri() which allows raising the priority of a
thread being waken up.  The thread waken up can run at a priority as
  high as after tsleep().

- Replace selwakeup()s with selwakeuppri()s and pass appropriate
  priorities.

- Add cv_broadcastpri() which raises the priority of the broadcast
  threads.  Used by selwakeuppri() if collision occurs.

Not objected in:	-arch, -current
2003-11-09 09:17:26 +00:00
Sam Leffler
395bb18680 speedup stream socket recv handling by tracking the tail of
the mbuf chain instead of walking the list for each append

Submitted by:	ps/jayanth
Obtained from:	netbsd (jason thorpe)
2003-10-28 05:47:40 +00:00
Mike Silbersack
184dcdc7c8 Change all SYSCTLS which are readonly and have a related TUNABLE
from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide
more useful error messages.
2003-10-21 18:28:36 +00:00
Jeffrey Hsu
cc3426866c Make the second argument to sooptcopyout() constant in order to
simplify the upcoming PIM patches.

Submitted by:   Pavlin Radoslavov <pavlin@icir.org>
2003-08-05 00:27:54 +00:00
Robert Drehmel
4e19fe1081 To avoid a kernel panic provoked by a NULL pointer dereference,
do not clear the `sb_sel' member of the sockbuf structure
while invalidating the receive sockbuf in sorflush(), called
from soshutdown().

The panic was reproduceable from user land by attaching a knote
with EVFILT_READ filters to a socket, disabling further reads
from it using shutdown(2), and then closing it.  knote_remove()
was called to remove all knotes from the socket file descriptor
by detaching each using its associated filterops' detach call-
back function, sordetach() in this case, which tried to remove
itself from the invalidated sockbuf's klist (sb_sel.si_note).

PR:	kern/54331
2003-07-17 23:49:10 +00:00
Jeffrey Hsu
330841c763 Rev 1.121 meant to pass the value 1 to soalloc() to indicate waitok.
Reported by:	arr
2003-07-14 20:39:22 +00:00
David E. O'Brien
677b542ea2 Use __FBSDID(). 2003-06-11 00:56:59 +00:00
Alexander Kabaev
104a9b7e3e Deprecate machine/limits.h in favor of new sys/limits.h.
Change all in-tree consumers to include <sys/limits.h>

Discussed on:	standards@
Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>
2003-04-29 13:36:06 +00:00
Olivier Houchard
695d74f337 Use while (*controlp != NULL) instead of do ... while (*control != NULL)
There are valid cases where *controlp will be NULL at this point.

Discussed with:	dwmalone
2003-04-14 14:44:36 +00:00
Dag-Erling Smørgrav
8994a245e0 Clean up whitespace, s/register //, refrain from strong urge to ANSIfy. 2003-03-02 15:56:49 +00:00
Dag-Erling Smørgrav
c952458814 uiomove-related caddr_t -> void * (just the low-hanging fruit) 2003-03-02 15:50:23 +00:00
Olivier Houchard
d6bf23783f Remove duplicate includes.
Submitted by:	Cyril Nguyen-Huu <cyril@ci0.org>
2003-02-20 03:26:11 +00:00
Warner Losh
a163d034fa Back out M_* changes, per decision of the TRB.
Approved by: trb
2003-02-19 05:47:46 +00:00
Alfred Perlstein
44956c9863 Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
2003-01-21 08:56:16 +00:00
Thomas Moestl
6f7cab9301 Disallow listen() on sockets which are in the SS_ISCONNECTED or
SS_ISCONNECTING state, returning EINVAL (which is what POSIX mandates
in this case).
listen() on connected or connecting sockets would cause them to enter
a bad state; in the TCP case, this could cause sockets to go
catatonic or panics, depending on how the socket was connected.

Reviewed by:	-net
MFC after:	2 weeks
2003-01-17 19:20:00 +00:00
Matthew Dillon
48e3128b34 Bow to the whining masses and change a union back into void *. Retain
removal of unnecessary casts and throw in some minor cleanups to see if
anyone complains, just for the hell of it.
2003-01-13 00:33:17 +00:00
Matthew Dillon
cd72f2180b Change struct file f_data to un_data, a union of the correct struct
pointer types, and remove a huge number of casts from code using it.

Change struct xfile xf_data to xun_data (ABI is still compatible).

If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary.  There are no operational changes in this
commit.
2003-01-12 01:37:13 +00:00
Alfred Perlstein
a09de2f7cd In sodealloc(), if there is an accept filter present on the socket
then call do_setopt_accept_filter(so, NULL) which will free the filter
instead of duplicating the code in do_setopt_accept_filter().

Pointed out by: Hiten Pandya <hiten@angelica.unixdaemons.com>
2003-01-05 11:14:04 +00:00
Poul-Henning Kamp
6ce9c72c30 s/sokqfilter/soo_kqfilter/ for consistency with the naming of all
other socket/file operations.
2002-12-23 21:37:28 +00:00
Maxim Konovalov
8819f45b51 Small SO_RCVTIMEO and SO_SNDTIMEO values are mistakenly taken to be zero.
PR:		kern/32827
Submitted by:	Hartmut Brandt <brandt@fokus.gmd.de>
Approved by:	re (jhb)
MFC after:	2 weeks
2002-11-27 13:34:04 +00:00
Alfred Perlstein
29f194457c Fix instances of macros with improperly parenthasized arguments.
Verified by: md5
2002-11-09 12:55:07 +00:00
Kelly Yancey
247a32f22a Fix filt_soread() to properly flag a kevent when a 0-byte datagram is
received.

Verified by:	dougb, Manfred Antar <null@pozo.com>
Sponsored by:	NTT Multimedia Communications Labs
2002-11-05 18:48:46 +00:00