Commit Graph

164 Commits

Author SHA1 Message Date
jhb
27f742341b Use sysctl_handle_long() instead of duplicating it's logic for
kern.ipc.maxsockbuf so that this sysctl works for 32-bit binaries running
on amd64 via compat/freebsd32.

MFC after:	3 days
2006-09-06 21:59:36 +00:00
rwatson
8db4b3b586 Remove 'register'.
Use ANSI C prototypes/function headers.
More deterministically line wrap comments.
2006-08-02 13:01:58 +00:00
rwatson
cedd19512d Reimplement socket buffer tear-down in sofree(): as the socket is no
longer referenced by other threads (hence our freeing it), we don't need
to set the can't send and can't receive flags, wake up the consumers,
perform two levels of locking, etc.  Implement a fast-path teardown,
sbdestroy(), which flushes and releases each socket buffer.  A manual
dom_dispose of the receive buffer is still required explicitly to GC
any in-flight file descriptors, etc, before flushing the buffer.

This results in a 9% UP performance improvement and 16% SMP performance
improvement on a tight loop of socket();close(); in micro-benchmarking,
but will likely also affect CPU-bound macro-benchmark performance.
2006-08-01 10:30:26 +00:00
rwatson
c5a16c08ba Remove non-socket buffer routines from uipc_sockbuf.c, and socket buffer
specific routines from uipc_socket2.c following repo-copy.  We might
rethink the location of one or two at some point, but the division was
relatively clean.  uipc_sockbuf.c is now the home of routines that
manipulate socket buffers.
2006-07-24 16:21:31 +00:00
rwatson
2137fc50e6 Several protocol switch functions (pru_abort, pru_detach, pru_sosetlabel)
return void, so don't implement no-op versions of these functions.
Instead, consistently check if those switch pointers are NULL before
invoking them.
2006-07-11 23:18:28 +00:00
rwatson
fba25d1a75 Remove now unneeded opt_mac.h and mac.h includes. 2006-07-06 13:25:51 +00:00
rwatson
a1677cd654 Remove sbinsertoob(), sbinsertoob_locked(). They violate (and have
basically always violated) invariannts of soreceive(), which assume
that the first mbuf pointer in a receive socket buffer can't change
while the SB_LOCK sleepable lock is held on the socket buffer,
which is precisely what these functions do.  No current protocols
invoke these functions, and removing them will help discourage them
from ever being used.  I should have removed them years ago, but
lost track of it.

MFC after:	1 week
Prodded almost by accident by:	peter
2006-06-17 22:48:34 +00:00
rwatson
120490c1a5 Move some functions and definitions from uipc_socket2.c to uipc_socket.c:
- Move sonewconn(), which creates new sockets for incoming connections on
  listen sockets, so that all socket allocate code is together in
  uipc_socket.c.

- Move 'maxsockets' and associated sysctls to uipc_socket.c with the
  socket allocation code.

- Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it
  to sysctl.h and remove lots of scattered implementations in various
  IPC modules.

- Sort sodealloc() after soalloc() in uipc_socket.c for dependency order
  reasons.  Statisticize soalloc() and sodealloc() as they are now
  required only in uipc_socket.c, and are internal to the socket
  implementation.

After this change, socket allocation and deallocation is entirely
centralized in one file, and uipc_socket2.c consists entirely of socket
buffer manipulation and default protocol switch functions.

MFC after:	1 month
2006-06-10 14:34:07 +00:00
ps
10b2fe8dea Allow for nmbclusters and maxsockets to be increased via sysctl.
An eventhandler is used to update all the various zones that depend
on these values.
2006-04-21 09:25:40 +00:00
rwatson
5479e5d692 Chance protocol switch method pru_detach() so that it returns void
rather than an error.  Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.

soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF.  so_pcb is now entirely owned and
managed by the protocol code.  Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.

Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.

In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.

netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit.  In their current state they may leak
memory or panic.

MFC after:	3 months
2006-04-01 15:42:02 +00:00
rwatson
8622e776f9 Change protocol switch pru_abort() API so that it returns void rather
than an int, as an error here is not meaningful.  Modify soabort() to
unconditionally free the socket on the return of pru_abort(), and
modify most protocols to no longer conditionally free the socket,
since the caller will do this.

This commit likely leaves parts of netinet and netinet6 in a situation
where they may panic or leak memory, as they have not are not fully
updated by this commit.  This will be corrected shortly in followup
commits to these components.

MFC after:      3 months
2006-04-01 15:15:05 +00:00
rwatson
ebefd09411 Add a sysctl, regression.sonewconn_earlytest, which when options
REGRESSION is enabled, allows user space to dictate that sonewconn()
should skip it's "skip the hard work" check to see if the listen
queue is full, and instead proceed with allocation of a socket and
trimming of the overflowed queue.  This makes it easier to test the
queue overflow logic.

MFC after:	1 month
2006-03-26 22:44:37 +00:00
rwatson
053507bd40 Change soabort() from returning int to returning void, since all
consumers ignore the return value, soabort() is required to succeed,
and protocols produce errors here to report multiple freeing of the
pcb, which we hope to eliminate.
2006-03-16 07:03:14 +00:00
jdp
88e469fc50 Fix a bug in the loop in sonewconn that makes room on the incomplete
connection queue for a new connection.  It was removing connections
from the wrong list.

Submitted by:	Paul Mikesell
Sponsored by:	Isilon Systems
MFC after:	1 week
2005-11-22 01:55:29 +00:00
andre
0df84f5a83 Retire MT_HEADER mbuf type and change its users to use MT_DATA.
Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it.  It only adds a layer of confusion.  The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.

Non-native code is not changed in this commit.  For compatibility
MT_HEADER is mapped to MT_DATA.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-02 13:46:32 +00:00
rwatson
49831ed8da Push the assignment of a new or updated so_qlimit from solisten()
following the protocol pru_listen() call to solisten_proto(), so
that it occurs under the socket lock acquisition that also sets
SO_ACCEPTCONN.  This requires passing the new backlog parameter
to the protocol, which also allows the protocol to be aware of
changes in queue limit should it wish to do something about the
new queue limit.  This continues a move towards the socket layer
acting as a library for the protocol.

Bump __FreeBSD_version due to a change in the in-kernel protocol
interface.  This change has been tested with IPv4 and UNIX domain
sockets, but not other protocols.
2005-10-30 19:44:40 +00:00
rwatson
1f48076149 Re-comment sbcompress() to explain what it is it does; it took me
quite a bit of reading to figure it out, and I want to avoid figuring
it out again.

Convert an if (foo) else printf("this is almost a panic") into a
KASSERT.

MFC after:	3 days
2005-09-18 10:30:10 +00:00
ssouhlal
efe31cd3da Fix the recent panics/LORs/hangs created by my kqueue commit by:
- Introducing the possibility of using locks different than mutexes
for the knlist locking. In order to do this, we add three arguments to
knlist_init() to specify the functions to use to lock, unlock and
check if the lock is owned. If these arguments are NULL, we assume
mtx_lock, mtx_unlock and mtx_owned, respectively.

- Using the vnode lock for the knlist locking, when doing kqueue operations
on a vnode. This way, we don't have to lock the vnode while holding a
mutex, in filt_vfsread.

Reviewed by:	jmg
Approved by:	re (scottl), scottl (mentor override)
Pointyhat to:	ssouhlal
Will be happy:	everyone
2005-07-01 16:28:32 +00:00
rwatson
ac1a365e2d In the current world order, each socket has two mutexes: a mutex
that protects socket and receive socket buffer state, and a second
mutex to protect send socket buffer state.  In some places, the
mutex shared between the socket and receive socket buffer will be
acquired twice, once by each layer, resulting in some
inconsistency, but providing the abstraction benefit of being able
to more easily separate the two mutexes in the future if desired.

When transitioning a socket to the SS_ISDISCONNECTING or
SS_ISDISCONNECTED states, grab the socket/receive socket buffer lock
once rather than grabbing it as the socket lock, modifying socket
state, then grabbing a second time as the receive lock in order to
modify the socket buffer state to indicate no further data can be
read.  This change is believed to close a race between the change in
socket state and the change in socket buffer state, which for a
remotely initiated close on a UNIX domain socket, resulted in
soreceive() returning ENOTCONN rather than an EOF condition.

A similar race still exists in the case of send, however, and is
harder to fix as the socket and send socket buffer mutexes are not
the same, and we would like to avoid holding combinations of socket
mutexes over sb_upcall until we've finished clarifying the locking
protocol for upcalls.

This change has the side affect of reducing the number of mutex
operations to initiate disconnect or perform disconnect on a
socket by two.

PR:		78824
Rerported by:	Marc Olzheim <marcolz@stack.nl>
MFC after:	2 weeks
2005-05-27 17:16:43 +00:00
rwatson
9237eab769 Extend the coverage of the accept and socket mutexes in soisconnected()
so that the socket lock is held over the test-and-set removal of the
accept filter option during connect, and the two socket mutex regions
(transition to connected, perform accept filter) are combined.
2005-03-12 13:39:39 +00:00
rwatson
d3722ef740 When upcalling from a socket in soisconnected() for an accept filter,
call with flag M_DONTWAIT rather than M_TRYWAIT, as we don't want to
do blocking memory allocation (etc) in the netisr.

MFC after:	3 days
2005-03-07 13:50:16 +00:00
rwatson
c9c16ea8f4 Prefer NULL to returning 0 cast to a pointer type.
MFC after:	3 days
2005-02-20 15:56:13 +00:00
rwatson
630d43c2be In sonewconn(), set the new socket's state to show the protocol-provided
connection status before inserting the new socket into the listen
socket's accept queue, or there might be a race in which another thread
wakes up when the accept lock is released, and sees the socket before its
state is set correctly.  The wakeup still occurs after the accept lock is
released.  There have been no diagnoses of this bug in real-world systems
(as yet).

MFC after:	3 days
2005-02-17 12:53:45 +00:00
imp
20280f1431 /* -> /*- for copyright notices, minor format tweaks as necessary 2005-01-06 23:35:40 +00:00
rwatson
f1732152a7 In sonewconn(), the s/if/while/ change to wait for room at the tail of
the accept queue is a feature, not a bug/issue, so remove the XXXRW
from the comment.
2004-12-23 01:16:21 +00:00
maxim
2425dcbcaa Fix a typo in a comparison appeared in rev. 1.125.
Submitted by:	JINMEI Tatuya
2004-10-27 05:37:58 +00:00
andre
504a86a63b Support for dynamically loadable and unloadable protocols within existing protocol
families.

The protosw[] array of any particular protocol family ("domain") is of fixed size
defined at compile time.  This made it impossible to dynamically add or remove any
protocols to or from it.  We work around this by introducing so called SPACER's
which are embedded into the protosw[] array at compile time.  The SPACER's have
a special protocol number (32767) to indicate the fact that they are SPACER's but
are otherwise NULL.  Only as many protocols can be dynamically loaded as SPACER's
are provided in the protosw[] structure.

The pr_usrreqs structure is treated more special and contains pointers to dummy
functions only returning EOPNOTSUPP.  This is needed because the use of those
functions pointers is usually not checked within the kernel because until now it
was assumed to be a valid function pointer.  Instead of fixing all potential
callers we just return a proper error code.

Two new functions provide a clean API to register and unregister a protocol.  The
register function expects a pointer to a valid and complete struct protosw including
a pointer to struct pru_usrreqs provided by the caller.  Upon successful registration
the pr_init() function will be called to finish initialization of the protocol.  The
unregister function restores the SPACER in place of the protocol again.  It is the
responseability of the caller to ensure proper closing of all sockets and freeing
of memory allocation by the unloading protocol.

 sys/protosw.h

  o Define generic PROTO_SPACER to be 32767
  o Prototypes for all pru_*_notsupp() functions
  o Prototypes for pf_proto_[un]register() functions

 kern/uipc_domain.c

  o Global struct pr_usrreqs nousrreqs containing valid pointers to the
    pru_*_notsupp() functions
  o New functions pf_proto_[un]register()

 kern/uipc_socket2.c

  o New functions bodies for all pru_*_notsupp() functions
2004-10-19 15:13:30 +00:00
jmg
bc1805c6e8 Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers.  Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks.  Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by:	green, rwatson (both earlier versions)
2004-08-15 06:24:42 +00:00
rwatson
758f90deb8 Reduce the number of unnecessary unlock-relocks on socket buffer mutexes
associated with performing a wakeup on the socket buffer:

- When performing an sbappend*() followed by a so[rw]wakeup(), explicitly
  acquire the socket buffer lock and use the _locked() variants of both
  calls.  Note that the _locked() sowakeup() versions unlock the mutex on
  return.  This is done in uipc_send(), divert_packet(), mroute
  socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append().

- When the socket buffer lock is dropped before a sowakeup(), remove the
  explicit unlock and use the _locked() sowakeup() variant.  This is done
  in soisdisconnecting(), soisdisconnected() when setting the can't send/
  receive flags and dropping data, and in uipc_rcvd() which adjusting
  back-pressure on the sockets.

For UNIX domain sockets running mpsafe with a contention-intensive SMP
mysql benchmark, this results in a 1.6% query rate improvement due to
reduce mutex costs.
2004-06-26 19:10:39 +00:00
rwatson
caac080ec9 Introduce sbreserve_locked(), which asserts the socket buffer lock on
the socket buffer having its limits adjusted.  sbreserve() now acquires
the lock before calling sbreserve_locked().  In soreserve(), acquire
socket buffer locks across read-modify-writes of socket buffer fields,
and calls into sbreserve/sbrelease; make sure to acquire in keeping
with the socket buffer lock order.  In tcp_mss(), acquire the socket
buffer lock in the calling context so that we have atomic read-modify
-write on buffer sizes.
2004-06-24 01:37:04 +00:00
rwatson
21164a78ac Merge next step in socket buffer locking:
- sowakeup() now asserts the socket buffer lock on entry.  Move
  the call to KNOTE higher in sowakeup() so that it is made with
  the socket buffer lock held for consistency with other calls.
  Release the socket buffer lock prior to calling into pgsigio(),
  so_upcall(), or aio_swake().  Locking for this event management
  will need revisiting in the future, but this model avoids lock
  order reversals when upcalls into other subsystems result in
  socket/socket buffer operations.  Assert that the socket buffer
  lock is not held at the end of the function.

- Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now
  have _locked versions which assert the socket buffer lock on
  entry.  If a wakeup is required by sb_notify(), invoke
  sowakeup(); otherwise, unconditionally release the socket buffer
  lock.  This results in the socket buffer lock being released
  whether a wakeup is required or not.

- Break out socantsendmore() into socantsendmore_locked() that
  asserts the socket buffer lock.  socantsendmore()
  unconditionally locks the socket buffer before calling
  socantsendmore_locked().  Note that both functions return with
  the socket buffer unlocked as socantsendmore_locked() calls
  sowwakeup_locked() which has the same properties.  Assert that
  the socket buffer is unlocked on return.

- Break out socantrcvmore() into socantrcvmore_locked() that
  asserts the socket buffer lock.  socantrcvmore() unconditionally
  locks the socket buffer before calling socantrcvmore_locked().
  Note that both functions return with the socket buffer unlocked
  as socantrcvmore_locked() calls sorwakeup_locked() which has
  similar properties.  Assert that the socket buffer is unlocked
  on return.

- Break out sbrelease() into a sbrelease_locked() that asserts the
  socket buffer lock.  sbrelease() unconditionally locks the
  socket buffer before calling sbrelease_locked().
  sbrelease_locked() now invokes sbflush_locked() instead of
  sbflush().

- Assert the socket buffer lock in socket buffer sanity check
  functions sblastrecordchk(), sblastmbufchk().

- Assert the socket buffer lock in SBLINKRECORD().

- Break out various sbappend() functions into sbappend_locked()
  (and variations on that name) that assert the socket buffer
  lock.  The !_locked() variations unconditionally lock the socket
  buffer before calling their _locked counterparts.  Internally,
  make sure to call _locked() support routines, etc, if already
  holding the socket buffer lock.

- Break out sbinsertoob() into sbinsertoob_locked() that asserts
  the socket buffer lock.  sbinsertoob() unconditionally locks the
  socket buffer before calling sbinsertoob_locked().

- Break out sbflush() into sbflush_locked() that asserts the
  socket buffer lock.  sbflush() unconditionally locks the socket
  buffer before calling sbflush_locked().  Update panic strings
  for new function names.

- Break out sbdrop() into sbdrop_locked() that asserts the socket
  buffer lock.  sbdrop() unconditionally locks the socket buffer
  before calling sbdrop_locked().

- Break out sbdroprecord() into sbdroprecord_locked() that asserts
  the socket buffer lock.  sbdroprecord() unconditionally locks
  the socket buffer before calling sbdroprecord_locked().

- sofree() now calls socantsendmore_locked() and re-acquires the
  socket buffer lock on return.  It also now calls
  sbrelease_locked().

- sorflush() now calls socantrcvmore_locked() and re-acquires the
  socket buffer lock on return.  Clean up/mess up other behavior
  in sorflush() relating to the temporary stack copy of the socket
  buffer used with dom_dispose by more properly initializing the
  temporary copy, and selectively bzeroing/copying more carefully
  to prevent WITNESS from getting confused by improperly
  initialized mutexes.  Annotate why that's necessary, or at
  least, needed.

- soisconnected() now calls sbdrop_locked() before unlocking the
  socket buffer to avoid locking overhead.

Some parts of this change were:

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-21 00:20:43 +00:00
rwatson
e5f4cab982 Assert socket buffer lock in sb_lock() to protect socket buffer sleep
lock state.  Convert tsleep() into msleep() with socket buffer mutex
as argument.  Hold socket buffer lock over sbunlock() to protect sleep
lock state.

Assert socket buffer lock in sbwait() to protect the socket buffer
wait state.  Convert tsleep() into msleep() with socket buffer mutex
as argument.

Modify sofree(), sosend(), and soreceive() to acquire SOCKBUF_LOCK()
in order to call into these functions with the lock, as well as to
start protecting other socket buffer use in their implementation.  Drop
the socket buffer mutexes around calls into the protocol layer, around
potentially blocking operations, for copying to/from user space, and
VM operations relating to zero-copy.  Assert the socket buffer mutex
strategically after code sections or at the beginning of loops.  In
some cases, modify return code to ensure locks are properly dropped.

Convert the potentially blocking allocation of storage for the remote
address in soreceive() into a non-blocking allocation; we may wish to
move the allocation earlier so that it can block prior to acquisition
of the socket buffer lock.

Drop some spl use.

NOTE: Some races exist in the current structuring of sosend() and
soreceive().  This commit only merges basic socket locking in this
code; follow-up commits will close additional races.  As merged,
these changes are not sufficient to run without Giant safely.

Reviewed by:	juli, tjr
2004-06-19 03:23:14 +00:00
rwatson
855c4bb01f Merge additional socket buffer locking from rwatson_netperf:
- Lock down low hanging fruit use of sb_flags with socket buffer
  lock.

- Lock down low hanging fruit use of so_state with socket lock.

- Lock down low hanging fruit use of so_options.

- Lock down low-hanging fruit use of sb_lowwat and sb_hiwat with
  socket buffer lock.

- Annotate situations in which we unlock the socket lock and then
  grab the receive socket buffer lock, which are currently actually
  the same lock.  Depending on how we want to play our cards, we
  may want to coallesce these lock uses to reduce overhead.

- Convert a if()->panic() into a KASSERT relating to so_state in
  soaccept().

- Remove a number of splnet()/splx() references.

More complex merging of socket and socket buffer locking to
follow.
2004-06-17 22:48:11 +00:00
rwatson
029226f3a8 Grab the socket buffer send or receive mutex when performing a
read-modify-write on the sb_state field.  This commit catches only
the "easy" ones where it doesn't interact with as yet unmerged
locking.
2004-06-15 03:51:44 +00:00
rwatson
f2c0db1521 The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality.  This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties.  The
following fields are moved to sb_state:

  SS_CANTRCVMORE            (so_state)
  SS_CANTSENDMORE           (so_state)
  SS_RCVATMARK              (so_state)

Rename respectively to:

  SBS_CANTRCVMORE           (so_rcv.sb_state)
  SBS_CANTSENDMORE          (so_snd.sb_state)
  SBS_RCVATMARK             (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts).  In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.
2004-06-14 18:16:22 +00:00
rwatson
f1bc833e95 Socket MAC labels so_label and so_peerlabel are now protected by
SOCK_LOCK(so):

- Hold socket lock over calls to MAC entry points reading or
  manipulating socket labels.

- Assert socket lock in MAC entry point implementations.

- When externalizing the socket label, first make a thread-local
  copy while holding the socket lock, then release the socket lock
  to externalize to userspace.
2004-06-13 02:50:07 +00:00
rwatson
87449e4f90 Mark sun_noname as const since it's immutable. Update definitions
of functions that potentially accept &sun_noname (sbappendaddr(),
et al) to accept a const sockaddr pointer.
2004-06-04 04:07:08 +00:00
rwatson
576b26bafd Integrate accept locking from rwatson_netperf, introducing a new
global mutex, accept_mtx, which serializes access to the following
fields across all sockets:

          so_qlen          so_incqlen         so_qstate
          so_comp          so_incomp          so_list
          so_head

While providing only coarse granularity, this approach avoids lock
order issues between sockets by avoiding ownership of the fields
by a specific socket and its per-socket mutexes.

While here, rewrite soclose(), sofree(), soaccept(), and
sonewconn() to add assertions, close additional races and  address
lock order concerns.  In particular:

- Reorganize the optimistic concurrency behavior in accept1() to
  always allocate a file descriptor with falloc() so that if we do
  find a socket, we don't have to encounter the "Oh, there wasn't
  a socket" race that can occur if falloc() sleeps in the current
  code, which broke inbound accept() ordering, not to mention
  requiring backing out socket state changes in a way that raced
  with the protocol level.  We may want to add a lockless read of
  the queue state if polling of empty queues proves to be important
  to optimize.

- In accept1(), soref() the socket while holding the accept lock
  so that the socket cannot be free'd in a race with the protocol
  layer.  Likewise in netgraph equivilents of the accept1() code.

- In sonewconn(), loop waiting for the queue to be small enough to
  insert our new socket once we've committed to inserting it, or
  races can occur that cause the incomplete socket queue to
  overfill.  In the previously implementation, it was sufficient
  to simply tested once since calling soabort() didn't release
  synchronization permitting another thread to insert a socket as
  we discard a previous one.

- In soclose()/sofree()/et al, it is the responsibility of the
  caller to remove a socket from the incomplete connection queue
  before calling soabort(), which prevents soabort() from having
  to walk into the accept socket to release the socket from its
  queue, and avoids races when releasing the accept mutex to enter
  soabort(), permitting soabort() to avoid lock ordering issues
  with the caller.

- Generally cluster accept queue related operations together
  throughout these functions in order to facilitate locking.

Annotate new locking in socketvar.h.
2004-06-02 04:15:39 +00:00
rwatson
bddadcf71a The SS_COMP and SS_INCOMP flags in the so_state field indicate whether
the socket is on an accept queue of a listen socket.  This change
renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new
state field on the socket, so_qstate, as the locking for these flags
is substantially different for the locking on the remainder of the
flags in so_state.
2004-06-01 02:42:56 +00:00
bmilekic
f7574a2276 Bring in mbuma to replace mballoc.
mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
  - Better layering between slab <-> zone caches; introduce
    Keg structure which splits off slab cache away from the
    zone structure and allows multiple zones to be stacked
    on top of a single Keg (single type of slab cache);
    perhaps we should look into defining a subset API on
    top of the Keg for special use by malloc(9),
    for example.
  - UMA_ZONE_REFCNT zones can now be added, and reference
    counters automagically allocated for them within the end
    of the associated slab structures.  uma_find_refcnt()
    does a kextract to fetch the slab struct reference from
    the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
  - integrates mbuf & cluster allocations with extended UMA
    and provides caches for commonly-allocated items; defines
    several zones (two primary, one secondary) and two kegs.
  - change up certain code paths that always used to do:
    m_get() + m_clget() to instead just use m_getcl() and
    try to take advantage of the newly defined secondary
    Packet zone.
  - netstat(1) and systat(1) quickly hacked up to do basic
    stat reporting but additional stats work needs to be
    done once some other details within UMA have been taken
    care of and it becomes clearer to how stats will work
    within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used.  The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
   - One report of 'ips' (ServeRAID) driver acting really
     slow in conjunction with mbuma.  Need more data.
     Latest report is that ips is equally sucking with
     and without mbuma.
   - Giant leak in NFS code sometimes occurs, can't
     reproduce but currently analyzing; brueffer is
     able to reproduce but THIS IS NOT an mbuma-specific
     problem and currently occurs even WITHOUT mbuma.
   - Issues in network locking: there is at least one
     code path in the rip code where one or more locks
     are acquired and we end up in m_prepend() with
     M_WAITOK, which causes WITNESS to whine from within
     UMA.  Current temporary solution: force all UMA
     allocations to be M_NOWAIT from within UMA for now
     to avoid deadlocks unless WITNESS is defined and we
     can determine with certainty that we're not holding
     any locks when we're M_WAITOK.
   - I've seen at least one weird socketbuffer empty-but-
     mbuf-still-attached panic.  I don't believe this
     to be related to mbuma but please keep your eyes
     open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
    rwatson,
    brueffer,
    Ketrien I. Saihr-Kesenchedra,
    ...
Reviewed by: Lots of people (for different parts)
2004-05-31 21:46:06 +00:00
ps
b36520446e syncache broke rev 1.23 which was done to fix the "thundering herd"
problem in Apache.  Fix it.

Reviewed by:	peter
2004-05-19 00:22:10 +00:00
imp
74cf37bd00 Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core
2004-04-05 21:03:37 +00:00
ps
e230d44ca7 Remove some netbsd debug code that crept into rev 1.116 2004-03-22 10:17:40 +00:00
rwatson
b0b5f961bd Rename dup_sockaddr() to sodupsockaddr() for consistency with other
functions in kern_socket.c.

Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT
in from the caller context rather than "1" or "0".

Correct mflags pass into mac_init_socket() from previous commit to not
include M_ZERO.

Submitted by:	sam
2004-03-01 03:14:23 +00:00
rwatson
94d29f7426 Modify soalloc() API so that it accepts a malloc flags argument rather
than a "waitok" argument.  Callers now passing M_WAITOK or M_NOWAIT
rather than 0 or 1.  This simplifies the soalloc() logic, and also
makes the waiting behavior of soalloc() more clear in the calling
context.

Submitted by:	sam
2004-02-29 17:54:05 +00:00
jhb
279b2b8278 Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count.  The plimit
  structure is treated similarly to struct ucred in that is is always copy
  on write, so having a reference to a structure is sufficient to read from
  it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
  limits from a process to keep the limit structure from changing out from
  under you while reading from it.
- Various global limits that are ints are not protected by a lock since
  int writes are atomic on all the archs we support and thus a lock
  wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
  behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
  either an rlimit, or the current or max individual limit of the specified
  resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
  other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
  (it didn't used the stackgap when it should have) but uses lim_rlimit()
  and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
  but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits.  It
  also no longer uses the stackgap for accessing sysctl's for the
  ibcs2_sysconf() syscall but uses kernel_sysctl() instead.  As a result,
  ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by:	mtm (mostly, I only did a few cleanups and catchups)
Tested on:	i386
Compiled on:	alpha, amd64
2004-02-04 21:52:57 +00:00
rwatson
9c969b771a Introduce a MAC label reference in 'struct inpcb', which caches
the   MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols.  This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.

This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.

For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks.  Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.

Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.

Reviewed by:	sam, bms
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
tanimura
7eade05dfa - Implement selwakeuppri() which allows raising the priority of a
thread being waken up.  The thread waken up can run at a priority as
  high as after tsleep().

- Replace selwakeup()s with selwakeuppri()s and pass appropriate
  priorities.

- Add cv_broadcastpri() which raises the priority of the broadcast
  threads.  Used by selwakeuppri() if collision occurs.

Not objected in:	-arch, -current
2003-11-09 09:17:26 +00:00
sam
39ba2e1c90 speedup stream socket recv handling by tracking the tail of
the mbuf chain instead of walking the list for each append

Submitted by:	ps/jayanth
Obtained from:	netbsd (jason thorpe)
2003-10-28 05:47:40 +00:00
silby
f0e686a675 Change all SYSCTLS which are readonly and have a related TUNABLE
from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide
more useful error messages.
2003-10-21 18:28:36 +00:00