Commit Graph

255 Commits

Author SHA1 Message Date
phk
cad3685bce Use fget_locked() instead of homerolled 2004-11-07 16:09:56 +00:00
alc
25b80a64b9 The synchronization provided by vm object locking has eliminated the
need for most calls to vm_page_busy().  Specifically, most calls to
vm_page_busy() occur immediately prior to a call to vm_page_remove().
In such cases, the containing vm object is locked across both calls.
Consequently, the setting of the vm page's PG_BUSY flag is not even
visible to other threads that are following the synchronization
protocol.

This change (1) eliminates the calls to vm_page_busy() that
immediately precede a call to vm_page_remove() or functions, such as
vm_page_free() and vm_page_rename(), that call it and (2) relaxes the
requirement in vm_page_remove() that the vm page's PG_BUSY flag is
set.  Now, the vm page's PG_BUSY flag is set only when the vm object
lock is released while the vm page is still in transition.  Typically,
this is when it is undergoing I/O.
2004-11-03 20:17:31 +00:00
rwatson
d961169e94 Move from using the socket reference count to the file reference
count to prevent sockets from being garbage collected during
socket-specific system calls.  This is the same approach used in
most VFS-specific system calls, as well as generic file descriptor
system calls such as read() and write().

To do this, add a utility function getsock(), which is logically
identical to getvnode() used for the same purpose in VFS.  Unlike
fgetsock(), it returns with the file reference count elevated, but
no bump of the socket reference count.  Replace matching calls to
fputsock() with fdrop().

This change is made to all socket system calls other than
sendfile() and accept(), but the approach should be applicable to
those system calls also.

This shaves about four mutex operations off of each of these
system calls, including send() and recv() variants, adding about
1% to pps on minimal UDP packets for UP using netblast, and 4% on
SMP.

Reviewed by:	pjd
2004-10-24 23:45:01 +00:00
alc
e24e0aa793 Use VM_ALLOC_NOBUSY instead of calling vm_page_wakeup(). 2004-10-24 20:09:59 +00:00
alc
d66bfa760a Modify the vm object locking in do_sendfile() so that the containing object
is locked when vm_page_io_finish() is called on a page.  This is to satisfy
a new, post-RELENG_5 assertion in vm_page_io_finish().  (I am in the
process of transitioning the responsibility for synchronizing access to
various fields/flags on the page from the global page queues lock to the
per-object lock.)

Tripped over by: obrien@
2004-10-20 17:44:40 +00:00
alc
ad2a4ca3e0 Add a SOCKBUF_LOCK() to a rarely executed path in do_sendfile(). 2004-10-02 05:37:47 +00:00
jmg
bc1805c6e8 Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers.  Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks.  Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by:	green, rwatson (both earlier versions)
2004-08-15 06:24:42 +00:00
dwmalone
c8c1b8f415 Add a kern_setsockopt and kern_getsockopt which can read the option
values from either user land or from the kernel. Use them for
[gs]etsockopt and to clean up some calls to [gs]etsockopt in the
Linux emulation code that uses the stackgap.
2004-07-17 21:06:36 +00:00
phk
b9f13e4266 Clean up and wash struct iovec and struct uio handling.
Add copyiniov() which copies a struct iovec array in from userland into
a malloc'ed struct iovec.  Caller frees.

Change uiofromiov() to malloc the uio (caller frees) and name it
copyinuio() which is more appropriate.

Add cloneuio() which returns a malloc'ed copy.  Caller frees.

Use them throughout.
2004-07-10 15:42:16 +00:00
rwatson
fb654efba8 Remove spl()'s from do_sendfile(). 2004-07-09 01:46:03 +00:00
rwatson
deac06df05 Acquire socket lock in the "waiting for connection" loop in
kern_connect(), replacing tsleep() with msleep() with the socket
mutex.
2004-06-24 01:43:23 +00:00
bms
00a26380d4 Fix an inconsistency in socket option propagation on accept(). Propagate
the SS_NBIO flag from the parent socket to the child socket during an
accept() operation.

The file descriptor O_NONBLOCK flag would have been propagated already
by the fflag assignment, and therefore would have been inconsistent
with the underlying socket's so_state member.

This makes accept() more closely adhere to the API contract we effectively
outline in the manual page. Note also that Linux continues to differ here;
O_NONBLOCK is not propagated. The other BSDs do propagate the flag, as
does Solaris. The Single UNIX Specification does not offer specific
advice on this issue.

PR:		kern/45733
Requested by:	Jayanth Vijayaraghavan
Reviewed by:	rwatson
2004-06-22 23:58:09 +00:00
rwatson
e5f4cab982 Assert socket buffer lock in sb_lock() to protect socket buffer sleep
lock state.  Convert tsleep() into msleep() with socket buffer mutex
as argument.  Hold socket buffer lock over sbunlock() to protect sleep
lock state.

Assert socket buffer lock in sbwait() to protect the socket buffer
wait state.  Convert tsleep() into msleep() with socket buffer mutex
as argument.

Modify sofree(), sosend(), and soreceive() to acquire SOCKBUF_LOCK()
in order to call into these functions with the lock, as well as to
start protecting other socket buffer use in their implementation.  Drop
the socket buffer mutexes around calls into the protocol layer, around
potentially blocking operations, for copying to/from user space, and
VM operations relating to zero-copy.  Assert the socket buffer mutex
strategically after code sections or at the beginning of loops.  In
some cases, modify return code to ensure locks are properly dropped.

Convert the potentially blocking allocation of storage for the remote
address in soreceive() into a non-blocking allocation; we may wish to
move the allocation earlier so that it can block prior to acquisition
of the socket buffer lock.

Drop some spl use.

NOTE: Some races exist in the current structuring of sosend() and
soreceive().  This commit only merges basic socket locking in this
code; follow-up commits will close additional races.  As merged,
these changes are not sufficient to run without Giant safely.

Reviewed by:	juli, tjr
2004-06-19 03:23:14 +00:00
rwatson
f2c0db1521 The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality.  This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties.  The
following fields are moved to sb_state:

  SS_CANTRCVMORE            (so_state)
  SS_CANTSENDMORE           (so_state)
  SS_RCVATMARK              (so_state)

Rename respectively to:

  SBS_CANTRCVMORE           (so_rcv.sb_state)
  SBS_CANTSENDMORE          (so_snd.sb_state)
  SBS_RCVATMARK             (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts).  In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.
2004-06-14 18:16:22 +00:00
rwatson
f1bc833e95 Socket MAC labels so_label and so_peerlabel are now protected by
SOCK_LOCK(so):

- Hold socket lock over calls to MAC entry points reading or
  manipulating socket labels.

- Assert socket lock in MAC entry point implementations.

- When externalizing the socket label, first make a thread-local
  copy while holding the socket lock, then release the socket lock
  to externalize to userspace.
2004-06-13 02:50:07 +00:00
rwatson
7c0b73a950 Correct whitespace errors in merge from rwatson_netperf: tabs instead of
spaces, no trailing tab at the end of line.

Pointed out by:	csjp
2004-06-12 23:36:59 +00:00
rwatson
82295697cd Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
  soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
  so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
  various contexts in the stack, both at the socket and protocol
  layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
  this could result in frobbing of a non-present socket if
  sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
  they don't free the socket.

Submitted by:	sam
Sponsored by:	FreeBSD Foundation
Obtained from:	BSD/OS
2004-06-12 20:47:32 +00:00
phk
86602fc06c Deorbit COMPAT_SUNOS.
We inherited this from the sparc32 port of BSD4.4-Lite1.  We have neither
a sparc32 port nor a SunOS4.x compatibility desire these days.
2004-06-11 11:16:26 +00:00
rwatson
8555f72de8 Correct a resource leak introduced in recent accept locking changes:
when I reordered events in accept1() to allocate a file descriptor
earlier, I didn't properly update use of goto on exit to unwind for
cases where the file descriptor is now held, but wasn't previously.
The result was that, in the event of accept() on a non-blocking socket,
or in the event of a socket error, a file descriptor would be leaked.

This ended up being non-fatal in many cases, as the file descriptor
would be properly GC'd on process exit, so only showed up for processes
that do a lot of non-blocking accept() calls, and also live for a long
time (such as qmail).

This change updates the use of goto targets to do additional unwinding.

Eyes provided by:	Brian Feldman <green@freebsd.org>
Feet, hands provided by:	Stefan Ehmann <shoesoft@gmx.net>,
				Dimitry Andric <dimitry@andric.com>
				Arjan van Leeuwen <avleeuwen@piwebs.com>
2004-06-07 21:45:44 +00:00
ume
3a5bdeaf2c allow more than MLEN bytes for ancillary data to meet the
requirement of Section 20.1 of RFC3542.

Obtained from:	KAME
MFC after:	1 week
2004-06-07 09:59:50 +00:00
rwatson
576b26bafd Integrate accept locking from rwatson_netperf, introducing a new
global mutex, accept_mtx, which serializes access to the following
fields across all sockets:

          so_qlen          so_incqlen         so_qstate
          so_comp          so_incomp          so_list
          so_head

While providing only coarse granularity, this approach avoids lock
order issues between sockets by avoiding ownership of the fields
by a specific socket and its per-socket mutexes.

While here, rewrite soclose(), sofree(), soaccept(), and
sonewconn() to add assertions, close additional races and  address
lock order concerns.  In particular:

- Reorganize the optimistic concurrency behavior in accept1() to
  always allocate a file descriptor with falloc() so that if we do
  find a socket, we don't have to encounter the "Oh, there wasn't
  a socket" race that can occur if falloc() sleeps in the current
  code, which broke inbound accept() ordering, not to mention
  requiring backing out socket state changes in a way that raced
  with the protocol level.  We may want to add a lockless read of
  the queue state if polling of empty queues proves to be important
  to optimize.

- In accept1(), soref() the socket while holding the accept lock
  so that the socket cannot be free'd in a race with the protocol
  layer.  Likewise in netgraph equivilents of the accept1() code.

- In sonewconn(), loop waiting for the queue to be small enough to
  insert our new socket once we've committed to inserting it, or
  races can occur that cause the incomplete socket queue to
  overfill.  In the previously implementation, it was sufficient
  to simply tested once since calling soabort() didn't release
  synchronization permitting another thread to insert a socket as
  we discard a previous one.

- In soclose()/sofree()/et al, it is the responsibility of the
  caller to remove a socket from the incomplete connection queue
  before calling soabort(), which prevents soabort() from having
  to walk into the accept socket to release the socket from its
  queue, and avoids races when releasing the accept mutex to enter
  soabort(), permitting soabort() to avoid lock ordering issues
  with the caller.

- Generally cluster accept queue related operations together
  throughout these functions in order to facilitate locking.

Annotate new locking in socketvar.h.
2004-06-02 04:15:39 +00:00
rwatson
bddadcf71a The SS_COMP and SS_INCOMP flags in the so_state field indicate whether
the socket is on an accept queue of a listen socket.  This change
renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new
state field on the socket, so_qstate, as the locking for these flags
is substantially different for the locking on the remainder of the
flags in so_state.
2004-06-01 02:42:56 +00:00
bmilekic
f7574a2276 Bring in mbuma to replace mballoc.
mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
  - Better layering between slab <-> zone caches; introduce
    Keg structure which splits off slab cache away from the
    zone structure and allows multiple zones to be stacked
    on top of a single Keg (single type of slab cache);
    perhaps we should look into defining a subset API on
    top of the Keg for special use by malloc(9),
    for example.
  - UMA_ZONE_REFCNT zones can now be added, and reference
    counters automagically allocated for them within the end
    of the associated slab structures.  uma_find_refcnt()
    does a kextract to fetch the slab struct reference from
    the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
  - integrates mbuf & cluster allocations with extended UMA
    and provides caches for commonly-allocated items; defines
    several zones (two primary, one secondary) and two kegs.
  - change up certain code paths that always used to do:
    m_get() + m_clget() to instead just use m_getcl() and
    try to take advantage of the newly defined secondary
    Packet zone.
  - netstat(1) and systat(1) quickly hacked up to do basic
    stat reporting but additional stats work needs to be
    done once some other details within UMA have been taken
    care of and it becomes clearer to how stats will work
    within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used.  The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
   - One report of 'ips' (ServeRAID) driver acting really
     slow in conjunction with mbuma.  Need more data.
     Latest report is that ips is equally sucking with
     and without mbuma.
   - Giant leak in NFS code sometimes occurs, can't
     reproduce but currently analyzing; brueffer is
     able to reproduce but THIS IS NOT an mbuma-specific
     problem and currently occurs even WITHOUT mbuma.
   - Issues in network locking: there is at least one
     code path in the rip code where one or more locks
     are acquired and we end up in m_prepend() with
     M_WAITOK, which causes WITNESS to whine from within
     UMA.  Current temporary solution: force all UMA
     allocations to be M_NOWAIT from within UMA for now
     to avoid deadlocks unless WITNESS is defined and we
     can determine with certainty that we're not holding
     any locks when we're M_WAITOK.
   - I've seen at least one weird socketbuffer empty-but-
     mbuf-still-attached panic.  I don't believe this
     to be related to mbuma but please keep your eyes
     open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
    rwatson,
    brueffer,
    Ketrien I. Saihr-Kesenchedra,
    ...
Reviewed by: Lots of people (for different parts)
2004-05-31 21:46:06 +00:00
rwatson
7128a3b5cc Unconditionally lock Giant in do_sendfile(), rather than locking it
conditional on debug.mpsafenet.  We can try pushing down Giant here
later, but we don't want to enter VFS without holding Giant.

Bumped into by:	kris
2004-05-08 02:24:21 +00:00
alc
b57e5e03fd Make vm_page's PG_ZERO flag immutable between the time of the page's
allocation and deallocation.  This flag's principal use is shortly after
allocation.  For such cases, clearing the flag is pointless.  The only
unusual use of PG_ZERO is in vfs_bio_clrbuf().  However, allocbuf() never
requests a prezeroed page.  So, vfs_bio_clrbuf() never sees a prezeroed
page.

Reviewed by:	tegge@
2004-05-06 05:03:23 +00:00
silby
d35a5d60b9 Fix a regression in my change which sends headers along with data; a
side effect of that change caused headers to not be sent if a 0 byte
file was passed to sendfile.  This change fixes that behavior, allowing
sendfile to send out the headers even with a 0 byte file again.

Noticed by:	Dirk Engling
2004-04-08 07:14:34 +00:00
imp
74cf37bd00 Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core
2004-04-05 21:03:37 +00:00
rwatson
40e717b764 Detatch incorrect spellings of detach. 2004-04-04 19:15:45 +00:00
alc
1ec4d75266 In some cases, sf_buf_alloc() should sleep with pri PCATCH; in others, it
should not.  Add a new parameter so that the caller can specify which is
the case.

Reported by:	dillon
2004-04-03 09:16:27 +00:00
rwatson
7360350f89 Conditionally acquire Giant when entering the sockets layer via the
socket-specific system calls based on debug.mpsafenet, rather than
acquiring Giant unconditionally.
2004-03-29 02:21:56 +00:00
rwatson
52c072609b When validating that the length sum in recvit(), we fail to release
Giant on an error.  Add a Giant acquisition.

Reviewed by:	sam, bms
2004-03-29 01:37:06 +00:00
alc
a2e820d27b Refactor the existing machine-dependent sf_buf_free() into a machine-
dependent function by the same name and a machine-independent function,
sf_buf_mext().  Aside from the virtue of making more of the code machine-
independent, this change also makes the interface more logical.  Before,
sf_buf_free() did more than simply undo an sf_buf_alloc(); it also
unwired and if necessary freed the page.  That is now the purpose of
sf_buf_mext().  Thus, sf_buf_alloc() and sf_buf_free() can now be used
as a general-purpose emphemeral map cache.
2004-03-16 19:04:28 +00:00
rwatson
48d4fe5ea4 Remove unneeded label 'done2' from socket(). We now grab Giant
only around socreate(), and don't need it for file descriptor
accesses.

Submitted by:	sam
2004-03-04 01:57:48 +00:00
silby
9428de17de Add the SF_NODISKIO flag to sendfile. This flag causes sendfile to be
mindful of blocking on disk I/O and instead return EBUSY when such
blocking would occur.

Results from the DeBox project indicate that blocking on disk I/O
can slow the performance of a kqueue/poll based webserver.  Using
a flag such as SF_NODISKIO and throwing connections that would block
to helper processes/threads helped increase performance.

Currently, only the Flash webserver uses this flag, although it could
probably be applied to thttpd with relative ease.

Idea by:	Yaoping Ruan & Vivek Pai
2004-02-08 07:35:48 +00:00
silby
ca8156c1ae Rename iov_to_uio to uiofromiov to be more consistent with other
uio* functions.

Suggested by:	bde
2004-02-04 08:43:21 +00:00
silby
a4c32edec5 Rewrite sendfile's header support so that headers are now sent in the first
packet along with data, instead of in their own packet.  When serving files
of size (packetsize - headersize) or smaller, this will result in one less
packet crossing the network.  Quick testing with thttpd and http_load has
shown a noticeable performance improvement in this case (350 vs 330 fetches
per second.)

Included in this commit are two support routines, iov_to_uio, and m_uiotombuf;
these routines are used by sendfile to construct the header mbuf chain that
will be linked to the rest of the data in the socket buffer.
2004-02-01 07:56:44 +00:00
kan
a1caf18323 One more instance of magic number used in place of IO_SEQSHIFT.
Submitted by:	alc
2004-01-19 20:45:43 +00:00
des
f270054cd3 New file descriptor allocation code, derived from similar code introduced
in OpenBSD by Niels Provos.  The patch introduces a bitmap of allocated
file descriptors which is used to locate available descriptors when a new
one is needed.  It also moves the task of growing the file descriptor table
out of fdalloc(), reducing complexity in both fdalloc() and do_dup().

Debts of gratitude are owed to tjr@ (who provided the original patch on
which this work is based), grog@ (for the gdb(4) man page) and rwatson@
(for assistance with pxeboot(8)).
2004-01-15 10:15:04 +00:00
des
4efce4e230 Back out 1.166, which was committed by mistake. 2004-01-11 20:07:15 +00:00
des
97917ae33d Mechanical whitespace cleanup + other minor style nits. 2004-01-11 19:56:42 +00:00
des
8de0243528 Mechanical whitespace cleanup + minor style nits. 2004-01-11 19:43:14 +00:00
des
1d50ccc7f9 More unparenthesized return values. 2004-01-10 17:14:53 +00:00
des
65c30ff68d Style: parenthesize return values. 2004-01-10 13:03:43 +00:00
truckman
417693c813 Add a somewhat redundant check on the len arguement to getsockaddr() to
avoid relying on the minimum memory allocation size to avoid problems.
The check is somewhat redundant because the consumers of the returned
structure will check that sa_len is a protocol-specific larger size.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by:	nectar
MFC after:	30 days
2004-01-10 08:28:54 +00:00
silby
a7d8091ae5 Track three new sendfile-related statistics:
- The number of times sendfile had to do disk I/O
- The number of times sfbuf allocation failed
- The number of times sfbuf allocation had to wait
2003-12-28 08:57:09 +00:00
dwmalone
92fec55360 In socket(2) we only need Giant around the call to socreate, so just
grab it there.
2003-12-25 23:44:38 +00:00
alfred
c4b798a73d Add restrict qualifiers.
PR: 44394
Submitted by: Craig Rodrigues <rodrige@attbi.com>
2003-12-24 18:47:43 +00:00
dg
da88330aaa Fixed a bug in sendfile(2) where the sent data would be corrupted due
to sendfile(2) being erroneously automatically restarted after a signal
is delivered. Fixed by converting ERESTART to EINTR prior to exiting.

Updated manual page to indicate the potential EINTR error, its cause
and consequences.

Approved by: re@freebsd.org
2003-12-01 22:12:50 +00:00
alc
74614e7f63 - Modify alpha's sf_buf implementation to use the direct virtual-to-
physical mapping.
 - Move the sf_buf API to its own header file; make struct sf_buf's
   definition machine dependent.  In this commit, we remove an
   unnecessary field from struct sf_buf on the alpha, amd64, and ia64.
   Ultimately, we may eliminate struct sf_buf on those architecures
   except as an opaque pointer that references a vm page.
2003-11-16 06:11:26 +00:00
dwmalone
be405a4cbd falloc allocates a file structure and adds it to the file descriptor
table, acquiring the necessary locks as it works. It usually returns
two references to the new descriptor: one in the descriptor table
and one via a pointer argument.

As falloc releases the FILEDESC lock before returning, there is a
potential for a process to close the reference in the file descriptor
table before falloc's caller gets to use the file. I don't think this
can happen in practice at the moment, because Giant indirectly protects
closes.

To stop the file being completly closed in this situation, this change
makes falloc set the refcount to two when both references are returned.
This makes life easier for several of falloc's callers, because the
first thing they previously did was grab an extra reference on the
file.

Reviewed by:	iedowse
Idea run past:	jhb
2003-10-19 20:41:07 +00:00
alc
8b0114def1 Migrate the sf_buf allocator that is used by sendfile(2) and zero-copy
sockets into machine-dependent files.  The rationale for this
migration is illustrated by the modified amd64 allocator.  It uses the
amd64's direct map to avoid emphemeral mappings in the kernel's
address space.  On an SMP, the emphemeral mappings result in an IPI
for TLB shootdown for each transmitted page.  Yuck.

Maintainers of other 64-bit platforms with direct maps should be able
to use the amd64 allocator as a reference implementation.
2003-08-29 20:04:10 +00:00
kan
91297961f6 Drop Giant in recvit before returning an error to the caller to avoid
leaking the Giant on the syscall exit.
2003-08-11 19:37:11 +00:00
yar
65e4901760 If connect(2) has been interrupted by a signal and therefore the
connection is to be established asynchronously, behave as in the
case of non-blocking mode:

- keep the SS_ISCONNECTING bit set thus indicating that
  the connection establishment is in progress, which is the case
  (clearing the bit in this case was just a bug);

- return EALREADY, instead of the confusing and unreasonable
  EADDRINUSE, upon further connect(2) attempts on this socket
  until the connection is established (this also brings our
  connect(2) into accord with IEEE Std 1003.1.)
2003-08-06 14:04:47 +00:00
dwmalone
cb188056e6 Do some minor Giant pushdown made possible by copyin, fget, fdrop,
malloc and mbuf allocation all not requiring Giant.

1) ostat, fstat and nfstat don't need Giant until they call fo_stat.
2) accept can copyin the address length without grabbing Giant.
3) sendit doesn't need Giant, so don't bother grabbing it until kern_sendit.
4) move Giant grabbing from each indivitual recv* syscall to recvit.
2003-08-04 21:28:57 +00:00
alc
4d05c167d2 Use kmem_alloc_nofault() rather than kmem_alloc_pageable() in sf_buf_init().
(See revision 1.140 of kern/sys_pipe.c for a detailed rationale.)

Submitted by:	tegge
2003-08-02 04:18:56 +00:00
truckman
b30ab68043 VOP_GETVOBJECT() wants to be called with the vnode lock held. 2003-06-19 03:55:01 +00:00
alc
e8221b068f Finish the vm object locking in sendfile(2). More generally,
the vm locking in sendfile(2) is complete.
2003-06-12 05:52:09 +00:00
alc
4451de3f80 Lock the vm object when removing a page. 2003-06-11 21:23:04 +00:00
obrien
3b8fff9e4c Use __FBSDID(). 2003-06-11 00:56:59 +00:00
dwmalone
a02706ac93 Grab giant in sendit rather than kern_sendit because sockargs may
allocate mbufs with M_TRYWAIT, which may require Giant.

Reviewed by:	bmilekic
Approved by:	re (scottl)
2003-05-29 18:36:26 +00:00
dwmalone
86e87d5e2d Split sendit into two parts. The first part, still called sendit, that
does the copyin stuff and then calls the second part kern_sendit to do
the hard work. Don't bother holding Giant during the copyin phase.

The intent of this is to allow the Linux emulator to impliment send*
syscalls without using the stackgap.
2003-05-05 20:33:38 +00:00
alc
3333da28bf Recent changes to uipc_cow.c have eliminated the need for some sf_buf-
related variables to be global.  Make them either local to sf_buf_init() or
static.
2003-03-31 06:25:42 +00:00
alc
6f59be774d Pass the vm_page's address to sf_buf_alloc(); map the vm_page as part
of sf_buf_alloc() instead of expecting sf_buf_alloc()'s caller to map it.

The ultimate reason for this change is to enable two optimizations:
(1) that there never be more than one sf_buf mapping a vm_page at a time
and (2) 64-bit architectures can transparently use their 1-1 virtual
to physical mapping (e.g., "K0SEG") avoiding the overhead of pmap_qenter()
and pmap_qremove().
2003-03-29 06:14:14 +00:00
alc
4641d3d127 Pass the sf buf to MEXTADD() as the optional argument. This permits
the simplification of socow_iodone() and sf_buf_free(); they don't
have to reverse engineer the sf buf from the data's address.
2003-03-16 07:19:12 +00:00
alc
e5b46f193e Remove GIANT_REQUIRED from sf_buf_free(). 2003-03-06 04:48:19 +00:00
tegge
2072a48a93 Sync new socket nonblocking/async state with file flags in accept().
PR:		1775
Reviewed by:	mbr
2003-02-23 23:00:28 +00:00
cognet
8d83e0054a Remove duplicate includes.
Submitted by:	Cyril Nguyen-Huu <cyril@ci0.org>
2003-02-20 03:26:11 +00:00
imp
cf874b345d Back out M_* changes, per decision of the TRB.
Approved by: trb
2003-02-19 05:47:46 +00:00
ume
56a5dfaa24 Break out the bind and connect syscalls to intend to make calling
these syscalls internally easy.
This is preparation for force coming IPv6 support for Linuxlator.

Submitted by:	dwmalone
MFC after:	10 days
2003-02-03 17:36:52 +00:00
alfred
b5c0015ac9 Consolidate MIN/MAX macros into one place (param.h).
Submitted by: Hiten Pandya <hiten@unixdaemons.com>
2003-02-02 13:17:30 +00:00
alfred
bf8e8a6e8f Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
2003-01-21 08:56:16 +00:00
dillon
ccd5574cc6 Bow to the whining masses and change a union back into void *. Retain
removal of unnecessary casts and throw in some minor cleanups to see if
anyone complains, just for the hell of it.
2003-01-13 00:33:17 +00:00
dillon
ddf9ef103e Change struct file f_data to un_data, a union of the correct struct
pointer types, and remove a huge number of casts from code using it.

Change struct xfile xf_data to xun_data (ABI is still compatible).

If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary.  There are no operational changes in this
commit.
2003-01-12 01:37:13 +00:00
phk
26984a4003 Move the declaration of the socket fileops from socketvar.h to file.h.
This allows us to use the new typedefs and removes the needs for a number
of forward struct declarations in socketvar.h
2002-12-23 22:46:47 +00:00
rwatson
1f2df65750 Integrate mac_check_socket_send() and mac_check_socket_receive()
checks from the MAC tree: allow policies to perform access control
for the ability of a process to send and receive data via a socket.
At some point, we might also pass in additional address information
if an explicit address is requested on send.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2002-10-06 14:39:15 +00:00
truckman
da2757cbc5 In an SMP environment post-Giant it is no longer safe to blindly
dereference the struct sigio pointer without any locking.  Change
fgetown() to take a reference to the pointer instead of a copy of the
pointer and call SIGIO_LOCK() before copying the pointer and
dereferencing it.

Reviewed by:	rwatson
2002-10-03 02:13:00 +00:00
archie
bf4ebb4609 accept(2) on a socket that has been shutdown(2) normally returns
ECONNABORTED. Make this happen in the non-blocking case as well.
The previous behavior was to return EAGAIN, which (a) is not
consistent with the blocking case and (b) causes the application
to think the socket is still valid.

PR:		bin/42100
Reviewed by:	freebsd-net
MFC after:	3 days
2002-08-28 20:56:01 +00:00
rwatson
44404e4547 In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
  "cred", and change the semantics of consumers of fo_read() and
  fo_write() to pass the active credential of the thread requesting
  an operation rather than the cached file cred.  The cached file
  cred is still available in fo_read() and fo_write() consumers
  via fp->f_cred.  These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
  pipe_read/write() now authorize MAC using active_cred rather
  than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
  VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred.  Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not.  If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-08-15 20:55:08 +00:00
rwatson
10845c31e4 Fix return case for negative namelen by jumping to normal exit processing
rather than immediately returning, or we may not unlock necessary locks.

Noticed by:	Mike Heffner <mheffner@acm.vt.edu>
2002-08-15 17:34:03 +00:00
dg
6dce2e7eff Moved sf_buf_alloc and sf_buf_free function declarations to sys/socketvar.h
so that they can be seen by external callers.
2002-08-13 19:03:19 +00:00
dg
7a86c9d738 Remove obsolete comment about sf_buf_* functions being static. They were
made un-static in rev 1.114.
2002-08-13 18:20:08 +00:00
semenu
68c52ce00e Fix sendfile(), who was calling vn_rdwr() without aresid parameter and
thus hiting EIO at the end of file. This is believed to be a feature
(not a bug) of vn_rdwr(), so we turn it off by supplying aresid param.

Reviewed by:	rwatson, dg
2002-08-11 20:33:11 +00:00
nectar
6cc73b8947 While we're at it, add range checks similar to those in previous commit to
getsockname() and getpeername(), too.
2002-08-09 12:58:11 +00:00
rwatson
1bb6530f3e Add additional range checks for copyout targets.
Submitted by:	Silvio Cesare <silvio@qualys.com>
2002-08-09 05:50:32 +00:00
rwatson
a5dcc1fd3d Include file cleanup; mac.h and malloc.h at one point had ordering
relationship requirements, and no longer do.

Reminded by:	bde
2002-08-01 17:47:56 +00:00
rwatson
7f656e6806 Introduce support for Mandatory Access Control and extensible
kernel access control.

Instrument connect(), listen(), and bind() system calls to invoke
MAC framework entry points to permit policies to authorize these
requests.  This can be useful for policies that want to limit
the activity of processes involving particular types of IPC and
network activity.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-07-31 16:39:49 +00:00
alc
34a3674724 o In do_sendfile(), replace vm_page_sleep_busy() by vm_page_sleep_if_busy()
and extend the scope of the page queues lock to cover all accesses
   to the page's flags and busy fields.
2002-07-30 18:51:07 +00:00
arr
249ad7ccdb - Make use of the VM_ALLOC_WIRED flag in the call to vm_page_alloc() in
do_sendfile().  This allows us to rearrange an if statement in order to
  avoid doing an unnecesary call to vm_page_lock_queues(), and an attempt
  at re-wiring the pages (which were wired in the vm_page_alloc() call).

Reviewed by:	alc, jhb
2002-07-23 01:09:34 +00:00
alc
5bc529bb34 Lock accesses to the page queues by sendfile() and friends. 2002-07-13 03:10:55 +00:00
alfred
598d9de715 Create a bug-for-bug FreeBSD4 compatible version of sendfile and move the
fixed sendfile over.  This is needed to preserve binary compatibility from
4.x to 5.x.
2002-07-12 06:51:57 +00:00
alfred
d47feb4376 nuke more instances of caddr_t 2002-06-29 00:02:01 +00:00
alfred
b0475cc9d5 remove or replace caddr_t with void.
make the mbuf external free function take a void * rather than caddr_t.
2002-06-28 23:48:23 +00:00
ken
0d3a835f3f At long last, commit the zero copy sockets code.
MAKEDEV:	Add MAKEDEV glue for the ti(4) device nodes.

ti.4:		Update the ti(4) man page to include information on the
		TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
		and also include information about the new character
		device interface and the associated ioctls.

man9/Makefile:	Add jumbo.9 and zero_copy.9 man pages and associated
		links.

jumbo.9:	New man page describing the jumbo buffer allocator
		interface and operation.

zero_copy.9:	New man page describing the general characteristics of
		the zero copy send and receive code, and what an
		application author should do to take advantage of the
		zero copy functionality.

NOTES:		Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
		TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.

conf/files:	Add uipc_jumbo.c and uipc_cow.c.

conf/options:	Add the 5 options mentioned above.

kern_subr.c:	Receive side zero copy implementation.  This takes
		"disposable" pages attached to an mbuf, gives them to
		a user process, and then recycles the user's page.
		This is only active when ZERO_COPY_SOCKETS is turned on
		and the kern.ipc.zero_copy.receive sysctl variable is
		set to 1.

uipc_cow.c:	Send side zero copy functions.  Takes a page written
		by the user and maps it copy on write and assigns it
		kernel virtual address space.  Removes copy on write
		mapping once the buffer has been freed by the network
		stack.

uipc_jumbo.c:	Jumbo disposable page allocator code.  This allocates
		(optionally) disposable pages for network drivers that
		want to give the user the option of doing zero copy
		receive.

uipc_socket.c:	Add kern.ipc.zero_copy.{send,receive} sysctls that are
		enabled if ZERO_COPY_SOCKETS is turned on.

		Add zero copy send support to sosend() -- pages get
		mapped into the kernel instead of getting copied if
		they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
		can be used elsewhere.  (uipc_cow.c)

if_media.c:	In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
		calling malloc() with M_WAITOK.  Return an error if
		the M_NOWAIT malloc fails.

		The ti(4) driver and the wi(4) driver, at least, call
		this with a mutex held.  This causes witness warnings
		for 'ifconfig -a' with a wi(4) or ti(4) board in the
		system.  (I've only verified for ti(4)).

ip_output.c:	Fragment large datagrams so that each segment contains
		a multiple of PAGE_SIZE amount of data plus headers.
		This allows the receiver to potentially do page
		flipping on receives.

if_ti.c:	Add zero copy receive support to the ti(4) driver.  If
		TI_PRIVATE_JUMBOS is not defined, it now uses the
		jumbo(9) buffer allocator for jumbo receive buffers.

		Add a new character device interface for the ti(4)
		driver for the new debugging interface.  This allows
		(a patched version of) gdb to talk to the Tigon board
		and debug the firmware.  There are also a few additional
		debugging ioctls available through this interface.

		Add header splitting support to the ti(4) driver.

		Tweak some of the default interrupt coalescing
		parameters to more useful defaults.

		Add hooks for supporting transmit flow control, but
		leave it turned off with a comment describing why it
		is turned off.

if_tireg.h:	Change the firmware rev to 12.4.11, since we're really
		at 12.4.11 plus fixes from 12.4.13.

		Add defines needed for debugging.

		Remove the ti_stats structure, it is now defined in
		sys/tiio.h.

ti_fw.h:	12.4.11 firmware.

ti_fw2.h:	12.4.11 firmware, plus selected fixes from 12.4.13,
		and my header splitting patches.  Revision 12.4.13
		doesn't handle 10/100 negotiation properly.  (This
		firmware is the same as what was in the tree previously,
		with the addition of header splitting support.)

sys/jumbo.h:	Jumbo buffer allocator interface.

sys/mbuf.h:	Add a new external mbuf type, EXT_DISPOSABLE, to
		indicate that the payload buffer can be thrown away /
		flipped to a userland process.

socketvar.h:	Add prototype for socow_setup.

tiio.h:		ioctl interface to the character portion of the ti(4)
		driver, plus associated structure/type definitions.

uio.h:		Change prototype for uiomoveco() so that we'll know
		whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c:	In vm_fault(), check to see whether we need to do a page
		based copy on write fault.

vm_object.c:	Add a new function, vm_object_allocate_wait().  This
		does the same thing that vm_object allocate does, except
		that it gives the caller the opportunity to specify whether
		it should wait on the uma_zalloc() of the object structre.

		This allows vm objects to be allocated while holding a
		mutex.  (Without generating WITNESS warnings.)

		vm_object_allocate() is implemented as a call to
		vm_object_allocate_wait() with the malloc flag set to
		M_WAITOK.

vm_object.h:	Add prototype for vm_object_allocate_wait().

vm_page.c:	Add page-based copy on write setup, clear and fault
		routines.

vm_page.h:	Add page based COW function prototypes and variable in
		the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.
2002-06-26 03:37:47 +00:00
alfred
619c88aeeb Implement SO_NOSIGPIPE option for sockets. This allows one to request that
an EPIPE error return not generate SIGPIPE on sockets.

Submitted by: lioux
Inspired by: Darwin
2002-06-20 18:52:54 +00:00
jhb
fbebc83b5b Catch up to changes in ktrace API. 2002-06-07 05:37:18 +00:00
tanimura
e6fa9b9e92 Back out my lats commit of locking down a socket, it conflicts with hsu's work.
Requested by:	hsu
2002-05-31 11:52:35 +00:00
tanimura
92d8381dd5 Lock down a socket, milestone 1.
o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
  socket buffer. The mutex in the receive buffer also protects the data
  in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

  - so_count
  - so_options
  - so_linger
  - so_state

o Remove *_locked() socket APIs.  Make the following socket APIs
  touching the members above now require a locked socket:

 - sodisconnect()
 - soisconnected()
 - soisconnecting()
 - soisdisconnected()
 - soisdisconnecting()
 - sofree()
 - soref()
 - sorele()
 - sorwakeup()
 - sotryfree()
 - sowakeup()
 - sowwakeup()

Reviewed by:	alfred
2002-05-20 05:41:09 +00:00
rwatson
4d39491e7e In sendfile(), use the vn_rdwr() helper function, rather than manually
constructing a struct aio and invoking VOP_READ() directly.  This cleans
up the code a little, but also has the advantage of making sure almost
all vnode read/write access in the kernel goes through the helper
function, meaning that instrumentation of that helper function can impact
almost all relevant read/write operations.  In this case, it permits us
to put MAC hooks into vn_rdwr() and not modify uipc_syscalls.c (yet).

In general, if helper vn_*() functions exist, they should be used in
preference to direct VOP's in system call service code.

Submitted by:	green
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-04-19 13:46:24 +00:00
jhb
db9aa81e23 Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on:	i386, alpha, sparc64
2002-04-04 21:03:38 +00:00
bde
90f30ee936 Fixed some style bugs in the removal of __P(()). The main ones were
not removing tabs before "__P((", and not outdenting continuation lines
to preserve non-KNF lining up of code with parentheses.  Switch to KNF
formatting and/or rewrap the whole prototype in some cases.
2002-03-24 05:09:11 +00:00