freebsd-skq

Author	SHA1	Message	Date
phk	cad3685bce	Use fget_locked() instead of homerolled	2004-11-07 16:09:56 +00:00
alc	25b80a64b9	The synchronization provided by vm object locking has eliminated the need for most calls to vm_page_busy(). Specifically, most calls to vm_page_busy() occur immediately prior to a call to vm_page_remove(). In such cases, the containing vm object is locked across both calls. Consequently, the setting of the vm page's PG_BUSY flag is not even visible to other threads that are following the synchronization protocol. This change (1) eliminates the calls to vm_page_busy() that immediately precede a call to vm_page_remove() or functions, such as vm_page_free() and vm_page_rename(), that call it and (2) relaxes the requirement in vm_page_remove() that the vm page's PG_BUSY flag is set. Now, the vm page's PG_BUSY flag is set only when the vm object lock is released while the vm page is still in transition. Typically, this is when it is undergoing I/O.	2004-11-03 20:17:31 +00:00
rwatson	d961169e94	Move from using the socket reference count to the file reference count to prevent sockets from being garbage collected during socket-specific system calls. This is the same approach used in most VFS-specific system calls, as well as generic file descriptor system calls such as read() and write(). To do this, add a utility function getsock(), which is logically identical to getvnode() used for the same purpose in VFS. Unlike fgetsock(), it returns with the file reference count elevated, but no bump of the socket reference count. Replace matching calls to fputsock() with fdrop(). This change is made to all socket system calls other than sendfile() and accept(), but the approach should be applicable to those system calls also. This shaves about four mutex operations off of each of these system calls, including send() and recv() variants, adding about 1% to pps on minimal UDP packets for UP using netblast, and 4% on SMP. Reviewed by: pjd	2004-10-24 23:45:01 +00:00
alc	e24e0aa793	Use VM_ALLOC_NOBUSY instead of calling vm_page_wakeup().	2004-10-24 20:09:59 +00:00
alc	d66bfa760a	Modify the vm object locking in do_sendfile() so that the containing object is locked when vm_page_io_finish() is called on a page. This is to satisfy a new, post-RELENG_5 assertion in vm_page_io_finish(). (I am in the process of transitioning the responsibility for synchronizing access to various fields/flags on the page from the global page queues lock to the per-object lock.) Tripped over by: obrien@	2004-10-20 17:44:40 +00:00
alc	ad2a4ca3e0	Add a SOCKBUF_LOCK() to a rarely executed path in do_sendfile().	2004-10-02 05:37:47 +00:00
jmg	bc1805c6e8	Add locking to the kqueue subsystem. This also makes the kqueue subsystem a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)	2004-08-15 06:24:42 +00:00
dwmalone	c8c1b8f415	Add a kern_setsockopt and kern_getsockopt which can read the option values from either user land or from the kernel. Use them for [gs]etsockopt and to clean up some calls to [gs]etsockopt in the Linux emulation code that uses the stackgap.	2004-07-17 21:06:36 +00:00
phk	b9f13e4266	Clean up and wash struct iovec and struct uio handling. Add copyiniov() which copies a struct iovec array in from userland into a malloc'ed struct iovec. Caller frees. Change uiofromiov() to malloc the uio (caller frees) and name it copyinuio() which is more appropriate. Add cloneuio() which returns a malloc'ed copy. Caller frees. Use them throughout.	2004-07-10 15:42:16 +00:00
rwatson	fb654efba8	Remove spl()'s from do_sendfile().	2004-07-09 01:46:03 +00:00
rwatson	deac06df05	Acquire socket lock in the "waiting for connection" loop in kern_connect(), replacing tsleep() with msleep() with the socket mutex.	2004-06-24 01:43:23 +00:00
bms	00a26380d4	Fix an inconsistency in socket option propagation on accept(). Propagate the SS_NBIO flag from the parent socket to the child socket during an accept() operation. The file descriptor O_NONBLOCK flag would have been propagated already by the fflag assignment, and therefore would have been inconsistent with the underlying socket's so_state member. This makes accept() more closely adhere to the API contract we effectively outline in the manual page. Note also that Linux continues to differ here; O_NONBLOCK is not propagated. The other BSDs do propagate the flag, as does Solaris. The Single UNIX Specification does not offer specific advice on this issue. PR: kern/45733 Requested by: Jayanth Vijayaraghavan Reviewed by: rwatson	2004-06-22 23:58:09 +00:00
rwatson	e5f4cab982	Assert socket buffer lock in sb_lock() to protect socket buffer sleep lock state. Convert tsleep() into msleep() with socket buffer mutex as argument. Hold socket buffer lock over sbunlock() to protect sleep lock state. Assert socket buffer lock in sbwait() to protect the socket buffer wait state. Convert tsleep() into msleep() with socket buffer mutex as argument. Modify sofree(), sosend(), and soreceive() to acquire SOCKBUF_LOCK() in order to call into these functions with the lock, as well as to start protecting other socket buffer use in their implementation. Drop the socket buffer mutexes around calls into the protocol layer, around potentially blocking operations, for copying to/from user space, and VM operations relating to zero-copy. Assert the socket buffer mutex strategically after code sections or at the beginning of loops. In some cases, modify return code to ensure locks are properly dropped. Convert the potentially blocking allocation of storage for the remote address in soreceive() into a non-blocking allocation; we may wish to move the allocation earlier so that it can block prior to acquisition of the socket buffer lock. Drop some spl use. NOTE: Some races exist in the current structuring of sosend() and soreceive(). This commit only merges basic socket locking in this code; follow-up commits will close additional races. As merged, these changes are not sufficient to run without Giant safely. Reviewed by: juli, tjr	2004-06-19 03:23:14 +00:00
rwatson	f2c0db1521	The socket field so_state is used to hold a variety of socket related flags relating to several aspects of socket functionality. This change breaks out several bits relating to send and receive operation into a new per-socket buffer field, sb_state, in order to facilitate locking. This is required because, in order to provide more granular locking of sockets, different state fields have different locking properties. The following fields are moved to sb_state: SS_CANTRCVMORE (so_state) SS_CANTSENDMORE (so_state) SS_RCVATMARK (so_state) Rename respectively to: SBS_CANTRCVMORE (so_rcv.sb_state) SBS_CANTSENDMORE (so_snd.sb_state) SBS_RCVATMARK (so_rcv.sb_state) This facilitates locking by isolating fields to be located with other identically locked fields, and permits greater granularity in socket locking by avoiding storing fields with different locking semantics in the same short (avoiding locking conflicts). In the future, we may wish to coallesce sb_state and sb_flags; for the time being I leave them separate and there is no additional memory overhead due to the packing/alignment of shorts in the socket buffer structure.	2004-06-14 18:16:22 +00:00
rwatson	f1bc833e95	Socket MAC labels so_label and so_peerlabel are now protected by SOCK_LOCK(so): - Hold socket lock over calls to MAC entry points reading or manipulating socket labels. - Assert socket lock in MAC entry point implementations. - When externalizing the socket label, first make a thread-local copy while holding the socket lock, then release the socket lock to externalize to userspace.	2004-06-13 02:50:07 +00:00
rwatson	7c0b73a950	Correct whitespace errors in merge from rwatson_netperf: tabs instead of spaces, no trailing tab at the end of line. Pointed out by: csjp	2004-06-12 23:36:59 +00:00
rwatson	82295697cd	Extend coverage of SOCK_LOCK(so) to include so_count, the socket reference count: - Assert SOCK_LOCK(so) macros that directly manipulate so_count: soref(), sorele(). - Assert SOCK_LOCK(so) in macros/functions that rely on the state of so_count: sofree(), sotryfree(). - Acquire SOCK_LOCK(so) before calling these functions or macros in various contexts in the stack, both at the socket and protocol layers. - In some cases, perform soisdisconnected() before sotryfree(), as this could result in frobbing of a non-present socket if sotryfree() actually frees the socket. - Note that sofree()/sotryfree() will release the socket lock even if they don't free the socket. Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS	2004-06-12 20:47:32 +00:00
phk	86602fc06c	Deorbit COMPAT_SUNOS. We inherited this from the sparc32 port of BSD4.4-Lite1. We have neither a sparc32 port nor a SunOS4.x compatibility desire these days.	2004-06-11 11:16:26 +00:00
rwatson	8555f72de8	Correct a resource leak introduced in recent accept locking changes: when I reordered events in accept1() to allocate a file descriptor earlier, I didn't properly update use of goto on exit to unwind for cases where the file descriptor is now held, but wasn't previously. The result was that, in the event of accept() on a non-blocking socket, or in the event of a socket error, a file descriptor would be leaked. This ended up being non-fatal in many cases, as the file descriptor would be properly GC'd on process exit, so only showed up for processes that do a lot of non-blocking accept() calls, and also live for a long time (such as qmail). This change updates the use of goto targets to do additional unwinding. Eyes provided by: Brian Feldman <green@freebsd.org> Feet, hands provided by: Stefan Ehmann <shoesoft@gmx.net>, Dimitry Andric <dimitry@andric.com> Arjan van Leeuwen <avleeuwen@piwebs.com>	2004-06-07 21:45:44 +00:00
ume	3a5bdeaf2c	allow more than MLEN bytes for ancillary data to meet the requirement of Section 20.1 of RFC3542. Obtained from: KAME MFC after: 1 week	2004-06-07 09:59:50 +00:00
rwatson	576b26bafd	Integrate accept locking from rwatson_netperf, introducing a new global mutex, accept_mtx, which serializes access to the following fields across all sockets: so_qlen so_incqlen so_qstate so_comp so_incomp so_list so_head While providing only coarse granularity, this approach avoids lock order issues between sockets by avoiding ownership of the fields by a specific socket and its per-socket mutexes. While here, rewrite soclose(), sofree(), soaccept(), and sonewconn() to add assertions, close additional races and address lock order concerns. In particular: - Reorganize the optimistic concurrency behavior in accept1() to always allocate a file descriptor with falloc() so that if we do find a socket, we don't have to encounter the "Oh, there wasn't a socket" race that can occur if falloc() sleeps in the current code, which broke inbound accept() ordering, not to mention requiring backing out socket state changes in a way that raced with the protocol level. We may want to add a lockless read of the queue state if polling of empty queues proves to be important to optimize. - In accept1(), soref() the socket while holding the accept lock so that the socket cannot be free'd in a race with the protocol layer. Likewise in netgraph equivilents of the accept1() code. - In sonewconn(), loop waiting for the queue to be small enough to insert our new socket once we've committed to inserting it, or races can occur that cause the incomplete socket queue to overfill. In the previously implementation, it was sufficient to simply tested once since calling soabort() didn't release synchronization permitting another thread to insert a socket as we discard a previous one. - In soclose()/sofree()/et al, it is the responsibility of the caller to remove a socket from the incomplete connection queue before calling soabort(), which prevents soabort() from having to walk into the accept socket to release the socket from its queue, and avoids races when releasing the accept mutex to enter soabort(), permitting soabort() to avoid lock ordering issues with the caller. - Generally cluster accept queue related operations together throughout these functions in order to facilitate locking. Annotate new locking in socketvar.h.	2004-06-02 04:15:39 +00:00
rwatson	bddadcf71a	The SS_COMP and SS_INCOMP flags in the so_state field indicate whether the socket is on an accept queue of a listen socket. This change renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new state field on the socket, so_qstate, as the locking for these flags is substantially different for the locking on the remainder of the flags in so_state.	2004-06-01 02:42:56 +00:00
bmilekic	f7574a2276	Bring in mbuma to replace mballoc. mbuma is an Mbuf & Cluster allocator built on top of a number of extensions to the UMA framework, all included herein. Extensions to UMA worth noting: - Better layering between slab <-> zone caches; introduce Keg structure which splits off slab cache away from the zone structure and allows multiple zones to be stacked on top of a single Keg (single type of slab cache); perhaps we should look into defining a subset API on top of the Keg for special use by malloc(9), for example. - UMA_ZONE_REFCNT zones can now be added, and reference counters automagically allocated for them within the end of the associated slab structures. uma_find_refcnt() does a kextract to fetch the slab struct reference from the underlying page, and lookup the corresponding refcnt. mbuma things worth noting: - integrates mbuf & cluster allocations with extended UMA and provides caches for commonly-allocated items; defines several zones (two primary, one secondary) and two kegs. - change up certain code paths that always used to do: m_get() + m_clget() to instead just use m_getcl() and try to take advantage of the newly defined secondary Packet zone. - netstat(1) and systat(1) quickly hacked up to do basic stat reporting but additional stats work needs to be done once some other details within UMA have been taken care of and it becomes clearer to how stats will work within the modified framework. From the user perspective, one implication is that the NMBCLUSTERS compile-time option is no longer used. The maximum number of clusters is still capped off according to maxusers, but it can be made unlimited by setting the kern.ipc.nmbclusters boot-time tunable to zero. Work should be done to write an appropriate sysctl handler allowing dynamic tuning of kern.ipc.nmbclusters at runtime. Additional things worth noting/known issues (READ): - One report of 'ips' (ServeRAID) driver acting really slow in conjunction with mbuma. Need more data. Latest report is that ips is equally sucking with and without mbuma. - Giant leak in NFS code sometimes occurs, can't reproduce but currently analyzing; brueffer is able to reproduce but THIS IS NOT an mbuma-specific problem and currently occurs even WITHOUT mbuma. - Issues in network locking: there is at least one code path in the rip code where one or more locks are acquired and we end up in m_prepend() with M_WAITOK, which causes WITNESS to whine from within UMA. Current temporary solution: force all UMA allocations to be M_NOWAIT from within UMA for now to avoid deadlocks unless WITNESS is defined and we can determine with certainty that we're not holding any locks when we're M_WAITOK. - I've seen at least one weird socketbuffer empty-but- mbuf-still-attached panic. I don't believe this to be related to mbuma but please keep your eyes open, turn on debugging, and capture crash dumps. This change removes more code than it adds. A paper is available detailing the change and considering various performance issues, it was presented at BSDCan2004: http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf Please read the paper for Future Work and implementation details, as well as credits. Testing and Debugging: rwatson, brueffer, Ketrien I. Saihr-Kesenchedra, ... Reviewed by: Lots of people (for different parts)	2004-05-31 21:46:06 +00:00
rwatson	7128a3b5cc	Unconditionally lock Giant in do_sendfile(), rather than locking it conditional on debug.mpsafenet. We can try pushing down Giant here later, but we don't want to enter VFS without holding Giant. Bumped into by: kris	2004-05-08 02:24:21 +00:00
alc	b57e5e03fd	Make vm_page's PG_ZERO flag immutable between the time of the page's allocation and deallocation. This flag's principal use is shortly after allocation. For such cases, clearing the flag is pointless. The only unusual use of PG_ZERO is in vfs_bio_clrbuf(). However, allocbuf() never requests a prezeroed page. So, vfs_bio_clrbuf() never sees a prezeroed page. Reviewed by: tegge@	2004-05-06 05:03:23 +00:00
silby	d35a5d60b9	Fix a regression in my change which sends headers along with data; a side effect of that change caused headers to not be sent if a 0 byte file was passed to sendfile. This change fixes that behavior, allowing sendfile to send out the headers even with a 0 byte file again. Noticed by: Dirk Engling	2004-04-08 07:14:34 +00:00
imp	74cf37bd00	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core	2004-04-05 21:03:37 +00:00
rwatson	40e717b764	Detatch incorrect spellings of detach.	2004-04-04 19:15:45 +00:00
alc	1ec4d75266	In some cases, sf_buf_alloc() should sleep with pri PCATCH; in others, it should not. Add a new parameter so that the caller can specify which is the case. Reported by: dillon	2004-04-03 09:16:27 +00:00
rwatson	7360350f89	Conditionally acquire Giant when entering the sockets layer via the socket-specific system calls based on debug.mpsafenet, rather than acquiring Giant unconditionally.	2004-03-29 02:21:56 +00:00
rwatson	52c072609b	When validating that the length sum in recvit(), we fail to release Giant on an error. Add a Giant acquisition. Reviewed by: sam, bms	2004-03-29 01:37:06 +00:00
alc	a2e820d27b	Refactor the existing machine-dependent sf_buf_free() into a machine- dependent function by the same name and a machine-independent function, sf_buf_mext(). Aside from the virtue of making more of the code machine- independent, this change also makes the interface more logical. Before, sf_buf_free() did more than simply undo an sf_buf_alloc(); it also unwired and if necessary freed the page. That is now the purpose of sf_buf_mext(). Thus, sf_buf_alloc() and sf_buf_free() can now be used as a general-purpose emphemeral map cache.	2004-03-16 19:04:28 +00:00
rwatson	48d4fe5ea4	Remove unneeded label 'done2' from socket(). We now grab Giant only around socreate(), and don't need it for file descriptor accesses. Submitted by: sam	2004-03-04 01:57:48 +00:00
silby	9428de17de	Add the SF_NODISKIO flag to sendfile. This flag causes sendfile to be mindful of blocking on disk I/O and instead return EBUSY when such blocking would occur. Results from the DeBox project indicate that blocking on disk I/O can slow the performance of a kqueue/poll based webserver. Using a flag such as SF_NODISKIO and throwing connections that would block to helper processes/threads helped increase performance. Currently, only the Flash webserver uses this flag, although it could probably be applied to thttpd with relative ease. Idea by: Yaoping Ruan & Vivek Pai	2004-02-08 07:35:48 +00:00
silby	ca8156c1ae	Rename iov_to_uio to uiofromiov to be more consistent with other uio* functions. Suggested by: bde	2004-02-04 08:43:21 +00:00
silby	a4c32edec5	Rewrite sendfile's header support so that headers are now sent in the first packet along with data, instead of in their own packet. When serving files of size (packetsize - headersize) or smaller, this will result in one less packet crossing the network. Quick testing with thttpd and http_load has shown a noticeable performance improvement in this case (350 vs 330 fetches per second.) Included in this commit are two support routines, iov_to_uio, and m_uiotombuf; these routines are used by sendfile to construct the header mbuf chain that will be linked to the rest of the data in the socket buffer.	2004-02-01 07:56:44 +00:00
kan	a1caf18323	One more instance of magic number used in place of IO_SEQSHIFT. Submitted by: alc	2004-01-19 20:45:43 +00:00
des	f270054cd3	New file descriptor allocation code, derived from similar code introduced in OpenBSD by Niels Provos. The patch introduces a bitmap of allocated file descriptors which is used to locate available descriptors when a new one is needed. It also moves the task of growing the file descriptor table out of fdalloc(), reducing complexity in both fdalloc() and do_dup(). Debts of gratitude are owed to tjr@ (who provided the original patch on which this work is based), grog@ (for the gdb(4) man page) and rwatson@ (for assistance with pxeboot(8)).	2004-01-15 10:15:04 +00:00
des	4efce4e230	Back out 1.166, which was committed by mistake.	2004-01-11 20:07:15 +00:00
des	97917ae33d	Mechanical whitespace cleanup + other minor style nits.	2004-01-11 19:56:42 +00:00
des	8de0243528	Mechanical whitespace cleanup + minor style nits.	2004-01-11 19:43:14 +00:00
des	1d50ccc7f9	More unparenthesized return values.	2004-01-10 17:14:53 +00:00
des	65c30ff68d	Style: parenthesize return values.	2004-01-10 13:03:43 +00:00
truckman	417693c813	Add a somewhat redundant check on the len arguement to getsockaddr() to avoid relying on the minimum memory allocation size to avoid problems. The check is somewhat redundant because the consumers of the returned structure will check that sa_len is a protocol-specific larger size. Submitted by: Matthew Dillon <dillon@apollo.backplane.com> Reviewed by: nectar MFC after: 30 days	2004-01-10 08:28:54 +00:00
silby	a7d8091ae5	Track three new sendfile-related statistics: - The number of times sendfile had to do disk I/O - The number of times sfbuf allocation failed - The number of times sfbuf allocation had to wait	2003-12-28 08:57:09 +00:00
dwmalone	92fec55360	In socket(2) we only need Giant around the call to socreate, so just grab it there.	2003-12-25 23:44:38 +00:00
alfred	c4b798a73d	Add restrict qualifiers. PR: 44394 Submitted by: Craig Rodrigues <rodrige@attbi.com>	2003-12-24 18:47:43 +00:00
dg	da88330aaa	Fixed a bug in sendfile(2) where the sent data would be corrupted due to sendfile(2) being erroneously automatically restarted after a signal is delivered. Fixed by converting ERESTART to EINTR prior to exiting. Updated manual page to indicate the potential EINTR error, its cause and consequences. Approved by: re@freebsd.org	2003-12-01 22:12:50 +00:00
alc	74614e7f63	- Modify alpha's sf_buf implementation to use the direct virtual-to- physical mapping. - Move the sf_buf API to its own header file; make struct sf_buf's definition machine dependent. In this commit, we remove an unnecessary field from struct sf_buf on the alpha, amd64, and ia64. Ultimately, we may eliminate struct sf_buf on those architecures except as an opaque pointer that references a vm page.	2003-11-16 06:11:26 +00:00
dwmalone	be405a4cbd	falloc allocates a file structure and adds it to the file descriptor table, acquiring the necessary locks as it works. It usually returns two references to the new descriptor: one in the descriptor table and one via a pointer argument. As falloc releases the FILEDESC lock before returning, there is a potential for a process to close the reference in the file descriptor table before falloc's caller gets to use the file. I don't think this can happen in practice at the moment, because Giant indirectly protects closes. To stop the file being completly closed in this situation, this change makes falloc set the refcount to two when both references are returned. This makes life easier for several of falloc's callers, because the first thing they previously did was grab an extra reference on the file. Reviewed by: iedowse Idea run past: jhb	2003-10-19 20:41:07 +00:00
alc	8b0114def1	Migrate the sf_buf allocator that is used by sendfile(2) and zero-copy sockets into machine-dependent files. The rationale for this migration is illustrated by the modified amd64 allocator. It uses the amd64's direct map to avoid emphemeral mappings in the kernel's address space. On an SMP, the emphemeral mappings result in an IPI for TLB shootdown for each transmitted page. Yuck. Maintainers of other 64-bit platforms with direct maps should be able to use the amd64 allocator as a reference implementation.	2003-08-29 20:04:10 +00:00
kan	91297961f6	Drop Giant in recvit before returning an error to the caller to avoid leaking the Giant on the syscall exit.	2003-08-11 19:37:11 +00:00
yar	65e4901760	If connect(2) has been interrupted by a signal and therefore the connection is to be established asynchronously, behave as in the case of non-blocking mode: - keep the SS_ISCONNECTING bit set thus indicating that the connection establishment is in progress, which is the case (clearing the bit in this case was just a bug); - return EALREADY, instead of the confusing and unreasonable EADDRINUSE, upon further connect(2) attempts on this socket until the connection is established (this also brings our connect(2) into accord with IEEE Std 1003.1.)	2003-08-06 14:04:47 +00:00
dwmalone	cb188056e6	Do some minor Giant pushdown made possible by copyin, fget, fdrop, malloc and mbuf allocation all not requiring Giant. 1) ostat, fstat and nfstat don't need Giant until they call fo_stat. 2) accept can copyin the address length without grabbing Giant. 3) sendit doesn't need Giant, so don't bother grabbing it until kern_sendit. 4) move Giant grabbing from each indivitual recv* syscall to recvit.	2003-08-04 21:28:57 +00:00
alc	4d05c167d2	Use kmem_alloc_nofault() rather than kmem_alloc_pageable() in sf_buf_init(). (See revision 1.140 of kern/sys_pipe.c for a detailed rationale.) Submitted by: tegge	2003-08-02 04:18:56 +00:00
truckman	b30ab68043	VOP_GETVOBJECT() wants to be called with the vnode lock held.	2003-06-19 03:55:01 +00:00
alc	e8221b068f	Finish the vm object locking in sendfile(2). More generally, the vm locking in sendfile(2) is complete.	2003-06-12 05:52:09 +00:00
alc	4451de3f80	Lock the vm object when removing a page.	2003-06-11 21:23:04 +00:00
obrien	3b8fff9e4c	Use __FBSDID().	2003-06-11 00:56:59 +00:00
dwmalone	a02706ac93	Grab giant in sendit rather than kern_sendit because sockargs may allocate mbufs with M_TRYWAIT, which may require Giant. Reviewed by: bmilekic Approved by: re (scottl)	2003-05-29 18:36:26 +00:00
dwmalone	86e87d5e2d	Split sendit into two parts. The first part, still called sendit, that does the copyin stuff and then calls the second part kern_sendit to do the hard work. Don't bother holding Giant during the copyin phase. The intent of this is to allow the Linux emulator to impliment send* syscalls without using the stackgap.	2003-05-05 20:33:38 +00:00
alc	3333da28bf	Recent changes to uipc_cow.c have eliminated the need for some sf_buf- related variables to be global. Make them either local to sf_buf_init() or static.	2003-03-31 06:25:42 +00:00
alc	6f59be774d	Pass the vm_page's address to sf_buf_alloc(); map the vm_page as part of sf_buf_alloc() instead of expecting sf_buf_alloc()'s caller to map it. The ultimate reason for this change is to enable two optimizations: (1) that there never be more than one sf_buf mapping a vm_page at a time and (2) 64-bit architectures can transparently use their 1-1 virtual to physical mapping (e.g., "K0SEG") avoiding the overhead of pmap_qenter() and pmap_qremove().	2003-03-29 06:14:14 +00:00
alc	4641d3d127	Pass the sf buf to MEXTADD() as the optional argument. This permits the simplification of socow_iodone() and sf_buf_free(); they don't have to reverse engineer the sf buf from the data's address.	2003-03-16 07:19:12 +00:00
alc	e5b46f193e	Remove GIANT_REQUIRED from sf_buf_free().	2003-03-06 04:48:19 +00:00
tegge	2072a48a93	Sync new socket nonblocking/async state with file flags in accept(). PR: 1775 Reviewed by: mbr	2003-02-23 23:00:28 +00:00
cognet	8d83e0054a	Remove duplicate includes. Submitted by: Cyril Nguyen-Huu <cyril@ci0.org>	2003-02-20 03:26:11 +00:00
imp	cf874b345d	Back out M_* changes, per decision of the TRB. Approved by: trb	2003-02-19 05:47:46 +00:00
ume	56a5dfaa24	Break out the bind and connect syscalls to intend to make calling these syscalls internally easy. This is preparation for force coming IPv6 support for Linuxlator. Submitted by: dwmalone MFC after: 10 days	2003-02-03 17:36:52 +00:00
alfred	b5c0015ac9	Consolidate MIN/MAX macros into one place (param.h). Submitted by: Hiten Pandya <hiten@unixdaemons.com>	2003-02-02 13:17:30 +00:00
alfred	bf8e8a6e8f	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.	2003-01-21 08:56:16 +00:00
dillon	ccd5574cc6	Bow to the whining masses and change a union back into void *. Retain removal of unnecessary casts and throw in some minor cleanups to see if anyone complains, just for the hell of it.	2003-01-13 00:33:17 +00:00
dillon	ddf9ef103e	Change struct file f_data to un_data, a union of the correct struct pointer types, and remove a huge number of casts from code using it. Change struct xfile xf_data to xun_data (ABI is still compatible). If we need to add a #define for f_data and xf_data we can, but I don't think it will be necessary. There are no operational changes in this commit.	2003-01-12 01:37:13 +00:00
phk	26984a4003	Move the declaration of the socket fileops from socketvar.h to file.h. This allows us to use the new typedefs and removes the needs for a number of forward struct declarations in socketvar.h	2002-12-23 22:46:47 +00:00
rwatson	1f2df65750	Integrate mac_check_socket_send() and mac_check_socket_receive() checks from the MAC tree: allow policies to perform access control for the ability of a process to send and receive data via a socket. At some point, we might also pass in additional address information if an explicit address is requested on send. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories	2002-10-06 14:39:15 +00:00
truckman	da2757cbc5	In an SMP environment post-Giant it is no longer safe to blindly dereference the struct sigio pointer without any locking. Change fgetown() to take a reference to the pointer instead of a copy of the pointer and call SIGIO_LOCK() before copying the pointer and dereferencing it. Reviewed by: rwatson	2002-10-03 02:13:00 +00:00
archie	bf4ebb4609	accept(2) on a socket that has been shutdown(2) normally returns ECONNABORTED. Make this happen in the non-blocking case as well. The previous behavior was to return EAGAIN, which (a) is not consistent with the blocking case and (b) causes the application to think the socket is still valid. PR: bin/42100 Reviewed by: freebsd-net MFC after: 3 days	2002-08-28 20:56:01 +00:00
rwatson	44404e4547	In order to better support flexible and extensible access control, make a series of modifications to the credential arguments relating to file read and write operations to cliarfy which credential is used for what: - Change fo_read() and fo_write() to accept "active_cred" instead of "cred", and change the semantics of consumers of fo_read() and fo_write() to pass the active credential of the thread requesting an operation rather than the cached file cred. The cached file cred is still available in fo_read() and fo_write() consumers via fp->f_cred. These changes largely in sys_generic.c. For each implementation of fo_read() and fo_write(), update cred usage to reflect this change and maintain current semantics: - badfo_readwrite() unchanged - kqueue_read/write() unchanged pipe_read/write() now authorize MAC using active_cred rather than td->td_ucred - soo_read/write() unchanged - vn_read/write() now authorize MAC using active_cred but VOP_READ/WRITE() with fp->f_cred Modify vn_rdwr() to accept two credential arguments instead of a single credential: active_cred and file_cred. Use active_cred for MAC authorization, and select a credential for use in VOP_READ/WRITE() based on whether file_cred is NULL or not. If file_cred is provided, authorize the VOP using that cred, otherwise the active credential, matching current semantics. Modify current vn_rdwr() consumers to pass a file_cred if used in the context of a struct file, and to always pass active_cred. When vn_rdwr() is used without a file_cred, pass NOCRED. These changes should maintain current semantics for read/write, but avoid a redundant passing of fp->f_cred, as well as making it more clear what the origin of each credential is in file descriptor read/write operations. Follow-up commits will make similar changes to other file descriptor operations, and modify the MAC framework to pass both credentials to MAC policy modules so they can implement either semantic for revocation. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs	2002-08-15 20:55:08 +00:00
rwatson	10845c31e4	Fix return case for negative namelen by jumping to normal exit processing rather than immediately returning, or we may not unlock necessary locks. Noticed by: Mike Heffner <mheffner@acm.vt.edu>	2002-08-15 17:34:03 +00:00
dg	6dce2e7eff	Moved sf_buf_alloc and sf_buf_free function declarations to sys/socketvar.h so that they can be seen by external callers.	2002-08-13 19:03:19 +00:00
dg	7a86c9d738	Remove obsolete comment about sf_buf_* functions being static. They were made un-static in rev 1.114.	2002-08-13 18:20:08 +00:00
semenu	68c52ce00e	Fix sendfile(), who was calling vn_rdwr() without aresid parameter and thus hiting EIO at the end of file. This is believed to be a feature (not a bug) of vn_rdwr(), so we turn it off by supplying aresid param. Reviewed by: rwatson, dg	2002-08-11 20:33:11 +00:00
nectar	6cc73b8947	While we're at it, add range checks similar to those in previous commit to getsockname() and getpeername(), too.	2002-08-09 12:58:11 +00:00
rwatson	1bb6530f3e	Add additional range checks for copyout targets. Submitted by: Silvio Cesare <silvio@qualys.com>	2002-08-09 05:50:32 +00:00
rwatson	a5dcc1fd3d	Include file cleanup; mac.h and malloc.h at one point had ordering relationship requirements, and no longer do. Reminded by: bde	2002-08-01 17:47:56 +00:00
rwatson	7f656e6806	Introduce support for Mandatory Access Control and extensible kernel access control. Instrument connect(), listen(), and bind() system calls to invoke MAC framework entry points to permit policies to authorize these requests. This can be useful for policies that want to limit the activity of processes involving particular types of IPC and network activity. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs	2002-07-31 16:39:49 +00:00
alc	34a3674724	o In do_sendfile(), replace vm_page_sleep_busy() by vm_page_sleep_if_busy() and extend the scope of the page queues lock to cover all accesses to the page's flags and busy fields.	2002-07-30 18:51:07 +00:00
arr	249ad7ccdb	- Make use of the VM_ALLOC_WIRED flag in the call to vm_page_alloc() in do_sendfile(). This allows us to rearrange an if statement in order to avoid doing an unnecesary call to vm_page_lock_queues(), and an attempt at re-wiring the pages (which were wired in the vm_page_alloc() call). Reviewed by: alc, jhb	2002-07-23 01:09:34 +00:00
alc	5bc529bb34	Lock accesses to the page queues by sendfile() and friends.	2002-07-13 03:10:55 +00:00
alfred	598d9de715	Create a bug-for-bug FreeBSD4 compatible version of sendfile and move the fixed sendfile over. This is needed to preserve binary compatibility from 4.x to 5.x.	2002-07-12 06:51:57 +00:00
alfred	d47feb4376	nuke more instances of caddr_t	2002-06-29 00:02:01 +00:00
alfred	b0475cc9d5	remove or replace caddr_t with void. make the mbuf external free function take a void * rather than caddr_t.	2002-06-28 23:48:23 +00:00
ken	0d3a835f3f	At long last, commit the zero copy sockets code. MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes. ti.4: Update the ti(4) man page to include information on the TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options, and also include information about the new character device interface and the associated ioctls. man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated links. jumbo.9: New man page describing the jumbo buffer allocator interface and operation. zero_copy.9: New man page describing the general characteristics of the zero copy send and receive code, and what an application author should do to take advantage of the zero copy functionality. NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS, TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT. conf/files: Add uipc_jumbo.c and uipc_cow.c. conf/options: Add the 5 options mentioned above. kern_subr.c: Receive side zero copy implementation. This takes "disposable" pages attached to an mbuf, gives them to a user process, and then recycles the user's page. This is only active when ZERO_COPY_SOCKETS is turned on and the kern.ipc.zero_copy.receive sysctl variable is set to 1. uipc_cow.c: Send side zero copy functions. Takes a page written by the user and maps it copy on write and assigns it kernel virtual address space. Removes copy on write mapping once the buffer has been freed by the network stack. uipc_jumbo.c: Jumbo disposable page allocator code. This allocates (optionally) disposable pages for network drivers that want to give the user the option of doing zero copy receive. uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are enabled if ZERO_COPY_SOCKETS is turned on. Add zero copy send support to sosend() -- pages get mapped into the kernel instead of getting copied if they meet size and alignment restrictions. uipc_syscalls.c:Un-staticize some of the sf* functions so that they can be used elsewhere. (uipc_cow.c) if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid calling malloc() with M_WAITOK. Return an error if the M_NOWAIT malloc fails. The ti(4) driver and the wi(4) driver, at least, call this with a mutex held. This causes witness warnings for 'ifconfig -a' with a wi(4) or ti(4) board in the system. (I've only verified for ti(4)). ip_output.c: Fragment large datagrams so that each segment contains a multiple of PAGE_SIZE amount of data plus headers. This allows the receiver to potentially do page flipping on receives. if_ti.c: Add zero copy receive support to the ti(4) driver. If TI_PRIVATE_JUMBOS is not defined, it now uses the jumbo(9) buffer allocator for jumbo receive buffers. Add a new character device interface for the ti(4) driver for the new debugging interface. This allows (a patched version of) gdb to talk to the Tigon board and debug the firmware. There are also a few additional debugging ioctls available through this interface. Add header splitting support to the ti(4) driver. Tweak some of the default interrupt coalescing parameters to more useful defaults. Add hooks for supporting transmit flow control, but leave it turned off with a comment describing why it is turned off. if_tireg.h: Change the firmware rev to 12.4.11, since we're really at 12.4.11 plus fixes from 12.4.13. Add defines needed for debugging. Remove the ti_stats structure, it is now defined in sys/tiio.h. ti_fw.h: 12.4.11 firmware. ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13, and my header splitting patches. Revision 12.4.13 doesn't handle 10/100 negotiation properly. (This firmware is the same as what was in the tree previously, with the addition of header splitting support.) sys/jumbo.h: Jumbo buffer allocator interface. sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to indicate that the payload buffer can be thrown away / flipped to a userland process. socketvar.h: Add prototype for socow_setup. tiio.h: ioctl interface to the character portion of the ti(4) driver, plus associated structure/type definitions. uio.h: Change prototype for uiomoveco() so that we'll know whether the source page is disposable. ufs_readwrite.c:Update for new prototype of uiomoveco(). vm_fault.c: In vm_fault(), check to see whether we need to do a page based copy on write fault. vm_object.c: Add a new function, vm_object_allocate_wait(). This does the same thing that vm_object allocate does, except that it gives the caller the opportunity to specify whether it should wait on the uma_zalloc() of the object structre. This allows vm objects to be allocated while holding a mutex. (Without generating WITNESS warnings.) vm_object_allocate() is implemented as a call to vm_object_allocate_wait() with the malloc flag set to M_WAITOK. vm_object.h: Add prototype for vm_object_allocate_wait(). vm_page.c: Add page-based copy on write setup, clear and fault routines. vm_page.h: Add page based COW function prototypes and variable in the vm_page structure. Many thanks to Drew Gallatin, who wrote the zero copy send and receive code, and to all the other folks who have tested and reviewed this code over the years.	2002-06-26 03:37:47 +00:00
alfred	619c88aeeb	Implement SO_NOSIGPIPE option for sockets. This allows one to request that an EPIPE error return not generate SIGPIPE on sockets. Submitted by: lioux Inspired by: Darwin	2002-06-20 18:52:54 +00:00
jhb	fbebc83b5b	Catch up to changes in ktrace API.	2002-06-07 05:37:18 +00:00
tanimura	e6fa9b9e92	Back out my lats commit of locking down a socket, it conflicts with hsu's work. Requested by: hsu	2002-05-31 11:52:35 +00:00
tanimura	92d8381dd5	Lock down a socket, milestone 1. o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a socket buffer. The mutex in the receive buffer also protects the data in struct socket. o Determine the lock strategy for each members in struct socket. o Lock down the following members: - so_count - so_options - so_linger - so_state o Remove *_locked() socket APIs. Make the following socket APIs touching the members above now require a locked socket: - sodisconnect() - soisconnected() - soisconnecting() - soisdisconnected() - soisdisconnecting() - sofree() - soref() - sorele() - sorwakeup() - sotryfree() - sowakeup() - sowwakeup() Reviewed by: alfred	2002-05-20 05:41:09 +00:00
rwatson	4d39491e7e	In sendfile(), use the vn_rdwr() helper function, rather than manually constructing a struct aio and invoking VOP_READ() directly. This cleans up the code a little, but also has the advantage of making sure almost all vnode read/write access in the kernel goes through the helper function, meaning that instrumentation of that helper function can impact almost all relevant read/write operations. In this case, it permits us to put MAC hooks into vn_rdwr() and not modify uipc_syscalls.c (yet). In general, if helper vn_*() functions exist, they should be used in preference to direct VOP's in system call service code. Submitted by: green Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs	2002-04-19 13:46:24 +00:00
jhb	db9aa81e23	Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64	2002-04-04 21:03:38 +00:00
bde	90f30ee936	Fixed some style bugs in the removal of __P(()). The main ones were not removing tabs before "__P((", and not outdenting continuation lines to preserve non-KNF lining up of code with parentheses. Switch to KNF formatting and/or rewrap the whole prototype in some cases.	2002-03-24 05:09:11 +00:00

1 2 3 4 5 ...

255 Commits