freebsd-dev

Author	SHA1	Message	Date
Robert Watson	ac45e92ff2	Change protocol switch pru_abort() API so that it returns void rather than an int, as an error here is not meaningful. Modify soabort() to unconditionally free the socket on the return of pru_abort(), and modify most protocols to no longer conditionally free the socket, since the caller will do this. This commit likely leaves parts of netinet and netinet6 in a situation where they may panic or leak memory, as they have not are not fully updated by this commit. This will be corrected shortly in followup commits to these components. MFC after: 3 months	2006-04-01 15:15:05 +00:00
Robert Watson	63b01ffd34	Add a sysctl, regression.sonewconn_earlytest, which when options REGRESSION is enabled, allows user space to dictate that sonewconn() should skip it's "skip the hard work" check to see if the listen queue is full, and instead proceed with allocation of a socket and trimming of the overflowed queue. This makes it easier to test the queue overflow logic. MFC after: 1 month	2006-03-26 22:44:37 +00:00
Robert Watson	92c07a345e	Change soabort() from returning int to returning void, since all consumers ignore the return value, soabort() is required to succeed, and protocols produce errors here to report multiple freeing of the pcb, which we hope to eliminate.	2006-03-16 07:03:14 +00:00
John Polstra	ba3612cd5c	Fix a bug in the loop in sonewconn that makes room on the incomplete connection queue for a new connection. It was removing connections from the wrong list. Submitted by: Paul Mikesell Sponsored by: Isilon Systems MFC after: 1 week	2005-11-22 01:55:29 +00:00
Andre Oppermann	34333b16cd	Retire MT_HEADER mbuf type and change its users to use MT_DATA. Having an additional MT_HEADER mbuf type is superfluous and redundant as nothing depends on it. It only adds a layer of confusion. The distinction between header mbuf's and data mbuf's is solely done through the m->m_flags M_PKTHDR flag. Non-native code is not changed in this commit. For compatibility MT_HEADER is mapped to MT_DATA. Sponsored by: TCP/IP Optimization Fundraise 2005	2005-11-02 13:46:32 +00:00
Robert Watson	d374e81efd	Push the assignment of a new or updated so_qlimit from solisten() following the protocol pru_listen() call to solisten_proto(), so that it occurs under the socket lock acquisition that also sets SO_ACCEPTCONN. This requires passing the new backlog parameter to the protocol, which also allows the protocol to be aware of changes in queue limit should it wish to do something about the new queue limit. This continues a move towards the socket layer acting as a library for the protocol. Bump __FreeBSD_version due to a change in the in-kernel protocol interface. This change has been tested with IPv4 and UNIX domain sockets, but not other protocols.	2005-10-30 19:44:40 +00:00
Robert Watson	7da7362b95	Re-comment sbcompress() to explain what it is it does; it took me quite a bit of reading to figure it out, and I want to avoid figuring it out again. Convert an if (foo) else printf("this is almost a panic") into a KASSERT. MFC after: 3 days	2005-09-18 10:30:10 +00:00
Suleiman Souhlal	571dcd15e2	Fix the recent panics/LORs/hangs created by my kqueue commit by: - Introducing the possibility of using locks different than mutexes for the knlist locking. In order to do this, we add three arguments to knlist_init() to specify the functions to use to lock, unlock and check if the lock is owned. If these arguments are NULL, we assume mtx_lock, mtx_unlock and mtx_owned, respectively. - Using the vnode lock for the knlist locking, when doing kqueue operations on a vnode. This way, we don't have to lock the vnode while holding a mutex, in filt_vfsread. Reviewed by: jmg Approved by: re (scottl), scottl (mentor override) Pointyhat to: ssouhlal Will be happy: everyone	2005-07-01 16:28:32 +00:00
Robert Watson	83b3d58d05	In the current world order, each socket has two mutexes: a mutex that protects socket and receive socket buffer state, and a second mutex to protect send socket buffer state. In some places, the mutex shared between the socket and receive socket buffer will be acquired twice, once by each layer, resulting in some inconsistency, but providing the abstraction benefit of being able to more easily separate the two mutexes in the future if desired. When transitioning a socket to the SS_ISDISCONNECTING or SS_ISDISCONNECTED states, grab the socket/receive socket buffer lock once rather than grabbing it as the socket lock, modifying socket state, then grabbing a second time as the receive lock in order to modify the socket buffer state to indicate no further data can be read. This change is believed to close a race between the change in socket state and the change in socket buffer state, which for a remotely initiated close on a UNIX domain socket, resulted in soreceive() returning ENOTCONN rather than an EOF condition. A similar race still exists in the case of send, however, and is harder to fix as the socket and send socket buffer mutexes are not the same, and we would like to avoid holding combinations of socket mutexes over sb_upcall until we've finished clarifying the locking protocol for upcalls. This change has the side affect of reducing the number of mutex operations to initiate disconnect or perform disconnect on a socket by two. PR: 78824 Rerported by: Marc Olzheim <marcolz@stack.nl> MFC after: 2 weeks	2005-05-27 17:16:43 +00:00
Robert Watson	59f21d5ab1	Extend the coverage of the accept and socket mutexes in soisconnected() so that the socket lock is held over the test-and-set removal of the accept filter option during connect, and the two socket mutex regions (transition to connected, perform accept filter) are combined.	2005-03-12 13:39:39 +00:00
Robert Watson	9bfb7389bc	When upcalling from a socket in soisconnected() for an accept filter, call with flag M_DONTWAIT rather than M_TRYWAIT, as we don't want to do blocking memory allocation (etc) in the netisr. MFC after: 3 days	2005-03-07 13:50:16 +00:00
Robert Watson	2b85a170d1	Prefer NULL to returning 0 cast to a pointer type. MFC after: 3 days	2005-02-20 15:56:13 +00:00
Robert Watson	280249a66a	In sonewconn(), set the new socket's state to show the protocol-provided connection status before inserting the new socket into the listen socket's accept queue, or there might be a race in which another thread wakes up when the accept lock is released, and sees the socket before its state is set correctly. The wakeup still occurs after the accept lock is released. There have been no diagnoses of this bug in real-world systems (as yet). MFC after: 3 days	2005-02-17 12:53:45 +00:00
Warner Losh	9454b2d864	/* -> /*- for copyright notices, minor format tweaks as necessary	2005-01-06 23:35:40 +00:00
Robert Watson	1ef121cf6b	In sonewconn(), the s/if/while/ change to wait for room at the tail of the accept queue is a feature, not a bug/issue, so remove the XXXRW from the comment.	2004-12-23 01:16:21 +00:00
Maxim Konovalov	2899faf96b	Fix a typo in a comparison appeared in rev. 1.125. Submitted by: JINMEI Tatuya	2004-10-27 05:37:58 +00:00
Andre Oppermann	312c75c362	Support for dynamically loadable and unloadable protocols within existing protocol families. The protosw[] array of any particular protocol family ("domain") is of fixed size defined at compile time. This made it impossible to dynamically add or remove any protocols to or from it. We work around this by introducing so called SPACER's which are embedded into the protosw[] array at compile time. The SPACER's have a special protocol number (32767) to indicate the fact that they are SPACER's but are otherwise NULL. Only as many protocols can be dynamically loaded as SPACER's are provided in the protosw[] structure. The pr_usrreqs structure is treated more special and contains pointers to dummy functions only returning EOPNOTSUPP. This is needed because the use of those functions pointers is usually not checked within the kernel because until now it was assumed to be a valid function pointer. Instead of fixing all potential callers we just return a proper error code. Two new functions provide a clean API to register and unregister a protocol. The register function expects a pointer to a valid and complete struct protosw including a pointer to struct pru_usrreqs provided by the caller. Upon successful registration the pr_init() function will be called to finish initialization of the protocol. The unregister function restores the SPACER in place of the protocol again. It is the responseability of the caller to ensure proper closing of all sockets and freeing of memory allocation by the unloading protocol. sys/protosw.h o Define generic PROTO_SPACER to be 32767 o Prototypes for all pru__notsupp() functions o Prototypes for pf_proto_[un]register() functions kern/uipc_domain.c o Global struct pr_usrreqs nousrreqs containing valid pointers to the pru__notsupp() functions o New functions pf_proto_[un]register() kern/uipc_socket2.c o New functions bodies for all pru_*_notsupp() functions	2004-10-19 15:13:30 +00:00
John-Mark Gurney	ad3b9257c2	Add locking to the kqueue subsystem. This also makes the kqueue subsystem a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)	2004-08-15 06:24:42 +00:00
Robert Watson	1e4d7da707	Reduce the number of unnecessary unlock-relocks on socket buffer mutexes associated with performing a wakeup on the socket buffer: - When performing an sbappend*() followed by a so[rw]wakeup(), explicitly acquire the socket buffer lock and use the _locked() variants of both calls. Note that the _locked() sowakeup() versions unlock the mutex on return. This is done in uipc_send(), divert_packet(), mroute socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append(). - When the socket buffer lock is dropped before a sowakeup(), remove the explicit unlock and use the _locked() sowakeup() variant. This is done in soisdisconnecting(), soisdisconnected() when setting the can't send/ receive flags and dropping data, and in uipc_rcvd() which adjusting back-pressure on the sockets. For UNIX domain sockets running mpsafe with a contention-intensive SMP mysql benchmark, this results in a 1.6% query rate improvement due to reduce mutex costs.	2004-06-26 19:10:39 +00:00
Robert Watson	3f11a2f374	Introduce sbreserve_locked(), which asserts the socket buffer lock on the socket buffer having its limits adjusted. sbreserve() now acquires the lock before calling sbreserve_locked(). In soreserve(), acquire socket buffer locks across read-modify-writes of socket buffer fields, and calls into sbreserve/sbrelease; make sure to acquire in keeping with the socket buffer lock order. In tcp_mss(), acquire the socket buffer lock in the calling context so that we have atomic read-modify -write on buffer sizes.	2004-06-24 01:37:04 +00:00
Robert Watson	a34b704666	Merge next step in socket buffer locking: - sowakeup() now asserts the socket buffer lock on entry. Move the call to KNOTE higher in sowakeup() so that it is made with the socket buffer lock held for consistency with other calls. Release the socket buffer lock prior to calling into pgsigio(), so_upcall(), or aio_swake(). Locking for this event management will need revisiting in the future, but this model avoids lock order reversals when upcalls into other subsystems result in socket/socket buffer operations. Assert that the socket buffer lock is not held at the end of the function. - Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now have _locked versions which assert the socket buffer lock on entry. If a wakeup is required by sb_notify(), invoke sowakeup(); otherwise, unconditionally release the socket buffer lock. This results in the socket buffer lock being released whether a wakeup is required or not. - Break out socantsendmore() into socantsendmore_locked() that asserts the socket buffer lock. socantsendmore() unconditionally locks the socket buffer before calling socantsendmore_locked(). Note that both functions return with the socket buffer unlocked as socantsendmore_locked() calls sowwakeup_locked() which has the same properties. Assert that the socket buffer is unlocked on return. - Break out socantrcvmore() into socantrcvmore_locked() that asserts the socket buffer lock. socantrcvmore() unconditionally locks the socket buffer before calling socantrcvmore_locked(). Note that both functions return with the socket buffer unlocked as socantrcvmore_locked() calls sorwakeup_locked() which has similar properties. Assert that the socket buffer is unlocked on return. - Break out sbrelease() into a sbrelease_locked() that asserts the socket buffer lock. sbrelease() unconditionally locks the socket buffer before calling sbrelease_locked(). sbrelease_locked() now invokes sbflush_locked() instead of sbflush(). - Assert the socket buffer lock in socket buffer sanity check functions sblastrecordchk(), sblastmbufchk(). - Assert the socket buffer lock in SBLINKRECORD(). - Break out various sbappend() functions into sbappend_locked() (and variations on that name) that assert the socket buffer lock. The !_locked() variations unconditionally lock the socket buffer before calling their _locked counterparts. Internally, make sure to call _locked() support routines, etc, if already holding the socket buffer lock. - Break out sbinsertoob() into sbinsertoob_locked() that asserts the socket buffer lock. sbinsertoob() unconditionally locks the socket buffer before calling sbinsertoob_locked(). - Break out sbflush() into sbflush_locked() that asserts the socket buffer lock. sbflush() unconditionally locks the socket buffer before calling sbflush_locked(). Update panic strings for new function names. - Break out sbdrop() into sbdrop_locked() that asserts the socket buffer lock. sbdrop() unconditionally locks the socket buffer before calling sbdrop_locked(). - Break out sbdroprecord() into sbdroprecord_locked() that asserts the socket buffer lock. sbdroprecord() unconditionally locks the socket buffer before calling sbdroprecord_locked(). - sofree() now calls socantsendmore_locked() and re-acquires the socket buffer lock on return. It also now calls sbrelease_locked(). - sorflush() now calls socantrcvmore_locked() and re-acquires the socket buffer lock on return. Clean up/mess up other behavior in sorflush() relating to the temporary stack copy of the socket buffer used with dom_dispose by more properly initializing the temporary copy, and selectively bzeroing/copying more carefully to prevent WITNESS from getting confused by improperly initialized mutexes. Annotate why that's necessary, or at least, needed. - soisconnected() now calls sbdrop_locked() before unlocking the socket buffer to avoid locking overhead. Some parts of this change were: Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS	2004-06-21 00:20:43 +00:00
Robert Watson	31f555a1c5	Assert socket buffer lock in sb_lock() to protect socket buffer sleep lock state. Convert tsleep() into msleep() with socket buffer mutex as argument. Hold socket buffer lock over sbunlock() to protect sleep lock state. Assert socket buffer lock in sbwait() to protect the socket buffer wait state. Convert tsleep() into msleep() with socket buffer mutex as argument. Modify sofree(), sosend(), and soreceive() to acquire SOCKBUF_LOCK() in order to call into these functions with the lock, as well as to start protecting other socket buffer use in their implementation. Drop the socket buffer mutexes around calls into the protocol layer, around potentially blocking operations, for copying to/from user space, and VM operations relating to zero-copy. Assert the socket buffer mutex strategically after code sections or at the beginning of loops. In some cases, modify return code to ensure locks are properly dropped. Convert the potentially blocking allocation of storage for the remote address in soreceive() into a non-blocking allocation; we may wish to move the allocation earlier so that it can block prior to acquisition of the socket buffer lock. Drop some spl use. NOTE: Some races exist in the current structuring of sosend() and soreceive(). This commit only merges basic socket locking in this code; follow-up commits will close additional races. As merged, these changes are not sufficient to run without Giant safely. Reviewed by: juli, tjr	2004-06-19 03:23:14 +00:00
Robert Watson	9535efc00d	Merge additional socket buffer locking from rwatson_netperf: - Lock down low hanging fruit use of sb_flags with socket buffer lock. - Lock down low hanging fruit use of so_state with socket lock. - Lock down low hanging fruit use of so_options. - Lock down low-hanging fruit use of sb_lowwat and sb_hiwat with socket buffer lock. - Annotate situations in which we unlock the socket lock and then grab the receive socket buffer lock, which are currently actually the same lock. Depending on how we want to play our cards, we may want to coallesce these lock uses to reduce overhead. - Convert a if()->panic() into a KASSERT relating to so_state in soaccept(). - Remove a number of splnet()/splx() references. More complex merging of socket and socket buffer locking to follow.	2004-06-17 22:48:11 +00:00
Robert Watson	7721f5d760	Grab the socket buffer send or receive mutex when performing a read-modify-write on the sb_state field. This commit catches only the "easy" ones where it doesn't interact with as yet unmerged locking.	2004-06-15 03:51:44 +00:00
Robert Watson	c0b99ffa02	The socket field so_state is used to hold a variety of socket related flags relating to several aspects of socket functionality. This change breaks out several bits relating to send and receive operation into a new per-socket buffer field, sb_state, in order to facilitate locking. This is required because, in order to provide more granular locking of sockets, different state fields have different locking properties. The following fields are moved to sb_state: SS_CANTRCVMORE (so_state) SS_CANTSENDMORE (so_state) SS_RCVATMARK (so_state) Rename respectively to: SBS_CANTRCVMORE (so_rcv.sb_state) SBS_CANTSENDMORE (so_snd.sb_state) SBS_RCVATMARK (so_rcv.sb_state) This facilitates locking by isolating fields to be located with other identically locked fields, and permits greater granularity in socket locking by avoiding storing fields with different locking semantics in the same short (avoiding locking conflicts). In the future, we may wish to coallesce sb_state and sb_flags; for the time being I leave them separate and there is no additional memory overhead due to the packing/alignment of shorts in the socket buffer structure.	2004-06-14 18:16:22 +00:00
Robert Watson	310e7ceb94	Socket MAC labels so_label and so_peerlabel are now protected by SOCK_LOCK(so): - Hold socket lock over calls to MAC entry points reading or manipulating socket labels. - Assert socket lock in MAC entry point implementations. - When externalizing the socket label, first make a thread-local copy while holding the socket lock, then release the socket lock to externalize to userspace.	2004-06-13 02:50:07 +00:00
Robert Watson	e7dd9a1001	Mark sun_noname as const since it's immutable. Update definitions of functions that potentially accept &sun_noname (sbappendaddr(), et al) to accept a const sockaddr pointer.	2004-06-04 04:07:08 +00:00
Robert Watson	2658b3bb8e	Integrate accept locking from rwatson_netperf, introducing a new global mutex, accept_mtx, which serializes access to the following fields across all sockets: so_qlen so_incqlen so_qstate so_comp so_incomp so_list so_head While providing only coarse granularity, this approach avoids lock order issues between sockets by avoiding ownership of the fields by a specific socket and its per-socket mutexes. While here, rewrite soclose(), sofree(), soaccept(), and sonewconn() to add assertions, close additional races and address lock order concerns. In particular: - Reorganize the optimistic concurrency behavior in accept1() to always allocate a file descriptor with falloc() so that if we do find a socket, we don't have to encounter the "Oh, there wasn't a socket" race that can occur if falloc() sleeps in the current code, which broke inbound accept() ordering, not to mention requiring backing out socket state changes in a way that raced with the protocol level. We may want to add a lockless read of the queue state if polling of empty queues proves to be important to optimize. - In accept1(), soref() the socket while holding the accept lock so that the socket cannot be free'd in a race with the protocol layer. Likewise in netgraph equivilents of the accept1() code. - In sonewconn(), loop waiting for the queue to be small enough to insert our new socket once we've committed to inserting it, or races can occur that cause the incomplete socket queue to overfill. In the previously implementation, it was sufficient to simply tested once since calling soabort() didn't release synchronization permitting another thread to insert a socket as we discard a previous one. - In soclose()/sofree()/et al, it is the responsibility of the caller to remove a socket from the incomplete connection queue before calling soabort(), which prevents soabort() from having to walk into the accept socket to release the socket from its queue, and avoids races when releasing the accept mutex to enter soabort(), permitting soabort() to avoid lock ordering issues with the caller. - Generally cluster accept queue related operations together throughout these functions in order to facilitate locking. Annotate new locking in socketvar.h.	2004-06-02 04:15:39 +00:00
Robert Watson	36568179e3	The SS_COMP and SS_INCOMP flags in the so_state field indicate whether the socket is on an accept queue of a listen socket. This change renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new state field on the socket, so_qstate, as the locking for these flags is substantially different for the locking on the remainder of the flags in so_state.	2004-06-01 02:42:56 +00:00
Bosko Milekic	099a0e588c	Bring in mbuma to replace mballoc. mbuma is an Mbuf & Cluster allocator built on top of a number of extensions to the UMA framework, all included herein. Extensions to UMA worth noting: - Better layering between slab <-> zone caches; introduce Keg structure which splits off slab cache away from the zone structure and allows multiple zones to be stacked on top of a single Keg (single type of slab cache); perhaps we should look into defining a subset API on top of the Keg for special use by malloc(9), for example. - UMA_ZONE_REFCNT zones can now be added, and reference counters automagically allocated for them within the end of the associated slab structures. uma_find_refcnt() does a kextract to fetch the slab struct reference from the underlying page, and lookup the corresponding refcnt. mbuma things worth noting: - integrates mbuf & cluster allocations with extended UMA and provides caches for commonly-allocated items; defines several zones (two primary, one secondary) and two kegs. - change up certain code paths that always used to do: m_get() + m_clget() to instead just use m_getcl() and try to take advantage of the newly defined secondary Packet zone. - netstat(1) and systat(1) quickly hacked up to do basic stat reporting but additional stats work needs to be done once some other details within UMA have been taken care of and it becomes clearer to how stats will work within the modified framework. From the user perspective, one implication is that the NMBCLUSTERS compile-time option is no longer used. The maximum number of clusters is still capped off according to maxusers, but it can be made unlimited by setting the kern.ipc.nmbclusters boot-time tunable to zero. Work should be done to write an appropriate sysctl handler allowing dynamic tuning of kern.ipc.nmbclusters at runtime. Additional things worth noting/known issues (READ): - One report of 'ips' (ServeRAID) driver acting really slow in conjunction with mbuma. Need more data. Latest report is that ips is equally sucking with and without mbuma. - Giant leak in NFS code sometimes occurs, can't reproduce but currently analyzing; brueffer is able to reproduce but THIS IS NOT an mbuma-specific problem and currently occurs even WITHOUT mbuma. - Issues in network locking: there is at least one code path in the rip code where one or more locks are acquired and we end up in m_prepend() with M_WAITOK, which causes WITNESS to whine from within UMA. Current temporary solution: force all UMA allocations to be M_NOWAIT from within UMA for now to avoid deadlocks unless WITNESS is defined and we can determine with certainty that we're not holding any locks when we're M_WAITOK. - I've seen at least one weird socketbuffer empty-but- mbuf-still-attached panic. I don't believe this to be related to mbuma but please keep your eyes open, turn on debugging, and capture crash dumps. This change removes more code than it adds. A paper is available detailing the change and considering various performance issues, it was presented at BSDCan2004: http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf Please read the paper for Future Work and implementation details, as well as credits. Testing and Debugging: rwatson, brueffer, Ketrien I. Saihr-Kesenchedra, ... Reviewed by: Lots of people (for different parts)	2004-05-31 21:46:06 +00:00
Paul Saab	c2696aaf51	syncache broke rev 1.23 which was done to fix the "thundering herd" problem in Apache. Fix it. Reviewed by: peter	2004-05-19 00:22:10 +00:00
Warner Losh	7f8a436ff2	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core	2004-04-05 21:03:37 +00:00
Paul Saab	2eada6bc8e	Remove some netbsd debug code that crept into rev 1.116	2004-03-22 10:17:40 +00:00
Robert Watson	746e5bf09b	Rename dup_sockaddr() to sodupsockaddr() for consistency with other functions in kern_socket.c. Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT in from the caller context rather than "1" or "0". Correct mflags pass into mac_init_socket() from previous commit to not include M_ZERO. Submitted by: sam	2004-03-01 03:14:23 +00:00
Robert Watson	2bc87dcfbe	Modify soalloc() API so that it accepts a malloc flags argument rather than a "waitok" argument. Callers now passing M_WAITOK or M_NOWAIT rather than 0 or 1. This simplifies the soalloc() logic, and also makes the waiting behavior of soalloc() more clear in the calling context. Submitted by: sam	2004-02-29 17:54:05 +00:00
John Baldwin	91d5354a2c	Locking for the per-process resource limits structure. - struct plimit includes a mutex to protect a reference count. The plimit structure is treated similarly to struct ucred in that is is always copy on write, so having a reference to a structure is sufficient to read from it without needing a further lock. - The proc lock protects the p_limit pointer and must be held while reading limits from a process to keep the limit structure from changing out from under you while reading from it. - Various global limits that are ints are not protected by a lock since int writes are atomic on all the archs we support and thus a lock wouldn't buy us anything. - All accesses to individual resource limits from a process are abstracted behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return either an rlimit, or the current or max individual limit of the specified resource from a process. - dosetrlimit() was renamed to kern_setrlimit() to match existing style of other similar syscall helper functions. - The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit() (it didn't used the stackgap when it should have) but uses lim_rlimit() and kern_setrlimit() instead. - The svr4 compat no longer uses the stackgap for resource limits calls, but uses lim_rlimit() and kern_setrlimit() instead. - The ibcs2 compat no longer uses the stackgap for resource limits. It also no longer uses the stackgap for accessing sysctl's for the ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result, ibcs2_sysconf() no longer needs Giant. - The p_rlimit macro no longer exists. Submitted by: mtm (mostly, I only did a few cleanups and catchups) Tested on: i386 Compiled on: alpha, amd64	2004-02-04 21:52:57 +00:00
Robert Watson	a557af222b	Introduce a MAC label reference in 'struct inpcb', which caches the MAC label referenced from 'struct socket' in the IPv4 and IPv6-based protocols. This permits MAC labels to be checked during network delivery operations without dereferencing inp->inp_socket to get to so->so_label, which will eventually avoid our having to grab the socket lock during delivery at the network layer. This change introduces 'struct inpcb' as a labeled object to the MAC Framework, along with the normal circus of entry points: initialization, creation from socket, destruction, as well as a delivery access control check. For most policies, the inpcb label will simply be a cache of the socket label, so a new protocol switch method is introduced, pr_sosetlabel() to notify protocols that the socket layer label has been updated so that the cache can be updated while holding appropriate locks. Most protocols implement this using pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use the the worker function in_pcbsosetlabel(), which calls into the MAC Framework to perform a cache update. Biba, LOMAC, and MLS implement these entry points, as do the stub policy, and test policy. Reviewed by: sam, bms Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories	2003-11-18 00:39:07 +00:00
Seigo Tanimura	512824f8f7	- Implement selwakeuppri() which allows raising the priority of a thread being waken up. The thread waken up can run at a priority as high as after tsleep(). - Replace selwakeup()s with selwakeuppri()s and pass appropriate priorities. - Add cv_broadcastpri() which raises the priority of the broadcast threads. Used by selwakeuppri() if collision occurs. Not objected in: -arch, -current	2003-11-09 09:17:26 +00:00
Sam Leffler	395bb18680	speedup stream socket recv handling by tracking the tail of the mbuf chain instead of walking the list for each append Submitted by: ps/jayanth Obtained from: netbsd (jason thorpe)	2003-10-28 05:47:40 +00:00
Mike Silbersack	184dcdc7c8	Change all SYSCTLS which are readonly and have a related TUNABLE from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide more useful error messages.	2003-10-21 18:28:36 +00:00
Scott Long	c43cad1ac1	Guard against MLEN growing larger than a uint8_t due to MSIZE grwoing to a value of 512 in LINT. This keeps gcc from complaining.	2003-07-26 07:23:24 +00:00
David E. O'Brien	677b542ea2	Use __FBSDID().	2003-06-11 00:56:59 +00:00
Mark Murray	51da11a27a	Fix some easy, global, lint warnings. In most cases, this means making some local variables static. In a couple of cases, this means removing an unused variable.	2003-04-30 12:57:40 +00:00
Peter Wemm	86bb731626	Missing M_TRYWAIT from so_upcall third argument.	2003-02-21 22:23:40 +00:00
Warner Losh	a163d034fa	Back out M_* changes, per decision of the TRB. Approved by: trb	2003-02-19 05:47:46 +00:00
Hartmut Brandt	1b978d453b	Make the variable types, the sysctl macros and the sysctl handler for kern.ipc.{maxsockbuf,sockbuf_waste_factor} to agree that those variables are of type unsigned long. PR: sparc64/47389 Approved by: jake (mentor)	2003-02-03 06:50:59 +00:00
Alfred Perlstein	44956c9863	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.	2003-01-21 08:56:16 +00:00
Tim J. Robbins	b3f1af6b8e	Don't count mbufs with m_type == MT_HEADER or MT_OOBDATA as control data in sballoc(), sbcompress(), sbdrop() and sbfree(). Fixes fstat() st_size reporting and kevent() EVFILT_READ on TCP sockets.	2003-01-11 07:51:52 +00:00
Kelly Yancey	04ac9b97b5	Spotted a couple of places where the socket buffer's counters were being manipulated directly (rather than using sballoc()/sbfree()); update them to tweak the new sb_ctl field too. Sponsored by: NTT Multimedia Communications Labs	2002-11-05 18:52:25 +00:00
Alan Cox	5ee0a409fc	Revert the change in revision 1.77 of kern/uipc_socket2.c. It is causing a panic because the socket's state isn't as expected by sofree(). Discussed with: dillon, fenner	2002-11-02 05:14:31 +00:00
Poul-Henning Kamp	7ed60de837	Use m_length() instead of home-rolled versions.	2002-09-18 19:44:14 +00:00
David Greenman	79cb7eb41c	Further improved the performance of sbreserve() by moving the calculation of the adjusted sb_max into a sysctl handler for sb_max and assigning it to a variable that is used instead. This eliminates the 32bit multiply and divide from the fast path that was being done previously.	2002-08-16 18:41:48 +00:00
David Greenman	8c71ce8a4e	Rewrote the space check algorithm in sbreserve() so that the extremely expensive (!) 64bit multiply, divide, and comparison aren't necessary (this came in originally from rev 1.19 to fix an overflow with large sb_max or MCLBYTES). The 64bit math in this function was measured in some kernel profiles as being as much as 5-8% of the total overhead of the TCP/IP stack and is eliminated with this commit. There is a harmless rounding error (of about .4% with the standard values) introduced with this change, however this is in the conservative direction (downward toward a slightly smaller maximum socket buffer size). MFC after: 3 days	2002-08-16 05:08:46 +00:00
Robert Watson	f9d0d52459	Include file cleanup; mac.h and malloc.h at one point had ordering relationship requirements, and no longer do. Reminded by: bde	2002-08-01 17:47:56 +00:00
Robert Watson	335654d73e	Introduce support for Mandatory Access Control and extensible kernel access control. Invoke the necessary MAC entry points to maintain labels on sockets. In particular, invoke entry points during socket allocation and destruction, as well as creation by a process or during an accept-scenario (sonewconn). For UNIX domain sockets, also assign a peer label. As the socket code isn't locked down yet, locking interactions are not yet clear. Various protocol stack socket operations (such as peer label assignment for IPv4) will follow. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs	2002-07-31 03:03:22 +00:00
David Malone	25dec7474c	If a socket is disconnected for some reason (like a TCP connection not responding) then drop any data on the outgoing queue in soisdisconnected because there is no way to get it to its destination any longer. The only objection to this patch I got on -net was from Terry, who wasn't sure that the condition in question could arise, so I provided some example code.	2002-07-27 23:06:52 +00:00
Robert Drehmel	e25dadb05d	Fix -Werror build for sparc64: Use the appropriate conversion specifier for an 'unsigned int' argument.	2002-07-26 12:57:57 +00:00
Alfred Perlstein	802082390b	More caddr_t removal. Change struct knote's kn_hook from caddr_t to void *.	2002-06-29 00:29:12 +00:00
Seigo Tanimura	03e4918190	Remove so*_locked(), which were backed out by mistake.	2002-06-18 07:42:02 +00:00
Seigo Tanimura	4cc20ab1f0	Back out my lats commit of locking down a socket, it conflicts with hsu's work. Requested by: hsu	2002-05-31 11:52:35 +00:00
Mike Silbersack	184fec1a09	Subtle fix to the accept filter LRU code. In some cases, a newly initialized socket with no qlimit was being passed in. In order to handle this case properly, we must not use >= when comparing queue sizes to qlimit. As a result of this improper handling, a panic could result in certain cases. PR: 38325 MFC after: 3 days	2002-05-20 17:34:31 +00:00
Seigo Tanimura	243917fe3b	Lock down a socket, milestone 1. o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a socket buffer. The mutex in the receive buffer also protects the data in struct socket. o Determine the lock strategy for each members in struct socket. o Lock down the following members: - so_count - so_options - so_linger - so_state o Remove *_locked() socket APIs. Make the following socket APIs touching the members above now require a locked socket: - sodisconnect() - soisconnected() - soisconnecting() - soisdisconnected() - soisdisconnecting() - sofree() - soref() - sorele() - sorwakeup() - sotryfree() - sowakeup() - sowwakeup() Reviewed by: alfred	2002-05-20 05:41:09 +00:00
Seigo Tanimura	9d0fc9636e	Do not forget to increase the number of completely connected sockets in soisconnected_locked(). Forgotten by: tanimura	2002-05-07 16:17:44 +00:00
Alfred Perlstein	f132072368	Redo the sigio locking. Turn the sigio sx into a mutex. Sigio lock is really only needed to protect interrupts from dereferencing the sigio pointer in an object when the sigio itself is being destroyed. In order to do this in the most unintrusive manner change pgsigio's sigio * argument into a **, that way we can lock internally to the function.	2002-05-01 20:44:46 +00:00
Seigo Tanimura	960ed29c4b	Revert the change of #includes in sys/filedesc.h and sys/socketvar.h. Requested by: bde Since locking sigio_lock is usually followed by calling pgsigio(), move the declaration of sigio_lock and the definitions of SIGIO_*() to sys/signalvar.h. While I am here, sort include files alphabetically, where possible.	2002-04-30 01:54:54 +00:00
Seigo Tanimura	acbbcc5f1d	Fix the code fragment clobbered in my last commit.	2002-04-27 09:33:49 +00:00
Seigo Tanimura	d48d4b2501	Add a global sx sigio_lock to protect the pointer to the sigio object of a socket. This avoids lock order reversal caused by locking a process in pgsigio(). sowakeup() and the callers of it (sowwakeup, soisconnected, etc.) now require sigio_lock to be locked. Provide sowwakeup_locked(), soisconnected_locked(), and so on in case where we have to modify a socket and wake up a process atomically.	2002-04-27 08:24:29 +00:00
Mike Silbersack	e1f1827f98	Make sure that sockets undergoing accept filtering are aborted in a LRU fashion when the listen queue fills up. Previously, there was no mechanism to kick out old sockets, leading to an easy DoS of daemons using accept filtering. Reviewed by: alfred MFC after: 3 days	2002-04-26 02:07:46 +00:00
Mike Silbersack	c473d3e406	Remove sodropablereq - this function hasn't been used since the syncache went in. MFC after: 3 days	2002-04-24 04:11:08 +00:00
Jeff Roberson	54d77689ed	Backout part of my previous commit; I was wrong about vm_zone's handling of limits on zones w/o objects.	2002-03-20 04:39:32 +00:00
Jeff Roberson	c897b81311	Remove references to vm_zone.h and switch over to the new uma API. Also, remove maxsockets. If you look carefully you'll notice that the old zone allocator never honored this anyway.	2002-03-20 04:09:59 +00:00
Matthew Dillon	ecde8f7c29	Get rid of the twisted MFREE() macro entirely. Reviewed by: dg, bmilekic MFC after: 3 days	2002-02-05 02:00:56 +00:00
Mike Silbersack	9f5193ca0b	Revert 1.81; 1.19 fixed this already in a different way.	2002-01-09 01:45:17 +00:00
Mike Silbersack	5213c50d83	Reorder a calculation in sbreserve so that it does not overflow with multi-megabyte socket buffer sizes. PR: 7420 MFC after: 3 weeks	2002-01-06 06:50:54 +00:00
Alfred Perlstein	21d56e9c33	Make AIO a loadable module. Remove the explicit call to aio_proc_rundown() from exit1(), instead AIO will use at_exit(9). Add functions at_exec(9), rm_at_exec(9) which function nearly the same as at_exec(9) and rm_at_exec(9), these functions are called on behalf of modules at the time of execve(2) after the image activator has run. Use a modified version of tegge's suggestion via at_exec(9) to close an exploitable race in AIO. Fix SYSCALL_MODULE_HELPER such that it's archetecuterally neutral, the problem was that one had to pass it a paramater indicating the number of arguments which were actually the number of "int". Fix it by using an inline version of the AS macro against the syscall arguments. (AS should be available globally but we'll get to that later.) Add a primative system for dynamically adding kqueue ops, it's really not as sophisticated as it should be, but I'll discuss with jlemon when he's around.	2001-12-29 07:13:47 +00:00
Peter Wemm	205b2b6107	Avoid an interaction between syncache and accept filters. The syncache code only passed up the connection to the tcp stack when it was complete, so it went directly into the so_comp (complete) queue. However, with accept filters, there is an additional phase before calling it "complete". Reviewed by: jlemon	2001-12-21 04:30:49 +00:00
Robert Watson	f8cf411e49	o Back out portions of 1.50 and 1.47, eliminating sonewconn3() and always deriving the credential for a newly accepted connection from the listen socket. Previously, the selection of the credential depended on the protocol: UNIX domain sockets would use the connecting process's credential, and protocols supporting a creation of the socket before the receiving end called accept() would use the listening socket. After this change, it is always the listening credential. Reviewed by: green	2001-12-13 22:09:37 +00:00
Matthew Dillon	b1e4abd246	Give struct socket structures a ref counting interface similar to vnodes. This will hopefully serve as a base from which we can expand the MP code. We currently do not attempt to obtain any mutex or SX locks, but the door is open to add them when we nail down exactly how that part of it is going to work.	2001-11-17 03:07:11 +00:00
John Baldwin	bd78cece5d	Change the kernel's ucred API as follows: - crhold() returns a reference to the ucred whose refcount it bumps. - crcopy() now simply copies the credentials from one credential to another and has no return value. - a new crshared() primitive is added which returns true if a ucred's refcount is > 1 and false (0) otherwise.	2001-10-11 23:38:17 +00:00
David Malone	59bdd40568	Allow sbcreatecontrol to make cluster sized control messages.	2001-10-04 12:59:53 +00:00
Julian Elischer	b40ce4165d	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha	2001-09-12 08:38:13 +00:00
Jonathan Lemon	84241bd0dc	Fix up indentation.	2001-06-29 04:01:38 +00:00
Peter Wemm	0978669829	"Fix" the previous initial attempt at fixing TUNABLE_INT(). This time around, use a common function for looking up and extracting the tunables from the kernel environment. This saves duplicating the same function over and over again. This way typically has an overhead of 8 bytes + the path string, versus about 26 bytes + the path string.	2001-06-08 05:24:21 +00:00
Peter Wemm	4422746fdf	Back out part of my previous commit. This was a last minute change and I botched testing. This is a perfect example of how NOT to do this sort of thing. :-(	2001-06-07 03:17:26 +00:00
Peter Wemm	81930014ef	Make the TUNABLE_() macros look and behave more consistantly like the SYSCTL_() macros. TUNABLE_INT_DECL() was an odd name because it didn't actually declare the int, which is what the name suggests it would do.	2001-06-06 22:17:08 +00:00
Jesper Skriver	5b86eac4e5	Revert the last bits of my bogus move of NMBCLUSTERS to <sys/param.h>	2001-06-01 21:47:34 +00:00
Jesper Skriver	e916d96e64	Move the definition of NMBCLUSTERS from src/sys/kern/uipc_mbuf.c to <sys/param.h>, so it's available to src/sys/netinet/ip_input.c, and remove the now unneeded includes of "opt_param.h". MFC after: 1 week	2001-05-31 21:56:44 +00:00
Mark Murray	fb919e4d5a	Undo part of the tangle of having sys/lock.h and sys/mutex.h included in other "system" header files. Also help the deprecation of lockmgr.h by making it a sub-include of sys/lock.h and removing sys/lockmgr.h form kernel .c files. Sort sys/*.h includes where possible in affected files. OK'ed by: bde (with reservations)	2001-05-01 08:13:21 +00:00
David Malone	32af0d74f0	Make sbcompress use the new M_WRITABLE macro. Previously sbcompress could not compress into clusters. This could result in lots of wasted clusters while recieving small packets from an interface that uses clusters for all it's packets. Patch is partially from BSDi (limiting the size of the copy) and based on a patch for 4.1 by Ian Dowse <iedowse@maths.tcd.ie> and myself. Reviewed by: bmilekic Obtained From: BSDi Submitted by: iedowse	2000-11-19 22:22:47 +00:00
Don Lewis	f535380cb6	Remove uidinfo hash table lookup and maintenance out of chgproccnt() and chgsbsize(), which are called rather frequently and may be called from an interrupt context in the case of chgsbsize(). Instead, do the hash table lookup and maintenance when credentials are changed, which is a lot less frequent. Add pointers to the uidinfo structures to the ucred and pcred structures for fast access. Pass a pointer to the credential to chgproccnt() and chgsbsize() instead of passing the uid. Add a reference count to the uidinfo structure and use it to decide when to free the structure rather than freeing the structure when the resource consumption drops to zero. Move the resource tracking code from kern_proc.c to kern_resource.c. Move some duplicate code sequences in kern_prot.c to separate helper functions. Change KASSERTs in this code to unconditional tests and calls to panic().	2000-09-05 22:11:13 +00:00
Brian Feldman	b6240737d5	Fix hangs caused by overzealous code removal. Thanks, Nickolay, for figuring out this is the problem. Submitted by: Nickolay Dudorov <nnd@mail.nsk.ru>	2000-08-31 11:31:58 +00:00
Brian Feldman	343079d9b2	Remove an extraneous setting of sb_hiwat.	2000-08-30 00:09:57 +00:00
Brian Feldman	6aef685fbb	Remove any possibility of hiwat-related race conditions by changing the chgsbsize() call to use a "subject" pointer (&sb.sb_hiwat) and a u_long target to set it to. The whole thing is splnet(). This fixes a problem that jdp has been able to provoke.	2000-08-29 11:28:06 +00:00
Paul Saab	030f7b3faa	Remove unnecessary call to splnet when setting an accept filter since we are already at splnet.	2000-07-31 08:23:43 +00:00
Alfred Perlstein	c636255150	fix races in the uidinfo subsystem, several problems existed: 1) while allocating a uidinfo struct malloc is called with M_WAITOK, it's possible that while asleep another process by the same user could have woken up earlier and inserted an entry into the uid hash table. Having redundant entries causes inconsistancies that we can't handle. fix: do a non-waiting malloc, and if that fails then do a blocking malloc, after waking up check that no one else has inserted an entry for us already. 2) Because many checks for sbsize were done as "test then set" in a non atomic manner it was possible to exceed the limits put up via races. fix: instead of querying the count then setting, we just attempt to set the count and leave it up to the function to return success or failure. 3) The uidinfo code was inlining and repeating, lookups and insertions and deletions needed to be in their own functions for clarity. Reviewed by: green	2000-06-22 22:27:16 +00:00
Alfred Perlstein	a79b71281c	return of the accept filter part II accept filters are now loadable as well as able to be compiled into the kernel. two accept filters are provided, one that returns sockets when data arrives the other when an http request is completed (doesn't work with 0.9 requests) Reviewed by: jmg	2000-06-20 01:09:23 +00:00
Alfred Perlstein	a72fda7154	backout accept optimizations. Requested by: jmg, dcs, jdp, nate	2000-06-18 08:49:13 +00:00
Alfred Perlstein	8f4e4aa5f1	add socketoptions DELAYACCEPT and HTTPACCEPT which will not allow an accept() until the incoming connection has either data waiting or what looks like a HTTP request header already in the socketbuffer. This ought to reduce the context switch time and overhead for processing requests. The initial idea and code for HTTPACCEPT came from Yahoo engineers and has been cleaned up and a more lightweight DELAYACCEPT for non-http servers has been added Reviewed by: silence on hackers.	2000-06-15 18:18:43 +00:00
Jonathan Lemon	cb679c385e	Introduce kqueue() and kevent(), a kernel event notification facility.	2000-04-16 18:53:38 +00:00
Yoshinobu Inoue	7d0d8dc306	CMSG_XXX macros alignment fixes to follow RFC2292. Approved by: jkh Submitted by: Partly from tech@openbsd Reviewed by: itojun	2000-03-03 11:13:12 +00:00

1 2 3 4 5

204 Commits