freebsd-dev

Author	SHA1	Message	Date
Bjoern A. Zeeb	36b5ba0c49	Style changes only. Put the return type on an extra line[1] and add an empty line at the beginning as we do not have any local variables. Submitted by: rwatson [1] Reviewed by: rwatson MFC after: 4 weeks	2008-12-10 22:10:37 +00:00
Bjoern A. Zeeb	413628a7e3	MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible	2008-11-29 14:32:14 +00:00
Konstantin Belousov	b4cf0e62f4	Add sv_flags field to struct sysentvec with intention to provide description of the ABI of the currently executing image. Change some places to test the flags instead of explicit comparing with address of known sysentvec structures to determine ABI features. Discussed with: dchagin, imp, jhb, peter	2008-11-22 12:36:15 +00:00
Julian Elischer	bc97ba5100	Fix a scope problem in the multiple routing table code that stopped the SO_SETFIB socket option from working correctly. Obtained from: Ironport MFC after: 3 days	2008-11-19 19:19:30 +00:00
Kip Macy	5a1760fc92	make sure that SO_NO_DDP and SO_NO_OFFLOAD get passed in correctly PR: 127360 MFC after: 3 days	2008-10-17 01:25:45 +00:00
Robert Watson	ff601c3645	In soreceive_dgram, when a 0-length buffer is passed into recv(2) and no data is ready, return 0 rather than blocking or returning EAGAIN. This is consistent with the behavior of soreceive_generic (soreceive) in earlier versions of FreeBSD, and restores this behavior for UDP. Discussed with: jhb, sam MFC after: 3 days	2008-10-07 20:57:55 +00:00
Robert Watson	ffe72750d9	Remove temporary debugging KASSERT's introduced to detect protocols improperly invoking sosend(), soreceive(), and sopoll() instead of attach either specialized or _generic() versions of those functions to their pru_sosend, pru_soreceive, and pru_sopoll protosw methods. MFC after: 3 days	2008-10-07 09:57:03 +00:00
John Baldwin	1af1c6cd8a	Wait until after dropping the receive socket buffer lock to allocate space to store the socket address stored in the first mbuf in a packet chain. This reduces contention on the lock and CPU system time in certain UDP workloads. Tested by: ps Reviewed by: rwatson MFC after: 1 week	2008-10-01 19:14:05 +00:00
Robert Watson	25edc6dd20	Various cleanups for soreceive_dgram(): - Update or remove comments that were left over from the original soreceive_generic() implementation. Quite a few were misleading in the context of the new code. - Since soreceive_dgram() has a simpler structure, replace several gotos with a while loop making the invariants more clear. - In the blocking while loop, don't try to handle cases incompatible with the loop invariant (since m is always NULL, don't check for and handle non-NULL). - Don't drop and re-acquire the socket buffer lock unnecessarily after sbwait() returns, which may help reduce lock contention (etc). - Assume PR_ATOMIC since we assert it at the top of the function. MFC after: 3 days	2008-10-01 13:26:52 +00:00
John Baldwin	c4688866d3	Update the function name in several assertions in soreceive_dgram(). Approved by: rwatson MFC after: 3 days	2008-09-30 18:44:26 +00:00
Robert Watson	26ec197d15	Remove XXXRW in soreceive_dgram that proves unnecessary. Remove unused orig_resid variable in soreceive_dgram. Submitted by: alfred X-MFC with: soreceive_dgram (r180198, r180211)	2008-09-02 16:55:21 +00:00
Kip Macy	dd0e6c383a	Add accessor functions for socket fields. MFC after: 1 week	2008-07-21 00:49:34 +00:00
Robert Watson	6992381eca	Update copyright date in light of soreceive_dgram(9).	2008-07-03 06:47:45 +00:00
Robert Watson	5df3e83946	Add soreceive_dgram(9), an optimized socket receive function for use by datagram-only protocols, such as UDP. This version removes use of sblock(), which is not required due to an inability to interlace data improperly with datagrams, as well as avoiding some of the larger loops and state management that don't apply on datagram sockets. This is experimental code, so hook it up only for UDPv4 for testing; if there are problems we may need to revise it or turn it off by default, but it offers significant performance improvements for threaded UDP applications such as BIND9, nsd, and memcached using UDP. Tested by: kris, ps	2008-07-02 23:23:27 +00:00
Julian Elischer	8b07e49a00	Add code to allow the system to handle multiple routing tables. This particular implementation is designed to be fully backwards compatible and to be MFC-able to 7.x (and 6.x) Currently the only protocol that can make use of the multiple tables is IPv4 Similar functionality exists in OpenBSD and Linux. From my notes: ----- One thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons). Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. This first version will not have some of the bells and whistles that will come with later versions. It will, for example, be limited to 16 tables in the first commit. Implementation method, Compatible version. (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not always caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (8 is sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called rtalloc_fib() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). You CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Firstly, all packets have a FIB associated with them. if nothing has been done to change it, it will be FIB 0. The FIB is changed in the following ways. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -3 ping target.example.com # will use fib 3 for ping. It is an obvious extension to make it a property of a jail but I have not done so. It can be achieved by combining the setfib and jail commands. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). (possibly in the future you may be able to associate a FIB with packets received on an interface.. An ifconfig arg, but not yet.) 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). 4/ a tcp listen socket associated with a fib will generate accept sockets that are associated with that same fib. 5/ Packets generated in response to some other packet (e.g. reset or icmp packets). These should use the FIB associated with the packet being reponded to. 6/ Packets generated during encapsulation. gif, tun and other tunnel interfaces will encapsulate using the FIB that was in effect withthe proces that set up the tunnel. thus setfib 1 ifconfig gif0 [tunnel instructions] will set the fib for the tunnel to use to be fib 1. Routing messages would be associated with their process, and thus select one FIB or another. messages from the kernel would be associated with the fib they refer to and would only be received by a routing socket associated with that fib. (not yet implemented) In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). Old versions of netstat see only the first FIB. In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. ipfw has grown 2 new keywords: setfib N ip from anay to any count ip from any to any fib N In pf there seems to be a requirement to be able to give symbolic names to the fibs but I do not have that capacity. I am not sure if it is required. SCTP has interestingly enough built in support for this, called VRFs in Cisco parlance. it will be interesting to see how that handles it when it suddenly actually does something. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some roto-tilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. When the ABI can be changed it raises the possibilty of the addition of a fib entry into the "struct route". Currently, the structure contains the sockaddr of the desination, and the resulting fib entry. To make this work fully, one could add a fib number so that given an address and a fib, one can find the third element, the fib entry. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. This work was sponsored by Ironport Systems/Cisco Reviewed by: several including rwatson, bz and mlair (parts each) Obtained from: Ironport systems/Cisco	2008-05-09 23:03:00 +00:00
Randall Stewart	cf71e4381a	Add pru_flush routine so a transport can flush itself during Shutdown MFC after: 1 week	2008-04-14 18:06:04 +00:00
Ruslan Ermilov	ea26d58729	Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT. Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true since the advent of MBUMA. Reviewed by: arch There are ongoing disputes as to whether we want to switch to directly using UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.	2008-03-25 09:39:02 +00:00
Maxim Sobolev	073d8ba485	Revert previous change - it appears that the limit I was hitting was a maxsockets limit, not maxfiles limit. The question remains why those limits are handled differently (with error code for maxfiles but with sleep for maxsokets), but those would be addressed in a separate commit if necessary. Requested by: rwhatson, jeff	2008-03-19 09:58:25 +00:00
Maxim Sobolev	c9370ff4d0	Properly set size of the file_zone to match kern.maxfiles parameter. Otherwise the parameter is no-op, since zone by default limits number of descriptors to some 12K entries. Attempt to allocate more ends up sleeping on zonelimit. MFC after: 2 weeks	2008-03-16 06:21:30 +00:00
Robert Watson	3f0bfcccfd	Further clean up sorflush: - Expose sbrelease_internal(), a variant of sbrelease() with no expectations about the validity of locks in the socket buffer. - Use sbrelease_internel() in sorflush(), and as a result avoid intializing and destroying a socket buffer lock for the temporary stack copy of the actual buffer, asb. - Add a comment indicating why we do what we do, and remove an XXX since things have gotten less ugly in sorflush() lately. This makes socket close cleaner, and possibly also marginally faster. MFC after: 3 weeks	2008-02-04 12:25:13 +00:00
Robert Watson	265de5bb62	Correct two problems relating to sorflush(), which is called to flush read socket buffers in shutdown() and close(): - Call socantrcvmore() before sblock() to dislodge any threads that might be sleeping (potentially indefinitely) while holding sblock(), such as a thread blocked in recv(). - Flag the sblock() call as non-interruptible so that a signal delivered to the thread calling sorflush() doesn't cause sblock() to fail. The sblock() is required to ensure that all other socket consumer threads have, in fact, left, and do not enter, the socket buffer until we're done flushin it. To implement the latter, change the 'flags' argument to sblock() to accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK flag. When SBL_NOINTR is set, it forces a non-interruptible sx acquisition, regardless of the setting of the disposition of SB_NOINTR on the socket buffer; without this change it would be possible for another thread to clear SB_NOINTR between when the socket buffer mutex is released and sblock() is invoked. Reviewed by: bz, kmacy Reported by: Jos Backus <jos at catnook dot com>	2008-01-31 08:22:24 +00:00
Robert Watson	30d239bc4c	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer	2007-10-24 19:04:04 +00:00
David Malone	041b706b2f	Despite several examples in the kernel, the third argument of sysctl_handle_int is not sizeof the int type you want to export. The type must always be an int or an unsigned int. Remove the instances where a sizeof(variable) is passed to stop people accidently cut and pasting these examples. In a few places this was sysctl_handle_int was being used on 64 bit types, which would truncate the value to be exported. In these cases use sysctl_handle_quad to export them and change the format to Q so that sysctl(1) can still print them.	2007-06-04 18:25:08 +00:00
Jeff Roberson	1c4bcd050a	- Move rusage from being per-process in struct pstats to per-thread in td_ru. This removes the requirement for per-process synchronization in statclock() and mi_switch(). This was previously supported by sched_lock which is going away. All modifications to rusage are now done in the context of the owning thread. reads proceed without locks. - Aggregate exiting threads rusage in thread_exit() such that the exiting thread's rusage is not lost. - Provide a new routine, rufetch() to fetch an aggregate of all rusage structures from all threads in a process. This routine must be used in any place requiring a rusage from a process prior to it's exit. The exited process's rusage is still available via p_ru. - Aggregate tick statistics only on demand via rufetch() or when a thread exits. Tick statistics are kept in the thread and protected by sched_lock until it exits. Initial patch by: attilio Reviewed by: attilio, bde (some objections), arch (mostly silent)	2007-06-01 01:12:45 +00:00
Robert Watson	d19e16a72c	Generally migrate to ANSI function headers, and remove 'register' use.	2007-05-16 20:41:08 +00:00
Pyun YongHyeon	ccd8d954f3	Add missing socket buffer unlock before returning to userland. Reviewed by: rwatson	2007-05-08 12:34:14 +00:00
Robert Watson	7abab91135	sblock() implements a sleep lock by interlocking SB_WANT and SB_LOCK flags on each socket buffer with the socket buffer's mutex. This sleep lock is used to serialize I/O on sockets in order to prevent I/O interlacing. This change replaces the custom sleep lock with an sx(9) lock, which results in marginally better performance, better handling of contention during simultaneous socket I/O across multiple threads, and a cleaner separation between the different layers of locking in socket buffers. Specifically, the socket buffer mutex is now solely responsible for serializing simultaneous operation on the socket buffer data structure, and not for I/O serialization. While here, fix two historic bugs: (1) a bug allowing I/O to be occasionally interlaced during long I/O operations (discovere by Isilon). (2) a bug in which failed non-blocking acquisition of the socket buffer I/O serialization lock might be ignored (discovered by sam). SCTP portion of this patch submitted by rrs.	2007-05-03 14:42:42 +00:00
Robert Watson	8c799760e1	Following movement of functions from uipc_socket2.c to uipc_socket.c and uipc_sockbuf.c, clean up and update comments.	2007-03-26 17:05:09 +00:00
Robert Watson	20d9e5e87c	Complete removal of uipc_socket2.c by moving the last few functions to other C files: - Move sbcreatecontrol() and sbtoxsockbuf() to uipc_sockbuf.c. While sbcreatecontrol() is really an mbuf allocation routine, it does its work with awareness of the layout of socket buffer memory. - Move pru_*() protocol switch stubs to uipc_socket.c where the non-stub versions of several of these functions live. Likewise, move socket state transition calls (soisconnecting(), etc) to uipc_socket.c. Moveo sodupsockaddr() and sotoxsocket().	2007-03-26 08:59:03 +00:00
Gleb Smirnoff	cd68a3f706	Move the dom_dispose and pru_detach calls in sofree() earlier. Only after calling pru_detach we can be absolutely sure, that we don't have any references to the socket in the stack. This closes race between lockless sbdestroy() and data arriving on socket. Reviewed by: rwatson	2007-03-22 13:21:24 +00:00
John Baldwin	7568503421	- Use m_gethdr(), m_get(), and m_clget() instead of the macros in sosend_copyin(). - Use M_WAITOK instead of M_TRYWAIT in sosend_copyin(). - Don't check for NULL from M_WAITOK and return ENOBUFS. M_WAITOK/M_TRYWAIT allocations don't fail with NULL. Reviewed by: andre Requested by: andre (2)	2007-03-12 19:27:36 +00:00
Ruslan Ermilov	fac61393b9	Don't block on the socket zone limit during the socket() call which can easily lock up a system otherwise; instead, return ENOBUFS as documented in a manpage, thus reverting us to the FreeBSD 4.x behavior. Reviewed by: rwatson MFC after: 2 weeks	2007-02-26 10:45:21 +00:00
Robert Watson	f58dd47091	Rename somaxconn_sysctl() to sysctl_somaxconn() so that I will be able to claim that sofoo() functions all accept a socket as their first argument.	2007-02-15 10:11:00 +00:00
Bruce M Simpson	7dc8d021ea	Diff reduction with RELENG_6, style(9): Remove unnecessary brace; && should be on end of line. No functional changes.	2007-02-03 03:57:45 +00:00
Andre Oppermann	6a37f331d7	Generic socket buffer auto sizing support, header defines, flag inheritance. MFC after: 1 month	2007-02-01 17:53:41 +00:00
Andre Oppermann	7c32173ba8	Unbreak writes of 0 bytes. Zero byte writes happen when only ancillary control data but no payload data is passed. Change m_uiotombuf() to return at least one empty mbuf if the requested length was zero. Add comment to sosend_dgram and sosend_generic(). Diagnoses by: jhb Regression test by: rwatson Pointy hat to. andre	2007-01-22 14:50:28 +00:00
Robert Watson	abdeb3b01f	Canonicalize copyrights in some files I hold copyrights on: - Sort by date in license blocks, oldest copyright first. - All rights reserved after all copyrights, not just the first. - Use (c) to be consistent with other entries. MFC after: 3 days	2007-01-08 17:49:59 +00:00
Bruce M Simpson	a86ec33820	Drop all received data mbufs from a socket's queue if the MT_SONAME mbuf is dropped, to preserve the invariant in the PR_ADDR case. Add a regression test to detect this condition, but do not hook it up to the build for now. PR: kern/38495 Submitted by: James Juran Reviewed by: sam, rwatson Obtained from: NetBSD MFC after: 2 weeks	2006-12-23 21:07:07 +00:00
Mohan Srinivasan	84eab9ad73	Fix a race in soclose() where connections could be queued to the listening socket after the pass that cleans those queues. This results in these connections being orphaned (and leaked). The fix is to clean up the so queues after detaching the socket from the protocol. Thanks to ups and jhb for discussions and a thorough code review.	2006-11-22 23:54:29 +00:00
Andre Oppermann	1ae4d97d51	Use the improved m_uiotombuf() function instead of home grown sosend_copyin() to do the userland to kernel copying in sosend_generic() and sosend_dgram(). sosend_copyin() is retained for ZERO_COPY_SOCKETS which are not yet supported by m_uiotombuf(). Benchmaring shows significant improvements (95% confidence): 66% less cpu (or 2.9 times better) with new sosend vs. old sosend (non-TSO) 65% less cpu (or 2.8 times better) with new sosend vs. old sosend (TSO) (Sender AMD Opteron 852 (2.6GHz) with em(4) PCI-X-133 interface and receiver DELL Poweredge SC1425 P-IV Xeon 3.2GHz with em(4) LOM connected back to back at 1000Base-TX full duplex.) Sponsored by: TCP/IP Optimization Fundraise 2005 MFC after: 3 month	2006-11-02 17:45:28 +00:00
Robert Watson	aed5570872	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA	2006-10-22 11:52:19 +00:00
Bruce M Simpson	4a75dc2585	Fix a case where socket I/O atomicity is violated due to not dropping the entire record when a non-data mbuf is removed in the soreceive() path. This only triggers a panic directly when compiled with INVARIANTS. PR: 38495 Submitted by: James Juran MFC after: 1 week	2006-09-22 15:34:16 +00:00
Pawel Jakub Dawidek	689f94bfe6	Fix a lock leak in an error case. Reported by: netchild Reviewed by: rwatson	2006-09-13 06:58:40 +00:00
Andre Oppermann	805def2e04	New sockets created by incoming connections into listen sockets should inherit all settings and options except listen specific options. Add the missing send/receive timeouts and low watermarks. Remove inheritance of the field so_timeo which is unused. Noticed by: phk Reviewed by: rwatson Sponsored by: TCP/IP Optimization Fundraise 2005 MFC after: 3 days	2006-09-10 17:08:06 +00:00
George V. Neville-Neil	daa5817e92	Fix a kernel panic based on receiving an ICMPv6 Packet too Big message. PR: 99779 Submitted by: Jinmei Tatuya Reviewed by: clement, rwatson MFC after: 1 week	2006-08-18 14:05:13 +00:00
Robert Watson	79ad81c06d	Before performing a sodealloc() when pru_attach() fails, assert that the socket refcount remains 1, and then drop to 0 before freeing the socket. PR: 101763 Reported by: Gleb Kozyrev <gkozyrev at ukr dot net>	2006-08-11 23:03:10 +00:00
Robert Watson	9126410f4b	Move destroying kqueue state from above pru_detach to below it in sofree(), as a number of protocols expect to be able to call soisdisconnected() during detach. That may not be a good assumption, but until I'm sure if it's a good assumption or not, allow it.	2006-08-02 18:37:44 +00:00
Robert Watson	c0e1415d51	Move updated of 'numopensockets' from bottom of sodealloc() to the top, eliminating a second set of identical mutex operations at the bottom. This allows brief exceeding of the max sockets limit, but only by sockets in the last stages of being torn down.	2006-08-02 00:45:27 +00:00
Robert Watson	eaa6dfbcc2	Reimplement socket buffer tear-down in sofree(): as the socket is no longer referenced by other threads (hence our freeing it), we don't need to set the can't send and can't receive flags, wake up the consumers, perform two levels of locking, etc. Implement a fast-path teardown, sbdestroy(), which flushes and releases each socket buffer. A manual dom_dispose of the receive buffer is still required explicitly to GC any in-flight file descriptors, etc, before flushing the buffer. This results in a 9% UP performance improvement and 16% SMP performance improvement on a tight loop of socket();close(); in micro-benchmarking, but will likely also affect CPU-bound macro-benchmark performance.	2006-08-01 10:30:26 +00:00
Robert Watson	b0668f7151	soreceive_generic(), and sopoll_generic(). Add new functions sosend(), soreceive(), and sopoll(), which are wrappers for pru_sosend, pru_soreceive, and pru_sopoll, and are now used univerally by socket consumers rather than either directly invoking the old so*() functions or directly invoking the protocol switch method (about an even split prior to this commit). This completes an architectural change that was begun in 1996 to permit protocols to provide substitute implementations, as now used by UDP. Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to perform these operations on sockets -- in particular, distributed file systems and socket system calls. Architectural head nod: sam, gnn, wollman	2006-07-24 15:20:08 +00:00
Robert Watson	809c2b789c	Update various uipc_socket.c comments, and reformat others.	2006-07-23 20:36:04 +00:00
Robert Watson	a152f8a361	Change semantics of socket close and detach. Add a new protocol switch function, pru_close, to notify protocols that the file descriptor or other consumer of a socket is closing the socket. pru_abort is now a notification of close also, and no longer detaches. pru_detach is no longer used to notify of close, and will be called during socket tear-down by sofree() when all references to a socket evaporate after an earlier call to abort or close the socket. This means detach is now an unconditional teardown of a socket, whereas previously sockets could persist after detach of the protocol retained a reference. This faciliates sharing mutexes between layers of the network stack as the mutex is required during the checking and removal of references at the head of sofree(). With this change, pru_detach can now assume that the mutex will no longer be required by the socket layer after completion, whereas before this was not necessarily true. Reviewed by: gnn	2006-07-21 17:11:15 +00:00
Robert Watson	5cd1a27145	Change comment on soabort() to more accurately describe how/when soabort() is used. Remove trailing white space.	2006-07-16 23:09:39 +00:00
Robert Watson	5908c617bb	Several protocol switch functions (pru_abort, pru_detach, pru_sosetlabel) return void, so don't implement no-op versions of these functions. Instead, consistently check if those switch pointers are NULL before invoking them.	2006-07-11 23:18:28 +00:00
Robert Watson	f949ae9b31	When pru_attach() fails, call sodealloc() on the socket rather than using sorele() and the full tear-down path. Since protocol state allocation failed, this is not required (and is arguably undesirable). This matches the behavior of sonewconn() under the same circumstances.	2006-07-11 21:56:58 +00:00
Robert Watson	721150ad8f	When retrieving SO_ERROR via getsockopt(), hold the socket lock around the retrieval and replacement with 0. MFC after: 1 week	2006-06-18 19:02:49 +00:00
Robert Watson	b37ffd3189	Move some functions and definitions from uipc_socket2.c to uipc_socket.c: - Move sonewconn(), which creates new sockets for incoming connections on listen sockets, so that all socket allocate code is together in uipc_socket.c. - Move 'maxsockets' and associated sysctls to uipc_socket.c with the socket allocation code. - Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it to sysctl.h and remove lots of scattered implementations in various IPC modules. - Sort sodealloc() after soalloc() in uipc_socket.c for dependency order reasons. Statisticize soalloc() and sodealloc() as they are now required only in uipc_socket.c, and are internal to the socket implementation. After this change, socket allocation and deallocation is entirely centralized in one file, and uipc_socket2.c consists entirely of socket buffer manipulation and default protocol switch functions. MFC after: 1 month	2006-06-10 14:34:07 +00:00
Robert Watson	e02421f3fb	Rearrange code in soalloc() so that it's less indented by returning early if uma_zalloc() from the socket zone fails. No functional change. MFC after: 1 week	2006-06-08 22:33:18 +00:00
Robert Watson	0cec9959e8	Assert that sockets passed into soabort() not be SQ_COMP or SQ_INCOMP, since that removal should have been done a layer up. MFC after: 3 months	2006-04-23 18:15:54 +00:00
Robert Watson	28ea180136	Add missing 'not' to SQ_COMP comment. MFC after: 3 months	2006-04-23 15:37:23 +00:00
Robert Watson	6ca35d4b81	Move handling of SQ_COMP exception case in sofree() to the top of the function along with the remainder of the reference checking code. Move comment from body to header with remainder of comments. Inclusion of a socket in a completed connection queue counts as a true reference, and should not be handled as an under-documented edge case. MFC after: 3 months	2006-04-23 15:33:38 +00:00
Robert Watson	bc725eafc7	Chance protocol switch method pru_detach() so that it returns void rather than an error. Detaches do not "fail", they other occur or the protocol flags SS_PROTOREF to take ownership of the socket. soclose() no longer looks at so_pcb to see if it's NULL, relying entirely on the protocol to decide whether it's time to free the socket or not using SS_PROTOREF. so_pcb is now entirely owned and managed by the protocol code. Likewise, no longer test so_pcb in other socket functions, such as soreceive(), which have no business digging into protocol internals. Protocol detach routines no longer try to free the socket on detach, this is performed in the socket code if the protocol permits it. In rts_detach(), no longer test for rp != NULL in detach, and likewise in other protocols that don't permit a NULL so_pcb, reduce the incidence of testing for it during detach. netinet and netinet6 are not fully updated to this change, which will be in an upcoming commit. In their current state they may leak memory or panic. MFC after: 3 months	2006-04-01 15:42:02 +00:00
Robert Watson	ac45e92ff2	Change protocol switch pru_abort() API so that it returns void rather than an int, as an error here is not meaningful. Modify soabort() to unconditionally free the socket on the return of pru_abort(), and modify most protocols to no longer conditionally free the socket, since the caller will do this. This commit likely leaves parts of netinet and netinet6 in a situation where they may panic or leak memory, as they have not are not fully updated by this commit. This will be corrected shortly in followup commits to these components. MFC after: 3 months	2006-04-01 15:15:05 +00:00
Robert Watson	7f689de232	Assert so->so_pcb is NULL in sodealloc() -- the protocol state should not be present at this point. We will eventually remove this assert because the socket layer should never look at so_pcb, but for now it's a useful debugging tool. MFC after: 3 months	2006-04-01 10:45:52 +00:00
Robert Watson	220c1357ed	Add a somewhat sizable comment documenting the semantics of various kernel socket calls relating to the creation and destruction of sockets. This will eventually form the foundation of socket(9), but is currently in too much flux to do so. MFC after: 3 months	2006-04-01 10:43:02 +00:00
Robert Watson	92c07a345e	Change soabort() from returning int to returning void, since all consumers ignore the return value, soabort() is required to succeed, and protocols produce errors here to report multiple freeing of the pcb, which we hope to eliminate.	2006-03-16 07:03:14 +00:00
Robert Watson	93709ad0be	As with socket consumer references (so_count), make sofree() return without GC'ing the socket if a strong protocol reference to the socket is present (SS_PROTOREF).	2006-03-15 12:45:35 +00:00
Robert Watson	13f322c2fc	Improve consistency of return() style. MFC after: 3 days	2006-02-12 15:00:27 +00:00
Robert Watson	b8ae1cd619	Add sosend_dgram(), a greatly reduced and simplified version of sosend() intended for use solely with atomic datagram socket types, and relies on the previous break-out of sosend_copyin(). Changes to allow UDP to optionally use this instead of sosend() will be committed as a follow-up.	2006-01-13 10:22:01 +00:00
John Baldwin	398293a8de	Fix snderr() to not leak the socket buffer lock if an error occurs in sosend(). Robert accidentally changed the snderr() macro to jump to the out label which assumes the lock is already released rather than the release label which drops the lock in his previous change to sosend(). This should fix the recent panics about returning from write(2) with the socket lock held and the most recent LOR on current@.	2005-11-29 23:07:14 +00:00
Robert Watson	66dd8a6f99	Move zero copy statistics structure before sosend_copyin(). MFC after: 1 month Reported by: tinderbox, sam	2005-11-28 21:45:36 +00:00
Robert Watson	a725629cf8	Break out functionality in sosend() responsible for building mbuf chains and copying in mbufs from the body of the send logic, creating a new function sosend_copyin(). This changes makes sosend() almost readable, and will allow the same logic to be used by tailored socket send routines. MFC after: 1 month Reviewed by: andre, glebius	2005-11-28 18:09:03 +00:00
Andre Oppermann	34333b16cd	Retire MT_HEADER mbuf type and change its users to use MT_DATA. Having an additional MT_HEADER mbuf type is superfluous and redundant as nothing depends on it. It only adds a layer of confusion. The distinction between header mbuf's and data mbuf's is solely done through the m->m_flags M_PKTHDR flag. Non-native code is not changed in this commit. For compatibility MT_HEADER is mapped to MT_DATA. Sponsored by: TCP/IP Optimization Fundraise 2005	2005-11-02 13:46:32 +00:00
Robert Watson	d374e81efd	Push the assignment of a new or updated so_qlimit from solisten() following the protocol pru_listen() call to solisten_proto(), so that it occurs under the socket lock acquisition that also sets SO_ACCEPTCONN. This requires passing the new backlog parameter to the protocol, which also allows the protocol to be aware of changes in queue limit should it wish to do something about the new queue limit. This continues a move towards the socket layer acting as a library for the protocol. Bump __FreeBSD_version due to a change in the in-kernel protocol interface. This change has been tested with IPv4 and UNIX domain sockets, but not other protocols.	2005-10-30 19:44:40 +00:00
Paul Saab	53f5742d33	Allow 32bit get/setsockopt with SO_SNDTIMEO or SO_RECVTIMEO to work.	2005-10-27 04:26:35 +00:00
Robert Watson	8434c29b28	Add three new read-only socket options, which allow regression tests and other applications to query the state of the stack regarding the accept queue on a listen socket: SO_LISTENQLIMIT Return the value of so_qlimit (socket backlog) SO_LISTENQLEN Return the value of so_qlen (complete sockets) SO_LISTENINCQLEN Return the value of so_incqlen (incomplete sockets) Minor white space tweaks to existing socket options to make them consistent. Discussed with: andre MFC after: 1 week	2005-09-18 21:08:03 +00:00
Robert Watson	bc6b8b5d64	Fix spelling in a comment. MFC after: 3 days	2005-09-18 10:46:34 +00:00
Maxim Konovalov	aada5cccd8	Backout rev. 1.246, it breaks code uses shutdown(2) on non-connected sockets. Pointed out by: rwatson	2005-09-15 13:18:05 +00:00
Maxim Konovalov	c5cff17017	o Return ENOTCONN when shutdown(2) on non-connected socket. PR: kern/84761 Submitted by: James Juran R-test: tools/regression/sockets/shutdown MFC after: 1 month	2005-09-15 11:45:36 +00:00
Gleb Smirnoff	016e62123a	In soreceive(), when a first mbuf is removed from socket buffer use sockbuf_pushsync(). Previous manipulation could lead to an inconsistent mbuf. Reviewed by: rwatson	2005-09-06 17:05:11 +00:00
Kelly Yancey	dcb5fef5db	Make getsockopt(..., SOL_SOCKET, SO_ACCEPTCONN, ...) work per IEEE Std 1003.1 (POSIX).	2005-08-01 21:15:09 +00:00
George V. Neville-Neil	0d52d7b01a	Fix for PR 83885. Make sure that there actually is a next packet before setting nextrecord to that field. PR: 83885 Submitted by: hirose@comm.yamaha.co.jp Obtained from: Patch suggested in the PR MFC after: 1 week	2005-07-28 10:10:01 +00:00
Suleiman Souhlal	571dcd15e2	Fix the recent panics/LORs/hangs created by my kqueue commit by: - Introducing the possibility of using locks different than mutexes for the knlist locking. In order to do this, we add three arguments to knlist_init() to specify the functions to use to lock, unlock and check if the lock is owned. If these arguments are NULL, we assume mtx_lock, mtx_unlock and mtx_owned, respectively. - Using the vnode lock for the knlist locking, when doing kqueue operations on a vnode. This way, we don't have to lock the vnode while holding a mutex, in filt_vfsread. Reviewed by: jmg Approved by: re (scottl), scottl (mentor override) Pointyhat to: ssouhlal Will be happy: everyone	2005-07-01 16:28:32 +00:00
Brooks Davis	fc74a9f93a	Stop embedding struct ifnet at the top of driver softcs. Instead the struct ifnet or the layer 2 common structure it was embedded in have been replaced with a struct ifnet pointer to be filled by a call to the new function, if_alloc(). The layer 2 common structure is also allocated via if_alloc() based on the interface type. It is hung off the new struct ifnet member, if_l2com. This change removes the size of these structures from the kernel ABI and will allow us to better manage them as interfaces come and go. Other changes of note: - Struct arpcom is no longer referenced in normal interface code. Instead the Ethernet address is accessed via the IFP2ENADDR() macro. To enforce this ac_enaddr has been renamed to _ac_enaddr. - The second argument to ether_ifattach is now always the mac address from driver private storage rather than sometimes being ac_enaddr. Reviewed by: sobomax, sam	2005-06-10 16:49:24 +00:00
Scott Long	8bde93598a	Drat! Committed from the wrong branch. Restore HEAD to its previous goodness.	2005-06-09 19:59:09 +00:00
Scott Long	76b472dbda	Back out 1.68.2.26. It was a mis-guided change that was already backed out of HEAD and should not have been MFC'd. This will restore UDP socket functionality, which will correct the recent NFS problems. Submitted by: rwatson	2005-06-09 19:56:38 +00:00
Andrew Gallatin	92dd256bd4	Allow sends sent from non page-aligned userspace addresses to be considered for zero-copy sends. Reviewed by: alc Submitted by: Romer Gil at Rice University	2005-06-05 17:13:23 +00:00
Robert Watson	a59f81d263	Move the logic implementing retrieval of the SO_ACCEPTFILTER socket option from uipc_socket.c to uipc_accf.c in do_getopt_accept_filter(), so that it now matches do_setopt_accept_filter(). Slightly reformulate the logic to match the optimistic allocation of storage for the argument in advance, and slightly expand the coverage of the socket lock.	2005-03-12 12:57:18 +00:00
Robert Watson	56856fbfb4	Remove an additional commented out reference to a possible future sx lock.	2005-03-11 19:16:02 +00:00
Robert Watson	2b37548a71	When setting up a socket in socreate(), there's no need to lock the socket lock around knlist_init(), so don't. Hard code the setting of the socket reference count to 1 rather than using soref() to avoid asserting the socket lock, since we've not yet exposed the socket to other threads. This removes two mutex operations from each socket allocation.	2005-03-11 16:30:02 +00:00
Robert Watson	5fab68b19e	Remove suggestive sx_init() comment in soalloc(). We will have something like this at some point, but for now it clutters the source.	2005-03-11 16:26:33 +00:00
Robert Watson	0daccb9c94	In the current world order, solisten() implements the state transition of a socket from a regular socket to a listening socket able to accept new connections. As part of this state transition, solisten() calls into the protocol to update protocol-layer state. There were several bugs in this implementation that could result in a race wherein a TCP SYN received in the interval between the protocol state transition and the shortly following socket layer transition would result in a panic in the TCP code, as the socket would be in the TCPS_LISTEN state, but the socket would not have the SO_ACCEPTCONN flag set. This change does the following: - Pushes the socket state transition from the socket layer solisten() to to socket "library" routines called from the protocol. This permits the socket routines to be called while holding the protocol mutexes, preventing a race exposing the incomplete socket state transition to TCP after the TCP state transition has completed. The check for a socket layer state transition is performed by solisten_proto_check(), and the actual transition is performed by solisten_proto(). - Holds the socket lock for the duration of the socket state test and set, and over the protocol layer state transition, which is now possible as the socket lock is acquired by the protocol layer, rather than vice versa. This prevents additional state related races in the socket layer. This permits the dual transition of socket layer and protocol layer state to occur while holding locks for both layers, making the two changes atomic with respect to one another. Similar changes are likely require elsewhere in the socket/protocol code. Reported by: Peter Holm <peter@holm.cc> Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net> Philosophical head nod: gnn	2005-02-21 21:58:17 +00:00
Robert Watson	a00428ef92	In soreceive(), when considering delivery to a socket in SS_ISCONFIRMING, only call the protocol's pru_rcvd() if the protocol has the flag PR_WANTRCVD set. This brings that instance of pru_rcvd() into line with the rest, which do check the flag. MFC after: 3 days	2005-02-20 15:54:44 +00:00
Robert Watson	a7ae36bc45	Correct a typo in the comment describing soreceive_rcvoob(). MFC after: 3 days	2005-02-18 19:15:22 +00:00
Robert Watson	1b5c4b15b4	In soconnect(), when resetting so->so_error, the socket lock is not required due to a straight integer write in which minor races are not a problem.	2005-02-18 19:13:51 +00:00
Robert Watson	78e436448f	Move do_setopt_accept_filter() from uipc_socket.c to uipc_accf.c, where the rest of the accept filter code currently lives. MFC after: 3 days	2005-02-18 18:54:42 +00:00
Robert Watson	627de7fa2c	Re-order checks in socheckuid() so that we check all deny cases before returning accept. MFC after: 3 days	2005-02-18 18:43:33 +00:00
Robert Watson	0d89301c51	In solisten(), unconditionally set the SO_ACCEPTCONN option in so->so_options when solisten() will succeed, rather than setting it conditionally based on there not being queued sockets in the completed socket queue. Otherwise, if the protocol exposes new sockets via the completed queue before solisten() completes, the listen() system call will succeed, but the socket and protocol state will be out of sync. For TCP, this didn't happen in practice, as the TCP code will panic if a new connection comes in after the tcpcb has been transitioned to a listening state but the socket doesn't have SO_ACCEPTCONN set. This is historical behavior resulting from bitrot since 4.3BSD, in which that line of code was associated with the conditional NULL'ing of the connection queue pointers (one-time initialization to be performed during the transition to a listening socket), which are now initialized separately. Discussed with: fenner, gnn MFC after: 3 days	2005-02-18 00:52:17 +00:00
Gleb Smirnoff	90d52f2f21	- Convert so_qlen, so_incqlen, so_qlimit fields of struct socket from short to unsigned short. - Add SYSCTL_PROC() around somaxconn, not accepting values < 1 or > U_SHRTMAX. Before this change setting somaxconn to smth above 32767 and calling listen(fd, -1) lead to a socket, which doesn't accept connections at all. Reviewed by: rwatson Reported by: Igor Sysoev	2005-01-24 12:20:21 +00:00
Maxim Sobolev	fdf84ec4c6	When re-connecting already connected datagram socket ensure to clean up its pending error state, which may be set in some rare conditions resulting in connect() syscall returning that bogus error and making application believe that attempt to change association has failed, while it has not in fact. There is sockets/reconnect regression test which excersises this bug. MFC after: 2 weeks	2005-01-12 10:15:23 +00:00
Warner Losh	9454b2d864	/* -> /*- for copyright notices, minor format tweaks as necessary	2005-01-06 23:35:40 +00:00
Robert Watson	ba65391172	Remove an XXXRW indicating atomic operations might be used as a substitute for a global mutex protecting the socket count and generation number. The observation that soreceive_rcvoob() can't return an mbuf chain is a property, not a bug, so remove the XXXRW. In sorflush, s/existing/previous/ for code when describing prior behavior. For SO_LINGER socket option retrieval, remove an XXXRW about why we hold the mutex: this is correct and not dubious. MFC after: 2 weeks	2004-12-23 01:07:12 +00:00
Robert Watson	81b5dbecd4	In soalloc(), simplify the mac_init_socket() handling to remove unnecessary use of a global variable and simplify the return case. While here, use ()'s around return values. In sodealloc(), remove a comment about why we bump the gencnt and decrement the socket count separately. It doesn't add substantially to the reading, and clutters the function. MFC after: 2 weeks	2004-12-23 00:59:43 +00:00
Alan Cox	c73e3e9223	Remove unneeded code from the zero-copy receive path. Discussed with: gallatin@ Tested by: ken@	2004-12-10 04:49:13 +00:00
Alan Cox	1c4dbedac4	Tidy up the zero-copy receive path: Remove an unneeded argument to uiomoveco() and userspaceco().	2004-12-08 05:25:08 +00:00
Paul Saab	d297f70246	If soreceive() is called from a socket callback, there's no reason to do a window update to the peer (thru an ACK) from soreceive() itself. TCP will do that upon return from the socket callback. Sending a window update from soreceive() results in a lock reversal. Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com Reviewed by: rwatson	2004-11-29 23:10:59 +00:00
Paul Saab	85d11adf25	Make soreceive(MSG_DONTWAIT) nonblocking. If MSG_DONTWAIT is passed into soreceive(), then pass in M_DONTWAIT to m_copym(). Also fix up error handling for the case where m_copym() returns failure. Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com Reviewed by: rwatson	2004-11-29 23:09:07 +00:00
Gleb Smirnoff	1449a2f547	Since sb_timeo type was increased to int, use INT_MAX instead of SHRT_MAX. This also gives us ability to close PR. PR: kern/42352 Approved by: julian (mentor) MFC after: 1 week	2004-11-09 18:35:26 +00:00
Robert Watson	aae2782bff	Acquire the accept mutex in soabort() before calling sotryfree(), as that is now required. RELENG_5_3 candidate. Foot provided by: Dikshie <dikshie at ppk dot itb dot ac dot id>	2004-11-02 17:15:13 +00:00
Andre Oppermann	3a82a5451c	socreate() does an early abort if either the protocol cannot be found, or pru_attach is NULL. With loadable protocols the SPACER dummy protocols have valid function pointers for all methods to functions returning just EOPNOTSUPP. Thus the early abort check would not detect immediately that attach is not supported for this protocol. Instead it would correctly get the EOPNOTSUPP error later on when it calls the protocol specific attach function. Add testing against the pru_attach_notsupp() function pointer to the early abort check as well.	2004-10-23 19:06:43 +00:00
Robert Watson	81158452be	Push acquisition of the accept mutex out of sofree() into the caller (sorele()/sotryfree()): - This permits the caller to acquire the accept mutex before the socket mutex, avoiding sofree() having to drop the socket mutex and re-order, which could lead to races permitting more than one thread to enter sofree() after a socket is ready to be free'd. - This also covers clearing of the so_pcb weak socket reference from the protocol to the socket, preventing races in clearing and evaluation of the reference such that sofree() might be called more than once on the same socket. This appears to close a race I was able to easily trigger by repeatedly opening and resetting TCP connections to a host, in which the tcp_close() code called as a result of the RST raced with the close() of the accepted socket in the user process resulting in simultaneous attempts to de-allocate the same socket. The new locking increases the overhead for operations that may potentially free the socket, so we will want to revise the synchronization strategy here as we normalize the reference counting model for sockets. The use of the accept mutex in freeing of sockets that are not listen sockets is primarily motivated by the potential need to remove the socket from the incomplete connection queue on its parent (listen) socket, so cleaning up the reference model here may allow us to substantially weaken the synchronization requirements. RELENG_5_3 candidate. MFC after: 3 days Reviewed by: dwhite Discussed with: gnn, dwhite, green Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de> Reported by: Vlad <marchenko at gmail dot com>	2004-10-18 22:19:43 +00:00
Robert Watson	35b260cd69	Rework sofree() logic to take into account a possible race with accept(). Sockets in the listen queues have reference counts of 0, so if the protocol decides to disconnect the pcb and try to free the socket, this triggered a race with accept() wherein accept() would bump the reference count before sofree() had removed the socket from the listen queues, resulting in a panic in sofree() when it discovered it was freeing a referenced socket. This might happen if a RST came in prior to accept() on a TCP connection. The fix is two-fold: to expand the coverage of the accept mutex earlier in sofree() to prevent accept() from grabbing the socket after the "is it really safe to free" tests, and to expand the logic of the "is it really safe to free" tests to check that the refcount is still 0 (i.e., we didn't race). RELENG_5 candidate. Much discussion with and work by: green Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de> Reported by: Vlad <marchenko at gmail dot com>	2004-10-11 08:11:26 +00:00
Robert Watson	76f6939888	Expand the scope of the socket buffer locks in sopoll() to include the state test as well as set, or we risk a race between a socket wakeup and registering for select() or poll() on the socket. This does increase the cost of the poll operation, but can probably be optimized some in the future. This appears to correct poll() "wedges" experienced with X11 on SMP systems with highly interactive applications, and might affect a plethora of other select() driven applications. RELENG_5 candidate. Problem reported by: Maxim Maximov <mcsi at mcsi dot pp dot ru> Debugged with help of: dwhite	2004-09-05 14:33:21 +00:00
Robert Watson	fe0f2d4e11	Conditional acquisition of socket buffer mutexes when testing socket buffers with kqueue filters is no longer required: the kqueue framework will guarantee that the mutex is held on entering the filter, either due to a call from the socket code already holding the mutex, or by explicitly acquiring it. This removes the last of the conditional socket locking.	2004-08-24 05:28:18 +00:00
Robert Watson	7b38f0d3c3	Back out uipc_socket.c:1.208, as it incorrectly assumes that all sockets are connection-oriented for the purposes of kqueue registration. Since UDP sockets aren't connection-oriented, this appeared to break a great many things, such as RPC-based applications and services (i.e., NFS). Since jmg isn't around I'm backing this out before too many more feet are shot, but intend to investigate the right solution with him once he's available. Apologies to: jmg Discussed with: imp, scottl	2004-08-20 16:24:23 +00:00
John-Mark Gurney	5d6dd4685a	make sure that the socket is either accepting connections or is connected when attaching a knote to it... otherwise return EINVAL... Pointed out by: benno	2004-08-20 04:15:30 +00:00
John-Mark Gurney	ad3b9257c2	Add locking to the kqueue subsystem. This also makes the kqueue subsystem a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)	2004-08-15 06:24:42 +00:00
Robert Watson	217a4b6e4e	Replace a reference to splnet() with a reference to locking in a comment.	2004-08-11 03:43:10 +00:00
Robert Watson	99901d0afb	Do some initial locking on accept filter registration and attach. While here, close some races that existed in the pre-locking world during low memory conditions. This locking isn't perfect, but it's closer than before.	2004-07-25 23:29:47 +00:00
David Malone	cdb71f7526	The recent changes to control message passing broke some things that get certain types of control messages (ping6 and rtsol are examples). This gets the new code closer to working: 1) Collect control mbufs for processing in the controlp == NULL case, so that they can be freed by externalize. 2) Loop over the list of control mbufs, as the externalize function may not know how to deal with chains. 3) In the case where there is no externalize function, remember to add the control mbuf to the controlp list so that it will be returned. 4) After adding stuff to the controlp list, walk to the end of the list of stuff that was added, incase we added a chain. This code can be further improved, but this is enough to get most things working again. Reviewed by: rwatson	2004-07-18 19:10:36 +00:00
Robert Watson	dad7b41a9b	When entering soclose(), assert that SS_NOFDREF is not already set.	2004-07-16 00:37:34 +00:00
David Malone	dcee93dcf9	Rename Alfred's kern_setsockopt to so_setsockopt, as this seems a a better name. I have a kern_[sg]etsockopt which I plan to commit shortly, but the arguments to these function will be quite different from so_setsockopt. Approved by: alfred	2004-07-12 21:42:33 +00:00
Alfred Perlstein	d58d3648dd	Use SO_REUSEADDR and SO_REUSEPORT when reconnecting NFS mounts. Tune the timeout from 5 seconds to 12 seconds. Provide a sysctl to show how many reconnects the NFS client has done. Seems to fix IPv6 from: kuriyama	2004-07-12 06:22:42 +00:00
Robert Watson	a294c3664f	Use sockbuf_pushsync() to synchronize stack and socket buffer state in soreceive() after removing an MT_SONAME mbuf from the head of the socket buffer. When processing MT_CONTROL mbufs in soreceive(), first remove all of the MT_CONTROL mbufs from the head of the socket buffer to a local mbuf chain, then feed them into dom_externalize() as a set, which both avoids thrashing the socket buffer lock when handling multiple control mbufs, and also avoids races with other threads acting on the socket buffer when the socket buffer mutex is released to enter the externalize code. Existing races that might occur if the protocol externalize method blocked during processing have also been closed. Now that we synchronize socket buffer and stack state following modifications to the socket buffer, turn the manual synchronization that previously followed control mbuf processing with a set of assertions. This can eventually be removed. The soreceive() code is now substantially more MPSAFE.	2004-07-11 23:13:14 +00:00
Robert Watson	b7562e178c	Add sockbuf_pushsync(), an inline function that, following a change to the head of the mbuf chains in a socket buffer, re-synchronizes the cache pointers used to optimize socket buffer appends. This will be used by soreceive() before dropping socket buffer mutexes to make sure a consistent version of the socket buffer is visible to other threads. While here, update copyright to account for substantial rewrite of much socket code required for fine-grained locking.	2004-07-11 22:59:32 +00:00
Robert Watson	d861372b14	Add additional annotations to soreceive(), documenting the effects of locking on 'nextrecord' and concerns regarding potentially inconsistent or stale use of socket buffer or stack fields if they aren't carefully synchronized whenever the socket buffer mutex is released. Document that the high-level sblock() prevents races against other readers on the socket. Also document the 'type' logic as to how soreceive() guarantees that it will only return one of normal data or inline out-of-band data.	2004-07-11 18:29:47 +00:00
Robert Watson	0014b343e0	In the 'dontblock' section of soreceive(), assert that the mbuf on hand ('m') is in fact the first mbuf in the receive socket buffer.	2004-07-11 01:44:12 +00:00
Robert Watson	5e44d93ffc	Break out non-inline out-of-band data receive code from soreceive() and put it in its own helper function soreceive_rcvoob().	2004-07-11 01:34:34 +00:00
Robert Watson	a04b09398c	Assign pointers values of NULL rather than 0 in soreceive().	2004-07-11 01:22:40 +00:00
Robert Watson	7e17bc9f26	When the MT_SONAME mbuf is popped off of a receive socket buffer associated with a PR_ADDR protocol, make sure to update the m_nextpkt pointer of the new head mbuf on the chain to point to the next record. Otherwise, when we release the socket buffer mutex, the socket buffer mbuf chain may be in an inconsistent state.	2004-07-10 21:43:35 +00:00
Robert Watson	5c2b7a2273	Now socket buffer locks are being asserted at higher code blocks in soreceive(), remove some leaf assertions that are redundant.	2004-07-10 04:38:06 +00:00
Robert Watson	32775a01da	Assert socket buffer lock at strategic points between sections of code in soreceive() to confirm we've moved from block to block properly maintaining locking invariants.	2004-07-10 03:47:15 +00:00
Robert Watson	6a72b225b7	Drop the socket buffer lock around a call to m_copym() with M_TRYWAIT. A subset of locking changes to soreceive() in the queue for merging. Bumped into by: Willem Jan Withagen <wjw@withagen.nl>	2004-07-05 19:29:33 +00:00
Robert Watson	a290574663	Add a new global mutex, so_global_mtx, which protects the global variables so_gencnt, numopensockets, and the per-socket field so_gencnt. Annotate this this might be better done with atomic operations. Annotate what accept_mtx protects.	2004-06-27 03:22:15 +00:00
Robert Watson	11c40a39b6	Replace comment on spl state when calling soabort() with a comment on locking state. No socket locks should be held when calling soabort() as it will call into protocol code that may acquire socket locks.	2004-06-26 17:12:29 +00:00
Robert Watson	c6b93bf29a	Lock socket buffers when processing setting socket options SO_SNDLOWAT or SO_RCVLOWAT for read-modify-write.	2004-06-24 04:28:30 +00:00
Robert Watson	adb4cf0fbc	Slide socket buffer lock earlier in sopoll() to cover the call into selrecord(), setting up select and flagging the socker buffers as SB_SEL and setting up select under the lock.	2004-06-24 00:54:26 +00:00
Robert Watson	fea24c0a71	Remove spl's from uipc_socket to ease in merging.	2004-06-22 03:49:22 +00:00
Robert Watson	a34b704666	Merge next step in socket buffer locking: - sowakeup() now asserts the socket buffer lock on entry. Move the call to KNOTE higher in sowakeup() so that it is made with the socket buffer lock held for consistency with other calls. Release the socket buffer lock prior to calling into pgsigio(), so_upcall(), or aio_swake(). Locking for this event management will need revisiting in the future, but this model avoids lock order reversals when upcalls into other subsystems result in socket/socket buffer operations. Assert that the socket buffer lock is not held at the end of the function. - Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now have _locked versions which assert the socket buffer lock on entry. If a wakeup is required by sb_notify(), invoke sowakeup(); otherwise, unconditionally release the socket buffer lock. This results in the socket buffer lock being released whether a wakeup is required or not. - Break out socantsendmore() into socantsendmore_locked() that asserts the socket buffer lock. socantsendmore() unconditionally locks the socket buffer before calling socantsendmore_locked(). Note that both functions return with the socket buffer unlocked as socantsendmore_locked() calls sowwakeup_locked() which has the same properties. Assert that the socket buffer is unlocked on return. - Break out socantrcvmore() into socantrcvmore_locked() that asserts the socket buffer lock. socantrcvmore() unconditionally locks the socket buffer before calling socantrcvmore_locked(). Note that both functions return with the socket buffer unlocked as socantrcvmore_locked() calls sorwakeup_locked() which has similar properties. Assert that the socket buffer is unlocked on return. - Break out sbrelease() into a sbrelease_locked() that asserts the socket buffer lock. sbrelease() unconditionally locks the socket buffer before calling sbrelease_locked(). sbrelease_locked() now invokes sbflush_locked() instead of sbflush(). - Assert the socket buffer lock in socket buffer sanity check functions sblastrecordchk(), sblastmbufchk(). - Assert the socket buffer lock in SBLINKRECORD(). - Break out various sbappend() functions into sbappend_locked() (and variations on that name) that assert the socket buffer lock. The !_locked() variations unconditionally lock the socket buffer before calling their _locked counterparts. Internally, make sure to call _locked() support routines, etc, if already holding the socket buffer lock. - Break out sbinsertoob() into sbinsertoob_locked() that asserts the socket buffer lock. sbinsertoob() unconditionally locks the socket buffer before calling sbinsertoob_locked(). - Break out sbflush() into sbflush_locked() that asserts the socket buffer lock. sbflush() unconditionally locks the socket buffer before calling sbflush_locked(). Update panic strings for new function names. - Break out sbdrop() into sbdrop_locked() that asserts the socket buffer lock. sbdrop() unconditionally locks the socket buffer before calling sbdrop_locked(). - Break out sbdroprecord() into sbdroprecord_locked() that asserts the socket buffer lock. sbdroprecord() unconditionally locks the socket buffer before calling sbdroprecord_locked(). - sofree() now calls socantsendmore_locked() and re-acquires the socket buffer lock on return. It also now calls sbrelease_locked(). - sorflush() now calls socantrcvmore_locked() and re-acquires the socket buffer lock on return. Clean up/mess up other behavior in sorflush() relating to the temporary stack copy of the socket buffer used with dom_dispose by more properly initializing the temporary copy, and selectively bzeroing/copying more carefully to prevent WITNESS from getting confused by improperly initialized mutexes. Annotate why that's necessary, or at least, needed. - soisconnected() now calls sbdrop_locked() before unlocking the socket buffer to avoid locking overhead. Some parts of this change were: Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS	2004-06-21 00:20:43 +00:00
Robert Watson	fa8368a8fe	When retrieving the SO_LINGER socket option for user space, hold the socket lock over pulling so_options and so_linger out of the socket structure in order to retrieve a consistent snapshot. This may be overkill if user space doesn't require a consistent snapshot.	2004-06-20 17:50:42 +00:00
Robert Watson	6f4b1b5578	Convert an if->panic in soclose() into a call to KASSERT().	2004-06-20 17:47:51 +00:00
Robert Watson	ed2f7766b0	Annotate some ordering-related issues in solisten() which are not yet resolved by socket locking: in particular, that we test the connection state at the socket layer without locking, request that the protocol begin listening, and then set the listen state on the socket non-atomically, resulting in a non-atomic cross-layer test-and-set.	2004-06-20 17:38:19 +00:00
Robert Watson	31f555a1c5	Assert socket buffer lock in sb_lock() to protect socket buffer sleep lock state. Convert tsleep() into msleep() with socket buffer mutex as argument. Hold socket buffer lock over sbunlock() to protect sleep lock state. Assert socket buffer lock in sbwait() to protect the socket buffer wait state. Convert tsleep() into msleep() with socket buffer mutex as argument. Modify sofree(), sosend(), and soreceive() to acquire SOCKBUF_LOCK() in order to call into these functions with the lock, as well as to start protecting other socket buffer use in their implementation. Drop the socket buffer mutexes around calls into the protocol layer, around potentially blocking operations, for copying to/from user space, and VM operations relating to zero-copy. Assert the socket buffer mutex strategically after code sections or at the beginning of loops. In some cases, modify return code to ensure locks are properly dropped. Convert the potentially blocking allocation of storage for the remote address in soreceive() into a non-blocking allocation; we may wish to move the allocation earlier so that it can block prior to acquisition of the socket buffer lock. Drop some spl use. NOTE: Some races exist in the current structuring of sosend() and soreceive(). This commit only merges basic socket locking in this code; follow-up commits will close additional races. As merged, these changes are not sufficient to run without Giant safely. Reviewed by: juli, tjr	2004-06-19 03:23:14 +00:00
Robert Watson	7b574f2e45	Hold SOCK_LOCK(so) while frobbing so_options. Note that while the local race is corrected, there's still a global race in sosend() relating to so_options and the SO_DONTROUTE flag.	2004-06-18 04:02:56 +00:00
Robert Watson	c012260726	Merge some additional leaf node socket buffer locking from rwatson_netperf: Introduce conditional locking of the socket buffer in fifofs kqueue filters; KNOTE() will be called holding the socket buffer locks in fifofs, but sometimes the kqueue() system call will poll using the same entry point without holding the socket buffer lock. Introduce conditional locking of the socket buffer in the socket kqueue filters; KNOTE() will be called holding the socket buffer locks in the socket code, but sometimes the kqueue() system call will poll using the same entry points without holding the socket buffer lock. Simplify the logic in sodisconnect() since we no longer need spls. NOTE: To remove conditional locking in the kqueue filters, it would make sense to use a separate kqueue API entry into the socket/fifo code when calling from the kqueue() system call.	2004-06-18 02:57:55 +00:00
Robert Watson	9535efc00d	Merge additional socket buffer locking from rwatson_netperf: - Lock down low hanging fruit use of sb_flags with socket buffer lock. - Lock down low hanging fruit use of so_state with socket lock. - Lock down low hanging fruit use of so_options. - Lock down low-hanging fruit use of sb_lowwat and sb_hiwat with socket buffer lock. - Annotate situations in which we unlock the socket lock and then grab the receive socket buffer lock, which are currently actually the same lock. Depending on how we want to play our cards, we may want to coallesce these lock uses to reduce overhead. - Convert a if()->panic() into a KASSERT relating to so_state in soaccept(). - Remove a number of splnet()/splx() references. More complex merging of socket and socket buffer locking to follow.	2004-06-17 22:48:11 +00:00
Robert Watson	c0b99ffa02	The socket field so_state is used to hold a variety of socket related flags relating to several aspects of socket functionality. This change breaks out several bits relating to send and receive operation into a new per-socket buffer field, sb_state, in order to facilitate locking. This is required because, in order to provide more granular locking of sockets, different state fields have different locking properties. The following fields are moved to sb_state: SS_CANTRCVMORE (so_state) SS_CANTSENDMORE (so_state) SS_RCVATMARK (so_state) Rename respectively to: SBS_CANTRCVMORE (so_rcv.sb_state) SBS_CANTSENDMORE (so_snd.sb_state) SBS_RCVATMARK (so_rcv.sb_state) This facilitates locking by isolating fields to be located with other identically locked fields, and permits greater granularity in socket locking by avoiding storing fields with different locking semantics in the same short (avoiding locking conflicts). In the future, we may wish to coallesce sb_state and sb_flags; for the time being I leave them separate and there is no additional memory overhead due to the packing/alignment of shorts in the socket buffer structure.	2004-06-14 18:16:22 +00:00
Robert Watson	395a08c904	Extend coverage of SOCK_LOCK(so) to include so_count, the socket reference count: - Assert SOCK_LOCK(so) macros that directly manipulate so_count: soref(), sorele(). - Assert SOCK_LOCK(so) in macros/functions that rely on the state of so_count: sofree(), sotryfree(). - Acquire SOCK_LOCK(so) before calling these functions or macros in various contexts in the stack, both at the socket and protocol layers. - In some cases, perform soisdisconnected() before sotryfree(), as this could result in frobbing of a non-present socket if sotryfree() actually frees the socket. - Note that sofree()/sotryfree() will release the socket lock even if they don't free the socket. Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS	2004-06-12 20:47:32 +00:00
Robert Watson	f6c0cce6d9	Introduce a mutex into struct sockbuf, sb_mtx, which will be used to protect fields in the socket buffer. Add accessor macros to use the mutex (SOCKBUF_()). Initialize the mutex in soalloc(), and destroy it in sodealloc(). Add addition, add SOCK_() access macros which will protect most remaining fields in the socket; for the time being, use the receive socket buffer mutex to implement socket level locking to reduce memory overhead. Submitted by: sam Sponosored by: FreeBSD Foundation Obtained from: BSD/OS	2004-06-12 16:08:41 +00:00
Stefan Farfeleder	1a5ff9285a	Avoid assignments to cast expressions. Reviewed by: md5 Approved by: das (mentor)	2004-06-08 13:08:19 +00:00
Robert Watson	2658b3bb8e	Integrate accept locking from rwatson_netperf, introducing a new global mutex, accept_mtx, which serializes access to the following fields across all sockets: so_qlen so_incqlen so_qstate so_comp so_incomp so_list so_head While providing only coarse granularity, this approach avoids lock order issues between sockets by avoiding ownership of the fields by a specific socket and its per-socket mutexes. While here, rewrite soclose(), sofree(), soaccept(), and sonewconn() to add assertions, close additional races and address lock order concerns. In particular: - Reorganize the optimistic concurrency behavior in accept1() to always allocate a file descriptor with falloc() so that if we do find a socket, we don't have to encounter the "Oh, there wasn't a socket" race that can occur if falloc() sleeps in the current code, which broke inbound accept() ordering, not to mention requiring backing out socket state changes in a way that raced with the protocol level. We may want to add a lockless read of the queue state if polling of empty queues proves to be important to optimize. - In accept1(), soref() the socket while holding the accept lock so that the socket cannot be free'd in a race with the protocol layer. Likewise in netgraph equivilents of the accept1() code. - In sonewconn(), loop waiting for the queue to be small enough to insert our new socket once we've committed to inserting it, or races can occur that cause the incomplete socket queue to overfill. In the previously implementation, it was sufficient to simply tested once since calling soabort() didn't release synchronization permitting another thread to insert a socket as we discard a previous one. - In soclose()/sofree()/et al, it is the responsibility of the caller to remove a socket from the incomplete connection queue before calling soabort(), which prevents soabort() from having to walk into the accept socket to release the socket from its queue, and avoids races when releasing the accept mutex to enter soabort(), permitting soabort() to avoid lock ordering issues with the caller. - Generally cluster accept queue related operations together throughout these functions in order to facilitate locking. Annotate new locking in socketvar.h.	2004-06-02 04:15:39 +00:00
Robert Watson	36568179e3	The SS_COMP and SS_INCOMP flags in the so_state field indicate whether the socket is on an accept queue of a listen socket. This change renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new state field on the socket, so_qstate, as the locking for these flags is substantially different for the locking on the remainder of the flags in so_state.	2004-06-01 02:42:56 +00:00
Don Lewis	866046f5a6	Add MSG_NBIO flag option to soreceive() and sosend() that causes them to behave the same as if the SS_NBIO socket flag had been set for this call. The SS_NBIO flag for ordinary sockets is set by fcntl(fd, F_SETFL, O_NONBLOCK). Pass the MSG_NBIO flag to the soreceive() and sosend() calls in fifo_read() and fifo_write() instead of frobbing the SS_NBIO flag on the underlying socket for each I/O operation. The O_NONBLOCK flag is a property of the descriptor, and unlike ordinary sockets, fifos may be referenced by multiple descriptors.	2004-06-01 01:18:51 +00:00
Bosko Milekic	099a0e588c	Bring in mbuma to replace mballoc. mbuma is an Mbuf & Cluster allocator built on top of a number of extensions to the UMA framework, all included herein. Extensions to UMA worth noting: - Better layering between slab <-> zone caches; introduce Keg structure which splits off slab cache away from the zone structure and allows multiple zones to be stacked on top of a single Keg (single type of slab cache); perhaps we should look into defining a subset API on top of the Keg for special use by malloc(9), for example. - UMA_ZONE_REFCNT zones can now be added, and reference counters automagically allocated for them within the end of the associated slab structures. uma_find_refcnt() does a kextract to fetch the slab struct reference from the underlying page, and lookup the corresponding refcnt. mbuma things worth noting: - integrates mbuf & cluster allocations with extended UMA and provides caches for commonly-allocated items; defines several zones (two primary, one secondary) and two kegs. - change up certain code paths that always used to do: m_get() + m_clget() to instead just use m_getcl() and try to take advantage of the newly defined secondary Packet zone. - netstat(1) and systat(1) quickly hacked up to do basic stat reporting but additional stats work needs to be done once some other details within UMA have been taken care of and it becomes clearer to how stats will work within the modified framework. From the user perspective, one implication is that the NMBCLUSTERS compile-time option is no longer used. The maximum number of clusters is still capped off according to maxusers, but it can be made unlimited by setting the kern.ipc.nmbclusters boot-time tunable to zero. Work should be done to write an appropriate sysctl handler allowing dynamic tuning of kern.ipc.nmbclusters at runtime. Additional things worth noting/known issues (READ): - One report of 'ips' (ServeRAID) driver acting really slow in conjunction with mbuma. Need more data. Latest report is that ips is equally sucking with and without mbuma. - Giant leak in NFS code sometimes occurs, can't reproduce but currently analyzing; brueffer is able to reproduce but THIS IS NOT an mbuma-specific problem and currently occurs even WITHOUT mbuma. - Issues in network locking: there is at least one code path in the rip code where one or more locks are acquired and we end up in m_prepend() with M_WAITOK, which causes WITNESS to whine from within UMA. Current temporary solution: force all UMA allocations to be M_NOWAIT from within UMA for now to avoid deadlocks unless WITNESS is defined and we can determine with certainty that we're not holding any locks when we're M_WAITOK. - I've seen at least one weird socketbuffer empty-but- mbuf-still-attached panic. I don't believe this to be related to mbuma but please keep your eyes open, turn on debugging, and capture crash dumps. This change removes more code than it adds. A paper is available detailing the change and considering various performance issues, it was presented at BSDCan2004: http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf Please read the paper for Future Work and implementation details, as well as credits. Testing and Debugging: rwatson, brueffer, Ketrien I. Saihr-Kesenchedra, ... Reviewed by: Lots of people (for different parts)	2004-05-31 21:46:06 +00:00
Robert Watson	123f024b24	Compare pointers with NULL rather than using pointers are booleans in if/for statements. Assign pointers to NULL rather than typecast 0. Compare pointers with NULL rather than 0.	2004-04-09 13:23:51 +00:00
Warner Losh	7f8a436ff2	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core	2004-04-05 21:03:37 +00:00
Robert Watson	8e44a7ec13	In sofree(), avoid nested declaration and initialization in declaration. Observe that initialization in declaration is frequently incompatible with locking, not just a bad idea due to style(9). Submitted by: bde	2004-03-31 03:48:35 +00:00
Robert Watson	181e65db5b	Use a common return path for filt_soread() and filt_sowrite() to simplify the impact of locking on these functions. Submitted by: sam Sponsored by: FreeBSD Foundation	2004-03-29 18:06:15 +00:00
Robert Watson	71c90a2944	In sofree(), moving caching of 'head' from 'so->so_head' to later in the function once it has been determined to be non-NULL to simplify locking on an earlier return.	2004-03-29 17:57:43 +00:00
Robert Watson	746e5bf09b	Rename dup_sockaddr() to sodupsockaddr() for consistency with other functions in kern_socket.c. Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT in from the caller context rather than "1" or "0". Correct mflags pass into mac_init_socket() from previous commit to not include M_ZERO. Submitted by: sam	2004-03-01 03:14:23 +00:00
Scott Long	740d9ba692	Convert the other use of flags to mflags in soalloc().	2004-03-01 01:14:28 +00:00
Robert Watson	2bc87dcfbe	Modify soalloc() API so that it accepts a malloc flags argument rather than a "waitok" argument. Callers now passing M_WAITOK or M_NOWAIT rather than 0 or 1. This simplifies the soalloc() logic, and also makes the waiting behavior of soalloc() more clear in the calling context. Submitted by: sam	2004-02-29 17:54:05 +00:00
Brian Feldman	f662a93197	Always socantsendmore() before deallocating a socket. This, in turn, calls selwakeup() if necessary (which it is, if you don't want freed memory hanging around on your td->td_selq). Props to: alfred	2004-02-12 01:48:40 +00:00
Poul-Henning Kamp	be8a62e821	Introduce the SO_BINTIME option which takes a high-resolution timestamp at packet arrival. For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL since it has higher resolution and lower overhead. Simultaneous use of the two options is possible and they will return consistent timestamps. This introduces an extra test and a function call for SO_TIMEVAL, but I have not been able to measure that.	2004-01-31 10:40:25 +00:00
Ruslan Ermilov	0541040c46	Since "m" is not part of the "mp" chain, need to free() it. Reported by: Stanford Metacompilation research group	2004-01-18 14:02:53 +00:00
Robert Watson	9e71dd0feb	Reduce gratuitous redundancy and length in function names: mac_setsockopt_label_set() -> mac_setsockopt_label() mac_getsockopt_label_get() -> mac_getsockopt_label() mac_getsockopt_peerlabel_get() -> mac_getsockopt_peerlabel() Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories	2003-11-16 18:25:20 +00:00
Robert Watson	12cbb9dc56	When implementing getsockopt() for SO_LABEL and SO_PEERLABEL, make sure to sooptcopyin() the (struct mac) so that the MAC Framework knows which label types are being requested. This fixes process queries of socket labels. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories	2003-11-16 03:53:36 +00:00
Seigo Tanimura	512824f8f7	- Implement selwakeuppri() which allows raising the priority of a thread being waken up. The thread waken up can run at a priority as high as after tsleep(). - Replace selwakeup()s with selwakeuppri()s and pass appropriate priorities. - Add cv_broadcastpri() which raises the priority of the broadcast threads. Used by selwakeuppri() if collision occurs. Not objected in: -arch, -current	2003-11-09 09:17:26 +00:00
Sam Leffler	395bb18680	speedup stream socket recv handling by tracking the tail of the mbuf chain instead of walking the list for each append Submitted by: ps/jayanth Obtained from: netbsd (jason thorpe)	2003-10-28 05:47:40 +00:00
Mike Silbersack	184dcdc7c8	Change all SYSCTLS which are readonly and have a related TUNABLE from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide more useful error messages.	2003-10-21 18:28:36 +00:00
Jeffrey Hsu	cc3426866c	Make the second argument to sooptcopyout() constant in order to simplify the upcoming PIM patches. Submitted by: Pavlin Radoslavov <pavlin@icir.org>	2003-08-05 00:27:54 +00:00
Robert Drehmel	4e19fe1081	To avoid a kernel panic provoked by a NULL pointer dereference, do not clear the `sb_sel' member of the sockbuf structure while invalidating the receive sockbuf in sorflush(), called from soshutdown(). The panic was reproduceable from user land by attaching a knote with EVFILT_READ filters to a socket, disabling further reads from it using shutdown(2), and then closing it. knote_remove() was called to remove all knotes from the socket file descriptor by detaching each using its associated filterops' detach call- back function, sordetach() in this case, which tried to remove itself from the invalidated sockbuf's klist (sb_sel.si_note). PR: kern/54331	2003-07-17 23:49:10 +00:00
Jeffrey Hsu	330841c763	Rev 1.121 meant to pass the value 1 to soalloc() to indicate waitok. Reported by: arr	2003-07-14 20:39:22 +00:00
David E. O'Brien	677b542ea2	Use __FBSDID().	2003-06-11 00:56:59 +00:00
Alexander Kabaev	104a9b7e3e	Deprecate machine/limits.h in favor of new sys/limits.h. Change all in-tree consumers to include <sys/limits.h> Discussed on: standards@ Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>	2003-04-29 13:36:06 +00:00
Olivier Houchard	695d74f337	Use while (controlp != NULL) instead of do ... while (control != NULL) There are valid cases where *controlp will be NULL at this point. Discussed with: dwmalone	2003-04-14 14:44:36 +00:00
Dag-Erling Smørgrav	8994a245e0	Clean up whitespace, s/register //, refrain from strong urge to ANSIfy.	2003-03-02 15:56:49 +00:00
Dag-Erling Smørgrav	c952458814	uiomove-related caddr_t -> void * (just the low-hanging fruit)	2003-03-02 15:50:23 +00:00
Olivier Houchard	d6bf23783f	Remove duplicate includes. Submitted by: Cyril Nguyen-Huu <cyril@ci0.org>	2003-02-20 03:26:11 +00:00
Warner Losh	a163d034fa	Back out M_* changes, per decision of the TRB. Approved by: trb	2003-02-19 05:47:46 +00:00
Alfred Perlstein	44956c9863	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.	2003-01-21 08:56:16 +00:00
Thomas Moestl	6f7cab9301	Disallow listen() on sockets which are in the SS_ISCONNECTED or SS_ISCONNECTING state, returning EINVAL (which is what POSIX mandates in this case). listen() on connected or connecting sockets would cause them to enter a bad state; in the TCP case, this could cause sockets to go catatonic or panics, depending on how the socket was connected. Reviewed by: -net MFC after: 2 weeks	2003-01-17 19:20:00 +00:00
Matthew Dillon	48e3128b34	Bow to the whining masses and change a union back into void *. Retain removal of unnecessary casts and throw in some minor cleanups to see if anyone complains, just for the hell of it.	2003-01-13 00:33:17 +00:00
Matthew Dillon	cd72f2180b	Change struct file f_data to un_data, a union of the correct struct pointer types, and remove a huge number of casts from code using it. Change struct xfile xf_data to xun_data (ABI is still compatible). If we need to add a #define for f_data and xf_data we can, but I don't think it will be necessary. There are no operational changes in this commit.	2003-01-12 01:37:13 +00:00
Alfred Perlstein	a09de2f7cd	In sodealloc(), if there is an accept filter present on the socket then call do_setopt_accept_filter(so, NULL) which will free the filter instead of duplicating the code in do_setopt_accept_filter(). Pointed out by: Hiten Pandya <hiten@angelica.unixdaemons.com>	2003-01-05 11:14:04 +00:00
Poul-Henning Kamp	6ce9c72c30	s/sokqfilter/soo_kqfilter/ for consistency with the naming of all other socket/file operations.	2002-12-23 21:37:28 +00:00
Maxim Konovalov	8819f45b51	Small SO_RCVTIMEO and SO_SNDTIMEO values are mistakenly taken to be zero. PR: kern/32827 Submitted by: Hartmut Brandt <brandt@fokus.gmd.de> Approved by: re (jhb) MFC after: 2 weeks	2002-11-27 13:34:04 +00:00
Alfred Perlstein	29f194457c	Fix instances of macros with improperly parenthasized arguments. Verified by: md5	2002-11-09 12:55:07 +00:00
Kelly Yancey	247a32f22a	Fix filt_soread() to properly flag a kevent when a 0-byte datagram is received. Verified by: dougb, Manfred Antar <null@pozo.com> Sponsored by: NTT Multimedia Communications Labs	2002-11-05 18:48:46 +00:00
Alan Cox	5ee0a409fc	Revert the change in revision 1.77 of kern/uipc_socket2.c. It is causing a panic because the socket's state isn't as expected by sofree(). Discussed with: dillon, fenner	2002-11-02 05:14:31 +00:00
Kelly Yancey	e0f640e82d	Track the number of non-data chararacters stored in socket buffers so that the data value returned by kevent()'s EVFILT_READ filter on non-TCP sockets accurately reflects the amount of data that can be read from the sockets by applications. PR: 30634 Reviewed by: -net, -arch Sponsored by: NTT Multimedia Communications Labs MFC after: 2 weeks	2002-11-01 21:27:59 +00:00
Robert Watson	6151efaa54	Trim extraneous #else and #endif MAC comments per style(9).	2002-10-28 21:17:53 +00:00
Robert Watson	83985c267e	Modify label allocation semantics for sockets: pass in soalloc's malloc flags so that we can call malloc with M_NOWAIT if necessary, avoiding potential sleeps while holding mutexes in the TCP syncache code. Similar to the existing support for mbuf label allocation: if we can't allocate all the necessary label store in each policy, we back out the label allocation and fail the socket creation. Sync from MAC tree. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories	2002-10-05 21:23:47 +00:00
Robert Watson	ea6027a8e1	Make similar changes to fo_stat() and fo_poll() as made earlier to fo_read() and fo_write(): explicitly use the cred argument to fo_poll() as "active_cred" using the passed file descriptor's f_cred reference to provide access to the file credential. Add an active_cred argument to fo_stat() so that implementers have access to the active credential as well as the file credential. Generally modify callers of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which was redundantly provided via the fp argument. This set of modifications also permits threads to perform these operations on behalf of another thread without modifying their credential. Trickle this change down into fo_stat/poll() implementations: - badfo_poll(), badfo_stat(): modify/add arguments. - kqueue_poll(), kqueue_stat(): modify arguments. - pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to MAC checks rather than td->td_ucred. - soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather than cred to pru_sopoll() to maintain current semantics. - sopoll(): moidfy arguments. - vn_poll(), vn_statfile(): modify/add arguments, pass new arguments to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL() to maintian current semantics. - vn_close(): rename cred to file_cred to reflect reality while I'm here. - vn_stat(): Add active_cred and file_cred arguments to vn_stat() and consumers so that this distinction is maintained at the VFS as well as 'struct file' layer. Pass active_cred instead of td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics. - fifofs: modify the creation of a "filetemp" so that the file credential is properly initialized and can be used in the socket code if desired. Pass ap->a_td->td_ucred as the active credential to soo_poll(). If we teach the vnop interface about the distinction between file and active credentials, we would use the active credential here. Note that current inconsistent passing of active_cred vs. file_cred to VOP's is maintained. It's not clear why GETATTR would be authorized using active_cred while POLL would be authorized using file_cred at the file system level. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs	2002-08-16 12:52:03 +00:00
Robert Watson	5c5384fe80	Use the credential authorizing the socket creation operation to perform the jail check and the MAC socket labeling in socreate(). This handles socket creation using a cached credential better (such as in the NFS client code when rebuilding a socket following a disconnect: the new socket should be created using the nfsmount cached cred, not the cred of the thread causing the socket to be rebuilt). Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs	2002-08-12 16:49:03 +00:00
Robert Watson	f9d0d52459	Include file cleanup; mac.h and malloc.h at one point had ordering relationship requirements, and no longer do. Reminded by: bde	2002-08-01 17:47:56 +00:00
Robert Watson	b827919594	Introduce support for Mandatory Access Control and extensible kernel access control. Implement two IOCTLs at the socket level to retrieve the primary and peer labels from a socket. Note that this user process interface will be changing to improve multi-policy support. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs	2002-08-01 03:45:40 +00:00
Robert Watson	335654d73e	Introduce support for Mandatory Access Control and extensible kernel access control. Invoke the necessary MAC entry points to maintain labels on sockets. In particular, invoke entry points during socket allocation and destruction, as well as creation by a process or during an accept-scenario (sonewconn). For UNIX domain sockets, also assign a peer label. As the socket code isn't locked down yet, locking interactions are not yet clear. Various protocol stack socket operations (such as peer label assignment for IPv4) will follow. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs	2002-07-31 03:03:22 +00:00
Mike Barcroft	5f0de71223	Catch up to rev 1.87 of sys/sys/socketvar.h (sb_cc changed from u_long to u_int). Noticed by: sparc64 tinderbox	2002-07-24 14:21:41 +00:00
Alfred Perlstein	802082390b	More caddr_t removal. Change struct knote's kn_hook from caddr_t to void *.	2002-06-29 00:29:12 +00:00

... 2 3 4 5 6 ...

474 Commits