freebsd-nq

Author	SHA1	Message	Date
Roy Marples	7045b1603b	socket: Implement SO_RERROR SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports. Reviewed by: philip (network), kbowling (transport), gbe (manpages) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D26652	2021-07-28 09:35:09 -07:00
Alexander V. Chernikov	924d1c9a05	Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors." Wrong version of the change was pushed inadvertenly. This reverts commit `4a01b854ca`.	2021-02-08 22:32:32 +00:00
Alexander V. Chernikov	4a01b854ca	SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports.	2021-02-08 21:42:20 +00:00
Gordon Bergling	3d265fce43	Fix a few mandoc issues - skipping paragraph macro: Pp after Sh - sections out of conventional order: Sh EXAMPLES - whitespace at end of input line - normalizing date format	2020-10-09 19:12:44 +00:00
John Baldwin	ae84ff9c47	Document SO_NO_OFFLOADS and SO_NO_DDP. Reviewed by: bcr, np MFC after: 1 week Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D25043	2020-06-03 18:59:31 +00:00
Alan Somers	8d910a4282	getsockopt.2: clarify that SO_TIMESTAMP is not 100% reliable When SO_TIMESTAMP is set, the kernel will attempt to attach a timestamp as ancillary data to each IP datagram that is received on the socket. However, it may fail, for example due to insufficient memory. In that case the packet will still be received but not timestamp will be attached. Reviewed by: kib MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D21607	2019-09-11 19:48:32 +00:00
Sergey Kandaurov	78c8b9477c	Document the ENOBUFS errno in setsockopt(2). In particular, it is the case if SO_SNDBUF/SO_RCVBUF would exceed sb_max_adj. PR: 200649 MFC after: 1 week	2019-02-09 21:33:32 +00:00
Michael Tuexen	6b01d4d433	Add SOL_SOCKET level socket option with name SO_DOMAIN to get the domain of a socket. This is helpful when testing and Solaris and Linux have the same socket option using the same name. Reviewed by: bcr@, rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16791	2018-08-21 14:04:30 +00:00
Sean Bruno	1a43cff92a	Load balance sockets with new SO_REUSEPORT_LB option. This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures: Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations: As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). This is a substantially different contribution as compared to its original incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@ for the substantive feedback that is included in this commit. Submitted by: Johannes Lundberg <johalun0@gmail.com> Obtained from: DragonflyBSD Relnotes: Yes Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003	2018-06-06 15:45:57 +00:00
Lawrence Stewart	9891578a40	Plug a memory leak and potential NULL-pointer dereference introduced in r331214. Each TCP connection that uses the system default cc_newreno(4) congestion control algorithm module leaks a "struct newreno" (8 bytes of memory) at connection initialisation time. The NULL-pointer dereference is only germane when using the ABE feature, which is disabled by default. While at it: - Defer the allocation of memory until it is actually needed given that ABE is optional and disabled by default. - Document the ENOMEM errno in getsockopt(2)/setsockopt(2). - Document ENOMEM and ENOBUFS in tcp(4) as being synonymous given that they are used interchangeably throughout the code. - Fix a few other nits also accidentally omitted from the original patch. Reported by: Harsh Jain on freebsd-net@ Tested by: tjh@ Differential Revision: https://reviews.freebsd.org/D15358	2018-05-17 02:46:27 +00:00
Michael Tuexen	703e1e3d0f	Fix minor formatting issue.	2017-08-13 15:15:40 +00:00
Warner Losh	fbbd9655e5	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
Maxim Sobolev	dd1badb4a3	Improve wording around SO_TS_CLOCK documentation. Submitted by: wblock Differential Revision: https://reviews.freebsd.org/D9171	2017-01-20 18:37:14 +00:00
Hans Petter Selasky	f3e7afe2d7	Implement kernel support for hardware rate limited sockets. - Add RATELIMIT kernel configuration keyword which must be set to enable the new functionality. - Add support for hardware driven, Receive Side Scaling, RSS aware, rate limited sendqueues and expose the functionality through the already established SO_MAX_PACING_RATE setsockopt(). The API support rates in the range from 1 to 4Gbytes/s which are suitable for regular TCP and UDP streams. The setsockopt(2) manual page has been updated. - Add rate limit function callback API to "struct ifnet" which supports the following operations: if_snd_tag_alloc(), if_snd_tag_modify(), if_snd_tag_query() and if_snd_tag_free(). - Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT flag, which tells if a network driver supports rate limiting or not. - This patch also adds support for rate limiting through VLAN and LAGG intermediate network devices. - How rate limiting works: 1) The userspace application calls setsockopt() after accepting or making a new connection to set the rate which is then stored in the socket structure in the kernel. Later on when packets are transmitted a check is made in the transmit path for rate changes. A rate change implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the destination network interface, which then sets up a custom sendqueue with the given rate limitation parameter. A "struct m_snd_tag" pointer is returned which serves as a "snd_tag" hint in the m_pkthdr for the subsequently transmitted mbufs. 2) When the network driver sees the "m->m_pkthdr.snd_tag" different from NULL, it will move the packets into a designated rate limited sendqueue given by the snd_tag pointer. It is up to the individual drivers how the rate limited traffic will be rate limited. 3) Route changes are detected by the NIC drivers in the ifp->if_transmit() routine when the ifnet pointer in the incoming snd_tag mismatches the one of the network interface. The network adapter frees the mbuf and returns EAGAIN which causes the ip_output() to release and clear the send tag. Upon next ip_output() a new "snd_tag" will be tried allocated. 4) When the PCB is detached the custom sendqueue will be released by a non-blocking ifp->if_snd_tag_free() call to the currently bound network interface. Reviewed by: wblock (manpages), adrian, gallatin, scottl (network) Differential Revision: https://reviews.freebsd.org/D3687 Sponsored by: Mellanox Technologies MFC after: 3 months	2017-01-18 13:31:17 +00:00
Maxim Sobolev	339efd75a4	Add a new socket option SO_TS_CLOCK to pick from several different clock sources to return timestamps when SO_TIMESTAMP is enabled. Two additional clock sources are: o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME); o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC). In addition to this, this option provides unified interface to get bintime (equivalent of using SO_BINTIME), except it also supported with IPv6 where SO_BINTIME has never been supported. The long term plan is to depreciate SO_BINTIME and move everything to using SO_TS_CLOCK. Idea for this enhancement has been briefly discussed on the Net session during dev summit in Ottawa last June and the general input was positive. This change is believed to benefit network benchmarks/profiling as well as other scenarios where precise time of arrival measurement is necessary. There are two regression test cases as part of this commit: one extends unix domain test code (unix_cmsg) to test new SCM_XXX types and another one implementis totally new test case which exchanges UDP packets between two processes using both conventional methods (i.e. calling clock_gettime(2) before recv(2) and after send(2)), as well as using setsockopt()+recv() in receive path. The resulting delays are checked for sanity for all supported clock types. Reviewed by: adrian, gnn Differential Revision: https://reviews.freebsd.org/D9171	2017-01-16 17:46:38 +00:00
George V. Neville-Neil	599c412493	Correct the returned message lengths for timeval and bintime control messages (SO_BINTIME, SO_TIMEVAL). Obtained from: phk	2013-04-05 18:09:43 +00:00
Konstantin Belousov	1d2ea43149	Document SO_PROTOCOL socket option. Discussed with: bz Reviewed by: glebius MFC after: 2 weeks	2012-02-26 13:57:24 +00:00
Sergey Kandaurov	0a1c3432f6	Add history for setsockopt(2). PR: docs/162719 Submitted by: Niclas Zeising <niclas at zeising gmail> MFC after: 1 week	2011-11-21 14:36:19 +00:00
Luigi Rizzo	5c9d0a9ad3	This commit implements the SO_USER_COOKIE socket option, which lets you tag a socket with an uint32_t value. The cookie can then be used by the kernel for various purposes, e.g. setting the skipto rule or pipe number in ipfw (this is the reason SO_USER_COOKIE has been implemented; however there is nothing ipfw-specific in its implementation). The ipfw-related code that uses the optopn will be committed separately. This change adds a field to 'struct socket', but the struct is not part of any driver or userland-visible ABI so the change should be harmless. See the discussion at http://lists.freebsd.org/pipermail/freebsd-ipfw/2009-October/004001.html Idea and code from Paul Joe, small modifications and manpage changes by myself. Submitted by: Paul Joe MFC after: 1 week	2010-11-12 13:02:26 +00:00
Edward Tomasz Napierala	bc8036862b	Make it clear where to look for for protocol-specific socket options. Reviewed by: rwatson Approved by: re (kib)	2009-06-30 20:53:56 +00:00
Wojciech A. Koszek	98fbfcd632	Bring missing getsockopt(2) options: SO_LABEL SO_PEERLABEL SO_LISTENQLIMIT SO_LISTENQLEN SO_LISTENINCQLEN to the manual page. Till now those were only present in sys/socket.h file. Reviewed by: rwatson, gnn, keramida (with mdoc hat)	2008-06-12 22:58:35 +00:00
Julian Elischer	65cb6b6834	Add code to allow the system to handle multiple routing tables. This particular implementation is designed to be fully backwards compatible and to be MFC-able to 7.x (and 6.x) Currently the only protocol that can make use of the multiple tables is IPv4 Similar functionality exists in OpenBSD and Linux. From my notes: ----- One thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons). Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. This first version will not have some of the bells and whistles that will come with later versions. It will, for example, be limited to 16 tables in the first commit. Implementation method, Compatible version. (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not always caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (8 is sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called rtalloc_fib() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). You CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Firstly, all packets have a FIB associated with them. if nothing has been done to change it, it will be FIB 0. The FIB is changed in the following ways. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -3 ping target.example.com # will use fib 3 for ping. It is an obvious extension to make it a property of a jail but I have not done so. It can be achieved by combining the setfib and jail commands. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). (possibly in the future you may be able to associate a FIB with packets received on an interface.. An ifconfig arg, but not yet.) 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). 4/ a tcp listen socket associated with a fib will generate accept sockets that are associated with that same fib. 5/ Packets generated in response to some other packet (e.g. reset or icmp packets). These should use the FIB associated with the packet being reponded to. 6/ Packets generated during encapsulation. gif, tun and other tunnel interfaces will encapsulate using the FIB that was in effect withthe proces that set up the tunnel. thus setfib 1 ifconfig gif0 [tunnel instructions] will set the fib for the tunnel to use to be fib 1. Routing messages would be associated with their process, and thus select one FIB or another. messages from the kernel would be associated with the fib they refer to and would only be received by a routing socket associated with that fib. (not yet implemented) In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). Old versions of netstat see only the first FIB. In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. ipfw has grown 2 new keywords: setfib N ip from anay to any count ip from any to any fib N In pf there seems to be a requirement to be able to give symbolic names to the fibs but I do not have that capacity. I am not sure if it is required. SCTP has interestingly enough built in support for this, called VRFs in Cisco parlance. it will be interesting to see how that handles it when it suddenly actually does something. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some roto-tilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. When the ABI can be changed it raises the possibilty of the addition of a fib entry into the "struct route". Currently, the structure contains the sockaddr of the desination, and the resulting fib entry. To make this work fully, one could add a fib number so that given an address and a fib, one can find the third element, the fib entry. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. This work was sponsored by Ironport Systems/Cisco PR: Reviewed by: several including rwatson, bz and mlair (parts each) Approved by: Obtained from: Ironport systems/Cisco MFC after: Security: PR: Submitted by: Reviewed by: Approved by: Obtained from: MFC after: Security:	2008-05-09 23:00:21 +00:00
Bruce M Simpson	fd46d76ecf	Wordsmithery. Pointed out by: ru	2007-03-09 19:43:42 +00:00
Bruce M Simpson	7b7b32179e	Document SO_ACCEPTCONN. Submitted by: Vlad GALU (with changes) MFC after: 3 days	2007-03-08 12:57:12 +00:00
Maxim Konovalov	eb15e82311	o Document SO_TIMESTAMP and SO_BINSTAMP socket options. PR: docs/107696 Submitted by: Rob Robertson Reviewed by: ru Obtained from: NetBSD (mostly) MFC after: 1 week	2007-01-11 18:45:41 +00:00
Warner Losh	c879ae3536	Per Regents of the University of Calfornia letter, remove advertising clause. # If I've done so improperly on a file, please let me know.	2007-01-09 00:28:16 +00:00
Ruslan Ermilov	a73a3ab56b	Markup fixes.	2006-09-17 21:27:35 +00:00
Maxim Konovalov	6ad8b89261	o Document SO_NOSIGPIPE, touch .Dd. PR: docs/78479 Submitted by: Mikko Tyolajarvi MFC after: 2 weeks	2006-04-15 13:37:35 +00:00
Ruslan Ermilov	24a0682c64	Sort sections.	2005-01-20 09:17:07 +00:00
Ruslan Ermilov	1a0a934547	Mechanically kill hard sentence breaks.	2004-07-02 23:52:20 +00:00
Alfred Perlstein	7e2a61e17d	Add restrict qualifiers. (docs) PR: 44394 Submitted by: Craig Rodrigues <rodrige@attbi.com>	2003-12-24 18:52:41 +00:00
Ruslan Ermilov	743d5d518c	mdoc(7): Properly mark C headers.	2003-09-10 19:24:35 +00:00
Ruslan Ermilov	2efeeba554	mdoc(7) police: "The .Fa argument.".	2002-12-19 09:40:28 +00:00
Ruslan Ermilov	2faeeff4c9	mdoc(7) police: Tidy up the syscall language. Stop calling system calls "function calls". Use "The .Fn system call" a-la "The .Nm utility". When referring to a non-BSD implementation in the HISTORY section, call syscall a function, to be safe.	2002-12-18 09:22:32 +00:00
Yaroslav Tykhiy	f381837242	Minor grammar and punctuation fixes in the SO_ACCEPTFILTER description.	2002-01-04 18:17:07 +00:00
Yaroslav Tykhiy	996d4dc275	State clearly that one should call listen(2) on a socket at first and try to set an accept_filter(9) on it only after that. Also document errno value that will be set if installing the filter on a non-listening socket.	2002-01-04 18:12:38 +00:00
Ruslan Ermilov	db8caf03e5	Remove the internal implementation details of wrapping syscalls, which do not match the reality anyway. Approved by: deischen, bde	2001-10-26 17:38:20 +00:00
Ruslan Ermilov	32eef9aeb1	mdoc(7) police: Use the new .In macro for #include statements.	2001-10-01 16:09:29 +00:00
Ruslan Ermilov	d6002fef6f	Use ``.Rv -std'' wherever possible. Submitted by: yar	2001-08-31 09:57:38 +00:00
Dima Dorfman	7ebcc426ef	Remove whitespace at EOL.	2001-07-15 07:53:42 +00:00
Ruslan Ermilov	a307d59838	mdoc(7) police: removed HISTORY info from the .Os call.	2001-07-10 13:41:46 +00:00
Dima Dorfman	70d51341bf	mdoc(7) police: remove extraneous .Pp before and/or after .Sh.	2001-07-09 09:54:33 +00:00
Ruslan Ermilov	d0353b836e	mdoc(7) police: split punctuation characters + misc fixes.	2001-02-01 16:38:02 +00:00
Alfred Perlstein	372e9eb0af	use .Pp instead of faking it with an extra newline Pointed out by: sheldonh	2000-07-20 11:05:52 +00:00
Alfred Perlstein	f47d88b0b7	document get/set sockopt usage with accept_filter(9)	2000-07-20 10:33:08 +00:00
Chris Costello	bb33e42207	Replace .Va, .Ar and .Nm with .Fa or .Va where necessary, examples: ``.Ar errno'' -> ``.Va errno'' ``.Nm ops'' -> ``.Fa ops'' ``.Va fd'' -> ``.Fa fd''	2000-06-23 05:05:44 +00:00
Alexey Zelkin	4f79a4117a	Use `Er' variable to define first column width in ERRORS section. It was initially suggested by mdoc(7) style, but was broken over the years	2000-05-04 13:09:25 +00:00
Alexey Zelkin	25bb73e063	Introduce ".Lb" macro to libc manpages. More libraries manpages updates following.	2000-04-21 09:42:15 +00:00
Poul-Henning Kamp	9b962c56a4	General clean-up of socket.h and associated sources to synchronise up with NetBSD and the Single Unix Specification v2. This updates some structures with other, almost equivalent types and effort is under way to get the whole more consistent. Also removes a double definition of INET6 and some other clean-ups. Reviewed by: green, bde, phk Some part obtained from: NetBSD, SUSv2 specification	1999-11-24 20:49:04 +00:00
Peter Wemm	7f3dea244c	$Id$ -> $FreeBSD$	1999-08-28 00:22:10 +00:00

1 2

60 Commits