freebsd-skq

Author	SHA1	Message	Date
Pedro F. Giffuni	b85f65af68	kern: for pointers replace 0 with NULL. These are mostly cosmetical, no functional change. Found with devel/coccinelle.	2016-04-15 16:10:11 +00:00
Navdeep Parhar	fddd4f6273	Plug leak in m_unshare. m_unshare passes on the source mbuf's flags as-is to m_getcl and this results in a leak if the flags include M_NOFREE. The fix is to clear the bits not listed in M_COPYALL before calling m_getcl. M_RDONLY should probably be filtered out too but that's outside the scope of this fix. Add assertions in the zone_mbuf and zone_pack ctors to catch similar bugs. Update netmap_get_mbuf to not pass M_NOFREE to m_getcl. It's not clear what the original code was trying to do but it's likely incorrect. Updated code is no different functionally but it avoids the newly added assertions. Reviewed by: gnn@ Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D5698	2016-03-26 23:39:53 +00:00
George V. Neville-Neil	dcd070d80a	Move mbuf provider under SDT to indicate that it is FreeBSD specific and not a stable interface. Reviewed by: markj MFC after: 2 weeks Sponsored by: Rubicon Communications (Netgate) Differential Revision: https://reviews.freebsd.org/D5716	2016-03-24 08:26:06 +00:00
George V. Neville-Neil	480f4e946d	Add an mbuf provider to DTrace. The mbuf provider is made up of a set of Statically Defined Tracepoints which help us look into mbufs as they are allocated and freed. This can be used to inspect the buffers or for a simplified mbuf leak detector. New tracepoints are: mbuf:::m-init mbuf:::m-gethdr mbuf:::m-get mbuf:::m-getcl mbuf:::m-clget mbuf:::m-cljget mbuf:::m-cljset mbuf:::m-free mbuf:::m-freem There is also a translator for mbufs which gives some visibility into the structure, see mbuf.d for more details. Reviewed by: bz, markj MFC after: 2 weeks Sponsored by: Rubicon Communications (Netgate) Differential Revision: https://reviews.freebsd.org/D5682	2016-03-22 13:16:52 +00:00
Gleb Smirnoff	56a5f52e80	New way to manage reference counting of mbuf external storage. The m_ext.ext_cnt pointer becomes a union. It can now hold the refcount value itself. To tell that m_ext.ext_flags flag EXT_FLAG_EMBREF is used. The first mbuf to attach a cluster stores the refcount. The further mbufs to reference the cluster point at refcount in the first mbuf. The first mbuf is freed only when the last reference is freed. The benefit over refcounts stored in separate slabs is that now refcounts of different, unrelated mbufs do not share a cache line. For EXT_EXTREF mbufs the zone_ext_refcnt is no longer needed, and m_extadd() becomes void, making widely used M_EXTADD macro safe. For EXT_SFBUF mbufs the sf_ext_ref() is removed, which was an optimization exactly against the cache aliasing problem with regular refcounting. Discussed with: rrs, rwatson, gnn, hiren, sbruno, np Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D5396 Sponsored by: Netflix	2016-03-01 00:17:14 +00:00
Gleb Smirnoff	5e4bc63b7c	o Gather all mbuf(9) allocation functions into kern_mbuf.c, and all mbuf(9) manipulation functions into uipc_mbuf.c. This looks like the initial intent, but had diffused in the last decade. o Gather all declarations in mbuf.h in one place and sort them. o Uninline m_clget() and m_cljget(). There are no functional changes in this patch. The patch comes from a larger version, where all mbuf(9) allocation was uninlined, which allowed to make mbuf(9) UMA zones private to kern_mbuf.c. The performance impact of the total uninlining is still unclear, so we are holding on now with larger version. Together with: melifaro, olivier	2016-02-11 21:32:23 +00:00
Gleb Smirnoff	2bab0c5535	New sendfile(2) syscall. A joint effort of NGINX and Netflix from 2013 and up to now. The new sendfile is the code that Netflix uses to send their multiple tens of gigabits of data per second. The new implementation features asynchronous I/O, when I/O operations are launched, but not awaited to be complete. An explanation of why such behavior is beneficial compared to old one is going to be too long for a commit message, so we will skip it here. Additional features of new syscall are extra flags, which provide an application more control over data sent. The SF_NOCACHE flag tells kernel that data shouldn't be cached after it was sent. The SF_READAHEAD() macro allows to specify readahead size in pages. The new syscalls is a drop in replacement. No modifications are required to applications. One can take nginx binary for stable/10 and run it successfully on head. Although SF_NODISKIO lost its original sense, as now sendfile doesn't block, and now means something completely different (tm), using the new sendfile the old way is absolutely safe. Celebrates: Netflix global launch! Sponsored by: Nginx, Inc. Sponsored by: Netflix Relnotes: yes	2016-01-08 20:34:57 +00:00
Hiren Panchasara	86a996e6bd	There are times when it would be really nice to have a record of the last few packets and/or state transitions from each TCP socket. That would help with narrowing down certain problems we see in the field that are hard to reproduce without understanding the history of how we got into a certain state. This change provides just that. It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is destroyed, the list is freed. I thought this was likely to be more performance-friendly than saving copies of the tcpcb. Plus, with the packets, you should be able to reverse-engineer what happened to the tcpcb. To enable the feature, you will need to compile a kernel with the TCPPCAP option. Even then, the feature defaults to being deactivated. You can activate it by setting a positive value for the number of captured packets. You can do that on either a global basis or on a per-socket basis (via a setsockopt call). There is no way to get the packets out of the kernel other than using kmem or getting a coredump. I thought that would help some of the legal/privacy concerns regarding such a feature. However, it should be possible to add a future effort to export them in PCAP format. I tested this at low scale, and found that there were no mbuf leaks and the peak mbuf usage appeared to be unchanged with and without the feature. The main performance concern I can envision is the number of mbufs that would be used on systems with a large number of sockets. If you save five packets per direction per socket and have 3,000 sockets, that will consume at least 30,000 mbufs just to keep these packets. I tried to reduce the concerns associated with this by limiting the number of clusters (not mbufs) that could be used for this feature. Again, in my testing, that appears to work correctly. Differential Revision: D3100 Submitted by: Jonathan Looney <jlooney at juniper dot net> Reviewed by: gnn, hiren	2015-10-14 00:35:37 +00:00
Gleb Smirnoff	e40e8705db	Fix regression from r248371. We need to copy packet header to new mbuf. Unlike in the pre-r248371 code, assert that M_PKTHDR is set only on a first mbuf. Reported & tested by: Andriy Voskoboinyk <s3erios gmail.com> Sponsored by: Nginx, Inc.	2015-10-07 12:40:00 +00:00
Gleb Smirnoff	640082d498	Remove debugging variable from r143761.	2015-10-06 09:43:49 +00:00
Alexander V. Chernikov	0cbefd30cb	Add const-qualifiers for source mbuf argument in m_dup(), m_copym(), m_dup_pkthdr() and m_tag_copy_chain().	2015-08-08 15:50:46 +00:00
Navdeep Parhar	9523d1bfc3	Fix leak in tcp_lro_rx. Simply clearing M_PKTHDR isn't enough, any tags hanging off the header need to be freed too. Differential Revision: https://reviews.freebsd.org/D2708 Reviewed by: ae@, hiren@	2015-06-30 17:19:58 +00:00
Andrey V. Elsukov	089bb672c4	m_dup() is supposed to give a writable copy of an mbuf chain. It uses m_dup_pkthdr(), that uses M_COPYFLAGS mask to copy m_flags field. If original mbuf chain has M_RDONLY flag, its copy also will have it. Reset this flag explicitly. MFC after: 2 weeks	2015-05-07 18:35:01 +00:00
Gleb Smirnoff	ee52391ebe	Use anonymous unions and structs to organize shared space in mbuf(9), instead of preprocessor macros. This will make debugger output of 'print *m' exactly match the names we use in code, making life of a kernel hacker way more pleasant. And this also allows to rename struct_m_ext back to m_ext.	2015-02-17 20:52:51 +00:00
Gleb Smirnoff	ec9d83dd9b	Use anonymous unions to add possibility to put mbufs into queue(3) STAILQs and SLISTs using the same structure field as good old m_next and m_nextpkt linkage occupy. New code is encouraged to use queue(3) macros, instead of implementing the wheel. However, better not to have a mixture of old style and queue(3) in one file or subsystem. Reviewed by: rwatson, rrs, rpaulo Differential Revision: D1499	2015-02-17 19:32:11 +00:00
Robert Watson	3d1a9ed34e	In order to support ongoing work to implement variable-size mbufs, and more generally make it easier to extend 'struct mbuf in the future', make a number of changes to the data structure: - As we anticipate embedding mbufs headers within variable-size regions of memory in the future, change the definitions of byte arrays embedded in mbufs to be of size [0] rather than [MLEN] and [MHLEN]. In fact, the cxgbe driver already uses 'struct mbuf' on the front of other storage sizes, but we would like the global mbuf allocator do be able to do this as well. - Fold 'struct m_hdr' into 'struct mbuf' itself, eliminating a set of macros that aliased 'mh_foo' field names to 'm_foo' names such as 'm_next'. These present a particular problem as we would like to add new mbuf-header fields -- e.g., 'm_size' -- that, if similarly named via macros, would introduce collisions with many other variable names in the kernel. - Rename 'struct m_ext' to 'struct struct_m_ext' so that we can add compile-time assertions without bumping into the still-extant 'm_ext' macro. - Remove the MSIZE compile-time assertion for 'struct mbuf', but add new assertions for alignment of embedded data arrays (64-bit alignment even on 32-bit platforms), and for the sizes the mbuf header, packet header, and m_ext structure. - Document that these assertions exist in comments in mbuf.h. This change is not intended to cause (non-trivial) behavioural differences, but is a precursor to further mbuf-allocator work. Differential Revision: https://reviews.freebsd.org/D1483 Reviewed by: bz, gnn, np, glebius ("go ahead, I trust you") Sponsored by: EMC / Isilon Storage Division	2015-01-14 23:44:00 +00:00
Robert Watson	a5e9a6d56f	Garbage collect m_copymdata(), an mbuf utility routine introduced in FreeBSD 7 that has not been used since. It contains a number of unresolved bugs including an inverted bcopy() and incorrect handling of read-only mbufs using internal storage. Removing this unused code is substantially essier than fixing it in order to update it to the coming mbuf world order -- but it can always be restored from revision history if it turns out to prove useful for future work. Pointed out by: jmallett Sponsored by: EMC / Isilon Storage Division	2015-01-10 10:41:23 +00:00
Robert Watson	b66f2a48e6	Replace hand-crafted versions of M_SIZE() and M_START() in uipc_mbuf.c with calls to the centralised macros, reducing direct use of MLEN and MHLEN. Differential Revision: https://reviews.freebsd.org/D1444 Reviewed by: bz Sponsored by: EMC / Isilon Storage Division	2015-01-08 11:16:21 +00:00
Robert Watson	ed6a66ca6c	To ease changes to underlying mbuf structure and the mbuf allocator, reduce the knowledge of mbuf layout, and in particular constants such as M_EXT, MLEN, MHLEN, and so on, in mbuf consumers by unifying various alignment utility functions (M_ALIGN(), MH_ALIGN(), MEXT_ALIGN() in a single M_ALIGN() macro, implemented by a now-inlined m_align() function: - Move m_align() from uipc_mbuf.c to mbuf.h; mark as __inline. - Reimplement M_ALIGN(), MH_ALIGN(), and MEXT_ALIGN() using m_align(). - Update consumers around the tree to simply use M_ALIGN(). This change eliminates a number of cases where mbuf consumers must be aware of whether or not mbufs returned by the allocator use external storage, but also assumptions about the size of the returned mbuf. This will make it easier to introduce changes in how we use external storage, as well as features such as variable-size mbufs. Differential Revision: https://reviews.freebsd.org/D1436 Reviewed by: glebius, trasz, gnn, bz Sponsored by: EMC / Isilon Storage Division	2015-01-05 09:58:32 +00:00
Gleb Smirnoff	651e4e6a30	Merge from projects/sendfile: extend protocols API to support sending not ready data: o Add new flag to pru_send() flags - PRUS_NOTREADY. o Add new protocol method pru_ready(). Sponsored by: Nginx, Inc. Sponsored by: Netflix	2014-11-30 13:24:21 +00:00
Gleb Smirnoff	7ee2d05890	Change a very strange code in m_demote() to simple assertion. Sponsored by: Nginx, Inc.	2014-09-04 19:27:30 +00:00
Gleb Smirnoff	1967edba02	Provide m_catpkt(), a wrapper around m_cat() that deals with M_PKTHDR mbufs. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-09-04 09:07:14 +00:00
Gleb Smirnoff	c71b4037ff	Use assignment instead of bcopy. Submitted by: jmg	2014-07-18 14:59:35 +00:00
Gleb Smirnoff	1fbe6a82f4	Improve reference counting of EXT_SFBUF pages attached to mbufs. o Do not use UMA refcount zone. The problem with this zone is that several refcounting words (16 on amd64) share the same cache line, and issueing atomic(9) updates on them creates cache line contention. Also, allocating and freeing them is extra CPU cycles. Instead, refcount the page directly via vm_page_wire() and the sfbuf via sf_buf_alloc(sf_buf_page(sf)) [1]. o Call refcounting/freeing function for EXT_SFBUF via direct function call, instead of function pointer. This removes barrier for CPU branch predictor. o Do not cleanup the mbuf to be freed in mb_free_ext(), merely to satisfy assertion in mb_dtor_mbuf(). Remove the assertion from mb_dtor_mbuf(). Use bcopy() instead of manual assignments to copy m_ext in mb_dupcl(). [1] This has some problems for now. Using sf_buf_alloc() merely to increase refcount is expensive, and is broken on sparc64. To be fixed. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-07-11 19:40:50 +00:00
Gleb Smirnoff	fcc34a238c	Fix style bug: rename the refcount field of m_ext to ext_cnt, to match other members. Sponsored by: Nginx, Inc.	2014-07-11 14:34:29 +00:00
Gleb Smirnoff	15c28f87b8	All mbuf external free functions never fail, so let them be void. Sponsored by: Nginx, Inc.	2014-07-11 13:58:48 +00:00
Gleb Smirnoff	c46713e636	Whitespace only.	2014-05-30 08:22:58 +00:00
Gleb Smirnoff	94985f742b	Remove historical macro. Sponsored by: Nginx, Inc.	2014-01-16 13:42:50 +00:00
Gleb Smirnoff	77badb18cd	Fix a very bad typo from r248887. Submitted by: art	2013-11-14 09:45:33 +00:00
Andre Oppermann	f729ede69e	Pad m_hdr on 32bit architectures to to prevent alignment and padding problems with the way MLEN, MHLEN, and struct mbuf are set up. CTASSERT's are provided to detect such issues at compile time in the future. The #define MLEN and MHLEN calculation do not take actual compiler- induced alignment and padding inside the complete struct mbuf into account. Accordingly appropriate attention is required when changing members of struct mbuf. Ideally one would calculate MLEN as (MSIZE - sizeof(((struct mbuf *)0)->m_hdr) but that doesn't work as the compiler refuses to operate on an as of yet incomplete structure. In particular ARM 32bit has more strict alignment requirements which caused 4 bytes of padding between m_hdr and pkthdr in struct mbuf because of the 64bit members in pkthdr. This wasn't picked up by MLEN and MHLEN causing an overflow of the mbuf provided data storage by overestimating its size. I386 didn't show this problem because it handles unaligned access just fine, albeit at a small performance penalty. On 64bit architectures the struct mbuf layout is 64bit aligned in all places. Reported by: Thomas Skibo <ThomasSkibo-at-sbcglobal-dot-net> Tested by: tuexen, ian, Thomas Skibo (extended patch) Sponsored by: The FreeBSD Foundation	2013-08-27 20:52:02 +00:00
Andre Oppermann	bb25e5ab00	Give (*ext_free) an int return value allowing for very sophisticated external mbuf buffer management capabilities in the future. For now only EXT_FREE_OK is defined with current legacy behavior. Sponsored by: The FreeBSD Foundation	2013-08-25 10:57:09 +00:00
Andre Oppermann	1b4381afbb	Restructure the mbuf pkthdr to make it fit for upcoming capabilities and features. The changes in particular are: o Remove rarely used "header" pointer and replace it with a 64bit protocol/ layer specific union PH_loc for local use. Protocols can flexibly overlay their own 8 to 64 bit fields to store information while the packet is worked on. o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc instead of pkthdr.header. o Extend csum_flags to 64bits to allow for additional future offload information to be carried (e.g. iSCSI, IPsec offload, and others). o Move the RSS hash type enumerator from abusing m_flags to its own 8bit rsstype field. Adjust accessor macros. o Add cosqos field to store Class of Service / Quality of Service information with the packet. It is not yet supported in any drivers but allows us to get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with a modernized ALTQ. o Add four 8 bit fields l[2-5]hlen to store the relative header offsets from the start of the packet. This is important for various offload capabilities and to relieve the drivers from having to parse the packet and protocol headers to find out location of checksums and other information. Header parsing in drivers is a lot of copy-paste and unhandled corner cases which we want to avoid. o Add another flexible 64bit union to map various additional persistent packet information, like ether_vtag, tso_segsz and csum fields. Depending on the csum_flags settings some fields may have different usage making it very flexible and adaptable to future capabilities. o Restructure the CSUM flags to better signify their outbound (down the stack) and inbound (up the stack) use. The CSUM flags used to be a bit chaotic and rather poorly documented leading to incorrect use in many places. Bring clarity into their use through better naming. Compatibility mappings are provided to preserve the API. The drivers can be corrected one by one and MFC'd without issue. o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures). Sponsored by: The FreeBSD Foundation	2013-08-24 19:51:18 +00:00
Andre Oppermann	9a73687609	Add an mbuf pointer parameter to (*ext_free) to give the external free function access to the mbuf the external memory was attached to. Mechanically adjust all users to include the mbuf parameter. This fixes a long standing annoyance for external free functions. Before one had to sacrifice one of the argument pointers for this. Sponsored by: The FreeBSD Foundation	2013-08-24 16:57:44 +00:00
Andre Oppermann	894734cbd6	dd a 24 bits wide ext_flags field to m_ext by reducing ext_type to 8 bits. ext_type is an enumerator and the number of types we have is a mere dozen. A couple of ext_types are renumbered to fit within 8 bits. EXT_VENDOR[1-4] and EXT_EXP[1-4] types for vendor-internal and experimental local mapping. The ext_flags field is currently unused but has a couple of flags already defined for future use. Again vendor and experimental flags are provided for local mapping. EXT_FLAG_BITS is provided for the printf(9) %b identifier. Initialize and copy ext_flags in the relevant mbuf functions. Improve alignment and packing of struct m_ext on 32 and 64 archs by carefully sorting the fields.	2013-08-24 13:15:42 +00:00
Andre Oppermann	1227f20d26	Revert r254520 and resurrect the M_NOFREE mbuf flag and functionality. Requested by: np, grehan	2013-08-21 18:12:04 +00:00
Andre Oppermann	aa3cb8fb64	Remove the unused M_NOFREE mbuf flag. It didn't have any in-tree users for a very long time, if ever. Should such a functionality ever be needed again the appropriate and much better way to do it is through a custom EXT_SOMETHING external mbuf type together with a dedicated *ext_free function. Discussed with: trociny, glebius	2013-08-19 11:16:53 +00:00
Gleb Smirnoff	59b9c4f289	Nuke mbstat. It wasn't used for mbuf statistics since FreeBSD 5. Now that r253351 moved sendfile() stats to a separate struct, the last field used in mbstat is m_mcfail, which is updated, but never read or obtained from userland.	2013-07-15 12:18:36 +00:00
Gleb Smirnoff	21f398487c	Fix bug in m_split() in a case when split len matches len of the first mbuf, and the first mbuf is M_PKTHDR. PR: kern/176144 Submitted by: Jacques Fourie <jacques.fourie gmail.com>	2013-03-29 14:10:40 +00:00
Gleb Smirnoff	4f67e14304	In m_align() add assertions that mbuf is virgin, similar to assertions in M_ALIGN(), MH_ALIGN, MEXT_ALIGN() macros.	2013-03-17 07:41:14 +00:00
Gleb Smirnoff	c95be8b536	- Replace compat macros with function calls. - Remove superfluous cleaning of m_len after allocating. Sponsored by: Nginx, Inc.	2013-03-16 08:57:36 +00:00
Gleb Smirnoff	5368b81eb0	Contrary to what the deleted comment said, the m_move_pkthdr() will not smash the M_EXT and data pointer, so it is safe to pass an mbuf with external storage procuded by m_getcl() to m_move_pkthdr(). Reviewed by: andre Sponsored by: Nginx, Inc.	2013-03-16 08:55:21 +00:00
Gleb Smirnoff	3112ae7644	Make m_get2() never use clusters that are bigger than PAGE_SIZE. Requested by: andre, jhb Sponsored by: Nginx, Inc.	2013-03-15 10:15:07 +00:00
Gleb Smirnoff	41a7572b26	Functions m_getm2() and m_get2() have different order of arguments, and that can drive someone crazy. While m_get2() is young and not documented yet, change its order of arguments to match m_getm2(). Sorry for churn, but better now than later.	2013-03-12 13:42:47 +00:00
Gleb Smirnoff	8c629bdf05	The m_extadd() can fail due to memory allocation failure, thus: - Make it return int, not void. - Add wait parameter. - Update MEXTADD() macro appropriately, defaults to M_NOWAIT, as before this change. Sponsored by: Nginx, Inc.	2013-03-12 12:12:16 +00:00
Gleb Smirnoff	29110f87a6	- Move large functions m_getjcl() and m_get2() to kern/uipc_mbuf.c - style(9) fixes to mbuf.h Reviewed by: bde	2013-01-24 09:29:41 +00:00
Gleb Smirnoff	eb1b1807af	Mechanically substitute flags from historic mbuf allocator with malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually	2012-12-05 08:04:20 +00:00
Kevin Lo	a2c36a0234	Since the macro dtom() has been removed, fix comments about the dtom. Reviewed by: glebius	2012-10-29 10:04:28 +00:00
Andre Oppermann	14d7c5b11c	Improve m_cat() by being able to also merge contents from M_EXT mbuf's by doing proper testing with M_WRITABLE(). In m_collapse() replace an incomplete manual check for M_RDONLY with the M_WRITABLE() macro that also tests for shared buffers and other cases that make a particular mbuf immutable. MFC after: 2 weeks	2012-10-28 18:38:51 +00:00
Konstantin Belousov	526d0bd547	Fix found places where uio_resid is truncated to int. Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. Discussed with: bde, das (previous versions) MFC after: 1 month	2012-02-21 01:05:12 +00:00
Kenneth D. Merry	7e949c467c	Xen netback driver rewrite. share/man/man4/Makefile, share/man/man4/xnb.4, sys/dev/xen/netback/netback.c, sys/dev/xen/netback/netback_unit_tests.c: Rewrote the netback driver for xen to attach properly via newbus and work properly in both HVM and PVM mode (only HVM is tested). Works with the in-tree FreeBSD netfront driver or the Windows netfront driver from SuSE. Has not been extensively tested with a Linux netfront driver. Does not implement LRO, TSO, or polling. Includes unit tests that may be run through sysctl after compiling with XNB_DEBUG defined. sys/dev/xen/blkback/blkback.c, sys/xen/interface/io/netif.h: Comment elaboration. sys/kern/uipc_mbuf.c: Fix page fault in kernel mode when calling m_print() on a null mbuf. Since m_print() is only used for debugging, there are no performance concerns for extra error checking code. sys/kern/subr_scanf.c: Add the "hh" and "ll" width specifiers from C99 to scanf(). A few callers were already using "ll" even though scanf() was handling it as "l". Submitted by: Alan Somers <alans@spectralogic.com> Submitted by: John Suykerbuyk <johns@spectralogic.com> Sponsored by: Spectra Logic MFC after: 1 week Reviewed by: ken	2012-01-26 16:35:09 +00:00

1 2 3 4 5

239 Commits