freebsd-dev

Author	SHA1	Message	Date
Konstantin Belousov	e243367b64	mbuf: add a way to mark flowid as calculated from the internal headers In some settings offload might calculate hash from decapsulated packet. Reserve a bit in packet header rsstype to indicate that. Add m_adj_decap() that acts similarly to m_adj, but also either clear flowid if it is not marked as inner, or transfer it to the decapsulated header, clearing inner indicator. It depends on the internals of m_adj() that reuses the argument packet header for the result. Use m_adj_decap() for decapsulating vxlan(4) and gif(4) input packets. Reviewed by: ae, hselasky, np Sponsored by: Nvidia Networking / Mellanox Technologies MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28773	2021-03-31 14:38:26 +03:00
Mark Johnston	608c44f96e	m_uiotombuf_nomap(): Stop clearing PG_ZERO in newly allocated pages The caller should not be passing M_ZERO in the first place, so PG_ZERO will not be preserved by the page allocator and clearing it accomplishes nothing. Reviewed by: gallatin, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28808	2021-02-22 10:04:46 -05:00
John Baldwin	c2a8fd6f05	Permit sending empty fragments for TLS 1.0. Due to a weakness in the TLS 1.0 protocol, OpenSSL will periodically send empty TLS records ("empty fragments"). These TLS records have no payload (and thus a page count of zero). m_uiotombuf_nomap() was returning NULL instead of an empty mbuf, and a few places needed to be updated to treat an empty TLS record as having a page count of "1" as 0 means "no work to do" (e.g. nothing to encrypt, or nothing to mark ready via sbready()). Reviewed by: gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26729	2020-10-13 17:30:34 +00:00
Mateusz Guzik	6fed89b179	kern: clean up empty lines in .c and .h files	2020-09-01 22:12:32 +00:00
Andrey V. Elsukov	edde7a538b	Add m__getjcl SDT probe. Obtained from: Yandex LLC MFC after: 1 week Sponsored by: Yandex LLC	2020-08-05 11:39:09 +00:00
Brandon Bergren	f57d153e2e	[PowerPC] Fix powerpcspe build failure after r360569 On powerpcspe, vm_paddr_t is 64 bit despite it being a 32 bit platform. Adjust compile time assertion to compensate.	2020-05-07 17:58:07 +00:00
Gleb Smirnoff	61664ee700	Step 4.2: start divorce of M_EXT and M_EXTPG They have more differencies than similarities. For now there is lots of code that would check for M_EXT only and work correctly on M_EXTPG buffers, so still carry M_EXT bit together with M_EXTPG. However, prepare some code for explicit check for M_EXTPG. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-03 00:37:16 +00:00
Gleb Smirnoff	365e8da44a	Mechanically rename MBUF_EXT_PGS_ASSERT() to M_ASSERTEXTPG() to match classical M_ASSERTPKTHDR. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-03 00:27:41 +00:00
Gleb Smirnoff	6edfd179c8	Step 4.1: mechanically rename M_NOMAP to M_EXTPG Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-03 00:21:11 +00:00
Gleb Smirnoff	7b6c99d08d	Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf within m_epg namespace. All edits except the 'struct mbuf' declaration and mb_dupcl() were done mechanically with sed: s/->m_ext_pgs.nrdy/->m_epg_nrdy/g s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g s/->m_ext_pgs.trail_len/->m_epg_trllen/g s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g s/->m_ext_pgs.flags/->m_epg_flags/g s/->m_ext_pgs.record_type/->m_epg_record_type/g s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g s/->m_ext_pgs.tls/->m_epg_tls/g s/->m_ext_pgs.so/->m_epg_so/g s/->m_ext_pgs.seqno/->m_epg_seqno/g s/->m_ext_pgs.stailq/->m_epg_stailq/g Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-03 00:12:56 +00:00
Gleb Smirnoff	bccf6e26e9	Step 2.5: Stop using 'struct mbuf_ext_pgs' in the kernel itself. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-03 00:08:05 +00:00
Gleb Smirnoff	c4ee38f8e8	Step 2.3: Rename mbuf_ext_pg_len() to m_epg_pagelen() that uses mbuf argument. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-02 23:52:35 +00:00
Gleb Smirnoff	7433a5a966	Start moving into EPG_/epg_ namespace. There is only one flag, but next commit brings in second flag, so let them already be in the future namespace. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-02 22:49:14 +00:00
Gleb Smirnoff	0c1032665c	Continuation of multi page mbuf redesign from r359919. The following series of patches addresses three things: Now that array of pages is embedded into mbuf, we no longer need separate structure to pass around, so struct mbuf_ext_pgs is an artifact of the first implementation. And struct mbuf_ext_pgs_data is a crutch to accomodate the main idea r359919 with minimal churn. Also, M_EXT of type EXT_PGS are just a synonym of M_NOMAP. The namespace for the newfeature is somewhat inconsistent and sometimes has a lengthy prefixes. In these patches we will gradually bring the namespace to "m_epg" prefix for all mbuf fields and most functions. Step 1 of 4: o Anonymize mbuf_ext_pgs_data, embed in m_ext o Embed mbuf_ext_pgs o Start documenting all this entanglement Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-02 22:39:26 +00:00
Andrew Gallatin	23feb56348	KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself. While the original implementation of unmapped mbufs was a large step forward in terms of reducing cache misses by enabling mbufs to carry more than a single page for sendfile, they are rather cache unfriendly when accessing the ext_pgs metadata and data. This is because the ext_pgs part of the mbuf is allocated separately, and almost guaranteed to be cold in cache. This change takes advantage of the fact that unmapped mbufs are never used at the same time as pkthdr mbufs. Given this fact, we can overlap the ext_pgs metadata with the mbuf pkthdr, and carry the ext_pgs meta directly in the mbuf itself. Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array of pages) directly after the existing m_ext. In order to be able to carry 5 pages (which is the minimum required for a 16K TLS record which is not perfectly aligned) on LP64, I've had to steal ext_arg2. The only user of this in the xmit path is sendfile, and I've adjusted it to use arg1 when using unmapped mbufs. This change is almost entirely mechanical, except that we change mb_alloc_ext_pgs() to no longer allow allocating pkthdrs, the change to avoid ext_arg2 as mentioned above, and the removal of the ext_pgs zone, This change saves roughly 2% "raw" CPU (~59% -> 57%), or over 3% "scaled" CPU on a Netflix 100% software kTLS workload at 90+ Gb/s on Broadwell Xeons. In a follow-on commit, I plan to remove some hacks to avoid access ext_pgs fields of mbufs, since they will now be in cache. Many thanks to glebius for helping to make this better in the Netflix tree. Reviewed by: hselasky, jhb, rrs, glebius (early version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24213	2020-04-14 14:46:06 +00:00
Pawel Biernacki	7029da5c36	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718	2020-02-26 14:26:36 +00:00
Mateusz Guzik	3ff65f71cb	Remove duplicated empty lines from kern/*.c No functional changes.	2020-01-30 20:05:05 +00:00
Andrew Gallatin	b2dba6634b	kTLS: Fix a bug where we would not encrypt anon data inplace. Software Kernel TLS needs to allocate a new destination crypto buffer when encrypting data from the page cache, so as to avoid overwriting shared clear-text file data with encrypted data specific to a single socket. When the data is anonymous, eg, not tied to a file, then we can encrypt in place and avoid allocating a new page. This fixes a bug where the existing code always assumes the data is private, and never encrypts in place. This results in unneeded page allocations and potentially more memory bandwidth consumption when doing socket writes. When the code was written at Netflix, ktls_encrypt() looked at private sendfile flags to determine if the pages being encrypted where part of the page cache (coming from sendfile) or anonymous (coming from sosend). This was broken internally at Netflix when the sendfile flags were made private, and the M_WRITABLE() check was added. Unfortunately, M_WRITABLE() will always be false for M_NOMAP mbufs, since one cannot just mtod() them. This change introduces a new flags field to the mbuf_ext_pgs struct by stealing a byte from the tls hdr. Note that the current header is still 2 bytes larger than the largest header we support: AES-CBC with explicit IV. We set MBUF_PEXT_FLAG_ANON when creating an unmapped mbuf in m_uiotombuf_nomap() (which is the path that socket writes take), and we check for that flag in ktls_encrypt() when looking for anon pages. Reviewed by: jhb Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21796	2019-09-27 20:08:19 +00:00
Mark Johnston	fee2a2fa39	Change synchonization rules for vm_page reference counting. There are several mechanisms by which a vm_page reference is held, preventing the page from being freed back to the page allocator. In particular, holding the page's object lock is sufficient to prevent the page from being freed; holding the busy lock or a wiring is sufficent as well. These references are protected by the page lock, which must therefore be acquired for many per-page operations. This results in false sharing since the page locks are external to the vm_page structures themselves and each lock protects multiple structures. Transition to using an atomically updated per-page reference counter. The object's reference is counted using a flag bit in the counter. A second flag bit is used to atomically block new references via pmap_extract_and_hold() while removing managed mappings of a page. Thus, the reference count of a page is guaranteed not to increase if the page is unbusied, unmapped, and the object's write lock is held. As a consequence of this, the page lock no longer protects a page's identity; operations which move pages between objects are now synchronized solely by the objects' locks. The vm_page_wire() and vm_page_unwire() KPIs are changed. The former requires that either the object lock or the busy lock is held. The latter no longer has a return value and may free the page if it releases the last reference to that page. vm_page_unwire_noq() behaves the same as before; the caller is responsible for checking its return value and freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is introduced for use in pmap_extract_and_hold(). It fails if the page is concurrently being unmapped, typically triggering a fallback to the fault handler. vm_page_wire() no longer requires the page lock and vm_page_unwire() now internally acquires the page lock when releasing the last wiring of a page (since the page lock still protects a page's queue state). In particular, synchronization details are no longer leaked into the caller. The change excises the page lock from several frequently executed code paths. In particular, vm_object_terminate() no longer bounces between page locks as it releases an object's pages, and direct I/O and sendfile(SF_NOCACHE) completions no longer require the page lock. In these latter cases we now get linear scalability in the common scenario where different threads are operating on different files. __FreeBSD_version is bumped. The DRM ports have been updated to accomodate the KPI changes. Reviewed by: jeff (earlier version) Tested by: gallatin (earlier version), pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20486	2019-09-09 21:32:42 +00:00
Mark Johnston	9fb7c918ef	Remove manual wire_count adjustments from the unmapped mbuf code. The original code came from a desire to minimize the number of updates to v_wire_count, which prior to r329187 was updated using atomics. However, there is no significant benefit to batching today, so simply allocate pages using VM_ALLOC_WIRED and rely on system accounting. Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D21323	2019-08-21 20:01:52 +00:00
John Baldwin	82334850ea	Add an external mbuf buffer type that holds multiple unmapped pages. Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses. For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT. Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe. NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required. If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:48:33 +00:00
John Baldwin	fb3bc59600	Restructure mbuf send tags to provide stronger guarantees. - Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags. Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds. To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions. For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN. - To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped. This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop). In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag. - cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called. To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag. - mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead. - Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer. Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117	2019-05-24 22:30:40 +00:00
Andrew Gallatin	50575ce11c	Track TCP connection's NUMA domain in the inpcb Drivers can now pass up numa domain information via the mbuf numa domain field. This information is then used by TCP syncache_socket() to associate that information with the inpcb. The domain information is then fed back into transmitted mbufs in ip{6}_output(). This mechanism is nearly identical to what is done to track RSS hash values in the inp_flowid. Follow on changes will use this information for lacp egress port selection, binding TCP pacers to the appropriate NUMA domain, etc. Reviewed by: markj, kib, slavash, bz, scottl, jtl, tuexen Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20028	2019-04-25 15:37:28 +00:00
Fabien Thomas	f8e73c47d8	Add a SPD cache to speed up lookups. When large SPDs are used, we face two problems: - too many CPU cycles are spent during the linear searches in the SPD for each packet - too much contention on multi socket systems, since we use a single shared lock. Main changes: - added the sysctl tree 'net.key.spdcache' to control the SPD cache (disabled by default). - cache the sp indexes that are used to perform SP lookups. - use a range of dedicated mutexes to protect the cache lines. Submitted by: Emeric Poupon <emeric.poupon@stormshield.eu> Reviewed by: ae Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D15050	2018-05-22 15:54:25 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Andriy Voskoboinyk	6623429867	mbuf(9): unbreak m_fragment() - Fix it by replacing m_cat() with m_prev->m_next = m_new (m_cat() will try to append data - as a result, there will be no fragmentation). - Move some constants out of the loop. Was previously tested with D4077. Differential Revision: https://reviews.freebsd.org/D4090	2017-10-16 21:46:11 +00:00
Gleb Smirnoff	07e87a1d55	In mb_dupcl() don't copy full m_ext, to avoid cache miss. Respectively, in mb_free_ext() always use fields from the original refcount holding mbuf (see. r296242) mbuf. Cuts another cache miss from mb_free_ext(). However, treat EXT_EXTREF mbufs differently, since they are different - they don't have a refcount holding mbuf. Provide longer comments in m_ext declaration to explain this change and change from r296242. In collaboration with: gallatin Differential Revision: https://reviews.freebsd.org/D12615	2017-10-09 20:51:58 +00:00
Conrad Meyer	f5b7359a00	Fix one more place uio_resid is truncated to int A follow-up to r231949 and r194990. Reported by: pho@ Reviewed by: kib@, markj@ Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11373	2017-06-27 17:23:20 +00:00
Gleb Smirnoff	14984031b7	m_mbuftouio() doesn't modify the mbuf.	2017-03-07 19:00:50 +00:00
Mark Johnston	d53d6fa9a8	Suppress a warning about m_assertbuf being unused. MFC after: 1 week	2017-01-15 03:53:20 +00:00
Bryan Drewery	28323add09	Fix improper use of "its". Sponsored by: Dell EMC Isilon	2016-11-08 23:59:41 +00:00
Ed Maste	69a2875821	Renumber license clauses in sys/kern to avoid skipping #3	2016-09-15 13:16:20 +00:00
Kevin Lo	cee4a05669	In m_devget(), if the data fits in a packet header mbuf, check the amount of data is less than or equal to MHLEN instead of MLEN when placing initial small packet header at end of mbuf. Reviewed by: glebius MFC after: 3 days	2016-09-08 01:02:53 +00:00
Pedro F. Giffuni	b85f65af68	kern: for pointers replace 0 with NULL. These are mostly cosmetical, no functional change. Found with devel/coccinelle.	2016-04-15 16:10:11 +00:00
Navdeep Parhar	fddd4f6273	Plug leak in m_unshare. m_unshare passes on the source mbuf's flags as-is to m_getcl and this results in a leak if the flags include M_NOFREE. The fix is to clear the bits not listed in M_COPYALL before calling m_getcl. M_RDONLY should probably be filtered out too but that's outside the scope of this fix. Add assertions in the zone_mbuf and zone_pack ctors to catch similar bugs. Update netmap_get_mbuf to not pass M_NOFREE to m_getcl. It's not clear what the original code was trying to do but it's likely incorrect. Updated code is no different functionally but it avoids the newly added assertions. Reviewed by: gnn@ Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D5698	2016-03-26 23:39:53 +00:00
George V. Neville-Neil	dcd070d80a	Move mbuf provider under SDT to indicate that it is FreeBSD specific and not a stable interface. Reviewed by: markj MFC after: 2 weeks Sponsored by: Rubicon Communications (Netgate) Differential Revision: https://reviews.freebsd.org/D5716	2016-03-24 08:26:06 +00:00
George V. Neville-Neil	480f4e946d	Add an mbuf provider to DTrace. The mbuf provider is made up of a set of Statically Defined Tracepoints which help us look into mbufs as they are allocated and freed. This can be used to inspect the buffers or for a simplified mbuf leak detector. New tracepoints are: mbuf:::m-init mbuf:::m-gethdr mbuf:::m-get mbuf:::m-getcl mbuf:::m-clget mbuf:::m-cljget mbuf:::m-cljset mbuf:::m-free mbuf:::m-freem There is also a translator for mbufs which gives some visibility into the structure, see mbuf.d for more details. Reviewed by: bz, markj MFC after: 2 weeks Sponsored by: Rubicon Communications (Netgate) Differential Revision: https://reviews.freebsd.org/D5682	2016-03-22 13:16:52 +00:00
Gleb Smirnoff	56a5f52e80	New way to manage reference counting of mbuf external storage. The m_ext.ext_cnt pointer becomes a union. It can now hold the refcount value itself. To tell that m_ext.ext_flags flag EXT_FLAG_EMBREF is used. The first mbuf to attach a cluster stores the refcount. The further mbufs to reference the cluster point at refcount in the first mbuf. The first mbuf is freed only when the last reference is freed. The benefit over refcounts stored in separate slabs is that now refcounts of different, unrelated mbufs do not share a cache line. For EXT_EXTREF mbufs the zone_ext_refcnt is no longer needed, and m_extadd() becomes void, making widely used M_EXTADD macro safe. For EXT_SFBUF mbufs the sf_ext_ref() is removed, which was an optimization exactly against the cache aliasing problem with regular refcounting. Discussed with: rrs, rwatson, gnn, hiren, sbruno, np Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D5396 Sponsored by: Netflix	2016-03-01 00:17:14 +00:00
Gleb Smirnoff	5e4bc63b7c	o Gather all mbuf(9) allocation functions into kern_mbuf.c, and all mbuf(9) manipulation functions into uipc_mbuf.c. This looks like the initial intent, but had diffused in the last decade. o Gather all declarations in mbuf.h in one place and sort them. o Uninline m_clget() and m_cljget(). There are no functional changes in this patch. The patch comes from a larger version, where all mbuf(9) allocation was uninlined, which allowed to make mbuf(9) UMA zones private to kern_mbuf.c. The performance impact of the total uninlining is still unclear, so we are holding on now with larger version. Together with: melifaro, olivier	2016-02-11 21:32:23 +00:00
Gleb Smirnoff	2bab0c5535	New sendfile(2) syscall. A joint effort of NGINX and Netflix from 2013 and up to now. The new sendfile is the code that Netflix uses to send their multiple tens of gigabits of data per second. The new implementation features asynchronous I/O, when I/O operations are launched, but not awaited to be complete. An explanation of why such behavior is beneficial compared to old one is going to be too long for a commit message, so we will skip it here. Additional features of new syscall are extra flags, which provide an application more control over data sent. The SF_NOCACHE flag tells kernel that data shouldn't be cached after it was sent. The SF_READAHEAD() macro allows to specify readahead size in pages. The new syscalls is a drop in replacement. No modifications are required to applications. One can take nginx binary for stable/10 and run it successfully on head. Although SF_NODISKIO lost its original sense, as now sendfile doesn't block, and now means something completely different (tm), using the new sendfile the old way is absolutely safe. Celebrates: Netflix global launch! Sponsored by: Nginx, Inc. Sponsored by: Netflix Relnotes: yes	2016-01-08 20:34:57 +00:00
Hiren Panchasara	86a996e6bd	There are times when it would be really nice to have a record of the last few packets and/or state transitions from each TCP socket. That would help with narrowing down certain problems we see in the field that are hard to reproduce without understanding the history of how we got into a certain state. This change provides just that. It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is destroyed, the list is freed. I thought this was likely to be more performance-friendly than saving copies of the tcpcb. Plus, with the packets, you should be able to reverse-engineer what happened to the tcpcb. To enable the feature, you will need to compile a kernel with the TCPPCAP option. Even then, the feature defaults to being deactivated. You can activate it by setting a positive value for the number of captured packets. You can do that on either a global basis or on a per-socket basis (via a setsockopt call). There is no way to get the packets out of the kernel other than using kmem or getting a coredump. I thought that would help some of the legal/privacy concerns regarding such a feature. However, it should be possible to add a future effort to export them in PCAP format. I tested this at low scale, and found that there were no mbuf leaks and the peak mbuf usage appeared to be unchanged with and without the feature. The main performance concern I can envision is the number of mbufs that would be used on systems with a large number of sockets. If you save five packets per direction per socket and have 3,000 sockets, that will consume at least 30,000 mbufs just to keep these packets. I tried to reduce the concerns associated with this by limiting the number of clusters (not mbufs) that could be used for this feature. Again, in my testing, that appears to work correctly. Differential Revision: D3100 Submitted by: Jonathan Looney <jlooney at juniper dot net> Reviewed by: gnn, hiren	2015-10-14 00:35:37 +00:00
Gleb Smirnoff	e40e8705db	Fix regression from r248371. We need to copy packet header to new mbuf. Unlike in the pre-r248371 code, assert that M_PKTHDR is set only on a first mbuf. Reported & tested by: Andriy Voskoboinyk <s3erios gmail.com> Sponsored by: Nginx, Inc.	2015-10-07 12:40:00 +00:00
Gleb Smirnoff	640082d498	Remove debugging variable from r143761.	2015-10-06 09:43:49 +00:00
Alexander V. Chernikov	0cbefd30cb	Add const-qualifiers for source mbuf argument in m_dup(), m_copym(), m_dup_pkthdr() and m_tag_copy_chain().	2015-08-08 15:50:46 +00:00
Navdeep Parhar	9523d1bfc3	Fix leak in tcp_lro_rx. Simply clearing M_PKTHDR isn't enough, any tags hanging off the header need to be freed too. Differential Revision: https://reviews.freebsd.org/D2708 Reviewed by: ae@, hiren@	2015-06-30 17:19:58 +00:00
Andrey V. Elsukov	089bb672c4	m_dup() is supposed to give a writable copy of an mbuf chain. It uses m_dup_pkthdr(), that uses M_COPYFLAGS mask to copy m_flags field. If original mbuf chain has M_RDONLY flag, its copy also will have it. Reset this flag explicitly. MFC after: 2 weeks	2015-05-07 18:35:01 +00:00
Gleb Smirnoff	ee52391ebe	Use anonymous unions and structs to organize shared space in mbuf(9), instead of preprocessor macros. This will make debugger output of 'print *m' exactly match the names we use in code, making life of a kernel hacker way more pleasant. And this also allows to rename struct_m_ext back to m_ext.	2015-02-17 20:52:51 +00:00
Gleb Smirnoff	ec9d83dd9b	Use anonymous unions to add possibility to put mbufs into queue(3) STAILQs and SLISTs using the same structure field as good old m_next and m_nextpkt linkage occupy. New code is encouraged to use queue(3) macros, instead of implementing the wheel. However, better not to have a mixture of old style and queue(3) in one file or subsystem. Reviewed by: rwatson, rrs, rpaulo Differential Revision: D1499	2015-02-17 19:32:11 +00:00
Robert Watson	3d1a9ed34e	In order to support ongoing work to implement variable-size mbufs, and more generally make it easier to extend 'struct mbuf in the future', make a number of changes to the data structure: - As we anticipate embedding mbufs headers within variable-size regions of memory in the future, change the definitions of byte arrays embedded in mbufs to be of size [0] rather than [MLEN] and [MHLEN]. In fact, the cxgbe driver already uses 'struct mbuf' on the front of other storage sizes, but we would like the global mbuf allocator do be able to do this as well. - Fold 'struct m_hdr' into 'struct mbuf' itself, eliminating a set of macros that aliased 'mh_foo' field names to 'm_foo' names such as 'm_next'. These present a particular problem as we would like to add new mbuf-header fields -- e.g., 'm_size' -- that, if similarly named via macros, would introduce collisions with many other variable names in the kernel. - Rename 'struct m_ext' to 'struct struct_m_ext' so that we can add compile-time assertions without bumping into the still-extant 'm_ext' macro. - Remove the MSIZE compile-time assertion for 'struct mbuf', but add new assertions for alignment of embedded data arrays (64-bit alignment even on 32-bit platforms), and for the sizes the mbuf header, packet header, and m_ext structure. - Document that these assertions exist in comments in mbuf.h. This change is not intended to cause (non-trivial) behavioural differences, but is a precursor to further mbuf-allocator work. Differential Revision: https://reviews.freebsd.org/D1483 Reviewed by: bz, gnn, np, glebius ("go ahead, I trust you") Sponsored by: EMC / Isilon Storage Division	2015-01-14 23:44:00 +00:00
Robert Watson	a5e9a6d56f	Garbage collect m_copymdata(), an mbuf utility routine introduced in FreeBSD 7 that has not been used since. It contains a number of unresolved bugs including an inverted bcopy() and incorrect handling of read-only mbufs using internal storage. Removing this unused code is substantially essier than fixing it in order to update it to the coming mbuf world order -- but it can always be restored from revision history if it turns out to prove useful for future work. Pointed out by: jmallett Sponsored by: EMC / Isilon Storage Division	2015-01-10 10:41:23 +00:00

1 2 3 4 5 ...

272 Commits