freebsd-dev

Author	SHA1	Message	Date
John Baldwin	b2e60773c6	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277	2019-08-27 00:01:56 +00:00
John Baldwin	afa60c068e	Invoke ext_free function when freeing an unmapped mbuf. Fix a mis-merge when extracting the unmapped mbuf changes from Netflix's in-kernel TLS changes where the call to the function that freed the backing pages from an unmapped mbuf was missed. Sponsored by: Chelsio Communications	2019-07-02 22:58:21 +00:00
John Baldwin	cec06a3edc	Add support for using unmapped mbufs with sendfile(2). This can be enabled at runtime via the kern.ipc.mb_use_ext_pgs sysctl. It is disabled by default. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:49:35 +00:00
John Baldwin	82334850ea	Add an external mbuf buffer type that holds multiple unmapped pages. Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses. For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT. Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe. NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required. If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:48:33 +00:00
John Baldwin	fb3bc59600	Restructure mbuf send tags to provide stronger guarantees. - Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags. Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds. To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions. For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN. - To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped. This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop). In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag. - cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called. To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag. - mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead. - Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer. Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117	2019-05-24 22:30:40 +00:00
Gleb Smirnoff	eec189c70b	Add new m_ext type for data for M_NOFREE mbufs, which doesn't actually do anything except several assertions. This type is going to be used for temporary on stack mbufs, that point into data in receive ring of a NIC, that shall not be freed. Such mbuf can not be stored or reallocated, its life time is current context.	2019-01-31 22:37:28 +00:00
Andriy Voskoboinyk	58c43838f7	m_getm2: correct a comment. The comment states that function always return a top of allocated mbuf; however, the function actually return the overall mbuf chain top pointer. Since there are already existing users of it (via m_getm(4) macro), rephrase the comment and leave behavior unchanged. PR: 134335 MFC after: 12 days	2019-01-27 16:44:27 +00:00
Conrad Meyer	0d1467b199	netdump: Fix netdumping with INVARIANTS kernels Correct boneheaded assertion I added in r339501. Mea culpa. The intent is to notice when an M_WAITOK zone allocation would fail during netdump, not to prevent all use of mbufs during netdump. Reviewed by: markj X-MFC-With: r339501 Differential Revision: https://reviews.freebsd.org/D17957	2018-11-12 05:24:20 +00:00
Mark Johnston	9978bd996b	Add malloc_domainset(9) and _domainset variants to other allocator KPIs. Remove malloc_domain(9) and most other _domain KPIs added in r327900. The new functions allow the caller to specify a general NUMA domain selection policy, rather than specifically requesting an allocation from a specific domain. The latter policy tends to interact poorly with M_WAITOK, resulting in situations where a caller is blocked indefinitely because the specified domain is depleted. Most existing consumers of the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy, in which we fall back to other domains to satisfy the allocation request. This change also defines a set of DOMAINSET_FIXED() policies, which only permit allocations from the specified domain. Discussed with: gallatin, jeff Reported and tested by: pho (previous version) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17418	2018-10-30 18:26:34 +00:00
Conrad Meyer	3937ee7557	netdump: Zone mbufs should be allocated before dump Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D17306	2018-10-20 22:24:58 +00:00
Mark Johnston	5475ca5aca	Add an mbuf allocator for netdump. The aim is to permit mbuf allocations after a panic without calling into the page allocator, without imposing any runtime overhead during regular operation of the system, and without modifying driver code. The approach taken is to preallocate a number of mbufs and clusters, storing them in linked lists, and using the lists to back some UMA cache zones. At panic time, the mbuf and cluster zone pointers are overwritten with those of the cache zones so that the mbuf allocator returns preallocated items. Using this scheme, drivers which cache mbuf zone pointers from m_getzone() require special handling when implementing netdump support. Reviewed by: cem (earlier version), julian, sbruno MFC after: 1 month Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15251	2018-05-06 00:19:48 +00:00
Mark Johnston	c2ba2d1b0e	Style. MFC after: 3 days	2018-05-06 00:11:30 +00:00
Jeff Roberson	ab3185d15e	Implement NUMA support in uma(9) and malloc(9). Allocations from specific domains can be done by the _domain() API variants. UMA also supports a first-touch policy via the NUMA zone flag. The slab layer is now segregated by VM domains and is precise. It handles iteration for round-robin directly. The per-cpu cache layer remains a mix of domains according to where memory is allocated and freed. Well behaved clients can achieve perfect locality with no performance penalty. The direct domain allocation functions have to visit the slab layer and so require per-zone locks which come at some expense. Reviewed by: Attilio (a slightly older version) Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-01-12 23:25:05 +00:00
Pedro F. Giffuni	8a36da99de	sys/kern: adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.	2017-11-27 15:20:12 +00:00
Gleb Smirnoff	9c82bec42d	Improvements to sendfile(2) mbuf free routine. o Fall back to default m_ext free mech, using function pointer in m_ext_free, and remove sf_ext_free() called directly from mbuf code. Testing on modern CPUs showed no regression. o Provide internally used flag EXT_FLAG_SYNC, to mark that I/O uses SF_SYNC flag. Lack of the flag allows us not to dereference ext_arg2, saving from a cache line miss. o Create function sendfile_free_page() that later will be used, for multi-page mbufs. For now compiler will inline it into sendfile_free_mext(). In collaboration with: gallatin Differential Revision: https://reviews.freebsd.org/D12615	2017-10-09 21:06:16 +00:00
Gleb Smirnoff	07e87a1d55	In mb_dupcl() don't copy full m_ext, to avoid cache miss. Respectively, in mb_free_ext() always use fields from the original refcount holding mbuf (see. r296242) mbuf. Cuts another cache miss from mb_free_ext(). However, treat EXT_EXTREF mbufs differently, since they are different - they don't have a refcount holding mbuf. Provide longer comments in m_ext declaration to explain this change and change from r296242. In collaboration with: gallatin Differential Revision: https://reviews.freebsd.org/D12615	2017-10-09 20:51:58 +00:00
Gleb Smirnoff	e8fd18f306	Shorten list of arguments to mbuf external storage freeing function. All of these arguments are stored in m_ext, so there is no reason to pass them in the argument list. Not all functions need the second argument, some don't even need the first one. The second argument lives in next cache line, so not dereferencing it is a performance gain. This was discovered in sendfile(2), which will be covered by next commits. The second goal of this commit is to bring even more flexibility to m_ext mbufs, allowing to create more fields in m_ext, opaque to the generic mbuf code, and potentially set and dereferenced by subsystems. Reviewed by: gallatin, kbowling Differential Revision: https://reviews.freebsd.org/D12615	2017-10-09 20:35:31 +00:00
Scott Long	4c7070db25	Import the 'iflib' API library for network drivers. From the author: "iflib is a library to eliminate the need for frequently duplicated device independent logic propagated (poorly) across many network drivers." Participation is purely optional. The IFLIB kernel config option is provided for drivers that want to transition between legacy and iflib modes of operation. ixl and ixgbe driver conversions will be committed shortly. We hope to see participation from the Broadcom and maybe Chelsio drivers in the near future. Submitted by: mmacy@nextbsd.org Reviewed by: gallatin Differential Revision: D5211	2016-05-18 04:35:58 +00:00
Gleb Smirnoff	17cd649f4a	Add a comment and KASSERT that a M_NOFREE mbuf has always EXT_EXTREF ext. Submitted by: kmacy	2016-05-17 23:15:16 +00:00
Pedro F. Giffuni	e3043798aa	sys/kern: spelling fixes in comments. No functional change.	2016-04-29 22:15:33 +00:00
Navdeep Parhar	fddd4f6273	Plug leak in m_unshare. m_unshare passes on the source mbuf's flags as-is to m_getcl and this results in a leak if the flags include M_NOFREE. The fix is to clear the bits not listed in M_COPYALL before calling m_getcl. M_RDONLY should probably be filtered out too but that's outside the scope of this fix. Add assertions in the zone_mbuf and zone_pack ctors to catch similar bugs. Update netmap_get_mbuf to not pass M_NOFREE to m_getcl. It's not clear what the original code was trying to do but it's likely incorrect. Updated code is no different functionally but it avoids the newly added assertions. Reviewed by: gnn@ Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D5698	2016-03-26 23:39:53 +00:00
Gleb Smirnoff	c8f59118be	Space and style(9) corrections for recent mbuf changes.	2016-03-24 20:06:52 +00:00
George V. Neville-Neil	480f4e946d	Add an mbuf provider to DTrace. The mbuf provider is made up of a set of Statically Defined Tracepoints which help us look into mbufs as they are allocated and freed. This can be used to inspect the buffers or for a simplified mbuf leak detector. New tracepoints are: mbuf:::m-init mbuf:::m-gethdr mbuf:::m-get mbuf:::m-getcl mbuf:::m-clget mbuf:::m-cljget mbuf:::m-cljset mbuf:::m-free mbuf:::m-freem There is also a translator for mbufs which gives some visibility into the structure, see mbuf.d for more details. Reviewed by: bz, markj MFC after: 2 weeks Sponsored by: Rubicon Communications (Netgate) Differential Revision: https://reviews.freebsd.org/D5682	2016-03-22 13:16:52 +00:00
Gleb Smirnoff	0ea37a86ba	Fix regression in r296242 affecting several drivers. For EXT_NET_DRV, EXT_MOD_TYPE, EXT_DISPOSABLE types we should first execute the free callback, then free the mbuf, otherwise we will derefernce memory that was just freed. Reported and tested: jhibbits	2016-03-02 02:12:01 +00:00
Gleb Smirnoff	56a5f52e80	New way to manage reference counting of mbuf external storage. The m_ext.ext_cnt pointer becomes a union. It can now hold the refcount value itself. To tell that m_ext.ext_flags flag EXT_FLAG_EMBREF is used. The first mbuf to attach a cluster stores the refcount. The further mbufs to reference the cluster point at refcount in the first mbuf. The first mbuf is freed only when the last reference is freed. The benefit over refcounts stored in separate slabs is that now refcounts of different, unrelated mbufs do not share a cache line. For EXT_EXTREF mbufs the zone_ext_refcnt is no longer needed, and m_extadd() becomes void, making widely used M_EXTADD macro safe. For EXT_SFBUF mbufs the sf_ext_ref() is removed, which was an optimization exactly against the cache aliasing problem with regular refcounting. Discussed with: rrs, rwatson, gnn, hiren, sbruno, np Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D5396 Sponsored by: Netflix	2016-03-01 00:17:14 +00:00
Gleb Smirnoff	5e4bc63b7c	o Gather all mbuf(9) allocation functions into kern_mbuf.c, and all mbuf(9) manipulation functions into uipc_mbuf.c. This looks like the initial intent, but had diffused in the last decade. o Gather all declarations in mbuf.h in one place and sort them. o Uninline m_clget() and m_cljget(). There are no functional changes in this patch. The patch comes from a larger version, where all mbuf(9) allocation was uninlined, which allowed to make mbuf(9) UMA zones private to kern_mbuf.c. The performance impact of the total uninlining is still unclear, so we are holding on now with larger version. Together with: melifaro, olivier	2016-02-11 21:32:23 +00:00
Gleb Smirnoff	b4b12e52fb	Garbage collect unused arguments of m_init().	2016-02-10 18:54:18 +00:00
Gleb Smirnoff	e60b2fcbeb	Redo r292484. Embed task(9) into zone, so that uz_maxaction is called in a context that can sleep, allowing consumers of the KPI to run their drain routines without any extra measures. Discussed with: jtl	2016-02-03 23:30:17 +00:00
Gleb Smirnoff	9542ea7b80	Move uma_dbg_alloc() and uma_dbg_free() into uma_core.c, which allows to make uma_dbg.h not depend on uma_int.h, which allows to uninclude uma_int.h from the mbuf(9) allocator.	2016-02-03 22:02:36 +00:00
Jonathan T. Looney	54503a13d8	Add a safety net to reclaim mbufs when one of the mbuf zones become exhausted. It is possible for a bug in the code (or, theoretically, even unusual network conditions) to exhaust all possible mbufs or mbuf clusters. When this occurs, things can grind to a halt fairly quickly. However, we currently do not call mb_reclaim() unless the entire system is experiencing a low-memory condition. While it is best to try to prevent exhaustion of one of the mbuf zones, it would also be useful to have a mechanism to attempt to recover from these situations by freeing "expendable" mbufs. This patch makes two changes: a) The patch adds a generic API to the UMA zone allocator to set a function that should be called when an allocation fails because the zone limit has been reached. Because of the way this function can be called, it really should do minimal work. b) The patch uses this API to try to free mbufs when an allocation fails from one of the mbuf zones because the zone limit has been reached. The function schedules a callout to run mb_reclaim(). Differential Revision: https://reviews.freebsd.org/D3864 Reviewed by: gnn Comments by: rrs, glebius MFC after: 2 weeks Sponsored by: Juniper Networks	2015-12-20 02:05:33 +00:00
Ryan Stone	f2c2231e0c	Fix integer truncation bug in malloc(9) A couple of internal functions used by malloc(9) and uma truncated a size_t down to an int. This could cause any number of issues (e.g. indefinite sleeps, memory corruption) if any kernel subsystem tried to allocate 2GB or more through malloc. zfs would attempt such an allocation when run on a system with 2TB or more of RAM. Note to self: When this is MFCed, sparc64 needs the same fix. Differential revision: https://reviews.freebsd.org/D2106 Reviewed by: kib Reported by: Michael Fuckner <michael@fuckner.net> Tested by: Michael Fuckner <michael@fuckner.net> MFC after: 2 weeks	2015-04-01 12:42:26 +00:00
Navdeep Parhar	a9fa76f27e	Test for absence of M_NOFREE before attempting to purge the mbuf's tags. This will leave more state intact should the assertion go off. MFC after: 1 month	2014-09-30 23:16:26 +00:00
Gleb Smirnoff	1fbe6a82f4	Improve reference counting of EXT_SFBUF pages attached to mbufs. o Do not use UMA refcount zone. The problem with this zone is that several refcounting words (16 on amd64) share the same cache line, and issueing atomic(9) updates on them creates cache line contention. Also, allocating and freeing them is extra CPU cycles. Instead, refcount the page directly via vm_page_wire() and the sfbuf via sf_buf_alloc(sf_buf_page(sf)) [1]. o Call refcounting/freeing function for EXT_SFBUF via direct function call, instead of function pointer. This removes barrier for CPU branch predictor. o Do not cleanup the mbuf to be freed in mb_free_ext(), merely to satisfy assertion in mb_dtor_mbuf(). Remove the assertion from mb_dtor_mbuf(). Use bcopy() instead of manual assignments to copy m_ext in mb_dupcl(). [1] This has some problems for now. Using sf_buf_alloc() merely to increase refcount is expensive, and is broken on sparc64. To be fixed. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-07-11 19:40:50 +00:00
Gleb Smirnoff	fcc34a238c	Fix style bug: rename the refcount field of m_ext to ext_cnt, to match other members. Sponsored by: Nginx, Inc.	2014-07-11 14:34:29 +00:00
Hans Petter Selasky	af3b2549c4	Pull in r267961 and r267973 again. Fix for issues reported will follow.	2014-06-28 03:56:17 +00:00
Glen Barber	37a107a407	Revert r267961, r267973: These changes prevent sysctl(8) from returning proper output, such as: 1) no output from sysctl(8) 2) erroneously returning ENOMEM with tools like truss(1) or uname(1) truss: can not get etype: Cannot allocate memory	2014-06-27 22:05:21 +00:00
Hans Petter Selasky	3da1cf1e88	Extend the meaning of the CTLFLAG_TUN flag to automatically check if there is an environment variable which shall initialize the SYSCTL during early boot. This works for all SYSCTL types both statically and dynamically created ones, except for the SYSCTL NODE type and SYSCTLs which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to be used in the case a tunable sysctl has a custom initialisation function allowing the sysctl to still be marked as a tunable. The kernel SYSCTL API is mostly the same, with a few exceptions for some special operations like iterating childrens of a static/extern SYSCTL node. This operation should probably be made into a factored out common macro, hence some device drivers use this. The reason for changing the SYSCTL API was the need for a SYSCTL parent OID pointer and not only the SYSCTL parent OID list pointer in order to quickly generate the sysctl path. The motivation behind this patch is to avoid parameter loading cludges inside the OFED driver subsystem. Instead of adding special code to the OFED driver subsystem to post-load tunables into dynamically created sysctls, we generalize this in the kernel. Other changes: - Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask" to "hw.pcic.intr_mask". - Removed redundant TUNABLE statements throughout the kernel. - Some minor code rewrites in connection to removing not needed TUNABLE statements. - Added a missing SYSCTL_DECL(). - Wrapped two very long lines. - Avoid malloc()/free() inside sysctl string handling, in case it is called to initialize a sysctl from a tunable, hence malloc()/free() is not ready when sysctls from the sysctl dataset are registered. - Bumped FreeBSD version to indicate SYSCTL API change. MFC after: 2 weeks Sponsored by: Mellanox Technologies	2014-06-27 16:33:43 +00:00
Ed Maste	b7bd677fe1	Initialise m_pkthdr via bzero instead of explicitly zeroing each member Sponsored by: The FreeBSD Foundation	2014-04-04 21:09:06 +00:00
John Baldwin	d251e7006b	Ignore attempts to set the nmbcluster sysctls to their current value rather than failing with an error. Reviewed by: andre Approved by: re (delphij) MFC after: 2 weeks	2013-10-10 16:11:34 +00:00
Hiren Panchasara	b6f49c23a3	Fixing a small typo. Reviewed by: gjb Approved by: sbruno (mentor)	2013-09-05 18:18:23 +00:00
Andre Oppermann	ce28636bcf	After r254779 "error" must always be present in mb_ctor_pack(), not only when MAC is defined. Reported by: gjb / tinderbox Sponsored by: The FreeBSD Foundation	2013-08-24 21:25:53 +00:00
Andre Oppermann	ce6169e715	Remove unused m_free_fast(). The difference to m_free() is only 2 predictable branches nowadays. However as a pre-condition the caller had to ensure that the mbuf pkthdr did not have any mtags attached to it, costing some potential branches again. Sponsored by: The FreeBSD Foundation	2013-08-24 21:09:57 +00:00
Andre Oppermann	1b4381afbb	Restructure the mbuf pkthdr to make it fit for upcoming capabilities and features. The changes in particular are: o Remove rarely used "header" pointer and replace it with a 64bit protocol/ layer specific union PH_loc for local use. Protocols can flexibly overlay their own 8 to 64 bit fields to store information while the packet is worked on. o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc instead of pkthdr.header. o Extend csum_flags to 64bits to allow for additional future offload information to be carried (e.g. iSCSI, IPsec offload, and others). o Move the RSS hash type enumerator from abusing m_flags to its own 8bit rsstype field. Adjust accessor macros. o Add cosqos field to store Class of Service / Quality of Service information with the packet. It is not yet supported in any drivers but allows us to get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with a modernized ALTQ. o Add four 8 bit fields l[2-5]hlen to store the relative header offsets from the start of the packet. This is important for various offload capabilities and to relieve the drivers from having to parse the packet and protocol headers to find out location of checksums and other information. Header parsing in drivers is a lot of copy-paste and unhandled corner cases which we want to avoid. o Add another flexible 64bit union to map various additional persistent packet information, like ether_vtag, tso_segsz and csum fields. Depending on the csum_flags settings some fields may have different usage making it very flexible and adaptable to future capabilities. o Restructure the CSUM flags to better signify their outbound (down the stack) and inbound (up the stack) use. The CSUM flags used to be a bit chaotic and rather poorly documented leading to incorrect use in many places. Bring clarity into their use through better naming. Compatibility mappings are provided to preserve the API. The drivers can be corrected one by one and MFC'd without issue. o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures). Sponsored by: The FreeBSD Foundation	2013-08-24 19:51:18 +00:00
Andre Oppermann	894734cbd6	dd a 24 bits wide ext_flags field to m_ext by reducing ext_type to 8 bits. ext_type is an enumerator and the number of types we have is a mere dozen. A couple of ext_types are renumbered to fit within 8 bits. EXT_VENDOR[1-4] and EXT_EXP[1-4] types for vendor-internal and experimental local mapping. The ext_flags field is currently unused but has a couple of flags already defined for future use. Again vendor and experimental flags are provided for local mapping. EXT_FLAG_BITS is provided for the printf(9) %b identifier. Initialize and copy ext_flags in the relevant mbuf functions. Improve alignment and packing of struct m_ext on 32 and 64 archs by carefully sorting the fields.	2013-08-24 13:15:42 +00:00
Andre Oppermann	afb295cc9a	Avoid code duplication for mbuf initialization and use m_init() instead in mb_ctor_mbuf() and mb_ctor_pack().	2013-08-24 12:24:58 +00:00
Andre Oppermann	1227f20d26	Revert r254520 and resurrect the M_NOFREE mbuf flag and functionality. Requested by: np, grehan	2013-08-21 18:12:04 +00:00
Andre Oppermann	aa3cb8fb64	Remove the unused M_NOFREE mbuf flag. It didn't have any in-tree users for a very long time, if ever. Should such a functionality ever be needed again the appropriate and much better way to do it is through a custom EXT_SOMETHING external mbuf type together with a dedicated *ext_free function. Discussed with: trociny, glebius	2013-08-19 11:16:53 +00:00
Jeff Roberson	5df87b21d3	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-08-07 06:21:20 +00:00
Gleb Smirnoff	59b9c4f289	Nuke mbstat. It wasn't used for mbuf statistics since FreeBSD 5. Now that r253351 moved sendfile() stats to a separate struct, the last field used in mbstat is m_mcfail, which is updated, but never read or obtained from userland.	2013-07-15 12:18:36 +00:00
Andrey V. Elsukov	05d1f5bce0	Introduce new structure sfstat for collecting sendfile's statistics and remove corresponding fields from struct mbstat. Use PCPU counters and SFSTAT_INC() macro for update these statistics. Discussed with: glebius	2013-07-15 06:16:57 +00:00
Andre Oppermann	bc4a1b8ccd	Make use of the fact that uma_zone_set_max(9) already returns the rounded limit making a call to uma_zone_get_max(9) unnecessary. MFC after: 1 day	2013-07-11 12:53:13 +00:00
Andre Oppermann	e0c00adda2	Fix style issues, a typo in "kern.ipc.nmbufs" and correctly plave and expose the value of the tunable maxmbufmem as "kern.ipc.maxmbufmem" through sysctl. Reported by: smh MFC after: 1 day	2013-07-11 12:46:35 +00:00
Julian Elischer	4591f0d339	Initialising the new fibnum field to a known value turns out to be a GOOD IDEA (TM). Apparently MOST users set this (e.g. tcp and friends) but there are a few users that just assume that it is a sensible value but then go on to read it. These include SCTP, pf and the FLOWTABLE option (and maybe others).	2013-05-24 02:18:37 +00:00
Andre Oppermann	2ebcc8ac4a	Base the calculation of maxmbufmem in part on kmem_map size instead of kernel_map size to prevent kernel memory exhaustion by mbufs and a subsequent panic on physical page allocation failure. On architectures without a direct map all mbuf memory (except for jumbo mbufs larger than PAGE_SIZE) comes from kmem_map. It is the limiting factor hence. For architectures with a direct map using the size of kmem_map is a good proxy of available kernel memory as well. If it is much smaller the mbuf limit may be sub-optimal but remains reasonable, while avoiding panics under exhaustion. The overall mbuf memory limit calculation may be reconsidered again later, however due to the many different mbuf sizes and different backing KVM maps it is a tricky subject. Found by: pho's new network stress test Pointed out by: alc (kmem_map instead of kernel_map) Tested by: pho	2013-04-24 13:54:55 +00:00
Andre Oppermann	371407162b	Move the mbuf memory limit calculations from init_param2() to tunable_mbinit() where it is next to where it is used later. Change the sysinit level of tunable_mbinit() from SI_SUB_TUNABLES to SI_SUB_KMEM after the VM is running. This allows to use better methods to determine the effectively available physical and virtual memory available to the kernel. Update comments. In a second step it can be merged into mbuf_init().	2013-01-17 21:28:31 +00:00
Pawel Jakub Dawidek	6e0b674628	Configure UMA warnings for the following zones: - unp_zone: kern.ipc.maxsockets limit reached - socket_zone: kern.ipc.maxsockets limit reached - zone_mbuf: kern.ipc.nmbufs limit reached - zone_clust: kern.ipc.nmbclusters limit reached - zone_jumbop: kern.ipc.nmbjumbop limit reached - zone_jumbo9: kern.ipc.nmbjumbo9 limit reached - zone_jumbo16: kern.ipc.nmbjumbo16 limit reached Note that those warnings are printed not often than every five minutes and can be globally turned off by setting sysctl/tunable vm.zone_warnings to 0. Discussed on: arch Obtained from: WHEEL Systems MFC after: 2 weeks	2012-12-07 22:30:30 +00:00
Pawel Jakub Dawidek	45fe0bf7e4	Make use of the fact that uma_zone_set_max(9) already returns actual limit set.	2012-12-07 22:23:53 +00:00
Pawel Jakub Dawidek	4007b61cde	More style cleanups.	2012-12-07 22:22:04 +00:00
Pawel Jakub Dawidek	b0b1402537	Style cleanups.	2012-12-07 22:19:41 +00:00
Andre Oppermann	416a434cd0	Complete r243631 by applying the remainder of kern_mbuf.c that got lost while merging into the commit tree. MFC after: 1 month X-MFC-with: r243631	2012-11-27 23:16:56 +00:00
Andre Oppermann	ead46972a4	Base the mbuf related limits on the available physical memory or kernel memory, whichever is lower. The overall mbuf related memory limit must be set so that mbufs (and clusters of various sizes) can't exhaust physical RAM or KVM. The limit is set to half of the physical RAM or KVM (whichever is lower) as the baseline. In any normal scenario we want to leave at least half of the physmem/kvm for other kernel functions and userspace to prevent it from swapping too easily. Via a tunable kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm. At the same time divorce maxfiles from maxusers and set maxfiles to physpages / 8 with a floor based on maxusers. This way busy servers can make use of the significantly increased mbuf limits with a much larger number of open sockets. Tidy up ordering in init_param2() and check up on some users of those values calculated here. Out of the overall mbuf memory limit 2K clusters and 4K (page size) clusters to get 1/4 each because these are the most heavily used mbuf sizes. 2K clusters are used for MTU 1500 ethernet inbound packets. 4K clusters are used whenever possible for sends on sockets and thus outbound packets. The larger cluster sizes of 9K and 16K are limited to 1/6 of the overall mbuf memory limit. When jumbo MTU's are used these large clusters will end up only on the inbound path. They are not used on outbound, there it's still 4K. Yes, that will stay that way because otherwise we run into lots of complications in the stack. And it really isn't a problem, so don't make a scene. Normal mbufs (256B) weren't limited at all previously. This was problematic as there are certain places in the kernel that on allocation failure of clusters try to piece together their packet from smaller mbufs. The mbuf limit is the number of all other mbuf sizes together plus some more to allow for standalone mbufs (ACK for example) and to send off a copy of a cluster. Unfortunately there isn't a way to set an overall limit for all mbuf memory together as UMA doesn't support such a limiting. NB: Every cluster also has an mbuf associated with it. Two examples on the revised mbuf sizing limits: 1GB KVM: 512MB limit for mbufs 419,430 mbufs 65,536 2K mbuf clusters 32,768 4K mbuf clusters 9,709 9K mbuf clusters 5,461 16K mbuf clusters 16GB RAM: 8GB limit for mbufs 33,554,432 mbufs 1,048,576 2K mbuf clusters 524,288 4K mbuf clusters 155,344 9K mbuf clusters 87,381 16K mbuf clusters These defaults should be sufficient for even the most demanding network loads. MFC after: 1 month	2012-11-27 21:19:58 +00:00
Alfred Perlstein	79f62ed690	Allow maxusers to scale on machines with large address space. Some hooks are added to clamp down maxusers and nmbclusters for small address space systems. VM_MAX_AUTOTUNE_MAXUSERS - the max maxusers that will be autotuned based on physical memory. VM_MAX_AUTOTUNE_NMBCLUSTERS - max nmbclusters based on physical memory. These are set to the old values on i386 to preserve the clamping that was being done to all arches. Another macro VM_AUTOTUNE_NMBCLUSTERS is provided to allow an override for the calculation on a MD basis. Currently no arch defines this. Reviewed by: peter MFC after: 2 weeks	2012-11-10 02:08:40 +00:00
Kevin Lo	a2c36a0234	Since the macro dtom() has been removed, fix comments about the dtom. Reviewed by: glebius	2012-10-29 10:04:28 +00:00
Navdeep Parhar	812302c3eb	Allow nmbjumbop, nmbjumbo9, and nmbjumbo16 to be set directly via loader tunables. MFC after: 1 month	2012-08-23 21:32:02 +00:00
Ed Schouten	60ae52f785	Use ISO C99 integer types in sys/kern where possible. There are only about 100 occurences of the BSD-specific u_int*_t datatypes in sys/kern. The ISO C99 integer types are used here more often.	2010-06-21 09:55:56 +00:00
Alan Cox	3153e878dd	Add support to the virtual memory system for configuring machine- dependent memory attributes: Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the fact that there are machine-dependent memory attributes that have nothing to do with controlling the cache's behavior. Introduce vm_object_set_memattr() for setting the default memory attributes that will be given to an object's pages. Introduce and use pmap_page_{get,set}_memattr() for getting and setting a page's machine-dependent memory attributes. Add full support for these functions on amd64 and i386 and stubs for them on the other architectures. The function pmap_page_set_memattr() is also responsible for any other machine-dependent aspects of changing a page's memory attributes, such as flushing the cache or updating the direct map. The uses include kmem_alloc_contig(), vm_page_alloc(), and the device pager: kmem_alloc_contig() can now be used to allocate kernel memory with non-default memory attributes on amd64 and i386. vm_page_alloc() and the device pager will set the memory attributes for the real or fictitious page according to the object's default memory attributes. Update the various pmap functions on amd64 and i386 that map pages to incorporate each page's memory attributes in the mapping. Notes: (1) Inherent to this design are safety features that prevent the specification of inconsistent memory attributes by different mappings on amd64 and i386. In addition, the device pager provides a warning when a device driver creates a fictitious page with memory attributes that are inconsistent with the real page that the fictitious page is an alias for. (2) Storing the machine-dependent memory attributes for amd64 and i386 as a dedicated "int" in "struct md_page" represents a compromise between space efficiency and the ease of MFCing these changes to RELENG_7. In collaboration with: jhb Approved by: re (kib)	2009-07-12 23:31:20 +00:00
Alan Cox	e999111ae7	This change is the next step in implementing the cache control functionality required by video card drivers. Specifically, this change introduces vm_cache_mode_t with an appropriate VM_CACHE_DEFAULT definition on all architectures. In addition, this changes adds a vm_cache_mode_t parameter to kmem_alloc_contig() and vm_phys_alloc_contig(). These will be the interfaces for allocating mapped kernel memory and physical memory, respectively, with non-default cache modes. In collaboration with: jhb	2009-06-26 04:47:43 +00:00
Kip Macy	5b204a113c	define helper routines for deferred mbuf initialization	2009-06-19 21:14:39 +00:00
Alan Cox	c45c00344f	Utilize the new function kmem_alloc_contig() to implement the UMA back-end allocator for the jumbo frames zones. This change has two benefits: (1) a custom back-end deallocator is no longer required. UMA's standard deallocator suffices. (2) It eliminates a potentially confusing artifact of using contigmalloc(): The malloc(9) statistics contain bogus information about the usage of jumbo frames. Specifically, the malloc(9) statistics report all jumbo frames in use whereas the UMA zone statistics report the "truth" about the number in use vs. the number free.	2009-06-18 17:59:04 +00:00
Robert Watson	bcf11e8d00	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd	2009-06-05 14:55:22 +00:00
Robert Watson	877e88123e	Temporary workaround for the limitations of the mbuf flowid field: zero the field in the mbuf constructors, since otherwise we have no way to tell if they are valid. In the future, Kip has plans to add a flag specifically to indicate validity, which is the preferred model.	2009-01-01 20:03:01 +00:00
Ruslan Ermilov	98bd6a1982	Removed a comment made obsolete by revisions 157927 and 174292.	2008-12-18 15:56:12 +00:00
Bjoern A. Zeeb	629386598e	Make sure nmbclusters are initialized before maxsockets by running the tunable_mbinit() SYSINIT at SI_ORDER_MIDDLE before the init_maxsockets() SYSINT at SI_ORDER_ANY. Reviewed by: rwatson, zec Sponsored by: The FreeBSD Foundation MFC after: 4 weeks	2008-12-10 22:17:09 +00:00
Kip Macy	8aa7a58108	make kern.ipc.nmbclusters actually have a useful effect on nmbclusters et al. initialize pkthdr in field order	2008-11-09 01:53:06 +00:00
Alan Cox	7630c26507	Reintroduce UMA_SLAB_KMAP; however, change its spelling to UMA_SLAB_KERNEL for consistency with its sibling UMA_SLAB_KMEM. (UMA_SLAB_KMAP met its original demise in revision 1.30 of vm/uma_core.c.) UMA_SLAB_KERNEL is now required by the jumbo frame allocators. Without it, UMA cannot correctly return pages from the jumbo frame zones to the VM system because it resets the pages' object field to NULL instead of the kernel object. In more detail, the jumbo frame zones are created with the option UMA_ZONE_REFCNT. This causes UMA to overwrite the pages' object field with the address of the slab. However, when UMA wants to release these pages, it doesn't know how to restore the object field, so it sets it to NULL. This change teaches UMA how to reset the object field to the kernel object. Crashes reported by: kris Fix tested by: kris Fix discussed with: jeff MFC after: 6 weeks	2008-04-04 18:41:12 +00:00
Robert Watson	237fdd787b	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink	2008-03-16 10:58:09 +00:00
Poul-Henning Kamp	cf827063a9	Give MEXTADD() another argument to make both void pointers to the free function controlable, instead of passing the KVA of the buffer storage as the first argument. Fix all conventional users of the API to pass the KVA of the buffer as the first argument, to make this a no-op commit. Likely break the only non-convetional user of the API, after informing the relevant committer. Update the mbuf(9) manual page, which was already out of sync on this point. Bump __FreeBSD_version to 800016 as there is no way to tell how many arguments a CPP macro needs any other way. This paves the way for giving sendfile(9) a way to wait for the passed storage to have been accessed before returning. This does not affect the memory layout or size of mbufs. Parental oversight by: sam and rwatson. No MFC is anticipated.	2008-02-01 19:36:27 +00:00
Randall Stewart	6b4959bf2f	- fix tab to space issue, hmm maybe I should use vi.	2007-12-15 23:14:53 +00:00
Randall Stewart	cf70a46b47	- Puts default limits on 4k/9k and 16k zones for mbufs all based on 1/2 of each of the successive limits tied to the limit for 2k clusters. - Adds real functionality in so that doing a sysctl to change these actually changes them :-) MFC after: 1 week	2007-12-05 15:29:44 +00:00
Alan Cox	ba63339a0a	Introduce an UMA backend page allocator for the jumbo frame zones that allocates physically contiguous memory. MFC after: 3 months Requested and reviewed by: Kip Macy Tested by: Andrew Gallatin and Pyun YongHyeon	2007-12-04 07:06:08 +00:00
David E. O'Brien	ef44c8d2a3	style(9)	2007-10-26 16:33:47 +00:00
Robert Watson	30d239bc4c	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer	2007-10-24 19:04:04 +00:00
Kip Macy	457869b973	This patch adds an M_NOFREE flag which allows one to mark an mbuf as not being independently freeable. This allows one to embed an mbuf in the cluster itself. This confers the benefits of the packet zone on all cluster sizes. Embedded mbufs currently suffer from the same limitation that packet zone mbufs do in that one cannot disconnect them and pass them around independently of the cluster. It would likely be possible to eliminate this limitation in the future by adding a second reference for the mbuf itself. Approved by: re(gnn)	2007-10-06 21:42:39 +00:00
Kip Macy	629b9e0853	Allow drivers to free an mbuf without having the mbuf be touched if the driver has already freed any attached tags Approved by: re(gnn)	2007-10-06 21:13:55 +00:00
David Malone	041b706b2f	Despite several examples in the kernel, the third argument of sysctl_handle_int is not sizeof the int type you want to export. The type must always be an int or an unsigned int. Remove the instances where a sizeof(variable) is passed to stop people accidently cut and pasting these examples. In a few places this was sysctl_handle_int was being used on 64 bit types, which would truncate the value to be exported. In these cases use sysctl_handle_quad to export them and change the format to Q so that sysctl(1) can still print them.	2007-06-04 18:25:08 +00:00
Kip Macy	0f4d9d04ea	Fix mb_ctor_clust and mb_dtor_clust to reference the appropriate zone, simplify setting refcnt Reviewed by: andre, rwatson, and glebius MFC after: 3 days	2007-04-04 21:27:01 +00:00
Mohan Srinivasan	6c125b8df6	Fix for problems that occur when all mbuf clusters migrate to the mbuf packet zone. Cluster allocations fail when this happens. Also processes that may have blocked on cluster allocations will never be woken up. Thanks to rwatson for an overview of the issue and pointers to the mbuma paper and his tool to dump out UMA zones. Reviewed by: andre@	2007-01-25 01:05:23 +00:00
Robert Watson	aed5570872	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA	2006-10-22 11:52:19 +00:00
Andre Oppermann	a855e2b4c0	Remove VLAN mtag UMA zones and initialize ether_vtag and tso_segsz packet header fields to zero on mbuf allocation. Sponsored by: TCP/IP Optimization Fundraise 2005	2006-09-17 13:44:32 +00:00
Robert Watson	b37ffd3189	Move some functions and definitions from uipc_socket2.c to uipc_socket.c: - Move sonewconn(), which creates new sockets for incoming connections on listen sockets, so that all socket allocate code is together in uipc_socket.c. - Move 'maxsockets' and associated sysctls to uipc_socket.c with the socket allocation code. - Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it to sysctl.h and remove lots of scattered implementations in various IPC modules. - Sort sodealloc() after soalloc() in uipc_socket.c for dependency order reasons. Statisticize soalloc() and sodealloc() as they are now required only in uipc_socket.c, and are internal to the socket implementation. After this change, socket allocation and deallocation is entirely centralized in one file, and uipc_socket2.c consists entirely of socket buffer manipulation and default protocol switch functions. MFC after: 1 month	2006-06-10 14:34:07 +00:00
Paul Saab	4f590175b7	Allow for nmbclusters and maxsockets to be increased via sysctl. An eventhandler is used to update all the various zones that depend on these values.	2006-04-21 09:25:40 +00:00
Andre Oppermann	a7bd90ef93	Properly handle the case when the packet secondary zone can't allocate further mbuf clusters to attach to mbufs. Reported by: kris Tested by: kris Sponsored by: TCP/IP Optimization Fundraise 2005 MFC after: 3 days	2006-03-08 14:05:38 +00:00
Gleb Smirnoff	73bb09f2d0	One more grammar nit. Submitted by: ru	2006-02-27 07:22:32 +00:00
Gleb Smirnoff	fcf9061858	Fix several typos and trim spaces at eol. PR: kern/93759 Submitted by: Antoine Brodin <antoine.brodin laposte.net>	2006-02-26 11:44:28 +00:00
Andre Oppermann	ec63cb90a3	Replace the 4k fixed sized jumbo mbuf clusters with PAGE_SIZE sized jumbo mbuf clusters. To make the variable size clear they are named MJUMPAGESIZE. Having jumbo clusters with the native PAGE_SIZE is more useful than a fixed 4k size according the device driver writers using this API. The 9k and 16k jumbo mbuf clusters remain unchanged. Requested by: glebius, gallatin Sponsored by: TCP/IP Optimization Fundraise 2005 MFC after: 3 days	2006-02-17 14:14:15 +00:00
Gleb Smirnoff	75ee267c22	Merge the //depot/user/yar/vlan branch into CVS. It contains some collective work by yar, thompsa and myself. The checksum offloading part also involves work done by Mihail Balikov. The most important changes: o Instead of global linked list of all vlan softc use a per-trunk hash. The size of hash is dynamically adjusted, depending on number of entries. This changes struct ifnet, replacing counter of vlans with a pointer to trunk structure. This change is an improvement for setups with big number of VLANs, several interfaces and several CPUs. It is a small regression for a setup with a single VLAN interface. An alternative to dynamic hash is a per-trunk static array with 4096 entries, which is a compile time option - VLAN_ARRAY. In my experiments the array is not an improvement, probably because such a big trunk structure doesn't fit into CPU cache. o Introduce an UMA zone for VLAN tags. Since drivers depend on it, the zone is declared in kern_mbuf.c, not in optional vlan(4) driver. This change is a big improvement for any setup utilizing vlan(4). o Use rwlock(9) instead of mutex(9) for locking. We are the first ones to do this! :) o Some drivers can do hardware VLAN tagging + hardware checksum offloading. Add an infrastructure for this. Whenever vlan(4) is attached to a parent or parent configuration is changed, the flags on vlan(4) interface are updated. In collaboration with: yar, thompsa In collaboration with: Mihail Balikov <mihail.balikov interbgc.com>	2006-01-30 13:45:15 +00:00
Andre Oppermann	fd2413398c	In mb_zinit_pack() explicitly ignore the return value of uma_zalloc_arg(). The success of the cluster allocation is checked through a field in the mbuf structure. This change is non-functional but properly silences code inspection tools. Found by: Coverity Prevent(tm) Coverity ID: CID807 Sponsored by: TCP/IP Optimization Fundraise 2005	2006-01-23 15:49:01 +00:00
Andre Oppermann	36ae3fd3c3	Hide the 4k mbuf clusters if the normal clusters are defined to be 4k already. This unbreaks tinderbox. Submitted by: ru	2005-12-10 15:21:04 +00:00
Andre Oppermann	d5269a636b	Add an API for jumbo mbuf cluster allocation and also provide 4k clusters in addition to 9k and 16k ones. struct mbuf m_getjcl(int how, short type, int flags, int size) void m_cljget(struct mbuf *m, int how, int size) m_getjcl() returns an mbuf with a cluster of the specified size attached like m_getcl() does for 2k clusters. m_cljget() is different from m_clget() as it can allocate clusters without attaching them to an mbuf. In that case the return value is the pointer to the cluster of the requested size. If an mbuf was specified, it gets the cluster attached to it and the return value can be safely ignored. For size both take MCLBYTES, MJUM4BYTES, MJUM9BYTES, MJUM16BYTES. Reviewed by: glebius Tested by: glebius Sponsored by: TCP/IP Optimization Fundraise 2005	2005-12-08 13:13:06 +00:00
Gleb Smirnoff	49d46b616e	Fix panic string in last revision.	2005-11-06 16:47:59 +00:00

1 2 3 4

164 Commits