freebsd-skq

Author	SHA1	Message	Date
John Baldwin	beb817edfe	crypto: Add crypto_cursor_segment() to fetch both base and length. This function combines crypto_cursor_segbase() and crypto_cursor_seglen() into a single function. This is mostly beneficial in the unmapped mbuf case where back to back calls of these two functions have to iterate over the sub-components of unmapped mbufs twice. Bump __FreeBSD_version for crypto drivers in ports. Suggested by: markj Reviewed by: markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30445	2021-05-25 16:59:19 -07:00
John Baldwin	6663f8a23e	sglist: Add sglist_append_single_mbuf(). This function appends the contents of a single mbuf to an sglist rather than an entire mbuf chain. Reviewed by: gallatin, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30135	2021-05-25 16:59:18 -07:00
Mark Johnston	38da497a4d	Add the KASAN runtime KASAN enables the use of LLVM's AddressSanitizer in the kernel. This feature makes use of compiler instrumentation to validate memory accesses in the kernel and detect several types of bugs, including use-after-frees and out-of-bounds accesses. It is particularly effective when combined with test suites or syzkaller. KASAN has high CPU and memory usage overhead and so is not suited for production environments. The runtime and pmap maintain a shadow of the kernel map to store information about the validity of memory mapped at a given kernel address. The runtime implements a number of functions defined by the compiler ABI. These are prefixed by __asan. The compiler emits calls to __asan_load() and __asan_store() around memory accesses, and the runtime consults the shadow map to determine whether a given access is valid. kasan_mark() is called by various kernel allocators to update state in the shadow map. Updates to those allocators will come in subsequent commits. The runtime also defines various interceptors. Some low-level routines are implemented in assembly and are thus not amenable to compiler instrumentation. To handle this, the runtime implements these routines on behalf of the rest of the kernel. The sanitizer implementation validates memory accesses manually before handing off to the real implementation. The sanitizer in a KASAN-configured kernel can be disabled by setting the loader tunable debug.kasan.disable=1. Obtained from: NetBSD MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29416	2021-04-13 17:42:20 -04:00
Dmitry Chagin	86887853c3	Remove reference to the pfctlinput2() from domain(9) after `237c1f932b`. Reviewed by: glebius MFC After: 1 week Differential Revision: https://reviews.freebsd.org/D29751	2021-04-14 00:40:20 +03:00
John Baldwin	76681661be	OCF: Remove support for asymmetric cryptographic operations. There haven't been any non-obscure drivers that supported this functionality and it has been impossible to test to ensure that it still works. The only known consumer of this interface was the engine in OpenSSL < 1.1. Modern OpenSSL versions do not include support for this interface as it was not well-documented. Reviewed by: cem Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D29736	2021-04-12 14:28:43 -07:00
Ka Ho Ng	86a52e262a	Document vnode_pager_setsize(9) MFC after: 3 days Sponsored by: The FreeBSD Foundation Reviewed by: bcr Approved by: philip (mentor) Differential Revision: https://reviews.freebsd.org/D29408	2021-04-07 19:11:26 +08:00
Warner Losh	e52368365d	config_intrhook: provide config_intrhook_drain config_intrhook_drain will remove the hook from the list as config_intrhook_disestablish does if the hook hasn't been called. If it has, config_intrhook_drain will wait for the hook to be disestablished in the normal course (or expedited, it's up to the driver to decide how and when to call config_intrhook_disestablish). This is intended for removable devices that use config_intrhook and might be attached early in boot, but that may be removed before the kernel can call the config_intrhook or before it ends. To prevent all races, the detach routine will need to call config_intrhook_train. Sponsored by: Netflix, Inc Reviewed by: jhb, mav, gde (in D29006 for man page) Differential Revision: https://reviews.freebsd.org/D29005	2021-03-11 09:45:10 -07:00
Hans Petter Selasky	c743a6bd4f	Implement mallocarray_domainset(9) variant of mallocarray(9). Reviewed by: kib @ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-03-06 11:38:55 +01:00
Ka Ho Ng	43afeee2fb	share/man/man9: document zero_region(9) The zero_region() kernel interface was previously undocumented. Add a new zero_region(9) manual page to document it. Submitted by: Ka Ho Ng <khng@freebsdfoundation.org> Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28914	2021-03-02 17:14:06 +08:00
Konstantin Belousov	55eb51ab66	Add VOP_READ_PGCACHE(9) PR: 253894 Reviewed by: gbe, rwatson Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28980	2021-03-01 01:38:33 +02:00
Ryan Libby	1c55eab104	bitset.9: add missing MLINKS Add MLINKS for new bitset(9) APIs in r364796 / `f878200180` and `ae4a8e5207`. Reported by: trasz Reviewed by: cem Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D27944	2021-01-03 12:52:21 -08:00
Mark Johnston	c97e33e1fd	Add missing refcount.9 MLINKS	2020-12-07 14:53:34 +00:00
Mark Johnston	eb3b7cece2	Add some missing nv(9) MLINKS MFC after: 1 week	2020-10-23 14:25:48 +00:00
Emmanuel Vadot	675aae732d	Add backlight subsystem This is a simple subsystem that allow drivers to register as a backlight. Each backlight creates a device node under /dev/backlight/backlightX and an alias based on the name provided. Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26250	2020-10-02 18:18:01 +00:00
Konstantin Belousov	2be2e7e549	Remove stray line	2020-09-22 23:39:14 +00:00
Warner Losh	6b1d211602	Add devctl_notify(9) man page Document the calls to send messages to userland via devctl. devctl_notify will create a message for the specified system, subsystem and type, optionally adding additional information. Reviewed by: bcr Differential Revision: https://reviews.freebsd.org/D26520	2020-09-22 23:02:01 +00:00
Warner Losh	c6d67028c7	Document devctl_safe_quote_sb This routine centralizes the knowledge needed for properly quoting 'value' in all key="value" items that appear in devctl messages. Reviewed by: bcr Differential Revision: https://reviews.freebsd.org/D26520	2020-09-22 23:01:53 +00:00
Warner Losh	a329c23eb7	Add a devctl_process_running man page. Reviewed by: bcr Differential Revision: https://reviews.freebsd.org/D26520	2020-09-22 23:01:44 +00:00
Mitchell Horne	cba446e2c2	Add getenv(9) boolean parsing functions This adds the getenv_bool() function, to parse a boolean value from a kernel environment variable or tunable. This works for traditional boolean values like "0" and "1", and also "true" and "false" (case-insensitive). These semantics do not yet apply to sysctls declared using SYSCTL_BOOL with CTLFLAG_TUN (they still only parse 1 and 0). Also added are two wrapper functions, getenv_is_true() and getenv_is_false(). These are slightly simpler for callers wishing to perform a single check of a configuration variable. Reviewed by: jhb (slightly earlier version) Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26270	2020-09-21 15:24:44 +00:00
Li-Wen Hsu	95407a79cb	Remove vm_map_create(9) KPI's manpage according to r364302 Submitted by: Ka Ho Ng <khng300@gmail.com> Reviewed by: markj Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26372	2020-09-10 06:32:25 +00:00
Conrad Meyer	8a0edc914f	Add prng(9) API Add prng(9) as a replacement for random(9) in the kernel. There are two major differences from random(9) and random(3): - General prng(9) APIs (prng32(9), etc) do not guarantee an implementation or particular sequence; they should not be used for repeatable simulations. - However, specific named API families are also exposed (for now: PCG), and those are expected to be repeatable (when so-guaranteed by the named algorithm). Some minor differences from random(3) and earlier random(9): - PRNG state for the general prng(9) APIs is per-CPU; this eliminates contention on PRNG state in SMP workloads. Each PCPU generator in an SMP system produces a unique sequence. - Better statistical properties than the Park-Miller ("minstd") PRNG (longer period, uniform distribution in all bits, passes BigCrush/PractRand analysis). - Faster than Park-Miller ("minstd") PRNG -- no division is required to step PCG-family PRNGs. For now, random(9) becomes a thin shim around prng32(). Eventually I would like to mechanically switch consumers over to the explicit API. Reviewed by: kib, markj (previous version both) Discussed with: markm Differential Revision: https://reviews.freebsd.org/D25916	2020-08-13 20:48:14 +00:00
Mateusz Guzik	51ea7bea91	vfs: add VOP_STAT The current scheme of calling VOP_GETATTR adds avoidable overhead. An example with tmpfs doing fstat (ops/s): before: 7488958 after: 7913833 Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D25910	2020-08-07 23:06:40 +00:00
Mark Johnston	96ad26eefb	Remove free_domain() and uma_zfree_domain(). These functions were introduced before UMA started ensuring that freed memory gets placed in domain-local caches. They no longer serve any purpose since UMA now provides their functionality by default. Remove them to simplyify the kernel memory allocator interfaces a bit. Reviewed by: cem, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25937	2020-08-04 13:58:36 +00:00
Edward Tomasz Napierala	55ec696d42	Add missing bitset(9) MLINKS. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25713	2020-07-19 12:22:32 +00:00
Gordon Bergling	d8fd37e1e1	devstat(9): Update the man page to reflect the current implementation - Rename devstat_add_entry to devstat_new_entry - Update the description of devstat_trans_flags - Add manpage aliases for devstat_start_transaction_bio and devstat_end_transaction_bio PR: 157316 Submitted by: novel Reviewed by: cem, bcr (mentor) Approved by: bcr (mentor) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D25677	2020-07-17 22:15:02 +00:00
John Baldwin	946b8f6fb0	Add crypto_initreq() and crypto_destroyreq(). These routines are similar to crypto_getreq() and crypto_freereq() but operate on caller-supplied storage instead of allocating crypto requests from a UMA zone. Reviewed by: markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D25691	2020-07-16 21:30:46 +00:00
Gleb Smirnoff	91ddfec2d7	Fixup for r360574: install new mlinks for sglist(9) and remove old ones.	2020-07-07 02:41:51 +00:00
John Baldwin	23230d520a	Remove an extraneous line continuation from r361481.	2020-05-25 23:07:50 +00:00
John Baldwin	9c0e3d3a53	Add support for optional separate output buffers to in-kernel crypto. Some crypto consumers such as GELI and KTLS for file-backed sendfile need to store their output in a separate buffer from the input. Currently these consumers copy the contents of the input buffer into the output buffer and queue an in-place crypto operation on the output buffer. Using a separate output buffer avoids this copy. - Create a new 'struct crypto_buffer' describing a crypto buffer containing a type and type-specific fields. crp_ilen is gone, instead buffers that use a flat kernel buffer have a cb_buf_len field for their length. The length of other buffer types is inferred from the backing store (e.g. uio_resid for a uio). Requests now have two such structures: crp_buf for the input buffer, and crp_obuf for the output buffer. - Consumers now use helper functions (crypto_use_, e.g. crypto_use_mbuf()) to configure the input buffer. If an output buffer is not configured, the request still modifies the input buffer in-place. A consumer uses a second set of helper functions (crypto_use_output_) to configure an output buffer. - Consumers must request support for separate output buffers when creating a crypto session via the CSP_F_SEPARATE_OUTPUT flag and are only permitted to queue a request with a separate output buffer on sessions with this flag set. Existing drivers already reject sessions with unknown flags, so this permits drivers to be modified to support this extension without requiring all drivers to change. - Several data-related functions now have matching versions that operate on an explicit buffer (e.g. crypto_apply_buf, crypto_contiguous_subsegment_buf, bus_dma_load_crp_buf). - Most of the existing data-related functions operate on the input buffer. However crypto_copyback always writes to the output buffer if a request uses a separate output buffer. - For the regions in input/output buffers, the following conventions are followed: - AAD and IV are always present in input only and their fields are offsets into the input buffer. - payload is always present in both buffers. If a request uses a separate output buffer, it must set a new crp_payload_start_output field to the offset of the payload in the output buffer. - digest is in the input buffer for verify operations, and in the output buffer for compute operations. crp_digest_start is relative to the appropriate buffer. - Add a crypto buffer cursor abstraction. This is a more general form of some bits in the cryptosoft driver that tried to always use uio's. However, compared to the original code, this avoids rewalking the uio iovec array for requests with multiple vectors. It also avoids allocate an iovec array for mbufs and populating it by instead walking the mbuf chain directly. - Update the cryptosoft(4) driver to support separate output buffers making use of the cursor abstraction. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24545	2020-05-25 22:12:04 +00:00
John Baldwin	29fe41ddd7	Retire the CRYPTO_F_IV_GENERATE flag. The sole in-tree user of this flag has been retired, so remove this complexity from all drivers. While here, add a helper routine drivers can use to read the current request's IV into a local buffer. Use this routine to replace duplicated code in nearly all drivers. Reviewed by: cem Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24450	2020-04-20 22:24:49 +00:00
John Baldwin	c034143269	Refactor driver and consumer interfaces for OCF (in-kernel crypto). - The linked list of cryptoini structures used in session initialization is replaced with a new flat structure: struct crypto_session_params. This session includes a new mode to define how the other fields should be interpreted. Available modes include: - COMPRESS (for compression/decompression) - CIPHER (for simply encryption/decryption) - DIGEST (computing and verifying digests) - AEAD (combined auth and encryption such as AES-GCM and AES-CCM) - ETA (combined auth and encryption using encrypt-then-authenticate) Additional modes could be added in the future (e.g. if we wanted to support TLS MtE for AES-CBC in the kernel we could add a new mode for that. TLS modes might also affect how AAD is interpreted, etc.) The flat structure also includes the key lengths and algorithms as before. However, code doesn't have to walk the linked list and switch on the algorithm to determine which key is the auth key vs encryption key. The 'csp_auth_' fields are always used for auth keys and settings and 'csp_cipher_' for cipher. (Compression algorithms are stored in csp_cipher_alg.) - Drivers no longer register a list of supported algorithms. This doesn't quite work when you factor in modes (e.g. a driver might support both AES-CBC and SHA2-256-HMAC separately but not combined for ETA). Instead, a new 'crypto_probesession' method has been added to the kobj interface for symmteric crypto drivers. This method returns a negative value on success (similar to how device_probe works) and the crypto framework uses this value to pick the "best" driver. There are three constants for hardware (e.g. ccr), accelerated software (e.g. aesni), and plain software (cryptosoft) that give preference in that order. One effect of this is that if you request only hardware when creating a new session, you will no longer get a session using accelerated software. Another effect is that the default setting to disallow software crypto via /dev/crypto now disables accelerated software. Once a driver is chosen, 'crypto_newsession' is invoked as before. - Crypto operations are now solely described by the flat 'cryptop' structure. The linked list of descriptors has been removed. A separate enum has been added to describe the type of data buffer in use instead of using CRYPTO_F_* flags to make it easier to add more types in the future if needed (e.g. wired userspace buffers for zero-copy). It will also make it easier to re-introduce separate input and output buffers (in-kernel TLS would benefit from this). Try to make the flags related to IV handling less insane: - CRYPTO_F_IV_SEPARATE means that the IV is stored in the 'crp_iv' member of the operation structure. If this flag is not set, the IV is stored in the data buffer at the 'crp_iv_start' offset. - CRYPTO_F_IV_GENERATE means that a random IV should be generated and stored into the data buffer. This cannot be used with CRYPTO_F_IV_SEPARATE. If a consumer wants to deal with explicit vs implicit IVs, etc. it can always generate the IV however it needs and store partial IVs in the buffer and the full IV/nonce in crp_iv and set CRYPTO_F_IV_SEPARATE. The layout of the buffer is now described via fields in cryptop. crp_aad_start and crp_aad_length define the boundaries of any AAD. Previously with GCM and CCM you defined an auth crd with this range, but for ETA your auth crd had to span both the AAD and plaintext (and they had to be adjacent). crp_payload_start and crp_payload_length define the boundaries of the plaintext/ciphertext. Modes that only do a single operation (COMPRESS, CIPHER, DIGEST) should only use this region and leave the AAD region empty. If a digest is present (or should be generated), it's starting location is marked by crp_digest_start. Instead of using the CRD_F_ENCRYPT flag to determine the direction of the operation, cryptop now includes an 'op' field defining the operation to perform. For digests I've added a new VERIFY digest mode which assumes a digest is present in the input and fails the request with EBADMSG if it doesn't match the internally-computed digest. GCM and CCM already assumed this, and the new AEAD mode requires this for decryption. The new ETA mode now also requires this for decryption, so IPsec and GELI no longer do their own authentication verification. Simple DIGEST operations can also do this, though there are no in-tree consumers. To eventually support some refcounting to close races, the session cookie is now passed to crypto_getop() and clients should no longer set crp_sesssion directly. - Assymteric crypto operation structures should be allocated via crypto_getkreq() and freed via crypto_freekreq(). This permits the crypto layer to track open asym requests and close races with a driver trying to unregister while asym requests are in flight. - crypto_copyback, crypto_copydata, crypto_apply, and crypto_contiguous_subsegment now accept the 'crp' object as the first parameter instead of individual members. This makes it easier to deal with different buffer types in the future as well as separate input and output buffers. It's also simpler for driver writers to use. - bus_dmamap_load_crp() loads a DMA mapping for a crypto buffer. This understands the various types of buffers so that drivers that use DMA do not have to be aware of different buffer types. - Helper routines now exist to build an auth context for HMAC IPAD and OPAD. This reduces some duplicated work among drivers. - Key buffers are now treated as const throughout the framework and in device drivers. However, session key buffers provided when a session is created are expected to remain alive for the duration of the session. - GCM and CCM sessions now only specify a cipher algorithm and a cipher key. The redundant auth information is not needed or used. - For cryptosoft, split up the code a bit such that the 'process' callback now invokes a function pointer in the session. This function pointer is set based on the mode (in effect) though it simplifies a few edge cases that would otherwise be in the switch in 'process'. It does split up GCM vs CCM which I think is more readable even if there is some duplication. - I changed /dev/crypto to support GMAC requests using CRYPTO_AES_NIST_GMAC as an auth algorithm and updated cryptocheck to work with it. - Combined cipher and auth sessions via /dev/crypto now always use ETA mode. The COP_F_CIPHER_FIRST flag is now a no-op that is ignored. This was actually documented as being true in crypto(4) before, but the code had not implemented this before I added the CIPHER_FIRST flag. - I have not yet updated /dev/crypto to be aware of explicit modes for sessions. I will probably do that at some point in the future as well as teach it about IV/nonce and tag lengths for AEAD so we can support all of the NIST KAT tests for GCM and CCM. - I've split up the exising crypto.9 manpage into several pages of which many are written from scratch. - I have converted all drivers and consumers in the tree and verified that they compile, but I have not tested all of them. I have tested the following drivers: - cryptosoft - aesni (AES only) - blake2 - ccr and the following consumers: - cryptodev - IPsec - ktls_ocf - GELI (lightly) I have not tested the following: - ccp - aesni with sha - hifn - kgssapi_krb5 - ubsec - padlock - safe - armv8_crypto (aarch64) - glxsb (i386) - sec (ppc) - cesa (armv7) - cryptocteon (mips64) - nlmsec (mips64) Discussed with: cem Relnotes: yes Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D23677	2020-03-27 18:25:23 +00:00
Mateusz Piotrowski	f04020edc5	Sort UMA macros and create MLINKS for them This patch is a follow-up to r344518. Reported by: ngie Reviewed by: hselasky Differential Revision: https://reviews.freebsd.org/D24165	2020-03-23 14:04:42 +00:00
Gleb Smirnoff	a7f12fce22	Remove struct callout_handle. Should have gone with r355732.	2020-01-22 05:47:59 +00:00
John Baldwin	4b28d96e5d	Remove the deprecated timeout(9) interface. All in-tree consumers have been converted to callout(9). Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D22602	2019-12-13 21:03:12 +00:00
Warner Losh	b832a7e505	Create new wrapper function: bus_delayed_attach_children() Delay the attachment of children, when requested, until after interrutps are running. This is often needed to allow children to run transactions on i2c or spi busses. It's a common enough idiom that it will be useful to have its own wrapper. Reviewed by: ian Differential Revision: https://reviews.freebsd.org/D21465	2019-12-13 19:39:33 +00:00
Ryan Libby	9825eadf2c	bitset: rename confusing macro NAND to ANDNOT s/BIT_NAND/BIT_ANDNOT/, and for CPU and DOMAINSET too. The actual implementation is "and not" (or "but not"), i.e. A but not B. Fortunately this does appear to be what all existing callers want. Don't supply a NAND (not (A and B)) operation at this time. Discussed with: jeff Reviewed by: cem Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22791	2019-12-13 09:32:16 +00:00
Andriy Gapon	337f6465a9	document taskqueue_start_threads_in_proc While here, fix taskqueue_start_threads_cpuset that was documented under old name of taskqueue_start_threads_pinned. MFC after: 4 weeks	2019-10-17 06:58:07 +00:00
Andriy Gapon	c812bea351	add superio.4 and superio.9 manual pages This adds basic documentation on what the superio driver is and how other drivers can interact with it. I decided to also document superio's ivar accessors. Reviewed by: bcr, brueffer (both manual contents only) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D21958	2019-10-11 11:13:47 +00:00
Mark Johnston	fee2a2fa39	Change synchonization rules for vm_page reference counting. There are several mechanisms by which a vm_page reference is held, preventing the page from being freed back to the page allocator. In particular, holding the page's object lock is sufficient to prevent the page from being freed; holding the busy lock or a wiring is sufficent as well. These references are protected by the page lock, which must therefore be acquired for many per-page operations. This results in false sharing since the page locks are external to the vm_page structures themselves and each lock protects multiple structures. Transition to using an atomically updated per-page reference counter. The object's reference is counted using a flag bit in the counter. A second flag bit is used to atomically block new references via pmap_extract_and_hold() while removing managed mappings of a page. Thus, the reference count of a page is guaranteed not to increase if the page is unbusied, unmapped, and the object's write lock is held. As a consequence of this, the page lock no longer protects a page's identity; operations which move pages between objects are now synchronized solely by the objects' locks. The vm_page_wire() and vm_page_unwire() KPIs are changed. The former requires that either the object lock or the busy lock is held. The latter no longer has a return value and may free the page if it releases the last reference to that page. vm_page_unwire_noq() behaves the same as before; the caller is responsible for checking its return value and freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is introduced for use in pmap_extract_and_hold(). It fails if the page is concurrently being unmapped, typically triggering a fallback to the fault handler. vm_page_wire() no longer requires the page lock and vm_page_unwire() now internally acquires the page lock when releasing the last wiring of a page (since the page lock still protects a page's queue state). In particular, synchronization details are no longer leaked into the caller. The change excises the page lock from several frequently executed code paths. In particular, vm_object_terminate() no longer bounces between page locks as it releases an object's pages, and direct I/O and sendfile(SF_NOCACHE) completions no longer require the page lock. In these latter cases we now get linear scalability in the common scenario where different threads are operating on different files. __FreeBSD_version is bumped. The DRM ports have been updated to accomodate the KPI changes. Reviewed by: jeff (earlier version) Tested by: gallatin (earlier version), pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20486	2019-09-09 21:32:42 +00:00
Mark Johnston	e46cfc2542	Revert a portion of r351628 that I did not mean to commit. Reported by: mjg MFC with: r351628	2019-09-03 14:39:36 +00:00
Mark Johnston	08cfa56ea3	Extend uma_reclaim() to permit different reclamation targets. The page daemon periodically invokes uma_reclaim() to reclaim cached items from each zone when the system is under memory pressure. This is important since the size of these caches is unbounded by default. However it also results in bursts of high latency when allocating from heavily used zones as threads miss in the per-CPU caches and must access the keg in order to allocate new items. With r340405 we maintain an estimate of each zone's usage of its (per-NUMA domain) cache of full buckets. Start making use of this estimate to avoid reclaiming the entire cache when under memory pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only items in excess of the estimate are reclaimed. Draining a zone reclaims all of the cached full buckets (the previous behaviour of uma_reclaim()), and may further drain the per-CPU caches in extreme cases. Now, when under memory pressure, the page daemon will trim zones rather than draining them. As a result, heavily used zones do not incur bursts of bucket cache misses following reclamation, but large, unused caches will be reclaimed as before. Reviewed by: jeff Tested by: pho (an earlier version) MFC after: 2 months Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16667	2019-09-01 22:22:43 +00:00
Mark Johnston	d794b3a3c2	Update and clean up the UMA man page. - Fix warnings from igor and mandoc. - Provide a brief description of the separation between zones and their backend slab allocators. - Document cache zones and secondary zones. - Document the kernel config options added in r350659. - Document the uma_zalloc_pcpu() and uma_zfree_pcpu() wrappers. - Document uma_zone_reserve(), uma_zone_reserve_kva() and uma_zone_prealloc(). - Document uma_zone_alloc() and uma_zone_freef(). - Add some missing MLINKs and Xrefs. MFC after: 2 weeks	2019-08-30 19:35:44 +00:00
Li-Wen Hsu	26a6feda9c	Follow r350693 to add a link for sbuf_nl_terminate(9) Sponsored by: The FreeBSD Foundation	2019-08-08 00:51:17 +00:00
Conrad Meyer	ac03832ef3	GEOM: Reduce unnecessary log interleaving with sbufs Similar to what was done for device_printfs in r347229. Convert g_print_bio() to a thin shim around g_format_bio(), which acts on an sbuf; documented in g_bio.9. Reviewed by: markj Discussed with: rlibby Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D21165	2019-08-07 19:28:35 +00:00
Mariusz Zaborski	7244507616	seqc: add man page Reviewed by: markj Earlier version reviewed by: emaste, mjg, bcr, 0mp Differential Revision: https://reviews.freebsd.org/D16744	2019-07-29 21:53:02 +00:00
Rick Macklem	4b5b98d22d	Create a man page for VOP_COPY_FILE_RANGE(9). r350315 created a Linux compatible copy_file_range(2) syscall. It uses a VOP method called VOP_COPY_FILE_RANGE so that file systems, such as the NFSv4.2 client can do file system specific copying. For NFSv4.2, this allows the copying to be done locally on the NFS server, avoiding transferring the data across the wire twice. This is a new man page (content changed). Reviewed by: kib, asomers Relnotes: yes Differential Revision: https://reviews.freebsd.org/D20584	2019-07-25 06:20:00 +00:00
Emmanuel Vadot	1d6d0a43ce	pkgbase: move man pages from runtime-manual to runtime We don't split the other man pages in their own package so do the same for runtime. Reviewed by: bapt, gjb Differential Revision: https://reviews.freebsd.org/D20962	2019-07-19 15:12:20 +00:00
Mark Johnston	eeacb3b02f	Merge the vm_page hold and wire mechanisms. The hold_count and wire_count fields of struct vm_page are separate reference counters with similar semantics. The remaining essential differences are that holds are not counted as a reference with respect to LRU, and holds have an implicit free-on-last unhold semantic whereas vm_page_unwire() callers must explicitly determine whether to free the page once the last reference to the page is released. This change removes the KPIs which directly manipulate hold_count. Functions such as vm_fault_quick_hold_pages() now return wired pages instead. Since r328977 the overhead of maintaining LRU for wired pages is lower, and in many cases vm_fault_quick_hold_pages() callers would swap holds for wirings on the returned pages anyway, so with this change we remove a number of page lock acquisitions. No functional change is intended. __FreeBSD_version is bumped. Reviewed by: alc, kib Discussed with: jeff Discussed with: jhb, np (cxgbe) Tested by: pho (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19247	2019-07-08 19:46:20 +00:00
John Baldwin	82334850ea	Add an external mbuf buffer type that holds multiple unmapped pages. Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses. For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT. Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe. NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required. If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:48:33 +00:00
Hans Petter Selasky	131b2b7658	Implement API for draining EPOCH(9) callbacks. The epoch_drain_callbacks() function is used to drain all pending callbacks which have been invoked by prior epoch_call() function calls on the same epoch. This function is useful when there are shared memory structure(s) referred to by the epoch callback(s) which are not refcounted and are rarely freed. The typical place for calling this function is right before freeing or invalidating the shared resource(s) used by the epoch callback(s). This function can sleep and is not optimized for performance. Differential Revision: https://reviews.freebsd.org/D20109 MFC after: 1 week Sponsored by: Mellanox Technologies	2019-06-28 10:38:56 +00:00

1 2 3 4 5 ...

644 Commits