freebsd-dev

Author	SHA1	Message	Date
Kyle Evans	b5a7ac997f	kern_shm_open: push O_CLOEXEC into caller control The motivation for this change is to allow wrappers around shm to be written that don't set CLOEXEC. kern_shm_open currently accepts O_CLOEXEC but sets it unconditionally. kern_shm_open is used by the shm_open(2) syscall, which is mandated by POSIX to set CLOEXEC, and CloudABI's sys_fd_create1(). Presumably O_CLOEXEC is intended in the latter caller, but it's unclear from the context. sys_shm_open() now unconditionally sets O_CLOEXEC to meet POSIX requirements, and a comment has been dropped in to kern_fd_open() to explain the situation and add a pointer to where O_CLOEXEC setting is maintained for shm_open(2) correctness. CloudABI's sys_fd_create1() also unconditionally sets O_CLOEXEC to match previous behavior. This also has the side-effect of making flags correctly reflect the O_CLOEXEC status on this fd for the rest of kern_shm_open(), but a glance-over leads me to believe that it didn't really matter. Reviewed by: kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D21119	2019-07-31 15:16:51 +00:00
Mark Johnston	49c3e8c8d1	Enable witness(4) blessings. witness has long had a facility to "bless" designated lock pairs. Lock order reversals between a pair of blessed locks are not reported upon. We have a number of long-standing false positive LOR reports; start marking well-understood LORs as blessed. This change hides reports about UFS vnode locks and the UFS dirhash lock, and UFS vnode locks and buffer locks, since those are the two that I observe most often. In the long term it would be preferable to be able to limit blessings to a specific site where a lock is acquired, and/or extend witness to understand why some lock order reversals are valid (for example, if code paths with conflicting lock orders are serialized by a third lock), but in the meantime the false positives frequently confuse users and generate bug reports. Reviewed by: cem, kib, mckusick MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21039	2019-07-30 17:09:58 +00:00
Mark Johnston	ed13ff4549	Regenerate after r350447.	2019-07-30 16:01:16 +00:00
Mark Johnston	f30f7b9870	Enable copy_file_range(2) in capability mode. copy_file_range() operates on a pair of file descriptors; it requires CAP_READ for the source descriptor and CAP_WRITE for the destination descriptor. Reviewed by: kevans, oshogbo Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21113	2019-07-30 15:59:44 +00:00
Xin LI	d4565741c6	Remove gzip'ed a.out support. The current implementation of gzipped a.out support was based on a very old version of InfoZIP which ships with an ancient modified version of zlib, and was removed from the GENERIC kernel in 1999 when we moved to an ELF world. PR: 205822 Reviewed by: imp, kib, emaste, Yoshihiro Ota <ota at j.email.ne.jp> Relnotes: yes Differential Revision: https://reviews.freebsd.org/D21099	2019-07-30 05:13:16 +00:00
Mark Johnston	98549e2dc6	Centralize the logic in vfs_vmio_unwire() and sendfile_free_page(). Both of these functions atomically unwire a page, optionally attempt to free the page, and enqueue or requeue the page. Add functions vm_page_release() and vm_page_release_locked() to perform the same task. The latter must be called with the page's object lock held. As a side effect of this refactoring, the buffer cache will no longer attempt to free mapped pages when completing direct I/O. This is consistent with the handling of pages by sendfile(SF_NOCACHE). Reviewed by: alc, kib MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20986	2019-07-29 22:01:28 +00:00
Mariusz Zaborski	9db97ca0bd	proc: make clear_orphan an public API This will be useful for other patches with process descriptors. Change its name as well. Reviewed by: markj, kib	2019-07-29 21:42:57 +00:00
Alan Somers	0367bca479	sendfile: don't panic when VOP_GETPAGES_ASYNC returns an error This is a partial merge of 350144 from projects/fuse2 PR: 236466 Reviewed by: markj MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21095	2019-07-29 20:50:26 +00:00
Mark Johnston	918988576c	Avoid relying on header pollution from sys/refcount.h. MFC after: 3 days Sponsored by: The FreeBSD Foundation	2019-07-29 20:26:01 +00:00
Alan Somers	e7d8ebc8ca	Better comments for vlrureclaim MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2019-07-28 16:07:27 +00:00
Alan Somers	2240d8c465	Add v_inval_buf_range, like vtruncbuf but for a range of a file v_inval_buf_range invalidates all buffers within a certain LBA range of a file. It will be used by fusefs(5). This commit is a partial merge of r346162, r346606, and r346756 from projects/fuse2. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21032	2019-07-28 00:48:28 +00:00
Rick Macklem	bf499e87f5	Update the generated syscall files for copy_file_range(2) added by r350315.	2019-07-25 05:55:55 +00:00
Rick Macklem	bbbbeca3e9	Add kernel support for a Linux compatible copy_file_range(2) syscall. This patch adds support to the kernel for a Linux compatible copy_file_range(2) syscall and the related VOP_COPY_FILE_RANGE(9). This syscall/VOP can be used by the NFSv4.2 client to implement the Copy operation against an NFSv4.2 server to do file copies locally on the server. The vn_generic_copy_file_range() function in this patch can be used by the NFSv4.2 server to implement the Copy operation. Fuse may also me able to use the VOP_COPY_FILE_RANGE() method. vn_generic_copy_file_range() attempts to maintain holes in the output file in the range to be copied, but may fail to do so if the input and output files are on different file systems with different _PC_MIN_HOLE_SIZE values. Separate commits will be done for the generated syscall files and userland changes. A commit for a compat32 syscall will be done later. Reviewed by: kib, asomers (plus comments by brooks, jilles) Relnotes: yes Differential Revision: https://reviews.freebsd.org/D20584	2019-07-25 05:46:16 +00:00
Mark Johnston	2fb62b1a46	Fix the turnstile_lock() KPI. turnstile_{lock,unlock}() were added for use in epoch. turnstile_lock() returned NULL to indicate that the calling thread had lost a race and the turnstile was no longer associated with the given lock, or the lock owner. However, reader-writer locks may not have a designated owner, in which case turnstile_lock() would return NULL and epoch_block_handler_preempt() would leak spinlocks as a result. Apply a minimal fix: return the lock owner as a separate return value. Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21048	2019-07-24 23:04:59 +00:00
Mark Johnston	e020a35f5b	Remove a redundant offset computation in elf_load_section(). With r344705 the offset is always zero. Submitted by: Wuyang Chung <wuyang.chung1@gmail.com>	2019-07-24 15:18:05 +00:00
Ed Maste	051e692a99	mqueuefs: fix struct file leak In some error cases we previously leaked a stuct file. Submitted by: mjg, markj	2019-07-23 20:59:36 +00:00
Alan Somers	caaa7cee09	[skip ci] Fix the comment for cache_purge(9) This is a merge of r348738 from projects/fuse2 Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2019-07-22 21:03:52 +00:00
Konstantin Belousov	f1cf2b9dcb	Check and avoid overflow when incrementing fp->f_count in fget_unlocked() and fhold(). On sufficiently large machine, f_count can be legitimately very large, e.g. malicious code can dup same fd up to the per-process filedescriptors limit, and then fork as much as it can. On some smaller machine, I see kern.maxfilesperproc: 939132 kern.maxprocperuid: 34203 which already overflows u_int. More, the malicious code can create transient references by sending fds over unix sockets. I realized that this check is missed after reading https://secfault-security.com/blog/FreeBSD-SA-1902.fd.html Reviewed by: markj (previous version), mjg Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D20947	2019-07-21 15:07:12 +00:00
Konstantin Belousov	47c3450e50	Fix leak of memory and file refs with sendmsg(2) over unix domain sockets. When sendmsg(2) sucessfully internalized one SCM_RIGHTS control message, but failed to process some other control message later, both file references and filedescent memory needs to be freed. This was not done, only mbuf chain was freed. Noted, test case written, reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21000	2019-07-19 20:51:39 +00:00
Alan Somers	fca79580be	sendfile: don't panic when VOP_GETPAGES_ASYNC returns an error PR: 236466 Sponsored by: The FreeBSD Foundation	2019-07-19 18:03:30 +00:00
Alan Somers	d26d63a4af	fusefs: multiple interruptility improvements 1) Don't explicitly not mask SIGKILL. kern_sigprocmask won't allow it to be masked, anyway. 2) Fix an infinite loop bug. If a process received both a maskable signal lower than 9 (like SIGINT) and then received SIGKILL, fticket_wait_answer would spin. msleep would immediately return EINTR, but cursig would return SIGINT, so the sleep would get retried. Fix it by explicitly checking whether SIGKILL has been received. 3) Abandon the sig_isfatal optimization introduced by r346357. That optimization would cause fticket_wait_answer to return immediately, without waiting for a response from the server, if the process were going to exit anyway. However, it's vulnerable to a race: 1) fatal signal is received while fticket_wait_answer is sleeping. 2) fticket_wait_answer sends the FUSE_INTERRUPT operation. 3) fticket_wait_answer determines that the signal was fatal and returns without waiting for a response. 4) Another thread changes the signal to non-fatal. 5) The first thread returns to userspace. Instead of exiting, the process continues. 6) The application receives EINTR, wrongly believes that the operation was successfully interrupted, and restarts it. This could cause problems for non-idempotent operations like FUSE_RENAME. Reported by: kib (the race part) Sponsored by: The FreeBSD Foundation	2019-07-17 22:45:43 +00:00
Alan Somers	0122532ee0	F_READAHEAD: Fix r349248's overflow protection, broken by r349391 I accidentally broke the main point of r349248 when making stylistic changes in r349391. Restore the original behavior, and also fix an additional overflow that was possible when uio->uio_resid was nearly SSIZE_MAX. Reported by: cem Reviewed by: bde MFC after: 2 weeks MFC-With: 349248 Sponsored by: The FreeBSD Foundation	2019-07-17 17:01:07 +00:00
Eric van Gyzen	9d3ecb7e62	Adds signal number format to kern.corefile Add format capability to core file names to include signal that generated the core. This can help various validation workflows where all cores should not be considered equally (SIGQUIT is often intentional and not an error unlike SIGSEGV or SIGBUS) Submitted by: David Leimbach (leimy2k@gmail.com) Reviewed by: markj MFC after: 1 week Relnotes: sysctl kern.corefile can now include the signal number Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20970	2019-07-16 15:51:09 +00:00
John Baldwin	32451fb9fc	Add ptrace op PT_GET_SC_RET. This ptrace operation returns a structure containing the error and return values from the current system call. It is only valid when a thread is stopped during a system call exit (PL_FLAG_SCX is set). The sr_error member holds the error value from the system call. Note that this error value is the native FreeBSD error value that has _not_ been translated to an ABI-specific error value similar to the values logged to ktrace. If sr_error is zero, then the return values of the system call will be set in sr_retval[0] and sr_retval[1]. Reviewed by: kib MFC after: 1 month Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D20901	2019-07-15 21:48:02 +00:00
John Baldwin	c18ca74916	Don't pass error from syscallenter() to syscallret(). syscallret() doesn't use error anymore. Fix a few other places to permit removing the return value from syscallenter() entirely. - Remove a duplicated assertion from arm's syscall(). - Use td_errno for amd64_syscall_ret_flush_l1d. Reviewed by: kib MFC after: 1 month Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D2090	2019-07-15 21:25:16 +00:00
John Baldwin	1af9474b26	Always set td_errno to the error value of a system call. Early errors prior to a system call did not set td_errno. This commit sets td_errno for all errors during syscallenter(). As a result, syscallret() can now always use td_errno without checking TDP_NERRNO. Reviewed by: kib MFC after: 1 month Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D20898	2019-07-15 21:16:01 +00:00
Konstantin Belousov	9f7b7da5bf	In do_sem2_wait(), balance umtx_key_get() with umtx_key_release() on retry. Reported by: ler Bisected and reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 12 days	2019-07-15 19:18:25 +00:00
Konstantin Belousov	ad038195bd	In do_lock_pi(), do not return prematurely. If umtxq_check_susp() indicates an exit, we should clean the resources before returning. Do it by breaking out of the loop and relying on post-loop cleanup. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 12 days Differential revision: https://reviews.freebsd.org/D20949	2019-07-15 08:39:52 +00:00
Konstantin Belousov	40bd868ba7	Correctly check for casueword(9) success in do_set_ceiling(). After r349951, the return code must be checked instead of old == new comparision. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 12 days Differential revision: https://reviews.freebsd.org/D20949	2019-07-15 08:38:01 +00:00
Michael Tuexen	a85b7f125b	Improve the input validation for l_linger. When using the SOL_SOCKET level socket option SO_LINGER, the structure struct linger is used as the option value. The component l_linger is of type int, but internally copied to the field so_linger of the structure struct socket. The type of so_linger is short, but it is assumed to be non-negative and the value is used to compute ticks to be stored in a variable of type int. Therefore, perform input validation on l_linger similar to the one performed by NetBSD and OpenBSD. Thanks to syzkaller for making me aware of this issue. Thanks to markj@ for pointing out that a similar check should be added to so_linger_set(). Reviewed by: markj@ MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20948	2019-07-14 21:44:18 +00:00
Konstantin Belousov	30b3018d48	Provide protection against starvation of the ll/sc loops when accessing userpace. Casueword(9) on ll/sc architectures must be prepared for userspace constantly modifying the same cache line as containing the CAS word, and not loop infinitely. Otherwise, rogue userspace livelocks the kernel. To fix the issue, change casueword(9) interface to return new value 1 indicating that either comparision or store failed, instead of relying on the oldval == *oldvalp comparison. The primitive no longer retries the operation if it failed spuriously. Modify callers of casueword(9), all in kern_umtx.c, to handle retries, and react to stops and requests to terminate between retries. On x86, despite cmpxchg should not return spurious failures, we can take advantage of the new interface and just return PSL.ZF. Reviewed by: andrew (arm64, previous version), markj Tested by: pho Reported by: https://xenbits.xen.org/xsa/advisory-295.txt Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D20772	2019-07-12 18:43:24 +00:00
Doug Moore	3f3f7c056f	Address problems in blist_alloc introduced in r349777. The swap block allocator could become corrupted if a retry to allocate swap space, after a larger allocation attempt failed, allocated a smaller set of free blocks that ended on a 32- or 64-block boundary. Add tests to detect this kind of failure-to-extend-at-boundary and prevent the associated accounting screwup. Reported by: pho Tested by: pho Reviewed by: alc Approved by: markj (mentor) Discussed with: kib Differential Revision: https://reviews.freebsd.org/D20893	2019-07-11 20:52:39 +00:00
Mark Johnston	2ffee5c1b2	Inherit P2_PROTMAX_{ENABLE,DISABLE} across fork(). Thus, when using proccontrol(1) to disable implicit application of PROT_MAX within a process, child processes will inherit this setting. Discussed with: kib MFC with: r349609 Sponsored by: The FreeBSD Foundation	2019-07-10 19:57:48 +00:00
John Baldwin	c26541e315	Use 'retval' label for first error in syscallenter(). This is more consistent with the rest of the function and lets us unindent most of the function. Reviewed by: kib MFC after: 1 month Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D20897	2019-07-09 23:58:12 +00:00
Mark Johnston	eeacb3b02f	Merge the vm_page hold and wire mechanisms. The hold_count and wire_count fields of struct vm_page are separate reference counters with similar semantics. The remaining essential differences are that holds are not counted as a reference with respect to LRU, and holds have an implicit free-on-last unhold semantic whereas vm_page_unwire() callers must explicitly determine whether to free the page once the last reference to the page is released. This change removes the KPIs which directly manipulate hold_count. Functions such as vm_fault_quick_hold_pages() now return wired pages instead. Since r328977 the overhead of maintaining LRU for wired pages is lower, and in many cases vm_fault_quick_hold_pages() callers would swap holds for wirings on the returned pages anyway, so with this change we remove a number of page lock acquisitions. No functional change is intended. __FreeBSD_version is bumped. Reviewed by: alc, kib Discussed with: jeff Discussed with: jhb, np (cxgbe) Tested by: pho (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19247	2019-07-08 19:46:20 +00:00
Doug Moore	31c82722c1	Change blist_next_leaf_alloc so that it can examine more than one leaf after the one where the possible block allocation begins, and allocate a larger number of blocks than the current limit. This does not affect the limit on minimum allocation size, which still cannot exceed BLIST_MAX_ALLOC. Use this change to modify swp_pager_getswapspace and its callers, so that they can allocate more than BLIST_MAX_ALLOC blocks if they are available. Tested by: pho Approved by: markj (mentor) Differential Revision: https://reviews.freebsd.org/D20579	2019-07-06 06:15:03 +00:00
Mark Johnston	6a01874c5a	Defer funsetown() calls for a TTY to tty_rel_free(). We were otherwise failing to call funsetown() for some descriptors associated with a tty, such as pts descriptors. Then, if the descriptor is closed before the owner exits, we may get memory corruption. Reported by: syzbot+c9b6206303bf47bac87e@syzkaller.appspotmail.com Reviewed by: ed MFC after: 3 days Sponsored by: The FreeBSD Foundation	2019-07-04 15:42:02 +00:00
Eric van Gyzen	8c5a9161d1	Save the last callout function executed on each CPU Save the last callout function pointer (and its argument) executed on each CPU for inspection by a debugger. Add a ddb `show callout_last` command to show these pointers. Add a kernel module that I used for testing that command. Relocate `ce_migration_cpu` to reduce padding and therefore preserve the size of `struct callout_cpu` (320 bytes on amd64) despite the added members. This should help diagnose reference-after-free bugs where the callout's mutex has already been freed when `softclock_call_cc` tries to unlock it. You might hope that the pointer would still be available, but it isn't. The argument to that function is on the stack (because `softclock_call_cc` uses it later), and that might be enough in some cases, but even then, it's very laborious. A pointer to the callout is saved right before these newly added fields, but that callout might have been freed. We still have the pointer to its associated mutex, and the name within might be enough, but it might also have been freed. Reviewed by: markj jhb MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20794	2019-07-03 19:22:44 +00:00
John Baldwin	afa60c068e	Invoke ext_free function when freeing an unmapped mbuf. Fix a mis-merge when extracting the unmapped mbuf changes from Netflix's in-kernel TLS changes where the call to the function that freed the backing pages from an unmapped mbuf was missed. Sponsored by: Chelsio Communications	2019-07-02 22:58:21 +00:00
John Baldwin	9b2d70da33	Fix description of debug.obsolete_panic. MFC after: 1 week	2019-07-02 22:57:24 +00:00
Konstantin Belousov	7fde3c6b28	More style. Re-wrap long lines, reformat comments, remove excessive blank line. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2019-07-02 21:03:06 +00:00
Konstantin Belousov	4b8b28e130	Style. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2019-07-02 19:32:48 +00:00
Konstantin Belousov	5dc7e31a09	Control implicit PROT_MAX() using procctl(2) and the FreeBSD note feature bit. In particular, allocate the bit to opt-out the image from implicit PROTMAX enablement. Provide procctl(2) verbs to set and query implicit PROTMAX handling. The knobs mimic the same per-image flag and per-process controls for ASLR. Reviewed by: emaste, markj (previous version) Discussed with: brooks Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D20795	2019-07-02 19:07:17 +00:00
Mark Johnston	6d958292f3	Fix handling of errors from sblock() in soreceive_stream(). Previously we would attempt to unlock the socket buffer despite having failed to lock it. Simply return an error instead: no resources need to be released at this point, and doing so is consistent with soreceive_generic(). PR: 238789 Submitted by: Greg Becker <greg@codeconcepts.com> MFC after: 1 week	2019-07-02 14:24:42 +00:00
Rick Macklem	555d8f2859	Factor out the code that does a VOP_SETATTR(size) from vn_truncate(). This patch factors the code in vn_truncate() that does the actual VOP_SETATTR() of size into a separate function called vn_truncate_locked(). This will allow the NFS server and the patch that adds a copy_file_range(2) syscall to call this function instead of duplicating the code and carrying over changes, such as the recent r347151. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D20808	2019-07-01 20:41:43 +00:00
Mark Johnston	7c3703a694	Use a consistent snapshot of the fd's rights in fget_mmap(). fget_mmap() translates rights on the descriptor to a VM protection mask. It was doing so without holding any locks on the descriptor table, so a writer could simultaneously be modifying those rights. Such a situation would be detected using a sequence counter, but not before an inconsistency could trigger assertion failures in the capability code. Fix the problem by copying the fd's rights to a structure on the stack, and perform the translation only once we know that that snapshot is consistent. Reported by: syzbot+ae359438769fda1840f8@syzkaller.appspotmail.com Reviewed by: brooks, mjg MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20800	2019-06-29 16:11:09 +00:00
Mark Johnston	02476c44c5	Fix mutual exclusion in pipe_direct_write(). We use PIPE_DIRECTW as a semaphore for direct writes to a pipe, where the reader copies data directly from pages mapped into the writer. However, when a reader finishes such a copy, it previously cleared PIPE_DIRECTW, allowing multiple writers to race and corrupt the state used to track wired pages belonging to the writer. Fix this by having the writer clear PIPE_DIRECTW and instead use the count of unread bytes to determine whether a write is finished. Reported by: syzbot+21811cc0a89b2a87a9e7@syzkaller.appspotmail.com Reviewed by: kib, mjg Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20784	2019-06-29 16:05:52 +00:00
John Baldwin	3807631b8e	Compress pending socket buffer data once it is marked ready. Apply similar logic from sbcompress to pending data in the socket buffer once it is marked ready via sbready. Normally sbcompress merges small mbufs to reduce the length of mbuf chains in the socket buffer. However, sbcompress cannot do this for mbufs marked M_NOTREADY. sbcompress_ready is now called from sbready when mbufs are marked ready to merge small mbuf chains once the data is available to copy. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:50:25 +00:00
John Baldwin	cec06a3edc	Add support for using unmapped mbufs with sendfile(2). This can be enabled at runtime via the kern.ipc.mb_use_ext_pgs sysctl. It is disabled by default. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:49:35 +00:00
John Baldwin	82334850ea	Add an external mbuf buffer type that holds multiple unmapped pages. Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses. For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT. Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe. NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required. If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:48:33 +00:00

1 2 3 4 5 ...

16712 Commits