Commit Graph

71 Commits

Author SHA1 Message Date
Mateusz Guzik
6fed89b179 kern: clean up empty lines in .c and .h files 2020-09-01 22:12:32 +00:00
Chuck Silvers
c2ea3d44bf Fix hang due to missing unbusy in sendfile when an async data I/O fails.
r359473 removed the page unbusy logic from sendfile_iodone() because when
vm_pager_get_pages_async() would return an error after failing to start
the async I/O (eg. because VOP_BMAP failed), sendfile_swapin() would also
unbusy the pages, and it was wrong to unbusy twice.  However this breaks
the case where vm_pager_get_pages_async() succeeds in starting an async I/O
and the async I/O is what fails.  In this case, sendfile_iodone() must
unbusy the pages, and because sendfile_iodone() doesn't know which case
it is in, sendfile_iodone() must always unbusy pages and relookup pages
which have been substituted with bogus_page, which in turn means that
sendfile_swapin() must never do unbusy or relookup for pages which have
been given to vm_pager_get_pages_async(), even if there is an error.

Reviewed by:	kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D25136
2020-06-06 00:02:50 +00:00
Gleb Smirnoff
61664ee700 Step 4.2: start divorce of M_EXT and M_EXTPG
They have more differencies than similarities. For now there is lots
of code that would check for M_EXT only and work correctly on M_EXTPG
buffers, so still carry M_EXT bit together with M_EXTPG. However,
prepare some code for explicit check for M_EXTPG.

Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D24598
2020-05-03 00:37:16 +00:00
Gleb Smirnoff
6edfd179c8 Step 4.1: mechanically rename M_NOMAP to M_EXTPG
Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D24598
2020-05-03 00:21:11 +00:00
Gleb Smirnoff
7b6c99d08d Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf
within m_epg namespace.
All edits except the 'struct mbuf' declaration and mb_dupcl() were done
mechanically with sed:

s/->m_ext_pgs.nrdy/->m_epg_nrdy/g
s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g
s/->m_ext_pgs.trail_len/->m_epg_trllen/g
s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g
s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g
s/->m_ext_pgs.flags/->m_epg_flags/g
s/->m_ext_pgs.record_type/->m_epg_record_type/g
s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g
s/->m_ext_pgs.tls/->m_epg_tls/g
s/->m_ext_pgs.so/->m_epg_so/g
s/->m_ext_pgs.seqno/->m_epg_seqno/g
s/->m_ext_pgs.stailq/->m_epg_stailq/g

Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D24598
2020-05-03 00:12:56 +00:00
Gleb Smirnoff
bccf6e26e9 Step 2.5: Stop using 'struct mbuf_ext_pgs' in the kernel itself.
Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D24598
2020-05-03 00:08:05 +00:00
Gleb Smirnoff
0c1032665c Continuation of multi page mbuf redesign from r359919.
The following series of patches addresses three things:

Now that array of pages is embedded into mbuf, we no longer need
separate structure to pass around, so struct mbuf_ext_pgs is an
artifact of the first implementation. And struct mbuf_ext_pgs_data
is a crutch to accomodate the main idea r359919 with minimal churn.

Also, M_EXT of type EXT_PGS are just a synonym of M_NOMAP.

The namespace for the newfeature is somewhat inconsistent and
sometimes has a lengthy prefixes. In these patches we will
gradually bring the namespace to "m_epg" prefix for all mbuf
fields and most functions.

Step 1 of 4:

 o Anonymize mbuf_ext_pgs_data, embed in m_ext
 o Embed mbuf_ext_pgs
 o Start documenting all this entanglement

Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D24598
2020-05-02 22:39:26 +00:00
Mark Johnston
b7eae75807 Make sendfile(SF_SYNC)'s CV wait interruptible.
Otherwise, since the CV is not signalled until data is drained from the
socket, it is trivial to create an unkillable process using
sendfile(SF_SYNC) and a process-private PF_LOCAL socket pair.  In
particular, the cv_wait() in sendfile() does not get interrupted until
data is drained from the receiving socket buffer.

Reported by:	pho
Discussed with:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-04-28 15:02:44 +00:00
Konstantin Belousov
acf65eef23 sendfile: When all io finished, assert that sfio->pa[] is in expected state.
It must contain fully restored contigous run of the wired pages from
the object, except possible trimmed tail.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2020-04-18 03:09:25 +00:00
Konstantin Belousov
e795a04083 The pa argument for sendfile_iodone() is not necessary a slice of sfio->pa.
It is true for zfs, but it is not for e.g. vnode or buffer pagers.
When fixing bogus pages, fix them in both places.  Rely on the fact
that pa[0] must have been invalid so it cannot be bogus.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2020-04-18 03:07:18 +00:00
Andrew Gallatin
23feb56348 KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.
While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.

This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.

In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.

This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,

This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.

In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.

Many thanks to glebius for helping to make this better in
the Netflix tree.

Reviewed by:	hselasky, jhb, rrs, glebius (early version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D24213
2020-04-14 14:46:06 +00:00
Konstantin Belousov
d25f1b21c9 sendfile_iodone: correct calculation of the page index for relookup.
This is yet another bug in r359473.

Reported and tested by:	delphij
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2020-04-12 05:10:48 +00:00
Konstantin Belousov
f709eee61a Do not pass bogus page to mbufs.
This is a bug in r359473.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2020-04-10 01:28:47 +00:00
Konstantin Belousov
c506a6386f kern_sendfile.c: fix bugs with handling of busy page states.
- Do not call into a vnode pager while leaving some pages from the
  same block as the current run, xbusy. This immediately deadlocks if
  pager needs to instantiate the buffer.
- Only relookup bogus pages after io finished, otherwise we might
  obliterate the valid pages by out of date disk content.  While there,
  expand the comment explaining this pecularity.
- Do not double-unbusy on error.  Split unbusy for error case, which
  is left in the sendfile_swapin(), from the more properly coded
  normal case in sendfile_iodone().
- Add an XXXKIB comment explaining the serious bug in the validation
  algorithm, not fixed by this patch series.

PR:	244713
Reviewed by:	glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24038
2020-03-30 22:13:32 +00:00
Konstantin Belousov
d866353677 kern_sendfile.c: do not release sfio reference on error.
It is already done by sendfile_iodone(), now consistently for all errors.
This de-facto reverts r358597, after r359466.

Reviewed by:	glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24038
2020-03-30 22:01:36 +00:00
Konstantin Belousov
59e1ac9d79 kern_sendfile.c: wait for all in-flight ios completion before unwiring pages.
Reviewed by:	glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24038
2020-03-30 21:57:28 +00:00
Konstantin Belousov
8f0a223c78 kern_sendfile.c: add specific malloc type.
Now sfio leaks are more easily seen in the malloc statistics than
e.g. just wired or busy pages leak.

Reviewed by:	glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24038
2020-03-30 21:50:51 +00:00
Konstantin Belousov
0ac8511aef kern_sendfile.c style: order headers alphabetically.
Reviewed by:	glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24038
2020-03-30 21:40:35 +00:00
Michael Tuexen
db4493f7b6 sendfile() does currently not support SCTP sockets.
Therefore, fail the call.

Reviewed by:		markj@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D24059
2020-03-13 18:38:28 +00:00
Chuck Silvers
37bf88e790 if vm_pager_get_pages_async() returns an error, release the sfio->nios
refcount that we took earlier that represents the I/O that ended up
not being started.

Reviewed by:	glebius
Approved by:	imp (mentor)
Sponsored by:	Netflix
2020-03-04 00:22:50 +00:00
Jeff Roberson
6be21eb778 Provide a lock free alternative to resolve bogus pages. This is not likely
to be much of a perf win, just a nice code simplification.

Reviewed by:	markj, kib
Differential Revision:	https://reviews.freebsd.org/D23866
2020-02-28 21:42:48 +00:00
Jeff Roberson
7aaf252c96 Convert a few triviail consumers to the new unlocked grab API.
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23847
2020-02-28 20:34:30 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Gleb Smirnoff
6bc27f086a Generalize resources freeing in sendfile with different scenarios.
Now we execute sendfile_iodone() in all possible cases, which
guarantees that vm_object_pip_wakeup() is called and sfio structure
is freed.

At the beginning of sendfile initialize sfio->m to NULL, that would
indicate that the mbuf chain either doesn't exist, or belongs to the
syscall (not to I/O completion).  Fill sfio->m only at a point when
we are positive that there are I/Os ongoing and before releasing
syscall's reference on sfio.

In sendfile_iodone() perform vm_object_pip_wakeup() once last
reference is released, then check for sfio->m.  NULL pointer
indicates that we need only to free the memory.

Reviewed by:	jtl, gallatin
2020-02-25 19:29:05 +00:00
Gleb Smirnoff
f85e1a806b Make ktls_frame() never fail. Caller must supply correct mbufs.
This makes sendfile code a bit simplier.
2020-02-25 19:26:40 +00:00
Gleb Smirnoff
69302907d6 When sendfile_swapin() sweeps through pages in search for a bogus page
skip first and last pages.  This is a micro optimisation.
2020-02-25 19:11:20 +00:00
Mark Johnston
d3631aa582 Avoid releasing object PIP in vn_sendfile() if no pages were grabbed.
sendfile(2) optionally takes a set of headers that get prepended to the
file data.  If the request length is less than that of the headers,
sendfile may not allocate an sfio structure, in which case its pointer
is null and we should be careful not to dereference.  This was
introduced in r356902.

Reported by:	syzkaller
Sponsored by:	The FreeBSD Foundation
2020-02-05 16:09:21 +00:00
Jeff Roberson
91e31c3c08 Consistently use busy and vm_page_valid() rather than touching page bits
directly.  This improves API compliance, asserts, etc.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23283
2020-01-23 04:54:49 +00:00
Jeff Roberson
d6e13f3b4d Don't hold the object lock while calling getpages.
The vnode pager does not want the object lock held.  Moving this out allows
further object lock scope reduction in callers.  While here add some missing
paging in progress calls and an assert.  The object handle is now protected
explicitly with pip.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23033
2020-01-19 23:47:32 +00:00
Mateusz Guzik
b249ce48ea vfs: drop the mostly unused flags argument from VOP_UNLOCK
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D21427
2020-01-03 22:29:58 +00:00
Konstantin Belousov
b631c36f0d Record part of the owner struct thread pointer into busy_lock.
Record as much bits from curthread into busy_lock as fits.  Low bits
for struct thread * representation are zero due to struct and zone
alignment, and they leave space for busy flags (perhaps except
statically allocated thread0).  Upper bits are not very interesting
for assert, and in most practical situations recorded value should
allow to manually identify the owner with certainity.

Assert that unbusy is performed by the owner, except few places where
unbusy is done in io completion handler.  For this case, add
_unchecked variants of asserts and unbusy primitives.

Reviewed by:	markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D22298
2019-11-24 19:12:23 +00:00
Gleb Smirnoff
b8c923032f If vm_pager_get_pages_async() returns an error synchronously we leak wired
and busy pages.  Add code that would carefully cleanups the state in case
of synchronous error return.  Cover a case when a first I/O went on
asynchronously, but second or N-th returned error synchronously.

In collaboration with:	chs
Reviewed by:		jtl, kib
2019-11-06 23:45:43 +00:00
John Baldwin
9e14430d46 Add a TOE KTLS mode and a TOE hook for allocating TLS sessions.
This adds the glue to allocate TLS sessions and invokes it from
the TLS enable socket option handler.  This also adds some counters
for active TOE sessions.

The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when
TOE KTLS is in use on a socket, but cannot be set via setsockopt().

To simplify various checks, a TLS session now includes an explicit
'mode' member set to the value returned by TLSTX_TLS_MODE.  Various
places that used to check 'sw_encrypt' against NULL to determine
software vs ifnet (NIC) TLS now check 'mode' instead.

Reviewed by:	np, gallatin
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21891
2019-10-08 21:34:06 +00:00
Mark Johnston
fee2a2fa39 Change synchonization rules for vm_page reference counting.
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator.  In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well.  These references are protected by the page lock, which must
therefore be acquired for many per-page operations.  This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter.  A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held.  As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed.  The former
requires that either the object lock or the busy lock is held.  The
latter no longer has a return value and may free the page if it releases
the last reference to that page.  vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate.  vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold().  It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler.  vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state).  In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths.  In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock.  In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped.  The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by:	jeff (earlier version)
Tested by:	gallatin (earlier version), pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20486
2019-09-09 21:32:42 +00:00
John Baldwin
818d755318 Only define the 'tls' member of sfio in KERN_TLS is defined.
This field was not initialized in the !KERN_TLS case triggering an
assertion failure when using sendfile(2).

Reported by:	pho, asomers
Sponsored by:	Netflix
2019-08-27 22:21:18 +00:00
John Baldwin
b2e60773c6 Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets.  KTLS only supports
offload of TLS for transmitted data.  Key negotation must still be
performed in userland.  Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option.  All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type.  Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer.  Each TLS frame is described by a single
ext_pgs mbuf.  The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame().  ktls_enqueue() is then
called to schedule TLS frames for encryption.  In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed.  For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue().  Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends.  Internally,
Netflix uses proprietary pure-software backends.  This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames.  As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready().  At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation.  In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session.  TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted.  The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface.  If so, the packet is tagged
with the TLS send tag and sent to the interface.  The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation.  If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped.  In addition, a task is scheduled to refresh the TLS send
tag for the TLS session.  If a new TLS send tag cannot be allocated,
the connection is dropped.  If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag.  (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another.  As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8).  ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option.  They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax.  However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node.  The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default).  The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by:	gallatin, hselasky, rrs
Obtained from:	Netflix
Sponsored by:	Netflix, Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
Gleb Smirnoff
814f33aafb Since r350426 this KASSERT doesn't serve any useful purpose. 2019-08-06 16:11:00 +00:00
Mark Johnston
98549e2dc6 Centralize the logic in vfs_vmio_unwire() and sendfile_free_page().
Both of these functions atomically unwire a page, optionally attempt
to free the page, and enqueue or requeue the page.  Add functions
vm_page_release() and vm_page_release_locked() to perform the same task.
The latter must be called with the page's object lock held.

As a side effect of this refactoring, the buffer cache will no longer
attempt to free mapped pages when completing direct I/O.  This is
consistent with the handling of pages by sendfile(SF_NOCACHE).

Reviewed by:	alc, kib
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20986
2019-07-29 22:01:28 +00:00
Alan Somers
0367bca479 sendfile: don't panic when VOP_GETPAGES_ASYNC returns an error
This is a partial merge of 350144 from projects/fuse2

PR:		236466
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21095
2019-07-29 20:50:26 +00:00
Alan Somers
fca79580be sendfile: don't panic when VOP_GETPAGES_ASYNC returns an error
PR:		236466
Sponsored by:	The FreeBSD Foundation
2019-07-19 18:03:30 +00:00
John Baldwin
cec06a3edc Add support for using unmapped mbufs with sendfile(2).
This can be enabled at runtime via the kern.ipc.mb_use_ext_pgs sysctl.
It is disabled by default.

Submitted by:	gallatin (earlier version)
Reviewed by:	gallatin, hselasky, rrs
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20616
2019-06-29 00:49:35 +00:00
John Baldwin
19c40c76e1 Trim an extra space. 2019-06-11 22:06:05 +00:00
Gleb Smirnoff
47f1ea5109 Plug sendfile(2) on a listening socket with proper error code.
Reported by:	ngie
Reviewed by:	ngie
Approved by:	re (delphij)
2018-10-16 15:57:16 +00:00
Gleb Smirnoff
147cd40fe7 Revert second chunk of r333860. The warning from gcc is false positive. The
npages won't be ever used in no space case.
2018-05-29 21:45:15 +00:00
Matt Macy
7fd6841438 sendfile: annotate unused value and ensure that npages is actually initialized 2018-05-19 05:10:51 +00:00
Matt Macy
cbd92ce62e Eliminate the overhead of gratuitous repeated reinitialization of cap_rights
- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
  A 3.6% speedup in fstat was measured with this change.

Reported by:	mjg
Reviewed by:	oshogbo
Approved by:	sbruno
MFC after:	1 month
2018-05-09 18:47:24 +00:00
Mark Johnston
1b5c869d64 Fix some races introduced in r332974.
With r332974, when performing a synchronized access of a page's "queue"
field, one must first check whether the page is logically dequeued. If
so, then the page lock does not prevent the page from being removed
from its page queue. Intoduce vm_page_queue(), which returns the page's
logical queue index. In some cases, direct access to the "queue" field
is still required, but such accesses should be confined to sys/vm.

Reported and tested by:	pho
Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15280
2018-05-04 17:17:30 +00:00
Brooks Davis
6469bdcdb6 Move most of the contents of opt_compat.h to opt_global.h.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c.  A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by:	kib, cem, jhb, jtl
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14941
2018-04-06 17:35:35 +00:00
Mark Johnston
0eb50f9cd2 Have vm_page_{deactivate,launder}() requeue already-queued pages.
In many cases the page is not enqueued so the change will have no
effect. However, the change is needed to support an optimization in
the fault handler and in some cases (sendfile, the buffer cache) it
was being emulated by the caller anyway.

Reviewed by:	alc
Tested by:	pho
MFC after:	2 weeks
X-Differential Revision: https://reviews.freebsd.org/D14625
2018-03-18 16:40:56 +00:00
Mark Johnston
1d3a1bcfac Dequeue wired pages lazily.
Previously, wiring a page would cause it to be removed from its page
queue. In the common case, unwiring causes it to be enqueued at the tail
of that page queue. This change modifies vm_page_wire() to not dequeue
the page, thus avoiding the highly contended page queue locks. Instead,
vm_page_unwire() takes care of requeuing the page as a single operation,
and the page daemon dequeues wired pages as they are encountered during
a queue scan to avoid needlessly revisiting them later. For pages in
PQ_ACTIVE we do even better, since a requeue is unnecessary.

The change improves scalability for some common workloads. For instance,
threads wiring pages into the buffer cache no longer need to modify
global page queues, and unwiring is usually done by the bufspace thread,
so concurrency is not as much of an issue. As another example, many
sysctl handlers wire the output buffer to avoid faults on copyout, and
since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid
modifying the page queue in this case.

The change also adds a block comment describing some properties of
struct vm_page's reference counters, and the busy lock.

Reviewed by:	jeff
Discussed with:	alc, kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D11943
2018-02-07 16:57:10 +00:00