Commit Graph

16659 Commits

Author SHA1 Message Date
Konstantin Belousov
5dc7e31a09 Control implicit PROT_MAX() using procctl(2) and the FreeBSD note
feature bit.

In particular, allocate the bit to opt-out the image from implicit
PROTMAX enablement.  Provide procctl(2) verbs to set and query
implicit PROTMAX handling.  The knobs mimic the same per-image flag
and per-process controls for ASLR.

Reviewed by:	emaste, markj (previous version)
Discussed with:	brooks
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D20795
2019-07-02 19:07:17 +00:00
Mark Johnston
6d958292f3 Fix handling of errors from sblock() in soreceive_stream().
Previously we would attempt to unlock the socket buffer despite having
failed to lock it.  Simply return an error instead: no resources need
to be released at this point, and doing so is consistent with
soreceive_generic().

PR:		238789
Submitted by:	Greg Becker <greg@codeconcepts.com>
MFC after:	1 week
2019-07-02 14:24:42 +00:00
Rick Macklem
555d8f2859 Factor out the code that does a VOP_SETATTR(size) from vn_truncate().
This patch factors the code in vn_truncate() that does the actual
VOP_SETATTR() of size into a separate function called vn_truncate_locked().
This will allow the NFS server and the patch that adds a
copy_file_range(2) syscall to call this function instead of duplicating
the code and carrying over changes, such as the recent r347151.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D20808
2019-07-01 20:41:43 +00:00
Mark Johnston
7c3703a694 Use a consistent snapshot of the fd's rights in fget_mmap().
fget_mmap() translates rights on the descriptor to a VM protection
mask.  It was doing so without holding any locks on the descriptor
table, so a writer could simultaneously be modifying those rights.
Such a situation would be detected using a sequence counter, but
not before an inconsistency could trigger assertion failures in
the capability code.

Fix the problem by copying the fd's rights to a structure on the stack,
and perform the translation only once we know that that snapshot is
consistent.

Reported by:	syzbot+ae359438769fda1840f8@syzkaller.appspotmail.com
Reviewed by:	brooks, mjg
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20800
2019-06-29 16:11:09 +00:00
Mark Johnston
02476c44c5 Fix mutual exclusion in pipe_direct_write().
We use PIPE_DIRECTW as a semaphore for direct writes to a pipe, where
the reader copies data directly from pages mapped into the writer.
However, when a reader finishes such a copy, it previously cleared
PIPE_DIRECTW, allowing multiple writers to race and corrupt the state
used to track wired pages belonging to the writer.

Fix this by having the writer clear PIPE_DIRECTW and instead use the
count of unread bytes to determine whether a write is finished.

Reported by:	syzbot+21811cc0a89b2a87a9e7@syzkaller.appspotmail.com
Reviewed by:	kib, mjg
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20784
2019-06-29 16:05:52 +00:00
John Baldwin
3807631b8e Compress pending socket buffer data once it is marked ready.
Apply similar logic from sbcompress to pending data in the socket
buffer once it is marked ready via sbready.  Normally sbcompress
merges small mbufs to reduce the length of mbuf chains in the socket
buffer.  However, sbcompress cannot do this for mbufs marked
M_NOTREADY.  sbcompress_ready is now called from sbready when mbufs
are marked ready to merge small mbuf chains once the data is available
to copy.

Submitted by:	gallatin (earlier version)
Reviewed by:	gallatin, hselasky, rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20616
2019-06-29 00:50:25 +00:00
John Baldwin
cec06a3edc Add support for using unmapped mbufs with sendfile(2).
This can be enabled at runtime via the kern.ipc.mb_use_ext_pgs sysctl.
It is disabled by default.

Submitted by:	gallatin (earlier version)
Reviewed by:	gallatin, hselasky, rrs
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20616
2019-06-29 00:49:35 +00:00
John Baldwin
82334850ea Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages.  It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.

For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer.  This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers).  It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused.  To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.

Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.

NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability.  This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands.  For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.

If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output.  If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.

Submitted by:	gallatin (earlier version)
Reviewed by:	gallatin, hselasky, rrs
Discussed with:	ae, kp (firewalls)
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
Konstantin Belousov
2d7a555294 Style.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-06-28 20:40:54 +00:00
Hans Petter Selasky
131b2b7658 Implement API for draining EPOCH(9) callbacks.
The epoch_drain_callbacks() function is used to drain all pending
callbacks which have been invoked by prior epoch_call() function calls
on the same epoch. This function is useful when there are shared
memory structure(s) referred to by the epoch callback(s) which are not
refcounted and are rarely freed. The typical place for calling this
function is right before freeing or invalidating the shared
resource(s) used by the epoch callback(s). This function can sleep and
is not optimized for performance.

Differential Revision: https://reviews.freebsd.org/D20109
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-06-28 10:38:56 +00:00
Alan Somers
0cfc1ef38d FIOBMAP2: inline vn_ioc_bmap2
Reported by:	kib
Reviewed by:	kib
MFC after:	2 weeks
MFC-With:	349238
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20783
2019-06-27 23:39:06 +00:00
Rick Macklem
e368095437 Add non-blocking trylock variants for the rangelock functions.
A future patch that will add a Linux compatible copy_file_range(2) syscall
needs to be able to lock the byte ranges of two files concurrently.
To do this without a risk of deadlock, a non-blocking variant of
vn_rangelock_rlock() called vn_rangelock_tryrlock() was needed.
This patch adds this, along with vn_rangelock_trywlock(), in order to
do this.
The patch also adds a couple of comments, that I hope clarify how the
algorithm used in kern_rangelock.c works.

Reviewed by:	kib, asomers (previous version)
Differential Revision:	https://reviews.freebsd.org/D20645
2019-06-27 23:10:40 +00:00
John Baldwin
1db2626a9b Fix comment in sofree() to reference sbdestroy().
r160875 added sbdestroy() as a wrapper around sbrelease_internal to be
called from sofree(), yet the comment added in the same revision to
sofree() still mentions sbrelease_internal().

Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20488
2019-06-27 22:50:11 +00:00
Alan Somers
4f53d57e8c fcntl: style changes to r349248
Reported by:	bde
MFC after:	2 weeks
MFC-With:	349248
Sponsored by:	The FreeBSD Foundation
2019-06-25 19:44:22 +00:00
Warner Losh
7a3e3a2859 Remove a couple of harmless stray references to nandfs.
Submitted by: tsoome@
2019-06-25 16:39:25 +00:00
Warner Losh
af9727f618 Add missing include of sys/boot.h
This change was dropped out in a rebase and I didn't catch that before
I committed.
2019-06-24 20:52:21 +00:00
Warner Losh
ec9abc1843 Move to using a common kernel path between the boot / laoder bits and
the kernel.
2019-06-24 20:34:53 +00:00
Konstantin Belousov
89f2ab0608 Switch to check for effective user id in r349320, and disable dumping
into existing files for sugid processes.

Despite using real user id pronounces the intent, it actually breaks
suid coredumps, while not making any difference for non-sugid
processes.  The reason for the breakage is that non-existent core file
is created with the effective uid (unless weird hacks like SUIDDIR are
configured).

Then, if user enabled kern.sugid_coredump, core dumping should not
overwrite core files owned by effective uid, but we cannot pretend to
use real uid for dumping.

PR:	68905
admbugs:	358
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-06-23 21:15:31 +00:00
Konstantin Belousov
7a29e0bf96 coredump: avoid writing to core files not owned by the real user.
Reported by: blake frantz <trew@hick.org>
PR:	68905
admbugs:	358
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-06-23 18:35:11 +00:00
Alan Somers
38b06f8ac4 fcntl: fix overflow when setting F_READAHEAD
VOP_READ and VOP_WRITE take the seqcount in blocks in a 16-bit field.
However, fcntl allows you to set the seqcount in bytes to any nonnegative
31-bit value. The result can be a 16-bit overflow, which will be
sign-extended in functions like ffs_read. Fix this by sanitizing the
argument in kern_fcntl. As a matter of policy, limit to IO_SEQMAX rather
than INT16_MAX.

Also, fifos have overloaded the f_seqcount field for a completely different
purpose ever since r238936.  Formalize that by using a union type.

Reviewed by:	cem
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20710
2019-06-20 23:07:20 +00:00
Alan Somers
d49b446bfb Add FIOBMAP2 ioctl
This ioctl exposes VOP_BMAP information to userland. It can be used by
programs like fragmentation analyzers and optimized cp implementations. But
I'm using it to test fusefs's VOP_BMAP implementation. The "2" in the name
distinguishes it from the similar but incompatible FIBMAP ioctls in NetBSD
and Linux.  FIOBMAP2 differs from FIBMAP in that it uses a 64-bit block
number instead of 32-bit, and it also returns runp and runb.

Reviewed by:	mckusick
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20705
2019-06-20 14:13:10 +00:00
Alan Somers
d01752c703 Add a VOP_BMAP(9) man page
Reviewed by:	mckusick
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20704
2019-06-20 13:59:46 +00:00
Alexander Motin
f91aa773be Add wakeup_any(), cheaper wakeup_one() for taskqueue(9).
wakeup_one() and underlying sleepq_signal() spend additional time trying
to be fair, waking thread with highest priority, sleeping longest time.
But in case of taskqueue there are many absolutely identical threads, and
any fairness between them is quite pointless.  It makes even worse, since
round-robin wakeups not only make previous CPU affinity in scheduler quite
useless, but also hide from user chance to see CPU bottlenecks, when
sequential workload with one request at a time looks evenly distributed
between multiple threads.

This change adds new SLEEPQ_UNFAIR flag to sleepq_signal(), making it wakeup
thread that went to sleep last, but no longer in context switch (to avoid
immediate spinning on the thread lock).  On top of that new wakeup_any()
function is added, equivalent to wakeup_one(), but setting the flag.
On top of that taskqueue(9) is switchied to wakeup_any() to wakeup its
threads.

As result, on 72-core Xeon v4 machine sequential ZFS write to 12 ZVOLs
with 16KB block size spend 34% less time in wakeup_any() and descendants
then it was spending in wakeup_one(), and total write throughput increased
by ~10% with the same as before CPU usage.

Reviewed by:	markj, mmacy
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D20669
2019-06-20 01:15:33 +00:00
Alexander Motin
49ee0fcea5 Use sbuf_cat() in GEOM confxml generation.
When it comes to megabytes of text, difference between sbuf_printf() and
sbuf_cat() becomes substantial.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-06-19 15:36:02 +00:00
Alexander Motin
eeea0fcf0f Fix typo in r349178.
Reported by:	ae
MFC after:	1 week
2019-06-19 13:30:50 +00:00
Alexander Motin
5c32e9fcb2 Optimize kern.geom.conf* sysctls.
On large systems those sysctls may generate megabytes of output.  Before
this change sbuf(9) code was resizing buffer by 4KB each time many times,
generating tons of TLB shootdowns.  Unfortunately in this case existing
sbuf_new_for_sysctl() mechanism, supposed to help with this issue, is not
applicable, since all the sbuf writes are done in different kernel thread.

This change improves situation in two ways:
 - on first sysctl call, not providing any output buffer, it sets special
sbuf drain function, just counting the data and so not needing big buffer;
 - on second sysctl call it uses as initial buffer size value saved on
previous call, so that in most cases there will be no reallocation, unless
GEOM topology changed significantly.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-06-18 21:05:10 +00:00
Xin LI
f89d207279 Separate kernel crc32() implementation to its own header (gsb_crc32.h) and
rename the source to gsb_crc32.c.

This is a prerequisite of unifying kernel zlib instances.

PR:		229763
Submitted by:	Yoshihiro Ota <ota at j.email.ne.jp>
Differential Revision:	https://reviews.freebsd.org/D20193
2019-06-17 19:49:08 +00:00
Alexander Motin
c80038a0a7 Update td_runtime of running thread on each statclock().
Normally td_runtime is updated on context switch, but there are some kernel
threads that due to high absolute priority may run for many seconds without
context switches (yes, that is bad, but that is true), which means their
td_runtime was not updated all that time, that made them invisible for top
other then as some general CPU usage.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-06-14 01:09:10 +00:00
Stephen Hurd
705aad98c6 Some devices take undesired actions when RTS and DTR are
asserted. Some development boards for example will reset on DTR,
and some radio interfaces will transmit on RTS.

This patch allows "stty -f /dev/ttyu9.init -rtsdtr" to prevent
RTS and DTR from being asserted on open(), allowing these devices
to be used without problems.

Reviewed by:    imp
Differential Revision:  https://reviews.freebsd.org/D20031
2019-06-12 18:07:04 +00:00
John Baldwin
0f70218343 Make the warning intervals for deprecated crypto algorithms tunable.
New sysctl/tunables can now set the interval (in seconds) between
rate-limited crypto warnings.  The new sysctls are:
- kern.cryptodev_warn_interval for /dev/crypto
- net.inet.ipsec.crypto_warn_interval for IPsec
- kern.kgssapi_warn_interval for KGSSAPI

Reviewed by:	cem
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D20555
2019-06-11 23:00:55 +00:00
John Baldwin
19c40c76e1 Trim an extra space. 2019-06-11 22:06:05 +00:00
Bjoern A. Zeeb
4c62bffef5 Fix dpcpu and vnet panics with complex types at the end of the section.
Apply a linker script when linking i386 kernel modules to apply padding
to a set_pcpu or set_vnet section.  The padding value is kind-of random
and is used to catch modules not compiled with the linker-script, so
possibly still having problems leading to kernel panics.

This is needed as the code generated on certain architectures for
non-simple-types, e.g., an array can generate an absolute relocation
on the edge (just outside) the section and thus will not be properly
relocated. Adding the padding to the end of the section will ensure
that even absolute relocations of complex types will be inside the
section, if they are the last object in there and hence relocation will
work properly and avoid panics such as observed with carp.ko or ipsec.ko.

There is a rather lengthy discussion of various options to apply in
the mentioned PRs and their depends/blocks, and the review.
There seems no best solution working across multiple toolchains and
multiple version of them, so I took the liberty of taking one,
as currently our users (and our CI system) are hitting this on
just i386 and we need some solution.  I wish we would have a proper
fix rather than another "hack".

Also backout r340009 which manually, temporarily fixed CARP before 12.0-R
"by chance" after a lead-up of various other link-elf.c and related fixes.

PR:			230857,238012
With suggestions from:	arichardson (originally last year)
Tested by:		lwhsu
Event:			Waterloo Hackathon 2019
Reported by:		lwhsu, olivier
MFC after:		6 weeks
Differential Revision:	https://reviews.freebsd.org/D17512
2019-06-08 17:44:42 +00:00
Alan Somers
46f8169aea Add a testing facility to manually reclaim a vnode
Add the debug.try_reclaim_vnode sysctl. When a pathname is written to it, it
will be reclaimed, as long as it isn't already or doomed. The purpose is to
gain test coverage for vnode reclamation, which is otherwise hard to
achieve.

Add the debug.ftry_reclaim_vnode sysctl.  It does the same thing, except
that its argument is a file descriptor instead of a pathname.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20519
2019-06-06 15:04:50 +00:00
Ed Maste
004caac2d8 style(9) / tidying for r348611
MFC with:	r348611
Event:		Waterloo Hackathon 2019
2019-06-04 13:45:30 +00:00
Ed Maste
74cd06b42e Expose the kernel's build-ID through sysctl
After our migration (of certain architectures) to lld the kernel is built
with a unique build-ID.  Make it available via a sysctl and uname(1) to
allow the user to identify their running kernel.

Submitted by:	Ali Mashtizadeh <ali_mashtizadeh.com>
MFC after:	2 weeks
Relnotes:	Yes
Event:		Waterloo Hackathon 2019
Differential Revision:	https://reviews.freebsd.org/D20326
2019-06-04 13:07:10 +00:00
John Baldwin
e8d8cc0139 Warn about deprecated features on all major OS versions.
Reviewed by:	imp
MFC after:	3 days
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D20490
2019-06-03 15:43:40 +00:00
Konstantin Belousov
f6e5ddff6b Remove dead check.
We already handled the case when symstrindex < 0 at line 680.

Reported by:	danfe using PVS-studio
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-06-03 15:23:37 +00:00
Brooks Davis
4af6033324 makesyscalls.sh: always use absolute path for syscalls.conf
syscalls.conf is included using "." which per the Open Group:

 If file does not contain a <slash>, the shell shall use the search
 path specified by PATH to find the directory containing file.

POSIX shells don't fall back to the current working directory.

Submitted by:	Nathaniel Wesley Filardo <nwf20@cl.cam.ac.uk>
Reviewed by:	bdrewery
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D20476
2019-05-30 20:56:23 +00:00
Dmitry Chagin
c8124e20e5 Remove wrong inline keyword.
Reported by:	markj
MFC after:	1 week
2019-05-30 16:11:20 +00:00
Konstantin Belousov
5c066cd2e2 Remove TODO comment after posixshmcontrol(1) added.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-05-30 16:04:00 +00:00
Konstantin Belousov
5d993207da Silence witness warning about duplicated mutex type.
The order is correct, it is nullfs vnode interlock -> lower vnode
interlock.  vop_stdadd_writecount() is called from nullfs
VOP_ADD_WRITECOUNT() and both take interlocks.

Requested by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2019-05-30 15:04:09 +00:00
Dmitry Chagin
c5afec6e89 Complete LOCAL_PEERCRED support. Cache pid of the remote process in the
struct xucred. Do not bump XUCRED_VERSION as struct layout is not changed.

PR:		215202
Reviewed by:	tijl
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D20415
2019-05-30 14:24:26 +00:00
Konstantin Belousov
ab74c84333 Do not go into sleep in sleepq_catch_signals() when SIGSTOP from
PT_ATTACH was consumed.

In particular, do not clear TDP_FSTP in ptracestop() if td_wchan is
non-NULL. Leave it to sleepq_catch_signal() to clear and convert zero
return code to EINTR.

Otherwise, per submitter report, if the PT_ATTACH SIGSTOP was
delivered right after the thread was added to the sleepqueue but not
yet really sleep, and cursig() caused debugger attach, the thread
sleeps instead of returning to the userspace boundary with EINTR.

PR: 231445
Reported by:	Efi Weiss <valmarelox@gmail.com>
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D20381
2019-05-29 14:05:27 +00:00
Andrew Turner
5393cbce3d Teach the kernel KUBSAN runtime about alignment_assumption
This checks the alignment of a given pointer is sufficient for the
requested alignment asked for. This fixes the build with a recent
llvm/clang.

Sponsored by:	DARPA, AFRL
2019-05-28 09:12:15 +00:00
Justin Hibbits
a5868885fa kern/CTF: link_elf_ctf_get() on big endian platforms
Check the CTF magic number in big endian platforms.  This lets DTrace FBT
handle types correctly on these platforms.

Submitted by:	Brandon Bergren
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20413
2019-05-27 04:20:31 +00:00
Conrad Meyer
5d0e829978 Disable intr_storm_threshold mechanism by default
The ixl.4 manual page has documented that the threshold falsely detects
interrupt storms on 40Gbit NICs as long ago as 2015, and we have seen
similar false positives with the ioat(4) DMA device (which can push GB/s).

For example, synthetic load can be generated with tools/tools/ioat
'ioatcontrol 0 200 8192 1 1000' (allocate 200x8kB buffers, generate an
interrupt for each one, and do this for 1000 milliseconds).  With
storm-detection disabled, the Broadwell-EP version of this device is capable
of generating ~350k real interrupts per second.

The following historical context comes from jhb@: Originally, the threshold
worked around incorrect routing of PCI INTx interrupts on single-CPU systems
which would end up in a hard hang during boot.  Since the threshold was
added, our PCI interrupt routing was improved, most PCI interrupts use
edge-triggered MSI instead of level-triggered INTx, and typical systems have
multiple CPUs available to service interrupts.

On the off chance that the threshold is useful in the future, it remains
available as a tunable and sysctl.

Reviewed by:	jhb
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20401
2019-05-24 22:33:14 +00:00
John Baldwin
fb3bc59600 Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
  for a different ifp than the one the packet is being output on), in
  ip_output() and ip6_output().  This avoids sending packets with send
  tags to ifnet drivers that don't support send tags.

  Since we are now checking for ifp mismatches before invoking
  if_output, we can now try to allocate a new tag before invoking
  if_output sending the original packet on the new tag if allocation
  succeeds.

  To avoid code duplication for the fragment and unfragmented cases,
  add ip_output_send() and ip6_output_send() as wrappers around
  if_output and nd6_output_ifp, respectively.  All of the logic for
  setting send tags and dealing with send tag-related errors is done
  in these wrapper functions.

  For pseudo interfaces that wrap other network interfaces (vlan and
  lagg), wrapper send tags are now allocated so that ip*_output see
  the wrapper ifp as the ifp in the send tag.  The if_transmit
  routines rewrite the send tags after performing an ifp mismatch
  check.  If an ifp mismatch is detected, the transmit routines fail
  with EAGAIN.

- To provide clearer life cycle management of send tags, especially
  in the presence of vlan and lagg wrapper tags, add a reference count
  to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
  Provide a helper function (m_snd_tag_init()) for use by drivers
  supporting send tags.  m_snd_tag_init() takes care of the if_ref
  on the ifp meaning that code alloating send tags via if_snd_tag_alloc
  no longer has to manage that manually.  Similarly, m_snd_tag_rele
  drops the refcount on the ifp after invoking if_snd_tag_free when
  the last reference to a send tag is dropped.

  This also closes use after free races if there are pending packets in
  driver tx rings after the socket is closed (e.g. from tcpdrop).

  In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
  csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
  Drivers now also check this flag instead of checking snd_tag against
  NULL.  This avoids false positive matches when a forwarded packet
  has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
  detached so that it could kick the firmware to flush any pending
  work on the flow.  This is because the driver doesn't require ACK
  messages from the firmware for every request, but instead does a
  kind of manual interrupt coalescing by only setting a flag to
  request a completion on a subset of requests.  If all of the
  in-flight requests don't have the flag when the tag is detached from
  the inp, the flow might never return the credits.  The current
  snd_tag_free command issues a flush command to force the credits to
  return.  However, the credit return is what also frees the mbufs,
  and since those mbufs now hold references on the tag, this meant
  that snd_tag_free would never be called.

  To fix, explicitly drop the mbuf's reference on the snd tag when the
  mbuf is queued in the firmware work queue.  This means that once the
  inp's reference on the tag goes away and all in-flight mbufs have
  been queued to the firmware, tag's refcount will drop to zero and
  snd_tag_free will kick in and send the flush request.  Note that we
  need to avoid doing this in the middle of ethofld_tx(), so the
  driver grabs a temporary reference on the tag around that loop to
  defer the free to the end of the function in case it sends the last
  mbuf to the queue after the inp has dropped its reference on the
  tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
  the send tag wasn't in use.  Explicitly use the ifp from other data
  structures instead.

- Sprinkle some assertions in various places to assert that received
  packets don't have a send tag, and that other places that overwrite
  rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by:	gallatin, hselasky, rgrimes, ae
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
Alan Somers
65417f5e27 Remove "struct ucred*" argument from vtruncbuf
vtruncbuf takes a "struct ucred*" argument. AFAICT, it's been unused ever
since that function was first added in r34611. Remove it.  Also, remove some
"struct ucred" arguments from fuse and nfs functions that were only used by
vtruncbuf.

Reviewed by:	cem
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20377
2019-05-24 20:27:50 +00:00
Conrad Meyer
8298529226 EKCD: Add Chacha20 encryption mode
Add Chacha20 mode to Encrypted Kernel Crash Dumps.

Chacha20 does not require messages to be multiples of block size, so it is
valid to use the cipher on non-block-sized messages without the explicit
padding AES-CBC would require.  Therefore, allow use with simultaneous dump
compression.  (Continue to disallow use of AES-CBC EKCD with compression.)

dumpon(8) gains a -C cipher flag to select between chacha and aes-cbc.
It defaults to chacha if no -C option is provided.  The man page documents this
behavior.

Relnotes:	sure
Sponsored by:	Dell EMC Isilon
2019-05-23 20:12:24 +00:00
Konstantin Belousov
56d0e33e7a Add a kern.ipc.posix_shm_list sysctl.
The sysctl provides the listing on named linked posix shared memory
segments existing in the system.

Reuse shm_fill_kinfo() for filling individual struct kinfo_file.
Remove unneeded lock around reading of shmfd->shm_mode.

Reviewed by:	jilles, tmunro
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D20258
2019-05-23 12:35:40 +00:00