16646 Commits

Author SHA1 Message Date
kib
fc14c29b96 Silence witness warning about duplicated mutex type.
The order is correct, it is nullfs vnode interlock -> lower vnode
interlock.  vop_stdadd_writecount() is called from nullfs
VOP_ADD_WRITECOUNT() and both take interlocks.

Requested by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2019-05-30 15:04:09 +00:00
dchagin
a25b408b04 Complete LOCAL_PEERCRED support. Cache pid of the remote process in the
struct xucred. Do not bump XUCRED_VERSION as struct layout is not changed.

PR:		215202
Reviewed by:	tijl
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D20415
2019-05-30 14:24:26 +00:00
kib
7219eaccfb Do not go into sleep in sleepq_catch_signals() when SIGSTOP from
PT_ATTACH was consumed.

In particular, do not clear TDP_FSTP in ptracestop() if td_wchan is
non-NULL. Leave it to sleepq_catch_signal() to clear and convert zero
return code to EINTR.

Otherwise, per submitter report, if the PT_ATTACH SIGSTOP was
delivered right after the thread was added to the sleepqueue but not
yet really sleep, and cursig() caused debugger attach, the thread
sleeps instead of returning to the userspace boundary with EINTR.

PR: 231445
Reported by:	Efi Weiss <valmarelox@gmail.com>
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D20381
2019-05-29 14:05:27 +00:00
andrew
7c1bc5ffc2 Teach the kernel KUBSAN runtime about alignment_assumption
This checks the alignment of a given pointer is sufficient for the
requested alignment asked for. This fixes the build with a recent
llvm/clang.

Sponsored by:	DARPA, AFRL
2019-05-28 09:12:15 +00:00
jhibbits
a8a0248516 kern/CTF: link_elf_ctf_get() on big endian platforms
Check the CTF magic number in big endian platforms.  This lets DTrace FBT
handle types correctly on these platforms.

Submitted by:	Brandon Bergren
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20413
2019-05-27 04:20:31 +00:00
cem
7aa3f66510 Disable intr_storm_threshold mechanism by default
The ixl.4 manual page has documented that the threshold falsely detects
interrupt storms on 40Gbit NICs as long ago as 2015, and we have seen
similar false positives with the ioat(4) DMA device (which can push GB/s).

For example, synthetic load can be generated with tools/tools/ioat
'ioatcontrol 0 200 8192 1 1000' (allocate 200x8kB buffers, generate an
interrupt for each one, and do this for 1000 milliseconds).  With
storm-detection disabled, the Broadwell-EP version of this device is capable
of generating ~350k real interrupts per second.

The following historical context comes from jhb@: Originally, the threshold
worked around incorrect routing of PCI INTx interrupts on single-CPU systems
which would end up in a hard hang during boot.  Since the threshold was
added, our PCI interrupt routing was improved, most PCI interrupts use
edge-triggered MSI instead of level-triggered INTx, and typical systems have
multiple CPUs available to service interrupts.

On the off chance that the threshold is useful in the future, it remains
available as a tunable and sysctl.

Reviewed by:	jhb
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20401
2019-05-24 22:33:14 +00:00
jhb
5518ae8169 Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
  for a different ifp than the one the packet is being output on), in
  ip_output() and ip6_output().  This avoids sending packets with send
  tags to ifnet drivers that don't support send tags.

  Since we are now checking for ifp mismatches before invoking
  if_output, we can now try to allocate a new tag before invoking
  if_output sending the original packet on the new tag if allocation
  succeeds.

  To avoid code duplication for the fragment and unfragmented cases,
  add ip_output_send() and ip6_output_send() as wrappers around
  if_output and nd6_output_ifp, respectively.  All of the logic for
  setting send tags and dealing with send tag-related errors is done
  in these wrapper functions.

  For pseudo interfaces that wrap other network interfaces (vlan and
  lagg), wrapper send tags are now allocated so that ip*_output see
  the wrapper ifp as the ifp in the send tag.  The if_transmit
  routines rewrite the send tags after performing an ifp mismatch
  check.  If an ifp mismatch is detected, the transmit routines fail
  with EAGAIN.

- To provide clearer life cycle management of send tags, especially
  in the presence of vlan and lagg wrapper tags, add a reference count
  to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
  Provide a helper function (m_snd_tag_init()) for use by drivers
  supporting send tags.  m_snd_tag_init() takes care of the if_ref
  on the ifp meaning that code alloating send tags via if_snd_tag_alloc
  no longer has to manage that manually.  Similarly, m_snd_tag_rele
  drops the refcount on the ifp after invoking if_snd_tag_free when
  the last reference to a send tag is dropped.

  This also closes use after free races if there are pending packets in
  driver tx rings after the socket is closed (e.g. from tcpdrop).

  In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
  csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
  Drivers now also check this flag instead of checking snd_tag against
  NULL.  This avoids false positive matches when a forwarded packet
  has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
  detached so that it could kick the firmware to flush any pending
  work on the flow.  This is because the driver doesn't require ACK
  messages from the firmware for every request, but instead does a
  kind of manual interrupt coalescing by only setting a flag to
  request a completion on a subset of requests.  If all of the
  in-flight requests don't have the flag when the tag is detached from
  the inp, the flow might never return the credits.  The current
  snd_tag_free command issues a flush command to force the credits to
  return.  However, the credit return is what also frees the mbufs,
  and since those mbufs now hold references on the tag, this meant
  that snd_tag_free would never be called.

  To fix, explicitly drop the mbuf's reference on the snd tag when the
  mbuf is queued in the firmware work queue.  This means that once the
  inp's reference on the tag goes away and all in-flight mbufs have
  been queued to the firmware, tag's refcount will drop to zero and
  snd_tag_free will kick in and send the flush request.  Note that we
  need to avoid doing this in the middle of ethofld_tx(), so the
  driver grabs a temporary reference on the tag around that loop to
  defer the free to the end of the function in case it sends the last
  mbuf to the queue after the inp has dropped its reference on the
  tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
  the send tag wasn't in use.  Explicitly use the ifp from other data
  structures instead.

- Sprinkle some assertions in various places to assert that received
  packets don't have a send tag, and that other places that overwrite
  rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by:	gallatin, hselasky, rgrimes, ae
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
asomers
6798617f6d Remove "struct ucred*" argument from vtruncbuf
vtruncbuf takes a "struct ucred*" argument. AFAICT, it's been unused ever
since that function was first added in r34611. Remove it.  Also, remove some
"struct ucred" arguments from fuse and nfs functions that were only used by
vtruncbuf.

Reviewed by:	cem
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20377
2019-05-24 20:27:50 +00:00
cem
935cac69d7 EKCD: Add Chacha20 encryption mode
Add Chacha20 mode to Encrypted Kernel Crash Dumps.

Chacha20 does not require messages to be multiples of block size, so it is
valid to use the cipher on non-block-sized messages without the explicit
padding AES-CBC would require.  Therefore, allow use with simultaneous dump
compression.  (Continue to disallow use of AES-CBC EKCD with compression.)

dumpon(8) gains a -C cipher flag to select between chacha and aes-cbc.
It defaults to chacha if no -C option is provided.  The man page documents this
behavior.

Relnotes:	sure
Sponsored by:	Dell EMC Isilon
2019-05-23 20:12:24 +00:00
kib
83a359ea2a Add a kern.ipc.posix_shm_list sysctl.
The sysctl provides the listing on named linked posix shared memory
segments existing in the system.

Reuse shm_fill_kinfo() for filling individual struct kinfo_file.
Remove unneeded lock around reading of shmfd->shm_mode.

Reviewed by:	jilles, tmunro
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D20258
2019-05-23 12:35:40 +00:00
kib
b45119a2cb Report ref count of the backing object as st_nlink for posix shm fd.
Unless there are transient references to the object, the ref count is
equal to the number of the shared memory segment mappings plus one.

Reviewed by:	jilles, tmunro
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D20258
2019-05-23 12:27:45 +00:00
kib
f6d894c8f2 Make pack_kinfo() available for external callers.
Reviewed by:	jilles, tmunro
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D20258
2019-05-23 12:25:03 +00:00
cem
afcbe159e2 mqueuefs: Do not allow manipulation of the pseudo-dirents "." and ".."
"." and ".." names are not maintained in the mqueuefs dirent datastructure and
cannot be opened as mqueues.  Creating or removing them is invalid; return
EINVAL instead of crashing.

PR:		236836
Submitted by:	Torbjørn Birch Moltu <t.b.moltu AT lyse.net>
Discussed with:	jilles (earlier version)
2019-05-21 21:26:14 +00:00
cem
3038f1af7b Include ktr.h in more compilation units
Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes.  Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone.  Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.

Like r348026, these CUs did not show up in tinderbox as missing the include.

Reported by:	peterj (arm64/mp_machdep.c)
X-MFC-With:	r347984
Sponsored by:	Dell EMC Isilon
2019-05-21 20:38:48 +00:00
kib
c84facbfbd NDFREE(): Fix unlocking for LOCKPARENT|LOCKLEAF and ndp->ni_dvp == ndp->ni_vp.
NDFREE() calculates unlock_dvp after ndp->ni_vp is unlocked and zeroed
out. This makes the comparision of ni_dvp with ni_vp always fail.
Move the calculation of unlock_dvp right after unlock_vp, so that the
code sees correct ni_vp value.

Reproduced by
	   chdir("/usr");
	   open("/..", O_BENEATH | O_RDONLY);

Reported by:	syzkaller
Reviewed by:	markj, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D20304
2019-05-21 15:12:13 +00:00
stevek
b5aa188649 The older detection methods (smbios.bios.vendor and smbios.system.product)
are able to determine some virtual machines, but the vm_guest variable was
still only being set to VM_GUEST_VM.

Since we do know what some of them specifically are, we can set vm_guest
appropriately.

Also, if we see the CPUID has the HV flag, but we were unable to find a
definitive vendor in the Hypervisor CPUID Information Leaf, fall back to
the older detection methods, as they may be able to determine a specific
HV type.

Add VM_GUEST_PARALLELS value to VM_GUEST for Parallels.

Approved by:	cem
Differential Revision:	https://reviews.freebsd.org/D20305
2019-05-21 13:29:53 +00:00
markj
ecca59505e kcov depends on eventhandler.h.
MFC after:	3 days
2019-05-20 19:14:07 +00:00
cem
250e158ddf Extract eventfilter declarations to sys/_eventfilter.h
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions.  The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended).  Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed.  __FreeBSD_version has been bumped.
2019-05-20 00:38:23 +00:00
kib
312be5657c Fix rw->ro remount when there is a text vnode mapping.
Reported and tested by:	hrs
Sponsored by:	The FreeBSD Foundation
MFC after:	16 days
2019-05-19 09:18:09 +00:00
markj
9771568914 Update the DIAGNOSTIC-only vmem_check_sanity() after r347949.
Cursor tags are special and shouldn't be subject to the existing checks.

Reported by:	kib, David Wolfskill
MFC with:	r347949
2019-05-18 14:19:23 +00:00
markj
7d39a491bf Implement the M_NEXTFIT allocation strategy for vmem(9).
This is described in the vmem paper: "directs vmem to use the next free
segment after the one previously allocated."  The implementation adds a
new boundary tag type, M_CURSOR, which is linked into the segment list
and precedes the segment following the previous M_NEXTFIT allocation.
The cursor is used to locate the next free segment satisfying the
allocation constraints.

This implementation isn't O(1) since busy tags aren't coalesced, and we
may potentially scan the entire segment list during an M_NEXTFIT
allocation.

Reviewed by:	alc
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D17226
2019-05-18 01:46:38 +00:00
kib
a2f4dfb45b Grammar fixes for r347690.
Submitted by:	alc
MFC after:	3 days
2019-05-17 21:18:11 +00:00
stevek
dd31a8ef39 Instead of individual conditional statements to look for each hypervisor
type, use a table to make it easier to add more in the future, if needed.

Add VirtualBox detection to the table ("VBoxVBoxVBox" is the hypervisor
vendor string to look for.) Also add VM_GUEST_VBOX to the VM_GUEST
enumeration to indicate VirtualBox.

Save the CPUID base for the hypervisor entry that we detected. Driver code
may need to know about it in order to obtain additional CPUID features.

Approved by:	bryanv, jhb
Differential Revision:	https://reviews.freebsd.org/D16305
2019-05-17 17:21:32 +00:00
kib
b10ca25384 amd64 pmap: rework delayed invalidation, removing global mutex.
For machines having cmpxcgh16b instruction, i.e. everything but very
early Athlons, provide lockless implementation of delayed
invalidation.

The implementation maintains lock-less single-linked list with the
trick from the T.L. Harris article about volatile mark of the elements
being removed. Double-CAS is used to atomically update both link and
generation.  New thread starting DI appends itself to the end of the
queue, setting the generation to the generation of the last element
+1.  On DI finish, thread donates its generation to the previous
element.  The generation of the fake head of the list is the last
passed DI generation.  Basically, the implementation is a queued
spinlock but without spinlock.

Many thanks both to Peter Holm and Mark Johnson for keeping with me
while I produced intermediate versions of the patch.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
MFC note:	td_md.md_invl_gen should go to the end of struct thread
Differential revision:	https://reviews.freebsd.org/D19630
2019-05-16 13:28:48 +00:00
kib
d20c4c3cb2 subr_turnstile: Extract some common code to a helper.
Code walks the list of contested turnstiles to calculate the priority
to unlend.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-05-16 13:17:57 +00:00
kib
27769d9344 imgact_elf.c: Add comment explaining the malloc/VOP_UNLOCK() dance
from r347148.

Requested by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-05-16 13:03:54 +00:00
ae
1d10d5539c Remove bpf interface lock, it is no longer exist. 2019-05-14 10:21:28 +00:00
cem
29adbefa47 Revert r346292 (permit_nonrandom_stackcookies)
We have a better, more comprehensive knob for this now:
kern.random.initial_seeding.bypass_before_seeding=1.

Requested by:	delphij
Sponsored by:	Dell EMC Isilon
2019-05-13 23:37:44 +00:00
mjg
18f8f341f0 cache: fix a brainfart in r347505
If bumping over the counter goes over the limit we have to decrement it back.

Previous code would only bump the counter after adding the entry (thus allowing
the cache to go over the limit).

Sponsored by:	The FreeBSD Foundation
2019-05-12 07:56:01 +00:00
mjg
568cd19283 cache: bump numcache on entry, while here fix lnumcache type
Sponsored by:	The FreeBSD Foundation
2019-05-12 06:59:22 +00:00
mjg
7e703db433 cache: push sdt probes in cache_zap_locked to code doing the work
Avoids branching to check which probe to evaluate. Very same check was
being done later to do the actual work.

Sponsored by:	The FreeBSD Foundation
2019-05-12 06:39:30 +00:00
dougm
24c307c3c0 A new parameter to blist_alloc specifies an upper bound on the size of
the allocation request, so that the blocks allocated are from the next
set of free blocks big enough to satisfy the minimum requirements of
the request, and the number of blocks allocated are as many as
possible, up to the specified maximum. The implementation of
swp_pager_getswapspace uses this parameter to ask for a number of
blocks between the new halved request size and the previous failed
request size. Thus a request for 32 blocks may fail, but instead of
getting only 16 blocks instead, the caller asks for 16 to 31 next, and
might get 19 or 27, which is closer to what they originally wanted.

I expect this to lead to bigger block allocations and less block
fragmentation, at least in some cases.

Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20001
2019-05-11 16:15:13 +00:00
dougm
0ecc670646 When bitpos can't be implemented with an inline ffs* instruction,
change the binary search so that it does not depend on a single bit
only being set in the bitmask. Use bitpos more generally, and avoid
some clearing of bits to accommodate its current behavior.

Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20237
2019-05-11 09:09:10 +00:00
dougm
09ef417213 Revert r347469.
Approved by: kib (mentor)
2019-05-11 02:13:52 +00:00
dougm
a2a4184084 Don't use _Generic, as many systems don't know about it. Go back to a lo-tech switch statement.
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20235
2019-05-10 23:12:37 +00:00
dougm
80622341ff When bitpos can't be implemented with an inline ffs* instruction,
change the binary search so that it does not depend on a single bit
only being set in the bitmask. Use bitpos more generally, and avoid
some clearing of bits to accommodate its current behavior.

Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20232
2019-05-10 22:49:01 +00:00
dougm
a2af9838ad Add a (q)uit option to the subr_blist test program.
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20234
2019-05-10 22:02:29 +00:00
dougm
28ca43d2eb Replace the expression "-mask & ~mask" with a function call that does
the same thing, but is commented so that it might be better
understood.

Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20231
2019-05-10 19:55:29 +00:00
dougm
64c7cf7dc3 blist_next_leaf_alloc walks over all the meta-nodes between one leaf
and the next one, and if blocks are allocated from the next leaf, it
walks back toward where it started, as long as there are interleaving
meta-nodes to be updated on account of the last free blocks under
those meta-nodes being allocated. Only if the walk goes all the way
back to the starting point must we calculate the position of the
meta-node that is the least-comment parent of one leaf and the next,
and update a bit in that meta-node to indicate the allocation of its
last free block.

There's no need to start calculating the position of that least-common
parent until the walk back reaches the original starting point, and
there's no need for a calculation that updates 'radius' to tell us
when we've walked back to the beginning, since comparing scan to next
suffices for that.

Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20229
2019-05-10 18:25:06 +00:00
dougm
a198e1a1fe Replace panic() with KASSERT() and provide more useful information when failure happens.
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20226
2019-05-10 18:22:40 +00:00
gallatin
fbc304aae0 Bind TCP HPTS (pacer) threads to NUMA domains
Bind the TCP pacer threads to NUMA domains and build per-domain
pacer-thread lookup tables. These tables allow us to use the
inpcb's NUMA domain information to match an inpcb with a pacer
thread on the same domain.

The motivation for this is to keep the TCP connection local to a
NUMA domain as much as possible.

Thanks to jhb for pre-reviewing an earlier version of the patch.

Reviewed by:	rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20134
2019-05-10 13:41:19 +00:00
mjg
c1523e6e58 Reduce umtx-related work on exec and exit
- there is no need to take the process lock to iterate the thread
  list after single-threading is enforced
- typically there are no mutexes to clean up (testable without taking
  the global umtx lock)
- typically there is no need to adjust the priority (testable without
  taking thread lock)

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20160
2019-05-08 16:30:38 +00:00
emaste
206ba42431 make sysent after r347228
Regenerate to add @generated tag in generated files.
2019-05-07 18:10:21 +00:00
cem
38fa0aaa51 device_printf: Use sbuf for more coherent prints on SMP
device_printf does multiple calls to printf allowing other console messages to
be inserted between the device name, and the rest of the message.  This change
uses sbuf to compose to two into a single buffer, and prints it all at once.

It exposes an sbuf drain function (drain-to-printf) for common use.

Update documentation to match; some unit tests included.

Submitted by:	jmg
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D16690
2019-05-07 17:47:20 +00:00
emaste
c7029e0b4a makesyscalls: use @generated tag in generated files
Multiple tools use @generated to identify generated files (for example,
in a review Phabricator will by default hide diffs in generated files).
Use the @generated tag in makesyscalls.sh as we've done for other
generated files.

Reviewed by:	cem
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20183
2019-05-07 16:17:33 +00:00
markj
72c6d2d761 Simplify the test against maxproc in fork1().
Previously nprocs_new would be tested against maxprocs twice when
nprocs_new < maxprocs - 10.  Eliminate the unnecessary comparison.

Submitted by:	Wuyang Chung <wuyang.chung1@gmail.com>
GitHub PR:	https://github.com/freebsd/freebsd/pull/397
MFC after:	1 week
2019-05-07 15:03:26 +00:00
dougm
7da63e7938 The intention of the blist cursor is for the search for free blocks to
resume where the last search left off. Suppose that there are no free
blocks of size 32, but plenty of size 16. If we repeatedly request
size 32 blocks, fail, and retry with size 16 blocks, then the failures
all reset the cursor to the beginning of memory, making the 16 block
allocation use a first fit, rather than next fit, strategy.

This change has blist_alloc make a copy of the cursor for its own
decision making, and only updates the real blist cursor after a
successful allocation, making those 16 block searches behave like
next-fit searches.

Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20177
2019-05-06 22:12:15 +00:00
cem
6058a49bde List-ify kernel dump device configuration
Allow users to specify multiple dump configurations in a prioritized list.
This enables fallback to secondary device(s) if primary dump fails.  E.g.,
one might configure a preference for netdump, but fallback to disk dump as a
second choice if netdump is unavailable.

This change does not list-ify netdump configuration, which is tracked
separately from ordinary disk dumps internally; only one netdump
configuration can be made at a time, for now.  It also does not implement
IPv6 netdump.

savecore(8) is already capable of scanning and iterating multiple devices
from /etc/fstab or passed on the command line.

This change doesn't update the rc or loader variables 'dumpdev' in any way;
it can still be set to configure a single dump device, and rc.d/savecore
still uses it as a single device.  Only dumpon(8) is updated to be able to
configure the more complicated configurations for now.

As part of revving the ABI, unify netdump and disk dump configuration ioctl
/ structure, and leave room for ipv6 netdump as a future possibility.
Backwards-compatibility ioctls are added to smooth ABI transition,
especially for developers who may not keep kernel and userspace perfectly
synced.

Reviewed by:	markj, scottl (earlier version)
Relnotes:	maybe
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D19996
2019-05-06 18:24:07 +00:00
kib
2dc0d9edaa Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount.  To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change.  vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP.  vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by:	markj, trasz
Tested by:	mjg, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
kib
e3b87f7a32 imgact_elf: do not relock the text vnode if possible.
We unlock the vnode around malloc(M_WAITOK), to make it possible for
pagedaemon to flush vnode pages for us.  Instead of doing it
unconditionally, first try M_NOWAIT allocation, which typically
succeed.  Only on failure, unlock the vnode and retry with M_WAITOK.

Reviewed by:	markj, trasz
Tested by:	mjg, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D19923
2019-05-05 11:04:01 +00:00