Commit Graph

8251 Commits

Author SHA1 Message Date
Alexander Motin
f2b6da18de Avoid code duplicaiton by using ipi_selected().
MFC after:	2 weeks
2020-07-21 17:18:38 +00:00
Konstantin Belousov
0675f4e1ca Simplify non-pti syscall entry on amd64.
Limit manipulations to use %rax as scratch to the pti portion of the
syscall entry code.

Submitted by:	alc
Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25722
2020-07-19 17:47:55 +00:00
Konstantin Belousov
3ec7e1695c amd64 pmap: microoptimize local shootdowns for PCID PTI configurations
When pmap operates in PTI mode, we must reload %cr3 on return to
userspace.  In non-PCID mode the reload always flushes all non-global
TLB entries and we take advantage of it by only invalidating the KPT
TLB entries (there is no cached UPT entries at all).

In PCID mode, we flush both KPT and UPT TLB explicitly, but we can
take advantage of the fact that PCID mode command to reload %cr3
includes a flag to flush/not flush target TLB.  In particular, we can
avoid the flush for UPT, instead record that load of pc_ucr3 into %cr3
on return to usermode should be flushing.  This is done by providing
either all-1s or ~CR3_PCID_MASK in pc_ucr3_load_mask.  The mask is
automatically reset to all-1s on return to usermode.

Similarly, we can avoid flushing UPT TLB on context switch, replacing
it by setting pc_ucr3_load_mask.  This unifies INVPCID and non-INVPCID
PTI ifunc, leaving only 4 cases instead of 6.  This trick is also
applicable both to the TLB shootdown IPI handlers, since handlers
interrupt the target thread.

But then we need to check pc_curpmap in handlers, and this would
reopen the same race for INVPCID machines as was fixed in r306350 for
non-INVPCID.  To not introduce the same bug, unconditionally do
spinlock_enter() in pmap_activate().

Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D25483
2020-07-18 18:19:57 +00:00
Edward Tomasz Napierala
3e9a214260 Regen after r363304.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-07-18 11:31:31 +00:00
Edward Tomasz Napierala
8d1d017175 Add a trivial linux(4) splice(2) implementation, which simply
returns EINVAL.  Fixes grep (grep-3.1-2build1).

PR:		kern/218699
Reported by:	avos
Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25636
2020-07-18 11:28:40 +00:00
Conrad Meyer
4ae224c663 Revert r240317 to prevent leaking pmap entries
Subsequent to r240317, kmem_free() was replaced with kva_free() (r254025).
kva_free() releases the KVA allocation for the mapped region, but no longer
clears the pmap (pagetable) entries.

An affected pmap_unmapdev operation would leave the still-pmap'd VA space
free for allocation by other KVA consumers.  However, this bug easily
avoided notice for ~7 years because most devices (1) never call
pmap_unmapdev and (2) on amd64, mostly fit within the DMAP and do not need
KVA allocations.  Other affected arch are less popular: i386, MIPS, and
PowerPC.  Arm64, arm32, and riscv are not affected.

Reported by:	Don Morris <dgmorris AT earthlink.net>
Submitted by:	Don Morris (amd64 part)
Reviewed by:	kib, markj, Don (!amd64 parts)
MFC after:	I don't intend to, but you might want to
Sponsored by:	Dell Isilon
Differential Revision:	https://reviews.freebsd.org/D25689
2020-07-16 23:29:26 +00:00
Mark Johnston
e64080e79c Switch from SCTP to SCTP_SUPPORT in GENERIC configs.
This removes SCTP from in-tree kernel configuration files.  Now, SCTP
can be enabled by simply loading the module, as discussed on
freebsd-net@.

Reviewed by:	tuexen
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25611
2020-07-16 15:09:04 +00:00
Mateusz Guzik
c4e64133d8 amd64: patch ffsl to use the compiler builtin
This shortens fdalloc by over 60 bytes. Correctness verified by running both
variants at the same time and comparing the result of each call.

Note someone(tm) should make a pass at converting everything else feasible.
2020-07-16 11:28:24 +00:00
Konstantin Belousov
a64fab1dd1 Grammar and typo fixes.
Submitted by:	alc
MFC after:	20 days
2020-07-15 09:48:36 +00:00
Konstantin Belousov
dc43978aa5 amd64: allow parallel shootdown IPIs
Stop using smp_ipi_mtx to protect global shootdown state, and
move/multiply the global state into pcpu.  Now each CPU can initiate
shootdown IPI independently from other CPUs.  Initiator enters
critical section, then fills its local PCPU shootdown info
(pc_smp_tlb_XXX), then clears scoreboard generation at location (cpu,
my_cpuid) for each target cpu.  After that IPI is sent to all targets
which scan for zeroed scoreboard generation words.  Upon finding such
word the shootdown data is read from corresponding cpu' pcpu, and
generation is set.  Meantime initiator loops waiting for all zeroed
generations in scoreboard to update.

Initiator does not disable interrupts, which should allow
non-invalidation IPIs from deadlocking, it only needs to disable
preemption to pin itself to the instance of the pcpu smp_tlb data.

The generation is set before the actual invalidation is performed in
handler. It is safe because target CPU cannot return to userspace
before handler finishes. In principle only NMI can preempt the
handler, but NMI would see the kernel handler frame and not touch
not-invalidated user page table.

Handlers loop until they do not see zeroed scoreboard generations.
This, together with hardware keeping one pending IPI in LAPIC IRR
should prevent lost shootdowns.

Notes.
1. The code does protect writes to LAPIC ICR with exclusion. I believe
   this is fine because we in fact do not send IPIs from interrupt
   handlers. More for !x2APIC mode where ICR access for write requires
   two registers write, we disable interrupts around it. If considered
   incorrect, I can add per-cpu spinlock around ipi_send().
2. Scoreboard lines owned by given target CPU can be padded to the
   cache line, to reduce ping-pong.

Reviewed by:	markj (previous version)
Discussed with:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D25510
2020-07-14 20:37:50 +00:00
Scott Long
ffc568ba8b Revert r362998, r326999 while a better compatibility strategy is devised. 2020-07-09 22:38:36 +00:00
Scott Long
b302c2e5c9 Migrate the feature of excluding RAM pages to use "excludelist"
as its nomenclature.

MFC after:	1 week
2020-07-07 20:33:11 +00:00
Andrew Turner
fcf7a48191 Rerun kernel ifunc resolvers after all CPUs have started
On architectures that use RELA relocations it is safe to rerun the ifunc
resolvers on after all CPUs have started, but while they are sill parked.

On arm64 with big.LITTLE this is needed as some SoCs have shipped with
different ID register values the big and little clusters meaning we were
unable to rely on the register values from the boot CPU.

Add support for rerunning the resolvers on arm64 and amd64 as these are
both RELA using architectures.

Reviewed by:	kib
Sponsored by:	Innovate UK
Differential Revision:	https://reviews.freebsd.org/D25455
2020-07-05 14:38:22 +00:00
Conrad Meyer
c74a3041f0 Add domain policy allocation for amd64 fpu_kern_ctx
Like other types of allocation, fpu_kern_ctx are frequently allocated per-cpu.
Provide the API and sketch some example consumers.

fpu_kern_alloc_ctx_domain() preferentially allocates memory from the
provided domain, and falls back to other domains if that one is empty
(DOMAINSET_PREF(domain) policy).

Maybe it makes more sense to just shove one of these in the DPCPU area
sooner or later -- left for future work.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22053
2020-07-03 14:54:46 +00:00
Conrad Meyer
64612d4e44 geom(4): Kill GEOM_PART_EBR_COMPAT option
Take advantage of Warner's nice new real GEOM aliasing system and use it for
aliased partition names that actually work.

Our canonical EBR partition name is the weird, not-default-on-x86-prior-to-
this-revision "da1p4+00001234."  However, if compatibility mode (tunable
kern.geom.part.ebr.compat_aliases) is enabled (1, default), we continue to
provide the alias names like "da1p5" in addition to the weird canonical
names.

Naming partition providers was just one aspect of the COMPAT knob; in
addition it limited mutability, in part because it did not preserve existing
EBR header content aside from that of LBA 0.  This change saves the EBR
header for LBA 0, as well as for every EBR partition encountered.  That way,
when we write out the EBR partition table on modification, we can restore
any bootloader or other metadata in both LBA0 (the first data-containing EBR
may start after 0) as well as every logical EBR we read from the disk, and
only update the geometry metadata and linked list pointers that describe the
actual partitioning.

(This change does not add support for the 'bootcode' verb to EBR.)

PR:		232463
Reported by:	Manish Jain <bourne.identity AT hotmail.com>
Discussed with:	ae (no objection)
Relnotes:	maybe
Differential Revision:	https://reviews.freebsd.org/D24939
2020-07-01 02:16:36 +00:00
Kyle Evans
5403f186a7 linuxolator: implement memfd_create syscall
This effectively mirrors our libc implementation, but with minor fudging --
name needs to be copied in from userspace, so we just copy it straight into
stack-allocated memfd_name into the correct position rather than allocating
memory that needs to be cleaned up.

The sealing-related fcntl(2) commands, F_GET_SEALS and F_ADD_SEALS, have
also been implemented now that we support them.

Note that this implementation is still not quite at feature parity w.r.t.
the actual Linux version; some caveats, from my foggy memory:

- Need to implement SHM_GROW_ON_WRITE, default for memfd (in progress)
- LTP wants the memfd name exposed to fdescfs
- Linux allows open() of an fdescfs fd with O_TRUNC to truncate after dup.
  (?)

Interested parties can install and run LTP from ports (devel/linux-ltp) to
confirm any fixes.

PR:		240874
Reviewed by:	kib, trasz
Differential Revision:	https://reviews.freebsd.org/D21845
2020-06-29 03:09:14 +00:00
Konstantin Belousov
557905569d amd64 pmap: explain ptepindex.
Reviewed by:	markj
Discussed with:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25187
2020-06-27 19:29:07 +00:00
Edward Tomasz Napierala
a39cdcd7e7 Regen.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-06-27 14:43:29 +00:00
Edward Tomasz Napierala
308e194cbf Add proper types for linux message queue syscalls; mostly taken
from 32-bit Linuxulator.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25386
2020-06-27 14:42:08 +00:00
Edward Tomasz Napierala
36507f85dc Add syscall definitions for linux xattr syscalls.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25387
2020-06-27 14:39:44 +00:00
Edward Tomasz Napierala
8036e7876d Adjust types of linuxulator syscalls, to match include/linux/syscalls.h
in vanilla Linux git tree.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25385
2020-06-27 14:37:36 +00:00
Conrad Meyer
4daa95f85d bhyve(8): For prototyping, reattempt decode in userspace
If userspace has a newer bhyve than the kernel, it may be able to decode
and emulate some instructions vmm.ko is unaware of.  In this scenario,
reset decoder state and try again.

Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D24464
2020-06-25 00:18:42 +00:00
Edward Tomasz Napierala
5ac2674278 Adapt linuxulator syscalls.master files to the new layout.
No functional changes.

Reviewed by:	brooks
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25381
2020-06-21 10:09:34 +00:00
Brandon Bergren
40b664f64b [PowerPC] More relocation fixes
It turns out relocating the symbol table itself can cause issues, like fbt
crashing because it applies the offsets to the kernel twice.

This had been previously brought up in rS333447 when the stoffs hack was
added, but I had been unaware of this and reimplemented symtab relocation.

Instead of relocating the symbol table, keep track of the relocation base
in ddb, so the ddb symbols behave like the kernel linker-provided symbols.

This is intended to be NFC on platforms other than PowerPC, which do not
use fully relocatable kernels. (The relbase will always be 0)

 * Remove the rest of the stoffs hack.
 * Remove my half-baked displace_symbol_table() function.
 * Extend ddb initialization to cope with having a relocation offset on the
   kernel symbol table.
 * Fix my kernel-as-initrd hack to work with booke64 by using a temporary
   mapping to access the data.
 * Fix another instance of __powerpc__ that is actually RELOCATABLE_KERNEL.
 * Change the behavior or X_db_symbol_values to apply the relocation base
   when updating valp, to match link_elf_symbol_values() behavior.

Reviewed by:	jhibbits
Sponsored by:	Tag1 Consulting, Inc.
Differential Revision:	https://reviews.freebsd.org/D25223
2020-06-21 03:39:26 +00:00
Edward Tomasz Napierala
bafd96b8dd Regen after r362440.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-06-20 18:31:02 +00:00
Edward Tomasz Napierala
52c81be11a Add linux_madvise(2) instead of having Linux apps call the native
FreeBSD madvise(2) directly.  While some of the flag values match,
most don't.

PR:		kern/230160
Reported by:	markj
Reviewed by:	markj
Discussed with:	brooks, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25272
2020-06-20 18:29:22 +00:00
Konstantin Belousov
17edf152e5 Control for Special Register Buffer Data Sampling mitigation.
New microcode update for Intel enables mitigation for SRBDS, which
slows down RDSEED and related instructions.  The update also provides
a control to limit the mitigation to SGX enclaves, which should
restore the speed of random generator by the cost of potential
cross-core bufer sampling.

See https://software.intel.com/security-software-guidance/insights/deep-dive-special-register-buffer-data-sampling

GIve the user control over it.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25221
2020-06-12 22:14:45 +00:00
Eric van Gyzen
6fba90f201 FPU init: allocate initial state from UMA to ensure alignment
The Intel Instruction Set Reference says this about the XSAVE instruction:

    Use of a destination operand not aligned to 64-byte boundary
    (in either 64-bit or 32-bit modes) results in a general-protection
    (#GP) exception.

This alignment happens naturally when all malloc buckets are powers
of two.  However, this change is necessary on some systems when
certain non-power-of-two (and non-multiple of 64) malloc buckets
are defined.

Reviewed by:	cem; kib; earlier version by jhb
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D25098
2020-06-12 21:17:56 +00:00
Eric van Gyzen
701acc2fd8 FPU: make xsave_area_desc static
...because it can be.

Reviewed by:	cem kib
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D25098
2020-06-12 21:12:26 +00:00
Eric van Gyzen
674cbe7908 FPU init: Do potentially blocking operations before disabling interrupts
In particular, uma_zcreate creates sysctl oids, which locks an sx lock,
which uses IPIs under contention.  IPIs tend not to work very well
when interrupts are disabled.  Who knew, right?

Reviewed by:	cem kib
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D25098
2020-06-12 21:10:45 +00:00
Konstantin Belousov
3b23ffe271 amd64 pmap: reorder IPI send and local TLB flush in TLB invalidations.
Right now code first flushes all local TLB entries that needs to be
flushed, then signals IPI to remote cores, and then waits for
acknowledgements while spinning idle.  In the VMWare article 'Don’t
shoot down TLB shootdowns!' it was noted that the time spent spinning
is lost, and can be more usefully used doing local TLB invalidation.

We could use the same invalidation handler for local TLB as for
remote, but typically for pmap == curpmap we can use INVLPG for locals
instead of INVPCID on remotes, since we cannot control context
switches on them.  Due to that, keep the local code and provide the
callbacks to be called from smp_targeted_tlb_shootdown() after IPIs
are fired but before spin wait starts.

Reviewed by:	alc, cem, markj, Anton Rang <rang at acm.org>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D25188
2020-06-10 22:07:57 +00:00
Mark Johnston
0cfac4d5c6 Handle getcpu() calls in vsyscall emulation on amd64.
linux_getcpu() has been implemented since r356241.

PR:		246339
Submitted by:	John Hay <john@sanren.ac.za>
MFC after:	1 week
2020-05-31 18:20:20 +00:00
Mark Johnston
81302f1d77 Fix boot on systems where NUMA domain 0 is unpopulated.
- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
  ensure that segments registered during hammer_time() are placed in the
  right domain.  Otherwise, since the SRAT is not parsed at that point,
  we just add them to domain 0, which may be incorrect and results in a
  domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
  If domain 0 is unpopulated, the allocation will simply fail, resulting
  in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
  affinity table, and change vm_phys_early_alloc() to handle wildcard
  domains.  This is necessary on amd64, where the page array is dense
  and pmap_page_array_startup() may allocate page table pages for
  non-existent page frames.

Reported and tested by:	Rafael Kitover <rkitover@gmail.com>
Reviewed by:	cem (earlier version), kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25001
2020-05-28 19:41:00 +00:00
Eric Joyner
71d104536b ice(4): Introduce new driver for Intel E800 Ethernet controllers
The ice(4) driver is the driver for the Intel E8xx series Ethernet
controllers; currently with codenames Columbiaville and
Columbia Park.

These new controllers support 100G speeds, as well as introducing
more queues, better virtualization support, and more offload
capabilities. Future work will enable virtual functions (like
in ixl(4)) and the other functionality outlined above.

For full functionality, the kernel should be compiled with
"device ice_ddp" like in the amd64 NOTES file, and/or
ice_ddp_load="YES" should be added to /boot/loader.conf so that
the DDP package file included in this commit can be downloaded
to the adapter. Otherwise, the adapter will fall back to a single
queue mode with limited functionality.

A man page for this driver will be forthcoming.

MFC after:	1 month
Relnotes:	yes
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D21959
2020-05-26 23:35:10 +00:00
Roger Pau Monné
f44b653756 xen-locore: fix size in GDT descriptor
There was an off-by-one in the GDT descriptor size field used by the
early Xen boot code. The GDT descriptor size should be the size of the
GDT minus one. No functional change expected as a result of this
change.

Sponsored by:	Citrix Systems R&D
2020-05-26 10:24:06 +00:00
Conrad Meyer
852c303b61 copystr(9): Move to deprecate (attempt #2)
This reapplies logical r360944 and r360946 (reverting r360955), with fixed
copystr() stand-in replacement macro.  Eventually the goal is to convert
consumers and kill the macro, but for a first step it helps if the macro is
correct.

Prior commit message:

Unlike the other copy*() functions, it does not serve to copy from one
address space to another or protect against potential faults.  It's just
an older incarnation of the now-more-common strlcpy().

Add a coccinelle script to tools/ which can be used to mechanically
convert existing instances where replacement with strlcpy is trivial.
In the two cases which matched, fuse_vfsops.c and union_vfsops.c, the
code was further refactored manually to simplify.

Replace the declaration of copystr() in systm.h with a small macro
wrapper around strlcpy (with correction from brooks@ -- thanks).

Remove N redundant MI implementations of copystr.  For MIPS, this
entailed inlining the assembler copystr into the only consumer,
copyinstr, and making the latter a leaf function.

Reviewed by:		jhb (earlier version)
Discussed with:		brooks (thanks!)
Differential Revision:	https://reviews.freebsd.org/D24672
2020-05-25 16:40:48 +00:00
Mark Johnston
e115748932 Fix the build after r361033 when ACPI is disabled.
Reported by:	Herbert J. Skuhra <herbert@gojira.at>
2020-05-22 01:18:55 +00:00
Konstantin Belousov
ea6020830c amd64: Add a knob to flush RSB on context switches if machine has SMEP.
The flush is needed to prevent cross-process ret2spec, which is not handled
on kernel entry if IBPB is enabled but SMEP is present.
While there, add i386 RSB flush.

Reported by:	Anthony Steinhauser <asteinhauser@google.com>
Reviewed by:	markj, Anthony Steinhauser
Discussed with:	philip
admbugs:	961
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-05-20 22:00:31 +00:00
Conrad Meyer
f4ce062964 vmm(4): Add 12 user ABI compat after r349948
Reported by:	kp
Reviewed by:	jhb, kp
Tested by:	kp
Differential Revision:	https://reviews.freebsd.org/D24929
2020-05-20 17:27:54 +00:00
Conrad Meyer
8a68ae80f6 vmm(4), bhyve(8): Expose kernel-emulated special devices to userspace
Expose the special kernel LAPIC, IOAPIC, and HPET devices to userspace
for use in, e.g., fallback instruction emulation (when userspace has a
newer instruction decode/emulation layer than the kernel vmm(4)).

Plumb the ioctl through libvmmapi and register the memory ranges in
bhyve(8).

Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D24525
2020-05-15 15:54:22 +00:00
Peter Grehan
ec048c7550 Hide host CPUID 0x15 TSC/Crystal ratio/freq info from guest
In recent Linux (5.3+) and OpenBSD (6.6+) kernels, and with hosts that
support CPUID 0x15, the local APIC frequency is determined directly
from the reported crystal clock to avoid calibration against the 8254
timer.

However, the local APIC frequency implemented by bhyve is 128MHz, where
most h/w systems report frequencies around 25MHz. This shows up on
OpenBSD guests as repeated keystrokes on the emulated PS2 keyboard
when using VNC, since the kernel's timers are now much shorter.

Fix by reporting all-zeroes for CPUID 0x15. This allows guests to fall
back to using the 8254 to calibrate the local APIC frequency.

Future work could be to compute values returned for 0x15 that would
match the host TSC and bhyve local APIC frequency, though all dependencies
on this would need to be examined (for example, Linux will start using
0x16 for some hosts).

PR:	246321
Reported by:	Jason Tubnor (and tested)
Reviewed by:	jhb
Approved by:	jhb, bz (mentor)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D24837
2020-05-14 22:18:12 +00:00
Mark Johnston
e76aab6ae2 Call acpi_pxm_set_proximity_info() slightly earlier on x86.
This function is responsible for setting pc_domain in each pcpu
structure.  Call it from the main function that starts APs, rather than
a separate SYSINIT.  This makes it easier to close the window where
UMA's per-CPU slab allocator may be called while pc_domain is
uninitialized.  In particular, the allocator uses pc_domain to allocate
domain-local pages, so allocations before this point end up using domain
0 for everything.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24757
2020-05-14 16:07:27 +00:00
Mark Johnston
821c4e77c5 Assert that page table traversal functions don't operate on superpages.
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D24828
2020-05-14 15:49:37 +00:00
Andriy Gapon
c4b4e8cd4e amd64/pmap: unbreak !NUMA case for fictitious pages
A fictitious page can have a physical address beyond the end of the RAM.
In the NUMA case there is some special code to handle such pages, but in
the other case the pages are handled the same as normal pages.  So, we
cannot assert that the physical address is within RAM addresses.

Suggested by:	kib
Reviewed by:	kib
X-MFC note:	NUMA support has not been MFC-ed
2020-05-12 09:31:48 +00:00
Conrad Meyer
051fc58cb3 Revert r360944 and r360946 until reported issues can be resolved
Reported by:	cy
2020-05-12 04:34:26 +00:00
Conrad Meyer
580744621f copystr(9): Move to deprecate [2/2]
Unlike the other copy*() functions, it does not serve to copy from one
address space to another or protect against potential faults.  It's just
an older incarnation of the now-more-common strlcpy().

Add a coccinelle script to tools/ which can be used to mechanically
convert existing instances where replacement with strlcpy is trivial.
In the two cases which matched, fuse_vfsops.c and union_vfsops.c, the
code was further refactored manually to simplify.

Replace the declaration of copystr() in systm.h with a small macro
wrapper around strlcpy.

Remove N redundant MI implementations of copystr.  For MIPS, this
entailed inlining the assembler copystr into the only consumer,
copyinstr, and making the latter a leaf function.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D24672
2020-05-11 22:57:21 +00:00
Mark Johnston
cd9c23b5eb Reinitialize thread0's stack base after enabling XSAVE.
Otherwise the initial call to set_top_of_stack(), which occurs before
fpuinit() sets the correct value for cpu_max_ext_state_size, leaves the
stack base at an incorrect location.  Then, when the full area is
zeroed, we end up erroneously zeroing part of the following page.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24754
2020-05-08 14:38:48 +00:00
John Baldwin
483d953a86 Initial support for bhyve save and restore.
Save and restore (also known as suspend and resume) permits a snapshot
to be taken of a guest's state that can later be resumed.  In the
current implementation, bhyve(8) creates a UNIX domain socket that is
used by bhyvectl(8) to send a request to save a snapshot (and
optionally exit after the snapshot has been taken).  A snapshot
currently consists of two files: the first holds a copy of guest RAM,
and the second file holds other guest state such as vCPU register
values and device model state.

To resume a guest, bhyve(8) must be started with a matching pair of
command line arguments to instantiate the same set of device models as
well as a pointer to the saved snapshot.

While the current implementation is useful for several uses cases, it
has a few limitations.  The file format for saving the guest state is
tied to the ABI of internal bhyve structures and is not
self-describing (in that it does not communicate the set of device
models present in the system).  In addition, the state saved for some
device models closely matches the internal data structures which might
prove a challenge for compatibility of snapshot files across a range
of bhyve versions.  The file format also does not currently support
versioning of individual chunks of state.  As a result, the current
file format is not a fixed binary format and future revisions to save
and restore will break binary compatiblity of snapshot files.  The
goal is to move to a more flexible format that adds versioning,
etc. and at that point to commit to providing a reasonable level of
compatibility.  As a result, the current implementation is not enabled
by default.  It can be enabled via the WITH_BHYVE_SNAPSHOT=yes option
for userland builds, and the kernel option BHYVE_SHAPSHOT.

Submitted by:	Mihai Tiganus, Flavius Anton, Darius Mihai
Submitted by:	Elena Mihailescu, Mihai Carabas, Sergiu Weisz
Relnotes:	yes
Sponsored by:	University Politehnica of Bucharest
Sponsored by:	Matthew Grooms (student scholarships)
Sponsored by:	iXsystems
Differential Revision:	https://reviews.freebsd.org/D19495
2020-05-05 00:02:04 +00:00
Mark Johnston
1d6638472b Remove an obsolete TODO comment from several minidump implementations.
The comment referenced a non-existent function, and these minidump
implementations already buffer discontiguous physical data pages by
mapping them into a single VA range that gets passed to the dump device,
so there is no real advantage in batching calls to blk_write().

The RISC-V and MIPS minidump implementations still write a page at a
time and so would benefit from some form of batching.

MFC after:	2 weeks
Sponsored by:	Juniper Networks, Klara Inc.
2020-04-24 18:47:42 +00:00
Conrad Meyer
47332982bc vmm(4): Decode and emulate BEXTR
Clang 10 -march=native kernels on znver1 emit BEXTR for APIC reads,
apparently.  Decode and emulate the instruction.

Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D24463
2020-04-21 21:34:24 +00:00