Adapt assembly generated by clang for memcmp and use it for <= 64 sized
compares (which are the vast majority).
Sample result of doing stats on Broadwell (% of samples):
before: 4.0 kernel bcmp cache_lookup
after : 0.7 kernel bcmp cache_lookup
The routine is most definitely still not optimal. Anyone interested in
spending time improving it is welcome to take over.
Reviewed by: kib
Evaluate cpu_stdext_feature early to have moved link_elf_ireloc() see
correct flags, most important is SMAP.
Tested by: mjg
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D15367
Supposedly, they PG_U bits there were set to easier making some kernel
page accessible to userspace in-place. Since it was not used for the
whole existence of the amd64 pmap.c and current design of the shared
pages prefers double-mapping over the in-place access, remove PG_U
both from the direct map and KVA slots.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This PML4 page is never used for the userspace process, so there is no
security implications. But the configuration trips SMAP check, which
should be corrected.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Since pop %ss/mov %ss instructions defer all interrupts and exceptions
for the next instruction, it is possible that the userspace watchpoint
trap executes on the first instruction of the kernel entry for
syscall/bpt.
In this case, DB# should be treated similarly to NMI: on amd64 we must
always load GSBASE even if the trap comes from kernel mode, and load
the kernel page table root into %cr3. Moreover, the trap must
use the dedicated stack, because we are still on the user stack when
trapped on syscall entry.
For i386, we must reload %cr3. The syscall instruction is not configured,
so there is no issue with executing on user stack when trapping.
Due to some CPU erratas it is not always possible to detect that the
userspace watchpoint triggered by inspecting %dr6. In trap(), compare the
trap %rip with the known unsafe entry points and if matched pretend that
the watchpoint did not fire at all.
Thank you to the MSRC Incident Response Team, and in particular Greg
Lenti and Nate Warfield, for coordinating the response to this issue
across multiple vendors.
Thanks to Computer Recycling at The Working Center of Kitchener for
making hardware available to allow us to test the patch on additional
CPU families.
Reviewed by: jhb
Discussed with: Matthew Dillon
Tested by: emaste
Sponsored by: The FreeBSD Foundation
Security: CVE-2018-8897
Security: FreeBSD-SA-18:06.debugreg
The parameter is effectively controllable by userspace. It does not matter
what it is set to as it is being passed to copyin - worst case the operation
will just fail.
While here stop computing it unless it is going to be used.
Noted by: dillon@backplane.com
There was a missing trick expanding the passed pattern to a full word
by multiplication. As a side effect non-zero patterns would be
incorrectly laid down.
This stems from the use of rep stosq which is word-sized, while the passed
argument is byte-sized.
I initially repurposed memcpy into memset without taking this into account.
All but non-bzero testing was performed with a variant utilizing ERMS, i.e.
using only stosb which happens to not into the problem whatsoever. So my bad
twice.
Thanks to Oliver Pinter for noting the problem and providing a testcase.
memmove is repurposed bcopy (arguments swapped, return value added)
The libkern variant is a wrapper around bcopy, so this is a big
improvement.
memset is repurposed memcpy. The librkern variant is doing fishy stuff,
including branching on 0 and calling bzero.
Both functions are rather crude and subject to partial depessimization.
This is a soft prerequisite to adding variants utilizing the
'Enhanced REP MOVSB/STOSB' bit and let the kernel patch at runtime.
The code was unnecessarily conditionally copying either 5 or 6 args.
It can blindly copy 6, which also means the size is known at compilation
time and the operation can be depessimized.
Note the entire syscall handling code is rather slow.
Tested on Skylake, sample result for getppid (calls/s):
without pti: 7310106 -> 10653569
with pti: 3304843 -> 4148306
Some syscalls (like read) did not note any difference, other have typically
very modest wins.
Required MD bits are only provided for x86.
Reviewed by: jhb (previous version, as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D13838
A large number of devices don't support PCIe FLR, in particular
graphics adapters. Use PCI power management to perform the
reset if FLR fails or isn't available, by cycling the device
through the D3 state.
This has been tested by a number of users with Nvidia and AMD GPUs.
Submitted and tested by: Matt Macy
Reviewed by: jhb, imp, rgrimes
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D15268
Dumpers may wish to print messages from an initialization hook; this
change ensures that such messages aren't mixed with output from the
generic dump code.
MFC after: 1 week
GCC warns about the potentially confusing use of the binary AND ('&')
operator with a left operand containing an addition expression. (The
confusion would be around the operator precedence between the + and & infix
operators.) The warning is converted into an error with -Werror.
No functional change.
This construct was actually introduced in r328083, but r333059 (re)moved the
closing parentheses.
For reference, see http://en.cppreference.com/w/c/language/operator_precedence .
- Microsemi SCSI driver for PQI controllers.
- Found on newer model HP servers.
- Restrict to AMD64 only as per developer request.
The driver provides support for the new generation of PQI controllers
from Microsemi. This driver is the first SCSI driver to implement the PQI
queuing model and it will replace the aacraid driver for Adaptec Series 9
controllers. HARDWARE Controllers supported by the driver include:
HPE Gen10 Smart Array Controller Family
OEM Controllers based on the Microsemi Chipset.
Submitted by: deepak.ukey@microsemi.com
Relnotes: yes
Sponsored by: Microsemi
Differential Revision: https://reviews.freebsd.org/D14514
hardware will ensure the stack pointer is aligned to a 16-byte
boundary before saving the fault state on the stack.
In the PTI case, handle this potential alignment adjustment by copying
both frames independently while unwinding the stack in between.
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15183
Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.
Reviewed by: kib, jeff (previous versions)
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14893
Remove auxarg_size as it was only used once right after a confusing
assignment in each of the variants of exec_copyout_strings().
Reviewed by: emaste
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D15123
This will allow to hook a ddb script to "kdb.enter.trap" event.
Previously there was no specific name for this event, so it could only
be handled by either "kdb.enter.unknown" or "kdb.enter.default" hooks.
Both are very unspecific.
Having a specific event is useful because the fatal trap condition is
very similar to panic but it has an additional property that the current
stack frame is the frame where the trap occurred. So, both a register
dump and a stack bottom dump have additional information that can help
analyze the problem.
I have added the event only on architectures that have trap_fatal()
function defined. I haven't looked at other architectures. Their
maintainers can add support for the event later.
Sample script:
kdb.enter.trap=bt; show reg; x/aS $rsp,20; x/agx $rsp,20
Reviewed by: kib, jhb, markj
MFC after: 11 days
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D15093
Half of implementations always failed (returned (-1)) and they were
previously used in only one place.
Reviewed by: kib, andrew
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15102
Trampoline mappings are better treated as global since they are valid
in all address spaces, even for PTI. pmap_invalidate_range() must work
on global mappings for pti since kernel_pmap invalidations are really
same as for non-PTI.
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D15052
The miscellaneous x86 sysent->sv_setregs() implementations tried to
migrate PSL_T from the previous program to the new executed one, but
they evaluated regs->tf_eflags after the whole regs structure was
bzeroed. Make this functional by saving PSL_T value before zeroing.
Note that if the debugger is not attached, executing the first
instruction in the new program with PSL_T set results in SIGTRAP, and
since all intercepted signals are reset to default dispostion on
exec(2), this means that non-debugged process gets killed immediately
if PSL_T is inherited. In particular, since suid images drop
P_TRACED, attempt to set PSL_T for execution of such program would
kill the process.
Another issue with userspace PSL_T handling is that it is reset by
trap(). It is reasonable to clear PSL_T when entering SIGTRAP
handler, to allow the signal to be handled without recursion or
delivery of blocked fault. But it is not reasonable to return back to
the normal flow with PSL_T cleared. This is too late to change, I
think.
Discussed with: bde, Ali Mashtizadeh
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D14995
In pti-enabled pmap, the PCID allocation scheme assigns temporal id
for the kernel page table, and user page table twin PCID is
calculating by setting high bit in the kernel PCID. So the kernel AS
is mapped with per-vmspace PCID, and we must completely shut down all
mappings in KVA when switching contexts, so that newly switched thread
would see all changes in KVA occured while it was not executing.
After all, KVA is same between all threads.
Currently the pti context switch for the user part of the page table
gets its TLB entries flushed too. It is excessive. The same PCID
flushing algorithm that is used for non-pti pmap, correctly works for
the UVA mappings. The only shared TLB entries are the pages from KVA
accessed by the kernel entry trampoline. All of them are static
except per-thread TSS and LDT. For TSS and LDT, the lifetime of newly
allocated entries is the whole thread life, so it is fine as well. If
not fine, then explicit shutdowns for current pmap of the newly
allocated LDT and TSS pages would be enough.
Also restore the constant value for the pm_pcid for the kernel_pmap.
Before, for PTI pmap, pm_pcid was erronously rolled same as user
pmap's pm_pcid, but it was not used.
Reviewed by: markj (previous version)
Discussed with: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D14961
Previously linuxulator had three identical copies of
linux_exec_imgact_try. Deduplicate before adding another arch to
linuxulator.
Sponsored by: Turing Robotic Industries Inc
Differential Revision: https://reviews.freebsd.org/D14856
from userland without the need to use sysctls, it allows the old
sysctls to continue to function, but deprecates them at
FreeBSD_version 1200060 (Relnotes for deprecate).
The command line of bhyve is maintained in a backwards compatible way.
The API of libvmmapi is maintained in a backwards compatible way.
The sysctl's are maintained in a backwards compatible way.
Added command option looks like:
bhyve -c [[cpus=]n][,sockets=n][,cores=n][,threads=n][,maxcpus=n]
The optional parts can be specified in any order, but only a single
integer invokes the backwards compatible parse. [,maxcpus=n] is
hidden by #ifdef until kernel support is added, though the api
is put in place.
bhyvectl --get-cpu-topology option added.
Reviewed by: grehan (maintainer, earlier version),
Reviewed by: bcr (manpages)
Approved by: bde (mentor), phk (mentor)
Tested by: Oleg Ginzburg <olevole@olevole.ru> (cbsd)
MFC after: 1 week
Relnotes: Y
Differential Revision: https://reviews.freebsd.org/D9930
SKZ63 Processor May Hang When Executing Code In an HLE Transaction
Region
Problem: Under certain conditions, if the processor acquires an HLE
(Hardware Lock Elision) lock via the XACQUIRE instruction in the Host
Physical Address range between 40000000H and 403FFFFFH, it may hang
with an internal timeout error (MCACOD 0400H) logged into
IA32_MCi_STATUS.
Move the pages from the range into the blacklist. Add a tunable to
not waste 4M if local DoS is not the issue.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D15001
This is used as part of implementing run control in bhyve's debug
server. The hypervisor now maintains a set of "debugged" CPUs.
Attempting to run a debugged CPU will fail to execute any guest
instructions and will instead report a VM_EXITCODE_DEBUG exit to
the userland hypervisor. Virtual CPUs are placed into the debugged
state via vm_suspend_cpu() (implemented via a new VM_SUSPEND_CPU ioctl).
Virtual CPUs can be resumed via vm_resume_cpu() (VM_RESUME_CPU ioctl).
The debug server suspends virtual CPUs when it wishes them to stop
executing in the guest (for example, when a debugger attaches to the
server). The debug server can choose to resume only a subset of CPUs
(for example, when single stepping) or it can choose to resume all
CPUs. The debug server must explicitly mark a CPU as resumed via
vm_resume_cpu() before the virtual CPU will successfully execute any
guest instructions.
Reviewed by: avg, grehan
Tested on: Intel (jhb), AMD (avg)
Differential Revision: https://reviews.freebsd.org/D14466
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.
Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.
Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.
Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
we patted the watchdog approximately once per 4KB page of memory. After
this change, we pat the watchdog approximately once per 128MB of memory.
On a sample machine, this translated to patting the watchdog approximately
every 5.4 seconds, which "seems reasonable". We can choose a different
value in the future, if warranted.
This has extensive field experience. It is a performance improvement, and
has not caused any known problems.
Reviewed by: imp, kib
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D14988
Add the missing breaks in the for loops, in order to exit the loop
when a suitable entry is found.
Also switch amd64 native_start_all_aps to use PHYS_TO_DMAP in order to
find the virtual address of the boot_trampoline and the initial page
tables.
Reported and tested by: pho
Sponsored by: Citrix Systems R&D
So that it doesn't rely on physmap[1] containing an address below
1MiB. Instead scan the full physmap and search for a suitable address
to place the trampoline code (below 1MiB) and the initial memory pages
(below 4GiB).
Sponsored by: Citrix Systems R&D
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D14878
The lcall trampoline enters kernel by int $0x80, which sets up invalid
length of the instruction for %rip rewind.
Reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Having the IDT entry specify ring 0 DPL caused delivery of #GP instead
of #OF.
The instruction is not valid in 64bit mode, which probably explains
why the IDT entry for #OF was initially set this way. It is
interesting to note that the BOUND instruction works with the IDT #BR
entry DPL 0, most likely CPU considers #BR from BOUND as generated by
a machine, not user.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
x86/cpu_machdep.c now needs to include elan_mmcr.h when CPU_ELAN is set.
While here, also remove the now unneeded inclusion of isareg.h in i386
and amd64 vm_machdep.c.
Reported by: lwhsu
MFC after: 14 days
X-MFC with: r331878
Because I didn't see any reason not too.
I've been making some changes to the code and couldn't help but notice
that the i386 and am64 code was nearly identical.
MFC after: 17 days
If cpu_reset() is called on an AP and if it somehow fails to wake the
BSP, then it's better to attempt the reset on the AP than just sit there
spinning on an unusable and undebuggable system.
MFC after: 16 days
The processor is "parked" in a spin-loop already and that's sufficient
for the reset. There is nothing that stop_cpus() would add here, only
extra complexity and fragility.
The original processor does not need to enable interrupts now, in fact,
it must not do that.
MFC after: 2 weeks
The ocs_fc(4) driver supports the following hardware:
Emulex 16/8G FC GEN 5 HBAS
LPe15004 FC Host Bus Adapters
LPe160XX FC Host Bus Adapters
Emulex 32/16G FC GEN 6 HBAS
LPe3100X FC Host Bus Adapters
LPe3200X FC Host Bus Adapters
The driver supports target and initiator mode, and also supports FC-Tape.
Note that the driver only currently works on little endian platforms. It
is only included in the module build for amd64 and i386, and in GENERIC
on amd64 only.
Submitted by: Ram Kishore Vegesna <ram.vegesna@broadcom.com>
Reviewed by: mav
MFC after: 5 days
Relnotes: yes
Sponsored by: Broadcom
Differential Revision: https://reviews.freebsd.org/D11423
platforms. Original commit message as follows:
Only use CPUs in the domain the device is attached to for default
assignment. Device drivers are able to override the default assignment
if they bind directly. There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.
Reviewed by: jhb, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14838