The TSC-s are checked and synchronized only if they were good
originally. That is, invariant, synchronized, etc.
This is necessary on an AMD-based system where after a wakeup from STR I
see that BSP clock differs from AP clocks by a count that roughly
corresponds to one second. The APs are in sync with each other. Not
sure if this is a hardware quirk or a firmware bug.
This is what I see after a resume with this change:
SMP: passed TSC synchronization test after adjustment
acpi_timer0: restoring timecounter, ACPI-fast -> TSC-low
Reviewed by: kib
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D15551
Instead, construct an auxargs array and copy it out all at once.
Use an array of Elf_Auxinfo rather than pairs of Elf_Addr * to represent
the array. This is the correct type where pairs of words just happend
to work. To reduce the size of the diff, AUXARGS_ENTRY is altered to act
on this array rather than introducing a new macro.
Return errors on copyout() and suword() failures and handle them in the
caller.
Incidentally fixes AT_RANDOM and AT_EXECFN in 32-bit linux on amd64
which incorrectly used AUXARG_ENTRY instead of AUXARGS_ENTRY_32
(now removed due to the use of proper types).
Reviewed by: kib
Comments from: emaste, jhb
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15485
We certainly should clear PSL_T when calling the SIGTRAP signal
handler, which is already done by all x86 sendsig(9) ABI code. On the
other hand, there is no obvious reason why PSL_T needs to be cleared
when returning from the signal handler. For instance, Linux allows
userspace to set PSL_T and keep tracing enabled for the desired
period. There are userspace programs which would use PSL_T if we make
it possible, for instance sbcl.
Remember if PSL_T was set by PT_STEP or PT_SETSTEP by mean of TDB_STEP
flag, and only clear it when the flag is set.
Discussed with: Ali Mashtizadeh
Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D15054
- Add constants for fields in DR6 and the reserved fields in DR7. Use
these constants instead of magic numbers in most places that use DR6
and DR7.
- Refer to T_TRCTRAP as "debug exception" rather than a "trace trap"
as it is not just for trace exceptions.
- Always read DR6 for debug exceptions and only clear TF in the flags
register for user exceptions where DR6.BS is set.
- Clear DR6 before returning from a debug exception handler as
recommended by the SDM dating all the way back to the 386. This
allows debuggers to determine the cause of each exception. For
kernel traps, clear DR6 in the T_TRCTRAP case and pass DR6 by value
to other parts of the handler (namely, user_dbreg_trap()). For user
traps, wait until after trapsignal to clear DR6 so that userland
debuggers can read DR6 via PT_GETDBREGS while the thread is stopped
in trapsignal().
Reviewed by: kib, rgrimes
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D15189
Speculative Store Bypass (SSB) is a speculative execution side channel
vulnerability identified by Jann Horn of Google Project Zero (GPZ) and
Ken Johnson of the Microsoft Security Response Center (MSRC)
https://bugs.chromium.org/p/project-zero/issues/detail?id=1528.
Updated Intel microcode introduces a MSR bit to disable SSB as a
mitigation for the vulnerability.
Introduce a sysctl hw.spec_store_bypass_disable to provide global
control over the SSBD bit, akin to the existing sysctl that controls
IBRS. The sysctl can be set to one of three values:
0: off
1: on
2: auto
Future work will enable applications to control SSBD on a per-process
basis (when it is not enabled globally).
SSBD bit detection and control was verified with prerelease microcode.
Security: CVE-2018-3639
Tested by: emaste (previous version, without updated microcode)
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
When we issue shootdown IPIs, we first assign zero to pm_gens to
indicate the need to flush on the next context switch in case our IPI
misses the context, next we read pm_active. On context switch we set
our bit in pm_active, then we read pm_gen. It is crucial that both
threads see the memory in the program order, otherwise invalidation
thread might read pm_active bit as zero and the context switching
thread might read pm_gen as zero.
IA32 allows CPU for both reads to see zero. We must use the barriers
between write and read. The pm_active bit set is already locked, so
only the invalidation functions need it.
I never saw it in real life, or at least I do not have a good
reproduction case. I found this during code inspection when hunting
for the Xen TLB issue reported by cperciva.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D15506
This turns on support for kernel dump encryption and compression, and
netdump. arm and mips platforms are omitted for now, since they are more
constrained and don't benefit as much from these features.
Reviewed by: cem, manu, rgrimes
Tested by: manu (arm64)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D15465
Currently, when using dd(1) to take a VM memory image, the capture never ends,
reading zeroes when it's beyond VM system memory max address.
Return EFAULT when trying to read beyond VM system memory max address.
Reviewed by: imp, grehan, anish
Approved by: grehan
Differential Revision: https://reviews.freebsd.org/D15156
Kernel debuggers depend on symbol names to find stack frames with a
trapframe rather than a normal stack frame. The labels used for the
shared interrupt entry point for the PTI and non-PTI cases did not
match the existing patterns confusing debuggers. Add the '.L' prefix
to mark these symbols as local so they are not visible in the symbol
table.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Chelsio Communications
From now on, linking amd64 kernel requires either lld or newer ld.bfd.
Reviewed by: jhb (as part of the large patch)
Discussed with: emaste
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D13838
Adapt assembly generated by clang for memcmp and use it for <= 64 sized
compares (which are the vast majority).
Sample result of doing stats on Broadwell (% of samples):
before: 4.0 kernel bcmp cache_lookup
after : 0.7 kernel bcmp cache_lookup
The routine is most definitely still not optimal. Anyone interested in
spending time improving it is welcome to take over.
Reviewed by: kib
Evaluate cpu_stdext_feature early to have moved link_elf_ireloc() see
correct flags, most important is SMAP.
Tested by: mjg
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D15367
Supposedly, they PG_U bits there were set to easier making some kernel
page accessible to userspace in-place. Since it was not used for the
whole existence of the amd64 pmap.c and current design of the shared
pages prefers double-mapping over the in-place access, remove PG_U
both from the direct map and KVA slots.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This PML4 page is never used for the userspace process, so there is no
security implications. But the configuration trips SMAP check, which
should be corrected.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Since pop %ss/mov %ss instructions defer all interrupts and exceptions
for the next instruction, it is possible that the userspace watchpoint
trap executes on the first instruction of the kernel entry for
syscall/bpt.
In this case, DB# should be treated similarly to NMI: on amd64 we must
always load GSBASE even if the trap comes from kernel mode, and load
the kernel page table root into %cr3. Moreover, the trap must
use the dedicated stack, because we are still on the user stack when
trapped on syscall entry.
For i386, we must reload %cr3. The syscall instruction is not configured,
so there is no issue with executing on user stack when trapping.
Due to some CPU erratas it is not always possible to detect that the
userspace watchpoint triggered by inspecting %dr6. In trap(), compare the
trap %rip with the known unsafe entry points and if matched pretend that
the watchpoint did not fire at all.
Thank you to the MSRC Incident Response Team, and in particular Greg
Lenti and Nate Warfield, for coordinating the response to this issue
across multiple vendors.
Thanks to Computer Recycling at The Working Center of Kitchener for
making hardware available to allow us to test the patch on additional
CPU families.
Reviewed by: jhb
Discussed with: Matthew Dillon
Tested by: emaste
Sponsored by: The FreeBSD Foundation
Security: CVE-2018-8897
Security: FreeBSD-SA-18:06.debugreg
The parameter is effectively controllable by userspace. It does not matter
what it is set to as it is being passed to copyin - worst case the operation
will just fail.
While here stop computing it unless it is going to be used.
Noted by: dillon@backplane.com
There was a missing trick expanding the passed pattern to a full word
by multiplication. As a side effect non-zero patterns would be
incorrectly laid down.
This stems from the use of rep stosq which is word-sized, while the passed
argument is byte-sized.
I initially repurposed memcpy into memset without taking this into account.
All but non-bzero testing was performed with a variant utilizing ERMS, i.e.
using only stosb which happens to not into the problem whatsoever. So my bad
twice.
Thanks to Oliver Pinter for noting the problem and providing a testcase.
memmove is repurposed bcopy (arguments swapped, return value added)
The libkern variant is a wrapper around bcopy, so this is a big
improvement.
memset is repurposed memcpy. The librkern variant is doing fishy stuff,
including branching on 0 and calling bzero.
Both functions are rather crude and subject to partial depessimization.
This is a soft prerequisite to adding variants utilizing the
'Enhanced REP MOVSB/STOSB' bit and let the kernel patch at runtime.
The code was unnecessarily conditionally copying either 5 or 6 args.
It can blindly copy 6, which also means the size is known at compilation
time and the operation can be depessimized.
Note the entire syscall handling code is rather slow.
Tested on Skylake, sample result for getppid (calls/s):
without pti: 7310106 -> 10653569
with pti: 3304843 -> 4148306
Some syscalls (like read) did not note any difference, other have typically
very modest wins.
Required MD bits are only provided for x86.
Reviewed by: jhb (previous version, as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D13838
A large number of devices don't support PCIe FLR, in particular
graphics adapters. Use PCI power management to perform the
reset if FLR fails or isn't available, by cycling the device
through the D3 state.
This has been tested by a number of users with Nvidia and AMD GPUs.
Submitted and tested by: Matt Macy
Reviewed by: jhb, imp, rgrimes
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D15268
Dumpers may wish to print messages from an initialization hook; this
change ensures that such messages aren't mixed with output from the
generic dump code.
MFC after: 1 week
GCC warns about the potentially confusing use of the binary AND ('&')
operator with a left operand containing an addition expression. (The
confusion would be around the operator precedence between the + and & infix
operators.) The warning is converted into an error with -Werror.
No functional change.
This construct was actually introduced in r328083, but r333059 (re)moved the
closing parentheses.
For reference, see http://en.cppreference.com/w/c/language/operator_precedence .
- Microsemi SCSI driver for PQI controllers.
- Found on newer model HP servers.
- Restrict to AMD64 only as per developer request.
The driver provides support for the new generation of PQI controllers
from Microsemi. This driver is the first SCSI driver to implement the PQI
queuing model and it will replace the aacraid driver for Adaptec Series 9
controllers. HARDWARE Controllers supported by the driver include:
HPE Gen10 Smart Array Controller Family
OEM Controllers based on the Microsemi Chipset.
Submitted by: deepak.ukey@microsemi.com
Relnotes: yes
Sponsored by: Microsemi
Differential Revision: https://reviews.freebsd.org/D14514
> Description of fields to fill in above: 76 columns --|
> PR: If and which Problem Report is related.
> Submitted by: If someone else sent in the change.
> Reported by: If someone else reported the issue.
> Reviewed by: If someone else reviewed your modification.
> Approved by: If you needed approval for this commit.
> Obtained from: If the change is from a third party.
> MFC after: N [day[s]|week[s]|month[s]]. Request a reminder email.
> MFH: Ports tree branch name. Request approval for merge.
> Relnotes: Set to 'yes' for mention in release notes.
> Security: Vulnerability reference (one per line) or description.
> Sponsored by: If the change was sponsored by an organization.
> Pull Request: https://github.com/freebsd/freebsd/pull/### (*full* GitHub URL needed).
> Differential Revision: https://reviews.freebsd.org/D### (*full* phabric URL needed).
> Empty fields above will be automatically removed.
M share/man/man4/Makefile
AM share/man/man4/smartpqi.4
M sys/amd64/conf/GENERIC
M sys/conf/NOTES
M sys/conf/files.amd64
A sys/dev/smartpqi
AM sys/dev/smartpqi/smartpqi_cam.c
AM sys/dev/smartpqi/smartpqi_cmd.c
AM sys/dev/smartpqi/smartpqi_defines.h
AM sys/dev/smartpqi/smartpqi_discovery.c
AM sys/dev/smartpqi/smartpqi_event.c
AM sys/dev/smartpqi/smartpqi_helper.c
AM sys/dev/smartpqi/smartpqi_includes.h
AM sys/dev/smartpqi/smartpqi_init.c
AM sys/dev/smartpqi/smartpqi_intr.c
AM sys/dev/smartpqi/smartpqi_ioctl.c
AM sys/dev/smartpqi/smartpqi_ioctl.h
AM sys/dev/smartpqi/smartpqi_main.c
AM sys/dev/smartpqi/smartpqi_mem.c
AM sys/dev/smartpqi/smartpqi_misc.c
AM sys/dev/smartpqi/smartpqi_prototypes.h
AM sys/dev/smartpqi/smartpqi_queue.c
AM sys/dev/smartpqi/smartpqi_request.c
AM sys/dev/smartpqi/smartpqi_response.c
AM sys/dev/smartpqi/smartpqi_sis.c
AM sys/dev/smartpqi/smartpqi_structures.h
AM sys/dev/smartpqi/smartpqi_tag.c
M sys/modules/Makefile
A sys/modules/smartpqi
AM sys/modules/smartpqi/Makefile
hardware will ensure the stack pointer is aligned to a 16-byte
boundary before saving the fault state on the stack.
In the PTI case, handle this potential alignment adjustment by copying
both frames independently while unwinding the stack in between.
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15183
Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.
Reviewed by: kib, jeff (previous versions)
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14893
Remove auxarg_size as it was only used once right after a confusing
assignment in each of the variants of exec_copyout_strings().
Reviewed by: emaste
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D15123
This will allow to hook a ddb script to "kdb.enter.trap" event.
Previously there was no specific name for this event, so it could only
be handled by either "kdb.enter.unknown" or "kdb.enter.default" hooks.
Both are very unspecific.
Having a specific event is useful because the fatal trap condition is
very similar to panic but it has an additional property that the current
stack frame is the frame where the trap occurred. So, both a register
dump and a stack bottom dump have additional information that can help
analyze the problem.
I have added the event only on architectures that have trap_fatal()
function defined. I haven't looked at other architectures. Their
maintainers can add support for the event later.
Sample script:
kdb.enter.trap=bt; show reg; x/aS $rsp,20; x/agx $rsp,20
Reviewed by: kib, jhb, markj
MFC after: 11 days
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D15093
Half of implementations always failed (returned (-1)) and they were
previously used in only one place.
Reviewed by: kib, andrew
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15102
Trampoline mappings are better treated as global since they are valid
in all address spaces, even for PTI. pmap_invalidate_range() must work
on global mappings for pti since kernel_pmap invalidations are really
same as for non-PTI.
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D15052
The miscellaneous x86 sysent->sv_setregs() implementations tried to
migrate PSL_T from the previous program to the new executed one, but
they evaluated regs->tf_eflags after the whole regs structure was
bzeroed. Make this functional by saving PSL_T value before zeroing.
Note that if the debugger is not attached, executing the first
instruction in the new program with PSL_T set results in SIGTRAP, and
since all intercepted signals are reset to default dispostion on
exec(2), this means that non-debugged process gets killed immediately
if PSL_T is inherited. In particular, since suid images drop
P_TRACED, attempt to set PSL_T for execution of such program would
kill the process.
Another issue with userspace PSL_T handling is that it is reset by
trap(). It is reasonable to clear PSL_T when entering SIGTRAP
handler, to allow the signal to be handled without recursion or
delivery of blocked fault. But it is not reasonable to return back to
the normal flow with PSL_T cleared. This is too late to change, I
think.
Discussed with: bde, Ali Mashtizadeh
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D14995
In pti-enabled pmap, the PCID allocation scheme assigns temporal id
for the kernel page table, and user page table twin PCID is
calculating by setting high bit in the kernel PCID. So the kernel AS
is mapped with per-vmspace PCID, and we must completely shut down all
mappings in KVA when switching contexts, so that newly switched thread
would see all changes in KVA occured while it was not executing.
After all, KVA is same between all threads.
Currently the pti context switch for the user part of the page table
gets its TLB entries flushed too. It is excessive. The same PCID
flushing algorithm that is used for non-pti pmap, correctly works for
the UVA mappings. The only shared TLB entries are the pages from KVA
accessed by the kernel entry trampoline. All of them are static
except per-thread TSS and LDT. For TSS and LDT, the lifetime of newly
allocated entries is the whole thread life, so it is fine as well. If
not fine, then explicit shutdowns for current pmap of the newly
allocated LDT and TSS pages would be enough.
Also restore the constant value for the pm_pcid for the kernel_pmap.
Before, for PTI pmap, pm_pcid was erronously rolled same as user
pmap's pm_pcid, but it was not used.
Reviewed by: markj (previous version)
Discussed with: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D14961