freebsd-skq

Author	SHA1	Message	Date
bde	d4e99d50f0	Fix recent breakages of kernel profiling, mostly on i386 (high resolution kernel profiling remains broken). memmove() was broken using ALTENTRY(). ALTENTRY() is only different from ENTRY() in the profiling case, and its use in that case was sort of backwards. The backwardness magically turned memmove() into memcpy() instead of completely breaking it. Only the high resolution parts of profiling itself were broken. Use ordinary ENTRY() for memmove(). Turn bcopy() into a tail call to memmove() to reduce complications. This gives slightly different pessimizations and profiling lossage. The pessimizations are minimized by not using a frame pointer() for bcopy(). Calls to profiling functions from exception trampolines were not relocated. This caused crashes on the first exception. Fix this using function pointers. Addresses of exception handlers in trampolines were not relocated. This caused unknown offsets in the profiling data. Relocate by abusing setidt_disp as for pmc although this is slower than necessary and requires namespace pollution. pmc seems to be missing some relocations. Stack traces and lots of other things in debuggers need similar relocations. Most user addresses were misclassified as unknown kernel addresses and then ignored. Treat all unknown addresses as user. Now only user addresses in the kernel text range are significantly misclassified (as known kernel addresses). The ibrs functions didn't preserve enough registers. This is the only recent breakage on amd64. Although these functions are written in asm, in the profiling case they call profiling functions which are mostly for the C ABI, so they only have to save call-used registers. They also have to save arg and return registers in some cases and actually save them in all cases to reduce complications. They end up saving all registers except %ecx on i386 and %r10 and %r11 on amd64. Saving these is only needed for 1 caller on each of amd64 and i386. Save them there. This is slightly simpler. Remove saving %ecx in handle_ibrs_exit on i386. Both handle_ibrs_entry and handle_ibrs_exit use %ecx, but only the latter needed to or did save it. But saving it there doesn't work for the profiling case. amd64 has more automatic saving of the most common scratch registers %rax, %rcx and %rdx (its complications for %r10 are from unusual use of %r10 by SYSCALL). Thus profiling of handle_ibrs_exit_rs() was not broken, and I didn't simplify the saving by moving the saving of these registers from it to the caller.	2018-06-02 04:25:09 +00:00
mmacy	2f6bd2cd39	hwpmc: remove unused pre-table driven bits for intel Intel now provides comprehensive tables for all performance counters and the various valid configuration permutations as text .json files. Libpmc has been converted to use these and hwpmc_core has been greatly simplified by moving to passthrough of the table values. The one gotcha is that said tables don't support pentium pro and and pentium IV. There's very few users of hwpmc on _amd64_ kernels on new hardware. It is unlikely that anyone is doing low level optimization on 15 year old Intel hardware. Nonetheless, if someone feels strongly enough to populate the corresponding tables for p4 and ppro I will reinstate the files in to the build. Code for the K8 counters and !x86 architectures remains unchanged.	2018-05-31 22:41:07 +00:00
dim	a34aa33cd6	Resolve conflicts between macros in fenv.h and ieeefp.h This is a follow-up to r321483, which disabled -Wmacro-redefined for some lib/msun tests. If an application included both fenv.h and ieeefp.h, several macros such as __fldcw(), __fldenv() were defined in both headers, with slightly different arguments, leading to conflicts. Fix this by putting all the common macros in the machine-specific versions of ieeefp.h. Where needed, update the arguments in places where the macros are invoked. This also slightly reduces the differences between the amd64 and i386 versions of ieeefp.h. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D15633	2018-05-31 20:22:47 +00:00
mjg	8a9e7fed77	amd64: switch pagecopy from non-temporal stores to rep movsq The copied data is accessed in part soon after and it results with additional cache misses during a -j 1 buildkernel WITHOUT_CTF=yes KERNFAST=1, as measured with pmc stat. before: 256165411 cache-references # 0.003 refs/inst 15105408 cache-misses # 5.897% 20.70 real # 99.67% cpu 13.24 user # 63.94% cpu 7.40 sys # 35.73% cpu after: 256764469 cache-references # 0.003 refs/inst 11913551 cache-misses # 4.640% 20.70 real # 99.67% cpu 13.19 user # 63.73% cpu 7.44 sys # 35.95% cpu Note the real time did not change, but traffic to RAM was reduced (multiple measurements performed with switching the implementation at runtime). Since nobody else is using non-temporal for this and there is no apparent benefit at least these days, don't use them either. Side note is that pagecopy arguments should probably get reversed to not have to flip them around in the primitive. Discussed with: jeff	2018-05-31 09:56:02 +00:00
brooks	db929686a9	Correct pointer subtraction in KASSERT(). The assertion would never fire without truly spectacular future programming errors. Reported by: Coverity CID: 1391370 Sponsored by: DARPA, AFRL	2018-05-29 20:03:24 +00:00
avg	546f863d51	re-synchronize TSC-s on SMP systems after resume, if necessary The TSC-s are checked and synchronized only if they were good originally. That is, invariant, synchronized, etc. This is necessary on an AMD-based system where after a wakeup from STR I see that BSP clock differs from AP clocks by a count that roughly corresponds to one second. The APs are in sync with each other. Not sure if this is a hardware quirk or a firmware bug. This is what I see after a resume with this change: SMP: passed TSC synchronization test after adjustment acpi_timer0: restoring timecounter, ACPI-fast -> TSC-low Reviewed by: kib MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D15551	2018-05-25 07:33:20 +00:00
brooks	a36ed4ef19	Avoid two suword() calls per auxarg entry. Instead, construct an auxargs array and copy it out all at once. Use an array of Elf_Auxinfo rather than pairs of Elf_Addr * to represent the array. This is the correct type where pairs of words just happend to work. To reduce the size of the diff, AUXARGS_ENTRY is altered to act on this array rather than introducing a new macro. Return errors on copyout() and suword() failures and handle them in the caller. Incidentally fixes AT_RANDOM and AT_EXECFN in 32-bit linux on amd64 which incorrectly used AUXARG_ENTRY instead of AUXARGS_ENTRY_32 (now removed due to the use of proper types). Reviewed by: kib Comments from: emaste, jhb Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15485	2018-05-24 16:25:18 +00:00
mmacy	97b23c16eb	take NUMA out	2018-05-24 04:31:53 +00:00
mmacy	da844acd6d	libpmcstat: compile in events based on json description	2018-05-24 04:30:06 +00:00
kib	cf836f6647	x86: stop unconditionally clearing PSL_T on the trace trap. We certainly should clear PSL_T when calling the SIGTRAP signal handler, which is already done by all x86 sendsig(9) ABI code. On the other hand, there is no obvious reason why PSL_T needs to be cleared when returning from the signal handler. For instance, Linux allows userspace to set PSL_T and keep tracing enabled for the desired period. There are userspace programs which would use PSL_T if we make it possible, for instance sbcl. Remember if PSL_T was set by PT_STEP or PT_SETSTEP by mean of TDB_STEP flag, and only clear it when the flag is set. Discussed with: Ali Mashtizadeh Reviewed by: jhb (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D15054	2018-05-23 21:39:29 +00:00
kib	90c2d6c602	Style. Wording and reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D15054	2018-05-23 21:25:49 +00:00
kib	e6ad362804	Enable IBRS when entering an interrupt handler from usermode. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-05-22 13:25:15 +00:00
jhb	31acfe0f07	Cleanups related to debug exceptions on x86. - Add constants for fields in DR6 and the reserved fields in DR7. Use these constants instead of magic numbers in most places that use DR6 and DR7. - Refer to T_TRCTRAP as "debug exception" rather than a "trace trap" as it is not just for trace exceptions. - Always read DR6 for debug exceptions and only clear TF in the flags register for user exceptions where DR6.BS is set. - Clear DR6 before returning from a debug exception handler as recommended by the SDM dating all the way back to the 386. This allows debuggers to determine the cause of each exception. For kernel traps, clear DR6 in the T_TRCTRAP case and pass DR6 by value to other parts of the handler (namely, user_dbreg_trap()). For user traps, wait until after trapsignal to clear DR6 so that userland debuggers can read DR6 via PT_GETDBREGS while the thread is stopped in trapsignal(). Reviewed by: kib, rgrimes MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D15189	2018-05-22 00:45:00 +00:00
kib	1300dfd419	Add Intel Spec Store Bypass Disable control. Speculative Store Bypass (SSB) is a speculative execution side channel vulnerability identified by Jann Horn of Google Project Zero (GPZ) and Ken Johnson of the Microsoft Security Response Center (MSRC) https://bugs.chromium.org/p/project-zero/issues/detail?id=1528. Updated Intel microcode introduces a MSR bit to disable SSB as a mitigation for the vulnerability. Introduce a sysctl hw.spec_store_bypass_disable to provide global control over the SSBD bit, akin to the existing sysctl that controls IBRS. The sysctl can be set to one of three values: 0: off 1: on 2: auto Future work will enable applications to control SSBD on a per-process basis (when it is not enabled globally). SSBD bit detection and control was verified with prerelease microcode. Security: CVE-2018-3639 Tested by: emaste (previous version, without updated microcode) Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-05-21 21:08:19 +00:00
kib	5574ca2f4a	Preserve other bits in IA32_SPEC_CTL MSR when changing the IBRS and STIBP states. Tested by: emaste (previous version) Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-05-21 21:05:55 +00:00
kib	e623971d49	Fix grammar. Submitted by: alc MFC after: 1 week	2018-05-21 19:15:05 +00:00
kib	d057b59f62	Add missed barrier for pm_gen/pm_active interaction. When we issue shootdown IPIs, we first assign zero to pm_gens to indicate the need to flush on the next context switch in case our IPI misses the context, next we read pm_active. On context switch we set our bit in pm_active, then we read pm_gen. It is crucial that both threads see the memory in the program order, otherwise invalidation thread might read pm_active bit as zero and the context switching thread might read pm_gen as zero. IA32 allows CPU for both reads to see zero. We must use the barriers between write and read. The pm_active bit set is already locked, so only the invalidation functions need it. I never saw it in real life, or at least I do not have a good reproduction case. I found this during code inspection when hunting for the Xen TLB issue reported by cperciva. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15506	2018-05-21 18:41:16 +00:00
mjg	7fc09e02a6	amd64: annotate pti with __read_frequently	2018-05-21 05:20:23 +00:00
markj	5594f37668	Enable kernel dump features in GENERIC for most platforms. This turns on support for kernel dump encryption and compression, and netdump. arm and mips platforms are omitted for now, since they are more constrained and don't benefit as much from these features. Reviewed by: cem, manu, rgrimes Tested by: manu (arm64) Relnotes: yes Differential Revision: https://reviews.freebsd.org/D15465	2018-05-19 19:53:23 +00:00
mmacy	f437b75cf6	pmap: silence warnings	2018-05-19 05:58:05 +00:00
emaste	0c5a9cf17f	amd64 GENERIC: correct whitespace on smartpqi entry	2018-05-18 17:51:42 +00:00
antoine	8156c46154	vmmdev: return EFAULT when trying to read beyond VM system memory max address Currently, when using dd(1) to take a VM memory image, the capture never ends, reading zeroes when it's beyond VM system memory max address. Return EFAULT when trying to read beyond VM system memory max address. Reviewed by: imp, grehan, anish Approved by: grehan Differential Revision: https://reviews.freebsd.org/D15156	2018-05-15 17:20:58 +00:00
jhb	8e38b4b70f	Make the common interrupt entry point labels local labels. Kernel debuggers depend on symbol names to find stack frames with a trapframe rather than a normal stack frame. The labels used for the shared interrupt entry point for the PTI and non-PTI cases did not match the existing patterns confusing debuggers. Add the '.L' prefix to mark these symbols as local so they are not visible in the symbol table. Reviewed by: kib MFC after: 1 week Sponsored by: Chelsio Communications	2018-05-14 17:27:53 +00:00
kib	852722fdfd	Make fpusave() and fpurestore() on amd64 ifuncs. From now on, linking amd64 kernel requires either lld or newer ld.bfd. Reviewed by: jhb (as part of the large patch) Discussed with: emaste Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D13838	2018-05-10 15:01:43 +00:00
mjg	c6c875dd41	amd64: depessimize bcmp for small buffers Adapt assembly generated by clang for memcmp and use it for <= 64 sized compares (which are the vast majority). Sample result of doing stats on Broadwell (% of samples): before: 4.0 kernel bcmp cache_lookup after : 0.7 kernel bcmp cache_lookup The routine is most definitely still not optimal. Anyone interested in spending time improving it is welcome to take over. Reviewed by: kib	2018-05-09 15:16:25 +00:00
kib	4895cdf089	Avoid calls to bzero() before ireloc. Evaluate cpu_stdext_feature early to have moved link_elf_ireloc() see correct flags, most important is SMAP. Tested by: mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D15367	2018-05-09 14:39:24 +00:00
kib	98f701752a	Remove PG_U from the rest of the kernel pmap ptes. Supposedly, they PG_U bits there were set to easier making some kernel page accessible to userspace in-place. Since it was not used for the whole existence of the amd64 pmap.c and current design of the shared pages prefers double-mapping over the in-place access, remove PG_U both from the direct map and KVA slots. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-05-09 12:09:08 +00:00
kib	dfdc3054ac	Remove PG_U from the recursive pte for kernel pmap' PML4 page. This PML4 page is never used for the userspace process, so there is no security implications. But the configuration trips SMAP check, which should be corrected. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-05-09 12:03:40 +00:00
kib	e393d204ee	Prepare DB# handler for deferred trigger of watchpoints. Since pop %ss/mov %ss instructions defer all interrupts and exceptions for the next instruction, it is possible that the userspace watchpoint trap executes on the first instruction of the kernel entry for syscall/bpt. In this case, DB# should be treated similarly to NMI: on amd64 we must always load GSBASE even if the trap comes from kernel mode, and load the kernel page table root into %cr3. Moreover, the trap must use the dedicated stack, because we are still on the user stack when trapped on syscall entry. For i386, we must reload %cr3. The syscall instruction is not configured, so there is no issue with executing on user stack when trapping. Due to some CPU erratas it is not always possible to detect that the userspace watchpoint triggered by inspecting %dr6. In trap(), compare the trap %rip with the known unsafe entry points and if matched pretend that the watchpoint did not fire at all. Thank you to the MSRC Incident Response Team, and in particular Greg Lenti and Nate Warfield, for coordinating the response to this issue across multiple vendors. Thanks to Computer Recycling at The Working Center of Kitchener for making hardware available to allow us to test the patch on additional CPU families. Reviewed by: jhb Discussed with: Matthew Dillon Tested by: emaste Sponsored by: The FreeBSD Foundation Security: CVE-2018-8897 Security: FreeBSD-SA-18:06.debugreg	2018-05-08 17:00:34 +00:00
mjg	e9b7f70752	amd64: stop asserting params != NULL in the syscall path The parameter is effectively controllable by userspace. It does not matter what it is set to as it is being passed to copyin - worst case the operation will just fail. While here stop computing it unless it is going to be used. Noted by: dillon@backplane.com	2018-05-07 21:32:08 +00:00
mjg	ec95c631af	amd64: fix up memset added in r333324 There was a missing trick expanding the passed pattern to a full word by multiplication. As a side effect non-zero patterns would be incorrectly laid down. This stems from the use of rep stosq which is word-sized, while the passed argument is byte-sized. I initially repurposed memcpy into memset without taking this into account. All but non-bzero testing was performed with a variant utilizing ERMS, i.e. using only stosb which happens to not into the problem whatsoever. So my bad twice. Thanks to Oliver Pinter for noting the problem and providing a testcase.	2018-05-07 20:54:42 +00:00
mjg	d886e55540	amd64: tweak the memmove comment regarding authorship To make it clear the mentioned author did not write memmove.	2018-05-07 17:37:07 +00:00
mjg	326c556da0	amd64: replace libkern's memset and memmove with assembly variants memmove is repurposed bcopy (arguments swapped, return value added) The libkern variant is a wrapper around bcopy, so this is a big improvement. memset is repurposed memcpy. The librkern variant is doing fishy stuff, including branching on 0 and calling bzero. Both functions are rather crude and subject to partial depessimization. This is a soft prerequisite to adding variants utilizing the 'Enhanced REP MOVSB/STOSB' bit and let the kernel patch at runtime.	2018-05-07 15:07:28 +00:00
mjg	7199be3e0a	amd64: syscall path bcopy -> memcpy	2018-05-04 22:41:12 +00:00
mjg	b3fd2731c4	amd64: get rid of the pessimized bcopy in syscall arg copy The code was unnecessarily conditionally copying either 5 or 6 args. It can blindly copy 6, which also means the size is known at compilation time and the operation can be depessimized. Note the entire syscall handling code is rather slow. Tested on Skylake, sample result for getppid (calls/s): without pti: 7310106 -> 10653569 with pti: 3304843 -> 4148306 Some syscalls (like read) did not note any difference, other have typically very modest wins.	2018-05-04 04:05:07 +00:00
kib	2a74077455	Implement support for ifuncs in the kernel linker. Required MD bits are only provided for x86. Reviewed by: jhb (previous version, as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13838	2018-05-03 21:37:46 +00:00
kib	0456e20163	Style. Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D13838	2018-05-03 10:17:37 +00:00
grehan	d0b707de44	Use PCI power-mgmt to reset a device if FLR fails. A large number of devices don't support PCIe FLR, in particular graphics adapters. Use PCI power management to perform the reset if FLR fails or isn't available, by cycling the device through the D3 state. This has been tested by a number of users with Nvidia and AMD GPUs. Submitted and tested by: Matt Macy Reviewed by: jhb, imp, rgrimes MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D15268	2018-05-02 17:41:00 +00:00
markj	65dc9f377b	Print the dump progress indicator after calling dump_start(). Dumpers may wish to print messages from an initialization hook; this change ensures that such messages aren't mixed with output from the generic dump code. MFC after: 1 week	2018-05-01 17:32:43 +00:00
cem	7b5081122b	amd64/mp_machdep.c: Fix GCC build after r333059 GCC warns about the potentially confusing use of the binary AND ('&') operator with a left operand containing an addition expression. (The confusion would be around the operator precedence between the + and & infix operators.) The warning is converted into an error with -Werror. No functional change. This construct was actually introduced in r328083, but r333059 (re)moved the closing parentheses. For reference, see http://en.cppreference.com/w/c/language/operator_precedence .	2018-04-28 17:55:28 +00:00
tychon	26c4626cab	Expand the checks for UCR3 == PMAP_NO_CR3 to enable processes to be excluded from PTI. Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15100	2018-04-27 12:44:20 +00:00
sbruno	c93f787b7e	move smartpqi(4) controller out of NOTES and into sys/amd64/NOTES to appease LINT Submitted by: rpokala Reported by: npn	2018-04-26 22:43:25 +00:00
sbruno	0804f0eef0	martpqi(4): - Microsemi SCSI driver for PQI controllers. - Found on newer model HP servers. - Restrict to AMD64 only as per developer request. The driver provides support for the new generation of PQI controllers from Microsemi. This driver is the first SCSI driver to implement the PQI queuing model and it will replace the aacraid driver for Adaptec Series 9 controllers. HARDWARE Controllers supported by the driver include: HPE Gen10 Smart Array Controller Family OEM Controllers based on the Microsemi Chipset. Submitted by: deepak.ukey@microsemi.com Relnotes: yes Sponsored by: Microsemi Differential Revision: https://reviews.freebsd.org/D14514 > Description of fields to fill in above: 76 columns --\| > PR: If and which Problem Report is related. > Submitted by: If someone else sent in the change. > Reported by: If someone else reported the issue. > Reviewed by: If someone else reviewed your modification. > Approved by: If you needed approval for this commit. > Obtained from: If the change is from a third party. > MFC after: N [day[s]\|week[s]\|month[s]]. Request a reminder email. > MFH: Ports tree branch name. Request approval for merge. > Relnotes: Set to 'yes' for mention in release notes. > Security: Vulnerability reference (one per line) or description. > Sponsored by: If the change was sponsored by an organization. > Pull Request: https://github.com/freebsd/freebsd/pull/### (full GitHub URL needed). > Differential Revision: https://reviews.freebsd.org/D### (full phabric URL needed). > Empty fields above will be automatically removed. M share/man/man4/Makefile AM share/man/man4/smartpqi.4 M sys/amd64/conf/GENERIC M sys/conf/NOTES M sys/conf/files.amd64 A sys/dev/smartpqi AM sys/dev/smartpqi/smartpqi_cam.c AM sys/dev/smartpqi/smartpqi_cmd.c AM sys/dev/smartpqi/smartpqi_defines.h AM sys/dev/smartpqi/smartpqi_discovery.c AM sys/dev/smartpqi/smartpqi_event.c AM sys/dev/smartpqi/smartpqi_helper.c AM sys/dev/smartpqi/smartpqi_includes.h AM sys/dev/smartpqi/smartpqi_init.c AM sys/dev/smartpqi/smartpqi_intr.c AM sys/dev/smartpqi/smartpqi_ioctl.c AM sys/dev/smartpqi/smartpqi_ioctl.h AM sys/dev/smartpqi/smartpqi_main.c AM sys/dev/smartpqi/smartpqi_mem.c AM sys/dev/smartpqi/smartpqi_misc.c AM sys/dev/smartpqi/smartpqi_prototypes.h AM sys/dev/smartpqi/smartpqi_queue.c AM sys/dev/smartpqi/smartpqi_request.c AM sys/dev/smartpqi/smartpqi_response.c AM sys/dev/smartpqi/smartpqi_sis.c AM sys/dev/smartpqi/smartpqi_structures.h AM sys/dev/smartpqi/smartpqi_tag.c M sys/modules/Makefile A sys/modules/smartpqi AM sys/modules/smartpqi/Makefile	2018-04-26 16:59:06 +00:00
tychon	b5071925e0	If a trap is encountered upon executing iretq from within doreti() the hardware will ensure the stack pointer is aligned to a 16-byte boundary before saving the fault state on the stack. In the PTI case, handle this potential alignment adjustment by copying both frames independently while unwinding the stack in between. Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15183	2018-04-25 14:21:13 +00:00
markj	37d36b67f3	Improve VM page queue scalability. Currently both the page lock and a page queue lock must be held in order to enqueue, dequeue or requeue a page in a given page queue. The queue locks are a scalability bottleneck in many workloads. This change reduces page queue lock contention by batching queue operations. To detangle the page and page queue locks, per-CPU batch queues are used to reference pages with pending queue operations. The requested operation is encoded in the page's aflags field with the page lock held, after which the page is enqueued for a deferred batch operation. Page queue scans are similarly optimized to minimize the amount of work performed with a page queue lock held. Reviewed by: kib, jeff (previous versions) Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14893	2018-04-24 21:15:54 +00:00
kib	dd08c073d0	Correct undesirable interaction between caching of %cr4 in bhyve and invltlb_glob(). Reviewed by: grehan, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15138	2018-04-24 13:44:19 +00:00
jhb	96955e60da	Simplify the code to allocate stack for auxv, argv[], and environment vectors. Remove auxarg_size as it was only used once right after a confusing assignment in each of the variants of exec_copyout_strings(). Reviewed by: emaste MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D15123	2018-04-19 16:00:34 +00:00
avg	cc87433116	set kdb_why to "trap" when calling kdb_trap from trap_fatal This will allow to hook a ddb script to "kdb.enter.trap" event. Previously there was no specific name for this event, so it could only be handled by either "kdb.enter.unknown" or "kdb.enter.default" hooks. Both are very unspecific. Having a specific event is useful because the fatal trap condition is very similar to panic but it has an additional property that the current stack frame is the frame where the trap occurred. So, both a register dump and a stack bottom dump have additional information that can help analyze the problem. I have added the event only on architectures that have trap_fatal() function defined. I haven't looked at other architectures. Their maintainers can add support for the event later. Sample script: kdb.enter.trap=bt; show reg; x/aS $rsp,20; x/agx $rsp,20 Reviewed by: kib, jhb, markj MFC after: 11 days Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D15093	2018-04-19 05:06:56 +00:00
avg	1bcaa50aa1	don't check for kdb reentry in trap_fatal(), it's impossible trap() checks for it earlier and calls kdb_reentry(). Discussed with: jhb MFC after: 12 days Sponsored by: Panzura	2018-04-18 15:44:54 +00:00
brooks	c35e9275fc	Remove the unused fuwintr() and suiwintr() functions. Half of implementations always failed (returned (-1)) and they were previously used in only one place. Reviewed by: kib, andrew Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15102	2018-04-17 18:04:28 +00:00
kib	4184d62cd8	Set PG_G global mapping bit on the trampoline ptes. Trampoline mappings are better treated as global since they are valid in all address spaces, even for PTI. pmap_invalidate_range() must work on global mappings for pti since kernel_pmap invalidations are really same as for non-PTI. Reviewed by: alc, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D15052	2018-04-14 17:33:16 +00:00
tychon	d53228c4ee	Add SDT probes to vmexit on Intel. Submitted by: domagoj.stolfa_gmail.com Reviewed by: grehan, tychon Sponsored by: DARPA/AFRL Differential Revision: https://reviews.freebsd.org/D14656	2018-04-13 17:23:05 +00:00
kib	eda1e69e06	Fix PSL_T inheritance on exec for x86. The miscellaneous x86 sysent->sv_setregs() implementations tried to migrate PSL_T from the previous program to the new executed one, but they evaluated regs->tf_eflags after the whole regs structure was bzeroed. Make this functional by saving PSL_T value before zeroing. Note that if the debugger is not attached, executing the first instruction in the new program with PSL_T set results in SIGTRAP, and since all intercepted signals are reset to default dispostion on exec(2), this means that non-debugged process gets killed immediately if PSL_T is inherited. In particular, since suid images drop P_TRACED, attempt to set PSL_T for execution of such program would kill the process. Another issue with userspace PSL_T handling is that it is reset by trap(). It is reasonable to clear PSL_T when entering SIGTRAP handler, to allow the signal to be handled without recursion or delivery of blocked fault. But it is not reasonable to return back to the normal flow with PSL_T cleared. This is too late to change, I think. Discussed with: bde, Ali Mashtizadeh Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D14995	2018-04-12 20:43:39 +00:00
kib	5a40f55a22	Optimize context switch for PTI on PCID pmap. In pti-enabled pmap, the PCID allocation scheme assigns temporal id for the kernel page table, and user page table twin PCID is calculating by setting high bit in the kernel PCID. So the kernel AS is mapped with per-vmspace PCID, and we must completely shut down all mappings in KVA when switching contexts, so that newly switched thread would see all changes in KVA occured while it was not executing. After all, KVA is same between all threads. Currently the pti context switch for the user part of the page table gets its TLB entries flushed too. It is excessive. The same PCID flushing algorithm that is used for non-pti pmap, correctly works for the UVA mappings. The only shared TLB entries are the pages from KVA accessed by the kernel entry trampoline. All of them are static except per-thread TSS and LDT. For TSS and LDT, the lifetime of newly allocated entries is the whole thread life, so it is fine as well. If not fine, then explicit shutdowns for current pmap of the newly allocated LDT and TSS pages would be enough. Also restore the constant value for the pm_pcid for the kernel_pmap. Before, for PTI pmap, pm_pcid was erronously rolled same as user pmap's pm_pcid, but it was not used. Reviewed by: markj (previous version) Discussed with: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D14961	2018-04-12 19:59:36 +00:00
emaste	2aec82f4db	linuxulator: add else case braces to reduce diffs between archs Sponsored by: Turing Robotic Industries Inc.	2018-04-09 19:11:24 +00:00
emaste	c5db8e6d0d	linuxulator: deduplicate linux_exec_imgact_try Previously linuxulator had three identical copies of linux_exec_imgact_try. Deduplicate before adding another arch to linuxulator. Sponsored by: Turing Robotic Industries Inc Differential Revision: https://reviews.freebsd.org/D14856	2018-04-09 17:24:01 +00:00
rgrimes	f4d1671ab3	Add the ability to control the CPU topology of created VMs from userland without the need to use sysctls, it allows the old sysctls to continue to function, but deprecates them at FreeBSD_version 1200060 (Relnotes for deprecate). The command line of bhyve is maintained in a backwards compatible way. The API of libvmmapi is maintained in a backwards compatible way. The sysctl's are maintained in a backwards compatible way. Added command option looks like: bhyve -c [[cpus=]n][,sockets=n][,cores=n][,threads=n][,maxcpus=n] The optional parts can be specified in any order, but only a single integer invokes the backwards compatible parse. [,maxcpus=n] is hidden by #ifdef until kernel support is added, though the api is put in place. bhyvectl --get-cpu-topology option added. Reviewed by: grehan (maintainer, earlier version), Reviewed by: bcr (manpages) Approved by: bde (mentor), phk (mentor) Tested by: Oleg Ginzburg <olevole@olevole.ru> (cbsd) MFC after: 1 week Relnotes: Y Differential Revision: https://reviews.freebsd.org/D9930	2018-04-08 19:24:49 +00:00
brooks	c68f7df1ad	Fix LINT (and static COMPAT_LINUX32) after r332122.	2018-04-08 17:10:32 +00:00
kib	3a596fa59c	Handle Skylake-X errata SKZ63. SKZ63 Processor May Hang When Executing Code In an HLE Transaction Region Problem: Under certain conditions, if the processor acquires an HLE (Hardware Lock Elision) lock via the XACQUIRE instruction in the Host Physical Address range between 40000000H and 403FFFFFH, it may hang with an internal timeout error (MCACOD 0400H) logged into IA32_MCi_STATUS. Move the pages from the range into the blacklist. Add a tunable to not waste 4M if local DoS is not the issue. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15001	2018-04-07 17:06:13 +00:00
jhb	a8d4adc969	Add a way to temporarily suspend and resume virtual CPUs. This is used as part of implementing run control in bhyve's debug server. The hypervisor now maintains a set of "debugged" CPUs. Attempting to run a debugged CPU will fail to execute any guest instructions and will instead report a VM_EXITCODE_DEBUG exit to the userland hypervisor. Virtual CPUs are placed into the debugged state via vm_suspend_cpu() (implemented via a new VM_SUSPEND_CPU ioctl). Virtual CPUs can be resumed via vm_resume_cpu() (VM_RESUME_CPU ioctl). The debug server suspends virtual CPUs when it wishes them to stop executing in the guest (for example, when a debugger attaches to the server). The debug server can choose to resume only a subset of CPUs (for example, when single stepping) or it can choose to resume all CPUs. The debug server must explicitly mark a CPU as resumed via vm_resume_cpu() before the virtual CPU will successfully execute any guest instructions. Reviewed by: avg, grehan Tested on: Intel (jhb), AMD (avg) Differential Revision: https://reviews.freebsd.org/D14466	2018-04-06 22:03:43 +00:00
brooks	9d79658aab	Move most of the contents of opt_compat.h to opt_global.h. opt_compat.h is mentioned in nearly 180 files. In-progress network driver compabibility improvements may add over 100 more so this is closer to "just about everywhere" than "only some files" per the guidance in sys/conf/options. Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h is created on all architectures. Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the set of compiled files. Reviewed by: kib, cem, jhb, jtl Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14941	2018-04-06 17:35:35 +00:00
jtl	77cc11ce79	Pat the watchdog less while producing a coredump. Prior to this change, we patted the watchdog approximately once per 4KB page of memory. After this change, we pat the watchdog approximately once per 128MB of memory. On a sample machine, this translated to patting the watchdog approximately every 5.4 seconds, which "seems reasonable". We can choose a different value in the future, if warranted. This has extensive field experience. It is a performance improvement, and has not caused any known problems. Reviewed by: imp, kib Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D14988	2018-04-06 17:06:22 +00:00
royger	e51e4bdd98	x86: fix trampoline memory allocation after r332073 Add the missing breaks in the for loops, in order to exit the loop when a suitable entry is found. Also switch amd64 native_start_all_aps to use PHYS_TO_DMAP in order to find the virtual address of the boot_trampoline and the initial page tables. Reported and tested by: pho Sponsored by: Citrix Systems R&D	2018-04-06 16:22:14 +00:00
royger	e1f89be1d3	remove GiB/MiB macros from param.h And instead define them in the files where they are used. Requested by: bde	2018-04-06 11:20:06 +00:00
royger	5f1547e410	x86: improve reservation of AP trampoline memory So that it doesn't rely on physmap[1] containing an address below 1MiB. Instead scan the full physmap and search for a suitable address to place the trampoline code (below 1MiB) and the initial memory pages (below 4GiB). Sponsored by: Citrix Systems R&D Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14878	2018-04-05 14:39:51 +00:00
kib	85654290ba	Fix ERESTART for lcall $7,$0 syscalls. The lcall trampoline enters kernel by int $0x80, which sets up invalid length of the instruction for %rip rewind. Reviewed by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-04-05 11:03:21 +00:00
kib	22ab7ec03b	Make the INTO instruction operational in 32bit mode. Having the IDT entry specify ring 0 DPL caused delivery of #GP instead of #OF. The instruction is not valid in 64bit mode, which probably explains why the IDT entry for #OF was initially set this way. It is interesting to note that the BOUND instruction works with the IDT #BR entry DPL 0, most likely CPU considers #BR from BOUND as generated by a machine, not user. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-04-05 11:03:05 +00:00
avg	3331ff57c2	fix i386 build with CPU_ELAN (LINT for instance) after r331878 x86/cpu_machdep.c now needs to include elan_mmcr.h when CPU_ELAN is set. While here, also remove the now unneeded inclusion of isareg.h in i386 and amd64 vm_machdep.c. Reported by: lwhsu MFC after: 14 days X-MFC with: r331878	2018-04-03 17:16:06 +00:00
avg	cbde65132d	unify amd64 and i386 cpu_reset() in x86/cpu_machdep.c Because I didn't see any reason not too. I've been making some changes to the code and couldn't help but notice that the i386 and am64 code was nearly identical. MFC after: 17 days	2018-04-02 13:45:23 +00:00
avg	f909741389	x86 cpu_reset: if failed to switch to BSP proceed to cpu_reset_real If cpu_reset() is called on an AP and if it somehow fails to wake the BSP, then it's better to attempt the reset on the AP than just sit there spinning on an unusable and undebuggable system. MFC after: 16 days	2018-04-02 08:06:18 +00:00
avg	a9c1c585d1	x86 cpu_reset_proxy: no need to stop_cpus() the original processor The processor is "parked" in a spin-loop already and that's sufficient for the reset. There is nothing that stop_cpus() would add here, only extra complexity and fragility. The original processor does not need to enable interrupts now, in fact, it must not do that. MFC after: 2 weeks	2018-04-02 07:45:13 +00:00
ken	570099bbdd	Bring in the Broadcom/Emulex Fibre Channel driver, ocs_fc(4). The ocs_fc(4) driver supports the following hardware: Emulex 16/8G FC GEN 5 HBAS LPe15004 FC Host Bus Adapters LPe160XX FC Host Bus Adapters Emulex 32/16G FC GEN 6 HBAS LPe3100X FC Host Bus Adapters LPe3200X FC Host Bus Adapters The driver supports target and initiator mode, and also supports FC-Tape. Note that the driver only currently works on little endian platforms. It is only included in the module build for amd64 and i386, and in GENERIC on amd64 only. Submitted by: Ram Kishore Vegesna <ram.vegesna@broadcom.com> Reviewed by: mav MFC after: 5 days Relnotes: yes Sponsored by: Broadcom Differential Revision: https://reviews.freebsd.org/D11423	2018-03-30 15:28:25 +00:00
jeff	bfe01083f9	Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all platforms. Original commit message as follows: Only use CPUs in the domain the device is attached to for default assignment. Device drivers are able to override the default assignment if they bind directly. There are severe performance penalties for handling interrupts on remote CPUs and this should only be done in very controlled circumstances. Reviewed by: jhb, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14838	2018-03-28 18:47:35 +00:00
jhb	7c78547f5c	Fix kernel builds without options DDB after r331650. Reported by: cy	2018-03-28 16:24:56 +00:00
jhb	24fa2df20b	Remove very old and unused signal information codes. These have been supplanted by the MI signal information codes in <sys/signal.h> since 7.0. The FPE_*_TRAP ones were deprecated even earlier in 1999. PR: 226579 (exp-run) Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14637	2018-03-27 20:57:51 +00:00
jeff	124dca372e	Backout r331606 until I can identify why it does not boot on some machines.	2018-03-27 10:20:50 +00:00
jeff	d1125a4e0d	Only use CPUs in the domain the device is attached to for default assignment. Device drivers are able to override the default assignment if they bind directly. There are severe performance penalties for handling interrupts on remote CPUs and this should only be done in very controlled circumstances. Reviewed by: jhb, kib Tested by: pho (earlier version) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14838	2018-03-27 03:37:04 +00:00
kib	c29c42a09e	Improve the lcall $7,$0 syscall emulation on amd64. Current code, which copies the potential syscall arguments into the current frame, puts an arbitrary limit on the number of syscall arguments. Apparently, mmap(2) and lseek(2) (?) require larger number. But there is an issue that stack is only need to be mapped to contain the number of arguments required by the syscall, so copying arbitrary large number of words from the stack is not completely safe. Use different approach to convert lcall frame into int $0x80 frame in place, by doing the retl in kernel. This also allows to stop proceed vfork case specially, and stop making assumptions about %cs at the syscall time. Also, improve comments with the formulations provided by bde. Reviewed and tested by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-24 12:57:58 +00:00
jtl	54451a3dd7	Make the TCP blackbox code committed in r331347 be an optional feature controlled by the TCP_BLACKBOX option. Enable this as part of amd64 GENERIC. For now, leave it disabled on other platforms. Sponsored by: Netflix, Inc.	2018-03-24 12:48:10 +00:00
emaste	fa8b1b66d0	Remove redundant cast from Linuxulator SYSINITs	2018-03-23 20:32:54 +00:00
emaste	f64c2a93db	Fixup return style(9) in amd64 linux*_sysvec.c Sponsored by: Turing Robotic Industries Inc.	2018-03-23 17:28:04 +00:00
emaste	9f106a66bc	Sort headers in MD Linuxulator files Bring #includes closer to style(9) and reduce differences between the (three) MD versions of linux_machdep.c and linux_sysvec.c. Sponsored by: Turing Robotic Industries Inc.	2018-03-23 17:16:36 +00:00
kib	d93ae96f8a	Fixes for ptrace(PT_GETXSTATE_INFO) related to the padding in struct ptrace_xstate_info). struct ptrace_xstate_info has 64bit member but ends up with 32bit one. As result, on amd64 there is a 32bit padding at the end, but not on i386. We must clear the padding before doing the copyout. For compat32 case, we must copyout the structure which does not have the padding at the end. The later fixes 32bit gdb display of the YMM registers when running on amd64 kernel. Reported by: Vlad Tsyrklevich Reviewed by: brooks (previous version) Sponsored by: The FreeBSD Foundation admbugs: 765 MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14794	2018-03-22 20:44:27 +00:00
kevans	cef0ea8f72	Re-work efidev ordering to fix efirt preloaded by loader on amd64 On amd64, efi_enter calls fpu_kern_enter(). This may not be called until fpuinitstate has been invoked, resulting in a kernel panic with efirt_load="YES" in loader.conf(5). Move fpuinitstate a little earlier in SI_SUB_DRIVERS so that we can squeeze efirt between it and efirtc at SI_SUB_DRIVERS, SI_ORDER_ANY. efidev must be after efirt and doesn't really need to be at SI_SUB_DEVFS, so drop it at SI_SUB_DRIVER, SI_ORDER_ANY. The not immediately obvious dependency of fpuinitstate by efirt has been noted in both places. Discussed with: kib, andrew Reported by: Jakob Alvermark <jakob@alvermark.net> X-MFC-With: r330868	2018-03-22 18:24:00 +00:00
emaste	289be8516c	Share Linux errno table with libsysdecode Requested by: jhb Reviewed by: jhb Sponsored by: Turing Robotic Industries Inc.	2018-03-22 12:58:49 +00:00
kib	0a1d8bb0a4	Move the CR0.WP manipulation KPI to x86. This should allow to avoid some #ifdefs in the common x86/ code. Requested by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-20 20:20:49 +00:00
emaste	bd06dc7104	Make linuxulator fn declaration match definition I accidentally swapped 'linux_fixup_elf' to 'linux_elf_fixup' in amd64's declaration (only), while bringing this change over from git and encountering a conflict.	2018-03-20 19:28:52 +00:00
emaste	6fe54a5343	Rename assym.s to assym.inc assym is only to be included by other .s files, and should never actually be assembled by itself. Reviewed by: imp, bdrewery (earlier) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D14180	2018-03-20 17:58:51 +00:00
kib	ec36014ed1	Disable write protection around patching of XSAVE instruction in the context switch code. Some BIOSes give control to the OS with CR0.WP already set, making the kernel text read-only before cpu_startup(). Reported by: Peter Lei <peter.lei@ieee.org> Reviewed by: jtl Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14768	2018-03-20 17:47:29 +00:00
kib	60e489a73d	Provide KPI for handling of rw/ro kernel text. This is a pure syntax patch to create an interface to enable and later restore write access to the kernel text and other read-only mapped regions. It is in line with e.g. vm_fault_disable_pagefaults() by allowing the nesting. Discussed with: Peter Lei <peter.lei@ieee.org> Reviewed by: jtl Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14768	2018-03-20 17:43:50 +00:00
emaste	58e1b5421f	Rename linuxulator functions with linux_ prefix It's preferable to have a consistent prefix. This also reduces differences between the three linux*_sysvec.c files. Sponsored by: Turing Robotic Industries Inc.	2018-03-19 21:26:32 +00:00
emaste	47fc89046a	linux*_sysvec.c: rationalize whitespace and comments There's a fair amount of duplication between MD linuxulator files. Make indentation and comments consistent between the three versions of linux_sysvec.c to reduce diffs when comparing them. Sponsored by: Turing Robotic Industries Inc.	2018-03-19 15:11:10 +00:00
emaste	566c3d41cc	Share a single bsd-linux errno table across MD consumers Three copies of the linuxulator linux_sysvec.c contained identical BSD to Linux errno translation tables, and future work to support other architectures will also use the same table. Move the table to a common file to be used by all. Make it 'const int' to place it in .rodata. (Some existing Linux architectures use MD errno values, but x86 and Arm share the generic set.) This change should introduce no functional change; a followup will add missing errno values. MFC after: 3 weeks Sponsored by: Turing Robotic Industries Inc. Differential Revision: https://reviews.freebsd.org/D14665	2018-03-16 14:46:38 +00:00
emaste	9975d0e7b5	Remove stray ; at end of linux_vdso_deinstall()	2018-03-14 13:20:36 +00:00
kevans	29870e52c0	EFIRT: SetVirtualAddressMap with 1:1 mapping after exiting boot services This fixes a problem encountered on the Lenovo Thinkpad X220/Yoga 11e where runtime services would try to inexplicably jump to other parts of memory where it shouldn't be when attempting to enumerate EFI vars, causing a panic. The virtual mapping is enabled by default and can be disabled by setting efi_disable_vmap in loader.conf(5). Reviewed by: kib (earlier version) MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D14677	2018-03-13 17:10:52 +00:00
emaste	6851f84d1f	Use C99 boolean type for translate_osrel Migrate to modern types before creating MD Linuxolator bits for new architectures. Reviewed by: cem Sponsored by: Turing Robotic Industries Inc. Differential Revision: https://reviews.freebsd.org/D14676	2018-03-13 16:40:29 +00:00
emaste	bc4d21ce60	Apply some style(9) to Linuxulator linux_sysvec.c comments	2018-03-13 00:40:05 +00:00
ian	de8e5f9bb1	Revert r330780, it was improperly tested and results in taking a spin mutex before acquiring sleep mutexes. Reported by: kib@	2018-03-11 20:13:15 +00:00
ian	00abf0e72d	Eliminate atrtc_time_lock, and use atrtc_lock for efirtc locking.	2018-03-11 19:22:58 +00:00
tychon	6cbf64d8bf	Fix a lock recursion introduced in r327065. Reported by: kmacy Reviewed by: grehan, jhb Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14548	2018-03-07 18:03:22 +00:00
jtl	8e9b6569cb	amd64: Protect the kernel text, data, and BSS by setting the RW/NX bits correctly for the data contained on each memory page. There are several components to this change: * Add a variable to indicate the start of the R/W portion of the initial memory. * Stop detecting NX bit support for each AP. Instead, use the value from the BSP and, if supported, activate the feature on the other APs just before loading the correct page table. (Functionally, we already assume that the BSP and all APs had the same support or lack of support for the NX bit.) * Set the RW and NX bits correctly for the kernel text, data, and BSS (subject to some caveats below). * Ensure DDB can write to memory when necessary (such as to set a breakpoint). * Ensure GDB can write to memory when necessary (such as to set a breakpoint). For this purpose, add new MD functions gdb_begin_write() and gdb_end_write() which the GDB support code can call before and after writing to memory. This change is not comprehensive: * It doesn't do anything to protect modules. * It doesn't do anything for kernel memory allocated after the kernel starts running. * In order to avoid excessive memory inefficiency, it may let multiple types of data share a 2M page, and assigns the most permissions needed for data on that page. Reviewed by: jhb, kib Discussed with: emaste MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14282	2018-03-06 14:28:37 +00:00
jtl	5a8fb2b9d6	We shouldn't need to execute code in the recursive page table mappings; therefore, it should be safe to set the NX bit on the PML4E for the recursive page table mappings. According to the Intel docs, the effect of the NX bit should propogate to any page reached through a PML4E which has the NX bit set. Reviewed by: kib, markj MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14333	2018-03-05 15:12:35 +00:00
jtl	886958fbdb	Prior to r329071, pmap_bootstrap() used pmap_kmem_choose() to round the first available virtual address to a 2MB boundary. After r329071, create_pagetables() rounds firstaddr up to a 2MB boundary. This ensures the kernel is mapped in super-pages, which is the point of the logic in pmap_kmem_choose(). Therefore, it is no longer necessary for pmap_bootstrap() to round up to the 2MB boundary again. As pmap_bootstrap() was the only user of pmap_kmem_choose(), we can delete pmap_kmem_choose(). Reviewed by: kib MFC after: 2 weeks X-MFC-with: r329071 Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14355	2018-03-05 15:10:17 +00:00
anish	919edd6e24	Move the new AMD-Vi IVHD [ACPI_IVRS_HARDWARE_NEW]definitions added in r329360 in contrib ACPI to local files till ACPI code adds new definitions reported by jkim. Rename ACPI_IVRS_HARDWARE_NEW to ACPI_IVRS_HARDWARE_EFRSUP, since new definitions add Extended Feature Register support. Use IvrsType to distinguish three types of IVHD - 0x10(legacy), 0x11 and 0x40(with EFR). IVHD 0x40 is also called mixed type since it supports HID device entries. Fix 2 coverity bugs reported by cem. Reported by:jkim, cem Approved by:grehan Differential Revision://reviews.freebsd.org/D14501	2018-03-05 02:28:25 +00:00
kib	d27cb27779	Unify bulk free operations in several pmaps. Submitted by: Yoshihiro Ota Reviewed by: markj MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13485	2018-03-04 20:53:20 +00:00
avg	785f44d456	db_nextframe/amd64: catch up with r328083 to recognize fast_syscall_common Since that change the system call stack traces look like this: ... sys___sysctl() at sys___sysctl+0x5f/frame 0xfffffe0028e13ac0 amd64_syscall() at amd64_syscall+0x79b/frame 0xfffffe0028e13bf0 fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe0028e13bf0 So, db_nextframe() stopped recognizing the system call frame. This commit should fix that. Reviewed by: kib MFC after: 4 days	2018-03-03 15:10:37 +00:00
rpokala	03d8fd318a	imcsmb(4): Intel integrated Memory Controller (iMC) SMBus controller driver imcsmb(4) provides smbus(4) support for the SMBus controller functionality in the integrated Memory Controllers (iMCs) embedded in Intel Sandybridge- Xeon, Ivybridge-Xeon, Haswell-Xeon, and Broadwell-Xeon CPUs. Each CPU implements one or more iMCs, depending on the number of cores; each iMC implements two SMBus controllers (iMC-SMBs). * IMPORTANT NOTE * Because motherboard firmware or the BMC might try to use the iMC-SMBs for monitoring DIMM temperatures and/or managing an NVDIMM, the driver might need to temporarily disable those functions, or take a hardware interlock, before using the iMC-SMBs. Details on how to do this may vary from board to board, and the procedure may be proprietary. It is strongly suggested that anyone wishing to use this driver contact their motherboard vendor, and modify the driver as described in the manual page and in the driver itself. (For what it's worth, the driver as-is has been tested on various SuperMicro motherboards.) Reviewed by: avg, jhb MFC after: 1 week Relnotes: yes Sponsored by: Panasas Differential Revision: https://reviews.freebsd.org/D14447 Discussed with: avg, ian, jhb Tested by: allanjude (previous version), Panasas	2018-03-03 01:53:51 +00:00
emaste	5c793c607c	Rationalize license text on Linuxolator files Many licenses on Linuxolator files contained small variations from the standard FreeBSD license text. To avoid license proliferation switch to the standard 2-clause FreeBSD license for those files where I have permission from each of the listed copyright holders. Additional files still waiting on permission from others are listed in review D14210. Approved by: dchagin, rdivacky, sos MFC after: 1 week MFC with: r329370 Sponsored by: The FreeBSD Foundation	2018-03-01 13:52:18 +00:00
jhb	3e51017144	Add a new variant of the GLA2GPA ioctl for use by the debug server. Unlike the existing GLA2GPA ioctl, GLA2GPA_NOFAULT does not modify the guest. In particular, it does not inject any faults or modify PTEs in the guest when performing an address space translation. This is used by bhyve's debug server to read and write memory for the remote debugger. Reviewed by: grehan MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D14075	2018-02-26 19:19:05 +00:00
pkelsey	91dbc8d21a	Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option. The conditional compilation support is now centralized in tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum theoretical code/data footprint when TCP_RFC7413 is disabled, but nearly all the TFO code should wind up being removed by the optimizer, the additional footprint in the syncache entries is a single pointer, and the additional overhead in the tcpcb is at the end of the structure. This enables the TCP_RFC7413 kernel option by default in amd64 and arm64 GENERIC. Reviewed by: hiren MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14048	2018-02-26 03:03:41 +00:00
jkim	972c1ff3db	Partially revert r197863 to reduce diff against i386. When I wrote the patch, I wanted to remove SYSINIT() usage from amd64 code. There is no reason to keep the divergence any more because iwasaki merged most amd64 suspend/resume code to i386 with r235622. Note this also fixed an enge case reported by royger. [1] Suggested by: jhb Reviewed by: royger Tested by: royger [1] MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D14400 [1]	2018-02-24 01:24:57 +00:00
cem	807035047f	Remove unused error return from API that cannot fail No implementation of fpu_kern_enter() can fail, and it was causing needless error checking boilerplate and confusion. Change the return code to void to match reality. (This trivial change took nine days to land because of the commit hook on sys/dev/random. Please consider removing the hook or otherwise lowering the bar -- secteam never seems to have free time to review patches.) Reported by: Lachlan McIlroy <Lachlan.McIlroy AT isilon.com> Reviewed by: delphij Approved by: secteam (delphij) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14380	2018-02-23 20:15:19 +00:00
emaste	8f7d85819e	Use linux types for linux-specific syscalls Sponsored by: Turing Robotic Industries Inc. Differential Revision: https://reviews.freebsd.org/D14065	2018-02-23 19:09:27 +00:00
emaste	5122d84e64	Correct pseudo misspelling in sys/ comments contrib code and #define in intel_ata.h unchanged.	2018-02-23 18:15:50 +00:00
emaste	86be7b282e	Remove accidental vim droppings Reported by: cy	2018-02-22 03:37:01 +00:00
emaste	f4add23ff5	Correct proper nouns in the Linuxulator - Capitalize Linux - Spell FreeBSD out in full - Address some style(9) on changed lines Sponsored by: Turing Robotic Industries Inc.	2018-02-22 02:24:17 +00:00
jhb	4c5555dd19	Add two new ioctls to bhyve for batch register fetch/store operations. These are a convenience for bhyve's debug server to use a single ioctl for 'g' and 'G' rather than a loop of individual get/set ioctl requests. Reviewed by: grehan MFC after: 2 months Differential Revision: https://reviews.freebsd.org/D14074	2018-02-22 00:39:25 +00:00
kib	ee3d0fb8ef	vm_wait() rework. Make vm_wait() take the vm_object argument which specifies the domain set to wait for the min condition pass. If there is no object associated with the wait, use curthread' policy domainset. The mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by the new helper vm_wait_doms(), which directly takes the bitmask of the domains to wait for passing min condition. Eliminate pagedaemon_wait(). vm_domain_clear() handles the same operations. Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls are enough. Eliminate several control state variables from vm_domain, unneeded after the vm_wait() conversion. Scetched and reviewed by: jeff Tested by: pho Sponsored by: The FreeBSD Foundation, Mellanox Technologies Differential revision: https://reviews.freebsd.org/D14384	2018-02-20 10:13:13 +00:00
emaste	624a2a708b	Rationalize license text on Linuxolator files Many licenses on Linuxolator files contained small variations from the standard FreeBSD license text. To avoid license proliferation switch to the standard 2-clause FreeBSD license for those files where I have permission from each of the listed copyright holders. Additional files waiting on permission from others are listed in review D14210. Approved by: kan, marcel, sos, rdivacky MFC after: 1 week Sponsored by: The FreeBSD Foundation	2018-02-16 15:00:14 +00:00
kib	3b2f153e73	Use local symbol for offset. Small global symbols confuse ddb which matches them against small unrelated displacements and makes the disassembly ugly. Reported by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-02-16 13:32:46 +00:00
avg	f05cb3b6cf	move vintr_intercept_enabled under INVARIANTS The function is not used outside of INVARIANTS since r328622. MFC after: 1 week	2018-02-16 07:02:14 +00:00
anish	046b55b4d5	This change fixes duplicate detection of same IOMMU/AMD-Vi device for Ryzen with EFR support. IVRS can have entry of type legacy and non-legacy present at same time for same AMD-Vi device. ivhd driver will ignore legacy if new IVHD type is present as specified in AMD-Vi specification. Earlier both of IVHD entries used and two ivhd devices were created. Add support for new IVHD type 0x11 and 0x40 in ACPI. Create new struct of type acpi_ivrs_hardware_new for these new type of IVHDs. Legacy type 0x10 will continue to use acpi_ivrs_hardware. Reviewed by: avg Approved by: grehan Differential Revision:https://reviews.freebsd.org/D13160	2018-02-16 05:17:00 +00:00
jkim	3873442ceb	Change size of padding to reflect reality. No functional change. Discussed with: kib	2018-02-15 20:42:38 +00:00
cem	47a3cad5ce	x86 pmap: Make memory mapped via pmap_qenter() non-executable The idea is, the pmap_qenter() API is now defined to not produce executable mappings. If you need executable mappings, use another API. Add pg_nx flag in pmap_qenter on x86 to make kernel pages non-executable. Other architectures that support execute-specific permissons on page table entries should subsequently be updated to match. Submitted by: Darrick Lew <darrick.freebsd AT gmail.com> Reviewed by: markj Discussed with: alc, jhb, kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14062	2018-02-14 23:35:47 +00:00
emaste	30058b6834	amd64/pmap: Move Foundation copyright to the 2-clause section Sponsored by: The FreeBSD Foundation	2018-02-13 19:19:26 +00:00
hselasky	4287266c4f	Import the mthca kernel side infiniband driver from Linux 4.9 and fix compilation under FreeBSD. The mthca driver was temporarily removed as part of the Linux 4.9 RoCE/infinband upgrade. Top commit in Linux source tree: 69973b830859bc6529a7a0468ba0d80ee5117826 Sponsored by: Mellanox Technologies	2018-02-13 17:04:34 +00:00
jeff	ba27b5187b	Make v_wire_count a per-cpu counter(9) counter. This eliminates a significant source of cache line contention from vm_page_alloc(). Use accessors and vm_page_unwire_noq() so that the mechanism can be easily changed in the future. Reviewed by: markj Discussed with: kib, glebius Tested by: pho (earlier version) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14273	2018-02-12 22:53:00 +00:00
jtl	c63f80e908	Mark the pages used for the initial page-table entries as wired. This makes them consistent with the way other page-table pages are allocated. It also provides the rest of the VM system a good clue that these pages are used. Reviewed by: alc, kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14269	2018-02-12 17:27:50 +00:00
imp	c78af2dc5f	We don't support gcc < 4.2.1, so varargs.h now is just #error always. Unifdef for versions prior to 4.2.1 and remove now-unused header files. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14323	2018-02-12 14:48:14 +00:00
tychon	e942535b5d	Provide further mitigation against CVE-2017-5715 by flushing the return stack buffer (RSB) upon returning from the guest. This was inspired by this linux commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/x86/kvm?id=117cc7a908c83697b0b737d15ae1eb5943afe35b Reviewed by: grehan Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14272	2018-02-12 14:45:27 +00:00
jtl	fcb98f3425	On bootup, the amd64 pmap initialization code creates page-table mappings for the pages used for the kernel and some initial allocations used for the page table. It maps the kernel and the blocks used for these initial allocations using 2MB pages. However, if the kernel does not end on a 2MB boundary, it still maps the last portion using a 2MB page, but reports that the unused 4K blocks within this 2MB allocation are free physical blocks. This means that these same physical blocks could also be mapped elsewhere - for example, into a user process. Given the proximity to the kernel text and data area, it seems wise to avoid allowing someone to write data to physical blocks also mapped into these virtual addresses. (Note that this isn't a security vulnerability: the direct map makes most/all memory on the system mapped into kernel space. And, nothing in the kernel should be trying to access these pages, as the virtual addresses are unused. It simply seems wise to avoid reusing these physical blocks while they are mapped to virtual addresses so close to the kernel text and data area.) Consequently, let's reserve the physical blocks covered by the page-table mappings for these initial allocations. Reviewed by: kib, markj MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14268	2018-02-09 17:46:33 +00:00
markj	7bc81b6db1	Use vm_page_unwire_noq() instead of directly modifying page wire counts. No functional change intended. Reviewed by: alc, kib (previous revision) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D14266	2018-02-08 19:28:51 +00:00
jeff	e67ec0d694	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000	2018-02-06 22:10:07 +00:00
emaste	0eb6366bc8	Additional linuxolator whitespace cleanup, missed in r328890	2018-02-05 18:39:06 +00:00
emaste	5c9ea56c98	Linuxolator whitespace cleanup A version of each of the MD files by necessity exists for each CPU architecture supported by the Linuxolator. Clean these up so that new architectures do not inherit whitespace issues. Clean up shared Linuxolator files while here. Sponsored by: Turing Robotic Industries Inc.	2018-02-05 17:29:12 +00:00
kib	0eb8b964d2	When switching IBRS on, also enable STIBP (Single Thread Indirect Branch Predictors) mitigation. DOcument 336996-001 promises that CPUs which implement IBRS but not STIBP silently ignore setting of the bit instead of trapping. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-31 16:56:02 +00:00
kib	01b52fdebb	IBRS support, AKA Spectre hardware mitigation. It is coded according to the Intel document 336996-001, reading of the patches posted on lkml, and some additional consultations with Intel. For existing processors, you need a microcode update which adds IBRS CPU features, and to manually enable it by setting the tunable/sysctl hw.ibrs_disable to 0. Current status can be checked in sysctl hw.ibrs_active. The mitigation might be inactive if the CPU feature is not patched in, or if CPU reports that IBRS use is not required, by IA32_ARCH_CAP_IBRS_ALL bit. Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14029	2018-01-31 14:36:27 +00:00
avg	bedef9326b	vmm/svm: post LAPIC interrupts using event injection, not virtual interrupts The virtual interrupt method uses V_IRQ, V_INTR_PRIO, and V_INTR_VECTOR fields of VMCB to inject a virtual interrupt into a guest VM. This method has many advantages over the direct event injection as it offloads all decisions of whether and when the interrupt can be delivered to the guest. But with a purely software emulated vAPIC the advantage is also a problem. The problem is that the hypervisor does not have any precise control over when the interrupt is actually delivered to the guest (or a notification about that). Because of that the hypervisor cannot update the interrupt vector in IRR and ISR in the same way as real hardware would. The hypervisor becomes aware that the interrupt is being serviced only upon the first VMEXIT after the interrupt is delivered. This creates a window between the actual interrupt delivery and the update of IRR and ISR. That means that IRR and ISR might not be correctly set up to the point of the end-of-interrupt signal. The described deviation has been observed to cause an interrupt loss in the following scenario. vCPU0 posts an inter-processor interrupt to vCPU1. The interrupt is injected as a virtual interrupt by the hypervisor. The interrupt is delivered to a guest and an interrupt handler is invoked. The handler performs a requested action and acknowledges the request by modifying a global variable. So far, there is no VMEXIT and the hypervisor is unaware of the events. Then, vCPU0 notices the acknowledgment and sends another IPI with the same vector. The IPI gets collapsed into the previous IPI in the IRR of vCPU1. Only after that a VMEXIT of vCPU1 occurs. At that time the vector is cleared in the IRR and is set in the ISR. vCPU1 has vAPIC state as if the second IPI has never been sent. The scenario is impossible on the real hardware because IRR and ISR are updated just before the interrupt handler gets started. I saw several possibilities of fixing the problem. One is to intercept the virtual interrupt delivery to update IRR and ISR at the right moment. The other is to deliver the LAPIC interrupts using the event injection, same as legacy interrupts. I opted to use the latter approach for several reasons. It's equivalent to what VMM/Intel does (in !VMX case). It appears to be what VirtualBox and KVM do. The code is already there (to support legacy interrupts). Another possibility was to use a special intermediate state for a vector after it is injected using a virtual interrupt and before it is known whether it was accepted or is still pending. That approach was implemented in https://reviews.freebsd.org/D13828 That method is more complex and does not have any clear advantage. Please see sections 15.20 and 15.21.4 of "AMD64 Architecture Programmer's Manual Volume 2: System Programming" (publication 24593, revision 3.29) for comparison between event injection and virtual interrupt injection. PR: 215972 Reported by: ajschot@hotmail.com, grehan Tested by: anish, grehan, Nils Beyer <nbe@renzel.net> Reviewed by: anish, grehan MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D13780	2018-01-31 11:14:26 +00:00
jhb	b5c926aafe	Ensure 'name' is not NULL before passing to strcmp(). This avoids a nested page fault when obtaining a stack trace in DDB if the address from the first frame does not resolve to a known symbol. MFC after: 1 week Sponsored by: Chelsio Communications	2018-01-30 23:29:27 +00:00
bdrewery	59f27534cc	Don't use an .OBJDIR for 'make sysent'. Reported by: emaste, jhb Sponsored by: Dell EMC	2018-01-29 19:14:15 +00:00
imp	1912ffb2e5	Add ISA PNP tables to ISA drivers. Fix a few incidental comments. ACPI ISA PBP tables not tagged, there's bigger issues with them.	2018-01-29 00:22:30 +00:00
kib	545e25ea75	Use PCID to optimize PTI. Use PCID to avoid complete TLB shootdown when switching between user and kernel mode with PTI enabled. I use the model close to what I read about KAISER, user-mode PCID has 1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID. Full kernel-mode TLB shootdown is performed on context switches, since KVA TLB invalidation only works in the current pmap. User-mode part of TLB is flushed on the pmap activations as well. Similarly, IPI TLB shootdowns must handle both kernel and user address spaces for each address. Note that machines which implement PCID but do not have INVPCID instructions, cause the usual complications in the IPI handlers, due to the need to switch to the target PCID temporary. This is racy, but because for PCID/no-INVPCID we disable the interrupts in pmap_activate_sw(), IPI handler cannot see inconsistent state of CPU PCID vs PCPU pmap/kcr3/ucr3 pointers. On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set and we do not clear TLB. I can imagine alternative use of PCID, where there is only one PCID allocated for the kernel pmap. Then, there is no need to shootdown kernel TLB entries on context switch. But copyout(3) would need to either use method similar to proc_rwmem() to access the userspace data, or (in reverse) provide a temporal mapping for the kernel buffer into user mode PCID and use trampoline for copy. Reviewed by: markj (previous version) Tested by: pho Discussed with: alc (some aspects) Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D13985	2018-01-27 11:49:37 +00:00
trasz	1da0bebce4	Add SPDX identifiers to linux_ptrace.c and cfumass.c. MFC after: 2 weeks	2018-01-24 17:04:01 +00:00
emaste	ae11f64597	Use BSD-2-Clause-FreeBSD license on linux_support.s These files previously had a 3-clause license and 'THE REGENTS' text. Switch to standard 2-clause text with kib's approval, and add the SPDX tag. Approved by: kib	2018-01-23 20:35:43 +00:00
pfg	ced875130d	Revert r327828, r327949, r327953, r328016-r328026, r328041: Uses of mallocarray(9). The use of mallocarray(9) has rocketed the required swap to build FreeBSD. This is likely caused by the allocation size attributes which put extra pressure on the compiler. Given that most of these checks are superfluous we have to choose better where to use mallocarray(9). We still have more uses of mallocarray(9) but hopefully this is enough to bring swap usage to a reasonable level. Reported by: wosch PR: 225197	2018-01-21 15:42:36 +00:00
kib	db37b511ae	Use correct symbol name in r328202. Sponsored by: The FreeBSD Foundation MFC after: 11 days	2018-01-20 18:05:14 +00:00
kib	14962b8ee9	Use predefined symbol for the CR3.PCID mask. Sponsored by: The FreeBSD Foundation MFC after: 11 days	2018-01-20 17:46:09 +00:00
royger	774ffaf6a8	xen: fix IDT setup after PTI On amd64 the IDT handler was not set correctly when using PTI. While there also fix the selectors to SEL_KPL. Obtained from: kib MFC with: r328083	2018-01-20 14:59:37 +00:00
kib	b4c82b3b07	PTI: Trap if we returned to userspace with kernel (full) page table still active. Map userspace portion of VA in the PTI kernel-mode page table as non-executable. This way, if we ever miss reloading ucr3 into %cr3 on the return to usermode, the process traps instead of executing in potentially vulnerable setup. Catch the condition of such trap and verify user-mode %cr3, which is saved by page fault handler. I peek this trick in some article about Linux implementation. Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 12 days DIfferential revision: https://reviews.freebsd.org/D13956	2018-01-19 22:10:29 +00:00
nwhitehorn	e79f2b9178	Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be evaluated at runtime to determine if the architecture has a direct map; if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or 1, the compiler can remove the conditional logic. As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had similar things but spelled differently. 32-bit MIPS has a partial direct-map that maps poorly to this concept and is unchanged. Reviewed by: kib Suggestions from: marius, alc, kib Runtime tested on: amd64, powerpc64, powerpc, mips64	2018-01-19 17:46:31 +00:00
emaste	1cf1c6c06d	Enable KPTI by default on amd64 for non-AMD CPUs Kernel Page Table Isolation (KPTI) was introduced in r328083 as a mitigation for the 'Meltdown' vulnerability. AMD CPUs are not affected, per https://www.amd.com/en/corporate/speculative-execution: We believe AMD processors are not susceptible due to our use of privilege level protections within paging architecture and no mitigation is required. Thus default KPTI to off for AMD CPUs, and to on for others. This may be refined later as we obtain more specific information on the sets of CPUs that are and are not affected. Submitted by: Mitchell Horne Reviewed by: cem Relnotes: Yes Security: CVE-2017-5754 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D13971	2018-01-19 15:42:34 +00:00
jhb	e2ed91ad09	Use a dedicated per-CPU stack for machine check exceptions. Similar to NMIs, machine check exceptions can fire at any time and are not masked by IF. This means that machine checks can fire when the kstack is too deep to hold a trap frame, or at critical sections in trap handlers when a user %gs is used with a kernel %cs. Use the same strategy used for NMIs of using a dedicated per-CPU stack configured in IST 3. Store the CPU's pcpu pointer at the stop of the stack so that the machine check handler can reliably find the proper value for %gs (also borrowed from NMIs). This should also fix a similar issue with PTI with a MC# occurring while the CPU is executing on the trampoline stack. While here, bypass trap() entirely and just call mca_intr(). This avoids a bogus call to kdb_reenter() (there's no reason to try to reenter kdb if a MC# is raised). Reviewed by: kib Tested by: avg (on AMD without PTI) Differential Revision: https://reviews.freebsd.org/D13962	2018-01-18 23:50:21 +00:00
jhb	efbbc35271	Remove two no-longer-used labels from the NMI interrupt handler. Reviewed by: kib	2018-01-18 22:13:53 +00:00
jhb	8fb0491b83	Adjust branch target in NMI handler for the !PTI case. In the !PTI case the NMI handler jumped past the instructions that set %rdi to point to the current PCB, but the target instructions assumed %rdi were set. Reviewed by: kib Tested by: pho	2018-01-18 20:12:12 +00:00
kib	af04296ad4	Move the kernphys declaration to machine/md_var.h. Apparently machinde/cpu.h is supposed to contain MD implementations of MI interfaces. Also, remove kernphys declaration from machdep.c, since it is already provided by md_var.h. Requested and reviewed by: bde MFC after: 13 days	2018-01-18 15:15:35 +00:00
kib	10c1564cbe	Fix compilation with gcc. etext is already declared in machine/cpu.h, move kernphys declaration there too. Based on the patch by: bde MFC after: 13 days	2018-01-18 11:21:03 +00:00
kib	911f28f4eb	Fix compilation with gas. Submitted by: bde MFC after: 13 days	2018-01-18 11:19:58 +00:00
kib	e24bdf2ac4	Remove the 'last' argument from the pmap_pti_free_page(). It is in fact unused. Noted and reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 13 days	2018-01-18 11:01:41 +00:00
jhb	391a83c86b	Save and restore guest debug registers. Currently most of the debug registers are not saved and restored during VM transitions allowing guest and host debug register values to leak into the opposite context. One result is that hardware watchpoints do not work reliably within a guest under VT-x. Due to differences in SVM and VT-x, slightly different approaches are used. For VT-x: - Enable debug register save/restore for VM entry/exit in the VMCS for DR7 and MSR_DEBUGCTL. - Explicitly save DR0-3,6 of the guest. - Explicitly save DR0-3,6-7, MSR_DEBUGCTL, and the trap flag from %rflags for the host. Note that because DR6 is "software" managed and not stored in the VMCS a kernel debugger which single steps through VM entry could corrupt the guest DR6 (since a single step trap taken after loading the guest DR6 could alter the DR6 register). To avoid this, explicitly disable single-stepping via the trace flag before loading the guest DR6. A determined debugger could still defeat this by setting a breakpoint after the guest DR6 was loaded and then single-stepping. For SVM: - Enable debug register caching in the VMCB for DR6/DR7. - Explicitly save DR0-3 of the guest. - Explicitly save DR0-3,6-7, and MSR_DEBUGCTL for the host. Since SVM saves the guest DR6 in the VMCB, the race with single-stepping described for VT-x does not exist. For both platforms, expose all of the guest DRx values via --get-drX and --set-drX flags to bhyvectl. Discussed with: avg, grehan Tested by: avg (SVM), myself (VT-x) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D13229	2018-01-17 23:11:25 +00:00
markj	0c26afc4c5	Annotate a couple of changes from r328083. Reviewed by: kib X-MFC with: r328083	2018-01-17 21:52:12 +00:00
kib	c35d24e497	PTI for amd64. The implementation of the Kernel Page Table Isolation (KPTI) for amd64, first version. It provides a workaround for the 'meltdown' vulnerability. PTI is turned off by default for now, enable with the loader tunable vm.pmap.pti=1. The pmap page table is split into kernel-mode table and user-mode table. Kernel-mode table is identical to the non-PTI table, while usermode table is obtained from kernel table by leaving userspace mappings intact, but only leaving the following parts of the kernel mapped: kernel text (but not modules text) PCPU GDT/IDT/user LDT/task structures IST stacks for NMI and doublefault handlers. Kernel switches to user page table before returning to usermode, and restores full kernel page table on the entry. Initial kernel-mode stack for PTI trampoline is allocated in PCPU, it is only 16 qwords. Kernel entry trampoline switches page tables. then the hardware trap frame is copied to the normal kstack, and execution continues. IST stacks are kept mapped and no trampoline is needed for NMI/doublefault, but of course page table switch is performed. On return to usermode, the trampoline is used again, iret frame is copied to the trampoline stack, page tables are switched and iretq is executed. The case of iretq faulting due to the invalid usermode context is tricky, since the frame for fault is appended to the trampoline frame. Besides copying the fault frame and original (corrupted) frame to kstack, the fault frame must be patched to make it look as if the fault occured on the kstack, see the comment in doret_iret detection code in trap(). Currently kernel pages which are mapped during trampoline operation are identical for all pmaps. They are registered using pmap_pti_add_kva(). Besides initial registrations done during boot, LDT and non-common TSS segments are registered if user requested their use. In principle, they can be installed into kernel page table per pmap with some work. Similarly, PCPU can be hidden from userspace mapping using trampoline PCPU page, but again I do not see much benefits besides complexity. PDPE pages for the kernel half of the user page tables are pre-allocated during boot because we need to know pml4 entries which are copied to the top-level paging structure page, in advance on a new pmap creation. I enforce this to avoid iterating over the all existing pmaps if a new PDPE page is needed for PTI kernel mappings. The iteration is a known problematic operation on i386. The need to flush hidden kernel translations on the switch to user mode make global tables (PG_G) meaningless and even harming, so PG_G use is disabled for PTI case. Our existing use of PCID is incompatible with PTI and is automatically disabled if PTI is enabled. PCID can be forced on only for developer's benefit. MCE is known to be broken, it requires IST stack to operate completely correctly even for non-PTI case, and absolutely needs dedicated IST stack because MCE delivery while trampoline did not switched from PTI stack is fatal. The fix is pending. Reviewed by: markj (partially) Tested by: pho (previous version) Discussed with: jeff, jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-01-17 11:44:21 +00:00
kib	72f9a98571	Amd64 user_ldt_deref() is not used outside sys_machdep.c. Mark it as static. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-17 11:21:03 +00:00
pfg	a7c6776f59	x86: make some use of mallocarray(9). Focus on code where we are doing multiplications within malloc(9). None of these ire likely to overflow, however the change is still useful as some static checkers can benefit from the allocation attributes we use for mallocarray. This initial sweep only covers malloc(9) calls with M_NOWAIT. No good reason but I started doing the changes before r327796 and at that time it was convenient to make sure the sorrounding code could handle NULL values. X-Differential revision: https://reviews.freebsd.org/D13837	2018-01-15 21:08:22 +00:00
tychon	03cbc447ee	Provide some mitigation against CVE-2017-5715 by clearing registers upon returning from the guest which aren't immediately clobbered by the host. This eradicates any remaining guest contents limiting their usefulness in an exploit gadget. This was inspired by this linux commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5b6c02f38315b720c593c6079364855d276886aa Reviewed by: grehan, rgrimes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13573	2018-01-15 18:37:03 +00:00
kib	f98ceb5bd2	Add STAC and CLAC instructions wrappers. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13838	2018-01-14 12:39:50 +00:00
jeff	cc3d6a3370	Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA. Sponsored by: Netflix, Dell/EMC Isilon Discussed with: jhb	2018-01-14 03:36:03 +00:00
jeff	f375b4dd66	Implement NUMA support in uma(9) and malloc(9). Allocations from specific domains can be done by the _domain() API variants. UMA also supports a first-touch policy via the NUMA zone flag. The slab layer is now segregated by VM domains and is precise. It handles iteration for round-robin directly. The per-cpu cache layer remains a mix of domains according to where memory is allocated and freed. Well behaved clients can achieve perfect locality with no performance penalty. The direct domain allocation functions have to visit the slab layer and so require per-zone locks which come at some expense. Reviewed by: Attilio (a slightly older version) Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-01-12 23:25:05 +00:00
kib	9a35b0521e	Fix grammar. Submitted by: alc MFC after: 3 days	2018-01-11 16:50:03 +00:00
kib	3a13615ee3	Remove redundand CLD instructions. We already clear %RFLAGS.DF on the kernel entry due to the compiler's ABI requirements. Suggested by: jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-01-11 13:22:13 +00:00
kib	2d65a3eac2	Do not clear %RFLAGS.DF on fast syscall entry. Hardware already did it for us due to the mask loaded into the MSR_SF_MASK msr register. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13838	2018-01-11 12:54:33 +00:00
kib	f00163c316	Move the hardware setup for fast syscalls into a common function. Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-11 12:40:43 +00:00
kib	8d33b830f9	Rename COMMON_TSS_RSP0 to TSS_RSP0. The symbol is just an offset in the hardware TSS structure, it is not limited to the common_tss instance. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-01-11 12:28:08 +00:00
kib	7429338dc0	Update comment explaining the check, to reality. Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-01-11 12:07:24 +00:00
cem	8899c9cd84	x86: Document purpose of _safe variants of {rd,wr}msr() Sponsored by: Dell EMC Isilon	2018-01-10 22:41:00 +00:00
avg	67ddd57974	vmm/svm: contigmalloc of the whole svm_softc is excessive This is a followup to r307903. struct svm_softc takes more than 200 kilobytes while what we really need is 3 contiguous pages for I/O permission map and 2 contiguous pages for MSR permission map. Other physically mapped structures have a size of a single page, so a proper alignment is sufficient for their correct mapping. Thus, only the permission maps are allocated with contigmalloc now, the softc is allocated with a regular malloc. Additionally, this commit adds a check that malloc returns memory with the expected page alignment and that contigmalloc does not fail. Unfortunately, at present svm_vminit() is expected to always succeed and there is no way to report an error. So, a contigmalloc failure leads to a panic. We should probably fix this. MFC after: 2 weeks	2018-01-09 14:22:18 +00:00
kib	dc8d51112c	Make it possible to re-evaluate cpu_features. Add cpuctl(4) ioctl CPUCTL_EVAL_CPU_FEATURES which forces re-read of cpu_features, cpu_features2, cpu_stdext_features, and std_stdext_features2. The intent is to allow the kernel to see the changes in the CPU features after micocode update. Of course, the update is not atomic across variables and not synchronized with readers. See the man page warning as well. Reviewed by: imp (previous version), jilles Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13770	2018-01-05 21:06:19 +00:00
avg	85a273200a	Fix a couple of comments in AMD Virtual Machine Control Block structure MFC after: 1 week	2018-01-05 19:15:24 +00:00
kib	e478c0d2cf	Avoid re-check of usermode condition. It does not change anything in the behavior of trap_pfault(), while eliminating obfuscation of jumping to the code which checks for the condition reversed of the goto cause. Also avoid force initialize the rv variable, since it is now only accessed after storing vm_fault() return value. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13725	2018-01-01 20:47:03 +00:00
kib	10fbef9845	Remove MP SAFE marks and stray register name in comments. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2017-12-31 17:07:59 +00:00
cperciva	0858f13002	Use the TSLOG framework to record entry/exit timestamps for hammer_time. The entry must be logged "manually" using TSRAW rather than TSENTER since PCPU data structures have not yet been initialized and thus curthread cannot be accessed; &thread0 is what will become curthread later in hammer_time. Other MD initialization code should be similarly instrumented in order to gain visibility into the time spent before entering mi_startup; this will require some care and testing from people with access to such hardware.	2017-12-31 09:22:07 +00:00
eadler	421a929b1e	kernel: Fix several typos and minor errors - duplicate words - typos - references to old versions of FreeBSD Reviewed by: imp, benno	2017-12-27 03:23:21 +00:00
tychon	02cc877968	Recognize a pending virtual interrupt while emulating the halt instruction. Reviewed by: grehan, rgrimes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13573	2017-12-21 18:30:11 +00:00
kib	3d18a9d66f	Add atomic_load(9) and atomic_store(9) operations. They provide relaxed-ordered atomic access semantic. Due to the FreeBSD memory model, the operations are syntaxical wrappers around the volatile accesses. The volatile qualifier is used to ensure that the access not optimized out and in turn depends on the volatile semantic as implemented by supported compilers. The motivation for adding the operation is to help people coming from other systems or knowing the C11/C++ standards where atomics have special type and require use of the special access operations. It is still the case that FreeBSD requires plain load and stores of aligned integer types to be atomic. Suggested by: jhb Reviewed by: alc, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13534	2017-12-19 09:59:20 +00:00
markj	b0b9b4fcf4	Pass the trap frame to fasttrap hooks. The DTrace fasttrap entry points expect a struct reg containing the register values of the calling thread. Perform the conversion in fasttrap rather than in the trap handler: this reduces the number of ifdefs and avoids wasting stack space for traps that don't involve DTrace. MFC after: 2 weeks	2017-12-11 19:21:39 +00:00
bde	d81660feeb	Move instantiation of msgbufp from 9 MD files to subr_prf.c. This variable should be pure MI except possibly for reading it in MD dump routines. Its initialization was pure MD in 4.4BSD, but FreeBSD changed this in r36441 in 1998. There were many imperfections in r36441. This commit fixes only a small one, to simplify fixing the others 1 arch at a time. (r47678 added support for special/early/multiple message buffer initialization which I want in a more general form, but this was too fragile to use because hacking on the msgbufp global corrupted it, and was only used for 5 hours in -current...)	2017-12-07 07:55:38 +00:00
avg	209d01f38b	amd-vi: set iommu msi configuration using pci_enable_msi method This is better than directly changing PCI configuration space of the device because it makes the PCI bus aware of the configuration. Also, the change allows to drop a bunch of code that duplicated pci_enable_msi() functionality. I wonder if it's possible to further simplify the code by using pci_alloc_msi().	2017-12-04 17:10:52 +00:00
avg	057e143faa	vmm/amd: add ivhd device with a higher order ivhd should attach after the root PCI bus and, thus, after the ACPI Host-PCI bridge off which the bus hangs. This is because ivhd changes PCI configuration of a PCI IOMMU device that is located on the root bus. If the bus attaches after ivhd it clears the MSI portion of the configuration. As a result IOMMU event interrupts would never be delivered. For regular ACPI devices the order is calculated as ACPI_DEV_BASE_ORDER + level * 10 where level is a depth of the device in the ACPI namespace. I expect the depth of the Host-PCI bridge to be two or three, so ACPI_DEV_BASE_ORDER + 10 * 10 should be a sufficiently safe order for ivhd. This should fix the setup of the AMD-Vi event interrupt when vmm is preloaded (as opposed to kldload-ed).	2017-12-04 17:08:03 +00:00
avg	a4b48b00e7	amd-vi: clear event interrupt and overflow bits upon handling the interrupt This ensures that we can receive further event interrupts. See the description of the bits in the specification for MMIO Offset 2020h IOMMU Status Register. The bits are defined as set-by-hardware write-1-to-clear, same as all the bits in the status register. Discussed with: anish	2017-12-04 17:02:53 +00:00
scottl	49fb5dd79b	It's time to retire AHC_REG_PRETTY_PRINT and AHD_REG_PRETTY_PRINT from the standard kernels. They are still available as custom compile options.	2017-11-29 23:41:49 +00:00
brooks	c6fbed1a3a	Disable vim syntax highlighting. Vim's default pick doesn't understand that ';' is a comment character and the result looks horrible. Reviewed by: emaste	2017-11-28 18:23:17 +00:00
kib	dbbac96e0d	Fix index calculation for the page table pages for efirt 1:1 map. Stop issuing pre-assigned number to enumerate all page table pages, the assignment is incorrect. Instead automatically calculate the next unused index. This index in fact does not serve any purpose except to be unique to satisfy vm_page_grab() interface, we do not look up the page by the index later. Reported and tested by: emaste Reviewed by: andrew Sponsored by: The FreeBSD Foundation MFC after: 2 weeks PR: 223906 Differential revision: https://reviews.freebsd.org/D13273	2017-11-28 09:34:43 +00:00
fsu	24e4690114	Remap ENOATTR to ENODATA in the linuxulator. In the linux ENOADATA is frequently #defined as ENOATTR. The change is required for an xattrs support implementation. MFC after: 1 week Discussed with: netchild Approved by: pfg Differential Revision: https://reviews.freebsd.org/D13221	2017-11-27 17:03:11 +00:00
pfg	d754734f5c	sys/amd64: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.	2017-11-27 15:03:07 +00:00
ed	d13a2ec254	Use TO_PTR() to convert integers to pointers. For FreeBSD/arm64's cloudabi32 support, I'm going to need a TO_PTR() in this place. Also use it for all of the other source files, so that the difference remains as minimal as possible. MFC after: 2 weeks	2017-11-26 14:45:56 +00:00
hselasky	091ce9badd	Merge ^/head r326132 through r326161.	2017-11-24 12:13:27 +00:00
hselasky	7b5126003a	Merge ^/head r325999 through r326131.	2017-11-23 14:28:14 +00:00
kib	873f304292	Remove lint support from system headers and MD x86 headers. Reviewed by: dim, jhb Discussed with: imp Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D13156	2017-11-23 11:40:16 +00:00
pfg	4736ccfd9c	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
hselasky	c6f05b2594	Merge ^/head r325842 through r325998.	2017-11-19 12:36:03 +00:00
pfg	9da7bdde06	spdx: initial adoption of licensing ID tags. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point. Initially, only tag files that use BSD 4-Clause "Original" license. RelNotes: yes Differential Revision: https://reviews.freebsd.org/D13133	2017-11-18 14:26:50 +00:00
hselasky	7f64b39d2d	Merge ^/head r325663 through r325841.	2017-11-15 11:28:11 +00:00
hselasky	b909dfecb7	Remove no longer supported mthca driver. Sponsored by: Mellanox Technologies	2017-11-13 10:59:38 +00:00
mjg	3552994d2f	amd64: stop nesting preemption counter in spinlock_enter Discussed with: jhb	2017-11-12 03:13:01 +00:00
jeff	3c355d849c	Replace manyinstances of VM_WAIT with blocking page allocation flags similar to the kernel memory allocator. This simplifies NUMA allocation because the domain will be known at wait time and races between failure and sleeping are eliminated. This also reduces boilerplate code and simplifies callers. A wait primitive is supplied for uma zones for similar reasons. This eliminates some non-specific VM_WAIT calls in favor of more explicit sleeps that may be satisfied without new pages. Reviewed by: alc, kib, markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2017-11-08 02:39:37 +00:00
kib	d8557dc01f	Zero the structure instead of the pointer to it. Reported by: Don Morris <Don.Morris@dell.com> MFC after: 4 days	2017-11-05 20:03:57 +00:00
kib	8113f89b71	x86: Do not emit unused TD_TID symbols. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-11-04 10:51:52 +00:00
kib	e5cf015156	Restore an optimization that was temporary disabled by r324665. In reclaim_pv_chunk(), rotate the pv chunks list so that next invocations of the reclaim do not scan the same pv chunks that could not be freed. Only do the rotation when there is no parallel scan, tracked by active_reclaims counter. To rotate, move all chunks that are before current iteration marker, after another marker that is inserted at the list tail on start of the reclaim. Reviewed by: alc Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-11-01 18:06:44 +00:00
kib	a6dcbd1557	Consistently ensure that we do not load MXCSR with reserved bits set. Some callers of fpusetregs()/npxsetregs(), most importantly set_fpcontext(), clear reserved bits. But some did not. Do the clearing in fpusetregs() and remove now redundand operation from set_fpcontext(). Reported by: Maxime Villard <max@m00nbsd.net> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-11-01 10:32:44 +00:00
grehan	02db9e4f6d	Emulate the "OR reg, r/m" instruction (opcode 0BH). This is needed for the HDA emulation with FreeBSD guests. Reviewed by: marcelo MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D12832	2017-11-01 03:26:53 +00:00
tijl	173e2ebded	Set the return address for stack entry points to zero. Stack unwinders treat zero as a stop condition. The value on the stack can be non-zero because thread stacks may be arbitrary memory provided via pthread_attr_setstack(3) or may be recycled from previous threads. Reference: https://lists.freebsd.org/pipermail/freebsd-current/2017-August/066855.html https://lists.freebsd.org/pipermail/freebsd-current/2017-October/067254.html Discussed with: kib MFC after: 1 week	2017-10-31 11:51:34 +00:00
ian	ae3d035140	Improve the performance of the hpet timer in bhyve guests by making the timer frequency a power of two. This changes the frequency from 10 to 16.7 MHz (2 ^ 24 HZ). Using a power of two avoids roundoff errors when doing arithmetic in sbintime_t units. Testing shows this can fix erratic ntpd behavior in guests using the hpet timer (which is the default for multicore guests). Reported by: bsam@	2017-10-29 20:50:03 +00:00
eadler	45275e3a26	Update several more URLs - Primarily http -> https - Primarily FreeBSD project URLs	2017-10-29 08:17:03 +00:00
jhb	7db3e0e96c	Rework pass through changes in r305485 to be safer. Specifically, devices that do not support PCI-e FLR and were not gracefully shutdown by the guest OS could continue to issue DMA requests after the VM was terminated. The changes in r305485 meant that those DMA requests were completed against the host's memory which could result in random memory corruption. Instead, leave ppt devices that are not attached to a VM disabled in the IOMMU and only restore the devices to the host domain if the ppt(4) driver is detached from a device. As an added safety belt, disable busmastering for a pass-through device when before adding it to the host domain during ppt(4) detach. PR: 222937 Tested by: Harry Schmalzbauer <freebsd@omnilan.de> Reviewed by: grehan MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D12661	2017-10-27 14:57:14 +00:00
markj	1588800df4	Fix the VM_NRESERVLEVEL == 0 build. Add VM_NRESERVLEVEL guards in the pmaps that implement transparent superpage promotion using reservations. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D12764	2017-10-23 15:34:05 +00:00
mjg	ca37b93652	Make the sleepq chain hash size configurable per-arch and bump on amd64. While here cache-align chains. This shortens longest found chain during poudriere -j 80 from 32 to 16. Pushing this higher up will probably require allocation on boot.	2017-10-22 20:43:50 +00:00
bz	48b1992757	With r181803 on 2008-08-17 23:27:27Z the first VIMAGE commit went into HEAD. Enable VIMAGE in GENERIC kernels and some others (where GENERIC does not exist) on HEAD. Disable building LINT-VIMAGE with VIMAGE being default. This should give it a lot more exposure in the run-up to 12 to help us evaluate whether to keep it on by default or not. We are also hoping to get better performance testing. The feature can be disabled using nooptions. Requested by: many Reviewed by: kristof, emaste, hiren X-MFC after: never Relnotes: yes Differential Revision: https://reviews.freebsd.org/D12639	2017-10-20 21:40:59 +00:00
mjg	397cc565b8	amd64: plug missed dt_lock in cpu_fork	2017-10-20 18:58:11 +00:00
mjg	86705ae197	amd64: __exclusive_cache_line pv_chunks_mutex and pv_list_locks Note that pv_list_locks is an array and currently it fits 2 locks per line. Resizing it and/or putting more locks in different lines requires several tests. MFC after: 1 week	2017-10-20 03:38:58 +00:00
mjg	c26a9dfc26	amd64: avoid acquiring dt lock if possible (which is the common case) Discussed with: kib MFC after: 1 week	2017-10-20 03:30:02 +00:00
markj	c87fb69add	Move kernel dump offset tracking into MI code. All of the kernel dump implementations keep track of the current offset ("dumplo") within the dump device. However, except for textdumps, they all write the dump sequentially, so we can reduce code duplication by having the MI code keep track of the current offset. The new dump_append() API can be used to write at the current offset. This is needed to implement support for kernel dump compression in the MI kernel dump code. Also simplify dump_encrypted_write() somewhat: use dump_write() instead of duplicating its bounds checks, and get rid of the redundant offset tracking. Reviewed by: cem Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11722	2017-10-18 15:38:05 +00:00
kib	95ee5810e6	Fix the pv_chunks pc_lru tailq handling in reclaim_pv_chunk(). For processing, reclaim_pv_chunk() removes the pv_chunk from the lru list, which makes pc_lru linkage invalid. Then the pmap lock is released, which allows for other thread to free the last pv entry allocated from the chunk and call free_pv_chunk(), which tries to modify the invalid linkage. Similarly, the chunk is inserted into the private tailq new_tail temporary. Again, free_pv_chunk() might be run and corrupt the linkage for the new_tail after the pmap lock is dropped. This is a consequence of r299788 elimination of pvh_global_lock, which allowed for reclaim to run in parallel with other pmap calls which free pv chunks. As a fix, do not remove the chunk from pc_lru queue, use a marker to remember the position in the queue iteration. We can safely operate on the chunks after the chunk's pmap is locked, we fetched the chunk after the marker, and we checked that chunk pmap is same as we have locked, because chunk removal from pc_lru requires both pv_chunk_mutex and the pmap mutex owned. Note that the fix lost an optimization which was present in the previous algorithm. Namely, new_tail requeueing rotated the pv chunks list so that reclaim didn't scan the same pv chunks that couldn't be freed (because they contained a wired and/or superpage mapping) on every invocation. An additional change is planned which would improve this. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-16 15:16:24 +00:00
kib	8bde9fde3b	Change amd64_get_ldt() to return 'EOF' when the LDT is not yet allocated, when requested range of descriptors does not fit into currently allocated LDT, or trim the return if the range fits partially. Before, the function returned EINVAL. Reviewed by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-09 16:20:39 +00:00
mjg	ea8bf70599	amd64: remove unused variable from pmap_delayed_invl_genp Reported by: gcc MFC after: 1 week	2017-10-05 18:51:48 +00:00
kib	6a6f4a405d	Ensure that after sucessfull i386_set_ldt() call, other threads can use LDT segments immediately. If the i386_set_ldt() call created a first LDT descriptor (and consequently created the LDT) for our address space, LDTR is currently loaded only on the CPU executing the syscall. Other CPUs executing threads sharing the address space, would only load LDTR after context switch. Uncomment set_user_ldt_rv() and call it on all CPUs. Remove critical section inside set_user_ldt(), it is not needed in the context of call from smp_rendezvous(). Set md_ldt after md_ldt_sd is initialized using the same code sequence as in user_ldt_free(). Do the whole initialization in a critical section, to not race with the context switching while we set LDT. Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-05 13:12:59 +00:00
kib	656af25b35	Avoid a race betweem freeing LDT and context switches. cpu_switch.S uses curproc->p_md.md_ldt value as the flag indicating presence of the process LDT. The flag is checked and then ldt segment descriptor is copied into the CPU' GDT slot. Disallow context switches around clearing of the curproc LDT state by performing the cleanup in critical section. Ensure that the md_ldt flag is cleared before md_ldt_sd descriptor content is destroyed by inserting fence between the operations. We depend on the x86 memory model strong ordering guarantees, in particular, that cpu_switch.S observes the writes to md_ldt and md_ldt_sd in the expected order. Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-05 12:50:03 +00:00
kib	4f3ec8ce9a	Improve amd64_get_ldt(). Provide consistent snapshot of the requested descriptors by preventing other threads from modifying LDT while we fetch the data, lock dt_lock around the read. Copy the data into intermediate buffer, which is copied out after the lock is dropped. Use guaranteed atomic (aligned volatile) reads of the descriptors to use same-size atomic as CPU update to set A bit in the descriptor type field. Improve overflow checking for the descriptors range calculations and remove unneeded casts. Reviewed by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-05 12:29:34 +00:00
kib	a71964674c	Minor style fix. Requested by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-05 12:19:55 +00:00
kib	00ef4a21e2	Complete r323772 on amd64. Compilers are allowed to combine plain reads into group operations, e.g. 64bit element copies of one array into another can be legitimately optimized back to a memcpy() call, which r323772 tried to prevent. Qualify accesses to LDT descriptors with volatile dereference to ensure that each write indeed occurs. After that, our usual claim of native-size aligned writes being atomic applies. This is equivalent to atomic_store(memory_order_relaxed) C11 accesses, but our machine/atomic.h does not provide corresponding primitive. Noted and reviewed by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-05 12:16:45 +00:00
kib	3684a76fa7	Use ANSI C declaration for amd64_get_ldt(). Reviewed by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-05 12:07:38 +00:00
kib	08f33cdf7c	Correct format specifiers in the debug code. Requested by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-05 12:01:39 +00:00
kib	c73f51f840	Remove useless comments. Requested by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-05 11:56:04 +00:00
kib	fdbe2e91b7	On amd64, mark the set_user_ldt() function as static. On i386, the function is used from the context switch code and needs to be accessible externally. Amd64 MD context switch does not lock an LDT spinlock and inlines switching in assembly. Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-05 11:50:01 +00:00
kib	404978fe18	Reduce default max_ldt_segment value to 512. This makes the LDT to use only one page with default settings, avoiding the need to find contigous 2 pages in KVA. It seems that most users are fine even with 512 segments. Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-05 11:36:55 +00:00
kib	cfbf7b56f3	Update comment to note that we skip LDT reload for kthreads as well. Noted by: bde Sponsored by: The FreeBSD Foundation MFC after: 3 days	2017-10-05 11:34:51 +00:00
kib	b20e21edec	Hide kernel stuff from userspace. Sponsored by: Mellanox Technologies	2017-10-02 08:37:43 +00:00
andrew	f448611e08	To prepare for adding EFI runtime services support on arm64 move the machine independent parts of the existing code to a new file that can be shared between amd64 and arm64. Reviewed by: kib (previous version), imp Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D12434	2017-10-01 19:52:47 +00:00
kib	466bfe2553	Do not do torn writes to active LDTs. Care must be taken when updating the active LDT, since parallel threads might try to load a segment descriptor which is currently updated. Since the results are undefined, this cannot be ignored by claiming to be an application race. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D12413	2017-09-19 17:57:04 +00:00
kibab	241835d4e9	Add MMCCAM-enabled kernel config for IMX6, reduce debug noice in MMCCAM kernels CAM_DEBUG_TRACE results in way too much debug output than needed now. When debugging, it's always possible to turn on trace level using camcontrol. Approved by: imp (mentor) Differential Revision: https://reviews.freebsd.org/D12110	2017-09-13 10:56:02 +00:00
cem	3bf35bb3c4	Add smn(4) driver for AMD System Management Network AMD Family 17h CPUs have an internal network used to communicate between the host CPU and the PSP and SMU coprocessors. It exposes a simple 32-bit register space. Reviewed by: avg (no +1), mjoras, truckman Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12217	2017-09-05 15:13:41 +00:00
jpaetzel	bfd734f77c	Revert r323087 This needs more thinking out and consensus, and the commit message was wrong AND there was a typo in the commit. pointyhat: jpaetzel	2017-09-01 17:03:48 +00:00
jpaetzel	612bb8539d	Take options IPSEC out of GENERIC PR: 220170 Submitted by: delphij Reviewed by: ae, glebius MFC after: 2 weeks Differential Revision: D11806	2017-09-01 15:54:53 +00:00
jpaetzel	f7739d7e09	Allow kldload tcpmd5 PR: 220170 MFC after: 2 weeks	2017-08-31 20:16:28 +00:00
mav	5849e8f575	Add NTB driver for PLX/Avago/Broadcom PCIe switches. This driver supports both NTB-to-NTB and NTB-to-Root Port modes (though the second with predictable complications on hot-plug and reboot events). I tested it with PEX 8717 and PEX 8733 chips, but expect it should work with many other compatible ones too. It supports up to two NT bridges per chip, each of which can have up to 2 64-bit or 4 32-bit memory windows, 6 or 12 scratchpad registers and 16 doorbells. There are also 4 DMA engines in those chips, but they are not yet supported. While there, rename Intel NTB driver from generic ntb_hw(4) to more specific ntb_hw_intel(4), so now it is on par with this new ntb_hw_plx(4) driver and alike to Linux naming. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2017-08-30 21:16:32 +00:00
cem	6659d8ab68	Drop CACHE_LINE_SIZE to 64 bytes on x86 The actual cache line size has always been 64 bytes. The 128 number arose as an optimization for Core 2 era Intel processors. By default (configurable in BIOS), these CPUs would prefetch adjacent cache lines unintelligently. Newer CPUs prefetch more intelligently. The latest Core 2 era CPU was introduced in September 2008 (Xeon 7400 series, "Dunnington"). If you are still using one of these CPUs, especially in a multi-socket configuration, consider locating the "adjacent cache line prefetch" option in BIOS and disabling it. Reported by: mjg Reviewed by: np Discussed with: jhb Sponsored by: Dell EMC Isilon	2017-08-28 22:28:41 +00:00
rlibby	0cabf7f260	amd64: drop q suffix from rd[fg]sbase for gas compatibility Reviewed by: kib Approved by: markj (mentor) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12133	2017-08-26 23:13:18 +00:00
kib	dd1b856d37	Save KGSBASE in pcb before overriding it with the guest value. Reported by: lwhsu, mjoras Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 18 days	2017-08-24 10:49:53 +00:00
kib	7bf099a0ef	Ensure that fs/gs bases are stored in pcb before copying the pcb for new process or thread. Reported and tested by: ae, dhw Sponsored by: The FreeBSD Foundation MFC after: 20 days	2017-08-22 18:15:47 +00:00
kib	f495f3ebd8	Make WRFSBASE and WRGSBASE instructions functional. Right now, we enable the CR4.FSGSBASE bit on CPUs which support the facility (Ivy and later), to allow usermode to read fs and gs bases without syscalls. This bit also controls the write access to bases from userspace, but WRFSBASE and WRGSBASE instructions currently cannot be used, because return path from both exceptions or interrupts overrides bases with the values from pcb. Supporting the instructions is useful because this means that usermode can implement green-threads completely in userspace without issuing syscalls to change all of the machine context. Support is implemented by saving the fs base and user gs base when PCB_FULL_IRET flag is set. The flag is set on the context switch, which potentially causes clobber of the bases due to activation of another context, and when explicit modification of the user context by a syscall or exception handler is performed. In particular, the patch moves setting of the flag before syscalls change context. The changes to doreti_exit and PUSH_FRAME to clear PCB_FULL_IRET on entry from userspace can be considered a bug fixes on its own. Reviewed by: jhb (previous version) Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D12023	2017-08-21 17:38:02 +00:00
kib	ebad4d0743	Simplify the code. Noted by: Oliver Pinter Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-08-20 11:18:16 +00:00
kib	2dd98dac16	Simplify amd64 trap(). - Use more relevant name 'signo' instead of 'i' for the local variable which contains a signal number to send for the current exception. - Eliminate two labels 'userout' and 'out' which point to the very end of the trap() function. Instead use return directly. - Re-indent the prot_fault_translation block by reducing if() nesting. - Some more monor style changes. Requested and reviewed by: bde Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-08-20 09:52:25 +00:00

... 3 4 5 6 7 ...

8448 Commits