freebsd-dev

Author	SHA1	Message	Date
Brooks Davis	e8504bf9e7	FCP-101: Remove vx(4). Relnotes: yes FCP: https://github.com/freebsd/fcp/blob/master/fcp-0101.md Reviewed by: jhb, imp Differential Revision: https://reviews.freebsd.org/D20230	2019-05-17 15:24:26 +00:00
Brooks Davis	be345ff023	FCP-101: Remove txp(4). Relnotes: yes FCP: https://github.com/freebsd/fcp/blob/master/fcp-0101.md Reviewed by: jhb, imp Differential Revision: https://reviews.freebsd.org/D20230	2019-05-17 15:24:17 +00:00
Brooks Davis	b1b1c2fe38	FCP-101: Remove tx(4). Relnotes: yes FCP: https://github.com/freebsd/fcp/blob/master/fcp-0101.md Reviewed by: jhb, imp Differential Revision: https://reviews.freebsd.org/D20230	2019-05-17 15:24:08 +00:00
Brooks Davis	7c897ca91f	FCP-101: Remove tl(4). Relnotes: yes FCP: https://github.com/freebsd/fcp/blob/master/fcp-0101.md Reviewed by: jhb, imp Differential Revision: https://reviews.freebsd.org/D20230	2019-05-17 15:24:00 +00:00
Brooks Davis	3b70dd81f5	FCP-101: Remove sf(4). Relnotes: yes FCP: https://github.com/freebsd/fcp/blob/master/fcp-0101.md Reviewed by: jhb, imp Differential Revision: https://reviews.freebsd.org/D20230	2019-05-17 15:23:43 +00:00
Brooks Davis	607790d10f	FCP-101: Remove pcn(4). Relnotes: yes FCP: https://github.com/freebsd/fcp/blob/master/fcp-0101.md Reviewed by: jhb, imp Differential Revision: https://reviews.freebsd.org/D20230	2019-05-17 15:23:34 +00:00
Brooks Davis	05aa6e583b	FCP-101: Remove ed(4). Relnotes: yes FCP: https://github.com/freebsd/fcp/blob/master/fcp-0101.md Reviewed by: jhb, imp Differential Revision: https://reviews.freebsd.org/D20230	2019-05-17 15:23:02 +00:00
Brooks Davis	08ac01a92c	FCP-101: Remove de(4). Relnotes: yes FCP: https://github.com/freebsd/fcp/blob/master/fcp-0101.md Reviewed by: jhb, imp Differential Revision: https://reviews.freebsd.org/D20230	2019-05-17 15:22:54 +00:00
Konstantin Belousov	7c5a46a1bc	Remove resolver_qual from DEFINE_IFUNC/DEFINE_UIFUNC macros. In all practical situations, the resolver visibility is static. Requested by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: so (emaste) Differential revision: https://reviews.freebsd.org/D20281	2019-05-16 22:20:54 +00:00
Konstantin Belousov	07d7e24cf7	amd64 pmap: sysctl vm.pmap.pcid_save_cnt should be read-only. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-05-16 14:33:32 +00:00
Konstantin Belousov	5b2c83cb59	amd64 pmap: Add tunable vm.pmap.di_locked to set DI mode. This is done mostly for debugging in field. Also added the sysctl of the same name to report used mode. Sponsored by: The FreeBSD Foundation MFC after: 1 month	2019-05-16 14:29:09 +00:00
Konstantin Belousov	1febb0b0ae	amd64 pmap: Rename DI functions. pmap_delayed_invl_started -> pmap_delayed_invl_start pmap_delayed_invl_finished -> pmap_delayed_invl_finish Requested by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 month	2019-05-16 13:40:54 +00:00
Konstantin Belousov	4d3b28bcdc	amd64 pmap: rework delayed invalidation, removing global mutex. For machines having cmpxcgh16b instruction, i.e. everything but very early Athlons, provide lockless implementation of delayed invalidation. The implementation maintains lock-less single-linked list with the trick from the T.L. Harris article about volatile mark of the elements being removed. Double-CAS is used to atomically update both link and generation. New thread starting DI appends itself to the end of the queue, setting the generation to the generation of the last element +1. On DI finish, thread donates its generation to the previous element. The generation of the fake head of the list is the last passed DI generation. Basically, the implementation is a queued spinlock but without spinlock. Many thanks both to Peter Holm and Mark Johnson for keeping with me while I produced intermediate versions of the patch. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 month MFC note: td_md.md_invl_gen should go to the end of struct thread Differential revision: https://reviews.freebsd.org/D19630	2019-05-16 13:28:48 +00:00
Ryan Libby	d375016d8d	x86: spell vpxor %zmm0 as vpxord Fix gcc/gas amd64 & i386 build after r347566. Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20264	2019-05-15 18:13:43 +00:00
Edward Tomasz Napierala	060d0b57b8	Fix handling of r10 in Linux ptrace(2). This fixes decoding of the 'flags' argument to mmap(2) with Linux strace(1). Reviewed by: dchagin MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20223	2019-05-14 20:59:44 +00:00
Konstantin Belousov	7355a02bdd	Mitigations for Microarchitectural Data Sampling. Microarchitectural buffers on some Intel processors utilizing speculative execution may allow a local process to obtain a memory disclosure. An attacker may be able to read secret data from the kernel or from a process when executing untrusted code (for example, in a web browser). Reference: https://www.intel.com/content/www/us/en/security-center/advisory/intel-sa-00233.html Security: CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091 Security: FreeBSD-SA-19:07.mds Reviewed by: jhb Tested by: emaste, lwhsu Approved by: so (gtetlow)	2019-05-14 17:02:20 +00:00
Mark Johnston	0ac6ef663b	Fix formatting. MFC after: 3 days	2019-05-14 15:19:48 +00:00
Dmitry Chagin	c5156c7785	Linuxulator depends on a fundamental kernel settings such as SMP. Many of them listed in opt_global.h which is not generated while building modules outside of a kernel and such modules never match real cofigured kernel. So, we should prevent our users from building obviously defective modules. Therefore, remove the root cause of the building of modules outside of a kernel - the possibility of building modules with DEBUG or KTR flags. And remove all of DEBUG printfs as it is incomplete and in threaded programms not informative, also a half of system call does not have DEBUG printf. For debuging Linux programms we have dtrace, ktr and ktrace ability. PR: 222861 Reviewed by: trasz MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20178	2019-05-13 18:24:29 +00:00
Mark Johnston	54a3a11421	Provide separate accounting for user-wired pages. Historically we have not distinguished between kernel wirings and user wirings for accounting purposes. User wirings (via mlock(2)) were subject to a global limit on the number of wired pages, so if large swaths of physical memory were wired by the kernel, as happens with the ZFS ARC among other things, the limit could be exceeded, causing user wirings to fail. The change adds a new counter, v_user_wire_count, which counts the number of virtual pages wired by user processes via mlock(2) and mlockall(2). Only user-wired pages are subject to the system-wide limit which helps provide some safety against deadlocks. In particular, while sources of kernel wirings typically support some backpressure mechanism, there is no way to reclaim user-wired pages shorting of killing the wiring process. The limit is exported as vm.max_user_wired, renamed from vm.max_wired, and changed from u_int to u_long. The choice to count virtual user-wired pages rather than physical pages was done for simplicity. There are mechanisms that can cause user-wired mappings to be destroyed while maintaining a wiring of the backing physical page; these make it difficult to accurately track user wirings at the physical page layer. The change also closes some holes which allowed user wirings to succeed even when they would cause the system limit to be exceeded. For instance, mmap() may now fail with ENOMEM in a process that has called mlockall(MCL_FUTURE) if the new mapping would cause the user wiring limit to be exceeded. Note that bhyve -S is subject to the user wiring limit, which defaults to 1/3 of physical RAM. Users that wish to exceed the limit must tune vm.max_user_wired. Reviewed by: kib, ngie (mlock() test changes) Tested by: pho (earlier version) MFC after: 45 days Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19908	2019-05-13 16:38:48 +00:00
Mateusz Guzik	b72515e129	amd64: tidy up pagezero*/pagecopy (movq -> movl) Sponsored by: The FreeBSD Foundation	2019-05-12 07:11:44 +00:00
Mateusz Guzik	45372f1a6f	amd64: fixup MEMMOVE comment (10 -> r10) Sponsored by: The FreeBSD Foundation	2019-05-12 06:42:17 +00:00
Mateusz Guzik	a8c2fcb287	x86: store pending bitmapped IPIs in per-cpu areas This gets rid of the global cpu_ipi_pending array. While replace cmpset with fcmpset in the delivery code and opportunistically check if given IPI is already pending. Sponsored by: The FreeBSD Foundation	2019-05-12 06:36:54 +00:00
Mateusz Guzik	8eae2be460	amd64: stop re-reading curpc in suword Plugs re-reads missed in r341719 Sponsored by: The FreeBSD Foundation	2019-05-12 06:34:58 +00:00
Andrew Gallatin	542970fa2d	Remove IPSEC from GENERIC due to performance issues Having IPSEC compiled into the kernel imposes a non-trivial performance penalty on multi-threaded workloads due to IPSEC refcounting. In my benchmarks of multi-threaded UDP transmit (connected sockets), I've seen a roughly 20% performance penalty when the IPSEC option is included in the kernel (16.8Mpps vs 13.8Mpps with 32 senders on a 14 core / 28 HTT Xeon 2697v3)). This is largely due to key_addref() incrementing and decrementing an atomic reference count on the default policy. This cause all CPUs to stall on the same cacheline, as it bounces between different CPUs. Given that relatively few users use ipsec, and that it can be loaded as a module, it seems reasonable to ask those users to load the ipsec module so as to avoid imposing this penalty on the GENERIC kernel. Its my hope that this will make FreeBSD look better in "out of the box" benchmark comparisons with other operating systems. Many thanks to ae for fixing auto-loading of ipsec.ko when ifconfig tries to configure ipsec, and to cy for volunteering to ensure the the racoon ports will load the ipsec.ko module Reviewed by: cem, cy, delphij, gnn, jhb, jpaetzel Differential Revision: https://reviews.freebsd.org/D20163	2019-05-09 22:38:15 +00:00
Kyle Evans	251a32b5b2	tun/tap: merge and rename to `tuntap` tun(4) and tap(4) share the same general management interface and have a lot in common. Bugs exist in tap(4) that have been fixed in tun(4), and vice-versa. Let's reduce the maintenance requirements by merging them together and using flags to differentiate between the three interface types (tun, tap, vmnet). This fixes a couple of tap(4)/vmnet(4) issues right out of the gate: - tap devices may no longer be destroyed while they're open [0] - VIMAGE issues already addressed in tun by kp [0] emaste had removed an easy-panic-button in r240938 due to devdrn blocking. A naive glance over this leads me to believe that this isn't quite complete -- destroy_devl will only block while executing d_* functions, but doesn't block the device from being destroyed while a process has it open. The latter is the intent of the condvar in tun, so this is "fixed" (for certain definitions of the word -- it wasn't really broken in tap, it just wasn't quite ideal). ifconfig(8) also grew the ability to map an interface name to a kld, so that `ifconfig {tun,tap}0` can continue to autoload the correct module, and `ifconfig vmnet0 create` will now autoload the correct module. This is a low overhead addition. (MFC commentary) This may get MFC'd if many bugs in tun(4)/tap(4) are discovered after this, and how critical they are. Changes after this are likely easily MFC'd without taking this merge, but the merge will be easier. I have no plans to do this MFC as of now. Reviewed by: bcr (manpages), tuexen (testing, syzkaller/packetdrill) Input also from: melifaro Relnotes: yes Differential Revision: https://reviews.freebsd.org/D20044	2019-05-08 02:32:11 +00:00
Conrad Meyer	fce2d624ea	vmm(4): Pass through RDSEED feature bit to guests Reviewed by: jhb Approved by: #bhyve (jhb) MFC after: 2 leapseconds Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20194	2019-05-08 00:40:08 +00:00
Edward Tomasz Napierala	faf2fa21d7	Support PTRACE_GETREGSET w/ NT_PRSTATUS in Linux ptrace(2). While Linux strace(1) doesn't strictly require it - it has a fallback to PTRACE_GETREGS - it's a newer interface, so we better support it before the old one is deprecated. Reviewed by: dchagin MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20152	2019-05-07 19:06:41 +00:00
Ed Maste	0e26cd440f	make sysent after r347228 Regenerate to add @generated tag in generated files.	2019-05-07 18:10:21 +00:00
Conrad Meyer	665919aaaf	x86: Implement MWAIT support for stopping a CPU IPI_STOP is used after panic or when ddb is entered manually. MONITOR/ MWAIT allows CPUs that support the feature to sleep in a low power way instead of spinning. Something similar is already used at idle. It is perhaps especially useful in oversubscribed VM environments, and is safe to use even if the panic/ddb thread is not the BSP. (Except in the presence of MWAIT errata, which are detected automatically on platforms with known wakeup problems.) It can be tuned/sysctled with "machdep.stop_mwait," which defaults to 0 (off). This commit also introduces the tunable "machdep.mwait_cpustop_broken," which defaults to 0, unless the CPU has known errata, but may be set to "1" in loader.conf to signal that mwait wakeup is broken on CPUs FreeBSD does not yet know about. Unfortunately, Bhyve doesn't yet support MONITOR extensions, so this doesn't help bhyve hypervisors running FreeBSD guests. Submitted by: Anton Rang <rang AT acm.org> (earlier version) Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20135	2019-05-04 20:34:26 +00:00
Conrad Meyer	83dc49beaf	x86: Define pc_monitorbuf as a logical structure Rather than just accessing it via pointer cast. No functional change intended. Discussed with: kib (earlier version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20135	2019-05-04 17:35:13 +00:00
John Baldwin	c2b4cedd78	Emulate the "ADD reg, r/m" instruction (opcode 03H). OVMF's flash variable storage is using add instructions when indexing the variable store bootrom location. Submitted by: D Scott Phillips <d.scott.phillips@intel.com> Reviewed by: rgrimes MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19975	2019-05-03 21:48:42 +00:00
Dmitry Chagin	d151344dbf	In order to reduce duplication between MD parts of the Linuxulator move bits that are MI out into the headers in compat/linux. For that remove bogus _packed attribute from struct l_sockaddr and use MI types for struct members. And continue to move into the linux_common module a code that is intended for both Linuxulator modules (both instruction set - 32 & 64 bit) or for external modules like linsysfs or linprocfs. To avoid header pollution introduce new sys/compat/linux_common.h header. Reviewed by: emaste MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20137	2019-05-03 08:42:49 +00:00
Conrad Meyer	d6745408c7	Add a COMPAT_FREEBSD12 kernel option. Use it wherever COMPAT_FREEBSD11 is currently specified, like r309749. Reviewed by: imp, jhb, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20120	2019-05-02 18:10:23 +00:00
Rodney W. Grimes	a488c9c99a	Add accessor function for vm->maxcpus Replace most VM_MAXCPU constant useses with an accessor function to vm->maxcpus which for now is initialized and kept at the value of VM_MAXCPUS. This is a rework of Fabian Freyer (fabian.freyer_physik.tu-berlin.de) work from D10070 to adjust it for the cpu topology changes that occured in r332298 Submitted by: Fabian Freyer (fabian.freyer_physik.tu-berlin.de) Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Approved by: bde (mentor), jhb (maintainer) MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D18755	2019-04-25 22:51:36 +00:00
Dmitry Chagin	c034ecf316	Since r339624 HEAD does not need for backslashes in syscalls.master, however to make a merge r345471 to the stable add backslashes to the syscalls.master. MFC after: 3 days	2019-04-23 18:10:46 +00:00
Konstantin Belousov	fdfe249b63	Fix initial x87 state after r345562. After the referenced commit, we did not set x87 and sse valid bits in the xstate_bv bitmask for initial fpu state (stored in memory), when using XSAVE. The state is loaded into FPU register file to initialize the process FPU state, and since both bits were clear, the default x87 and SSE states were loaded. By chance, FreeBSD ABI SSE2 state is same as FPU initial state, so the bug is not visible for 64bit processes. But on i386, the precision control should be set to double (53bit mantissa), instead of the default double extended (64bit mantissa). For 32bit processes on amd64, kernel reloads control word with the right mask, which only left native i386 and amd64 native but using x87 as affected. Fix it by setting minimal required xstate_bv mask. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-04-16 19:46:02 +00:00
Warner Losh	f7ab01581a	Move mpr/mps drivers from per-arch NOTES files into the MI notes file. They are in more arches they they aren't. Add appropriate nodevice directives in powerpc and arm.	2019-04-13 06:30:45 +00:00
Konstantin Belousov	2a508645b4	pci_cfgreg.c: Use io port config access for early boot time. Some early PCIe chipsets are explicitly listed in the white-list to enable use of the MMIO config space accesses, perhaps because ACPI tables were not reliable source of the base MCFG address at that time. For that chipsets, MCFG base was read from the known chipset MCFGbase config register. During very early stage of boot, when access to the PCI config space is performed (see e.g. pci_early_quirks.c), we cannot map 255MB of registers because the method used with pre-boot pmap overflows initial kernel page tables. Move fallback to read MCFGbase to the attachment method of the x86/legacy device, which removes code duplication, and results in the use of io accesses until MCFG is parsed or legacy attach called. For amd64, pre-initialize cfgmech with CFGMECH_1, right now we dynamically assign CFGMECH_1 to it anyway, and remove checks for CFGMECH_NONE. There is a mention in the Intel documentation for corresponding chipsets that OS must use either io port or MMIO access method, but we already break this rule by reading MCFGbase register, so one more access seems to be innocent. Reported by: longwitz@incore.de PR: 236838 Reviewed by: avg (other version), jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D19833	2019-04-09 18:07:17 +00:00
Konstantin Belousov	5db2a4a812	Implement resets for PCI buses and PCIe bridges. For PCI device (i.e. child of a PCI bus), reset tries FLR if implemented and worked, and falls to power reset otherwise. For PCIe bus (child of a PCIe bridge or root port), reset disables PCIe link and then re-trains it, performing what is known as link-level reset. Reviewed by: imp (previous version), jhb (previous version) Sponsored by: Mellanox Technologies MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D19646	2019-04-05 19:25:26 +00:00
Warner Losh	d99880cf46	Add mpr, mps, mpt to NOTES file Add these to all the architectures that these are in the GENERIC kernel.	2019-04-05 02:54:02 +00:00
Jung-uk Kim	278f0de60d	Merge ACPICA 20190329.	2019-03-29 20:21:28 +00:00
Conrad Meyer	8207def158	x86: Use XSAVEOPT for fpusave(), when available Remove redundant npxsave_core definition while here. Suggested by: Anton Rang Reviewed by: kib, Anton Rang <rang AT acm.org> Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D19665	2019-03-26 22:45:41 +00:00
Dmitry Chagin	1f66cb5154	Regen from r345471. MFC after: 1 month	2019-03-24 14:51:17 +00:00
Dmitry Chagin	f730d606d5	Update syscall.master to 5.0. For 32-bit Linuxulator, ipc() syscall was historically the entry point for the IPC API. Starting in Linux 4.18, direct syscalls are provided for the IPC. Enable it. MFC after: 1 month	2019-03-24 14:50:02 +00:00
Dmitry Chagin	d9be8b39a5	Regen for r345469 (shmat()). MFC after: 1 month	2019-03-24 14:46:07 +00:00
Dmitry Chagin	7dabf89bcf	Linux between 4.18 and 5.0 split IPC system calls. In preparation for doing this in the Linuxulator modify our linux_shmat() to match actual Linux shmat() system call. MFC after: 1 month	2019-03-24 14:44:35 +00:00
Dmitry Chagin	a7b87a2d95	Revert r313993. AMD64_SET_**BASE expects a pointer to a pointer, we just passing in the pointer value itself. Set PCB_FULL_IRET for doreti to restore %fs, %gs and its correspondig base. PR: 225105 Reported by: trasz@ MFC after: 1 month	2019-03-24 14:02:57 +00:00
Mark Johnston	64087fd7f3	Disallow preemptive creation of wired superpage mappings. There are some unusual cases where a process may cause an mlock()ed range of memory to be unmapped. If the application subsequently faults on that region, the handler may attempt to create a superpage mapping backed by the resident, wired pages. However, the pmap code responsible for creating such a mapping (pmap_enter_pde() on i386 and amd64) does not ensure that a leaf page table page is available if the superpage is later demoted; the demotion operation must therefore perform a non-blocking page allocation and must unmap the entire superpage if the allocation fails. The pmap layer ensures that this can never happen for wired mappings, and so the case described above breaks that invariant. For now, simply ensure that the MI fault handler never attempts to create a wired superpage except via promotion. Reviewed by: kib Reported by: syzbot+292d3b0416c27c131505@syzkaller.appspotmail.com MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D19670	2019-03-21 19:52:50 +00:00
Marcin Wojtas	3caad0b8f4	Prevent loading SGX with incorrect EPC data It may happen on some machines, that even if SGX is disabled in firmware, the driver would still attach despite EPC base and size equal zero. Such behaviour causes a kernel panic when the module is unloaded. Add a simple check to make sure we only attach when these values are correctly set. Submitted by: Kornel Duleba <mindal@semihalf.com> Reviewed by: br Obtained from: Semihalf Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D19595	2019-03-19 02:33:58 +00:00
Konstantin Belousov	fd8d844f76	amd64 KPTI: add control from procctl(2). Add the infrastructure to allow MD procctl(2) commands, and use it to introduce amd64 PTI control and reporting. PTI mode cannot be modified for existing pmap, the knob controls PTI of the new vmspace created on exec. Requested by: jhb Reviewed by: jhb, markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D19514	2019-03-16 11:44:33 +00:00
Konstantin Belousov	6f1fe3305a	amd64: Add md process flags and first P_MD_PTI flag. PTI mode for the process pmap on exec is activated iff P_MD_PTI is set. On exec, the existing vmspace can be reused only if pti mode of the pmap matches the P_MD_PTI flag of the process. Add MD cpu_exec_vmspace_reuse() callback for exec_new_vmspace() which can vetoed reuse of the existing vmspace. MFC note: md_flags change struct proc KBI. Reviewed by: jhb, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D19514	2019-03-16 11:31:01 +00:00
Konstantin Belousov	c1c120b2cb	amd64: fix switching to the pmap with pti disabled. When the pmap with pti disabled (i.e. pm_ucr3 == PMAP_NO_CR3) is activated, tss.rsp0 was not updated. Any interrupt that happen before next context switch would use pti trampoline stack for hardware frame but fault and interrupt handlers are not prepared to this. Correctly update tss.rsp0 for both PMAP_NO_CR3 and pti pmaps. Note that this case, pti = 1 but pmap->pm_ucr3 == PMAP_NO_CR3 is not used at the moment. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D19514	2019-03-16 11:16:09 +00:00
Konstantin Belousov	a9262f497a	amd64: rewrite cpu_switch.S fragment to reload tss.rsp0 on context switch. New code avoids jumps. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D19514	2019-03-16 11:12:02 +00:00
Konstantin Belousov	39d70f6b80	Provide deterministic (and somewhat useful) value for RDPID result, and for %ecx after RDTSCP. Initialize TSC_AUX MSR with CPUID. It allows for userspace to cheaply identify CPU it was executed on some time ago, which is sometimes useful. Note: The values returned might be changed in future. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-03-15 16:43:28 +00:00
Warner Losh	329f0aa952	Kill tz_minuteswest and tz_dsttime. Research Unix, 7th Edition introduced TIMEZONE and DSTFLAG compile-time constants in sys/param.h to communicate these values for the machine. 4.2BSD moved from the compile-time to run-time and introduced these variables and used for localtime() to return the right offset from UTC (sometimes referred to as GMT, for this purpose is the same). 4.4BSD migrated to using the tzdata code/database and these variables were basically unused. FreeBSD removed the real need for these with adjkerntz in 1995. However, some RTC clocks continued to use these variables, though they were largely unused otherwise. Later, phk centeralized most of the uses in utc_offset, but left it using both tz_minuteswest and adjkerntz. POSIX (IEEE Std 1003.1-2017) states in the gettimeofday specification "If tzp is not a null pointer, the behavior is unspecified" so there's no standards reason to retain it anymore. In fact, gettimeofday has been marked as obsolecent, meaning it could be removed from a future release of the standard. It is the only interface defined in POSIX that references these two values. All other references come from the tzdata database via tzset(). These were used to more faithfully implement early unix ABIs which have been removed from FreeBSD. NetBSD has completely eliminated these variables years ago. Linux has migrated to tzdata as well, though these variables technically still exist for compatibility with unspecified older programs. So, there's no real reason to have them these days. They are a historical vestige that's no longer used in any meaningful way. Reviewed By: jhb@, brooks@ Differential Revision: https://reviews.freebsd.org/D19550	2019-03-12 04:49:47 +00:00
Andrey V. Elsukov	40025d42fd	Fix typo. MFC after: 1 week	2019-03-07 10:01:32 +00:00
Matt Macy	030963c090	add gcov to LINT build MFC after: 1 week	2019-03-07 03:50:34 +00:00
John Baldwin	2e43efd0bb	Drop "All rights reserved" from my copyright statements. Reviewed by: rgrimes MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D19485	2019-03-06 22:11:45 +00:00
John Baldwin	2c352feb3b	Fix missed posted interrupts in VT-x in bhyve. When a vCPU is HLTed, interrupts with a priority below the processor priority (PPR) should not resume the vCPU while interrupts at or above the PPR should. With posted interrupts, bhyve maintains a bitmap of pending interrupts in PIR descriptor along with a single 'pending' bit. This bit is checked by a CPU running in guest mode at various places to determine if it should be checked. In addition, another CPU can force a CPU in guest mode to check for pending interrupts by sending an IPI to a special IDT vector reserved for this purpose. bhyve had a bug in that it would only notify a guest vCPU of an interrupt (e.g. by sending the special IPI or by resuming it if it was idle due to HLT) if an interrupt arrived that was higher priority than PPR and no interrupts were currently pending. This assumed that if the 'pending' bit was set, any needed notification was already in progress. However, if the first interrupt sent to a HLTed vCPU was lower priority than PPR and the second was higher than PPR, the first interrupt would set 'pending' but not notify the vCPU, and the second interrupt would not notify the vCPU because 'pending' was already set. To fix this, track the priority of pending interrupts in a separate per-vCPU bitmask and notify a vCPU anytime an interrupt arrives that is above PPR and higher than any previously-received interrupt. This was found and debugged in the bhyve port to SmartOS maintained by Joyent. Relevant SmartOS bugs with more background: https://smartos.org/bugview/OS-6829 https://smartos.org/bugview/OS-6930 https://smartos.org/bugview/OS-7354 Submitted by: Patrick Mooney <pmooney@pfmooney.com> Reviewed by: tychon, rgrimes Obtained from: SmartOS / Joyent MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19299	2019-03-01 20:43:48 +00:00
Edward Tomasz Napierala	1699546def	Remove sv_pagesize, originally introduced with r100384. In all of the architectures we have today, we always use PAGE_SIZE. While in theory one could define different things, none of the current architectures do, even the ones that have transitioned from 32-bit to 64-bit like i386 and arm. Some ancient mips binaries on other systems used 8k instead of 4k, but we don't support running those and likely never will due to their age and obscurity. Reviewed by: imp (who also contributed the commit message) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D19280	2019-03-01 16:16:38 +00:00
Konstantin Belousov	e7a9df16e6	Add kernel support for Intel userspace protection keys feature on Skylake Xeons. See SDM rev. 68 Vol 3 4.6.2 Protection Keys and the description of the RDPKRU and WRPKRU instructions. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D18893	2019-02-20 09:51:13 +00:00
Konstantin Belousov	87b1bf4f31	amd64: add defines and decode protection keys and SGX page faults reasons. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D18893	2019-02-20 09:46:44 +00:00
Warner Losh	b1ece24388	Remove drm from LINT kernels drm was accidentally left in the LINT kernels. Pointy hat to: imp	2019-02-19 21:20:50 +00:00
Konstantin Belousov	5ddeaf67c6	Provide convenience C wrappers for RDPKRU and WRPKRU instructions. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D18893	2019-02-19 19:17:20 +00:00
Konstantin Belousov	8cbe929be5	amd64: cleanup pmap_init_pat(). The pmap_works variable is always true for amd64. Remove it, the branch in the initialization taken when false, and corresponding sysctl. Remove pat_table[] local array, work on pat_index[] directly. Collapse whole initialization to not override already assigned values. Add comment explaining the choice for PAT4 and PAT7. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week MFC note: Leave the sysctl around Differential revision: https://reviews.freebsd.org/D19225	2019-02-18 16:02:00 +00:00
Mark Johnston	8cbc89c7d2	Fix refcount leaks in the SGX Linux compat ioctl handler. Some argument validation error paths would return without releasing the file reference obtained at the beginning of the function. While here, fix some style bugs and remove trivial debug prints. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D19214	2019-02-17 16:43:44 +00:00
Konstantin Belousov	fa50a3552d	Implement Address Space Layout Randomization (ASLR) With this change, randomization can be enabled for all non-fixed mappings. It means that the base address for the mapping is selected with a guaranteed amount of entropy (bits). If the mapping was requested to be superpage aligned, the randomization honours the superpage attributes. Although the value of ASLR is diminshing over time as exploit authors work out simple ASLR bypass techniques, it elimintates the trivial exploitation of certain vulnerabilities, at least in theory. This implementation is relatively small and happens at the correct architectural level. Also, it is not expected to introduce regressions in existing cases when turned off (default for now), or cause any significant maintaince burden. The randomization is done on a best-effort basis - that is, the allocator falls back to a first fit strategy if fragmentation prevents entropy injection. It is trivial to implement a strong mode where failure to guarantee the requested amount of entropy results in mapping request failure, but I do not consider that to be usable. I have not fine-tuned the amount of entropy injected right now. It is only a quantitive change that will not change the implementation. The current amount is controlled by aslr_pages_rnd. To not spoil coalescing optimizations, to reduce the page table fragmentation inherent to ASLR, and to keep the transient superpage promotion for the malloced memory, locality clustering is implemented for anonymous private mappings, which are automatically grouped until fragmentation kicks in. The initial location for the anon group range is, of course, randomized. This is controlled by vm.cluster_anon, enabled by default. The default mode keeps the sbrk area unpopulated by other mappings, but this can be turned off, which gives much more breathing bits on architectures with small address space, such as i386. This is tied with the question of following an application's hint about the mmap(2) base address. Testing shows that ignoring the hint does not affect the function of common applications, but I would expect more demanding code could break. By default sbrk is preserved and mmap hints are satisfied, which can be changed by using the kern.elf{32,64}.aslr.honor_sbrk sysctl. ASLR is enabled on per-ABI basis, and currently it is only allowed on FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support for additional architectures will be added after further testing. Both per-process and per-image controls are implemented: - procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS; - NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible to force ASLR off for the given binary. (A tool to edit the feature control note is in development.) Global controls are: - kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2); - kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings; - kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2); - vm.cluster_anon - enables anon mapping clustering. PR: 208580 (exp runs) Exp-runs done by: antoine Reviewed by: markj (previous version) Discussed with: emaste Tested by: pho MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D5603	2019-02-10 17:19:45 +00:00
Conrad Meyer	7e804fd5c5	Revert r343713 temporarily The COVERAGE option breaks xtoolchain-gcc GENERIC kernel early boot extremely badly and hasn't been fixed for the ~week since it was committed. Please enable for GENERIC only when it doesn't do that. Related fallout reported by: lwhsu, tuexen (pr 235611)	2019-02-10 07:54:46 +00:00
Ed Maste	e26563b8c7	Retire SPX_HACK option unused after r342244	2019-02-06 17:21:25 +00:00
Konstantin Belousov	762138f78f	amd64: clear callee-preserved registers on syscall exit. %r8, %r10, and on non-KPTI configuration %r9 were not restored on fast return from a syscall. Reviewed by: markj Approved by: so Security: CVE-2019-5595 Sponsored by: The FreeBSD Foundation MFC after: 0 minutes	2019-02-05 17:49:27 +00:00
Andrew Turner	634a8a8873	Enable COVERAGE and KCOV by default on arm64 and amd64. This allows userspace to trace the kernel using the coverage sanitizer found in clang. It will also allow other coverage tools to be built as modules and attach into the same framework. Sponsored by: DARPA, AFRL	2019-02-03 12:46:27 +00:00
Konstantin Belousov	c75f49f7d8	Make iflib a loadable module. iflib is already a module, but it is unconditionally compiled into the kernel. There are drivers which do not need iflib(4), and there are situations where somebody might not want iflib in kernel because of using the corresponding driver as module. Reviewed by: marius Discussed with: erj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D19041	2019-01-31 19:05:56 +00:00
Andrew Turner	524553f56d	Extract the coverage sanitizer KPI to a new file. This will allow multiple consumers of the coverage data to be compiled into the kernel together. The only requirement is only one can be registered at a given point in time, however it is expected they will only register when the coverage data is needed. A new kernel conflig option COVERAGE is added. This will allow kcov to become a module that can be loaded as needed, or compiled into the kernel. While here clean up the #include style a little. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D18955	2019-01-29 11:04:17 +00:00
Andriy Voskoboinyk	86d535ab47	Garbage collect AH_SUPPORT_AR5416 config option. It does nothing since r318857.	2019-01-25 13:48:40 +00:00
Ed Maste	1b1f24b936	linuxulator: fix stack memory disclosure in linux_sigaltstack admbugs: 765 Reported by: Vlad Tsyrklevich <vlad@tsyrklevich.net> Reviewed by: andrew MFC after: 1 day Security: Kernel memory disclosure Sponsored by: The FreeBSD Foundation	2019-01-21 16:25:40 +00:00
Andriy Voskoboinyk	4945f79a4c	Remove IEEE80211_AMPDU_AGE config option. It is noop since r297774.	2019-01-20 15:17:56 +00:00
Conrad Meyer	d0c7cde53e	vmm(4): Mask Spectre feature bits on AMD hosts For parity with Intel hosts, which already mask out the CPUID feature bits that indicate the presence of the SPEC_CTRL MSR, do the same on AMD. Eventually we may want to have a better support story for guests, but for now, limit the damage of incorrectly indicating an MSR we do not yet support. Eventually, we may want a generic CPUID override system for administrators, or for minimum supported feature set in heterogenous environments with failover. That is a much larger scope effort than this bug fix. PR: 235010 Reported by: Rys Sommefeldt <rys AT sommefeldt.com> Sponsored by: Dell EMC Isilon	2019-01-18 23:54:51 +00:00
Konstantin Belousov	f1dc49f33a	Trim whitespace at EoL, use tabs instead of spaces for indent. PR: 235004 Submitted by: Jose Luis Duran <jlduran@gmail.com> MFC after: 3 days	2019-01-17 05:15:25 +00:00
Conrad Meyer	15b7da10ac	vmm(4): Take steps towards multicore bhyve AMD support vmm's CPUID emulation presented Intel topology information to the guest, but disabled AMD topology information and in some cases passed through garbage. I.e., CPUID leaves 0x8000_001[de] were passed through to the guest, but guest CPUs can migrate between host threads, so the information presented was not consistent. This could easily be observed with 'cpucontrol -i 0xfoo /dev/cpuctl0'. Slightly improve this situation by enabling the AMD topology feature flag and presenting at least the CPUID fields used by FreeBSD itself to probe topology on more modern AMD64 hardware (Family 15h+). Older stuff is probably less interesting. I have not been able to empirically confirm it is sufficient, but it should not regress anything either. Reviewed by: araujo (previous version) Relnotes: sure	2019-01-16 02:19:04 +00:00
Andrew Turner	b3c0d957a2	Add support for the Clang Coverage Sanitizer in the kernel (KCOV). When building with KCOV enabled the compiler will insert function calls to probes allowing us to trace the execution of the kernel from userspace. These probes are on function entry (trace-pc) and on comparison operations (trace-cmp). Userspace can enable the use of these probes on a single kernel thread with an ioctl interface. It can allocate space for the probe with KIOSETBUFSIZE, then mmap the allocated buffer and enable tracing with KIOENABLE, with the trace mode being passed in as the int argument. When complete KIODISABLE is used to disable tracing. The first item in the buffer is the number of trace event that have happened. Userspace can write 0 to this to reset the tracing, and is expected to do so on first use. The format of the buffer depends on the trace mode. When in PC tracing just the return address of the probe is stored. Under comparison tracing the comparison type, the two arguments, and the return address are traced. The former method uses on entry per trace event, while the later uses 4. As such they are incompatible so only a single mode may be enabled. KCOV is expected to help fuzzing the kernel, and while in development has already found a number of issues. It is required for the syzkaller system call fuzzer [1]. Other kernel fuzzers could also make use of it, either with the current interface, or by extending it with new modes. A man page is currently being worked on and is expected to be committed soon, however having the code in the kernel now is useful for other developers to use. [1] https://github.com/google/syzkaller Submitted by: Mitchell Horne <mhorne063@gmail.com> (Earlier version) Reviewed by: kib Testing by: tuexen Sponsored by: DARPA, AFRL Sponsored by: The FreeBSD Foundation (Mitchell Horne) Differential Revision: https://reviews.freebsd.org/D14599	2019-01-12 11:21:28 +00:00
Fedor Uporov	6651cf410c	Fix errno values returned from DUMMY_XATTR linuxulator calls Reported by: weiss@uni-mainz.de Reviewed by: markj MFC after: 1 day Differential Revision: https://reviews.freebsd.org/D18812	2019-01-11 07:58:25 +00:00
Konstantin Belousov	f0d85a5dc5	x86: Report per-cpu IPI TLB shootdown generation in ddb 'show pcpu' output. It is useful for inspecting tlb shootdown hangs. The smp_tlb_generation value is available using regular ddb data inspection commands. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2019-01-04 17:25:47 +00:00
Mark Johnston	9bfc7fa41d	Avoid setting PG_U unconditionally in pmap_enter_quick_locked(). This KPI may in principle be used to create kernel mappings, in which case we certainly should not be setting PG_U. In any case, PG_U must be set on all layers in the page tables to grant user mode access, and we were only setting it on leaf entries. Thus, this change should have no functional impact. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-01-02 15:36:35 +00:00
Mateusz Guzik	628888f0e0	Remove iBCS2, part2: general kernel Reviewed by: kib (previous version) Sponsored by: The FreeBSD Foundation	2018-12-19 21:57:58 +00:00
Mateusz Guzik	3c76ace36b	amd64: stop re-reading curpc on subyte/suword Originally read value is still safely kept. Re-reading code was there for previous iterations which were partially shared with i386. Sponsored by: The FreeBSD Foundation	2018-12-08 04:53:08 +00:00
Mark Johnston	352aaa5122	Plug memory disclosures via ptrace(2). On some architectures, the structures returned by PT_GET*REGS were not fully populated and could contain uninitialized stack memory. The same issue existed with the register files in procfs. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: kib MFC after: 3 days Security: kernel stack memory disclosure Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18421	2018-12-03 20:54:17 +00:00
Mateusz Guzik	ddf6571230	amd64: align target memmove buffer to 16 bytes before using rep movs See the review for sample test results. Reviewed by: kib (kernel part) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18401	2018-12-01 14:20:32 +00:00
Mateusz Guzik	94243af2da	amd64: handle small memmove buffers with overlapping stores Handling sizes of > 32 backwards will be updated later. Reviewed by: kib (kernel part) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18387	2018-11-30 20:58:08 +00:00
Mateusz Guzik	2847cfce54	amd64: remove stale attribution for memmove work While the routine started as expanded bcopy, it is now entirely rewritten. Sponsored by: The FreeBSD Foundation	2018-11-30 00:47:36 +00:00
Mateusz Guzik	dd219e5ea5	amd64: tidy up copying backwards in memmove For non-ERMS case the code used handle possible trailing bytes with movsb first and then followed it up with movsq. This also happened to alter how calculations were done for other cases. Handle the tail with regular movs, just like when copying forward. Use leaq to calculate the right offset from the get go, instead of doing separate add and sub. This adjusts the offset for non-rep cases so that they can be used to handle the tail. The routine is still a work in progress. Sponsored by: The FreeBSD Foundation	2018-11-30 00:45:10 +00:00
Konstantin Belousov	32b083531f	Fix assert condition in pmap_large_unmap(). pmap_large_unmap() asserts that an unmapping request covers the entirety of a 2M or 1G page. The logic in the asserts was out of date with the loop logic. Correct the test to actually check that destroying the current superpage mapping does not unmap addresses beyond those requested by the caller. Submitted by: D Scott Phillips <d.scott.phillips@intel.com> Reviewed by: alc MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18345	2018-11-27 21:40:51 +00:00
Eric van Gyzen	607a0eb2f1	Remove superfluous bzero in getcontext/swapcontext/sendsig We zero the whole structure; we don't need to zero the __spare__ field again. Remove trailing whitespace. MFC after: 2 weeks Sponsored by: Dell EMC Isilon	2018-11-26 20:56:05 +00:00
Eric van Gyzen	f5e7d8bdb5	Prevent kernel stack disclosure in getcontext/swapcontext Expand r338982 to cover freebsd32 interfaces on amd64, mips, and powerpc. MFC after: 2 days Security: FreeBSD-EN-18:12.mem Security: CVE-2018-17155 Sponsored by: Dell EMC Isilon	2018-11-26 20:50:55 +00:00
Mark Johnston	2910a16124	Clear unused bytes in ia32_osendsig(). Mirror the fix for the native i386 implementation from r218327. This code is compiled only when the non-default COMPAT_43 option is configured. Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com> Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18298	2018-11-22 17:51:19 +00:00
Konstantin Belousov	2343757338	Align IA32_ARCH_CAP MSR definitions and use with SDM rev. 068. SDM rev. 068 was released yesterday and it contains the description of the MSR 0x10a IA32_ARCH_CAP. This change adds symbolic definitions for all bits present in the document, and decode them in the CPU identification lines printed on boot. But also, the document defines SSB_NO as bit 4, while FreeBSD used but 2 to detect the need to work-around Speculative Store Bypass issue. Change code to use the bit from SDM. Similarly, the document describes bit 3 as an indicator that L1TF issue is not present, in particular, no L1D flush is needed on VMENTRY. We used RDCL_NO to avoid flushing, and again I changed the code to follow new spec from SDM. In fact my Apollo Lake machine with latest ucode shows this: IA32_ARCH_CAPS=0x19<RDCL_NO,SKIP_L1DFL_VME,SSB_NO> Reviewed by: bwidawsk Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D18006	2018-11-16 21:27:11 +00:00
Mateusz Guzik	088ac3ef4b	amd64: handle small memset buffers with overlapping stores Instead of jumping to locations which store the exact number of bytes, use displacement to move the destination. In particular the following clears an area between 8-16 (inclusive) branch-free: movq %r10,(%rdi) movq %r10,-8(%rdi,%rcx) For instance for rcx of 10 the second line is rdi + 10 - 8 = rdi + 2. Writing 8 bytes starting at that offset overlaps with 6 bytes written previously and writes 2 new, giving 10 in total. Provides a nice win for smaller stores. Other ones are erratic depending on the microarchitecture. General idea taken from NetBSD (restricted use of the trick) and bionic string functions (use for various ranges like in this patch). Reviewed by: kib (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17660	2018-11-16 00:44:22 +00:00
Matt Macy	f1bac7bb74	Add ZFS to amd64 NOTES to catch future breakage of static linking	2018-11-13 23:08:46 +00:00
Niclas Zeising	af14df7703	Add evdev support to amd64 and i386 kernels Include evdev support and drivers in the amd64 and i386 GENERIC and MINIMAL kernels. Evdev is used by X and wayland to handle input devices, and this change, together with upcomming changes in ports will make us handle input devices better in graphical UIs. Reviewed by: wulf, bapt, imp Approved by: imp Differential Revision: https://reviews.freebsd.org/D17912	2018-11-12 21:01:28 +00:00
Konstantin Belousov	83813c6696	Apply fix to un-cripple max cpu id on BSP earlier. We need to know actual value for the standard extended features before ifuncs are resolved. Reported and tested by: madpilot Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-11-12 19:17:26 +00:00
Mateusz Guzik	f1161465f4	amd64: align memset buffers to 16 bytes before using rep stos Both Intel manual and Agner Fog's docs suggest aligning to 16. See the review for benchmark results. Reviewed by: kib (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17661	2018-11-08 15:12:36 +00:00
Andrew Turner	3869df5d71	Add the KUBSAN options to the arm64 and amd64 GENERIC kernel config files. As the kernel file size may be too large to run with a stock loader comment them out for now. Sponsored by: DARPA, AFRL	2018-11-06 17:47:58 +00:00
Tijl Coosemans	02bf7e5e40	Fix builds with COMPAT_LINUX32 in the kernel config. MFC after: 3 days	2018-11-06 15:29:44 +00:00
Tijl Coosemans	8fc08087a1	On amd64 both Linux compat modules, linux.ko and linux64.ko, provide linux_ioctl_(un)register_handler that allows other driver modules to register ioctl handlers. The ioctl syscall implementation in each Linux compat module iterates over the list of handlers and forwards the call to the appropriate driver. Because the registration functions have the same name in each module it is not possible for a driver to support both 32 and 64 bit linux compatibility. Move the list of ioctl handlers to linux_common.ko so it is shared by both Linux modules and all drivers receive both 32 and 64 bit ioctl calls with one registration. These ioctl handlers normally forward the call to the FreeBSD ioctl handler which can handle both 32 and 64 bit. Keep the special COMPAT_LINUX32 ioctl handlers in linux.ko in a separate list for now and let the ioctl syscall iterate over that list first. Later, COMPAT_LINUX32 support can be added to the 64 bit ioctl handlers via a runtime check for ILP32 like is done for COMPAT_FREEBSD32 and then this separate list would disappear again. That is a much bigger effort however and this commit is meant to be MFCable. This enables linux64 support in x11/nvidia-driver*. PR: 206711 Reviewed by: kib MFC after: 3 days	2018-11-06 13:51:08 +00:00
John Baldwin	7f7f6f85a1	Add a custom implementation of cpu_lock_delay() for x86. Avoid using DELAY() since it can try to use spin locks on CPUs without a P-state invariant TSC. For cpu_lock_delay(), always use the TSC if it exists (even if it is not P-state invariant) to delay for a microsecond. If the TSC does not exist, read from I/O port 0x84 to delay instead. PR: 228768 Reported by: Roger Hammerstein <cheeky.m@live.com> Reviewed by: kib MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D17851	2018-11-05 22:54:03 +00:00
John Baldwin	4cbbb74888	Add a KPI for the delay while spinning on a spin lock. Replace a call to DELAY(1) with a new cpu_lock_delay() KPI. Currently cpu_lock_delay() is defined to DELAY(1) on all platforms. However, platforms with a DELAY() implementation that uses spin locks should implement a custom cpu_lock_delay() doesn't use locks. Reviewed by: kib MFC after: 3 days	2018-11-05 21:34:17 +00:00
John Baldwin	b317cfd4c0	Don't enter DDB for fatal traps before panic by default. Add a new 'debugger_on_trap' knob separate from 'debugger_on_panic' and make the calls to kdb_trap() in MD fatal trap handlers prior to calling panic() conditional on this new knob instead of 'debugger_on_panic'. Disable the new knob by default. Developers who wish to recover from a fatal fault by adjusting saved register state and retrying the faulting instruction can still do so by enabling the new knob. However, for the more common case this makes the user experience for panics due to a fatal fault match the user experience for other panics, e.g. 'c' in DDB will generate a crash dump and reboot the system rather than being stuck in an infinite loop of fatal fault messages and DDB prompts. Reviewed by: kib, avg MFC after: 2 months Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D17768	2018-11-01 21:34:17 +00:00
Konstantin Belousov	6bc6a54280	Add pci_early function to detect Intel stolen memory. On some Intel devices BIOS does not properly reserve memory (called "stolen memory") for the GPU. If the stolen memory is claimed by the OS, functions that depend on stolen memory (like frame buffer compression) can't be used. A function called pci_early_quirks that is called before the virtual memory system is started was added. In Linux, this PCI early quirks function iterates through all PCI slots to check for any device that require quirks. While this more generic solution is preferable I only ported the Intel graphics specific parts because I think my implementation would be too similar to Linux GPL'd solution after looking at the Linux code too much. The code regarding Intel graphics stolen memory was ported from Linux. In the case of Intel graphics stolen memory this pci_early_quirks will read the stolen memory base and size from north bridge registers. The values are stored in global variables that is later read by linuxkpi_gplv2. Linuxkpi stores these values in a Linux-specific structure that is read by the drm driver. Relevant linuxkpi code is here: https://github.com/FreeBSDDesktop/kms-drm/blob/drm-v4.16/linuxkpi/gplv2/src/linux_compat.c#L37 For now, only amd64 arch is suppor ted since that is the only arch supported by the new drm drivers. I was told that Intel GPUs are always located on 0:2:0 so these values are hard coded for now. Note that the structure and early execution of the detection code is not required in its current form, but we expect that the code will be added shortly which fixes the potential BIOS bugs by reserving the stolen range in phys_avail[]. This must be done as early as possible to avoid conflicts with the potential usage of the memory in kernel. Submitted by: Johannes Lundberg <johalun0@gmail.com> Reviewed by: bwidawsk, imp MFC after: 1 week Differential revision: https://reviews.freebsd.org/D16719 Differential revision: https://reviews.freebsd.org/D17775	2018-10-31 23:17:00 +00:00
Kyle Evans	be352d20d5	Compile in VERBOSE_SYSINIT support by default, remain silent by default The loader tunable 'debug.verbose_sysinit' may be used to toggle verbosity. This is added to the debugging section of these kernconfs to be turned off in stable branches for clarity of intent. MFC after: never	2018-10-31 22:38:19 +00:00
Marcelo Araujo	ec9e3fb095	Merge cases with upper block. This is a cosmetic change only to simplify code. Reported by: anish Sponsored by: iXsystems Inc.	2018-10-31 01:27:44 +00:00
Marcelo Araujo	5bae7542d4	Emulate machine check related MSR_EXTFEATURES to allow guest OSes to boot on AMD FX Series. PR: 224476 Submitted by: Keita Uchida <m@jgz.jp> Reviewed by: rgrimes Sponsored by: iXsystems Inc. Differential Revision: https://reviews.freebsd.org/D17713	2018-10-30 10:02:23 +00:00
Konstantin Belousov	9775a6ebd2	amd64: Use ifuncs to select suitable implementation of set_pcb_flags(). There is no reason to check for PCB_FULL_IRET if FSGSBASE instructions are not supported. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-10-29 23:52:31 +00:00
Konstantin Belousov	93177620ee	Style. Wrap long lines, use +4 spaces for continuation indent. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-10-29 23:45:17 +00:00
Yuri Pankov	8d56c80545	Provide basic descriptions for VMX exit reason (from "Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3"). Add the document to SEE ALSO in bhyve.8 (and pet manlint here a bit). Reviewed by: jhb, rgrimes, 0mp Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D17531	2018-10-27 21:24:28 +00:00
Mateusz Guzik	099c6f6d45	amd64: finish the tail in memset with an overlapping store Instead of finding the exact size to fit in we can just shift the target by -8 + tail. Doing a blind write to a previously rep stosq'ed area comes with a penalty so do it conditionally. Sample win on EPYC when zeroing a 257 sized buffer (tail = 1) aligned to 16 bytes: before: 44782846 ops/s after: 46118614 ops/s Idea stolen from NetBSD. Sponsored by: The FreeBSD Foundation	2018-10-22 06:44:20 +00:00
Warner Losh	6a18678249	Remove the ncr(4) drive. This driver has been obsolete since the FreeBSD 4.x. It should have been removed then since the sym(4) driver had subsumed it. The driver was commented out of GENERIC in 2000. RelNotes: Yes	2018-10-22 02:36:18 +00:00
Warner Losh	49a93324fe	Remove stg(4) driver stg(4) is marked as gone in 12. Remove it. There are no sightings of it in the nycbug dmesg database. It was for an obscure SCSI card that sold mostly in Japan, and was especially popilar among pc98 hackers in the 4.x time frame. It was also only enabled on i386. Relnote: Yes	2018-10-22 02:35:50 +00:00
Warner Losh	08204c2cc3	Remove nsp(4) driver nsp(4) is marked as gone in 12. Remove it. There are no sightings of it in the nycbug dmesg database. It was for an obscure SCSI card that sold mostly in Japan, and was especially popilar among pc98 hackers in the 4.x time frame. It was also only enabled on i386. Relnote: Yes	2018-10-22 02:35:38 +00:00
Warner Losh	2dfd358865	Remove ncv(4) driver ncv(4) is marked as gone in 12. Remove it. There are no sightings of it in the nycbug dmesg database. It was for an obscure SCSI card that sold mostly in Japan, and was especially popilar among pc98 hackers in the 4.x time frame.. Relnote: Yes	2018-10-22 02:35:26 +00:00
Warner Losh	e9b5375b04	Retire dpt(4) Marked as gone in 12 and not relevant since the early 90s. No sightings in nycbug's dmesg database. Relnotes: yes	2018-10-22 02:35:12 +00:00
Warner Losh	48ac1a9566	Remove the gone_in(12) devices. We're planning on removing adv, adw, aha, aic, bt, ncv, nsp, and stg soon. They have been tagged for removal in 12. At least get them out of GENERIC. MFC after: 3 days Relnotes: yes	2018-10-22 02:28:18 +00:00
Mateusz Guzik	bbf3607b86	amd64: tidy up memset to have rax set earlier for small sizes	2018-10-21 10:46:00 +00:00
Konstantin Belousov	2dec2b4a34	amd64: flush L1 data cache on syscall return with an error. The knob allows to select the flushing mode or turn it off/on. The idea, as well as the list of the ignored syscall errors, were taken from https://www.openwall.com/lists/kernel-hardening/2018/10/11/10 . I was not able to measure statistically significant difference between flush enabled vs disabled using syscall_timing getuid. Reviewed by: bwidawsk Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17536	2018-10-20 23:17:24 +00:00
Mark Johnston	36209a40d1	Add an assertion to pmap_enter(). When modifying an existing managed mapping, we should find a PV entry for the old mapping. Verify this. Before r335784 this would have been implicitly tested by the fact that we always freed the PV entry for the old mapping. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D17626	2018-10-20 20:53:35 +00:00
Mateusz Guzik	c88205e76e	amd64: relax constraints in curthread and curpcb This makes the compiler less likely to reload the content from %gs. The 'P' modifier drops all synteax prefixes and 'n' constraint treats input as a known at compilation time immediate integer. Example reloading victim was spinlock_enter. Stolen from: OpenBSD Reported by: jtl Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17615	2018-10-20 17:00:18 +00:00
Konstantin Belousov	3ade944019	Do not flush cache for PCIe config window. Apparently AMD machines cannot tolerate this. This was uncovered by r339386, where cache flush started really flushing the requested range. Introduce pmap_mapdev_pciecfg(), which simply does not flush cache comparing with pmap_mapdev(). It assumes that the MCFG region was never accessed through the cacheable mapping, which is most likely true for machine to boot at all. Note that i386 does not need the change, since the architecture handles access per-page due to the KVA shortage, and page remapping already does not flush the cache. Reported and tested by: mjg, Mike Tancsa <mike@sentex.net> Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D17612	2018-10-18 20:49:16 +00:00
Konstantin Belousov	2fd0c8e7ca	Provide pmap_large_map() KPI on amd64. The KPI allows to map very large contigous physical memory regions into KVA, which are not covered by DMAP. I see both with QEMU and with some real hardware started shipping, the regions for NVDIMMs might be very far apart from the normal RAM, and we expect that at least initial users of NVDIMM could install very large amount of such memory. IMO it is not reasonable to extend DMAP to cover that far-away regions both because it could overflow existing 4T window for DMAP in KVA, and because it costs in page table pages allocations, for gap and for possibly unused NV RAM. Also, KPI provides some special functionality for fast cache flushing based on the knowledge of the NVRAM mapping use. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17070	2018-10-16 17:28:10 +00:00
Konstantin Belousov	9d5d89b209	Add clwb(). Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 3 days Differential revision: https://reviews.freebsd.org/D17070	2018-10-16 17:00:42 +00:00
John Baldwin	de679f6efa	Reload the LDT selector after an AMD-v #VMEXIT. cpu_switch() always reloads the LDT, so this can only affect the hypervisor process itself. Fix this by explicitly reloading the host LDT selector after each #VMEXIT. The stock bhyve process on FreeBSD never uses a custom LDT, so this change is cosmetic. Reviewed by: kib Tested by: Mike Tancsa <mike@sentex.net> Approved by: re (gjb) MFC after: 2 weeks	2018-10-15 18:12:25 +00:00
Mateusz Guzik	6816c88458	amd64: partially depessimize cpu_fetch_syscall_args and cpu_set_syscall_retval Vast majority of syscalls take 6 or less arguments. Move handling of other cases to a fallback function. Similarly, special casing for _syscall and __syscall magic syscalls is moved away. Return is almost always 0. The change replaces 3 branches with 1 in the common case. Also the 'frame' variable convinces clang not to reload it on each access. Reviewed by: kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17542	2018-10-13 21:18:31 +00:00
Eric Joyner	77c1fcec91	ixl/iavf(4): Change ixlv to iavf and update it to use iflib(9) Finishes the conversion of the 40Gb Intel Ethernet drivers to iflib(9) for FreeBSD 12.0, and fixes numerous bugs in both ixl(4) and iavf(4). This commit also re-adds the VF driver to GENERIC since it now compiles and functions. The VF driver name was changed from ixlv(4) to iavf(4) because the VF driver is now intended to be used with future products, not just with Fortville/Fort Park VFs. A man page update that documents these drivers is forthcoming in a separate commit. Reviewed by: sbruno@, kbowling@ Tested by: jeffrey.e.pieper@intel.com Approved by: re (gjb@) Relnotes: yes Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D16429	2018-10-12 22:40:54 +00:00
Mateusz Guzik	3cf1291d2e	amd64: employ MEMMOVE in copyin/copyout See r339205 for justification. Reviewed by: kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17526	2018-10-12 21:59:09 +00:00
Konstantin Belousov	555225c062	Call initializecpucache() before ifuncs are resolved. The function tweaks CPU capabilities based on the VM platform and tunables, which affected selection of the cache flush method before ifuncs were used, and should affect the cache flush in the same way after ifunc. PR: 232081 Reported by: phk Analyzed by: avg Sponsored by: The FreeBSD Foundation Approved by: re (gjb)	2018-10-12 16:00:21 +00:00
Konstantin Belousov	78a3652794	bhyve: emulate CLFLUSH and CLFLUSHOPT. Apparently CLFLUSH on mmio can cause VM exit, as reported in the PR. I do not see that anything useful can be done except emulating page faults on invalid addresses. Due to the instruction encoding pecularity, also emulate SFENCE. PR: 232081 Reported by: phk Reviewed by: araujo, avg, jhb (all: previous version) Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17482	2018-10-12 15:30:15 +00:00
Mateusz Guzik	3c9a1d0493	amd64: make memmove and memcpy less slow with mov The reasoning is the same as with the memset change, see r339205 Reviewed by: kib (previous version) Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17441	2018-10-11 23:37:57 +00:00
Mateusz Guzik	3f102f5881	Provide string functions for use before ifuncs get resolved. The change is a no-op for architectures which don't ifunc memset, memcpy nor memmove. Convert places which need them. Xen bits by royger. Reviewed by: kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17487	2018-10-11 23:28:04 +00:00
John Baldwin	b843f9be5e	Fully restore the GDTR, IDTR, and LDTR after VT-x VM exits. The VT-x VMCS only stores the base address of the GDTR and IDTR. As a result, VM exits use a fixed limit of 0xffff for the host GDTR and IDTR losing the smaller limits set in when the initial GDT is loaded on each CPU during boot. Explicitly save and restore the full GDTR and IDTR contents around VM entries and exits to restore the correct limit. Similarly, explicitly save and restore the LDT selector. VM exits always clear the host LDTR as if the LDT was loaded with a NULL selector and a userspace hypervisor is probably using a NULL selector anyway, but save and restore the LDT explicitly just to be safe. PR: 230773 Reported by: John Levon <levon@movementarian.org> Reviewed by: kib Tested by: araujo Approved by: re (rgrimes) MFC after: 1 week	2018-10-11 18:27:19 +00:00
Brooks Davis	c7d0908e1c	Regenerated assorted syscall related files after: - r327895: Implement 'domainset'... - r329876: Use linux types for linux-specific syscalls Diff generated with: find . -name syscalls.conf \| xargs dirname \| \ xargs -n1 -I DIR make -C DIR sysent Approved by: re (kib) Sponsored by: DARPA, AFRL	2018-10-09 20:42:17 +00:00
Michael Tuexen	6b45121a6d	Address the warning regarding duplicate option 'GEOM_PART_GPT' when configuring kernels for i386, amd64, and arm64. The 'GEOM_PART_GPT' option was added to the DEFAULTS configuration in r337967. Approved by: re (kib@) Reviewed by: ler@ Differential Revision: https://reviews.freebsd.org/D17458 Sponsored by: Netflix, Inc.	2018-10-07 15:54:13 +00:00
Mateusz Guzik	97bb9a0818	amd64: make memset less slow with mov rep stos has a high startup time even on modern microarchitectures like Skylake. Intel optimization manuals discuss how for small sizes it is beneficial to go for streaming stores. Since those cannot be used without extra penalty in the kernel I investigated performance impact of just regular movs. The patch below implements a very simple scheme: a 32-byte loop followed by filling in the remainder of at most 31 bytes. It has a 256 breaking point on which it falls back to rep stos. It provides a significant win over the current primitive on several machines I tested (both Intel and AMD). A 64-byte loop did not provide any benefit even for multiple of 64 sizes. See the review for benchmark data. Reviewed by: kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17398	2018-10-05 19:25:09 +00:00
Mateusz Guzik	9657b80ce7	amd64: hide non-erms jump label under non-erms copyin/copyout This change is a no-op in terms of semantics, but has a side effect of removing a perfectly useless nop sled for CPUs with ERMS. Approved by: re (gjb) Sponsored by: The FreeBSD Foundation	2018-10-04 20:01:48 +00:00
Mark Johnston	cb4961abc1	Apply r339046 to i386. Belatedly add a comment to the amd64 pmap explaining why we initialize the kernel pmap's resident page count. Reviewed by: alc, kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17377	2018-10-01 18:48:33 +00:00
Mark Johnston	c6c770d041	Count bootstrap data as resident in the kernel pmap. Such data may later be unmapped. This occurs, for example, when a loader-provided microcode update file is discarded. Reviewed by: alc, kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17340	2018-10-01 14:47:49 +00:00
Konstantin Belousov	f76bec2a28	Revert part of the r338891 which reordered local invalidation and IPI. For PCID case, there is a dependency between pm_gen zeroing and reading pm_active for IPI target selection, to ensure that the invalidation is not missed. Reported and tested by: mjg Sponsored by: The FreeBSD Foundation Approved by: re (gjb)	2018-09-28 14:08:20 +00:00
Mateusz Guzik	825eeb55f4	amd64: fix return value of copyinstr after r338970 The function stopped swapping rdi and rsi, but the error handling code was not updated with the new register name. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation	2018-09-27 20:48:07 +00:00
John Baldwin	83382d027f	Don't clear DR6 for debug exceptions from userland. This reverts part of r333368. The attempt to clear DR6 was occuring too soon as trapsignal() does not pause to let the debugger notice the SIGTRAP and query DR6. The signal exchange does not occur until much later during ast(). As a result, GDB was no longer recognizing hardware breakpoints and watchpoints on x86. In addition, any userland programs that want to inspect DR6 in a SIGTRAP handler don't have a way to do this if we clear DR6 in the exception handler. Instead of relying on the kernel to clear DR6, debuggers will have to explicitly clear it after a trace trap (which they needed to do on older kernels anyway). Reviewed by: kib Approved by: re (delphij) MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D17319	2018-09-27 17:33:59 +00:00
Mateusz Guzik	5910b87605	amd64: macroify and mostly depessimize copyinstr See r338968 for details. Reviewed by: kib Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17288	2018-09-27 15:53:36 +00:00
Mateusz Guzik	3d95cc51bb	amd64: mostly depessimize copystr - remove a forward branch in the common case - replace xchg + lodsb/stosb loop with simple movs A simple test on Intel(R) Core(TM) i7-4600U CPU @ 2.10GH copying /foo/bar/baz in a loop goes from 295715863 ops/s to 465807408. Further changes are pending. Reviewed by: kib Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17281	2018-09-27 15:27:53 +00:00
Mateusz Guzik	0e59ecce47	amd64: clean up copyin/copyout - move the PSL.AC comment to the fault handler - stop testing for zero-sized ops. after several minutes of package building there were no copyin calls with zero bytes and very few copyout. the semantic of returning 0 in this case is preserved - shorten exit paths by clearing %eax earlier - replace xchg with 3 movs. this is what compilers do. a naive benchmark on EPYC suggests about 1% increase in thoughput thanks to this change. - remove the useless movb %cl,%al from copyout. it looks like a leftover from many years ago Reviewed by: kib Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17286	2018-09-27 15:24:16 +00:00
Mateusz Guzik	a8e3f99ec1	amd64: implement memcmp in assembly Both the in-kernel C variant and libc asm variant have very poor performance. The former compiles to a single byte comparison loop, which breaks down even for small sizes. The latter uses rep cmpsq/b which turn out to have very poor throughput and are slower than a hand-coded 32-byte comparison loop. Depending on size this is about 3-4 times faster than the current routines. Reviewed by: kib Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17328	2018-09-27 14:05:44 +00:00
Andrew Turner	27d2645787	Handle a guest executing a vm instruction by trapping and raising an undefined instruction exception. Previously we would exit the guest, however an unprivileged user could execute these. Found with: syzkaller Reviewed by: araujo, tychon (previous version) Approved by: re (kib) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D17192	2018-09-27 11:16:19 +00:00
Konstantin Belousov	05e1cca97a	Fix some uses of dmaplimit. dmaplimit is the first byte after the end of DMAP. Reported by: "Johnson, Archna" <Archna.Johnson@netapp.com> Reviewed by: alc, markj Approved by: re (gjb) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17318	2018-09-25 20:07:58 +00:00
Konstantin Belousov	cbe100dfca	Fix an issue in r338862. For pmap_invalidate_all_pcid(), only reset pm_gen for non-kernel pmaps, as it was done before the conversion to ifuncs. The reset is useless but innocent for kernel_pmap. Coverity reported that cpuid is used uninitialized in this case. Reported by: cem Reviewed by: alc, cem, markj CID: 1395807 Sponsored by: The FreeBSD Foundation Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D17314	2018-09-25 18:24:25 +00:00
Konstantin Belousov	108ff63e8a	Further reorganize pmap_invalidate TLB code. Split calculation of mask for shootdown IPI and local invalidation. Reorder IPI before local. Suggested by: alc Reviewed by: alc, markj Tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (rgrimes) Differential revision: https://reviews.freebsd.org/D17277	2018-09-22 17:04:39 +00:00
Mark Johnston	6cdde6fda2	Use the GNU as-compatible .endm instead of .endmacro. Approved by: re (gjb)	2018-09-21 20:20:03 +00:00
Konstantin Belousov	d995b5b1ea	Convert x86 TLB top-level invalidation functions to ifuncs. Note that shootdown IPI handlers are already per-mode. Suggested by: alc Reviewed by: alc, markj Tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D17184	2018-09-21 17:53:06 +00:00
Mateusz Guzik	68c7542edf	amd64: even up copyin/copyout with memcpy + other cleanup - _fault handlers for both primitives are identical, provide just one - change the copying scheme to match memcpy (in particular jump avoidance for the most common case of multiply of 8) - stop re-reading pcb address on exit, just store it locally (in r9) Reviewed by: kib Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17265	2018-09-21 15:00:46 +00:00
Mateusz Guzik	bb32268086	amd64: check for small size in memmove, memcpy and memset If the size is 15 bytes or less avoid spinning up rep just to copy the 8 bytes. In my tests on EPYC and old Intel microarchs without ERMS (like Westmere) it provided a nice win over the current version (e.g. for EPYC memset with 15 bytes of size goes from 59712651 ops/s to 70600095) all while almost not pessimizing the other cases. Data collected during package building shows that < 16 sizes are pretty common. Verified with the glibc test suite. Approved by: re (kib)	2018-09-21 12:27:36 +00:00
Mateusz Guzik	5254f065ab	amd64: macroify copyin/copyout and provide erms variants, follow up Fix a fat-fingered typo with a "funny" side-effect: when doing copyin on a cpu without ERMS and with size being a multiply of 8 a page fault would be triggered resulting in EFAULT. Pointy hat: mjg Approved by: re (implicit)	2018-09-20 20:32:08 +00:00
Mateusz Guzik	4bf0035fd1	amd64: macroify copyin/copyout and provide erms variants Reviewed by: kib Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17257	2018-09-20 18:30:17 +00:00
Mateusz Guzik	a286a3099c	amd64: move fusufault after all users A lot of function have the following check: cmpq %rax,%rdi /* verify address is valid */ ja fusufault The label is present earlier in kernel .text, which means this is a jump backwards. Absent any information in branch predictor, the cpu predicts it as taken. Since it is almost never taken in practice, this results in a completely avoidable misprediction. Move it past all consumers, so that it is predicted as not taken. Approved by: re (kib)	2018-09-20 13:29:43 +00:00
Konstantin Belousov	d12c446550	Convert x86 cache invalidation functions to ifuncs. This simplifies the runtime logic and reduces the number of runtime-constant branches. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D16736	2018-09-19 19:35:02 +00:00
Konstantin Belousov	215aa93033	amd64 pmap: remove tautological assert. pm_pcid is unsigned. Reviewed by: cem, markj CID: 1395727 Noted by: cem Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 3 days Differential revision: https://reviews.freebsd.org/D17235	2018-09-19 15:39:16 +00:00
Konstantin Belousov	3c022be2ca	Use ifunc to resolve context switching mode on amd64. Patch removes all checks for pti/pcid/invpcid from the context switch path. I verified this by looking at the generated code, compiling with the in-tree clang. The invpcid_works1 trick required inline attribute for pmap_activate_sw_pcid_pti() to work. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D17181	2018-09-17 15:52:19 +00:00
Mateusz Guzik	d6943c5804	amd64: tidy up kernel memmove, take 2 There is no need to use %rax for temporary values and avoiding doing so shortens the func. Handle the explicit 'check for tail' depessimisization for backwards copying. This reduces the diff against userspace. Tested with the glibc test suite. Approved by: re (kib)	2018-09-17 15:51:49 +00:00
Konstantin Belousov	09a6ada991	Calculate PTI, PCID and INVPCID modes earlier, before ifuncs are resolved. This will be used in following conversion of pmap_activate_sw(). Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D17181	2018-09-17 15:34:19 +00:00
Konstantin Belousov	76ed0c542f	Make the PTI violation check to follow style of the SMAP check. No functional changes. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (rgrimes) Differential revision: https://reviews.freebsd.org/D17181	2018-09-17 14:59:05 +00:00
Mateusz Guzik	9d1b868da0	Revert amd64: tidy up kernel memmove There is a braino in the non-erms variant which breaks the functionality. Will be fixed at a later time with a different patch. Reported by: Manfred Antar Approved by: re (implicit)	2018-09-16 21:46:27 +00:00
Mateusz Guzik	17f67f63b9	amd64: tidy up kernel memmove There is no need to use %rax for temporary values and avoiding doing so shortens the func. Handle the explicit 'check for tail' depessimisization for backwards copying. This reduces the diff against userspace. Approved by: re (kib)	2018-09-16 19:28:27 +00:00
Konstantin Belousov	bd6c14afa7	Remove unneeded new line from the panic string. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (rgrimes) Differential revision: https://reviews.freebsd.org/D17181	2018-09-16 18:36:42 +00:00
Mateusz Guzik	c51b7ab9e3	amd64: implement pagezero_erms Intel docs claim such a memset (rep stosb + 4096 bytes) is special-cased by microarchs. They also switched Linux to use it for this purpose. Approved by: re (gjb)	2018-09-14 15:29:35 +00:00
Mateusz Guzik	13ea074dc3	amd64: implement ERMS-based memmove, memcpy and memset Reviewed by: kib Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17124	2018-09-13 14:53:51 +00:00
Mateusz Guzik	e382dd47aa	amd64: enable options NUMA in GENERIC and MINIMAL Reviewed by: gallatin, cem, scottl Approved by: re (kib) Relnotes: yes Sponsored by: Dell EMC Isilon, Netflix Differential Revision: https://reviews.freebsd.org/D17059	2018-09-11 23:54:31 +00:00
Mateusz Guzik	12360b3079	amd64: depessimize copyinstr_smap The stac/clac combo around each byte copy is causing a measurable slowdown in benchmarks. Do it only before and after all data is copied. While here reorder the code to avoid a forward branch in the common case. Note the copying loop (originating from copyinstr) is avoidably slow and will be fixed later. Reviewed by: kib Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17063	2018-09-06 19:42:40 +00:00
Konstantin Belousov	20df4f456d	amd64: Properly re-merge r334537 into SMAP-ified copyin(9) and copyout(9). Also this fixes the eflags.ac leak from copyin_smap() when the copied data length is multiple of eight bytes. Sponsored by: The FreeBSD Foundation Approved by: re (gjb)	2018-09-04 19:27:53 +00:00
Konstantin Belousov	e21c5abc2a	amd64: For non-PTI mode, do not initialize PCPU kcr3 to KPML4phys. Non-PTI mode does not switch kcr3, which means that kcr3 is almost always stale. This is important for the NMI handler, which reloads %cr3 with PCPU(kcr3) if the value is different from PMAP_NO_CR3. The end result is that curpmap in NMI handler does not match the page table loaded into hardware. The manifestation was copyin(9) looping forever when a usermode access page fault cannot be resolved by vm_fault() updating a different page table. Reported by: mmacy Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 3 days Approved by: re (gjb)	2018-09-04 19:26:54 +00:00
Konstantin Belousov	50cd0be78f	Catch exceptions during EFI RT calls on amd64. This appeared to be required to have EFI RT support and EFI RTC enabled by default, because there are too many reports of faulting calls on many different machines. The knob is added to leave the exceptions unhandled to allow to debug the actual bugs. Reviewed by: kevans Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (rgrimes) Differential revision: https://reviews.freebsd.org/D16972	2018-09-02 21:37:05 +00:00
Konstantin Belousov	1565fb29a7	Add amd64 mdthread fields needed for the upcoming EFI RT exception handling. This is split into a separate commit from the main change to make it easier to handle possible revert after upcoming KBI freeze. Reviewed by: kevans Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (rgrimes) Differential revision: https://reviews.freebsd.org/D16972	2018-09-02 21:16:43 +00:00
Konstantin Belousov	9eb958988a	Swap order of dererencing PCPU curpmap and checking for usermode in trap_pfault() KPTI violation check. EFI RT may set curpmap to NULL for the duration of the call for some machines (PCID but no INVPCID). Since apparently EFI RT code must be ready for exceptions from the calls, avoid dereferencing curpmap until we know that this call does not come from usermode. Reviewed by: kevans Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (rgrimes) Differential revision: https://reviews.freebsd.org/D16972	2018-09-02 20:07:36 +00:00
Konstantin Belousov	d4be3789fe	Normalize use of semicolon with EFI_TIME_LOCK macros. Reviewed by: kevans Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (rgrimes) Differential revision: https://reviews.freebsd.org/D16972	2018-09-02 19:48:41 +00:00
Konstantin Belousov	f0165b1ca6	Remove {max/min}_offset() macros, use vm_map_{max/min}() inlines. Exposing max_offset and min_offset defines in public headers is causing clashes with variable names, for example when building QEMU. Based on the submission by: royger Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week Approved by: re (marius) Differential revision: https://reviews.freebsd.org/D16881	2018-08-29 12:24:19 +00:00
Konstantin Belousov	d367236183	Several bug fixes and robustness improvements for the AP boot page table allocation. At the time that mp_bootaddress() is called, phys_avail[] array does not reflect some memory reservations already done, like kernel placement. Recent changes to DMAP protection which make kernel text read-only in DMAP revealed this, where on some machines AP boot page tables selection appears to intersect with the kernel itself. Fix this by checking the addresses selected using the same algorithm as bootaddr_rwx(). Also, try to chomp pages for the page table not only at the start of the contiguous range, but also at the end. This should improve robustness when the only suitable range is already consumed by the kernel. Reported and tested by: Michael Gmelin <freebsd@grem.de> Reviewed by: jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D16907	2018-08-28 18:47:02 +00:00
Alan Cox	49bfa624ac	Eliminate the arena parameter to kmem_free(). Implicitly this corrects an error in the function hypercall_memfree(), where the wrong arena was being passed to kmem_free(). Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are mapped in kmem with execute permissions. Use this flag to determine which arena the kmem virtual addresses are returned to. Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it redundant. Update the nearby comment for UMA_SLAB_KERNEL. Reviewed by: kib, markj Discussed with: jeff Approved by: re (marius) Differential Revision: https://reviews.freebsd.org/D16845	2018-08-25 19:38:08 +00:00
Konstantin Belousov	60b7423434	Unify amd64 and i386 vmspace0 pmap activation. Add pmap_activate_boot() for i386, move the invocation on APs from MD init_secondary() to x86 init_secondary_tail(). Suggested by: alc Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation Approved by: re (marius) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D16893	2018-08-25 15:21:28 +00:00
Warner Losh	592ffb2175	Revert drm2 removal. Revert r338177, r338176, r338175, r338174, r338172 After long consultations with re@, core members and mmacy, revert these changes. Followup changes will be made to mark them as deprecated and prent a message about where to find the up-to-date driver. Followup commits will be made to make this clear in the installer. Followup commits to reduce POLA in ways we're still exploring. It's anticipated that after the freeze, this will be removed in 13-current (with the residual of the drm2 code copied to sys/arm/dev/drm2 for the TEGRA port's use w/o the intel or radeon drivers). Due to the impending freeze, there was no formal core vote for this. I've been talking to different core members all day, as well as Matt Macey and Glen Barber. Nobody is completely happy, all are grudgingly going along with this. Work is in progress to mitigate the negative effects as much as possible. Requested by: re@ (gjb, rgrimes)	2018-08-24 00:02:00 +00:00
Mark Johnston	36716fe2e6	Prepare the kernel linker to handle PC-relative ifunc relocations. The boot-time ifunc resolver assumes that it only needs to apply IRELATIVE relocations to PLT entries. With an upcoming optimization, this assumption no longer holds, so add the support required to handle PC-relative relocations targeting GNU_IFUNC symbols. - Provide a custom symbol lookup routine that can be used in early boot. The default lookup routine uses kobj, which is not functional at that point. - Apply all existing relocations during boot rather than filtering IRELATIVE relocations. - Ensure that we continue to apply ifunc relocations in a second pass when loading a kernel module. Reviewed by: kib MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16749	2018-08-22 20:44:30 +00:00
Konstantin Belousov	614a9ce31a	Skip PMAP_PCID_KERN + 1 PCPU pcid_next value on APs as well. r337838 did it for BSP. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-08-22 14:58:52 +00:00
Matt Macy	d157fbd5b4	Remove legacy drm and drm2 from tree As discussed on the MLs drm2 conflicts with the ports' version and there is no upstream for most if not all of drm. Both have been merged in to a single port. Users on powerpc, 32-bit hardware, or with GPUs predating Radeon and i915 will need to install the graphics/drm-legacy-kmod. All other users should be able to use one of the LinuxKPI-based ports: graphics/drm-stable-kmod, graphics/drm-next-kmod, graphics/drm-devel-kmod. MFC: never Approved by: core@	2018-08-22 01:50:12 +00:00
Alan Cox	83a90bffd8	Eliminate kmem_malloc()'s unused arena parameter. (The arena parameter became unused in FreeBSD 12.x as a side-effect of the NUMA-related changes.) Reviewed by: kib, markj Discussed with: jeff, re@ Differential Revision: https://reviews.freebsd.org/D16825	2018-08-21 16:43:46 +00:00
Konstantin Belousov	a997bcc015	Update comment about ABI of flush_l1s_sw to match the reality. CPUID instruction clobbers %rbx and %rdx. Sponsored by: The FreeBSD Foundation MFC after: 13 days	2018-08-20 19:09:39 +00:00
Konstantin Belousov	b0568ddbec	Always initialize PCPU kcr3 for vmspace0 pmap. If an exception or NMI occurs before CPU switched to a pmap different from vmspace0, PCPU kcr3 is left zero for pti config, which causes triple-fault in the handler. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-08-20 19:07:57 +00:00
John Baldwin	a800b45c18	Merge amd64 and i386 <machine/intr_machdep.h> headers. Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D16803	2018-08-20 12:31:39 +00:00
Konstantin Belousov	c1141fba00	Update L1TF workaround to sustain L1D pollution from NMI. Current mitigation for L1TF in bhyve flushes L1D either by an explicit WRMSR command, or by software reading enough uninteresting data to fully populate all lines of L1D. If NMI occurs after either of methods is completed, but before VM entry, L1D becomes polluted with the cache lines touched by NMI handlers. There is no interesting data which NMI accesses, but something sensitive might be co-located on the same cache line, and then L1TF exposes that to a rogue guest. Use VM entry MSR load list to ensure atomicity of L1D cache and VM entry if updated microcode was loaded. If only software flush method is available, try to help the bhyve sw flusher by also flushing L1D on NMI exit to kernel mode. Suggested by and discussed with: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D16790	2018-08-19 18:47:16 +00:00
John Baldwin	8cd385fda0	Make 'device crypto' lines more consistent. - In configurations with a pseudo devices section, move 'device crypto' into that section. - Use a consistent comment. Note that other things common in kernel configs such as GELI also require 'device crypto', not just IPSEC. Reviewed by: rgrimes, cem, imp Differential Revision: https://reviews.freebsd.org/D16775	2018-08-18 20:32:08 +00:00
Warner Losh	62ee5bbd73	GPT is standard in x86 and arm64 land. Add it to DEFAULTS with the others. Differential Revision: https://reviews.freebsd.org/D16740	2018-08-17 14:47:21 +00:00
Konstantin Belousov	54564eda77	Fix early EFIRT on PCID machines after r337773. Ensure that the valid PCID state is created for proc0 pmap, since it might be used by efirt enter() before first context switch on the BSP. Sponsored by: The FreeBSD Foundation MFC after: 6 days	2018-08-15 12:48:49 +00:00
Konstantin Belousov	c30578feeb	Provide part of the mitigation for L1TF-VMM. On the guest entry in bhyve, flush L1 data cache, using either L1D flush command MSR if available, or by reading enough uninteresting data to fill whole cache. Flush is automatically enabled on CPUs which do not report RDCL_NO, and can be disabled with the hw.vmm.l1d_flush tunable/kenv. Security: CVE-2018-3646 Reviewed by: emaste. jhb, Tony Luck <tony.luck@intel.com> Sponsored by: The FreeBSD Foundation	2018-08-14 17:29:41 +00:00
Konstantin Belousov	9840c7373c	Reserve page at the physical address zero on amd64. We always zero the invalidated PTE/PDE for superpage, which means that L1TF CPU vulnerability (CVE-2018-3620) can be only used for reading from the page at zero. Note that both i386 and amd64 exclude the page from phys_avail[] array, so this change is redundant, but I think that phys_avail[] on UEFI-boot does not need to do that. Eventually the blacklisting should be made conditional on CPUs which report that they are not vulnerable to L1TF. Reviewed by: emaste. jhb Sponsored by: The FreeBSD Foundation	2018-08-14 17:14:33 +00:00
Konstantin Belousov	8fba5348fc	amd64: ensure that curproc->p_vmspace pmap always matches PCPU curpmap. When performing context switch on a machine without PCID, if current %cr3 equals to the new pmap %cr3, which is typical for kernel_pmap vs. kernel process, I overlooked to update PCPU curpmap value. Remove check for %cr3 not equal to pm_cr3 for doing the update. It is believed that this case cannot happen at all, due to other changes in this revision. Also, do not set the very first curpmap to kernel_pmap, it should be vmspace0 pmap instead to match curproc. Move the common code to activate the initial pmap both on BSP and APs into pmap_activate_boot() helper. Reported by: eadler, ambrisko Discussed with: kevans Reviewed by: alc, markj (previous version) Tested by: ambrisko (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D16618	2018-08-14 16:37:14 +00:00
Konstantin Belousov	ef52dc71eb	Fix typo. Noted by: alc MFC after: 3 days	2018-08-14 16:27:17 +00:00
Mark Johnston	97edfc1b45	Implement kernel support for early loading of Intel microcode updates. Updates in the format described in section 9.11 of the Intel SDM can now be applied as one of the first steps in booting the kernel. Updates that are loaded this way are automatically re-applied upon exit from ACPI sleep states, in contrast with the existing cpucontrol(8)-based method. For the time being only Intel updates are supported. Microcode update files are passed to the kernel via loader(8). The file type must be "cpu_microcode" in order for the file to be recognized as a candidate microcode update. Updates for multiple CPU types may be concatenated together into a single file, in which case the kernel will select and apply a matching update. Memory used to store the update file will be freed back to the system once the update is applied, so this approach will not consume more memory than required. Reviewed by: kib MFC after: 6 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16370	2018-08-13 17:13:09 +00:00
Konstantin Belousov	cb0eecdf92	Futex support functions in linux.ko and linux32.ko on amd64 should be aware of SMAP. Reported and tested by: Johannes Lundberg <johalun0@gmail.com>, wulf Sponsored by: The FreeBSD Foundation	2018-08-07 18:29:10 +00:00
Kyle Evans	3395e43a04	efirt: Don't enter EFI context early, convert addrs to KVA instead efi_enter here was needed because efi_runtime dereference causes a fault outside of EFI context, due to runtime table living in runtime service space. This may cause problems early in boot, though, so instead access it by converting paddr to KVA for access. While here, remove the other direct PHYS_TO_DMAP calls and the explicit DMAP requirement from efidev. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D16591	2018-08-04 21:41:10 +00:00
Konstantin Belousov	54c531cacd	Add END()s for amd64 linux futex support routines. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-08-04 13:57:50 +00:00
Konstantin Belousov	35efb3b1de	Fix typo in copyinstr_smap, resulting in mis-handling of too long strings. Reported and tested by: pho PR: 230286 Sponsored by: The FreeBSD Foundation	2018-08-03 15:35:29 +00:00
Konstantin Belousov	e45b89d23d	Add pmap_is_valid_memattr(9). Discussed with: alc Sponsored by: The FreeBSD Foundation, Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15583	2018-08-01 18:45:51 +00:00
Mark Johnston	8a5efe3601	Make sure that ENTRY() and END() refer to the same symbol. X-MFC with: r336876	2018-08-01 15:50:42 +00:00
Marcelo Araujo	be963beee6	- Add the ability to run bhyve(8) within a jail(8). This patch adds a new sysctl(8) knob "security.jail.vmm_allowed", by default this option is disable. Submitted by: Shawn Webb <shawn.webb____hardenedbsd.org> Reviewed by: jamie@ and myself. Relnotes: Yes. Sponsored by: HardenedBSD and G2, Inc. Differential Revision: https://reviews.freebsd.org/D16057	2018-08-01 00:39:21 +00:00
Mark Johnston	40fd44953c	COMPAT_LINUX32 has not depended on COMPAT_43 in some time. MFC after: 3 days	2018-07-31 21:40:13 +00:00
Kyle Evans	164138e7d8	amd64/GENERIC: Enable EFIRT by default As noted in UDPATING, the new loader tunable efi.rt_disabled may be used to disable EFIRT at runtime. It should have no effect if you are not booted via UEFI boot. MFC after: 6 weeks	2018-07-30 17:54:18 +00:00
Konstantin Belousov	8e36389535	Remove unneeded CLDs instructions in the SMAP-ed version of several functions from support.S. I believe they re-appeared due to me mis-merging my r327820 into the topic branch. Sponsored by: The FreeBSD Foundation	2018-07-30 16:54:51 +00:00
Konstantin Belousov	b3a7db3b06	Use SMAP on amd64. Ifuncs selectors dispatch copyin(9) family to the suitable variant, to set rflags.AC around userspace access. Rflags.AC bit is cleared in all kernel entry points unconditionally even on machines not supporting SMAP. Reviewed by: jhb Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D13838	2018-07-29 20:47:00 +00:00
Warner Losh	67d33338c0	Rename VM_FREELIST_ISADMA to VM_FREELIST_LOWMEM. There's no differene between VM_FREELIST_ISADMA and VM_FREELIST_LOWMEM except for the default boundary (16MB on x86 and 256MB on MIPS, but they are otherwise the same). We don't need both for any system we support (there were some really old ARC systems that did have ISA/EISA bus, but we never ran on them and they are too old to ever grow support for). Differential Review: https://reviews.freebsd.org/D16290	2018-07-27 18:34:20 +00:00
Mark Johnston	6c85795a25	Fix handling of KVA in kmem_bootstrap_free(). Do not use vm_map_remove() to release KVA back to the system. Because kernel map entries do not have an associated VM object, with r336030 the vm_map_remove() call will not update the kernel page tables. Avoid relying on the vm_map layer and instead update the pmap and release KVA to the kernel arena directly in kmem_bootstrap_free(). Because the pmap updates will generally result in superpage demotions, modify pmap_init() to insert PTPs shadowed by superpage mappings into the kernel pmap's radix tree. While here, port r329171 to i386. Reported by: alc Reviewed by: alc, kib X-MFC with: r336505 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16426	2018-07-27 15:46:34 +00:00
Konstantin Belousov	45ed991d96	On amd64, enable workarounds for several Ryzen erratas as described in the AMD document 55449 'Revision Guide for AMD Family 17h Models 00h-0Fh Processors' rev 1.12. The errata numbers are mentioned near each action. It seems that newer BIOSes already include required chicken bits settings, so the magic MSR updates are only needed when BIOS cannot be updated. On the other hand, MWAIT avoidance seems to be important. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-07-27 15:31:20 +00:00
Konstantin Belousov	41bed185c1	Extend ranges of the critical sections to ensure that context switch code never sees FPU pcb flags not consistent with the hardware state. This is uncovered by the eager FPU switch mode. Analyzed, reviewed and tested by: gleb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-07-24 19:22:52 +00:00
Mark Johnston	483f692ea6	Have preload_delete_name() free pages backing preloaded data. On i386 and amd64, add a vm_phys segment for physical memory used to store the kernel binary and other preloaded data. This makes it possible to free such memory back to the system once it is no longer needed, e.g., when a preloaded kernel module is unloaded. Previously, it would have remained unused. Reviewed by: kib, royger MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16330	2018-07-19 20:00:28 +00:00
Roger Pau Monné	b0663c33c2	xen: implement early init helper for PVHv2 In order to setup an initial environment and jump into the generic hammer_time initialization function. Some of the code is shared with PVHv1, while other code is PVHv2 specific. This allows booting FreeBSD as a PVHv2 DomU and Dom0. Sponsored by: Citrix Systems R&D	2018-07-19 08:44:52 +00:00
Roger Pau Monné	f2577f25c1	xen: add PVHv2 entry point The PVHv2 entry point is fairly similar to the multiboot1 one. The kernel is started in protected mode with paging disabled. More information about the exact BSP state can be found in the pvh.markdown document on the Xen tree. This entry point is going to be joined with the native entry point at hammer_time, and in order to do so the BSP needs to be bootstrapped into long mode with the same set of page tables as used on bare metal. Sponsored by: Citrix Systems R&D	2018-07-19 07:39:35 +00:00
Konstantin Belousov	53dec71d39	Expand x86 struct pcpus to UMA_PCPU_ALLOC_SIZE AKA PAGE_SIZE. This restores counters(9) operation. Revert r336024. Improve assert of pcpu size on x86. Reviewed by: mmacy Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D16163	2018-07-06 19:50:44 +00:00
Konstantin Belousov	fb0a281196	Revert to recommit with the proper message.	2018-07-06 19:50:25 +00:00
Konstantin Belousov	1614716655	Save a call to pmap_remove() if entry cannot have any pages mapped. Due to the way rtld creates mappings for the shared objects, each dso causes unmap of at least three guard map entries. For instance, in the buildworld load, this change reduces the amount of pmap_remove() calls by 1/5. Profiled by: alc Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D16148	2018-07-06 19:48:47 +00:00
Hans Petter Selasky	a7a7f5b472	Make sure kernel modules built by default are portable between UP and SMP systems by extending defined(SMP) to include defined(KLD_MODULE). This is a regression issue after r335873 . Discussed with: mmacy@ Sponsored by: Mellanox Technologies	2018-07-06 10:13:42 +00:00
Matt Macy	428194fed2	counter(9): unbreak amd64 following r336020 Apply temporary fix to counter until daylight hours. The fact that the assembly for counter_u64_add relied on the sizeof(struct pcpu) was the basis for the otherwise arbitrary offset never came up in D15933. critical_{enter,exit} is now inline so the only real added overhead is the added (mostly false) conditional branch in exit.	2018-07-06 10:10:00 +00:00
Matt Macy	ab3059a8e7	Back pcpu zone with domain correct pages - Change pcpu zone consumers to use a stride size of PAGE_SIZE. (defined as UMA_PCPU_ALLOC_SIZE to make future identification easier) - Allocate page from the correct domain for a given cpu. - Don't initialize pc_domain to non-zero value if NUMA is not defined There are some misconceptions surrounding this field. It is the _VM_ NUMA domain and should only ever correspond to valid domain values as understood by the VM. The former slab size of sizeof(struct pcpu) was somewhat arbitrary. The new value is PAGE_SIZE because that's the smallest granularity which the VM can allocate a slab for a given domain. If you have fewer than PAGE_SIZE/8 counters on your system there will be some memory wasted, but this is obviously something where you want the cache line to be coming from the correct domain. Reviewed by: jeff Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15933	2018-07-06 02:06:03 +00:00
Konstantin Belousov	945a6b310b	Extend r335969 to superpages. It is possible that a fictitious unmanaged userspace mapping of superpage is created on x86, e.g. by pmap_object_init_pt(), with the physical address outside the vm_page_array[] coverage. Noted and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D16085	2018-07-05 17:28:06 +00:00
Konstantin Belousov	a0ef97f6fa	Revert r335999 to re-commit with the correct error message.	2018-07-05 17:26:13 +00:00
Konstantin Belousov	c59dfa63bf	In x86 pmap_extract_and_hold(), there is no need to recalculate the physical address, which is readily available after sucessfull vm_page_pa_tryrelock(). Noted and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D16085	2018-07-05 16:38:54 +00:00
Konstantin Belousov	81dac87135	In x86 pmap_extract_and_hold(), there is no need to recalculate the physical address, which is readily available after sucessfull vm_page_pa_tryrelock(). Noted and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D16085	2018-07-05 16:27:34 +00:00
Alan Cox	c11c3d64e0	As of r335784, if pmap_enter() replaces a managed mapping by an unmanaged mapping, then it leaks the unlinked PV entry. This change eliminates that leak, freeing the PV entry. Reviewed by: kib, markj X-MFC with: r335784 Differential Revision: https://reviews.freebsd.org/D16130	2018-07-05 02:04:18 +00:00
Konstantin Belousov	84a15fe70d	In x86 pmap_extract_and_hold()s, handle the case of PHYS_TO_VM_PAGE() returning NULL. vm_fault_quick_hold_pages() can be legitimately called on userspace mappings backed by fictitious pages created by unmanaged device and sg pagers. Note that other architectures pmap_extract_and_hold() might need similar fix, but I postponed the examination. Reported by: bde Discussed with: alc Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D16085	2018-07-04 21:21:59 +00:00
John Baldwin	79ba91952d	Use 'e' instead of 'i' constraints with 64-bit atomic operations on amd64. The ADD, AND, OR, and SUB instructions take at most a 32-bit sign-extended immediate operand. 64-bit constants that do not fit into that constraint need to be loaded into a register. The 'i' constraint tells the compiler it can pass any integer constant to the assembler, whereas the 'e' constrain only permits constants that fit into a 32-bit sign-extended value. This fixes using atomic_add/clear/set/subtract_long/64 with constants that do not fit into a 32-bit sign-extended immediate. Reported by: several folks Tested by: Pete Wright <pete@nomadlogic.org> MFC after: 2 weeks	2018-07-03 22:03:28 +00:00
Matt Macy	f4b3640475	inline atomics and allow tied modules to inline locks - inline atomics in modules on i386 and amd64 (they were always inline on other arches) - allow modules to opt in to inlining locks by specifying MODULE_TIED=1 in the makefile Reviewed by: kib Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16079	2018-07-02 19:48:38 +00:00
Mark Johnston	1253de1eb6	Invalidate the mapping before updating its physical address. Doing so ensures that all threads sharing the pmap have a consistent view of the mapping. This fixes the problem described in the commit log messages for r329254 without the overhead of an extra fault in the common case. Once other pmap_enter() implementations are similarly modified, the workaround added in r329254 can be removed, reducing the overhead of CoW faults. With this change we can reuse the PV entry from the old mapping, potentially avoiding a call to reclaim_pv_chunk(). Otherwise, there is nothing preventing the old PV entry from being reclaimed. In rare cases this could result in the PTE's page table page being freed, leading to a use-after-free of the page when the updated PTE is written following the allocation of the PV entry for the new mapping. Reported and tested by: pho Reviewed by: alc, kib MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D16005	2018-06-28 21:40:31 +00:00
Konstantin Belousov	7f12ebe583	Do not leave stray qword on top of stack for interrupts and exceptions without error code. Doing so it mis-aligned the stack. Since the only consumer of the SSE instructions with the alignment requirements is AES-NI module, and since the FPU context cannot be accessed in interrupts, the only situation where the alignment matter are the compat32 syscalls, as reported in the PR. PR: 229222 Reported and tested by: dewayne@heuristicsystems.com.au Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-06-25 11:29:04 +00:00
Mark Johnston	a8be239d69	Re-count available PV entries after reclaiming a PV chunk. The call to reclaim_pv_chunk() in reserve_pv_entries() may free a PV chunk with free entries belonging to the current pmap. In this case we must account for the free entries that were reclaimed, or reserve_pv_entries() may return without having reserved the requested number of entries. Reviewed by: alc, kib Tested by: pho (previous version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D15911	2018-06-23 10:41:52 +00:00
Chuck Tuffli	3575504976	Fix the Linux kernel version number calculation The Linux compatibility code was converting the version number (e.g. 2.6.32) in two different ways and then comparing the results. The linux_map_osrel() function converted MAJOR.MINOR.PATCH similar to what FreeBSD does natively. I.e. where major=v0, minor=v1, and patch=v2 v = v0 * 1000000 + v1 * 1000 + v2; The LINUX_KERNVER() macro, on the other hand, converted the value with bit shifts. I.e. where major=a, minor=b, and patch=c v = (((a) << 16) + ((b) << 8) + (c)) The Linux kernel uses the later format via the KERNEL_VERSION() macro in include/generated/uapi/linux/version.h Fix is to use the LINUX_KERNVER() macro in linux_map_osrel() as well as in the .trans_osrel functions. PR: 229209 Reviewed by: emaste, cem, imp (mentor) Approved by: imp (mentor) Differential Revision: https://reviews.freebsd.org/D15952	2018-06-22 00:02:03 +00:00
Matt Macy	92689b3f02	remove ixl iwarp and ixlv from the build until they are in a working state	2018-06-19 02:48:53 +00:00
Eric Joyner	1031d839aa	ixl(4): Update to use iflib Update the driver to use iflib in order to bring performance, maintainability, and (hopefully) stability benefits to the driver. The driver currently isn't completely ported; features that are missing: - VF driver (ixlv) - SR-IOV host support - RDMA support The plan is to have these re-added to the driver before the next FreeBSD release. Reviewed by: gallatin@ Contributions by: gallatin@, mmacy@, krzysztof.galazka@intel.com Tested by: jeffrey.e.pieper@intel.com MFC after: 1 month Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D15577	2018-06-18 20:12:54 +00:00
Ed Maste	931e2a1a6e	linuxulator: do not include legacy syscalls on arm64 Existing linuxulator platforms (i386, amd64) support legacy syscalls, such as non-*at ones like open, but arm64 and other new platforms do not. Wrap these in #ifdef LINUX_LEGACY_SYSCALLS, #defined in the MD linux.h files. We may need finer grained control in the future but this is sufficient for now. Reviewed by: andrew Sponsored by: Turing Robotic Industries Differential Revision: https://reviews.freebsd.org/D15237	2018-06-15 14:41:51 +00:00
Konstantin Belousov	459ccd3c5f	linuxolator/amd64: Don't mangle %r10 on return from syscall for EJUSTRETURN. This fixes the %r10 content for rt_sigreturn. Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com> MFC after: 1 week	2018-06-14 12:35:57 +00:00
Konstantin Belousov	5803d744c7	Reorganize code flow in fpudna()/npxdna() to highlight the critical section scope. Sprinkle __predict_false() for conditions known to never occur or occur only on rare platforms. Sponsored by: The FreeBSD Foundation	2018-06-14 11:09:51 +00:00
Konstantin Belousov	fa7fad8ab9	Remove printf() in #NM handler. Give up and remove the almost useless informational message reporting that device not available exception occured while our state tracking indicates the current CPU has FPU context loaded for the current thread. It seems that this is recurring bug with some VM monitors. Sponsored by: The FreeBSD Foundation	2018-06-14 10:33:26 +00:00
Konstantin Belousov	d1a07e31e5	Enable eager FPU context switch by default on amd64. With compilers making increasing use of vector instructions the performance benefit of lazily switching FPU state is no longer a desirable tradeoff. Linux switched to eager FPU context switch some time ago, and the idea was floated on the FreeBSD-current mailing list some years ago[1]. Enable eager FPU context switch by default on amd64, with a tunable/sysctl available to turn it back off. [1] https://lists.freebsd.org/pipermail/freebsd-current/2015-March/055198.html Reviewed by: jhb Tested by: pho Sponsored by: The FreeBSD Foundation	2018-06-13 17:55:09 +00:00
Jonathan T. Looney	0766f278d8	Make UMA and malloc(9) return non-executable memory in most cases. Most kernel memory that is allocated after boot does not need to be executable. There are a few exceptions. For example, kernel modules do need executable memory, but they don't use UMA or malloc(9). The BPF JIT compiler also needs executable memory and did use malloc(9) until r317072. (Note that a side effect of r316767 was that the "small allocation" path in UMA on amd64 already returned non-executable memory. This meant that some calls to malloc(9) or the UMA zone(9) allocator could return executable memory, while others could return non-executable memory. This change makes the behavior consistent.) This change makes malloc(9) return non-executable memory unless the new M_EXEC flag is specified. After this change, the UMA zone(9) allocator will always return non-executable memory, and a KASSERT will catch attempts to use the M_EXEC flag to allocate executable memory using uma_zalloc() or its variants. Allocations that do need executable memory have various choices. They may use the M_EXEC flag to malloc(9), or they may use a different VM interfact to obtain executable pages. Now that malloc(9) again allows executable allocations, this change also reverts most of r317072. PR: 228927 Reviewed by: alc, kib, markj, jhb (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15691	2018-06-13 17:04:41 +00:00
Marcelo Araujo	ebc3c37c6f	Add SPDX tags to vmm(4). MFC after: 4 weeks. Sponsored by: iXsystems Inc.	2018-06-13 07:02:58 +00:00
Jung-uk Kim	6362b1a6b1	Fix number of auxargs entries to copy out for 32-bit Linuxulator. PR: 228790	2018-06-12 22:54:48 +00:00
Konstantin Belousov	b45e10c3f4	Fix braino in r334799. Maxmem is in pages. Reported by: ae, pho Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-06-11 15:28:20 +00:00
Bruce Evans	3cd246d9a9	Untangle configuration ifdefs a little. On x86, msi is optional on pci, and also on apic in common and i386 files (except for xen it is optional only on xenhvm), but it was not ifdefed except on apic in common and i386 files. This is all that is left from an attempt to build a (sub-)minimal kernel without any devices. The isa "option" is still used without ifdefs in many standard files even on amd64. ISAPNP is not optional on at least i386. ATPIC is not optional on i386 (it is used mainly for Xspuriousint). But pci is now supposed to be optional on x86.	2018-06-10 14:49:13 +00:00
Mark Johnston	f090f67503	Tell the compiler that rdtscp clobbers %ecx.	2018-06-09 18:31:19 +00:00
Tycho Nightingale	4d20e87b7e	Don't bother looking for non-executable pages when a process is excluded from PTI. Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15708	2018-06-08 20:35:58 +00:00
Matt Macy	eb7c901995	hwpmc: simplify calling convention for hwpmc interrupt handling pmc_process_interrupt takes 5 arguments when only 3 are needed. cpu is always available in curcpu and inuserspace can always be derived from the passed trapframe. While facially a reasonable cleanup this change was motivated by the need to workaround a compiler bug. core2_intr(cpu, tf) -> pmc_process_interrupt(cpu, ring, pmc, tf, inuserspace) -> pmc_add_sample(cpu, ring, pm, tf, inuserspace) In the process of optimizing the tail call the tf pointer was getting clobbered: (kgdb) up at /storage/mmacy/devel/freebsd/sys/dev/hwpmc/hwpmc_mod.c:4709 4709 pmc_save_kernel_callchain(ps->ps_pc, (kgdb) up 1205 error = pmc_process_interrupt(cpu, PMC_HR, pm, tf, resulting in a crash in pmc_save_kernel_callchain.	2018-06-08 04:58:03 +00:00
Mateusz Guzik	dfa5753e09	amd64: remove now unused bzero, bcmp and bcopy. move pagecopy higher up.	2018-06-08 04:18:42 +00:00
Mateusz Guzik	c9ca1a70cc	amd64: fix a retarded bug in memset memset fills the target buffer from a byte-sized value passed in as the second argument. The fully-sized (8 bytes) register containing it is named %rsi. Lower 4 bytes can be referred to as %esi and finally the lowest byte is %sil. Vast majority of all the callers just zero the target buffer and set it up by doing xor %esi,%esi which has a side-effect of zeroing the upper parts of the register as well. Some others do a word-sized move to %esi which has the same result. However, there are callers which only fill %sil. This does not clear up the rest of the register. The value of %rsi is multiplied by $0x0101010101010101 to create a 8-byte sized pattern for 8-byte stores. Prior to the patch, the func just blindly took %rsi assuming the unwanted bytes are zeroed out. Since this is not the case for the callers which only play with %sil (the rest of the register can have absolutely anything), the resulting pattern can be garbage. This has potential for funny bugs. One side effect (which was not amusing) after enabling it instead of bzero was that the kernel was hanging on boot as a xen domU. Reported by: Trond Endrestøl <Trond.Endrestol fagskolen.gjovik.no> Pointy hat: me	2018-06-08 00:47:24 +00:00
Konstantin Belousov	943defc3a0	Account for dmap limit when selecting the pages for the bootstrap pagetables. physmap[] can be inconsistent with the physical memory limit due to buggy bios, or to the hw.physmem tunable. Since bootstrap pagetables are initialized by accesses through the DMAP, we must ensure that DMAP really cover the selected pages. This is only relevant when machine has less than 4G RAM and buggy BIOS, which is the combination on Acer Chromebook 720. The call to mp_bootaddress() is moved later to have Maxmem initialized. An alternative could be to always cover 4G for DMAP, but this change seems to be simpler. Reported and tested by: grembo Reviewed by: royger Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15675	2018-06-07 17:04:34 +00:00
Matt Macy	155046394a	cpufunc: add rdtscp for x86	2018-06-07 00:54:11 +00:00
Matt Macy	07d80fd8dc	hwpmc: ABI fixes - increase pmc cpuid field from 8 to 12 bits - add cpuid version string to initialize entry in the log so that filter can identify which counter index an event name maps to - GC unused config flags - make fixed counter assignment more robust as well as the changes needed to be properly identified for filter	2018-06-04 02:05:48 +00:00
Mateusz Guzik	d0a22279db	Remove an unused argument to turnstile_unpend. PR: 228694 Submitted by: Julian Pszczołowski <julian.pszczolowski@gmail.com>	2018-06-02 22:37:53 +00:00
Mateusz Guzik	15825d5b78	amd64: add a mild depessimization to rep mov/stos users Currently all the primitives are waiting for a rewrite, tidy them up in the meantime. Vast majority of cases pass sizes which are multiple of 8. Which means the following rep stosb/movb has nothing to do. Turns out testing first if there is anything to do is a big win across the board (cpus with and without ERMS, Intel and AMD) while not pessimizing the case where there is work to do. Sample results for zeroing 64 bytes (ops/second): Ryzen Threadripper 1950X 91433212 -> 147265741 Intel(R) Xeon(R) CPU X5675 @ 3.07GHz 90714044 -> 121992888 bzero and bcopy are on their way out and were not modified. Nothing in the tree uses them.	2018-06-02 20:14:43 +00:00
Bruce Evans	c507c512b9	Finish COMPAT_AOUT support for amd64. It wasn't in any amd64 or MI file in /sys/conf, so was unavailable in configurations that don't use modules, and was not testable or notable in NOTES. Its normal configuration (not using a module) is still silently deprecated in aout(4) by not mentioning it there. Update i386 NOTES for COMPAT_AOUT. It is not i386-only, or even very MD. Sort its entry better. Finish gzip configuration (but not support) for amd64. gzip is really gzipped aout. It is currently broken even for i386 (a call to vm fails). amd64 has always attempted to configure and test it, but it depends on COMPAT_AOUT (as noted). The bug that it depends on unconfigured files was not detected since it is configured as a device. All other optional image activators are configured properly using an option.	2018-06-02 06:40:15 +00:00
Bruce Evans	49c871278a	Fix high resolution kernel profiling just enough to not crash at boot time, especially for SMP. If configured, it turns itself on at boot time for calibration, so is fragile even if never otherwise used. Both types of kernel profiling were supposed to use a global spinlock in the SMP case. If hi-res profiling is configured (but not necessarily used), this was supposed to be optimized by only using it when necessary, and slightly more efficiently, in asm. But it was not done at all for mcount entry where it is necessary. This caused crashes in the SMP case when either type of profiling was enabled. For mcount exit, it only caused wrong times. The times were wrongest with an i8254 timer since using that requires exclusive access to the hardware. The i8254 timer was too slow to use here 20 years ago and is much less usable now, but it is the default for the SMP case since TSCs weren't invariant when SMP was new. Do the locking in all hi-res SMP cases for simplicity. Calibration uses special asms, and the clobber lists in these were sort of inverted. They contained the arg and return registers which are not clobbered, but on amd64 they didn't contain the residue of the call-used registers which may be clobbered (%r10 and %r11). This usually caused hangs at boot time. This usually affected even the UP case.	2018-06-02 05:48:44 +00:00
Bruce Evans	dbe3061729	Fix recent breakages of kernel profiling, mostly on i386 (high resolution kernel profiling remains broken). memmove() was broken using ALTENTRY(). ALTENTRY() is only different from ENTRY() in the profiling case, and its use in that case was sort of backwards. The backwardness magically turned memmove() into memcpy() instead of completely breaking it. Only the high resolution parts of profiling itself were broken. Use ordinary ENTRY() for memmove(). Turn bcopy() into a tail call to memmove() to reduce complications. This gives slightly different pessimizations and profiling lossage. The pessimizations are minimized by not using a frame pointer() for bcopy(). Calls to profiling functions from exception trampolines were not relocated. This caused crashes on the first exception. Fix this using function pointers. Addresses of exception handlers in trampolines were not relocated. This caused unknown offsets in the profiling data. Relocate by abusing setidt_disp as for pmc although this is slower than necessary and requires namespace pollution. pmc seems to be missing some relocations. Stack traces and lots of other things in debuggers need similar relocations. Most user addresses were misclassified as unknown kernel addresses and then ignored. Treat all unknown addresses as user. Now only user addresses in the kernel text range are significantly misclassified (as known kernel addresses). The ibrs functions didn't preserve enough registers. This is the only recent breakage on amd64. Although these functions are written in asm, in the profiling case they call profiling functions which are mostly for the C ABI, so they only have to save call-used registers. They also have to save arg and return registers in some cases and actually save them in all cases to reduce complications. They end up saving all registers except %ecx on i386 and %r10 and %r11 on amd64. Saving these is only needed for 1 caller on each of amd64 and i386. Save them there. This is slightly simpler. Remove saving %ecx in handle_ibrs_exit on i386. Both handle_ibrs_entry and handle_ibrs_exit use %ecx, but only the latter needed to or did save it. But saving it there doesn't work for the profiling case. amd64 has more automatic saving of the most common scratch registers %rax, %rcx and %rdx (its complications for %r10 are from unusual use of %r10 by SYSCALL). Thus profiling of handle_ibrs_exit_rs() was not broken, and I didn't simplify the saving by moving the saving of these registers from it to the caller.	2018-06-02 04:25:09 +00:00
Matt Macy	e92a1350b5	hwpmc: remove unused pre-table driven bits for intel Intel now provides comprehensive tables for all performance counters and the various valid configuration permutations as text .json files. Libpmc has been converted to use these and hwpmc_core has been greatly simplified by moving to passthrough of the table values. The one gotcha is that said tables don't support pentium pro and and pentium IV. There's very few users of hwpmc on _amd64_ kernels on new hardware. It is unlikely that anyone is doing low level optimization on 15 year old Intel hardware. Nonetheless, if someone feels strongly enough to populate the corresponding tables for p4 and ppro I will reinstate the files in to the build. Code for the K8 counters and !x86 architectures remains unchanged.	2018-05-31 22:41:07 +00:00
Dimitry Andric	b451efbedc	Resolve conflicts between macros in fenv.h and ieeefp.h This is a follow-up to r321483, which disabled -Wmacro-redefined for some lib/msun tests. If an application included both fenv.h and ieeefp.h, several macros such as __fldcw(), __fldenv() were defined in both headers, with slightly different arguments, leading to conflicts. Fix this by putting all the common macros in the machine-specific versions of ieeefp.h. Where needed, update the arguments in places where the macros are invoked. This also slightly reduces the differences between the amd64 and i386 versions of ieeefp.h. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D15633	2018-05-31 20:22:47 +00:00
Mateusz Guzik	64415b8b22	amd64: switch pagecopy from non-temporal stores to rep movsq The copied data is accessed in part soon after and it results with additional cache misses during a -j 1 buildkernel WITHOUT_CTF=yes KERNFAST=1, as measured with pmc stat. before: 256165411 cache-references # 0.003 refs/inst 15105408 cache-misses # 5.897% 20.70 real # 99.67% cpu 13.24 user # 63.94% cpu 7.40 sys # 35.73% cpu after: 256764469 cache-references # 0.003 refs/inst 11913551 cache-misses # 4.640% 20.70 real # 99.67% cpu 13.19 user # 63.73% cpu 7.44 sys # 35.95% cpu Note the real time did not change, but traffic to RAM was reduced (multiple measurements performed with switching the implementation at runtime). Since nobody else is using non-temporal for this and there is no apparent benefit at least these days, don't use them either. Side note is that pagecopy arguments should probably get reversed to not have to flip them around in the primitive. Discussed with: jeff	2018-05-31 09:56:02 +00:00
Brooks Davis	cbf7e0cba7	Correct pointer subtraction in KASSERT(). The assertion would never fire without truly spectacular future programming errors. Reported by: Coverity CID: 1391370 Sponsored by: DARPA, AFRL	2018-05-29 20:03:24 +00:00
Andriy Gapon	279be68bfd	re-synchronize TSC-s on SMP systems after resume, if necessary The TSC-s are checked and synchronized only if they were good originally. That is, invariant, synchronized, etc. This is necessary on an AMD-based system where after a wakeup from STR I see that BSP clock differs from AP clocks by a count that roughly corresponds to one second. The APs are in sync with each other. Not sure if this is a hardware quirk or a firmware bug. This is what I see after a resume with this change: SMP: passed TSC synchronization test after adjustment acpi_timer0: restoring timecounter, ACPI-fast -> TSC-low Reviewed by: kib MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D15551	2018-05-25 07:33:20 +00:00
Brooks Davis	5f77b8a88b	Avoid two suword() calls per auxarg entry. Instead, construct an auxargs array and copy it out all at once. Use an array of Elf_Auxinfo rather than pairs of Elf_Addr * to represent the array. This is the correct type where pairs of words just happend to work. To reduce the size of the diff, AUXARGS_ENTRY is altered to act on this array rather than introducing a new macro. Return errors on copyout() and suword() failures and handle them in the caller. Incidentally fixes AT_RANDOM and AT_EXECFN in 32-bit linux on amd64 which incorrectly used AUXARG_ENTRY instead of AUXARGS_ENTRY_32 (now removed due to the use of proper types). Reviewed by: kib Comments from: emaste, jhb Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15485	2018-05-24 16:25:18 +00:00
Matt Macy	14d13423dd	take NUMA out	2018-05-24 04:31:53 +00:00
Matt Macy	e98bbcf9ca	libpmcstat: compile in events based on json description	2018-05-24 04:30:06 +00:00
Konstantin Belousov	8936419a6c	x86: stop unconditionally clearing PSL_T on the trace trap. We certainly should clear PSL_T when calling the SIGTRAP signal handler, which is already done by all x86 sendsig(9) ABI code. On the other hand, there is no obvious reason why PSL_T needs to be cleared when returning from the signal handler. For instance, Linux allows userspace to set PSL_T and keep tracing enabled for the desired period. There are userspace programs which would use PSL_T if we make it possible, for instance sbcl. Remember if PSL_T was set by PT_STEP or PT_SETSTEP by mean of TDB_STEP flag, and only clear it when the flag is set. Discussed with: Ali Mashtizadeh Reviewed by: jhb (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D15054	2018-05-23 21:39:29 +00:00
Konstantin Belousov	61bc50d032	Style. Wording and reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D15054	2018-05-23 21:25:49 +00:00
Konstantin Belousov	14f7050dba	Enable IBRS when entering an interrupt handler from usermode. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-05-22 13:25:15 +00:00
John Baldwin	9e2154ff1c	Cleanups related to debug exceptions on x86. - Add constants for fields in DR6 and the reserved fields in DR7. Use these constants instead of magic numbers in most places that use DR6 and DR7. - Refer to T_TRCTRAP as "debug exception" rather than a "trace trap" as it is not just for trace exceptions. - Always read DR6 for debug exceptions and only clear TF in the flags register for user exceptions where DR6.BS is set. - Clear DR6 before returning from a debug exception handler as recommended by the SDM dating all the way back to the 386. This allows debuggers to determine the cause of each exception. For kernel traps, clear DR6 in the T_TRCTRAP case and pass DR6 by value to other parts of the handler (namely, user_dbreg_trap()). For user traps, wait until after trapsignal to clear DR6 so that userland debuggers can read DR6 via PT_GETDBREGS while the thread is stopped in trapsignal(). Reviewed by: kib, rgrimes MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D15189	2018-05-22 00:45:00 +00:00
Konstantin Belousov	3621ba1ede	Add Intel Spec Store Bypass Disable control. Speculative Store Bypass (SSB) is a speculative execution side channel vulnerability identified by Jann Horn of Google Project Zero (GPZ) and Ken Johnson of the Microsoft Security Response Center (MSRC) https://bugs.chromium.org/p/project-zero/issues/detail?id=1528. Updated Intel microcode introduces a MSR bit to disable SSB as a mitigation for the vulnerability. Introduce a sysctl hw.spec_store_bypass_disable to provide global control over the SSBD bit, akin to the existing sysctl that controls IBRS. The sysctl can be set to one of three values: 0: off 1: on 2: auto Future work will enable applications to control SSBD on a per-process basis (when it is not enabled globally). SSBD bit detection and control was verified with prerelease microcode. Security: CVE-2018-3639 Tested by: emaste (previous version, without updated microcode) Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-05-21 21:08:19 +00:00
Konstantin Belousov	2320153fcc	Preserve other bits in IA32_SPEC_CTL MSR when changing the IBRS and STIBP states. Tested by: emaste (previous version) Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-05-21 21:05:55 +00:00
Konstantin Belousov	5988464ec4	Fix grammar. Submitted by: alc MFC after: 1 week	2018-05-21 19:15:05 +00:00
Konstantin Belousov	0a4b04a616	Add missed barrier for pm_gen/pm_active interaction. When we issue shootdown IPIs, we first assign zero to pm_gens to indicate the need to flush on the next context switch in case our IPI misses the context, next we read pm_active. On context switch we set our bit in pm_active, then we read pm_gen. It is crucial that both threads see the memory in the program order, otherwise invalidation thread might read pm_active bit as zero and the context switching thread might read pm_gen as zero. IA32 allows CPU for both reads to see zero. We must use the barriers between write and read. The pm_active bit set is already locked, so only the invalidation functions need it. I never saw it in real life, or at least I do not have a good reproduction case. I found this during code inspection when hunting for the Xen TLB issue reported by cperciva. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15506	2018-05-21 18:41:16 +00:00
Mateusz Guzik	edacda736b	amd64: annotate pti with __read_frequently	2018-05-21 05:20:23 +00:00
Mark Johnston	892bdccca0	Enable kernel dump features in GENERIC for most platforms. This turns on support for kernel dump encryption and compression, and netdump. arm and mips platforms are omitted for now, since they are more constrained and don't benefit as much from these features. Reviewed by: cem, manu, rgrimes Tested by: manu (arm64) Relnotes: yes Differential Revision: https://reviews.freebsd.org/D15465	2018-05-19 19:53:23 +00:00
Matt Macy	f5ad6b4b00	pmap: silence warnings	2018-05-19 05:58:05 +00:00
Ed Maste	3dc3b1235a	amd64 GENERIC: correct whitespace on smartpqi entry	2018-05-18 17:51:42 +00:00
Antoine Brodin	147d12a7d3	vmmdev: return EFAULT when trying to read beyond VM system memory max address Currently, when using dd(1) to take a VM memory image, the capture never ends, reading zeroes when it's beyond VM system memory max address. Return EFAULT when trying to read beyond VM system memory max address. Reviewed by: imp, grehan, anish Approved by: grehan Differential Revision: https://reviews.freebsd.org/D15156	2018-05-15 17:20:58 +00:00
John Baldwin	0b3e6e4c50	Make the common interrupt entry point labels local labels. Kernel debuggers depend on symbol names to find stack frames with a trapframe rather than a normal stack frame. The labels used for the shared interrupt entry point for the PTI and non-PTI cases did not match the existing patterns confusing debuggers. Add the '.L' prefix to mark these symbols as local so they are not visible in the symbol table. Reviewed by: kib MFC after: 1 week Sponsored by: Chelsio Communications	2018-05-14 17:27:53 +00:00
Konstantin Belousov	8b4fc8b11c	Make fpusave() and fpurestore() on amd64 ifuncs. From now on, linking amd64 kernel requires either lld or newer ld.bfd. Reviewed by: jhb (as part of the large patch) Discussed with: emaste Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D13838	2018-05-10 15:01:43 +00:00
Mateusz Guzik	20ca271fdd	amd64: depessimize bcmp for small buffers Adapt assembly generated by clang for memcmp and use it for <= 64 sized compares (which are the vast majority). Sample result of doing stats on Broadwell (% of samples): before: 4.0 kernel bcmp cache_lookup after : 0.7 kernel bcmp cache_lookup The routine is most definitely still not optimal. Anyone interested in spending time improving it is welcome to take over. Reviewed by: kib	2018-05-09 15:16:25 +00:00
Konstantin Belousov	55c9d75e6b	Avoid calls to bzero() before ireloc. Evaluate cpu_stdext_feature early to have moved link_elf_ireloc() see correct flags, most important is SMAP. Tested by: mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D15367	2018-05-09 14:39:24 +00:00
Konstantin Belousov	71d1bbce91	Remove PG_U from the rest of the kernel pmap ptes. Supposedly, they PG_U bits there were set to easier making some kernel page accessible to userspace in-place. Since it was not used for the whole existence of the amd64 pmap.c and current design of the shared pages prefers double-mapping over the in-place access, remove PG_U both from the direct map and KVA slots. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-05-09 12:09:08 +00:00
Konstantin Belousov	5aaa5bc3d6	Remove PG_U from the recursive pte for kernel pmap' PML4 page. This PML4 page is never used for the userspace process, so there is no security implications. But the configuration trips SMAP check, which should be corrected. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-05-09 12:03:40 +00:00
Konstantin Belousov	053641bb1c	Prepare DB# handler for deferred trigger of watchpoints. Since pop %ss/mov %ss instructions defer all interrupts and exceptions for the next instruction, it is possible that the userspace watchpoint trap executes on the first instruction of the kernel entry for syscall/bpt. In this case, DB# should be treated similarly to NMI: on amd64 we must always load GSBASE even if the trap comes from kernel mode, and load the kernel page table root into %cr3. Moreover, the trap must use the dedicated stack, because we are still on the user stack when trapped on syscall entry. For i386, we must reload %cr3. The syscall instruction is not configured, so there is no issue with executing on user stack when trapping. Due to some CPU erratas it is not always possible to detect that the userspace watchpoint triggered by inspecting %dr6. In trap(), compare the trap %rip with the known unsafe entry points and if matched pretend that the watchpoint did not fire at all. Thank you to the MSRC Incident Response Team, and in particular Greg Lenti and Nate Warfield, for coordinating the response to this issue across multiple vendors. Thanks to Computer Recycling at The Working Center of Kitchener for making hardware available to allow us to test the patch on additional CPU families. Reviewed by: jhb Discussed with: Matthew Dillon Tested by: emaste Sponsored by: The FreeBSD Foundation Security: CVE-2018-8897 Security: FreeBSD-SA-18:06.debugreg	2018-05-08 17:00:34 +00:00
Mateusz Guzik	a9456603f2	amd64: stop asserting params != NULL in the syscall path The parameter is effectively controllable by userspace. It does not matter what it is set to as it is being passed to copyin - worst case the operation will just fail. While here stop computing it unless it is going to be used. Noted by: dillon@backplane.com	2018-05-07 21:32:08 +00:00
Mateusz Guzik	bed34b0b04	amd64: fix up memset added in r333324 There was a missing trick expanding the passed pattern to a full word by multiplication. As a side effect non-zero patterns would be incorrectly laid down. This stems from the use of rep stosq which is word-sized, while the passed argument is byte-sized. I initially repurposed memcpy into memset without taking this into account. All but non-bzero testing was performed with a variant utilizing ERMS, i.e. using only stosb which happens to not into the problem whatsoever. So my bad twice. Thanks to Oliver Pinter for noting the problem and providing a testcase.	2018-05-07 20:54:42 +00:00
Mateusz Guzik	f185a3dc33	amd64: tweak the memmove comment regarding authorship To make it clear the mentioned author did not write memmove.	2018-05-07 17:37:07 +00:00
Mateusz Guzik	6a909b9680	amd64: replace libkern's memset and memmove with assembly variants memmove is repurposed bcopy (arguments swapped, return value added) The libkern variant is a wrapper around bcopy, so this is a big improvement. memset is repurposed memcpy. The librkern variant is doing fishy stuff, including branching on 0 and calling bzero. Both functions are rather crude and subject to partial depessimization. This is a soft prerequisite to adding variants utilizing the 'Enhanced REP MOVSB/STOSB' bit and let the kernel patch at runtime.	2018-05-07 15:07:28 +00:00
Mateusz Guzik	ac7edb45e1	amd64: syscall path bcopy -> memcpy	2018-05-04 22:41:12 +00:00
Mateusz Guzik	f0648bcc04	amd64: get rid of the pessimized bcopy in syscall arg copy The code was unnecessarily conditionally copying either 5 or 6 args. It can blindly copy 6, which also means the size is known at compilation time and the operation can be depessimized. Note the entire syscall handling code is rather slow. Tested on Skylake, sample result for getppid (calls/s): without pti: 7310106 -> 10653569 with pti: 3304843 -> 4148306 Some syscalls (like read) did not note any difference, other have typically very modest wins.	2018-05-04 04:05:07 +00:00
Konstantin Belousov	7035cf14ee	Implement support for ifuncs in the kernel linker. Required MD bits are only provided for x86. Reviewed by: jhb (previous version, as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13838	2018-05-03 21:37:46 +00:00
Konstantin Belousov	9ea6332090	Style. Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D13838	2018-05-03 10:17:37 +00:00
Peter Grehan	adb947a67a	Use PCI power-mgmt to reset a device if FLR fails. A large number of devices don't support PCIe FLR, in particular graphics adapters. Use PCI power management to perform the reset if FLR fails or isn't available, by cycling the device through the D3 state. This has been tested by a number of users with Nvidia and AMD GPUs. Submitted and tested by: Matt Macy Reviewed by: jhb, imp, rgrimes MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D15268	2018-05-02 17:41:00 +00:00
Mark Johnston	20f85b1ddd	Print the dump progress indicator after calling dump_start(). Dumpers may wish to print messages from an initialization hook; this change ensures that such messages aren't mixed with output from the generic dump code. MFC after: 1 week	2018-05-01 17:32:43 +00:00
Conrad Meyer	538184fa2f	amd64/mp_machdep.c: Fix GCC build after r333059 GCC warns about the potentially confusing use of the binary AND ('&') operator with a left operand containing an addition expression. (The confusion would be around the operator precedence between the + and & infix operators.) The warning is converted into an error with -Werror. No functional change. This construct was actually introduced in r328083, but r333059 (re)moved the closing parentheses. For reference, see http://en.cppreference.com/w/c/language/operator_precedence .	2018-04-28 17:55:28 +00:00
Tycho Nightingale	27275f8a52	Expand the checks for UCR3 == PMAP_NO_CR3 to enable processes to be excluded from PTI. Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15100	2018-04-27 12:44:20 +00:00
Sean Bruno	14ec0f3a3b	move smartpqi(4) controller out of NOTES and into sys/amd64/NOTES to appease LINT Submitted by: rpokala Reported by: npn	2018-04-26 22:43:25 +00:00
Sean Bruno	1e66f787c8	martpqi(4): - Microsemi SCSI driver for PQI controllers. - Found on newer model HP servers. - Restrict to AMD64 only as per developer request. The driver provides support for the new generation of PQI controllers from Microsemi. This driver is the first SCSI driver to implement the PQI queuing model and it will replace the aacraid driver for Adaptec Series 9 controllers. HARDWARE Controllers supported by the driver include: HPE Gen10 Smart Array Controller Family OEM Controllers based on the Microsemi Chipset. Submitted by: deepak.ukey@microsemi.com Relnotes: yes Sponsored by: Microsemi Differential Revision: https://reviews.freebsd.org/D14514	2018-04-26 16:59:06 +00:00
Tycho Nightingale	19c5cea336	If a trap is encountered upon executing iretq from within doreti() the hardware will ensure the stack pointer is aligned to a 16-byte boundary before saving the fault state on the stack. In the PTI case, handle this potential alignment adjustment by copying both frames independently while unwinding the stack in between. Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15183	2018-04-25 14:21:13 +00:00
Mark Johnston	5cd29d0f3c	Improve VM page queue scalability. Currently both the page lock and a page queue lock must be held in order to enqueue, dequeue or requeue a page in a given page queue. The queue locks are a scalability bottleneck in many workloads. This change reduces page queue lock contention by batching queue operations. To detangle the page and page queue locks, per-CPU batch queues are used to reference pages with pending queue operations. The requested operation is encoded in the page's aflags field with the page lock held, after which the page is enqueued for a deferred batch operation. Page queue scans are similarly optimized to minimize the amount of work performed with a page queue lock held. Reviewed by: kib, jeff (previous versions) Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14893	2018-04-24 21:15:54 +00:00
Konstantin Belousov	b7941dc91e	Correct undesirable interaction between caching of %cr4 in bhyve and invltlb_glob(). Reviewed by: grehan, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15138	2018-04-24 13:44:19 +00:00
John Baldwin	73c8686e91	Simplify the code to allocate stack for auxv, argv[], and environment vectors. Remove auxarg_size as it was only used once right after a confusing assignment in each of the variants of exec_copyout_strings(). Reviewed by: emaste MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D15123	2018-04-19 16:00:34 +00:00
Andriy Gapon	f3f6ecb450	set kdb_why to "trap" when calling kdb_trap from trap_fatal This will allow to hook a ddb script to "kdb.enter.trap" event. Previously there was no specific name for this event, so it could only be handled by either "kdb.enter.unknown" or "kdb.enter.default" hooks. Both are very unspecific. Having a specific event is useful because the fatal trap condition is very similar to panic but it has an additional property that the current stack frame is the frame where the trap occurred. So, both a register dump and a stack bottom dump have additional information that can help analyze the problem. I have added the event only on architectures that have trap_fatal() function defined. I haven't looked at other architectures. Their maintainers can add support for the event later. Sample script: kdb.enter.trap=bt; show reg; x/aS $rsp,20; x/agx $rsp,20 Reviewed by: kib, jhb, markj MFC after: 11 days Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D15093	2018-04-19 05:06:56 +00:00
Andriy Gapon	6d83b2e971	don't check for kdb reentry in trap_fatal(), it's impossible trap() checks for it earlier and calls kdb_reentry(). Discussed with: jhb MFC after: 12 days Sponsored by: Panzura	2018-04-18 15:44:54 +00:00
Brooks Davis	9c11d8d483	Remove the unused fuwintr() and suiwintr() functions. Half of implementations always failed (returned (-1)) and they were previously used in only one place. Reviewed by: kib, andrew Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15102	2018-04-17 18:04:28 +00:00
Konstantin Belousov	23084818ff	Set PG_G global mapping bit on the trampoline ptes. Trampoline mappings are better treated as global since they are valid in all address spaces, even for PTI. pmap_invalidate_range() must work on global mappings for pti since kernel_pmap invalidations are really same as for non-PTI. Reviewed by: alc, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D15052	2018-04-14 17:33:16 +00:00
Tycho Nightingale	6ac73777ea	Add SDT probes to vmexit on Intel. Submitted by: domagoj.stolfa_gmail.com Reviewed by: grehan, tychon Sponsored by: DARPA/AFRL Differential Revision: https://reviews.freebsd.org/D14656	2018-04-13 17:23:05 +00:00
Konstantin Belousov	7c5d1690e9	Fix PSL_T inheritance on exec for x86. The miscellaneous x86 sysent->sv_setregs() implementations tried to migrate PSL_T from the previous program to the new executed one, but they evaluated regs->tf_eflags after the whole regs structure was bzeroed. Make this functional by saving PSL_T value before zeroing. Note that if the debugger is not attached, executing the first instruction in the new program with PSL_T set results in SIGTRAP, and since all intercepted signals are reset to default dispostion on exec(2), this means that non-debugged process gets killed immediately if PSL_T is inherited. In particular, since suid images drop P_TRACED, attempt to set PSL_T for execution of such program would kill the process. Another issue with userspace PSL_T handling is that it is reset by trap(). It is reasonable to clear PSL_T when entering SIGTRAP handler, to allow the signal to be handled without recursion or delivery of blocked fault. But it is not reasonable to return back to the normal flow with PSL_T cleared. This is too late to change, I think. Discussed with: bde, Ali Mashtizadeh Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D14995	2018-04-12 20:43:39 +00:00
Konstantin Belousov	b7dbf1132e	Optimize context switch for PTI on PCID pmap. In pti-enabled pmap, the PCID allocation scheme assigns temporal id for the kernel page table, and user page table twin PCID is calculating by setting high bit in the kernel PCID. So the kernel AS is mapped with per-vmspace PCID, and we must completely shut down all mappings in KVA when switching contexts, so that newly switched thread would see all changes in KVA occured while it was not executing. After all, KVA is same between all threads. Currently the pti context switch for the user part of the page table gets its TLB entries flushed too. It is excessive. The same PCID flushing algorithm that is used for non-pti pmap, correctly works for the UVA mappings. The only shared TLB entries are the pages from KVA accessed by the kernel entry trampoline. All of them are static except per-thread TSS and LDT. For TSS and LDT, the lifetime of newly allocated entries is the whole thread life, so it is fine as well. If not fine, then explicit shutdowns for current pmap of the newly allocated LDT and TSS pages would be enough. Also restore the constant value for the pm_pcid for the kernel_pmap. Before, for PTI pmap, pm_pcid was erronously rolled same as user pmap's pm_pcid, but it was not used. Reviewed by: markj (previous version) Discussed with: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D14961	2018-04-12 19:59:36 +00:00
Ed Maste	c7fb0e1ddf	linuxulator: add else case braces to reduce diffs between archs Sponsored by: Turing Robotic Industries Inc.	2018-04-09 19:11:24 +00:00
Ed Maste	b267239d4b	linuxulator: deduplicate linux_exec_imgact_try Previously linuxulator had three identical copies of linux_exec_imgact_try. Deduplicate before adding another arch to linuxulator. Sponsored by: Turing Robotic Industries Inc Differential Revision: https://reviews.freebsd.org/D14856	2018-04-09 17:24:01 +00:00
Rodney W. Grimes	01d822d33b	Add the ability to control the CPU topology of created VMs from userland without the need to use sysctls, it allows the old sysctls to continue to function, but deprecates them at FreeBSD_version 1200060 (Relnotes for deprecate). The command line of bhyve is maintained in a backwards compatible way. The API of libvmmapi is maintained in a backwards compatible way. The sysctl's are maintained in a backwards compatible way. Added command option looks like: bhyve -c [[cpus=]n][,sockets=n][,cores=n][,threads=n][,maxcpus=n] The optional parts can be specified in any order, but only a single integer invokes the backwards compatible parse. [,maxcpus=n] is hidden by #ifdef until kernel support is added, though the api is put in place. bhyvectl --get-cpu-topology option added. Reviewed by: grehan (maintainer, earlier version), Reviewed by: bcr (manpages) Approved by: bde (mentor), phk (mentor) Tested by: Oleg Ginzburg <olevole@olevole.ru> (cbsd) MFC after: 1 week Relnotes: Y Differential Revision: https://reviews.freebsd.org/D9930	2018-04-08 19:24:49 +00:00
Brooks Davis	1a449272a3	Fix LINT (and static COMPAT_LINUX32) after r332122.	2018-04-08 17:10:32 +00:00
Konstantin Belousov	e55d32b7b3	Handle Skylake-X errata SKZ63. SKZ63 Processor May Hang When Executing Code In an HLE Transaction Region Problem: Under certain conditions, if the processor acquires an HLE (Hardware Lock Elision) lock via the XACQUIRE instruction in the Host Physical Address range between 40000000H and 403FFFFFH, it may hang with an internal timeout error (MCACOD 0400H) logged into IA32_MCi_STATUS. Move the pages from the range into the blacklist. Add a tunable to not waste 4M if local DoS is not the issue. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15001	2018-04-07 17:06:13 +00:00
John Baldwin	fc276d92ae	Add a way to temporarily suspend and resume virtual CPUs. This is used as part of implementing run control in bhyve's debug server. The hypervisor now maintains a set of "debugged" CPUs. Attempting to run a debugged CPU will fail to execute any guest instructions and will instead report a VM_EXITCODE_DEBUG exit to the userland hypervisor. Virtual CPUs are placed into the debugged state via vm_suspend_cpu() (implemented via a new VM_SUSPEND_CPU ioctl). Virtual CPUs can be resumed via vm_resume_cpu() (VM_RESUME_CPU ioctl). The debug server suspends virtual CPUs when it wishes them to stop executing in the guest (for example, when a debugger attaches to the server). The debug server can choose to resume only a subset of CPUs (for example, when single stepping) or it can choose to resume all CPUs. The debug server must explicitly mark a CPU as resumed via vm_resume_cpu() before the virtual CPU will successfully execute any guest instructions. Reviewed by: avg, grehan Tested on: Intel (jhb), AMD (avg) Differential Revision: https://reviews.freebsd.org/D14466	2018-04-06 22:03:43 +00:00
Brooks Davis	6469bdcdb6	Move most of the contents of opt_compat.h to opt_global.h. opt_compat.h is mentioned in nearly 180 files. In-progress network driver compabibility improvements may add over 100 more so this is closer to "just about everywhere" than "only some files" per the guidance in sys/conf/options. Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h is created on all architectures. Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the set of compiled files. Reviewed by: kib, cem, jhb, jtl Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14941	2018-04-06 17:35:35 +00:00
Jonathan T. Looney	6a740d0bcf	Pat the watchdog less while producing a coredump. Prior to this change, we patted the watchdog approximately once per 4KB page of memory. After this change, we pat the watchdog approximately once per 128MB of memory. On a sample machine, this translated to patting the watchdog approximately every 5.4 seconds, which "seems reasonable". We can choose a different value in the future, if warranted. This has extensive field experience. It is a performance improvement, and has not caused any known problems. Reviewed by: imp, kib Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D14988	2018-04-06 17:06:22 +00:00
Roger Pau Monné	e0f92f5c77	x86: fix trampoline memory allocation after r332073 Add the missing breaks in the for loops, in order to exit the loop when a suitable entry is found. Also switch amd64 native_start_all_aps to use PHYS_TO_DMAP in order to find the virtual address of the boot_trampoline and the initial page tables. Reported and tested by: pho Sponsored by: Citrix Systems R&D	2018-04-06 16:22:14 +00:00
Roger Pau Monné	444c6d6f03	remove GiB/MiB macros from param.h And instead define them in the files where they are used. Requested by: bde	2018-04-06 11:20:06 +00:00
Roger Pau Monné	9dba82a442	x86: improve reservation of AP trampoline memory So that it doesn't rely on physmap[1] containing an address below 1MiB. Instead scan the full physmap and search for a suitable address to place the trampoline code (below 1MiB) and the initial memory pages (below 4GiB). Sponsored by: Citrix Systems R&D Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14878	2018-04-05 14:39:51 +00:00
Konstantin Belousov	2d7e563c39	Fix ERESTART for lcall $7,$0 syscalls. The lcall trampoline enters kernel by int $0x80, which sets up invalid length of the instruction for %rip rewind. Reviewed by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-04-05 11:03:21 +00:00
Konstantin Belousov	f407f5fb88	Make the INTO instruction operational in 32bit mode. Having the IDT entry specify ring 0 DPL caused delivery of #GP instead of #OF. The instruction is not valid in 64bit mode, which probably explains why the IDT entry for #OF was initially set this way. It is interesting to note that the BOUND instruction works with the IDT #BR entry DPL 0, most likely CPU considers #BR from BOUND as generated by a machine, not user. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-04-05 11:03:05 +00:00
Andriy Gapon	3da25bdb02	fix i386 build with CPU_ELAN (LINT for instance) after r331878 x86/cpu_machdep.c now needs to include elan_mmcr.h when CPU_ELAN is set. While here, also remove the now unneeded inclusion of isareg.h in i386 and amd64 vm_machdep.c. Reported by: lwhsu MFC after: 14 days X-MFC with: r331878	2018-04-03 17:16:06 +00:00
Andriy Gapon	8428d0f154	unify amd64 and i386 cpu_reset() in x86/cpu_machdep.c Because I didn't see any reason not too. I've been making some changes to the code and couldn't help but notice that the i386 and am64 code was nearly identical. MFC after: 17 days	2018-04-02 13:45:23 +00:00
Andriy Gapon	ace498d81e	x86 cpu_reset: if failed to switch to BSP proceed to cpu_reset_real If cpu_reset() is called on an AP and if it somehow fails to wake the BSP, then it's better to attempt the reset on the AP than just sit there spinning on an unusable and undebuggable system. MFC after: 16 days	2018-04-02 08:06:18 +00:00
Andriy Gapon	5d29acd810	x86 cpu_reset_proxy: no need to stop_cpus() the original processor The processor is "parked" in a spin-loop already and that's sufficient for the reset. There is nothing that stop_cpus() would add here, only extra complexity and fragility. The original processor does not need to enable interrupts now, in fact, it must not do that. MFC after: 2 weeks	2018-04-02 07:45:13 +00:00
Kenneth D. Merry	ef270ab1b6	Bring in the Broadcom/Emulex Fibre Channel driver, ocs_fc(4). The ocs_fc(4) driver supports the following hardware: Emulex 16/8G FC GEN 5 HBAS LPe15004 FC Host Bus Adapters LPe160XX FC Host Bus Adapters Emulex 32/16G FC GEN 6 HBAS LPe3100X FC Host Bus Adapters LPe3200X FC Host Bus Adapters The driver supports target and initiator mode, and also supports FC-Tape. Note that the driver only currently works on little endian platforms. It is only included in the module build for amd64 and i386, and in GENERIC on amd64 only. Submitted by: Ram Kishore Vegesna <ram.vegesna@broadcom.com> Reviewed by: mav MFC after: 5 days Relnotes: yes Sponsored by: Broadcom Differential Revision: https://reviews.freebsd.org/D11423	2018-03-30 15:28:25 +00:00
Jeff Roberson	27a3c9d710	Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all platforms. Original commit message as follows: Only use CPUs in the domain the device is attached to for default assignment. Device drivers are able to override the default assignment if they bind directly. There are severe performance penalties for handling interrupts on remote CPUs and this should only be done in very controlled circumstances. Reviewed by: jhb, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14838	2018-03-28 18:47:35 +00:00
John Baldwin	dbb4ba297b	Fix kernel builds without options DDB after r331650. Reported by: cy	2018-03-28 16:24:56 +00:00
John Baldwin	d41e41f9f0	Remove very old and unused signal information codes. These have been supplanted by the MI signal information codes in <sys/signal.h> since 7.0. The FPE_*_TRAP ones were deprecated even earlier in 1999. PR: 226579 (exp-run) Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14637	2018-03-27 20:57:51 +00:00
Jeff Roberson	261c408744	Backout r331606 until I can identify why it does not boot on some machines.	2018-03-27 10:20:50 +00:00
Jeff Roberson	a48de40bcc	Only use CPUs in the domain the device is attached to for default assignment. Device drivers are able to override the default assignment if they bind directly. There are severe performance penalties for handling interrupts on remote CPUs and this should only be done in very controlled circumstances. Reviewed by: jhb, kib Tested by: pho (earlier version) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14838	2018-03-27 03:37:04 +00:00
Konstantin Belousov	a37d4032ed	Improve the lcall $7,$0 syscall emulation on amd64. Current code, which copies the potential syscall arguments into the current frame, puts an arbitrary limit on the number of syscall arguments. Apparently, mmap(2) and lseek(2) (?) require larger number. But there is an issue that stack is only need to be mapped to contain the number of arguments required by the syscall, so copying arbitrary large number of words from the stack is not completely safe. Use different approach to convert lcall frame into int $0x80 frame in place, by doing the retl in kernel. This also allows to stop proceed vfork case specially, and stop making assumptions about %cs at the syscall time. Also, improve comments with the formulations provided by bde. Reviewed and tested by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-24 12:57:58 +00:00
Jonathan T. Looney	e24e568336	Make the TCP blackbox code committed in r331347 be an optional feature controlled by the TCP_BLACKBOX option. Enable this as part of amd64 GENERIC. For now, leave it disabled on other platforms. Sponsored by: Netflix, Inc.	2018-03-24 12:48:10 +00:00
Ed Maste	f8268d4d97	Remove redundant cast from Linuxulator SYSINITs	2018-03-23 20:32:54 +00:00
Ed Maste	ad448975e6	Fixup return style(9) in amd64 linux*_sysvec.c Sponsored by: Turing Robotic Industries Inc.	2018-03-23 17:28:04 +00:00
Ed Maste	c0aa0e2c27	Sort headers in MD Linuxulator files Bring #includes closer to style(9) and reduce differences between the (three) MD versions of linux_machdep.c and linux_sysvec.c. Sponsored by: Turing Robotic Industries Inc.	2018-03-23 17:16:36 +00:00
Konstantin Belousov	54f30ad961	Fixes for ptrace(PT_GETXSTATE_INFO) related to the padding in struct ptrace_xstate_info). struct ptrace_xstate_info has 64bit member but ends up with 32bit one. As result, on amd64 there is a 32bit padding at the end, but not on i386. We must clear the padding before doing the copyout. For compat32 case, we must copyout the structure which does not have the padding at the end. The later fixes 32bit gdb display of the YMM registers when running on amd64 kernel. Reported by: Vlad Tsyrklevich Reviewed by: brooks (previous version) Sponsored by: The FreeBSD Foundation admbugs: 765 MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14794	2018-03-22 20:44:27 +00:00
Kyle Evans	ad456dd9fa	Re-work efidev ordering to fix efirt preloaded by loader on amd64 On amd64, efi_enter calls fpu_kern_enter(). This may not be called until fpuinitstate has been invoked, resulting in a kernel panic with efirt_load="YES" in loader.conf(5). Move fpuinitstate a little earlier in SI_SUB_DRIVERS so that we can squeeze efirt between it and efirtc at SI_SUB_DRIVERS, SI_ORDER_ANY. efidev must be after efirt and doesn't really need to be at SI_SUB_DEVFS, so drop it at SI_SUB_DRIVER, SI_ORDER_ANY. The not immediately obvious dependency of fpuinitstate by efirt has been noted in both places. Discussed with: kib, andrew Reported by: Jakob Alvermark <jakob@alvermark.net> X-MFC-With: r330868	2018-03-22 18:24:00 +00:00
Ed Maste	1ac2776bbb	Share Linux errno table with libsysdecode Requested by: jhb Reviewed by: jhb Sponsored by: Turing Robotic Industries Inc.	2018-03-22 12:58:49 +00:00
Konstantin Belousov	8fbcc3343f	Move the CR0.WP manipulation KPI to x86. This should allow to avoid some #ifdefs in the common x86/ code. Requested by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-20 20:20:49 +00:00
Ed Maste	b7d779b3e5	Make linuxulator fn declaration match definition I accidentally swapped 'linux_fixup_elf' to 'linux_elf_fixup' in amd64's declaration (only), while bringing this change over from git and encountering a conflict.	2018-03-20 19:28:52 +00:00
Ed Maste	fc2a8776a2	Rename assym.s to assym.inc assym is only to be included by other .s files, and should never actually be assembled by itself. Reviewed by: imp, bdrewery (earlier) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D14180	2018-03-20 17:58:51 +00:00
Konstantin Belousov	9cffc92c62	Disable write protection around patching of XSAVE instruction in the context switch code. Some BIOSes give control to the OS with CR0.WP already set, making the kernel text read-only before cpu_startup(). Reported by: Peter Lei <peter.lei@ieee.org> Reviewed by: jtl Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14768	2018-03-20 17:47:29 +00:00
Konstantin Belousov	2337dc6430	Provide KPI for handling of rw/ro kernel text. This is a pure syntax patch to create an interface to enable and later restore write access to the kernel text and other read-only mapped regions. It is in line with e.g. vm_fault_disable_pagefaults() by allowing the nesting. Discussed with: Peter Lei <peter.lei@ieee.org> Reviewed by: jtl Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14768	2018-03-20 17:43:50 +00:00
Ed Maste	dc85846736	Rename linuxulator functions with linux_ prefix It's preferable to have a consistent prefix. This also reduces differences between the three linux*_sysvec.c files. Sponsored by: Turing Robotic Industries Inc.	2018-03-19 21:26:32 +00:00
Ed Maste	9bec2ea66e	linux*_sysvec.c: rationalize whitespace and comments There's a fair amount of duplication between MD linuxulator files. Make indentation and comments consistent between the three versions of linux_sysvec.c to reduce diffs when comparing them. Sponsored by: Turing Robotic Industries Inc.	2018-03-19 15:11:10 +00:00
Ed Maste	6e481f83f7	Share a single bsd-linux errno table across MD consumers Three copies of the linuxulator linux_sysvec.c contained identical BSD to Linux errno translation tables, and future work to support other architectures will also use the same table. Move the table to a common file to be used by all. Make it 'const int' to place it in .rodata. (Some existing Linux architectures use MD errno values, but x86 and Arm share the generic set.) This change should introduce no functional change; a followup will add missing errno values. MFC after: 3 weeks Sponsored by: Turing Robotic Industries Inc. Differential Revision: https://reviews.freebsd.org/D14665	2018-03-16 14:46:38 +00:00
Ed Maste	7b194b3d3b	Remove stray ; at end of linux_vdso_deinstall()	2018-03-14 13:20:36 +00:00
Kyle Evans	63ee68c220	EFIRT: SetVirtualAddressMap with 1:1 mapping after exiting boot services This fixes a problem encountered on the Lenovo Thinkpad X220/Yoga 11e where runtime services would try to inexplicably jump to other parts of memory where it shouldn't be when attempting to enumerate EFI vars, causing a panic. The virtual mapping is enabled by default and can be disabled by setting efi_disable_vmap in loader.conf(5). Reviewed by: kib (earlier version) MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D14677	2018-03-13 17:10:52 +00:00
Ed Maste	a95659f75f	Use C99 boolean type for translate_osrel Migrate to modern types before creating MD Linuxolator bits for new architectures. Reviewed by: cem Sponsored by: Turing Robotic Industries Inc. Differential Revision: https://reviews.freebsd.org/D14676	2018-03-13 16:40:29 +00:00
Ed Maste	4ba257591b	Apply some style(9) to Linuxulator linux_sysvec.c comments	2018-03-13 00:40:05 +00:00
Ian Lepore	c7053bbe54	Revert r330780, it was improperly tested and results in taking a spin mutex before acquiring sleep mutexes. Reported by: kib@	2018-03-11 20:13:15 +00:00
Ian Lepore	86051be993	Eliminate atrtc_time_lock, and use atrtc_lock for efirtc locking.	2018-03-11 19:22:58 +00:00
Tycho Nightingale	490768e24a	Fix a lock recursion introduced in r327065. Reported by: kmacy Reviewed by: grehan, jhb Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14548	2018-03-07 18:03:22 +00:00
Jonathan T. Looney	beb2406556	amd64: Protect the kernel text, data, and BSS by setting the RW/NX bits correctly for the data contained on each memory page. There are several components to this change: * Add a variable to indicate the start of the R/W portion of the initial memory. * Stop detecting NX bit support for each AP. Instead, use the value from the BSP and, if supported, activate the feature on the other APs just before loading the correct page table. (Functionally, we already assume that the BSP and all APs had the same support or lack of support for the NX bit.) * Set the RW and NX bits correctly for the kernel text, data, and BSS (subject to some caveats below). * Ensure DDB can write to memory when necessary (such as to set a breakpoint). * Ensure GDB can write to memory when necessary (such as to set a breakpoint). For this purpose, add new MD functions gdb_begin_write() and gdb_end_write() which the GDB support code can call before and after writing to memory. This change is not comprehensive: * It doesn't do anything to protect modules. * It doesn't do anything for kernel memory allocated after the kernel starts running. * In order to avoid excessive memory inefficiency, it may let multiple types of data share a 2M page, and assigns the most permissions needed for data on that page. Reviewed by: jhb, kib Discussed with: emaste MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14282	2018-03-06 14:28:37 +00:00
Jonathan T. Looney	99e159dcf6	We shouldn't need to execute code in the recursive page table mappings; therefore, it should be safe to set the NX bit on the PML4E for the recursive page table mappings. According to the Intel docs, the effect of the NX bit should propogate to any page reached through a PML4E which has the NX bit set. Reviewed by: kib, markj MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14333	2018-03-05 15:12:35 +00:00
Jonathan T. Looney	66ce8430aa	Prior to r329071, pmap_bootstrap() used pmap_kmem_choose() to round the first available virtual address to a 2MB boundary. After r329071, create_pagetables() rounds firstaddr up to a 2MB boundary. This ensures the kernel is mapped in super-pages, which is the point of the logic in pmap_kmem_choose(). Therefore, it is no longer necessary for pmap_bootstrap() to round up to the 2MB boundary again. As pmap_bootstrap() was the only user of pmap_kmem_choose(), we can delete pmap_kmem_choose(). Reviewed by: kib MFC after: 2 weeks X-MFC-with: r329071 Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14355	2018-03-05 15:10:17 +00:00
Anish Gupta	9363000dfe	Move the new AMD-Vi IVHD [ACPI_IVRS_HARDWARE_NEW]definitions added in r329360 in contrib ACPI to local files till ACPI code adds new definitions reported by jkim. Rename ACPI_IVRS_HARDWARE_NEW to ACPI_IVRS_HARDWARE_EFRSUP, since new definitions add Extended Feature Register support. Use IvrsType to distinguish three types of IVHD - 0x10(legacy), 0x11 and 0x40(with EFR). IVHD 0x40 is also called mixed type since it supports HID device entries. Fix 2 coverity bugs reported by cem. Reported by:jkim, cem Approved by:grehan Differential Revision://reviews.freebsd.org/D14501	2018-03-05 02:28:25 +00:00
Konstantin Belousov	8c8ee2ee1c	Unify bulk free operations in several pmaps. Submitted by: Yoshihiro Ota Reviewed by: markj MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13485	2018-03-04 20:53:20 +00:00
Andriy Gapon	ae0eceab56	db_nextframe/amd64: catch up with r328083 to recognize fast_syscall_common Since that change the system call stack traces look like this: ... sys___sysctl() at sys___sysctl+0x5f/frame 0xfffffe0028e13ac0 amd64_syscall() at amd64_syscall+0x79b/frame 0xfffffe0028e13bf0 fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe0028e13bf0 So, db_nextframe() stopped recognizing the system call frame. This commit should fix that. Reviewed by: kib MFC after: 4 days	2018-03-03 15:10:37 +00:00
Ravi Pokala	24f93aa05f	imcsmb(4): Intel integrated Memory Controller (iMC) SMBus controller driver imcsmb(4) provides smbus(4) support for the SMBus controller functionality in the integrated Memory Controllers (iMCs) embedded in Intel Sandybridge- Xeon, Ivybridge-Xeon, Haswell-Xeon, and Broadwell-Xeon CPUs. Each CPU implements one or more iMCs, depending on the number of cores; each iMC implements two SMBus controllers (iMC-SMBs). * IMPORTANT NOTE * Because motherboard firmware or the BMC might try to use the iMC-SMBs for monitoring DIMM temperatures and/or managing an NVDIMM, the driver might need to temporarily disable those functions, or take a hardware interlock, before using the iMC-SMBs. Details on how to do this may vary from board to board, and the procedure may be proprietary. It is strongly suggested that anyone wishing to use this driver contact their motherboard vendor, and modify the driver as described in the manual page and in the driver itself. (For what it's worth, the driver as-is has been tested on various SuperMicro motherboards.) Reviewed by: avg, jhb MFC after: 1 week Relnotes: yes Sponsored by: Panasas Differential Revision: https://reviews.freebsd.org/D14447 Discussed with: avg, ian, jhb Tested by: allanjude (previous version), Panasas	2018-03-03 01:53:51 +00:00
Ed Maste	023b850b62	Rationalize license text on Linuxolator files Many licenses on Linuxolator files contained small variations from the standard FreeBSD license text. To avoid license proliferation switch to the standard 2-clause FreeBSD license for those files where I have permission from each of the listed copyright holders. Additional files still waiting on permission from others are listed in review D14210. Approved by: dchagin, rdivacky, sos MFC after: 1 week MFC with: r329370 Sponsored by: The FreeBSD Foundation	2018-03-01 13:52:18 +00:00
John Baldwin	5f8754c077	Add a new variant of the GLA2GPA ioctl for use by the debug server. Unlike the existing GLA2GPA ioctl, GLA2GPA_NOFAULT does not modify the guest. In particular, it does not inject any faults or modify PTEs in the guest when performing an address space translation. This is used by bhyve's debug server to read and write memory for the remote debugger. Reviewed by: grehan MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D14075	2018-02-26 19:19:05 +00:00
Patrick Kelsey	18a7530938	Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option. The conditional compilation support is now centralized in tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum theoretical code/data footprint when TCP_RFC7413 is disabled, but nearly all the TFO code should wind up being removed by the optimizer, the additional footprint in the syncache entries is a single pointer, and the additional overhead in the tcpcb is at the end of the structure. This enables the TCP_RFC7413 kernel option by default in amd64 and arm64 GENERIC. Reviewed by: hiren MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14048	2018-02-26 03:03:41 +00:00
Jung-uk Kim	0ef8c0cb57	Partially revert r197863 to reduce diff against i386. When I wrote the patch, I wanted to remove SYSINIT() usage from amd64 code. There is no reason to keep the divergence any more because iwasaki merged most amd64 suspend/resume code to i386 with r235622. Note this also fixed an enge case reported by royger. [1] Suggested by: jhb Reviewed by: royger Tested by: royger [1] MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D14400 [1]	2018-02-24 01:24:57 +00:00
Conrad Meyer	849ce31a82	Remove unused error return from API that cannot fail No implementation of fpu_kern_enter() can fail, and it was causing needless error checking boilerplate and confusion. Change the return code to void to match reality. (This trivial change took nine days to land because of the commit hook on sys/dev/random. Please consider removing the hook or otherwise lowering the bar -- secteam never seems to have free time to review patches.) Reported by: Lachlan McIlroy <Lachlan.McIlroy AT isilon.com> Reviewed by: delphij Approved by: secteam (delphij) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14380	2018-02-23 20:15:19 +00:00
Ed Maste	716cfaab96	Use linux types for linux-specific syscalls Sponsored by: Turing Robotic Industries Inc. Differential Revision: https://reviews.freebsd.org/D14065	2018-02-23 19:09:27 +00:00
Ed Maste	315fbaeca2	Correct pseudo misspelling in sys/ comments contrib code and #define in intel_ata.h unchanged.	2018-02-23 18:15:50 +00:00
Ed Maste	a0409b6f36	Remove accidental vim droppings Reported by: cy	2018-02-22 03:37:01 +00:00
Ed Maste	eae594f7d5	Correct proper nouns in the Linuxulator - Capitalize Linux - Spell FreeBSD out in full - Address some style(9) on changed lines Sponsored by: Turing Robotic Industries Inc.	2018-02-22 02:24:17 +00:00
John Baldwin	4f8666989a	Add two new ioctls to bhyve for batch register fetch/store operations. These are a convenience for bhyve's debug server to use a single ioctl for 'g' and 'G' rather than a loop of individual get/set ioctl requests. Reviewed by: grehan MFC after: 2 months Differential Revision: https://reviews.freebsd.org/D14074	2018-02-22 00:39:25 +00:00
Konstantin Belousov	2c0f13aa59	vm_wait() rework. Make vm_wait() take the vm_object argument which specifies the domain set to wait for the min condition pass. If there is no object associated with the wait, use curthread' policy domainset. The mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by the new helper vm_wait_doms(), which directly takes the bitmask of the domains to wait for passing min condition. Eliminate pagedaemon_wait(). vm_domain_clear() handles the same operations. Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls are enough. Eliminate several control state variables from vm_domain, unneeded after the vm_wait() conversion. Scetched and reviewed by: jeff Tested by: pho Sponsored by: The FreeBSD Foundation, Mellanox Technologies Differential revision: https://reviews.freebsd.org/D14384	2018-02-20 10:13:13 +00:00
Ed Maste	0ba1b36553	Rationalize license text on Linuxolator files Many licenses on Linuxolator files contained small variations from the standard FreeBSD license text. To avoid license proliferation switch to the standard 2-clause FreeBSD license for those files where I have permission from each of the listed copyright holders. Additional files waiting on permission from others are listed in review D14210. Approved by: kan, marcel, sos, rdivacky MFC after: 1 week Sponsored by: The FreeBSD Foundation	2018-02-16 15:00:14 +00:00
Konstantin Belousov	13cad9af82	Use local symbol for offset. Small global symbols confuse ddb which matches them against small unrelated displacements and makes the disassembly ugly. Reported by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-02-16 13:32:46 +00:00
Andriy Gapon	7b394c1066	move vintr_intercept_enabled under INVARIANTS The function is not used outside of INVARIANTS since r328622. MFC after: 1 week	2018-02-16 07:02:14 +00:00
Anish Gupta	0b37d3d90e	This change fixes duplicate detection of same IOMMU/AMD-Vi device for Ryzen with EFR support. IVRS can have entry of type legacy and non-legacy present at same time for same AMD-Vi device. ivhd driver will ignore legacy if new IVHD type is present as specified in AMD-Vi specification. Earlier both of IVHD entries used and two ivhd devices were created. Add support for new IVHD type 0x11 and 0x40 in ACPI. Create new struct of type acpi_ivrs_hardware_new for these new type of IVHDs. Legacy type 0x10 will continue to use acpi_ivrs_hardware. Reviewed by: avg Approved by: grehan Differential Revision:https://reviews.freebsd.org/D13160	2018-02-16 05:17:00 +00:00
Jung-uk Kim	ea4fe1da62	Change size of padding to reflect reality. No functional change. Discussed with: kib	2018-02-15 20:42:38 +00:00
Conrad Meyer	5bd0149714	x86 pmap: Make memory mapped via pmap_qenter() non-executable The idea is, the pmap_qenter() API is now defined to not produce executable mappings. If you need executable mappings, use another API. Add pg_nx flag in pmap_qenter on x86 to make kernel pages non-executable. Other architectures that support execute-specific permissons on page table entries should subsequently be updated to match. Submitted by: Darrick Lew <darrick.freebsd AT gmail.com> Reviewed by: markj Discussed with: alc, jhb, kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14062	2018-02-14 23:35:47 +00:00
Ed Maste	83ab0f9f33	amd64/pmap: Move Foundation copyright to the 2-clause section Sponsored by: The FreeBSD Foundation	2018-02-13 19:19:26 +00:00
Hans Petter Selasky	33ec1ccbae	Import the mthca kernel side infiniband driver from Linux 4.9 and fix compilation under FreeBSD. The mthca driver was temporarily removed as part of the Linux 4.9 RoCE/infinband upgrade. Top commit in Linux source tree: 69973b830859bc6529a7a0468ba0d80ee5117826 Sponsored by: Mellanox Technologies	2018-02-13 17:04:34 +00:00
Jeff Roberson	e958ad4cf3	Make v_wire_count a per-cpu counter(9) counter. This eliminates a significant source of cache line contention from vm_page_alloc(). Use accessors and vm_page_unwire_noq() so that the mechanism can be easily changed in the future. Reviewed by: markj Discussed with: kib, glebius Tested by: pho (earlier version) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14273	2018-02-12 22:53:00 +00:00
Jonathan T. Looney	48fca66157	Mark the pages used for the initial page-table entries as wired. This makes them consistent with the way other page-table pages are allocated. It also provides the rest of the VM system a good clue that these pages are used. Reviewed by: alc, kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14269	2018-02-12 17:27:50 +00:00
Warner Losh	982e7bdafc	We don't support gcc < 4.2.1, so varargs.h now is just #error always. Unifdef for versions prior to 4.2.1 and remove now-unused header files. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14323	2018-02-12 14:48:14 +00:00
Tycho Nightingale	58a6aaf7ec	Provide further mitigation against CVE-2017-5715 by flushing the return stack buffer (RSB) upon returning from the guest. This was inspired by this linux commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/x86/kvm?id=117cc7a908c83697b0b737d15ae1eb5943afe35b Reviewed by: grehan Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14272	2018-02-12 14:45:27 +00:00
Jonathan T. Looney	31ba4c7b5b	On bootup, the amd64 pmap initialization code creates page-table mappings for the pages used for the kernel and some initial allocations used for the page table. It maps the kernel and the blocks used for these initial allocations using 2MB pages. However, if the kernel does not end on a 2MB boundary, it still maps the last portion using a 2MB page, but reports that the unused 4K blocks within this 2MB allocation are free physical blocks. This means that these same physical blocks could also be mapped elsewhere - for example, into a user process. Given the proximity to the kernel text and data area, it seems wise to avoid allowing someone to write data to physical blocks also mapped into these virtual addresses. (Note that this isn't a security vulnerability: the direct map makes most/all memory on the system mapped into kernel space. And, nothing in the kernel should be trying to access these pages, as the virtual addresses are unused. It simply seems wise to avoid reusing these physical blocks while they are mapped to virtual addresses so close to the kernel text and data area.) Consequently, let's reserve the physical blocks covered by the page-table mappings for these initial allocations. Reviewed by: kib, markj MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14268	2018-02-09 17:46:33 +00:00
Mark Johnston	ab7c09f121	Use vm_page_unwire_noq() instead of directly modifying page wire counts. No functional change intended. Reviewed by: alc, kib (previous revision) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D14266	2018-02-08 19:28:51 +00:00
Jeff Roberson	e2068d0bcd	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000	2018-02-06 22:10:07 +00:00
Ed Maste	8a3b44cfc2	Additional linuxolator whitespace cleanup, missed in r328890	2018-02-05 18:39:06 +00:00
Ed Maste	132f90c660	Linuxolator whitespace cleanup A version of each of the MD files by necessity exists for each CPU architecture supported by the Linuxolator. Clean these up so that new architectures do not inherit whitespace issues. Clean up shared Linuxolator files while here. Sponsored by: Turing Robotic Industries Inc.	2018-02-05 17:29:12 +00:00
Konstantin Belousov	f7f14d9dea	When switching IBRS on, also enable STIBP (Single Thread Indirect Branch Predictors) mitigation. DOcument 336996-001 promises that CPUs which implement IBRS but not STIBP silently ignore setting of the bit instead of trapping. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-31 16:56:02 +00:00
Konstantin Belousov	319117fd57	IBRS support, AKA Spectre hardware mitigation. It is coded according to the Intel document 336996-001, reading of the patches posted on lkml, and some additional consultations with Intel. For existing processors, you need a microcode update which adds IBRS CPU features, and to manually enable it by setting the tunable/sysctl hw.ibrs_disable to 0. Current status can be checked in sysctl hw.ibrs_active. The mitigation might be inactive if the CPU feature is not patched in, or if CPU reports that IBRS use is not required, by IA32_ARCH_CAP_IBRS_ALL bit. Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14029	2018-01-31 14:36:27 +00:00
Andriy Gapon	6a8b7aa424	vmm/svm: post LAPIC interrupts using event injection, not virtual interrupts The virtual interrupt method uses V_IRQ, V_INTR_PRIO, and V_INTR_VECTOR fields of VMCB to inject a virtual interrupt into a guest VM. This method has many advantages over the direct event injection as it offloads all decisions of whether and when the interrupt can be delivered to the guest. But with a purely software emulated vAPIC the advantage is also a problem. The problem is that the hypervisor does not have any precise control over when the interrupt is actually delivered to the guest (or a notification about that). Because of that the hypervisor cannot update the interrupt vector in IRR and ISR in the same way as real hardware would. The hypervisor becomes aware that the interrupt is being serviced only upon the first VMEXIT after the interrupt is delivered. This creates a window between the actual interrupt delivery and the update of IRR and ISR. That means that IRR and ISR might not be correctly set up to the point of the end-of-interrupt signal. The described deviation has been observed to cause an interrupt loss in the following scenario. vCPU0 posts an inter-processor interrupt to vCPU1. The interrupt is injected as a virtual interrupt by the hypervisor. The interrupt is delivered to a guest and an interrupt handler is invoked. The handler performs a requested action and acknowledges the request by modifying a global variable. So far, there is no VMEXIT and the hypervisor is unaware of the events. Then, vCPU0 notices the acknowledgment and sends another IPI with the same vector. The IPI gets collapsed into the previous IPI in the IRR of vCPU1. Only after that a VMEXIT of vCPU1 occurs. At that time the vector is cleared in the IRR and is set in the ISR. vCPU1 has vAPIC state as if the second IPI has never been sent. The scenario is impossible on the real hardware because IRR and ISR are updated just before the interrupt handler gets started. I saw several possibilities of fixing the problem. One is to intercept the virtual interrupt delivery to update IRR and ISR at the right moment. The other is to deliver the LAPIC interrupts using the event injection, same as legacy interrupts. I opted to use the latter approach for several reasons. It's equivalent to what VMM/Intel does (in !VMX case). It appears to be what VirtualBox and KVM do. The code is already there (to support legacy interrupts). Another possibility was to use a special intermediate state for a vector after it is injected using a virtual interrupt and before it is known whether it was accepted or is still pending. That approach was implemented in https://reviews.freebsd.org/D13828 That method is more complex and does not have any clear advantage. Please see sections 15.20 and 15.21.4 of "AMD64 Architecture Programmer's Manual Volume 2: System Programming" (publication 24593, revision 3.29) for comparison between event injection and virtual interrupt injection. PR: 215972 Reported by: ajschot@hotmail.com, grehan Tested by: anish, grehan, Nils Beyer <nbe@renzel.net> Reviewed by: anish, grehan MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D13780	2018-01-31 11:14:26 +00:00
John Baldwin	05d56d83b6	Ensure 'name' is not NULL before passing to strcmp(). This avoids a nested page fault when obtaining a stack trace in DDB if the address from the first frame does not resolve to a known symbol. MFC after: 1 week Sponsored by: Chelsio Communications	2018-01-30 23:29:27 +00:00
Bryan Drewery	595109196a	Don't use an .OBJDIR for 'make sysent'. Reported by: emaste, jhb Sponsored by: Dell EMC	2018-01-29 19:14:15 +00:00
Warner Losh	d6b6639713	Add ISA PNP tables to ISA drivers. Fix a few incidental comments. ACPI ISA PBP tables not tagged, there's bigger issues with them.	2018-01-29 00:22:30 +00:00
Konstantin Belousov	c8f9c1f3d9	Use PCID to optimize PTI. Use PCID to avoid complete TLB shootdown when switching between user and kernel mode with PTI enabled. I use the model close to what I read about KAISER, user-mode PCID has 1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID. Full kernel-mode TLB shootdown is performed on context switches, since KVA TLB invalidation only works in the current pmap. User-mode part of TLB is flushed on the pmap activations as well. Similarly, IPI TLB shootdowns must handle both kernel and user address spaces for each address. Note that machines which implement PCID but do not have INVPCID instructions, cause the usual complications in the IPI handlers, due to the need to switch to the target PCID temporary. This is racy, but because for PCID/no-INVPCID we disable the interrupts in pmap_activate_sw(), IPI handler cannot see inconsistent state of CPU PCID vs PCPU pmap/kcr3/ucr3 pointers. On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set and we do not clear TLB. I can imagine alternative use of PCID, where there is only one PCID allocated for the kernel pmap. Then, there is no need to shootdown kernel TLB entries on context switch. But copyout(3) would need to either use method similar to proc_rwmem() to access the userspace data, or (in reverse) provide a temporal mapping for the kernel buffer into user mode PCID and use trampoline for copy. Reviewed by: markj (previous version) Tested by: pho Discussed with: alc (some aspects) Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D13985	2018-01-27 11:49:37 +00:00
Edward Tomasz Napierala	28f3d8b2c2	Add SPDX identifiers to linux_ptrace.c and cfumass.c. MFC after: 2 weeks	2018-01-24 17:04:01 +00:00
Ed Maste	7eb2159f6a	Use BSD-2-Clause-FreeBSD license on linux_support.s These files previously had a 3-clause license and 'THE REGENTS' text. Switch to standard 2-clause text with kib's approval, and add the SPDX tag. Approved by: kib	2018-01-23 20:35:43 +00:00
Pedro F. Giffuni	ac2fffa4b7	Revert r327828, r327949, r327953, r328016-r328026, r328041: Uses of mallocarray(9). The use of mallocarray(9) has rocketed the required swap to build FreeBSD. This is likely caused by the allocation size attributes which put extra pressure on the compiler. Given that most of these checks are superfluous we have to choose better where to use mallocarray(9). We still have more uses of mallocarray(9) but hopefully this is enough to bring swap usage to a reasonable level. Reported by: wosch PR: 225197	2018-01-21 15:42:36 +00:00
Konstantin Belousov	c398c14664	Use correct symbol name in r328202. Sponsored by: The FreeBSD Foundation MFC after: 11 days	2018-01-20 18:05:14 +00:00
Konstantin Belousov	3a5e472e17	Use predefined symbol for the CR3.PCID mask. Sponsored by: The FreeBSD Foundation MFC after: 11 days	2018-01-20 17:46:09 +00:00
Roger Pau Monné	50a53194f6	xen: fix IDT setup after PTI On amd64 the IDT handler was not set correctly when using PTI. While there also fix the selectors to SEL_KPL. Obtained from: kib MFC with: r328083	2018-01-20 14:59:37 +00:00
Konstantin Belousov	b4dfc9d7ad	PTI: Trap if we returned to userspace with kernel (full) page table still active. Map userspace portion of VA in the PTI kernel-mode page table as non-executable. This way, if we ever miss reloading ucr3 into %cr3 on the return to usermode, the process traps instead of executing in potentially vulnerable setup. Catch the condition of such trap and verify user-mode %cr3, which is saved by page fault handler. I peek this trick in some article about Linux implementation. Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 12 days DIfferential revision: https://reviews.freebsd.org/D13956	2018-01-19 22:10:29 +00:00
Nathan Whitehorn	9a8196ce19	Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be evaluated at runtime to determine if the architecture has a direct map; if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or 1, the compiler can remove the conditional logic. As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had similar things but spelled differently. 32-bit MIPS has a partial direct-map that maps poorly to this concept and is unchanged. Reviewed by: kib Suggestions from: marius, alc, kib Runtime tested on: amd64, powerpc64, powerpc, mips64	2018-01-19 17:46:31 +00:00
Ed Maste	b3327f62f0	Enable KPTI by default on amd64 for non-AMD CPUs Kernel Page Table Isolation (KPTI) was introduced in r328083 as a mitigation for the 'Meltdown' vulnerability. AMD CPUs are not affected, per https://www.amd.com/en/corporate/speculative-execution: We believe AMD processors are not susceptible due to our use of privilege level protections within paging architecture and no mitigation is required. Thus default KPTI to off for AMD CPUs, and to on for others. This may be refined later as we obtain more specific information on the sets of CPUs that are and are not affected. Submitted by: Mitchell Horne Reviewed by: cem Relnotes: Yes Security: CVE-2017-5754 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D13971	2018-01-19 15:42:34 +00:00
John Baldwin	68fd3b0ef5	Use a dedicated per-CPU stack for machine check exceptions. Similar to NMIs, machine check exceptions can fire at any time and are not masked by IF. This means that machine checks can fire when the kstack is too deep to hold a trap frame, or at critical sections in trap handlers when a user %gs is used with a kernel %cs. Use the same strategy used for NMIs of using a dedicated per-CPU stack configured in IST 3. Store the CPU's pcpu pointer at the stop of the stack so that the machine check handler can reliably find the proper value for %gs (also borrowed from NMIs). This should also fix a similar issue with PTI with a MC# occurring while the CPU is executing on the trampoline stack. While here, bypass trap() entirely and just call mca_intr(). This avoids a bogus call to kdb_reenter() (there's no reason to try to reenter kdb if a MC# is raised). Reviewed by: kib Tested by: avg (on AMD without PTI) Differential Revision: https://reviews.freebsd.org/D13962	2018-01-18 23:50:21 +00:00
John Baldwin	f36b1fe0bd	Remove two no-longer-used labels from the NMI interrupt handler. Reviewed by: kib	2018-01-18 22:13:53 +00:00
John Baldwin	7f513d17b2	Adjust branch target in NMI handler for the !PTI case. In the !PTI case the NMI handler jumped past the instructions that set %rdi to point to the current PCB, but the target instructions assumed %rdi were set. Reviewed by: kib Tested by: pho	2018-01-18 20:12:12 +00:00
Konstantin Belousov	3705dda7e4	Move the kernphys declaration to machine/md_var.h. Apparently machinde/cpu.h is supposed to contain MD implementations of MI interfaces. Also, remove kernphys declaration from machdep.c, since it is already provided by md_var.h. Requested and reviewed by: bde MFC after: 13 days	2018-01-18 15:15:35 +00:00
Konstantin Belousov	ac97ccbab5	Fix compilation with gcc. etext is already declared in machine/cpu.h, move kernphys declaration there too. Based on the patch by: bde MFC after: 13 days	2018-01-18 11:21:03 +00:00
Konstantin Belousov	406bc0da95	Fix compilation with gas. Submitted by: bde MFC after: 13 days	2018-01-18 11:19:58 +00:00
Konstantin Belousov	0d6c61ec30	Remove the 'last' argument from the pmap_pti_free_page(). It is in fact unused. Noted and reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 13 days	2018-01-18 11:01:41 +00:00
John Baldwin	65eefbe422	Save and restore guest debug registers. Currently most of the debug registers are not saved and restored during VM transitions allowing guest and host debug register values to leak into the opposite context. One result is that hardware watchpoints do not work reliably within a guest under VT-x. Due to differences in SVM and VT-x, slightly different approaches are used. For VT-x: - Enable debug register save/restore for VM entry/exit in the VMCS for DR7 and MSR_DEBUGCTL. - Explicitly save DR0-3,6 of the guest. - Explicitly save DR0-3,6-7, MSR_DEBUGCTL, and the trap flag from %rflags for the host. Note that because DR6 is "software" managed and not stored in the VMCS a kernel debugger which single steps through VM entry could corrupt the guest DR6 (since a single step trap taken after loading the guest DR6 could alter the DR6 register). To avoid this, explicitly disable single-stepping via the trace flag before loading the guest DR6. A determined debugger could still defeat this by setting a breakpoint after the guest DR6 was loaded and then single-stepping. For SVM: - Enable debug register caching in the VMCB for DR6/DR7. - Explicitly save DR0-3 of the guest. - Explicitly save DR0-3,6-7, and MSR_DEBUGCTL for the host. Since SVM saves the guest DR6 in the VMCB, the race with single-stepping described for VT-x does not exist. For both platforms, expose all of the guest DRx values via --get-drX and --set-drX flags to bhyvectl. Discussed with: avg, grehan Tested by: avg (SVM), myself (VT-x) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D13229	2018-01-17 23:11:25 +00:00
Mark Johnston	9cb26f73ea	Annotate a couple of changes from r328083. Reviewed by: kib X-MFC with: r328083	2018-01-17 21:52:12 +00:00
Konstantin Belousov	bd50262f70	PTI for amd64. The implementation of the Kernel Page Table Isolation (KPTI) for amd64, first version. It provides a workaround for the 'meltdown' vulnerability. PTI is turned off by default for now, enable with the loader tunable vm.pmap.pti=1. The pmap page table is split into kernel-mode table and user-mode table. Kernel-mode table is identical to the non-PTI table, while usermode table is obtained from kernel table by leaving userspace mappings intact, but only leaving the following parts of the kernel mapped: kernel text (but not modules text) PCPU GDT/IDT/user LDT/task structures IST stacks for NMI and doublefault handlers. Kernel switches to user page table before returning to usermode, and restores full kernel page table on the entry. Initial kernel-mode stack for PTI trampoline is allocated in PCPU, it is only 16 qwords. Kernel entry trampoline switches page tables. then the hardware trap frame is copied to the normal kstack, and execution continues. IST stacks are kept mapped and no trampoline is needed for NMI/doublefault, but of course page table switch is performed. On return to usermode, the trampoline is used again, iret frame is copied to the trampoline stack, page tables are switched and iretq is executed. The case of iretq faulting due to the invalid usermode context is tricky, since the frame for fault is appended to the trampoline frame. Besides copying the fault frame and original (corrupted) frame to kstack, the fault frame must be patched to make it look as if the fault occured on the kstack, see the comment in doret_iret detection code in trap(). Currently kernel pages which are mapped during trampoline operation are identical for all pmaps. They are registered using pmap_pti_add_kva(). Besides initial registrations done during boot, LDT and non-common TSS segments are registered if user requested their use. In principle, they can be installed into kernel page table per pmap with some work. Similarly, PCPU can be hidden from userspace mapping using trampoline PCPU page, but again I do not see much benefits besides complexity. PDPE pages for the kernel half of the user page tables are pre-allocated during boot because we need to know pml4 entries which are copied to the top-level paging structure page, in advance on a new pmap creation. I enforce this to avoid iterating over the all existing pmaps if a new PDPE page is needed for PTI kernel mappings. The iteration is a known problematic operation on i386. The need to flush hidden kernel translations on the switch to user mode make global tables (PG_G) meaningless and even harming, so PG_G use is disabled for PTI case. Our existing use of PCID is incompatible with PTI and is automatically disabled if PTI is enabled. PCID can be forced on only for developer's benefit. MCE is known to be broken, it requires IST stack to operate completely correctly even for non-PTI case, and absolutely needs dedicated IST stack because MCE delivery while trampoline did not switched from PTI stack is fatal. The fix is pending. Reviewed by: markj (partially) Tested by: pho (previous version) Discussed with: jeff, jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-01-17 11:44:21 +00:00
Konstantin Belousov	94b011c4bc	Amd64 user_ldt_deref() is not used outside sys_machdep.c. Mark it as static. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-17 11:21:03 +00:00
Pedro F. Giffuni	74641f0bc6	x86: make some use of mallocarray(9). Focus on code where we are doing multiplications within malloc(9). None of these ire likely to overflow, however the change is still useful as some static checkers can benefit from the allocation attributes we use for mallocarray. This initial sweep only covers malloc(9) calls with M_NOWAIT. No good reason but I started doing the changes before r327796 and at that time it was convenient to make sure the sorrounding code could handle NULL values. X-Differential revision: https://reviews.freebsd.org/D13837	2018-01-15 21:08:22 +00:00
Tycho Nightingale	91fe5fe7e7	Provide some mitigation against CVE-2017-5715 by clearing registers upon returning from the guest which aren't immediately clobbered by the host. This eradicates any remaining guest contents limiting their usefulness in an exploit gadget. This was inspired by this linux commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5b6c02f38315b720c593c6079364855d276886aa Reviewed by: grehan, rgrimes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13573	2018-01-15 18:37:03 +00:00
Konstantin Belousov	5f7b9ff2e3	Add STAC and CLAC instructions wrappers. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13838	2018-01-14 12:39:50 +00:00
Jeff Roberson	b6715dab8f	Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA. Sponsored by: Netflix, Dell/EMC Isilon Discussed with: jhb	2018-01-14 03:36:03 +00:00
Jeff Roberson	ab3185d15e	Implement NUMA support in uma(9) and malloc(9). Allocations from specific domains can be done by the _domain() API variants. UMA also supports a first-touch policy via the NUMA zone flag. The slab layer is now segregated by VM domains and is precise. It handles iteration for round-robin directly. The per-cpu cache layer remains a mix of domains according to where memory is allocated and freed. Well behaved clients can achieve perfect locality with no performance penalty. The direct domain allocation functions have to visit the slab layer and so require per-zone locks which come at some expense. Reviewed by: Attilio (a slightly older version) Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-01-12 23:25:05 +00:00
Konstantin Belousov	c751f90c0c	Fix grammar. Submitted by: alc MFC after: 3 days	2018-01-11 16:50:03 +00:00
Konstantin Belousov	6da5c56ae5	Remove redundand CLD instructions. We already clear %RFLAGS.DF on the kernel entry due to the compiler's ABI requirements. Suggested by: jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-01-11 13:22:13 +00:00
Konstantin Belousov	4975c202ac	Do not clear %RFLAGS.DF on fast syscall entry. Hardware already did it for us due to the mask loaded into the MSR_SF_MASK msr register. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13838	2018-01-11 12:54:33 +00:00
Konstantin Belousov	0f7c159f6b	Move the hardware setup for fast syscalls into a common function. Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-11 12:40:43 +00:00
Konstantin Belousov	4275e16fa9	Rename COMMON_TSS_RSP0 to TSS_RSP0. The symbol is just an offset in the hardware TSS structure, it is not limited to the common_tss instance. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-01-11 12:28:08 +00:00
Konstantin Belousov	3ee6e65875	Update comment explaining the check, to reality. Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-01-11 12:07:24 +00:00
Conrad Meyer	e6fcf7898d	x86: Document purpose of _safe variants of {rd,wr}msr() Sponsored by: Dell EMC Isilon	2018-01-10 22:41:00 +00:00
Andriy Gapon	091da2dfa5	vmm/svm: contigmalloc of the whole svm_softc is excessive This is a followup to r307903. struct svm_softc takes more than 200 kilobytes while what we really need is 3 contiguous pages for I/O permission map and 2 contiguous pages for MSR permission map. Other physically mapped structures have a size of a single page, so a proper alignment is sufficient for their correct mapping. Thus, only the permission maps are allocated with contigmalloc now, the softc is allocated with a regular malloc. Additionally, this commit adds a check that malloc returns memory with the expected page alignment and that contigmalloc does not fail. Unfortunately, at present svm_vminit() is expected to always succeed and there is no way to report an error. So, a contigmalloc failure leads to a panic. We should probably fix this. MFC after: 2 weeks	2018-01-09 14:22:18 +00:00
Konstantin Belousov	0530a9360f	Make it possible to re-evaluate cpu_features. Add cpuctl(4) ioctl CPUCTL_EVAL_CPU_FEATURES which forces re-read of cpu_features, cpu_features2, cpu_stdext_features, and std_stdext_features2. The intent is to allow the kernel to see the changes in the CPU features after micocode update. Of course, the update is not atomic across variables and not synchronized with readers. See the man page warning as well. Reviewed by: imp (previous version), jilles Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13770	2018-01-05 21:06:19 +00:00
Andriy Gapon	5f3c7d6580	Fix a couple of comments in AMD Virtual Machine Control Block structure MFC after: 1 week	2018-01-05 19:15:24 +00:00
Konstantin Belousov	84874cc151	Avoid re-check of usermode condition. It does not change anything in the behavior of trap_pfault(), while eliminating obfuscation of jumping to the code which checks for the condition reversed of the goto cause. Also avoid force initialize the rv variable, since it is now only accessed after storing vm_fault() return value. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13725	2018-01-01 20:47:03 +00:00
Konstantin Belousov	1865d6b851	Remove MP SAFE marks and stray register name in comments. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2017-12-31 17:07:59 +00:00
Colin Percival	31a55efdc5	Use the TSLOG framework to record entry/exit timestamps for hammer_time. The entry must be logged "manually" using TSRAW rather than TSENTER since PCPU data structures have not yet been initialized and thus curthread cannot be accessed; &thread0 is what will become curthread later in hammer_time. Other MD initialization code should be similarly instrumented in order to gain visibility into the time spent before entering mi_startup; this will require some care and testing from people with access to such hardware.	2017-12-31 09:22:07 +00:00
Eitan Adler	caa7e52f3f	kernel: Fix several typos and minor errors - duplicate words - typos - references to old versions of FreeBSD Reviewed by: imp, benno	2017-12-27 03:23:21 +00:00
Tycho Nightingale	9e33a61693	Recognize a pending virtual interrupt while emulating the halt instruction. Reviewed by: grehan, rgrimes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13573	2017-12-21 18:30:11 +00:00
Konstantin Belousov	30d4f9e888	Add atomic_load(9) and atomic_store(9) operations. They provide relaxed-ordered atomic access semantic. Due to the FreeBSD memory model, the operations are syntaxical wrappers around the volatile accesses. The volatile qualifier is used to ensure that the access not optimized out and in turn depends on the volatile semantic as implemented by supported compilers. The motivation for adding the operation is to help people coming from other systems or knowing the C11/C++ standards where atomics have special type and require use of the special access operations. It is still the case that FreeBSD requires plain load and stores of aligned integer types to be atomic. Suggested by: jhb Reviewed by: alc, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13534	2017-12-19 09:59:20 +00:00
Mark Johnston	5bab623438	Pass the trap frame to fasttrap hooks. The DTrace fasttrap entry points expect a struct reg containing the register values of the calling thread. Perform the conversion in fasttrap rather than in the trap handler: this reduces the number of ifdefs and avoids wasting stack space for traps that don't involve DTrace. MFC after: 2 weeks	2017-12-11 19:21:39 +00:00
Bruce Evans	fb3cc1c37d	Move instantiation of msgbufp from 9 MD files to subr_prf.c. This variable should be pure MI except possibly for reading it in MD dump routines. Its initialization was pure MD in 4.4BSD, but FreeBSD changed this in r36441 in 1998. There were many imperfections in r36441. This commit fixes only a small one, to simplify fixing the others 1 arch at a time. (r47678 added support for special/early/multiple message buffer initialization which I want in a more general form, but this was too fragile to use because hacking on the msgbufp global corrupted it, and was only used for 5 hours in -current...)	2017-12-07 07:55:38 +00:00
Andriy Gapon	a7437a3e9d	amd-vi: set iommu msi configuration using pci_enable_msi method This is better than directly changing PCI configuration space of the device because it makes the PCI bus aware of the configuration. Also, the change allows to drop a bunch of code that duplicated pci_enable_msi() functionality. I wonder if it's possible to further simplify the code by using pci_alloc_msi().	2017-12-04 17:10:52 +00:00
Andriy Gapon	df92c28d6a	vmm/amd: add ivhd device with a higher order ivhd should attach after the root PCI bus and, thus, after the ACPI Host-PCI bridge off which the bus hangs. This is because ivhd changes PCI configuration of a PCI IOMMU device that is located on the root bus. If the bus attaches after ivhd it clears the MSI portion of the configuration. As a result IOMMU event interrupts would never be delivered. For regular ACPI devices the order is calculated as ACPI_DEV_BASE_ORDER + level * 10 where level is a depth of the device in the ACPI namespace. I expect the depth of the Host-PCI bridge to be two or three, so ACPI_DEV_BASE_ORDER + 10 * 10 should be a sufficiently safe order for ivhd. This should fix the setup of the AMD-Vi event interrupt when vmm is preloaded (as opposed to kldload-ed).	2017-12-04 17:08:03 +00:00
Andriy Gapon	8f09494d1e	amd-vi: clear event interrupt and overflow bits upon handling the interrupt This ensures that we can receive further event interrupts. See the description of the bits in the specification for MMIO Offset 2020h IOMMU Status Register. The bits are defined as set-by-hardware write-1-to-clear, same as all the bits in the status register. Discussed with: anish	2017-12-04 17:02:53 +00:00
Scott Long	c15269ccb8	It's time to retire AHC_REG_PRETTY_PRINT and AHD_REG_PRETTY_PRINT from the standard kernels. They are still available as custom compile options.	2017-11-29 23:41:49 +00:00
Brooks Davis	5cd667e65f	Disable vim syntax highlighting. Vim's default pick doesn't understand that ';' is a comment character and the result looks horrible. Reviewed by: emaste	2017-11-28 18:23:17 +00:00

... 7 8 9 10 11 ...

8376 Commits