freebsd-skq

Author	SHA1	Message	Date
kib	f495f3ebd8	Make WRFSBASE and WRGSBASE instructions functional. Right now, we enable the CR4.FSGSBASE bit on CPUs which support the facility (Ivy and later), to allow usermode to read fs and gs bases without syscalls. This bit also controls the write access to bases from userspace, but WRFSBASE and WRGSBASE instructions currently cannot be used, because return path from both exceptions or interrupts overrides bases with the values from pcb. Supporting the instructions is useful because this means that usermode can implement green-threads completely in userspace without issuing syscalls to change all of the machine context. Support is implemented by saving the fs base and user gs base when PCB_FULL_IRET flag is set. The flag is set on the context switch, which potentially causes clobber of the bases due to activation of another context, and when explicit modification of the user context by a syscall or exception handler is performed. In particular, the patch moves setting of the flag before syscalls change context. The changes to doreti_exit and PUSH_FRAME to clear PCB_FULL_IRET on entry from userspace can be considered a bug fixes on its own. Reviewed by: jhb (previous version) Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D12023	2017-08-21 17:38:02 +00:00
cem	5a79b729aa	x86: Add dynamic interrupt rebalancing Add an option to dynamically rebalance interrupts across cores (hw.intrbalance); off by default. The goal is to minimize preemption. By placing interrupt sources on distinct CPUs, ithreads get preferentially scheduled on distinct CPUs. Overall preemption is reduced and latency is reduced. In our workflow it reduced "fighting" between two high-frequency interrupt sources. Reduced latency was proven by, e.g., SPEC2008. Submitted by: jeff@ (earlier version) Reviewed by: kib@ Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D10435	2017-08-16 18:48:53 +00:00
br	c51ad5f03a	Add support for Intel Software Guard Extensions (Intel SGX). Intel SGX allows to manage isolated compartments "Enclaves" in user VA space. Enclaves memory is part of processor reserved memory (PRM) and always encrypted. This allows to protect user application code and data from upper privilege levels including OS kernel. This includes SGX driver and optional linux ioctl compatibility layer. Intel SGX SDK for FreeBSD is also available. Note this requires support from hardware (available since late Intel Skylake CPUs). Many thanks to Robert Watson for support and Konstantin Belousov for code review. Project wiki: https://wiki.freebsd.org/Intel_SGX. Reviewed by: kib Relnotes: yes Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D11113	2017-08-16 10:38:06 +00:00
kib	8353aff0e8	Add {rd,wr}{fs,gs}base C wrappers for instructions. Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-08-14 11:20:54 +00:00
imp	3c9b264a76	Fail to open efirt device when no EFI on system. libefivar expects opening /dev/efi to indicate if the we can make efi runtime calls. With a null routine, it was always succeeding leading efi_variables_supported() to return the wrong value. Only succeed if we have an efi_runtime table. Also, while I'm hear, out of an abundance of caution, add a likely redundant check to make sure efi_systbl is not NULL before dereferencing it. I know it can't be NULL if efi_cfgtbl is non-NULL, but the compiler doesn't.	2017-08-08 20:44:16 +00:00
cem	d4a7946bbd	x86: Tag some intrinsics with __pure2 Some C wrappers for x86 instructions do not touch global memory and only act on their arguments; they can be marked __pure2, aka __const__. Without this annotation, Clang 3.9.1 is not intelligent enough on its own to grok that these functions are __const__. Submitted by: Anton Rang <anton.rang AT isilon.com> Sponsored by: Dell EMC Isilon	2017-08-03 22:28:30 +00:00
truckman	43c5cd502a	Lower the amd64 shared page, which contains the signal trampoline, from the top of user memory to one page lower on machines with the Ryzen (AMD Family 17h) CPU. This pushes ps_strings and the stack down by one page as well. On Ryzen there is some sort of interaction between code running at the top of user memory address space and interrupts that can cause FreeBSD to either hang or silently reset. This sounds similar to the problem found with DragonFly BSD that was fixed with this commit: https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20 but our signal trampoline location was already lower than the address that DragonFly moved their signal trampoline to. It also does not appear to be related to SMT as described here: https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads?p=955498#post955498 "Hi, Matt Dillon here. Yes, I did find what I believe to be a hardware issue with Ryzen related to concurrent operations. In a nutshell, for any given hyperthread pair, if one hyperthread is in a cpu-bound loop of any kind (can be in user mode), and the other hyperthread is returning from an interrupt via IRETQ, the hyperthread issuing the IRETQ can stall indefinitely until the other hyperthread with the cpu-bound loop pauses (aka HLT until next interrupt). After this situation occurs, the system appears to destabilize. The situation does not occur if the cpu-bound loop is on a different core than the core doing the IRETQ. The %rip the IRETQ returns to (e.g. userland %rip address) matters a LOT. The problem occurs more often with high %rip addresses such as near the top of the user stack, which is where DragonFly's signal trampoline traditionally resides. So a user program taking a signal on one thread while another thread is cpu-bound can cause this behavior. Changing the location of the signal trampoline makes it more difficult to reproduce the problem. I have not been because the able to completely mitigate it. When a cpu-thread stalls in this manner it appears to stall INSIDE the microcode for IRETQ. It doesn't make it to the return pc, and the cpu thread cannot take any IPIs or other hardware interrupts while in this state." since the system instability has been observed on FreeBSD with SMT disabled. Interrupts to appear to play a factor since running a signal-intensive process on the first CPU core, which handles most of the interrupts on my machine, is far more likely to trigger the problem than running such a process on any other core. Also lower sv_maxuser to prevent a malicious user from using mmap() to load and execute code in the top page of user memory that was made available when the shared page was moved down. Make the same changes to the 64-bit Linux emulator. PR: 219399 Reported by: nbe@renzel.net Reviewed by: kib Reviewed by: dchagin (previous version) Tested by: nbe@renzel.net (earlier version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D11780	2017-08-02 01:43:35 +00:00
alc	621c14d3f9	Add support for pmap_enter(..., psind=1) to the amd64 pmap. In other words, add support for explicitly requesting that pmap_enter() create a 2MB page mapping. (Essentially, this feature allows the machine-independent layer to create superpage mappings preemptively, and not wait for automatic promotion to occur.) Export pmap_ps_enabled() to the machine-independent layer. Add a flag to pmap_pv_insert_pde() that specifies whether it should fail or reclaim a PV entry when one is not available. Refactor pmap_enter_pde() into two functions, one by the same name, that is a general-purpose function for creating PDE PG_PS mappings, and another, pmap_enter_2mpage(), that is used to prefault 2MB read- and/or execute-only mappings for execve(2), mmap(2), and shmat(2). Submitted by: Yufeng Zhou <yz70@rice.edu> (an earlier version) Reviewed by: kib, markj Tested by: pho MFC after: 10 days Differential Revision: https://reviews.freebsd.org/D11556	2017-07-23 06:33:58 +00:00
rlibby	1eb6bbbb6e	efi: restrict visibility of EFIABI_ATTR-declared functions In-tree gcc (4.2) doesn't understand __attribute__((ms_abi)) (EFIABI_ATTR). Avoid declaring functions with that attribute when the compiler is detected to be gcc < 4.4. Reviewed by: kib, imp (previous version) Approved by: markj (mentor) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11636	2017-07-20 06:47:06 +00:00
jah	d1caaa9300	Clean up MD pollution of bus_dma.h: --Remove special-case handling of sparc64 bus_dmamap* functions. Replace with a more generic mechanism that allows MD busdma implementations to generate inline mapping functions by defining WANT_INLINE_DMAMAP in <machine/bus_dma.h>. This is currently useful for sparc64, x86, and arm64, which all implement non-load dmamap operations as simple wrappers around map objects which may be bus- or device-specific. --Remove NULL-checked bus_dmamap macros. Implement the equivalent NULL checks in the inlined x86 implementation. For non-x86 platforms, these checks are a minor pessimization as those platforms do not currently allow NULL maps. NULL maps were originally allowed on arm64, which appears to have been the motivation behind adding arm[64]-specific barriers to bus_dma.h, but that support was removed in r299463. --Simplify the internal interface used by the bus_dmamap_load* variants and move it to bus_dma_internal.h --Fix some drivers that directly include sys/bus_dma.h despite the recommendations of bus_dma(9) Reviewed by: kib (previous revision), marius Differential Revision: https://reviews.freebsd.org/D10729	2017-07-01 05:35:29 +00:00
kib	7b6fe97487	Make struct syscall_args visible to userspace compilation environment from machine/proc.h, consistently on all architectures. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 weeks X-Differential revision: https://reviews.freebsd.org/D11080	2017-06-12 20:53:44 +00:00
trasz	31b7c45db9	Bump default MAXTSIZ (kern.maxtsiz) from 128MB to 32GB. The old limit prevents one from running eg clang built with debug; the new one is arbitrary (equal to MAXDSIZ) and... well, should be quite future-proof. Same fix might be applicable to other 64 bit architectures; I'll ask their respective maintainers to make sure it won't break anything. Reviewed by: kib MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D10758	2017-05-17 08:38:41 +00:00
glebius	21ead51d79	- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter in place. To do per-cpu stats, convert all fields that previously were maintained in the vmmeters that sit in pcpus to counter(9). - Since some vmmeter stats may be touched at very early stages of boot, before we have set up UMA and we can do counter_u64_alloc(), provide an early counter mechanism: o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter. o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter, so that at early stages of boot, before counters are allocated we already point to a counter that can be safely written to. o For sparc64 that required a whole dummy pcpu[MAXCPU] array. Further related changes: - Don't include vmmeter.h into pcpu.h. - vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit, to match kernel representation. - struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion. This is based on benno@'s 4-year old patch: https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html Reviewed by: kib, gallatin, marius, lidl Differential Revision: https://reviews.freebsd.org/D10156	2017-04-17 17:34:47 +00:00
pkelsey	33064e92a2	Corrected misspelled versions of rendezvous. The MFC will include a compat definition of smp_no_rendevous_barrier() that calls smp_no_rendezvous_barrier(). Reviewed by: gnn, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D10313	2017-04-09 02:00:03 +00:00
bde	254458ab34	Fix printing of negative offsets (typically from frame pointers) again. I fixed this in 1997, but the fix was over-engineered and fragile and was broken in 2003 if not before. i386 parameters were copied to 8 other arches verbatim, mostly after they stopped working on i386, and mostly without the large comment saying how the values were chosen on i386. powerpc has a non-verbatim copy which just changes the uncritical parameter and seems to add a sign extension bug to it. Just treat negative offsets as offsets if they are no more negative than -db_offset_max (default -64K), and remove all the broken parameters. -64K is not very negative, but it is enough for frame and stack pointer offsets since kernel stacks are small. The over-engineering was mainly to go more negative than -64K for the negative offset format, without affecting printing for more than a single address. Addresses in the top 64K of a (full 32-bit or 64-bit) address space are now printed less well, but there aren't many interesting ones. For arches that have many interesting ones very near the top (e.g., 68k has interrupt vectors there), there would be no good limit for the negative offset format and -64K is a good as anything.	2017-03-26 18:46:35 +00:00
markj	c4e0ff355e	Add support for 8- and 16-bit atomic_(f)cmpset to x86. Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D10068	2017-03-22 17:29:04 +00:00
bde	02d7df1d66	Don't access the reserved registers %dr4 and %dr5 on i386. On the original i386, %dr[4-5] were unimplemented but not very clearly reserved, so debuggers read them to print them. i386 was still doing this. On the original athlon64, %dr[4-5] are documented as reserved but are aliased to %dr[6-7] unless CR4_DE is set, when accessing them traps. On 2 of my systems, accessing %dr[4-5] trapped sometimes. On my Haswell system, the apparent randomness was because the boot CPU starts with CR4_DE set while all other CPUs start with CR4_DE clear. FreeBSD doesn't support the data breakpoints enabled by CR4_DE and it never changes this flag, so the flag remains different across CPUs and the behaviour seemed inconsistent except while booting when the CPU doesn't change. The invalid accesses broke: - read access for printing the registers in ddb "show watches" on CPUs with CR4_DE set - read accesses in fill_dbregs() on CPUs with CR4_DE set. This didn't implement panic(3) since the user case always skipped %dr[4-5]. - write accesses in set_dbregs(). This also didn't affect userland. When it didn't trap, the aliasing made it fragile. Don't print the dummy (zero) values of %dr[4-5] in "show watches" for i386 or amd64. Fix style bugs near this printing. amd64 also has space in the dbregs struct for the reserved %dr[8-15] and already didn't print the dummy values for these, and never accessed any of the 10 reserved debug registers. Remove cpufuncs for making the invalid accesses. Even amd64 had these.	2017-03-17 13:49:05 +00:00
imp	7e6cabd06e	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
alc	c062fef7f7	Refine the fix from r312954. Specifically, add a new PDE-only flag, PG_PROMOTED, that indicates whether lingering 4KB page mappings might need to be flushed on a PDE change that restricts or destroys a 2MB page mapping. This flag allows the pmap to avoid range invalidations that are both unnecessary and costly. Reviewed by: kib, markj MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D9665	2017-02-26 19:54:02 +00:00
jah	85257d4d56	Bring back r313037, with fixes for mips: Implement get_pcpu() for amd64/sparc64/mips/powerpc, and use it to replace pcpu_find(curcpu) in MI code. Reviewed by: andreast, kan, lidl Tested by: lidl(mips, sparc64), andreast(powerpc) Differential Revision: https://reviews.freebsd.org/D9587	2017-02-19 02:03:09 +00:00
jah	f5659d40d3	Revert r313037 The switch to get_pcpu() in MI code seems to cause hangs on MIPS. Back out until we can get a better idea of what's happening there. Reported by: kan, lidl	2017-02-04 06:24:49 +00:00
jah	fc31303beb	Implement get_pcpu() for the remaining architectures and use it to replace pcpu_find(curcpu) in MI code.	2017-02-01 03:32:49 +00:00
kib	d52cc80daf	Use SFENCE for ordering CLFLUSHOPT. SDM states that CLFLUSHOPT instructions can be ordered with other writes by SFENCE, heavier MFENCE is not required. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-01-20 19:08:44 +00:00
mjg	50a477c2ee	amd64: add atomic_fcmpset Reviewed by: kib, jhb	2017-01-03 21:00:24 +00:00
jhb	1018a82d1b	Move declarations of invpcid_works and pmap_pcid_enabled to pmap.h. Previously these were only declared under #ifdef SMP in <machine/smp.h>. However, these variables are defind in pmap.c unconditionally, and efirt.c references them unconditionally. This fixes non-SMP kernel builds. Discussed with: kib MFC after: 1 week	2016-10-31 18:37:05 +00:00
kib	559623d89a	Re-apply r306516 (by cem): Reduce the cost of TLB invalidation on x86 by using per-CPU completion flags Reduce contention during TLB invalidation operations by using a per-CPU completion flag, rather than a single atomically-updated variable. On a Westmere system (2 sockets x 4 cores x 1 threads), dtrace measurements show that smp_tlb_shootdown is about 50% faster with this patch; observations with VTune show that the percentage of time spent in invlrng_single_page on an interrupt (actually doing invalidation, rather than synchronization) increases from 31% with the old mechanism to 71% with the new one. (Running a basic file server workload.) Submitted by: Anton Rang <rang at acm.org> Reviewed by: cem (earlier version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8041	2016-10-04 17:01:24 +00:00
cem	de42bf751c	Revert r306516 for now, it is incomplete on i386 Noted by: kib	2016-09-30 18:58:50 +00:00
cem	22e3a710d0	Reduce the cost of TLB invalidation on x86 by using per-CPU completion flags Reduce contention during TLB invalidation operations by using a per-CPU completion flag, rather than a single atomically-updated variable. On a Westmere system (2 sockets x 4 cores x 1 threads), dtrace measurements show that smp_tlb_shootdown is about 50% faster with this patch; observations with VTune show that the percentage of time spent in invlrng_single_page on an interrupt (actually doing invalidation, rather than synchronization) increases from 31% with the old mechanism to 71% with the new one. (Running a basic file server workload.) Submitted by: Anton Rang <rang at acm.org> Reviewed by: cem (earlier version), kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8041	2016-09-30 18:12:16 +00:00
imp	5720aec884	Change the efi_get_table interface to a void ** so we can return the pointer by dereferencing the pointer. Reviewed by: kib@ MFC After: 2 weeks Sponsored by: Netflix, Inc	2016-09-22 19:04:51 +00:00
kib	df422cbea3	Add kernel interfaces to call EFI Runtime Services. Runtime services require special execution environment for the call. Besides that, OS must inform firmware about runtime virtual memory map which will be active during the calls, with the SetVirtualAddressMap() runtime call, done while the 1:1 mapping is still used. There are two complication: the SetVirtualAddressMap() effectively must be done from loader, which needs to know kernel address map in advance. More, despite not explicitely mentioned in the specification, both 1:1 and the map passed to SetVirtualAddressMap() must be active during the SetVirtualAddressMap() call. Second, there are buggy BIOSes which require both mappings active during runtime calls as well, most likely because they fail to identify all relocations to perform. On amd64, we can get rid of both problems by providing 1:1 mapping for the duration of runtime calls, by temprorary remapping user addresses. As result, we avoid the need for loader to know about future kernel address map, and avoid bugs in BIOSes. Typically BIOS only maps something in low 4G. If not runtime bugs, we would take advantage of the DMAP, as previous versions of this patch did. Similar but more complicated trick can be used even for i386 and 32bit runtime, if and when the EFI boot on i386 is supported. We would need a trampoline page, since potentially whole 4G of VA would be switched on calls, instead of only userspace portion on amd64. Context switches are disabled for the duration of the call, FPU access is granted, and interrupts are not disabled. The later is possible because kernel is mapped during calls. To test, the sysctl mib debug.efi_time is provided, setting it to 1 makes one call to EFI get_time() runtime service, on success the efitm structure is printed to the control terminal. Load efirt.ko, or add EFIRT option to the kernel config, to enable code. Discussed with: emaste, imp Tested by: emaste (mac, qemu) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-21 11:31:58 +00:00
kib	0b0178b3a6	Add a way for the architecture to specify the calling ABI for methods in the EFI Runtime Services Table. On amd64, the calling conventions are MS. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-09-21 10:35:44 +00:00
kib	ff88ae0aef	Add amd64 functions to load/store GDT register, store IDT and TR registers. Note that lgdt() name is already used for function which, besides loading GDT, also reloads segment descriptors cache, thus new function is named bare_lgdt(). Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-09-21 10:10:36 +00:00
kib	5aff5acf9c	Export the pmap_cache_bits() and pmap_pinit_pml4() functions from the amd64 pmap. The new pmap_pinit_pml4() function initializes the level 4 page table with entries for the kernel mappings. Both functions are needed for upcoming EFI Runtime Services support. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-09-21 10:05:51 +00:00
kib	3c5296b91a	Move pmap_p*e_index() inline functions from pmap.c to pmap.h. They are already used in minidump code. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-09-20 09:38:07 +00:00
bde	634d4e4a33	Decode some REX prefixes in inst_call(). This makes the 'next' and 'until' commands work in more cases.	2016-09-15 18:30:53 +00:00
bde	bf8d177543	Abort single stepping in ddb if the trap is not for single-stepping. This is not very easy to do, since ddb didn't know when traps are for single-stepping. It more or less assumed that traps are either breakpoints or single-step, but even for x86 this became inadequate with the release of the i386 in ~1986, and FreeBSD passes it other trap types for NMIs and panics. On x86, teach ddb when a trap is for single stepping using the %dr6 register. Unknown traps are now treated almost the same as breakpoints instead of as the same as single-steps. Previously, the classification of breakpoints was almost correct and everything else was unknown so had to be treated as a single-step. Now the classification of single- steps is precise, the classification of breakpoints is almost correct (as before) and everything else is unknown and treated like a breakpoint. This fixes: - breakpoints not set by ddb, including the main one in kdb_enter(), were treated as single-steps and not stopped on when stepping (except for the usual, simple case of a step with residual count 1). As special cases, kdb_enter() didn't stop for fatal traps or panics - similarly for "hardware breakpoints". Use a new MD macro IS_SSTEP_TRAP(type, code) to code to classify single-steps. This is excessively complicated for bug-for-bug and backwards compatibilty. Design errors apparently started in Mach in ~1990 or perhaps in the FreeBSD interface in ~1993. Common trap types like single steps should have a unique MI code (like the TRAP* codes for user SIGTRAP) so that debuggers don't need macros like IS_SSTEP_TRAP() to decode them. But 'type' is actually an ambiguous MD trap number, and code was always 0 (now it is (int)%dr6 on x86). So it was impossible to determine the trap type from the args. Global variables had to be used. There is already a classification macro db_pc_is_single_step(), but this just gets in the way. It is only used to recover from bugs in IS_BREAKPOINT_TRAP(). On some arches, IS_BREAKPOINT_TRAP() just duplicates the ambiguity in 'type' and misclassifies single-steps as breakpoints. It defaults to 'false', which is the opposite of what is needed for bug-for-bug compatibility. When this is cleaned up, MI classification bits should be passed in 'code'. This could be done now for positive-logic bits, since 'code' was always 0, but some negative logic is needed for compatibility so a simple MI classificition is not usable yet. After reading %dr6, clear the single-step bit in it so that the type of the next debugger trap can be decoded. This is a little ddb-specific. ddb doesn't understand the need to clear this bit and doing it before calling kdb is easiest. gdb would need to reverse this to support hardware breakpoints, but it just doesn't support them now since gdbstub doesn't support %dr*. Fix a bug involving %dr6: when emulating a single-step trap for vm86, set the bit for it in %dr6. Userland debuggers need this. ddb now needs this for vm86 bios calls. The bit gets copied to 'code' then cleared again. Fix related style bugs: - when clearing bits for hardware breakpoints in %dr6, spell the mask as ~0xf on both amd64 and i386 to get the correct number of bits using sign extension and not need a comment about using the wrong mask on amd64 (amd64 traps for invalid results but clearing the reserved top bits didn't trap since they are 0). - rewrite my old wrong comments about using %dr6 for ddb watchpoints.	2016-09-15 17:24:23 +00:00
jhb	bc4a384597	Remove 'cpu' and 'cpu_class' on amd64. The 'cpu' and 'cpu_class' variables were always set to the same value on amd64 and are legacy holdovers from i386. Remove them entirely on amd64. Reviewed by: imp, kib (older version) Differential Revision: https://reviews.freebsd.org/D7888	2016-09-15 17:05:54 +00:00
kib	cb80ddd115	Add FPU_KERN_NOCTX flag to the fpu_kern_enter() function on amd64. The flag specifies that the block which uses FPU must be executed in critical section, i.e. take no context switches, and does not need an FPU save area during the execution. It is intended to be applied around fast and short code pathes where save area allocation is impossible or undesirable, due to context or due to the relative cost of calculation vs. allocation. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-11 09:14:07 +00:00
jhb	a40f879158	MFC 304637: Fix build for !SMP kernels after the Xen MSIX workaround. Move msix_disable_migration under #ifdef SMP since it doesn't make sense for !SMP kernels. PR: 212014	2016-09-09 19:57:32 +00:00
bde	eb85aca715	On amd64, declare sse2_pagezero() and start using it again, but only for zeroing pages in idle where nontemporal writes are clearly best. This is almost a no-op since zeroing in idle works does nothing good and is off by default. Fix END() statement forgotten in previous commit. Align the loop in sse2_pagezero(). Since it writes to main memory, the loop doesn't have to be very carefully written to keep up. Unrolling it was considered useless or harmful and was not done on i386, but that was too careless. Timing for i386: the loop was not unrolled at all, and moved only 4 bytes/iteration. So on a 2GHz CPU, it needed to run at 2 cycles/ iteration to keep up with a memory speed of just 4GB/sec. But when it crossed a 16-byte boundary, on old CPUs it ran at 3 cycles/ iteration so it gave a maximum speed of 2.67GB/sec and couldn't even keep up with PC3200 memory. Fix the alignment so that it keep up with 4GB/sec memory, and unroll once to get nearer to 8GB/sec. Further unrolling might be useless or harmful since it would prevent the loop fitting in 16-bytes. My test system with an old CPU and old DDR1 only needed 5+ GB/sec. My test system with a new CPU and DDR3 doesn't need any changes to keep up ~16GB/sec. Timing for amd64: with 8-byte accesses and newer faster CPUs it is easy to reach 16GB/sec but not so easy to go much faster. The alignment doesn't matter much if the CPU is not very old. The loop was already unrolled 4 times, but needs 32 bytes and uses a fancy method that doesn't work for 2-way unrolling in 16 bytes. Just align it to 32-bytes.	2016-08-29 13:07:21 +00:00
jhb	4e659fa057	Fix build for !SMP kernels after the Xen MSIX workaround. Move msix_disable_migration under #ifdef SMP since it doesn't make sense for !SMP kernels. PR: 212014 Reported by: Glyn Grinstead <glyn@grinstead.org> MFC after: 3 days	2016-08-22 21:23:17 +00:00
jhb	fbbd0c6298	MFC 302181,302635: Disable MSI-X migration on older Xen hypervisors. 302181: Add a tunable to disable migration of MSI-X interrupts. The new 'machdep.disable_msix_migration' tunable can be set to 1 to disable migration of MSI-X interrupts. Xen versions prior to 4.6.0 do not properly handle updates to MSI-X table entries after the initial write. In particular, the operation to unmask a table entry after updating it during migration is not propagated to the "real" table for passthrough devices causing the interrupt to remain masked. At least some systems in EC2 are affected by this bug when using SRIOV. The tunable can be set in loader.conf as a workaround. 302635: xen: automatically disable MSI-X interrupt migration If the hypervisor version is smaller than 4.6.0. Xen commits 74fd00 and 70a3cb are required on the hypervisor side for this to be fixed, and those are only included in 4.6.0, so stay on the safe side and disable MSI-X interrupt migration on anything older than 4.6.0. It should not cause major performance degradation unless a lot of MSI-X interrupts are allocated.	2016-08-05 17:13:25 +00:00
mav	fcb5c9368b	Add more UEFI/e820 memory types from latest specifications. This is only cosmetics. MFC after: 2 weeks	2016-07-24 09:15:11 +00:00
royger	844ce8697a	xen: automatically disable MSI-X interrupt migration If the hypervisor version is smaller than 4.6.0. Xen commits 74fd00 and 70a3cb are required on the hypervisor side for this to be fixed, and those are only included in 4.6.0, so stay on the safe side and disable MSI-X interrupt migration on anything older than 4.6.0. It should not cause major performance degradation unless a lot of MSI-X interrupts are allocated. Sponsored by: Citrix Systems R&D MFC after: 3 days Reviewed by: jhb Differential revision: https://reviews.freebsd.org/D7148	2016-07-12 08:43:09 +00:00
nwhitehorn	89d01c24d1	Replace a number of conflations of mp_ncpus and mp_maxid with either mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try, for example, to run tasks on CPUs that did not exist or to allocate too few buffers on systems with sparse CPU IDs in which there are holes in the range and mp_maxid > mp_ncpus. Such circumstances generally occur on systems with SMT, but on which SMT is disabled. This patch restores system operation at least on POWER8 systems configured in this way. There are a number of other places in the kernel with potential problems in these situations, but where sparse CPU IDs are not currently known to occur, mostly in the ARM machine-dependent code. These will be fixed in a follow-up commit after the stable/11 branch. PR: kern/210106 Reviewed by: jhb Approved by: re (glebius)	2016-07-06 14:09:49 +00:00
sephe	ad3696213e	MFC 301015 hyperv/vmbus: Rename ISR functions MFC after: 1 week Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D6601	2016-06-24 01:20:33 +00:00
sephe	ee9748f41f	MFC 299912 atomic: Add testandclear on i386/amd64 Reviewed by: kib Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D6381	2016-06-23 02:21:37 +00:00
kib	60a7096da4	MFC r299350: Add locking annotations to amd64 struct md_page members.	2016-05-17 07:55:49 +00:00
sephe	6babf96582	atomic: Add testandclear on i386/amd64 Reviewed by: kib Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D6381	2016-05-16 07:19:33 +00:00
kib	b4ddfd6167	Eliminate pvh_global_lock from the amd64 pmap. The only current purpose of the pvh lock was explained there On Wed, Jan 09, 2013 at 11:46:13PM -0600, Alan Cox wrote: > Let me lay out one example for you in detail. Suppose that we have > three processors and two of these processors are actively using the same > pmap. Now, one of the two processors sharing the pmap performs a > pmap_remove(). Suppose that one of the removed mappings is to a > physical page P. Moreover, suppose that the other processor sharing > that pmap has this mapping cached with write access in its TLB. Here's > where the trouble might begin. As you might expect, the processor > performing the pmap_remove() will acquire the fine-grained lock on the > PV list for page P before destroying the mapping to page P. Moreover, > this processor will ensure that the vm_page's dirty field is updated > before releasing that PV list lock. However, the TLB shootdown for this > mapping may not be initiated until after the PV list lock is released. > The processor performing the pmap_remove() is not problematic, because > the code being executed by that processor won't presume that the mapping > is destroyed until the TLB shootdown has completed and pmap_remove() has > returned. However, the other processor sharing the pmap could be > problematic. Specifically, suppose that the third processor is > executing the page daemon and concurrently trying to reclaim page P. > This processor performs a pmap_remove_all() on page P in preparation for > reclaiming the page. At this instant, the PV list for page P may > already be empty but our second processor still has a stale TLB entry > mapping page P. So, changes might still occur to the page after the > page daemon believes that all mappings have been destroyed. (If the PV > entry had still existed, then the pmap lock would have ensured that the > TLB shootdown completed before the pmap_remove_all() finished.) Note, > however, the page daemon will know that the page is dirty. It can't > possibly mistake a dirty page for a clean one. However, without the > current pvh global locking, I don't think anything is stopping the page > daemon from starting the laundering process before the TLB shootdown has > completed. > > I believe that a similar example could be constructed with a clean page > P' and a stale read-only TLB entry. In this case, the page P' could be > "cached" in the cache/free queues and recycled before the stale TLB > entry is flushed. TLBs for addresses with updated PTEs are always flushed before pmap lock is unlocked. On the other hand, amd64 pmap code does not always flushes TLBs before PV list locks are unlocked, if previously PTEs were cleared and PV entries removed. To handle the situations where a thread might notice empty PV list but third thread still having access to the page due to TLB invalidation not finished yet, introduce delayed invalidation. Comparing with the pvh_global_lock, DI does not block entered thread when pmap_remove_all() or pmap_remove_write() (callers of pmap_delayed_invl_wait()) are executed in parallel. But _invl_wait() callers are blocked until all previously noted DI blocks are leaved, thus ensuring that neccessary TLB invalidations were performed before returning from pmap_remove_all() or pmap_remove_write(). See comments for detailed description of the mechanism, and also for the explanations why several pmap methods, most important pmap_enter(), do not need DI protection. Reviewed by: alc, jhb (turnstile KPI usage) Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D5747	2016-05-14 23:35:11 +00:00

1 2 3 4 5 ...

2031 Commits