freebsd-skq

Author	SHA1	Message	Date
mjg	50a477c2ee	amd64: add atomic_fcmpset Reviewed by: kib, jhb	2017-01-03 21:00:24 +00:00
kib	149e8d3d06	Fix typo. Remove spurious blank line. MFC after: 3 days	2016-12-18 09:32:23 +00:00
jhb	203d8b88f7	Enable EARLY_AP_STARTUP on amd64 and i386 kernels by default. PR: 199321, 203682 MFC after: 2 months Sponsored by: Netflix	2016-12-16 21:10:37 +00:00
kib	64b6cb0328	Provide non-final but valid PCB pointer for thread0 for duration of hammer_time(). This makes assembler exception handlers not fault itself when setting PCB flags, and allow normal kernel trap handler to get control. The pointer is reset after FPU parameters are obtained. Set thread0.td_critnest to 1 for duration of hammer_time() as well. In particular, page faults at that early stage panic immediately instead of trying to call not yet operational VM to resolve it. As result, faults during second half of the hammer_time() execution have a chance to be reported instead of silent machine reboot or hang. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-12-14 11:40:31 +00:00
def	f63c437216	Add support for encrypted kernel crash dumps. Changes include modifications in kernel crash dump routines, dumpon(8) and savecore(8). A new tool called decryptcore(8) was added. A new DIOCSKERNELDUMP I/O control was added to send a kernel crash dump configuration in the diocskerneldump_arg structure to the kernel. The old DIOCSKERNELDUMP I/O control was renamed to DIOCSKERNELDUMP_FREEBSD11 for backward ABI compatibility. dumpon(8) generates an one-time random symmetric key and encrypts it using an RSA public key in capability mode. Currently only AES-256-CBC is supported but EKCD was designed to implement support for other algorithms in the future. The public key is chosen using the -k flag. The dumpon rc(8) script can do this automatically during startup using the dumppubkey rc.conf(5) variable. Once the keys are calculated dumpon sends them to the kernel via DIOCSKERNELDUMP I/O control. When the kernel receives the DIOCSKERNELDUMP I/O control it generates a random IV and sets up the key schedule for the specified algorithm. Each time the kernel tries to write a crash dump to the dump device, the IV is replaced by a SHA-256 hash of the previous value. This is intended to make a possible differential cryptanalysis harder since it is possible to write multiple crash dumps without reboot by repeating the following commands: # sysctl debug.kdb.enter=1 db> call doadump(0) db> continue # savecore A kernel dump key consists of an algorithm identifier, an IV and an encrypted symmetric key. The kernel dump key size is included in a kernel dump header. The size is an unsigned 32-bit integer and it is aligned to a block size. The header structure has 512 bytes to match the block size so it was required to make a panic string 4 bytes shorter to add a new field to the header structure. If the kernel dump key size in the header is nonzero it is assumed that the kernel dump key is placed after the first header on the dump device and the core dump is encrypted. Separate functions were implemented to write the kernel dump header and the kernel dump key as they need to be unencrypted. The dump_write function encrypts data if the kernel was compiled with the EKCD option. Encrypted kernel textdumps are not supported due to the way they are constructed which makes it impossible to use the CBC mode for encryption. It should be also noted that textdumps don't contain sensitive data by design as a user decides what information should be dumped. savecore(8) writes the kernel dump key to a key.# file if its size in the header is nonzero. # is the number of the current core dump. decryptcore(8) decrypts the core dump using a private RSA key and the kernel dump key. This is performed by a child process in capability mode. If the decryption was not successful the parent process removes a partially decrypted core dump. Description on how to encrypt crash dumps was added to the decryptcore(8), dumpon(8), rc.conf(5) and savecore(8) manual pages. EKCD was tested on amd64 using bhyve and i386, mipsel and sparc64 using QEMU. The feature still has to be tested on arm and arm64 as it wasn't possible to run FreeBSD due to the problems with QEMU emulation and lack of hardware. Designed by: def, pjd Reviewed by: cem, oshogbo, pjd Partial review: delphij, emaste, jhb, kib Approved by: pjd (mentor) Differential Revision: https://reviews.freebsd.org/D4712	2016-12-10 16:20:39 +00:00
imp	3459557545	Permit loading of efirt module even when there's no EFI to call. The module loading is successful, but attempts to use it will not be successful. This is similar to what we do (did?) with ACPI on non-ACPI systems. We succeed if we can't find the necessary information to hook into EFI, but still fail if we're unable to allocate resources if we do find EFI. Not Objected to by: kib@ MFC Afer: 3 days	2016-12-09 23:37:11 +00:00
markj	8bb19c4929	Add a COMPAT_FREEBSD11 kernel option. Use it wherever COMPAT_FREEBSD10 is currently specified. Reviewed by: glebius, imp, jhb Differential Revision: https://reviews.freebsd.org/D8736	2016-12-09 18:54:12 +00:00
glebius	f8eae77f98	Treat R_X86_64_PLT32 relocs as R_X86_64_PC32. If we load a binary that is designed to be a library, it produces relocatable code via assembler directives in the assembly itself (rather than compiler options). This emits R_X86_64_PLT32 relocations, which are not handled by the kernel linker. Submitted by: gallatin Reviewed by: kib	2016-12-09 18:07:28 +00:00
alc	7571ef95c1	Previously, vm_radix_remove() would panic if the radix trie didn't contain a vm_page_t at the specified index. However, with this change, vm_radix_remove() no longer panics. Instead, it returns NULL if there is no vm_page_t at the specified index. Otherwise, it returns the vm_page_t. The motivation for this change is that it simplifies the use of radix tries in the amd64, arm64, and i386 pmap implementations. Instead of performing a lookup before every remove, the pmap can simply perform the remove. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D8708	2016-12-08 04:29:29 +00:00
jhb	f987575fac	Report page faults due to reserved bits in PTEs as a separate fault type. Rather than reporting a page fault due to a bad PTE as a protection violation with the "rsv" flag, treat these faults as a separate type of fault altogether. MFC after: 1 month	2016-11-19 01:34:12 +00:00
bdrewery	30f99dbeef	Fix improper use of "its". Sponsored by: Dell EMC Isilon	2016-11-08 23:59:41 +00:00
cem	7ae132fee1	Add BUF_TRACKING and FULL_BUF_TRACKING buffer debugging Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code. This can be handy in tracking down what code touched hung bios and bufs last. The full history is especially useful, but adds enough bloat that it shouldn't be enabled in release builds. Function names (or arbitrary string constants) are tracked in a fixed-size ring in bufs. Bios gain a pointer to the upper buf for tracking. SCSI CCBs gain a pointer to the upper bio for tracking. Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8366	2016-10-31 23:09:52 +00:00
jhb	1018a82d1b	Move declarations of invpcid_works and pmap_pcid_enabled to pmap.h. Previously these were only declared under #ifdef SMP in <machine/smp.h>. However, these variables are defind in pmap.c unconditionally, and efirt.c references them unconditionally. This fixes non-SMP kernel builds. Discussed with: kib MFC after: 1 week	2016-10-31 18:37:05 +00:00
avg	9127df4438	fix a syntax error in r308039 ... that I somehow introduced between testing the change iand committing it. MFC after: 1 week X-MFC with: r307903	2016-10-28 15:57:55 +00:00
avg	6310254212	vmm: another take at maximmum address passed to contigmalloc Just using vm_paddr_t value with all bits set. That should work as long as the type is unsigned. While there, fix a couple of whitespace issues nearby. MFC after: 1 week X-MFC with: r307903	2016-10-28 14:38:01 +00:00
jhb	95a3814f21	MFamd64: Add bounds checks on addresses used with /dev/mem. Reject attempts to read from or memory map offsets in /dev/mem that are beyond the maximum-supported physical address of the current CPU. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D7408	2016-10-27 21:23:14 +00:00
glebius	856adf7415	The argument validation in r296956 was not enough to close all possible overflows in sysarch(2). Submitted by: Kun Yang <kun.yang chaitin.com> Patch by: kib Security: SA-16:15	2016-10-25 17:13:46 +00:00
avg	a2af253a41	fix up r307903, use correct max address definition MFC after: 1 week X-MFC with: r307903	2016-10-25 10:59:21 +00:00
avg	7d3b940604	vmm/svm: iopm_bitmap and msr_bitmap must be contiguous in physical memory To achieve that the whole svm_softc is allocated with contigmalloc now. It would be more effient to de-embed those arrays and allocate only them with contigmalloc. Previously, if malloc(9) used non-contiguous pages for the arrays, then random bits in physical pages next to the first page would be used to determine permissions for I/O port and MSR accesses. That could result in a guest dangerously modifying the host hardware configuration. One example is that sometimes NMI watchdog driver in a Linux guest would be able to configure a performance counter on a host system. The counter would generate an interrupt and if hwpmc(4) driver is loaded on the host, then the interrupt would be delivered as an NMI. Discussed with: jhb Reviewed by: grehan MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D8321	2016-10-25 10:34:14 +00:00
kib	45100446da	Follow-up to r307866: - Make !KDB config buildable. - Simplify interface to nmi_handle_intr() by evaluating panic_on_nmi in one place, namely nmi_call_kdb(). This allows to remove do_panic argument from the functions, and to remove i386/amd64 duplication of the variable and sysctl definitions. Note that now NMI causes panic(9) instead of trap_fatal() reporting and then panic(9), consistently for NMIs delivered while CPU operated in ring 0 and 3. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-10-24 20:47:46 +00:00
kib	a04db702cd	Handle broadcast NMIs. On several Intel chipsets, diagnostic NMIs sent from BMC or NMIs reporting hardware errors are broadcasted to all CPUs. When kernel is configured to enter kdb on NMI, the outcome is problematic, because each CPU tries to enter kdb. All CPUs are executing NMI handlers, which set the latches disabling the nested NMI delivery; this means that stop_cpus_hard(), used by kdb_enter() to stop other cpus by broadcasting IPI_STOP_HARD NMI, cannot work. One indication of this is the harmless but annoying diagnostic "timeout stopping cpus". Much more harming behaviour is that because all CPUs try to enter kdb, and if ddb is used as debugger, all CPUs issue prompt on console and race for the input, not to mention the simultaneous use of the ddb shared state. Try to fix this by introducing a pseudo-lock for simultaneous attempts to handle NMIs. If one core happens to enter NMI trap handler, other cores see it and simulate reception of the IPI_STOP_HARD. More, generic_stop_cpus() avoids sending IPI_STOP_HARD and avoids waiting for the acknowledgement, relying on the nmi handler on other cores suspending and then restarting the CPU. Since it is impossible to detect at runtime whether some stray NMI is broadcast or unicast, add a knob for administrator (really developer) to configure debugging NMI handling mode. The updated patch was debugged with the help from Andrey Gapon (avg) and discussed with him. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D8249	2016-10-24 16:40:27 +00:00
jkim	229f578eb8	Implement BPF_MOD and BPF_XOR instructions. These two ALU instructions first appeared on Linux. Then, libpcap adopted and made them available since 1.6.2. Now more platforms including NetBSD have them in kernel. So do we. --이 줄 이하는 자동으로 제거됩니다-- > Description of fields to fill in above: 76 columns --\| > PR: If and which Problem Report is related. > Submitted by: If someone else sent in the change. > Reported by: If someone else reported the issue. > Reviewed by: If someone else reviewed your modification. > Approved by: If you needed approval for this commit. > Obtained from: If the change is from a third party. > MFC after: N [day[s]\|week[s]\|month[s]]. Request a reminder email. > MFH: Ports tree branch name. Request approval for merge. > Relnotes: Set to 'yes' for mention in release notes. > Security: Vulnerability reference (one per line) or description. > Sponsored by: If the change was sponsored by an organization. > Differential Revision: https://reviews.freebsd.org/D### (full phabric URL needed). > Empty fields above will be automatically removed. M share/man/man4/bpf.4 M sys/amd64/amd64/bpf_jit_machdep.c M sys/amd64/amd64/bpf_jit_machdep.h M sys/i386/i386/bpf_jit_machdep.c M sys/i386/i386/bpf_jit_machdep.h M sys/net/bpf_filter.c	2016-10-21 06:55:07 +00:00
jkim	42d876c52e	Redude code for conditional jumps.	2016-10-21 06:09:30 +00:00
jkim	b35ec131f8	Fix compiler warnings for user land.	2016-10-21 06:06:54 +00:00
stevek	8cc74c4a97	Add sysctl to make amd64 minidump retry count tunable at runtime. PR: 213462 Submitted by: RaviPrakash Darbha <rdarbha@juniper.net> Reviewed by: cemi, markj Approved by: sjg (mentor) Obtained from: Juniper Networks Differential Revision: https://reviews.freebsd.org/D8254	2016-10-17 22:57:41 +00:00
kib	eb0078e23d	Do not try to create /dev/efi device node before devfs is initialized. Split efirt.ko initialization into early stage where runtime services KPI environment is created, to be used e.g. for RTC, and the later devfs node creation stage, per module. Switch the efi device to use make_dev_s(9) instead of make_dev(9). At least, this gracefully handles the duplicated device name issue. Remove ARGSUSED comment from efidev_ioctl(), all unused arguments are annotated with __unused attribute. Reported by: ambrisko, O. Hartmann <ohartman@zedat.fu-berlin.de> Reviewed by: imp Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-10-16 06:07:43 +00:00
jhb	f689fd5a63	Drop support for using mmap() with /dev/kmem. Using the device pager with /dev/kmem is not stable since KVA mappings are transient, but the device pager caches the PA associated with a given offset forever. Interestingly, mips' implementation of memmap() already refused requests for /dev/kmem. Note that kvm_read/kvm_write do not use mmap, but use read and write on /dev/kmem, so this should not affect libkvm users. Reviewed by: kib MFC after: 2 months	2016-10-14 20:01:07 +00:00
jtl	62030781cd	In the TCP stack, the hhook(9) framework provides hooks for kernel modules to add actions that run when a TCP frame is sent or received on a TCP session in the ESTABLISHED state. In the base tree, this functionality is only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd, and cc_vegas congestion control modules. Presently, we incur overhead to check for hooks each time a TCP frame is sent or received on an ESTABLISHED TCP session. This change adds a new compile-time option (TCP_HHOOK) to determine whether to include the hhook(9) framework for TCP. To retain backwards compatibility, I added the TCP_HHOOK option to every configuration file that already defined "options INET". (Therefore, this patch introduces no functional change. In order to see a functional difference, you need to compile a custom kernel without the TCP_HHOOK option.) This change will allow users to easily exclude this functionality from their kernel, should they wish to do so. Note that any users who use a custom kernel configuration and use one of the congestion control modules listed above will need to add the TCP_HHOOK option to their kernel configuration. Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8185	2016-10-12 02:16:42 +00:00
imp	41d4ddad0f	Create /dev/efidev to provide an ioctl interface to userland. It supports userland interfaces to UEFI Runtime Services. This is indended to the the MI portion of EFI RuntimeServices support. Differential Revision: https://reviews.freebsd.org/D8128 Reviewed by: kib@, wblock@, Ganael Laplanche	2016-10-11 22:24:30 +00:00
kib	559623d89a	Re-apply r306516 (by cem): Reduce the cost of TLB invalidation on x86 by using per-CPU completion flags Reduce contention during TLB invalidation operations by using a per-CPU completion flag, rather than a single atomically-updated variable. On a Westmere system (2 sockets x 4 cores x 1 threads), dtrace measurements show that smp_tlb_shootdown is about 50% faster with this patch; observations with VTune show that the percentage of time spent in invlrng_single_page on an interrupt (actually doing invalidation, rather than synchronization) increases from 31% with the old mechanism to 71% with the new one. (Running a basic file server workload.) Submitted by: Anton Rang <rang at acm.org> Reviewed by: cem (earlier version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8041	2016-10-04 17:01:24 +00:00
cem	de42bf751c	Revert r306516 for now, it is incomplete on i386 Noted by: kib	2016-09-30 18:58:50 +00:00
cem	22e3a710d0	Reduce the cost of TLB invalidation on x86 by using per-CPU completion flags Reduce contention during TLB invalidation operations by using a per-CPU completion flag, rather than a single atomically-updated variable. On a Westmere system (2 sockets x 4 cores x 1 threads), dtrace measurements show that smp_tlb_shootdown is about 50% faster with this patch; observations with VTune show that the percentage of time spent in invlrng_single_page on an interrupt (actually doing invalidation, rather than synchronization) increases from 31% with the old mechanism to 71% with the new one. (Running a basic file server workload.) Submitted by: Anton Rang <rang at acm.org> Reviewed by: cem (earlier version), kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8041	2016-09-30 18:12:16 +00:00
hselasky	5e41da7ccd	Move the ConnectX-3 and ConnectX-2 driver from sys/ofed into sys/dev/mlx4 like other PCI network drivers. The sys/ofed directory is now mainly reserved for generic infiniband code, with exception of the mthca driver. - Add new manual page, mlx4en(4), describing how to configure and load mlx4en. - All relevant driver C-files are now prefixed mlx4, mlx4_en and mlx4_ib respectivly to avoid object filename collisions when compiling the kernel. This also fixes an issue with proper dependency file generation for the C-files in question. - Device mlxen is now device mlx4en and depends on device mlx4, see mlx4en(4). Only the network device name remains unchanged. - The mlx4 and mlx4en modules are now built by default on i386 and amd64 targets. Only building the mlx4ib module depends on WITH_OFED=YES . Sponsored by: Mellanox Technologies	2016-09-30 08:23:06 +00:00
kib	7dd86df41b	Handle TLB shootdown IPI during the EFI runtime calls, on SandyBridge and IvyBridge machines, which support PCID but do not have INVPCID instruction. MFC after: 1 week	2016-09-26 17:25:25 +00:00
kib	3deaeb8d22	For machines which support PCID but not have INVPCID instruction, i.e. SandyBridge and IvyBridge, correct a race between pmap_activate() and invltlb_pcid_handler(). Reported by and tested by: Slawa Olhovchenkov <slw@zxy.spb.ru> MFC after: 1 week	2016-09-26 17:22:44 +00:00
bde	94a237c17c	Fix vm86 initialization, part 3 of 2 and a half. (Actually, just fix early printfs and debugging of vm86 initialization and some other early initialization in some cases.) Add an option debug.late_console (with default 1=off) to move console and kdb initialization back where it was. Do the same for amd64 although there is no vm86 there. On my test system, debug.late_console=0 works for the syscons, sio and uart console drivers on amd64 and i386, and for vt on i386 but not on amd64. The early printfs fixed by debug.late_console=0 are: - on i386, the message about lost memory above 4G - with -v in otherwise normal use, about 20 printfs for SMAP - other debugging messages for memory sizing. Mostly under -v and not printed in normal use. Document in a comment how much earlier the initialization and early printf()s can be. That is very early for the console. Not much more than curthread is needed. kdb use obviously needs to be not so early, since it needs IDT initialization and that is done relatively late for convenience and historical reasons.	2016-09-25 14:56:24 +00:00
imp	5720aec884	Change the efi_get_table interface to a void ** so we can return the pointer by dereferencing the pointer. Reviewed by: kib@ MFC After: 2 weeks Sponsored by: Netflix, Inc	2016-09-22 19:04:51 +00:00
markj	f2a1dd4de8	Regenerate syscall provider argument strings.	2016-09-22 04:50:03 +00:00
kib	df422cbea3	Add kernel interfaces to call EFI Runtime Services. Runtime services require special execution environment for the call. Besides that, OS must inform firmware about runtime virtual memory map which will be active during the calls, with the SetVirtualAddressMap() runtime call, done while the 1:1 mapping is still used. There are two complication: the SetVirtualAddressMap() effectively must be done from loader, which needs to know kernel address map in advance. More, despite not explicitely mentioned in the specification, both 1:1 and the map passed to SetVirtualAddressMap() must be active during the SetVirtualAddressMap() call. Second, there are buggy BIOSes which require both mappings active during runtime calls as well, most likely because they fail to identify all relocations to perform. On amd64, we can get rid of both problems by providing 1:1 mapping for the duration of runtime calls, by temprorary remapping user addresses. As result, we avoid the need for loader to know about future kernel address map, and avoid bugs in BIOSes. Typically BIOS only maps something in low 4G. If not runtime bugs, we would take advantage of the DMAP, as previous versions of this patch did. Similar but more complicated trick can be used even for i386 and 32bit runtime, if and when the EFI boot on i386 is supported. We would need a trampoline page, since potentially whole 4G of VA would be switched on calls, instead of only userspace portion on amd64. Context switches are disabled for the duration of the call, FPU access is granted, and interrupts are not disabled. The later is possible because kernel is mapped during calls. To test, the sysctl mib debug.efi_time is provided, setting it to 1 makes one call to EFI get_time() runtime service, on success the efitm structure is printed to the control terminal. Load efirt.ko, or add EFIRT option to the kernel config, to enable code. Discussed with: emaste, imp Tested by: emaste (mac, qemu) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-21 11:31:58 +00:00
kib	eaff6cf773	Rename efi_systbl to efi_systbl_phys, the variable contains the physical address of the EFI System Table. Add _KERNEL guard around its declaration in sys/efi.h. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-09-21 10:55:28 +00:00
kib	0b0178b3a6	Add a way for the architecture to specify the calling ABI for methods in the EFI Runtime Services Table. On amd64, the calling conventions are MS. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-09-21 10:35:44 +00:00
kib	ff88ae0aef	Add amd64 functions to load/store GDT register, store IDT and TR registers. Note that lgdt() name is already used for function which, besides loading GDT, also reloads segment descriptors cache, thus new function is named bare_lgdt(). Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-09-21 10:10:36 +00:00
kib	5aff5acf9c	Export the pmap_cache_bits() and pmap_pinit_pml4() functions from the amd64 pmap. The new pmap_pinit_pml4() function initializes the level 4 page table with entries for the kernel mappings. Both functions are needed for upcoming EFI Runtime Services support. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-09-21 10:05:51 +00:00
kib	e9e29e2686	MFC r305939: Remove trailing space.	2016-09-21 08:14:55 +00:00
kib	3c5296b91a	Move pmap_p*e_index() inline functions from pmap.c to pmap.h. They are already used in minidump code. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-09-20 09:38:07 +00:00
emaste	a8b2a6847f	Catch up to sys/capability.h rename to sys/capsicum.h in r263232 MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-09-19 18:44:43 +00:00
kib	90ee5f51f2	Consolidate four efi_next_descriptor() definitions. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-09-18 17:38:02 +00:00
kib	3fd792803a	Remove trailing space. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2016-09-18 17:33:49 +00:00
bde	b8aaa2c367	Fix decoding of tf_rsp on amd64, and move TF_HAS_STACKREGS() to the i386-only section, and fix a comment about the amd64 kernel trapframe not having stackregs. tf_rsp doesn't need decoding on amd64, but had an old clone of i386 code to do this in 1 place, and since the amd64 kernel trapframe does have stackregs, the result was an off-by-16 error for %rsp in an error message.	2016-09-16 07:09:35 +00:00
bde	d6a5db2944	(1) Ifdef the new dr6 variable for KDB. While here, avoid using the old variable 'code' and remove it in trap(). ('code' was meant for holding things like %dr6, but is too small to hold %dr6 on amd64 and was reduced to an obfuscation of tf_err, with early truncation on amd64.) Submitted by: Michael Butler (imb@...)	2016-09-16 04:58:37 +00:00
bde	634d4e4a33	Decode some REX prefixes in inst_call(). This makes the 'next' and 'until' commands work in more cases.	2016-09-15 18:30:53 +00:00
bde	bf8d177543	Abort single stepping in ddb if the trap is not for single-stepping. This is not very easy to do, since ddb didn't know when traps are for single-stepping. It more or less assumed that traps are either breakpoints or single-step, but even for x86 this became inadequate with the release of the i386 in ~1986, and FreeBSD passes it other trap types for NMIs and panics. On x86, teach ddb when a trap is for single stepping using the %dr6 register. Unknown traps are now treated almost the same as breakpoints instead of as the same as single-steps. Previously, the classification of breakpoints was almost correct and everything else was unknown so had to be treated as a single-step. Now the classification of single- steps is precise, the classification of breakpoints is almost correct (as before) and everything else is unknown and treated like a breakpoint. This fixes: - breakpoints not set by ddb, including the main one in kdb_enter(), were treated as single-steps and not stopped on when stepping (except for the usual, simple case of a step with residual count 1). As special cases, kdb_enter() didn't stop for fatal traps or panics - similarly for "hardware breakpoints". Use a new MD macro IS_SSTEP_TRAP(type, code) to code to classify single-steps. This is excessively complicated for bug-for-bug and backwards compatibilty. Design errors apparently started in Mach in ~1990 or perhaps in the FreeBSD interface in ~1993. Common trap types like single steps should have a unique MI code (like the TRAP* codes for user SIGTRAP) so that debuggers don't need macros like IS_SSTEP_TRAP() to decode them. But 'type' is actually an ambiguous MD trap number, and code was always 0 (now it is (int)%dr6 on x86). So it was impossible to determine the trap type from the args. Global variables had to be used. There is already a classification macro db_pc_is_single_step(), but this just gets in the way. It is only used to recover from bugs in IS_BREAKPOINT_TRAP(). On some arches, IS_BREAKPOINT_TRAP() just duplicates the ambiguity in 'type' and misclassifies single-steps as breakpoints. It defaults to 'false', which is the opposite of what is needed for bug-for-bug compatibility. When this is cleaned up, MI classification bits should be passed in 'code'. This could be done now for positive-logic bits, since 'code' was always 0, but some negative logic is needed for compatibility so a simple MI classificition is not usable yet. After reading %dr6, clear the single-step bit in it so that the type of the next debugger trap can be decoded. This is a little ddb-specific. ddb doesn't understand the need to clear this bit and doing it before calling kdb is easiest. gdb would need to reverse this to support hardware breakpoints, but it just doesn't support them now since gdbstub doesn't support %dr*. Fix a bug involving %dr6: when emulating a single-step trap for vm86, set the bit for it in %dr6. Userland debuggers need this. ddb now needs this for vm86 bios calls. The bit gets copied to 'code' then cleared again. Fix related style bugs: - when clearing bits for hardware breakpoints in %dr6, spell the mask as ~0xf on both amd64 and i386 to get the correct number of bits using sign extension and not need a comment about using the wrong mask on amd64 (amd64 traps for invalid results but clearing the reserved top bits didn't trap since they are 0). - rewrite my old wrong comments about using %dr6 for ddb watchpoints.	2016-09-15 17:24:23 +00:00
jhb	bc4a384597	Remove 'cpu' and 'cpu_class' on amd64. The 'cpu' and 'cpu_class' variables were always set to the same value on amd64 and are legacy holdovers from i386. Remove them entirely on amd64. Reviewed by: imp, kib (older version) Differential Revision: https://reviews.freebsd.org/D7888	2016-09-15 17:05:54 +00:00
bz	de4b915bfb	Try to fix LINT builds after r305807. Seems to be a simple s&r error I missed while reading through the 1st time as well.	2016-09-14 16:08:23 +00:00
bde	d58cd5baa4	Use the MI macro TRAPF_USERMODE() instead of open-coded checks for SEL_UPL and sometimes PSL_VM. This is just a style change on amd64, but on i386 it fixes 1 unimportant place where the PSL_VM check was missing and starts fixing 1 important place where the PSL_VM check had a logic error. Fix logic errors in treating vm86 bioscall mode as kernel mode. The main place checked all the necessary flags, but put the necessary parentheses for the PSL_VM and PCB_VM86CALL checks in the wrong place. The broken case is only reached if a vm86 bioscall uses a %cs which is nonzero mod 4, but that is unusual -- most bios calls start with %cs = 0xc000 or 0xf000 and rarely change it. Another place was missing the check for PCB_VM86CALL, but was only reachable if there are bugs virtualizing PSL_I. Add a macro TF_HAS_STACKREGS() and use this instead of converting open-coded checks of SEL_UPL, etc. to TRAPF_USERMODE() when we only care about whether the frame has stack registers. This fixes 3 places in my recent fix for register variables in vm86 mode where I messed up the PSL_VM check and cleans up other places.	2016-09-14 12:57:40 +00:00
kib	cb80ddd115	Add FPU_KERN_NOCTX flag to the fpu_kern_enter() function on amd64. The flag specifies that the block which uses FPU must be executed in critical section, i.e. take no context switches, and does not need an FPU save area during the execution. It is intended to be applied around fast and short code pathes where save area allocation is impossible or undesirable, due to context or due to the relative cost of calculation vs. allocation. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-11 09:14:07 +00:00
alc	44f29780e8	Various changes to pmap_ts_referenced() Move PMAP_TS_REFERENCED_MAX out of the various pmap implementations and into vm/pmap.h, and describe what its purpose is. Eliminate the archaic "XXX" comment about its value. I don't believe that its exact value, e.g., 5 versus 6, matters. Update the arm64 and riscv pmap implementations of pmap_ts_referenced() to opportunistically update the page's dirty field. On amd64, use the PDE value already cached in a local variable rather than dereferencing a pointer again and again. Reviewed by: kib, markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D7836	2016-09-10 16:49:25 +00:00
jhb	7380cda6f5	MFC 303713: Correct assertion on vcpuid argument to vm_gpa_hold(). PR: 208168	2016-09-09 20:30:36 +00:00
jhb	a40f879158	MFC 304637: Fix build for !SMP kernels after the Xen MSIX workaround. Move msix_disable_migration under #ifdef SMP since it doesn't make sense for !SMP kernels. PR: 212014	2016-09-09 19:57:32 +00:00
avg	e62f31d754	work around AMD erratum 793 for family 16h, models 00h-0Fh	2016-09-07 14:24:29 +00:00
jhb	b83d0562bd	Reset PCI pass through devices via PCI-e FLR during VM start and end. Add routines to trigger a function level reset (FLR) of a PCI-express device via the PCI-express device control register. This also includes support routines to wait for pending transactions to complete as well as calculating the maximum completion timeout permitted by a device. Change the ppt(4) driver to reset pass through devices before attaching to a VM during startup and before detaching from a VM during shutdown. Reviewed by: imp, wblock (earlier version) MFC after: 1 month Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D7751	2016-09-06 21:15:35 +00:00
jhb	9b7bf59c96	Update the I/O MMU in bhyve when PCI devices are added and removed. When the I/O MMU is active in bhyve, all PCI devices need valid entries in the DMAR context tables. The I/O MMU code does a single enumeration of the available PCI devices during initialization to add all existing devices to a domain representing the host. The ppt(4) driver then moves pass through devices in and out of domains for virtual machines as needed. However, when new PCI devices were added at runtime either via SR-IOV or HotPlug, the I/O MMU tables were not updated. This change adds a new set of EVENTHANDLERS that are invoked when PCI devices are added and deleted. The I/O MMU driver in bhyve installs handlers for these events which it uses to add and remove devices to the "host" domain. Reviewed by: imp Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D7667	2016-09-06 20:17:54 +00:00
jhb	31bd6f147b	Remove remnants of PERFMON and I586_PMC_GUPROF from amd64. These options were never fully ported over from i386.	2016-09-06 19:25:32 +00:00
jhb	5f56f30076	Leave ppt devices in the host domain when they are not attached to a VM. This allows a pass through device to be reset to a normal device driver on the host and reused on the host. ppt devices are now always active in some I/O MMU domain when the I/O MMU is active, either the host domain or the domain of a VM they are attached to. Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D7666	2016-09-06 18:53:17 +00:00
markj	fb5804c98d	Remove support for idle page zeroing. Idle page zeroing has been disabled by default on all architectures since r170816 and has some bugs that make it seemingly unusable. Specifically, the idle-priority pagezero thread exacerbates contention for the free page lock, and yields the CPU without releasing it in non-preemptive kernels. The pagezero thread also does not behave correctly when superpage reservations are enabled: its target is a function of v_free_count, which includes reserved-but-free pages, but it is only able to zero pages belonging to the physical memory allocator. Reviewed by: alc, imp, kib Differential Revision: https://reviews.freebsd.org/D7714	2016-09-03 20:38:13 +00:00
alc	24a2d27767	As an optimization to the machine-independent layer, change the machine- dependent pmap_ts_referenced() so that it updates the page's dirty field if a modified bit is found while counting reference bits. This opportunistic update can be performed at low cost and can eliminate the need for some future calls to pmap_is_modified() by the machine- independent layer. Reviewed by: kib, markj MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D7722	2016-09-01 15:57:44 +00:00
bde	aa589d0481	Shorten banal comments about zeroing and copying pages. Don't give implementation details that last echoed the code 15-20 years ago. But add a detail about pagezero() on i386. Switch from Mach style to BSD style.	2016-08-29 14:38:31 +00:00
bde	eb85aca715	On amd64, declare sse2_pagezero() and start using it again, but only for zeroing pages in idle where nontemporal writes are clearly best. This is almost a no-op since zeroing in idle works does nothing good and is off by default. Fix END() statement forgotten in previous commit. Align the loop in sse2_pagezero(). Since it writes to main memory, the loop doesn't have to be very carefully written to keep up. Unrolling it was considered useless or harmful and was not done on i386, but that was too careless. Timing for i386: the loop was not unrolled at all, and moved only 4 bytes/iteration. So on a 2GHz CPU, it needed to run at 2 cycles/ iteration to keep up with a memory speed of just 4GB/sec. But when it crossed a 16-byte boundary, on old CPUs it ran at 3 cycles/ iteration so it gave a maximum speed of 2.67GB/sec and couldn't even keep up with PC3200 memory. Fix the alignment so that it keep up with 4GB/sec memory, and unroll once to get nearer to 8GB/sec. Further unrolling might be useless or harmful since it would prevent the loop fitting in 16-bytes. My test system with an old CPU and old DDR1 only needed 5+ GB/sec. My test system with a new CPU and DDR3 doesn't need any changes to keep up ~16GB/sec. Timing for amd64: with 8-byte accesses and newer faster CPUs it is easy to reach 16GB/sec but not so easy to go much faster. The alignment doesn't matter much if the CPU is not very old. The loop was already unrolled 4 times, but needs 32 bytes and uses a fancy method that doesn't work for 2-way unrolling in 16 bytes. Just align it to 32-bytes.	2016-08-29 13:07:21 +00:00
bde	61cf0d838a	Restore the nontemporal pagezero() under the name sse2_pagezero() (the same name as for i386). It is not reconnected yet. Which method is better is too machine-dependent and system-dependent to replace the old method unconditionally.	2016-08-29 06:07:43 +00:00
jhb	731da97ac4	Enable I/O MMU when PCI pass through is first used. Rather than enabling the I/O MMU when the vmm module is loaded, defer initialization until the first attempt to pass a PCI device through to a guest. If the I/O MMU fails to initialize or is not present, than fail the attempt to pass a PCI device through to a guest. The hw.vmm.force_iommu tunable has been removed since the I/O MMU is no longer enabled during boot. However, the I/O MMU support can be disabled by setting the hw.vmm.iommu.enable tunable to 0 to prevent use of the I/O MMU on any systems where it is buggy. Reviewed by: grehan MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D7448	2016-08-26 20:15:22 +00:00
ed	d81be03d3f	Make execution of 32-bit CloudABI executables work on amd64. A nice thing about requiring a vDSO is that it makes it incredibly easy to provide full support for running 32-bit processes on 64-bit systems. Instead of letting the kernel be responsible for composing/decomposing 64-bit arguments across multiple registers/stack slots, all of this can now be done in the vDSO. This means that there is no need to provide duplicate copies of certain system calls, like the sys_lseek() and freebsd32_lseek() we have for COMPAT_FREEBSD32. This change imports a new vDSO from the CloudABI repository that has automatically generated code in it that copies system call arguments into a buffer, padding them to eight bytes and zero-extending any pointers/size_t arguments. After returning from the kernel, it does the inverse: extracting return values, in the process truncating pointers/size_t values to 32 bits. Obtained from: https://github.com/NuxiNL/cloudabi	2016-08-24 10:51:33 +00:00
ed	c0aa6fd209	Convert pointers obtained from the threadattr_t structure with TO_PTR(). In all of these source files, the userspace pointer size corresponds with the kernelspace pointer size, meaning that casting directly works. As I'm planning on making 32-bit execution on 64-bit systems work as well, use TO_PTR() here as well, so that the changes between source files remain minimal.	2016-08-24 10:13:18 +00:00
jhb	4e659fa057	Fix build for !SMP kernels after the Xen MSIX workaround. Move msix_disable_migration under #ifdef SMP since it doesn't make sense for !SMP kernels. PR: 212014 Reported by: Glyn Grinstead <glyn@grinstead.org> MFC after: 3 days	2016-08-22 21:23:17 +00:00
jhb	e24281ea43	Remove the si(4) driver and sicontrol(8) for Specialix serial cards. The si(4) driver supported multiport serial adapters for ISA, EISA, and PCI buses. This driver does not use bus_space, instead it depends on direct use of the pointer returned by rman_get_virtual(). It is also still locked by Giant and calls for patch testing to convert it to use bus_space were unanswered. Relnotes: yes	2016-08-19 21:14:27 +00:00
kib	e07c032881	MFC r303913: Unconditionally perform checks that FPU region was entered, when #NM exception is caught in kernel mode.	2016-08-17 07:09:22 +00:00
avg	5461062c89	MFC r302835: fix-up for configuration of AMD Family 10h processors borrowed from Linux	2016-08-15 09:04:31 +00:00
kib	04bce34a47	The pmap_delayed_invl_wait() function blocks on turnstile, it does not spin, in the committed version. Remove stray '*' in the text. Sponsored by: The FreeBSD Foundation. MFC after: 3 days	2016-08-11 12:37:11 +00:00
ed	cc2c089a3f	Provide the CloudABI vDSO to its executables. CloudABI executables already provide support for passing in vDSOs. This functionality is used by the emulator for OS X to inject system call handlers. On FreeBSD, we could use it to optimize calls to gettimeofday(), etc. Though I don't have any plans to optimize any system calls right now, let's go ahead and already pass in a vDSO. This will allow us to simplify the executables, as the traditional "syscall" shims can be removed entirely. It also means that we gain more flexibility with regards to adding and removing system calls. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D7438	2016-08-10 21:02:41 +00:00
kib	acae466016	Unconditionally perform checks that FPU region was entered, when #NM exception is caught in kernel mode. There are third-party modules which trigger the issue, and since the problem causes usermode state corruption at least, panic in production kernels as well. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-10 13:44:03 +00:00
jhb	fbbd0c6298	MFC 302181,302635: Disable MSI-X migration on older Xen hypervisors. 302181: Add a tunable to disable migration of MSI-X interrupts. The new 'machdep.disable_msix_migration' tunable can be set to 1 to disable migration of MSI-X interrupts. Xen versions prior to 4.6.0 do not properly handle updates to MSI-X table entries after the initial write. In particular, the operation to unmask a table entry after updating it during migration is not propagated to the "real" table for passthrough devices causing the interrupt to remain masked. At least some systems in EC2 are affected by this bug when using SRIOV. The tunable can be set in loader.conf as a workaround. 302635: xen: automatically disable MSI-X interrupt migration If the hypervisor version is smaller than 4.6.0. Xen commits 74fd00 and 70a3cb are required on the hypervisor side for this to be fixed, and those are only included in 4.6.0, so stay on the safe side and disable MSI-X interrupt migration on anything older than 4.6.0. It should not cause major performance degradation unless a lot of MSI-X interrupts are allocated.	2016-08-05 17:13:25 +00:00
jhb	1c2702c041	Don't permit mappings of invalid physical addresses on amd64 via /dev/mem. Discussed with: kib	2016-08-04 17:55:23 +00:00
jhb	b1de97ff2b	Correct assertion on vcpuid argument to vm_gpa_hold(). PR: 208168 Submitted by: Dave Cameron <daverabbitz@ihug.co.nz> Reviewed by: grehan MFC after: 1 month	2016-08-03 15:20:10 +00:00
kib	9a5f028012	Merge i386 and amd64 variants of mp_watchdog.c into x86/, there is no difference between files. For pc98, put x86/mp_x86.c into the same place as used by i386 file list. Fix typo in comment. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-03 13:51:53 +00:00
mjg	b584a8e1ae	amd64: implement pagezero using rep stos The current implementation uses non-temporal writes. This turns out to be detrimental to performance if the page is used shortly after, which is the typical case with page faults. Switch to rep stos. Reviewed by: kib MFC after: 1 week	2016-07-31 11:34:08 +00:00
brooks	017f31c108	Don't create pointless backups of generated files in "make sysent". Any sensible workflow will include a revision control system from which to restore the old files if required. In normal usage, developers just have to clean up the mess. Reviewed by: jhb Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D7353	2016-07-28 21:29:04 +00:00
mav	fcb5c9368b	Add more UEFI/e820 memory types from latest specifications. This is only cosmetics. MFC after: 2 weeks	2016-07-24 09:15:11 +00:00
dchagin	98960de4ab	MFC r302517: Fix a copy/paste bug introduced during X86_64 Linuxulator work. FreeBSD support NX bit on X86_64 processors out of the box, for i386 emulation use READ_IMPLIES_EXEC flag, introduced in r302515. While here move common part of mmap() and mprotect() code to the files in compat/linux to reduce code dupcliation between Linuxulator's. MFC r302518, r302626: Add linux_mmap.c to the appropriate conf/files.	2016-07-17 15:23:32 +00:00
dchagin	1886392a37	Regen for r302962 (Linux personality), record mergeinfo for r320516.	2016-07-17 15:11:23 +00:00
dchagin	8576a4ebaa	MFC r302515: Implement Linux personality() system call mainly due to READ_IMPLIES_EXEC flag. In Linux if this flag is set, PROT_READ implies PROT_EXEC for mmap(). Linux/i386 set this flag automatically if the binary requires executable stack. READ_IMPLIES_EXEC flag will be used in the next Linux mmap() commit.	2016-07-17 15:07:33 +00:00
mav	480936abd1	Increase number of I/O APIC pins from 24 to 32 to give PCI up to 16 IRQs. Move HPET to the top of the supported 0-31 range. Proposed by: jhb@, grehan@	2016-07-14 14:35:25 +00:00
avg	00ea714475	remove a stray change from r302834 MFC after: 3 weeks X-MFC with: r302834	2016-07-14 11:13:26 +00:00
avg	555e531ac1	fix-up for configuration of AMD Family 10h processors borrowed from Linux http://lxr.free-electrons.com/source/arch/x86/kernel/cpu/amd.c#L643 BIOS may configure Family 10h processors to convert WC+ cache type to CD. That can hurt performance of guest VMs using nested paging. Reviewed by: kib MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D6059	2016-07-14 11:03:05 +00:00
badger	5908cb719e	Add explicit detection of KVM hypervisor Set vm_guest to a new enum value (VM_GUEST_KVM) when kvm is detected and use vm_guest in conditionals testing for KVM. Also, fix a conditional checking if we're running in a VM which caught only the generic VM case, but not more specific VMs (KVM, VMWare, etc.). (Spotted by: vangyzen). Differential revision: https://reviews.freebsd.org/D7172 Sponsored by: Dell Inc. Approved by: kib (mentor), vangyzen (mentor) Reviewed by: alc MFC after: 4 weeks	2016-07-13 19:19:18 +00:00
royger	844ce8697a	xen: automatically disable MSI-X interrupt migration If the hypervisor version is smaller than 4.6.0. Xen commits 74fd00 and 70a3cb are required on the hypervisor side for this to be fixed, and those are only included in 4.6.0, so stay on the safe side and disable MSI-X interrupt migration on anything older than 4.6.0. It should not cause major performance degradation unless a lot of MSI-X interrupts are allocated. Sponsored by: Citrix Systems R&D MFC after: 3 days Reviewed by: jhb Differential revision: https://reviews.freebsd.org/D7148	2016-07-12 08:43:09 +00:00
dchagin	c93d4a7bde	Fix a copy/paste bug introduced during X86_64 Linuxulator work. FreeBSD support NX bit on X86_64 processors out of the box, for i386 emulation use READ_IMPLIES_EXEC flag, introduced in r302515. While here move common part of mmap() and mprotect() code to the files in compat/linux to reduce code dupcliation between Linuxulator's. Reported by: Johannes Jost Meixner, Shawn Webb MFC after: 1 week XMFC with: r302515, r302516	2016-07-10 08:22:04 +00:00
dchagin	7acd3da18d	Regen for r302215 (Linux personality).	2016-07-10 08:17:16 +00:00
dchagin	50efd461d3	Implement Linux personality() system call mainly due to READ_IMPLIES_EXEC flag. In Linux if this flag is set, PROT_READ implies PROT_EXEC for mmap(). Linux/i386 set this flag automatically if the binary requires executable stack. READ_IMPLIES_EXEC flag will be used in the next Linux mmap() commit.	2016-07-10 08:15:50 +00:00
ed	887bfdc0a4	Don't forget to set sa->narg for CloudABI system calls. It turns out that this value is not used within the system call code under normal conditions, except when using tracing tools like ktrace. If we forget to set this value, it is set to random garbage. This may cause ktrace to hang indefinitely, making it impossible to kill. Reported by: Michael Plass PR: 210800 MFC before: 11.0-RELEASE	2016-07-08 20:09:21 +00:00
nwhitehorn	89d01c24d1	Replace a number of conflations of mp_ncpus and mp_maxid with either mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try, for example, to run tasks on CPUs that did not exist or to allocate too few buffers on systems with sparse CPU IDs in which there are holes in the range and mp_maxid > mp_ncpus. Such circumstances generally occur on systems with SMT, but on which SMT is disabled. This patch restores system operation at least on POWER8 systems configured in this way. There are a number of other places in the kernel with potential problems in these situations, but where sparse CPU IDs are not currently known to occur, mostly in the ARM machine-dependent code. These will be fixed in a follow-up commit after the stable/11 branch. PR: kern/210106 Reviewed by: jhb Approved by: re (glebius)	2016-07-06 14:09:49 +00:00
sephe	ad3696213e	MFC 301015 hyperv/vmbus: Rename ISR functions MFC after: 1 week Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D6601	2016-06-24 01:20:33 +00:00
sephe	ee9748f41f	MFC 299912 atomic: Add testandclear on i386/amd64 Reviewed by: kib Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D6381	2016-06-23 02:21:37 +00:00
sephe	29d5b44ca0	MFC 297931,298022 297931 Expose doreti as a global symbol on amd64 and i386. doreti provides the common code path for returning from interrupt andlers on x86. Exposing doreti as a global symbol allows kernel modules to include low-level interrupt handlers instead of requiring all low-level handlers to be statically compiled into the kernel. Submitted by: Howard Su <howard0su@gmail.com> Reviewed by: kib 298022 hyperv: Deprecate HYPERV option by moving Hyper-V IDT vector into vmbus Submitted by: Jun Su <junsu microsoft com> Reviewed by: jhb, kib, sephe Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D5910	2016-06-21 04:51:55 +00:00
kib	25fbd81576	MFC r301853: Do not access pv_table array for fictitious pages.	2016-06-20 09:15:03 +00:00
kib	496a3b1f65	Update comments for the MD functions managing contexts for new threads, to make it less confusing and using modern kernel terms. Rename the functions to reflect current use of the functions, instead of the historic KSE conventions: cpu_set_fork_handler -> cpu_fork_kthread_handler (for kthreads) cpu_set_upcall -> cpu_copy_thread (for forks) cpu_set_upcall_kse -> cpu_set_upcall (for new threads creation) Reviewed by: jhb (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (hrs) Differential revision: https://reviews.freebsd.org/D6731	2016-06-16 12:05:44 +00:00
kib	7e7c56668a	Do not access pv_table array for fictitious pages, since the array does not cover the dynamically registered ficititious ranges, and fictitious pages mappings are not promoted. Offer a dummy struct md_page to fetch constant superpage pv list generation to satisfy logic. Also, by initializing the pv_dummy pv_list to empty, we can remove several explicit PG_FICTITIOUS tests. Reported and tested by: Michael Butler <imb@protected-networks.net> (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D6728 Approved by: re (hrs)	2016-06-13 03:45:08 +00:00
kib	9be8cdbf64	MFC r301457: Avoid spurious EINVAL in amd64 pmap_change_attr().	2016-06-12 02:42:08 +00:00
kib	d80c39f48c	Avoid spurious EINVAL in amd64 pmap_change_attr(). Do not try to change attributes for DMAP when working on a mapping which is not covered by the DMAP. This was reported on real system where a BAR of a device (NTB) was mapped outside the PCI window. Reported and tested by: mav Reviewed by: jhb, mav Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D6668	2016-06-05 17:11:23 +00:00
dchagin	74905f75d4	MFC r300415: Add macro to convert errno and use it when appropriate.	2016-06-05 07:34:10 +00:00
dchagin	a24d09886b	MFC r300359, r300360: Correct an argument param of linux_sched_* system calls as a struct l_sched_param does not defined due to it's nature.	2016-06-05 05:49:33 +00:00
kib	b049cb19c0	In pmap_advise(), avoid leaking DI start for EPT pmaps which needs A/D emulation. Assert that syscalls do not leak DI. Reported by: gjb Sponsored by: The FreeBSD Foundation	2016-05-27 18:45:11 +00:00
jkim	45ae491494	Both Clang and GCC cannot generate efficient reserve_pv_entries(). http://docs.freebsd.org/cgi/mid.cgi?552BFEB2.8040407 Re-implement it entirely in inline assembly not to let compilers do silly spilling to memory. For non-POPCNT case, use newly added bit_count(3). Reported by: alc Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D6541	2016-05-25 23:06:52 +00:00
jkim	63911f4577	Document POPCNT erratum for 6th Generation Intel Core processors.	2016-05-23 23:00:47 +00:00
dchagin	bb2ef25b41	MFC r299249: Add a forgotten in r283424 .eh_frame section with CFI & FDE records to allow stack unwinding through signal handler.	2016-05-23 05:31:53 +00:00
kib	00696f008c	MFC r300305, r300332: Check for overflow and return EINVAL if detected. Use unsigned index.	2016-05-23 00:58:52 +00:00
dchagin	791b4b1122	Add macro to convert errno and use it when appropriate. MFC after: 1 week	2016-05-22 12:46:34 +00:00
dchagin	d8b7958da0	Regen after r300359 (struct l_sched_param removal). MFC after: 1 week	2016-05-21 08:03:13 +00:00
dchagin	4f09fcdf95	Correct an argument param of linux_sched_* system calls as a struct l_sched_param does not defined due to it's nature. MFC after: 1 week	2016-05-21 08:01:14 +00:00
kib	515614230c	Check for overflow and return EINVAL if detected. Backport this and r300305 to i386. PR: 209661 Reported and reviewed by: cturt Sponsored by: The FreeBSD Foundation MFC after: 3 days	2016-05-20 19:50:32 +00:00
kib	b357843884	Use unsigned type for the loop index to make overflow checks effective. PR: 209661 Reported by: cturt Sponsored by: The FreeBSD Foundation MFC after: 3 days	2016-05-20 15:32:48 +00:00
eadler	156fd4834a	Don't repeat the the word 'the' (one manual change to fix grammar) Confirmed With: db Approved by: secteam (not really, but this is a comment typo fix)	2016-05-17 12:52:31 +00:00
avg	3a90d30e3f	MFC r298737: fix up r300036	2016-05-17 08:36:54 +00:00
avg	cfc0337581	MFC r298736: ensure that initial local apic id is sane on AMD 10h systems	2016-05-17 08:33:40 +00:00
kib	60a7096da4	MFC r299350: Add locking annotations to amd64 struct md_page members.	2016-05-17 07:55:49 +00:00
sephe	6babf96582	atomic: Add testandclear on i386/amd64 Reviewed by: kib Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D6381	2016-05-16 07:19:33 +00:00
kib	b4ddfd6167	Eliminate pvh_global_lock from the amd64 pmap. The only current purpose of the pvh lock was explained there On Wed, Jan 09, 2013 at 11:46:13PM -0600, Alan Cox wrote: > Let me lay out one example for you in detail. Suppose that we have > three processors and two of these processors are actively using the same > pmap. Now, one of the two processors sharing the pmap performs a > pmap_remove(). Suppose that one of the removed mappings is to a > physical page P. Moreover, suppose that the other processor sharing > that pmap has this mapping cached with write access in its TLB. Here's > where the trouble might begin. As you might expect, the processor > performing the pmap_remove() will acquire the fine-grained lock on the > PV list for page P before destroying the mapping to page P. Moreover, > this processor will ensure that the vm_page's dirty field is updated > before releasing that PV list lock. However, the TLB shootdown for this > mapping may not be initiated until after the PV list lock is released. > The processor performing the pmap_remove() is not problematic, because > the code being executed by that processor won't presume that the mapping > is destroyed until the TLB shootdown has completed and pmap_remove() has > returned. However, the other processor sharing the pmap could be > problematic. Specifically, suppose that the third processor is > executing the page daemon and concurrently trying to reclaim page P. > This processor performs a pmap_remove_all() on page P in preparation for > reclaiming the page. At this instant, the PV list for page P may > already be empty but our second processor still has a stale TLB entry > mapping page P. So, changes might still occur to the page after the > page daemon believes that all mappings have been destroyed. (If the PV > entry had still existed, then the pmap lock would have ensured that the > TLB shootdown completed before the pmap_remove_all() finished.) Note, > however, the page daemon will know that the page is dirty. It can't > possibly mistake a dirty page for a clean one. However, without the > current pvh global locking, I don't think anything is stopping the page > daemon from starting the laundering process before the TLB shootdown has > completed. > > I believe that a similar example could be constructed with a clean page > P' and a stale read-only TLB entry. In this case, the page P' could be > "cached" in the cache/free queues and recycled before the stale TLB > entry is flushed. TLBs for addresses with updated PTEs are always flushed before pmap lock is unlocked. On the other hand, amd64 pmap code does not always flushes TLBs before PV list locks are unlocked, if previously PTEs were cleared and PV entries removed. To handle the situations where a thread might notice empty PV list but third thread still having access to the page due to TLB invalidation not finished yet, introduce delayed invalidation. Comparing with the pvh_global_lock, DI does not block entered thread when pmap_remove_all() or pmap_remove_write() (callers of pmap_delayed_invl_wait()) are executed in parallel. But _invl_wait() callers are blocked until all previously noted DI blocks are leaved, thus ensuring that neccessary TLB invalidations were performed before returning from pmap_remove_all() or pmap_remove_write(). See comments for detailed description of the mechanism, and also for the explanations why several pmap methods, most important pmap_enter(), do not need DI protection. Reviewed by: alc, jhb (turnstile KPI usage) Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D5747	2016-05-14 23:35:11 +00:00
alc	dcd0f3bb88	Eliminate an unused #include. For a brief period of time, _unrhdr.h was used to implement PCID support on amd64. Reviewed by: kib	2016-05-13 20:14:41 +00:00
kib	05241d701e	Add locking annotations to amd64 struct md_page members. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-05-10 09:58:51 +00:00
jhb	6bae79f884	Add a new bus method to fetch device-specific CPU sets. bus_get_cpus() returns a specified set of CPUs for a device. It accepts an enum for the second parameter that indicates the type of cpuset to request. Currently two valus are supported: - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to the device when DEVICE_NUMA is enabled) - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core) For systems that do not support NUMA (or if it is not enabled in the kernel config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus' by default. The idea is that INTR_CPUS should always return a valid set. Device drivers which want to use per-CPU interrupts should start using INTR_CPUS instead of simply assigning interrupts to all available CPUs. In the future we may wish to add tunables to control the policy of INTR_CPUS (e.g. should it be local-only or global, should it ignore SMT threads or not). The x86 nexus driver exposes the internal set of interrupt CPUs from the the x86 interrupt code via INTR_CPUS. The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled. They also and the global INTR_CPUS set from the nexus driver with the per-domain set from _PXM to generate a local INTR_CPUS set for child devices. Compared to the r298933, this version uses 'struct _cpuset' in <sys/bus.h> instead of 'cpuset_t' to avoid requiring <sys/param.h> (<sys/_cpuset.h> still requires <sys/param.h> for MAXCPU even though <sys/_bitset.h> does not after recent changes).	2016-05-09 20:50:21 +00:00
dchagin	c62a8ee2d0	Add a forgotten in r283424 .eh_frame section with CFI & FDE records to allow stack unwinding through signal handler. Reported by: Dmitry Sivachenko MFC after: 2 weeks	2016-05-09 07:38:47 +00:00
jhb	eb663acb54	Native PCI-express HotPlug support. PCI-express HotPlug support is implemented via bits in the slot registers of the PCI-express capability of the downstream port along with an interrupt that triggers when bits in the slot status register change. This is implemented for FreeBSD by adding HotPlug support to the PCI-PCI bridge driver which attaches to the virtual PCI-PCI bridges representing downstream ports on HotPlug slots. The PCI-PCI bridge driver registers an interrupt handler to receive HotPlug events. It also uses the slot registers to determine the current HotPlug state and drive an internal HotPlug state machine. For simplicty of implementation, the PCI-PCI bridge device detaches and deletes the child PCI device when a card is removed from a slot and creates and attaches a PCI child device when a card is inserted into the slot. The PCI-PCI bridge driver provides a bus_child_present which claims that child devices are present on HotPlug-capable slots only when a card is inserted. Rather than requiring a timeout in the RC for config accesses to not-present children, the pcib_read/write_config methods fail all requests when a card is not present (or not yet ready). These changes include support for various optional HotPlug capabilities such as a power controller, mechanical latch, electro-mechanical interlock, indicators, and an attention button. It also includes support for devices which require waiting for command completion events before initiating a subsequent HotPlug command. However, it has only been tested on ExpressCard systems which support surprise removal and have none of these optional capabilities. PCI-express HotPlug support is conditional on the PCI_HP option which is enabled by default on arm64, x86, and powerpc. Reviewed by: adrian, imp, vangyzen (older versions) Relnotes: yes Differential Revision: https://reviews.freebsd.org/D6136	2016-05-05 22:26:23 +00:00
alc	52f3fcfa90	Explain why pmap_copy(), pmap_enter_pde(), and pmap_enter_quick_locked() call pmap_invalidate_page() even though they are not destroying a leaf- level page table entry. Eliminate some bogus white-space characters in a comment. Reviewed by: kib	2016-05-04 17:54:13 +00:00
avg	ecc2c61c90	MFC r297857: re-enable AMD Topology extension on certain models if disabled by BIOS	2016-05-04 11:53:30 +00:00
pfg	7f85f79cce	sys/amd64: Small spelling fixes. No functional change.	2016-05-03 22:13:04 +00:00
pfg	826c10b2f3	vmm(4): Small spelling fixes. Reviewed by: grehan	2016-05-03 22:07:18 +00:00
mav	896cbb26a0	MFC r297243: Polish wbwd(4) driver and add more supported chips.	2016-05-03 07:48:52 +00:00
jhb	c71e075efb	Revert bus_get_cpus() for now. I really thought I had run this through the tinderbox before committing, but many places need <sys/types.h> -> <sys/param.h> for <sys/bus.h> now.	2016-05-03 01:17:40 +00:00
jhb	2da46e01a0	Add a new bus method to fetch device-specific CPU sets. bus_get_cpus() returns a specified set of CPUs for a device. It accepts an enum for the second parameter that indicates the type of cpuset to request. Currently two valus are supported: - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to the device when DEVICE_NUMA is enabled) - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core) For systems that do not support NUMA (or if it is not enabled in the kernel config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus' by default. The idea is that INTR_CPUS should always return a valid set. Device drivers which want to use per-CPU interrupts should start using INTR_CPUS instead of simply assigning interrupts to all available CPUs. In the future we may wish to add tunables to control the policy of INTR_CPUS (e.g. should it be local-only or global, should it ignore SMT threads or not). The x86 nexus driver exposes the internal set of interrupt CPUs from the the x86 interrupt code via INTR_CPUS. The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled. They also and the global INTR_CPUS set from the nexus driver with the per-domain set from _PXM to generate a local INTR_CPUS set for child devices. Reviewed by: wblock (manpage) Differential Revision: https://reviews.freebsd.org/D5519	2016-05-02 18:00:38 +00:00
jhb	050f1049b2	Move 'device pci' for the PCI bus driver to the MI NOTES file. The PCI bus was already listed in all of the MD NOTES files and the driver should at least compile on all platforms.	2016-04-29 23:53:55 +00:00
avg	08fbecbeed	fix missing variable in r298736 Pointyhat to: avg Reported by: Ivan Klymenko <fidaj@ukr.net> MFC after: 2 weeks X-MFC with: r298736	2016-04-28 09:40:24 +00:00
avg	f68c6e4879	ensure that initial local apic id is sane on AMD 10h systems Summary: The Initial Local APIC ID is returned by CPUID function 1 (in EBX). On AMD Family 10h systems the way that ID is built is controlled by an MSR bit (InitApicIdCpuIdLo). BKDG instructs BIOS to set it in a certain way, but a BIOS can be buggy. In that case the ID can confuse tools that use it, e.g. hwloc. For example, on a system that I own real Local APIC IDs are configured as 0, 1, 2, 3, but IDs reported via CPUID.1 are 0, 0x40, 0x80, 0xc0. See: https://github.com/open-mpi/hwloc/issues/183 Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D6060	2016-04-28 08:29:57 +00:00
pfg	1df61c536d	MFC r298482: Cleanup redundant parenthesis from existing howmany()/roundup() macro uses. Requested by: dchagin	2016-04-26 17:39:54 +00:00
avg	d401a99e64	MFC r297846: [amd64] dtrace_invop handler is to be called only for kernel exceptions	2016-04-26 07:40:07 +00:00
cem	241a3b76d8	AMD64 pmap: Use howmany() macro Use param.h howmany() instead of hand-rolled version. Sponsored by: EMC / Isilon Storage Division	2016-04-24 21:35:01 +00:00
pfg	b4106812fd	Cleanup redundant parenthesis from existing howmany()/roundup() macro uses.	2016-04-22 16:57:42 +00:00
pfg	729533413f	sys: use our roundup2/rounddown2() macros when param.h is available. rounddown2 tends to produce longer lines than the original code and when the code has a high indentation level it was not really advantageous to do the replacement. This tries to strike a balance between readability using the macros and flexibility of having the expressions, so not everything is converted.	2016-04-21 19:57:40 +00:00
pfg	be4082c832	X86: use our nitems() macro when it is avaliable through param.h. No functional change, only trivial cases are done in this sweep, Discussed in: freebsd-current	2016-04-19 23:41:46 +00:00
cem	98188ed5c2	Add 4Kn kernel dump support (And 4Kn minidump support, but only for amd64.) Make sure all I/O to the dump device is of the native sector size. To that end, we keep a native sector sized buffer associated with dump devices (di->blockbuf) and use it to pad smaller objects as needed (e.g. kerneldumpheader). Add dump_write_pad() as a convenience API to dump smaller objects with zero padding. (Rather than pull in NPM leftpad, we wrote our own.) Savecore(1) has been updated to deal with these dumps. The format for 512-byte sector dumps should remain backwards compatible. Minidumps for other architectures are left as an exercise for the reader. PR: 194279 Submitted by: ambrisko@ Reviewed by: cem (earlier version), rpokala Tested by: rpokala (4Kn/512 except 512 fulldump), cem (512 fulldump) Relnotes: yes Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D5848	2016-04-15 17:45:12 +00:00
sephe	3d59317312	hyperv: Deprecate HYPERV option by moving Hyper-V IDT vector into vmbus Submitted by: Jun Su <junsu microsoft com> Reviewed by: jhb, kib, sephe Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D5910	2016-04-15 02:20:18 +00:00
jhb	b5f76666d8	Expose doreti as a global symbol on amd64 and i386. doreti provides the common code path for returning from interrupt andlers on x86. Exposing doreti as a global symbol allows kernel modules to include low-level interrupt handlers instead of requiring all low-level handlers to be statically compiled into the kernel. Submitted by: Howard Su <howard0su@gmail.com> Reviewed by: kib	2016-04-13 17:37:31 +00:00
jhb	2ce9aa06e4	Enable DEVICE_NUMA with up to 8 domains by default on amd64. 8 memory domains should handle a quad-socket board with dual-domain processors. Reviewed by: kib Relnotes: maybe? Differential Revision: https://reviews.freebsd.org/D5893	2016-04-12 21:23:44 +00:00
avg	f7d20d3734	re-enable AMD Topology extension on certain models if disabled by BIOS Some BIOSes disable AMD Topology extension on AMD Family 15h notebook processors. We re-enable the extension, so that we can properly discover core and cache topology. Linux seems to do the same. Reported by: Johannes Dieterich <dieterich.joh@gmail.com> Reviewed by: jhb, kib Tested by: Johannes Dieterich <dieterich.joh@gmail.com> (earlier version) MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D5883	2016-04-12 13:30:39 +00:00
avg	73eedd1d08	[amd64] dtrace_invop handler is to be called only for kernel exceptions DTrace-related exceptions in userland code are handled elsewhere. One practical problem was a crash in dtrace_invop_start() when saved %rsp pointed to a virtual address that was not backed. i386 code already ignored userland exceptions. Reviewed by: markj, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D5906	2016-04-12 06:46:54 +00:00
anish	3d3fd1fdc9	Allow guest writes to AMD microcode update[0xc0010020] MSR without updating actual hardware MSR. This allows guest microcode update to go through which otherwise failing because wrmsr() was returning EINVAL. Submitted by:Yamagi Burmeister Approved by:grehan MFC after:2 weeks	2016-04-11 05:09:43 +00:00
hselasky	28b34f7466	MFC r294526: Add missing atomic wrapper macro. Reviewed by: alfred @ Sponsored by: Mellanox Technologies	2016-04-07 07:21:27 +00:00
ed	e55c02e6f8	Make CloudABI's way of doing TLS more friendly to userspace emulators. We're currently seeing how hard it would be to run CloudABI binaries on operating systems cannot be modified easily (Windows, Mac OS X). The idea is that we want to just run them without any sandboxing. Now that CloudABI executables are PIE, this is already a bit easier, but TLS is still problematic: - CloudABI executables want to write to the %fs, which typically requires extra system calls by the emulator every time it needs to switch between CloudABI's and its own TLS. - If CloudABI executables overwrite the %fs base unconditionally, it also becomes harder for the emulator to store a backup of the old value of %fs. To solve this, let's no longer overwrite %fs, but just %fs:0. As CloudABI's C library does not use a TCB, this space can now be used by an emulator to keep track of its internal state. The executable can now safely overwrite %fs:0, as long as it makes sure that the TCB is copied over to the new TLS area. Ensure that there is an initial TLS area set up when the process starts, only containing a bogus TCB. We don't really care about its contents on FreeBSD. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D5836	2016-04-06 11:11:31 +00:00
bapt	577607dffc	Add kern.features flags for linux and linux64 modules kern.features.linux: 1 meaning linux 32 bits binaries are supported kern.features.linux64: 1 meaning linux 64 bits binaries are supported The goal here is to help 3rd party applications (including ports) to determine if the host do support linux emulation Reviewed by: dchagin MFC after: 1 week Relnotes: yes Differential Revision: D5830	2016-04-05 22:36:48 +00:00
jhb	ff4b317e50	Move i386/i386/autoconf.c to sys/x86/x86 and use it on both amd64 and i386.	2016-04-03 23:03:54 +00:00
ed	910e4d679c	Make Position Independent Executables work for CloudABI. - Set BI_CAN_EXEC_DYN, so we can execute ET_DYN ELF files in addition to regular ET_EXECs. - Provide an AT_BASE entry in the auxiliary vector, so the executable knows at which address it got loaded and can apply relocations.	2016-03-31 18:52:00 +00:00
kib	eb986c64f5	Type of the interrupt handlers on x86 cannot be expressed in C. Simplify and unify placeholder type definitions. Reviewed by: jhb Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D5771	2016-03-29 19:56:48 +00:00
dchagin	a5f7ea1073	Revert r297310 as the SOL_XXX are equal to the IPPROTO_XX except SOL_SOCKET. Pointed out by: ae@	2016-03-27 10:09:10 +00:00
dchagin	22c1ebea21	iConvert Linux SOL_IPV6 level. MFC after: 1 week	2016-03-27 08:12:01 +00:00
dchagin	dfb7b90783	MFC r297062: Regen for r297061 (fstatfs64 Linux syscall).	2016-03-27 06:17:19 +00:00
dchagin	ce9255cc9f	MFC r297061; Implement fstatfs64 system call. PR: 181012 Submitted by: John Wehle	2016-03-27 06:10:51 +00:00
mav	2e42421c8b	Polish wbwd(4) driver and add more supported chips. MFC after: 1 month	2016-03-24 20:52:35 +00:00
jhb	0566758cff	Enable interrupts on the BSP once all PICs are initialized. This moves the enabling of interrupts slightly earlier (the old location was still before devices were enumerated and probed) and does it in the interrupt code (rather than in the device configuration code). This also avoids tripping over an assertion on the first TLB shootdown with earlier AP startup. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D5710	2016-03-24 00:24:07 +00:00
dchagin	26a09ec651	Regen for r297061 (fstatfs64 Linux syscall). MFC after: 1 week	2016-03-20 13:23:01 +00:00
dchagin	6aee9bb2b4	Implement fstatfs64 system call. PR: 181012 Submitted by: John Wehle MFC after: 1 week	2016-03-20 13:21:20 +00:00
glebius	4c0b9655c9	Merge r296956: Due to invalid use of a signed intermediate value in the bounds checking during argument validity verification, unbound zero'ing of the process LDT and adjacent memory can be initiated from usermode. Submitted by: CORE Security Patch by: kib Security: SA-16:15	2016-03-16 22:35:55 +00:00
glebius	ecffce941a	Due to invalid use of a signed intermediate value in the bounds checking during argument validity verification, unbound zero'ing of the process LDT and adjacent memory can be initiated from usermode. Submitted by: CORE Security Patch by: kib Security: SA-16:15	2016-03-16 22:33:12 +00:00
kib	6f44a5bccb	MFC r296908: Force the desired alignment of the user save area.	2016-03-16 16:42:01 +00:00
kib	3d336a82bd	The PKRU state size is 4 bytes, its support makes the XSAVE area size non-multiple of 64 bytes. Thereafter, the user state save area is misaligned, which triggers assertion in the debugging kernels, or segmentation violation on accesses for non-debugging configs. Force the desired alignment of the user save area as the fix (workaround is to disable bit 9 in the hw.xsave_mask loader tunable). This correction is required for booting on the upcoming Intel' Purley platform. Reported and tested by: "Pieper, Jeffrey E" <jeffrey.e.pieper@intel.com>, jimharris Sponsored by: The FreeBSD Foundation MFC after: 3 days	2016-03-15 15:42:53 +00:00
jhb	e641458f70	Fix reporting of the CloudABI ABI in kdump. - Advertise the word size for CloudABI ABIs via the SV_LP64 flag. All of the other ABIs include either SV_ILP32 or SV_LP64. - Fix kdump to not assume a 32-bit ABI if the ABI flags field is non-zero but SV_LP64 isn't set. Instead, only assume a 32-bit ABI if SV_ILP32 is set and fallback to the unknown value of "00" if neither SV_LP64 nor SV_ILP32 is set. Reviewed by: kib, ed Differential Revision: https://reviews.freebsd.org/D5560	2016-03-09 18:38:30 +00:00
kib	ad5b5f3479	MFC r295966: Return dst as the result from memcpy(9) on amd64. PR: 207422	2016-03-09 10:21:13 +00:00
marcel	b1abdb4699	Bump VM_MAX_MEMSEGS from 2 to 3 to match the number of VM segment identifiers present in vmmapi.h. In particular, it's now possible to create a VM_FRAMEBUFFER segment.	2016-02-26 16:18:47 +00:00
kib	2d14e03626	Return dst as the result from memcpy(9) on amd64. PR: 207422 MFC after: 1 week	2016-02-24 11:58:15 +00:00
skra	bad1d5e697	As <machine/vm.h> is included from <vm/vm.h>, there is no need to include it explicitly when <vm/vm.h> is already included. Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D5380	2016-02-22 09:10:23 +00:00
skra	ad68cf93b1	As <machine/vmparam.h> is included from <vm/vm_param.h>, there is no need to include it explicitly when <vm/vm_param.h> is already included. Suggested by: alc Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D5379	2016-02-22 09:08:04 +00:00
skra	f4b6499ab5	As <machine/pmap.h> is included from <vm/pmap.h>, there is no need to include it explicitly when <vm/pmap.h> is already included. Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D5373	2016-02-22 09:02:20 +00:00
glebius	b3c4f0ddbf	Include sys/_task.h into uma_int.h, so that taskqueue.h isn't a requirement for uma_int.h. Suggested by: jhb	2016-02-09 20:22:35 +00:00
glebius	953ea03018	Redo r292484. Embed task(9) into zone, so that uz_maxaction is called in a context that can sleep, allowing consumers of the KPI to run their drain routines without any extra measures. Discussed with: jtl	2016-02-03 23:30:17 +00:00
kib	bac8688b17	MFC r294311: Clear whole XMM register file instead of only XMM0. Also clear x87 registers. This brings amd64 on par with i386, providing consistent initial FPU state. PR: 206370 MFC r294312: Use ANSI definitions. Wrap long line. MFC r294313: Adjust i386 comment to match amd64 one after r294311. Approved by: re (gjb)	2016-02-02 14:16:07 +00:00
grehan	83c1d10f0c	MFC r284539, r284630, r284688, r284877, r285217, r285218, r286837, r286838, r288470, r288522, r288524, r288826, r289001 Pull in bhyve bug fixes and changes to allow UEFI booting. This provides Windows support. Tested on Intel and AMD with: - Arch Linux i386+amd64 (kernel 4.3.3) - Ubuntu 15.10 server 64-bit - FreeBSD-CURRENT/amd64 20160127 snap - FreeBSD 10.2 i386+amd64 - OpenBSD 5.8 i386+amd64 - SmartOS latest - Windows 10 build 1511' Huge thanks to Yamagi Burmeister who submitted the patch and did the majority of the testing. r284539 - bootrom mem allocation support r284630 - Add SO_REUSEADDR when starting debug port r284688 - Fix a regression in "movs" emulation r284877 - verify_gla() non-zero segment base fix r285217 - Always assert DCD and DSR in the uart r285218 - devmem nodes moved to /dev/vmm.io/ r286837 - Add define for SATA Check-Power-Mode r286838 - Add simple (no-op) SATA cmd emulations r288470 - Increase virtio-blk indirect descs r288522 - Firmware guest query interface r288524 - Fix post-test typo r288826 - Clean up SATA unimplemented cmd msg r289001 - Add -l option to specify userboot path Submitted by: Yamagi Burmeister Approved by: re (kib)	2016-02-01 14:56:11 +00:00
jhb	169dd4da8e	Convert ss_sp in stack_t and sigstack to void . POSIX requires these members to be of type void rather than the char * inherited from 4BSD. NetBSD and OpenBSD both changed their fields to void * back in 1998. No new build failures were reported via an exp-run. PR: 206503 (exp-run) Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D5092	2016-01-27 17:55:01 +00:00
delphij	00bfa82a7b	MFC r294900: Implement AT_SECURE properly. AT_SECURE auxv entry has been added to the Linux 2.5 kernel to pass a boolean flag indicating whether secure mode should be enabled. 1 means that the program has changes its credentials during the execution. Being exported AT_SECURE used by glibc issetugid() call. Submitted by: imp, dchagin Security: FreeBSD-SA-16:10.linux Security: CVE-2016-1883	2016-01-27 07:28:55 +00:00
delphij	c07f0f872d	Implement AT_SECURE properly. AT_SECURE auxv entry has been added to the Linux 2.5 kernel to pass a boolean flag indicating whether secure mode should be enabled. 1 means that the program has changes its credentials during the execution. Being exported AT_SECURE used by glibc issetugid() call. Submitted by: imp, dchagin Security: FreeBSD-SA-16:10.linux Security: CVE-2016-1883	2016-01-27 07:20:55 +00:00
dchagin	1b8c6467ee	MFC r294620: Fix a typo. MFC r294621: Remove obsolete comment.	2016-01-26 06:05:55 +00:00
ian	33902405d5	MFC r293045, r293046: Make the 'env' directive described in config(5) work on all architectures, providing compiled-in static environment data that is used instead of any data passed in from a boot loader. Previously 'env' worked only on i386 and arm xscale systems, because it required the MD startup code to examine the global envmode variable and decide whether to use static_env or an environment obtained from the boot loader, and set the global kern_envp accordingly. Most startup code wasn't doing so. Making things even more complex, some mips startup code uses an alternate scheme that involves calling init_static_kenv() to pass an empty buffer and its size, then uses a series of kern_setenv() calls to populate that buffer. Now all MD startup code calls init_static_kenv(), and that routine provides a single point where envmode is checked and the decision is made whether to use the compiled-in static_kenv or the values provided by the MD code. The routine also continues to serve its original purpose for mips; if a non-zero buffer size is passed the routine installs the empty buffer ready to accept kern_setenv() values. Now if the size is zero, the provided buffer full of existing env data is installed. A NULL pointer can be passed if the boot loader provides no env data; this allows the static env to be installed if envmode is set to do so. Most of the work here is a near-mechanical change to call the init function instead of directly setting kern_envp. A notable exception is in xen/pv.c; that code was originally installing a buffer full of preformatted env data along with its non-zero size (like mips code does), which would have allowed kern_setenv() calls to wipe out the preformatted data. Now it passes a zero for the size so that the buffer of data it installs is treated as non-writeable. Also, revert accidental change that snuck into r293045.	2016-01-24 21:04:06 +00:00
dchagin	1fea9511b6	Remove obsolete comment. MFC after: 3 days	2016-01-23 08:08:06 +00:00
dchagin	4258ee00b5	Fix a typo. MFC after: 3 days	2016-01-23 08:04:29 +00:00
hselasky	3e2da6e430	Add missing atomic wrapper macro. Reviewed by: alfred @ Sponsored by: Mellanox Technologies MFC after: 1 week	2016-01-21 18:22:50 +00:00
jhb	24f353ebf7	Regen for r294368.	2016-01-20 01:11:01 +00:00
jhb	77733541d4	MFC 289769,289822,290143,290144: Rename remaining linux32 symbols from linux_* to linux32_. 289769: Rename remaining linux32 symbols such as linux_sysent[] and linux_syscallnames[] from linux_ to linux32_* to avoid conflicts with linux64.ko. While here, add support for linux64 binaries to systrace. - Update NOPROTO entries in amd64/linux/syscalls.master to match the main table to fix systrace build. - Add a special case for union l_semun arguments to the systrace generation. - The systrace_linux32 module now only builds the systrace_linux32.ko. module on amd64. - Add a new systrace_linux module that builds on both i386 and amd64. For i386 it builds the existing systrace_linux.ko. For amd64 it builds a systrace_linux.ko for 64-bit binaries. 289822: Fix build for the KTR-enabled kernels. 290143: Fix build with DEBUG defined. 290144: Update for LINUX32 rename. The assembler didn't complain about undefined symbols but just used 0 after the rename.	2016-01-20 01:09:53 +00:00
kib	63cfb09a9a	Use ANSI definitions. Wrap long line. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-01-19 08:08:08 +00:00
kib	bee088bb00	Clear whole XMM register file instead of only XMM0. Also clear x87 registers. This brings amd64 on par with i386, providing consistent initial FPU state. Note that we do not clear any extended state, at least because kernel does not understand extended state structure and consequences of zero overwrite after fninit()/fpusave(). Submitted by: joss.upton@yahoo.com PR: 206370 MFC after: 2 weeks	2016-01-19 08:04:02 +00:00
jhb	e6d9c6386f	MFC 290728: Export various helper variables describing the layout and size of certain kernel structures for use by debuggers. This mostly aids in examining cores from a kernel without debug symbols as a debugger can infer these values if debug symbols are available. One set of variables describes the layout of 'struct linker_file' to walk the list of loaded kernel modules. A second set of variables describes the layout of 'struct proc' and 'struct thread' to walk the list of processes in the kernel and the threads in each process. The 'pcb_size' variable is used to index into the stoppcbs[] array. The 'vm_maxuser_address' is used to distinguish kernel virtual addresses from user addresses. This doesn't have to be perfect, and 'vm_maxuser_address' is a cheap and simple way to differentiate kernel pointers from simple values like TIDs and PIDs. While here, annotate the fields in struct pcb used by kgdb on amd64 and i386 to note that their ABI should be preserved. Annotations for other platforms will be added in the future.	2016-01-18 18:27:21 +00:00
emaste	800cde159e	MFC r293343: Move amd64 metadata.h to x86 and share with i386	2016-01-18 15:52:07 +00:00
emaste	31c7f199a4	MFC r281381: Use explicitly sized types in EFI module metadata This will allow the same metadata struct to be used on all platforms.	2016-01-18 15:43:00 +00:00
dchagin	ed08737097	MFC r293613: Implement vsyscall hack. Prior to 2.13 glibc uses vsyscall instead of vdso. An upcoming linux_base-c6 needs it.	2016-01-16 07:56:49 +00:00
glebius	f65cb2db64	Regen after r293907.	2016-01-14 10:15:21 +00:00
glebius	d87c627c80	Change linux get_robust_list system call to match actual linux one. The set_robust_list system call request the kernel to record the head of the list of robust futexes owned by the calling thread. The head argument is the list head to record. The get_robust_list system call should return the head of the robust list of the thread whose thread id is specified in pid argument. The list head should be stored in the location pointed to by head argument. In contrast, our implemenattion of get_robust_list system call copies the known portion of memory pointed by recorded in set_robust_list system call pointer to the head of the robust list to the location pointed by head argument. So, it is possible for a local attacker to read portions of kernel memory, which may result in a privilege escalation. Submitted by: mjg Security: SA-16:03.linux	2016-01-14 10:13:58 +00:00
glebius	924e9fd65e	o Fix SCTP ICMPv6 error message vulnerability. [SA-16:01.sctp] o Fix Linux compatibility layer incorrect futex handling. [SA-16:03.linux] o Fix Linux compatibility layer setgroups(2) system call. [SA-16:04.linux] o Fix TCP MD5 signature denial of service. [SA-16:05.tcp] o Fix insecure default bsnmpd.conf permissions. [SA-16:06.bsnmpd] Security: FreeBSD-SA-16:01.sctp, CVE-2016-1879 Security: FreeBSD-SA-16:03.linux, CVE-2016-1880 Security: FreeBSD-SA-16:04.linux, CVE-2016-1881 Security: FreeBSD-SA-16:05.tcp, CVE-2016-1882 Security: FreeBSD-SA-16:06.bsnmpd, CVE-2015-5677	2016-01-14 09:11:42 +00:00
jkim	9dcfa1d85c	Remove dead code when the target processor has POPCNT instruction.	2016-01-13 19:19:50 +00:00
dchagin	e706df7b9a	Implement vsyscall hack. Prior to 2.13 glibc uses vsyscall instead of vdso. An upcoming linux_base-c6 needs it. Differential Revision: https://reviews.freebsd.org/D1090 Reviewed by: kib, trasz MFC after: 1 week	2016-01-09 20:18:53 +00:00
dchagin	07e5594c02	MFC r289055 (by mjg@): linux: fix handling of out-of-bounds syscall attempts Due to an off by one the code would read an entry past the table, as opposed to the last entry which contains the nosys handler. This fixes my fault. MFC r289058 (by cem@): Fix missing semi-colon from r289055. MFC r289768 (by jhb@): Merge r289055 to amd64/linux32: linux: fix handling of out-of-bounds syscall attempts Due to an off by one the code would read an entry past the table, as opposed to the last entry which contains the nosys handler.	2016-01-09 18:32:52 +00:00
dchagin	d1e4a825ff	MFC r284159: Futex is an aligned 32-bit integer. Use the proper instruction and operand when dereferencing futex pointer.	2016-01-09 18:19:18 +00:00
dchagin	1fd2c934ac	MFC r283544: When I merged the lemul branch I missied kib@'s r282708 commit. This is not the final fix as I need properly cleanup thread resources before other threads suicide.	2016-01-09 18:07:48 +00:00
dchagin	6470ace45c	Regen for r293592.	2016-01-09 17:56:04 +00:00
dchagin	ddaf8065bb	MFC r283492: Implement Linux specific syncfs() system call.	2016-01-09 17:54:37 +00:00
dchagin	87e0367fbe	Regen for r293588.	2016-01-09 17:51:17 +00:00
dchagin	bbbcfd1903	MFC r283488: Implement recvmmsg() and sendmmsg() system calls.	2016-01-09 17:50:13 +00:00
dchagin	2e5298109d	MFC r283487: Reduce duplication between MD Linux code by moving msg related struct definitions out into the compat/linux/linux_socket.h	2016-01-09 17:49:05 +00:00
dchagin	a803e87674	Regen for r293585.	2016-01-09 17:47:57 +00:00
dchagin	735299091c	MFC r283484: Implement epoll_pwait() system call.	2016-01-09 17:45:02 +00:00
dchagin	e412c865a0	Regen for r293582.	2016-01-09 17:42:25 +00:00
dchagin	1e80f16f0f	MFC r283480: Add utimensat() system call.	2016-01-09 17:41:00 +00:00
dchagin	48b0af056f	MFC r283479: The kernel sends signals to the processes via ABI specific sv_sendsig method. Native ABI do not need signal conversion, only emulators may want this. Usually emulators implements its own sv_sendsig method. For now only ibcs2 emulator does not have own sv_sendsig implementation and depends on native sendsig() method. So, remove any extra attempts to convert signal numbers from native sendsig() methods except from i386 where ibsc2 is living.	2016-01-09 17:39:41 +00:00
dchagin	05243c7228	MFC r283474: Rework signal code to allow using it by other modules, like linprocfs: 1. Linux sigset always 64 bit on all platforms. In order to move Linux sigset code to the linux_common module define it as 64 bit int. Move Linux sigset manipulation routines to the MI path. 2. Move Linux signal number definitions to the MI path. In general, they are the same on all platforms except for a few signals. 3. Map Linux RT signals to the FreeBSD RT signals and hide signal conversion tables to avoid conversion errors. 4. Emulate Linux SIGPWR signal via FreeBSD SIGRTMIN signal which is outside of allowed on Linux signal numbers. PR: 197216	2016-01-09 17:29:08 +00:00
dchagin	858c17f9b3	MFC r283471: According to Linux man sigaltstack(3) shall return EINVAL if the ss argument is not a null pointer, and the ss_flags member pointed to by ss contains flags other than SS_DISABLE. However, in fact, Linux also allows SS_ONSTACK flag which is simply ignored. For buggy apps (at least mono) ignore other than SS_DISABLE flags as a Linux do. While here move MI part of sigaltstack code to the appropriate place.	2016-01-09 17:22:51 +00:00
dchagin	1aaf87d264	Regen for r293569.	2016-01-09 17:20:19 +00:00
dchagin	8ab518aec9	MFC r283467: Call nosys in case when the incorrect syscall number is specified. Its my fault, fixed by mjg@ at r289055.	2016-01-09 17:18:03 +00:00
dchagin	c4895a81f6	Regen for r293567.	2016-01-09 17:15:03 +00:00
dchagin	5b01285f9b	MFC r283465: Add preliminary fallocate system call implementation to emulate posix_fallocate() function.	2016-01-09 17:13:43 +00:00
dchagin	09f25351da	Regen for r293555.	2016-01-09 17:00:15 +00:00
dchagin	682bdd605d	MFC r283451: Implement ppoll() system call.	2016-01-09 16:58:57 +00:00
dchagin	9d7b3777ea	MFC r283446: Include opt_compat.h, so that COMPAT_LINUX32 is defined, and we can access to the semop structs and functions.	2016-01-09 16:52:25 +00:00
dchagin	b7022d5321	Regen for r293549.	2016-01-09 16:50:09 +00:00
dchagin	1eeab3feb9	MFC r283444: Implement eventfd system call.	2016-01-09 16:48:50 +00:00
dchagin	623ca98188	MFC r283443: Put the correct value for the abi_nfdbits parameter of kern_select() for all supported Linuxulators.	2016-01-09 16:47:36 +00:00
dchagin	2ce85f55b6	Regen for r293546.	2016-01-09 16:45:54 +00:00
dchagin	ea9daca708	MFC r283441: Implement epoll family system calls. This is a tiny wrapper around kqueue() to implement epoll subset of functionality. The kqueue user data are 32bit on i386 which is not enough for epoll user data, so we keep user data in the proc emuldata. Initial patch developed by rdivacky@ in 2007, then extended by Yuri Victorovich @ r255672 and finished by me in collaboration with mjg@ and jillies@.	2016-01-09 16:44:17 +00:00
dchagin	2205518265	MFC r283437: To avoid code duplication move open/fcntl definitions to the MI header file.	2016-01-09 16:31:10 +00:00
dchagin	f31e70952f	MFC r283436: Use the BSD_TO_LINUX_SIGNAL() wherever there is no need to check the ABI as it is known.	2016-01-09 16:29:51 +00:00
dchagin	a1e3b366c7	MFC r283432: Being exported through vdso the note.Linux section used by glibc to determine the kernel version (this saves one uname call). Temporarily disable the export of a note.Linux section until I figured out how to change the kernel version in the note.Linux on the fly.	2016-01-09 16:25:30 +00:00
dchagin	68ddeae1b4	MFC r283431: Add AT_RANDOM and AT_EXECFN auxiliary vector entries which are used by glibc. At list since glibc version 2.16 using AT_RANDOM is mandatory.	2016-01-09 16:24:30 +00:00
dchagin	6a70519414	Regen for r293533.	2016-01-09 16:23:11 +00:00
dchagin	f186d260e2	MFC r283428: Change linux faccessat syscall definition to match actual linux one. The AT_EACCESS and AT_SYMLINK_NOFOLLOW flags are actually implemented within the glibc wrapper function for faccessat(). If either of these flags are specified, then the wrapper function employs fstatat() to determine access permissions.	2016-01-09 16:21:39 +00:00
dchagin	199f40b151	Regen for r293530.	2016-01-09 16:16:16 +00:00
dchagin	aaac17a4f9	MFC r283424: Add preliminary support for x86-64 Linux binaries.	2016-01-09 16:14:24 +00:00
dchagin	6c12e20ac1	MFC r283422: Refund the proc emuldata struct for future use. For now move flags from thread emuldata to proc emuldata as it was originally intended. As we can have both 64 & 32 bit Linuxulator running any eventhandler can be called twice for us. To prevent this move eventhandlers code from linux_emul.c to the linux_common.ko module.	2016-01-09 16:11:09 +00:00
dchagin	d30e84112a	MFC r283421: Introduce a new module linux_common.ko which is intended for the following primary purposes: 1. Remove the dependency of linsysfs and linprocfs modules from linux.ko, which will be architecture specific on amd64. 2. Incorporate into linux_common.ko general code for platforms on which we'll support two Linuxulator modules (for both instruction set - 32 & 64 bit). 3. Move malloc(9) declaration to linux_common.ko, to enable getting memory usage statistics properly. Currently linux_common.ko incorporates a code from linux_mib.c and linux_util.c and linprocfs, linsysfs and linux kernel modules depend on linux_common.ko. Temporarily remove dtrace garbage from linux_mib.c and linux_util.c	2016-01-09 16:08:22 +00:00
dchagin	793bb9c3f7	MFC r283416: x86_64 Linux do not use multiplexing on ipc system calls. Move struct ipc_perm definition to the MD path as it differs for 64 and 32 bit platform.	2016-01-09 15:56:01 +00:00
dchagin	f536abaf31	MFC r283411: Remove stale comment about a signal trampoline which is moved to the shared page at r219609.	2016-01-09 15:49:42 +00:00
dchagin	a034df74fd	MFC r283410: Put linux_platform into the vdso to avoid copying it onto the stack at every exec.	2016-01-09 15:48:11 +00:00
dchagin	5c3e282c6e	MFC r283408: Eliminate a now unused global declaration of elf_linux_sysvec.	2016-01-09 15:46:05 +00:00
dchagin	18c1672334	MFC r283407: Implement vdso - virtual dynamic shared object. Through vdso Linux exposes functions from kernel with proper DWARF CFI information so that it becomes easier to unwind through them. Using vdso is a mandatory for a thread cancelation && cleanup on a modern glibc.	2016-01-09 15:44:38 +00:00
dchagin	2e9cc3f70d	Regen for r293511.	2016-01-09 15:40:44 +00:00
dchagin	4ed27590e5	MFC r283403: Implement pselect6() system call.	2016-01-09 15:39:41 +00:00
dchagin	4992ef5f9d	Regen for r293510.	2016-01-09 15:38:16 +00:00
dchagin	027f6631c0	MFC r283401: Implement prlimit64() system call.	2016-01-09 15:37:10 +00:00
dchagin	a82405c150	Regen for r293508.	2016-01-09 15:35:57 +00:00
dchagin	b4d7be064f	MFC r283399: Implement dup3() system call.	2016-01-09 15:34:54 +00:00
dchagin	e327d1c9cc	Regen for r293505.	2016-01-09 15:32:33 +00:00
dchagin	df59792813	MFC r283396: Implement rt_sigqueueinfo() system call.	2016-01-09 15:31:15 +00:00
dchagin	4e3ae75e5e	Regen for r293503.	2016-01-09 15:29:10 +00:00
dchagin	3c97a00938	MFC r283394: Implement waitid() system call.	2016-01-09 15:28:05 +00:00
dchagin	a14064e328	MFC r283391: To reduce code duplication introduce linux_copyout_rusage() method. Use it in linux_wait4() system call and move linux_wait4() to the MI path. While here add a prototype for the static bsd_to_linux_rusage().	2016-01-09 15:23:54 +00:00
dchagin	2646cf70a0	MFC r283385: Some style(9) && whitespaces fixes. No functional changes.	2016-01-09 15:18:36 +00:00
dchagin	cb3b38d164	MFC r283383: Switch linuxulator to use the native 1:1 threads. The reasons: 1. Get rid of the stubs/quirks with process dethreading, process reparent when the process group leader exits and close to this problems on wait(), waitpid(), etc. 2. Reuse our kernel code instead of writing excessive thread managment routines in Linuxulator. Implementation details: 1. The thread is created via kern_thr_new() in the clone() call with the CLONE_THREAD parameter. Thus, everything else is a process. 2. The test that the process has a threads is done via P_HADTHREADS bit p_flag of struct proc. 3. Per thread emulator state data structure is now located in the struct thread and freed in the thread_dtor() hook. Mandatory holdig of the p_mtx required when referencing emuldata from the other threads. 4. PID mangling has changed. Now Linux pid is the native tid and Linux tgid is the native pid, with the exception of the first thread in the process where tid and pid are one and the same. Ugliness: In case when the Linux thread is the initial thread in the thread group thread id is equal to the process id. Glibc depends on this magic (assert in pthread_getattr_np.c). So for system calls that take thread id as a parameter we should use the special method to reference struct thread.	2016-01-09 15:16:13 +00:00
dchagin	2b83b41438	MFC r283382: In preparation for switching linuxulator to the use the native 1:1 threads add a hook for cleaning thread resources before the thread die.	2016-01-09 14:53:08 +00:00
dchagin	c12aa632f3	Regen fro r293487.	2016-01-09 14:48:23 +00:00
dchagin	31e61f6749	MFC r283379: Implement a Linux version of sched_getparam() && sched_setparam(). Temporarily use the first thread in proc.	2016-01-09 14:47:08 +00:00
dchagin	65d490113d	MFC r283378: Remove a now unused include.	2016-01-09 14:45:41 +00:00
dchagin	fd9d33be2a	MFC r283374: In preparation for switching linuxulator to the use the native 1:1 threads refactor kern_sched_rr_get_interval() and sys_sched_rr_get_interval(). Add a kern_sched_rr_get_interval() counterpart which takes a targettd parameter to allow specify target thread directly by callee (new Linuxulator). Linuxulator temporarily uses first thread in proc. Move linux_sched_rr_get_interval() to the MI part.	2016-01-09 14:40:38 +00:00
dchagin	994d3d5889	Regen for r293478.	2016-01-09 14:34:29 +00:00
dchagin	e060fa6fed	MFC r283370: In preparation for switching linuxulator to the use the native 1:1 threads introduce linux_exit() stub instead of sys_exit() call (which terminates process). In the new linuxulator exit() system call terminates the calling thread (not a whole process).	2016-01-09 14:33:10 +00:00
dchagin	358125d39c	MFC r283369: In preparation for switching linuxulator to the use the native 1:1 threads print the thread id in addition to the pid in debug messages.	2016-01-09 14:31:03 +00:00
emaste	2029b75c0e	Move amd64 metadata.h to x86 and share with i386 MFC after: 1 week	2016-01-07 19:47:26 +00:00
ian	3d96cedc35	Make the 'env' directive described in config(5) work on all architectures, providing compiled-in static environment data that is used instead of any data passed in from a boot loader. Previously 'env' worked only on i386 and arm xscale systems, because it required the MD startup code to examine the global envmode variable and decide whether to use static_env or an environment obtained from the boot loader, and set the global kern_envp accordingly. Most startup code wasn't doing so. Making things even more complex, some mips startup code uses an alternate scheme that involves calling init_static_kenv() to pass an empty buffer and its size, then uses a series of kern_setenv() calls to populate that buffer. Now all MD startup code calls init_static_kenv(), and that routine provides a single point where envmode is checked and the decision is made whether to use the compiled-in static_kenv or the values provided by the MD code. The routine also continues to serve its original purpose for mips; if a non-zero buffer size is passed the routine installs the empty buffer ready to accept kern_setenv() values. Now if the size is zero, the provided buffer full of existing env data is installed. A NULL pointer can be passed if the boot loader provides no env data; this allows the static env to be installed if envmode is set to do so. Most of the work here is a near-mechanical change to call the init function instead of directly setting kern_envp. A notable exception is in xen/pv.c; that code was originally installing a buffer full of preformatted env data along with its non-zero size (like mips code does), which would have allowed kern_setenv() calls to wipe out the preformatted data. Now it passes a zero for the size so that the buffer of data it installs is treated as non-writeable.	2016-01-02 02:53:48 +00:00
jhb	994c23f093	Move shared variables from {amd64,i386}/initcpu.c to x86/identcpu.c. While here, move the common bits of <machine/cputypes.h> to <x86/cputypes.h> as well. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D4670	2015-12-23 21:41:42 +00:00
ngie	3dc5879a19	Remove redundant ctx_switch_xsave declaration in sys/amd64/include/md_var.h This variable was added to sys/x86/include/x86_var.h recently. This unbreaks building kernel source that #includes both md_var.h and x86_var.h with gcc 4.2.1 on amd64 Differential Revision: https://reviews.freebsd.org/D4686 Reviewed by: kib X-MFC with: r291949 Sponsored by: EMC / Isilon Storage Division	2015-12-22 20:08:32 +00:00
dim	7295d680ea	MFC r277735 (by royger): amd64: allow base memory segment to start at address different than 0 Current code requires that the first physical memory segment starts at 0, but this is not really needed. We only need to make sure the bootstrap code and page tables for APs are allocated below 4GB. This patch removes this requirement and allows booting a Dell R710 from UEFI, where the first physical memory segment starts at 0x10000. Sponsored by: Citrix Systems R&D Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D1417	2015-12-21 17:15:03 +00:00
imp	3e2743eaf6	Save the physical address passed into the kernel of the UEFI system table.	2015-12-19 19:01:43 +00:00
kib	9167bbc1e5	MFC r291948: Use ANSI C definition.	2015-12-14 07:54:45 +00:00
kib	f124247e27	Merge common parts of i386 and amd64 md_var.h and smp.h into new headers x86/include x86_var.h and x86_smp.h. Reviewed by: emaste, jhb Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D4358	2015-12-07 17:41:20 +00:00
kib	fcdb3dc23f	Use ANSI C definition. MFC after: 1 week	2015-12-07 17:24:55 +00:00
cem	a15dada94b	pmap_invalidate_range: For very large ranges, flush the whole TLB Typical TLBs have 40-512 entries available. At some point, iterating every single page in a requested invalidation range and issuing invlpg on it is more expensive than flushing the TLB and allowing it to reload on demand. Broadwell CPUs have 1536 L2 TLB entries, so I've picked the arbitrary number 4096 entries as a hueristic at which point we flush TLB rather than invalidating every single potential page. Reviewed by: alc Feedback from: jhb, kib MFC notes: Depends on r291688 Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4280	2015-12-06 17:39:13 +00:00
kib	f741f698b7	For amd64 non-PCID machines, and for i386 machines with support for the PG_G global pte flag, pmap_invalidate_all() fails to flush global TLB entries []. This is because TLB shootdown handler for such configs reloads CR3, and on i386 pmap_invalidate_all() does the same for the initiating CPU. Note that current code does not issue total invalidation requests for the kernel_pmap. Rename amd64 function invltlb_globpcid() to invltlb_glob(), it is not specific for PCID for quite some time, and implement the same functionality for i386. Use the function instead of invltlb() in shootdown handlers and in i386 pmap_invalidate_all(), but only for the kernel pmap (which maps pages with the PG_G attribute set), which takes care of PG_G TLB entries on flush. To detect the affected pmap in i386 TLB shootdown handler, pmap should be passed to the smp_masked_invltlb() function, which makes amd64 and i386 TLB shootdown code almost identical. Merge the code under x86/. Noted by: jhb [] Reviewed by: cem, jhb, pho Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D4346	2015-12-03 11:14:14 +00:00
kib	ee461b4bba	Remove sv_prepsyscall, sv_sigsize and sv_sigtbl members of the struct sysent. sv_prepsyscall is unused. sv_sigsize and sv_sigtbl translate signal number from the FreeBSD namespace into the ABI domain. It is only utilized on i386 for iBCS2 binaries. The issue with this approach is that signals for iBCS2 were delivered with the FreeBSD signal frame layout, which does not follow iBCS2. The same note is true for any other potential user if sv_sigtbl. In other words, if ABI needs signal number translation, it really needs custom sv_sendsig method instead. Sponsored by: The FreeBSD Foundation	2015-11-28 08:49:07 +00:00
emaste	c168857c6a	Fix whitespace on addition of IPSEC option	2015-11-26 21:35:50 +00:00
kib	e0c4faece4	Split kerne timekeep ABI structure vdso_sv_tk out of the struct sysentvec. This allows the timekeep data to be shared between similar ABIs which cannot share sysentvec. Make the timekeep_push_vdso() tick callback to the timekeep structures instead of sysentvecs. If several sysentvec share the vdso_sv_tk structure, we would update the userspace data several times on each tick, without the change. Only allocate vdso_sv_tk in the exec_sysvec_init() sysinit when sysentvec is marked with the new SV_TIMEKEEP flag. This saves allocation and update of unneeded vdso_sv_tk for ABIs which do not provide userspace gettimeofday yet, which are PowerPCs arches right now. Make vdso_sv_tk allocator public, namely split out and export alloc_sv_tk() and alloc_sv_tk_compat32(). ABIs which share timekeep data now can allocate it manually and share as appropriate. Requested by: nwhitehorn Tested by: nwhitehorn, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-11-23 07:09:35 +00:00
markj	3e47d7787e	Remove unneeded includes of opt_kdtrace.h. As of r258541, KDTRACE_HOOKS is defined in opt_global.h, so opt_kdtrace.h is not needed when defining SDT(9) probes.	2015-11-22 02:01:01 +00:00
jhb	f91ade9d24	MFC 284325: Report the values of x86 segment registers to remote debuggers. While here, also report %eflags from the i386 trapframe.	2015-11-13 00:50:34 +00:00
jhb	d10f133720	MFC 285783: Various changes to the registers displayed in DDB for x86. - Fix segment registers to only display the low 16 bits. - Remove unused handlers and entries for the debug registers. - Display xcr0 (if valid) in 'show sysregs'. - Add '0x' prefix to MSR values to match other values in 'show sysregs'. - MFamd64: Display various MSRs in 'show sysregs'. - Add a 'show dbregs' to display the value of debug registers. - Dynamically size the column width for register values to properly align columns on 64-bit platforms. - Display %gs for i386 in 'show registers'.	2015-11-12 23:49:47 +00:00
jhb	2285630285	MFC 285773,285775,285776: Various fixes for stack unwinding in DDB on x86. 285773: Remove some dead code from DDB's amd64 stack unwinder. The amd64 port copied some code from i386 to fetch function arguments and display them in backtraces. However, it was commented out and can't easily be implemented since the function arguments are passed in registers rather than on the stack in amd64. Remove it in preparation for some bug fixes in this area. 285775: Improve stack unwinding on i386 and amd64 after an IP fault. If we can't find a symbol corresponding to the faulting instruction, assume that the previously-executed function is a call and attempt to find the calling function using the return address on the stack. Otherwise we end up associating the last stack frame with the current call, which is incorrect and causes the unwinder to skip printing of the calling function, resulting in a confusing backtrace. 285776: Let the unwinder handle faults during function prologues or epilogues. The i386 and amd64 DDB stack unwinders contain code to detect and handle the case where the first frame is not completely set up or torn down. This code was accidentally unused however, since db_backtrace() was never called with a non-NULL trap frame. This change fixes that. Also remove get_rsp() from the amd64 code. It appears to have come from i386, which needs to take into account whether the exception triggered a CPL switch, since SS:ESP is only pushed onto the stack if so. On amd64, SS:RSP is pushed regardless, so get_rsp() was doing the wrong thing for kernel-mode exceptions. As a result, we can also remove custom print functions for these registers.	2015-11-12 22:45:51 +00:00
jhb	c1d9f70889	Export various helper variables describing the layout and size of certain kernel structures for use by debuggers. This mostly aids in examining cores from a kernel without debug symbols as a debugger can infer these values if debug symbols are available. One set of variables describes the layout of 'struct linker_file' to walk the list of loaded kernel modules. A second set of variables describes the layout of 'struct proc' and 'struct thread' to walk the list of processes in the kernel and the threads in each process. The 'pcb_size' variable is used to index into the stoppcbs[] array. The 'vm_maxuser_address' is used to distinguish kernel virtual addresses from user addresses. This doesn't have to be perfect, and 'vm_maxuser_address' is a cheap and simple way to differentiate kernel pointers from simple values like TIDs and PIDs. While here, annotate the fields in struct pcb used by kgdb on amd64 and i386 to note that their ABI should be preserved. Annotations for other platforms will be added in the future. Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D3773	2015-11-12 22:00:59 +00:00
kib	70c328a1bb	MFC r289824: Add CLFLUSHOPT instruction wrappers. MFC r290188: Fix prefix on i386.	2015-10-30 10:02:57 +00:00
cem	dffa7f0590	pmap_change_attr: Only fixup DMAP for DMAPed ranges pmap_change_attr must change the memory type of both the requested KVA and the corresponding DMAP mappings (if such mappings exist), to satisfy an Intel requirement that two or more mappings to the same physical pages must have the same memory type. However, not all kernel mapped pages have corresponding DMAP mappings -- for example, 64-bit BARs. Skip fixing up the DMAP for out-of-bounds addresses. Submitted by: Steve Wahl <steve_wahl@dell.com> Reviewed by: alc, jhb Sponsored by: Dell Compellent Differential Revision: https://reviews.freebsd.org/D4030	2015-10-29 19:07:00 +00:00
jhb	9ee931e10a	Update for LINUX32 rename. The assembler didn't complain about undefined symbols but just used 0 after the rename.	2015-10-29 15:20:47 +00:00
jhb	617f6c60b6	Fix build with DEBUG defined. Reported by: hselasky	2015-10-29 15:16:47 +00:00
mckusick	485c0ddc22	Bring the tags and links entries for amd64 up to date. Based on how out of date it is, I doubt that anyone other than me and my code-reading students still use it.	2015-10-27 22:59:24 +00:00
kib	919ebacc13	Intel SDM before revision 56 described the CLFLUSH instruction as only ordered with the MFENCE instruction. Similar weak guarantees are also specified by the AMD APM vol. 3 rev. 3.22. x86 pmap methods pmap_invalidate_cache_range() and pmap_invalidate_cache_pages() braced CLFLUSH loop with MFENCE both before and after the loop. In the revision 56 of SDM, Intel stated that all existing implementations of CLFLUSH are strict, CLFLUSH instructions execution is ordered WRT other CLFLUSH and writes. Also, the strict behaviour is made architectural. A new instruction CLFLUSHOPT (which was documented for some time in the Instruction Set Extensions Programming Reference) provides the weak behaviour which was previously attributed to CLFLUSH. Use CLFLUSHOPT when available. When CLFLUSH is used on Intel CPUs, do not execute MFENCE before and after the flushing loop. Reviewed by: alc Sponsored by: The FreeBSD Foundation	2015-10-24 21:37:47 +00:00
kib	7eb36dd3f9	Add CLFLUSHOPT instruction wrappers. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-10-23 11:45:38 +00:00
avg	bef317767a	MFC r261891: provide fast versions of ffsl and flsl for i386; ffsll and flsll for amd64	2015-10-23 10:05:43 +00:00
jhb	4a875c71d5	Regen for linux32 rename and linux64 systrace.	2015-10-22 21:33:37 +00:00
jhb	9740ac3060	Rename remaining linux32 symbols such as linux_sysent[] and linux_syscallnames[] from linux_* to linux32_* to avoid conflicts with linux64.ko. While here, add support for linux64 binaries to systrace. - Update NOPROTO entries in amd64/linux/syscalls.master to match the main table to fix systrace build. - Add a special case for union l_semun arguments to the systrace generation. - The systrace_linux32 module now only builds the systrace_linux32.ko. module on amd64. - Add a new systrace_linux module that builds on both i386 and amd64. For i386 it builds the existing systrace_linux.ko. For amd64 it builds a systrace_linux.ko for 64-bit binaries. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D3954	2015-10-22 21:28:20 +00:00
jhb	2705fe5cc1	Merge r289055 to amd64/linux32: linux: fix handling of out-of-bounds syscall attempts Due to an off by one the code would read an entry past the table, as opposed to the last entry which contains the nosys handler.	2015-10-22 21:23:58 +00:00
ed	7fb0afec66	Refactoring: move out generic bits from cloudabi64_sysvec.c. In order to make it easier to support CloudABI on ARM64, move out all of the bits from the AMD64 cloudabi_sysvec.c into a new file cloudabi_module.c that would otherwise remain identical. This reduces the AMD64 specific code to just ~160 lines. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D3974	2015-10-22 09:07:53 +00:00
royger	d31f5fb7b9	x86/xen: Consolidate xen-os.h in a single place amd64 and i386 platform code contain very similar xen/xen-os.h The only differences are: - Functions/variables/types which were unused in i386/xen/xen-os.h: * xen_xchg * __xchg_dummy * __xg * __xchg * atomic_t * atomic_inc * rdtscll The functions/variables/types unused in xen-os.h can be dropped and there is no more differences betwen amd64 and i386. The new header is placed in x86/include/xen and each platform will have dummy headers include x86/xen/.h. This is to be able to include machine/xen/.h in the PV drivers. Submitted by: Julien Grall <julien.grall@citrix.com> Reviewed by: royger Differential Revision: https://reviews.freebsd.org/D3880 Sponsored by: Citrix Systems R&D	2015-10-21 10:04:35 +00:00
mav	64d53c4c7d	Remove compatibility shims for legacy ATA device names. We got new ATA stack in FreeBSD 8.x, switched to it at 9.x, completely removed old stack at 10.x, so at 11.x it is time to remove compat shims.	2015-10-11 13:01:51 +00:00
mjg	d8dc4fc1ae	linux: fix handling of out-of-bounds syscall attempts Due to an off by one the code would read an entry past the table, as opposed to the last entry which contains the nosys handler. Reported by: Pawel Biernacki <pawel.biernacki gmail.com>	2015-10-08 21:08:35 +00:00

... 4 5 6 7 8 ...

8130 Commits