7352 Commits

Author SHA1 Message Date
kib
f6cfae6dab Add a comment about too strong semantic of atomic_load_acq() on x86.
Submitted by:	bde
MFC after:	2 weeks
2015-06-29 09:58:40 +00:00
kib
e85612a06d pcb_gs32sd is unused for long time, remove it. Keep the padding in pcb.
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-06-29 07:53:44 +00:00
kib
9b07cc4555 Add x86 PT_GETFSBASE, PT_GETGSBASE machine-depended ptrace requests to
obtain the thread %fs and %gs bases.  Add x86 PT_SETFSBASE and
PT_SETGSBASE requests to set the bases from debuggers.  The set
requests, similarly to the sysarch({I386,AMD64}_SET_FSBASE),
override the corresponding segment registers.

The main purpose of the operations is to retrieve and modify the tcb
address for debuggee.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-06-29 07:07:24 +00:00
kib
6279b7c930 Remove unneeded data dependency, currently imposed by
atomic_load_acq(9), on it source, for x86.

Right now, atomic_load_acq() on x86 is sequentially consistent with
other atomics, code ensures this by doing store/load barrier by
performing locked nop on the source.  Provide separate primitive
__storeload_barrier(), which is implemented as the locked nop done on
a cpu-private variable, and put __storeload_barrier() before load, to
keep seq_cst semantic but avoid introducing false dependency on the
no-modification of the source for its later use.

Note that seq_cst property of x86 atomic_load_acq() is not documented
and not carried by atomics implementations on other architectures,
although some kernel code relies on the behaviour.  This commit does
not intend to change this.

Reviewed by:	alc
Discussed with:	bde
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-06-28 05:04:08 +00:00
tychon
d4a6573433 verify_gla() needs to account for non-zero segment base addresses.
Reviewed by:	neel
2015-06-26 18:00:29 +00:00
royger
c61d8ab317 amd64: set the correct LMA values
The current linker script generates program headers with VMA == LMA:

Entry point 0xffffffff802e7000
There are 6 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0xffffffff80200040 0xffffffff80200040
                 0x0000000000000150 0x0000000000000150  R E    8
  INTERP         0x0000000000000190 0xffffffff80200190 0xffffffff80200190
                 0x000000000000000d 0x000000000000000d  R      1
      [Requesting program interpreter: /red/herring]
  LOAD           0x0000000000000000 0xffffffff80200000 0xffffffff80200000
                 0x00000000010559b0 0x00000000010559b0  R E    200000
  LOAD           0x0000000001056000 0xffffffff81456000 0xffffffff81456000
                 0x0000000000132638 0x000000000052ecf8  RW     200000
  DYNAMIC        0x0000000001056000 0xffffffff81456000 0xffffffff81456000
                 0x00000000000000d0 0x00000000000000d0  RW     8
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RWE    8

This is fine for the FreeBSD loader, because it completely ignores p_paddr
and instead uses p_vaddr with a hardcoded offset. Other loaders however
acknowledge p_paddr (like the Xen ELF loader), in which case they will try
to load the kernel at the wrong place. Fix this by adding an AT keyword to
the first section specifying the physical address, other sections will
follow suit, so it ends up looking like:

Entry point 0xffffffff802e7000
There are 6 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0xffffffff80200040 0x0000000000200040
                 0x0000000000000150 0x0000000000000150  R E    8
  INTERP         0x0000000000000190 0xffffffff80200190 0x0000000000200190
                 0x000000000000000d 0x000000000000000d  R      1
      [Requesting program interpreter: /red/herring]
  LOAD           0x0000000000000000 0xffffffff80200000 0x0000000000200000
                 0x00000000010559b0 0x00000000010559b0  R E    200000
  LOAD           0x0000000001056000 0xffffffff81456000 0x0000000001456000
                 0x0000000000132638 0x000000000052ecf8  RW     200000
  DYNAMIC        0x0000000001056000 0xffffffff81456000 0x0000000001456000
                 0x00000000000000d0 0x00000000000000d0  RW     8
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RWE    8

Tested on bare metal using the native FreeBSD loader and grub2 from TRUEOS.

Sponsored by: Citrix Systems R&D
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D2783
2015-06-26 07:12:17 +00:00
neel
2c76978d5a Restore the host's GS.base before returning from 'svm_launch()'.
Previously this was done by the caller of 'svm_launch()' after it returned.
This works fine as long as no code is executed in the interim that depends
on pcpu data.

The dtrace probe 'fbt:vmm:svm_launch:return' broke this assumption because
it calls 'dtrace_probe()' which in turn relies on pcpu data.

Reported by:	avg
MFC after:	1 week
2015-06-23 02:17:23 +00:00
neel
8c70d6c7af Restructure memory allocation in bhyve to support "devmem".
devmem is used to represent MMIO devices like the boot ROM or a VESA framebuffer
where doing a trap-and-emulate for every access is impractical. devmem is a
hybrid of system memory (sysmem) and emulated device models.

devmem is mapped in the guest address space via nested page tables similar
to sysmem. However the address range where devmem is mapped may be changed
by the guest at runtime (e.g. by reprogramming a PCI BAR). Also devmem is
usually mapped RO or RW as compared to RWX mappings for sysmem.

Each devmem segment is named (e.g. "bootrom") and this name is used to
create a device node for the devmem segment (e.g. /dev/vmm/testvm.bootrom).
The device node supports mmap(2) and this decouples the host mapping of
devmem from its mapping in the guest address space (which can change).

Reviewed by:	tychon
Discussed with:	grehan
Differential Revision:	https://reviews.freebsd.org/D2762
MFC after:	4 weeks
2015-06-18 06:00:17 +00:00
jhb
fd66a5bf8b Report the values of x86 segment registers to remote debuggers.
While here, also report %eflags from the i386 trapframe.

Differential Revision:	https://reviews.freebsd.org/D2743
Reviewed by:	kib
Obtained from:	1 month
2015-06-12 15:14:08 +00:00
br
1383b5af08 Allow DTrace to be compiled-in to the kernel.
This will require for AArch64 as we dont have modules yet.

Sponsored by:	HEIF5
Sponsored by:	ARM Ltd.
Differential Revision:	https://reviews.freebsd.org/D1997
2015-06-10 15:53:39 +00:00
mjg
b6be1c5ace Fixup the build after r284215.
Submitted by:	Ivan Klymenko <fidaj ukr.net> [slighly modified]
2015-06-10 12:39:01 +00:00
mjg
d7bc9285a6 Implement lockless resource limits.
Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
2015-06-10 10:48:12 +00:00
mjg
67f2eebb44 Generalised support for copy-on-write structures shared by threads.
Thread credentials are maintained as follows: each thread has a pointer to
creds and a reference on them. The pointer is compared with proc's creds on
userspace<->kernel boundary and updated if needed.

This patch introduces a counter which can be compared instead, so that more
structures can use this scheme without adding more comparisons on the boundary.
2015-06-10 10:43:59 +00:00
alc
cbefd9b195 Account for superpage mappings that are created by pmap_copy(). 2015-06-09 18:04:28 +00:00
tychon
3259f3f35f Support guest writes to the TSC by enabling the "use TSC offsetting"
execution control and writing the difference between the host TSC and
the guest TSC into the TSC offset in the VMCS upon encountering a
write.

Reviewed by:	neel
2015-06-09 00:14:47 +00:00
dchagin
b837e3372d Futex is an aligned 32-bit integer. Use the proper instruction and
operand when dereferencing futex pointer.
2015-06-08 17:39:25 +00:00
alc
263927b83e Retire VM_FREEPOOL_CACHE as the next step in eliminating PG_CACHE pages.
Differential Revision:	https://reviews.freebsd.org/D2712
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-06-08 04:59:32 +00:00
kib
a1956e48c3 Update print_INTEL_TLB() by the tag values from the Intel SDM
rev. 55.  The modern CPUs cache and TLB descriptions looked quite
questionable without the update, e.g. Haswell i7 4770S reported:
	Data TLB: 4 KB pages, 4-way set associative, 64 entries
	L2 cache: 256 kbytes, 8-way associative, 64 bytes/line
After the update, the report is:
	Data TLB: 1 GByte pages, 4-way set associative, 4 entries
	Data TLB: 4 KB pages, 4-way set associative, 64 entries
	Instruction TLB: 2M/4M pages, fully associative, 8 entries
	Instruction TLB: 4KByte pages, 8-way set associative, 64 entries
	64-Byte prefetching
	Shared 2nd-Level TLB: 4 KByte/2MByte pages, 8-way associative, 1024 entries
	L2 cache: 256 kbytes, 8-way associative, 64 bytes/line
Some tags were apparently removed from the table 3-21, Vol. 2A.  Keep
them around, but add a comment stating the removal.

Update the format line for cpu_stdext_feature according to the bits
from the SDM rev.55.  It appears that Haswells do not store %cs and
%ds values in the FPU save area.

Store content of the %ecx register from the CPUID leaf 0x7
subleaf 0 as cpu_stdext_feature2 and print defined bits from it,
again acording to SDM rev. 55.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-06-06 22:03:24 +00:00
neel
954e6fdc3b The 'verify_gla()' function is used to ensure that the effective address
after decoding the instruction matches the one provided by hardware.

Prior to r283293 'vie->num_valid' used to contain the actual length of
the instruction whereas now it contains the maximum instruction length
possible. This introduced a bug when calculating a RIP-relative base address.

Fix this by using 'vie->num_processed' rather than 'vie->num_valid' as the
length of the emulated instruction.

Reported and tested by:	tychon
MFC after:	1 week
2015-06-05 21:22:26 +00:00
neel
0a2b315713 Use tunable 'hw.vmm.svm.features' to disable specific SVM features even
though they might be available in hardware.

Use tunable 'hw.vmm.svm.num_asids' to limit the number of ASIDs used by
the hypervisor.

MFC after:	1 week
2015-06-04 02:12:23 +00:00
dim
2474a7a0d8 Remove unneeded NULL checks in amd64's trap_fatal().
Since td_name is an array member of struct thread, it can never be NULL,
so the check can be removed.  In addition, curproc can never be NULL,
so remove the if statement, and splice the two printfs() together.

While here, remove the u_long cast, and use the correct printf format
specifier curproc->p_pid.

Reviewed by:	kib
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D2695
2015-06-01 06:50:39 +00:00
kib
e2f56205b5 Remove several write-only variables, all reported by the gcc 4.9
buildkernel run.

Some of them were write-only under some kernel options, e.g. variables
keeping values only used by CTR() macros.  It costs nothing to the
code readability and correctness to eliminate the warnings in those
cases too by removing the local cached values used only for
single-access.

Review:	https://reviews.freebsd.org/D2665
Reviewed by:	rodrigc
Looked at by:	bjk
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-29 13:24:17 +00:00
neel
3f2b4fc770 Fix non-deterministic delays when accessing a vcpu that was in "running" or
"sleeping" state. This is done by forcing the vcpu to transition to "idle"
by returning to userspace with an exit code of VM_EXITCODE_REQIDLE.

MFC after:      2 weeks
2015-05-28 17:37:01 +00:00
kib
167672d7cd Enabled rewritten PCID support by default.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
2015-05-27 09:50:18 +00:00
dchagin
50f49b7240 When I merged the lemul branch I missied kib@'s r282708 commit.
This is not the final fix as I need properly cleanup thread resources
before other threads suicide.

Tested by:	Ruslan Makhmatkhanov
2015-05-25 20:44:46 +00:00
dchagin
9ffc61f923 Regen for r283492. 2015-05-24 18:09:01 +00:00
dchagin
5cd0d07723 Implement Linux specific syncfs() system call. 2015-05-24 18:08:01 +00:00
dchagin
21f92dba1f Regen for r283488. 2015-05-24 18:05:21 +00:00
dchagin
ea835975de Implement recvmmsg() and sendmmsg() system calls. 2015-05-24 18:04:04 +00:00
dchagin
8164e61122 Reduce duplication between MD Linux code by moving msg related
struct definitions out into the compat/linux/linux_socket.h
2015-05-24 18:03:14 +00:00
dchagin
791c259f28 Regen for r283484. 2015-05-24 18:02:17 +00:00
dchagin
8e85763052 Implement epoll_pwait() system call. 2015-05-24 18:00:14 +00:00
dchagin
bf8c32ec80 Regen for r283480. 2015-05-24 17:58:24 +00:00
dchagin
5361a6bf58 Add utimensat() system call.
The patch developed by Jilles Tjoelker and Andrew Wilcox and
adopted for lemul branch by me.
2015-05-24 17:57:07 +00:00
dchagin
0969667a9e The kernel sends signals to the processes via ABI specific sv_sendsig method.
Native ABI do not need signal conversion, only emulators may want this. Usually
emulators implements its own sv_sendsig method. For now only ibcs2 emulator does
not have own sv_sendsig implementation and depends on native sendsig() method.
So, remove any extra attempts to convert signal numbers from native sendsig()
methods except from i386 where ibsc2 is living.
2015-05-24 17:56:02 +00:00
dchagin
fc55d94b46 Rework signal code to allow using it by other modules, like linprocfs:
1. Linux sigset always 64 bit on all platforms. In order to move Linux
sigset code to the linux_common module define it as 64 bit int. Move
Linux sigset manipulation routines to the MI path.

2. Move Linux signal number definitions to the MI path. In general, they
are the same on all platforms except for a few signals.

3. Map Linux RT signals to the FreeBSD RT signals and hide signal conversion
tables to avoid conversion errors.

4. Emulate Linux SIGPWR signal via FreeBSD SIGRTMIN signal which is outside
of allowed on Linux signal numbers.

PR:		197216
2015-05-24 17:47:20 +00:00
dchagin
cace25f46d According to Linux man sigaltstack(3) shall return EINVAL if the ss
argument is not a null pointer, and the ss_flags member pointed to by ss
contains flags other than SS_DISABLE. However, in fact, Linux also
allows SS_ONSTACK flag which is simply ignored.

For buggy apps (at least mono) ignore other than SS_DISABLE
flags as a Linux do.

While here move MI part of sigaltstack code to the appropriate place.

Reported by:	abi at abinet dot ru
2015-05-24 17:44:08 +00:00
dchagin
9d4a2ad595 Regen for r283467. 2015-05-24 17:39:18 +00:00
dchagin
92d496261e Call nosys in case when the incorrect syscall number is specified.
Reported by:	trinity
2015-05-24 17:38:02 +00:00
dchagin
df01339e31 Regen for r283465. 2015-05-24 17:35:42 +00:00
dchagin
a346bc7dc8 Add preliminary fallocate system call implementation
to emulate posix_fallocate() function.

Differential Revision:	https://reviews.freebsd.org/D1523
Reviewed by:	emaste
2015-05-24 17:33:21 +00:00
dchagin
54f00beef0 Regen for r283451. 2015-05-24 17:00:43 +00:00
dchagin
b37f23e513 Implement ppoll() system call.
Differential Revision:	https://reviews.freebsd.org/D1105
Reviewed by:	trasz
2015-05-24 16:59:25 +00:00
dchagin
a03d7a9e7f Include opt_compat.h, so that COMPAT_LINUX32 is defined, and we can
access to the semop structs and functions.

Submitted by:	cognet@

Differential Revision:	https://reviews.freebsd.org/D1095
Reviewed by:	trasz
2015-05-24 16:51:04 +00:00
dchagin
495cb86c87 Regen for r283444. 2015-05-24 16:50:17 +00:00
dchagin
d7e47c502a Implement eventfd system call.
Differential Revision:	https://reviews.freebsd.org/D1094
In collaboration with:	Jilles Tjoelker
2015-05-24 16:49:14 +00:00
dchagin
2336f8bb01 Put the correct value for the abi_nfdbits parameter of kern_select() for
all supported Linuxulators.

Differential Revision:	https://reviews.freebsd.org/D1093
Reviewed by:	trasz
2015-05-24 16:47:13 +00:00
dchagin
5e8bb6e371 Regen for r283441. 2015-05-24 16:42:49 +00:00
dchagin
5f069937a3 Implement epoll family system calls. This is a tiny wrapper
around kqueue() to implement epoll subset of functionality.
The kqueue user data are 32bit on i386 which is not enough for
epoll user data, so we keep user data in the proc emuldata.

Initial patch developed by rdivacky@ in 2007, then extended
by Yuri Victorovich @ r255672 and finished by me
in collaboration with mjg@ and jillies@.

Differential Revision:	https://reviews.freebsd.org/D1092
2015-05-24 16:41:39 +00:00
dchagin
db8a000521 To avoid code duplication move open/fcntl definitions to the MI
header file.

Differential Revision:	https://reviews.freebsd.org/D1087
Reviewed by:	trasz
2015-05-24 16:31:44 +00:00