As the Linux semop syscall is not defined in i386, and as it is equal
to the native semop syscall, call it directly.
Fix semop definition to match Linux actual one - nsops is size_t type.
MFC after: 2 weeks
The "call" variable comes from the user in privcmd_ioctl_hypercall().
It's an offset into the hypercall_page[] which has (PAGE_SIZE / 32)
elements. We need to put an upper bound on it to prevent an out of
bounds access.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Obtained from: Linux
Linux commit: 42d8644bd77dd2d747e004e367cb0c895a606f39
Fixes: bf7313e3b7 ("xen: implement the privcmd user-space device")
Submitted by: Elliott Mitchell <ehem+freebsd@m5p.com>
Reviewed by: royger
This option was never enabled in GENERIC and does not appear to work
(the cdevsw is stored in a global array but never passed to make_dev
to be associated with a character device).
Reviewed by: emaste
Differential Revision: https://reviews.freebsd.org/D35008
These files no longer depend on the macros required when these checks
were added.
PR: 263102 (exp-run)
Reviewed by: brooks, imp, emaste
Differential Revision: https://reviews.freebsd.org/D34804
All supported compilers (modern versions of GCC and clang) support
this.
PR: 263102 (exp-run)
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D34802
All supported compilers (modern versions of GCC and clang) support
this.
Many places didn't have an #else so would just silently do the wrong
thing. Ancient versions of icc (the original motivation for this) are
no longer a compiler FreeBSD supports.
PR: 263102 (exp-run)
Reviewed by: brooks, imp
Differential Revision: https://reviews.freebsd.org/D34797
These headers #include <sys/cdefs.h> right after checking if it has
already been #included. The nested #include already existed when the
check for _SYS_CDEFS_H_ was added, so the check shouldn't have been
added in the first place.
PR: 263102 (exp-run)
Reported by: brooks
Reviewed by: brooks, imp, emaste
Differential Revision: https://reviews.freebsd.org/D34796
The __diagused macro was used to cure a "set but not used" warning. This
broke the build for bhyve since __diagused is only defined in the
kernel. Define __diagused when not building the kernel.
Fixes: 5241577a22 ("vmm: fix set but not used warning")
Reported by: Jenkins
Historically 32-bit Linuxulator under amd64 emulated the real i386
behavior. Since 3d8dd983 the old i386 Linux world can't be used under
amd64 Linuxulator as it don't know anything about amd64 machine (which
is returned now by newuname() syscall). So, add a knob to allow to swith
the behavior and use i386 Linux binaries on amd64.
Set knob to the new behavior as I think this is common to the modern
Linux distros.
Reviewed by: Pau Amma (doc), emaste
Differential revision: https://reviews.freebsd.org/D34708
MFC after: 2 weeks
This more clearly differentiates system call arguments from integer
registers and return values. On current architectures it has no effect,
but on architectures where pointers are not integers (CHERI) and may
not even share registers (CHERI-MIPS) it is necessiary to differentiate
between system call arguments (syscallarg_t) and integer register values
(register_t).
Obtained from: CheriBSD
Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D33780
This register set contains the values of the fsbase and gsbase
registers. Note that these registers can already be controlled
individually via ptrace(2) via MD operations, so the main reason for
adding this is to include these register values in core dumps. In
particular, this will enable looking up the value of TLS variables
from core dumps in gdb.
The value of NT_X86_SEGBASES was chosen to match the value of
NT_386_TLS on Linux. The notes serve similar purposes, but FreeBSD
will never dump a note equivalent to NT_386_TLS (which dumps a single
segment descriptor rather than a pair of addresses) and picking a
currently-unused value in the NT_X86_* range could result in a future
conflict.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D34650
This permits I/O devices on the host to directly access wired memory
dedicated to guests using passthru devices. Note that wired memory
belonging to guests that do not use passthru devices has always been
accessible by I/O devices on the host.
bhyve maps guest physical addresses into the user address space of
the bhyve process by mmap'ing /dev/vmm/<vmname>. Device models pass
pointers derived from this mapping directly to system calls such as
preadv() to minimize copies when emulating DMA. If the backing store
for a device model is a raw host device (e.g. when exporting a raw disk
device such as /dev/ada<n> as a drive in the guest), the host device
driver (e.g. ahci for /dev/ada<n>) can itself use DMA on the host
directly to the guest's memory. However, if the guest's memory is
not present in the host IOMMU domain, these DMA requests by the host
device will fail without raising an error visible to the host device
driver or to the guest resulting in non-working I/O in the guest.
It is unclear why guest addresses were removed from the IOMMU host domain
initially, especially only for VM's with a passthru device as the
host IOMMU domain does not affect the permissions of passthru devices,
only devices on the host.
A considered alternative was using bounce buffers instead (D34535
is a proof of concept), but that adds additional overhead for unclear
benefit.
This solves a long-standing problem when using passthru devices and
physical disks in the same VM.
Thanks to: grehan (patience and help)
Thanks to: jhb (for improving the commit message)
PR: 260178
Reviewed by: grehan, jhb
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D34607
Introduce a helper to fetch the TSC frequency from CPUID when running
under Xen.
Since the TSC can also be initialized early when running as a Xen
guest pull out the call to tsc_init() from the
early_clock_source_init() handlers and place it in clock_init(), as
otherwise all handlers would call tsc_init() anyway.
Reviewed by: markj
Sponsored by: Citrix Systems R&D
Differential revision: https://reviews.freebsd.org/D34581
Some PCI devices especially GPUs require a ROM to work properly.
The ROM is executed by boot firmware to initialize the device.
To add a ROM to a device use the new ROM option for passthru device
(e.g. -s passthru,0/2/0,rom=<path>/<to>/<rom>).
It's necessary that the ROM is executed by the boot firmware.
It won't be executed by any OS.
Additionally, the boot firmware should be configured to execute the
ROM file.
For that reason, it's only possible to use a ROM when using
OVMF with enabled bus enumeration.
Differential Revision: https://reviews.freebsd.org/D33129
Sponsored by: Beckhoff Automation GmbH & Co. KG
MFC after: 1 month
We assume EFI_PAGE_SIZE is the same as PAGE_SIZE, however this may not
be the case. Use the former when working with a list of pages from the
UEFI firmware so the correct size is used.
This will be needed on arm64 where PAGE_SIZE could be 16k or 64k in the
future. The other architectures have been updated to be consistent.
Reviewed by: imp
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34510
As in commit c3d830cf7c, we should finalize CPU identification before
probing the TSC frequency.
Fixes: 84369dd523 ("x86: Probe the TSC frequency earlier")
Reported by: khng
This lets us use the TSC to implement early DELAY, limiting the use of
the sometimes-unreliable 8254 PIT.
PR: 262155
Reviewed by: emaste
Tested by: emaste, mike tancsa <mike@sentex.net>, Stefan Hegnauer <stefan.hegnauer@gmx.ch>
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34367
Add a static assert for the siginfo_t, mcontext_t and ucontext_t
sizes. These are de-facto ABI options and cannot change size ever.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D34212
- Add a starting index to 'struct vmstats' and change the
VM_STATS ioctl to fetch the 64 stats starting at that index.
A compat shim for <= 13 continues to fetch only the first 64
stats.
- Extend vm_get_stats() in libvmmapi to use a loop and a static
thread local buffer which grows to hold the stats needed.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D27463
These headers originate with the Xen project and shouldn't be mixed with
the main portion of the FreeBSD kernel. Notably they shouldn't be the
target of clean-up commits.
Switch to use the headers in sys/contrib/xen.
Reviewed by: royger
Modules no longer call kernel functions for atomic ops, and since the
previous commit, we always use lock prefix.
Submitted by: Elliott Mitchell <ehem+freebsd@m5p.com>
Reviewed by: jhb, markj
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34153
Atomics have significant other use besides providing in-system
primitives for safe memory updates. They are used for implementing
communication with out of system software or hardware following some
protocols.
For instance, even UP kernel might require a protocol using atomics to
communicate with the software-emulated device on SMP hypervisor. Or
real hardware might need atomic accesses as part of the proper
management protocol.
Another point is that UP configurations on x86 are extinct, so slight
performance hit by unconditionally use proper atomics is not important.
It is compensated by less code clutter, which in fact improves the
UP/i386 lifetime expectations.
Requested by: Elliott Mitchell <ehem+freebsd@m5p.com>
Reviewed by: Elliott Mitchell, imp, jhb, markj, royger
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34153
Eliminate shlq $3,address shift after masking of the va is done, which
is needed to convert pt_entry_t[] array index into byte offset.
Do it by preshifting the mask, and compensating the right shift of va.
Suggested by: alc
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33786
This adds the PT_GETREGSET and PT_SETREGSET ptrace types. These can be
used to access all the registers from a specified core dump note type.
The NT_PRSTATUS and NT_FPREGSET notes are initially supported. Other
machine-dependant types are expected to be added in the future.
The ptrace addr points to a struct iovec pointing at memory to hold the
registers along with its length. On success the length in the iovec is
updated to tell userspace the actual length the kernel wrote or, if the
base address is NULL, the length the kernel would have written.
Because the data field is an int the arguments are backwards when
compared to the Linux PTRACE_GETREGSET call.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19831
ASLR stack randomization will reappear in a forthcoming commit. Rather
than inserting a random gap into the stack mapping, the entire stack
mapping itself will be randomized in the same way that other mappings
are when ASLR is enabled.
No functional change intended, as the stack gap implementation is
currently disabled by default.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704
Rather than fetching the ps_strings address directly from a process'
sysentvec, use this macro. With stack address randomization the
ps_strings address is no longer fixed.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704
The size of the ps_strings structure varies between ABIs, so this is
useful for computing the address of the ps_strings structure relative to
the top of the stack when stack address randomization is enabled.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704
Some guests or driver might depend on MTRR to work properly. E.g. the
nvidia gpu driver won't work without MTRR.
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: Beckhoff Automation GmbH & Co. KG
Differential Revision: https://reviews.freebsd.org/D33333
- Use ia32_freebsd4_* instead of ia32_*4.
- Use ia32_o* instead of ia32_*3.
Reviewed by: brooks, imp, kib
Sponsored by: The University of Cambridge, Google Inc.
Differential Revision: https://reviews.freebsd.org/D33882
Missed issues in truss on at least armv7 and powerpcspe need to be
resolved before recommit.
This reverts commit 3889fb8af0.
This reverts commit 1544e0f5d1.
This more clearly differentiates system call arguments from integer
registers and return values. On current architectures it has no effect,
but on architectures where pointers are not integers (CHERI) and may
not even share registers (CHERI-MIPS) it is necessiary to differentiate
between system call arguments (syscallarg_t) and integer register values
(register_t).
Obtained from: CheriBSD
Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D33780
Suspend/Resume of Win10 leads that CPU0 is busy on handling interrupts.
Win10 does not use LAPIC timer to often and in most cases, and I see it
is disabled by writing 0 to Initial Count Register (for Timer).
During resume, restart timer only for enabled LAPIC and enabled timer
for that LAPIC.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D33448
This removes one or two atomic update of the pte on access. Also it
makes the pte constant during its lifecycle.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33778
Pre-calculate masks and page table/page directory bases, for LA48 and
LA57 cases, trading a conditional for the memory accesses.
This shaves 672 bytes of .text, by the cost of 32 .data bytes. Note
that eliminated code contained one conditional, and several costly
movabs instructions, which are traded by two memory fetches per
vtop{t,d}e() inlines.
Reviewed by: markj
Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33776
Simplify control flow around handling of the execpath length and signal
trampoline. Cache the sysentvec pointer in a local variable.
No functional change intended.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33703
The introduction of <sched.h> improved compatibility with some 3rd
party software, but caused the configure scripts of some ports to
assume that they were run in a GLIBC compatible environment.
Parts of sched.h were made conditional on -D_WITH_CPU_SET_T being
added to ports, but there still were compatibility issues due to
invalid assumptions made in autoconfigure scripts.
The differences between the FreeBSD version of macros like CPU_AND,
CPU_OR, etc. and the GLIBC versions was in the number of arguments:
FreeBSD used a 2-address scheme (one source argument is also used as
the destination of the operation), while GLIBC uses a 3-adderess
scheme (2 source operands and a separately passed destination).
The GLIBC scheme provides a super-set of the functionality of the
FreeBSD macros, since it does not prevent passing the same variable
as source and destination arguments. In code that wanted to preserve
both source arguments, the FreeBSD macros required a temporary copy of
one of the source arguments.
This patch set allows to unconditionally provide functions and macros
expected by 3rd party software written for GLIBC based systems, but
breaks builds of externally maintained sources that use any of the
following macros: CPU_AND, CPU_ANDNOT, CPU_OR, CPU_XOR.
One contributed driver (contrib/ofed/libmlx5) has been patched to
support both the old and the new CPU_OR signatures. If this commit
is merged to -STABLE, the version test will have to be extended to
cover more ranges.
Ports that have added -D_WITH_CPU_SET_T to build on -CURRENT do
no longer require that option.
The FreeBSD version has been bumped to 1400046 to reflect this
incompatible change.
Reviewed by: kib
MFC after: 2 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D33451
When a DMA request using bounce pages completes, a swi is triggered to
schedule pending DMA requests using the just-freed bounce pages. For
a long time this bus_dma swi has been tied to a "virtual memory" swi
(swi_vm). However, all of the swi_vm implementations are the same and
consist of checking a flag (busdma_swi_pending) which is always true
and if set calling busdma_swi. I suspect this dates back to the
pre-SMPng days and that the intention was for swi_vm to serve as a
mux. However, in the current scheme there's no need for the mux.
Instead, remove swi_vm and vm_ih. Each bus_dma implementation that
uses bounce pages is responsible for creating its own swi (busdma_ih)
which it now schedules directly. This swi invokes busdma_swi directly
removing the need for busdma_swi_pending.
One consequence is that the swi now works on RISC-V which had previously
failed to invoke busdma_swi from swi_vm.
Reviewed by: imp, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D33447
The current implementations never correctly return TRUE. In all cases,
when they currently return TRUE, they should have returned FALSE. And,
in some cases, when they currently return FALSE, they should have
returned TRUE. Except for its effects on performance, specifically,
additional page faults and pointless calls to pmap_enter_quick() that
abort, this error is harmless. That is why it has gone unnoticed.
Add a comment to the amd64, arm64, and riscv implementations
describing how their return values are computed.
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D33659
Notably, the current compat_options only makes sense for native and
freebsd32 ABIs. For the others, it just adds cruft. Switch to having
sets of compat options, and default to the native set. Setup the other
ABIs where it doesn't make sense to opt-out of the native set.
This removes some redundant COMPAT_FREEBSD* stuff from Linuxolator bits.
line_expr in makesyscalls.lua is fixed to allow empty strings to be
specified, since they're harmless.
Reviewed by: brooks, kib (both earlier version)
Differential Revision: https://reviews.freebsd.org/D33356
Suggested by: Michael Pratt <mpratt@google.com>
Reviewed by: Michael Pratt <mpratt@google.com>, emaste
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D33423
The header exports the following:
- Definition of struct tcb.
- Helpers to get/set the tcb for the current thread.
- TLS_TCB_SIZE (size of TCB)
- TLS_TCB_ALIGN (alignment of TCB)
- TLS_VARIANT_I or TLS_VARIANT_II
- TLS_DTV_OFFSET (bias of pointers in dtv[])
- TLS_TP_OFFSET (bias of "thread pointer" relative to TCB)
Note that TLS_TP_OFFSET does not account for if the unbiased thread
pointer points to the start of the TCB (arm and x86) or the end of the
TCB (MIPS, PowerPC, and RISC-V).
Note also that for amd64, the struct tcb does not include the unused
tcb_spare field included in the current structure in libthr. libthr
does not use this field, and the existing calls in libc and rtld that
allocate a TCB for amd64 assume it is the size of 3 Elf_Addr's (and
thus do not allocate room for tcb_spare).
A <sys/_tls_variant_i.h> header is used by architectures using
Variant I TLS which uses a common struct tcb.
Reviewed by: kib (older version of x86/tls.h), jrtc27
Sponsored by: The University of Cambridge, Google Inc.
Differential Revision: https://reviews.freebsd.org/D33351
After a round of cleanups in late 2020, all definitions are
functionally identical.
This removes a rotted __aligned(8) on arm. It was added in
b7112ead32 and was intended to align the
args member so that 64-bit types (off_t, etc) could be safely read on
armeb compiled with clang. With the removal of armev, this is no
longer needed (armv7 requires that 32-bit aligned reads of 64-bit
values be supported and we enable such support on armv6). As further
evidence this is unnecessary, cleanups to struct syscall_args have
resulted in args being 32-bit aligned on 32-bit systems. The sole
effect is to bloat the struct by 4 bytes.
Reviewed by: kib, jhb, imp
Differential Revision: https://reviews.freebsd.org/D33308
The headers were mostly identical on amd64 and i386.
No functional change intended.
Reviewed by: cperciva, mav, imp, kib, jhb
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33205
When a superpage mapping is destroyed and the original page table page
containing 4KB mappings that was being held in reserve is deallocated,
the recently introduced user page table page count was not being
decremented. Consequentially, the count was wrong and would grow over
time. For example, after multiple iterations of "buildworld", I was
seeing implausible counts, like the following:
vm.pmap.kernel_pt_page_count: 2184
vm.pmap.user_pt_page_count: 2280849
vm.pmap.pv_page_count: 106
With this change, I now see:
vm.pmap.kernel_pt_page_count: 2183
vm.pmap.user_pt_page_count: 344
vm.pmap.pv_page_count: 105
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D33276
It stopped being used in 3c256f5395, when trap() was reorganized to
have separate switch statements for user and kernel traps. Remove the
two leftover references and the flag itself.
Reviewed by: kib
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D33253
The vga and splash devices are part of the sc(4) system console. vt(4)
uses the vt_vga driver instead, and has some limited splash support
directly in vt_core.c. Leave the sc(4) options in GENERIC/MINIMAL (for
now) but group them together under an sc(4) comment.
Sponsored by: The FreeBSD Foundation
Followup to b8cf1c5c30, remove from MINIMAL in addition to GENERIC.
options VESA / vesa.ko provides VESA Bios Extensions (VBE) support for
the legacy sc(4) console. It is not used by the default console, vt(4).
PR: 253733
Fixes: b8cf1c5c30 ("Remove options VESA from x86 GENERIC")
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
options VESA / vesa.ko provides VESA Bios Extensions (VBE) support for
the legacy sc(4) console. It is not used by the default console, vt(4).
There is a report[1] of an incompatibility between VESA and the Nvidia
driver breaking suspend/resume. Since VESA is not used by the default
configuration anyway, just remove options VESA from GENERIC. The kernel
module is still available and may be loaded by sc(4) users who want to
select a VBE mode.
(Note that vt(4) does not support selecting a VBE mode. The loader can
set a VBE mode and vt(4) will use it via the vt_vbefb driver.)
[1] https://lists.freebsd.org/archives/freebsd-hackers/2021-November/000469.html
PR: 253733
Reported by: Stefan Blachmann [1]
Reviewed by: imp, manu, tsoome
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33141
Belatedly remove twa(4). It was supposed to go before 13.0, but was
overlooked.
Sponsored by: Netflix
Relnotes: yes
Reviewed by: scottl
Differential Revision: https://reviews.freebsd.org/D33114
Belatedly remove esp(4). It was tagged as gone in 13, but was overlooked
until now.
Sponsored by: Netflix
Reviewed by: scottl
Differential Revision: https://reviews.freebsd.org/D33115
Belatedly remove amr(4). It was slated to depart before 13.0 but was
overlooked until now.
Sponsored by: Netflix
Relnotes: yes
Reviewed by: scottl
Differential Revision: https://reviews.freebsd.org/D33113
Belatedly remove iir(4). It was slated to go before 13, but was
overlooked.
Sponsored by: Netflix
Relnotes: yes
Reviewed by: scottl
Differential Revision: https://reviews.freebsd.org/D33112
We'd said this was going away in 13, but was overlooked. Belatedly
remove.
Sponsored by: Netflix
Relnotes: yes
Reviewed by: scottl
Differential Revision: https://reviews.freebsd.org/D33111
in_cksum() and related routines are implemented separately for each
platform, but only i386 and arm have optimized versions. Other
platforms' copies of in_cksum.c are identical except for style
differences and support for big-endian CPUs.
Deduplicate the implementations for the rest of the platforms. This
will make it easier to implement in_cksum() for unmapped mbufs. On arm
and i386, define HAVE_MD_IN_CKSUM to mean that the MI implementation is
not to be compiled.
No functional change intended.
Reviewed by: kp, glebius
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33095
It was never implemented on powerpc or riscv and appears to have been
unused since it was added in 1998. No functional change intended.
Reviewed by: kp, glebius, cy
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33093
When constructing the set of dumpable pages, use the bitset provided by
the state argument, rather than assuming vm_page_dump invariably. For
normal kernel minidumps this will be a pointer to vm_page_dump, but when
dumping the live system it will not.
To do this, the functions in vm_dumpset.h are extended to accept the
desired bitset as an argument. Note that this provided bitset is assumed
to be derived from vm_page_dump, and therefore has the same size.
Reviewed by: kib, markj, jhb
MFC after: 2 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D31992
Don't assume we are dumping the global message buffer, but use the one
provided by the state argument. While here, drop superfluous
cast to char *.
Reviewed by: markj, jhb
MFC after: 2 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D31991
During a live dump, we may race with updates to the kernel page tables.
This is generally okay; we accept that the state of the system while
dumping may be somewhat inconsistent with its state when the dump was
invoked. However, when walking the kernel page tables, it is important
that we load each PDE/PTE only once while operating on it. Otherwise, it
is possible to have the relevant PTE change underneath us. For example,
after checking the valid bit, but before reading the physical address.
Convert the loads to atomics, and add some validation around the
physical addresses, to ensure that we do not try to dump a non-existent
or non-canonical physical address.
Similarly, don't read kernel_vm_end more than once, on the off chance
that pmap_growkernel() is called between the two page table walks.
Reviewed by: kib, markj
MFC after: 2 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D31990
It is useful for quickly checking an address against the DMAP region.
These definitions exist already on arm64 and riscv.
Reviewed by: kib, markj
MFC after: 3 days
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D32962
The minidump code is written assuming that certain global state will not
change, and rightly so, since it executes from a kernel debugger
context. In order to support taking minidumps of a live system, we
should allow copies of relevant global state that is likely to change to
be passed as parameters to the minidumpsys() function.
This patch does the work of parameterizing this function, by adding a
struct minidumpstate argument. For now, this struct allows for copies of
the kernel message buffer, and the bitset that tracks which pages should
be dumped (vm_page_dump). Follow-up changes will actually make use of
these arguments.
Notably, dump_avail[] does not need a snapshot, since it is not expected
to change after system initialization.
The existing minidumpsys() definitions are renamed, and a thin MI
wrapper is added to kern_dump.c, which handles the construction of
the state struct. Thus, calling minidumpsys() remains as simple as
before.
Reviewed by: kib, markj, jhb
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31989
pipe requires no special handling.
ofreebsd32_sigpending did differ from osigpending in that it acted
on the siglist rather than the sigqueue, but this appears to be an
oversight in 3fbdb3c215.
ogetpagesize could theoretically have ABI-dependent results, but in
practice does not. If it does it would be easy handle in the central
implementation and be the least of the problems in changing the value of
PAGE_SIZE.
Reviewed by: kevans
We use pmap_invalidate_cpu_mask() to get the set of active CPUs. This
(32-byte) set is copied by value through multiple frames until we get to
smp_targeted_tlb_shootdown(), where it is copied yet again.
Avoid this copying by having smp_targeted_tlb_shootdown() make a local
copy of the active CPUs for the pmap, and drop the cpuset parameter,
simplifying callers. Also leverage the use of the non-destructive
CPU_FOREACH_ISSET to avoid unneeded copying within
smp_targeted_tlb_shootdown().
Reviewed by: alc, kib
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32792
This is in preference to simply filling the cpuset, and allows the
conditional in pmap_invalidate_cpu_mask() to be elided.
Also export pmap_invalidate_cpu_mask() outside of pmap.c for use in a
subsequent commit.
Suggested by: kib
Reviewed by: alc, kib
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32792
Define CC_NEWRENO in all the appropriate DEFAULTS and std.* config
files. It's the default congestion control algorithm. Add code to cc.c
so that CC_DEFAULT is "newreno" if it's not overriden in the config
file.
Sponsored by: Netflix
Fixes: b8d60729de ("tcp: Congestion control cleanup.")
Revired by: manu, hselasky, jhb, glebius, tuexen
Differential Revision: https://reviews.freebsd.org/D32964
The MINIMAL configs were overlooked. They are compiled as part of
universe, so this broke universe builds. Add the same defafults as for
GENERIC.
Sponsored by: Netflix
NOTE: HEADS UP read the note below if your kernel config is not including GENERIC!!
This patch does a bit of cleanup on TCP congestion control modules. There were some rather
interesting surprises that one could get i.e. where you use a socket option to change
from one CC (say cc_cubic) to another CC (say cc_vegas) and you could in theory get
a memory failure and end up on cc_newreno. This is not what one would expect. The
new code fixes this by requiring a cc_data_sz() function so we can malloc with M_WAITOK
and pass in to the init function preallocated memory. The CC init is expected in this
case *not* to fail but if it does and a module does break the
"no fail with memory given" contract we do fall back to the CC that was in place at the time.
This also fixes up a set of common newreno utilities that can be shared amongst other
CC modules instead of the other CC modules reaching into newreno and executing
what they think is a "common and understood" function. Lets put these functions in
cc.c and that way we have a common place that is easily findable by future developers or
bug fixers. This also allows newreno to evolve and grow support for its features i.e. ABE
and HYSTART++ without having to dance through hoops for other CC modules, instead
both newreno and the other modules just call into the common functions if they desire
that behavior or roll there own if that makes more sense.
Note: This commit changes the kernel configuration!! If you are not using GENERIC in
some form you must add a CC module option (one of CC_NEWRENO, CC_VEGAS, CC_CUBIC,
CC_CDG, CC_CHD, CC_DCTCP, CC_HTCP, CC_HD). You can have more than one defined
as well if you desire. Note that if you create a kernel configuration that does not
define a congestion control module and includes INET or INET6 the kernel compile will
break. Also you need to define a default, generic adds 'options CC_DEFAULT=\"newreno\"
but you can specify any string that represents the name of the CC module (same names
that show up in the CC module list under net.inet.tcp.cc). If you fail to add the
options CC_DEFAULT in your kernel configuration the kernel build will also break.
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
RELNOTES:YES
Differential Revision: https://reviews.freebsd.org/D32693
This moves linux_ptrace.c from sys/amd64/linux/ to sys/compat/linux/,
making it possible to use it on architectures other than amd64.
It also enables Linux ptrace(2) on arm64.
Relnotes: yes
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D32868
When working on the ports these functions were slightly different, but
now there's no reason for them to be separate.
No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Make sys/amd64/linux/linux_ptrace.c machine-independent,
in preparation for moving it into sys/compat/linux/.
No functional changes.
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D32756
Note that this is largely untested at this point, as was
the previous version; I'm committing this mostly to get
rid of `struct linux_pt_reg`.
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D32735
Previously it returned a shorter struct. I can't find any
modern software that uses it, but tests/ptrace from strace(1)
repo complained.
Differential Revision: https://reviews.freebsd.org/D32601
Translate ERESTART into Linux "internal" errno ERESTARTSYS.
This fixes the erestartsys.gen.test from strace(1).
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D32623
to match the added accounting of the top-level page table pages.
Reviewed by: markj
Tested by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32569