Use AUXARGS_ENTRY_PTR to export these pointers. This is a followup to
r359987 and r359988.
Reviewed by: jhb
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24446
This simplifies discovery of these values, potentially with reducing the
number of syscalls we need to make at runtime. Longer term, we wish to
convert the startup process to pass an auxargs pointer to _start() and
use that rather than walking off the end of envv. This is cleaner,
more C-friendly, and for systems with strong bounds (e.g. CHERI)
necessary.
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24407
There's no point in pre-checking that we can access the user's rmtp
pointer before we do it in copyout().
While here, improve style(9) compliance.
Reviewed by: imp
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24409
Copy the CP, PTRIN, etc macros from freebsd32.h into a sys/abi_compat.h
and replace existing definitation with includes where required. This
eliminates duplicate code and allows Linux and FreeBSD compatability
headers to be included in the same files.
Input from: cem, jhb
Obtained from: CheriBSD
MFC after: 2 weeks
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24275
Include a temporarily compatibility shim as well for kernels predating
close_range, since closefrom is used in some critical areas.
Reviewed by: markj (previous version), kib
Differential Revision: https://reviews.freebsd.org/D24399
close_range(min, max, flags) allows for a range of descriptors to be
closed. The Python folk have indicated that they would much prefer this
interface to closefrom(2), as the case may be that they/someone have special
fds dup'd to higher in the range and they can't necessarily closefrom(min)
because they don't want to hit the upper range, but relocating them to lower
isn't necessarily feasible.
sys_closefrom has been rewritten to use kern_close_range() using ~0U to
indicate closing to the end of the range. This was chosen rather than
requiring callers of kern_close_range() to hold FILEDESC_SLOCK across the
call to kern_close_range for simplicity.
The flags argument of close_range(2) is currently unused, so any flags set
is currently EINVAL. It was added to the interface in Linux so that future
flags could be added for, e.g., "halt on first error" and things of this
nature.
This patch is based on a syscall of the same design that is expected to be
merged into Linux.
Reviewed by: kib, markj, vangyzen (all slightly earlier revisions)
Differential Revision: https://reviews.freebsd.org/D21627
in the LinuxKPI.
This allows synchronize RCU to be used inside a SRCU read section.
No functional change intended.
Bump the __FreeBSD_version to force recompilation of external kernel modules.
PR: 242272
MFC after: 1 week
Sponsored by: Mellanox Technologies
Modern debuggers and process tracers use ptrace() rather than procfs
for debugging. ptrace() has a supserset of functionality available
via procfs and new debugging features are only added to ptrace().
While the two debugging services share some fields in struct proc,
they each use dedicated fields and separate code. This results in
extra complexity to support a feature that hasn't been enabled in the
default install for several years.
PR: 244939 (exp-run)
Reviewed by: kib, mjg (earlier version)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D23837
pci_iov_if.h was added to pci.h, but none of the kms-drm branches have
that. Rather than play whack a mole with the branches, move its inclusion to
linux_pci.c which is the only part of the code that needs it now.
Longer term, other solutions will be needed, but this gives us time to get those
deployed on all the supported versions.
This presents an extensible interface to the generic mmap(2)
implementation via a struct pointer intended to use a designated
initializer or compount literal. We take advantage of the mandatory
zeroing of fields not listed in the initializer.
Remove kern_mmap_fpcheck() and use kern_mmap_req().
The motivation for this change is a desire to keep the core
implementation from growing an ever-increasing number of arguments
that must be specified in the correct order for the lowest-level
implementations. In CheriBSD we have already added two more arguments.
Reviewed by: kib
Discussed with: kevans
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D23164
The new structure is copy-on-write. With the assumption that path lookups are
significantly more frequent than chdirs and chrooting this is a win.
This provides stable root and jail root vnodes without the need to reference
them on lookup, which in turn means less work on globally shared structures.
Note this also happens to fix a bug where jail vnode was never referenced,
meaning subsequent access on lookup could run into use-after-free.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23884
On Linux the valid range of priorities for the SCHED_FIFO and SCHED_RR
scheduling policies is [1,99]. For SCHED_OTHER the single valid priority is
0. On FreeBSD it is [0,31] for all policies. Programs are supposed to
query the valid range using sched_get_priority_(min|max), but of course some
programs assume the Linux values are valid.
This commit adds a tunable compat.linux.map_sched_prio. When enabled
sched_get_priority_(min|max) return the Linux values and sched_setscheduler
and sched_(get|set)param translate between FreeBSD and Linux values.
Because there are more Linux levels than FreeBSD levels, multiple Linux
levels map to a single FreeBSD level, which means pre-emption might not
happen as it does on Linux, so the tunable allows to disable this behaviour.
It is enabled by default because I think it is unlikely that anyone runs
real-time software under Linux emulation on FreeBSD that critically relies
on correct pre-emption.
This fixes FMOD, a commercial sound library used by several games.
PR: 240043
Tested by: Alex S <iwtcex@gmail.com>
Reviewed by: dchagin
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D23790
Mesa's drm_syncobj usage, in the LinuxKPI.
While at it optimise the jiffies conversion functions to avoid repeated
and constant calculations.
Submitted by: Greg V <greg@unrelenting.technology>
Differential Revision: https://reviews.freebsd.org/D23846
MFC after: 1 week
Sponsored by: Mellanox Technologies
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
For drmkpi (D23085) we don't want the Linux struct file as we don't emulate
everything. Also the prototypes should be in shmem_fs.h to have 100%
compatibility with Linux.
Reviewed by: hselasky
MFC after: Maybe
Differential Revision: https://reviews.freebsd.org/D23764
This function test if the string str begins with the string pointed
at by prefix.
Reviewed by: hselasky
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23767
This function just test if the element is the first of the list.
Reviewed by: hselasky
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23766
realpath(3) is used a lot e.g., by clang and is a major source of getcwd
and fstatat calls. This can be done more efficiently in the kernel.
This works by performing a regular lookup while saving the name and found
parent directory. If the terminal vnode is a directory we can resolve it using
usual means. Otherwise we can use the name saved by lookup and resolve the
parent.
See the review for sample syscall counts.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23574
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked). Use it in
preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Reviewed by: hselasky, kib, zeising
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D23631
A new syscall sigfastblock(2) is added which registers a uint32_t
variable as containing the count of blocks for signal delivery. Its
content is read by kernel on each syscall entry and on AST processing,
non-zero count of blocks is interpreted same as the signal mask
blocking all signals.
The biggest downside of the feature that I see is that memory
corruption that affects the registered fast sigblock location, would
cause quite strange application misbehavior. For instance, the process
would be immune to ^C (but killable by SIGKILL).
With consumers (rtld and libthr added), benchmarks do not show a
slow-down of the syscalls in micro-measurements, and macro benchmarks
like buildworld do not demonstrate a difference. Part of the reason is
that buildworld time is dominated by compiler, and clang already links
to libthr. On the other hand, small utilities typically used by shell
scripts have the total number of syscalls cut by half.
The syscall is not exported from the stable libc version namespace on
purpose. It is intended to be used only by our C runtime
implementation internals.
Tested by: pho
Disscussed with: cem, emaste, jilles
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D12773
Submitted by: Bora Özarslan <borako.ozarslan@gmail.com>
Submitted by: Yang Wang <2333@outlook.jp>
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19917
- handle the CLOCK_{PROCESS,THREAD}_CPUTIME_ID specified directly;
- fix thread id calculation as in the Linuxulator we should
convert the user supplied thread id to struct thread * by linux_tdfind();
- fix CPUCLOCK_SCHED case by using kern_{process,thread}_cputime()
directly as native get_cputime() used by kern_clock_gettime() uses
native tdfind()/pfind() to find proccess/thread.
PR: 240990
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23341
MFC after: 2 weeks
so don't initialize nwhich in declaration and remove stale comment from r161304.
Reviewed by: emaste
Differential Revision: https://reviews.freebsd.org/D23339
MFC after: 2 weeks
for missing IP_RECVERR setsockopt(2) support. Without it, DNS
resolution is broken for glibc >= 2.30 (glibc BZ #24047).
From the user point of view this fixes "yum update" on recent
CentOS 8.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23234
This unbreaks Mono (mono-devel-4.6.2.7+dfsg-1ubuntu1 from Ubuntu Bionic);
previously would crash on "amd64_is_imm32" assert.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23306
This unbreaks Mono (mono-devel-4.6.2.7+dfsg-1ubuntu1 from Ubuntu Bionic);
previously would crash on "amd64_is_imm32" assert.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
The Linux32 system call argument fetcher places each argument (passed in
registers in the Linux x86 system call convention) into an entry in the
generic system call args array. Each member of this array is 8 bytes
wide, so this approach is broken for system calls that take off_t
arguments.
Fix the problem by splitting l_loff_t arguments in the 32-bit system
call descriptions, the same as we do for FreeBSD32. Change entry points
to handle this using the PAIR32TO64 macro.
Move linux_ftruncate64() into compat/linux.
PR: 243155
Reported by: Alex S <iwtcex@gmail.com>
Reviewed by: kib (previous version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23210
r355473 vastly improved the readability and cleanliness of these Makefiles.
Every single one of them follows the same pattern and duplicates the exact
same logic.
Now that we have GENERATED/SRCS, split SRCS up into the two parameters we'll
use for ${MAKESYSCALLS} rather than assuming a specific ordering of SRCS and
include a common sysent.mk to handle the rest. This makes it less tedious to
make sweeping changes.
Some default values are provided for GENERATED/SYSENT_*; almost all of these
just use a 'syscalls.master' and 'syscalls.conf' in cwd, and they all use
effectively the same filenames with an arbitrary prefix. Most ABIs will be
able to get away with just setting GENERATED_PREFIX and including
^/sys/conf/sysent.mk, while others only need light additions. kern/Makefile
is the notable exception, as it doesn't take a SYSENT_CONF and the generated
files are spread out between ^/sys/kern and ^/sys/sys, but it otherwise fits
the pattern enough to use the common version.
Reviewed by: brooks, imp
Nice!: emaste
Differential Revision: https://reviews.freebsd.org/D23197
This can happen if a file is closed during unix socket GC. The same bug
was fixed for devfs descriptors in r228361.
PR: 242913
Reported and tested by: iz-rpi03@hs-karlsruhe.de
Reviewed by: hselasky, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23178
sys_setsockopt. Just a cleanup; no functional changes.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22812
When either makesyscalls.lua or syscalls.master changes, all of the
${GENERATED} targets are now out-of-date. With make jobs > 1, this means we
will run the makesyscalls script in parallel for the same ABI, generating
the same set of output files.
Prior to r356603 , there is a large window for interlacing output for some
of the generated files that we were generating in-place rather than staging
in a temp dir. After that, we still should't need to run the script more
than once per-ABI as the first invocation should update all of them. Add
.ORDER to do so cleanly.
Reviewed by: brooks
Discussed with: sjg
Differential Revision: https://reviews.freebsd.org/D23099
- Export the offset into the backing object, not the object size.
- Fix a bug where we would print the previous entry's "offset" when a
map_entry has no object.
- Try to identify shared mappings. Linux prints "s" when the mapping
"may be shared". This attempt is not perfect, for example, we print
"p" for anonymous memory that may be shared via
minherit(INHERIT_SHARE).
PR: 240992
Reviewed by: kib
MFC after: 1 week
MFC note: no OBJ_ANON in stable/12
Differential Revision: https://reviews.freebsd.org/D23062
The previous behavior of leaving VI_OWEINACT vnodes on the active list without
a hold count is eliminated. Hold count is kept and inactive processing gets
explicitly deferred by setting the VI_DEFINACT flag. The syncer is then
responsible for vdrop.
Reviewed by: kib (previous version)
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D23036
When file sealing and shm_open2 were introduced, we should have grown a new
kern_shm_open2 helper that did the brunt of the work with the new interface
while kern_shm_open remains the same. Instead, more complexity was
introduced to kern_shm_open to handle the additional features and consumers
had to keep changing in somewhat awkward ways, and a kern_shm_open2 was
added to wrap kern_shm_open.
Backpedal on this and correct the situation- kern_shm_open returns to the
interface it had prior to file sealing being introduced, and neither
function needs an initial_seals argument anymore as it's handled in
kern_shm_open2 based on the shmflags.
Linux mmap rejects mmap() on a write-only file with EACCES.
linux_mmap_common currently does a fun dance to grab the fp associated with
the passed in fd, validates it, then drops the reference and calls into
kern_mmap(). Doing so is perhaps both fragile and premature; there's still
plenty of chance for the request to get rejected with a more appropriate
error, and it's prone to a race where the file we ultimately mmap has
changed after it drops its referenced.
This change alleviates the need to do this by providing a kern_mmap variant
that allows the caller to inspect the fp just before calling into the fileop
layer. The callback takes flags, prot, and maxprot as one could imagine
scenarios where any of these, in conjunction with the file itself, may
influence a caller's decision.
The file type check in the linux compat layer has been removed; EINVAL is
seemingly not an appropriate response to the file not being a vnode or
device. The fileop layer will reject the operation with ENODEV if it's not
supported, which more closely matches the common linux description of
mmap(2) return values.
If we discover that we're allowing an mmap() on a file type that Linux
normally wouldn't, we should restrict those explicitly.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22977
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
Its use of the page lock is incorrect, and it is not used by the DRM
modules.
Reviewed by: hselasky
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D23002
syscall is to query the CPU number and the NUMA domain the calling
thread is currently running on. The third argument is ignored.
It doesn't do anything regarding scheduling - it's literally
just a way to query the current state, without any guarantees
you won't get rescheduled an opcode later.
This unbreaks Java from CentOS 8
(java-11-openjdk-11.0.5.10-0.el8_0.x86_64).
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22972
copy_file_range(2) is implemented natively since r350315, make it available
for Linux binaries too.
Reviewed by: kib (mentor), trasz (previous version)
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D22959
devices. It's required for LTP, among other things. It's not
complete, but good enough for now.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22950
srandom(9) is meaningless on SMP systems or any system with, say,
interrupts. One could never rely on random(9) to produce a reproducible
sequence of outputs on the basis of a specific srandom() seed because the
global state was shared by all kernel contexts. As such, removing it is
literally indistinguishable to random(9) consumers (as compared with
retaining it).
Mark random(9) as deprecated and slated for quick removal. This is not to
say we intend to remove all fast, non-cryptographic PRNG(s) in the kernel.
It/they just won't be random(9), as it exists today, in either name or
implementation.
Before random(9) is removed, a replacement will be provided and in-tree
consumers will be converted.
Note that despite the name, the random(9) interface does not bear any
resemblance to random(3). Instead, it is the same crummy 1988 Park-Miller
LCG used in libc rand(3).
removed from objects including calls to free. Pages must not be xbusy
when freed and not on an object. Strengthen assertions to match these
expectations. In practice very little code had to change busy handling
to meet these rules but we can now make stronger guarantees to busy
holders and avoid conditionally dropping busy in free.
Refine vm_page_remove() and vm_page_replace() semantics now that we have
stronger guarantees about busy state. This removes redundant and
potentially problematic code that has proliferated.
Discussed with: markj
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22822
than "/compat/linux". Useful when you have several compat directories
with different Linux versions and you don't want to clash with files
installed by linux-c7 packages.
Reviewed by: bcr (manpages)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22574
Don't hold the scheduler lock while doing context switches. Instead we
unlock after selecting the new thread and switch within a spinlock
section leaving interrupts and preemption disabled to prevent local
concurrency. This means that mi_switch() is entered with the thread
locked but returns without. This dramatically simplifies scheduler
locking because we will not hold the schedlock while spinning on
blocked lock in switch.
This change has not been made to 4BSD but in principle it would be
more straightforward.
Discussed with: markj
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22778
Eliminate recursion from most thread_lock consumers. Return from
sched_add() without the thread_lock held. This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks. This will eventually allow for lockless remote adds.
Discussed with: kib
Reviewed by: jhb
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22626
And remove the inline/deprecated attribute use entirely in stdlib.h, from
r355747. The intent was to provide a buildable API transitionary period, but
clearly that was counter-productive.
Reported by: delphij, imp, others
Probably all of these linuxkpi stubs should be '#ifndef' guarded, but maybe
that would prevent people from noticing when they are defined.
Introduced in r355759. For some reason I only ran a buildworld and not a
kernel. Mea culpa.
Reported by: Mark Millard
X-MFC-with: r355759
over the usual fsync(2).
This silences some warnings when running "apt-get upgrade".
Reviewed by: brooks, emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22371
Partially revert r354741 and r354754 and go back to allocating a
fixed-size chunk of stack space for the auxiliary vector. Keep
sv_copyout_auxargs but change it to accept the address at the end of
the environment vector as an input stack address and no longer
allocate room on the stack. It is now called at the end of
copyout_strings after the argv and environment vectors have been
copied out.
This should fix a regression in r354754 that broke the stack alignment
for newer Linux amd64 binaries (and probably broke Linux arm64 as
well).
Reviewed by: kib
Tested on: amd64 (native, linux64 (only linux-base-c7), and i386)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22695
Use the power of variable to avoid spelling out source and generated
files too many times. The previous Makefiles were hard to read, hard to
edit, and badly formatted.
Reviewed by: kevans, emaste
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22714
- Use ustringp for the location of the argv and environment strings
and allow destp to travel further down the stack for the stackgap
and auxv regions.
- Update the Linux copyout_strings variants to move destp down the
stack as was done for the native ABIs in r263349.
- Stop allocating a space for a stack gap in the Linux ABIs. This
used to hold translated system call arguments, but hasn't been used
since r159992.
Reviewed by: kib
Tested on: md64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22501
Linux epoll allow passing of any negative timeout value to epoll_wait()
to cause unbound blocking
Reviewed by: emaste
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22517
Such an events are legal and should be interpreted as EPOLLERR | EPOLLHUP.
Register a disabled kqueue event in that case as we do not support EPOLLHUP yet.
Required by Linux Steam client.
PR: 240590
Reported by: Alex S <iwtcex@gmail.com>
Reviewed by: emaste
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22516
Linux epoll EPOLL_CTL_ADD op handler should always check registration
of both EVFILT_READ and EVFILT_WRITE kevents to deceide if supplied
file descriptor fd is already registered with epoll instance.
Reviewed by: emaste
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22515
Linux epoll does not remove descriptor after one-shot event has been triggered.
Set EV_DISPATCH kqueue flag rather then EV_ONESHOT to get the same behavior.
Required by Linux Steam client.
PR: 240590
Reported by: Alex S <iwtcex@gmail.com>
Reviewed by: emaste, imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22513
The lua-based makesyscalls produces slightly different output than its
makesyscalls.sh predecessor, all whitespace differences more closely
matching the source syscalls.master.
flua is bootstrapped as part of the build for those on older
versions/revisions that don't yet have flua installed. Once upgraded past
r354833, "make sysent" will again naturally work as expected.
Reviewed by: brooks
Differential Revision: https://reviews.freebsd.org/D21894
Co-mingling two things here:
* Addressing some feedback from Konstantin and Kyle re: jail,
capability mode, and a few other things
* Adding audit support as promised.
The audit support change includes a partial refresh of OpenBSM from
upstream, where the change to add shm_rename has already been
accepted. Matthew doesn't plan to work on refreshing anything else to
support audit for those new event types.
Submitted by: Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by: kib
Relnotes: Yes
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22083
when being passed O_NOFOLLOW. This fixes LTP testcase openat02:5.
Reviewed by: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22384
Change the FreeBSD ELF ABIs to use this new hook to copyout ELF auxv
instead of doing it in the sv_fixup hook. In particular, this new
hook allows the stack space to be allocated at the same time the auxv
values are copied out to userland. This allows us to avoid wasting
space for unused auxv entries as well as not having to recalculate
where the auxv vector is by walking back up over the argv and
environment vectors.
Reviewed by: brooks, emaste
Tested on: amd64 (amd64 and i386 binaries), i386, mips, mips64
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22355
Pointer arguments should be of the form "<type> *..." and not "<type>* ...".
No functional change.
Reviewed by: kevans
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22373
When reporting a process' stats, we can't just provide the tty as an
unsigned long, as if we have no controlling tty, the tty would be NODEV, or
-1. Instaed, just special-case NODEV.
Submitted by: Juraj Lutter <otis@sk.FreeBSD.org>
MFC after: 1 week
In the cases where Linux returns an error (e.g. passing in an undefined
flag) there's no need for us to emit a message. (The target of this
message is a developer working on the linuxulatorm, not the author of
presumably broken Linux software).
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21606
a tmpfs to be mounted there, and because they like to verify it's
actually a mountpoint, a symlink won't do.
Reviewed by: dchagin (earlier version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20333
Bump the __FreeBSD_version to force recompilation of
external kernel modules due to structure change.
Differential Revision: https://reviews.freebsd.org/D21564
Submitted by: Greg V <greg@unrelenting.technology>
MFC after: 1 week
Sponsored by: Mellanox Technologies
The LinuxKPI linux_dma code calls PCTRIE_INSERT with a
mutex held, but does not set M_NOWAIT when allocating
nodes, leading to a potential panic. All of this code
can handle an allocation failure here, so prefer an
allocation failure to sleeping on memory.
Also fix a related case where NOWAIT/WAITOK was not
specified. In this case it's not clear whether sleeping
is allowed so be conservative and assume not. There are
a lot of other paths in this code that can fail due to
a lack of memory anyway.
Differential Revision: https://reviews.freebsd.org/D22127
Reviewed by: imp
Sponsored by: Dell EMC Isilon
MFC After: 1 week
Move futex_mtx to linux_common.ko for amd64 and aarch64 along
with respective list/mutex init/destroy.
PR: 240989
Reported by: Alex S <iwtcex@gmail.com>
Move futex_list definition to linux.c which is included once
in linux.ko (i386) and in linux_common.ko (amd64 and aarch64)
allowing 32/64 bit linux programs to access the same futexes
in the latter case.
PR: 240989
Reviewed by: dchagin
Differential Revision: https://reviews.freebsd.org/D22073
Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.
Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.
Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision: https://reviews.freebsd.org/D21882
Add an atomic shm rename operation, similar in spirit to a file
rename. Atomically unlink an shm from a source path and link it to a
destination path. If an existing shm is linked at the destination
path, unlink it as part of the same atomic operation. The caller needs
the same permissions as shm_unlink to the shm being renamed, and the
same permissions for the shm at the destination which is being
unlinked, if it exists. If those fail, EACCES is returned, as with the
other shm_* syscalls.
truss support is included; audit support will come later.
This commit includes only the implementation; the sysent-generated
bits will come in a follow-on commit.
Submitted by: Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by: jilles (earlier revision)
Reviewed by: brueffer (manpages, earlier revision)
Relnotes: yes
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21423
shm_open2 allows a little more flexibility than the original shm_open.
shm_open2 doesn't enforce CLOEXEC on its callers, and it has a separate
shmflag argument that can be expanded later. Currently the only shmflag is
to allow file sealing on the returned fd.
shm_open and memfd_create will both be implemented in libc to use this new
syscall.
__FreeBSD_version is bumped to indicate the presence.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D21393
Now that flags may be set on posixshm, add an argument to kern_shm_open()
for the initial seals. To maintain past behavior where callers of
shm_open(2) are guaranteed to not have any seals applied to the fd they're
given, apply F_SEAL_SEAL for existing callers of kern_shm_open. A special
flag could be opened later for shm_open(2) to indicate that sealing should
be allowed.
We currently restrict initial seals to F_SEAL_SEAL. We cannot error out if
F_SEAL_SEAL is re-applied, as this would easily break shm_open() twice to a
shmfd that already existed. A note's been added about the assumptions we've
made here as a hint towards anyone wanting to allow other seals to be
applied at creation.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D21392
Just return EINVAL if flags != 0. The Linux man page documents one
case of EINVAL as "The filesystem does not support one of the flags in
flags."
After r351723 userland binaries will try using new system calls.
Reported by: mjg
Reviewed by: mjg, trasz
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21590
A work callback may restart itself. Loop in the drain function to see if the
work has been rescheduled and stop the subsequent reschedules, if any.
MFC after: 1 week
Sponsored by: Mellanox Technologies
- VM_ALLOC_NOCREAT will grab without creating a page.
- vm_page_grab_valid() will grab and page in if necessary.
- vm_page_busy_acquire() automates some busy acquire loops.
Discussed with: alc, kib, markj
Tested by: pho (part of larger branch)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21546
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486
It allows a process to request that stack gap was not applied to its
stacks, retroactively. Also it is possible to control the gaps in the
process after exec.
PR: 239894
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21352
Previously userspace would issue one syscall to resolve the sysctl and then
another one to actually use it. Do it all in one trip.
Fallback is provided in case newer libc happens to be running on an older
kernel.
Submitted by: Pawel Biernacki
Reported by: kib, brooks
Differential Revision: https://reviews.freebsd.org/D17282
glibc 2.24 and up (eg Ubuntu 19.04) fail with "FATAL: kernel too old".
This alone is not enough to make newer binaries actually work;
fix/hack/workaround is pending review at https://reviews.freebsd.org/D20687.
Reviewed by: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20757
'compat.linux.osrelease=3.10.0-957.12.1.el7.x86_64', which
corresponds to CentOS 7.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20685
Require the vnode to be locked for the VOP_UNSET_TEXT() call. This
will be used by the following bug fix for a tmpfs issue.
Tested by: sbruno, pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week