a tmpfs to be mounted there, and because they like to verify it's
actually a mountpoint, a symlink won't do.
Reviewed by: dchagin (earlier version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20333
Move futex_mtx to linux_common.ko for amd64 and aarch64 along
with respective list/mutex init/destroy.
PR: 240989
Reported by: Alex S <iwtcex@gmail.com>
Move futex_list definition to linux.c which is included once
in linux.ko (i386) and in linux_common.ko (amd64 and aarch64)
allowing 32/64 bit linux programs to access the same futexes
in the latter case.
PR: 240989
Reviewed by: dchagin
Differential Revision: https://reviews.freebsd.org/D22073
Just return EINVAL if flags != 0. The Linux man page documents one
case of EINVAL as "The filesystem does not support one of the flags in
flags."
After r351723 userland binaries will try using new system calls.
Reported by: mjg
Reviewed by: mjg, trasz
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21590
glibc 2.24 and up (eg Ubuntu 19.04) fail with "FATAL: kernel too old".
This alone is not enough to make newer binaries actually work;
fix/hack/workaround is pending review at https://reviews.freebsd.org/D20687.
Reviewed by: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20757
'compat.linux.osrelease=3.10.0-957.12.1.el7.x86_64', which
corresponds to CentOS 7.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20685
Require the vnode to be locked for the VOP_UNSET_TEXT() call. This
will be used by the following bug fix for a tmpfs issue.
Tested by: sbruno, pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
After r349240 kern_mprotect returns EINVAL for unsupported bits in the prot
argument. Linux rtld uses PROT_GROWSDOWN and PROT_GROWS_UP when marking the
stack executable. Mask these bits like kern_mprotect used to do. For other
unsupported bits EINVAL is returned like Linux does.
Reviewed by: trasz, brooks
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20864
New system calls between 2.6.32 and 2.6.26 are already implemented.
This should be mostly NFC as far as contemporary Linux applications are
concerned though, as Linux kernel 3.2 is the oldest supported by a
number of popular distros today; work is in progress by others to enable
support for those applications.
Discussed with: trasz
MFC after: 1 month
Linux man(1) calls it for no good reason; this avoids the console spam
(eg '(man): ioctl fd=4, cmd=0x660b ('f',11) is not implemented').
Reviewed by: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20690
return something reasonable, and helps linux binaries which attempt
to close all the files, eg apt(8).
Reviewed by: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20692
This was added for emulation of Linux's CDROMSUBCHNL, but allows
users with read access to a cd(4) device to overwrite kernel memory
provided that the driver detects some media present.
Reimplement CDROMSUBCHNL by bouncing the data from CDIOCREADSUBCHANNEL
through the linux_cdrom_subchnl structure passed from userspace.
admbugs: 768
Reported by: Alex Fortune
Security: CVE-2019-5602
Security: FreeBSD-SA-19:11.cd_ioctl
struct xucred. Do not bump XUCRED_VERSION as struct layout is not changed.
PR: 215202
Reviewed by: tijl
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20415
socket is non-blocking and connect() is not finished yet.
Initial patch developed by Steven Hartland in 2008 and adopted by me.
PR: 129169
Reported by: smh@
MFC after: 2 weeks
of them listed in opt_global.h which is not generated while building
modules outside of a kernel and such modules never match real cofigured
kernel.
So, we should prevent our users from building obviously defective modules.
Therefore, remove the root cause of the building of modules outside of a
kernel - the possibility of building modules with DEBUG or KTR flags.
And remove all of DEBUG printfs as it is incomplete and in threaded
programms not informative, also a half of system call does not have DEBUG
printf. For debuging Linux programms we have dtrace, ktr and ktrace ability.
PR: 222861
Reviewed by: trasz
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20178
alter the userspace sockaddr to convert the format between linux and BSD versions.
That's the minimum 3 of copyin/copyout operations for one syscall.
Also some syscall uses linux_sa_put() and linux_getsockaddr() when load
sockaddr to userspace or from userspace accordingly.
To avoid this chaos, especially converting sockaddr in the userspace,
rewrite these 4 functions to convert sockaddr only in kernel and leave
only 2 of this functions.
Also in order to reduce duplication between MD parts of the Linuxulator put
struct sockaddr conversion functions that are MI out into linux_common module.
PR: 232920
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20157
a final revision.
Fix style issues and change bool-like variables from int to bool.
Reviewed by: emaste
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20141
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
move bits that are MI out into the headers in compat/linux.
For that remove bogus _packed attribute from struct l_sockaddr
and use MI types for struct members.
And continue to move into the linux_common module a code that is
intended for both Linuxulator modules (both instruction set - 32 & 64 bit)
or for external modules like linsysfs or linprocfs.
To avoid header pollution introduce new sys/compat/linux_common.h header.
Reviewed by: emaste
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20137
the file associated with the given file descriptor.
Reviewed by: kib, asomers
Reviewed by: cem, jilles, brooks (they reviewed previous version)
Discussed with: pjd, and many others
Differential Revision: https://reviews.freebsd.org/D14567
For 32-bit Linuxulator, ipc() syscall was historically
the entry point for the IPC API. Starting in Linux 4.18, direct
syscalls are provided for the IPC. Enable it.
MFC after: 1 month
Most siginfo_to_lsiginfo callers already zeroed the l_siginfo_t before
callit it, but linux_waitid did not. Instead of zeroing in the called
function to address linux_waitid (as in commit 2e6ebe70), just do it in
linux_waitid.
admbugs: 765
Reported by: Vlad Tsyrklevich <vlad@tsyrklevich.net>
Reviewed by: Andrew
MFC after: 1 day
Security: Kernel stack memory disclosure
Sponsored by: The FreeBSD Foundation
An integrity check such as a check-hash or a cross-correlation failed.
The integrity error falls between EINVAL that identifies errors in
parameters to a system call and EIO that identifies errors with the
underlying storage media. EINTEGRITY is typically raised by intermediate
kernel layers such as a filesystem or an in-kernel GEOM subsystem when
they detect inconsistencies. Uses include allowing the mount(8) command
to return a different exit value to automate the running of fsck(8)
during a system boot.
These changes make no use of the new error, they just add it. Later
commits will be made for the use of the new error number and it will
be added to additional manual pages as appropriate.
Reviewed by: gnn, dim, brueffer, imp
Discussed with: kib, cem, emaste, ed, jilles
Differential Revision: https://reviews.freebsd.org/D18765
Some kevent functions have a boolean "waitok" parameter for use when
calling malloc(9). Replace them with the corresponding malloc() flags:
the desired behaviour is known at compile-time, so this eliminates a
couple of conditional branches, and makes the code easier to read.
No functional change intended.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18318
This was hidden behind the LINUX_CMSG_NXTHDR macro which dereferences its
second argument. Stop using the macro as well as LINUX_CMSG_FIRSTHDR. Use
the size field of the kernel copy of the control message header to obtain
the next control message.
PR: 217901
MFC after: 2 days
X-MFC-With: r340631
Instead of calling m_append with a user address, allocate an mbuf cluster
and copy data into it using copyin. For the SCM_CREDS case, instead of
zeroing a stack variable and appending that to the mbuf, zero part of the
mbuf cluster directly. One mbuf cluster is also the size limit used by
the FreeBSD sendmsg syscall (uipc_syscalls.c:sockargs()).
PR: 217901
Reviewed by: kib
MFC after: 3 days
Doing so removes the dependency on proctree lock from sysctl process list
export which further reduces contention during poudriere -j 128 runs.
Reviewed by: kib (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17825
Leave ptrace(2) alone for the moment as it's defined to take a caddr_t.
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17852
linux_ioctl_(un)register_handler that allows other driver modules to
register ioctl handlers. The ioctl syscall implementation in each Linux
compat module iterates over the list of handlers and forwards the call to
the appropriate driver. Because the registration functions have the same
name in each module it is not possible for a driver to support both 32 and
64 bit linux compatibility.
Move the list of ioctl handlers to linux_common.ko so it is shared by
both Linux modules and all drivers receive both 32 and 64 bit ioctl calls
with one registration. These ioctl handlers normally forward the call
to the FreeBSD ioctl handler which can handle both 32 and 64 bit.
Keep the special COMPAT_LINUX32 ioctl handlers in linux.ko in a separate
list for now and let the ioctl syscall iterate over that list first.
Later, COMPAT_LINUX32 support can be added to the 64 bit ioctl handlers
via a runtime check for ILP32 like is done for COMPAT_FREEBSD32 and then
this separate list would disappear again. That is a much bigger effort
however and this commit is meant to be MFCable.
This enables linux64 support in x11/nvidia-driver*.
PR: 206711
Reviewed by: kib
MFC after: 3 days
Flags prevent open(2) and *at(2) vfs syscalls name lookup from
escaping the starting directory. Supposedly the interface is similar
to the same proposed Linux flags.
Reviewed by: jilles (code, previous version of manpages), 0mp (manpages)
Discussed with: allanjude, emaste, jonathan
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D17547
If a recvmsg(2) or recvmmsg(2) caller doesn't provide sufficient space
for all control messages, the kernel sets MSG_CTRUNC in the message
flags to indicate truncation of the control messages. In the case
of SCM_RIGHTS messages, however, we were failing to dispose of the
rights that had already been externalized into the recipient's file
descriptor table. Add a new function and mbuf type to handle this
cleanup task, and use it any time we fail to copy control messages
out to the recipient. To simplify cleanup, control message truncation
is now only performed at control message boundaries.
The change also fixes a few related bugs:
- Rights could be leaked to the recipient process if an error occurred
while copying out a message's contents.
- We failed to set MSG_CTRUNC if the truncation occurred on a control
message boundary, e.g., if the caller received two control messages
and provided only the exact amount of buffer space needed for the
first.
PR: 131876
Reviewed by: ed (previous version)
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16561
The timespecadd(3) family of macros were imported from NetBSD back in
r35029. However, they were initially guarded by #ifdef _KERNEL. In the
meantime, we have grown at least 28 syscalls that use timespecs in some
way, leading many programs both inside and outside of the base system to
redefine those macros. It's better just to make the definitions public.
Our kernel currently defines two-argument versions of timespecadd and
timespecsub. NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define
three-argument versions. Solaris also defines a three-argument version, but
only in its kernel. This revision changes our definition to match the
common three-argument version.
Bump _FreeBSD_version due to the breaking KPI change.
Discussed with: cem, jilles, ian, bde
Differential Revision: https://reviews.freebsd.org/D14725
Previously the linuxulator's linux_brk invoked the FreeBSD sys_break
syscall implementation directly. Instead, move the bulk of the existing
implementation to kern_break, and call that from both sys_break and
linux_brk.
This also addresses a minor bug in linux_brk in that we now return the
actual (rounded up) break address, rather than the requested value.
Reviewed by: brooks (earlier version)
Sponsored by: Turing Robotic Industries
Differential Revision: https://reviews.freebsd.org/D16019
The Linux compatibility code was converting the version number (e.g.
2.6.32) in two different ways and then comparing the results.
The linux_map_osrel() function converted MAJOR.MINOR.PATCH similar to
what FreeBSD does natively. I.e. where major=v0, minor=v1, and patch=v2
v = v0 * 1000000 + v1 * 1000 + v2;
The LINUX_KERNVER() macro, on the other hand, converted the value with
bit shifts. I.e. where major=a, minor=b, and patch=c
v = (((a) << 16) + ((b) << 8) + (c))
The Linux kernel uses the later format via the KERNEL_VERSION() macro in
include/generated/uapi/linux/version.h
Fix is to use the LINUX_KERNVER() macro in linux_map_osrel() as well as
in the .trans_osrel functions.
PR: 229209
Reviewed by: emaste, cem, imp (mentor)
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D15952
So that the ptrace code will catch it and report it to attached
debugger. Enables debugging of threaded Linux binaries with FreeBSD
debugger.
Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com>
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D15880
Linux 2.6.26 introduced 64-bit capability sets. Extend our stub
implementation to handle both 32- and 64-bit. (We still report no
capabilities in capget, and disallow any in capset.)
Reviewed by: chuck
Sponsored by: Turing Robotic Industries Inc.
Differential Revision: https://reviews.freebsd.org/D15887
Existing linuxulator platforms (i386, amd64) support legacy syscalls,
such as non-*at ones like open, but arm64 and other new platforms do
not.
Wrap these in #ifdef LINUX_LEGACY_SYSCALLS, #defined in the MD linux.h
files. We may need finer grained control in the future but this is
sufficient for now.
Reviewed by: andrew
Sponsored by: Turing Robotic Industries
Differential Revision: https://reviews.freebsd.org/D15237
The Linuxulator provides per-syscall debug control via the
compat.linux.debug sysctl. There's generally a 1:1 mapping between
sysctl setting and syscall, but faccessat was controlled by the access
setting, perhaps due to copy-paste.
Sponsored by: Turing Robotic Industries
The break() system call was renamed (several times) starting in v3
AT&T UNIX when C was invented and break was a language keyword. The
last vestage of a need for it to be called something else (eg obreak)
was removed in r225617 which consistantly prefixed all syscall
implementations.
Reviewed by: emaste, kib (older version)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15638
the old encodings for the lower 16 and 32 bits and only using the
higher 32 bits for unusually large major and minor numbers. This
change breaks compatibility with the previous encoding (which was only
used in -current).
Fix truncation to (essentially) 16-bit dev_t in newnfs v3.
Any encoding of device numbers gives an ABI, so it can't be changed
without translations for compatibility. Extra bits give the much
larger complication that the translations need to compress into fewer
bits. Fortunately, more than 32 bits are rarely needed, so
compression is rarely needed except for 16-bit linux dev_t where it
was always needed but never done.
The previous encoding moved the major number into the top 32 bits.
Almost no translation code handled this, so the major number was blindly
truncated away in most 32-bit encodings. E.g., for ffs, mknod(8) with
major = 1 and minor = 2 gave dev_t = 0x10000002; ffs cannot represent
this and blindly truncated it to 2. But if this mknod was run on any
released version of FreeBSD, it gives dev_t = 0x102. ffs can represent
this, but in the previous encoding it was not decoded, giving major = 0,
minor = 0x102.
The presence of bugs was most obvious for exporting dev_t's from an
old system to -current, since bugs in newnfs augment them. I fixed
oldnfs to support 32-bit dev_t in 1996 (r16634), but this regressed
to 16-bit dev_t in newnfs, first to the old 16-bit encoding and then
further in -current. E.g., old ad0 with major = 234, minor = 0x10002
had the correct (major, minor) number on the wire, but newnfs truncated
this to (234, 2) and then the previous encoding shifted the major
number into oblivion as seen by ffs or old applications.
I first tried to fix this by translating on every ABI/API boundary, but
there are too many boundaries and too many sloppy translations by blind
truncation. So use the old encoding for the low 32 bits so that sloppy
translations work no worse than before provided the high 32 bits are
not set. Add some error checking for when bits are lost. Keep not
doing any error checking for translations for almost everything in
compat/linux.
compat/freebsd32/freebsd32_misc.c:
Optionally check for losing bits after possibly-truncating assignments as
before.
compat/linux/linux_stats.c:
Depend on the representation being compatible with Linux's (or just with
itself for local use) and spell some of the translations as assignments in
a macro that hides the details.
fs/nfsclient/nfs_clcomsubs.c:
Essentially the same fix as in 1996, except there is now no possible
truncation in makedev() itself. Also fix nearby style bugs.
kern/vfs_syscalls.c:
As for freebsd32. Also update the sysctl description to include file
numbers, and change it to describe device ids as device numbers.
sys/types.h:
Use inline functions (wrapped by macros) since the expressions are now
a bit too complicated for plain macros. Describe the encoding and
some of the reasons for it. 16-bit compatibility didn't leave many
reasonable choices for the 32-bit encoding, and 32-bit compatibility
doesn't leave many reasonable choices for the 64-bit encoding. My
choice is to put the 8 new minor bits in the low 8 bits of the top 32
bits. This minimizes discontiguities.
Reviewed by: kib (except for rewrite of the comment in linux_stats.c)
of 64-bit dev_t's (but not ones involving dev_t's).
st_size was supposed to be clamped in cvtstat() and linux's copy_stat(),
but the clamping code wasn't aware that st_size is signed, and also had
an obfuscated off-by-1 value for the unsigned limit, so its effect was
to produce a bizarre negative size instead of clamping.
Change freebsd32's copy_ostat() to be no worse than cvtstat(). It was
missing clamping and bzero()ing of padding.
Reviewed by: kib (except a final fix of the clamp to the signed maximum)
- Add macros to allow preinitialization of cap_rights_t.
- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.
Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month
Use proper method to access userspace. For now, only the slow copyout
path is implemented.
Reported and tested by: tijl (previous version)
Sponsored by: The FreeBSD Foundation
Previously linuxulator had three identical copies of
linux_exec_imgact_try. Deduplicate before adding another arch to
linuxulator.
Sponsored by: Turing Robotic Industries Inc
Differential Revision: https://reviews.freebsd.org/D14856
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.
Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.
Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.
Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
strlcpy is used to copy a string into a buffer to be copied to userland,
previously leaving uninitialized data after the terminating NUL. Zero
the buffer first to avoid a kernel memory disclosure.
admbugs: 765, 811
MFC after: 1 day
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Reported by: Vlad Tsyrklevich
Sponsored by: The FreeBSD Foundation
Many licenses on Linuxolator files contained small variations from the
standard FreeBSD license text. To avoid license proliferation switch to
the standard 2-Clause FreeBSD license for those files where I have
permission from each of the listed copyright holders.
Approved by: rdivacky, marcel
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Three copies of the linuxulator linux_sysvec.c contained identical
BSD to Linux errno translation tables, and future work to support other
architectures will also use the same table. Move the table to a common
file to be used by all. Make it 'const int' to place it in .rodata.
(Some existing Linux architectures use MD errno values, but x86 and Arm
share the generic set.)
This change should introduce no functional change; a followup will add
missing errno values.
MFC after: 3 weeks
Sponsored by: Turing Robotic Industries Inc.
Differential Revision: https://reviews.freebsd.org/D14665
Many licenses on Linuxolator files contained small variations from the
standard FreeBSD license text. To avoid license proliferation switch to
the standard 2-clause FreeBSD license for those files where I have
permission from each of the listed copyright holders. Additional files
still waiting on permission from others are listed in review D14210.
Approved by: dchagin, rdivacky, sos
MFC after: 1 week
MFC with: r329370
Sponsored by: The FreeBSD Foundation
binary mode user-space emulation layer. This is a regression issue after
r328436, when LinuxKPI character devices started to use DTYPE_DEV in
the "f_type" field of the associated file structure(s).
MFC after: 3 days
Found by: Johannes Lundberg <johalun0@gmail.com>
Sponsored by: Mellanox Technologies
Many licenses on Linuxolator files contained small variations from the
standard FreeBSD license text. To avoid license proliferation switch to
the standard 2-clause FreeBSD license for those files where I have
permission from each of the listed copyright holders. Additional files
waiting on permission from others are listed in review D14210.
Approved by: kan, marcel, sos, rdivacky
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
significant source of cache line contention from vm_page_alloc(). Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.
Reviewed by: markj
Discussed with: kib, glebius
Tested by: pho (earlier version)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14273
A version of each of the MD files by necessity exists for each CPU
architecture supported by the Linuxolator. Clean these up so that new
architectures do not inherit whitespace issues.
Clean up shared Linuxolator files while here.
Sponsored by: Turing Robotic Industries Inc.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Upon successful completion, the execve() system call invokes
exec_setregs() to initialize the registers of the initial thread of the
newly executed process. What is weird is that when execve() returns, it
still goes through the normal system call return path, clobbering the
registers with the system call's return value (td->td_retval).
Though this doesn't seem to be problematic for x86 most of the times (as
the value of eax/rax doesn't matter upon startup), this can be pretty
frustrating for architectures where function argument and return
registers overlap (e.g., ARM). On these systems, exec_setregs() also
needs to initialize td_retval.
Even worse are architectures where cpu_set_syscall_retval() sets
registers to values not derived from td_retval. On these architectures,
there is no way cpu_set_syscall_retval() can set registers to the way it
wants them to be upon the start of execution.
To get rid of this madness, let sys_execve() return EJUSTRETURN. This
will cause cpu_set_syscall_retval() to leave registers intact. This
makes process execution easier to understand. It also eliminates the
difference between execution of the initial process and successive ones.
The initial call to sys_execve() is not performed through a system call
context.
Reviewed by: kib, jhibbits
Differential Revision: https://reviews.freebsd.org/D13180
Expose more information about PCI devices (and GPUs in particular) via
linsysfs to libdrm.
This allows unmodified modern 64-bit Linux libdrm to work, which allows
modern Linux Mesa to work. The submitter reports that he tested the change
with an Ubuntu 16.04 chroot + amdgpu from graphics/drm-next-kmod.
PR: 222375
Submitted by: Greg V <greg AT unrelenting.technology>
were copied to the buffer supplied by the user.
Also fix getrandom() if Linuxulator modules are built without the kernel.
PR: 219464
Submitted by: Maciej Pasternacki
Reported by: Maciej Pasternacki
MFC after: 1 week
Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.
ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.
Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.
Struct xvnode changed layout, no compat shims are provided.
For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.
Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.
Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).
Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439
According to the C standard, it is invalid to copy beyond the end of an
object, even if that object is obviously a member of a larger object (a
struct, in this case).
Appease the standard and Coverity by refactoring the copy in a
straightforward way. No functional change.
Reported by: Coverity (CWE-120)
CIDs: 1007819, 1007820, 1007821, 1007822, 1009668, 1009669
Security: no (false positive detection)
Sponsored by: Dell EMC Isilon
so overflow can still occur. Fix that. Also remove the extra check for
tv_sec size as under COMPAT_LINUX32 it is always true.
Pointed out by: bde@
MFC after: 1 week
nanosleep() updates rmtp on EINVAL. In that case, kern_nanosleep()
has not updated rmt, so sys_nanosleep() updates the user-space rmtp
by copying garbage from its stack frame. This is not only a kernel
memory disclosure, it's also not POSIX-compliant. Fix it to update
rmtp only on EINTR.
Reviewed by: jilles (via D10020), dchagin
MFC after: 3 days
Security: possibly
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D10044
related struct definitions out into the MI path.
Invert the native ipc structs to the Linux ipc structs convesion logic.
Since 64-bit variant of ipc structs has more precision convert native ipc
structs to the 64-bit Linux ipc structs and then truncate 64-bit values
into the non 64-bit if needed. Unlike Linux, return EOVERFLOW if the
values do not fit.
Fix SYSV IPC for 64-bit Linuxulator which never sets IPC_64 bit.
MFC after: 1 month
the syscalls that are not implemented in Linux kernel itself.
Cleanup DUMMY() macros.
Reviewed by: dchagin, trasz
Approved by: dchagin
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D9804
Linuxulator is x86 only.
The only notable differences in algnment for an LP64 64-bit system
when compared to a 32-bit system is an eight or large byte types
alignment.
MFC after: 1 month
would be useless anyway - there is no point in pretending to have block
devices; our "block" devices are in fact character ones, and can only
be accessed as such.
Discussed with: dchagin
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
parameter to mmap(2), even if MAP_FIXED is not explicitly specified.
Android ART is one example. Implement bug compatibility for this case
in linuxulator.
Reviewed by: dchagin@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9373
Convert linux_recv(), linux_send() and linux_accept() system call arguments
to the register_t type too.
PR: 217161
MFC after: 3 days
xMFC with: r313284,r313285,r313684
getdents64() with wrapper over kern_getdirentries().
The patch was originally written by emaste@ and then adapted by trasz@
and me.
Note:
1. I divided linux_getdents() and linux_readdir() as in case when the
getdents() called with count = 1 (readdir() case) it can overwrite
user stack (by writing to user buffer pointer more than 1 byte).
2. Linux returns EINVAL in case when user supplied buffer is not enough
to contain fetched dirent.
3. Linux returns ENOTDIR in case when fd points to not a directory.
Reviewed by: trasz@
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D2210
Rename kern_vm_* functions to kern_*. Move the prototypes to
syscallsubr.h. Also change Mach VM types to uintptr_t/size_t as
needed, to avoid headers pollution.
Requested by: alc, jhb
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D9535
Members of the syscall argument structures are padded to a word size. So,
for COMPAT_LINUX32 we should convert user supplied system call arguments
which is 32-bit in that case to the array of register_t.
Reported by: Oleg V. Nauman
MFC after: 1 week
kern_vm_munmap(), and kern_vm_madvise(), and use them in various compats
instead of their sys_*() counterparts.
Reviewed by: ed, dchagin, kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9378
and wrong numbering for a few unimplemented syscalls.
For 32-bit Linuxulator, socketcall() syscall was historically
the entry point for the sockets API. Starting in Linux 4.3, direct
syscalls are provided for the sockets API. Enable it.
The initial version of patch was provided by trasz@ and extended by me.
Submitted by: trasz
MFC after: 2 week
Differential Revision: https://reviews.freebsd.org/D9381
and use it in compats instead of their sys_*() counterparts.
Reviewed by: kib, jhb, dchagin
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9383
of their sys_*() counterparts. The svr4 is left unchanged.
Reviewed by: kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9379
instead of their sys_*() counterparts in various compats. The svr4
is left untouched, because there's no point.
Reviewed by: ed@, kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9367
Right now size of the structure is 472 bytes on amd64, which is
already large and stack allocations are indesirable. With the ino64
work, MNAMELEN is increased to 1024, which will make it impossible to have
struct statfs on the stack.
Extracted from: ino64 work by gleb
Discussed with: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Descriptor returned by accept(2) should inherits capabilities rights from
the listening socket.
PR: 201052
Reviewed by: emaste, jonathan
Discussed with: many
Differential Revision: https://reviews.freebsd.org/D7724
Make the kern_fsync() function public, so that it can be used by other
parts of the kernel. Fix up existing consumers to make use of it.
Requested by: kib
FreeBSD support NX bit on X86_64 processors out of the box, for i386 emulation
use READ_IMPLIES_EXEC flag, introduced in r302515.
While here move common part of mmap() and mprotect() code to the files in compat/linux
to reduce code dupcliation between Linuxulator's.
Reported by: Johannes Jost Meixner, Shawn Webb
MFC after: 1 week
XMFC with: r302515, r302516
In Linux if this flag is set, PROT_READ implies PROT_EXEC for mmap().
Linux/i386 set this flag automatically if the binary requires executable stack.
READ_IMPLIES_EXEC flag will be used in the next Linux mmap() commit.
[1] Remove unneeded sockaddr conversion before kern_recvit() call as the from
argument is used to record result (the source address of the received message) only.
[2] In Linux the type of msg_namelen member of struct msghdr is signed but native
msg_namelen has a unsigned type (socklen_t). So use the proper storage to fetch fromlen
from userspace and than check the user supplied value and return EINVAL if it is less
than 0 as a Linux do.
Reported by: Thomas Mueller <tmueller at sysgo dot com> [1]
Reviewed by: kib@
Approved by: re (gjb, kib)
MFC after: 3 days
threads, to make it less confusing and using modern kernel terms.
Rename the functions to reflect current use of the functions, instead
of the historic KSE conventions:
cpu_set_fork_handler -> cpu_fork_kthread_handler (for kthreads)
cpu_set_upcall -> cpu_copy_thread (for forks)
cpu_set_upcall_kse -> cpu_set_upcall (for new threads creation)
Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (hrs)
Differential revision: https://reviews.freebsd.org/D6731
intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013.
A robust mutex is guaranteed to be cleared by the system upon either
thread or process owner termination while the mutex is held. The next
mutex locker is then notified about inconsistent mutex state and can
execute (or abandon) corrective actions.
The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths. Additionally, the thread exit handler was extended to
iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.
The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive). Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.
Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot. When a thread
touches a robust mutex for the first time, a new umtx op syscall is
issued which informs about location of lists heads.
The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust.
Somewhat unrelated changes in the patch:
1. Style.
2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
pi mutexes.
3. Removal of the userspace struct pthread_mutex m_owner field.
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
the lifetime of the shared mutex associated with a vnode' page.
Reviewed by: jilles (previous version, supposedly the objection was fixed)
Discussed with: brooks, Martin Simmons <martin@lispworks.com> (some aspects)
Tested by: pho
Sponsored by: The FreeBSD Foundation
This is a minor follow-up to r297422, prompted by a Coverity warning. (It's
not a real defect, just a code smell.) OSD slot array reservations are an
array of pointers (void **) but were cast to void* and back unnecessarily.
Keep the correct type from reservation to use.
osd.9 is updated to match, along with a few trivial igor fixes.
Reported by: Coverity
CID: 1353811
Sponsored by: EMC / Isilon Storage Division