Commit Graph

1554 Commits

Author SHA1 Message Date
Dmitry Chagin
187715a420 linux(4): Implement clock_getres_time64 system call.
MFC after:	2 weeks
2021-06-07 05:21:32 +03:00
Dmitry Chagin
19f9a0e4df linux(4): Implement clock_settime64 system call.
MFC after:	2 weeks
2021-06-07 05:11:25 +03:00
Dmitry Chagin
99b6f43069 linux(4): Implement clock_gettime64 system call.
MFC after:	2 weeks
2021-06-07 05:04:42 +03:00
Dmitry Chagin
e4bffb80bb linux(4): Implement utimensat_time64 system call.
MFC after:	2 weeks
2021-06-07 04:54:30 +03:00
Dmitry Chagin
bfcce1a9f6 linux(4): add struct timespec64 definition and conversion routine for
future use.

MFC after:		2 weeks
2021-06-07 04:47:12 +03:00
Dmitry Chagin
2a0fa277f6 linux(4): Microoptimize futimesat, utimes, utime.
While here wrap long line.

Differential Revision:	https://reviews.freebsd.org/D30488
MFC after:		2 weeks
2021-05-31 22:54:18 +03:00
Dmitry Chagin
b4f9b6eef2 linux(4): Handle AT_EMPTY_PATH in the utimensat syscall.
Differential Revision:	https://reviews.freebsd.org/D30518
MFC after:		2 weeks
2021-05-31 22:37:06 +03:00
Dmitry Chagin
8505eb5dd8 linux(4): Convert flags before use in utimensat.
Differential Revision:	https://reviews.freebsd.org/D30487
MFC after:		2 weeks
2021-05-31 22:30:37 +03:00
Dmitry Chagin
a06c12464b linux(4): Add F_GETPIPE_SZ fcntl operation which returns the capacity
of the pipe referred by fd.

Differential Revision:	https://reviews.freebsd.org/D30517
MFC after:		2 weeks
2021-05-31 22:15:02 +03:00
Edward Tomasz Napierala
83043a741d linux: deduplicate DUMMY() entries
No functional changes.

Reviewed By:	emaste
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D30524
2021-05-29 17:51:36 +00:00
Edward Tomasz Napierala
6d926e850d linux: add new syscall numbers
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D30193
2021-05-28 09:02:16 +01:00
Dmitry Chagin
5184e2da41 linux_common: retire extra module version.
The second 'linuxcommon' line was added by c66f5b079d
but Linuxulator's modules dependend on 'linux_common'.
To avoid such mistakes in the future rename moduledata name and module
name to  'linux_common' and retire 'linuxcommon' line.

Reviewed by:		emaste
Differential Revision:	https://reviews.freebsd.org/D30409
MFC after:		2 weeks
2021-05-26 08:34:32 +03:00
Edward Tomasz Napierala
5e8caee259 linux: remove redundant SDT tracepoints
Remove all the 'entry' and 'return' probes; they clutter up the source
and are redundant to FBT.

Reviewed By:	dchagin
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D30040
2021-05-05 13:59:00 +01:00
Edward Tomasz Napierala
ee384b229d linux(4): make linkat(2) handle AT_EMPTY_PATH
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D29974
2021-05-04 13:09:46 +01:00
Edward Tomasz Napierala
5d1d844a77 kern_linkat: modify to accept AT_ flags instead of FOLLOW/NOFOLLOW
This makes this API match other kern_xxxat() functions.

Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D29776
2021-04-25 14:13:12 +01:00
Edward Tomasz Napierala
156da725d3 linux(4): bump osrelease to 4.4.0.
This is required for the current Arch Linux binaries to work.

PR:		254112
Reviewed By:	emaste
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D29218
2021-04-19 11:37:58 +01:00
Edward Tomasz Napierala
e47823b831 linux: support AT_EMPTY_PATH flag in fchownat(2)
This fixes rsyslog package installation scripts in Bionic.

Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D29108
2021-04-16 16:27:20 +01:00
Edward Tomasz Napierala
4b45c2bb83 linux: make fstatat(2) handle AT_EMPTY_PATH
Without it, Qt5 apps from Focal fail to start, being unable to load
their plugins.  It's also necessary for glibc 2.33, as found in recent
Arch snapshots.

PR:		254112
Reviewed By:	kib
Sponsored by:	The FreeBSD Foundation, EPSRC
Differential Revision:	https://reviews.freebsd.org/D28192
2021-04-16 08:56:19 +01:00
Edward Tomasz Napierala
1663120ae4 linux: implement O_PATH
Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D29773
2021-04-15 15:30:59 +01:00
Edward Tomasz Napierala
1b11173c00 linux: extend the LINUX_O_ constants to make room for O_PATH
No functional changes.

Sponsored By:	EPSRC
2021-04-15 15:04:44 +01:00
Edward Tomasz Napierala
ca6e1fa3ce linux: adjust ordering of Linux auxv and add dummy AT_HWCAP2
This should be a no-op; the purpose of this is to reduce
a spurious difference between Linuxulator and Linux, to make
debugging core dumps slightly easier.

Note that AT_HWCAP2 we pass to Linux binaries is always 0,
instead of being equal to 'cpu_feature2'.  This matches what
I've observed under Ubuntu Focal VM.

Reviewed By:	chuck, dchagin
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D29609
2021-04-13 13:14:30 +01:00
John Baldwin
3b57ddb029 Rename linux_set_upcall_kse() to linux_set_upcall().
This matches the rename of cpu_set_upcall_kse() in
5c2cf81845.

Reviewed by:	kib, emaste
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D29295
2021-03-18 12:14:34 -07:00
Edward Tomasz Napierala
0dfbdd9fc2 linux(4): make getcwd(2) return ERANGE instead of ENOMEM
For native FreeBSD binaries, the return value from __getcwd(2)
doesn't really matter, as the libc wrapper takes over and returns
the proper errno.

PR:		kern/254120
Reported By:	Alex S <iwtcex@gmail.com>
Reviewed By:	kib
Sponsored By:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29217
2021-03-12 15:31:45 +00:00
Edward Tomasz Napierala
cd84c82c6a linux: add support for SO_PEERGROUPS
The su(8) and sudo(8) from Ubuntu Bionic use it.

Sponsored By:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28165
2021-03-06 19:48:58 +00:00
Mark Johnston
0fc8a79672 linux: Unmap the VDSO page when unloading
linux_shared_page_init() creates an object and grabs and maps a single
page to back the VDSO.  When destroying the VDSO object, we failed to
destroy the mapping and free KVA.  Fix this.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28696
2021-02-16 09:40:02 -05:00
Vladimir Kondratyev
b3c6fe663b epoll: Store epoll_event udata member in ext member of kevent.
Current epoll implementation stores udata fields of epoll_event
structure in special dynamically-sized table rather than in udata field
of backing kevent structure because of 2 reasons:
1. Kevent's udata size is smaller than epoll's on 32-bit archs.
2. Kevent's udata can be clobbered on execution EPOLL_CTL_ADD as kqueue
   modifies existing event while epoll returns error in this case.

After r320043 has introduced four new 64bit user data members (ext[]),
we can store epoll udata in one of them and drop aforementioned table.
According to kqueue_register() source code ext members are not updated
when existing kevent is modified that fixes p.2.

As a side effect the patch fixes PR/252582.

Reviewed by:	trasz
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D28169
2021-02-08 02:46:14 +03:00
Edward Tomasz Napierala
e44a78ce6f linux: add support for SO_PEERSEC getsockopt
It returns "unconfined", like Linux without SELinux would.

Sponsored By:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28164
2021-02-07 20:42:04 +00:00
Edward Tomasz Napierala
f6e8256a96 linux: fix handling of flags for 32 bit send(2) syscall
Previously the flags were passed as-is, which could resulted
in spurious EAGAIN returned for non-blocking sockets, which
broke some Steam games.

PR:		248065
Reported By:	Alex S <iwtcex@gmail.com>
Tested By:	Alex S <iwtcex@gmail.com>
Reviewed By:	emaste
MFC After:	3 days
Sponsored By:	The FreeBSD Foundation
2021-02-06 23:21:27 +00:00
shu
14c40d2c29 linux: remove locks around callout_drain in timerfd_close()
The lock around callout_drain() is unnecessary and may cause
deadlock when one closes a timer descriptor during timer execution.

Reviewed By:	delphij
Submitted By:	ankohuu_outlook.com (Shunchao Hu)
Differential Revision: https://reviews.freebsd.org/D28148
2021-02-03 19:47:38 +00:00
shu
ae71b794cb linux: make timerfd_settime(2) set expirations count to zero
On Linux, read(2) from a timerfd file descriptor returns an unsigned
8-byte integer (uint64_t) containing the number of expirations
that have occurred, if the timer has already expired one or more
times since its settings were last modified using timerfd_settime(),
or since the last successful read(2).  That's to say, once we do
a read or call timerfd_settime(), timer fd's expiration count should
be zero.  Some Linux applications create timerfd and add it to epoll
with LT mode, when event comes, they do timerfd_settime instead
of read to stop event source from trigger.  On FreeBSD,
timerfd_settime(2) didn't set the count to zero, which caused high
CPU utilization.

Submitted by:	ankohuu_outlook.com (Shunchao Hu)
Differential Revision: https://reviews.freebsd.org/D28231
2021-02-03 19:08:40 +00:00
Brooks Davis
7a1591c1b6 Rename kern_mmap_req to kern_mmap
Replace all uses of kern_mmap with kern_mmap_req move the old kern_mmap.
Reand rename kern_mmap_req to kern_mmap                                .

The helper saved some code churn initially, but having multiple
interfaces is sub-optimal.

Obtained from:	CheriBSD
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28292
2021-01-25 21:50:37 +00:00
Edward Tomasz Napierala
7d3310c4fc linux: remove spurious newline.
Sponsored by:	The FreeBSD Foundation
2021-01-19 09:56:45 +00:00
Edward Tomasz Napierala
feb96ee9c8 linux: mute "unsupported socket(AF_NETLINK, 3, NETLINK_AUDIT)" warnings
They are way too noisy with Focal.

Sponsored by:	The FreeBSD Foundation
2021-01-14 09:16:28 +00:00
Edward Tomasz Napierala
ec2700e015 linux: mute the "unsupported prctl option 23" warnings
Make the PR_CAPBSET_READ prctl(2) return EINVAL without logging
any warnings; this is way too noisy with Focal.

Sponsored by:	The FreeBSD Foundation
2021-01-13 10:31:56 +00:00
Edward Tomasz Napierala
a339b4223a linux: bump the default version from 3.10.0 to 3.17.0
This is required for Qt5, as found in Ubuntu Focal.  The library contains
the minimum kernel version encoded in an ELF note; this makes rtld ignore
it altogether, with a confusing error message.  Without it, things fail
like this:

$ konsole: error while loading shared libraries: libQt5Core.so.5: cannot
open shared object file: No such file or directory

For reference, the Qt kernel version requirements can be found at:
https://github.com/qt/qtbase/blob/dev/src/corelib/global/minimum-linux_p.h

Sponsored by:	The FreeBSD Foundation
Reviewed By:	emaste
Differential Revision:	https://reviews.freebsd.org/D28105
2021-01-13 10:02:16 +00:00
Mateusz Guzik
6b3a9a0f3d Convert remaining cap_rights_init users to cap_rights_init_one
semantic patch:

@@

expression rights, r;

@@

- cap_rights_init(&rights, r)
+ cap_rights_init_one(&rights, r)
2021-01-12 13:16:10 +00:00
Konstantin Belousov
7a202823aa Expose eventfd in the native API/ABI using a new __specialfd syscall
eventfd is a Linux system call that produces special file descriptors
for event notification. When porting Linux software, it is currently
usually emulated by epoll-shim on top of kqueues.  Unfortunately, kqueues
are not passable between processes.  And, as noted by the author of
epoll-shim, even if they were, the library state would also have to be
passed somehow.  This came up when debugging strange HW video decode
failures in Firefox.  A native implementation would avoid these problems
and help with porting Linux software.

Since we now already have an eventfd implementation in the kernel (for
the Linuxulator), it's pretty easy to expose it natively, which is what
this patch does.

Submitted by:   greg@unrelenting.technology
Reviewed by:    markj (previous version)
MFC after:      2 weeks
Differential Revision:  https://reviews.freebsd.org/D26668
2020-12-27 12:57:26 +02:00
Konstantin Belousov
7cb901bf22 Remove useless ARGUSED annotations.
Submitted by:	greg@unrelenting.technology
2020-12-27 12:57:26 +02:00
Konstantin Belousov
11c9f2ff1a Add SPDX tag.
Submitted by:	greg@unrelenting.technology
2020-12-27 12:57:26 +02:00
Tijl Coosemans
77fb6b6644 Move V4L feature declarations and DTrace provider definitions from
linux_common.c to linux_util.c so they become available on i386.

linux_common.c defines the linux_common kernel module but this module does
not exist on i386 and linux_common.c is not included in the linux module.
linux_util.c is included in the linux_common module on amd64 and the linux
module on i386.

Remove linux_common.c from files.i386 again.  It was added recently in
r367433 when the DTrace provider definitions were moved.

The V4L feature declarations were moved to linux_common in r283423.
2020-12-06 10:58:55 +00:00
Konstantin Belousov
cd85379104 Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
Konstantin Belousov
4815f175d0 Linuxolator: Replace use of eventhandlers by sysent hooks.
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27309
2020-11-23 18:18:16 +00:00
Conrad Meyer
f8f74aaa84 linux(4) clone(2): Correctly handle CLONE_FS and CLONE_FILES
The two flags are distinct and it is impossible to correctly handle clone(2)
without the assistance of fork1().  This change depends on the pwddesc split
introduced in r367777.

I've added a fork_req flag, FR2_SHARE_PATHS, which indicates that p_pd
should be treated the opposite way p_fd is (based on RFFDG flag).  This is a
little ugly, but the benefit is that existing RFFDG API is preserved.
Holding FR2_SHARE_PATHS disabled, RFFDG indicates both p_fd and p_pd are
copied, while !RFFDG indicates both should be cloned.

In Chrome, clone(2) is used with CLONE_FS, without CLONE_FILES, and expects
independent fd tables.

The previous conflation of CLONE_FS and CLONE_FILES was introduced in
r163371 (2006).

Discussed with:	markj, trasz (earlier version)
Differential Revision:	https://reviews.freebsd.org/D27016
2020-11-17 21:20:11 +00:00
Conrad Meyer
ede4af47ae unix(4): Enhance LOCAL_CREDS_PERSISTENT ABI
As this ABI is still fresh (r367287), let's correct some mistakes now:

- Version the structure to allow for future changes
- Include sender's pid in control message structure
- Use a distinct control message type from the cmsgcred / sockcred mess

Discussed with:	kib, markj, trasz
Differential Revision:	https://reviews.freebsd.org/D27084
2020-11-17 20:01:21 +00:00
Conrad Meyer
de774e422e linux(4): Implement name_to_handle_at(), open_by_handle_at()
They are similar to our getfhat(2) and fhopen(2) syscalls.

Differential Revision:	https://reviews.freebsd.org/D27111
2020-11-17 19:51:47 +00:00
Edward Tomasz Napierala
e3b1c847a4 Make it possible to mount a fuse filesystem, such as squashfuse,
from a Linux binary.  Should come handy for AppImages.

Reviewed by:	asomers
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26959
2020-11-09 08:53:15 +00:00
Alexander Leidinger
8ec6c4a38b - add more linux socket options (sorted by value)
- map those IPv4 / IPv6 socket options which exist in FreeBSD
   + most of them visually verified to have the same type/layout of arguments
   + not tested with linux programs to behave as intended
 - be more human readable for known options which are not handled
 - be more verbose for unhandled socket message flags we know about
 - print the jail ID in linux_msg if run in a jail
 - add possibility to print debug message about known missing parts only once
 - add multiple levels of sysctl linux.debug:
   1: print debug messages, tell about unimplemented stuff (only once)
   2: like 1, but also print messages about implemented but not tested
      stuff (only once)
   3+: like 2, but no rate limiting of messages
 - increase default linux debug level from 1 to 3

We are a lot more verbose in as we need to be (e.g. some of the IP socket
options which are the same, and share the same memory layout, and are
believed to work). The reason is that we have no good testsuite to test those
linux-bits. The LTP or other test suites like the python one, are not fully
up to the task we need. As such the excessive messages about emulated but not
tested socket options.

IMO any MFC (possible, but most probably not by me) should set the default
debug level to 1.

Discussed with:	trasz
2020-11-08 09:50:58 +00:00
Conrad Meyer
76b2bfeda4 linux(4): Fix loadable modules after r367395
Move dtrace SDT definitions into linux_common module code.  Also, build
linux_dummy.c into the linux_common kld -- we don't need separate
versions of these stubs for 32- and 64-bit emulation.

Reported by:	several
PR:		250897
Discussed with:	emaste, trasz
Tested by:	John Kennedy, Yasuhiro KIMURA, Oleg Sidorkin
X-MFC-With:	r367395
Differential Revision:	https://reviews.freebsd.org/D27124
2020-11-06 22:04:57 +00:00
Conrad Meyer
e9b13c6612 linux(4): Deduplicate unimpl/dummy syscall handlers
No functional change.

Reviewed by:	emaste, trasz
Differential Revision:	https://reviews.freebsd.org/D27099
2020-11-05 19:30:31 +00:00
Edward Tomasz Napierala
cdf6e4e922 Unbreak buildworld after r367339.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-11-04 21:39:04 +00:00
Edward Tomasz Napierala
2f927d87f9 Add linux_to_bsd_errtbl[], mapping Linux errnos to their BSD counterparts.
This will be used by fuse(4).

Reviewed by:	asomers
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26974
2020-11-04 19:54:18 +00:00
Conrad Meyer
9e47480e94 linux(4): Improve netlink diagnostics
Add some missing netlink_family definitions and produce vaguely
human-readable error messages for those definitions, like we used to do for
just ROUTE and KOBJECT_UEVENTS.

Additionally, if we know it's a netfilter socket but didn't find it in the
table, fall back to printing that instead of the generic handler ("socket
domain 16, ...").

No change to the emulator correctness, just mildly improved diagnostics for
gaps.
2020-11-03 19:50:42 +00:00
Edward Tomasz Napierala
7abf30d339 Make linux_errtbl[] static.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27004
2020-11-03 19:12:33 +00:00
Edward Tomasz Napierala
939e5de8d4 Fix rookie mistake - it's nitems(), not sizeof().
Reported by:	xtouqh_icloud.com
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-11-03 14:44:33 +00:00
Conrad Meyer
eaa5afcefa linux(4) prctl(2): Implement PR_[GS]ET_DUMPABLE
Proxy the flag to the roughly analogous FreeBSD procctl 'TRACE'.

TRACE-disabled processes are not coredumped, and Linux !DUMPABLE processes
can not be ptraced.  There are some additional semantics around ownership of
files in the /proc/[pid] pseudo-filesystem, which we do not attempt to
emulate correctly at this time.

Reviewed by:	markj (earlier version)
Differential Revision:	https://reviews.freebsd.org/D27015
2020-11-03 02:10:54 +00:00
Conrad Meyer
443d8a07df linux(4): Emulate Linux SOL_SOCKET:SO_PASSCRED
This is required by some major linux applications, such as Chrome and
Firefox.  (As well as Electron-using applications, which are essentially
a bundled version of Chrome.)

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27012
2020-11-03 01:19:13 +00:00
Conrad Meyer
a98f03786e linux(4): style: Eliminate dead 'break' after 'return'
No functional change.
2020-11-03 01:10:27 +00:00
Conrad Meyer
7731194090 linux(4): Quiesce unrecognized ioctl warning for F2FS query
On Linux, sqlite probes for underlying F2FS filesystems that support
certain kinds of atomic update with this ioctl.  The expected result on
non-F2FS filesystem (i.e., all FreeBSD filesystems) is any error value.

Minimally implement the ioctl and avoid the warning message.

(This shows up in Linux Chrome, which embeds sqlite.)

Reviewed by:	emaste, trasz
Differential Revision:	https://reviews.freebsd.org/D27050
2020-11-02 18:45:43 +00:00
Conrad Meyer
53efdb55a8 linux(4): Deduplicate ioctl range construction with a helper macro
No functional change.

Reviewed by:	emaste, trasz
Differential Revision:	https://reviews.freebsd.org/D27049
2020-11-02 18:45:15 +00:00
Conrad Meyer
63ed2e3642 linux(4): Disambiguate identical ioctl errors in distinct paths
And stop truncating the full ioctl number in the error message.

Reviewed by:	emaste
Differential Revision:	https://reviews.freebsd.org/D27048
2020-11-02 06:16:11 +00:00
Conrad Meyer
76dfd556f1 linux(4): Add missing clone(2) flags 2020-10-31 01:12:35 +00:00
Conrad Meyer
ae9cafd919 linux(4): Quiesce warning about madvise(..., -1)
This API misuse is intended to produce an error value to detect certain
bogus stub implementations of MADV_WIPEONFORK.  We don't need to log a
warning about it.

Example:
https://boringssl.googlesource.com/boringssl/+/ad5582985cc6b89d0e7caf0d9cc7e301de61cf66%5E%21/

Reviewed by:	emaste, trasz
Differential Revision:	https://reviews.freebsd.org/D27017
2020-10-30 19:02:59 +00:00
Edward Tomasz Napierala
b60b81e643 Fix typo.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-10-29 14:42:51 +00:00
Edward Tomasz Napierala
1a8577fa68 Add defines for Linux errno values and use them to make linux_errtbl[]
more readable.  While here, add linux_check_errtbl() function to make
sure we don't leave holes.

No objections:	emaste (earlier version)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26972
2020-10-29 14:23:52 +00:00
Edward Tomasz Napierala
1701c69b6e Make linux_errtbl a bit more readable by using named initializers.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26970
2020-10-28 14:16:08 +00:00
Edward Tomasz Napierala
866b1f5147 Fix misnomer - linux_to_bsd_errno() does the exact opposite.
Reported by:	arichardson
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26965
2020-10-27 12:49:40 +00:00
Mateusz Guzik
fe76bef462 linux: silence renameat2 flags warning
Hogs the console while building the Linux kernel in a Ubuntu Focal jail.
2020-10-26 18:03:50 +00:00
Mateusz Guzik
1024de70f9 linux: add missing conversions for compat.linux.use_emul_path handling 2020-10-26 18:02:52 +00:00
Edward Tomasz Napierala
b3be0b4d0c Tweak linux(4) socket(2) debug messages.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26900
2020-10-24 14:25:38 +00:00
Edward Tomasz Napierala
62b1382ff3 Further improve prctl(2) debug.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26916
2020-10-24 14:23:44 +00:00
Edward Tomasz Napierala
1c7481377c Improve prctl(2) debug.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26899
2020-10-23 12:00:30 +00:00
Edward Tomasz Napierala
f4d91df5a0 Make linux(4) warn about unsupported socket(2) types.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25680
2020-10-21 18:45:48 +00:00
Edward Tomasz Napierala
1a34e9fad6 Fix potential race condition in linux stat(2).
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25618
2020-10-20 17:19:10 +00:00
Edward Tomasz Napierala
54669eb779 Add compat.linux.dummy_rlimits, and disable by default.
Turns out the dummy rlimits fix prlimit(1), but break su(8)
(login-1:4.5-1ubuntu2) - although not sudo(8), for some reason.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26814
2020-10-18 15:58:16 +00:00
Edward Tomasz Napierala
1c34dcb532 Set default stack size for Linux apps to 8MB. This matches Linux'
defaults, makes core files smaller, and fixes applications which use
pthread_join(3) in a wrong way, namely Steam.

This is based on a patch submitted by Jason Yang, which I've reworked
to set the limit instead of only changing the value reported (which
is enough to fix the bug for Linux pthreads, but could be confusing).

PR:		248225
Submitted by:	Jason_YH_Yang at wistron.com (earlier version)
Analyzed by:	Alex S <iwtcex@gmail.com>
Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26778
2020-10-16 11:23:30 +00:00
Edward Tomasz Napierala
139c09788b Make linux getrlimit(2) and prlimit(2) return something reasonable
for linux-specific limits.  Fixes prlimit (util-linux-2.31.1-0.4ubuntu3.7).

Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26777
2020-10-16 10:10:09 +00:00
Konstantin Belousov
aaf78c16f5 Do not leak oldvmspace if image activation failed
and current address space is already destroyed, so kern_execve()
terminates the process.

While there, clean up some internals of post_execve() inlined in init_main.

Reported by:	Peter <pmc@citylink.dinoex.sub.org>
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26525
2020-09-23 18:03:07 +00:00
Edward Tomasz Napierala
106a784b35 Reduce code duplication by introducing linux_copyout_sockaddr()
helper function.  No functional changes.

Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25804
2020-09-17 12:14:24 +00:00
Edward Tomasz Napierala
79e3da0602 Add support for SOUND_MIXER_WRITE_MONITOR ioctl. Fixes alsamixer(1)
on my x220.

Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25806
2020-09-17 11:44:45 +00:00
Edward Tomasz Napierala
70890254b3 Get rid of sv_errtbl and SV_ABI_ERRNO().
Reviewed by:	kib
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26388
2020-09-17 11:39:33 +00:00
Mark Johnston
46888dedc9 Add emulation support for the Linux kcov(4) ioctl API.
This makes it possible to run an unmodified Linux syzkaller executor
against the Linuxulator, and have it gather code coverage information.

Sponsored by:	The FreeBSD Foundation
2020-09-04 00:12:28 +00:00
Mateusz Guzik
1a18003240 compat: clean up empty lines in .c and .h files 2020-09-01 21:24:33 +00:00
Mateusz Guzik
feabaaf995 cache: drop the always curthread argument from reverse lookup routines
Note VOP_VPTOCNP keeps getting it as temporary compatibility for zfs.

Tested by:	pho
2020-08-24 08:57:02 +00:00
Mateusz Guzik
7ad2a82da2 vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error
Most consumers pass NULL.
2020-08-19 02:51:17 +00:00
Mateusz Guzik
a125ed50a6 linux: add sysctl compat.linux.use_emul_path
This is a step towards facilitating jails with only Linux binaries.
Supporting emul_path adds path lookups which are completely spurious
if the binary at hand runs in a Linux-based root directory.

It defaults to on (== current behavior).

make -C /root/linux-5.3-rc8 -s -j 1 bzImage:

use_emul_path=1: 101.65s user 68.68s system 100% cpu 2:49.62 total
use_emul_path=0: 101.41s user 64.32s system 100% cpu 2:45.02 total
2020-08-18 22:04:22 +00:00
Mark Johnston
a7044c60a5 Fix handling of ancillary data on non-AF_UNIX Linux sockets.
After r340674, the "continue" would restart the loop without having
updated clen, resulting in an infinite loop.  Restore the old behaviour
of simply ignoring all control messages on such sockets, since we
currently only implement handling for AF_UNIX-specific messages.

Reported by:	syzkaller
Reviewed by:	tijl
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26093
2020-08-18 14:17:14 +00:00
Mark Johnston
d9565182fd Remove "emulation" of clone(CLONE_PARENT | CLONE_THREAD).
On Linux this is supposed to result in EINVAL.

Reported by:	syzkaller
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2020-08-17 21:30:49 +00:00
Mark Johnston
74a796e0fc Fix a lock leak when emulating futex(FUTEX_WAIT_BITSET).
Reported by:	syzkaller
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2020-08-17 21:30:15 +00:00
Mark Johnston
30dcce2709 Skip Linux madvise(MADV_DONTNEED) on unmanaged objects.
vm_object_madvise() is a no-op for unmanaged objects, but we should also
limit the scope of mappings on which pmap_remove() is called.  In
particular, with the WIP largepage shm objects patch the kernel must
remove mappings of such objects along superpage boundaries, and without
this check Linux madvise(MADV_DONTNEED) could violate that requirement.

Reviewed by:	alc, kib
MFC with:	r362631
Sponsored by:	Juniper Networks, Klara Inc.
Differential Revision:	https://reviews.freebsd.org/D26084
2020-08-17 17:14:56 +00:00
Edward Tomasz Napierala
aa75412146 Make linux(4) support the BLKPBSZGET ioctl. Oracle uses it.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25694
2020-07-19 12:25:03 +00:00
Edward Tomasz Napierala
d5c5b4b382 Make linux fallocate(2) return EOPNOTSUPP, not ENOSYS, on unsupported mode,
as documented in the man page.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-07-18 12:21:08 +00:00
Edward Tomasz Napierala
eb6ae7576d Bump the default linux version from 3.2.0 to 3.10.0, which corresponds
to RHEL 7.  Required for DB2.

Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25656
2020-07-18 11:37:30 +00:00
Edward Tomasz Napierala
8d1d017175 Add a trivial linux(4) splice(2) implementation, which simply
returns EINVAL.  Fixes grep (grep-3.1-2build1).

PR:		kern/218699
Reported by:	avos
Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25636
2020-07-18 11:28:40 +00:00
Alexander Leidinger
e37db348c1 Fix r363125 (Implement CLOCK_MONOTONIC_RAW (linux >= 2.6.28)),
by realy using the MONOTONIC version and not the REALTIME version.

Noticed by:	myfreeweb at github
2020-07-12 14:57:29 +00:00
Alexander Leidinger
8c2602f30f Implement CLOCK_MONOTONIC_RAW (linux >= 2.6.28).
It is documented as a raw hardware-based clock not subject to NTP or
incremental adjustments. With this "not as precise as CLOCK_MONOTONIC"
description in mind, map it to our CLOCK_MONOTNIC_FAST (the same
mapping as for the linux CLOCK_MONOTONIC_COARSE).

This is needed for the webcomponent of steam (chromium) and some
other steam component or game.

The linux-steam-utils port contains a LD_PRELOAD based fix for this.
There this is mapped to CLOCK_MONOTONIC.
As an untrained ear/eye (= the majority of people) is normaly not
noticing a difference of jitter in the 10-20 ms range, specially
if you don't pay attention like for example in a browser session
while watching a video stream, the mapping to CLOCK_MONOTONIC_FAST
seems more appropriate than to CLOCK_MONOTONIC.
2020-07-12 09:51:09 +00:00
Edward Tomasz Napierala
17f701a3fb Make linux stat(2) return the same st_dev for every devfs instance.
The reason for this is to work around an idiosyncrasy of glibc
getttynam(3) implementation: it checks whether st_dev returned for
fd 0 is the same as st_dev returned for the target of /proc/self/fd/0
symlink, and with linux chroots having their own devfs instance,
the check will fail if you chrooted into it.

PR:		kern/240767
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25559
2020-07-11 13:08:16 +00:00
Edward Tomasz Napierala
09c4e43d18 Don't emit warnings on MADV_HUGEPAGE; Firefox uses it a lot.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-07-10 21:41:09 +00:00
Kyle Evans
423a033ba7 memfd_create: turn on SHM_GROW_ON_WRITE
memfd_create fds will no longer require an ftruncate(2) to set the size;
they'll grow (to the extent that it's possible) upon write(2)-like syscalls.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D25502
2020-07-10 00:45:16 +00:00
Edward Tomasz Napierala
4d2b7be54a Fix Linux recvmsg(2) when msg_namelen returned is 0. Previously
it would fail with EINVAL, breaking some of the Python regression
tests.

While here, cap the user-controlled message length.

Note that the code doesn't seem to be copying out the new length
in either (success or failure) case. This will be addressed separately.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25392
2020-07-05 10:57:28 +00:00
Mateusz Guzik
d6d9ddd41f linux: fix ioctl performance for termios
TCGETS et al are frequently issued by Linux binaries while the previous code
avoidably ping-pongs a global sx lock and serializes on Giant.

Note that even with the fix the common case will serialize on a per-tty lock.
2020-07-04 06:25:41 +00:00
Edward Tomasz Napierala
6d76adbb6d Rework linux accept(2). This makes the code flow easier to follow,
and fixes a bug where calling accept(2) could result in closing fd 0.

Note that the code still contains a number of problems: it makes
assumptions about l_sockaddr_in being the same as sockaddr_in,
the EFAULT-related code looks like it doesn't work at all, and the
socket type check is racy.  Those will be addressed later on;
I'm trying to work in small steps to avoid breaking one thing while
fixing another.

It fixes Redis, among other things.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25461
2020-07-01 10:37:08 +00:00
Edward Tomasz Napierala
9bc42c18cb Make linux(4) ignore SA_INTERRUPT. The zsh(1) binary from Bionic uses it.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25499
2020-06-30 16:18:09 +00:00
Kyle Evans
97ce5033a8 linux: reposition the comment for bsd_to_linux_bits/linux_to_bsd_bits
rpokala notes that splitting the definitions like this is kind of silly,
since the comment applies to both.  Move the comment up (or the definition
down, depending on your perspective on life) accordingly.

Reported by:	rpokala
2020-06-29 17:47:00 +00:00
Kyle Evans
5403f186a7 linuxolator: implement memfd_create syscall
This effectively mirrors our libc implementation, but with minor fudging --
name needs to be copied in from userspace, so we just copy it straight into
stack-allocated memfd_name into the correct position rather than allocating
memory that needs to be cleaned up.

The sealing-related fcntl(2) commands, F_GET_SEALS and F_ADD_SEALS, have
also been implemented now that we support them.

Note that this implementation is still not quite at feature parity w.r.t.
the actual Linux version; some caveats, from my foggy memory:

- Need to implement SHM_GROW_ON_WRITE, default for memfd (in progress)
- LTP wants the memfd name exposed to fdescfs
- Linux allows open() of an fdescfs fd with O_TRUNC to truncate after dup.
  (?)

Interested parties can install and run LTP from ports (devel/linux-ltp) to
confirm any fixes.

PR:		240874
Reviewed by:	kib, trasz
Differential Revision:	https://reviews.freebsd.org/D21845
2020-06-29 03:09:14 +00:00
Edward Tomasz Napierala
4fe5361cbe Make linux(4) support SO_PROTOCOL. Running Python test suite
with python3.8 from Focal triggers those.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25491
2020-06-28 18:56:32 +00:00
Edward Tomasz Napierala
d5629eb216 Make linux(4) warn about unsupported SA_ flags.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25453
2020-06-27 15:50:35 +00:00
Mark Johnston
f4134e3d87 Implement an approximation of Linux MADV_DONTNEED semantics.
Linux MADV_DONTNEED is not advisory: it has side effects for anonymous
memory, and some system software depends on that.  In particular,
MADV_DONTNEED causes anonymous pages to be discarded.  If the mapping is
a private mapping of a named object then subsequent faults are to
repopulate the range from that object, otherwise pages will be
zero-filled.  For mappings of non-anonymous objects, Linux MADV_DONTNEED
can be implemented in the same way as our MADV_DONTNEED.

This implementation differs from Linux semantics in its handling of
private mappings, inherited through fork(), of non-anonymous objects.
After applying MADV_DONTNEED, subsequent faults will repopulate the
mapping from the parent object rather than the root of the shadow chain.

PR:		230160
Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D25330
2020-06-25 20:30:30 +00:00
Edward Tomasz Napierala
52c81be11a Add linux_madvise(2) instead of having Linux apps call the native
FreeBSD madvise(2) directly.  While some of the flag values match,
most don't.

PR:		kern/230160
Reported by:	markj
Reviewed by:	markj
Discussed with:	brooks, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25272
2020-06-20 18:29:22 +00:00
Edward Tomasz Napierala
4afe4fae1b Add warnings for unsupported Linux clockids.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25322
2020-06-19 19:33:06 +00:00
Konstantin Belousov
8a15ac8378 Fix execution of linux binary from multithreaded non-Linux process.
If multithreaded non-Linux process execs Linux binary, then non-Linux
threads different from the one that execing are cleared by
single-threading at boundary, and then terminating them in
post_execve(). Since at that time the process is already switched to
linux ABI, linuxolator is involved in the thread clearing on boundary,
but cannot find the emul data.

Handle it by pre-creating emuldata for all threads in the execing process.

Also remove a code in linux_proc_exec() handler that cleared emul data
for other threads when execing from multithreaded Linux process. It is
excessive.

PR:	247020
Reported by:	Martin FIlla <freebsd@sysctl.cz>
Reported by:	Henrique L. Amorim, Independent Security Researcher
Reported by:	Rodrigo Rubira Branco (BSDaemon), Amazon Web Services
Reviewed by:	markj
Tested by:	trasz
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25293
2020-06-18 20:49:56 +00:00
Edward Tomasz Napierala
3d8dd98381 Make Linux uname(2) return x86_64 to 32-bit apps. This helps Steam.
PR:		kern/240432
Analyzed by by:	Alex S <iwtcex@gmail.com>
Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25248
2020-06-15 20:12:10 +00:00
Edward Tomasz Napierala
889cd28520 Make linux(4) warn about unsupported CMSG level/type.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25255
2020-06-14 14:38:40 +00:00
Edward Tomasz Napierala
462171d9aa Add compat.linux.debug sysctl, to make it possible to silence down
the debug messages. While here, clean up some variable naming.

Reviewed by:	bcr (manpages), emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25230
2020-06-12 14:37:50 +00:00
Edward Tomasz Napierala
599dadca55 Fix naming clash.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-06-12 14:31:19 +00:00
Edward Tomasz Napierala
34ff0c0e6a Make linux(4) warn about unsupported fcntls.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25231
2020-06-12 14:25:32 +00:00
Edward Tomasz Napierala
4beacc3b1d Minor code cleanup; no functional changes.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25232
2020-06-12 14:23:10 +00:00
Edward Tomasz Napierala
86e794eb65 Don't use newlines with linux_msg(). No functional changes.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-06-11 14:57:30 +00:00
Edward Tomasz Napierala
bc8e281082 Replace LINUX_FASYNC with LINUX_O_ASYNC; no functional changes.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25218
2020-06-11 14:09:43 +00:00
Edward Tomasz Napierala
433d61a573 Improve the warnings.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-06-11 12:35:00 +00:00
Edward Tomasz Napierala
3bc69ad9b3 Make linux(4) handle SO_REUSEPORT.
Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25216
2020-06-11 12:25:49 +00:00
Mark Johnston
479f70ef24 Fix a couple of nits in Linux sysinfo(2) emulation.
- Use the same definition of free memory as Linux.
- Rename the totalbig and freebig fields to match the corresponding
  names on Linux.

Discussed with:	alc
MFC after:	1 week
2020-06-10 23:52:50 +00:00
Mark Johnston
27e4374dd4 Add a comment reflecting the commit log for r361945.
Suggested by:	alc
Reviewed by:	alc
MFC with:	r361945
2020-06-10 23:52:39 +00:00
Edward Tomasz Napierala
8c5059e9ea Make linux(4) set the openfiles soft resource limit to 1024 for Linux
applications, which often depend on this being the case.  There's a new
sysctl, compat.linux.default_openfiles, to control this behaviour.

Reviewed by:	kevans, emaste, bcr (manpages)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25177
2020-06-10 18:50:46 +00:00
Edward Tomasz Napierala
c31a6a6612 Support SO_SNDBUFFORCE/SO_RCVBUFFORCE by aliasing them to the
standard SO_SNDBUF/SO_RCVBUF.  Mostly cosmetics, to get rid
of the warning during 'apt upgrade'.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25173
2020-06-10 18:43:43 +00:00
Mark Johnston
3e5fae34fc Stop computing a "sharedram" value when emulating Linux sysinfo(2).
The previous code was computing an incorrect value in a very expensive
manner.  "sharedram" is supposed to be the amount of memory used by
named swap objects, which on FreeBSD basically corresponds to memory
usage by shared memory objects (including, for example, GEM objects) and
tmpfs.  We currently have no cheap way to count such pages.  The
previous code tried to determine the number of copy-on-write pages
shared between processes.

Just replace the computed value with 0.  illumos reportedly does the
same thing.  Linux itself did not populate this field until a 2014
commit, "mm: export NR_SHMEM via sysinfo(2) / si_meminfo() interfaces".

Reported by:	mjg
MFC after:	1 week
2020-06-08 22:29:52 +00:00
Brooks Davis
562894f0dc Centralize compatability translation macros.
Copy the CP, PTRIN, etc macros from freebsd32.h into a sys/abi_compat.h
and replace existing definitation with includes where required. This
eliminates duplicate code and allows Linux and FreeBSD compatability
headers to be included in the same files.

Input from:	cem, jhb
Obtained from:	CheriBSD
MFC after:	2 weeks
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D24275
2020-04-14 20:30:48 +00:00
Mark Johnston
4596ac234e compat/linux/linux.h depends on queue.h since r353725.
Sponsored by:	The FreeBSD Foundation
2020-03-26 17:12:55 +00:00
Hans Petter Selasky
d845d3dc9a Add support for the device statistics IOCTL, needed by the coming
linux_libusb upgrade.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2020-03-10 15:56:49 +00:00
Tijl Coosemans
b4147bf6b4 Move compat.linux.map_sched_prio sysctl definition to linux_mib.c so it is
only defined by linux_common kernel module and not both linux and linux64
modules.

Reported by:	Yuri Pankov <ypankov@fastmail.com>
2020-03-05 14:41:27 +00:00
Brooks Davis
d718de812f Introduce kern_mmap_req().
This presents an extensible interface to the generic mmap(2)
implementation via a struct pointer intended to use a designated
initializer or compount literal.  We take advantage of the mandatory
zeroing of fields not listed in the initializer.

Remove kern_mmap_fpcheck() and use kern_mmap_req().

The motivation for this change is a desire to keep the core
implementation from growing an ever-increasing number of arguments
that must be specified in the correct order for the lowest-level
implementations.  In CheriBSD we have already added two more arguments.

Reviewed by:	kib
Discussed with:	kevans
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D23164
2020-03-04 21:27:12 +00:00
Tijl Coosemans
f8b9b299a2 linuxulator: Map scheduler priorities to Linux priorities.
On Linux the valid range of priorities for the SCHED_FIFO and SCHED_RR
scheduling policies is [1,99].  For SCHED_OTHER the single valid priority is
0.  On FreeBSD it is [0,31] for all policies.  Programs are supposed to
query the valid range using sched_get_priority_(min|max), but of course some
programs assume the Linux values are valid.

This commit adds a tunable compat.linux.map_sched_prio.  When enabled
sched_get_priority_(min|max) return the Linux values and sched_setscheduler
and sched_(get|set)param translate between FreeBSD and Linux values.

Because there are more Linux levels than FreeBSD levels, multiple Linux
levels map to a single FreeBSD level, which means pre-emption might not
happen as it does on Linux, so the tunable allows to disable this behaviour.
It is enabled by default because I think it is unlikely that anyone runs
real-time software under Linux emulation on FreeBSD that critically relies
on correct pre-emption.

This fixes FMOD, a commercial sound library used by several games.

PR:		240043
Tested by:	Alex S <iwtcex@gmail.com>
Reviewed by:	dchagin
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D23790
2020-03-01 13:12:04 +00:00
Edward Tomasz Napierala
5d481ad8df Make linuxulator warn about unsupported getsockopt/setsockopt flags.
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D23791
2020-02-27 19:40:20 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Edward Tomasz Napierala
0b40dcbe32 Make linux(4) use kern_socketpair(9) instead of going through
sys_socketpair().  It's a cleanup; no functional changes.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22814
2020-02-10 13:24:14 +00:00
Konstantin Belousov
8e3d7caee5 linux futex_put(): do not touch futex after dropping our reference.
Reported and tested by:	Steve Roome <me@stephenroome.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-02-07 22:21:44 +00:00
Ed Maste
fc7510aef7 linuxulator: implement sendfile
Submitted by:	Bora Özarslan <borako.ozarslan@gmail.com>
Submitted by:	Yang Wang <2333@outlook.jp>
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19917
2020-02-05 16:53:02 +00:00
Dmitry Chagin
7bc05ae6bb Fix clock_gettime() and clock_getres() for cpu clocks:
- handle the CLOCK_{PROCESS,THREAD}_CPUTIME_ID specified directly;
- fix thread id calculation as in the Linuxulator we should
  convert the user supplied thread id to struct thread * by linux_tdfind();
- fix CPUCLOCK_SCHED case by using kern_{process,thread}_cputime()
  directly as native get_cputime() used by kern_clock_gettime() uses
  native tdfind()/pfind() to find proccess/thread.

PR:			240990
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D23341
MFC after:		2 weeks
2020-02-04 05:27:05 +00:00
Dmitry Chagin
2506c76121 linux_to_native_clockid() properly initializes nwhich variable (or return error),
so don't initialize nwhich in declaration and remove stale comment from r161304.

Reviewed by:		emaste
Differential Revision:	https://reviews.freebsd.org/D23339
MFC after:		2 weeks
2020-02-04 05:23:34 +00:00
Mateusz Guzik
7739d92766 cache: replace kern___getcwd with vn_getcwd
The previous routine was resulting in extra data copies most notably in
linux_getcwd.
2020-02-01 20:38:38 +00:00
Edward Tomasz Napierala
c2d4745705 Add TCP_CORK support to linux(4). This fixes one of the things Nginx
trips over.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23171
2020-01-28 13:57:24 +00:00
Edward Tomasz Napierala
da6d8ae6d8 Add compat.linux.ignore_ip_recverr sysctl. This is a workaround
for missing IP_RECVERR setsockopt(2) support. Without it, DNS
resolution is broken for glibc >= 2.30 (glibc BZ #24047).

From the user point of view this fixes "yum update" on recent
CentOS 8.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23234
2020-01-28 13:51:53 +00:00
Edward Tomasz Napierala
618b55c2e2 Make linux(4) handle MAP_32BIT.
This unbreaks Mono (mono-devel-4.6.2.7+dfsg-1ubuntu1 from Ubuntu Bionic);
previously would crash on "amd64_is_imm32" assert.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23306
2020-01-24 12:08:23 +00:00
Edward Tomasz Napierala
b3fb13eb55 Add kern_unmount() and use in Linuxulator. No functional changes.
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22646
2020-01-24 11:57:55 +00:00
Edward Tomasz Napierala
10f2d3f857 Revert r356948; breaks build somehow. 2020-01-21 20:32:49 +00:00
Edward Tomasz Napierala
c5f4e26e7d Make linux(4) handle MAP_32BIT.
This unbreaks Mono (mono-devel-4.6.2.7+dfsg-1ubuntu1 from Ubuntu Bionic);
previously would crash on "amd64_is_imm32" assert.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-01-21 19:19:02 +00:00
Mark Johnston
149afbf3ba Fix 64-bit syscall argument fetching in 32-bit Linux syscall handlers.
The Linux32 system call argument fetcher places each argument (passed in
registers in the Linux x86 system call convention) into an entry in the
generic system call args array.  Each member of this array is 8 bytes
wide, so this approach is broken for system calls that take off_t
arguments.

Fix the problem by splitting l_loff_t arguments in the 32-bit system
call descriptions, the same as we do for FreeBSD32.  Change entry points
to handle this using the PAIR32TO64 macro.

Move linux_ftruncate64() into compat/linux.

PR:		243155
Reported by:	Alex S <iwtcex@gmail.com>
Reviewed by:	kib (previous version)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23210
2020-01-21 17:28:22 +00:00
Edward Tomasz Napierala
66632fe7bb Properly translate MNT_FORCE flag to Linux umount2(2). Previously
it worked by accident.

MFC after:	2 weeks
Sponsored by:	DARPA
2020-01-20 12:16:32 +00:00
Edward Tomasz Napierala
9c6eb0f92f Make linux(4) use kern_setsockopt(9) instead of going through
sys_setsockopt.  Just a cleanup; no functional changes.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22812
2020-01-14 11:33:07 +00:00
Edward Tomasz Napierala
dfd060c0b6 Make linux(4) use kern_getsockopt(9) instead of going through
sys_getsockopt().  It's a cleanup; no functional changes.

Reviewed by:	kib (earlier version)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22813
2020-01-14 11:30:30 +00:00
Edward Tomasz Napierala
46209ceae5 Make linux getcpu(2) report the domain.
Submitted by:	markj
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23144
2020-01-14 11:24:06 +00:00
Edward Tomasz Napierala
ca603bb1ee dd kern_getpriority(), make Linuxulator use it.
Reviewed by:	kib, emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22842
2020-01-12 14:25:44 +00:00
Edward Tomasz Napierala
7a0ef283e6 Add kern_setpriority(), use it in Linuxulator.
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22841
2020-01-12 13:38:51 +00:00
Mateusz Guzik
c8b3463dd0 vfs: reimplement deferred inactive to use a dedicated flag (VI_DEFINACT)
The previous behavior of leaving VI_OWEINACT vnodes on the active list without
a hold count is eliminated. Hold count is kept and inactive processing gets
explicitly deferred by setting the VI_DEFINACT flag. The syncer is then
responsible for vdrop.

Reviewed by:	kib (previous version)
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D23036
2020-01-07 15:56:24 +00:00
Kyle Evans
18348a2369 kern_mmap: add a variant that allows caller to inspect fp
Linux mmap rejects mmap() on a write-only file with EACCES.
linux_mmap_common currently does a fun dance to grab the fp associated with
the passed in fd, validates it, then drops the reference and calls into
kern_mmap(). Doing so is perhaps both fragile and premature; there's still
plenty of chance for the request to get rejected with a more appropriate
error, and it's prone to a race where the file we ultimately mmap has
changed after it drops its referenced.

This change alleviates the need to do this by providing a kern_mmap variant
that allows the caller to inspect the fp just before calling into the fileop
layer. The callback takes flags, prot, and maxprot as one could imagine
scenarios where any of these, in conjunction with the file itself, may
influence a caller's decision.

The file type check in the linux compat layer has been removed; EINVAL is
seemingly not an appropriate response to the file not being a vnode or
device. The fileop layer will reject the operation with ENODEV if it's not
supported, which more closely matches the common linux description of
mmap(2) return values.

If we discover that we're allowing an mmap() on a file type that Linux
normally wouldn't, we should restrict those explicitly.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D22977
2020-01-04 23:39:58 +00:00
Mateusz Guzik
b249ce48ea vfs: drop the mostly unused flags argument from VOP_UNLOCK
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D21427
2020-01-03 22:29:58 +00:00
Edward Tomasz Napierala
cc50333011 Add basic getcpu(2) support to linuxulator. The purpose of this
syscall is to query the CPU number and the NUMA domain the calling
thread is currently running on.  The third argument is ignored.
It doesn't do anything regarding scheduling - it's literally
just a way to query the current state, without any guarantees
you won't get rescheduled an opcode later.

This unbreaks Java from CentOS 8
(java-11-openjdk-11.0.5.10-0.el8_0.x86_64).

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22972
2019-12-31 22:01:08 +00:00
Pawel Biernacki
54666dffa8 linux(4): implement copy_file_range(2)
copy_file_range(2) is implemented natively since r350315, make it available
for Linux binaries too.

Reviewed by:	kib (mentor), trasz (previous version)
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D22959
2019-12-30 18:11:06 +00:00
Edward Tomasz Napierala
ee0fe82ee2 Implement Linux syslog(2) syscall; just enough to make Linux dmesg(8)
utility work.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22465
2019-12-29 15:53:55 +00:00
Edward Tomasz Napierala
273ce4ae1b Make Linux stat(2) et al distinguish between block and character
devices.  It's required for LTP, among other things.  It's not
complete, but good enough for now.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22950
2019-12-29 13:54:02 +00:00
Edward Tomasz Napierala
b9594cd92d Implement Linux BLKGETSIZE64 ioctl.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2019-12-29 12:22:11 +00:00
Edward Tomasz Napierala
135b723803 Make linux mount(2) tolerate NULL 'from' argument, and fix flag
handling.

This should unbreak access04, acct01, chmod06, creat06,
and fchmod06 LTP tests.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2019-12-28 13:35:54 +00:00
Edward Tomasz Napierala
2cf9eb6c4e Add a hack to make ^T work for Linux binaries, enabled with
'compat.linux.preserve_vstatus=1' sysctl.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21967
2019-12-16 20:15:44 +00:00
Edward Tomasz Napierala
b5f20658ee Add compat.linux.emul_path, so it can be set to something other
than "/compat/linux".  Useful when you have several compat directories
with different Linux versions and you don't want to clash with files
installed by linux-c7 packages.

Reviewed by:	bcr (manpages)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22574
2019-12-16 20:07:04 +00:00
Edward Tomasz Napierala
518cce0274 Don't use K&R definitions. No functional changes.
Reported by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2019-12-16 17:45:15 +00:00
Jeff Roberson
61a74c5ccd schedlock 1/4
Eliminate recursion from most thread_lock consumers.  Return from
sched_add() without the thread_lock held.  This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks.  This will eventually allow for lockless remote adds.

Discussed with:	kib
Reviewed by:	jhb
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22626
2019-12-15 21:11:15 +00:00
Edward Tomasz Napierala
cf69fe66d4 Add sync_file_range(2) implementation to linux(4); it's a thin wrapper
over the usual fsync(2).

This silences some warnings when running "apt-get upgrade".

Reviewed by:	brooks, emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22371
2019-12-14 13:37:17 +00:00
Edward Tomasz Napierala
34ad5ac242 Add kern_kill() and use it in Linuxulator. It's just a cleanup,
no functional changes.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22645
2019-12-13 18:44:02 +00:00
Edward Tomasz Napierala
be2cfdbc86 Add kern_getsid() and use it in Linuxulator; no functional changes.
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22647
2019-12-13 18:39:36 +00:00
Jeff Roberson
4504268a1b Fix the last few cases that grab without busy or valid. The grab functions must
return the page in some held state for consistency elsewhere.

Reviewed by:	alc, kib, markj
Differential Revision:	https://reviews.freebsd.org/D22610
2019-12-02 22:38:25 +00:00
Vladimir Kondratyev
71b8e362c5 Linux epoll: Allow passing of any negative timeout value to epoll_wait
Linux epoll allow passing of any negative timeout value to epoll_wait()
to cause unbound blocking

Reviewed by:	emaste
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D22517
2019-11-24 20:51:09 +00:00
Vladimir Kondratyev
335fe0afb8 Linux epoll: Register events with zero event mask
Such an events are legal and should be interpreted as EPOLLERR | EPOLLHUP.
Register a disabled kqueue event in that case as we do not support EPOLLHUP yet.

Required by Linux Steam client.

PR:		240590
Reported by:	Alex S <iwtcex@gmail.com>
Reviewed by:	emaste
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D22516
2019-11-24 20:47:40 +00:00
Vladimir Kondratyev
461120b834 Linux epoll: Check both read and write kqueue events existence in EPOLL_CTL_ADD
Linux epoll EPOLL_CTL_ADD op handler should always check registration
of both EVFILT_READ and EVFILT_WRITE kevents to deceide if supplied
file descriptor fd is already registered with epoll instance.

Reviewed by:	emaste
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D22515
2019-11-24 20:44:14 +00:00
Vladimir Kondratyev
896a4c279d Linux epoll: Don't deregister file descriptor after EPOLLONESHOT is fired
Linux epoll does not remove descriptor after one-shot event has been triggered.
Set EV_DISPATCH kqueue flag rather then EV_ONESHOT to get the same behavior.

Required by Linux Steam client.

PR:		240590
Reported by:	Alex S <iwtcex@gmail.com>
Reviewed by:	emaste, imp
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D22513
2019-11-24 20:41:47 +00:00
Mateusz Guzik
4de1818baf linux: avoid overhead of P_CONTROLT checks if possible
Sponsored by:	The FreeBSD Foundation
2019-11-20 12:06:29 +00:00
Edward Tomasz Napierala
dfe91e5e34 Make linux(4) open(2)/openat(2) return ELOOP instead of EMLINK,
when being passed O_NOFOLLOW.  This fixes LTP testcase openat02:5.

Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22384
2019-11-18 10:19:16 +00:00
Edward Tomasz Napierala
299cb52a80 Support O_CLOEXEC in linux(4) open(2) and openat(2).
Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21966
2019-11-15 16:21:46 +00:00
Ed Maste
01b9ee4c50 linux_renameat2: improve flag checks
In the cases where Linux returns an error (e.g. passing in an undefined
flag) there's no need for us to emit a message.  (The target of this
message is a developer working on the linuxulatorm, not the author of
presumably broken Linux software).

Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21606
2019-11-07 15:51:44 +00:00
Edward Tomasz Napierala
044ab55e41 Make linux(4) create /dev/shm. Linux applications often expect
a tmpfs to be mounted there, and because they like to verify it's
actually a mountpoint, a symlink won't do.

Reviewed by:	dchagin (earlier version)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20333
2019-11-06 20:53:33 +00:00
Yuri Pankov
a161fba992 linux: futex_mtx should follow futex_list
Move futex_mtx to linux_common.ko for amd64 and aarch64 along
with respective list/mutex init/destroy.

PR:		240989
Reported by:	Alex S <iwtcex@gmail.com>
2019-10-18 12:25:33 +00:00
Yuri Pankov
b9d3556a34 linux: provide just one instance of futex_list
Move futex_list definition to linux.c which is included once
in linux.ko (i386) and in linux_common.ko (amd64 and aarch64)
allowing 32/64 bit linux programs to access the same futexes
in the latter case.

PR:		240989
Reviewed by:	dchagin
Differential Revision:	https://reviews.freebsd.org/D22073
2019-10-18 10:28:08 +00:00
Pawel Biernacki
ea2609a490 linux_renameat2: don't add extra \n on error.
linux_msg() already adds \n at the end of all messages.

Reported by:	emaste, kib (mentor), mjg (mentor)
Reviewed by:	kib (mentor), mjg (mentor)
Differential Revision:	https://reviews.freebsd.org/D21852
2019-09-30 19:05:14 +00:00
Ed Maste
2eb6ef203a linux: add trivial renameat2 implementation
Just return EINVAL if flags != 0.  The Linux man page documents one
case of EINVAL as "The filesystem does not support one of the flags in
flags."

After r351723 userland binaries will try using new system calls.

Reported by:	mjg
Reviewed by:	mjg, trasz
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21590
2019-09-11 13:01:59 +00:00
Edward Tomasz Napierala
1d3a302b4a Bump Linux version to 3.2.0. Without it, binaries linked against
glibc 2.24 and up (eg Ubuntu 19.04) fail with "FATAL: kernel too old".

This alone is not enough to make newer binaries actually work;
fix/hack/workaround is pending review at https://reviews.freebsd.org/D20687.

Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20757
2019-09-02 18:10:35 +00:00
Edward Tomasz Napierala
7a8cbc5288 Relax compat.linux.osrelease checks. This way one can do eg
'compat.linux.osrelease=3.10.0-957.12.1.el7.x86_64', which
corresponds to CentOS 7.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20685
2019-09-02 16:57:42 +00:00
Konstantin Belousov
bb9e2184f0 Change locking requirements for VOP_UNSET_TEXT().
Require the vnode to be locked for the VOP_UNSET_TEXT() call.  This
will be used by the following bug fix for a tmpfs issue.

Tested by:	sbruno, pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-08-18 20:24:52 +00:00
Konstantin Belousov
62375ca79c compat/linux: Remove obsoleted and somewhat confusing comments related to COMPAT_43.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21200
2019-08-11 19:17:29 +00:00
Andriy Gapon
c66f5b079d linuxcommon: add module version
MFC after:	2 weeks
2019-07-10 13:47:10 +00:00
Tijl Coosemans
e2fba140a8 Let linuxulator mprotect mask unsupported bits before calling kern_mprotect.
After r349240 kern_mprotect returns EINVAL for unsupported bits in the prot
argument.  Linux rtld uses PROT_GROWSDOWN and PROT_GROWS_UP when marking the
stack executable.  Mask these bits like kern_mprotect used to do.  For other
unsupported bits EINVAL is returned like Linux does.

Reviewed by:	trasz, brooks
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20864
2019-07-10 08:19:33 +00:00
Ed Maste
b97ebbbf72 Update Linux compat version to 2.6.36
New system calls between 2.6.32 and 2.6.26 are already implemented.

This should be mostly NFC as far as contemporary Linux applications are
concerned though, as Linux kernel 3.2 is the oldest supported by a
number of popular distros today; work is in progress by others to enable
support for those applications.

Discussed with:	trasz
MFC after:	1 month
2019-07-04 20:42:08 +00:00
Edward Tomasz Napierala
0fabd7b5cc Return ENOTSUP for Linux FS_IOC_FIEMAP ioctl.
Linux man(1) calls it for no good reason; this avoids the console spam
(eg '(man): ioctl fd=4, cmd=0x660b ('f',11) is not implemented').

Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20690
2019-07-04 20:16:04 +00:00
Edward Tomasz Napierala
2478d444d1 Fix linuxulator prlimit64(2) with pid == 0. This makes 'ulimit -a'
return something reasonable, and helps linux binaries which attempt
to close all the files, eg apt(8).

Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20692
2019-07-04 19:40:01 +00:00
Mark Johnston
fc795c25d4 Remove the CDIOCREADSUBCHANNEL_SYSSPACE ioctl.
This was added for emulation of Linux's CDROMSUBCHNL, but allows
users with read access to a cd(4) device to overwrite kernel memory
provided that the driver detects some media present.

Reimplement CDROMSUBCHNL by bouncing the data from CDIOCREADSUBCHANNEL
through the linux_cdrom_subchnl structure passed from userspace.

admbugs:	768
Reported by:	Alex Fortune
Security:	CVE-2019-5602
Security:	FreeBSD-SA-19:11.cd_ioctl
2019-07-03 00:10:01 +00:00
Dmitry Chagin
c5afec6e89 Complete LOCAL_PEERCRED support. Cache pid of the remote process in the
struct xucred. Do not bump XUCRED_VERSION as struct layout is not changed.

PR:		215202
Reviewed by:	tijl
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D20415
2019-05-30 14:24:26 +00:00
Dmitry Chagin
1410bfe142 Linux does not support MSG_OOB for unix(4) or non-stream oriented socket,
return EOPNOTSUPP as a Linux does.

Reviewed by:	tijl
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D20409
2019-05-30 14:21:51 +00:00
Dmitry Chagin
8128cfc59e Do not leak sa in linux_recvmsg() call if kern_recvit() fails.
MFC after:	1 week
2019-05-21 18:08:19 +00:00
Dmitry Chagin
57cb29a73e Do not use uninitialised sa.
Reported by:	tijl@
MFC after:	1 week
2019-05-21 18:05:57 +00:00
Dmitry Chagin
dcd6241868 Do not leak sa in linux_recvfrom() call if kern_recvit() fails.
MFC after:	1 week
2019-05-21 18:03:58 +00:00
Edward Tomasz Napierala
d49fb289c8 Implement PTRACE_O_TRACESYSGOOD. This makes Linux strace(1) work.
Reviewed by:	dchagin
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20200
2019-05-19 12:58:44 +00:00
Dmitry Chagin
aa28871254 Linux send() call returns EAGAIN instead of ENOTCONN in case when the
socket is non-blocking and connect() is not finished yet.

Initial patch developed by Steven Hartland in 2008 and adopted by me.

PR:		129169
Reported by:	smh@
MFC after:	2 weeks
2019-05-19 09:23:20 +00:00
Dmitry Chagin
c5156c7785 Linuxulator depends on a fundamental kernel settings such as SMP. Many
of them listed in opt_global.h which is not generated while building
modules outside of a kernel and such modules never match real cofigured
kernel.

So, we should prevent our users from building obviously defective modules.

Therefore, remove the root cause of the building of modules outside of a
kernel - the possibility of building modules with DEBUG or KTR flags.
And remove all of DEBUG printfs as it is incomplete and in threaded
programms not informative, also a half of system call does not have DEBUG
printf. For debuging Linux programms we have dtrace, ktr and ktrace ability.

PR:		222861
Reviewed by:	trasz
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20178
2019-05-13 18:24:29 +00:00
Dmitry Chagin
caaad8736e Linuxulator getpeername() returns EINVAL in case then namelen less then 0.
MFC after:	2 weeks
2019-05-13 18:14:20 +00:00
Dmitry Chagin
d5368bf3df Our bsd_to_linux_sockaddr() and linux_to_bsd_sockaddr() functions
alter the userspace sockaddr to convert the format between linux and BSD versions.
That's the minimum 3 of copyin/copyout operations for one syscall.

Also some syscall uses linux_sa_put() and linux_getsockaddr() when load
sockaddr to userspace or from userspace accordingly.

To avoid this chaos, especially converting sockaddr in the userspace,
rewrite these 4 functions to convert sockaddr only in kernel and leave
only 2 of this functions.

Also in order to reduce duplication between MD parts of the Linuxulator put
struct sockaddr conversion functions that are MI out into linux_common module.

PR:		232920
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20157
2019-05-13 17:48:16 +00:00
Dmitry Chagin
bbac65c772 Rewrite linux_ifflags() in more readable Linuxulator style.
Reviewed by:	emaste
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20146
2019-05-06 19:57:51 +00:00
Dmitry Chagin
9c1437ae57 Complete r347052 (https://reviews.freebsd.org/D20137) as it it was not
a final revision.

Fix style issues and change bool-like variables from int to bool.

Reviewed by:	emaste
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20141
2019-05-06 19:56:13 +00:00
Konstantin Belousov
78022527bb Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount.  To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change.  vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP.  vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by:	markj, trasz
Tested by:	mjg, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
Dmitry Chagin
d151344dbf In order to reduce duplication between MD parts of the Linuxulator
move bits that are MI out into the headers in compat/linux.
For that remove bogus _packed attribute from struct l_sockaddr
and use MI types for struct members.

And continue to move into the linux_common module a code that is
intended for both Linuxulator modules (both instruction set - 32 & 64 bit)
or for external modules like linsysfs or linprocfs.

To avoid header pollution introduce new sys/compat/linux_common.h header.

Reviewed by:	emaste
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20137
2019-05-03 08:42:49 +00:00
Dmitry Chagin
5d520d7fab Follow the FreeBSD and implement PDEATH_SIG prctl ops in the Linuxulator.
It was first introduced in r163734 and missied by me in r283383.

MFC after:	1 week
2019-04-30 17:18:05 +00:00
Mariusz Zaborski
a1304030b8 Introduce funlinkat syscall that always us to check if we are removing
the file associated with the given file descriptor.

Reviewed by:	kib, asomers
Reviewed by:	cem, jilles, brooks (they reviewed previous version)
Discussed with:	pjd, and many others
Differential Revision:	https://reviews.freebsd.org/D14567
2019-04-06 09:34:26 +00:00
Dmitry Chagin
803fff9065 Whitespace cleanup (annoying).
MFC after:	1 month
2019-03-24 15:08:30 +00:00
Dmitry Chagin
f730d606d5 Update syscall.master to 5.0.
For 32-bit Linuxulator, ipc() syscall was historically
the entry point for the IPC API. Starting in Linux 4.18, direct
syscalls are provided for the IPC. Enable it.

MFC after:	1 month
2019-03-24 14:50:02 +00:00
Dmitry Chagin
7dabf89bcf Linux between 4.18 and 5.0 split IPC system calls.
In preparation for doing this in the Linuxulator modify our linux_shmat()
to match actual Linux shmat() system call.

MFC after:	1 month
2019-03-24 14:44:35 +00:00
Ed Maste
347a8ed1bf linuxulator: fix stack memory disclosure in linux_sigaltstack
Most siginfo_to_lsiginfo callers already zeroed the l_siginfo_t before
callit it, but linux_waitid did not.  Instead of zeroing in the called
function to address linux_waitid (as in commit 2e6ebe70), just do it in
linux_waitid.

admbugs:	765
Reported by:	Vlad Tsyrklevich <vlad@tsyrklevich.net>
Reviewed by:	Andrew
MFC after:	1 day
Security:	Kernel stack memory disclosure
Sponsored by:	The FreeBSD Foundation
2019-01-21 17:12:16 +00:00
Ed Maste
9866e7bbae linuxulator: fix stack memory disclosure in linux_ioctl_termio
admbugs:	765
Reported by:	Vlad Tsyrklevich <vlad@tsyrklevich.net>
Reviewed by:	andrew
MFC after:	1 day
Security:	Kernel stack memory disclosure
Sponsored by:	The FreeBSD Foundation
2019-01-21 16:21:03 +00:00
Ed Maste
4308a37410 linuxulator: fix stack memory disclosure in linux_ioctl_v4l
admbugs:	765
Reported by:	Vlad Tsyrklevich <vlad@tsyrklevich.net>
Reviewed by:	andrew
MFC after:	1 day
Security:	Kernel stack memory disclosure
Sponsored by:	The FreeBSD Foundation
2019-01-21 16:19:02 +00:00
Kirk McKusick
88640c0e8b Create new EINTEGRITY error with message "Integrity check failed".
An integrity check such as a check-hash or a cross-correlation failed.
The integrity error falls between EINVAL that identifies errors in
parameters to a system call and EIO that identifies errors with the
underlying storage media. EINTEGRITY is typically raised by intermediate
kernel layers such as a filesystem or an in-kernel GEOM subsystem when
they detect inconsistencies. Uses include allowing the mount(8) command
to return a different exit value to automate the running of fsck(8)
during a system boot.

These changes make no use of the new error, they just add it. Later
commits will be made for the use of the new error number and it will
be added to additional manual pages as appropriate.

Reviewed by:    gnn, dim, brueffer, imp
Discussed with: kib, cem, emaste, ed, jilles
Differential Revision: https://reviews.freebsd.org/D18765
2019-01-17 06:35:45 +00:00
Mark Johnston
bb376a990c Specify the correct option level when emulating SO_PEERCRED.
Our equivalent to SO_PEERCRED, LOCAL_PEERCRED, is implemented at
socket option level 0, not SOL_SOCKET.

PR:		234722
Submitted by:	Dániel Bakai <bakaidl@gmail.com>
MFC after:	2 weeks
2019-01-08 17:21:59 +00:00
Mateusz Guzik
cc426dd319 Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
Mark Johnston
792843c38f Pass malloc flags directly through kevent(2) subroutines.
Some kevent functions have a boolean "waitok" parameter for use when
calling malloc(9).  Replace them with the corresponding malloc() flags:
the desired behaviour is known at compile-time, so this eliminates a
couple of conditional branches, and makes the code easier to read.

No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18318
2018-11-24 17:06:01 +00:00
Tijl Coosemans
7df0e7beb7 Fix another user address dereference in linux_sendmsg syscall.
This was hidden behind the LINUX_CMSG_NXTHDR macro which dereferences its
second argument.  Stop using the macro as well as LINUX_CMSG_FIRSTHDR.  Use
the size field of the kernel copy of the control message header to obtain
the next control message.

PR:		217901
MFC after:	2 days
X-MFC-With:	r340631
2018-11-20 14:18:57 +00:00
Tijl Coosemans
e3b385fc95 Do proper copyin of control message data in the Linux sendmsg syscall.
Instead of calling m_append with a user address, allocate an mbuf cluster
and copy data into it using copyin.  For the SCM_CREDS case, instead of
zeroing a stack variable and appending that to the mbuf, zero part of the
mbuf cluster directly.  One mbuf cluster is also the size limit used by
the FreeBSD sendmsg syscall (uipc_syscalls.c:sockargs()).

PR:		217901
Reviewed by:	kib
MFC after:	3 days
2018-11-19 15:31:54 +00:00
Mateusz Guzik
2c054ce924 proc: always store parent pid in p_oppid
Doing so removes the dependency on proctree lock from sysctl process list
export which further reduces contention during poudriere -j 128 runs.

Reviewed by:	kib (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17825
2018-11-16 17:07:54 +00:00
Brooks Davis
318f0d7720 Use declared types for caddr_t arguments.
Leave ptrace(2) alone for the moment as it's defined to take a caddr_t.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17852
2018-11-06 18:46:38 +00:00
Tijl Coosemans
8fc08087a1 On amd64 both Linux compat modules, linux.ko and linux64.ko, provide
linux_ioctl_(un)register_handler that allows other driver modules to
register ioctl handlers.  The ioctl syscall implementation in each Linux
compat module iterates over the list of handlers and forwards the call to
the appropriate driver.  Because the registration functions have the same
name in each module it is not possible for a driver to support both 32 and
64 bit linux compatibility.

Move the list of ioctl handlers to linux_common.ko so it is shared by
both Linux modules and all drivers receive both 32 and 64 bit ioctl calls
with one registration.  These ioctl handlers normally forward the call
to the FreeBSD ioctl handler which can handle both 32 and 64 bit.

Keep the special COMPAT_LINUX32 ioctl handlers in linux.ko in a separate
list for now and let the ioctl syscall iterate over that list first.
Later, COMPAT_LINUX32 support can be added to the 64 bit ioctl handlers
via a runtime check for ILP32 like is done for COMPAT_FREEBSD32 and then
this separate list would disappear again.  That is a much bigger effort
however and this commit is meant to be MFCable.

This enables linux64 support in x11/nvidia-driver*.

PR:		206711
Reviewed by:	kib
MFC after:	3 days
2018-11-06 13:51:08 +00:00
Konstantin Belousov
4f77f48884 Implement O_BENEATH and AT_BENEATH.
Flags prevent open(2) and *at(2) vfs syscalls name lookup from
escaping the starting directory.  Supposedly the interface is similar
to the same proposed Linux flags.

Reviewed by:	jilles (code, previous version of manpages), 0mp (manpages)
Discussed with:	allanjude, emaste, jonathan
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D17547
2018-10-25 22:16:34 +00:00
Mark Johnston
c7902fbeae Improve handling of control message truncation.
If a recvmsg(2) or recvmmsg(2) caller doesn't provide sufficient space
for all control messages, the kernel sets MSG_CTRUNC in the message
flags to indicate truncation of the control messages.  In the case
of SCM_RIGHTS messages, however, we were failing to dispose of the
rights that had already been externalized into the recipient's file
descriptor table.  Add a new function and mbuf type to handle this
cleanup task, and use it any time we fail to copy control messages
out to the recipient.  To simplify cleanup, control message truncation
is now only performed at control message boundaries.

The change also fixes a few related bugs:
- Rights could be leaked to the recipient process if an error occurred
  while copying out a message's contents.
- We failed to set MSG_CTRUNC if the truncation occurred on a control
  message boundary, e.g., if the caller received two control messages
  and provided only the exact amount of buffer space needed for the
  first.

PR:		131876
Reviewed by:	ed (previous version)
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16561
2018-08-07 16:36:48 +00:00
Alan Somers
6040822c4e Make timespecadd(3) and friends public
The timespecadd(3) family of macros were imported from NetBSD back in
r35029. However, they were initially guarded by #ifdef _KERNEL. In the
meantime, we have grown at least 28 syscalls that use timespecs in some
way, leading many programs both inside and outside of the base system to
redefine those macros. It's better just to make the definitions public.

Our kernel currently defines two-argument versions of timespecadd and
timespecsub.  NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define
three-argument versions.  Solaris also defines a three-argument version, but
only in its kernel.  This revision changes our definition to match the
common three-argument version.

Bump _FreeBSD_version due to the breaking KPI change.

Discussed with:	cem, jilles, ian, bde
Differential Revision:	https://reviews.freebsd.org/D14725
2018-07-30 15:46:40 +00:00
Ed Maste
e8a1ec3e05 Split kern_break from sys_break and use it in linuxulator
Previously the linuxulator's linux_brk invoked the FreeBSD sys_break
syscall implementation directly.  Instead, move the bulk of the existing
implementation to kern_break, and call that from both sys_break and
linux_brk.

This also addresses a minor bug in linux_brk in that we now return the
actual (rounded up) break address, rather than the requested value.

Reviewed by:	brooks (earlier version)
Sponsored by:	Turing Robotic Industries
Differential Revision:	https://reviews.freebsd.org/D16019
2018-06-27 14:45:13 +00:00
Ed Maste
9c42fa94a6 Quiet unused fn warning for linuxulator w/o legacy syscalls
Sponsored by:	Turing Robotic Industries
2018-06-25 19:24:50 +00:00
Chuck Tuffli
3575504976 Fix the Linux kernel version number calculation
The Linux compatibility code was converting the version number (e.g.
2.6.32) in two different ways and then comparing the results.

The linux_map_osrel() function converted MAJOR.MINOR.PATCH similar to
what FreeBSD does natively. I.e. where major=v0, minor=v1, and patch=v2
    v = v0 * 1000000 + v1 * 1000 + v2;

The LINUX_KERNVER() macro, on the other hand, converted the value with
bit shifts. I.e. where major=a, minor=b, and patch=c
    v = (((a) << 16) + ((b) << 8) + (c))

The Linux kernel uses the later format via the KERNEL_VERSION() macro in
include/generated/uapi/linux/version.h

Fix is to use the LINUX_KERNVER() macro in linux_map_osrel() as well as
in the .trans_osrel functions.

PR: 229209
Reviewed by: emaste, cem, imp (mentor)
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D15952
2018-06-22 00:02:03 +00:00
Konstantin Belousov
31665c1abb linux_clone_thread: mark new thread as TDB_BORN.
So that the ptrace code will catch it and report it to attached
debugger.  Enables debugging of threaded Linux binaries with FreeBSD
debugger.

Submitted by:	Yanko Yankulov <yanko.yankulov@gmail.com>
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15880
2018-06-21 21:15:04 +00:00
Ed Maste
aec1e6d390 linuxulator: handle V3 capget/capset
Linux 2.6.26 introduced 64-bit capability sets.  Extend our stub
implementation to handle both 32- and 64-bit.  (We still report no
capabilities in capget, and disallow any in capset.)

Reviewed by:	chuck
Sponsored by:	Turing Robotic Industries Inc.
Differential Revision:	https://reviews.freebsd.org/D15887
2018-06-19 21:26:23 +00:00
Ed Maste
645f3d4345 linuxulator: add debugging for invalid capget/capset version
Sponsored by:	Turing Robotic Industries Inc.
2018-06-18 18:43:45 +00:00
Ed Maste
931e2a1a6e linuxulator: do not include legacy syscalls on arm64
Existing linuxulator platforms (i386, amd64) support legacy syscalls,
such as non-*at ones like open, but arm64 and other new platforms do
not.

Wrap these in #ifdef LINUX_LEGACY_SYSCALLS, #defined in the MD linux.h
files.  We may need finer grained control in the future but this is
sufficient for now.

Reviewed by:	andrew
Sponsored by:	Turing Robotic Industries
Differential Revision:	https://reviews.freebsd.org/D15237
2018-06-15 14:41:51 +00:00
Ed Maste
4b842782da Correct debug control for linuxulator faccessat
The Linuxulator provides per-syscall debug control via the
compat.linux.debug sysctl.  There's generally a 1:1 mapping between
sysctl setting and syscall, but faccessat was controlled by the access
setting, perhaps due to copy-paste.

Sponsored by:	Turing Robotic Industries
2018-06-15 14:29:41 +00:00
Brooks Davis
9da5364ed9 Name the implementation of brk and sbrk sys_break().
The break() system call was renamed (several times) starting in v3
AT&T UNIX when C was invented and break was a language keyword. The
last vestage of a need for it to be called something else (eg obreak)
was removed in r225617 which consistantly prefixed all syscall
implementations.

Reviewed by:	emaste, kib (older version)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15638
2018-06-14 21:27:25 +00:00
Bruce Evans
407a812657 Oops, r335053 had an old version of the comment about 16-bit linux dev_t
translation.
2018-06-13 12:44:45 +00:00
Bruce Evans
ab35e1c71b Fix the encoding of major and minor numbers in 64-bit dev_t by restoring
the old encodings for the lower 16 and 32 bits and only using the
higher 32 bits for unusually large major and minor numbers.  This
change breaks compatibility with the previous encoding (which was only
used in -current).

Fix truncation to (essentially) 16-bit dev_t in newnfs v3.

Any encoding of device numbers gives an ABI, so it can't be changed
without translations for compatibility.  Extra bits give the much
larger complication that the translations need to compress into fewer
bits.  Fortunately, more than 32 bits are rarely needed, so
compression is rarely needed except for 16-bit linux dev_t where it
was always needed but never done.

The previous encoding moved the major number into the top 32 bits.
Almost no translation code handled this, so the major number was blindly
truncated away in most 32-bit encodings.  E.g., for ffs, mknod(8) with
major = 1 and minor = 2 gave dev_t = 0x10000002; ffs cannot represent
this and blindly truncated it to 2.  But if this mknod was run on any
released version of FreeBSD, it gives dev_t = 0x102.  ffs can represent
this, but in the previous encoding it was not decoded, giving major = 0,
minor = 0x102.

The presence of bugs was most obvious for exporting dev_t's from an
old system to -current, since bugs in newnfs augment them.  I fixed
oldnfs to support 32-bit dev_t in 1996 (r16634), but this regressed
to 16-bit dev_t in newnfs, first to the old 16-bit encoding and then
further in -current.  E.g., old ad0 with major = 234, minor = 0x10002
had the correct (major, minor) number on the wire, but newnfs truncated
this to (234, 2) and then the previous encoding shifted the major
number into oblivion as seen by ffs or old applications.

I first tried to fix this by translating on every ABI/API boundary, but
there are too many boundaries and too many sloppy translations by blind
truncation.  So use the old encoding for the low 32 bits so that sloppy
translations work no worse than before provided the high 32 bits are
not set.  Add some error checking for when bits are lost.  Keep not
doing any error checking for translations for almost everything in
compat/linux.

compat/freebsd32/freebsd32_misc.c:
Optionally check for losing bits after possibly-truncating assignments as
before.

compat/linux/linux_stats.c:
Depend on the representation being compatible with Linux's (or just with
itself for local use) and spell some of the translations as assignments in
a macro that hides the details.

fs/nfsclient/nfs_clcomsubs.c:
Essentially the same fix as in 1996, except there is now no possible
truncation in makedev() itself.  Also fix nearby style bugs.

kern/vfs_syscalls.c:
As for freebsd32.  Also update the sysctl description to include file
numbers, and change it to describe device ids as device numbers.

sys/types.h:
Use inline functions (wrapped by macros) since the expressions are now
a bit too complicated for plain macros.  Describe the encoding and
some of the reasons for it.  16-bit compatibility didn't leave many
reasonable choices for the 32-bit encoding, and 32-bit compatibility
doesn't leave many reasonable choices for the 64-bit encoding.  My
choice is to put the 8 new minor bits in the low 8 bits of the top 32
bits.  This minimizes discontiguities.

Reviewed by:	kib (except for rewrite of the comment in linux_stats.c)
2018-06-13 12:22:00 +00:00
Bruce Evans
372639f944 Fix some bugs found while fixing the representation and translation
of 64-bit dev_t's (but not ones involving dev_t's).

st_size was supposed to be clamped in cvtstat() and linux's copy_stat(),
but the clamping code wasn't aware that st_size is signed, and also had
an obfuscated off-by-1 value for the unsigned limit, so its effect was
to produce a bizarre negative size instead of clamping.

Change freebsd32's copy_ostat() to be no worse than cvtstat().  It was
missing clamping and bzero()ing of padding.

Reviewed by:	kib (except a final fix of the clamp to the signed maximum)
2018-06-13 08:50:43 +00:00
Matt Macy
4f6c66cc9c UDP: further performance improvements on tx
Cumulative throughput while running 64
  netperf -H $DUT -t UDP_STREAM -- -m 1
on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps

Single stream throughput increases from 910kpps to 1.18Mpps

Baseline:
https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg

- Protect read access to global ifnet list with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg

- Protect short lived ifaddr references with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg

- Convert if_afdata read lock path to epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg

A fix for the inpcbhash contention is pending sufficient time
on a canary at LLNW.

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15409
2018-05-23 21:02:14 +00:00
Matt Macy
d7c5a620e2 ifnet: Replace if_addr_lock rwlock with epoch + mutex
Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
4.98   0.00   4.42   0.00 4235592     33   83.80 4720653 2149771   1235 247.32
4.73   0.00   4.20   0.00 4025260     33   82.99 4724900 2139833   1204 247.32
4.72   0.00   4.20   0.00 4035252     33   82.14 4719162 2132023   1264 247.32
4.71   0.00   4.21   0.00 4073206     33   83.68 4744973 2123317   1347 247.32
4.72   0.00   4.21   0.00 4061118     33   80.82 4713615 2188091   1490 247.32
4.72   0.00   4.21   0.00 4051675     33   85.29 4727399 2109011   1205 247.32
4.73   0.00   4.21   0.00 4039056     33   84.65 4724735 2102603   1053 247.32

After the patch

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
5.43   0.00   4.20   0.00 3313143     33   84.96 5434214 1900162   2656 245.51
5.43   0.00   4.20   0.00 3308527     33   85.24 5439695 1809382   2521 245.51
5.42   0.00   4.19   0.00 3316778     33   87.54 5416028 1805835   2256 245.51
5.42   0.00   4.19   0.00 3317673     33   90.44 5426044 1763056   2332 245.51
5.42   0.00   4.19   0.00 3314839     33   88.11 5435732 1792218   2499 245.52
5.44   0.00   4.19   0.00 3293228     33   91.84 5426301 1668597   2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15366
2018-05-18 20:13:34 +00:00
Matt Macy
cbd92ce62e Eliminate the overhead of gratuitous repeated reinitialization of cap_rights
- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
  A 3.6% speedup in fstat was measured with this change.

Reported by:	mjg
Reviewed by:	oshogbo
Approved by:	sbruno
MFC after:	1 month
2018-05-09 18:47:24 +00:00
Konstantin Belousov
0cde66af78 Fix futexes on i386 after the 4/4G split.
Use proper method to access userspace.  For now, only the slow copyout
path is implemented.

Reported and tested by:	tijl (previous version)
Sponsored by:	The FreeBSD Foundation
2018-04-24 12:50:21 +00:00
Ed Maste
38be312706 Map FreeBSD EDOOFUS to Linux EINVAL
Previously EDOOFUS mapped to EBUSY.  EINVAL seems more appropriate.

Discussed with:	cem
MFC after:	1 week
Sponsored by:	Turing Robotic Industries Inc.
2018-04-23 18:33:26 +00:00
Ed Maste
b267239d4b linuxulator: deduplicate linux_exec_imgact_try
Previously linuxulator had three identical copies of
linux_exec_imgact_try.  Deduplicate before adding another arch to
linuxulator.

Sponsored by:	Turing Robotic Industries Inc
Differential Revision:	https://reviews.freebsd.org/D14856
2018-04-09 17:24:01 +00:00
Brooks Davis
6469bdcdb6 Move most of the contents of opt_compat.h to opt_global.h.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c.  A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by:	kib, cem, jhb, jtl
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14941
2018-04-06 17:35:35 +00:00
Ed Maste
19406511e5 Fix kernel memory disclosure in linux_ioctl_socket
strlcpy is used to copy a string into a buffer to be copied to userland,
previously leaving uninitialized data after the terminating NUL.  Zero
the buffer first to avoid a kernel memory disclosure.

admbugs:	765, 811
MFC after:	1 day
Reported by:	Ilja Van Sprundel <ivansprundel@ioactive.com>
Reported by:	Vlad Tsyrklevich
Sponsored by:	The FreeBSD Foundation
2018-04-04 19:58:25 +00:00
Ed Maste
d851b216eb linux_ioctl_hdio: fix kernel memory disclosure
Stack-allocated struct linux_hd_big_geometry has undeclared padding
copied to userland.

admbugs:	765
Reported by:	Vlad Tsyrklevich
MFC after:	1 day
Security:	Kernel memory disclosure
Sponsored by:	The FreeBSD Foundation
2018-04-04 14:41:48 +00:00
Ed Maste
b1ed95fd9d Rationalize license text on Linuxolator files
Many licenses on Linuxolator files contained small variations from the
standard FreeBSD license text.  To avoid license proliferation switch to
the standard 2-Clause FreeBSD license for those files where I have
permission from each of the listed copyright holders.

Approved by:	rdivacky, marcel
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-03-23 14:39:34 +00:00
Ed Maste
1ac2776bbb Share Linux errno table with libsysdecode
Requested by:	jhb
Reviewed by:	jhb
Sponsored by:	Turing Robotic Industries Inc.
2018-03-22 12:58:49 +00:00
Ed Maste
d595c5c0d6 linux_errno.c: add newer errno values
Also introduce a static assert to ensure the list is kept up to date.

Sponsored by:	Turing Robotic Industries Inc.
2018-03-16 14:51:47 +00:00