Commit Graph

327 Commits

Author SHA1 Message Date
Edward Tomasz Napierala
2478d444d1 Fix linuxulator prlimit64(2) with pid == 0. This makes 'ulimit -a'
return something reasonable, and helps linux binaries which attempt
to close all the files, eg apt(8).

Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20692
2019-07-04 19:40:01 +00:00
Edward Tomasz Napierala
d49fb289c8 Implement PTRACE_O_TRACESYSGOOD. This makes Linux strace(1) work.
Reviewed by:	dchagin
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20200
2019-05-19 12:58:44 +00:00
Dmitry Chagin
c5156c7785 Linuxulator depends on a fundamental kernel settings such as SMP. Many
of them listed in opt_global.h which is not generated while building
modules outside of a kernel and such modules never match real cofigured
kernel.

So, we should prevent our users from building obviously defective modules.

Therefore, remove the root cause of the building of modules outside of a
kernel - the possibility of building modules with DEBUG or KTR flags.
And remove all of DEBUG printfs as it is incomplete and in threaded
programms not informative, also a half of system call does not have DEBUG
printf. For debuging Linux programms we have dtrace, ktr and ktrace ability.

PR:		222861
Reviewed by:	trasz
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20178
2019-05-13 18:24:29 +00:00
Konstantin Belousov
78022527bb Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount.  To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change.  vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP.  vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by:	markj, trasz
Tested by:	mjg, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
Dmitry Chagin
5d520d7fab Follow the FreeBSD and implement PDEATH_SIG prctl ops in the Linuxulator.
It was first introduced in r163734 and missied by me in r283383.

MFC after:	1 week
2019-04-30 17:18:05 +00:00
Ed Maste
347a8ed1bf linuxulator: fix stack memory disclosure in linux_sigaltstack
Most siginfo_to_lsiginfo callers already zeroed the l_siginfo_t before
callit it, but linux_waitid did not.  Instead of zeroing in the called
function to address linux_waitid (as in commit 2e6ebe70), just do it in
linux_waitid.

admbugs:	765
Reported by:	Vlad Tsyrklevich <vlad@tsyrklevich.net>
Reviewed by:	Andrew
MFC after:	1 day
Security:	Kernel stack memory disclosure
Sponsored by:	The FreeBSD Foundation
2019-01-21 17:12:16 +00:00
Mateusz Guzik
cc426dd319 Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
Alan Somers
6040822c4e Make timespecadd(3) and friends public
The timespecadd(3) family of macros were imported from NetBSD back in
r35029. However, they were initially guarded by #ifdef _KERNEL. In the
meantime, we have grown at least 28 syscalls that use timespecs in some
way, leading many programs both inside and outside of the base system to
redefine those macros. It's better just to make the definitions public.

Our kernel currently defines two-argument versions of timespecadd and
timespecsub.  NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define
three-argument versions.  Solaris also defines a three-argument version, but
only in its kernel.  This revision changes our definition to match the
common three-argument version.

Bump _FreeBSD_version due to the breaking KPI change.

Discussed with:	cem, jilles, ian, bde
Differential Revision:	https://reviews.freebsd.org/D14725
2018-07-30 15:46:40 +00:00
Ed Maste
e8a1ec3e05 Split kern_break from sys_break and use it in linuxulator
Previously the linuxulator's linux_brk invoked the FreeBSD sys_break
syscall implementation directly.  Instead, move the bulk of the existing
implementation to kern_break, and call that from both sys_break and
linux_brk.

This also addresses a minor bug in linux_brk in that we now return the
actual (rounded up) break address, rather than the requested value.

Reviewed by:	brooks (earlier version)
Sponsored by:	Turing Robotic Industries
Differential Revision:	https://reviews.freebsd.org/D16019
2018-06-27 14:45:13 +00:00
Ed Maste
aec1e6d390 linuxulator: handle V3 capget/capset
Linux 2.6.26 introduced 64-bit capability sets.  Extend our stub
implementation to handle both 32- and 64-bit.  (We still report no
capabilities in capget, and disallow any in capset.)

Reviewed by:	chuck
Sponsored by:	Turing Robotic Industries Inc.
Differential Revision:	https://reviews.freebsd.org/D15887
2018-06-19 21:26:23 +00:00
Ed Maste
645f3d4345 linuxulator: add debugging for invalid capget/capset version
Sponsored by:	Turing Robotic Industries Inc.
2018-06-18 18:43:45 +00:00
Ed Maste
931e2a1a6e linuxulator: do not include legacy syscalls on arm64
Existing linuxulator platforms (i386, amd64) support legacy syscalls,
such as non-*at ones like open, but arm64 and other new platforms do
not.

Wrap these in #ifdef LINUX_LEGACY_SYSCALLS, #defined in the MD linux.h
files.  We may need finer grained control in the future but this is
sufficient for now.

Reviewed by:	andrew
Sponsored by:	Turing Robotic Industries
Differential Revision:	https://reviews.freebsd.org/D15237
2018-06-15 14:41:51 +00:00
Brooks Davis
9da5364ed9 Name the implementation of brk and sbrk sys_break().
The break() system call was renamed (several times) starting in v3
AT&T UNIX when C was invented and break was a language keyword. The
last vestage of a need for it to be called something else (eg obreak)
was removed in r225617 which consistantly prefixed all syscall
implementations.

Reviewed by:	emaste, kib (older version)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15638
2018-06-14 21:27:25 +00:00
Ed Maste
eae594f7d5 Correct proper nouns in the Linuxulator
- Capitalize Linux
- Spell FreeBSD out in full
- Address some style(9) on changed lines

Sponsored by:	Turing Robotic Industries Inc.
2018-02-22 02:24:17 +00:00
Jeff Roberson
e958ad4cf3 Make v_wire_count a per-cpu counter(9) counter. This eliminates a
significant source of cache line contention from vm_page_alloc().  Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.

Reviewed by:	markj
Discussed with:	kib, glebius
Tested by:	pho (earlier version)
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14273
2018-02-12 22:53:00 +00:00
Ed Maste
132f90c660 Linuxolator whitespace cleanup
A version of each of the MD files by necessity exists for each CPU
architecture supported by the Linuxolator.  Clean these up so that new
architectures do not inherit whitespace issues.

Clean up shared Linuxolator files while here.

Sponsored by:	Turing Robotic Industries Inc.
2018-02-05 17:29:12 +00:00
Pedro F. Giffuni
7f2d13d607 sys/compat: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 15:13:23 +00:00
Dmitry Chagin
51d93426d9 On success, getrandom() Linux system call returns the number of bytes that
were copied to the buffer supplied by the user.

PR:           219464
Submitted by: Maciej Pasternacki
Reported by:  Maciej Pasternacki
MFC after:    1 week
2017-06-04 18:35:30 +00:00
Dmitry Chagin
e2e6a2a1b6 Revert r319053 due to lack of sence. As pointed out by kib@ opt_global.h
contains such fundamental settings as e.g. SMP option and fake
opt_global.h almost never match real configured kernels.

Reported by:	kib@
2017-06-04 18:24:41 +00:00
Dmitry Chagin
9ecc1abca3 On success, getrandom() Linux system call returns the number of bytes that
were copied to the buffer supplied by the user.

Also fix getrandom() if Linuxulator modules are built without the kernel.

PR:		219464
Submitted by:	Maciej Pasternacki
Reported by:	Maciej Pasternacki
MFC after:	1 week
2017-05-28 07:40:09 +00:00
Dmitry Chagin
faa9679e24 Use kern_mincore() helper instead of abusing syscall entry.
Suggested by:	kib@
Reviewed by:	kib@
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D10143
2017-03-30 19:45:07 +00:00
Dmitry Chagin
6c2a934b79 Implement Linux mincore() system call.
This is necessary for the upcoming drm-next.

Suggested by:	hselasky@
MFC after:	1 month
2017-03-25 15:47:29 +00:00
Dmitry Chagin
b1ba0846f1 Implement getrandom() syscall.
Note. GRND_RANDOM option is not supported for now.

MFC after:	1 month
2017-03-18 18:34:29 +00:00
Dmitry Chagin
0670e972e3 Return EOVERFLOW error in case then the size of tv_sec field of struct timespec
in COMPAT_LINUX32 Linuxulator's not equal to the size of native tv_sec.

MFC after:	1 month
2017-02-26 09:40:42 +00:00
Konstantin Belousov
496ab0532d Rework r313352.
Rename kern_vm_* functions to kern_*.  Move the prototypes to
syscallsubr.h.  Also change Mach VM types to uintptr_t/size_t as
needed, to avoid headers pollution.

Requested by:	alc, jhb
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D9535
2017-02-13 09:04:38 +00:00
Edward Tomasz Napierala
69cdfcef2e Add kern_vm_mmap2(), kern_vm_mprotect(), kern_vm_msync(), kern_vm_munlock(),
kern_vm_munmap(), and kern_vm_madvise(), and use them in various compats
instead of their sys_*() counterparts.

Reviewed by:	ed, dchagin, kib
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9378
2017-02-06 20:57:12 +00:00
Edward Tomasz Napierala
96ee43103d Add kern_cpuset_getaffinity() and kern_cpuset_getaffinity(),
and use it in compats instead of their sys_*() counterparts.

Reviewed by:	kib, jhb, dchagin
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9383
2017-02-05 13:24:54 +00:00
Edward Tomasz Napierala
5db72ef2e4 Fix linux_getppid() to debug the actual parent, even it was reparented
by debugger.

Reviewed by:	dchagin@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9361
2017-01-31 15:22:51 +00:00
Dmitry Chagin
23e8912c60 Implement Linux personality() system call mainly due to READ_IMPLIES_EXEC flag.
In Linux if this flag is set, PROT_READ implies PROT_EXEC for mmap().
Linux/i386 set this flag automatically if the binary requires executable stack.

READ_IMPLIES_EXEC flag will be used in the next Linux mmap() commit.
2016-07-10 08:15:50 +00:00
Gleb Smirnoff
34e05ebe72 Fix kernel stack disclosures in the Linux and 4.3BSD compat layers.
Submitted by:	CTurt
Security:	SA-16:20
Security:	SA-16:21
2016-05-31 16:56:30 +00:00
Pedro F. Giffuni
1ce4275dd2 sys/compat/linux*: spelling fixes.
Mostly on comments but there are some user-visible messages as well.

MFC after: 2 weeks
2016-04-30 00:53:10 +00:00
Pedro F. Giffuni
ae26eab161 Fix indentation oops. 2016-04-03 14:40:54 +00:00
Dmitry Chagin
8bc21bafba Move Linux specific times tests up to guarantee the values are defined.
CID:		1305178
Submitted by:	pfg@
MFC after:	1 week
2016-04-03 06:33:16 +00:00
Dmitry Chagin
4525bb829f Rework r296543:
1. Limit secs to INT32_MAX / 2 to avoid errors from kern_setitimer().
   Assert that kern_setitimer() returns 0.
   Remove bogus cast of secs.
   Fix style(9) issues.

2. Increment the return value if the remaining tv_usec value more than 500000 as a Linux does.

Pointed out by: [1] Bruce Evans

MFC after:	1 week
2016-03-20 11:40:52 +00:00
Dmitry Chagin
a87488d1e4 Better english.
Submitted by:	Kevin P. Neal
MFC after:	1 week
2016-03-08 19:40:01 +00:00
Dmitry Chagin
649ca5e9dc Put a commit message from r296502 about Linux alarm() system call
behaviour to the source.

Suggested by:	emaste@

MFC after:	1 week
2016-03-08 19:20:57 +00:00
Dmitry Chagin
15c3b371e2 According to POSIX and Linux implementation the alarm() system call
is always successfull.
So, ignore any errors and return 0 as a Linux do.

XXX. Unlike POSIX, Linux in case when the invalid seconds value specified
always return 0, so in that case Linux does not return proper remining time.

MFC after:	1 week
2016-03-08 15:12:49 +00:00
Gleb Smirnoff
c8358c6e0d Call crextend() before copying old credentials to the new credentials
and replace crcopysafe by crcopy as crcopysafe is is not intended to be
safe in a threaded environment, it drops PROC_LOCK() in while() that
can lead to unexpected results, such as overwrite kernel memory.

In my POV crcopysafe() needs special attention. For now I do not see
any problems with this function, but who knows.

Submitted by:	dchagin
Found by:	trinity
Security:	SA-16:04.linux
2016-01-14 10:16:25 +00:00
Konstantin Belousov
cd0a26c53f Fix build for the KTR-enabled kernels.
Sponsored by:	The FreeBSD Foundation
2015-10-23 11:41:55 +00:00
Konstantin Belousov
b4490c6e93 The si_status field of the siginfo_t, provided by the waitid(2) and
SIGCHLD signal, should keep full 32 bits of the status passed to the
_exit(2).

Split the combined p_xstat of the struct proc into the separate exit
status p_xexit for normal process exit, and signalled termination
information p_xsig.  Kernel-visible macro KW_EXITCODE() reconstructs
old p_xstat from p_xexit and p_xsig.  p_xexit contains complete status
and copied out into si_status.

Requested by:	Joerg Schilling
Reviewed by:	jilles (previous version), pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2015-07-18 09:02:50 +00:00
Mateusz Guzik
f6f6d24062 Implement lockless resource limits.
Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
2015-06-10 10:48:12 +00:00
Dmitry Chagin
d9cbe8f0ef Properly check tv_nsec value. The tv_nsec field can also be one
of the special value UTIME_NOW or UTIME_OMIT.
2015-05-24 18:06:46 +00:00
Dmitry Chagin
19d8b461f4 Add utimensat() system call.
The patch developed by Jilles Tjoelker and Andrew Wilcox and
adopted for lemul branch by me.
2015-05-24 17:57:07 +00:00
Dmitry Chagin
4ab7403bbd Rework signal code to allow using it by other modules, like linprocfs:
1. Linux sigset always 64 bit on all platforms. In order to move Linux
sigset code to the linux_common module define it as 64 bit int. Move
Linux sigset manipulation routines to the MI path.

2. Move Linux signal number definitions to the MI path. In general, they
are the same on all platforms except for a few signals.

3. Map Linux RT signals to the FreeBSD RT signals and hide signal conversion
tables to avoid conversion errors.

4. Emulate Linux SIGPWR signal via FreeBSD SIGRTMIN signal which is outside
of allowed on Linux signal numbers.

PR:		197216
2015-05-24 17:47:20 +00:00
Dmitry Chagin
680982281b Do not use struct l_timespec without conversion. While here move
args->timeout handling before acquiring the futex key at FUTEX_WAIT path.

Differential Revision:	https://reviews.freebsd.org/D1520
Reviewed by:	trasz
2015-05-24 17:29:18 +00:00
Dmitry Chagin
a6b40812ec Implement ppoll() system call.
Differential Revision:	https://reviews.freebsd.org/D1105
Reviewed by:	trasz
2015-05-24 16:59:25 +00:00
Dmitry Chagin
3e89b64168 Put the correct value for the abi_nfdbits parameter of kern_select() for
all supported Linuxulators.

Differential Revision:	https://reviews.freebsd.org/D1093
Reviewed by:	trasz
2015-05-24 16:47:13 +00:00
Dmitry Chagin
2245df381a Convert Linux wait options to the FreeBSD.
Check wait options as a Linux do.
Linux always set WEXITED option not a WUNTRACED|WNOHANG
which is a strange bug.

Differential Revision:	https://reviews.freebsd.org/D1085
Reviewed by:	trasz
2015-05-24 16:28:58 +00:00
Dmitry Chagin
7a7a6efc25 Set WIFCONTINUED to the wait status if needed.
Differential Revision:	https://reviews.freebsd.org/D1083
Reviewed by:	trasz
2015-05-24 16:27:38 +00:00
Dmitry Chagin
e0d3ea8c65 Where possible we will use M_LINUX malloc(9) type.
Move M_FUTEX defines to the linux_common.ko.

Differential Revision:	https://reviews.freebsd.org/D1077
Reviewed by:	emaste
2015-05-24 16:14:41 +00:00