The approach taken by the stack gap implementation was to insert a
random gap between the top of the fixed stack mapping and the true top
of the main process stack. This approach was chosen so as to avoid
randomizing the previously fixed address of certain process metadata
stored at the top of the stack, but had some shortcomings. In
particular, mlockall(2) calls would wire the gap, bloating the process'
memory usage, and RLIMIT_STACK included the size of the gap so small
(< several MB) limits could not be used.
There is little value in storing each process' ps_strings at a fixed
location, as only very old programs hard-code this address; consumers
were converted decades ago to use a sysctl-based interface for this
purpose. Thus, this change re-implements stack address randomization by
simply breaking the convention of storing ps_strings at a fixed
location, and randomizing the location of the entire stack mapping.
This implementation is simpler and avoids the problems mentioned above,
while being unlikely to break compatibility anywhere the default ASLR
settings are used.
The kern.elfN.aslr.stack_gap sysctl is renamed to kern.elfN.aslr.stack,
and is re-enabled by default.
PR: 260303
Reviewed by: kib
Discussed with: emaste, mw
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704
ASLR stack randomization will reappear in a forthcoming commit. Rather
than inserting a random gap into the stack mapping, the entire stack
mapping itself will be randomized in the same way that other mappings
are when ASLR is enabled.
No functional change intended, as the stack gap implementation is
currently disabled by default.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704
Rather than fetching the ps_strings address directly from a process'
sysentvec, use this macro. With stack address randomization the
ps_strings address is no longer fixed.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704
The current ASLR stack gap feature will be removed, and with that the
need for the kern.stacktop sysctl is gone. All consumers have been
removed.
This reverts commit a97d697122da2bfb0baae5f0939d118d119dae33.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704
Simplify control flow around handling of the execpath length and signal
trampoline. Cache the sysentvec pointer in a local variable.
No functional change intended.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33703
vn_fullpath() call was not converted to pass newtextvp, instead it used
imgp->vp which is still NULL there. As result vn_fullpath() always
returned EINVAL and execpath was recorded from the value of arg0.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
This allows the pmap_remove(min, max) call to see empty pmap and exploit
empty pmap optimization.
Reviewed by: markj
Tested by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32569
While doing it, also move all the code to resolve pathnames and obtain
text vp and dvp, into single place. Besides simplifying the code, it
avoids spurious vnode relocks and validates the explanation why
a transient text reference on the script vnode is not harmful.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611
For this, use vn_fullpath_hardlink() to resolve executable name for
execve(2).
This should provide the right hardlink name, used for execution, instead
of random hardlink pointing to this binary. Also this should make the
AT_EXECNAME reliable for execve(2), since kernel only needs to resolve
parent directory path, which should always succeed (except pathological
cases like unlinking a directory).
PR: 248184
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611
Also re-align comments, and group booleans and char members together.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611
On kernels compiled with 'options TSLOG', record for each process ID:
* The timestamp of the fork() which creates it and the parent
process ID,
* The first path passed to execve(), if any,
* The first path resolved by namei, if any, and
* The timestamp of the exit() which terminates the process.
Expose this information via a new sysctl, debug.tslog_user.
On kernels lacking 'options TSLOG' (the default), no information is
recorded and the sysctl does not exist.
Note that recording namei is needed in order to obtain the names of
rc.d scripts being launched, as the rc system sources them in a
subshell rather than execing the scripts.
With this commit it is now possible to generate flamecharts of the
entire boot process from the start of the loader to the end of
/etc/rc. The code needed to perform this processing is currently
found in github: https://github.com/cperciva/freebsd-boot-profiling
Reviewed by: mhorne
Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D32493
With stack gap enabled top of the stack is moved down by a random
amount of bytes. Because of that some multithreaded applications
which use kern.usrstack sysctl to calculate address of stacks for
their threads can fail. Add kern.stacktop sysctl, which can be used
to retrieve address of the stack after stack gap is applied to it.
Returns value identical to kern.usrstack for processes which have
no stack gap.
Reviewed by: kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D31897
Calling setrlimit with stack gap enabled and with low values of stack
resource limit often caused the program to abort immediately after
exiting the syscall. This happened due to the fact that the resource
limit was calculated assuming that the stack started at sv_usrstack,
while with stack gap enabled the stack is moved by a random number
of bytes.
Save information about stack size in struct vmspace and adjust the
rlim_cur value. If the rlim_cur and stack gap is bigger than rlim_max,
then the value is truncated to rlim_max.
PR: 253208
Reviewed by: kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D31516
This can be disabled by sysctl kern.core_dump_can_intr
Reported and tested by: pho
Reviewed by: imp, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32313
A core segment is bounded in size only by memory size. On 64-bit
architectures this means a segment can be much larger than 4GB.
However, compress_chunk() takes only a u_int, clamping segment size to
4GB-1, resulting in a truncated core. Everything else, including the
compressor internally, uses size_t, so use size_t at the boundary here.
This dates back to the original refactor back in 2015 (r279801 /
aa14e9b7).
MFC after: 1 week
Sponsored by: Juniper Networks, Inc.
Move the common kernel function signatures from machine/reg.h to a new
sys/reg.h. This is in preperation for adding PT_GETREGSET to ptrace(2).
Reviewed by: imp, markj
Sponsored by: DARPA, AFRL (original work)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19830
To prevent umtx.h polluting by future changes split it on two headers:
umtx.h - ABI header for userspace;
umtxvar.h - the kernel staff.
While here fix umtx_key_match style.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31248
MFC after: 2 weeks
Temporary add stubs to the Linux emulation layer which calls the existing hook.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D30911
MFC after: 2 weeks
For future use in the Linux emulation layer call sv_onexec hook right after
the new process address space is created. It's safe, as sv_onexec used only
by Linux abi and linux_on_exec() does not depend on a state of process VA.
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D30899
MFC after: 2 weeks
Use sysentvec hooks to only call umtx_thread_exit/umtx_exec, which handle
robust mutexes, for native FreeBSD ABI. Similarly, there is no sense
in calling sigfastblock_clear() for non-native ABIs.
Requested by: dchagin
Reviewed by: dchagin, markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30987
Unlike sv_onexec(), it is called from the old (pre-exec) sysentvec structure.
The old vmspace for the process is still intact during the call.
Reviewed by: dchagin, markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30987
This introduces a new, per-process flag, "NO_NEW_PRIVS", which
is inherited, preserved on exec, and cannot be cleared. The flag,
when set, makes subsequent execs ignore any SUID and SGID bits,
instead executing those binaries as if they not set.
The main purpose of the flag is implementation of Linux
PROC_SET_NO_NEW_PRIVS prctl(2), and possibly also unpriviledged
chroot.
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30939
To allow the ABI to make a dicision based on the Brandinfo add a link
to the Elf_Brandinfo into the struct proc. Add a note that the high 8 bits
of Elf_Brandinfo flags is private to the ABI.
Note to MFC: it breaks KBI.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D30918
MFC after: 2 weeks
This makes it possible to use core_write(), core_output(),
and sbuf_drain_core_output(), in Linux coredump code. Moving
them out of imgact_elf.c is necessary because of the weird way
it's being built.
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30369
We cache mapped execve argument buffers to avoid the overhead of TLB
shootdowns. Mark them invalid when they are freed to the cache.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29460
After old vmspace is destroyed during execve(2), but before the new space
is fully constructed, an error during image activation cannot be returned
because there is no executing program to receive it.
In the relatively common case of failure to map stack, print some hints
on the control terminal. Note that user has enough knobs to cause stack
mapping error, and this is the most common reason for execve(2) aborting
the process.
Requested by: jhb
Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28050
It is checked in vm_map_insert() and vm_map_protect() that PROT_WRITE |
PROT_EXEC are never specified together, if vm_map has MAP_WX flag set.
FreeBSD control flag allows specific binary to request WX exempt, and
there are per ABI boolean sysctls kern.elf{32,64}.allow_wx to enable/
disable globally.
Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28050
Also centralize and unify checks to enable ASLR stack gap in a new
helper exec_stackgap().
PR: 239873
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Exec and exit are same as corresponding eventhandler hooks.
Thread exit hook is called somewhat earlier, while thread is still
owned by the process and enough context is available. Note that the
process lock is owned when the hook is called.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27309
While there, do some minor cleanup for kclocks. They are only
registered from kern_time.c, make registration function static.
Remove event hooks, they are not used by both registered kclocks.
Add some consts.
Perhaps we can stop registering kclocks at all and statically
initialize them.
Reviewed by: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27305
There is no point in dynamic registration, umtx hook is there always.
Reviewed by: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27303
No functional change intended.
Tracking these structures separately for each proc enables future work to
correctly emulate clone(2) in linux(4).
__FreeBSD_version is bumped (to 1300130) for consumption by, e.g., lsof.
Reviewed by: kib
Discussed with: markj, mjg
Differential Revision: https://reviews.freebsd.org/D27037
This is mostly mechanical except for vmspace_exit(). There, use the new
refcount_release_if_last() to avoid switching to vmspace0 unless other
processes are sharing the vmspace. In that case, upon switching to
vmspace0 we can unconditionally release the reference.
Remove the volatile qualifier from vm_refcnt now that accesses are
protected using refcount(9) KPIs.
Reviewed by: alc, kib, mmel
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27057
r326145 corrected do_execve() to return EJUSTRETURN upon success so that
important registers are not clobbered. This had the side effect of tapping
out 'failures' for all *execve(2) audit records, which is less than useful
for auditing purposes.
Audit exec returns earlier, where we can know for sure that EJUSTRETURN
translates to success. Note that this unsets TDP_AUDITREC as we commit the
audit record, so the usual audit in the syscall return path will do nothing.
PR: 249179
Reported by: Eirik Oeverby <ltning-freebsd anduin net>
Reviewed by: csjp, kib
MFC after: 1 week
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26922
and current address space is already destroyed, so kern_execve()
terminates the process.
While there, clean up some internals of post_execve() inlined in init_main.
Reported by: Peter <pmc@citylink.dinoex.sub.org>
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26525
It keeps recalculated way more often than it is needed.
Provide a routine (fdlastfile) to get it if necessary.
Consumers may be better off with a bitmap iterator instead.