Address Space Layout Randomization (ASLR) is an exploit mitigation
technique implemented in the majority of modern operating systems.
It involves randomly positioning the base address of an executable
and the position of libraries, heap, and stack, in a process's address
space. Although over the years ASLR proved to not guarantee full OS
security on its own, this mechanism can make exploitation more difficult.
Tests on the tier 1 64-bit architectures demonstrated that the ASLR is
stable and does not result in noticeable performance degradation,
therefore it should be safe to enable this mechanism by default.
Moreover its effectiveness is increased for PIE (Position Independent
Executable) binaries. Thanks to commit 9a227a2fd642 ("Enable PIE by
default on 64-bit architectures"), building from src is not necessary
to have PIE binaries. It is enough to control usage of ASLR in the
OS solely by setting the appropriate sysctls.
This patch toggles the kernel settings to use address map randomization
for PIE & non-PIE 64-bit binaries. It also disables SBRK, in order
to allow utilization of the bss grow region for mappings. The latter
has no effect if ASLR is disabled, so apply it to all architectures.
As for the drawbacks, a consequence of using the ASLR is more
significant VM fragmentation, hence the issues may be encountered
in the systems with a limited address space in high memory consumption
cases, such as buildworld. As a result, although the tests on 32-bit
architectures with ASLR enabled were mostly on par with what was
observed on 64-bit ones, the defaults for the former are not changed
at this time. Also, for the sake of safety keep the feature disabled
for 32-bit executables on 64-bit machines, too.
The committed change affects the overall OS operation, so the
following should be taken into consideration:
* Address space fragmentation.
* A changed ABI due to modified layout of address space.
* More complicated debugging due to:
* Non-reproducible address space layout between runs.
* Some debuggers automatically disable ASLR for spawned processes,
making target's environment different between debug and
non-debug runs.
In order to confirm/rule-out the dependency of any encountered issue
on ASLR it is strongly advised to re-run the test with the feature
disabled - it can be done by setting the following sysctls
in the /etc/sysctl.conf file:
kern.elf64.aslr.enable=0
kern.elf64.aslr.pie_enable=0
Co-developed by: Dawid Gorecki <dgr@semihalf.com>
Reviewed by: emaste, kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D27666
Some upcoming changes will modify software checksum routines like
in_cksum() to operate using m_apply(), which uses the direct map to
access packet data for unmapped mbufs. This approach of course does not
work on platforms without a direct map, so we have to disallow the use
of unmapped mbufs on such platforms.
I believe this is the right tradeoff: we only configure KTLS on amd64
and arm64 today (and one KTLS consumer, NFS TLS, requires a direct map
already), and the use of unmapped mbufs with plain sendfile is a recent
optimization. If need be, m_apply() could be modified to create
CPU-private mappings of extpg mbuf pages as a fallback.
So, change mb_use_ext_pgs to be hard-wired to zero on systems without a
direct map. Note that PMAP_HAS_DMAP is not a compile-time constant on
some systems, so the default value of mb_use_ext_pgs has to be
determined during boot.
Reviewed by: jhb
Discussed with: gallatin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32940
When doing an initial mount(8) with its -f (force) flag, the MNT_FORCE
flag is not passed through to the underlying filesystem mount routine.
MNT_FORCE is only passed through on later updates to an existing
mount. With this commit the MNT_FORCE flag is now passed through on the
initial mount.
Sanity check: kib
Sponsored by: Netflix
This is how most SYSINITs are defined. Also annotate the dummy
parameter with __unused. No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Rename to match the naming of syscalls and allow 32 to be appended
without making an ugly name like kevent_freebsd1132.
While here, make the kevent changelist argument const.
Reviewed by: kib
This will be used to break a deadlock in ZFS between the per-mountpoint
teardown lock and page busy locks. In particular, when purging data
from the page cache during dataset rollback, we want to avoid blocking
on the busy state of invalid pages since the busying thread may be
blocked on the teardown lock in zfs_getpages().
Add a helper, vn_pages_remove_valid(), for use by filesystems. Bump
__FreeBSD_version so that the OpenZFS port can make use of the new
helper.
PR: 258208
Reviewed by: avg, kib, sef
Tested by: pho (part of a larger patch)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32931
- Return an errno value upon failure, instead of 1.
- Provide a bus_translate_resource() wrapper.
- Implement the generic version, which traverses the hierarchy until a
bus driver with a non-trivial implementation is found, in subr_bus.c
like other similar default implementations.
- Make ofw_pcib_translate_resource() return an error if a matching PCI
address range is not found.
- Make generic_pcie_translate_resource_common() return an int instead of
a bool. Fix up callers.
No functional change intended.
Reviewed by: imp, jhb
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32855
We do not require devvp vnode locked for metadata io. It is typically
not needed indeed, since correctness of the file system using
corresponding block device ensures that there is no incorrect or racy
manipulations.
But right now DEBUG_VFS_LOCKS option excludes both character device
vnodes and completely destroyed (VBAD) vnodes from asserts. This is not
too bad since WITNESS still ensures that we do not leak locks. On the
other hand, asserts do not mean what they should, to the reader, and
reliance on them being enforced might result in wrong code.
Note that ASSERT_VOP_LOCKED() still silently accepts NULLVP, I think it
is worth fixing as well, in the next round.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D32761
For devfs vnodes, it is fine to not lock vnodes for VOP_FSYNC().
Otherwise vnode must be locked exclusively, except for MNT_SHARED_WRITES()
where the shared lock is enough.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32761
We were not including the requested starting offset in the page offset.
Reviewed by: jhb
Fixes: 3c7a01d773ac ("Extend m_apply() to support unmapped mbufs.")
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32922
for compatibility with Linux.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32901
Previously, sorele() always required the socket lock and dropped the
lock if the released reference was not the last reference. Many
callers locked the socket lock just before calling sorele() resulting
in a wasted lock/unlock when not dropping the last reference.
Move the previous implementation of sorele() into a new
sorele_locked() function and use it instead of sorele() for various
places in uipc_socket.c that called sorele() while already holding the
socket lock.
The sorele() macro now uses refcount_release_if_not_last() try to drop
the socket reference without locking the socket. If that shortcut
fails, it locks the socket and calls sorele_locked().
Reviewed by: kib, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32741
Some syscalls checked for invalid AT_* flags in sys_* and others in
kern_*.
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: The University of Cambridge, Google Inc.
Differential Revision: https://reviews.freebsd.org/D32864
When the NFSv4.2 server does a VOP_ALLOCATE(), it needs
the operation to be done for the RPC's credential and not
td_ucred. It also needs the writing to be done synchronously.
This patch adds "ioflag" and "cred" arguments to VOP_ALLOCATE()
and modifies vop_stdallocate() to use these arguments.
The VOP_ALLOCATE.9 man page will be patched separately.
Reviewed by: khng, kib
Differential Revision: https://reviews.freebsd.org/D32865
sched_throw() can no longer take a NULL thread, APs enter through
sched_ap_entry() instead. This completely removes branching in the
common case and cleans up both paths. No functional change intended.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D32829
This define will later on be used by coming TLS RX hardware offload patches.
No functional change intended.
Reviewed by: jhb@
MFC after: 1 week
Sponsored by: NVIDIA Networking
Normally setting kern.ipc.maxsockets returns EINVAL if the new value
is not greater than the previous value. This can cause spurious
error messages when sysctl.conf is processed multiple times, or when
automation systems try to ensure the sysctl is set to the correct
value. If the value is unchanged, then just do nothing.
PR: 243532
Reviewed by: markj
MFC after: 3 days
Sponsored by: Modirum MDPay
Sponsored by: Klara Inc.
Differential Revision: https://reviews.freebsd.org/D32775
Change the 'period' argument to 'duration' and change its type to
sbintime_t so we can more easily express different durations.
Reviewed by: tsoome, glebius
Differential Revision: https://reviews.freebsd.org/D32619
schedinit_ap() sets up an AP for a later call to sched_throw(NULL).
Currently, ULE sets up some pcpu bits and fixes the idlethread lock with
a call to sched_throw(NULL); this results in a window where curthread is
setup in platforms' init_secondary(), but it has the wrong td_lock.
Typical platform AP startup procedure looks something like:
- Setup curthread
- ... other stuff, including cpu_initclocks_ap()
- Signal smp_started
- sched_throw(NULL) to enter the scheduler
cpu_initclocks_ap() may have callouts to process (e.g., nvme) and
attempt to sched_add() for this AP, but this attempt fails because
of the noted violated assumption leading to locking heartburn in
sched_setpreempt().
Interrupts are still disabled until cpu_throw() so we're not really at
risk of being preempted -- just let the scheduler in on it a little
earlier as part of setting up curthread.
Reviewed by: alfredo, kib, markj
Triage help from: andrew, markj
Smoke-tested by: alfredo (ppc), kevans (arm64, x86), mhorne (arm)
Differential Revision: https://reviews.freebsd.org/D32797
When working on the ports these functions were slightly different, but
now there's no reason for them to be separate.
No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
for strange case where queried process does not have text.
Reported by: Michael Butler <imb@protected-networks.net>
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
vn_fullpath() call was not converted to pass newtextvp, instead it used
imgp->vp which is still NULL there. As result vn_fullpath() always
returned EINVAL and execpath was recorded from the value of arg0.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
The existing logic didn't take into account newly inserted mappings
wholly contained by an existing region (or vice versa), nor did it
account for weird overlap scenarios. The latter is probably unlikely
to happen, but the former may happen in UEFI: BootServicesData allocated
within a large chunk of ConventionalMemory. This situation blows up vm
initialization.
While we're here, remove the "exact match" logic as it's likely wrong;
if an exact match exists with conflicting flags, for instance, then we
should probably be doing something else. The new logic takes into
account exact matches as part of the overlapping efforts.
Reviewed by: kib, mhorne (both earlier version)
Differential Revision: https://reviews.freebsd.org/D32701
In 6e66030c4c0, additional ptracestop was added in order
to implement PTRACE_EVENT_EXEC. Make it only apply to cases
where the debugger is a Linux processes; native FreeBSD
debuggers can trace Linux processes too, but they don't
expect that additonal ptracestop.
Fixes: 6e66030c4c0
Reported By: kib
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D32726
This change is a slight performance optimization for systems with a slow
64-bit division.
The th->th_scale and th->th_large_delta values only depend on the
timecounter frequency and the th->th_adjustment. The timecounter
frequency of a timehand only changes when a new timecounter is activated
for the timehand. The th->th_adjustment is only changed by the NTP
second update. The NTP second update is not done for every call of
tc_windup().
Move the code block to recalculate the scaling factor and
the large delta of a timehand to the new helper function
recalculate_scaling_factor_and_large_delta().
Call recalculate_scaling_factor_and_large_delta() when a new timecounter
is activated and a NTP second update occurred.
MFC after: 1 week
This allows the pmap_remove(min, max) call to see empty pmap and exploit
empty pmap optimization.
Reviewed by: markj
Tested by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32569
While doing it, also move all the code to resolve pathnames and obtain
text vp and dvp, into single place. Besides simplifying the code, it
avoids spurious vnode relocks and validates the explanation why
a transient text reference on the script vnode is not harmful.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611
For this, use vn_fullpath_hardlink() to resolve executable name for
execve(2).
This should provide the right hardlink name, used for execution, instead
of random hardlink pointing to this binary. Also this should make the
AT_EXECNAME reliable for execve(2), since kernel only needs to resolve
parent directory path, which should always succeed (except pathological
cases like unlinking a directory).
PR: 248184
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611
Also re-align comments, and group booleans and char members together.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611