r176215 corrected readlink(2)'s return type and the type of the last
argument. readlink(2) was introduced in r177788 after being developed
as part of Google Summer of Code 2007; it appears to have inherited the
wrong return type.
Man pages and header files were already ssize_t; update syscalls.master
to match.
PR: 197915
Submitted by: Henning Petersen <henning.petersen@t-online.de>
MFC after: 2 weeks
the file associated with the given file descriptor.
Reviewed by: kib, asomers
Reviewed by: cem, jilles, brooks (they reviewed previous version)
Discussed with: pjd, and many others
Differential Revision: https://reviews.freebsd.org/D14567
Each control message region must be aligned on a 4-byte boundary on 32-bit
architectures. The 32-bit compat shim for recvmsg() gets the actual layout
right, but doesn't pad the payload length when computing msg_controllen for
the output message header. If a control message contains an unaligned
payload, such as the 1-byte TTL field in the example attached to PR 236737,
this can produce control message payload boundaries that extend beyond
the boundary reported by msg_controllen.
PR: 236737
Reported by: Yuval Pavel Zholkover <paulzhol@gmail.com>
Reviewed by: markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19768
For 32-bit Linuxulator, ipc() syscall was historically
the entry point for the IPC API. Starting in Linux 4.18, direct
syscalls are provided for the IPC. Enable it.
MFC after: 1 month
Add the infrastructure to allow MD procctl(2) commands, and use it to
introduce amd64 PTI control and reporting. PTI mode cannot be
modified for existing pmap, the knob controls PTI of the new vmspace
created on exec.
Requested by: jhb
Reviewed by: jhb, markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
Fix some style while at it.
Submitted by: Johannes Lundberg <johalun0@gmail.com>
MFC after: 1 week
Sponsored by: Limelight Networks
Sponsored by: Mellanox Technologies
Research Unix, 7th Edition introduced TIMEZONE and DSTFLAG
compile-time constants in sys/param.h to communicate these values for
the machine. 4.2BSD moved from the compile-time to run-time and
introduced these variables and used for localtime() to return the
right offset from UTC (sometimes referred to as GMT, for this purpose
is the same). 4.4BSD migrated to using the tzdata code/database and
these variables were basically unused.
FreeBSD removed the real need for these with adjkerntz in
1995. However, some RTC clocks continued to use these variables,
though they were largely unused otherwise. Later, phk centeralized
most of the uses in utc_offset, but left it using both tz_minuteswest
and adjkerntz.
POSIX (IEEE Std 1003.1-2017) states in the gettimeofday specification
"If tzp is not a null pointer, the behavior is unspecified" so there's
no standards reason to retain it anymore. In fact, gettimeofday has
been marked as obsolecent, meaning it could be removed from a future
release of the standard. It is the only interface defined in POSIX
that references these two values. All other references come from the
tzdata database via tzset().
These were used to more faithfully implement early unix ABIs which
have been removed from FreeBSD. NetBSD has completely eliminated
these variables years ago. Linux has migrated to tzdata as well,
though these variables technically still exist for compatibility
with unspecified older programs.
So, there's no real reason to have them these days. They are a
historical vestige that's no longer used in any meaningful way.
Reviewed By: jhb@, brooks@
Differential Revision: https://reviews.freebsd.org/D19550
In all of the architectures we have today, we always use PAGE_SIZE.
While in theory one could define different things, none of the
current architectures do, even the ones that have transitioned from
32-bit to 64-bit like i386 and arm. Some ancient mips binaries on
other systems used 8k instead of 4k, but we don't support running
those and likely never will due to their age and obscurity.
Reviewed by: imp (who also contributed the commit message)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19280
When porting code once written for Linux we find not only uints but also ushort and ulong.
Provide central typedefs as part of the linuxkpi for those as well.
Reviewed by: hselasky, emaste
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19405
- offsets can be negative, loff_t needs to be signed, it also simplifies
interop with the rest of the code base to use off_t than the actual linux
definition "long long"
- don't rely on the defining "file" to "linux_file" in interface definitions
as that causes heartache with includes
Reviewed by: hps@
MFC after: 1 week
Sponsored by: iX Systems
Differential Revision: https://reviews.freebsd.org/D19274
Some consumers may be loosely coupled with the lkpi.
This allows them to call linux_alloc_current without
having a static dependency.
Reviewed by: hps@
MFC after: 1 week
Sponsored by: iX Systems
Differential Revision: https://reviews.freebsd.org/D19257
So far, intr_{g,s}etaffinity(9) take a single int for identifying
a device interrupt. This approach doesn't work on all architectures
supported, as a single int isn't sufficient to globally specify a
device interrupt. In particular, with multiple interrupt controllers
in one system as found on e. g. arm and arm64 machines, an interrupt
number as returned by rman_get_start(9) may be only unique relative
to the bus and, thus, interrupt controller, a certain device hangs
off from.
In turn, this makes taskqgroup_attach{,_cpu}(9) and - internal to
the gtaskqueue implementation - taskqgroup_attach_deferred{,_cpu}()
not work across architectures. Yet in turn, iflib(4) as gtaskqueue
consumer so far doesn't fit architectures where interrupt numbers
aren't globally unique.
However, at least for intr_setaffinity(..., CPU_WHICH_IRQ, ...) as
employed by the gtaskqueue implementation to bind an interrupt to a
particular CPU, using bus_bind_intr(9) instead is equivalent from
a functional point of view, with bus_bind_intr(9) taking the device
and interrupt resource arguments required for uniquely specifying a
device interrupt.
Thus, change the gtaskqueue implementation to employ bus_bind_intr(9)
instead and intr_{g,s}etaffinity(9) to take the device and interrupt
resource arguments required respectively. This change also moves
struct grouptask from <sys/_task.h> to <sys/gtaskqueue.h> and wraps
struct gtask along with the gtask_fn_t typedef into #ifdef _KERNEL
as userland likes to include <sys/_task.h> or indirectly drags it
in - for better or worse also with _KERNEL defined -, which with
device_t and struct resource dependencies otherwise is no longer
as easily possible now.
The userland inclusion problem probably can be improved a bit by
introducing a _WANT_TASK (as well as a _WANT_MOUNT) akin to the
existing _WANT_PRISON etc., which is orthogonal to this change,
though, and likely needs an exp-run.
While at it:
- Change the gt_cpu member in the grouptask structure to be of type
int as used elswhere for specifying CPUs (an int16_t may be too
narrow sooner or later),
- move the gtaskqueue_enqueue_fn typedef from <sys/gtaskqueue.h> to
the gtaskqueue implementation as it's only used and needed there,
- change the GTASK_INIT macro to use "gtask" rather than "task" as
argument given that it actually operates on a struct gtask rather
than a struct task, and
- let subr_gtaskqueue.c consistently use __func__ to print functions
names.
Reported by: mmel
Reviewed by: mmel
Differential Revision: https://reviews.freebsd.org/D19139
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
It is currently re-declared in sys/sysent.h which is a wrong place for
MD variable. Which causes redeclaration error with gcc when
sys/sysent.h and machine/md_var.h are included both.
Remove it from sys/sysent.h and instead include machine/md_var.h when
needed, under #ifdef for both i386 and amd64.
Reported and tested by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
amd64 miniport drivers are allowed to use FPU which triggers "Unregistered use
of FPU in kernel" panic.
Wrap all variants of MSCALL with fpu_kern_enter/fpu_kern_leave. To reduce
amount of allocations/deallocations done via
fpu_kern_alloc_ctx/fpu_kern_free_ctx maintain cache of fpu_kern_ctx elements.
Based on the patch by Paul B Mahol
PR: 165622
Submitted by: Vlad Movchan <vladislav.movchan@gmail.com>
MFC after: 1 month
Most siginfo_to_lsiginfo callers already zeroed the l_siginfo_t before
callit it, but linux_waitid did not. Instead of zeroing in the called
function to address linux_waitid (as in commit 2e6ebe70), just do it in
linux_waitid.
admbugs: 765
Reported by: Vlad Tsyrklevich <vlad@tsyrklevich.net>
Reviewed by: Andrew
MFC after: 1 day
Security: Kernel stack memory disclosure
Sponsored by: The FreeBSD Foundation
An integrity check such as a check-hash or a cross-correlation failed.
The integrity error falls between EINVAL that identifies errors in
parameters to a system call and EIO that identifies errors with the
underlying storage media. EINTEGRITY is typically raised by intermediate
kernel layers such as a filesystem or an in-kernel GEOM subsystem when
they detect inconsistencies. Uses include allowing the mount(8) command
to return a different exit value to automate the running of fsck(8)
during a system boot.
These changes make no use of the new error, they just add it. Later
commits will be made for the use of the new error number and it will
be added to additional manual pages as appropriate.
Reviewed by: gnn, dim, brueffer, imp
Discussed with: kib, cem, emaste, ed, jilles
Differential Revision: https://reviews.freebsd.org/D18765
atomic updates and reduces amount of data protected by zone lock.
During startup point these fields to EARLY_COUNTER. After startup
allocate them for all early zones.
Tested by: pho