Linux requires that all IOCTL data resides in userspace. FreeBSD
always moves the main IOCTL structure into a kernel buffer before
invoking the IOCTL handler and then copies it back into userspace,
before returning. Hide this difference in the "linux_copyin()" and
"linux_copyout()" functions by remapping userspace addresses in the
range from 0x10000 to 0x20000, to the kernel IOCTL data buffer.
It is assumed that the userspace code, data and stack segments starts
no lower than memory address 0x400000, which is also stated by "man 1
ld", which means any valid userspace pointer can be passed to regular
LinuxKPI handled IOCTLs.
Bump the FreeBSD version to force recompilation of all kernel modules.
Discussed with: kmacy @
MFC after: 1 week
Sponsored by: Mellanox Technologies
"current" inside all LinuxKPI file operation callbacks. The "current"
is frequently used for various debug prints, printing the thread name
and thread ID for example.
Obtained from: kmacy @
MFC after: 1 week
Sponsored by: Mellanox Technologies
Ensure the actual poll result is returned by the "linux_file_poll()"
function instead of zero which means no data is available.
MFC after: 3 days
Sponsored by: Mellanox Technologies
The "len" parameter is uint32_t, indexing it with an int may
end up in a signed integer overflow.
strlen(3) returns an integer of size_t so the corresponding index should
have that size.
MFC after: 1 week
This is a minor follow-up to r297422, prompted by a Coverity warning. (It's
not a real defect, just a code smell.) OSD slot array reservations are an
array of pointers (void **) but were cast to void* and back unnecessarily.
Keep the correct type from reservation to use.
osd.9 is updated to match, along with a few trivial igor fixes.
Reported by: Coverity
CID: 1353811
Sponsored by: EMC / Isilon Storage Division
undefined symbol svr4_delete_socket which was moved from streams to the svr4 module
in r160558 that created a two-way dependency between them.
PR: 208464
Submitted by: Kristoffer Eriksson
Reported by: Kristoffer Eriksson
MFC after: 2 week
We're currently seeing how hard it would be to run CloudABI binaries on
operating systems cannot be modified easily (Windows, Mac OS X). The
idea is that we want to just run them without any sandboxing. Now
that CloudABI executables are PIE, this is already a bit easier, but TLS
is still problematic:
- CloudABI executables want to write to the %fs, which typically
requires extra system calls by the emulator every time it needs to
switch between CloudABI's and its own TLS.
- If CloudABI executables overwrite the %fs base unconditionally, it
also becomes harder for the emulator to store a backup of the old
value of %fs. To solve this, let's no longer overwrite %fs, but just
%fs:0.
As CloudABI's C library does not use a TCB, this space can now be used
by an emulator to keep track of its internal state. The executable can
now safely overwrite %fs:0, as long as it makes sure that the TCB is
copied over to the new TLS area.
Ensure that there is an initial TLS area set up when the process starts,
only containing a bogus TCB. We don't really care about its contents on
FreeBSD.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D5836
This is kinda critical to the performance when the CPU is slow and
network bandwidth is high, e.g. in the hypervisor.
Reviewed by: rrs, gallatin, Dexuan Cui <decui microsoft com>
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5765
- Set BI_CAN_EXEC_DYN, so we can execute ET_DYN ELF files in addition to
regular ET_EXECs.
- Provide an AT_BASE entry in the auxiliary vector, so the executable
knows at which address it got loaded and can apply relocations.
Some time ago I made a change to merge together the memory scope
definitions used by mmap (MAP_{PRIVATE,SHARED}) and lock objects
(PTHREAD_PROCESS_{PRIVATE,SHARED}). Though that sounded pretty smart
back then, it's backfiring. In the case of mmap it's used with other
flags in a bitmask, but for locking it's an enumeration. As our plan is
to automatically generate bindings for other languages, that looks a bit
sloppy.
Change all of the locking functions to use separate flags instead.
Obtained from: https://github.com/NuxiNL/cloudabi
data (headers). Historically the size of the headers was not checked
against the socket buffer space. Application could easily overcommit the
socket buffer space.
With the new sendfile (r293439) the problem remained, but a KASSERT was
inserted that checked that amount of data written to the socket matches
its space. In case when size of headers is bigger that socket space,
KASSERT fires. Without INVARIANTS the new sendfile won't panic, but
would report incorrect amount of bytes sent.
o With this change, the headers copyin is moved down into the cycle, after
the sbspace() check. The uio size is trimmed by socket space there,
which fixes the overcommit problem and its consequences.
o The compatibility handling for FreeBSD 4 sendfile headers API is pushed
up the stack to syscall wrappers. This required a copy and paste of the
code, but in turn this allowed to remove extra stack carried parameter
from fo_sendfile_t, and embrace entire compat code into #ifdef. If in
future we got more fo_sendfile_t function, the copy and paste level would
even reduce.
Reviewed by: emax, gallatin, Maxim Dounin <mdounin mdounin.ru>
Tested by: Vitalij Satanivskij <satan ukr.net>
Sponsored by: Netflix
The type definitions and constants that were used by COMPAT_CLOUDABI64
are a literal copy of some headers stored inside of CloudABI's C
library, cloudlibc. What is annoying is that we can't make use of
cloudlibc's system call list, as the format is completely different and
doesn't provide enough information. It had to be synced in manually.
We recently decided to solve this (and some other problems) by moving
the ABI definitions into a separate file:
https://github.com/NuxiNL/cloudabi/blob/master/cloudabi.txt
This file is processed by a pile of Python scripts to generate the
header files like before, documentation (markdown), but in our case more
importantly: a FreeBSD system call table.
This change discards the old files in sys/contrib/cloudabi and replaces
them by the latest copies, which requires some minor changes here and
there. Because cloudabi.txt also enforces consistent names of the system
call arguments, we have to patch up a small number of system call
implementations to use the new argument names.
The new header files can also be included directly in FreeBSD kernel
space without needing any includes/defines, so we can now remove
cloudabi_syscalldefs.h and cloudabi64_syscalldefs.h. Patch up the
sources to include the definitions directly from sys/contrib/cloudabi
instead.
1. Limit secs to INT32_MAX / 2 to avoid errors from kern_setitimer().
Assert that kern_setitimer() returns 0.
Remove bogus cast of secs.
Fix style(9) issues.
2. Increment the return value if the remaining tv_usec value more than 500000 as a Linux does.
Pointed out by: [1] Bruce Evans
MFC after: 1 week
On some architectures, u_long isn't large enough for resource definitions.
Particularly, powerpc and arm allow 36-bit (or larger) physical addresses, but
type `long' is only 32-bit. This extends rman's resources to uintmax_t. With
this change, any resource can feasibly be placed anywhere in physical memory
(within the constraints of the driver).
Why uintmax_t and not something machine dependent, or uint64_t? Though it's
possible for uintmax_t to grow, it's highly unlikely it will become 128-bit on
32-bit architectures. 64-bit architectures should have plenty of RAM to absorb
the increase on resource sizes if and when this occurs, and the number of
resources on memory-constrained systems should be sufficiently small as to not
pose a drastic overhead. That being said, uintmax_t was chosen for source
clarity. If it's specified as uint64_t, all printf()-like calls would either
need casts to uintmax_t, or be littered with PRI*64 macros. Casts to uintmax_t
aren't horrible, but it would also bake into the API for
resource_list_print_type() either a hidden assumption that entries get cast to
uintmax_t for printing, or these calls would need the PRI*64 macros. Since
source code is meant to be read more often than written, I chose the clearest
path of simply using uintmax_t.
Tested on a PowerPC p5020-based board, which places all device resources in
0xfxxxxxxxx, and has 8GB RAM.
Regression tested on qemu-system-i386
Regression tested on qemu-system-mips (malta profile)
Tested PAE and devinfo on virtualbox (live CD)
Special thanks to bz for his testing on ARM.
Reviewed By: bz, jhb (previous)
Relnotes: Yes
Sponsored by: Alex Perez/Inertial Computing
Differential Revision: https://reviews.freebsd.org/D4544
- Mark AIO system calls as STD and remove the helpers to dynamically
register them.
- Use COMPAT6 for the old system calls with the older sigevent instead of
an 'o' prefix.
- Simplify the POSIX configuration to note that AIO is always available.
- Handle AIO in the default VOP_PATHCONF instead of special casing it in
the pathconf() system call. fpathconf() is still hackish.
- Remove freebsd32_aio_cancel() as it just called the native one directly.
Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5589
is always successfull.
So, ignore any errors and return 0 as a Linux do.
XXX. Unlike POSIX, Linux in case when the invalid seconds value specified
always return 0, so in that case Linux does not return proper remining time.
MFC after: 1 week
- Set td_errno so that ktrace and dtrace can obtain the syscall error
number in the usual way.
- Pass negative error numbers directly to the syscall layer, as they're
not intended to be returned to userland.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D5425
- Make the system call fail if prot contains bits other than read, write
and exec.
- Similar to OpenBSD's W^X, don't allow write and exec to be set at the
same time. I'd like to see for now what happens if we enforce this
policy unconditionally. If it turns out that this is far too strict,
we'll loosen this requirement.
need to include it explicitly when <vm/vm_param.h> is already included.
Suggested by: alc
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D5379
fork1 required its callers to pass a pointer to struct proc * which would
be set to the new process (if any). procdesc and racct manipulation also
used said pointer.
However, the process could have exited prior to do_fork return and be
automatically reaped, thus making this a use-after-free.
Fix the problem by letting callers indicate whether they want the pid or
the struct proc, return the process in stopped state for the latter case.
Reviewed by: kib
- Add some new hlist macros.
- Update existing hlist macros removing the need for a temporary
iteration variable.
- Properly define the RCU hlist macros to be SMP safe with regard
to RCU.
- Safe list macro arguments by adding a pair of parentheses.
- Prefix the _list_add() and _list_splice() functions with "linux"
to reflect they are LinuxKPI internal functions.
Obtained from: Linux
MFC after: 1 week
Sponsored by: Mellanox Technologies
- Fix implementation of atomic_add_unless(). The atomic_cmpset_int()
function returns a boolean and not the previous value of the atomic
variable.
- The atomic counters should be signed according to Linux.
- Some minor cosmetics and styling while at it.
Reviewed by: alfred @
MFC after: 1 week
Sponsored by: Mellanox Technologies
idr_alloc_cyclic() in the LinuxKPI. Bump the FreeBSD version to
force recompilation of all KLDs due to IDR structure size change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
If a driver's Linux mmap callback passed vm_page_prot through unchanged,
then linux_dev_mmap_single() would try to apply whatever VM_MEMATTR_xxx
value 0 is to the mapping. On x86, VM_MEMATTR_DEFAULT is the PAT value
for write-back (WB) which is 6, while 0 maps to the PAT value for
uncacheable (UC). Thus, any mmap request that did not explicitly set
page_prot was tried to map memory as UC triggering the warning in
sg_pager_getpages().
Tested by: np
Reported by: Krishnamraju Eraparaju @ Chelsio
MFC after: 3 days
Sponsored by: Chelsio Communications
LinuxKPI. Fix a few spaces to tabs. Bump the FreeBSD version to force
recompilation of existing KMODs.
MFC after: 1 week
Sponsored by: Mellanox Technologies
and replace crcopysafe by crcopy as crcopysafe is is not intended to be
safe in a threaded environment, it drops PROC_LOCK() in while() that
can lead to unexpected results, such as overwrite kernel memory.
In my POV crcopysafe() needs special attention. For now I do not see
any problems with this function, but who knows.
Submitted by: dchagin
Found by: trinity
Security: SA-16:04.linux
The set_robust_list system call request the kernel to record the head
of the list of robust futexes owned by the calling thread. The head
argument is the list head to record.
The get_robust_list system call should return the head of the robust
list of the thread whose thread id is specified in pid argument.
The list head should be stored in the location pointed to by head
argument.
In contrast, our implemenattion of get_robust_list system call copies
the known portion of memory pointed by recorded in set_robust_list
system call pointer to the head of the robust list to the location
pointed by head argument.
So, it is possible for a local attacker to read portions of kernel
memory, which may result in a privilege escalation.
Submitted by: mjg
Security: SA-16:03.linux
- Properly prefix internal functions with "linux_" instead of only a
single underscore to avoid future namespace collisions.
- Make some functions global instead of inline to ease debugging and
to avoid unnecessary code duplication.
- Remove no longer existing kthread_create() function's prototype.
MFC after: 1 week
Sponsored by: Mellanox Technologies