int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen);
int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen);
which allow to bind and connect respectively to a UNIX domain socket with a
path relative to the directory associated with the given file descriptor 'fd'.
- Add manual pages for the new syscalls.
- Make the new syscalls available for processes in capability mode sandbox.
- Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on
the directory descriptor for the syscalls to work.
- Update audit(4) to support those two new syscalls and to handle path
in sockaddr_un structure relative to the given directory descriptor.
- Update procstat(1) to recognize the new capability rights.
- Document the new capability rights in cap_rights_limit(2).
Sponsored by: The FreeBSD Foundation
Discussed with: rwatson, jilles, kib, des
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
FreeBSD TCP-level socket options (only the first two are). Instead,
using a mapping function and fail unsupported options as we do for other
socket option levels.
MFC after: 2 weeks
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.
Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.
The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.
Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.
PR: standards/170346
Submitted by: "Jukka A. Ukkonen" <jau@iki.fi>
MFC after: 1 month
was still possible to open for write from the lower filesystem. There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.
Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount. Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode. Always use the lower vnode v_writecount
for the checks.
Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode. Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.
Tested by: pho
MFC after: 3 weeks
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.
The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.
Conducted and reviewed by: attilio
Tested by: pho
If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.
Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.
Tested by: pho (previous version)
MFC after: 2 weeks
drivers:
- Remove scsi_low_pisa.*, they were unused.
- Remove <compat/netbsd/physio_proc.h> and calls to the stubs in that
header. They were empty nops.
- Retire sl_xname and use device_get_nameunit() and device_printf() with
the underlying device_t instead.
- Remove unused {ct,ncv,nsp,stg}print() functions.
- Remove empty SOFT_INTR_REQUIRED() macro and the unused sl_irq member.
now fully encapsulates all accesses to f_offset, and extends f_offset
locking to other consumers that need it, in particular, to lseek() and
variants of getdirentries().
Ensure that on 32bit architectures f_offset, which is 64bit quantity,
always read and written under the mtxpool protection. This fixes
apparently easy to trigger race when parallel lseek()s or lseek() and
read/write could destroy file offset.
The already broken ABI emulations, including iBCS and SysV, are not
converted (yet).
Tested by: pho
No objections from: jhb
MFC after: 3 weeks
- DTrace scripts to check for errors, performance, ...
they serve mostly as examples of what you can do with the static probe;s
with moderate load the scripts may be overwhelmed, excessive lock-tracing
may influence program behavior (see the last design decission)
Design decissions:
- use "linuxulator" as the provider for the native bitsize; add the
bitsize for the non-native emulation (e.g. "linuxuator32" on amd64)
- Add probes only for locks which are acquired in one function and released
in another function. Locks which are aquired and released in the same
function should be easy to pair in the code, inter-function
locking is more easy to verify in DTrace.
- Probes for locks should be fired after locking and before releasing to
prevent races (to provide data/function stability in DTrace, see the
man-page of "dtrace -v ..." and the corresponding DTrace docs).
but GNU libc used it without checking its kernel version, e. g., Fedora 10.
- Move pipe(2) implementation for Linuxulator from MD files to MI file,
sys/compat/linux/linux_file.c. There is no MD code for this syscall at all.
- Correct an argument type for pipe() from l_ulong * to l_int *. Probably
this was the source of MI/MD confusion.
Reviewed by: emulation
reg.h with stubs.
The tREGISTER macros are only made visible on i386. These macros are
deprecated and should not be available on amd64.
The i386 and amd64 versions of struct reg have been renamed to struct
__reg32 and struct __reg64. During compilation either __reg32 or __reg64
is defined as reg depending on the machine architecture. On amd64 the i386
struct is also available as struct reg32 which is used in COMPAT_FREEBSD32
code.
Most of compat/ia32/ia32_reg.h is now IA64 only.
Reviewed by: kib (previous version)
Remove FPU types from compat/ia32/ia32_reg.h that are no longer needed.
Create machine/npx.h on amd64 to allow compiling i386 code that uses
this header.
The original npx.h and fpu.h define struct envxmm differently. Both
definitions have been included in the new x86 header as struct __envxmm32
and struct __envxmm64. During compilation either __envxmm32 or __envxmm64
is defined as envxmm depending on machine architecture. On amd64 the i386
struct is also available as struct envxmm32.
Reviewed by: kib
After getting the current irql, if the kthread gets preempted and
subsequently runs on a different CPU, the saved irql could be wrong.
Also, correct the panic string.
PR: kern/165630
Submitted by: Vladislav Movchan <vladislav.movchan at gmail.com>
not be ideal, but is the ABI we've shipped so far. Fix macros which reflect
the results of _ALIGN on 32-bit MIPS to use the right alignment.
This fixes sendmsg under COMPAT_FREEBSD32 on n64 MIPS kernels.
using the o32 ABI. This mostly follows nwhitehorn's lead in implementing
COMPAT_FREEBSD32 on powerpc64.
o) Add a new type to the freebsd32 compat layer, time32_t, which is time_t in the
32-bit ABI being used. Since the MIPS port is relatively-new, even the 32-bit
ABIs use a 64-bit time_t.
o) Because time{spec,val}32 has the same size and layout as time{spec,val} on MIPS
with 32-bit compatibility, then, disable some code which assumes otherwise
wrongly when built for MIPS. A more general macro to check in this case would
seem like a good idea eventually. If someone adds support for using n32
userland with n64 kernels on MIPS, then they will have to add a variety of
flags related to each piece of the ABI that can vary. That's probably the
right time to generalize further.
o) Add MIPS to the list of architectures which use PAD64_REQUIRED in the
freebsd32 compat code. Probably this should be generalized at some point.
Reviewed by: gonzo
Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.
Discussed with: bde, das (previous versions)
MFC after: 1 month
Vnode-backed mappings cannot be put into the kernel map, since it is a
system map.
Use exec_map for transient mappings, and remove the mappings with
kmem_free_wakeup() to notify the waiters on available map space.
Do not map the whole executable into KVA at all to copy it out into
usermode. Directly use vn_rdwr() for the case of not page aligned
binary.
There is one place left where the potentially unbounded amount of data
is mapped into exec_map, namely, in the COFF image activator
enumeration of the needed shared libraries.
Reviewed by: alc
MFC after: 2 weeks
Code should just use the devtoname() function to obtain the name of a
character device. Also add const keywords to pieces of code that need it
to build properly.
MFC after: 2 weeks
64bit and 32bit ABIs. As a side-effect, it enables AVX on capable
CPUs.
In particular:
- Query the CPU support for XSAVE, list of the supported extensions
and the required size of FPU save area. The hw.use_xsave tunable is
provided for disabling XSAVE, and hw.xsave_mask may be used to
select the enabled extensions.
- Remove the FPU save area from PCB and dynamically allocate the
(run-time sized) user save area on the top of the kernel stack,
right above the PCB. Reorganize the thread0 PCB initialization to
postpone it after BSP is queried for save area size.
- The dumppcb, stoppcbs and susppcbs now do not carry the FPU state as
well. FPU state is only useful for suspend, where it is saved in
dynamically allocated suspfpusave area.
- Use XSAVE and XRSTOR to save/restore FPU state, if supported and
enabled.
- Define new mcontext_t flag _MC_HASFPXSTATE, indicating that
mcontext_t has a valid pointer to out-of-struct extended FPU
state. Signal handlers are supplied with stack-allocated fpu
state. The sigreturn(2) and setcontext(2) syscall honour the flag,
allowing the signal handlers to inspect and manipilate extended
state in the interrupted context.
- The getcontext(2) never returns extended state, since there is no
place in the fixed-sized mcontext_t to place variable-sized save
area. And, since mcontext_t is embedded into ucontext_t, makes it
impossible to fix in a reasonable way. Instead of extending
getcontext(2) syscall, provide a sysarch(2) facility to query
extended FPU state.
- Add ptrace(2) support for getting and setting extended state; while
there, implement missed PT_I386_{GET,SET}XMMREGS for 32bit binaries.
- Change fpu_kern KPI to not expose struct fpu_kern_ctx layout to
consumers, making it opaque. Internally, struct fpu_kern_ctx now
contains a space for the extended state. Convert in-kernel consumers
of fpu_kern KPI both on i386 and amd64.
First version of the support for AVX was submitted by Tim Bird
<tim.bird am sony com> on behalf of Sony. This version was written
from scratch.
Tested by: pho (previous version), Yamagi Burmeister <lists yamagi org>
MFC after: 1 month
to read strings completely to know the actual size.
As a side effect it fixes the issue with kern.proc.args and kern.proc.env
sysctls, which didn't return the size of available data when calling
sysctl(3) with the NULL argument for oldp.
Note, in get_ps_strings(), which does actual work for proc_getargv() and
proc_getenvv(), we still have a safety limit on the size of data read in
case of a corrupted procces stack.
Suggested by: kib
MFC after: 3 days
is compared to an integer, by casting the pointer to l_uintptr_t. No
functional difference on both i386 and amd64.
Reviewed by: ed, jhb
MFC after: 1 week
definition from K&R to ANSI, to avoid a clang warning about the uint8_t
parameter being promoted to int, which is not compatible with the type
declared in the earlier prototype.
MFC after: 1 week
and proc_getenvv(), which were implemented using linprocfs_doargv() as
a reference.
Suggested by: kib
Reviewed by: kib
Approved by: des (linprocfs maintainer)
MFC after: 2 weeks
system calls to provide feed-forward clock management capabilities to
userspace processes. ffclock_getcounter() returns the current value of the
kernel's feed-forward clock counter. ffclock_getestimate() returns the current
feed-forward clock parameter estimates and ffclock_setestimate() updates the
feed-forward clock parameter estimates.
- Document the syscalls in the ffclock.2 man page.
- Regenerate the script-derived syscall related files.
Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.
For more information, see http://www.synclab.org/radclock/
Submitted by: Julien Ridoux (jridoux at unimelb edu au)
The current code mixes the use of `flags' and `mode'. This is a bit
confusing, since the faccessat() function as a `flag' parameter to store
the AT_ flag.
Make this less confusing by using the same name as used in the POSIX
specification -- `amode'.
it can be used by in-kernel consumers.
- Make kern_posix_fallocate() public.
- Use kern_posix_fadvise() and kern_posix_fallocate() to implement the
freebsd32 wrappers for the two system calls.
to match native struct timespec ABI on __LP32__.
This change is a prerequisite for upcoming futimens()/utimensat() in whose
implementations it is assumed that timespec32 can take a negative value.
MFC after: 1 week
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.
Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.
The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.
To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.
Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month
exposed by the security fix in FreeBSD-SA-11:05.unix.
Approved by: so (cperciva)
Approved by: re (kib)
Security: Related to FreeBSD-SA-11:05.unix, but not actually
a security fix.
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.
Reviewed by: rwatson
Approved by: re (bz)
A "process descriptor" file descriptor is used to manage processes
without using the PID namespace. This is required for Capsicum's
Capability Mode, where the PID namespace is unavailable.
New system calls pdfork(2) and pdkill(2) offer the functional equivalents
of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote
process for debugging purposes. The currently-unimplemented pdwait(2) will,
in the future, allow querying rusage/exit status. In the interim, poll(2)
may be used to check (and wait for) process termination.
When a process is referenced by a process descriptor, it does not issue
SIGCHLD to the parent, making it suitable for use in libraries---a common
scenario when using library compartmentalisation from within large
applications (such as web browsers). Some observers may note a similarity
to Mach task ports; process descriptors provide a subset of this behaviour,
but in a UNIX style.
This feature is enabled by "options PROCDESC", but as with several other
Capsicum kernel features, is not enabled by default in GENERIC 9.0.
Reviewed by: jhb, kib
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
kernel for FreeBSD 9.0:
Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.
Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.
In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.
Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.
Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc
configured swap devices in the Linux-compatible format.
Based on the submission by: Robert Millan <rmh debian org>
PR: kern/159281
Reviewed by: bde
Approved by: re (kensmith)
MFC after: 2 weeks
Implement two previously-reserved Capsicum system calls:
- cap_new() creates a capability to wrap an existing file descriptor
- cap_getrights() queries the rights mask of a capability.
Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
option that is highly recommended to be adjusted in too much
documentation while doing nothing in FreeBSD since r2729 (rev 1.1).
ipcs(1) needs to be recompiled as it is accessing _KERNEL private
variables.
Reviewed by: jhb (before comment change on linux code)
Sponsored by: Sandvine Incorporated
space. This is consistent with the behavior in linux.
PR: kern/157871
Reported by: Petr Salinger <Petr Salinger att seznam cz>
Verified on: GNU/kFreeBSD debian 8.2-1-amd64 (by reporter)
Reviewed by: kib (some time ago)
MFC after: 2 weeks
native devices which support the v4l2 API from processes running within
the linuxulator, e.g. skype or flash can access the multimedia/pwcbsd
or multimedia/webcamd supplied drivers.
Submitted by: nox
MFC after: 1 month
vop_stdallocate() is filesystem agnostic and will run as slow as a
read/write loop in userspace; however, it serves to correctly
implement the functionality for filesystems that do not implement a
VOP_ALLOCATE.
Note that __FreeBSD_version was already bumped today to 900036 for any
ports which would like to use this function.
Also reserve space in the syscall table for posix_fadvise(2).
Reviewed by: -arch (previous version)
safer for i386 because it can be easily over 4 GHz now. More worse, it can
be easily changed by user with 'machdep.tsc_freq' tunable (directly) or
cpufreq(4) (indirectly). Note it is intentionally not used in performance
critical paths to avoid performance regression (but we should, in theory).
Alternatively, we may add "virtual TSC" with lower frequency if maximum
frequency overflows 32 bits (and ignore possible incoherency as we do now).
There is a generic problem with the shims for ioctls that receive
pointers to the usermode data areas in the data argument. We either have
to modify the handler to accept UIO_USERSPACE/UIO_SYSSPACE indicator, or
allocate and fill a usermode memory for data buffer in the host format.
The change goes the second route, in particular because we do not need
to modify the handler.
Submitted by: John Wehle <john feith com>
MFC after: 2 weeks
In particular:
- implement compat shims for old stat(2) variants and ogetdirentries(2);
- implement delivery of signals with ancient stack frame layout and
corresponding sigreturn(2);
- implement old getpagesize(2);
- provide a user-mode trampoline and LDT call gate for lcall $7,$0;
- port a.out image activator and connect it to the build as a module
on amd64.
The changes are hidden under COMPAT_43.
MFC after: 1 month
I have not properly thought through the commit. After r220031 (linux
compat: improve and fix sendmsg/recvmsg compatibility) the basic
handling for SO_PASSCRED is not sufficient as it breaks recvmsg
functionality for SCM_CREDS messages because now we would need to handle
sockcred data in addition to cmsgcred. And that is not implemented yet.
Pointyhat to: avg
on the set of rules it maintains and the current resource usage. It also
privides userland API to manage that ruleset.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)
This seems to have been a part of a bigger patch by dchagin that either
haven't been committed or committed partially.
Submitted by: dchagin, nox
MFC after: 2 weeks
- implement baseic stubs for capget, capset, prctl PR_GET_KEEPCAPS
and prctl PR_SET_KEEPCAPS.
- add SCM_CREDS support to sendmsg and recvmsg
- modify sendmsg to ignore control messages if not using UNIX
domain sockets
This should allow linux pulse audio daemon and client work on FreeBSD
and interoperate with native counter-parts modulo the differences in
pulseaudio versions.
PR: kern/149168
Submitted by: John Wehle <john@feith.com>
Reviewed by: netchild
MFC after: 2 weeks
Code inspection shows freebsd32_ioctl calls fget for a fd and calls
a subroutine to handle each specific ioctl. It is expected that the
subroutine will call fdrop when done. However many of the subroutines
will exit out early if copyin encounters an error resulting in fdrop
never being called.
Submitted by: John Wehle <john feith com>
MFC after: 3 days
- Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL
in fork to honor the locking requirements. While here, expand the scope
of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously
the code was locking the new child process (p2) after it had locked the
parent process (p1). However, when locking two processes, the safe order
is to lock the child first, then the parent.
- Fix various places that were checking p_state against PRS_NEW without
having the process locked to use PROC_LOCK(). Every place was already
locking the process, just after the PRS_NEW check.
- Remove or reduce the use of PROC_SLOCK() for places that were checking
p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading
the current state.
- Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once.
MFC after: 1 week
explicit process at fork trampoline path instead of eventhadler(schedtail)
invocation for each child process.
Remove eventhandler(schedtail) code and change linux ABI to use newly added
sysvec method.
While here replace explicit comparing of module sysentvec structure with the
newly created process sysentvec to detect the linux ABI.
Discussed with: kib
MFC after: 2 Week
it possible for the kernel to track login class the process is assigned to,
which is required for RCTL. This change also make setusercontext(3) call
setloginclass(2) and makes it possible to retrieve current login class using
id(1).
Reviewed by: kib (as part of a larger patch)
compiled conditionally on options CAPABILITIES:
Add a new credential flag, CRED_FLAG_CAPMODE, which indicates that a
subject (typically a process) is in capability mode.
Add two new system calls, cap_enter(2) and cap_getmode(2), which allow
setting and querying (but never clearing) the flag.
Export the capability mode flag via process information sysctls.
Sponsored by: Google, Inc.
Reviewed by: anderson
Discussed with: benl, kris, pjd
Obtained from: Capsicum Project
MFC after: 3 months
resource - the layout of cprd_port is identical but using cprd_mem
makes the code easier to understand.
PR: kern/118493
Submitted by: Weongyo Jeong <weongyo.jeong at gmail.com>
MFC after: 3 days
different processes that happen to use the same user address in the
separate processes will now be treated as distinct futexes rather than the
same futex. We can now honor shared futexes properly by mapping them to a
PROCESS_SHARED umtx_key. Private futexes use THREAD_SHARED umtx_key
objects.
In conjunction with: dchagin
Reviewed by: kib
MFC after: 1 week
1) Translate the native signal number in the appropriate Linux signal.
2) Remove bogus code, which can lead to a panic as it calls
kern_sigtimedwait with same ksiginfo.
3) Return the corresponding signal number.
to construct the full pathname. It starts to search at the default
mountpoint which is /dev/shm. If this fails it runs through fstab
and searches for shmfs and tmpfs. Whatever it finds will be
statfs()'ed to be checked for Linux' fs magic for shmfs (0x01021994).
Ideally our tmpfs should deliver this fs magic to Linux processes, but
as our tmpfs is considered to be an experimental feature we can not
assume that there is always a tmpfs available.
To make shared memory work in the Linuxulator, force the fs type of
/dev/shm (which can be a symlink) to match what Linux expects. The user
is responsible (info has to be added to the linux base ports and the docs)
to setup a suitable link for /dev/shm.
Noticed by: Andre Albsmeier <Andre.Albsmeier@siemens.com>
Submitted by: Andre Albsmeier <Andre.Albsmeier@siemens.com>
MFC after: 1 month
try to export as much information as we can match.
Requested on: Debian GNU/kFreeBSD list (debian-bsd lists.debian.org) 2010-12
Tested by: Mats Erik Andersson (mats.andersson gisladisker.se)
MFC after: 10 days
be used by linuxolator itself.
Move linux_wait4() to MD path as it requires native struct
rusage translation to struct l_rusage on linux32/amd64.
MFC after: 1 Month.
if no records where returned by VOP_READDIR(). Readdir implementations
allowed to return 0 records when first record is larger then supplied
buffer. In this case trying to execute VOP_READDIR() again causes the
syscall looping forewer.
The goto was there from the day 1, which goes back to 1995 year.
Reported and tested by: Beat G?tzi <beat chruetertee ch>
MFC after: 2 weeks
setting SV_SHP flag and providing pointer to the vm object and mapping
address. Provide simple allocator to carve space in the page, tailored
to put the code with alignment restrictions.
Enable shared page use for amd64, both native and 32bit FreeBSD
binaries. Page is private mapped at the top of the user address
space, moving a start of the stack one page down. Move signal
trampoline code from the top of the stack to the shared page.
Reviewed by: alc
information is set to FreeBSD. It had been falling through to the end
of linux_ioctl_sound() and returning ENOIOCTL. Noticed when running the
Linux ALSA amixer tool.
Add a LINUX_SOUND_MIXER_READ_CAPS ioctl which is used by the Skype
v2.1.0.81 binary.
Reviewed by: gavin
MFC after: 2 weeks
the original amd64 and i386 headers with stubs.
Rename (AMD64|I386)_BUS_SPACE_* to X86_BUS_SPACE_* everywhere.
Reviewed by: imp (previous version), jhb
Approved by: kib (mentor)
With that change the Atheros 9xxx driver is actually usable and does not
panic anymore.
Submitted by: Paul B Mahol <onemda at gmail.com>
MFC after: 2 weeks
driver for example requests the NetCfgInstanceId but doesn't check the
returned status code and will happily access random memory instead.
Submitted by: Paul B Mahol <onemda at gmail.com>
MFC after: 2 weeks
- Many drivers on amd64 are picking system uptime, interrupt time and ticks
via global data structure instead of calling functions for performance
reasons. For now just patch such address so driver will not trigger page
fault when trying to access such data. In future, additional callout may
be added to update data in periodic intervals.
- On amd64 we need to allocate "shadow space" on stack before calling any
function.
Submitted by: Paul B Mahol <onemda at gmail.com>
According to the comment for MmIsAddressValid() there are issues on PAE
kernels using pmap_kextract().
Submitted by: Paul B Mahol <onemda at gmail.com>
thread specific informations.
In order to do that, and in order to avoid KBI breakage with existing
infrastructure the following semantic is implemented:
- For live programs, a new member to the PT_LWPINFO is added (pl_tdname)
- For cores, a new ELF note is added (NT_THRMISC) that can be used for
storing thread specific, miscellaneous, informations. Right now it is
just popluated with a thread name.
GDB, then, retrieves the correct informations from the corefile via the
BFD interface, as it groks the ELF notes and create appropriate
pseudo-sections.
Sponsored by: Sandvine Incorporated
Tested by: gianni
Discussed with: dim, kan, kib
MFC after: 2 weeks
properly aligned structure is atomic on all supported architectures, and
the thread that should see side-effect of assignment is the same thread
that does assignment.
Use a more appropriate conditional to detect the linux ABI.
Suggested by: kib
X-MFC: together with r215664
on FreeBSD (amd64), invocations of "javac" (or "java") eventually
end with the output of "Killed" and exit code 137.
This is caused by:
1. After calling exec() in multithreaded linux program threads are not
destroyed and continue running. They get killed after program being
executed finishes.
2. linux_exit_group doesn't return correct exit code when called not
from group leader. Which happens regularly using sun jvm.
The submitters fix this in a similar way to how NetBSD handles this.
I took the PRs away from dchagin, who seems to be out of touch of
this since a while (no response from him).
The patches committed here are from [2], with some little modifications
from me to the style.
PR: 141439 [1], 144194 [2]
Submitted by: Stefan Schmidt <stefan.schmidt@stadtbuch.de>, gk
Reviewed by: rdivacky (in april 2010)
MFC after: 5 days
and MmAllocateContiguousMemorySpecifyCache().
Those two functions take 64-bit variable(s) for their arguments. On i386
that takes additional 32-bit variable per argument. This is required so
that windrv_wrap() can correctly wrap function that miniport driver calls
with stdcall convention. Similar explanation is provided in subr_ndis.c for
other functions.
Submitted by: Paul B Mahol <onemda at gmail.com>
unsupported futex operation
- for those futex operations which are known to be not supported,
print out which futex operation it is
- shortcut the error return of the unsupported FUTEX_CLOCK_REALTIME in
some cases:
FUTEX_CLOCK_REALTIME can be used to tell linux to use
CLOCK_REALTIME instead of CLOCK_MONOTONIC. FUTEX_CLOCK_REALTIME
however must only be set, if either FUTEX_WAIT_BITSET or
FUTEX_WAIT_REQUEUE_PI are set too. If that's not the case
we can die with ENOSYS right at the beginning.
Submitted by: arundel
Reviewed by: rdivacky (earlier iteration of the patch)
MFC after: 1 week
kernel of exactly the same __FreeBSD_version as the headers module was
compiled against.
Mark our in-tree ABI emulators with DECLARE_MODULE_TIED. The modules
use kernel interfaces that the Release Engineering Team feel are not
stable enough to guarantee they will not change during the life cycle
of a STABLE branch. In particular, the layout of struct sysentvec is
declared to be not part of the STABLE KBI.
Discussed with: bz, rwatson
Approved by: re (bz, kensmith)
MFC after: 2 weeks
- handle compat32 processes;
- remove the checks for copied in addresses to belong into valid
usermode range, proc_rwmem() does this;
- simplify loop reading single string, limit the total amount of strings
collected by ARG_MAX bytes;
- correctly add '\0' at the end of each copied string;
- fix style.
In linprocfs_doprocenviron():
- unlock the process before calling copyin code [1]. The process is held
by pseudofs.
In linprocfs_doproccmdline:
- use linprocfs_doargv() to handle !curproc case for which p_args is not cached.
Reported by: plulnet [1]
Tested by: pluknet
Approved by: des (linprocfs maintainer, previous
version of the patch)
MFC after: 3 weeks
code associated with overflow or with the drain function. While this
function is not expected to be used often, it produces more information
in the form of an errno that sbuf_overflowed() did.
In particular, provide pagesize and pagesizes array, the canary value
for SSP use, number of host CPUs and osreldate.
Tested by: marius (sparc64)
MFC after: 1 month
Now we use a regular mutex instead of a spin mutex. When we enter and exit
the emulator, spinlock_enter() and spinlock_exit() are additionally used.
Move some page table related stuff from x86bios_init() and x86bios_uninit()
to x86bios_map_mem() and x86bios_unmap_mem().
VM86 calls instead of the real mode emulator as a backend. VM86 has been
proven reliable for very long time and it is actually few times faster than
emulation. Increase maximum number of page table entries per VM86 context
from 3 to 8 pages. It was (ridiculously) low and insufficient for new VM86
backend, which shares one context globally. Slighly rearrange and clean up
the emulator backend to accommodate new code. The only visible change here
is stack size, which is decreased from 64K to 4K bytes to sync. with VM86.
Actually, it seems there is no need for big stack in real mode.
MFC after: 1 month
details of the string buffer allocation in one place.
Eliminate the portion of the string buffer that was dedicated to storing
the interpreter name. The pointer to the interpreter name can simply be
made to point to the appropriate argument string.
Reviewed by: kib
shell command are stored in exec*()'s demand-paged string buffer. For
a "buildworld" on an 8GB amd64 multiprocessor, the new order reduces
the number of global TLB shootdowns by 31%. It also eliminates about
330k page faults on the kernel address space.
Change exec_shell_imgact() to use "args->begin_argv" consistently as
the start of the argument and environment strings. Previously, it
would sometimes use "args->buf", which is the start of the overall
buffer, but no longer the start of the argument and environment
strings. While I'm here, eliminate unnecessary passing of "&length"
to copystr(), where we don't actually care about the length of the
copied string.
Clean up the initialization of the exec map. In particular, use the
correct size for an entry, and express that size in the same way that
is used when an entry is allocated. The old size was one page too
large. (This discrepancy originated in 2004 when I rewrote
exec_map_first_page() to use sf_buf_alloc() instead of the exec map
for mapping the first page of the executable.)
Reviewed by: kib
- Rename tdsignal() to tdsendsignal() and make it private to kern_sig.c.
- Add tdsignal() and tdksignal() routines that mirror psignal() and
pksignal() except that they accept a thread as an argument instead of
a process. They send a signal to a specific thread rather than to an
individual process.
Reviewed by: kib
syscalls. On the dynamic syscall deregistration, wait until all
threads leave the syscall code. This somewhat increases the safety
of the loadable modules unloading.
Reviewed by: jhb
Tested by: pho
MFC after: 1 month
Clang generates the following warnings when building subr_usbd.c:
| subr_usbd.c:598:13: warning: promoted type 'int' of K&R function
| parameter is not compatible with the parameter type 'uint8_t' (aka
| 'unsigned char') declared in a previous prototype
| subr_usbd.c:627:13: warning: promoted type 'int' of K&R function
| parameter is not compatible with the parameter type 'uint8_t' (aka
| 'unsigned char') declared in a previous prototype
| subr_usbd.c:649:13: warning: promoted type 'int' of K&R function
| parameter is not compatible with the parameter type 'uint8_t' (aka
| 'unsigned char') declared in a previous prototype
Instead of just ANSIfying these three prototypes, do it for the entire
file.
Spotted by: clang
Intention of this commit is to let us take a full advantage
of libusb(8) ported to Linux. This decreases a possibility of getting
any collisions within ioctl() "command" space, especially with
relation to LINUX_SNDCTL_SEQ... stuff.
Basically, we provide commands, that will be mapped in the kernel
to correct ones and forward those to the USB layer. Port enabling
functionality brought with this patch is here:
http://www.freebsd.org/cgi/query-pr.cgi?pr=146895
Bump __FreeBSD_version to catch, since which version installing a
port makes sense.
This patch should bring no regressions. So far, only i386 is tested.
Tested by: thompsa@
Reviewed by: thompsa@
OKed by: netchild@
Extend struct sysvec with three new elements:
sv_fetch_syscall_args - the method to fetch syscall arguments from
usermode into struct syscall_args. The structure is machine-depended
(this might be reconsidered after all architectures are converted).
sv_set_syscall_retval - the method to set a return value for usermode
from the syscall. It is a generalization of
cpu_set_syscall_retval(9) to allow ABIs to override the way to set a
return value.
sv_syscallnames - the table of syscall names.
Use sv_set_syscall_retval in kern_sigsuspend() instead of hardcoding
the call to cpu_set_syscall_retval().
The new functions syscallenter(9) and syscallret(9) are provided that
use sv_*syscall* pointers and contain the common repeated code from
the syscall() implementations for the architecture-specific syscall
trap handlers.
Syscallenter() fetches arguments, calls syscall implementation from
ABI sysent table, and set up return frame. The end of syscall
bookkeeping is done by syscallret().
Take advantage of single place for MI syscall handling code and
implement ptrace_lwpinfo pl_flags PL_FLAG_SCE, PL_FLAG_SCX and
PL_FLAG_EXEC. The SCE and SCX flags notify the debugger that the
thread is stopped at syscall entry or return point respectively. The
EXEC flag augments SCX and notifies debugger that the process address
space was changed by one of exec(2)-family syscalls.
The i386, amd64, sparc64, sun4v, powerpc and ia64 syscall()s are
changed to use syscallenter()/syscallret(). MIPS and arm are not
converted and use the mostly unchanged syscall() implementation.
Reviewed by: jhb, marcel, marius, nwhitehorn, stas
Tested by: marcel (ia64), marius (sparc64), nwhitehorn (powerpc),
stas (mips)
MFC after: 1 month
variable and can cause problems, without the cliplist handling it works
without problems
- improve the cliplist error handling
- fix VIDIOCGTUNER and VIDIOCSMICROCODE (still no hardware available to test)
Submitted by: J.R. Oldroyd <jr@opal.com>
X-MFC after: soon (together with all the v4l stuff)
mostly work on 64bit host.
The work is based on an original patch submitted by emaste, obtained
from Sandvine's source tree.
Reviewed by: jhb
MFC after: 1 week
A nice thing about POSIX 2008 is that it finally standardizes a way to
obtain file access/modification/change times in sub-second precision,
namely using struct timespec, which we already have for a very long
time. Unfortunately POSIX uses different names.
This commit adds compatibility macros, so existing code should still
build properly. Also change all source code in the kernel to work
without any of the compatibility macros. This makes it all a less
ambiguous.
I am also renaming st_birthtime to st_birthtim, even though it was a
local extension anyway. It seems Cygwin also has a st_birthtim.
within the VBE interrupt handler. Unfortunately it was causing real mode
page faults because we were fetching instructions from bogus addresses.
Pass me the pointyhat, please.
PR: kern/144654
MFC after: 3 days
to the image_params struct instead of several members of that struct
individually. This makes it easier to expand its arguments in the future
without touching all platforms.
Reviewed by: jhb
According to POSIX open() must return ENOTDIR when the path name does
not refer to a path name. Change vn_open() to respect this flag. This
also simplifies the Linuxolator a bit.
- Print the initial memory map when bootverbose is set.
- Change the page fault address format from linear to %cs:%ip style.
- Move duplicate code into a newly added function.
- Add strictly aligned memory access for distant future. ;-)
sysv_{msg,sem,shm}.c files.
Mark SysV IPC freebsd32 syscalls as NOSTD and add required
SYSCALL_INIT_HELPER/SYSCALL32_INIT_HELPERs to provide auto
register/unregister on module load.
This makes COMPAT_FREEBSD32 functional with SysV IPC compiled and loaded
as modules.
Reviewed by: jhb
MFC after: 2 weeks
FOr SYSCALL32_MODULE_HELPER, use "sys32/<syscallname>" module name.
This avoids modules name conflict when compat32 syscall does not
need shims.
Note that SYSCALL_MODULE_HELPER is going to be unused in the tree by
several next commits.
Suggested by: jhb
MFC after: 2 weeks
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.
Reviewed by: kib, jhb
On Linux, /proc/<pid>/fd is comparable to fdescfs, where it allows you
to inspect the file descriptors used by each process. Glibc's ttyname()
works by performing a readlink() on these nodes, since all nodes in this
directory are symlinks.
It is a bit hard to implement this in linprocfs right now, so I am not
going to bother. Add a way to make ttyname(3) work, by adding a
/proc/<pid>/fd symlink, which points to /dev/fd only if the calling
process matches. When fdescfs is mounted, this will cause the
readlink() in ttyname() to fail, causing it to fall back on manually
finding a matching node in /dev.
Discussed on: emulation@
this matches the Linux behavior.
- Check if we have sufficient space allocated for socket structure, which
fixes a buffer overflow when wrong length is being passed into the
emulation layer. [1]
PR: kern/138860
Submitted by: Mateusz Guzik <mjguzik gmail com>
Reported by: Alexander Best [1]
MFC after: 2 weeks
With this change, Linux binaries can work with our libusb(3) when
it's compiled against our header files on GNU/Linux system -- this
solves the problem with differences between /dev layouts.
With ported libusb(3), I am able to use my USB JTAG cable with Linux
binaries that support it.
Reviewed by: thompsa
---snip---
Add video clipping support but with the caveats below.
Background info:
Video clipping allows the user to provide either a series of clip rectangles
or a clip bitmap to the driver and have the driver mask the video according
to the clipping specs provided.
Adding support for clipping to the FreeBSD Linux emulator is problematic
because it seems that this feature is not supported by many drivers and
therefore it is ignored by many applications. Unfortunately, when not
using it, rather than passing in a null clipping list, some apps leave the
clipping fields uninitialized, casuing random values to be passed in. In
the case where the driver does not use the clipping info, this is not a
problem (although it is bad form). But the Linux emulator does not know
which drivers will use this and which won't, so the Linux emulator must
try to handle this clip list, and deal gracefully with cases where the
values seem to be uninitialized.
Video clipping info is passed in using the VIDIOCSWIN ioctl in two fields
in the video_window structure: the integer clipcount and the pointer clips.
How the linuxulator handles this from this commit on:
* if (clipcount == VIDEO_CLIP_BITMAP)
The clips variable is a void * pointer to a 128*625 byte
(1024*625 bit) memory area containing a bitmap of the clipping area.
The pointer in the video_window structure is copied, but no
video_clip structures are copied.
* if (clipcount > 0 && clipcount <= 16384)
The clips variable is pointer to a list of video_clip structures. Up
to clipcount structures are copied and passed to the driver.
The upper limit of 16384 was imposed here so that user code that does
not properly initialize clipcount falls through below and no attempt
is made to copy an uninitialized list. This value was found by
examining Linux drivers that support the clip list.
* else
The clipcount is either negative (but not VIDEO_CLIP_BITMAP), zero or
positive (> 16384).
All these cases are treated as invalid data. Both the clipcount field
and clips pointer are forced to zero/NULL and passed to the driver.
It should be noted that, at the time of developing this V4L emulator code,
the pwc(4) V4L driver does not support clipping.
Submitted by: J.R. Oldroyd <fbsd@opal.com>
MFC after: 1 month
---snip---
native devices which support the v4l API from processes running within
the linuxulator, e.g. skype or flash can access the multimedia/pwcbsd driver.
Not tested is firmware upload, framebuffer stuff and video tuner stuff
due to lack of hardware.
The clipping part (VIDIOCSWIN) needs a little bit of further work (partly
in progress, but can not be tested due to lack of a suitable device).
The submitter tested this sucessfully with Skype and flash apps on amd64 and
i386 with the multimedia/pwcbsd driver.
Submitted by: J.R. Oldroyd <fbsd@opal.com>
kern.ngroups+1. kern.ngroups can range from NGROUPS_MAX=1023 to
INT_MAX-1. Given that the Windows group limit is 1024, this range
should be sufficient for most applications.
MFC after: 1 month
When renaming a directory it passes through several intermediate
states. First its new name will be created causing it to have two
names (from possibly different parents). Next, if it has different
parents, its value of ".." will be changed from pointing to the old
parent to pointing to the new parent. Concurrently, its old name
will be removed bringing it back into a consistent state. When fsck
encounters an extra name for a directory, it offers to remove the
"extraneous hard link"; when it finds that the names have been
changed but the update to ".." has not happened, it offers to rewrite
".." to point at the correct parent. Both of these changes were
considered unexpected so would cause fsck in preen mode or fsck in
background mode to fail with the need to run fsck manually to fix
these problems. Fsck running in preen mode or background mode now
corrects these expected inconsistencies that arise during directory
rename. The functionality added with this update is used by fsck
running in background mode to make these fixes.
Solution:
This update adds three new fsck sysctl commands to support background
fsck in correcting expected inconsistencies that arise from incomplete
directory rename operations. They are:
setcwd(dirinode) - set the current directory to dirinode in the
filesystem associated with the snapshot.
setdotdot(oldvalue, newvalue) - Verify that the inode number for ".."
in the current directory is oldvalue then change it to newvalue.
unlink(nameptr, oldvalue) - Verify that the inode number associated
with nameptr in the current directory is oldvalue then unlink it.
As with all other fsck sysctls, these new ones may only be used by
processes with appropriate priviledge.
Reported by: jeff
Security issues: rwatson
target one. Since r184058, linux_do_tkill() calls tdsignal() instead of
kill(), without checking for validity of supplied signal number. Prevent
panic when supplied signal is 0 by finishing work after checks.
Found and tested by: scf
MFC after: 3 days
native devices which support the v4l API from processes running within
the linuxulator, e.g. skype or flash can access the multimedia/pwcbsd driver.
Not tested is firmware upload, framebuffer stuff and video tuner stuff
due to lack of hardware.
The clipping part (VIDIOCSWIN) needs a little bit of further work (partly
in progress, but can not be tested due to lack of a suitable device).
The submitter tested this sucessfully with Skype and flash apps on amd64 and
i386 with the multimedia/pwcbsd driver.
Submitted by: J.R. Oldroyd <fbsd@opal.com>
Right now <sys/termios.h> includes <sys/ttycom.h>, which provides the
TTY ioctls to the svr4 code. We need both struct termios and the ioctls,
so include <sys/tty.h> for now.
well-known race condition, which elimination was the reason for the
function appearance in first place. If sigmask supplied as argument to
pselect() enables a signal, the signal might be delivered before thread
called select(2), causing lost wakeup. Reimplement pselect() in kernel,
making change of sigmask and sleep atomic.
Since signal shall be delivered to the usermode, but sigmask restored,
set TDP_OLDMASK and save old mask in td_oldsigmask. The TDP_OLDMASK
should be cleared by ast() in case signal was not gelivered during
syscall execution.
Reviewed by: davidxu
Tested by: pho
MFC after: 1 month
while in kernel mode, and later changing signal mask to block the
signal, was fixed for sigprocmask(2) and ptread_exit(3). The same race
exists for sigreturn(2), setcontext(2) and swapcontext(2) syscalls.
Use kern_sigprocmask() instead of direct manipulation of td_sigmask to
reschedule newly blocked signals, closing the race.
Reviewed by: davidxu
Tested by: pho
MFC after: 1 month
kern_sigprocmask() to properly notify other possible candidate threads
for signal delivery.
Since sigsuspend() shall only return to usermode after a signal was
delivered, do cursig/postsig loop immediately after waiting for
signal, repeating the wait if wakeup was spurious due to race with
other thread fetching signal from the process queue before us. Add
thread_suspend_check() call to allow the thread to be stopped or killed
while in loop.
Modify last argument of kern_sigprocmask() from boolean to flags,
allowing the function to be called with locked proc. Convertion of the
callers that supplied 1 to the old argument will be done in the next
commit, and due to SIGPROCMASK_OLD value equial to 1, code is formally
correct in between.
Reviewed by: davidxu
Tested by: pho
MFC after: 1 month
no matter whether we are compiled as module or if our default of the
net.inet6.ip6.v6only sysctl already matches what we would set.
This avoids unnecessary complications with modules, VIMAGES, INET6 and
the sysctl value, especially considering that most users will use
linux compat as a module.
Discussed with: kib, rwatson (weeks ago)
Reviewed by: rwatson
MFC after: 6 weeks
- Do not map entire real mode memory (1MB). Instead, we map IVT/BDA and
ROM area separately. Most notably, ROM area is mapped as device memory
(uncacheable) as it should be. User memory is dynamically allocated and
free'ed with contigmalloc(9) and contigfree(9). Remove now redundant and
potentially dangerous x86bios_alloc.c. If this emulator ever grows to
support non-PC hardware, we may implement it with rman(9) later.
- Move all host-specific initializations from x86emu_util.c to x86bios.c and
remove now unnecessary x86emu_util.c. Currently, non-PC hardware is not
supported. We may use bus_space(9) later when the KPI is fixed.
- Replace all bzero() calls for emulated registers with more obviously named
x86bios_init_regs(). This function also initializes DS and SS properly.
- Add x86bios_get_intr(). This function checks if the interrupt vector is
available for the platform. It is not necessary for PC-compatible hardware
but it may be needed later. ;-)
- Do not try turning off monitor if DPMS does not support the state.
- Allocate stable memory for VESA OEM strings instead of just holding
pointers to them. They may or may not be accessible always. Fix a memory
leak of video mode table while I am here.
- Add (experimental) BIOS POST call for vesa(4). This function calls VGA
BIOS POST code from the current VGA option ROM. Some video controllers
cannot save and restore the state properly even if it is claimed to be
supported. Usually the symptom is blank display after resuming from suspend
state. If the video mode does not match the previous mode after restoring,
we try BIOS POST and force the known good initial state. Some magic was
taken from NetBSD (and it was taken from vbetool, I believe.)
- Add a loader tunable for vgapci(4) to give a hint to dpms(4) and vesa(4)
to identify who owns the VESA BIOS. This is very useful for multi-display
adapter setup. By default, the POST video controller is automatically
probed and the tunable "hw.pci.default_vgapci_unit" is set to corresponding
vgapci unit number. You may override it from loader but it is very unlikely
to be necessary. Unfortunately only AGP/PCI/PCI-E controllers can be
matched because ISA controller does not have necessary device IDs.
- Fix a long standing bug in state save/restore function. The state buffer
pointer should be ES:BX, not ES:DI according to VBE 3.0. If it ever worked,
that's because BX was always zero. :-)
- Clean up register initializations more clearer per VBE 3.0.
- Fix a lot of style issues with vesa(4).
first and the native ia32 compat as middle (before other things).
o(ld)brandinfo as well as third party like linux, kfreebsd, etc.
stays on SI_ORDER_ANY coming last.
The reason for this is only to make sure that even in case we would
overflow the MAX_BRANDS sized array, the native FreeBSD brandinfo
would still be there and the system would be operational.
Reviewed by: kib
MFC after: 1 month
- Remove CS and IP registers from x86bios.h. They have no use for us.
- Adjust register dump to make it little bit more useful for debugging.
Submitted by: paradox (ddkprog yahoo com)[1] (initial version)
FreeBSD 8, the compatibility shims should be built not just when FreeBSD 7
compatibility is requested, but also when compatibility with any older
FreeBSD version where that feature was present is requested.o
Without this patch, a kernel config that sets COMPAT_FREEBSD6 but not *7
would fail to build due to inconsistencies between the declaration of the
compatibility shims and their use in the SysV code.
There are similar errors in other *proto.h headers in the tree.
MFC after: 3 weeks
longs. Since 32bit processes longs are 4 bytes, 64bit kernel may copy in
or out 4 bytes more then the process expected.
Calculate the amount of bytes to copy taking into account size of fd_set
for the current process ABI.
Diagnosed and tested by: Peter Jeremy <peterjeremy acm org>
Reviewed by: jhb
MFC after: 1 week
correctly and do not match a colliding Debian GNU/kFreeBSD
brandinfo statements.
For this mark the Debian GNU/kFreeBSD brandinfo that it must have
an .note.ABI-tag section and ignore the old EI_OSABI brandinfo
when comparing a possibly colliding set of options.
Due to SYSINIT we add the brandinfo in a non-deterministic order,
so native FreeBSD is not always first. We may want to consider
to force native FreeBSD to come first as well.
The only way a problem could currently be noticed is when running an
i386 binary without the .note.ABI-tag on amd64 and the Debian GNU/kFreeBSD
brandinfo was matched first, as the fallback to ld-elf32.so.1 does
not exist in that case.
Reported and tested by: ticso
In collaboration with: kib
MFC after: 3 days
Handle GNU/Linux according to LSB Core Specification 4.0,
Chapter 11. Object Format, 11.8. ABI note tag.
Also check the first word of desc, not only name, according to
glibc abi-tags specification to distinguish between Linux and
kFreeBSD.
Add explicit handling for Debian GNU/kFreeBSD, which runs
on our kernels as well [2].
In {amd64,i386}/trap.c, when checking osrel of the current process,
also check the ABI to not change the signal behaviour for Linux
binary processes, now that we save an osrel version for all three
from the lists above in struct proc [2].
These changes make it possible to run FreeBSD, Debian GNU/kFreeBSD
and Linux binaries on the same machine again for at least i386 and
amd64, and no longer break kFreeBSD which was detected as GNU(/Linux).
PR: kern/135468
Submitted by: dchagin [1] (initial patch)
Suggested by: kib [2]
Tested by: Petr Salinger (Petr.Salinger seznam.cz) for kFreeBSD
Reviewed by: kib
MFC after: 3 days
several critical bugs, including race conditions and lock order issues:
Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock. Either can be held to stablize the lists and indexes, but both
are required to write. This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions. As before, writes to
the interface list must occur from sleepable contexts.
Reviewed by: bz, julian
MFC after: 3 days
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.
Reviewed by: bz
Approved by: re (vimage blanket)
old ABI versions of the relevant control system call (e.g.
freebsd7_freebsd32_msgctl() instead of freebsd32_msgctl() for msgsys()).
Approved by: re (kib)
restrictions) were found to be inadequately described by a boolean.
Define a new parameter type with three values (disable, new, inherit)
to handle these and future cases.
Approved by: re (kib), bz (mentor)
Discussed with: rwatson
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
if the new file mode is the same as it was before; however, this
optimization must be disabled for filesystems that support NFSv4 ACLs.
Chmod uses pathconf(2) to determine whether this is the case - however,
pathconf(2) always follows symbolic links, while the 'chmod -h' doesn't.
This change adds lpathconf(3) to make it possible to solve that problem
in a clean way.
Reviewed by: rwatson (earlier version)
Approved by: re (kib)
specific macros for each audit argument type. This makes it easier to
follow call-graphs, especially for automated analysis tools (such as
fxr).
In MFC, we should leave the existing AUDIT_ARG() macros as they may be
used by third-party kernel modules.
Suggested by: brooks
Approved by: re (kib)
Obtained from: TrustedBSD Project
MFC after: 1 week
- The uid/cuid members of struct ipc_perm are now uid_t instead of unsigned
short.
- The gid/cgid members of struct ipc_perm are now gid_t instead of unsigned
short.
- The mode member of struct ipc_perm is now mode_t instead of unsigned short
(this is merely a style bug).
- The rather dubious padding fields for ABI compat with SV/I386 have been
removed from struct msqid_ds and struct semid_ds.
- The shm_segsz member of struct shmid_ds is now a size_t instead of an
int. This removes the need for the shm_bsegsz member in struct
shmid_kernel and should allow for complete support of SYSV SHM regions
>= 2GB.
- The shm_nattch member of struct shmid_ds is now an int instead of a
short.
- The shm_internal member of struct shmid_ds is now gone. The internal
VM object pointer for SHM regions has been moved into struct
shmid_kernel.
- The existing __semctl(), msgctl(), and shmctl() system call entries are
now marked COMPAT7 and new versions of those system calls which support
the new ABI are now present.
- The new system calls are assigned to the FBSD-1.1 version in libc. The
FBSD-1.0 symbols in libc now refer to the old COMPAT7 system calls.
- A simplistic framework for tagging system calls with compatibility
symbol versions has been added to libc. Version tags are added to
system calls by adding an appropriate __sym_compat() entry to
src/lib/libc/incldue/compat.h. [1]
PR: kern/16195 kern/113218 bin/129855
Reviewed by: arch@, rwatson
Discussed with: kan, kib [1]
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively. (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)
The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer. Do the equivalent in
kinfo_proc.
Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively. Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary. In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.
Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups. When feasible, truncate
the group list rather than generating an error.
Minor changes:
- Reduce the number of hand rolled versions of groupmember().
- Do not assign to both cr_gid and cr_groups[0].
- Modify ipfw to cache ucreds instead of part of their contents since
they are immutable once referenced by more than one entity.
Submitted by: Isilon Systems (initial implementation)
X-MFC after: never
PR: bin/113398 kern/133867
in the type field of system call tables. Specifically, one can now use
the 'NO*' types as flags in addition to the 'COMPAT*' types. For example,
to tag 'COMPAT*' system calls as living in a KLD via NOSTD. The COMPAT*
type is required to be listed first in this case.
- Add new functions 'type()' and 'flag()' to the embedded awk script in
makesyscalls.sh that return true if a requested flag is found in the
type field ($3). The flag() function checks all of the flags in the
field, but type() only checks the first flag. type() is meant to be
used in the top-level "switch" statement and flag() should be used
otherwise.
- Retire the CPT_NOA type, it is now replaced with "COMPAT|NOARGS" using
the flags approach.
- Tweak the comment descriptions of COMPAT[46] system calls so that they
say "freebsd[46] foo" rather than "old foo".
- Document the COMPAT6 type.
- Sync comments in compat32 syscall table with the master table.
any open file descriptors >= 'lowfd'. It is largely identical to the same
function on other operating systems such as Solaris, DFly, NetBSD, and
OpenBSD. One difference from other *BSD is that this closefrom() does not
fail with any errors. In practice, while the manpages for NetBSD and
OpenBSD claim that they return EINTR, they ignore internal errors from
close() and never return EINTR. DFly does return EINTR, but for the common
use case (closing fd's prior to execve()), the caller really wants all
fd's closed and returning EINTR just forces callers to call closefrom() in
a loop until it stops failing.
Note that this implementation of closefrom(2) does not make any effort to
resolve userland races with open(2) in other threads. As such, it is not
multithread safe.
Submitted by: rwatson (initial version)
Reviewed by: rwatson
MFC after: 2 weeks
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.
Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.
Discussed with: pjd
The system hostname is now stored in prison0, and the global variable
"hostname" has been removed, as has the hostname_mtx mutex. Jails may
have their own host information, or they may inherit it from the
parent/system. The proper way to read the hostname is via
getcredhostname(), which will copy either the hostname associated with
the passed cred, or the system hostname if you pass NULL. The system
hostname can still be accessed directly (and without locking) at
prison0.pr_host, but that should be avoided where possible.
The "similar information" referred to is domainname, hostid, and
hostuuid, which have also become prison parameters and had their
associated global variables removed.
Approved by: bz (mentor)
... by moving two ~2KB structures from stack to heap allocation.
I experienced stack overflow in linux emulation on i386 (8K stack)
when LINUX_DVD_READ_STRUCT ioctl was performed on atapicam cd
device and there was an error that resulted in additional quite
heavy stack use in cam layer.
Reviewed by: dchagin
Approved by: jhb (mentor)
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.
Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().
Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.
Approved by: bz (mentor)
Args argument is a pointer to the structure located in user space in
which the socketcall arguments are packed. The structure must be
copied to the kernel instead of direct dereferencing.
Approved by: kib (mentor)
MFC after: 1 week
SOCK_NONBLOCK flags, that allow to save fcntl() calls.
Implement a variation of the socket() syscall which takes a flags
in addition to the type argument.
Approved by: kib (mentor)
MFC after: 1 month
Temporarily use 0 for pid member as the FreeBSD does not cache remote
UNIX domain socket peer pid.
PR: kern/102956
Reviewed by: rwatson
Approved by: kib (mentor)
MFC after: 1 month
to 2.4.0, as it has appeared in the 2.4.0-rc7 first time.
Being exported, AT_CLKTCK is returned by sysconf(_SC_CLK_TCK),
glibc falls back to the hard-coded CLK_TCK value when aux entry
is not present.
Glibc versions prior to 2.2.1 always use hard-coded CLK_TCK value.
For older applications/libc's which depends on hard-coded CLK_TCK
value user should set compat.linux.osrelease less than 2.4.0.
Approved by: kib (mentor)
designation of the emulated kernel version.
linux_kernver() returns integer value formatted as 'VVVMMMIII' where
VVV - version, MMM - major revision, III - minor revision.
Approved by: kib (mentor)
The frequency of the statistics clock is given by stathz.
Use stathz if it is available, otherwise use hz.
Pointed out by: bde
Approved by: kib (mentor)
I really don't want any pieces of code to include ioctl_compat.h, so let
the ibcs2 and svr4 compat leave sgtty alone. If they want to support
sgtty, they should emulate it on top of termios, not sgtty.
The code has been marked with BURN_BRIDGES for a long time. ibcs2 and
svr4 are not really popular pieces of code anyway.
virtualized instances of hostname and domainname, as well as a new top-level
virtualization struct vimage, which holds pointers to struct vnet and struct
vprocg. Struct vprocg is likely to become replaced in the near future with
a new jail management API import.
As a consequence of this change, change struct ucred to point to a struct
vimage, instead of directly pointing to a vnet.
Merge vnet / vimage / ucred refcounting infrastructure from p4 / vimage
branch.
Permit kldload / kldunload operations to be executed only from the default
vimage context.
This change should have no functional impact on nooptions VIMAGE kernel
builds.
Reviewed by: bz
Approved by: julian (mentor)
OSD-based jail extensions. This allows the Linux MIB to accessed via
jail_set and jail_get, and serves as a demonstration of adding jail support
to a module.
Reviewed by: dchagin, kib
Approved by: bz (mentor)
which is available for Glibc as sysconf(_SC_CLK_TCK). If AT_CLKTCK entry is
not exported, Glibc uses 100.
linux_times() shall use the value that is exported to user space.
Pointyhat to: dchagin
PR: kern/134251
Approved by: kib (mentor)
MFC after: 2 weeks
Use the protocol family constants for the domain argument validation.
Return EAFNOSUPPORT in case when the incorrect domain argument
is specified.
Return EPROTONOSUPPORT instead of passing values that are not 0
to the BSD layer.
Suggested by: rwatson
Approved by: kib (mentor)
MFC after: 1 month
Old implemention used Giant to protect the kernel data structures,
but at the same time called malloc(M_WAITOK), that could cause the
calling thread to sleep and lost Giant protection. User-visible
result was the missed wakeup.
New implementation uses one sx lock per futex. The sx protects
the futex structures and allows to sleep while copyin or copyout
are performed.
Unlike linux, we return EINVAL when FUTEX_CMP_REQUEUE operation
is requested and either caller specified futexes are equial or
second futex already exists. This is acceptable since the situation
can only occur from the application error, and glibc falls back to
old FUTEX_WAKE operation when FUTEX_CMP_REQUEUE returns an error.
Approved by: kib (mentor)
MFC after: 1 month
interface as nmount(2). Three new system calls are added:
* jail_set, to create jails and change the parameters of existing jails.
This replaces jail(2).
* jail_get, to read the parameters of existing jails. This replaces the
security.jail.list sysctl.
* jail_remove to kill off a jail's processes and remove the jail.
Most jail parameters may now be changed after creation, and jails may be
set to exist without any attached processes. The current jail(2) system
call still exists, though it is now a stub to jail_set(2).
Approved by: bz (mentor)
rearrange / replace / adjust several INIT_VNET_* initializer
macros, all of which currently resolve to whitespace.
Reviewed by: bz (an older version of the patch)
Approved by: julian (mentor)
Glibc does not use this operation since 2.3.3 version (Jun 2004),
as it is racy and replaced by FUTEX_CMP_REQUEUE operation.
Glibc versions prior to 2.3.3 fall back to FUTEX_WAKE when
FUTEX_REQUEUE returned EINVAL.
Any application directly using FUTEX_REQUEUE without return
value checking are definitely broken.
Limit quantity of messages per process about unsupported
operation.
Approved by: kib (mentor)
MFC after: 1 month
- make usb2_power_mask_t 16-bit
- remove "usb2_config_sub" structure from "usb2_config". To compensate for this
"usb2_config" has a new field called "usb_mode" which select for which mode
the current xfer entry is active. Options are: a) Device mode only b) Host
mode only (default-by-zero) c) Both modes. This change was scripted using
the following sed script: "s/\.mh\././g".
- the standard packet size table in "usb_transfer.c" is now a function, hence
the code for the function uses less memory than the table itself.
Submitted by: Hans Petter Selasky
1) Move the new field (brand_note) to the end of the Brandinfo structure.
2) Add a new flag BI_BRAND_NOTE that indicates that the brand_note pointer
is valid.
3) Use the brand_note field if the flag BI_BRAND_NOTE is set and as old
modules won't have the flag set, so the new field brand_note would be
ignored.
Suggested by: jhb
Reviewed by: jhb
Approved by: kib (mentor)
MFC after: 6 days
the arguments translations. Provide ABI-compatible definition of the
struct i386_ldt_args for freebsd32 compat layer.
In collaboration with: pho
Reviewed by: jhb
To keep these structures ABI-compatible, half the size of r_trapno,
r_err, mc_trapno, mc_flags.
Add fsbase and gsbase to mcontext on both amd64 and i386.
Add flags to amd64 mcontext to indicate that it contains valid segments
or bases.
In collaboration with: pho
Discussed with: peter
Reviewed by: jhb
It's quite strange that nobody reported this issue before. It turns out
functions like ttyname(), ptsname() and fdevname() don't work in
compat32. This means it't not even possible to run applications like
script(1) inside a 32-bit FreeBSD jail.
Fix this by converting 32-bit fiodgname_arg structures to their 64-bit
equivalent.
Reported by: kris
Tested by: kris
via the Linux tool.
- Add Linux shim to ipmi(4)
- Create a partitions file to linprocfs to make Linux fdisk see
disks. This file is dynamic so we can see disks come and go.
- Convert msdosfs to vfat in mtab since Linux uses that for
msdosfs.
- In the Linux mount path convert vfat passed in to msdosfs
so Linux mount works on FreeBSD. Note that tasting works
so that if da0 is a msdos file system
/compat/linux/bin/mount /dev/da0 /mnt
works.
- fix a 64it bug for l_off_t.
Grabing sh, mount, fdisk, df from Linux, creating a symlink of mtab to
/compat/linux/etc/mtab and then some careful unpacking of the Linux bmc
update tool and hacking makes it work on newer Dell boxes. Note, probably
if you can't figure out how to do this, then you probably shouldn't be
doing it :-)
or URB_FUNCTION_CLASS_xxx with HAL preemption lock that means it's
non-sleepable during USB requests though usb2_do_request() requires a
sleep so it needs to send queries to the default pipe without those
interfaces to avoid sleep.
whatever the IRP flag is because some drivers (eg. RTL8187L NDIS driver)
call IoCompleteRequest() without setting flags. It will prevent waiting
a event forever at attach.
".note.ABI-tag" section.
The search order of a brand is changed, now first of all the
".note.ABI-tag" is looked through.
Move code which fetch osreldate for ELF binary to check_note() handler.
PR: 118473
Approved by: kib (mentor)
Based on the HAL preemption lock there is a problem on SMP machines
and causes a panic.
o When a device detached the current tactic to detach NDIS USB driver is
to call SURPRISE_REMOVED event. So it don't need to call
ndis_halt_nic() again. This fixes some page faults when some drivers
work abnormal.
o it assumes now that URB_FUNCTION_BULK_OR_INTERRUPT_TRANSFER is in
DISPATCH_LEVEL (non-sleepable) and as further work
URB_FUNCTION_VENDOR_XXX and URB_FUNCTION_CLASS_XXX should be.
Reviewed by: Hans Petter Selasky <hselasky_at_freebsd.org>
Tested by: Paul B. Mahol <onemda_at_gmail.com>
o implement URB_FUNCTION_ABORT_PIPE handling.
o remove unused code related with canceling the timer list for USB
drivers.
o whitespace cleanup and style(9)
Obtained from: hps's original patch
ABIs:
- Store the FPU initial control word in the pcb for each thread.
- When first using the FPU, load the initial control word after restoring
the clean state if it is not the standard control word.
- Provide a correct control word for Linux/i386 binaries under
FreeBSD/amd64.
- Adjust the control word returned for fpugetregs()/npxgetregs() when a
thread hasn't used the FPU yet to reflect the real initial control
word for the current ABI.
- The Linux/i386 ABI for FreeBSD/i386 now properly sets the right control
word instead of trashing whatever the current state of the FPU is.
Reviewed by: bde
are used by glibc. This silents the message "2.4+ kernel w/o ELF notes?"
from some programs at start, among them are top and pkill.
Do the assignment of the vector entries in elf_linux_fixup()
as it is done in glibc.
Fix some minor style issues.
Submitted by: Marcin Cieslak <saper at SYSTEM PL>
Approved by: kib (mentor)
MFC after: 1 week
vfsopt and the vfs_buildopts function public, and add some new fields
to struct vfsopt (pos and seen), and new functions vfs_getopt_pos and
vfs_opterror.
Further extend the interface to allow reading options from the kernel
in addition to sending them to the kernel, with vfs_setopt and related
functions.
While this allows the "name=value" option interface to be used for more
than just FS mounts (planned use is for jails), it retains the current
"vfsopt" name and <sys/mount.h> requirement.
Approved by: bz (mentor)
net/route.h.
Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.
We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.
This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.
The existing code calls kern_open() to resolve the vnode of a pathname
right after a stat(). This is not correct, because it causes random
character devices to be opened in /dev. This means ls'ing a tape
streamer will cause it to rewind, for example. Changes I have made:
- Add kern_statat_vnhook() to allow binary emulators to `post-process'
struct stat, using the proper vnode.
- Remove unneeded printf's from stat() and statfs().
- Make the Linuxolator use kern_statat_vnhook(), replacing
translate_path_major_minor_at().
- Let translate_fd_major_minor() use vp->v_rdev instead of
vp->v_un.vu_cdev.
Result:
crw-rw-rw- 1 root root 0, 14 Feb 20 13:54 /dev/ptmx
crw--w---- 1 root adm 136, 0 Feb 20 14:03 /dev/pts/0
crw--w---- 1 root adm 136, 1 Feb 20 14:02 /dev/pts/1
crw--w---- 1 ed tty 136, 2 Feb 20 14:03 /dev/pts/2
Before this commit, ptmx also had a major number of 136, because it
silently allocated and deallocated a pseudo-terminal. Device nodes that
cannot be opened now have proper major/minor-numbers.
Reviewed by: kib, netchild, rdivacky (thanks!)
Inside the kernel, the minor() function was responsible for obtaining
the device minor number of a character device. Because we made device
numbers dynamically allocated and independent of the unit number passed
to make_dev() a long time ago, it was actually a misnomer. If you really
want to obtain the device number, you should use dev2udev().
We already converted all the drivers to use dev2unit() to obtain the
device unit number, which is still used by a lot of drivers. I've
noticed not a single driver passes NULL to dev2unit(). Even if they
would, its behaviour would make little sense. This is why I've removed
the NULL check.
Ths commit removes minor(), minor2unit() and unit2minor() from the
kernel. Because there was a naming collision with uminor(), we can
rename umajor() and uminor() back to major() and minor(). This means
that the makedev(3) manual page also applies to kernel space code now.
I suspect umajor() and uminor() isn't used that often in external code,
but to make it easier for other parties to port their code, I've
increased __FreeBSD_version to 800062.
In the existing code we didn't really enforce that callers hold Giant
before calling userland_sysctl(), even though there is no guarantee it
is safe. Fix this by just placing Giant locks around the call to the oid
handler. This also means we only pick up Giant for a very short period
of time. Maybe we should add MPSAFE flags to sysctl or phase it out all
together.
I've also added SYSCTL_LOCK_ASSERT(). We have to make sure sysctl_root()
and name2oid() are called with the sysctl lock held.
Reviewed by: Jille Timmermans <jille quis cx>
compare map->timestamp with saved timestamp after map read lock is
reacquired, not with saved timestamp + 1. The only consequence of the +1
was unconditional lookup of the next map entry, though.
Tested by: pho
Approved by: des
MFC after: 2 weeks