In the mpsafetty branch, Linux sshd seems to work properly inside a
jail. Some small modifications had to be made to the Linux compatibility
layer.
The Linux PTY routines always expect the device major number to be 136
or higher. Our code always set the major/minor number pair to 136:0.
This makes routines like ttyname() and ptsname() fail, because we'll end
up having ambiguous device numbers.
The conversion was not performed on all *stat() routines, which meant in
some cases the numbers didn't get transformed. By pushing the conversion
into linux_driver_get_major_minor(), the transformation will take place
on all calls.
Approved by: philip (mentor), rdivacky
what Linux does. This is because robust futexes are mostly
userspace thing which we cannot alter. Two syscalls maintain
pointer to userspace list and when process exits a routine
walks this list waking up processes sleeping on futexes
from that list.
Reviewed by: kib (mentor)
MFC after: 1 month
and linux_openat(). Instead just pass AT_FDCWD into linux_common_open()
for the linux_open() case. This prevents passing -1 as a dirfd to
openat() from succeeding which is wrong.
Suggested by: rwatson, kib
Approved by: kib (mentor)
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.
Highlights include:
* Thread-safe kernel RPC client - many threads can use the same RPC
client handle safely with replies being de-multiplexed at the socket
upcall (typically driven directly by the NIC interrupt) and handed
off to whichever thread matches the reply. For UDP sockets, many RPC
clients can share the same socket. This allows the use of a single
privileged UDP port number to talk to an arbitrary number of remote
hosts.
* Single-threaded kernel RPC server. Adding support for multi-threaded
server would be relatively straightforward and would follow
approximately the Solaris KPI. A single thread should be sufficient
for the NLM since it should rarely block in normal operation.
* Kernel mode NLM server supporting cancel requests and granted
callbacks. I've tested the NLM server reasonably extensively - it
passes both my own tests and the NFS Connectathon locking tests
running on Solaris, Mac OS X and Ubuntu Linux.
* Userland NLM client supported. While the NLM server doesn't have
support for the local NFS client's locking needs, it does have to
field async replies and granted callbacks from remote NLMs that the
local client has contacted. We relay these replies to the userland
rpc.lockd over a local domain RPC socket.
* Robust deadlock detection for the local lock manager. In particular
it will detect deadlocks caused by a lock request that covers more
than one blocking request. As required by the NLM protocol, all
deadlock detection happens synchronously - a user is guaranteed that
if a lock request isn't rejected immediately, the lock will
eventually be granted. The old system allowed for a 'deferred
deadlock' condition where a blocked lock request could wake up and
find that some other deadlock-causing lock owner had beaten them to
the lock.
* Since both local and remote locks are managed by the same kernel
locking code, local and remote processes can safely use file locks
for mutual exclusion. Local processes have no fairness advantage
compared to remote processes when contending to lock a region that
has just been unlocked - the local lock manager enforces a strict
first-come first-served model for both local and remote lockers.
Sponsored by: Isilon Systems
PR: 95247 107555 115524 116679
MFC after: 2 weeks
so the annoying message is not printed.
o Don't warn about FUTEX_FD not being implemented
and return ENOSYS instead of 0 (eg. success).
o Clear FUTEX_PRIVATE_FLAG as we actually implement
only private futexes so there is no reason to
return ENOSYS when app asks for a private futex.
We don't reject shared futexes because they worked
just fine with our implementation so far.
Approved by: kib (mentor)
Tested by: bsam
MFC after: 1 week
Check that only MREMAP_FIXED and MREMAP_MAYMOVE flags are specified.
Check for the page alignment of the addr argument.
Submitted by: rdivacky
MFC after: 1 week
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.
KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.
Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.
Manpage and FreeBSD_version will be updated through further commits.
As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.
Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>
Without it, code has two problems:
- behaviour of the old and new [l]stat are different with regard of
the /compat/linux
- directly accessing the userspace data from the kernel asks for
the panics.
Reported and tested by: Peter Holm
Reviewed by: rdivacky
MFC after: 3 days
LINUX_SIOCGIFCOUNT just returns 0 since it is not implemented in the
Linux 2.6.16.
LINUX_SIOCGIFINDEX/LINUX_SIOGIFINDEX are mapped to the FreeBSD native
SIOCGIFINDEX.
Tested by: Peter Kostouros <kpeter@melbpc.org.au>
Reviewed by: brooks, rpaulo (on net@)
Submitted by: rdivacky
MFC after: 1 week
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:
mac_<object>_<method/action>
mac_<object>_check_<method/action>
The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.
All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.
Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer
3 arguments, but we had forgotten the second argument. Also make the
Linux statfs64 struct depend on the architecture because it has an
extra 4 bytes padding on amd64 compared to i386.
The three argument fix is from David Taylor, the struct statfs64
stuff is my fault. With this patch I can install i386 Linux matlab
on an amd64 machine.
Submitted by: David Taylor <davidt_at_yadt.co.uk>
Approved by: re (kensmith)
with Linux 2.6 emulation. This shall be reimplemented once FreeBSD gets
native scheduler affinity syscalls.
Submitted by: rdivacky
Reviewed by: jkim
Sponsored by: Google Summer of Code 2007
Approved by: re (kensmith)
previously conditionally acquired Giant based on debug.mpsafenet. As that
has now been removed, they are no longer required. Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.
While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option. Clean up some related gotos for
consistency.
Reviewed by: bz, csjp
Tested by: kris
Approved by: re (kensmith)
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.
Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.
We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.
Reviewed by: csjp
Obtained from: TrustedBSD Project
This patch fixes places where they should be called atomically changing
their locking requirements (both assume per-proc spinlock held) and
introducing rufetchcalc which wrappers both calls to be performed in
atomic way.
Reviewed by: jeff
Approved by: jeff (mentor)
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.
Requested by: alc
Approved by: jeff (mentor)
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.
Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)
Implement all futex atomic operations in assembler to not depend on the
fuword() that does not allow to distinguish between -1 and failure return.
Correctly return 0 from atomic operations on success.
In collaboration with: rdivacky
Tested by: Scot Hetzel <swhetzel gmail com>, Milos Vyletel <mvyletel mzm cz>
Sponsored by: Google SoC 2007
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.
Contributed by: Attilio Rao <attilio@FreeBSD.org>
sendmsg() while using a 0-length msg_controllen. This isn't allowed in
the FreeBSD system call ABI, so detect this case and set msg_control to
NULL. This allows Linux ping to work.
Submitted by: rdivacky
Linux SCSI SG passthrough device API. The intention is to allow for both
running of Linux apps that want to talk to /dev/sg* nodes, and to facilitate
porting of apps from Linux to FreeBSD. As such, both native and linuxolator
entry points and definitions are provided.
Caveats:
- This does not support the procfs and sysfs nodes that the Linux SG
driver provides. Some Linux apps may rely on these for operation,
others may only use them for informational purposes.
- More ioctls need to be implemented.
- Linux uses a naming scheme of "sg[a-z]" for devices, while FreeBSD uses a
scheme of "sg[0-9]". Devfs aliasis (symlinks) are automatically created
to link the two together. However, tools like camcontrol only see the
native names.
- Some operations were originally designed to return byte counts or other
data directly as the syscall return value. The linuxolator doesn't appear
to support this well, so this driver just punts for these cases.
Now that the driver is in place, others are welcome to add missing
functionality. Thanks to Roman Divacky for pushing this work along.
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.
- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.
- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.
- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).
- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.
In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).
Tested by: kris
Discussed with: jhb, kris, attilio, jeff
semi-automatic style(9)
The futex stuff already differs a lot (only a small part does not differ)
from NetBSD, so we are already way off and can't apply changes from NetBSD
automatically. As we need to merge everything by hand already, we can even
make the files comply to our world order.
- Dont "return" in linux_clone() after we forked the new process in a case
of problems.
- Move the copyout of p2->p_pid outside the emul_lock coverage in
linux_clone().
- Cache the em->pdeath_signal in a local variable and move the copyout
out of the emul_lock coverage.
- Move the free() out of the emul_shared_lock coverage in a preparation
to switch emul_lock to non-sleepable lock (mutex).
Submitted by: rdivacky
cannot change (because its referenced by curthread). This fixes
a LOR caused by acquiring emul_shared_lock while holding emul_lock.
Fix typo in comment.
Submitted by: rdivacky
p->p_emuldata is properly initialized in the time when the child can run.
Do not set p->p_emuldata to NULL when the process is exiting.
It does not make any sense and only costs 2 mutex operations.
Do not lock emul_data to unlock it on the very next line.
Comment on possible race while there.
Reparent all procs that are part of a threading group but not its leaders
to init and SIGCHLD init to finish the zombies off. This fixes zombies
left after opera's exit. [1]
There is no need to lock p_em in the linux_proc_init CLONE_THREAD
case because the process cannot change the address of the p_em->shared
because its currently running this code path.
Move assigning of em->shared outside emul_shared_lock.
Noticed by: Scott Robbins <scottro@nyc.rr.com> [1]
Submitted by: rdivacky
Dont expose em->shared to the outside world before its properly
initialized. Might not affect anything but its at least a better
coding style.
Dont expose em via p->p_emuldata until its properly initialized.
This also enables us to get rid of some locking and simplify the
code because we are workin on a local copy.
In linux_fork and linux_vfork create the process in stopped state
to be sure that the new process runs with fully initialized emuldata
structure [1]. Also fix the vfork (both in linux_clone and linux_vfork)
race that could result in never woken up process [2].
Reported by: Scot Hetzel [1]
Suggested by: jhb [2]
Reviewed by: jhb (at least some important parts)
Submitted by: rdivacky
Tested by: Scot Hetzel (on amd64)
Change 2 comments (in the new code) to comply to style(9).
Suggested by: jhb
to open() [1].
Improve locking for accessing session control structures [2].
Try to document (most likely harmless) races in the code [3].
Based on submission by: Intron (intron at intron ac) [1]
Reviewed by: jhb [2]
Discussed with: netchild, rwatson, jhb [3]
Now (ok it's been a while...) that FreeBSD has RLIMIT_AS too, we can use
it in the linuxolator instead of ignoring it.
This fixes a LTP test.
Submitted by: rdivacky
No need to lock prison in a case of linux_use26 because the int
setting is atomic and process cannot leave jail.
Submitted by: kib
Reviewed by: jhb
Requested by: rdivacky
Dont lock em in a case of just using em->shared->group_pid because
the group_pid never changes.
Submitted by: rdivacky
Reviewed by: kib
Glanced at by: jhb
Redo the checking for 2.6 emulation. We now cache the value of
use26 and replace calls to linux_get_osrelease() + parsing with
a call to linux_use26(). Typical path is lockless now.
Pointed out by: kib
This allows to ship RELENG_7_0 with a default osrelease of 2.4.2 and the
possibility to enable 2.6.x emulation without the possible performance
impact of the previous version of the check.
Submitted by: rdivacky
- Move linux_nanosleep() from src/sys/amd64/linux32/linux32_machdep.c to
src/sys/compat/linux/linux_time.c.
- Validate timespec ranges before use as Linux kernel does.
- Fix l_timespec structure.
- Clean up style(9) nits.
Add rudimentary IPC_INFO/MSG_INFO command support for linux_msgctl()
to pacify Linux ipcs(1). While I am here, add more bound checks
for linux_msgsnd() and linux_msgrcv().
Fixes for 'blocking in fifoor state' problem of LTP tests.
linux_*stat*() functions were opening files with O_RDONLY to get
major/minor pair for char/block special files. Unfortunately,
when these functions are used against fifo, it is blocked forever
because there is no writer. Instead, we only open char/block special
files for major/minor conversion. We have to get rid of kern_open()
entirely from translate_path_major_minor() but today is not the day.
While I am here, add checks for errors before calling
translate_path_major_minor().
- Currently LINUX_MAX_COMM_LEN is smaller than MAXCOMLEN, but in case
this will change we have a buffer overflow. Apply some defensive
programming to DTRT when this should happen.
- Use copyinstr() instead of copyin where appropriate.
* Fallback to copyin() in case of ENAMETOOLONG. [1]
* Use the right source and destination (it was wrong before).
- Use strlcpy instead of strcpy.
- Properly lock the read case (PR_GET_NAME) like the write case.
Reviewed by: rwatson (except [1])
Suggested by: rwatson [1]
This fix lets clone02 LTP test pass with 2.6 emulation. In reality 99%
of the cases are that CLONE_VM and CLONE_THREAD are both set so it
seemed to work.
Submitted by: rdivacky
__WCLONE. This fixes it thus fixing skype/teamspeak to not keep zombies
after exit.
Submitted by: rdivacky
Reported by: Bakul Shah (bakul at bitblocks com)
specific privilege names to a broad range of privileges. These may
require some future tweaking.
Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>
dynamic nature (if no native aio code is available, the linux part
returns ENOSYS because of missing requisites) should be solved differently
than it is.
All this will be done in P4.
Not included in this commit is a backout of the changes to the native aio
code (removing static in some places). Those changes (and some more) will
also be needed when the reworked linux aio stuff will reenter the tree.
Requested by: rwatson
Discussed with: rwatson
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.
This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA
Implement the linux_io_* syscalls (AIO). They are only enabled if the native
AIO code is available (either compiled in to the kernel or as a module) at
the time the functions are used. If the AIO stuff is not available there
will be a ENOSYS.
From the submitter:
---snip---
DESIGN NOTES:
1. Linux permits a process to own multiple AIO queues (distinguished by
"context"), but FreeBSD creates only one single AIO queue per process.
My code maintains a request queue (STAILQ of queue(3)) per "context",
and throws all AIO requests of all contexts owned by a process into
the single FreeBSD per-process AIO queue.
When the process calls io_destroy(2), io_getevents(2), io_submit(2) and
io_cancel(2), my code can pick out requests owned by the specified context
from the single FreeBSD per-process AIO queue according to the per-context
request queues maintained by my code.
2. The request queue maintained by my code stores contrast information between
Linux IO control blocks (struct linux_iocb) and FreeBSD IO control blocks
(struct aiocb). FreeBSD IO control block actually exists in userland memory
space, required by FreeBSD native aio_XXXXXX(2).
3. It is quite troubling that the function io_getevents() of libaio-0.3.105
needs to use Linux-specific "struct aio_ring", which is a partial mirror
of context in user space. I would rather take the address of context in
kernel as the context ID, but the io_getevents() of libaio forces me to
take the address of the "ring" in user space as the context ID.
To my surprise, one comment line in the file "io_getevents.c" of
libaio-0.3.105 reads:
Ben will hate me for this
REFERENCE:
1. Linux kernel source code: http://www.kernel.org/pub/linux/kernel/v2.6/
(include/linux/aio_abi.h, fs/aio.c)
2. Linux manual pages: http://www.kernel.org/pub/linux/docs/manpages/
(io_setup(2), io_destroy(2), io_getevents(2), io_submit(2), io_cancel(2))
3. Linux Scalability Effort: http://lse.sourceforge.net/io/aio.html
The design notes: http://lse.sourceforge.net/io/aionotes.txt
4. The package libaio, both source and binary:
http://rpmfind.net/linux/rpm2html/search.php?query=libaio
Simple transparent interface to Linux AIO system calls.
5. Libaio-oracle: http://oss.oracle.com/projects/libaio-oracle/
POSIX AIO implementation based on Linux AIO system calls (depending on
libaio).
---snip---
Submitted by: Li, Xiao <intron@intron.ac>
- Linux returns ENOPROTOOPT in a case of not supported opt to setsockopt.
- Return EISDIR in pread() when arg is a directory.
- Return EINVAL instead of EFAULT when namelen is not correct in accept().
- Return EINVAL instead of EACCESS if invalid access mode is entered in
access().
- Return EINVAL instead of EADDRNOTAVAIL in a case of bad salen param
to bind().
Submitted by: rdivacky
Tested with: LTP (vfork01 fails now, but it seems to be a race and
not caused by those changes)
MFC after: 1 week
- add support to power off the system [2]
- check the linux magic values [3]
Submitted by: Marcin Cieslak <saper@SYSTEM.PL> [1,2]
Modelled after: linux man page of the reboot() syscall [3]
Found by: LTP testcase "reboot02" [1]
Tested with: LTP testcase "reboot02" [1,3]
MFC after: 1 week
but further on -current (still not successful, but a step into the right
direction).
Sponsored by: Google SoC 2006
Submitted by: rdivacky
Tested by: Paul Mather <paul@gromit.dlib.vt.edu>