some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.
Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.
We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.
Reviewed by: csjp
Obtained from: TrustedBSD Project
This patch fixes places where they should be called atomically changing
their locking requirements (both assume per-proc spinlock held) and
introducing rufetchcalc which wrappers both calls to be performed in
atomic way.
Reviewed by: jeff
Approved by: jeff (mentor)
- Unsafeness on ruadd() in thread_exit()
- Unatomicity of thread_exiit() in the exit1() operations
This patch addresses these problems allocating p_fd as part of the
process and modifying the way it is accessed.
A small chunk of this patch, resolves a race about p_state in kern_wait(),
since we have to be sure about the zombif-ing process.
Submitted by: jeff
Approved by: jeff (mentor)
implementing some of them using existing ones.
- Allow to compile ZFS on all archs and use atomic operations surrounded
by global mutex on archs we don't have or can't have all atomic
operations needed by ZFS.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
sysctl_handle_int is not sizeof the int type you want to export.
The type must always be an int or an unsigned int.
Remove the instances where a sizeof(variable) is passed to stop
people accidently cut and pasting these examples.
In a few places this was sysctl_handle_int was being used on 64 bit
types, which would truncate the value to be exported. In these
cases use sysctl_handle_quad to export them and change the format
to Q so that sysctl(1) can still print them.
1. Pass locking flags to VFS_ROOT().
2. Check v_mountedhere while the vnode is locked.
3. Always return locked vnode on success.
Change 1 fixes problem reported by Stephen M. Rumble - after
zfs_vfsops.c,1.9 change, zfs_root() no longer locks the vnode
unconditionally and traverse() didn't pass right lock type to
VFS_ROOT(). The result was that kernel paniced when .zfs/ directory
was accessed via NFS.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.
Requested by: alc
Approved by: jeff (mentor)
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.
Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)
debug is turned off, initialize locks with NOWITNESS flag.
At some point I'll get back to them, we would probably need BLESSING
functionality, which is currently turned off by default.
Implement all futex atomic operations in assembler to not depend on the
fuword() that does not allow to distinguish between -1 and failure return.
Correctly return 0 from atomic operations on success.
In collaboration with: rdivacky
Tested by: Scot Hetzel <swhetzel gmail com>, Milos Vyletel <mvyletel mzm cz>
Sponsored by: Google SoC 2007
same way it was enabled for Linux binares in linuxulator.
This allows binaries built with -pie. Many ports auto-detect -fPIE support
in GCC 4.2 and build binaries FreeBSD was unable to run.
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.
Contributed by: Attilio Rao <attilio@FreeBSD.org>
processes under 64-bit kernels). Previously, each 32-bit process overwrote
its resource limits at exec() time. The problem with this approach is that
the new limits affect all child processes of the 32-bit process, including
if the child process forks and execs a 64-bit process. To fix this, don't
ovewrite the resource limits during exec(). Instead, sv_fixlimits() is
now replaced with a different function sv_fixlimit() which asks the ABI to
sanitize a single resource limit. We then use this when querying and
setting resource limits. Thus, if a 32-bit process sets a limit, then
that new limit will be inherited by future children. However, if the
32-bit process doesn't change a limit, then a future 64-bit child will
see the "full" 64-bit limit rather than the 32-bit limit.
MFC is tentative since it will break the ABI of old linux.ko modules (no
other modules are affected).
MFC after: 1 week
- Move FreeBSD-specific code to zfs_freebsd_*() functions in zfs_vnops.c
and keep original functions as similar to vendor's code as possible.
- Add various includes back, now that we have them.
@118370 Correct typo.
@118371 Integrate changes from vendor.
@118491 Show backtrace on unexpected code paths.
@118494 Integrate changes from vendor.
@118504 Fix sendfile(2). I had two ways of fixing it:
1. Fixing sendfile(2) itself to use VOP_GETPAGES() instead of
hacking around with vn_rdwr(UIO_NOCOPY), which was suggested
by ups.
2. Modify ZFS behaviour to handle this special case.
Although 1 is more correct, I've choosen 2, because hack from 1
have a side-effect of beeing faster - it reads ahead MAXBSIZE
bytes instead of reading page by page. This is not easy to implement
with VOP_GETPAGES(), at least not for me in this very moment.
Reported by: Andrey V. Elsukov <bu7cher@yandex.ru>
@118525 Reorganize the code to reduce diff.
@118526 This code path is expected. It is simply when file is opened with
O_FSYNC flag.
Reported by: kris
Reported by: Michal Suszko <dry@dry.pl>
on a snapshot directory:
- Remove PRIV_VFS_MOUNT check - regular users can mount snapshots
via lookups on snapshot directory.
- Reset mount credential to kcred, so user won't be able to unmount
the snapshot.
- Reset owner uid.
- Unlock vnode in case of a failure.
Reported by: simokawa
This fixes stange panics when listing .zfs/snapshot/ directory for me.
Reported by: simokawa
Reported by: Johan Hendriks <Johan@double-l.nl>
- Hide cache_purge() under FREEBSD_NAMECACHE like in other files.
- Protect mnt_flag with mount interlock.
sendmsg() while using a 0-length msg_controllen. This isn't allowed in
the FreeBSD system call ABI, so detect this case and set msg_control to
NULL. This allows Linux ping to work.
Submitted by: rdivacky
popular names. Hence:
- comment current index() and rindex() functions, as these serve the same
functionality as, respectively, strchr() and strrchr() from userland;
- add inlined version of strchr() and strrchr(), as we tend to use them more
often;
- remove str[r]chr() definitions from ZFS code;
Reviewed by: pjd
Approved by: cognet (mentor)
- Allow to shrink ARC down to 16MB (instead of 64MB).
- Set arc_max to 1/2 of kmem_map by default.
- Start freeing things earlier when low memory situation is detected.
- Serialize execution of arc_lowmem().
I decided to setup minimum ZFS memory requirements to 512MB of RAM and 256MB of
kmem_map size. If there is less RAM or kmem_map, a warning will be printed.
World is cruel, be no better. In other words: modern file system requires
modern hardware:)
From ZFS administration guide:
"Currently the minimum amount of memory recommended to install a Solaris
system is 512 Mbytes. However, for good ZFS performance, at least one
Gbyte or more of memory is recommended."
Linux SCSI SG passthrough device API. The intention is to allow for both
running of Linux apps that want to talk to /dev/sg* nodes, and to facilitate
porting of apps from Linux to FreeBSD. As such, both native and linuxolator
entry points and definitions are provided.
Caveats:
- This does not support the procfs and sysfs nodes that the Linux SG
driver provides. Some Linux apps may rely on these for operation,
others may only use them for informational purposes.
- More ioctls need to be implemented.
- Linux uses a naming scheme of "sg[a-z]" for devices, while FreeBSD uses a
scheme of "sg[0-9]". Devfs aliasis (symlinks) are automatically created
to link the two together. However, tools like camcontrol only see the
native names.
- Some operations were originally designed to return byte counts or other
data directly as the syscall return value. The linuxolator doesn't appear
to support this well, so this driver just punts for these cases.
Now that the driver is in place, others are welcome to add missing
functionality. Thanks to Roman Divacky for pushing this work along.
ZFS file system was ported from OpenSolaris operating system. The code in under
CDDL license.
I'd like to thank all SUN developers that created this great piece of software.
Supported by: Wheel LTD (http://www.wheel.pl/)
Supported by: The FreeBSD Foundation (http://www.freebsdfoundation.org/)
Supported by: Sentex (http://www.sentex.net/)
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.
- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.
- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.
- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).
- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.
In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).
Tested by: kris
Discussed with: jhb, kris, attilio, jeff
seminfo because kernel_sysctlbyname() is slow. There is no dependency
problem since linux module depends on both sysvmsg and sysvsem and linprocfs
depends on it in turn.
Pointed out by: des
Reviewed by: des
function which is called from pfs_destroy() before the node is reclaimed.
Modify pfs_create_{dir,file,link}() to accept a pointer to a destructor
function in addition to the usual attr / fill / vis pointers.
This breaks both the programming and binary interfaces between pseudofs
and its consumers. It is believed that there are no pseudofs consumers
outside the source tree, so that the impact of this change is minimal.
Submitted by: Aniruddha Bohra <bohra@cs.rutgers.edu>
semi-automatic style(9)
The futex stuff already differs a lot (only a small part does not differ)
from NetBSD, so we are already way off and can't apply changes from NetBSD
automatically. As we need to merge everything by hand already, we can even
make the files comply to our world order.
- Dont "return" in linux_clone() after we forked the new process in a case
of problems.
- Move the copyout of p2->p_pid outside the emul_lock coverage in
linux_clone().
- Cache the em->pdeath_signal in a local variable and move the copyout
out of the emul_lock coverage.
- Move the free() out of the emul_shared_lock coverage in a preparation
to switch emul_lock to non-sleepable lock (mutex).
Submitted by: rdivacky
triggers a KASSERT) or local variables. In the case of kern_ndis, the
tsleep() actually used a common sleep address (curproc) making it
susceptible to a premature wakeup.
cannot change (because its referenced by curthread). This fixes
a LOR caused by acquiring emul_shared_lock while holding emul_lock.
Fix typo in comment.
Submitted by: rdivacky
p->p_emuldata is properly initialized in the time when the child can run.
Do not set p->p_emuldata to NULL when the process is exiting.
It does not make any sense and only costs 2 mutex operations.
Do not lock emul_data to unlock it on the very next line.
Comment on possible race while there.
Reparent all procs that are part of a threading group but not its leaders
to init and SIGCHLD init to finish the zombies off. This fixes zombies
left after opera's exit. [1]
There is no need to lock p_em in the linux_proc_init CLONE_THREAD
case because the process cannot change the address of the p_em->shared
because its currently running this code path.
Move assigning of em->shared outside emul_shared_lock.
Noticed by: Scott Robbins <scottro@nyc.rr.com> [1]
Submitted by: rdivacky
Dont expose em->shared to the outside world before its properly
initialized. Might not affect anything but its at least a better
coding style.
Dont expose em via p->p_emuldata until its properly initialized.
This also enables us to get rid of some locking and simplify the
code because we are workin on a local copy.
In linux_fork and linux_vfork create the process in stopped state
to be sure that the new process runs with fully initialized emuldata
structure [1]. Also fix the vfork (both in linux_clone and linux_vfork)
race that could result in never woken up process [2].
Reported by: Scot Hetzel [1]
Suggested by: jhb [2]
Reviewed by: jhb (at least some important parts)
Submitted by: rdivacky
Tested by: Scot Hetzel (on amd64)
Change 2 comments (in the new code) to comply to style(9).
Suggested by: jhb
to open() [1].
Improve locking for accessing session control structures [2].
Try to document (most likely harmless) races in the code [3].
Based on submission by: Intron (intron at intron ac) [1]
Reviewed by: jhb [2]
Discussed with: netchild, rwatson, jhb [3]
Now (ok it's been a while...) that FreeBSD has RLIMIT_AS too, we can use
it in the linuxolator instead of ignoring it.
This fixes a LTP test.
Submitted by: rdivacky
No need to lock prison in a case of linux_use26 because the int
setting is atomic and process cannot leave jail.
Submitted by: kib
Reviewed by: jhb
Requested by: rdivacky
Dont lock em in a case of just using em->shared->group_pid because
the group_pid never changes.
Submitted by: rdivacky
Reviewed by: kib
Glanced at by: jhb
Redo the checking for 2.6 emulation. We now cache the value of
use26 and replace calls to linux_get_osrelease() + parsing with
a call to linux_use26(). Typical path is lockless now.
Pointed out by: kib
This allows to ship RELENG_7_0 with a default osrelease of 2.4.2 and the
possibility to enable 2.6.x emulation without the possible performance
impact of the previous version of the check.
Submitted by: rdivacky
- Move linux_nanosleep() from src/sys/amd64/linux32/linux32_machdep.c to
src/sys/compat/linux/linux_time.c.
- Validate timespec ranges before use as Linux kernel does.
- Fix l_timespec structure.
- Clean up style(9) nits.
Add rudimentary IPC_INFO/MSG_INFO command support for linux_msgctl()
to pacify Linux ipcs(1). While I am here, add more bound checks
for linux_msgsnd() and linux_msgrcv().
Fixes for 'blocking in fifoor state' problem of LTP tests.
linux_*stat*() functions were opening files with O_RDONLY to get
major/minor pair for char/block special files. Unfortunately,
when these functions are used against fifo, it is blocked forever
because there is no writer. Instead, we only open char/block special
files for major/minor conversion. We have to get rid of kern_open()
entirely from translate_path_major_minor() but today is not the day.
While I am here, add checks for errors before calling
translate_path_major_minor().
Use TAILQ_FOREACH_SAFE instead of the unsafe one where an item is removed
from the queue.
This prevents a panic on kldunload.
Submitted by: rdivacky
Tested by: bsam
- Currently LINUX_MAX_COMM_LEN is smaller than MAXCOMLEN, but in case
this will change we have a buffer overflow. Apply some defensive
programming to DTRT when this should happen.
- Use copyinstr() instead of copyin where appropriate.
* Fallback to copyin() in case of ENAMETOOLONG. [1]
* Use the right source and destination (it was wrong before).
- Use strlcpy instead of strcpy.
- Properly lock the read case (PR_GET_NAME) like the write case.
Reviewed by: rwatson (except [1])
Suggested by: rwatson [1]
Add two linprocfs entries for Linux IPC:
/proc/sys/kernel/msgmni -> kern.ipc.msgmni
/proc/sys/kernel/sem -> kern.ipc.semmsl
kern.ipc.semmns
kern.ipc.semopm
kern.ipc.semmni
This fixes msgget03 and semget05 from Linux Test Project (LTP) test suite.
msgctl08 and msgctl09 also use /proc/sys/kernel/msgmni but another fix is
required from p4 (Change 110179).
Requested by: netchild
This fix lets clone02 LTP test pass with 2.6 emulation. In reality 99%
of the cases are that CLONE_VM and CLONE_THREAD are both set so it
seemed to work.
Submitted by: rdivacky
__WCLONE. This fixes it thus fixing skype/teamspeak to not keep zombies
after exit.
Submitted by: rdivacky
Reported by: Bakul Shah (bakul at bitblocks com)