Commit Graph

723 Commits

Author SHA1 Message Date
Dmitry Chagin
79262bf1f0 Reimplement futexes.
Old implemention used Giant to protect the kernel data structures,
but at the same time called malloc(M_WAITOK), that could cause the
calling thread to sleep and lost Giant protection. User-visible
result was the missed wakeup.

New implementation uses one sx lock per futex. The sx protects
the futex structures and allows to sleep while copyin or copyout
are performed.

Unlike linux, we return EINVAL when FUTEX_CMP_REQUEUE operation
is requested and either caller specified futexes are equial or
second futex already exists. This is acceptable since the situation
can only occur from the application error, and glibc falls back to
old FUTEX_WAKE operation when FUTEX_CMP_REQUEUE returns an error.

Approved by:	kib (mentor)
MFC after:	1 month
2009-05-01 15:36:02 +00:00
Marko Zec
093f25f8c8 In preparation for turning on options VIMAGE in next commits,
rearrange / replace / adjust several INIT_VNET_* initializer
macros, all of which currently resolve to whitespace.

Reviewed by:	bz (an older version of the patch)
Approved by:	julian (mentor)
2009-04-26 22:06:42 +00:00
Dmitry Chagin
b1121623d2 Remove support for FUTEX_REQUEUE operation.
Glibc does not use this operation since 2.3.3 version (Jun 2004),
as it is racy and replaced by FUTEX_CMP_REQUEUE operation.
Glibc versions prior to 2.3.3 fall back to FUTEX_WAKE when
FUTEX_REQUEUE returned EINVAL.

Any application directly using FUTEX_REQUEUE without return
value checking are definitely broken.

Limit quantity of messages per process about unsupported
operation.

Approved by:	kib (mentor)
MFC after:	1 month
2009-04-19 13:48:42 +00:00
Doug Ambrisko
d2b2128a28 Add stuff to support upcoming BMC/IPMI flashing of newer Dell machine
via the Linux tool.
     -  Add Linux shim to ipmi(4)
     -  Create a partitions file to linprocfs to make Linux fdisk see
        disks.  This file is dynamic so we can see disks come and go.
     -  Convert msdosfs to vfat in mtab since Linux uses that for
        msdosfs.
     -  In the Linux mount path convert vfat passed in to msdosfs
        so Linux mount works on FreeBSD.  Note that tasting works
        so that if da0 is a msdos file system
                /compat/linux/bin/mount /dev/da0 /mnt
        works.
     -  fix a 64it bug for l_off_t.
Grabing sh, mount, fdisk, df from Linux, creating a symlink of mtab to
/compat/linux/etc/mtab and then some careful unpacking of the Linux bmc
update tool and hacking makes it work on newer Dell boxes.  Note, probably
if you can't figure out how to do this, then you probably shouldn't be
doing it :-)
2009-03-26 17:14:22 +00:00
Dmitry Chagin
731aded86d Sort include files in the alphabetical order.
Approved by:	kib (mentor)
MFC after:	2 weeks
2009-03-16 05:39:37 +00:00
Dmitry Chagin
b41a7787e1 Ignore FUTEX_FD op, as it is done by linux.
Approved by:	kib (mentor)
MFC after:	2 weeks
2009-03-15 19:38:34 +00:00
Dmitry Chagin
3b8cbbded3 Include linux_futex.h before linux_emul.h
Approved by:	kib (mentor)
MFC after:	6 days
2009-03-15 19:16:12 +00:00
John Baldwin
2ee8325f42 A better fix for handling different FPU initial control words for different
ABIs:
- Store the FPU initial control word in the pcb for each thread.
- When first using the FPU, load the initial control word after restoring
  the clean state if it is not the standard control word.
- Provide a correct control word for Linux/i386 binaries under
  FreeBSD/amd64.
- Adjust the control word returned for fpugetregs()/npxgetregs() when a
  thread hasn't used the FPU yet to reflect the real initial control
  word for the current ABI.
- The Linux/i386 ABI for FreeBSD/i386 now properly sets the right control
  word instead of trashing whatever the current state of the FPU is.

Reviewed by:	bde
2009-03-05 19:42:11 +00:00
Dmitry Chagin
4d7c2e8a48 Add AT_PLATFORM, AT_HWCAP and AT_CLKTCK auxiliary vector entries which
are used by glibc. This silents the message "2.4+ kernel w/o ELF notes?"
from some programs at start, among them are top and pkill.

Do the assignment of the vector entries in elf_linux_fixup()
as it is done in glibc.

Fix some minor style issues.

Submitted by:	Marcin Cieslak <saper at SYSTEM PL>
Approved by:	kib (mentor)
MFC after:	1 week
2009-03-04 12:14:33 +00:00
Bjoern A. Zeeb
33553d6e99 For all files including net/vnet.h directly include opt_route.h and
net/route.h.

Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.

We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.

This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.
2009-02-27 14:12:05 +00:00
Ed Schouten
0eee862a54 Don't make Linux stat() open character devices to resolve its name.
The existing code calls kern_open() to resolve the vnode of a pathname
right after a stat(). This is not correct, because it causes random
character devices to be opened in /dev. This means ls'ing a tape
streamer will cause it to rewind, for example. Changes I have made:

- Add kern_statat_vnhook() to allow binary emulators to `post-process'
  struct stat, using the proper vnode.

- Remove unneeded printf's from stat() and statfs().

- Make the Linuxolator use kern_statat_vnhook(), replacing
  translate_path_major_minor_at().

- Let translate_fd_major_minor() use vp->v_rdev instead of
  vp->v_un.vu_cdev.

Result:

	crw-rw-rw- 1 root root   0, 14 Feb 20 13:54 /dev/ptmx
	crw--w---- 1 root adm  136,  0 Feb 20 14:03 /dev/pts/0
	crw--w---- 1 root adm  136,  1 Feb 20 14:02 /dev/pts/1
	crw--w---- 1 ed   tty  136,  2 Feb 20 14:03 /dev/pts/2

Before this commit, ptmx also had a major number of 136, because it
silently allocated and deallocated a pseudo-terminal. Device nodes that
cannot be opened now have proper major/minor-numbers.

Reviewed by:	kib, netchild, rdivacky (thanks!)
2009-02-20 13:05:29 +00:00
John Baldwin
ea77ff0a15 Use shared vnode locks when invoking VOP_READDIR().
MFC after:	1 month
2009-02-13 18:18:14 +00:00
Alexander Leidinger
0134f12f7a Fix an edge-case of the linux readdir: We need the size of a linux dirent
structure, not the size of a pointer to it.

PR:		131099
Submitted by:	Andreas Kies <andikies@gmail.com>
MFC after:	2 weeks
2009-02-13 11:55:19 +00:00
Ed Schouten
a4611ab612 Last step of splitting up minor and unit numbers: remove minor().
Inside the kernel, the minor() function was responsible for obtaining
the device minor number of a character device. Because we made device
numbers dynamically allocated and independent of the unit number passed
to make_dev() a long time ago, it was actually a misnomer. If you really
want to obtain the device number, you should use dev2udev().

We already converted all the drivers to use dev2unit() to obtain the
device unit number, which is still used by a lot of drivers. I've
noticed not a single driver passes NULL to dev2unit(). Even if they
would, its behaviour would make little sense. This is why I've removed
the NULL check.

Ths commit removes minor(), minor2unit() and unit2minor() from the
kernel. Because there was a naming collision with uminor(), we can
rename umajor() and uminor() back to major() and minor(). This means
that the makedev(3) manual page also applies to kernel space code now.

I suspect umajor() and uminor() isn't used that often in external code,
but to make it easier for other parties to port their code, I've
increased __FreeBSD_version to 800062.
2009-01-28 17:57:16 +00:00
Ed Schouten
ddf9d24349 Push down Giant inside sysctl. Also add some more assertions to the code.
In the existing code we didn't really enforce that callers hold Giant
before calling userland_sysctl(), even though there is no guarantee it
is safe. Fix this by just placing Giant locks around the call to the oid
handler. This also means we only pick up Giant for a very short period
of time. Maybe we should add MPSAFE flags to sysctl or phase it out all
together.

I've also added SYSCTL_LOCK_ASSERT(). We have to make sure sysctl_root()
and name2oid() are called with the sysctl lock held.

Reviewed by:	Jille Timmermans <jille quis cx>
2008-12-29 12:58:45 +00:00
Bjoern A. Zeeb
4b79449e2f Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by:	brooks, gnn, des, zec, imp
Sponsored by:	The FreeBSD Foundation
2008-12-02 21:37:28 +00:00
Konstantin Belousov
74f5d68011 Make linux_sendmsg() and linux_recvmsg() work on linux32/amd64.
Change types used in the linux' struct msghdr and struct cmsghdr
definitions to the properly-sized architecture-specific types.
Move ancillary data handler from linux_sendit() to linux_sendmsg().

Submitted by:	dchagin
2008-11-29 17:14:06 +00:00
Roman Divacky
7356a43c88 Document that all the other commands are either
identical to the FreeBSD ones or rejected by
kern_msgctl().

Found with:	Coverity Prevent(tm)
CID:	3456
Approved by:	kib (mentor)
2008-11-26 16:38:43 +00:00
Konstantin Belousov
62162dfc94 In the robust futexes list head, futex_offset shall be signed,
and glibc actually supplies negative offsets. Change l_ulong to l_long.

Submitted by:	dchagin
2008-11-16 15:45:41 +00:00
Ed Schouten
a1b5a8955e Mark uname(), getdomainname() and setdomainname() with COMPAT_FREEBSD4.
Looking at our source code history, it seems the uname(),
getdomainname() and setdomainname() system calls got deprecated
somewhere after FreeBSD 1.1, but they have never been phased out
properly. Because we don't have a COMPAT_FREEBSD1, just use
COMPAT_FREEBSD4.

Also fix the Linuxolator to build without the setdomainname() routine by
just making it call userland_sysctl on kern.domainname. Also replace the
setdomainname()'s implementation to use this approach, because we're
duplicating code with sysctl_domainname().

I wasn't able to keep these three routines working in our
COMPAT_FREEBSD32, because that would require yet another keyword for
syscalls.master (COMPAT4+NOPROTO). Because this routine is probably
unused already, this won't be a problem in practice. If it turns out to
be a problem, we'll just restore this functionality.

Reviewed by:	rdivacky, kib
2008-11-09 10:45:13 +00:00
Konstantin Belousov
17b9edd35a The code in linux_proc_exit() contains a race when multiple linux based
processes exits at the same time.  The linux_emuldata structure is freed
but p->p_emuldata is left as a dangling pointer to the just freed memory.

The check for W_EXIT in the loop scanning the child processes isn't safe
since the state of the child process can change right afterwards. Lock
the process and check the W_EXIT before delivering signal.

Submitted by:	tegge
Reviewed by:	davidxu
MFC after:	1 week
2008-10-31 10:38:30 +00:00
Edward Tomasz Napierala
15bc6b2bd8 Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by:	rwatson (mentor)
2008-10-28 13:44:11 +00:00
Dag-Erling Smørgrav
1ede983cc9 Retire the MALLOC and FREE macros. They are an abomination unto style(9).
MFC after:	3 months
2008-10-23 15:53:51 +00:00
Konstantin Belousov
aa8b201112 Correctly fill siginfo for the signals delivered by linux tkill/tgkill.
It is required for async cancellation to work.

Fix PROC_LOCK leak in linux_tgkill when signal delivery attempt is made
to not linux process.

Do not call em_find(p, ...) with p unlocked.

Move common code for linux_tkill() and linux_tgkill() into
linux_do_tkill().

Change linux siginfo_t definition to match actual linux one. Extend
uid fields to 4 bytes from 2. The extension does not change structure
layout and is binary compatible with previous definition, because i386
is little endian, and each uid field has 2 byte padding after it.

Reported by:	Nicolas Joly <njoly pasteur fr>
Submitted by:	dchangin
MFC after:	1 month
2008-10-19 10:02:26 +00:00
Konstantin Belousov
175c6c319b Make robust futexes work on linux32/amd64. Use PTRIN to read
user-mode pointers. Change types used in the structures definitions to
properly-sized architecture-specific types.

Submitted by:	dchagin
MFC after:	1 week
2008-10-14 07:59:23 +00:00
Konstantin Belousov
68da8b22d2 Current linux_fooaffinity() emulation fails, as the FreeBSD affinity
syscalls expect the bitmap size in the range from 32 to 128. Old glibc
always assumed size 1024, while newer glibc searches for approriate
size, starting from 1024 and going up.

For now, use FreeBSD size of cpuset_t for bitmap size parameter and
return EINVAL if length of user space bitmap less than our size of
cpuset_t.

Submitted by:	dchagin
MFC after:	1 week
	[This requires MFC of the actual linux affinity syscalls]
2008-10-04 19:23:30 +00:00
Marko Zec
8b615593fc Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by:	julian, bz, brooks, zec
Reviewed by:	julian, bz, brooks, kris, rwatson, ...
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
Edward Tomasz Napierala
9545354ed5 Fix usage of mac_vnode_check_open() in linuxulator - last argument
should be VREAD, not FREAD.

Approved by:	rwatson (mentor)
2008-09-22 18:59:24 +00:00
Roman Divacky
0d62170990 The ERESTART to EINTR conversion is already done in
kern_select so there is no need to repeat it in
linux_select().

Submitted by:	Dmitry Chagin <dchagin@>
MFC after:	1 week
Approved by:	kib (mentor)
2008-09-11 15:28:28 +00:00
Roman Divacky
2963584278 Getdents requires padding with 2 bytes instead of 1 byte
as with getdents64. The last byte is used for storing
the d_type, add this to plain getdents case where it was
missing before. Also change the code to use strlcpy instead
of plain strcpy. This changes fix the getdents crash we
had reports about (hl2 server etc.)

PR:		kern/117010
MFC after:	1 week
Submitted by:	Dmitry Chagin (dchagin@)
Tested by:	MITA Yoshio <mita ee.t.u-tokyo.ac jp>
Approved by:	kib (mentor)
2008-09-09 16:00:17 +00:00
Konstantin Belousov
745aaef5b5 Remove superfluous copyin() of args, structures are already in kernel space.
Submitted by:	dchagin
MFC after:	1 week
2008-09-09 13:01:14 +00:00
Attilio Rao
0359a12ead Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-08-28 15:23:18 +00:00
Julian Elischer
1d89fc4ebe All opt_x.h includes go at the top of other includes. 2008-08-25 04:55:29 +00:00
Ed Schouten
bc093719ca Integrate the new MPSAFE TTY layer to the FreeBSD operating system.
The last half year I've been working on a replacement TTY layer for the
FreeBSD kernel. The new TTY layer was designed to improve the following:

- Improved driver model:

  The old TTY layer has a driver model that is not abstract enough to
  make it friendly to use. A good example is the output path, where the
  device drivers directly access the output buffers. This means that an
  in-kernel PPP implementation must always convert network buffers into
  TTY buffers.

  If a PPP implementation would be built on top of the new TTY layer
  (still needs a hooks layer, though), it would allow the PPP
  implementation to directly hand the data to the TTY driver.

- Improved hotplugging:

  With the old TTY layer, it isn't entirely safe to destroy TTY's from
  the system. This implementation has a two-step destructing design,
  where the driver first abandons the TTY. After all threads have left
  the TTY, the TTY layer calls a routine in the driver, which can be
  used to free resources (unit numbers, etc).

  The pts(4) driver also implements this feature, which means
  posix_openpt() will now return PTY's that are created on the fly.

- Improved performance:

  One of the major improvements is the per-TTY mutex, which is expected
  to improve scalability when compared to the old Giant locking.
  Another change is the unbuffered copying to userspace, which is both
  used on TTY device nodes and PTY masters.

Upgrading should be quite straightforward. Unlike previous versions,
existing kernel configuration files do not need to be changed, except
when they reference device drivers that are listed in UPDATING.

Obtained from:		//depot/projects/mpsafetty/...
Approved by:		philip (ex-mentor)
Discussed:		on the lists, at BSDCan, at the DevSummit
Sponsored by:		Snow B.V., the Netherlands
dcons(4) fixed by:	kan
2008-08-20 08:31:58 +00:00
Bjoern A. Zeeb
603724d3ab Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from:	//depot/projects/vimage-commit2/...
Reviewed by:	brooks, des, ed, mav, julian,
		jamie, kris, rwatson, zec, ...
		(various people I forgot, different versions)
		md5 (with a bit of help)
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
X-MFC after:	never
V_Commit_Message_Reviewed_By:	more people than the patch
2008-08-17 23:27:27 +00:00
Ed Schouten
b377be43a5 Add TIOCPKT and TIOCSPTLCK to the Linuxolator.
We're very lucky, because the flags used by our TIOCPKT implementation
are the same as flags used by Linux. We can safely enable TIOCPKT,
assuming EXTPROC is not used.

TIOCSPTLCK is used by unlockpt(). Because we don't need unlockpt() in
our implementation, make this ioctl a no-op.

Approved by:	philip (mentor, implicit), rdivacky
Obtained from:	P4 (//depot/projects/mpsafetty/...)
2008-07-23 17:47:44 +00:00
Roman Divacky
0864e2a4f1 Fix linux_alarm, the linux behaviour is to limit the
secs to INT_MAX when the passed in parameter is bigger
than INT_MAX.

Submitted by:	Dmitry Chagin <chagin.dmitry gmail com>
Approved by:	kib (mentor)
2008-07-23 17:19:02 +00:00
Robert Watson
4f7d1876d5 Introduce a new lock, hostname_mtx, and use it to synchronize access
to global hostname and domainname variables.  Where necessary, copy
to or from a stack-local buffer before performing copyin() or
copyout().  A few uses, such as in cd9660 and daemon_saver, remain
under-synchronized and will require further updates.

Correct a bug in which a failed copyin() of domainname would leave
domainname potentially corrupted.

MFC after:	3 weeks
2008-07-05 13:10:10 +00:00
Roman Divacky
2e1a489300 d_ino member of linux_dirent structure should be unsigned long.
Submitted by:	Chagin Dmitry <chagin.dmitry@gmail.com>
Approved by:	kib (mentor)
2008-06-08 11:09:25 +00:00
Roman Divacky
a47444d525 Switch to emulating Linux 2.6 on default.
Approved by:	kib (mentor)
2008-06-03 17:50:13 +00:00
Ed Schouten
a147e6cadf Push down the major/minor conversion for pts/%u to improve consistency.
In the mpsafetty branch, Linux sshd seems to work properly inside a
jail. Some small modifications had to be made to the Linux compatibility
layer.

The Linux PTY routines always expect the device major number to be 136
or higher. Our code always set the major/minor number pair to 136:0.
This makes routines like ttyname() and ptsname() fail, because we'll end
up having ambiguous device numbers.

The conversion was not performed on all *stat() routines, which meant in
some cases the numbers didn't get transformed. By pushing the conversion
into linux_driver_get_major_minor(), the transformation will take place
on all calls.

Approved by:	philip (mentor), rdivacky
2008-06-02 08:40:06 +00:00
Roman Divacky
4732e446fb Implement robust futexes. Most of the code is modelled after
what Linux does. This is because robust futexes are mostly
userspace thing which we cannot alter. Two syscalls maintain
pointer to userspace list and when process exits a routine
walks this list waking up processes sleeping on futexes
from that list.

Reviewed by:	kib (mentor)
MFC after:	1 month
2008-05-13 20:01:27 +00:00
Roman Divacky
a6d043e30d Implement linux_truncate64() syscall.
Tested by:	Aline de Freitas <aline@riseup.net>
Approved by:	kib (mentor)
2008-04-23 15:56:33 +00:00
Roman Divacky
872cbe6466 Remove using magic value of -1 to distinguish between linux_open()
and linux_openat(). Instead just pass AT_FDCWD into linux_common_open()
for the linux_open() case. This prevents passing -1 as a dirfd to
openat() from succeeding which is wrong.

Suggested by:	rwatson, kib
Approved by:	kib (mentor)
2008-04-09 16:42:50 +00:00
Konstantin Belousov
48b05c3f82 Implement the linux syscalls
openat, mkdirat, mknodat, fchownat, futimesat, fstatat, unlinkat,
    renameat, linkat, symlinkat, readlinkat, fchmodat, faccessat.

Submitted by:	rdivacky
Sponsored by:	Google Summer of Code 2007
Tested by:	pho
2008-04-08 09:45:49 +00:00
Konstantin Belousov
57b4252e45 Add the support for the AT_FDCWD and fd-relative name lookups to the
namei(9).

Based on the submission by rdivacky,
	sponsored by Google Summer of Code 2007
Reviewed by:	rwatson, rdivacky
Tested by:	pho
2008-03-31 12:01:21 +00:00
Doug Rabson
dfdcada31e Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
  client handle safely with replies being de-multiplexed at the socket
  upcall (typically driven directly by the NIC interrupt) and handed
  off to whichever thread matches the reply. For UDP sockets, many RPC
  clients can share the same socket. This allows the use of a single
  privileged UDP port number to talk to an arbitrary number of remote
  hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
  server would be relatively straightforward and would follow
  approximately the Solaris KPI. A single thread should be sufficient
  for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
  callbacks. I've tested the NLM server reasonably extensively - it
  passes both my own tests and the NFS Connectathon locking tests
  running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
  support for the local NFS client's locking needs, it does have to
  field async replies and granted callbacks from remote NLMs that the
  local client has contacted. We relay these replies to the userland
  rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
  it will detect deadlocks caused by a lock request that covers more
  than one blocking request. As required by the NLM protocol, all
  deadlock detection happens synchronously - a user is guaranteed that
  if a lock request isn't rejected immediately, the lock will
  eventually be granted. The old system allowed for a 'deferred
  deadlock' condition where a blocked lock request could wake up and
  find that some other deadlock-causing lock owner had beaten them to
  the lock.

* Since both local and remote locks are managed by the same kernel
  locking code, local and remote processes can safely use file locks
  for mutual exclusion. Local processes have no fairness advantage
  compared to remote processes when contending to lock a region that
  has just been unlocked - the local lock manager enforces a strict
  first-come first-served model for both local and remote lockers.

Sponsored by:	Isilon Systems
PR:		95247 107555 115524 116679
MFC after:	2 weeks
2008-03-26 15:23:12 +00:00
Ruslan Ermilov
d7a38db650 Fix build.
Reported by:	ache, tinderbox
2008-03-25 13:20:52 +00:00
Roman Divacky
6af821237d o Add stub support for some new futex operations,
so the annoying message is not printed.

	o	Don't warn about FUTEX_FD not being implemented
		and return ENOSYS instead of 0 (eg. success).

	o	Clear FUTEX_PRIVATE_FLAG as we actually implement
		only private futexes so there is no reason to
		return ENOSYS when app asks for a private futex.
		We don't reject shared futexes because they worked
		just fine with our implementation so far.

Approved by:	kib (mentor)
Tested by:	bsam
MFC after:	1 week
2008-03-20 17:03:55 +00:00
Roman Divacky
5dfb688191 Implement sched_setaffinity and get_setaffinity using
real cpu affinity setting primitives.

Reviewed by:	jeff
Approved by:	kib (mentor)
2008-03-16 16:27:44 +00:00
Konstantin Belousov
a0b0d286bc Return ENOSYS instead of 0 for the unknown futex operations.
Submitted by: rdivacky
Reported and tested by: Gary Stanley <gary velocity-servers net>
2008-03-02 14:00:50 +00:00
Konstantin Belousov
cbd2c621f8 Sanitize arguments to linux_mremap().
Check that only MREMAP_FIXED and MREMAP_MAYMOVE flags are specified.
Check for the page alignment of the addr argument.

Submitted by:	rdivacky
MFC after:	1 week
2008-02-22 11:47:56 +00:00
Attilio Rao
22db15c06f VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
2008-01-13 14:44:15 +00:00
Attilio Rao
cb05b60a89 vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by:	Diego Sardina <siarodx at gmail dot com>,
		Andrea Di Pasquale <whyx dot it at gmail dot com>
2008-01-10 01:10:58 +00:00
Konstantin Belousov
d075105da0 After applying LCONVPATH() to the path, do use the converted path
instead of original user-mode string in the linux_stat() and
linux_lstat() syscalls.

Tested by:	Peter Holm
MFC after:	3 days
2008-01-05 12:36:35 +00:00
Konstantin Belousov
93eba2d50d Plug the leaks in the present (hopefully, soon to be replaced)
implementation of the linux_openat() for the quick MFC.

Reported and tested by: Peter Holm
MFC after:      3 days
2007-12-29 14:28:01 +00:00
Konstantin Belousov
15b78ac5d1 Apply the LCONVPATH() to the (old) linux_stat() and linux_lstat() syscalls.
Without it, code has two problems:
- behaviour of the old and new [l]stat are different with regard of
  the /compat/linux
- directly accessing the userspace data from the kernel asks for
  the panics.

Reported and tested by:	Peter Holm
Reviewed by:	rdivacky
MFC after:	3 days
2007-12-29 14:25:29 +00:00
Konstantin Belousov
d60f0a3d6a Implement LINUX_SIOCGIFCOUNT and LINUX_SIOCGIFINDEX/LINUX_SIOGIFINDEX.
LINUX_SIOCGIFCOUNT just returns 0 since it is not implemented in the
Linux 2.6.16.

LINUX_SIOCGIFINDEX/LINUX_SIOGIFINDEX are mapped to the FreeBSD native
SIOCGIFINDEX.

Tested by:	Peter Kostouros <kpeter@melbpc.org.au>
Reviewed by:	brooks, rpaulo (on net@)
Submitted by:	rdivacky
MFC after:	1 week
2007-11-07 16:42:52 +00:00
Robert Watson
30d239bc4c Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
David Malone
3ab8526963 The kernel version of Linux statfs64 is actually supposed to take
3 arguments, but we had forgotten the second argument. Also make the
Linux statfs64 struct depend on the architecture because it has an
extra 4 bytes padding on amd64 compared to i386.

The three argument fix is from David Taylor, the struct statfs64
stuff is my fault. With this patch I can install i386 Linux matlab
on an amd64 machine.

Submitted by: David Taylor <davidt_at_yadt.co.uk>
Approved by: re (kensmith)
2007-09-18 19:50:33 +00:00
Konstantin Belousov
b6e645c90f Implement fake linux sched_getaffinity() syscall to enable java to work
with Linux 2.6 emulation. This shall be reimplemented once FreeBSD gets
native scheduler affinity syscalls.

Submitted by:	rdivacky
Reviewed by:	jkim
Sponsored by:	Google Summer of Code 2007
Approved by:	re (kensmith)
2007-08-28 12:26:35 +00:00
Robert Watson
0bf686c125 Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which
previously conditionally acquired Giant based on debug.mpsafenet.  As that
has now been removed, they are no longer required.  Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.

While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option.  Clean up some related gotos for
consistency.

Reviewed by:	bz, csjp
Tested by:	kris
Approved by:	re (kensmith)
2007-08-06 14:26:03 +00:00
Peter Wemm
79d5bdcca5 Don't add the 'pad' argument to the mmap/truncate/etc syscalls.
Submitted by: kensmith
Approved by: re (kensmith)
2007-07-04 23:06:43 +00:00
Robert Watson
32f9753cfb Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths.  Do, however, move those prototypes to priv.h.

Reviewed by:	csjp
Obtained from:	TrustedBSD Project
2007-06-12 00:12:01 +00:00
Matt Jacob
2ba956ed13 Ensure that newpath is always initialized, even for the error case. 2007-06-10 04:37:22 +00:00
Attilio Rao
a1fe14bc33 rufetch and calcru sometimes should be called atomically together.
This patch fixes places where they should be called atomically changing
their locking requirements (both assume per-proc spinlock held) and
introducing rufetchcalc which wrappers both calls to be performed in
atomic way.

Reviewed by: jeff
Approved by: jeff (mentor)
2007-06-09 21:48:44 +00:00
Attilio Rao
2feb50bf7d Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)
2007-05-31 22:52:15 +00:00
Konstantin Belousov
9e223287c0 Revert UF_OPENING workaround for CURRENT.
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.

Proposed and reviewed by:	jhb
Reviewed by:	daichi (unionfs)
Approved by:	re (kensmith)
2007-05-31 11:51:53 +00:00
Konstantin Belousov
1c182de9a9 Move futex support code from <arch>/support.s into linux compat directory.
Implement all futex atomic operations in assembler to not depend on the
fuword() that does not allow to distinguish between -1 and failure return.
Correctly return 0 from atomic operations on success.

In collaboration with:	rdivacky
Tested by:	Scot Hetzel <swhetzel gmail com>, Milos Vyletel <mvyletel mzm cz>
Sponsored by:	Google SoC 2007
2007-05-23 08:33:06 +00:00
Jeff Roberson
222d01951f - define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts.  This can be used to abstract away pcpu details but also changes
   to use atomics for all counters now.  This means sched lock is no longer
   responsible for protecting counts in the switch routines.

Contributed by:		Attilio Rao <attilio@FreeBSD.org>
2007-05-18 07:10:50 +00:00
Robert Watson
d72a615878 Some Linux applications (ping) pass a non-NULL msg_control argument to
sendmsg() while using a 0-length msg_controllen.  This isn't allowed in
the FreeBSD system call ABI, so detect this case and set msg_control to
NULL.  This allows Linux ping to work.

Submitted by:	rdivacky
2007-04-14 10:35:09 +00:00
Scott Long
6eef46be3b Whitespace fixes 2007-04-10 21:37:37 +00:00
Scott Long
1eba4c7948 Add the CAM 'SG' peripheral device. This device implements a subset of the
Linux SCSI SG passthrough device API.  The intention is to allow for both
running of Linux apps that want to talk to /dev/sg* nodes, and to facilitate
porting of apps from Linux to FreeBSD.  As such, both native and linuxolator
entry points and definitions are provided.

Caveats:
 - This does not support the procfs and sysfs nodes that the Linux SG
   driver provides.  Some Linux apps may rely on these for operation,
   others may only use them for informational purposes.
 - More ioctls need to be implemented.
 - Linux uses a naming scheme of "sg[a-z]" for devices, while FreeBSD uses a
   scheme of "sg[0-9]".  Devfs aliasis (symlinks) are automatically created
   to link the two together.  However, tools like camcontrol only see the
   native names.
 - Some operations were originally designed to return byte counts or other
   data directly as the syscall return value.  The linuxolator doesn't appear
   to support this well, so this driver just punts for these cases.

Now that the driver is in place, others are welcome to add missing
functionality.  Thanks to Roman Divacky for pushing this work along.
2007-04-07 19:40:58 +00:00
Robert Watson
5e3f7694b1 Replace custom file descriptor array sleep lock constructed using a mutex
and flags with an sxlock.  This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention.  All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently.  Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.

- Generally eliminate the distinction between "fast" and regular
  acquisisition of the filedesc lock; the plan is that they will now all
  be fast.  Change all locking instances to either shared or exclusive
  locks.

- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
  was called without the mutex held; sx_sleep() is now always called with
  the sxlock held exclusively.

- Universally hold the struct file lock over changes to struct file,
  rather than the filedesc lock or no lock.  Always update the f_ops
  field last. A further memory barrier is required here in the future
  (discussed with jhb).

- Improve locking and reference management in linux_at(), which fails to
  properly acquire vnode references before using vnode pointers.  Annotate
  improper use of vn_fullpath(), which will be replaced at a future date.

In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).

Tested by:	kris
Discussed with:	jhb, kris, attilio, jeff
2007-04-04 09:11:34 +00:00
Jung-uk Kim
357afa7113 MFP4: Turn emul_lock into a mutex.
Submitted by:	rdivacky
2007-04-02 18:38:13 +00:00
Jung-uk Kim
a328699b34 MFP4: Linux futex support for amd64.
Initial patch was submitted by kib and additional work was done
by Divacky Roman.

Tested by:	emulation
2007-03-30 01:07:28 +00:00
Julian Elischer
6734f35eac Implement the openat() linux syscall
Submitted by:	Roman Divacky (rdivacky@)
MFC after:	2 weeks
2007-03-29 02:11:46 +00:00
Robert Watson
b77ad8fc3b In translate_path_major_minor(), do not calculate otherwise unused 'fp'
variable, avoiding an extra locking of the file descriptor array.
2007-03-06 07:39:12 +00:00
Jung-uk Kim
a4e3bad794 MFP4: 115220, 115222
- Fix style(9) and reduce diff between amd64 and i386.
- Prefix Linuxulator macros with LINUX_ to prevent future collision.
2007-03-02 00:08:47 +00:00
Alexander Leidinger
8cf5ee2e2a MFp4 (110541):
Sync with rev 1.7 in NetBSD.

	Obtained from:	NetBSD
2007-02-25 12:43:07 +00:00
Alexander Leidinger
f9dac96185 MFp4 (110523, parts which apply cleanly):
semi-automatic style(9)

The futex stuff already differs a lot (only a small part does not differ)
from NetBSD, so we are already way off and can't apply changes from NetBSD
automatically. As we need to merge everything by hand already, we can even
make the files comply to our world order.
2007-02-25 12:40:35 +00:00
Alexander Leidinger
802e08a360 Partial MFp4 of 114977:
Whitespace commit: Fix grammar, spelling and punctuation.

Submitted by:	"Scot Hetzel" <swhetzel@gmail.com>
2007-02-24 16:49:25 +00:00
Alexander Leidinger
1a26db0a3a MFp4 (114193 (i386 part), 114194, 114195, 114200):
- Dont "return" in linux_clone() after we forked the new process in a case
   of problems.
 - Move the copyout of p2->p_pid outside the emul_lock coverage in
   linux_clone().
 - Cache the em->pdeath_signal in a local variable and move the copyout
   out of the emul_lock coverage.
 - Move the free() out of the emul_shared_lock coverage in a preparation
   to switch emul_lock to non-sleepable lock (mutex).

Submitted by:	rdivacky
2007-02-23 22:39:26 +00:00
Alexander Leidinger
e8b8b834b4 MFp4 (part of 114132):
- Fix a LOR caused by holding emul_lock and proctree_lock at once.

Submitted by:	rdivacky
2007-02-23 22:29:24 +00:00
Konstantin Belousov
b4bb515484 Remove extern int hz; use proper include file instead. 2007-02-02 08:58:16 +00:00
Konstantin Belousov
d0b2365eec Introduce some more SO_ option equivalents from Linux to FreeBSD.
The msg variable in linux_recvmsg() was not initialized.
Copy it from userspace.

Submitted by: rdivacky
2007-02-01 13:36:19 +00:00
Konstantin Belousov
75ee4e5462 No need to lock emul_lock in exit_group() because em->shared
cannot change (because its referenced by curthread). This fixes
a LOR caused by acquiring emul_shared_lock while holding emul_lock.

Fix typo in comment.

Submitted by: rdivacky
2007-02-01 13:33:33 +00:00
Konstantin Belousov
25954d7430 No need to synchronize linux_schedtail with linux_proc_init.
p->p_emuldata is properly initialized in the time when the child can run.

Do not set p->p_emuldata to NULL when the process is exiting.
It does not make any sense and only costs 2 mutex operations.

Do not lock emul_data to unlock it on the very next line.
Comment on possible race while there.

Reparent all procs that are part of a threading group but not its leaders
to init and SIGCHLD init to finish the zombies off. This fixes zombies
left after opera's exit. [1]

There is no need to lock p_em in the linux_proc_init CLONE_THREAD
case because the process cannot change the address of the p_em->shared
because its currently running this code path.
Move assigning of em->shared outside emul_shared_lock.

Noticed by: Scott Robbins <scottro@nyc.rr.com> [1]
Submitted by: rdivacky
2007-02-01 13:29:27 +00:00
Alexander Leidinger
d071f5048c MFp4 (113077, 113083, 113103, 113124, 113097):
Dont expose em->shared to the outside world before its properly
	initialized. Might not affect anything but its at least a better
	coding style.

	Dont expose em via p->p_emuldata until its properly initialized.
	This also enables us to get rid of some locking and simplify the
	code because we are workin on a local copy.

	In linux_fork and linux_vfork create the process in stopped state
	to be sure that the new process runs with fully initialized emuldata
	structure [1]. Also fix the vfork (both in linux_clone and linux_vfork)
	race that could result in never woken up process [2].

Reported by:	Scot Hetzel	[1]
Suggested by:	jhb		[2]
Reviewed by:	jhb (at least some important parts)
Submitted by:	rdivacky
Tested by:	Scot Hetzel (on amd64)

Change 2 comments (in the new code) to comply to style(9).

Suggested by:	jhb
2007-01-20 14:58:59 +00:00
Konstantin Belousov
4349c6ba29 Add support for LINUX_O_DIRECT, LINUX_O_DIRECT and LINUX_O_NOFOLLOW flags
to open() [1].
Improve locking for accessing session control structures [2].
Try to document (most likely harmless) races in the code [3].

Based on submission by:	Intron (intron at intron ac) [1]
Reviewed by:		jhb [2]
Discussed with:		netchild, rwatson, jhb [3]
2007-01-18 09:32:08 +00:00
Alexander Leidinger
17011df1e1 MFp4 (112379):
Implement SETALL/GETALL IPC primitives. This fixes some LTP testcases and
LabView is able to proceed a little bit further.

Submitted by:	rdivacky
2007-01-14 16:34:43 +00:00
Alexander Leidinger
31becc7692 MFp4 (112705):
Inherit setting of the default emulation version to the jails.

Pointed out by:	jhb
Submitted by:	rdivacky
2007-01-14 16:07:01 +00:00
Alexander Leidinger
a849401985 MFp4 (112646):
Now (ok it's been a while...) that FreeBSD has RLIMIT_AS too, we can use
it in the linuxolator instead of ignoring it.

This fixes a LTP test.

Submitted by:	rdivacky
2007-01-07 19:30:19 +00:00
Alexander Leidinger
bb419e1b5b MFp4 (112535):
No need to lock prison in a case of linux_use26 because the int
setting is atomic and process cannot leave jail.

Submitted by:	kib
Reviewed by:	jhb
Requested by:	rdivacky
2007-01-07 19:20:17 +00:00
Alexander Leidinger
0ed6f09c4e MFp4 (112534):
Dont lock em in a case of just using em->shared->group_pid because
the group_pid never changes.

Submitted by:	rdivacky
Reviewed by:	kib
Glanced at by:	jhb
2007-01-07 19:14:06 +00:00
Alexander Leidinger
291081ce0a MFp4 (112499):
Protect em->shared with the lock in case of CLONE_THREAD.

Submitted by:	rdivacky
2007-01-07 19:09:20 +00:00
Alexander Leidinger
1c65504ca8 MFp4 (112498):
Rename the locking flags to EMUL_DOLOCK and EMUL_DONTLOCK to prevent confusion.

Submitted by:	rdivacky
2007-01-07 19:00:38 +00:00
Xin LI
59038483f5 Fix amd64 build.
Submitted by:	Divacky Roman <xdivac02 stud fit vutbr cz>
2007-01-01 14:47:45 +00:00
Alexander Leidinger
c9447c7551 MFp4 (111746, 108671, 108945, 112352):
- add linux utimes syscall [1]
 - add linux rt_sigtimedwait syscall [2]

Submitted by:	"Scot Hetzel" <swhetzel@gmail.com> [1]
Submitted by:	Bruce Becker <hostmaster@whois.gts.net> [2]
PR:		93199 [2]
2006-12-31 13:16:00 +00:00
Alexander Leidinger
a628609ee9 MFp4:
- semi-automatic style fixes
2006-12-31 12:42:55 +00:00