Commit Graph

835 Commits

Author SHA1 Message Date
Alexander Leidinger
0d7b5e545c Staticize functions which are not used somewhere else, move the
corresponding prototypes from the header to the code file.
2011-03-15 13:40:47 +00:00
Dmitry Chagin
31f7ad1545 Style(9) fixes. No functional changes.
MFC after:	2 Week
2011-03-12 07:47:05 +00:00
John Baldwin
c28a98e948 Remove now-obsolete comment.
Submitted by:	netchild
MFC after:	1 week
2011-03-10 19:50:12 +00:00
Dmitry Chagin
a2cd91cf28 Indeed, remove bogus since r219405 check of the Linux ABI.
Pointed out:	jhb

MFC after:	2 Week
2011-03-09 05:59:33 +00:00
Dmitry Chagin
e5d81ef1b5 Extend struct sysvec with new method sv_schedtail, which is used for an
explicit process at fork trampoline path instead of eventhadler(schedtail)
invocation for each child process.

Remove eventhandler(schedtail) code and change linux ABI to use newly added
sysvec method.

While here replace explicit comparing of module sysentvec structure with the
newly created process sysentvec to detect the linux ABI.

Discussed with:	kib

MFC after:	2 Week
2011-03-08 19:01:45 +00:00
Dmitry Chagin
3a4bc25691 Print out shared flag for debug purpose.
MFC after:	1 Week
2011-03-03 18:29:55 +00:00
Dmitry Chagin
815cb72a0c Switch PROCESS_SHARE to AUTO_SHARE (as umtx do). Even for SHARED,
if page mapped MAP_ANON linux uses private algorithm too.

Disscussed with:	jhb

MFC after:	3 Days
2011-03-03 18:19:10 +00:00
John Baldwin
21f8f506fb Use umtx_key objects to uniquely identify futexes. Private futexes in
different processes that happen to use the same user address in the
separate processes will now be treated as distinct futexes rather than the
same futex.  We can now honor shared futexes properly by mapping them to a
PROCESS_SHARED umtx_key.  Private futexes use THREAD_SHARED umtx_key
objects.

In conjunction with:	dchagin
Reviewed by:	kib
MFC after:	1 week
2011-02-23 13:23:28 +00:00
Dmitry Chagin
f9e66923e5 Do not clobber %rdx.
Before calling vfork() syscall the linux user-space stores the current PID
in the %rdx and restore it when the parent process will leave the kernel.
2011-02-20 07:58:30 +00:00
Dmitry Chagin
09d6cb0a23 For realtime signals fill the sigval value. 2011-02-15 21:46:36 +00:00
Dmitry Chagin
f3481dd9ab Make a linux_rt_sigtimedwait() system call is actually working.
1) Translate the native signal number in the appropriate Linux signal.
2) Remove bogus code, which can lead to a panic as it calls
   kern_sigtimedwait with same ksiginfo.
3) Return the corresponding signal number.
2011-02-15 21:42:48 +00:00
Dmitry Chagin
8c50c56206 Style(9) fix. Wrap long lines in linux_rt_sigtimedwait(). 2011-02-15 21:24:50 +00:00
Dmitry Chagin
d207e753da Put the macro declaration in the relevant include file for future use. 2011-02-15 21:22:09 +00:00
Dmitry Chagin
e2ef00a426 Style(9) fix. Do not initialize variables in the declarations. 2011-02-14 17:24:58 +00:00
Dmitry Chagin
49fa1a745e Sort include files in the alphabetical order. 2011-02-13 20:07:48 +00:00
Dmitry Chagin
4ca49f41ec Remove comment about 'ftlk' LOR. 2011-02-13 18:46:34 +00:00
Dmitry Chagin
890c582fe5 Stop printing the LOR, as this is expected behavior. 2011-02-13 18:41:40 +00:00
Dmitry Chagin
7c3b05b99c The bitset field of freshly created futex should be initialized explicity.
Otherwise, REQUEUE operations fails.
2011-02-13 17:56:22 +00:00
Dmitry Chagin
d14cc07d07 Rename used_requeue and use it as bitwise field to store more flags.
Reimplement used_requeue logic with LINUX_XDEPR_REQUEUEOP flag.
2011-02-12 20:58:59 +00:00
Dmitry Chagin
cfa57401b0 Slightly rewrite linux_fork:
1) Remove bogus error checking.
2) A new process exit from kernel through fork_trampoline(),
   so remove bogus check.
2011-02-12 20:16:25 +00:00
Dmitry Chagin
9588e04dde Remove bogus include <machine/frame.h> 2011-02-12 19:14:57 +00:00
Dmitry Chagin
222198ab0b Move linux_clone(), linux_fork(), linux_vfork() to a MI path. 2011-02-12 18:17:12 +00:00
Alexander Leidinger
529844c77c Linux' shm_open() fails because it wants to find some funky shmfs
to construct the full pathname. It starts to search at the default
mountpoint which is /dev/shm. If this fails it runs through fstab
and searches for shmfs and tmpfs. Whatever it finds will be
statfs()'ed to be checked for Linux' fs magic for shmfs (0x01021994).

Ideally our tmpfs should deliver this fs magic to Linux processes, but
as our tmpfs is considered to be an experimental feature we can not
assume that there is always a tmpfs available.

To make shared memory work in the Linuxulator, force the fs type of
/dev/shm (which can be a symlink) to match what Linux expects. The user
is responsible (info has to be added to the linux base ports and the docs)
to setup a suitable link for /dev/shm.

Noticed by:	Andre Albsmeier <Andre.Albsmeier@siemens.com>
Submitted by:	Andre Albsmeier <Andre.Albsmeier@siemens.com>
MFC after:	1 month
2011-02-09 20:23:22 +00:00
Dmitry Chagin
78ec1867a2 Yet another unimplemented futex operation, print out about.
Submitted by:	arundel
MFC after:	1 month.
2011-01-31 06:06:23 +00:00
Dmitry Chagin
5163762354 Implement a futex BITSET op.
Submitted by:	arundel
MFC after:	1 month.
2011-01-31 05:59:05 +00:00
Dmitry Chagin
596ba1bd95 Style(9) fixes.
MFC after:	1 Month.
2011-01-28 19:04:15 +00:00
Dmitry Chagin
adc7ece00a Implement a variation of the linux_common_wait() which should
be used by linuxolator itself.

Move linux_wait4() to MD path as it requires native struct
rusage translation to struct l_rusage on linux32/amd64.

MFC after:	1 Month.
2011-01-28 18:47:07 +00:00
Dmitry Chagin
d908c2d2a2 Style(9) fix.
MFC after:	1 month.
2011-01-28 05:42:14 +00:00
Dmitry Chagin
9a6a64d3c4 Style(9) fix.
Approved by:	kib(mentor)
MFC after:	1 month
2011-01-23 09:50:39 +00:00
Konstantin Belousov
e225428f27 In linuxolator getdents_common(), it seems there is no reason to loop
if no records where returned by VOP_READDIR(). Readdir implementations
allowed to return 0 records when first record is larger then supplied
buffer. In this case trying to execute VOP_READDIR() again causes the
syscall looping forewer.

The goto was there from the day 1, which goes back to 1995 year.

Reported and tested by:	Beat G?tzi <beat chruetertee ch>
MFC after:   2 weeks
2011-01-19 12:19:25 +00:00
Sean Farley
506e9a3a87 Fix the LINUX_SOUND_MIXER_INFO ioctl to return success after the
information is set to FreeBSD.  It had been falling through to the end
of linux_ioctl_sound() and returning ENOIOCTL.  Noticed when running the
Linux ALSA amixer tool.

Add a LINUX_SOUND_MIXER_READ_CAPS ioctl which is used by the Skype
v2.1.0.81 binary.

Reviewed by:	gavin
MFC after:	2 weeks
2010-12-30 02:18:04 +00:00
Dimitry Andric
95353459ae Fix linux kernel module breakage introduced in r215675, by including
<sys/sysent.h>.

Noticed by:	many
Pointy hat to:	netchild
2010-11-22 20:23:18 +00:00
Alexander Leidinger
526384ecf2 Do not take the process lock. The assignment to u_short inside the
properly aligned structure is atomic on all supported architectures, and
the thread that should see side-effect of assignment is the same thread
that does assignment.

Use a more appropriate conditional to detect the linux ABI.

Suggested by:	kib
X-MFC:		together with r215664
2010-11-22 12:42:32 +00:00
Alexander Leidinger
5706ce8b58 Remove trailing dot from the unimplemented futex messages to make
them consistent with the syscall and ipc messages.

Submitted by:	arundel
MFC after:	3 days
2010-11-22 09:25:32 +00:00
Alexander Leidinger
bb63fdde6d By using the 32-bit Linux version of Sun's Java Development Kit 1.6
on FreeBSD (amd64), invocations of "javac" (or "java") eventually
end with the output of "Killed" and exit code 137.

This is caused by:
1. After calling exec() in multithreaded linux program threads are not
   destroyed and continue running. They get killed after program being
   executed finishes.

2. linux_exit_group doesn't return correct exit code when called not
   from group leader. Which happens regularly using sun jvm.

The submitters fix this in a similar way to how NetBSD handles this.

I took the PRs away from dchagin, who seems to be out of touch of
this since a while (no response from him).

The patches committed here are from [2], with some little modifications
from me to the style.

PR:		141439 [1], 144194 [2]
Submitted by:	Stefan Schmidt <stefan.schmidt@stadtbuch.de>, gk
Reviewed by:	rdivacky (in april 2010)
MFC after:	5 days
2010-11-22 09:06:59 +00:00
Alexander Leidinger
809290db9e Some style(9) fixes.
Submitted by:	arundel
MFC after:	1 week
2010-11-15 13:07:10 +00:00
Alexander Leidinger
be44a97cd9 - print out the PID and program name of the program trying to use an
unsupported futex operation
- for those futex operations which are known to be not supported,
  print out which futex operation it is
- shortcut the error return of the unsupported FUTEX_CLOCK_REALTIME in
  some cases:
    FUTEX_CLOCK_REALTIME can be used to tell linux to use
    CLOCK_REALTIME instead of CLOCK_MONOTONIC. FUTEX_CLOCK_REALTIME
    however must only be set, if either FUTEX_WAIT_BITSET or
    FUTEX_WAIT_REQUEUE_PI are set too. If that's not the case
    we can die with ENOSYS right at the beginning.

Submitted by:	arundel
Reviewed by:	rdivacky (earlier iteration of the patch)
MFC after:	1 week
2010-11-15 13:03:35 +00:00
Konstantin Belousov
113801819a Remove stale comment.
Submitted by:	arundel
MFC after:	3 days
2010-10-14 19:30:44 +00:00
Jung-uk Kim
2a9479393a Simplify timeout check in futex_wait() using itimerfix() and return error
if the given timeout is invalid.  Consistently use int type for timeout and
correct a format string in futex_sleep().
2010-10-06 18:51:22 +00:00
Alexander Leidinger
5e82f12aca Fix a comparision of an uninitialised pointer.
Submitted by:	arundel
Found by:	clang analysis (automatic service by uqs@)
Reviewed by:	rdivacky
2010-10-06 07:34:41 +00:00
Matthew D Fleming
4d369413e1 Replace sbuf_overflowed() with sbuf_error(), which returns any error
code associated with overflow or with the drain function.  While this
function is not expected to be used often, it produces more information
in the form of an errno that sbuf_overflowed() did.
2010-09-10 16:42:16 +00:00
John Baldwin
ad6eec7b9e Tweak the in-kernel API for sending signals to threads:
- Rename tdsignal() to tdsendsignal() and make it private to kern_sig.c.
- Add tdsignal() and tdksignal() routines that mirror psignal() and
  pksignal() except that they accept a thread as an argument instead of
  a process.  They send a signal to a specific thread rather than to an
  individual process.

Reviewed by:	kib
2010-06-29 20:41:52 +00:00
Wojciech A. Koszek
eedfc35c5c Bring USB fixes for linux(4).
Intention of this commit is to let us take a full advantage
of libusb(8) ported to Linux. This decreases a possibility of getting
any collisions within ioctl() "command" space, especially with
relation to  LINUX_SNDCTL_SEQ... stuff.

Basically, we provide commands, that will be mapped in the kernel
to correct ones and forward those to the USB layer. Port enabling
functionality brought with this patch is here:

	http://www.freebsd.org/cgi/query-pr.cgi?pr=146895

Bump __FreeBSD_version to catch, since which version installing a
port makes sense.

This patch should bring no regressions. So far, only i386 is tested.

Tested by:	thompsa@
Reviewed by:	thompsa@
OKed by:	netchild@
2010-05-24 07:04:00 +00:00
Alexander Leidinger
eddc400373 - #ifdef out the cliplist part, skype seems like using an uninitialized
variable and can cause problems, without the cliplist handling it works
  without problems
- improve the cliplist error handling
- fix VIDIOCGTUNER and VIDIOCSMICROCODE (still no hardware available to test)

Submitted by:	J.R. Oldroyd <jr@opal.com>
X-MFC after:	soon (together with all the v4l stuff)
2010-05-03 14:19:58 +00:00
Ed Schouten
510ea843ba Rename st_*timespec fields to st_*tim for POSIX 2008 compliance.
A nice thing about POSIX 2008 is that it finally standardizes a way to
obtain file access/modification/change times in sub-second precision,
namely using struct timespec, which we already have for a very long
time. Unfortunately POSIX uses different names.

This commit adds compatibility macros, so existing code should still
build properly. Also change all source code in the kernel to work
without any of the compatibility macros. This makes it all a less
ambiguous.

I am also renaming st_birthtime to st_birthtim, even though it was a
local extension anyway. It seems Cygwin also has a st_birthtim.
2010-03-28 13:13:22 +00:00
Alexander Leidinger
90782c0a14 Fix some problems which may lead to a panic:
- right order of src and dst in memcpy
 - NULL out the clips after freeing to prevent an accident

Noticed by:	hselasky
2010-03-26 08:42:11 +00:00
Ed Schouten
0fef797f4a Actually make O_DIRECTORY work.
According to POSIX open() must return ENOTDIR when the path name does
not refer to a path name. Change vn_open() to respect this flag. This
also simplifies the Linuxolator a bit.
2010-03-21 20:43:23 +00:00
Joel Dahl
2f7bcda248 The NetBSD Foundation has granted permission to remove clause 3 and 4 from
their software.

Obtained from:	NetBSD
2010-03-01 17:20:04 +00:00
Pawel Jakub Dawidek
957d68dd91 No need to include security/mac/mac_framework.h here. 2010-02-18 22:26:01 +00:00
Xin LI
5cb9c68cc9 - Return EAFNOSUPPORT instead of EINVAL for unsupported address family,
this matches the Linux behavior.
 - Check if we have sufficient space allocated for socket structure, which
   fixes a buffer overflow when wrong length is being passed into the
   emulation layer. [1]

PR:		kern/138860
Submitted by:	Mateusz Guzik <mjguzik gmail com>
Reported by:	Alexander Best [1]
MFC after:	2 weeks
2010-02-09 22:30:51 +00:00
Wojciech A. Koszek
edfe497ed4 Let us to use our libusb(3) in Linuxolator.
With this change, Linux binaries can work with our libusb(3) when
it's compiled against our header files on GNU/Linux system -- this
solves the problem with differences between /dev layouts.

With ported libusb(3), I am able to use my USB JTAG cable with Linux
binaries that support it.

Reviewed by:	thompsa
2010-01-18 22:46:06 +00:00
Alexander Leidinger
2883eb1ce1 Whitespace change to be able to provide the correct commit log for r202364:
---snip---
Add video clipping support but with the caveats below.

Background info:

Video clipping allows the user to provide either a series of clip rectangles
or a clip bitmap to the driver and have the driver mask the video according
to the clipping specs provided.

Adding support for clipping to the FreeBSD Linux emulator is problematic
because it seems that this feature is not supported by many drivers and
therefore it is ignored by many applications. Unfortunately, when not
using it, rather than passing in a null clipping list, some apps leave the
clipping fields uninitialized, casuing random values to be passed in. In
the case where the driver does not use the clipping info, this is not a
problem (although it is bad form). But the Linux emulator does not know
which drivers will use this and which won't, so the Linux emulator must
try to handle this clip list, and deal gracefully with cases where the
values seem to be uninitialized.

Video clipping info is passed in using the VIDIOCSWIN ioctl in two fields
in the video_window structure: the integer clipcount and the pointer clips.

How the linuxulator handles this from this commit on:

    * if (clipcount == VIDEO_CLIP_BITMAP)
      The clips variable is a void * pointer to a 128*625 byte
      (1024*625 bit) memory area containing a bitmap of the clipping area.
      The pointer in the video_window structure is copied, but no
      video_clip structures are copied.
    * if (clipcount > 0 && clipcount <= 16384)
      The clips variable is pointer to a list of video_clip structures. Up
      to clipcount structures are copied and passed to the driver.
      The upper limit of 16384 was imposed here so that user code that does
      not properly initialize clipcount falls through below and no attempt
      is made to copy an uninitialized list. This value was found by
      examining Linux drivers that support the clip list.
    * else
      The clipcount is either negative (but not VIDEO_CLIP_BITMAP), zero or
      positive (> 16384).
      All these cases are treated as invalid data. Both the clipcount field
      and clips pointer are forced to zero/NULL and passed to the driver.

It should be noted that, at the time of developing this V4L emulator code,
the pwc(4) V4L driver does not support clipping.

Submitted by:	J.R. Oldroyd <fbsd@opal.com>
MFC after:	1 month
---snip---
2010-01-15 15:38:31 +00:00
Alexander Leidinger
0f6800b944 This is v4l support for the linuxulator. This allows to access FreeBSD
native devices which support the v4l API from processes running within
the linuxulator, e.g. skype or flash can access the multimedia/pwcbsd driver.

Not tested is firmware upload, framebuffer stuff and video tuner stuff
due to lack of hardware.
The clipping part (VIDIOCSWIN) needs a little bit of further work (partly
in progress, but can not be tested due to lack of a suitable device).

The submitter tested this sucessfully with Skype and flash apps on amd64 and
i386 with the multimedia/pwcbsd driver.

Submitted by:	J.R. Oldroyd <fbsd@opal.com>
2010-01-15 14:58:19 +00:00
Brooks Davis
3ef5ae2dde Since all other comparisons involving ngroups_max use
"ngroups_max + 1", use ">= ngroups_max+1" instead of the equivalent
"> ngroups_max" to reduce confusion.
2010-01-15 07:05:00 +00:00
Brooks Davis
412f9500e2 Replace the static NGROUPS=NGROUPS_MAX+1=1024 with a dynamic
kern.ngroups+1.  kern.ngroups can range from NGROUPS_MAX=1023 to
INT_MAX-1.  Given that the Windows group limit is 1024, this range
should be sufficient for most applications.

MFC after:	1 month
2010-01-12 07:49:34 +00:00
Kirk McKusick
e268f54cb4 Background:
When renaming a directory it passes through several intermediate
states. First its new name will be created causing it to have two
names (from possibly different parents). Next, if it has different
parents, its value of ".." will be changed from pointing to the old
parent to pointing to the new parent. Concurrently, its old name
will be removed bringing it back into a consistent state. When fsck
encounters an extra name for a directory, it offers to remove the
"extraneous hard link"; when it finds that the names have been
changed but the update to ".." has not happened, it offers to rewrite
".." to point at the correct parent. Both of these changes were
considered unexpected so would cause fsck in preen mode or fsck in
background mode to fail with the need to run fsck manually to fix
these problems. Fsck running in preen mode or background mode now
corrects these expected inconsistencies that arise during directory
rename. The functionality added with this update is used by fsck
running in background mode to make these fixes.

Solution:

This update adds three new fsck sysctl commands to support background
fsck in correcting expected inconsistencies that arise from incomplete
directory rename operations. They are:

setcwd(dirinode) - set the current directory to dirinode in the
    filesystem associated with the snapshot.
setdotdot(oldvalue, newvalue) - Verify that the inode number for ".."
    in the current directory is oldvalue then change it to newvalue.
unlink(nameptr, oldvalue) - Verify that the inode number associated
    with nameptr in the current directory is oldvalue then unlink it.

As with all other fsck sysctls, these new ones may only be used by
processes with appropriate priviledge.

Reported by:    	jeff
Security issues:	rwatson
2010-01-11 20:44:05 +00:00
Martin Blapp
c2ede4b379 Remove extraneous semicolons, no functional changes.
Submitted by:	Marc Balmer <marc@msys.ch>
MFC after:	1 week
2010-01-07 21:01:37 +00:00
Konstantin Belousov
9ae781dfcf Signal 0 is used to check the permission for current process to signal
target one. Since r184058, linux_do_tkill() calls tdsignal() instead of
kill(), without checking for validity of supplied signal number. Prevent
panic when supplied signal is 0 by finishing work after checks.

Found and tested by:	scf
MFC after:	3 days
2009-12-18 14:27:18 +00:00
Alexander Leidinger
7b6bedd3a7 This is v4l support for the linuxulator. This allows to access FreeBSD
native devices which support the v4l API from processes running within
the linuxulator, e.g. skype or flash can access the multimedia/pwcbsd driver.

Not tested is firmware upload, framebuffer stuff and video tuner stuff
due to lack of hardware.
The clipping part (VIDIOCSWIN) needs a little bit of further work (partly
in progress, but can not be tested due to lack of a suitable device).

The submitter tested this sucessfully with Skype and flash apps on amd64 and
i386 with the multimedia/pwcbsd driver.

Submitted by:	J.R. Oldroyd <fbsd@opal.com>
2009-12-04 21:06:54 +00:00
Alexander Leidinger
63f743fb25 Import the unchanged v4l videodev.h from the vendor branch. 2009-12-04 20:46:45 +00:00
Alexander Leidinger
f3d62ac43d Fix typo in kernel message. The fix is based upon the patch in the PR.
PR:		kern/140279
Submitted by:	Alexander Best <alexbestms@math.uni-muenster.de>
MFC after:	1 week
2009-11-05 07:37:48 +00:00
Bjoern A. Zeeb
d97bee3e7e Unconditionally call the setsockopt for IPV6_V6ONLY for v6 linux sockets
no matter whether we are compiled as module or if our default of the
net.inet6.ip6.v6only sysctl already matches what we would set.

This avoids unnecessary complications with modules, VIMAGES, INET6 and
the sysctl value, especially considering that most users will use
linux compat as a module.

Discussed with:	kib, rwatson (weeks ago)
Reviewed by:	rwatson
MFC after:	6 weeks
2009-10-25 09:58:56 +00:00
Marko Zec
ed539ef656 Lock the ifnet list while iterating over it.
Submitted by:	julian
MFC after:	3 days
2009-09-13 21:30:18 +00:00
Konstantin Belousov
b55ef216fe kern_select(9) copies fd_set in and out of userspace in quantities of
longs. Since 32bit processes longs are 4 bytes, 64bit kernel may copy in
or out 4 bytes more then the process expected.

Calculate the amount of bytes to copy taking into account size of fd_set
for the current process ABI.

Diagnosed and tested by:	Peter Jeremy <peterjeremy acm org>
Reviewed by:	jhb
MFC after:	1 week
2009-09-09 20:59:01 +00:00
Marko Zec
a26f987f5d Fix a few panics in linuxulator + VIMAGE due to curvnet not being set.
This change affects only options VIMAGE builds.

Reviewed by:	julian
MFC after:	3 days
2009-08-28 22:51:07 +00:00
Robert Watson
77dfcdc445 Rework global locks for interface list and index management, correcting
several critical bugs, including race conditions and lock order issues:

Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock.  Either can be held to stablize the lists and indexes, but both
are required to write.  This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions.  As before, writes to
the interface list must occur from sleepable contexts.

Reviewed by:	bz, julian
MFC after:	3 days
2009-08-23 20:40:19 +00:00
Robert Watson
530c006014 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
Jamie Gritton
7cbf72137f Some jail parameters (in particular, "ip4" and "ip6" for IP address
restrictions) were found to be inadequately described by a boolean.
Define a new parameter type with three values (disable, new, inherit)
to handle these and future cases.

Approved by:	re (kib), bz (mentor)
Discussed with:	rwatson
2009-07-25 14:48:57 +00:00
Robert Watson
eddfbb763d Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
Robert Watson
14961ba789 Replace AUDIT_ARG() with variable argument macros with a set more more
specific macros for each audit argument type.  This makes it easier to
follow call-graphs, especially for automated analysis tools (such as
fxr).

In MFC, we should leave the existing AUDIT_ARG() macros as they may be
used by third-party kernel modules.

Suggested by:	brooks
Approved by:	re (kib)
Obtained from:	TrustedBSD Project
MFC after:	1 week
2009-06-27 13:58:44 +00:00
John Baldwin
b648d4806b Change the ABI of some of the structures used by the SYSV IPC API:
- The uid/cuid members of struct ipc_perm are now uid_t instead of unsigned
  short.
- The gid/cgid members of struct ipc_perm are now gid_t instead of unsigned
  short.
- The mode member of struct ipc_perm is now mode_t instead of unsigned short
  (this is merely a style bug).
- The rather dubious padding fields for ABI compat with SV/I386 have been
  removed from struct msqid_ds and struct semid_ds.
- The shm_segsz member of struct shmid_ds is now a size_t instead of an
  int.  This removes the need for the shm_bsegsz member in struct
  shmid_kernel and should allow for complete support of SYSV SHM regions
  >= 2GB.
- The shm_nattch member of struct shmid_ds is now an int instead of a
  short.
- The shm_internal member of struct shmid_ds is now gone.  The internal
  VM object pointer for SHM regions has been moved into struct
  shmid_kernel.
- The existing __semctl(), msgctl(), and shmctl() system call entries are
  now marked COMPAT7 and new versions of those system calls which support
  the new ABI are now present.
- The new system calls are assigned to the FBSD-1.1 version in libc.  The
  FBSD-1.0 symbols in libc now refer to the old COMPAT7 system calls.
- A simplistic framework for tagging system calls with compatibility
  symbol versions has been added to libc.  Version tags are added to
  system calls by adding an appropriate __sym_compat() entry to
  src/lib/libc/incldue/compat.h. [1]

PR:		kern/16195 kern/113218 bin/129855
Reviewed by:	arch@, rwatson
Discussed with:	kan, kib [1]
2009-06-24 21:10:52 +00:00
Bjoern A. Zeeb
5736e6fb9d After cleaning up rt_tables from vnet.h and cleaning up opt_route.h
a lot of files no longer need route.h either. Garbage collect them.
While here remove now unneeded vnet.h #includes as well.
2009-06-23 17:03:45 +00:00
Brooks Davis
838d985825 Rework the credential code to support larger values of NGROUPS and
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively.  (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)

The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer.  Do the equivalent in
kinfo_proc.

Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively.  Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary.  In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.

Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups.  When feasible, truncate
the group list rather than generating an error.

Minor changes:
  - Reduce the number of hand rolled versions of groupmember().
  - Do not assign to both cr_gid and cr_groups[0].
  - Modify ipfw to cache ucreds instead of part of their contents since
    they are immutable once referenced by more than one entity.

Submitted by:	Isilon Systems (initial implementation)
X-MFC after:	never
PR:		bin/113398 kern/133867
2009-06-19 17:10:35 +00:00
Bjoern A. Zeeb
ebd8672cc3 Add explicit includes for jail.h to the files that need them and
remove the "hidden" one from vimage.h.
2009-06-17 15:01:01 +00:00
Jamie Gritton
9ed47d01eb Get vnets from creds instead of threads where they're available, and from
passed threads instead of curthread.

Reviewed by:	zec, julian
Approved by:	bz (mentor)
2009-06-15 19:01:53 +00:00
Dmitry Chagin
0046fd5dd9 Unlock process lock when return error from getrobustlist call.
Tested by:	Alexander Best <alexbestms at math uni-muenster de>
Approved by:	kib (mentor)
MFC after:	3 days
2009-06-14 17:53:55 +00:00
Jamie Gritton
7455b100af Add counterparts to getcredhostname:
getcreddomainname, getcredhostuuid, getcredhostid

Suggested by:	rmacklem
Approved by:	bz
2009-06-13 00:12:02 +00:00
Bjoern A. Zeeb
8d8bc0182e After r193232 rt_tables in vnet.h are no longer indirectly dependent on
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.

Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.
2009-06-08 19:57:35 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
Dmitry Chagin
f83427b833 Add forgotten in previous commit flags argument.
Approved by:	kib (mentor)
MFC after:	1 month
2009-06-01 20:54:41 +00:00
Dmitry Chagin
f8cd0af232 Implement accept4 syscall.
Approved by:	kib (mentor)
MFC after:	1 month
2009-06-01 20:48:39 +00:00
Dmitry Chagin
93e694c9df Implement a variation of the accept_common() which takes
a flags argument.

Do not preserve td_retval before kern_fcntl(F_SETFL) as it does not
changed.

Approved by:	kib (mentor)
MFC after:	1 month
2009-06-01 20:44:58 +00:00
Dmitry Chagin
c8f37d612d Split linux_accept() syscall onto linux_accept_common() which should
be used by linuxulator and linux_accept() itself.

Approved by:	kib (mentor)
MFC after:	1 month
2009-06-01 20:42:27 +00:00
Dmitry Chagin
39253cf9bb Implement a variation of the socketpair() syscall which takes a flags
in addition to the type argument.

Approved by:	kib (mentor)
MFC after:	1 month
2009-05-31 12:16:31 +00:00
Dmitry Chagin
38a18e9760 Move new socket flags handling into a separate function as Linux
introduced more syscalls which uses these flags.

Approved by:	kib (mentor)
MFC after:	1 month
2009-05-31 12:04:01 +00:00
Dmitry Chagin
20a4ff27b0 Remove empty lines.
Approved by:	kib (mentor)
MFC after:	1 month
2009-05-31 12:00:16 +00:00
Jamie Gritton
76ca6f88da Place hostnames and similar information fully under the prison system.
The system hostname is now stored in prison0, and the global variable
"hostname" has been removed, as has the hostname_mtx mutex.  Jails may
have their own host information, or they may inherit it from the
parent/system.  The proper way to read the hostname is via
getcredhostname(), which will copy either the hostname associated with
the passed cred, or the system hostname if you pass NULL.  The system
hostname can still be accessed directly (and without locking) at
prison0.pr_host, but that should be avoided where possible.

The "similar information" referred to is domainname, hostid, and
hostuuid, which have also become prison parameters and had their
associated global variables removed.

Approved by:	bz (mentor)
2009-05-29 21:27:12 +00:00
Andriy Gapon
93f0eafde3 linux_ioctl_cdrom: reduce stack usage
... by moving two ~2KB structures from stack to heap allocation.
I experienced stack overflow in linux emulation on i386 (8K stack)
when LINUX_DVD_READ_STRUCT ioctl was performed on atapicam cd
device and there was an error that resulted in additional quite
heavy stack use in cam layer.

Reviewed by:	dchagin
Approved by:	jhb (mentor)
2009-05-27 15:23:12 +00:00
Jamie Gritton
0304c73163 Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails.  Child jails may be restricted more than their parents,
but never less.  Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system.  Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings.  The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by:	bz (mentor)
2009-05-27 14:11:23 +00:00
Dmitry Chagin
ea7b81d2bd Validate user-supplied arguments values.
Args argument is a pointer to the structure located in user space in
which the socketcall arguments are packed. The structure must be
copied to the kernel instead of direct dereferencing.

Approved by:	kib (mentor)
MFC after:	1 week
2009-05-19 09:10:53 +00:00
Dmitry Chagin
3a72bf04c4 Implement MSG_CMSG_CLOEXEC flag for linux_recvmsg().
Approved by:	kib (mentor)
MFC after:	1 month
2009-05-18 04:07:46 +00:00
Dmitry Chagin
3933bde22e Somewhere between 2.6.23 and 2.6.27, Linux added SOCK_CLOEXEC and
SOCK_NONBLOCK flags, that allow to save fcntl() calls.

Implement a variation of the socket() syscall which takes a flags
in addition to the type argument.

Approved by:	kib (mentor)
MFC after:	1 month
2009-05-16 18:48:41 +00:00
Dmitry Chagin
eeb63e515f Return EINVAL in case when the incorrect or unsupported
type argument is specified.

Do not map type argument value as its Linux values are
identical to FreeBSD values.

Approved by:	kib (mentor)
2009-05-16 18:46:51 +00:00
Dmitry Chagin
6994ea543f Use the protocol family constants for the domain argument validation.
Return immediately when the socket() failed.

Approved by:	kib (mentor)
MFC after:	1 month
2009-05-16 18:44:56 +00:00
Dmitry Chagin
d4dd69c46c Emulate SO_PEERCRED socket option.
Temporarily use 0 for pid member as the FreeBSD does not cache remote
UNIX domain socket peer pid.

PR:		kern/102956
Reviewed by:	rwatson
Approved by:	kib (mentor)
MFC after:	1 month
2009-05-16 18:42:18 +00:00
Dmitry Chagin
03cc95d21a Translate l_timeval arg to native struct timeval in
linux_setsockopt()/linux_getsockopt() for SO_RCVTIMEO,
SO_SNDTIMEO opts as l_timeval has MD members.

Remove bogus __packed attribute from l_timeval struct on __amd64__.

PR:		kern/134276
Submitted by:	Thomas Mueller <tmueller sysgo com>
Approved by:	kib (mentor)
MFC after:	2 weeks
2009-05-11 13:50:42 +00:00
Dmitry Chagin
3980a435a2 Add forgotten linux to bsd flags argument mapping into the linux_recv().
PR:		kern/134276
Submitted by:	Thomas Mueller <tmueller sysgo com>
Approved by:	kib (mentor)
MFC after:	2 weeks
2009-05-11 13:42:40 +00:00
Dmitry Chagin
8d30f381ef Do not export AT_CLKTCK when emulating Linux kernel prior
to 2.4.0, as it has appeared in the 2.4.0-rc7 first time.
Being exported, AT_CLKTCK is returned by sysconf(_SC_CLK_TCK),
glibc falls back to the hard-coded CLK_TCK value when aux entry
is not present.

Glibc versions prior to 2.2.1 always use hard-coded CLK_TCK value.

For older applications/libc's which depends on hard-coded CLK_TCK
value user should set compat.linux.osrelease less than 2.4.0.

Approved by:	kib (mentor)
2009-05-10 18:43:43 +00:00
Dmitry Chagin
580dd797fd Introduce linux_kernver() interface which is intended for an exact
designation of the emulated kernel version.

linux_kernver() returns integer value formatted as 'VVVMMMIII' where
VVV - version, MMM - major revision, III - minor revision.

Approved by:	kib (mentor)
2009-05-10 18:27:20 +00:00
Dmitry Chagin
1ca16454b3 Rework r189362, r191883.
The frequency of the statistics clock is given by stathz.
Use stathz if it is available, otherwise use hz.

Pointed out by:	bde

Approved by:	kib (mentor)
2009-05-10 18:16:07 +00:00