10796 Commits

Author SHA1 Message Date
Alexander Kabaev
6ee7dd87ba Shared memory objects that have size which is not necessarily equal to
exact multiple of system page size should still be allowed to be mapped
in their entirety to match the regular vnode backed file behavior.

Reported by: ed
Reviewed by: jhb
2008-12-01 22:33:50 +00:00
Ken Smith
f34015d47b Catch up with the disappearance of sys/dev/hfa. 2008-12-01 14:34:42 +00:00
Attilio Rao
ccc55b33b7 Fix an inverted check introduced in r184554.
Submitted by:	tegge
Pointy hat to:	me
2008-12-01 03:00:26 +00:00
David Xu
6d9b63d6c8 Revision 184199 had not been fully reverted, add missing piece.
Reported by: phk
2008-12-01 01:54:55 +00:00
Bjoern A. Zeeb
d465a41d3b Unbreak the no-networks (no INET/6) build that I broke with
the commit in r185435.

Pointyhat:	no, but I could need a ski cap for the winter
2008-11-29 16:17:39 +00:00
Bjoern A. Zeeb
413628a7e3 MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
  and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
  help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
  suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
  on cluster machines as well as all the testers and people
  who provided feedback the last months on freebsd-jail and
  other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by:	(see above)
MFC after:	3 months (this is just so that I get the mail)
X-MFC Before:   7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
Konstantin Belousov
6179164448 In the nfsrv_fhtovp(), after the vfs_getvfs() function found the pointer
to the fs, but before a vnode on the fs is locked, unmount may free fs
structures, causing access to destroyed data and freed memory.

Introduce a vfs_busymp() function that looks up and busies found
fs while mountlist_mtx is held. Use it in nfsrv_fhtovp() and in the
implementation of the handle syscalls.

Two other uses of the vfs_getvfs() in the vfs_subr.c, namely in
sysctl_vfs_ctl and vfs_getnewfsid seems to be ok. In particular,
sysctl_vfs_ctl is protected by Giant by being a non-sleeping sysctl
handler, that prevents Giant-locked unmount code to interfere with it.

Noted by:	tegge
Reviewed by:	dfr
Tested by:	pho
MFC after:	1 month
2008-11-29 13:34:59 +00:00
Pawel Jakub Dawidek
0c6a80e78d Improve KASSERT() call a bit:
- Print flags in hex.
- Note that flags can be fine and panic can be due unexpected error condition.
- Remove redundant new line character.

Eventhough panic message excess 80 characters keep it in one line so it is
easier to grep.
2008-11-29 12:40:14 +00:00
Bjoern A. Zeeb
b9f0b66c75 With the permissions of phk@ change the license on kern_jail.c
to a 2 clause BSD license.
2008-11-28 19:23:46 +00:00
Ed Schouten
1cbae70533 Fix matching of message queues by name.
The mqfs_search() routine uses strncmp() to match message queue objects
by name. This is because it can be called from environments where the
file name is not null terminated (the VFS for example).

Unfortunately it doesn't compare the lengths of the message queue names,
which means if a system has "Queue12345", the name "Queue" will also
match.

I noticed this when a student of mine handed in an exercise using
message queues with names "Queue2" and "Queue".

Reviewed by:	rink
2008-11-28 14:53:18 +00:00
Konstantin Belousov
b7a813fc21 Explicitely note that destroy_dev() sleeps.
Requested by:	ed (some time ago), Jaakko Heinonen <jh saunalahti fi>
2008-11-27 16:47:25 +00:00
Ganbold Tsagaankhuu
559b717f5e Remove unused variable.
Found with:     Coverity Prevent(tm)
CID: 3664

Approved by: kib
2008-11-27 04:40:37 +00:00
Marko Zec
97021c2464 Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by:  bz, julian
Approved by:  julian (mentor)
Obtained from:        //depot/projects/vimage-commit2/...
X-MFC after:  never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-11-26 22:32:07 +00:00
Joe Marcus Clarke
ef61995ebd Move vn_fullpath1() outside of FILEDESC locking. This is being done in
advance of teaching vn_fullpath1() how to query file systems for
vnode-to-name mappings when cache lookups fail.

Thanks to kib for guidance and patience on this process.

Reviewed by:	kib
Approved by:	kib
2008-11-25 15:36:15 +00:00
Ed Maste
6ffb78d173 Correct typo in comment: thier -> their 2008-11-24 19:28:52 +00:00
David Malone
27d68f904f It's possible that the dump device has gone away after it was
configured, change the message to let people know this is a
possibility. I've slightly changed the message from the one
submitted by Pekka to keep the printf on one line.

Submitted by:	Pekka Savola <pekkas@netcore.fi>
2008-11-23 21:05:22 +00:00
Konstantin Belousov
b4cf0e62f4 Add sv_flags field to struct sysentvec with intention to provide description
of the ABI of the currently executing image. Change some places to test
the flags instead of explicit comparing with address of known sysentvec
structures to determine ABI features.

Discussed with:	dchagin, imp, jhb, peter
2008-11-22 12:36:15 +00:00
Kip Macy
db7f0b974f - bump __FreeBSD version to reflect added buf_ring, memory barriers,
and ifnet functions

- add memory barriers to <machine/atomic.h>
- update drivers to only conditionally define their own

- add lockless producer / consumer ring buffer
- remove ring buffer implementation from cxgb and update its callers

- add if_transmit(struct ifnet *ifp, struct mbuf *m) to ifnet to
  allow drivers to efficiently manage multiple hardware queues
  (i.e. not serialize all packets through one ifq)
- expose if_qflush to allow drivers to flush any driver managed queues

This work was supported by Bitgravity Inc. and Chelsio Inc.
2008-11-22 05:55:56 +00:00
Julian Elischer
bc97ba5100 Fix a scope problem in the multiple routing table code that stopped the
SO_SETFIB socket option from working correctly.

Obtained from:	Ironport
MFC after:	3 days
2008-11-19 19:19:30 +00:00
John Baldwin
0d484d249f Allow device hints to wire the unit numbers of devices.
- An "at" hint now reserves a device name.
- A new BUS_HINT_DEVICE_UNIT method is added to the bus interface.  When
  determining the unit number of a device, this method is invoked to
  let the bus driver specify the unit of a device given a specific
  devclass.  This is the only way a device can be given a name reserved
  via an "at" hint.
- Implement BUS_HINT_DEVICE_UNIT() for the acpi(4) and isa(4) bus drivers.
  Both of these busses implement this by comparing the resources for a
  given hint device with the resources enumerated by ACPI/PnPBIOS and
  wire a unit if the hint resources are a subset of the "real" resources.
- Use bus_hinted_children() for adding hinted devices on isa(4) busses
  now instead of doing it by hand.
- Remove the unit kludging from sio(4) as it is no longer necessary.

Prodding from:	peter, imp
OK'd by:	marcel
MFC after:	1 month
2008-11-18 21:01:54 +00:00
John Baldwin
02f0ff6d92 When checking to see if another CPU is running its idle thread, examine
the thread running on the other CPU instead of the thread being placed on
the run queue.

Reported by:	Ravi Murty @ Intel
Reviewed by:	jeff
2008-11-18 05:41:34 +00:00
Xin LI
e1088cdca3 Obey signedness flag in %z case.
MFC after:	2 months
2008-11-17 23:57:40 +00:00
Pawel Jakub Dawidek
1ba4a712dd Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.
This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

	Allows regular users to perform ZFS operations, like file system
	creation, snapshot creation, etc.

- L2ARC

	Level 2 cache for ZFS - allows to use additional disks for cache.
	Huge performance improvements mostly for random read of mostly
	static content.

- slog

	Allow to use additional disks for ZFS Intent Log to speed up
	operations like fsync(2).

- vfs.zfs.super_owner

	Allows regular users to perform privileged operations on files stored
	on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

	Not all the flags are supported. This still needs work.

- ZFSBoot

	Support to boot off of ZFS pool. Not finished, AFAIK.

	Submitted by:	dfr

- Snapshot properties

- New failure modes

	Before if write requested failed, system paniced. Now one
	can select from one of three failure modes:
	- panic - panic on write error
	- wait - wait for disk to reappear
	- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

	Just quota and reservation properties, but don't count space consumed
	by children file systems, clones and snapshots.

- Sparse volumes

	ZVOLs that don't reserve space in the pool.

- External attributes

	Compatible with extattr(2).

- NFSv4-ACLs

	Not sure about the status, might not be complete yet.

	Submitted by:	trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from:	OpenSolaris
2008-11-17 20:49:29 +00:00
Konstantin Belousov
c5f77bf986 Revert r184118. There is actually a code in the kernel, for instance in
kern_unlinkat(), that expects that vn_start_write() actually fills the mp
even when the call failed.

As Tor noted, that pattern relies on the the type stability of the mount
points, as well as that suspended mount points are never freed and
V_XSLEEP is always passed to vn_start_write() when called on a freed
mount point.

Reported by:	stass
Reviewed by:	tegge
PR:		123768
2008-11-16 21:56:29 +00:00
Nick Hibma
eaa5bb21ca Silence detach messages if the device has marked itself quiet (u3g).
MFC after:	3 weeks
2008-11-13 21:46:19 +00:00
Ed Schouten
87fe0fa84f Don't forget to relock the TTY after uiomove() returns an error.
Peter Holm just discovered this funny bug inside the TTY code: if
uiomove() in ttydisc_write() returns an error, we forget to relock the
TTY before jumping out of ttydisc_write(). Fix it by placing
tty_unlock() and tty_lock() around uiomove().

Submitted by:	pho
2008-11-12 09:04:44 +00:00
Ed Schouten
ab0d10f68e Several cleanups related to pipe(2).
- Use `fildes[2]' instead of `*fildes' to make more clear that pipe(2)
  fills an array with two descriptors.

- Remove EFAULT from the manual page. Because of the current calling
  convention, pipe(2) raises a segmentation fault when an invalid
  address is passed.

- Introduce kern_pipe() to make it easier for binary emulations to
  implement pipe(2).

- Make Linux binary emulation use kern_pipe(), which means we don't have
  to recover td_retval after calling the FreeBSD system call.

Approved by:	rdivacky
Discussed on:	arch
2008-11-11 14:55:59 +00:00
Andrew Gallatin
528fb798ad Avoid scheduling firmware taskqs when cold.
This prevents a panic which occurs when a driver attempts to load
firmware at boot via firmware_get() when the firmware module has not
been preloaded.  firmware_get() will  enqueue a task using a struct
taskqueue allocated on the stack, and the machine will crash much
later in the firmware taskq thread when taskqs are started and the
struct taskqueue is garbage.

Not objected to by: sam
2008-11-11 12:25:08 +00:00
Ed Schouten
ebb45b0620 Regenerate system call tables for r184789. 2008-11-09 10:48:06 +00:00
Ed Schouten
a1b5a8955e Mark uname(), getdomainname() and setdomainname() with COMPAT_FREEBSD4.
Looking at our source code history, it seems the uname(),
getdomainname() and setdomainname() system calls got deprecated
somewhere after FreeBSD 1.1, but they have never been phased out
properly. Because we don't have a COMPAT_FREEBSD1, just use
COMPAT_FREEBSD4.

Also fix the Linuxolator to build without the setdomainname() routine by
just making it call userland_sysctl on kern.domainname. Also replace the
setdomainname()'s implementation to use this approach, because we're
duplicating code with sysctl_domainname().

I wasn't able to keep these three routines working in our
COMPAT_FREEBSD32, because that would require yet another keyword for
syscalls.master (COMPAT4+NOPROTO). Because this routine is probably
unused already, this won't be a problem in practice. If it turns out to
be a problem, we'll just restore this functionality.

Reviewed by:	rdivacky, kib
2008-11-09 10:45:13 +00:00
Kip Macy
8aa7a58108 make kern.ipc.nmbclusters actually have a useful effect on nmbclusters et al.
initialize pkthdr in field order
2008-11-09 01:53:06 +00:00
Ed Schouten
5bbae50149 Reduce the default baud rate of PTY's to 9600.
On RELENG_6 (and probably RELENG_7) we see our syscons windows and
pseudo-terminals have the following buffer sizes:

| LINE RAW CAN OUT IHIWT ILOWT OHWT LWT     COL STATE  SESS      PGID DISC
| ttyv0  0   0   0  7680  6720 2052 256       7 OCcl       1146  1146 term
| ttyp0  0   0   0  7680  6720 1296 256       0 OCc       82033 82033 term

These buffer sizes make no sense, because we often have much more output
than input, but I guess having higher input buffer sizes improves
guarantees of the system.

On MPSAFE TTY I just sent both the input and output buffer sizes to 7
KB, which is pretty big on a standard FreeBSD install with 8 syscons
windows and some PTY's. Reduce the baud rate to 9600 baud, which means
we now have the following buffer sizes:

|  LINE   INQ  CAN  LIN  LOW  OUTQ  USE  LOW   COL  SESS  PGID STATE
| ttyv0  1920    0    0  192  1984    0  199     7  2401  2401 Oil
| pts/0  1920    0    0  192  1984    0  199  5631  1305  2526 Oi

This is a lot smaller, but for pseudo-devices this should be good
enough. You need to do a lot of punching to fill up a 7.5 KB input
buffer. If it turns out things don't work out this way, we'll just
switch to 19200 baud.
2008-11-08 20:40:39 +00:00
Craig Rodrigues
e506f34b24 Merge latest DTrace changes from Perforce.
Approved by:	jb
2008-11-05 19:40:36 +00:00
David Xu
7b4a950a7d Revert rev 184216 and 184199, due to the way the thread_lock works,
it may cause a lockup.

Noticed by: peter, jhb
2008-11-05 03:01:23 +00:00
John Baldwin
927edcc9ba Use shared vnode locks for auditing vnode arguments as auditing only
does a VOP_GETATTR() which does not require an exclusive lock.

Reviewed by:	csjp, rwatson
2008-11-04 22:31:04 +00:00
John Baldwin
0caf1ab15a Don't bother calling setrunnable() and clearing the sleeping flag in
sleepq_resume_thread() if the thread isn't asleep.
2008-11-04 19:13:53 +00:00
John Baldwin
2ff47c5f18 Remove unnecessary locking around vn_fullpath(). The vnode lock for the
vnode in question does not need to be held.  All the data structures used
during the name lookup are protected by the global name cache lock.
Instead, the caller merely needs to ensure a reference is held on the
vnode (such as vhold()) to keep it from being freed.

In the case of procfs' <pid>/file entry, grab the process lock while we
gain a new reference (via vhold()) on p_textvp to fully close races with
execve(2).

For the kern.proc.vmmap sysctl handler, use a shared vnode lock around
the call to VOP_GETATTR() rather than an exclusive lock.

MFC after:	1 month
2008-11-04 19:04:01 +00:00
Ed Schouten
394e94079c Remove redundant return value tests.
There is no need to test whether the return value is non-zero here. Just
return the error number directly.
2008-11-04 10:58:02 +00:00
John Baldwin
4482f952b1 Adjust the license statement to more closely match a standard 3-clause BSD
license.

MFC after:	3 days
2008-11-03 21:17:02 +00:00
John Baldwin
21fc02d271 Use shared vnode locks instead of exclusive vnode locks for the access(),
chdir(), chroot(), eaccess(), fpathconf(), fstat(), fstatfs(), lseek()
(when figuring out the current size of the file in the SEEK_END case),
pathconf(), readlink(), and statfs() system calls.

Submitted by:	ups (mostly)
Tested by:	pho
MFC after:	1 month
2008-11-03 20:31:00 +00:00
Attilio Rao
30f60d8c31 Remove the mnt_holdcnt and mnt_holdcntwaiters because they are useless.
Really, the concept of holdcnt in the struct mount is rappresented by
the mnt_ref (which prevents the type-stable structure from being
"recycled) handled through vfs_ref() and vfs_rel().
On this optic, switch the holdcnt acquisition into an emulated vfs_ref()
(and subsequent release into vfs_rel()).

Discussed with:	kib
Tested by:	pho
2008-11-03 20:00:35 +00:00
John Baldwin
0f54f8c2b3 A few style nits. 2008-11-03 19:33:20 +00:00
Doug Rabson
45e6ab7f81 Regen. 2008-11-03 10:39:35 +00:00
Doug Rabson
a9148abd9d Implement support for RPCSEC_GSS authentication to both the NFS client
and server. This replaces the RPC implementation of the NFS client and
server with the newer RPC implementation originally developed
(actually ported from the userland sunrpc code) to support the NFS
Lock Manager.  I have tested this code extensively and I believe it is
stable and that performance is at least equal to the legacy RPC
implementation.

The NFS code currently contains support for both the new RPC
implementation and the older legacy implementation inherited from the
original NFS codebase. The default is to use the new implementation -
add the NFS_LEGACYRPC option to fall back to the old code. When I
merge this support back to RELENG_7, I will probably change this so
that users have to 'opt in' to get the new code.

To use RPCSEC_GSS on either client or server, you must build a kernel
which includes the KGSSAPI option and the crypto device. On the
userland side, you must build at least a new libc, mountd, mount_nfs
and gssd. You must install new versions of /etc/rc.d/gssd and
/etc/rc.d/nfsd and add 'gssd_enable=YES' to /etc/rc.conf.

As long as gssd is running, you should be able to mount an NFS
filesystem from a server that requires RPCSEC_GSS authentication. The
mount itself can happen without any kerberos credentials but all
access to the filesystem will be denied unless the accessing user has
a valid ticket file in the standard place (/tmp/krb5cc_<uid>). There
is currently no support for situations where the ticket file is in a
different place, such as when the user logged in via SSH and has
delegated credentials from that login. This restriction is also
present in Solaris and Linux. In theory, we could improve this in
future, possibly using Brooks Davis' implementation of variant
symlinks.

Supporting RPCSEC_GSS on a server is nearly as simple. You must create
service creds for the server in the form 'nfs/<fqdn>@<REALM>' and
install them in /etc/krb5.keytab. The standard heimdal utility ktutil
makes this fairly easy. After the service creds have been created, you
can add a '-sec=krb5' option to /etc/exports and restart both mountd
and nfsd.

The only other difference an administrator should notice is that nfsd
doesn't fork to create service threads any more. In normal operation,
there will be two nfsd processes, one in userland waiting for TCP
connections and one in the kernel handling requests. The latter
process will create as many kthreads as required - these should be
visible via 'top -H'. The code has some support for varying the number
of service threads according to load but initially at least, nfsd uses
a fixed number of threads according to the value supplied to its '-n'
option.

Sponsored by:	Isilon Systems
MFC after:	1 month
2008-11-03 10:38:00 +00:00
Ivan Voras
aa880b9018 Increase the initial sbuf size for CPU topology dump to something more
usable for newer CPUs. The new value allows 2 x quad core configuration
dumps to fit within the initial buffer without reallocations.

Approved by:	gnn (mentor) (older version)
Pointed out by:	rdivacky
2008-11-02 23:11:20 +00:00
Attilio Rao
83b3bdbc8a Improve VFS locking:
- Implement real draining for vfs consumers by not relying on the
  mnt_lock and using instead a refcount in order to keep track of lock
  requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
  useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
  Change so its KPI by removing the interlock argument and defining 2 new
  flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
  old version (which was unlinked from the lockmgr alredy) and
  MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
  once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
  mnt_ref if running for more than 3 seconds, make it totally useless.
  Remove it as it was thought to work into older versions.
  If a problem of "refcount held never going away" should appear, we will
  need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
  leaving the MNTK_MWAIT flag on even if it the waiters were actually
  woken up. Just a place in vfs_mount_destroy() is left because it is
  going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.

This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.

Discussed with:	kib
Tested by:	pho
2008-11-02 10:15:42 +00:00
Ed Schouten
37a9f58275 Clamp the values of t_column to 5 digits in pstat -t' and show all ttys'.
We often run into these very high column numbers when we run curses
applications, because they don't print any newlines. This messes up the
table output of `pstat -t'. If these numbers get really high, they
aren't of any use to the reader anyway. Convert them to `99999' when
they run out of bounds.
2008-11-01 13:40:46 +00:00
Ed Schouten
c9dba40cc8 Reimplement the /dev/console device node.
One of the pieces of code that I had left alone during the development
of the MPSAFE TTY layer, was tty_cons.c. This file actually has two
different functions:

- It contains low-level console input/output routines (cnputc(), etc).

- It creates /dev/console and wraps all its cdevsw calls to the
  appropriate TTY.

This commit reimplements the second set of functions by moving it
directly into the TTY layer. /dev/console is now a character device node
that's basically a regular TTY, but does a lookup of `si_drv1' each time
you open it. d_write has also been changed to call log_console().
d_close() is not present, because we must make sure we don't revoke the
TTY after writing a log message to it.

Even though I'm not convinced this is in line with the future directions
of our console code, it is a good move for now. It removes recursive
locking from the top half of the TTY layer. The previous implementation
called into the TTY layer with Giant held.

I'm renaming tty_cons.c to kern_cons.c now. The code hardly contains any
TTY related bits, so we'd better give it a less misleading name.

Tested by:	Andrzej Tobola <ato iem pw edu pl>,
		Carlos A.M. dos Santos <unixmania gmail com>,
		Eygene Ryabinkin <rea-fbsd codelabs ru>
2008-11-01 08:35:28 +00:00
Peter Wemm
7a9c4d2409 Add three extra to the kinfo_proc_vmmap data. kve_offset - the offset
within an object that a mapping refers to.  fileid and fsid are inode/dev
for vnodes.  (Linux procfs has these and valgrind is really unhappy
without them.)  I believe I didn't change the size of the struct.
2008-10-31 05:43:19 +00:00
Maxim Sobolev
b0606bd11a Make it possible to compile kernel with KTR but without DDB. 2008-10-30 21:48:28 +00:00