Commit Graph

11169 Commits

Author SHA1 Message Date
ed
35bb1e1a73 Remove slush space from clists.
Right now we only have a very small amount of drivers that use clists,
but we still allocate 50 cblocks as slush space, which allows drivers to
temporarily overcommit their storage. Most of the drivers don't allow
this anyway.

I've performed the following changes:

- We don't allocate any cblocks on startup.

- I've removed the DDB command, because it has nothing useful to print
  now. You can obtain the amount of allocated blocks by running `vmstat
  -m | grep clist'.

- I've removed cfreecount, which is now unused.

- The old code first tries to allocate using M_NOWAIT, followed by
  M_WAITOK. This doesn't make any sense, so just remove this logic. It
  seems the drivers allow us to sleep anyway.

We can even remove ccmax from clist_alloc_cblocks and c_cbmax from
struct clist, but this breaks binary compatibility.

This reduces the amount of allocated cblocks on my system from 54 to 4.
2009-02-04 17:10:01 +00:00
ed
85ebf97341 Slightly improve the design of the TTY buffer.
The TTY buffers used the standard <sys/queue.h> lists. Unfortunately
they have a big shortcoming. If you want to have a double linked list,
but no tail pointer, it's still not possible to obtain the previous
element in the list. Inside the buffers we don't need them. This is why
I switched to custom linked list macros. The macros will also keep track
of the amount of items in the list. Because it doesn't use a sentinel,
we can just initialize the queues with zero.

In its simplest form (the output queue), we will only keep two
references to blocks in the queue, namely the head of the list and the
last block in use. All free blocks are stored behind the last block in
use.

I noticed there was a very subtle bug in the previous code: in a very
uncommon corner case, it would uma_zfree() a block in the queue before
calling memcpy() to extract the data from the block.
2009-02-03 19:58:28 +00:00
imp
b5abf9646f Use NULL in preference to 0 in pointer contexts. 2009-02-03 07:54:42 +00:00
imp
02f34b7ded Make bioq_disksort have a ANSI-C definition rather than a K&R definition. 2009-02-03 07:53:51 +00:00
imp
35e4f2b272 rman_debug should be static, so make it static. 2009-02-03 07:53:08 +00:00
imp
b5c4f1a094 Use ANSI function definition for profil. 2009-02-03 07:52:36 +00:00
imp
10cf8131b7 Prefer ANSI function definitions to K&R ones. 2009-02-03 07:52:07 +00:00
imp
5a073bfc7c Use NULL in preference to 0 for pointers. 2009-02-03 07:51:41 +00:00
imp
82f181ca79 Use NULL in preference to 0 for pointers. 2009-02-03 07:51:11 +00:00
imp
54f6c3e35c o Use unsigned for bit fields.
o Use NULL for pointers in preference to 0.
2009-02-03 07:50:41 +00:00
imp
b033fcf7e9 int foo(void) is the proper ANSI function definition when there's no
parameters.  Use it for resettodr().
2009-02-03 07:50:01 +00:00
imp
0821a484fd Declare bus_data_devices to be static: it isn't used elsewhere.
Use NULL in a couple of places rather than 0 in the context of
pointers to be consistent with the rest of the file.
2009-02-03 00:10:21 +00:00
sepotvin
9d78a7fce3 Fix select on platforms where sizeof(long) != sizeof(int). This used
to work by accident before the cleanup done in revision 187693.

Approved by:	kan (mentor)
2009-02-02 03:34:40 +00:00
rwatson
7147753438 If a process is a zombie and we couldn't identify another useful state,
print out the state as "zombine" in preference to "unknown" when ^T is
pressed.

MFC after:	3 days
Sponsored by:	Google, Inc.
2009-01-29 09:32:56 +00:00
ed
50efccc9f0 Mark most often used sysctl's as MPSAFE.
After running a `make buildkernel', I noticed most of the Giant locks in
sysctl are only caused by a very small amount of sysctl's:

- sysctl.name2oid. This one is locked by SYSCTL_LOCK, just like
  sysctl.oidfmt.

- kern.ident, kern.osrelease, kern.version, etc. These are just constant
  strings.

- kern.arandom, used by the stack protector. It is already protected by
  arc4_mtx.

I also saw the following sysctl's show up. Not as often as the ones
above, but still quite often:

- security.jail.jailed. Also mark security.jail.list as MPSAFE. They
  don't need locking or already use allprison_lock.

- kern.devname, used by devname(3), ttyname(3), etc.

This seems to reduce Giant locking inside sysctl by ~75% in my primitive
test setup.
2009-01-28 19:58:05 +00:00
jhb
04d889be20 Convert the global mutex protecting the directory lookup name cache from a
mutex to a reader/writer lock.  Lookup operations first grab a read lock and
perform the lookup.  If the operation results in a need to modify the cache,
then it tries to do an upgrade.  If that fails, it drops the read lock,
obtains a write lock, and redoes the lookup.
2009-01-28 19:05:18 +00:00
ed
9b5c4f4d39 Use the proper flag to let kern.ttys be executed without Giant.
Pointed out by:	jhb
2009-01-26 16:43:18 +00:00
jhb
b8c520ae3c Whitespace tweak. 2009-01-26 15:32:39 +00:00
jeff
687eba74e7 - bit has to be fd_mask to work properly on 64bit platforms. Constants
must also be cast even though the result ultimately is promoted
   to 64bit.
 - Correct a loop index upper bound in selscan().
2009-01-25 18:38:42 +00:00
rwatson
97295d8b75 When a statically linked binary is executed (or at least, one without
an interpreter definition in its program header), set the auxiliary
ELF argument AT_BASE to 0 rather than to the address that we would
have mapped the interpreter at if there had been one.

The ELF ABI specifications appear to be ambiguous as to the desired
behavior in this situation, as they define AT_BASE as the base address
of the interpreter, but do not mention what to do if there is none.
On Solaris, AT_BASE will be set to the base address of the static
binary if there is no interpreter, and on Linux, AT_BASE is set to 0.
We go with the Linux semantics as they are of more immediate utility
and allow the early runtime environment to know that the kernel has
not mapped an interpreter, but because AT_PHDR points at the ELF
header for the running binary, it is still possible to retrieve all
required mapping information when the process starts should it be
required.  Either approach would be preferable to our current behavior
of passing a pointer to an unmapped region of user memory as AT_BASE.

MFC after:	3 weeks
2009-01-25 12:07:43 +00:00
bz
6dddd78341 For consistency with prison_{local,remote,check}_ipN rename
prison_getipN to prison_get_ipN.

Submitted by:	jamie (as part of a larger patch)
MFC after:	1 week
2009-01-25 10:11:58 +00:00
jeff
dcd94957aa - Correct a typo in a comment.
Noticed by:	danger
2009-01-25 09:17:16 +00:00
jeff
69d1bd8670 - Make the keg abstraction more complete. Permit a zone to have multiple
backend kegs so it may source compatible memory from multiple backends.
   This is useful for cases such as NUMA or different layouts for the same
   memory type.
 - Provide a new api for adding new backend kegs to secondary zones.
 - Provide a new flag for adjusting the layout of zones to stagger
   allocations better across cache lines.

Sponsored by:	Nokia
2009-01-25 09:11:24 +00:00
ed
ce1034ac57 Remove unneeded use of device unit numbers from pty(4).
A much more simple approach to generate the slave device name, is to
obtain the device name of the master and replace 'p' by 't'.
2009-01-25 08:27:11 +00:00
jeff
d4c94410f6 - Use __XSTRING where I want the define to be expanded. This resulted in
sizeof("MAXCPU") being used to calculate a string length rather than
   something more reasonable such as sizeof("32").  This shouldn't have
   caused any ill effect until we run on machines with 1000000 or more
   cpus.
2009-01-25 07:35:10 +00:00
jeff
3688ae7441 Fix errors introduced when I rewrote select.
- Restructure selscan() and selrescan() to avoid producing extra selfps
   when we have a fd in multiple sets.  As described below multiple selfps
   may still exist for other reasons.
 - Make selrescan() tolerate multiple selfds for a given descriptor
   set since sockets use two selinfos per fd.  If an event on each selinfo
   fires selrescan() will see the descriptor twice.  This could result in
   select() returning 2x the number of fds actually existing in fd sets.

Reported by:	mgleason@ncftp.com
2009-01-25 07:24:34 +00:00
ed
e3697d4040 Mark kern.ttys as MPSAFE.
sysctl now allows Giantless calls, so make kern.ttys use this. If it
needs Giant, it locks the proper TTY anyway.
2009-01-24 18:20:15 +00:00
rwatson
aaaff3620b Add explicit static DTrace tracing to the callout mechanism, capturing
pointers to the callout handler just before and just after the callout
it invoked.  I attempted to do this in a manner congruent to tracing in
Solaris's callout mechanism, but couldn't quite use the same names due
to convention and syntax differences.

Example DTrace script to generate a distribution graph of callout
execution times:

callout_execute:::callout_start
{
        self->cstart = timestamp;
}

callout_execute:::callout_end
{

        @length = quantize(timestamp - self->cstart);
}

Reviewed by:	jb
MFC after:	3 days
2009-01-24 10:22:49 +00:00
jhb
a9601871c9 - Mark all standalone INT/LONG/QUAD sysctl's MPSAFE. This is done
inside the SYSCTL() macros and thus does not need to be done for
  all of the nodes scattered across the source tree.
- Mark the name-cache related sysctl's (including debug.hashstat.*) MPSAFE.
- Mark vm.loadavg MPSAFE.
- Remove GIANT_REQUIRED from vmtotal() (everything in this routine already
  has sufficient locking) and mark vm.vmtotal MPSAFE.
- Mark the vm.stats.(sys|vm).* sysctls MPSAFE.
2009-01-23 22:49:23 +00:00
jhb
0245a3370f - Add conditional Giant locking around the vrele() in
sysctl_kern_proc_pathname().
- Mark all the kern.proc.* sysctls as MPSAFE.

Submitted by:	csjp (2)
2009-01-23 22:46:45 +00:00
jhb
b6d1e3ceff Add a flag to tag individual sysctl leaf nodes as MPSAFE and thus not
needing Giant.

Submitted by:	csjp (an older version)
2009-01-23 22:40:35 +00:00
jhb
d94da54d95 Use shared vnode locks for fchdir().
Submitted by:	ups
2009-01-23 22:13:30 +00:00
jhb
d7c8a44c0d Tweak the wording for vfs_mark_atime() since the I/O it is avoiding by not
updating va_atime via VOP_SETATTR() isn't always synchronous.  For some
filesystems it is asynchronous.

Suggested by:  bde
2009-01-23 22:13:00 +00:00
jhb
4efa7c83e1 Push down Giant in the vlnru kproc main loop so that it is only acquired
around calls to vlrureclaim() on non-MPSAFE filesystems.  Specifically,
vnlru no longer needs Giant for the common case of waking up and deciding
there is nothing for it to do.

MFC after:	2 weeks
2009-01-23 22:08:54 +00:00
jhb
d2c61e641d Use the correct type for the timeout parameter to the 32-bit
compat version aio_waitcomplete().

Reminded by:	bz
Submitted by:	jamie
MFC after:	3 days
2009-01-23 13:23:17 +00:00
jhb
2939c2f76f Fix a few style bogons.
Submitted by:	bde
2009-01-21 20:08:17 +00:00
kib
44eef9d9bb Move the code from ufs_lookup.c used to do dotdot lookup, into
the helper function. It is supposed to be useful for any filesystem
that has to unlock dvp to walk to the ".." entry in lookup routine.

Requested by:	jhb
Tested by:	pho
MFC after:	1 month
2009-01-21 14:51:38 +00:00
jhb
47455a7b41 Move the VA_MARKATIME flag for VOP_SETATTR() out into its own VOP:
VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME
can be performed while holding a shared vnode lock (the same functionality
is done internally by VOP_READ which can run with a shared vnode lock).
Add missing locking of the vnode interlock to the ufs implementation and
remove a special note and test from the NFS client about not supporting the
feature.

Inspired by:	ups
Tested by:	pho
2009-01-21 14:42:00 +00:00
thompsa
50e14c608e Add functions WITNESS so it can be asserted that the lock is not released for a
section of code, this uses WITNESS_NORELEASE() and WITNESS_RELEASEOK() to mark
the boundaries. Both functions require the lock to be held when calling.

This is intended for scenarios like a bus asserting that the bus lock is not
dropped during a driver call. There doesn't appear to be a man page to
document this in.

Reviewed by:	jhb
2009-01-21 04:19:18 +00:00
kib
cbb8defa10 FFS puts the extended attributes blocks at the negative blocks for the
vnode, from -1 down. When vinvalbuf(vp, V_ALT) is done for the vnode, it
incorrectly does vm_object_page_remove(0, 0), removing all pages from
the underlying vm object, not only the pages that back the extended
attributes data.

Change vinvalbuf() to not remove any pages from the object when
V_NORMAL or V_ALT are specified. Instead, the only in-tree caller
in ffs_inode.c:ffs_truncate() that specifies V_ALT explicitely
removes the corresponding page range. The V_NORMAL caller
does vnode_pager_setsize(vp, 0) immediately after the call to
vinvalbuf(V_NORMAL) already.

Reported by:	csjp
Reviewed by:	ups
MFC after:	3 weeks
2009-01-20 11:27:45 +00:00
mckay
7620a1118d Add a limit on namecache entries.
In normal operation, the number of cache entries is roughly equal to the
number of active vnodes.  However, when most of the recently accessed
vnodes have many hard links, the number of cache entries can be 32000
times as large, exhausting kernel memory and provoking a panic in
kmem_malloc().

MFC after: 2 weeks
2009-01-20 04:21:21 +00:00
mav
dcb9508070 Teach m_copyback() to use trailing space of the last mbuf in chain. 2009-01-18 20:19:55 +00:00
jeff
3d8d825555 - Implement generic macros for producing KTR records that are compatible
with src/tools/sched/schedgraph.py.  This allows developers to quickly
   create a graphical view of ktr data for any resource in the system.
 - Add sched_tdname() and the pcpu field 'name' for quickly and uniformly
   identifying records associated with a thread or cpu.
 - Reimplement the KTR_SCHED traces using the new generic facility.

Obtained from:	attilio
Discussed with:	jhb
Sponsored by:	Nokia
2009-01-17 07:17:57 +00:00
kib
92e5c3777e Lock the semaphore identifier lock during semaphore initialization to
guarantee atomicity of the operation for other semaphore consumers.
In particular, this should guard against access to the semaphore with
not done or partially done MAC label assignment.

Reviewed by:	rwatson
MFC after:	1 month
2009-01-15 12:15:46 +00:00
kib
f068898e84 It seems that there are at least three issues with IPC_RMID operation
on SysV semaphores.

  The squeeze of the semaphore array in the kern_semctl() modifies
  sem_base for the semaphores with sem_base greater then sem_base of
  the removed semaphore, as well as the values of the semaphores,
  without locking their mutex. This can lead to (killable) hangs or
  unexpected behaviour of the processes performing any sem operations
  while other process does IPC_RMID.

  The semexit_myhook() eventhandler unlocks SEMUNDO_LOCK() while
  accessing *suptr. This allows for IPC_RMID for the sem id to be
  performed in parallel with undo hook referenced by the current undo
  structure. This leads to the panic("semexit - semid not allocated") [1].

  The semaphore creation is protected by Giant, while IPC_RMID is done
  while only semaphore mutex is held. This seems to result in invalid
  values for semtot, causing random ENOSPC error returns [2].

Redo the locking of the semaphores lifetime cycle. Delegate the
sem_mtx to the sole purpose of protecting semget() and
semctl(IPC_RMID). Introduce new sem_undo_mtx to protect SEM_UNDO
handling. Remove the Giant remnants from the code.
Note that  mac_sysvsem_check_semget() and mac_sysvsem_create() are
now called while sem_mtx is held, as well as mac_sysvsem_cleanup() [3].

When semaphore is removed, acquire semaphore locks for all semaphores
with sem_base that is going to be changed by squeeze of the sema
array. The lock order is not important there, because the region is
protected by sem_mtx.

Organize both used and free sem_undo structures into the lists,
protected by sem_undo_mtx. In semexit_myhook(), remove sem_undo
structure that is being processed, from used list, without putting it
onto the free to prevent modifications by other threads. This allows
for sem_undo_lock to be dropped to acquire individial semaphore locks
without violating lock order. Since IPC_RMID may no longer find this
sem_undo, do tolerate references to unallocated semaphores in undo
structure, and check sequential number to not undo unrelated semaphore
with the same id.

While there, convert functions definitions to ANSI C and fix small
style(9) glitches.

Reported by:	Omer Faruk Sen <omerfsen gmail com> [1], pho [2]
Reviewed by:	rwatson [3]
Tested by:	pho
MFC after:	1 month
2009-01-14 15:20:13 +00:00
jhb
a20bdb255f Add a new KTR tracepoint in the KTR_CALLOUT class to note when a callout
routine finishes executing.

MFC after:	1 week
2009-01-13 15:56:53 +00:00
kib
97fa304203 Do not call namei() while having another user-controlled vnode
locked. Lookup could attempt to recursively lock that vnode.

Do not call vn_start_write(V_WAIT) while vnode is locked, this may
result in a deadlock with suspension.

vfs_busy() the mountpoint before dropping vnode lock for vnode
that was used to look up the mountpoint, to prevent unmount in
between.

Reported and tested by:	pho
Reviewed by:	rwatson
MFC after:	3 weeks
2009-01-08 12:47:30 +00:00
ed
2f8c83e018 Remove Giant locking from domains list.
During boot, the domain list is locked with Giant. It is not possible to
register any protocols after the system has booted, so the lock is only
used to protect insertion of entries.

There is already a mutex in uipc_domain.c called dom_mtx. Use this mutex
to lock the list, instead of using Giant. It won't matter anything with
respect to performance, but we'll never get rid of Giant if we don't
remove from places where we don't need it.

Approved by:	rwatson
MFC after:	3 weeks
2009-01-04 19:22:53 +00:00
rwatson
af9e0de76f Remove two further uses (debugging and NULLing) of pr_ousrreq, missed due
to svn commit in the wrong directory.

Spotted by:	bz
2009-01-04 19:16:36 +00:00
bz
903df7883d Back out r186615; the sanitizing of the pointers in the error case
is not needed and seems that it will not be needed either.

Pointy hat:	mine, mine, mine and not pho's
2009-01-04 12:18:18 +00:00
kib
ac1b596fda Extend the struct vm_page wire_count to u_int to avoid the overflow
of the counter, that may happen when too many sendfile(2) calls are
being executed with this vnode [1].

To keep the size of the struct vm_page and offsets of the fields
accessed by out-of-tree modules, swap the types and locations
of the wire_count and cow fields. Add safety checks to detect cow
overflow and force fallback to the normal copy code for zero-copy
sockets. [2]

Reported by:	Anton Yuzhaninov <citrin citrin ru> [1]
Suggested by:	alc [2]
Reviewed by:	alc
MFC after:	2 weeks
2009-01-03 13:24:08 +00:00
ed
fa1537b948 Fix a corner case in my previous commit.
Even though there are not many setups that have absolutely no console
device, make sure a close() on a TTY doesn't dereference a null pointer.
2009-01-02 23:39:29 +00:00
ed
8244396e79 Don't let /dev/console be revoked if the TTY below is being closed.
During startup some of the syscons TTY's are used to set attributes like
the screensaver and mouse options. These actions cause /dev/console to
be rendered unusable.

Fix the issue by leaving the TTY opened when it is used as the console
device.

Reported by:	imp
2009-01-02 23:32:43 +00:00
rwatson
96b82a80dd White space and comment tweaks.
MFC after:	3 weeks
2009-01-01 20:03:22 +00:00
rwatson
c049b2b5fe Temporary workaround for the limitations of the mbuf flowid field: zero
the field in the mbuf constructors, since otherwise we have no way to
tell if they are valid.  In the future, Kip has plans to add a flag
specifically to indicate validity, which is the preferred model.
2009-01-01 20:03:01 +00:00
ed
05475b9543 Don't clobber sysctl_root()'s error number.
When sysctl() is being called with a buffer that is too small, it will
return ENOMEM. Unfortunately the changes I made the other day sets the
error number to 0, because it just returns the error number of the
copyout(). Revert this part of the change.
2009-01-01 00:19:51 +00:00
ivoras
4136fd8892 Document the relationship between enum VM_GUEST and the vm_guest_sysctl_names
array.

Approved by:	gnn (original version)
2008-12-30 23:49:54 +00:00
pho
2f6e82c78d Added missing second part of cleaning j->ip[46] as requested by bz
Approved by:	kib (mentor)
Pointy hat:	pho
2008-12-30 20:39:47 +00:00
pho
6e2644b311 Make sure that unused j->ip[46] are cleared
Reviewed by:	bz
Approved by:	kib (mentor)
2008-12-30 17:54:25 +00:00
rwatson
0c0b8926ba Rename mbcnt to mbcnt_delta in uipc_send() -- unlike other local
variables named mbcnt in uipc_usrreq.c, this instance is a delta
rather than a cache of sb_mbcnt.

MFC after:	3 weeks
2008-12-30 16:09:57 +00:00
kib
2349a65923 Clear the pointers to the file in the struct filedesc before file is closed
in fdfree. Otherwise, sysctl_kern_proc_filedesc may dereference stale
struct file * values.

Reported and tested by:	pho
MFC after:	1 month
2008-12-30 12:51:56 +00:00
kib
c81ec4dc0c In r185557, the check for existing negative entry for the given name
did not compared nc_dvp with supplied parent directory vnode pointer.
Add the check and note that now branches for vp != NULL and vp == NULL
are the same, thus can be merged.

Reported and reviewed by:	kan
Tested by:	pho
MFC after:	2 weeks
2008-12-30 12:51:14 +00:00
ed
3f319ef66d Fix compilation. Also move ogetkerninfo() to kern_xxx.c.
It seems I forgot to remove `int error' from a single piece of code. I'm
also moving ogetkerninfo() to kern_xxx.c, because it belongs to the
class of compat system information system calls, not the generic sysctl
code.
2008-12-29 19:24:00 +00:00
ed
f3a9a195cb Push down Giant inside sysctl. Also add some more assertions to the code.
In the existing code we didn't really enforce that callers hold Giant
before calling userland_sysctl(), even though there is no guarantee it
is safe. Fix this by just placing Giant locks around the call to the oid
handler. This also means we only pick up Giant for a very short period
of time. Maybe we should add MPSAFE flags to sysctl or phase it out all
together.

I've also added SYSCTL_LOCK_ASSERT(). We have to make sure sysctl_root()
and name2oid() are called with the sysctl lock held.

Reviewed by:	Jille Timmermans <jille quis cx>
2008-12-29 12:58:45 +00:00
kib
bd5d614be8 vm_map_lock_read() does not increment map->timestamp, so we should
compare map->timestamp with saved timestamp after map read lock is
reacquired, not with saved timestamp + 1. The only consequence of the +1
was unconditional lookup of the next map entry, though.

Tested by:	pho
Approved by:	des
MFC after:	2 weeks
2008-12-29 12:45:11 +00:00
kmacy
208a4373c4 drop rnh lock before destroying it 2008-12-28 14:32:27 +00:00
bz
7d22a18291 Hide detect_virtual() along with the accompanying string
arrays under #ifndef XEN to make XEN config compile again.
In case of Xen vm_guest is hard coded.

Move the list for the vm_guest sysctl out of the restictive
bounds as the sysctl is there in either case.
2008-12-27 17:19:16 +00:00
pho
fbacc4af83 Prevent overflow of uio_resid.
Approved by:	kib
2008-12-27 10:13:43 +00:00
rwatson
13abb9545e Following the recent security advisory, add a comment describing our
invariants and approach for protocol switch methods in protsw_init(),
and also some KASSERT's for non-domain init entries in protocol
switch tables: pru_abort and pru_send must both be implemented.

For now, leave those assertions #if 0'd, since there are a few
protocols that violate them in non-harmful ways.  Whether or not we
should enforce pru_abort being implemented for non-stream protocols
is an interesting question: currently abort is only invoked on stream
sockets in situations where un-accepted sockets must be abruptly
closed (i.e., close() on a listen socket with pending connections),
but in principle it is useful for datagram sockets and most datagram
socket types implement it.

MFC after:	3 weeks
2008-12-25 11:32:38 +00:00
marcus
09a2776e69 Do not KASSERT when vp->v_dd is NULL. Only directories which have had ".."
looked up would have v_dd set to a non-NULL value.  This fixes a panic
seen when running installworld on a diskless system with a separate /usr
file system.

Submitted by:	cracauer
Approved by:	kib
2008-12-23 20:43:42 +00:00
kib
87478ed893 Keep the hold on the vnode during VOP_VPTOCNP() call, allowing the vop
implementation to drop vnode lock, if needed.

Reported and tested by:	pho
2008-12-23 20:04:31 +00:00
ivoras
97219f9ae7 Add missing newlines to flags tags of CPU topology, for prettier
output.

Reviewed by:	jeff (original version)
Approved by:	gnn (mentor) (original version)
2008-12-23 16:19:59 +00:00
cperciva
87e5b5b6cc Prevent cross-site forgery attacks on ftpd(8) due to splitting
long commands into multiple requests. [08:12]

Avoid calling uninitialized function pointers in protocol switch
code. [08:13]

Merry Christmas everybody...

Approved by:	so (cperciva)
Approved by:	re (kensmith)
Security:	FreeBSD-SA-08:12.ftpd, FreeBSD-SA-08:13.protosw
2008-12-23 01:23:09 +00:00
ed
c48b61a389 Revert r185891.
In r185891 I removed the newlines from messages written to /dev/console,
because it made startup messages from rc-scripts harder to read. This,
unfortunately, causes the kernel message that is printed after a
non-terminated log message to be concatenated.

This could be fixed, but on short term it's better to just revert the
change.

Reported by:	Jaakko Heinonen <jh saunalahti fi>
2008-12-21 21:54:01 +00:00
ed
8cd1b93741 Set PTS_FINISHED before waking up any threads.
Inside ptsdrv_{in,out}wakeup() we call KNOTE_LOCKED() to wake up any
kevent(2) users. Because the kqueue handlers are executed synchronously,
we must set PTS_FINISHED before calling ptsdrv_{in,out}wakeup().

Discovered by:	nork
2008-12-21 21:16:57 +00:00
ed
1bd8105e62 Let wchan names more closely match pre-MPSAFE TTY behaviour.
Right now the wchan strings "ttyinp" and "ttybgw" only differ one
character from the strings we used prior to MPSAFE TTY. Just rename them
back to their pre-MPSAFE TTY counterparts.

Also rename "ttylck" to "ttymtx", which should make it more clear that a
process is blocked on the TTY mutex, not some other form of locking.
2008-12-20 09:36:40 +00:00
nwhitehorn
3fcef6d9c2 Modularize the Open Firmware client interface to allow run-time switching
of OFW access semantics, in order to allow future support for real-mode
OF access and flattened device frees. OF client interface modules are
implemented using KOBJ, in a similar way to the PPC PMAP modules.

Because we need Open Firmware to be available before mutexes can be used on
sparc64, changes are also included to allow KOBJ to be used very early in
the boot process by only using the mutex once we know it has been initialized.

Reviewed by:    marius, grehan
2008-12-20 00:33:10 +00:00
ivoras
9b61ff858b Further beautify the lock strings to be more pleasing to the eye and
self documenting within 6 characters.

Reviewed by:	ed (older version)
Approved by:	gnn (older version)
2008-12-19 14:49:14 +00:00
ru
021bbbd29f Removed a comment made obsolete by revisions 157927 and 174292. 2008-12-18 15:56:12 +00:00
ivoras
c6f6eeca99 By popular request, stringify kern.vm_guest sysctl. Now it returns a
short, self-documenting string describing the detected virtual
environment.

Approved by:	gnn (mentor) (earlier version)
2008-12-18 15:34:38 +00:00
ivoras
a6186ea60d Remove spaces in wait object names to make top (1) output prettier and
unbreak scripts that examine ps (1) output.

Reviewed by:	ed
Approved by:	gnn (mentor)
2008-12-18 15:25:33 +00:00
kib
5b3918fe07 The quotactl, statfs and fstatfs syscall implementations may dereference
NULL pointer to struct mount if the looked up vnode is reclaimed. Also,
these syscalls only mnt_ref() the mp, still allowing it to be unmounted;
only struct mount memory is kept from being reused.

Lock the vnode when doing name lookup, then reference its mount point,
unlock the vnode and vfs_busy the mountpoint. This sequence shall take
care of both races.

Reported and tested by:	pho
Discussed with:	attilio
MFC after:	1 month
2008-12-18 12:01:19 +00:00
kib
fe785ac856 Do not return success and doomed vnode from lookup. LK_UPGRADE allows
the vnode to be reclaimed.

Tested by:	pho
MFC after:	1 month
2008-12-18 11:58:12 +00:00
ivoras
b769de9274 Introduce a sysctl kern.vm_guest that reflects what the kernel knows about
it running under a virtual environment. This also introduces a globally
accessible variable vm_guest that can be used where appropriate in the
kernel to inspect this environment.

To make it easier for the long run, an enum VM_GUEST is also introduced,
which could possibly be factored out in a header somewhere (but the
question is where - vm/vm_param.h? sys/param.h?) so it eventually becomes
a part of the standard KPI. In any case, it's a start.

The purpose of all this isn't to absolutely detect that the OS is running
under a virtual environment (cf. "redpill") but to allow the parts of the
kernel and the userland that care about this particular aspect and can do
something useful depending on it to have a standardised interface. Reducing
kern.hz is one example but there are other things that could be done like
avoiding context switches, not using CPU instructions that are known to be
slow in emulation, possibly different strategies in VM (memory) allocation,
CPU scheduling, etc.

It isn't clear if the JAILS/VIMAGE functionality should also be exposed
by this particular mechanism (probably not since they're not "full"
virtual hardware environments). Sometime in the future another sysctl and
a variable could be introduced to reflect if the kernel supports any kind
of virtual hosting (e.g. VMWare VMI, Xen dom0).

Reviewed by:	silence from src-commiters@, virtualization@, kmacy@
Approved by:	gnn (mentor)
Security:	Obscurity doesn't help.
2008-12-17 19:57:12 +00:00
peter
35932c6d7c Remove sysctl debug.elf_trace and the trace field in auxargs. They go
nowhere.  It used to be the equivalent of $LD_DEBUG in rtld-elf.
Elf_Auxargs is an internal structure.
2008-12-17 16:54:29 +00:00
imp
4ad1824222 Minor style(9) nit. 2008-12-17 16:25:20 +00:00
kib
ce7791f58d Remove two remnant uses of AT_DEBUG. 2008-12-17 13:13:35 +00:00
attilio
697e2a94e4 1) Fix a deadlock in the VFS:
- threadA runs vfs_rel(mp1)
- threadB does unmount the mp1 fs, sets MNTK_UNMOUNT and drop MNT_ILOCK()
- threadA runs vfs_busy(mp1) and, as long as, MNTK_UNMOUNT is set, sleeps
  waiting for threadB to complete the unmount
- threadB, in vfs_mount_destroy(), finds mnt_lock > 0 and sleeps waiting
  for the refcount to expire.

Fix the deadlock by adding a flag called MNTK_REFEXPIRE which signals the
unmounter is waiting for mnt_ref to expire.
The vfs_busy contenders got awake, fails, and if they retry the
MNTK_REFEXPIRE won't allow them to sleep again.

2) Simplify significantly the code of vfs_mount_destroy() trimming
   unnecessary codes:
   - as long as any reference exited, it is no-more possible to have
     write-op (primarty and secondary) in progress.
   - it is no needed to drop and reacquire the mount lock.
   - filling the structures with dummy values is unuseful as long as
     it is going to be freed.

Tested by:	pho, Andrea Barberio <insomniac at slackware dot it>
Discussed with:	kib
2008-12-16 23:16:10 +00:00
mav
325d9e1d23 If possible, try to obtain max_mhz on cpufreq attach instead of first request.
On HyperThreading CPUs logical cores have same frequency, so setting it
on any core will change the other's one. In most cases first request
to the second core will be the "set" request, done after setting frequency
of the first core. In such case second CPU will obtain throttled frequency
of the first core as it's max_mhz making cpufreq broken due to different
frequency sets.
2008-12-16 01:24:05 +00:00
mav
37aff7daa7 Change ttyhook_register() second argument from thread to process pointer.
Thread was not really needed there, while previous ng_tty implementation
that used thread pointer had locking issues (using sx while holding mutex).
2008-12-13 21:17:46 +00:00
jkoshy
57939c399f - Bug fix: prevent a thread from migrating between CPUs between the
time it is marked for user space callchain capture in the NMI
  handler and the time the callchain capture callback runs.

- Improve code and control flow clarity by invoking hwpmc(4)'s user
  space callchain capture callback directly from low-level code.

Reviewed by:	jhb (kern/subr_trap.c)
Testing (various patch revisions): gnn,
		Fabien Thomas <fabien dot thomas at netasq dot com>,
		Artem Belevich <artemb at gmail dot com>
2008-12-13 13:07:12 +00:00
ed
7397703a4c Add FIONREAD to pseudo-terminal master devices.
All ioctl()'s that aren't implemented by pts(4) are forwarded to the TTY
itself. Unfortunately this is not correct for FIONREAD, because it will
give the wrong amount of bytes that are available to read.

Tested by:	keramida
Reminded by:	keramida
2008-12-13 07:23:55 +00:00
kib
ba6c22347a Uio_yield() already does DROP_GIANT/PICKUP_GIANT, no need to repeat this
around the call.

Noted by:  bde
2008-12-12 14:03:04 +00:00
kib
e747469903 Reference the vmspace of the process being inspected by procfs, linprocfs
and sysctl kern_proc_vmmap handlers.

Reported and tested by:	pho
Reviewed by:	rwatson, des
MFC after:	1 week
2008-12-12 12:12:36 +00:00
kib
2ef4ea7ee8 The userland_sysctl() function retries sysctl_root() until returned
error is not EAGAIN. Several sysctls that inspect another process use
p_candebug() for checking access right for the curproc. p_candebug()
returns EAGAIN for some reasons, in particular, for the process doing
exec() now. If execing process tries to lock Giant, we get a livelock,
because sysctl handlers are covered by Giant, and often do not sleep.

Break the livelock by dropping Giant and allowing other threads to
execute in the EAGAIN loop.

Also, do not return EAGAIN from p_candebug() when process is executing,
use more appropriate EBUSY error [1].

Reported and tested by:	pho
Suggested by:	rwatson [1]
Reviewed by:	rwatson, des
MFC after:	1 week
2008-12-12 12:06:28 +00:00
marcus
91e684d7f9 Add a new VOP, VOP_VPTOCNP, which translates a vnode to its component name
on a best-effort basis.  Teach vn_fullpath to use this new VOP if a
regular VFS cache lookup fails.  This VOP is designed to supplement the
VFS cache to provide a better chance that a vnode-to-name lookup will
succeed.

Currently, an implementation for devfs is being committed.  The default
implementation is to return ENOENT.

A big thanks to kib for the mentorship on this, and to pho for running it
through his stress test suite.

Reviewed by:	arch
Approved by:	kib
2008-12-12 00:57:38 +00:00
ed
9afe297ee5 Add kqueue()-support to pseudo-terminal master devices.
One thing I didn't expect many applications to use, was kqueue() on
pseudo-terminal master devices. There are applications that use kqueue()
on the TTY itself (rtorrent, etc). That doesn't mean we shouldn't
implement this. Libraries like libevent use kqueue() by default, which
means they wouldn't be able to use kqueue().

The old TTY layer implements a very broken version of kqueue() by
performing the actual polling on the TTY device.

Discussed with:	peter
2008-12-11 21:44:02 +00:00
bz
9a73283b1f Order #includes - also to reduce diffs with vimage branches in p4.
Sponsored by:	The FreeBSD Foundation
2008-12-11 16:09:31 +00:00
bz
e65de9d982 Correctly check the number of prison states to not access anything
outside the prison_states array.
When checking if there is a name configured for the prison, check the
first character to not be '\0' instead of checking if the char array
is present, which it always is. Note, that this is different for the
*jailname in the syscall.

Found with:	Coverity Prevent(tm)
CID:		4156, 4155
MFC after:	4 weeks (just that I get the mail)
2008-12-11 01:04:25 +00:00
zec
7b573d1496 Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.

Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out.  Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.

Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively

#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif

Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.

Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs.  This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.

Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.

De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps.  PF virtualization will be done
separately, most probably after next PF import.

Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw.  Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-12-10 23:12:39 +00:00
bz
f30a0a94fe Make sure nmbclusters are initialized before maxsockets
by running the tunable_mbinit() SYSINIT at SI_ORDER_MIDDLE
before the init_maxsockets() SYSINT at SI_ORDER_ANY.

Reviewed by:		rwatson, zec
Sponsored by:		The FreeBSD Foundation
MFC after:		4 weeks
2008-12-10 22:17:09 +00:00
bz
72852bcf84 Style changes only. Put the return type on an extra line[1] and
add an empty line at the beginning as we do not have any local
variables.

Submitted by:	rwatson [1]
Reviewed by:	rwatson
MFC after:	4 weeks
2008-12-10 22:10:37 +00:00
ed
4400e3e134 Remove added newlines from logged messages written to /dev/console.
The /dev/console device node logs all strings that are written to it.
When the string does not contain a trailing newline, it appends one. I
can imagine this was useful a long time ago, but with our current
rc-scripts, it generates a whole bunch of messages that look like:

| Configuring syscons:
|  blanktime
| .

By not appending the newlines, the output of `dmesg -a' is now (almost?)
exactly the same as what the user will see on the console device
(syscons, uart).
2008-12-10 21:48:05 +00:00
jhb
f3dcc2d9e0 - Add 32-bit compat system calls for VFS_AIO. The system calls live in the
aio code and are registered via the recently added SYSCALL32_*() helpers.
- Since the aio code likes to invoke fuword and suword a lot down in the
  "bowels" of system calls, add a structure holding a set of operations for
  things like storing errors, copying in the aiocb structure, storing
  status, etc.  The 32-bit system calls use a separate operations vector to
  handle fuword32 vs fuword, etc.  Also, the oldsigevent handling is now
  done by having seperate operation vectors with different aiocb copyin
  routines.
- Split out kern_foo() functions for the various AIO system calls so the
  32-bit front ends can manage things like copying in and converting
  timespec structures, etc.
- For both the native and 32-bit aio_suspend() and lio_listio() calls,
  just use copyin() to read the array of aiocb pointers instead of using
  a for loop that iterated over fuword/fuword32.  The error handling in
  the old case was incomplete (lio_listio() just ignored any aiocb's that
  it got an EFAULT trying to read rather than reporting an error), and
  possibly slower.

MFC after:	1 month
2008-12-10 20:56:19 +00:00
kmacy
c510d681c9 add RW_SYSINIT_FLAGS macro and rw_sysinit_flags initialization function 2008-12-08 21:46:55 +00:00
jkim
bc7e5e240b - Detect Bochs BIOS variants and use HZ_VM as well.
- Free kernel environment variable after its use.
- Fix style(9) nits.
2008-12-08 18:39:59 +00:00
kib
8324189f53 Do drop vm map lock earlier in the sysctl_kern_proc_vmmap(), to avoid
locking a vnode while having vm map locked.

Reported and tested by:	pho
MFC after:	1 week
2008-12-08 12:29:30 +00:00
kmacy
598b522b42 - convert radix node head lock from mutex to rwlock
- make radix node head lock not recursive
 - fix LOR in rtexpunge
 - fix LOR in rtredirect

Reviewed by:	sam
2008-12-07 21:15:43 +00:00
kib
ccad2ebfb2 Several threads in a process may do vfork() simultaneously. Then, all
parent threads sleep on the parent' struct proc until corresponding
child releases the vmspace. Each sleep is interlocked with proc mutex of
the child, that triggers assertion in the sleepq_add(). The assertion
requires that at any time, all simultaneous sleepers for the channel use
the same interlock.

Silent the assertion by using conditional variable allocated in the
child. Broadcast the variable event on exec() and exit().

Since struct proc * sleep wait channel is overloaded for several
unrelated events, I was unable to remove wakeups from the places where
cv_broadcast() is added, except exec().

Reported and tested by:	ganbold
Suggested and reviewed by:	jhb
MFC after:	2 week
2008-12-05 20:50:24 +00:00
jhb
0f1a8aa011 When the SYSINIT() to load a module invokes the MOD_LOAD event successfully,
move that module to the head of the associated linker file's list of modules.
The end result is that once all the modules are loaded, they are sorted in
the reverse of their load order.  This causes the kernel linker to invoke
the MOD_QUIESCE and MOD_UNLOAD events in the reverse of the order that
MOD_LOAD was invoked.  This means that the ordering of MOD_LOAD events that
is set by the SI_* paramters to DECLARE_MODULE() are now honored in the same
order they would be for SYSUNINIT() for the MOD_QUIESCE and MOD_UNLOAD
events.

MFC after:	1 month
2008-12-05 16:47:30 +00:00
jhb
0ac14961cc - Invoke MOD_QUIESCE on all modules in a linker file (kld) before
unloading any modules.  As a result, if any module veto's an unload
  request via MOD_QUIESCE, the entire set of modules for that linker
  file will remain loaded and active now rather than leaving the kld
  in a weird state where some modules are loaded and some are unloaded.
- This also moves the logic for handling the "forced" unload flag out of
  kern_module.c and into kern_linker.c which is a bit cleaner.
- Add a module_name() routine that returns the name of a module and use that
  instead of printing pointer values in debug messages when a module fails
  MOD_QUIESCE or MOD_UNLOAD.

MFC after:	1 month
2008-12-05 13:40:25 +00:00
bz
965b4db5a5 Fix a credential reference leak. [1]
Close subtle but relatively unlikely race conditions when
propagating the vnode write error to other active sessions
tracing to the same vnode, without holding a reference on
the vnode anymore. [2]

PR:		kern/126368 [1]
Submitted by:	rwatson [2]
Reviewed by:	kib, rwatson
MFC after:	4 weeks
2008-12-03 15:54:35 +00:00
bz
604d89458a Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by:	brooks, gnn, des, zec, imp
Sponsored by:	The FreeBSD Foundation
2008-12-02 21:37:28 +00:00
kib
ade687809e Shared lookup makes it possible to create several negative cache
entries for one name. Then, creating inode with that name would remove
one entry, leaving others dormant. Reclaiming the vnode would uncover
negative entries, causing false return of ENOENT from the calls like
stat, that do not create inode.

Prevent creation of the duplicated negative entries.

Reported and debugged with:	pho
Reviewed by:	jhb
X-MFC:	after shared lookup changes
2008-12-02 11:14:16 +00:00
peter
76037b082e Merge user/peter/kinfo branch as of r185547 into head.
This changes struct kinfo_filedesc and kinfo_vmentry such that they are
same on both 32 and 64 bit platforms like i386/amd64 and won't require
sysctl wrapping.

Two new OIDs are assigned.  The old ones are available under
COMPAT_FREEBSD7 - but it isn't that simple.  The superceded interface
was never actually released on 7.x.

The other main change is to pack the data passed to userland via the
sysctl.  kf_structsize and kve_structsize are reduced for the copyout.
If you have a process with 100,000+ sockets open, the unpacked records
require a 132MB+ copyout.  With packing, it is "only" ~35MB.  (Still
seriously unpleasant, but not quite as devastating).  A similar problem
exists for the vmentry structure - have lots and lots of shared libraries
and small mmaps and its copyout gets expensive too.

My immediate problem is valgrind.  It traditionally achieves this
functionality by parsing procfs output, in a packed format.  Secondly, when
tracing 32 bit binaries on amd64 under valgrind, it uses a cross compiled
32 bit binary which ran directly into the differing data structures in 32
vs 64 bit mode.  (valgrind uses this to track file descriptor operations
and this therefore affected every single 32 bit binary)

I've added two utility functions to libutil to unpack the structures into
a fixed record length and to make it a little more convenient to use.
2008-12-02 06:50:26 +00:00
peter
0cd59a18e3 Prune some whining. 2008-12-02 02:32:13 +00:00
kan
6f76d481f0 Shared memory objects that have size which is not necessarily equal to
exact multiple of system page size should still be allowed to be mapped
in their entirety to match the regular vnode backed file behavior.

Reported by: ed
Reviewed by: jhb
2008-12-01 22:33:50 +00:00
kensmith
b8203d2b22 Catch up with the disappearance of sys/dev/hfa. 2008-12-01 14:34:42 +00:00
attilio
2f606e171c Fix an inverted check introduced in r184554.
Submitted by:	tegge
Pointy hat to:	me
2008-12-01 03:00:26 +00:00
peter
cd7b78c33f Duplicate another few hundred lines of code in order to be compatible
with unreleased binaries.
2008-12-01 02:13:32 +00:00
davidxu
25ef38d002 Revision 184199 had not been fully reverted, add missing piece.
Reported by: phk
2008-12-01 01:54:55 +00:00
peter
2b1f03929a Properly wrap this giant block of duplicate code inside COMPAT_FREEBSD7 2008-11-30 21:04:53 +00:00
peter
343bde9706 Implement copyout packing more along the lines of what I had in mind.
Create a temporary duplicate implementation of old filedesc struct for
pre-7.1 libgtop package.
Todo: specific fd or addr request
2008-11-30 00:18:21 +00:00
peter
83dc2280ce WIP kinfo_file/kinfo_vmmentry tweaks. The idea:
1) to get the 32 and 64 bit versions in sync so that no shims are needed,
Valgrind in particular excercises this.  and:
2) reduce the size of the copyout.  On large processes this turns out to
be a huge problem.  Valgrind also suffers from this since it needs to do
this in a context that can't malloc. I want to pack the records.
3) Add new types..  'tell me about fd N' and 'tell me about addr N'.
2008-11-29 20:55:11 +00:00
bz
817305efd5 Unbreak the no-networks (no INET/6) build that I broke with
the commit in r185435.

Pointyhat:	no, but I could need a ski cap for the winter
2008-11-29 16:17:39 +00:00
bz
d2730d5b27 MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
  and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
  help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
  suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
  on cluster machines as well as all the testers and people
  who provided feedback the last months on freebsd-jail and
  other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by:	(see above)
MFC after:	3 months (this is just so that I get the mail)
X-MFC Before:   7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
kib
bf74bb2e16 In the nfsrv_fhtovp(), after the vfs_getvfs() function found the pointer
to the fs, but before a vnode on the fs is locked, unmount may free fs
structures, causing access to destroyed data and freed memory.

Introduce a vfs_busymp() function that looks up and busies found
fs while mountlist_mtx is held. Use it in nfsrv_fhtovp() and in the
implementation of the handle syscalls.

Two other uses of the vfs_getvfs() in the vfs_subr.c, namely in
sysctl_vfs_ctl and vfs_getnewfsid seems to be ok. In particular,
sysctl_vfs_ctl is protected by Giant by being a non-sleeping sysctl
handler, that prevents Giant-locked unmount code to interfere with it.

Noted by:	tegge
Reviewed by:	dfr
Tested by:	pho
MFC after:	1 month
2008-11-29 13:34:59 +00:00
pjd
881f5f6bef Improve KASSERT() call a bit:
- Print flags in hex.
- Note that flags can be fine and panic can be due unexpected error condition.
- Remove redundant new line character.

Eventhough panic message excess 80 characters keep it in one line so it is
easier to grep.
2008-11-29 12:40:14 +00:00
bz
3139d98062 With the permissions of phk@ change the license on kern_jail.c
to a 2 clause BSD license.
2008-11-28 19:23:46 +00:00
ed
355c41a8e4 Fix matching of message queues by name.
The mqfs_search() routine uses strncmp() to match message queue objects
by name. This is because it can be called from environments where the
file name is not null terminated (the VFS for example).

Unfortunately it doesn't compare the lengths of the message queue names,
which means if a system has "Queue12345", the name "Queue" will also
match.

I noticed this when a student of mine handed in an exercise using
message queues with names "Queue2" and "Queue".

Reviewed by:	rink
2008-11-28 14:53:18 +00:00
kib
c0925f09e7 Explicitely note that destroy_dev() sleeps.
Requested by:	ed (some time ago), Jaakko Heinonen <jh saunalahti fi>
2008-11-27 16:47:25 +00:00
ganbold
93f46a969a Remove unused variable.
Found with:     Coverity Prevent(tm)
CID: 3664

Approved by: kib
2008-11-27 04:40:37 +00:00
zec
95a15f5c84 Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by:  bz, julian
Approved by:  julian (mentor)
Obtained from:        //depot/projects/vimage-commit2/...
X-MFC after:  never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-11-26 22:32:07 +00:00
marcus
8b6bc38b7c Move vn_fullpath1() outside of FILEDESC locking. This is being done in
advance of teaching vn_fullpath1() how to query file systems for
vnode-to-name mappings when cache lookups fail.

Thanks to kib for guidance and patience on this process.

Reviewed by:	kib
Approved by:	kib
2008-11-25 15:36:15 +00:00
emaste
abb4c5f368 Correct typo in comment: thier -> their 2008-11-24 19:28:52 +00:00
dwmalone
e3dcd91abe It's possible that the dump device has gone away after it was
configured, change the message to let people know this is a
possibility. I've slightly changed the message from the one
submitted by Pekka to keep the printf on one line.

Submitted by:	Pekka Savola <pekkas@netcore.fi>
2008-11-23 21:05:22 +00:00
kib
8fad2283b3 Add sv_flags field to struct sysentvec with intention to provide description
of the ABI of the currently executing image. Change some places to test
the flags instead of explicit comparing with address of known sysentvec
structures to determine ABI features.

Discussed with:	dchagin, imp, jhb, peter
2008-11-22 12:36:15 +00:00
kmacy
9d3bb599b1 - bump __FreeBSD version to reflect added buf_ring, memory barriers,
and ifnet functions

- add memory barriers to <machine/atomic.h>
- update drivers to only conditionally define their own

- add lockless producer / consumer ring buffer
- remove ring buffer implementation from cxgb and update its callers

- add if_transmit(struct ifnet *ifp, struct mbuf *m) to ifnet to
  allow drivers to efficiently manage multiple hardware queues
  (i.e. not serialize all packets through one ifq)
- expose if_qflush to allow drivers to flush any driver managed queues

This work was supported by Bitgravity Inc. and Chelsio Inc.
2008-11-22 05:55:56 +00:00
julian
cf07f793f2 Fix a scope problem in the multiple routing table code that stopped the
SO_SETFIB socket option from working correctly.

Obtained from:	Ironport
MFC after:	3 days
2008-11-19 19:19:30 +00:00
jhb
c2251260be Allow device hints to wire the unit numbers of devices.
- An "at" hint now reserves a device name.
- A new BUS_HINT_DEVICE_UNIT method is added to the bus interface.  When
  determining the unit number of a device, this method is invoked to
  let the bus driver specify the unit of a device given a specific
  devclass.  This is the only way a device can be given a name reserved
  via an "at" hint.
- Implement BUS_HINT_DEVICE_UNIT() for the acpi(4) and isa(4) bus drivers.
  Both of these busses implement this by comparing the resources for a
  given hint device with the resources enumerated by ACPI/PnPBIOS and
  wire a unit if the hint resources are a subset of the "real" resources.
- Use bus_hinted_children() for adding hinted devices on isa(4) busses
  now instead of doing it by hand.
- Remove the unit kludging from sio(4) as it is no longer necessary.

Prodding from:	peter, imp
OK'd by:	marcel
MFC after:	1 month
2008-11-18 21:01:54 +00:00
jhb
b08d457fbe When checking to see if another CPU is running its idle thread, examine
the thread running on the other CPU instead of the thread being placed on
the run queue.

Reported by:	Ravi Murty @ Intel
Reviewed by:	jeff
2008-11-18 05:41:34 +00:00
delphij
39ade9204d Obey signedness flag in %z case.
MFC after:	2 months
2008-11-17 23:57:40 +00:00
pjd
bbe899b96e Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.
This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

	Allows regular users to perform ZFS operations, like file system
	creation, snapshot creation, etc.

- L2ARC

	Level 2 cache for ZFS - allows to use additional disks for cache.
	Huge performance improvements mostly for random read of mostly
	static content.

- slog

	Allow to use additional disks for ZFS Intent Log to speed up
	operations like fsync(2).

- vfs.zfs.super_owner

	Allows regular users to perform privileged operations on files stored
	on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

	Not all the flags are supported. This still needs work.

- ZFSBoot

	Support to boot off of ZFS pool. Not finished, AFAIK.

	Submitted by:	dfr

- Snapshot properties

- New failure modes

	Before if write requested failed, system paniced. Now one
	can select from one of three failure modes:
	- panic - panic on write error
	- wait - wait for disk to reappear
	- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

	Just quota and reservation properties, but don't count space consumed
	by children file systems, clones and snapshots.

- Sparse volumes

	ZVOLs that don't reserve space in the pool.

- External attributes

	Compatible with extattr(2).

- NFSv4-ACLs

	Not sure about the status, might not be complete yet.

	Submitted by:	trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from:	OpenSolaris
2008-11-17 20:49:29 +00:00
kib
ec4c199459 Revert r184118. There is actually a code in the kernel, for instance in
kern_unlinkat(), that expects that vn_start_write() actually fills the mp
even when the call failed.

As Tor noted, that pattern relies on the the type stability of the mount
points, as well as that suspended mount points are never freed and
V_XSLEEP is always passed to vn_start_write() when called on a freed
mount point.

Reported by:	stass
Reviewed by:	tegge
PR:		123768
2008-11-16 21:56:29 +00:00
n_hibma
1e9b4258a4 Silence detach messages if the device has marked itself quiet (u3g).
MFC after:	3 weeks
2008-11-13 21:46:19 +00:00
ed
1722724523 Don't forget to relock the TTY after uiomove() returns an error.
Peter Holm just discovered this funny bug inside the TTY code: if
uiomove() in ttydisc_write() returns an error, we forget to relock the
TTY before jumping out of ttydisc_write(). Fix it by placing
tty_unlock() and tty_lock() around uiomove().

Submitted by:	pho
2008-11-12 09:04:44 +00:00
ed
8d12469978 Several cleanups related to pipe(2).
- Use `fildes[2]' instead of `*fildes' to make more clear that pipe(2)
  fills an array with two descriptors.

- Remove EFAULT from the manual page. Because of the current calling
  convention, pipe(2) raises a segmentation fault when an invalid
  address is passed.

- Introduce kern_pipe() to make it easier for binary emulations to
  implement pipe(2).

- Make Linux binary emulation use kern_pipe(), which means we don't have
  to recover td_retval after calling the FreeBSD system call.

Approved by:	rdivacky
Discussed on:	arch
2008-11-11 14:55:59 +00:00
gallatin
5b802b524b Avoid scheduling firmware taskqs when cold.
This prevents a panic which occurs when a driver attempts to load
firmware at boot via firmware_get() when the firmware module has not
been preloaded.  firmware_get() will  enqueue a task using a struct
taskqueue allocated on the stack, and the machine will crash much
later in the firmware taskq thread when taskqs are started and the
struct taskqueue is garbage.

Not objected to by: sam
2008-11-11 12:25:08 +00:00
ed
7baae41248 Regenerate system call tables for r184789. 2008-11-09 10:48:06 +00:00
ed
9d3703b842 Mark uname(), getdomainname() and setdomainname() with COMPAT_FREEBSD4.
Looking at our source code history, it seems the uname(),
getdomainname() and setdomainname() system calls got deprecated
somewhere after FreeBSD 1.1, but they have never been phased out
properly. Because we don't have a COMPAT_FREEBSD1, just use
COMPAT_FREEBSD4.

Also fix the Linuxolator to build without the setdomainname() routine by
just making it call userland_sysctl on kern.domainname. Also replace the
setdomainname()'s implementation to use this approach, because we're
duplicating code with sysctl_domainname().

I wasn't able to keep these three routines working in our
COMPAT_FREEBSD32, because that would require yet another keyword for
syscalls.master (COMPAT4+NOPROTO). Because this routine is probably
unused already, this won't be a problem in practice. If it turns out to
be a problem, we'll just restore this functionality.

Reviewed by:	rdivacky, kib
2008-11-09 10:45:13 +00:00
kmacy
7f1b49e5c3 make kern.ipc.nmbclusters actually have a useful effect on nmbclusters et al.
initialize pkthdr in field order
2008-11-09 01:53:06 +00:00
ed
c3eca8b8cc Reduce the default baud rate of PTY's to 9600.
On RELENG_6 (and probably RELENG_7) we see our syscons windows and
pseudo-terminals have the following buffer sizes:

| LINE RAW CAN OUT IHIWT ILOWT OHWT LWT     COL STATE  SESS      PGID DISC
| ttyv0  0   0   0  7680  6720 2052 256       7 OCcl       1146  1146 term
| ttyp0  0   0   0  7680  6720 1296 256       0 OCc       82033 82033 term

These buffer sizes make no sense, because we often have much more output
than input, but I guess having higher input buffer sizes improves
guarantees of the system.

On MPSAFE TTY I just sent both the input and output buffer sizes to 7
KB, which is pretty big on a standard FreeBSD install with 8 syscons
windows and some PTY's. Reduce the baud rate to 9600 baud, which means
we now have the following buffer sizes:

|  LINE   INQ  CAN  LIN  LOW  OUTQ  USE  LOW   COL  SESS  PGID STATE
| ttyv0  1920    0    0  192  1984    0  199     7  2401  2401 Oil
| pts/0  1920    0    0  192  1984    0  199  5631  1305  2526 Oi

This is a lot smaller, but for pseudo-devices this should be good
enough. You need to do a lot of punching to fill up a 7.5 KB input
buffer. If it turns out things don't work out this way, we'll just
switch to 19200 baud.
2008-11-08 20:40:39 +00:00
rodrigc
ca625c199f Merge latest DTrace changes from Perforce.
Approved by:	jb
2008-11-05 19:40:36 +00:00
davidxu
1ebf3ee9a3 Revert rev 184216 and 184199, due to the way the thread_lock works,
it may cause a lockup.

Noticed by: peter, jhb
2008-11-05 03:01:23 +00:00
jhb
fb34cedca4 Use shared vnode locks for auditing vnode arguments as auditing only
does a VOP_GETATTR() which does not require an exclusive lock.

Reviewed by:	csjp, rwatson
2008-11-04 22:31:04 +00:00
jhb
bfd6c884ce Don't bother calling setrunnable() and clearing the sleeping flag in
sleepq_resume_thread() if the thread isn't asleep.
2008-11-04 19:13:53 +00:00
jhb
6c6f8c89e8 Remove unnecessary locking around vn_fullpath(). The vnode lock for the
vnode in question does not need to be held.  All the data structures used
during the name lookup are protected by the global name cache lock.
Instead, the caller merely needs to ensure a reference is held on the
vnode (such as vhold()) to keep it from being freed.

In the case of procfs' <pid>/file entry, grab the process lock while we
gain a new reference (via vhold()) on p_textvp to fully close races with
execve(2).

For the kern.proc.vmmap sysctl handler, use a shared vnode lock around
the call to VOP_GETATTR() rather than an exclusive lock.

MFC after:	1 month
2008-11-04 19:04:01 +00:00
ed
27790fa127 Remove redundant return value tests.
There is no need to test whether the return value is non-zero here. Just
return the error number directly.
2008-11-04 10:58:02 +00:00
jhb
9757361d22 Adjust the license statement to more closely match a standard 3-clause BSD
license.

MFC after:	3 days
2008-11-03 21:17:02 +00:00
jhb
ee8312c8bb Use shared vnode locks instead of exclusive vnode locks for the access(),
chdir(), chroot(), eaccess(), fpathconf(), fstat(), fstatfs(), lseek()
(when figuring out the current size of the file in the SEEK_END case),
pathconf(), readlink(), and statfs() system calls.

Submitted by:	ups (mostly)
Tested by:	pho
MFC after:	1 month
2008-11-03 20:31:00 +00:00
attilio
26a604f3bc Remove the mnt_holdcnt and mnt_holdcntwaiters because they are useless.
Really, the concept of holdcnt in the struct mount is rappresented by
the mnt_ref (which prevents the type-stable structure from being
"recycled) handled through vfs_ref() and vfs_rel().
On this optic, switch the holdcnt acquisition into an emulated vfs_ref()
(and subsequent release into vfs_rel()).

Discussed with:	kib
Tested by:	pho
2008-11-03 20:00:35 +00:00
jhb
24139401dd A few style nits. 2008-11-03 19:33:20 +00:00
dfr
6929a6d99b Regen. 2008-11-03 10:39:35 +00:00
dfr
2fb03513fc Implement support for RPCSEC_GSS authentication to both the NFS client
and server. This replaces the RPC implementation of the NFS client and
server with the newer RPC implementation originally developed
(actually ported from the userland sunrpc code) to support the NFS
Lock Manager.  I have tested this code extensively and I believe it is
stable and that performance is at least equal to the legacy RPC
implementation.

The NFS code currently contains support for both the new RPC
implementation and the older legacy implementation inherited from the
original NFS codebase. The default is to use the new implementation -
add the NFS_LEGACYRPC option to fall back to the old code. When I
merge this support back to RELENG_7, I will probably change this so
that users have to 'opt in' to get the new code.

To use RPCSEC_GSS on either client or server, you must build a kernel
which includes the KGSSAPI option and the crypto device. On the
userland side, you must build at least a new libc, mountd, mount_nfs
and gssd. You must install new versions of /etc/rc.d/gssd and
/etc/rc.d/nfsd and add 'gssd_enable=YES' to /etc/rc.conf.

As long as gssd is running, you should be able to mount an NFS
filesystem from a server that requires RPCSEC_GSS authentication. The
mount itself can happen without any kerberos credentials but all
access to the filesystem will be denied unless the accessing user has
a valid ticket file in the standard place (/tmp/krb5cc_<uid>). There
is currently no support for situations where the ticket file is in a
different place, such as when the user logged in via SSH and has
delegated credentials from that login. This restriction is also
present in Solaris and Linux. In theory, we could improve this in
future, possibly using Brooks Davis' implementation of variant
symlinks.

Supporting RPCSEC_GSS on a server is nearly as simple. You must create
service creds for the server in the form 'nfs/<fqdn>@<REALM>' and
install them in /etc/krb5.keytab. The standard heimdal utility ktutil
makes this fairly easy. After the service creds have been created, you
can add a '-sec=krb5' option to /etc/exports and restart both mountd
and nfsd.

The only other difference an administrator should notice is that nfsd
doesn't fork to create service threads any more. In normal operation,
there will be two nfsd processes, one in userland waiting for TCP
connections and one in the kernel handling requests. The latter
process will create as many kthreads as required - these should be
visible via 'top -H'. The code has some support for varying the number
of service threads according to load but initially at least, nfsd uses
a fixed number of threads according to the value supplied to its '-n'
option.

Sponsored by:	Isilon Systems
MFC after:	1 month
2008-11-03 10:38:00 +00:00
ivoras
d819bb20f8 Increase the initial sbuf size for CPU topology dump to something more
usable for newer CPUs. The new value allows 2 x quad core configuration
dumps to fit within the initial buffer without reallocations.

Approved by:	gnn (mentor) (older version)
Pointed out by:	rdivacky
2008-11-02 23:11:20 +00:00
attilio
e1f493235e Improve VFS locking:
- Implement real draining for vfs consumers by not relying on the
  mnt_lock and using instead a refcount in order to keep track of lock
  requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
  useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
  Change so its KPI by removing the interlock argument and defining 2 new
  flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
  old version (which was unlinked from the lockmgr alredy) and
  MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
  once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
  mnt_ref if running for more than 3 seconds, make it totally useless.
  Remove it as it was thought to work into older versions.
  If a problem of "refcount held never going away" should appear, we will
  need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
  leaving the MNTK_MWAIT flag on even if it the waiters were actually
  woken up. Just a place in vfs_mount_destroy() is left because it is
  going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.

This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.

Discussed with:	kib
Tested by:	pho
2008-11-02 10:15:42 +00:00
ed
57b4089c20 Clamp the values of t_column to 5 digits in pstat -t' and show all ttys'.
We often run into these very high column numbers when we run curses
applications, because they don't print any newlines. This messes up the
table output of `pstat -t'. If these numbers get really high, they
aren't of any use to the reader anyway. Convert them to `99999' when
they run out of bounds.
2008-11-01 13:40:46 +00:00
ed
c2c324d379 Reimplement the /dev/console device node.
One of the pieces of code that I had left alone during the development
of the MPSAFE TTY layer, was tty_cons.c. This file actually has two
different functions:

- It contains low-level console input/output routines (cnputc(), etc).

- It creates /dev/console and wraps all its cdevsw calls to the
  appropriate TTY.

This commit reimplements the second set of functions by moving it
directly into the TTY layer. /dev/console is now a character device node
that's basically a regular TTY, but does a lookup of `si_drv1' each time
you open it. d_write has also been changed to call log_console().
d_close() is not present, because we must make sure we don't revoke the
TTY after writing a log message to it.

Even though I'm not convinced this is in line with the future directions
of our console code, it is a good move for now. It removes recursive
locking from the top half of the TTY layer. The previous implementation
called into the TTY layer with Giant held.

I'm renaming tty_cons.c to kern_cons.c now. The code hardly contains any
TTY related bits, so we'd better give it a less misleading name.

Tested by:	Andrzej Tobola <ato iem pw edu pl>,
		Carlos A.M. dos Santos <unixmania gmail com>,
		Eygene Ryabinkin <rea-fbsd codelabs ru>
2008-11-01 08:35:28 +00:00
peter
1f7fd22cbb Add three extra to the kinfo_proc_vmmap data. kve_offset - the offset
within an object that a mapping refers to.  fileid and fsid are inode/dev
for vnodes.  (Linux procfs has these and valgrind is really unhappy
without them.)  I believe I didn't change the size of the struct.
2008-10-31 05:43:19 +00:00
sobomax
dafc63cd43 Make it possible to compile kernel with KTR but without DDB. 2008-10-30 21:48:28 +00:00
ivoras
483637ae39 Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.

An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
 <group level="1" cache-level="0">
   <cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
   <flags></flags>
   <children>
     <group level="2" cache-level="0">
       <cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
       <flags></flags>
     </group>
     <group level="2" cache-level="0">
       <cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
       <flags></flags>
     </group>
   </children>
 </group>
</groups>

Reviewed by:	jeff
Approved by:	gnn (mentor)
2008-10-29 13:36:23 +00:00
davidxu
11aa09b488 If threads limit is exceeded, increase the totoal number
of failures.
2008-10-29 12:11:48 +00:00
trasz
4e57a80147 Rename a variable missed in previous accmode_t-related commits.
Approved by:	rwatson (mentor)
2008-10-28 21:58:48 +00:00
trasz
0ad8692247 Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by:	rwatson (mentor)
2008-10-28 13:44:11 +00:00
kib
b9b0d2c54c Style return statements in vn_pollrecord(). 2008-10-28 12:22:33 +00:00
kib
86b5e61ab2 Protect check for v_pollinfo == NULL and assignment of the newly allocated
vpollinfo with vnode interlock. Fully initialize vpollinfo before putting
pointer to it into vp->v_pollinfo.

Discussed with:	dwhite
Tested by:	pho
MFC after:	1 week
2008-10-28 12:08:36 +00:00
rwatson
a2129bd144 Rename three MAC entry points from _proc_ to _cred_ to reflect the fact
that they operate directly on credentials: mac_proc_create_swapper(),
mac_proc_create_init(), and mac_proc_associate_nfsd().  Update policies.

Obtained from:	TrustedBSD Project
2008-10-28 11:33:06 +00:00
peter
b5b26198a7 After a machine has been up for a bit more than 20 days with HZ=1000,
"ticks" goes negative.  This breaks the signed comparison in softclock.
This causes sleep() to never wake up, tcp to stop, etc etc.  This is
bad(TM).  Use the SEQ_LT() method from tcp's sequence number comparisons.
2008-10-28 03:26:25 +00:00
jhb
c343bee743 - Whitespace fix for vop_poll.
- Use the right label for vop_vptofh lock assertions so they are enforced.
2008-10-27 21:41:55 +00:00
sobomax
2bddeb51d2 vm_pnames should be "const char *const[]".
Submitted by:	Christoph Mallon
2008-10-27 08:09:05 +00:00
sobomax
c9fd562aa0 vm_pnames has no reason to be global.
MFC after:	2 weeks
2008-10-27 06:34:41 +00:00
sobomax
6b076dc603 Default HZ value (1,000) on i386/amd64 is not very virtual machine friendly.
Due to the nature of the beast it causes lot of unproductive overhead. This
is especially bad when running SMP kernel on VMWare with several virtual
processors - idle FreeBSD guest with SMP kernel takes 150% host CPU time on my
dual-core MacBook Pro when I am enabling two virtual CPUs, making even host
not very usable. Detect when we are running in the sandbox and reduce HZ
to 10 (can be adjusted via VM_HZ in the kernel config) in such cases. This
brings host CPU usage of idle FreeBSD/SMP on two virtual processors down
to 10%.

Detect most popular VM platforms out there - VMWare, Parallels, VirtualBox
and VirtualPC.

MFC after:	2 weeks
2008-10-27 06:25:02 +00:00
dfr
f98e1f1bbf Don't rely on the value of *statep without first taking the vnode interlock.
Reviewed by:	Mike Tancsa
MFC after:	2 weeks
2008-10-24 16:04:10 +00:00
davidxu
238f3ee5f4 Don't rearm callout if the process is exiting, it may leak a callout
because callout_drain() only waits for running callout, but not disable
it if it is rearmed.
2008-10-24 01:09:24 +00:00
davidxu
e66e7ee6bb partly revert revision 184199, because TDF_NEEDSIGCHK is persitent
when thread is in kernel mode, it can cause dead loop, now unlock
process lock after acquired sleep queue lock and thread lock to
avoid the problem. This means TDF_NEEDSIGCHK and TDF_NEEDSUSPCHK must
be set with process lock and thread lock being hold at same time.
2008-10-24 01:03:31 +00:00
jhb
2e4682de75 Whitespace fix. 2008-10-23 21:50:16 +00:00
des
a1e1ad22e0 Fix a number of style issues in the MALLOC / FREE commit. I've tried to
be careful not to fix anything that was already broken; the NFSv4 code is
particularly bad in this respect.
2008-10-23 20:26:15 +00:00
des
66f807ed8b Retire the MALLOC and FREE macros. They are an abomination unto style(9).
MFC after:	3 months
2008-10-23 15:53:51 +00:00
davidxu
2062caca24 Actually, for signal and thread suspension, extra process spin lock is
unnecessary, the normal process lock and thread lock are enough. The
spin lock is still needed for process and thread exiting to mimic
single sched_lock.
2008-10-23 07:55:38 +00:00
jhb
327ae6eb3a Split the copyout of *base at the end of getdirentries() out leaving the
rest in kern_getdirentries().  Use kern_getdirentries() to implement
freebsd32_getdirentries().  This fixes a bug where calls to getdirentries()
in 32-bit binaries would trash the 4 bytes after the 'long base' in
userland.

Submitted by:	ups
MFC after:	1 week
2008-10-22 21:55:48 +00:00
marcel
7de1858d0c Trivially avoid a null pointer dereference when drivers
don't set the rman description. While drivers should set
it, a kernel panic is not the right behaviour when faced
without one.
2008-10-22 18:20:45 +00:00
thompsa
0fcb99be5e Fix spelling mistake in the last rev. 2008-10-21 14:44:25 +00:00
thompsa
8ee58ba9e6 If we have getc_inject hooked then the outq buffer is inaccessible to the
driver so skip the drain rather than waiting indefinitely.

Reviewed by:	ed
2008-10-21 14:18:45 +00:00
kib
cc3d7dc928 Change vn_start_write() to clear *mpp on all failures when non-NULL vp
is supplied, since vm_pageout_scan() expects it to be cleared on error.

Submitted by:	tegge
PR:	123768
MFC after:	1 week
2008-10-21 09:55:49 +00:00
attilio
42c5b05453 In the actual code for witness_warn:
- If there aren't spinlocks held, but there are problems with old
  sleeplocks, they are not reported.
- If the spinlock found is not the only one, problems are not reported.

Fix these 2 problems.

Reported by:	tegge
2008-10-20 19:22:16 +00:00
kib
e4785f6af4 Assert that v_holdcnt is non-zero before entering lockmgr in vn_lock
and ffs_lock. This cannot catch situations where holdcnt is incremented
not by curthread, but I think it is useful.

Reviewed by:	tegge, attilio
Tested by:	pho
MFC after:	2 weeks
2008-10-20 10:11:33 +00:00
kib
015479d466 In vfs_busy(), lockmgr() cannot legitimately sleep, because code checked
MNTK_UNMOUNT before, and mnt_mtx is used as interlock. vfs_busy() always
tries to obtain a shared lock on mnt_lock, the other user is unmount who
tries to drain it, setting MNTK_UNMOUNT before.

Reviewed by:	tegge, attilio
Tested by:	pho
MFC after:	2 weeks
2008-10-20 10:07:28 +00:00
davidxu
57a7a67ea5 In realtimer_delete(), clear timer's value and interval to tell
realtimer_expire() to not rearm the timer, otherwise there is a chance
that a callout will be left there and be tiggered in future unexpectly.

Bug reported by: tegge@
2008-10-20 02:37:53 +00:00
kib
e8c0b1746f Ktr(9) stores format string and arguments in the event circular buffer,
not the string formatted at the time of CTRX() call. Stack_ktr(9) uses
an on-stack buffer for the symbol name, that is supplied as an argument
to ktr. As result, stack_ktr() traces show garbage or cause page faults.

Fix stack_ktr() by using pointer to module symbol table that is supposed
to have a longer lifetime.

Tested by:	pho
MFC after:	1 week
2008-10-19 11:13:49 +00:00
kmacy
4ceda2abba - Forward port flush of page table updates on context switch or userret
- Forward port vfork XEN hack
2008-10-19 01:35:27 +00:00
bz
4d4d2d367d Add cr_canseeinpcb() doing checks using the cached socket
credentials from inp_cred which is also available after the
socket is gone.
Switch cr_canseesocket consumers to cr_canseeinpcb.
This removes an extra acquisition of the socket lock.

Reviewed by:	rwatson
MFC after:	3 months (set timer; decide then)
2008-10-17 16:26:16 +00:00
kmacy
f9a07efdb6 make sure that SO_NO_DDP and SO_NO_OFFLOAD get passed in correctly
PR:		127360
MFC after:	3 days
2008-10-17 01:25:45 +00:00
attilio
708fbd2d50 - Fix a race in witness_checkorder() where, between the PCPU_GET() and
PCPU_PTR() curthread can migrate on another CPU and get incorrect
  results.
- Fix a similar race into witness_warn().
- Fix the interlock's checks bypassing by correctly using the appropriate
  children even when the lock_list chunk to be explored is not the first
  one.
- Allow witness_warn() to work with spinlocks too.

Bugs found by:	tegge
Submitted by:	jhb, tegge
Tested by:	Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-10-16 12:42:56 +00:00
davidxu
3f5ab59cf2 Restore code wrongly removed in SVN revision 173004, it causes threaded
process to be stuck in execv().

Noticed by: delphij
2008-10-16 04:17:17 +00:00
ed
48c0c8f51a Import some improvements to the TTY code from the MPSAFE TTY branch.
- Change the ddb(4) commands to be more useful (by thompsa@):
  - `show ttys' is now called `show all ttys'. This command will now
    also display the address where the TTY data structure resides.
  - Add `show tty <addr>', which dumps the TTY in a readable form.

- Place an upper bound on the TTY buffer sizes. Some drivers do not want
  to care about baud rates. Protect these drivers by preventing the TTY
  buffers from getting enormous. Right now we'll just clamp it to 64K,
  which is pretty high, taking into account that these buffers are only
  used by the built-in discipline.

- Only call ttydev_leave() when needed. Back in April/May the TTY
  reference counting mechanism was a little different, which required us
  to call ttydev_leave() each time we finished a cdev operation.
  Nowadays we only need to call ttydev_leave() when we really mark it as
  being closed.

- Improve return codes of read() and write() on TTY device nodes.

- Make sure we really wake up all blocked threads when the driver calls
  tty_rel_gone(). There were some possible code paths where we didn't
  properly wake up any readers/writers.

- Add extra assertions to prevent sleeping on a TTY that has been
  abandoned by the driver.

- Use ttydev_cdevsw as a more reliable method to figure out whether a
  device node is a real TTY device node.

Obtained from:	//depot/projects/mpsafetty/...
Reviewed by:	thompsa
2008-10-15 16:58:35 +00:00
davidxu
5068f6dcf0 Move per-thread userland debugging flags into seperated field,
this eliminates some problems of locking, e.g, a thread lock is needed
but can not be used at that time. Only the process lock is needed now
for new field.
2008-10-15 06:31:37 +00:00
rdivacky
ead773b051 Check the result of copyin and in a case of error
return one. This prevents setting wrong priority
or (more likely) returning EINVAL.

Approved by:	kib (mentor)
2008-10-13 21:04:52 +00:00
rwatson
ef6dfc27c4 Downgrade XXX to a Note for fgetsock() and fputsock().
MFC after:	3 days
2008-10-12 20:03:17 +00:00
rwatson
f2c33837dd Remove stale comment: while uipc_connect2() was, until recently, not
static so it could be used by fifofs (actually portalfs), it is now
static.

Submitted by:	kensmith
2008-10-11 17:28:22 +00:00
attilio
b8bf37e585 Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by:	kib
Tested by:	Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-10-10 21:23:50 +00:00
imp
793aee6634 Close, but not eliminate, a race condition. It is one that properly
designed drivers would never hit, but was exposed in diving into
another problem...

When expanding the devclass array, free the old memory after updating
the pointer to the new memory.  For the following single race case,
this helps:

	allocate new memory
	copy to new memory
	free old memory
<interrupt>				read pointer to freed memory
	update pointer to new memory

Now we do
	allocate new memory
	copy to new memory
	update pointer to new memory
	free old memory

Which closes this problem, but doesn't even begin to address the
multicpu races, which all should be covered by Giant at the moment,
but likely aren't completely.

Note: reviewers were ok with this fix, but suggested the use case
wasn't one we wanted to encourage.

Reviewed by:	jhb, scottl.
2008-10-10 17:49:47 +00:00
kib
997f16fb43 If the ABI-overriden interpreter was not loaded, do not set
have_interp to TRUE. This allows the code in image activator to try
/libexec/ld-elf.so.1 as interpreter when newinterp is not found to
execute.

Reviewed by:	peter
MFC after:	2 weeks (together with r175105)
2008-10-08 11:11:36 +00:00
rwatson
8315016284 Remove stale comment (and XXX saying so) about why we zero the file
descriptor pointer in unp_freerights: we can no longer recurse into
unp_gc due to unp_gc being invoked in a deferred way, but it's still
a good idea.

MFC after:	3 days
2008-10-08 06:26:51 +00:00
rwatson
82c89c763f Differentiate pr_usrreqs for stream and datagram UNIX domain sockets, and
employ soreceive_dgram for the datagram case.

MFC after:	3 months
2008-10-08 06:19:49 +00:00
rwatson
72d39e41ec In soreceive_dgram, when a 0-length buffer is passed into recv(2) and
no data is ready, return 0 rather than blocking or returning EAGAIN.
This is consistent with the behavior of soreceive_generic (soreceive)
in earlier versions of FreeBSD, and restores this behavior for UDP.

Discussed with:	jhb, sam
MFC after:	3 days
2008-10-07 20:57:55 +00:00
rwatson
494f70982b Remove temporary debugging KASSERT's introduced to detect protocols
improperly invoking sosend(), soreceive(), and sopoll() instead of
attach either specialized or _generic() versions of those functions
to their pru_sosend, pru_soreceive, and pru_sopoll protosw methods.

MFC after:	3 days
2008-10-07 09:57:03 +00:00
rwatson
73c76af492 Rewrite sbreserve_locked()'s comment on NULL thread pointers, eliminating
an XXXRW about the comment being stale.

MFC after:	3 days
2008-10-07 09:51:39 +00:00
rwatson
064d14f0bc Lock receive socket buffer in soo_stat() rather than commenting that we
should lock it, which may marginally improve the consistency of the
results.  Remove comment.

MFC after:	3 days
2008-10-07 07:10:28 +00:00
rwatson
730ae9451a Now that portalfs doesn't directly invoke uipc_connect2(), make it a
static symbol.

MFC after:	3 days
2008-10-06 18:43:11 +00:00
sam
5a99959acc dynamically allocate the task structure in firmware_mountroot: when
booting from an MFS root (e.g. from an install CD) firmware_mountroot
can be called twice with the second call happening before the task
callback occurs; this results in the task structure contents being
corrupted because it was declared static.

Submitted by:	marius (original version)
2008-10-04 23:58:02 +00:00
jhb
c07f87c6a9 Oops, missed updating a place with with 's/lock1/plock/' when adding
interlock support to WITNESS.  Specifically, the printf listing the
first location when duplicate locks of the same type are acquired.

Reported by:	pho
2008-10-03 18:13:05 +00:00
rwatson
bbe0e18165 Further minor cleanups to UNIX domain sockets:
- Staticize and locally prototype functions uipc_ctloutput(), unp_dispose(),
  unp_init(), and unp_externalize(), none of which have been required
  outside of uipc_usrreq.c since uipc_proto.c was removed.
- Remove stale prototype for uipc_usrreq(), which has not existed in the
  code since 1997
- Forward declare and staticize uipc_usrreqs structure in uipc_usrreq.c and
  not un.h.
- Comment on why uipc_connect2() is still non-static -- it is used directly
  by fifofs.
- Remove stale comments, tidy up whitespace.

MFC after:	3 days (where applicable)
2008-10-03 13:01:56 +00:00
rwatson
6ee0280c0f Remove or update several stale comments.
A bit of whitespace/style cleanup.

Update copyright.

MFC after:	3 days (applicable changes)
2008-10-03 09:01:55 +00:00
zec
8797d4caec Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by:	julian, bz, brooks, zec
Reviewed by:	julian, bz, brooks, kris, rwatson, ...
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
peter
ed8d07f232 Collect N identical (or near identical) mkdumpheader() implementations into
one, as threatened in the comment.  Textdump magic can be passed in.
2008-10-01 22:08:53 +00:00
jhb
f99df3bfe4 Enable shared locks for path name lookups on supported filesystems (NFS
client, UFS, and ZFS) by default.
2008-10-01 19:25:37 +00:00
jhb
d705b32c33 Remove the LOOKUP_SHARED kernel option. Instead, make vfs.lookup_shared
a loader tunable (it was already a sysctl).
2008-10-01 19:24:16 +00:00
jhb
ee566dffaa Wait until after dropping the receive socket buffer lock to allocate space
to store the socket address stored in the first mbuf in a packet chain.
This reduces contention on the lock and CPU system time in certain UDP
workloads.

Tested by:	ps
Reviewed by:	rwatson
MFC after:	1 week
2008-10-01 19:14:05 +00:00
rwatson
2d03779951 Various cleanups for soreceive_dgram():
- Update or remove comments that were left over from the original
  soreceive_generic() implementation.  Quite a few were misleading in the
  context of the new code.
- Since soreceive_dgram() has a simpler structure, replace several gotos
  with a while loop making the invariants more clear.
- In the blocking while loop, don't try to handle cases incompatible with
  the loop invariant (since m is always NULL, don't check for and handle
  non-NULL).
- Don't drop and re-acquire the socket buffer lock unnecessarily after
  sbwait() returns, which may help reduce lock contention (etc).
- Assume PR_ATOMIC since we assert it at the top of the function.

MFC after:	3 days
2008-10-01 13:26:52 +00:00
jhb
3eb652b1c7 Update the function name in several assertions in soreceive_dgram().
Approved by:	rwatson
MFC after:	3 days
2008-09-30 18:44:26 +00:00
kib
1fb31bd167 If the panic thread is preempted after setting panicstr but before
setting TDF_INPANIC then it will never be rescheduled again. Wrap
setting the panic condition with the critical section.

Noted and reviewed by:	tegge
MFC after:	1 week
2008-09-27 15:45:54 +00:00
ed
e40b7c4704 Move uminor() and umajor() to the same place as userspace minor() and major().
The uminor() and umajor() functions have the same use in kernel space as
the minor() and major() functions in userspace. If we ever get rid of
the minor() function in kernel space, we could decide to just expose
minor() and major() to kernel space, making uminor() and umajor()
redundant.

There are two reasons why we want to have uminor() and umajor() in
<sys/types.h>:

- Having them close together prevents them from diverting. Even though
  it's unlikely the definitions will change, it's a good habit to have
  them at the same place.

- They don't really belong in kern_conf.c. kern_conf.c has been
  liberated from dealing with device major and minor number handling.

The device_ids(9) manpage now lists the wrong #include's, because it
should only list <sys/types.h> now. I'm leaving it as it is now, because
I wonder if we should document them anyway. We're probably better off
documenting minor(3) and major(3).
2008-09-27 13:19:09 +00:00
ed
4efdef565f Replace all calls to minor() with dev2unit().
After I removed all the unit2minor()/minor2unit() calls from the kernel
yesterday, I realised calling minor() everywhere is quite confusing.
Character devices now only have the ability to store a unit number, not
a minor number. Remove the confusion by using dev2unit() everywhere.

This commit could also be considered as a bug fix. A lot of drivers call
minor(), while they should actually be calling dev2unit(). In -CURRENT
this isn't a problem, but it turns out we never had any problem reports
related to that issue in the past. I suspect not many people connect
more than 256 pieces of the same hardware.

Reviewed by:	kib
2008-09-27 08:51:18 +00:00
ed
2e07c6d916 Don't forget to initialize `int error' in ttydev_open().
I've had some reports in the past that opening an already opened TTY
through, for example, /dev/tty can fail with random error codes. Looking
at ttydev_open(), I can see there is a way `error' is returned without
initialising it. Even though I haven't had any confirmation this fixes
the bug, I'll fix it anyway.

Reported by:	Andrzej Tobola <ato iem pw edu pl>
2008-09-26 18:17:04 +00:00
ed
d421caabf9 Rename the minor' argument of make_dev(9) to unit'.
To prevent any further confusion about device minor and unit numbers,
we'd better just refer to device unit numbers. Many people still think
the numbers we show inside devfs have any relation to the numbers passed
to make_dev(9), which is not the case.

Discussed with:	kib
2008-09-26 14:31:24 +00:00
ed
4212d51a7d Remove unit2minor() use from kernel code.
When I changed kern_conf.c three months ago I made device unit numbers
equal to (unneeded) device minor numbers. We used to require
bitshifting, because there were eight bits in the middle that were
reserved for a device major number. Not very long after I turned
dev2unit(), minor(), unit2minor() and minor2unit() into macro's.
The unit2minor() and minor2unit() macro's were no-ops.

We'd better not remove these four macro's from the kernel, because there
is a lot of (external) code that may still depend on them. For now it's
harmless to remove all invocations of unit2minor() and minor2unit().

Reviewed by:	kib
2008-09-26 14:19:52 +00:00
jhb
6ccb676bf2 Regen. 2008-09-25 20:08:36 +00:00
jhb
00776aeb58 Tidy up a few things with syscall generation:
- Instead of using a syscall slot (370) just to get a function prototype
  for lkmressys(), add an explicit function prototype to <sys/sysent.h>.
  This also removes unused special case checks for 'lkmressys' from
  makesyscalls.sh.
- Instead of having magic logic in makesyscalls.sh to only generate a
  function prototype the first time 'lkmnosys' is seen, make 'NODEF'
  always not generate a function prototype and include an explicit
  prototype for 'lkmnosys' in <sys/sysent.h>.
- As a result of the fix in (2), update the LKM syscall entries in
  the freebsd32 syscall table to use 'lkmnosys' rather than 'nosys'.
- Use NOPROTO for the __syscall() entry (198) in the native ABI.  This
  avoids the need for magic logic in makesyscalls.h to only generate
  a function prototype the first time 'nosys' is encountered.
2008-09-25 20:07:42 +00:00
jhb
9c5408c4f9 - Don't do a WITNESS_SAVE() on the interlock if it is Giant in the condition
variable wait routines.  DROP_GIANT() already manages that state in the
  Giant interlock case.
- Assert that Giant is held when it is passed as a sleep interlock.
2008-09-25 13:42:19 +00:00
jhb
1161006cb6 Part 1 of making shared lookups more resilient with respect to forced
unmounts.  When we upgrade a vnode lock from shared to exclusive during
a name cache lookup, fail the lookup with EBADF if the vnode is invalidated
while we are waiting for the exclusive lock.

Also, for correctness (though I'm not sure it can occur in practice),
downgrade an exclusively locked vnode if it should be share locked.

Tested by:	pho
2008-09-24 18:51:33 +00:00
jhb
8d01b3e526 Update description of witness_watch. 2008-09-24 18:47:24 +00:00
ed
7993f2b835 Fix a crash when calling tty_rel_free() while draining during closure.
Yesterday I got two reports of potential crashes, related to TTY
deallocation during device closure. When a thread is in TF_OPENCLOSE,
draining its output upon closure, we should not allow calls to
tty_rel_free() to happen at the same time. This could cause the TTY to
be torn down twice.

PR:		kern/127561
Reported by:	KOIE Hidetaka <koie suri co jp>
Discussed with:	thompsa
2008-09-24 11:16:09 +00:00
kib
c500808674 Change the static struct sysentvec and struct Elf_Brandinfo initializers
to the C99 style. At least, it is easier to read sysent definitions
that way, and search for the actual instances of sigcode etc.

Explicitely initialize sysentvec.sv_maxssiz that was missed in most
sysvecs.

No objection from:	jhb
MFC after:	1 month
2008-09-24 10:14:37 +00:00
ed
94258be421 Track state to determine if the associated TTY device node has been used.
It turns out our old TTY layer (and other implementations) block when
you read() on a PTY master device of which the slave device node has not
been opened yet. Our new implementation just returned 0. This caused
applications like telnetd to die in a very subtle way (when child
processes would open the TTY later than the first call to select()).

Introduce a new flag called PTS_FINISHED, which indicates whether we
should block or bail out of a read() or write() occurs.

Reported by:	Claude Buisson <clbuisson orange fr>
2008-09-23 17:12:25 +00:00
obrien
219d6d1626 style(9) 2008-09-23 14:25:56 +00:00
obrien
8cb3aed24c Reverse if() logic to improve readability.
Reviewed by:	ru
2008-09-23 14:25:38 +00:00
ed
1475e942ed Introduce a hooks layer for the MPSAFE TTY layer.
One of the features that prevented us from fixing some of the TTY
consumers to work once again, was an interface that allowed consumers to
do the following:

- `Sniff' incoming data, which is used by the snp(4) driver.

- Take direct control of the input and output paths of a TTY, which is
  used by ng_tty(4), ppp(4), sl(4), etc.

There's no practical advantage in committing a hooks layer without
having any consumers. In P4 there is a preliminary port of snp(4) and
thompsa@ is busy porting ng_tty(4) to this interface. I already want to
have it in the tree, because this may stimulate others to work on the
remaining modules.

Discussed with:	thompsa
Obtained from:	//depot/projects/mpsafetty/...
2008-09-22 19:25:14 +00:00
ed
d49d04e133 Fix style(9) issue in TTY header files: document function argument names.
According to style(9), function argument names should only be omitted
for prototypes that are exported to userspace. This means we should
document the function arguments in the TTY header files, because they
are only used in userspace.

While there, change the type of the buffer argument of
ttydisc_rint_bypass() to `const void *' instead of `char *'.

Requested by:	attilio
Obtained from:	//depot/projects/mpsafetty/...
2008-09-22 18:44:09 +00:00
jkoshy
9d661b5bf6 Support sparsely numbered CPUs.
Requested by:	obrien, alfred (long ago)
2008-09-22 10:37:02 +00:00
ed
acdad30e01 Make fstat() on a pseudo-terminal master return sane timestamps.
Because pseudo-terminal master file descriptors no longer have a vnode
underneath, we have to fill in fstat() values ourselves. Make our
implementation somewhat sane by returning the timestamps of the TTY
device node that corresponds with our file descriptor.

Obtained from:	//depot/projects/mpsafettty/...
2008-09-21 19:24:15 +00:00
ed
3c13ffd7b5 Now that the number of clist consumers have dropped massively, trim down
the code to prevent useless waste of space.

- Remove support for quote bits. There is not a single driver that needs
  these bits anymore. This means putc() now accepts a char instead of an
  int.

- Remove the unneeded catq() and nextc() routines. They were only used
  by the old TTY layer.

- Convert the clist code to use ANSI C prototypes.
2008-09-21 18:12:18 +00:00
kib
a127656dea fdescfs, devfs, mqueuefs, nfs, portalfs, pseudofs, tmpfs and xfs
initialize the vattr structure in VOP_GETATTR() with VATTR_NULL(),
vattr_null() or by zeroing it. Remove these to allow preinitialization
of fields work in vn_stat(). This is needed to get birthtime initialized
correctly.

Submitted by:   Jaakko Heinonen <jh saunalahti fi>
Discussed on:   freebsd-fs
MFC after:	1 month
2008-09-20 19:50:52 +00:00
kib
ef4f1dc9c7 Initialize va_rdev to NODEV instead of 0 or VNOVAL in VOP_GETATTR().
NODEV is more appropriate when va_rdev doesn't have a meaningful value.

Submitted by:   Jaakko Heinonen <jh saunalahti fi>
Suggested by:   bde
Discussed on:   freebsd-fs
MFC after:	1 month
2008-09-20 19:49:15 +00:00
kib
570463af80 Initialize va_rdev to NODEV and va_fsid to VNOVAL before the
VOP_GETATTR() call in vn_stat(). Thus if a file system doesn't
initialize those fields in VOP_GETATTR() they will have a sane default
value.

Submitted by:   Jaakko Heinonen <jh saunalahti fi>
Discussed on:   freebsd-fs
MFC after:	1 month
2008-09-20 19:48:24 +00:00
kib
81d455e702 Initialize va_flags and va_filerev properly in VOP_GETATTR(). Don't
initialize va_vaflags and va_spare because they are not part of the
VOP_GETATTR() API. Also don't initialize birthtime to ctime or zero.

Submitted by:   Jaakko Heinonen <jh saunalahti fi>
Reviewed by:	bde
Discussed on:   freebsd-fs
MFC after:	1 month
2008-09-20 19:46:45 +00:00
kib
c6232cabbe Initialize birthtime fields in vn_stat() to prevent stat(2) from
returning uninitialized birthtime. Most file systems don't initialize
birthtime properly in their VOP_GETTATTR().

Submitted by:   Jaakko Heinonen <jh saunalahti fi>
Reviewed by:	bde
Discussed on:   freebsd-fs
MFC after:	1 month
2008-09-20 19:43:22 +00:00
obrien
0c0da6bba7 Add freebsd32 compat shim for nmount(2).
(and quiet some compiler warnings for vfs_donmount)
2008-09-19 15:17:32 +00:00
jhb
a0cba7a231 Various style fixes. 7 space indent is just odd. 2008-09-18 20:10:11 +00:00
jhb
2ba0911fd8 Sort includes. 2008-09-18 20:04:22 +00:00
attilio
23ff3dbeb8 Remove the suser(9) interface from the kernel. It has been replaced from
years by the priv_check(9) interface and just very few places are left.
Note that compatibility stub with older FreeBSD version
(all above the 8 limit though) are left in order to reduce diffs against
old versions. It is responsibility of the maintainers for any module, if
they think it is the case, to axe out such cases.

This patch breaks KPI so __FreeBSD_version will be bumped into a later
commit.

This patch needs to be credited 50-50 with rwatson@ as he found time to
explain me how the priv_check() works in detail and to review patches.

Tested by:      Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
Reviewed by:    rwatson
2008-09-17 15:49:44 +00:00
ed
0f8f4f624b Fix minor TTY API inconsistency.
Unlike tty_rel_gone() and tty_rel_sess(), the tty_rel_pgrp() routine
does not unlock the TTY. I once had the idea to make the code call
tty_rel_pgrp() and tty_rel_sess(), picking up the TTY lock once. This
turned out a little harder than I expected, so this is how it works now.

It's a lot easier if we just let tty_rel_pgrp() unlock the TTY, because
the other routines do this anyway.
2008-09-16 14:57:23 +00:00
kib
f6863a9ef7 When attempt is made to suspend a filesystem that is already syspended,
wait until the current suspension is lifted instead of silently returning
success immediately. The consequences of calling vfs_write() resume when
not owning the suspension are not well-defined at best.

Add the vfs_susp_clean() mount method to be called from
vfs_write_resume(). Set it to process_deferred_inactive() for ffs, and
stop calling it manually.

Add the thread flag TDP_IGNSUSP that allows to bypass the suspension
point in the vn_start_write. It is intended for use by VFS in the
situations where the suspender want to do some i/o requiring calls to
vn_start_write(), and this i/o cannot be done later.

Reviewed by:	tegge
In collaboration with:	pho
MFC after:	 1 month
2008-09-16 11:51:06 +00:00
kib
0488506405 Add the ffs structures introspection functions for ddb.
Show the b_dep value for the buffer in the show buffer command.
Add a comand to dump the dirty/clean buffer list for vnode.

Reviewed by:	tegge
Tested and used by:   pho
MFC after:   1 month
2008-09-16 11:19:38 +00:00
kib
039c5da1b2 Garbage-collect vn_write_suspend_wait().
Suggested and reviewed by:	tegge
Tested by:	pho
MFC after:	1 month
2008-09-16 11:09:26 +00:00
sam
05a7094fc1 Make ddb command registration dynamic so modules can extend
the command set (only so long as the module is present):
o add db_command_register and db_command_unregister to add and remove
  commands, respectively
o replace linker sets with SYSINIT's (and SYSUINIT's) that register
  commands
o expose 3 list heads: db_cmd_table, db_show_table, and db_show_all_table
  for registering top-level commands, show operands, and show all operands,
  respectively

While here also:
o sort command lists
o add DB_ALIAS, DB_SHOW_ALIAS, and DB_SHOW_ALL_ALIAS to add aliases
  for existing commands
o add "show all trace" as an alias for "show alltrace"
o add "show all locks" as an alias for "show alllocks"

Submitted by:	Guillaume Ballet <gballet@gmail.com> (original version)
Reviewed by:	jhb
MFC after:	1 month
2008-09-15 22:45:14 +00:00
jhb
a55e334c2b Expose a new public routine intr_event_execute_handlers() which executes
all the non-filter handlers attached to an interrupt event.  This can be
used by device drivers which multiplex their interrupt onto the interrupt
handlers for child devices.
2008-09-15 22:19:44 +00:00
attilio
00ea27d0c3 - For any lock list we hold the head in order to reduce allocation from
the free list and in this way avoid contention on the w_mtx.
  In order to make the code simple, we rely on the rule that when the head
  has not a child it also doesn't have other subsequent entries.
  Actually this assertion is broken because we can free all the head
  children and quit witness_unlock() with the head still allocated, with no
  children and subsequent entries present.
  Fix this by shifting the head if other entries are present and still
  freeing the object, but leaving always an head.
- Fix witness_thread_has_locks() in order to report, correctly, if the
  lock list linked to a specific thread has children or not based on the
  above explained rule.
- Fix a printout into DDB's "show alllocks" command in order to show,
  correctly, the process name that is really what we want.
- Fix style(9) for a comment.

Tested by:	Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
Reported by:	Marko Kiiskila <marko dot kiiskila at nokia dot com>
Sponsored by:	Nokia
2008-09-12 21:44:01 +00:00
csjp
1fa65beb80 Make sure the TTY has not disappeared out from under us before calling
ttydevsw_outwakeup().  This should fix panics which occur after remote
login sessions timeout during moderate TTY activity.  An example of
where this might occur is where a pending write to the terminal is
occurring while sshd(8) is shutting down the TTY after a TCP timeout.

Submitted by:	ed
2008-09-10 20:12:10 +00:00
jhb
af0471aaec Teach WITNESS about the interlocks used with lockmgr. This removes a bunch
of spurious witness warnings since lockmgr grew witness support.  Before
this, every time you passed an interlock to a lockmgr lock WITNESS treated
it as a LOR.

Reviewed by:	attilio
2008-09-10 19:13:30 +00:00
jhb
fd768740df Various whitespace fixes. 2008-09-10 17:59:21 +00:00
trasz
9303940ffe Remove VSVTX, VSGID and VSUID. This should be a no-op,
as VSVTX == S_ISVTX, VSGID == S_ISGID and VSUID == S_ISUID.

Approved by:	rwatson (mentor)
2008-09-10 13:16:41 +00:00
jhb
7851995759 - Reduce scope of #ifdef's in uma_zcreate() call in init_turnstile0().
- Set UMA_ZONE_NOFREE so that the per-turnstile spin locks are type stable
  to avoid a race where one thread might dereference a lock in a free'd
  turnstile that was previously used by another thread.

Theorized by:	tegge (2)
MFC after:	1 week
2008-09-08 21:40:15 +00:00
jhb
f147e876b7 Close a race in sleepq_broadcast() where the sleepq could be reused after
it had been assigned to the last sleeping thread.  That thread might have
started running on another CPU and have reused that sleep queue.  Fix it
by just walking the thread queue using TAILQ_FOREACH_SAFE() rather than
a while loop.

PR:		amd64/124200
Discovered by:	tegge
Tested by:	benjsc
MFC after:	1 week
2008-09-08 19:44:57 +00:00
bz
cb1cd5ee09 Catch a possible NULL pointer deref in case the offsets got mangled
somehow.
As a consequence we may now get an unexpected result(*).
Catch that error cases with a well defined panic giving appropriate
pointers to ease debugging.

(*) While the concensus was that the case should never happen unless
    there was a bug, noone was definitively sure.

Discussed with:		kmacy (about 8 months back)
Reviewed by:		silby (as part of a larger patch in March)
MFC after:		2 months
2008-09-07 13:09:04 +00:00
ed
e4b90a03d0 Make TIOCCONS use priv_check() instead of checking /dev/console permissions.
As discussed with Robert on IRC, checking the permissions on
/dev/console to see if we can call TIOCCONS could be unreliable. When we
run a chroot() without a devfs instance mounted inside, it won't
actually check the permissions on the device node inside the devfs
instance.

Using the already existing PRIV_TTY_CONSOLE for this seems like a better
idea.

Approved by:	rwatson
2008-09-06 14:43:32 +00:00
ed
bf17d6c233 Fix a small typo in a comment in calcru1().
The word "happene" should read "happened".

Submitted by:	Jille Timmermans <jille quis cx>
2008-09-05 15:55:06 +00:00
davidxu
40259861b6 Fix LOR between vnode lock and internal mqueue locks. 2008-09-05 07:32:57 +00:00
thompsa
6beefd0e39 Remove the alignment of the align parameter. This is up to the caller to pass
in and it breaks tap(4) on strict alignment machines as m_uiotombuf is called
with ETHER_ALIGN.

Found by:	Jared Go
Reviewed by:	emax
MFC after:	3 days
2008-09-05 04:05:31 +00:00
davidxu
74474a30a7 Fix lock name conflict.
PR:	kern/127040
2008-09-05 02:07:25 +00:00
ed
fa5c2849f2 Implement pts(4) packet mode.
As reported by several users on the mailing lists, applications like
screen(1) fail to properly handle ^S and ^Q characters. This was because
MPSAFE TTY didn't implement packet mode (TIOCPKT) yet. Add basic packet
mode support to make these applications work again.

Obtained from:	//depot/projects/mpsafetty/...
2008-09-04 16:39:02 +00:00
ed
28aa9d1022 Fix an awful bug inside our COMPAT_43TTY code.
When I migrated tty_compat.c to MPSAFE TTY, I just hooked it up to the
build and fixed it until it compiled and somewhat worked. It turns out
this was not the smartest thing, because the old TTY layer also had a
field called t_flags, which contained a set of sgtty flags.

This means our current COMPAT_43TTY code overwrites the TTY flags,
causing all strange problems to occur. Fix this code to use a new struct
member called t_compatflags. This commit may cause kern/127054 to be
fixed, but this still has to be tested/confirmed by the originator. It
has to be fixed anyway.

PR:		kern/127054
2008-09-04 16:30:53 +00:00
kevlo
9f7bbf786b If the process id specified is invalid, the system call returns ESRCH 2008-09-04 10:44:33 +00:00
simon
6bb93e188c - Fix amd64 local privilege escalation. [08:07]
- Fix nmount(2) local privilege escalation. [08:08]
- Fix IPv6 remote kernel panics. [08:09]

Fix for [08:07] is merge of r181823.

Submitted by:	kib [08:07], csjp [08:08], bz [08:09]
Reviewed by:	peter [08:07], jhb [08:07]
Reviewed by:	jinmei [08:09], rwatson [08:09]
Approved by:	re (SA blanket)
Approved by:	so (simon)
Security:	FreeBSD-SA-08:07.amd64
Security:	FreeBSD-SA-08:08.nmount
Security:	FreeBSD-SA-08:09.icmp6
2008-09-03 19:09:47 +00:00
ed
8496708649 Use size_t to store the return value of ttydisc_getc().
The ttydisc_getc() routine obtains a read length from ttyoutq_read().
For no valid reason, the current code stores this value in an int, and
returns a size_t. There is no need to perform this useless conversion.

Obtained from:	//depot/projects/mpsafetty/...
2008-09-02 17:13:11 +00:00
rwatson
34b4039202 Remove XXXRW in soreceive_dgram that proves unnecessary.
Remove unused orig_resid variable in soreceive_dgram.

Submitted by:	alfred
X-MFC with:	soreceive_dgram (r180198, r180211)
2008-09-02 16:55:21 +00:00
pjd
9f3f074340 When setting error to EINVAL in 'fvp == tdvp' case, jump to out label,
because if not, the error will be later overwritten by
mac_vnode_check_rename_to() call.

Reviewed by:	rwatson
2008-09-01 10:11:39 +00:00
attilio
e2ca413d09 Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions.
Manpages are updated accordingly.

Tested by:	Diego Sardina <siarodx at gmail dot com>
2008-08-31 14:26:08 +00:00
attilio
d6333c0f4d - Improve some witness_watch operability in code which does perform both
lock tracking and checks, doing just the former ones.
- Fix a bug where sysctl utility was printing crazy values when setting a
  new value for debug.witness.watch [0]

[0] Reported by:	yongari
2008-08-30 13:20:35 +00:00
ed
fa61dcef0f Fix some edge cases in the TTY queues:
- In the current design, when a TTY decreases its baud rate, it tries to
  shrink the queues. This may not always be possible, because it will
  not free any blocks that are still filled with data.

  Change the TTY queues to store a `quota' value as well, which means it
  will not free any blocks when changing the baud rate, but when placing
  blocks back into the queue. When the amount of blocks exceeds the
  quota, they get freed.

  It also fixes some edge cases, where TIOCSETA during read()/
  write()-calls could actually make the queue a tiny bit bigger than in
  normal cases.

- Don't leak blocks of memory when calling TIOCSETA when the device
  driver abandons the TTY while allocating memory.

- Create ttyoutq_init() and ttyinq_init() to initialize the queues,
  instead of initializing them by hand. The new TTY snoop driver also
  creates an outq, so it's good to have a proper interface to do this.

Obtained from:	//depot/projects/mpsafetty/...
2008-08-30 09:18:27 +00:00
attilio
8f53106a9e - Make witness_watch a 3 state value.
1 means that witness is up and running.
  0 means that witness is disabled but that it can be established later
    again in effective way.
  -1 means that witness is disabled permanently
- Fix a bug causing kernel to panic on witness disabling through
  witness_watch.  lock lists queues were still full of entries and this was
  causing throubles with debugging stubs (like witness_thread_exit()).

Reported by:	kris, yongari
Sponsored by:	Nokia
2008-08-29 15:47:53 +00:00
ed
e9104ac4da Backport two small fixes from the MPSAFE TTY branch in Perforce:
- Implement IMAXBEL. It turned out the IMAXBEL termios switch was marked
  as supported, while it had not been implemented.

- Don't go into the high watermark when in canonical mode, no data has
  been canonicalized and the input buffer is full. This caused the
  terminal to lock up. This prevented users from pressing
  backspace/^U/etc in such cases.

  This could easily be simulated by pasting a very big amount of data in
  a shell with sh(1) in canonical mode.

Obtained from:	//depot/projects/mpsafetty/...
2008-08-29 15:02:50 +00:00
davidxu
b08de3bf2f Don't remove queued SIGCHLD if options contain WNOWAIT, so other
threads still can be notified by the signal.
2008-08-29 01:34:05 +00:00
trhodes
4076a3a80f Fix a typo in r180291
"NAme of the current YP/NIS domain" -> "Name of the current YP/NIS domain"
2008-08-28 23:52:34 +00:00
ed
92daa6b410 Make ureadc() warn when holding any locks, just like uiomove().
A couple of months ago I was quite impressed, because when I was writing
code, I discovered that uiomove() would not allow any locks to be held,
while ureadc() did, mainly because ureadc() is implemented using the
same building blocks as uiomove().

Let's see if this triggers any aditional witness warnings on our source
tree.

Reviewed by:	atillio
2008-08-28 19:34:58 +00:00
attilio
dbf35e279f Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-08-28 15:23:18 +00:00
kib
dd53532f93 Introduce the VV_FORCEINSMQ vnode flag. It instructs the insmnque() function
to ignore the unmounting and forces insertion of the vnode into the mount
vnode list.

Change insmntque() to fail when forced unmount is in progress and
VV_FORCEINSMQ is not specified.

Add an assertion to the insmntque(), requiring the vnode to be
exclusively locked for mp-safe filesystems.

Use the VV_FORCEINSMQ for the creation of the syncvnode.

Tested by:	pho
Reviewed by:	tegge
MFC after:	1 month
2008-08-28 09:08:15 +00:00
ed
17e49589f8 Properly unlock the init/lock-state devices when invoking TIOCSETA.
For some reason a return-statement crept into this code, where it
shouldn't belong. This means we didn't properly unlock the TTY before
returning to userspace.

Submitted by:	Tor Egge <tor egge cvsup no freebsd org>
2008-08-27 19:37:21 +00:00
jhb
8011873cf5 - Only count the number of CPUs in the rendezvous map once rather than
doing it on every CPU.
- Use CPU_ABSENT() rather than pcpu_find() to determine if a CPU is not
  present.
- Count up to mp_maxid rather than MAXCPU when iterating over CPUs to
  match the rest of the code in the kernel.

MFC after:	1 week
2008-08-27 18:23:55 +00:00
kib
05dac85e4b Implement WNOWAIT flag for wait4(2). It specifies that process whose status
is returned shall be kept in the waitable state.
Add WSTOPPED as an alias for WUNTRACED.

Submitted by:	Jukka Ukkonen <jau at iki fi>
PR:	standards/116221
MFC after:	2 weeks
2008-08-26 12:37:16 +00:00
kib
2d990eae05 When calculating arguments to the interpreter for the shebang script
executed by fexecve(2), imgp->args->fname is NULL. Moreover, there is
no way to recover the path to the script being executed.
Do what some other U*ixes do unconditionally, namely supply /dev/fd/n
as the script path when called from fexecve(). Document requirement of
having fdescfs mounted as caveat.
2008-08-26 10:53:32 +00:00