Commit Graph

8845 Commits

Author SHA1 Message Date
Don Lewis
34ea500bea Add missing word to comment. 2005-10-04 04:02:33 +00:00
Gleb Smirnoff
e113edf30a o Move a lot of parameter checking from netisr_poll() to
dedicated sysctl handlers. Protect manipulations with
  poll_mtx. The affected sysctls are:
  - kern.polling.burst_max
  - kern.polling.each_burst
  - kern.polling.user_frac
  - kern.polling.reg_frac
o Use CTLFLAG_RD on MIBs that supposed to be read-only.
o u_int32t -> uint32_t
o Remove unneeded locking from poll_switch().
2005-10-03 14:15:26 +00:00
Colin Percival
33812c066d If sufficiently bad things happen during a call to kern_execve(), it is
possible for do_execve() to call exit1() rather than returning.  As a
result, the sequence "allocate memory; call kern_execve; free memory"
can end up leaking memory.

This commit documents this astonishing behaviour and adds a call to
exec_free_args() before the exit1() call in do_execve().  Since all
the users of kern_execve() in the tree use exec_free_args() to free
the command-line arguments after kern_execve() returns, this should
be safe, and it fixes the memory leak which can otherwise occur.

Submitted by:	Peter Holm
MFC after:	3 days
Security:	Local denial of service
2005-10-03 12:49:54 +00:00
Hajimu UMEMOTO
56e5a87a55 make saved cpu level stackable. 2005-10-03 06:57:29 +00:00
Don Lewis
5032ff8197 Always wire the sysctl output buffer in sysctl_kern_proc() before
calling sysctl_out_proc().  -- fix from jhb

Move the code in fill_kinfo_thread() that gathers data from struct proc
into the new function fill_kinfo_proc_only().

Change all callers of fill_kinfo_thread() to call both
fill_kinfo_proc_only() and fill_kinfo() thread.  When gathering
data from a multi-threaded process, fill_kinfo_proc_only() only needs
to be called once.

Grab sched_lock before accessing the process thread list or calling
fill_kinfo_thread().

PR:		kern/84684
MFC after:	3 days
2005-10-02 23:27:56 +00:00
Robert Watson
c30bf5c317 Include kdb.h so that kdb_active is declared regardless of KDB being
included in the kernel.

MFC after:	0 days
2005-10-02 10:03:51 +00:00
Poul-Henning Kamp
7bbb3a2690 Make sure the clone lists are sorted in the right order.
Explosion triggered by:	pjd
MFC:	3 days
2005-10-01 19:21:03 +00:00
Gleb Smirnoff
4092996774 Big polling(4) cleanup.
o Axe poll in trap.

o Axe IFF_POLLING flag from if_flags.

o Rework revision 1.21 (Giant removal), in such a way that
  poll_mtx is not dropped during call to polling handler.
  This fixes problem with idle polling.

o Make registration and deregistration from polling in a
  functional way, insted of next tick/interrupt.

o Obsolete kern.polling.enable. Polling is turned on/off
  with ifconfig.

Detailed kern_poll.c changes:
  - Remove polling handler flags, introduced in 1.21. The are not
    needed now.
  - Forget and do not check if_flags, if_capenable and if_drv_flags.
  - Call all registered polling handlers unconditionally.
  - Do not drop poll_mtx, when entering polling handlers.
  - In ether_poll() NET_LOCK_GIANT prior to locking poll_mtx.
  - In netisr_poll() axe the block, where polling code asks drivers
    to unregister.
  - In netisr_poll() and ether_poll() do polling always, if any
    handlers are present.
  - In ether_poll_[de]register() remove a lot of error hiding code. Assert
    that arguments are correct, instead.
  - In ether_poll_[de]register() use standard return values in case of
    error or success.
  - Introduce poll_switch() that is a sysctl handler for kern.polling.enable.
    poll_switch() goes through interface list and enabled/disables polling.
    A message that kern.polling.enable is deprecated is printed.

Detailed driver changes:
  - On attach driver announces IFCAP_POLLING in if_capabilities, but
    not in if_capenable.
  - On detach driver calls ether_poll_deregister() if polling is enabled.
  - In polling handler driver obtains its lock and checks IFF_DRV_RUNNING
    flag. If there is no, then unlocks and returns.
  - In ioctl handler driver checks for IFCAP_POLLING flag requested to
    be set or cleared. Driver first calls ether_poll_[de]register(), then
    obtains driver lock and [dis/en]ables interrupts.
  - In interrupt handler driver checks IFCAP_POLLING flag in if_capenable.
    If present, then returns.This is important to protect from spurious
    interrupts.

Reviewed by:	ru, sam, jhb
2005-10-01 18:56:19 +00:00
Don Lewis
5997cae9a4 Copy new process argument list in do_execve() before grabbing PROC_LOCK
to avoid touching pageable memory while holding a mutex.

Simplify argument list replacement logic.

PR:		kern/84935
Submitted by:	"Antoine Pelisse" apelisse AT gmail.com (in a different form)
MFC after:	3 days
2005-10-01 08:33:56 +00:00
Don Lewis
bd3c2d867d Un-staticize waitrunningbufspace() and call it before returning from
ffs_copyonwrite() if any async writes were launched.

Restore the threads previous TDP_NORUNNINGBUF state before returning
from ffs_copyonwrite().
2005-09-30 18:07:41 +00:00
David Xu
763a429571 Fox a LOR of sleep and sched_lock by using a timeout wait
when process reaches maximum number of threads.

MFC after: 3 days
2005-09-30 06:09:41 +00:00
Don Lewis
6c8b634f1d Un-staticize runningbufwakeup() and staticize updateproc.
Add a new private thread flag to indicate that the thread should
not sleep if runningbufspace is too large.

Set this flag on the bufdaemon and syncer threads so that they skip
the waitrunningbufspace() call in bufwrite() rather than than
checking the proc pointer vs. the known proc pointers for these two
threads.  A way of preventing these threads from being starved for
I/O but still placing limits on their outstanding I/O would be
desirable.

Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from
blocking on the runningbufspace check while holding snaplk.  This
prevents snaplk from being held for an arbitrarily long period of
time if runningbufspace is high and greatly reduces the contention
for snaplk.  The disadvantage is that ffs_copyonwrite() can start
a large amount of I/O if there are a large number of snapshots,
which could cause a deadlock in other parts of the code.

Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace
before attempting to grab snaplk so that I/O requests waiting on
snaplk are not counted in runningbufspace as being in-progress.
Increment runningbufspace again before actually launching the
original I/O request.

Prior to the above two changes, the system could deadlock if enough
I/O requests were blocked by snaplk to prevent runningbufspace from
falling below lorunningspace and one of the bawrite() calls in
ffs_copyonwrite() blocked in waitrunningbufspace() while holding
snaplk.

See <http://www.holm.cc/stress/log/cons143.html>
2005-09-30 01:30:01 +00:00
John Baldwin
b65089ccb5 Trim a couple of unneeded includes. 2005-09-29 19:13:52 +00:00
Peter Edwards
d41c4674c2 Close a race in biodone(), whereby the bio_done field of the passed
bio may have been freed and reassigned by the wakeup before being
tested after releasing the bdonelock.

There's a non-zero chance this is the cause of a few of the crashes
knocking around with biodone() sitting in the stack backtrace.

Reviewed By: phk@
2005-09-29 10:37:20 +00:00
Poul-Henning Kamp
64fd97df54 puc(4) does strange things to resources in order to fool the
subdrivers to hook up.

It should probably be rewritten to implement a simple bus to which
the sub drivers attach using some kind of hint.

Until then, provide a couple of crutch functions with big warning
signs so it can survive the recent changes to struct resource.
2005-09-28 18:06:25 +00:00
Robert Watson
5f419982c2 Back out alpha/alpha/trap.c:1.124, osf1_ioctl.c:1.14, osf1_misc.c:1.57,
osf1_signal.c:1.41, amd64/amd64/trap.c:1.291, linux_socket.c:1.60,
svr4_fcntl.c:1.36, svr4_ioctl.c:1.23, svr4_ipc.c:1.18, svr4_misc.c:1.81,
svr4_signal.c:1.34, svr4_stat.c:1.21, svr4_stream.c:1.55,
svr4_termios.c:1.13, svr4_ttold.c:1.15, svr4_util.h:1.10,
ext2_alloc.c:1.43, i386/i386/trap.c:1.279, vm86.c:1.58,
unaligned.c:1.12, imgact_elf.c:1.164, ffs_alloc.c:1.133:

Now that Giant is acquired in uprintf() and tprintf(), the caller no
longer leads to acquire Giant unless it also holds another mutex that
would generate a lock order reversal when calling into these functions.
Specifically not backed out is the acquisition of Giant in nfs_socket.c
and rpcclnt.c, where local mutexes are held and would otherwise violate
the lock order with Giant.

This aligns this code more with the eventual locking of ttys.

Suggested by:	bde
2005-09-28 07:03:03 +00:00
Christian S.J. Peron
453f7d5369 Push Giant down in jails. Pass the MPSAFE flag to NDINIT, and keep track
of whether or not Giant was picked up by the filesystem. Add VFS_LOCK_GIANT
macros around vrele as it's possible that this can call in the VOP_INACTIVE
filesystem specific code. Also while we are here, remove the Giant assertion.
from the sysctl handler,  we do not actually require Giant here so we
shouldn't assert it. Doing so will just complicate things when Giant is removed
from the sysctl framework.
2005-09-28 00:30:56 +00:00
Robert Watson
667285c4e3 If KDB_STOP_NMI is compiled into the kernel, default
debug.kdb.stop_cpus_with_nmi to 1 rather than 0.

MFC after:	3 days
2005-09-27 21:12:05 +00:00
Robert Watson
2b59d50cfb In lockstatus(), don't lock and unlock the interlock when testing the
sleep lock status while kdb_active, or we risk contending with the
mutex on another CPU, resulting in a panic when using "show
lockedvnods" while in DDB.

MFC after:	3 days
Reviewed by:	jhb
Reported by:	kris
2005-09-27 21:02:59 +00:00
Robert Watson
32a6bd9510 No longer maintain mbstat statistics for the mbuf allocator, UMA
statistics and libmemstat(3) are now used to track mbuf statistics.

MFC after:	1 month
2005-09-27 20:28:43 +00:00
John Baldwin
7e9e371f2d Use the refcount API to manage the reference count for user credentials
rather than using pool mutexes.

Tested on:	i386, alpha, sparc64
2005-09-27 18:09:42 +00:00
John Baldwin
b2149bde1f Use the reference count API to manage the reference counts for process
limit structures rather than using pool mutexes to protect the reference
counts.

Tested on:	i386, alpha, sparc64
2005-09-27 18:07:05 +00:00
John Baldwin
55b4a5ae0d Use the refcount API to implement reference counts on process argument
structures rather than using a global mutex to protect the reference
counts.

Tested on:	i386, alpha, sparc64
2005-09-27 18:03:15 +00:00
Christian S.J. Peron
6acd4b6189 Update the "created from" section to reflect the most recent version of
syscalls.master

Requested by:	jhb
2005-09-27 14:36:59 +00:00
Christian S.J. Peron
7f300b47dd Mark the extended attribute syscalls as being MP safe.
Requested by:	jhb
2005-09-27 14:32:04 +00:00
John Baldwin
d27acf445e Add the spin lock used by the binary nvidia driver to the static lock
order list so that WITNESS and the driver play together nicely.

Tested by:	Harald Schmalzbauer
MFC after:	3 days
2005-09-26 18:30:12 +00:00
Robert Watson
9b7915859d Add "show allpcpu" to DDB, which prints the current CPU id followed by
the per-cpu data for all CPUs.  This is easier to ask users to do than
"figure out how many CPUs you have, now run show pcpu, then run it
once for each CPU you have".

MFC after:	3 days
2005-09-26 16:55:11 +00:00
David Xu
2b7182c6b7 Reorder statements to avoid accessing unknown memory.
In theory, invoking kenv with very long string can panic
kernel.
2005-09-26 14:14:55 +00:00
Robert Watson
329c75a730 Acquire Giant in uprintf() and tprintf() rather than asserting it. In
the vast majority of cases, these functions are called without mutexes
held, meaning that in all but two cases, there will be no ordering
issues with doing this, and it will eliminate the need for changes in
the caller.  In two cases, mutexes are held, so Giant must be acquired
before those mutexes such that uprintf() and tprintf() recurse Giant
rather than generating a lock order reversal.

Suggested by:	bde
2005-09-26 08:02:24 +00:00
Poul-Henning Kamp
2b35175c8a Add rman_is_region_manager() for the benefit of an alpha hack. 2005-09-25 20:10:10 +00:00
Christian S.J. Peron
c47a4d1c9f Implement new world order in VFS locking for extended attributes. This will
remove the unconditional acquisition of Giant for extended attribute related
operations. If the file system is set as being MP safe and debug.mpsafevfs is
1, do not pickup Giant.

Mark the following system calls as being MP safe so we no longer pickup Giant
in the system call handler:

o extattrctl
o extattr_set_file
o extattr_get_file
o extattr_delete_file
o extattr_set_fd
o extattr_get_fd
o extattr_delete_fd
o extattr_set_link
o extattr_get_link
o extattr_delete_link
o extattr_list_file
o extattr_list_link
o extattr_list_fd

-Pass MPSAFE flags to namei(9) lookup and introduce vfslocked variable which
 will keep track of any Giant acquisitions.
-Wrap any fd operations which manipulate vnodes in VFS_{UN}LOCK_GIANT
-Drop VFS_ASSERT_GIANT into function which operate on vnodes to ensure that
 we are sufficiently protected.

I've tested these changes with various TrustedBSD MAC policies which use
extended attribute a lot on SMP and UP systems (thanks to Scott Long for
making some SMP hardware available to me for testing).

Discussed with:	jeff
Requested by:	jhb, rwatson
2005-09-24 23:47:04 +00:00
Poul-Henning Kamp
ae7ff71f63 Split struct resource in an external and internal part.
The external part is still called 'struct resource' but the contents
is now visible to drivers etc.  This makes it part of the device
driver ABI so it not be changed lightly.  A comment to this effect
is in place.

The internal part is called 'struct resource_i' and contain its external
counterpart as one field.

Move the bus_space tag+handle into the external struct resource, this
removes the need for device drivers to even know about these fields
in order to use bus_space to access hardware. (More in following commit).
2005-09-24 20:07:03 +00:00
Poul-Henning Kamp
a778923149 Add two convenience functions for device drivers: bus_alloc_resources()
and bus_free_resources().  These functions take a list of resources
and handle them all in one go.  A flag makes it possible to mark
a resource as optional.

A typical device driver can save 10-30 lines of code by using these.

Usage examples will follow RSN.

MFC:	A good idea, eventually.
2005-09-24 19:31:10 +00:00
Robert Watson
e1ac28e239 Canonicalize the UNIX domain socket copyright layout: original holders
before more recent holders.

MFC after:	3 days
2005-09-23 12:41:06 +00:00
Stephan Uphoff
3fafa27b27 Don't pretend to be thread0 when calling sync().
It confuses the lock manager since in some places thread0 is
then used for vnode locking while curthread is used for vnode unlocking.

Found by:	Yahoo!
Reviewed by:	ps@,jhb@
MFC after:	3 days
2005-09-22 15:34:15 +00:00
David Xu
a861574011 Temporarily disable nice threshold detection code, as it can starve
a thread holding critical resource, e.g mutex or other implicit
synchronous flags. Give thread which exceeds nice threshold a minimum
time slice.

PR: kern/86087
2005-09-22 01:19:37 +00:00
John Baldwin
e12560dd4b Use correct VFS locking rather than unconditionally grabbing Giant around
namei() calls in kern_alternate_path().

Reviewed by:	csjp
MFC after:	1 week
2005-09-21 19:49:42 +00:00
Robert Watson
87328e07e0 Pass 'curthread' into VFS_STATFS() from acctwatch(), rather than passing
NULL.  The NFS client expects that a thread will always be present for a
VOP so that it can check for signal conditions, and will dereference a
NULL pointer if one isn't present.

MFC after:	3 days
2005-09-21 15:28:07 +00:00
Robert Watson
5580b0b157 Correct an incorrect comment from the dawn of time: neither tprintf()
nor uprintf() is believed to perform tsleep() or msleep() as written,
as ttycheckoutq() is called with '0' as its sleep argument.

Remove recently added WITNESS warnings for sleep as the comment was
incorrect.  This should silence a warning from the nfs_timer() code.

Discussed with:	bde
2005-09-20 09:55:36 +00:00
Andre Oppermann
e452573df7 Start time_uptime with 1 instead of 0.
Discussed with:		phk
2005-09-19 22:16:31 +00:00
Poul-Henning Kamp
e606a3c63e Rewamp DEVFS internals pretty severely [1].
Give DEVFS a proper inode called struct cdev_priv.  It is important
to keep in mind that this "inode" is shared between all DEVFS
mountpoints, therefore it is protected by the global device mutex.

Link the cdev_priv's into a list, protected by the global device
mutex.  Keep track of each cdev_priv's state with a flag bit and
of references from mountpoints with a dedicated usecount.

Reap the benefits of much improved kernel memory allocator and the
generally better defined device driver APIs to get rid of the tables
of pointers + serial numbers, their overflow tables,  the atomics
to muck about in them and all the trouble that resulted in.

This makes RAM the only limit on how many devices we can have.

The cdev_priv is actually a super struct containing the normal cdev
as the "public" part, and therefore allocation and freeing has moved
to devfs_devs.c from kern_conf.c.

The overall responsibility is (to be) split such that kern/kern_conf.c
is the stuff that deals with drivers and struct cdev and fs/devfs
handles filesystems and struct cdev_priv and their private liason
exposed only in devfs_int.h.

Move the inode number from cdev to cdev_priv and allocate inode
numbers properly with unr.  Local dirents in the mountpoints
(directories, symlinks) allocate inodes from the same pool to
guarantee against overlaps.

Various other fields are going to migrate from cdev to cdev_priv
in the future in order to hide them.  A few fields may migrate
from devfs_dirent to cdev_priv as well.

Protect the DEVFS mountpoint with an sx lock instead of lockmgr,
this lock also protects the directory tree of the mountpoint.

Give each mountpoint a unique integer index, allocated with unr.
Use it into an array of devfs_dirent pointers in each cdev_priv.
Initially the array points to a single element also inside cdev_priv,
but as more devfs instances are mounted, the array is extended with
malloc(9) as necessary when the filesystem populates its directory
tree.

Retire the cdev alias lists, the cdev_priv now know about all the
relevant devfs_dirents (and their vnodes) and devfs_revoke() will
pick them up from there.  We still spelunk into other mountpoints
and fondle their data without 100% good locking.  It may make better
sense to vector the revoke event into the tty code and there do a
destroy_dev/make_dev on the tty's devices, but that's for further
study.

Lots of shuffling of stuff and churn of bits for no good reason[2].

XXX: There is still nothing preventing the dev_clone EVENTHANDLER
from being invoked at the same time in two devfs mountpoints.  It
is not obvious what the best course of action is here.

XXX: comment out an if statement that lost its body, until I can
find out what should go there so it doesn't do damage in the meantime.

XXX: Leave in a few extra malloc types and KASSERTS to help track
down any remaining issues.

Much testing provided by:		Kris
Much confusion caused by (races in):	md(4)

[1] You are not supposed to understand anything past this point.

[2] This line should simplify life for the peanut gallery.
2005-09-19 19:56:48 +00:00
Robert Watson
84d2b7df26 Add GIANT_REQUIRED and WITNESS sleep warnings to uprintf() and tprintf(),
as they both interact with the tty code (!MPSAFE) and may sleep if the
tty buffer is full (per comment).

Modify all consumers of uprintf() and tprintf() to hold Giant around
calls into these functions.  In most cases, this means adding an
acquisition of Giant immediately around the function.  In some cases
(nfs_timer()), it means acquiring Giant higher up in the callout.

With these changes, UFS no longer panics on SMP when either blocks are
exhausted or inodes are exhausted under load due to races in the tty
code when running without Giant.

NB: Some reduction in calls to uprintf() in the svr4 code is probably
desirable.

NB: In the case of nfs_timer(), calling uprintf() while holding a mutex,
or even in a callout at all, is a bad idea, and will generate warnings
and potential upset.  This needs to be fixed, but was a problem before
this change.

NB: uprintf()/tprintf() sleeping is generally a bad ideas, as is having
non-MPSAFE tty code.

MFC after:	1 week
2005-09-19 16:51:43 +00:00
Robert Watson
223aaaecb0 Remove mac_create_root_mount() and mpo_create_root_mount(), which
provided access to the root file system before the start of the
init process.  This was used briefly by SEBSD before it knew about
preloading data in the loader, and using that method to gain
access to data earlier results in fewer inconsistencies in the
approach.  Policy modules still have access to the root file system
creation event through the mac_create_mount() entry point.

Removed now, and will be removed from RELENG_6, in order to gain
third party policy dependencies on the entry point for the lifetime
of the 6.x branch.

MFC after:	3 days
Submitted by:	Chris Vance <Christopher dot Vance at SPARTA dot com>
Sponsored by:	SPARTA
2005-09-19 13:59:57 +00:00
Marcel Moolenaar
73130b2224 Move the UUID generator into its own function, called kern_uuidgen(),
so that UUIDs can be generated from within the kernel. The uuidgen(2)
syscall now allocates kernel memory, calls the generator, and does a
copyout() for the whole UUID store. This change is in support of GPT.
2005-09-18 21:40:15 +00:00
Robert Watson
8434c29b28 Add three new read-only socket options, which allow regression tests
and other applications to query the state of the stack regarding the
accept queue on a listen socket:

SO_LISTENQLIMIT    Return the value of so_qlimit (socket backlog)
SO_LISTENQLEN      Return the value of so_qlen (complete sockets)
SO_LISTENINCQLEN   Return the value of so_incqlen (incomplete sockets)

Minor white space tweaks to existing socket options to make them
consistent.

Discussed with:	andre
MFC after:	1 week
2005-09-18 21:08:03 +00:00
Robert Watson
bc6b8b5d64 Fix spelling in a comment.
MFC after:	3 days
2005-09-18 10:46:34 +00:00
Robert Watson
7da7362b95 Re-comment sbcompress() to explain what it is it does; it took me
quite a bit of reading to figure it out, and I want to avoid figuring
it out again.

Convert an if (foo) else printf("this is almost a panic") into a
KASSERT.

MFC after:	3 days
2005-09-18 10:30:10 +00:00
Warner Losh
fe0519b171 MFp4: Expose device_probe_child() 2005-09-18 01:32:09 +00:00
Christian S.J. Peron
42e7197fba Implement new world order in VFS locking for ACLs. This will remove the
unconditional acquisition of Giant for ACL related operations. If the file
system is set as being MP safe and debug.mpsafevfs is 1, do not pickup
giant.

For any operations which require namei(9) lookups:

__acl_get_file
__acl_get_link
__acl_set_file
__acl_set_link
__acl_delete_file
__acl_delete_link
__acl_aclcheck_file
__acl_aclcheck_link

-Set the MPSAFE flag in NDINIT
-Initialize vfslocked variable using the NDHASGIANT macro

For functions which operate on fds, make sure the operations are locked:

__acl_get_fd
__acl_set_fd
__acl_delete_fd
__acl_aclcheck_fd

-Initialize vfslocked using VFS_LOCK_GIANT before we manipulate the vnode

Discussed with:	jeff
2005-09-17 22:01:14 +00:00
Tor Egge
61ac14dab6 Break out of loop if next buffer pointer has become invalid while flushing
current buffer.

Reviewed by:	kan
2005-09-16 18:28:12 +00:00
Stephan Uphoff
19b2dff7b0 Fix race condition that caused activation of an event to
be ignored immediately after it was deactivated.

Found by: 	Yahoo!
MFC after:	3 days
2005-09-15 21:10:12 +00:00
John Baldwin
21f9e816cd Oops, missed adding the required include.
Pointy hat to:	jhb
2005-09-15 20:20:36 +00:00
John Baldwin
53c0e1ff7d Replace the dont_sleep_in_callout mutex hack (similar to g_x{up,down})
with the disallow sleeping facility.
2005-09-15 20:09:08 +00:00
John Baldwin
10f508d9a3 Don't disallow sleeping for handlers on swi's since some swi handlers
(like CAM) do sleep in their handlers.

Requested by:	scottl
2005-09-15 20:08:21 +00:00
John Baldwin
b27dbfbf4a - Enforce an implicit lock order that Giant cannot be locked while holding
any other non-sleepable lock.  In plain English: Giant comes before all
  other mutexes.
- Add some extra description to the lock order reversal printf's to indicate
  when a reversal is triggered by a hard-coded implicit rule.

Requested by:	truckman (2)
MFC after:	1 week
2005-09-15 19:07:14 +00:00
John Baldwin
51460da87f - Add a new simple facility for marking the current thread as being in a
state where sleeping on a sleep queue is not allowed.  The facility
  doesn't support recursion but uses a simple private per-thread flag
  (TDP_NOSLEEPING).  The sleepq_add() function will panic if the flag is
  set and INVARIANTS is enabled.
- Use this new facility to replace the g_xup and g_xdown mutexes that were
  (ab)used to achieve similar behavior.
- Disallow sleeping in interrupt threads when invoking interrupt handlers.

MFC after:	1 week
Reviewed by:	phk
2005-09-15 19:05:37 +00:00
Christian S.J. Peron
68ff2a4397 Improve the MP safeness associated with the creation of symbolic
links and the execution of ELF binaries. Two problems were found:

1) The link path wasn't tagged as being MP safe and thus was not properly
   protected.
2) The ELF interpreter vnode wasnt being locked in namei(9) and thus was
   insufficiently protected.

This commit makes the following changes:

-Sets the MPSAFE flag in NDINIT for symbolic link paths
-Sets the MPSAFE flag in NDINIT and introduce a vfslocked variable which
 will be used to instruct VFS_UNLOCK_GIANT to unlock Giant if it has been
 picked up.
-Drop in an assertion into vfs_lookup which ensures that if the MPSAFE
 flag is NOT set, that we have picked up giant. If not panic (if WITNESS
 compiled into the kernel). This should help us find conditions where vnode
 operations are in-sufficiently protected.

This is a RELENG_6 candidate.

Discussed with:	jeff
MFC after:	4 days
2005-09-15 15:03:48 +00:00
Maxim Konovalov
aada5cccd8 Backout rev. 1.246, it breaks code uses shutdown(2) on non-connected
sockets.

Pointed out by:	rwatson
2005-09-15 13:18:05 +00:00
Ralf S. Engelschall
724447ac41 Fix system shutdown timeout handling by again supporting longer running
shutdown procedures (which have a duration of more than 120 seconds).

We have two user-space affecting shutdown timeouts: a "soft" one in
/etc/rc.shutdown and a "hard" one in init(8). The first one can be
configured via /etc/rc.conf variable "rcshutdown_timeout" and defaults
to 30 seconds. The second one was originally (in 1998) intended to be
configured via sysctl(8) variable "kern.shutdown_timeout" and defaults
to 120 seconds.

Unfortunately, the "kern.shutdown_timeout" was declared "unused" in 1999
(as it obviously is actually not used within the kernel itself) and
hence was intentionally but misleadingly removed in revision 1.107 from
init_main.c. Kernel sysctl(8) variables are certainly a wrong way to
control user-space processes in general, but in this particular case the
sysctl(8) variable should have remained as it supports init(8), which
isn't passed command line flags (which in turn could have been set via
/etc/rc.conf), etc.

As there is already a similar "kern.init_path" sysctl(8) variable which
directly affects init(8), resurrect the init(8) shutdown timeout under
sysctl(8) variable "kern.init_shutdown_timeout". But this time document
it as being intentionally unused within the kernel and used by init(8).
Also document it in the manpages init(8) and rc.conf(5).

Reviewed by: phk
MFC after: 2 weeks
2005-09-15 13:16:07 +00:00
Maxim Konovalov
c5cff17017 o Return ENOTCONN when shutdown(2) on non-connected socket.
PR:		kern/84761
Submitted by:	James Juran
R-test:		tools/regression/sockets/shutdown
MFC after:	1 month
2005-09-15 11:45:36 +00:00
Poul-Henning Kamp
74f46f19aa Retire unused dev_named() function. 2005-09-15 08:01:57 +00:00
Robert Watson
fd1a469ba5 In vfs_kqfilter(), return EINVAL instead of 1 (EPERM) when an unsupported
kqueue filter type is requested on a vnode.

MFC after:	3 days
2005-09-12 19:22:37 +00:00
Jung-uk Kim
9ed448b20c use monotonic time_uptime' instead of time_second'
Approved by:	anholt (mentor)
Discussed on:	arch
2005-09-12 15:31:28 +00:00
Poul-Henning Kamp
2883ba6668 Introduce vfs_read_dirent() which can help VOP_READDIR() implementations
by handling all the cookie stuff.
2005-09-12 08:46:07 +00:00
Tor Egge
6ff5e2db45 Don't retry when vget() returns ENOENT in the nonblocking case due to the
vnode being doomed.  It causes a livelock.
2005-09-12 01:48:57 +00:00
Don Lewis
908b3deb2b Relocate witness_levelall(), witness_leveldescendents(), and
witness_displaydescendants() so that they are protected by
"#ifdef DDB/#endif" to unbreak kernels not using "option DDB".

MFC after:	3 weeks
2005-09-11 07:57:06 +00:00
Gleb Smirnoff
d04304d155 Make callout_reset() return a non-zero value if a pending callout
was rescheduled. If there was no pending callout, then return 0.

Reviewed by:	iedowse, cperciva
2005-09-08 14:20:39 +00:00
Don Lewis
d07f87a218 Add a new struct buf flag bit, B_PERSISTENT, and use it to tag
struct bufs that are persistently held by ext2fs.  Ignore any buffers
with this flag in the code in boot() that counts "busy" and dirty
buffers and attempts to sync the dirty buffers, which is done before
attempting to unmount all the file systems during shutdown.

This fixes the problem caused by any ext2fs file systems that are
mounted at system shutdown time, which caused boot() to give up on
a non-zero number of buffers and skip the call to vfs_unmountall().
This left all the mounted file systems in a dirty state and caused
them to all require cleanup by fsck on reboot.

Move the two separate copies of the "busy" buffer test in boot()
to a separate function.

Nuke the useless spl() stuff in the ext2fs ULCK_BUF() macro.

Bring the PRINT_BUF_FLAGS definition in sys/buf.h up to date with
this and previous flag changes.

PR:		kern/56675, kern/85163
Tested by:	"Matthias Andree" matthias.andree at gmx.de
Reviewed by:	bde
MFC after:	3 days
2005-09-08 06:30:05 +00:00
David E. O'Brien
5b1c0294e4 Forward declaring static variables as extern is invalid ISO-C. Now that
GCC can properly handle forward static declarations, do this properly.
2005-09-07 10:06:14 +00:00
Gleb Smirnoff
016e62123a In soreceive(), when a first mbuf is removed from socket buffer use
sockbuf_pushsync(). Previous manipulation could lead to an inconsistent
mbuf.

Reviewed by:	rwatson
2005-09-06 17:05:11 +00:00
Gleb Smirnoff
f46ab10c02 Document flags of a pollrec. 2005-09-06 11:09:18 +00:00
Christian S.J. Peron
d1dfd92177 Convert the primary ACL allocator from malloc(9) to using a UMA zone instead.
Also introduce an aclinit function which will be used to create the UMA zone
for use by file systems at system start up.

MFC after:	1 month
Discussed with:	rwatson
2005-09-06 00:06:30 +00:00
Gleb Smirnoff
16901c0186 Remove Giant mutex from polling(4) and use a separate poll_mtx(4)
instead. Detailed changelist:

o Add flags field to struct pollrec, to indicate that
  are particular entry is being worked on.
o Define a macro PR_VALID() to check that a pollrec
  is valid and pollable.
o Mark ISRs as mpsafe.

o ether_poll()
  - Acquire poll_mtx while traversing pollrec array.
  - Skip pollrecs, that are being worked on.
  - Conditionally acquire Giant when entering handler.

o netisr_pollmore()
  - Conditionally assert Giant.
  - Acquire poll_mtx while working with statistics.

o netisr_poll()
  - Conditionally assert Giant.
  - Acquire poll_mtx while working with statistics
    and traversing pollrec array.

o ether_poll_register(), ether_poll_deregister()
  - Conditionally assert Giant.
  - Acquire poll_mtx while working with pollrec array.

o poll_idle()
  - Remove all strange manipulations with Giant.

In collaboration with:	ru, pjd
In collaboration with:	Oleg Bulyzhin <oleg rinet.ru>
In collaboration with:	dima <_pppp mail.ru>
2005-09-05 16:02:11 +00:00
Xin LI
5248ef8a3c When padding with zero, do pad after prefixes rather than padding
before prefixes.

Use cases:
	printf("%05d", -42);   -->   "00-42"   (should be "-0042")
	printf("%#05x", 12);   -->   "000xc"   (should be "0x00c")

Submitted by:	Oliver Fromme
PR:		kern/85520
MFC After:	1 week
2005-09-04 18:03:45 +00:00
Poul-Henning Kamp
1e7d2c4763 If we ignore an unknown % sequence, we must stop interpreting the
remaining % arguments because the varargs are now out of sync and
there is a risk that we might for instance dereference an integer
in a %s argument.

Sponsored by: Napatech.com
2005-09-03 10:28:08 +00:00
John Baldwin
acc0265cc2 - Add some comments to some of the static lock orders. Don't explicitly
link proctree and allproc to Giant since that order is already implicitly
  enforced.
- Use a goto to handle the case where we want to enforce a reversal before
  calling isitmydescendant() in witness_checkorder() so that the logic is
  easier to follow and so that it is easier to add more forced-reversal
  cases in the future.

MFC after:	 3 days
2005-09-02 20:23:49 +00:00
John Baldwin
83cece6fa1 - Add an assertion to panic if one tries to call mtx_trylock() on a spin
mutex.
- Don't panic if a spin lock is held too long inside _mtx_lock_spin() if
  panicstr is set (meaning that we are already in a panic).  Just keep
  spinning forever instead.
2005-09-02 20:21:49 +00:00
John Baldwin
83de502d59 Add witness warnings to panic if a thread tries to exit while holding any
locks.

Requested by:	jeff
MFC after:	3 days
2005-09-02 20:20:01 +00:00
Nate Lawson
9000b91eb9 Break out the checks for duplicates and absolute settings being too high
instead of trying to do them all at once.  This should fix the level sorting
problems from the previous revision.

Testing help:	ume
2005-09-02 16:32:43 +00:00
Suleiman Souhlal
1f71de49e1 Print out a warning and a backtrace if we try to unlock a lockmgr that
we do not hold.

Glanced at by:	phk
MFC after:	3 days
2005-09-02 15:56:01 +00:00
Suleiman Souhlal
2611e5a6a9 Don't unbusy the devfs mount in vfs_mountroot_try() as it gets accessed
and unbusied in devfs_fixup(), which assumes that the devfs mount is
still locked.

Granced at by:	phk
MFC after:	3 days
2005-09-02 13:37:54 +00:00
Pawel Jakub Dawidek
d8b464e51e In case of mac_check_vnode_rename_from() or vn_start_write() failure,
vn_finished_write() should not be called.

Reviewed by:	ssouhlal
MFC after:	3 days
2005-09-01 21:46:33 +00:00
Andre Oppermann
fdcc028d11 Changes and cleanups to m_sanity():
o for() instead of while() looping  over mbuf chain
o paren's around all flag checks
o more verbose function and purpose description
o some more style changes

Based on feedback from:	sam
2005-08-30 21:31:42 +00:00
Andre Oppermann
e0068c3a69 Unbreak m_demote() and put back the 'all' flag. Without it we cannot
correctly test for m_nextpkt in an mbuf chain.
2005-08-30 21:14:30 +00:00
Andre Oppermann
fbe816384a o Remove the 'all' flag from m_demote(). Users can simply call it with
m_demote(m->m_next) if they wish to start at the second mbuf in chain.
o Test m_type with == instead of &.
o Check m_nextpkt against NULL instead of implicit 0.

Based on feedback from:	sam
2005-08-30 20:07:49 +00:00
Nate Lawson
5308b2a64e Eliminate cpufreq levels for two cases that are less than optimal:
1. Walk the absolute list in reverse to prefer duplicated levels that have
a lower absolute setting, i.e. 800 Mhz/50% is better than 1600 Mhz/25% even
though both have the same actual frequency.  This also removes the need to
check for already-modified levels since by definition, those will be added
later in the sorted list.

2. Compare the absolute settings for derived levels and don't use the new
level if it's higher.  For example, a level of 800 Mhz/75% is preferable to
1600 Mhz/25% even though the latter has a lower total frequency.

This work is based on a patch from the submitter but reworked by myself.

Submitted by:	Tijl Coosemans (tijl/ulyssis.org)
2005-08-30 04:45:32 +00:00
Andre Oppermann
4da8443133 Add m_copymdata(struct mbuf *m, struct mbuf *n, int off, int len,
int prep, int how).

Copies the data portion of mbuf (chain) n starting from offset off
for length len to mbuf (chain) m.  Depending on prep the copied
data will be appended or prepended.  The function ensures that the
mbuf (chain) m will be fully writeable by making real (not refcnt)
copies of mbuf clusters.  For the prepending the function returns
a pointer to the new start of mbuf chain m and leaves as much
leading space as possible in the new first mbuf.

Reviewed by:	glebius
2005-08-29 20:15:33 +00:00
Andre Oppermann
a048affba5 Add m_sanity(struct mbuf *m, int sanitize) to do some heavy sanity
checking on mbuf's and mbuf chains.  Set sanitize to 1 to garble
illegal things and have them blow up later when used/accessed.

m_sanity()'s main purpose is for KASSERT()'s and debugging of non-
kosher mbuf manipulation (of which we have a number of).

Reviewed by:	glebius
2005-08-29 19:58:56 +00:00
Andre Oppermann
ed111688e9 Add m_demote(struct mbuf *m, int all) to clean up mbuf (chain) from
any tags and packet headers.  If "all" is set then the first mbuf
in the chain will be cleaned too.

This function is used before an mbuf, that arrived as packet with
m->flags & M_PKTHDR, is appended to an mbuf chain using m->m_next
(not m->m_nextpkt).

Reviewed by:	glebius
2005-08-29 19:45:39 +00:00
Pawel Jakub Dawidek
e37a499443 Add 'depth' argument to CTRSTACK() macro, which allows to reduce number
of ktr slots used. If 'depth' is equal to 0, the whole stack will be
logged, just like before.
2005-08-29 11:34:08 +00:00
Suleiman Souhlal
a6c109d658 Fix a typo in vop_rename_pre() where we ended up using vholdl()
instead of vhold(), even though the vnode interlock is unlocked.

MFC after:	3 days
2005-08-28 23:00:11 +00:00
Alan Cox
7f1ef325d7 Handle vm_map_wire()'s failure. 2005-08-28 05:38:40 +00:00
Alan Cox
5d3043ce9a Correctly handle vm_map_wire()'s failure. (See also revisions 1.81 and
1.82.)

Reviewed by:	tegge
2005-08-28 04:50:11 +00:00
Alan Cox
45e31b6034 Eliminate an unneeded reference on a vm object. If, in fact, the nearby
vm_map_find() fails, then the excess reference causes the vm object to be
leaked.

Reviewed by:	tegge
2005-08-28 00:24:58 +00:00
Alan Cox
4167396552 Revert the previous change for two reasons: (1) If vm_map_find() succeeds
but vm_map_wire() fails, then a vm object, vm map entries, and kernel_map
free space is leaked and (2) unwiring is handled automatically by
vm_map_remove().

Suggested by:   tegge
2005-08-28 00:19:54 +00:00
Dag-Erling Smørgrav
d09dfa2bfd Two minor optimizations of fdalloc():
- if minfd < fd_freefile (as is most often the case, since minfd is
   usually 0), set it to fd_freefile.

 - remove a call to fd_first_free() which duplicates work already done
   by fdused().

This change results in a small but measurable speedup for processes
with large numbers (several thousands) of open files.

PR:		kern/85176
Submitted by:	Divacky Roman <xdivac02@stud.fit.vutbr.cz>
MFC after:	3 weeks
2005-08-26 11:16:39 +00:00
Don Lewis
4053cae340 Track all lock relationships instead of pruning direct relationships
if an indirect relationship exists (keep both A->B->C and A->C).
This allows witness_checkorder() to use isitmychild() instead of
the much more expensive isitmydescendant() to check for valid lock
ordering.

Don't do an expensive tree walk to update the w_level values when
the tree is updated.  Only update the w_level values when using the
debugger to display the tree.

Nuke the experimental "witness_watch > 1" mode that only compared
w_level for the two locks.  This information is no longer maintained
at run time, and the use of isitmychild() in witness_checkorder
should bring performance close enough to the acceptable level that
this hack is not needed.

Report witness data structure allocation statistics under the
debug.witness sysctl.

Reviewed by:	jhb
MFC after:	30 days
2005-08-25 03:47:37 +00:00
Don Lewis
ad9f180121 Back out the removal of LK_NOWAIT from the VOP_LOCK() call in
vlrureclaim() in vfs_subr.c 1.636  because waiting for the vnode
lock aggravates an existing race condition.  It is also undesirable
according to the commit log for 1.631.

Fix the tiny race condition that remains by rechecking the vnode
state after grabbing the vnode lock and grabbing the vnode interlock.

Fix the problem of other threads being starved (which 1.636 attempted
to fix by removing LK_NOWAIT) by calling uio_yield() periodically
in vlrureclaim().  This should be more deterministic than hoping
that VOP_LOCK() without LK_NOWAIT will block, which may not happen
in this loop.

Reviewed by:	kan
MFC after:	5 days
2005-08-23 03:44:06 +00:00
Pawel Jakub Dawidek
4e4aa37e75 mp_ncpus is always (properly) initialized, even on UP kernels, so just use it. 2005-08-21 18:03:31 +00:00
Robert Watson
6cd8dee3c5 Silence "busy" warnings when unmounting devfs at system shutdown. This
is a workaround for non-symetric teardown of the file systems at
shutdown with respect to the mount order at boot.  The proper long term
fix is to properly detach devfs from the root mount before unmounting
each, and should be implemented, but since the problem is non-harmful,
this temporary band-aid will prevent false positive bug reports and
unnecessary error output for 6.0-RELEASE.

MFC after:	3 days
Tested by:	pav, pjd
2005-08-20 17:12:47 +00:00
Poul-Henning Kamp
1d45c50ec3 Properly un-giant-trick the cdevsw in fini_cdevsw()
Tripped over by:	Huang wen hui <huang@gddsn.org.cn>
2005-08-20 12:13:51 +00:00
David Xu
86ef8e2671 Add missing brackets.
Noticed by: stefanf@
2005-08-19 22:30:13 +00:00
David Xu
8c6d7a8db8 Fix a LOR between sched_lock and sleep queue lock. 2005-08-19 13:35:34 +00:00
David Xu
f8ec133ed0 Move up code for testing KEF_HOLD to avoid ke_cpu being changed unexpectly
for PRI_ITHD and PRI_REALTIME threads.
2005-08-19 11:51:41 +00:00
Hajimu UMEMOTO
1fea6ce7dd - don't forget to save freqency when priority is raised.
- nuke redundant variable initialization.
2005-08-18 16:41:25 +00:00
Hajimu UMEMOTO
5f36393468 don't forget to update curr_priority. even when frequency is
not changed, priority may be changed.
2005-08-18 16:08:56 +00:00
Poul-Henning Kamp
516ad423b1 Handle device drivers with D_NEEDGIANT in a way which does not
penalize the 'good' drivers:  Allocate a shadow cdevsw and populate
it with wrapper functions which grab Giant
2005-08-17 08:19:52 +00:00
Poul-Henning Kamp
a07b0febaa In vop_stdpathconf(ap) also default for _PC_NAME_MAX and _PC_PATH_MAX. 2005-08-17 06:59:23 +00:00
Hajimu UMEMOTO
961f7f911f Save cpu level only when priority is greater than PRIO_USER
to make CPUFREQ_SET(NULL, prio) work.
TODO: implement saved_level as stack.

Reviewed by:	njl
2005-08-16 20:03:08 +00:00
Poul-Henning Kamp
b3740d656f Remove stale comment. 2005-08-16 19:47:42 +00:00
Poul-Henning Kamp
31cc57cdbd Collect the devfs related sysctls in one place 2005-08-16 19:25:02 +00:00
Poul-Henning Kamp
9c0af1310c Create a new internal .h file to communicate very private stuff
from kern_conf.c to devfs.

For now just two prototypes, more to come.
2005-08-16 19:08:01 +00:00
Alexander Kabaev
0c207975f2 Do not keep parent directory locked while calling VFS_ROOT to traverse mount
points in lookup(). The lock can be dropped safely around VFS_ROOT because
LOCKPARENT semantics with child and perent vnodes coming from different FSes
does not really have any meaningful use. On the other hard, this prevents
easily triggered deadlock on systems using automounter daemon.
2005-08-14 18:10:04 +00:00
Alexander Kabaev
857b66d505 Do not use vm_pager_init() to initialize vnode_pbuf_freecnt variable.
vm_pager_init() is run before required nswbuf variable has been set
to correct value. This caused system to run with single pbuf available
for vnode_pager. Handle both cluster_pbuf_freecnt and vnode_pbuf_freecnt
variable in the same way.

Reported by:	ade
Obtained from:	alc
MFC after:	2 days
2005-08-13 20:21:33 +00:00
Marcel Moolenaar
fd65baf8e2 Make mpsafe_vfs=1 the default on ia64. 2005-08-13 20:07:50 +00:00
Nate Lawson
da8a77c1f1 The "lowest" sysctl setting makes more sense as the lowest one to use, so
discard all levels less than this setting, not less than/equal to.

MFC after:	1 day
2005-08-11 18:40:58 +00:00
Alexander Kabaev
45a0d1ed7a Do not drop the vnode interlock if vdropl is called on already doomed vnode.
vdropl callers expect it to return with interlock still being held.

MFC after:	2 days
2005-08-10 11:46:03 +00:00
Robert Watson
ae018704a1 Add an order between UDP inpcb locks and the IPv4 multicast address
list lock, as there has been a report that an alternative lock order
is getting introduced.  This should help ferret it out.

Reported by:	Ed Maste <emaste at phaedrus dot sandvine dot ca>
2005-08-09 13:27:50 +00:00
Robert Watson
13f4c340ae Propagate rename of IFF_OACTIVE and IFF_RUNNING to IFF_DRV_OACTIVE and
IFF_DRV_RUNNING, as well as the move from ifnet.if_flags to
ifnet.if_drv_flags.  Device drivers are now responsible for
synchronizing access to these flags, as they are in if_drv_flags.  This
helps prevent races between the network stack and device driver in
maintaining the interface flags field.

Many __FreeBSD__ and __FreeBSD_version checks maintained and continued;
some less so.

Reviewed by:	pjd, bz
MFC after:	7 days
2005-08-09 10:20:02 +00:00
Christian S.J. Peron
d8339a2616 Drop in a WITNESS_WARN into SYSCTL_IN to make sure that we are
not holding any non-sleep-able-locks locks when copyin is called.
This gets executed un-conditionally since we have no function
to wire the buffer in this direction.

Pointed out by:	truckman
MFC after:	1 week
2005-08-08 21:06:42 +00:00
Robert Watson
6a113b3de7 Merge the dev_clone and dev_clone_cred event handlers into a single
event handler, dev_clone, which accepts a credential argument.
Implementors of the event can ignore it if they're not interested,
and most do.  This avoids having multiple event handler types and
fall-back/precedence logic in devfs.

This changes the kernel API for /dev cloning, and may affect third
party packages containg cloning kernel modules.

Requested by:	phk
MFC after:	3 days
2005-08-08 19:55:32 +00:00
Christian S.J. Peron
417ab24f78 Check to see if we wired the user-supplied buffers in SYSCTL_OUT, if
the buffer has not been wired and we are holding any non-sleep-able locks,
drop a witness warning. If the buffer has not been wired, it is possible
that the writing of the data can sleep, especially if the page is not in
memory. This can result in a number of different locking issues, including
dead locks.

MFC after:	1 week
Discussed with:	rwatson
Reviewed by:	jhb
2005-08-08 18:54:35 +00:00
David Xu
1278181c6c Try best to keep a preempted thread at front of run queue, this seems
improved performance a bit for some workloads, but still seeing interactive
lagging unless cpu idling race is fixed.
2005-08-08 14:20:10 +00:00
Peter Grehan
e000e00118 Export a routine, kobj_machdep_init(), that allows platforms
to use the kobj subsystem as soon at mutex_init() has been called
instead of having to wait for the SI_SUB_LOCK sysinit.

Reviewed by:	dfr
2005-08-07 02:20:35 +00:00
Christian S.J. Peron
9baea4b4b4 Change the data type of the upper shared memory limits from a signed
integer to an unsigned long. This lifts variables like the maximum
number of pages available for shared memory from 2^31 to 2^32 on 32
bit architectures, and from 2^31 to 2^64 on 64 bit architectures.

It should be noted that this changes breaks ABI on 64 bit architectures
because the size of the shmmax, shmmin, shmmni, shmseg and shmall members
of the shminfo structure has changed.

Silence on:	current@
2005-08-06 07:20:18 +00:00
Suleiman Souhlal
34cc826ae8 Holding a vnode doesn't prevent v_mount from disappearing (when the
vnode is inactivated), possibly leading to a NULL dereference when
checking if the mount wants knotes to be activated in the VOP hooks.
So, we add a new vnode flag VV_NOKNOTE that is only set in getnewvnode(),
if necessary, and check it when activating knotes.
Since the flags are not erased when a vnode is being held, we can safely
read them.

Reviewed by:	kris@
MFC after:	3 days
2005-08-06 01:42:04 +00:00
Robert Watson
dd5a318ba3 Introduce in_multi_mtx, which will protect IPv4-layer multicast address
lists, as well as accessor macros.  For now, this is a recursive mutex
due code sequences where IPv4 multicast calls into IGMP calls into
ip_output(), which then tests for a multicast forwarding case.

For support macros in in_var.h to check multicast address lists, assert
that in_multi_mtx is held.

Acquire in_multi_mtx around iteration over the IPv4 multicast address
lists, such as in ip_input() and ip_output().

Acquire in_multi_mtx when manipulating the IPv4 layer multicast addresses,
as well as over the manipulation of ifnet multicast address lists in order
to keep the two layers in sync.

Lock down accesses to IPv4 multicast addresses in IGMP, or assert the
lock when performing IGMP join/leave events.

Eliminate spl's associated with IPv4 multicast addresses, portions of
IGMP that weren't previously expunged by IGMP locking.

Add in_multi_mtx, igmp_mtx, and if_addr_mtx lock order to hard-coded
lock order in WITNESS, in that order.

Problem reported by:	Ed Maste <emaste at phaedrus dot sandvine dot ca>
MFC after:		10 days
2005-08-03 19:29:47 +00:00
Jeff Roberson
40a495853a - Unlock before we call mac_destroy_vnode to prevent a lock order reversal.
Found by:	trhodes
2005-08-03 05:36:50 +00:00
Jeff Roberson
9e2aaec1e3 - Use lockmgr_printinfo rather than rolling our own. This introduces a
slight problem by using printf instead of db_printf however
   'show lockedvnods' does the same so I believe it is ok for now.
2005-08-03 05:02:08 +00:00
Jeff Roberson
7499fd8de9 - Fix a problem that slipped through review; the stack member of the lockmgr
structure should have the lk_ prefix.
 - Add stack_print(lkp->lk_stack) to the information printed with
   lockmgr_printinfo().
2005-08-03 04:59:07 +00:00
Jeff Roberson
e8ddb61d38 - Replace the series of DEBUG_LOCKS hacks which tried to save the vn_lock
caller by saving the stack of the last locker/unlocker in lockmgr.  We
   also put the stack in KTR at the moment.

Contributed by:		Antoine Brodin <antoine.brodin@laposte.net>
2005-08-03 04:48:22 +00:00
Jeff Roberson
8d511e2a05 - Add support for saving stack traces and displaying them via printf(9)
and KTR.

Contributed by:		Antoine Brodin <antoine.brodin@laposte.net>
Concept code from:	Neal Fachan <neal@isilon.com>
2005-08-03 04:27:40 +00:00
David Xu
3c424d1447 In adjustrunqueue(), add code to handle thread migrating case for
ULE scheduler. In original code, local run queue of threaded ksegrp
is corrupted if adjustrunqueue() is called while thread is migrating.
2005-08-03 01:23:45 +00:00
Ruslan Ermilov
2319835713 Long overdue, keep up with mbuf.h,v 1.148. 2005-08-02 20:03:23 +00:00
Kelly Yancey
dcb5fef5db Make getsockopt(..., SOL_SOCKET, SO_ACCEPTCONN, ...) work per IEEE Std
1003.1 (POSIX).
2005-08-01 21:15:09 +00:00
David Xu
3d16f519b6 If a thread was removed from system run queue, kse_assign shouldn't
add it again.
2005-07-31 15:11:21 +00:00
Alexander Leidinger
32069af652 The resource_xxx routines in subr_hints.c are called before and after the
kenv environment in kern_environment.c switches to dynamic kenv. The prior
call sets the static variable hintp to the static hints in subr_hints.c
(hintmode==0).

However, changes to the environment are not detected by the resource_xxx
lookups after the change to dynamic kernel environment, so the lookup
routines only report the old stuff of hintmode==0, even after the change to
the dynamic kenv. This causes kenv users to see a different environment than
the kernel routines.

This is a problem in the mixer.c code that looks up initial mixer volume
settings from the hints: If the hints are dynamic and not from the
device.hints file, mixer.c doesn't see them, but kenv does.

The patch from the PR (modified to comply to the style of the function)
solves this.

PR:		83686
Submitted by:	Harry Coin <harrycoin@qconline.com>
2005-07-31 10:46:55 +00:00
Alexander Leidinger
3904769ba8 Add bounds checking to the setenv part of the kernel environment.
This has no security implications since only root is allowed to use
kenv(1) (and corrupt the kernel memory after adding too much variables
previous to this commit).

This is based upon the PR [1] mentioned below, but extended to check both
bounds (in case of an overflow of the counting variable) and to comply
to the style of the function. An overflow of the counting variable
shouldn't happen after adding the check for the upper bound, but better
safe than sorry (in case some other function in the kernel overwrites
random memory).

An interested soul may want to add a printf to notify root in case the
bounds are hit.

Also allocate KENV_SIZE+1 entries (the array is NULL-terminated), since
the comment for KENV_SIZE says it's the maximum number of environment
strings. [2]

PR:		83687 [1]
Submitted by:	Harry Coin <harrycoin@qconline.com> [1]
Submitted by:	Ariff Abdullah <skywizard@MyBSD.org.my> [2]
2005-07-31 10:28:35 +00:00
Joseph Koshy
fadcc6e201 Fail the module loading process if the currently executing kernel
was not compiled with 'options HWPMC_HOOKS' or if the compiled-in
version numbers of the kernel and module are out of sync.

Reported by:	cracauer
MFC after:	3 days
2005-07-30 09:02:42 +00:00
Paul Saab
1126349ae7 Ignore mutex asserts when we're dumping as well. This allows me
to panic a system from DDB when INVARIANTS is compiled into the
kernel on a scsi system.
2005-07-30 05:54:30 +00:00
Sam Leffler
ab8ab90c5b add m_align, a function to align any type of mbuf (i.e. it
is a superset of M_ALIGN and MH_ALIGN)

Reviewed by:	several
2005-07-30 01:32:16 +00:00
R. Imura
080e3a63b3 Change API of mb_copy_t in libmchain so that netsmb can handle
multibyte character share name correctly.

Reviewed by:	bp
2005-07-29 13:22:37 +00:00
George V. Neville-Neil
0d52d7b01a Fix for PR 83885.
Make sure that there actually is a next packet before setting
nextrecord to that field.

PR: 83885
Submitted by: hirose@comm.yamaha.co.jp
Obtained from:	Patch suggested in the PR
MFC after: 1 week
2005-07-28 10:10:01 +00:00
Pawel Jakub Dawidek
73864adbd4 Fix the way how "InUse" column in 'vmstat -m' output works:
- increase number of allocations count only on successfull malloc(9),
  so it doesn't confuse people;
- because we need to check if 'size > 0', hide 'mtsp->mts_memalloced += size;'
  under the check as well, as for size=0 it is of course a no-op;
- avoid critical_enter()/critical_exit() in case of failure in
  malloc_type_allocated() as there will be nothing to do.

OK'ed by:	rwatson
MFC after:	2 days
2005-07-27 23:17:31 +00:00
Xin LI
05a6b7ad62 Cast to uintptr_t when the compiler complains. This unbreaks ULE
scheduler breakage accompanied by the recent atomic_ptr() change.
2005-07-25 10:21:49 +00:00
Alan Cox
ec9c9e7363 Eliminate inconsistency in the setting of the B_DONE flag. Specifically,
make the b_iodone callback responsible for setting it if it is needed.
Previously, it was set unconditionally by bufdone() without holding
whichever lock is shared by the b_iodone callback and the corresponding
top-half function.  Consequently, in a race, the top-half function could
conclude that operation was done before the b_iodone callback finished.
See, for example, aio_physwakeup() and aio_fphysio().

Note: I don't believe that the other, more widely-used b_iodone callbacks
are affected.

Discussed with: jeff
Reviewed by: phk
MFC after: 2 weeks
2005-07-20 19:06:06 +00:00
Jeff Roberson
39b2406838 - Allow vnlru to drop giant if the filesystem does not require it. The
vnlru proc is extremely inefficient, potentially iteration over tens of
   thousands of vnodes without blocking.  Droping Giant allows other threads
   to preempt us although we should revisit the algorithm to fix the runtime
   problems especially since this may hold up all vnode allocations.
 - Remove the LK_NOWAIT from the VOP_LOCK in vlrureclaim.  This provides
   a natural blocking point to help alleviate the situation described above
   although it may not technically be desirable.
 - yield after we make a pass on all mount points to prevent us from
   blocking other threads which require Giant.

MFC after:	2 weeks
2005-07-20 01:43:27 +00:00
John Baldwin
ddf9c4f771 - Slightly reorder the events around the setting of PRS_ZOMBIE to be less
hokie and much more readable and expand the comment to explain why it is
  the way that it is.
- Close a race where one CPU could free the process belonging to a thread
  on another CPU that hasn't quite finished exiting yet but is beyond the
  point of setting the process state as PRS_ZOMBIE.

Reported and tested by:	ps (2)
MFC after:	3 days
2005-07-18 20:08:14 +00:00
Robert Watson
68352adfe7 Define four constants, MBUF_{,MEM,CLUSTER,PACKET,TAG}_MEM_NAME, which
are string names for their respective UMA zones and malloc types, and
are passed into uma_zcreate() and MALLOC_DEFINE().  Export them
outside of _KERNEL in mbuf.h so that netstat can reference them.

Change the names to improve consistency, with each zone/type
associated with the mbuf allocator being prefixed mbuf_.

MFC after:	1 week
2005-07-17 14:04:03 +00:00
John Baldwin
122eceef61 Convert the atomic_ptr() operations over to operating on uintptr_t
variables rather than void * variables.  This makes it easier and simpler
to get asm constraints and volatile keywords correct.

MFC after:	3 days
Tested on:	i386, alpha, sparc64
Compiled on:	ia64, powerpc, amd64
Kernel toolchain busted on:	arm
2005-07-15 18:17:59 +00:00