Commit Graph

10847 Commits

Author SHA1 Message Date
Konstantin Belousov
22a448c4d9 vm_map_lock_read() does not increment map->timestamp, so we should
compare map->timestamp with saved timestamp after map read lock is
reacquired, not with saved timestamp + 1. The only consequence of the +1
was unconditional lookup of the next map entry, though.

Tested by:	pho
Approved by:	des
MFC after:	2 weeks
2008-12-29 12:45:11 +00:00
Kip Macy
08a2459ee1 drop rnh lock before destroying it 2008-12-28 14:32:27 +00:00
Bjoern A. Zeeb
34820bbf06 Hide detect_virtual() along with the accompanying string
arrays under #ifndef XEN to make XEN config compile again.
In case of Xen vm_guest is hard coded.

Move the list for the vm_guest sysctl out of the restictive
bounds as the sysctl is there in either case.
2008-12-27 17:19:16 +00:00
Peter Holm
ab62a2d023 Prevent overflow of uio_resid.
Approved by:	kib
2008-12-27 10:13:43 +00:00
Robert Watson
9c232f86ca Following the recent security advisory, add a comment describing our
invariants and approach for protocol switch methods in protsw_init(),
and also some KASSERT's for non-domain init entries in protocol
switch tables: pru_abort and pru_send must both be implemented.

For now, leave those assertions #if 0'd, since there are a few
protocols that violate them in non-harmful ways.  Whether or not we
should enforce pru_abort being implemented for non-stream protocols
is an interesting question: currently abort is only invoked on stream
sockets in situations where un-accepted sockets must be abruptly
closed (i.e., close() on a listen socket with pending connections),
but in principle it is useful for datagram sockets and most datagram
socket types implement it.

MFC after:	3 weeks
2008-12-25 11:32:38 +00:00
Joe Marcus Clarke
4769218f4b Do not KASSERT when vp->v_dd is NULL. Only directories which have had ".."
looked up would have v_dd set to a non-NULL value.  This fixes a panic
seen when running installworld on a diskless system with a separate /usr
file system.

Submitted by:	cracauer
Approved by:	kib
2008-12-23 20:43:42 +00:00
Konstantin Belousov
86dcb537c9 Keep the hold on the vnode during VOP_VPTOCNP() call, allowing the vop
implementation to drop vnode lock, if needed.

Reported and tested by:	pho
2008-12-23 20:04:31 +00:00
Ivan Voras
59d9578919 Add missing newlines to flags tags of CPU topology, for prettier
output.

Reviewed by:	jeff (original version)
Approved by:	gnn (mentor) (original version)
2008-12-23 16:19:59 +00:00
Colin Percival
f0b40b1c97 Prevent cross-site forgery attacks on ftpd(8) due to splitting
long commands into multiple requests. [08:12]

Avoid calling uninitialized function pointers in protocol switch
code. [08:13]

Merry Christmas everybody...

Approved by:	so (cperciva)
Approved by:	re (kensmith)
Security:	FreeBSD-SA-08:12.ftpd, FreeBSD-SA-08:13.protosw
2008-12-23 01:23:09 +00:00
Ed Schouten
3a4d0c86aa Revert r185891.
In r185891 I removed the newlines from messages written to /dev/console,
because it made startup messages from rc-scripts harder to read. This,
unfortunately, causes the kernel message that is printed after a
non-terminated log message to be concatenated.

This could be fixed, but on short term it's better to just revert the
change.

Reported by:	Jaakko Heinonen <jh saunalahti fi>
2008-12-21 21:54:01 +00:00
Ed Schouten
67dd0ccbee Set PTS_FINISHED before waking up any threads.
Inside ptsdrv_{in,out}wakeup() we call KNOTE_LOCKED() to wake up any
kevent(2) users. Because the kqueue handlers are executed synchronously,
we must set PTS_FINISHED before calling ptsdrv_{in,out}wakeup().

Discovered by:	nork
2008-12-21 21:16:57 +00:00
Ed Schouten
9d34a1338c Let wchan names more closely match pre-MPSAFE TTY behaviour.
Right now the wchan strings "ttyinp" and "ttybgw" only differ one
character from the strings we used prior to MPSAFE TTY. Just rename them
back to their pre-MPSAFE TTY counterparts.

Also rename "ttylck" to "ttymtx", which should make it more clear that a
process is blocked on the TTY mutex, not some other form of locking.
2008-12-20 09:36:40 +00:00
Nathan Whitehorn
91416fb268 Modularize the Open Firmware client interface to allow run-time switching
of OFW access semantics, in order to allow future support for real-mode
OF access and flattened device frees. OF client interface modules are
implemented using KOBJ, in a similar way to the PPC PMAP modules.

Because we need Open Firmware to be available before mutexes can be used on
sparc64, changes are also included to allow KOBJ to be used very early in
the boot process by only using the mutex once we know it has been initialized.

Reviewed by:    marius, grehan
2008-12-20 00:33:10 +00:00
Ivan Voras
bb501b18e8 Further beautify the lock strings to be more pleasing to the eye and
self documenting within 6 characters.

Reviewed by:	ed (older version)
Approved by:	gnn (older version)
2008-12-19 14:49:14 +00:00
Ruslan Ermilov
98bd6a1982 Removed a comment made obsolete by revisions 157927 and 174292. 2008-12-18 15:56:12 +00:00
Ivan Voras
3610a2260b By popular request, stringify kern.vm_guest sysctl. Now it returns a
short, self-documenting string describing the detected virtual
environment.

Approved by:	gnn (mentor) (earlier version)
2008-12-18 15:34:38 +00:00
Ivan Voras
0e469db660 Remove spaces in wait object names to make top (1) output prettier and
unbreak scripts that examine ps (1) output.

Reviewed by:	ed
Approved by:	gnn (mentor)
2008-12-18 15:25:33 +00:00
Konstantin Belousov
548066ea66 The quotactl, statfs and fstatfs syscall implementations may dereference
NULL pointer to struct mount if the looked up vnode is reclaimed. Also,
these syscalls only mnt_ref() the mp, still allowing it to be unmounted;
only struct mount memory is kept from being reused.

Lock the vnode when doing name lookup, then reference its mount point,
unlock the vnode and vfs_busy the mountpoint. This sequence shall take
care of both races.

Reported and tested by:	pho
Discussed with:	attilio
MFC after:	1 month
2008-12-18 12:01:19 +00:00
Konstantin Belousov
2cfddad734 Do not return success and doomed vnode from lookup. LK_UPGRADE allows
the vnode to be reclaimed.

Tested by:	pho
MFC after:	1 month
2008-12-18 11:58:12 +00:00
Ivan Voras
3dc309114a Introduce a sysctl kern.vm_guest that reflects what the kernel knows about
it running under a virtual environment. This also introduces a globally
accessible variable vm_guest that can be used where appropriate in the
kernel to inspect this environment.

To make it easier for the long run, an enum VM_GUEST is also introduced,
which could possibly be factored out in a header somewhere (but the
question is where - vm/vm_param.h? sys/param.h?) so it eventually becomes
a part of the standard KPI. In any case, it's a start.

The purpose of all this isn't to absolutely detect that the OS is running
under a virtual environment (cf. "redpill") but to allow the parts of the
kernel and the userland that care about this particular aspect and can do
something useful depending on it to have a standardised interface. Reducing
kern.hz is one example but there are other things that could be done like
avoiding context switches, not using CPU instructions that are known to be
slow in emulation, possibly different strategies in VM (memory) allocation,
CPU scheduling, etc.

It isn't clear if the JAILS/VIMAGE functionality should also be exposed
by this particular mechanism (probably not since they're not "full"
virtual hardware environments). Sometime in the future another sysctl and
a variable could be introduced to reflect if the kernel supports any kind
of virtual hosting (e.g. VMWare VMI, Xen dom0).

Reviewed by:	silence from src-commiters@, virtualization@, kmacy@
Approved by:	gnn (mentor)
Security:	Obscurity doesn't help.
2008-12-17 19:57:12 +00:00
Peter Wemm
a3ac8c94cf Remove sysctl debug.elf_trace and the trace field in auxargs. They go
nowhere.  It used to be the equivalent of $LD_DEBUG in rtld-elf.
Elf_Auxargs is an internal structure.
2008-12-17 16:54:29 +00:00
Warner Losh
35c2a5a852 Minor style(9) nit. 2008-12-17 16:25:20 +00:00
Konstantin Belousov
6f3475454e Remove two remnant uses of AT_DEBUG. 2008-12-17 13:13:35 +00:00
Attilio Rao
4a0f807602 1) Fix a deadlock in the VFS:
- threadA runs vfs_rel(mp1)
- threadB does unmount the mp1 fs, sets MNTK_UNMOUNT and drop MNT_ILOCK()
- threadA runs vfs_busy(mp1) and, as long as, MNTK_UNMOUNT is set, sleeps
  waiting for threadB to complete the unmount
- threadB, in vfs_mount_destroy(), finds mnt_lock > 0 and sleeps waiting
  for the refcount to expire.

Fix the deadlock by adding a flag called MNTK_REFEXPIRE which signals the
unmounter is waiting for mnt_ref to expire.
The vfs_busy contenders got awake, fails, and if they retry the
MNTK_REFEXPIRE won't allow them to sleep again.

2) Simplify significantly the code of vfs_mount_destroy() trimming
   unnecessary codes:
   - as long as any reference exited, it is no-more possible to have
     write-op (primarty and secondary) in progress.
   - it is no needed to drop and reacquire the mount lock.
   - filling the structures with dummy values is unuseful as long as
     it is going to be freed.

Tested by:	pho, Andrea Barberio <insomniac at slackware dot it>
Discussed with:	kib
2008-12-16 23:16:10 +00:00
Alexander Motin
d288bcc4df If possible, try to obtain max_mhz on cpufreq attach instead of first request.
On HyperThreading CPUs logical cores have same frequency, so setting it
on any core will change the other's one. In most cases first request
to the second core will be the "set" request, done after setting frequency
of the first core. In such case second CPU will obtain throttled frequency
of the first core as it's max_mhz making cpufreq broken due to different
frequency sets.
2008-12-16 01:24:05 +00:00
Alexander Motin
a9385ad10f Change ttyhook_register() second argument from thread to process pointer.
Thread was not really needed there, while previous ng_tty implementation
that used thread pointer had locking issues (using sx while holding mutex).
2008-12-13 21:17:46 +00:00
Joseph Koshy
6fe00c7876 - Bug fix: prevent a thread from migrating between CPUs between the
time it is marked for user space callchain capture in the NMI
  handler and the time the callchain capture callback runs.

- Improve code and control flow clarity by invoking hwpmc(4)'s user
  space callchain capture callback directly from low-level code.

Reviewed by:	jhb (kern/subr_trap.c)
Testing (various patch revisions): gnn,
		Fabien Thomas <fabien dot thomas at netasq dot com>,
		Artem Belevich <artemb at gmail dot com>
2008-12-13 13:07:12 +00:00
Ed Schouten
d4892ee51e Add FIONREAD to pseudo-terminal master devices.
All ioctl()'s that aren't implemented by pts(4) are forwarded to the TTY
itself. Unfortunately this is not correct for FIONREAD, because it will
give the wrong amount of bytes that are available to read.

Tested by:	keramida
Reminded by:	keramida
2008-12-13 07:23:55 +00:00
Konstantin Belousov
cd2983ca71 Uio_yield() already does DROP_GIANT/PICKUP_GIANT, no need to repeat this
around the call.

Noted by:  bde
2008-12-12 14:03:04 +00:00
Konstantin Belousov
c7462f4387 Reference the vmspace of the process being inspected by procfs, linprocfs
and sysctl kern_proc_vmmap handlers.

Reported and tested by:	pho
Reviewed by:	rwatson, des
MFC after:	1 week
2008-12-12 12:12:36 +00:00
Konstantin Belousov
af80b2c901 The userland_sysctl() function retries sysctl_root() until returned
error is not EAGAIN. Several sysctls that inspect another process use
p_candebug() for checking access right for the curproc. p_candebug()
returns EAGAIN for some reasons, in particular, for the process doing
exec() now. If execing process tries to lock Giant, we get a livelock,
because sysctl handlers are covered by Giant, and often do not sleep.

Break the livelock by dropping Giant and allowing other threads to
execute in the EAGAIN loop.

Also, do not return EAGAIN from p_candebug() when process is executing,
use more appropriate EBUSY error [1].

Reported and tested by:	pho
Suggested by:	rwatson [1]
Reviewed by:	rwatson, des
MFC after:	1 week
2008-12-12 12:06:28 +00:00
Joe Marcus Clarke
b9022449b3 Add a new VOP, VOP_VPTOCNP, which translates a vnode to its component name
on a best-effort basis.  Teach vn_fullpath to use this new VOP if a
regular VFS cache lookup fails.  This VOP is designed to supplement the
VFS cache to provide a better chance that a vnode-to-name lookup will
succeed.

Currently, an implementation for devfs is being committed.  The default
implementation is to return ENOENT.

A big thanks to kib for the mentorship on this, and to pho for running it
through his stress test suite.

Reviewed by:	arch
Approved by:	kib
2008-12-12 00:57:38 +00:00
Ed Schouten
1ff90be789 Add kqueue()-support to pseudo-terminal master devices.
One thing I didn't expect many applications to use, was kqueue() on
pseudo-terminal master devices. There are applications that use kqueue()
on the TTY itself (rtorrent, etc). That doesn't mean we shouldn't
implement this. Libraries like libevent use kqueue() by default, which
means they wouldn't be able to use kqueue().

The old TTY layer implements a very broken version of kqueue() by
performing the actual polling on the TTY device.

Discussed with:	peter
2008-12-11 21:44:02 +00:00
Bjoern A. Zeeb
9ea9ef7e89 Order #includes - also to reduce diffs with vimage branches in p4.
Sponsored by:	The FreeBSD Foundation
2008-12-11 16:09:31 +00:00
Bjoern A. Zeeb
0f1fe22db5 Correctly check the number of prison states to not access anything
outside the prison_states array.
When checking if there is a name configured for the prison, check the
first character to not be '\0' instead of checking if the char array
is present, which it always is. Note, that this is different for the
*jailname in the syscall.

Found with:	Coverity Prevent(tm)
CID:		4156, 4155
MFC after:	4 weeks (just that I get the mail)
2008-12-11 01:04:25 +00:00
Marko Zec
385195c062 Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.

Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out.  Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.

Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively

#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif

Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.

Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs.  This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.

Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.

De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps.  PF virtualization will be done
separately, most probably after next PF import.

Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw.  Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-12-10 23:12:39 +00:00
Bjoern A. Zeeb
629386598e Make sure nmbclusters are initialized before maxsockets
by running the tunable_mbinit() SYSINIT at SI_ORDER_MIDDLE
before the init_maxsockets() SYSINT at SI_ORDER_ANY.

Reviewed by:		rwatson, zec
Sponsored by:		The FreeBSD Foundation
MFC after:		4 weeks
2008-12-10 22:17:09 +00:00
Bjoern A. Zeeb
36b5ba0c49 Style changes only. Put the return type on an extra line[1] and
add an empty line at the beginning as we do not have any local
variables.

Submitted by:	rwatson [1]
Reviewed by:	rwatson
MFC after:	4 weeks
2008-12-10 22:10:37 +00:00
Ed Schouten
d16ebcd4fe Remove added newlines from logged messages written to /dev/console.
The /dev/console device node logs all strings that are written to it.
When the string does not contain a trailing newline, it appends one. I
can imagine this was useful a long time ago, but with our current
rc-scripts, it generates a whole bunch of messages that look like:

| Configuring syscons:
|  blanktime
| .

By not appending the newlines, the output of `dmesg -a' is now (almost?)
exactly the same as what the user will see on the console device
(syscons, uart).
2008-12-10 21:48:05 +00:00
John Baldwin
3858a1f4f5 - Add 32-bit compat system calls for VFS_AIO. The system calls live in the
aio code and are registered via the recently added SYSCALL32_*() helpers.
- Since the aio code likes to invoke fuword and suword a lot down in the
  "bowels" of system calls, add a structure holding a set of operations for
  things like storing errors, copying in the aiocb structure, storing
  status, etc.  The 32-bit system calls use a separate operations vector to
  handle fuword32 vs fuword, etc.  Also, the oldsigevent handling is now
  done by having seperate operation vectors with different aiocb copyin
  routines.
- Split out kern_foo() functions for the various AIO system calls so the
  32-bit front ends can manage things like copying in and converting
  timespec structures, etc.
- For both the native and 32-bit aio_suspend() and lio_listio() calls,
  just use copyin() to read the array of aiocb pointers instead of using
  a for loop that iterated over fuword/fuword32.  The error handling in
  the old case was incomplete (lio_listio() just ignored any aiocb's that
  it got an EFAULT trying to read rather than reporting an error), and
  possibly slower.

MFC after:	1 month
2008-12-10 20:56:19 +00:00
Kip Macy
e1d881ba31 add RW_SYSINIT_FLAGS macro and rw_sysinit_flags initialization function 2008-12-08 21:46:55 +00:00
Jung-uk Kim
9bd2cbe43f - Detect Bochs BIOS variants and use HZ_VM as well.
- Free kernel environment variable after its use.
- Fix style(9) nits.
2008-12-08 18:39:59 +00:00
Konstantin Belousov
118d0afa28 Do drop vm map lock earlier in the sysctl_kern_proc_vmmap(), to avoid
locking a vnode while having vm map locked.

Reported and tested by:	pho
MFC after:	1 week
2008-12-08 12:29:30 +00:00
Kip Macy
3120b9d428 - convert radix node head lock from mutex to rwlock
- make radix node head lock not recursive
 - fix LOR in rtexpunge
 - fix LOR in rtredirect

Reviewed by:	sam
2008-12-07 21:15:43 +00:00
Konstantin Belousov
aeb325719a Several threads in a process may do vfork() simultaneously. Then, all
parent threads sleep on the parent' struct proc until corresponding
child releases the vmspace. Each sleep is interlocked with proc mutex of
the child, that triggers assertion in the sleepq_add(). The assertion
requires that at any time, all simultaneous sleepers for the channel use
the same interlock.

Silent the assertion by using conditional variable allocated in the
child. Broadcast the variable event on exec() and exit().

Since struct proc * sleep wait channel is overloaded for several
unrelated events, I was unable to remove wakeups from the places where
cv_broadcast() is added, except exec().

Reported and tested by:	ganbold
Suggested and reviewed by:	jhb
MFC after:	2 week
2008-12-05 20:50:24 +00:00
John Baldwin
75444a8590 When the SYSINIT() to load a module invokes the MOD_LOAD event successfully,
move that module to the head of the associated linker file's list of modules.
The end result is that once all the modules are loaded, they are sorted in
the reverse of their load order.  This causes the kernel linker to invoke
the MOD_QUIESCE and MOD_UNLOAD events in the reverse of the order that
MOD_LOAD was invoked.  This means that the ordering of MOD_LOAD events that
is set by the SI_* paramters to DECLARE_MODULE() are now honored in the same
order they would be for SYSUNINIT() for the MOD_QUIESCE and MOD_UNLOAD
events.

MFC after:	1 month
2008-12-05 16:47:30 +00:00
John Baldwin
b4824b48b4 - Invoke MOD_QUIESCE on all modules in a linker file (kld) before
unloading any modules.  As a result, if any module veto's an unload
  request via MOD_QUIESCE, the entire set of modules for that linker
  file will remain loaded and active now rather than leaving the kld
  in a weird state where some modules are loaded and some are unloaded.
- This also moves the logic for handling the "forced" unload flag out of
  kern_module.c and into kern_linker.c which is a bit cleaner.
- Add a module_name() routine that returns the name of a module and use that
  instead of printing pointer values in debug messages when a module fails
  MOD_QUIESCE or MOD_UNLOAD.

MFC after:	1 month
2008-12-05 13:40:25 +00:00
Bjoern A. Zeeb
118258f5c2 Fix a credential reference leak. [1]
Close subtle but relatively unlikely race conditions when
propagating the vnode write error to other active sessions
tracing to the same vnode, without holding a reference on
the vnode anymore. [2]

PR:		kern/126368 [1]
Submitted by:	rwatson [2]
Reviewed by:	kib, rwatson
MFC after:	4 weeks
2008-12-03 15:54:35 +00:00
Bjoern A. Zeeb
4b79449e2f Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by:	brooks, gnn, des, zec, imp
Sponsored by:	The FreeBSD Foundation
2008-12-02 21:37:28 +00:00
Konstantin Belousov
d6568724e1 Shared lookup makes it possible to create several negative cache
entries for one name. Then, creating inode with that name would remove
one entry, leaving others dormant. Reclaiming the vnode would uncover
negative entries, causing false return of ENOENT from the calls like
stat, that do not create inode.

Prevent creation of the duplicated negative entries.

Reported and debugged with:	pho
Reviewed by:	jhb
X-MFC:	after shared lookup changes
2008-12-02 11:14:16 +00:00