Commit Graph

11230 Commits

Author SHA1 Message Date
rwatson
f775cac272 Add SDT DTrace probes for namei():
vfs:namei:lookup:entry takes parent directory vnode pointer, path to
    look up, and lookup flags.
  vfs:namei:lookup:return takes an error value, and if successful, the
    returned vnode pointer.

MFC after:      1 month
2009-04-06 10:32:40 +00:00
dchagin
01bf63c9fb Fix KBI breakage by r190520 which affects older linux.ko binaries:
1) Move the new field (brand_note) to the end of the Brandinfo structure.
2) Add a new flag BI_BRAND_NOTE that indicates that the brand_note pointer
   is valid.
3) Use the brand_note field if the flag BI_BRAND_NOTE is set and as old
   modules won't have the flag set, so the new field brand_note would be
   ignored.

Suggested by:	jhb
Reviewed by:	jhb
Approved by:	kib (mentor)
MFC after:	6 days
2009-04-05 09:27:19 +00:00
kan
51a543ed33 Revert change 190655 temporarily. It breaks many setups where nullfs is
used and needs to be revisited.
2009-04-04 17:48:38 +00:00
marcel
c9498bd9af PowerPC, meet kernel core dumps. The support is based
on a generic dumper that creates an ELF core file and
uses PMAP functions to scan and iterate over memory
chunks, as well as handle memory mappings used during
dumping.
the PMAP layer can choose to return physical memory
chunks or virtual memory chunks. For minidumps, the
chunks should be virtual.

The default MMU I/F implementation for the scan_md()
method returns NULL. Thus, when a PMAP implementation
does not implement the required methods, an empty
core file is created. Here, empty means having an ELF
header only.

Obtained from:	Juniper Networks
2009-04-04 02:12:37 +00:00
thompsa
fe5458f665 Add a how argument to root_mount_hold() so it can be passed NOWAIT and be called
in situations where sleeping isnt allowed.
2009-04-03 19:46:12 +00:00
peter
77d3ac29e9 vn_vptocnp() unlocks the name cache and forgets to re-lock it before
returning in one error case, and mistakenly unlocks it for the
umount -f case.
2009-04-02 21:16:20 +00:00
brueffer
f970c13637 Fix memory leak in semunload().
PR:		133064
Submitted by:	Mateusz Guzik <mjguzik@gmail.com>
MFC after:	1 week
2009-03-30 15:01:29 +00:00
thompsa
84a2f3fa57 Further rate limit the root wait status, it will be printed once per
root_mount_rel() wakeup.
2009-03-30 05:57:55 +00:00
kan
ca739eae4a Replace v_dd vnode pointer with v_cache_dd pointer to struct namecache
in directory vnodes. Allow namecache dotdot entry to be created pointing
from child vnode to parent vnode if no existing links in opposite
direction exist. Use direct link from parent to child for dotdot lookups
otherwise.

This restores more efficient dotdot caching in NFS filesystems which
was lost when vnodes stoppped being type stable.

Reviewed by:	kib
2009-03-29 21:25:40 +00:00
jamie
5a5a677581 Whitespace/spelling fixes in advance of upcoming functional changes.
Approved by:	bz (mentor)
2009-03-27 13:13:59 +00:00
thompsa
8caa2c2d5c Skip the allocation of the root hold token if the mount already happened. 2009-03-27 03:52:08 +00:00
jhb
7f1bee5227 When looking up the parent devclass of a new devclass, create the parent
devclass if it doesn't already exist.
2009-03-25 17:02:05 +00:00
jhb
503529db8f When a file lookup fails due to encountering a doomed vnode from a forced
unmount, consistently return ENOENT rather than EBADF.

Reviewed by:	kib
MFC after:	1 month
2009-03-24 18:16:42 +00:00
jkim
effc077868 Clean up MI inittodr(9) and kill noop code.
It was derived from i386 version long ago but never resync'ed again.
Originally, i386 version compared the current time from realtime clock
with time_second (which was just `time' in the old days).  When this MI
version was written, it was wrongly compared against `base' AND never
used because of a bug (typo?) in the code.  This check was killed
in i386 version when home-rolled calendaric calculation was removed.
Now, we just remove the code here as well to make the code simpler.
2009-03-23 21:16:21 +00:00
jhb
ffcd13a80f Improve the description of a few sysctls.
Submitted by:	bde (partially)
MFC after:	3 days
2009-03-23 20:18:06 +00:00
kan
ab12866f38 Add safety check that does not allow empty strings to be queued
to the devctl notification queue. Empty strings cause devctl read
call to return 0 and result in devd exiting prematurely.

The actual offender (ugen notes for root hubs) will be fixed
by separate commit.
2009-03-23 01:13:34 +00:00
cperciva
b7238dced4 Correctly sanity-check timer IDs. [SA-09:06]
Limit the size of malloced buffer when dumping environment
variables. [EN-09:01]

Approved by:	so (cperciva)
Approved by:	re (kensmith)
Security:	FreeBSD-SA-09:06.ktimer
Errata:		FreeBSD-EN-09:01.kenv
2009-03-23 00:00:50 +00:00
kib
4c3e8a8b03 Fix several issues with parsing the notes for ELF objects.
Badly formed ELF note may cause the caclulated pointer to the next note
to point both after the note region, that was checked in the code, but
also to point before the region, that was not checked [1]. Remember the
first note location in note0 and leap out if the note is not between
note0 and note_end.

In the similar way, badly formed note may cause infinite loop by
pointing next note into the same or previous note. Guard against this by
limiting amount of loop iterations by arbitrary choosen big number.

For clarity, check the calculated note alignment in each iteration.

Reported by:	Chris Palmer <chris noncombatant org> [1]
PR:	kern/132886
Reviewed and tested by:	dchagin
MFC after:	3 days
2009-03-22 13:42:41 +00:00
kib
c8b4c76eed Do not underflow the buffer and then report the problem. Check for the
condition before the buffer write.
Also, since buflen is unsigned, previous check was ignored.

Reviewed by:	marcus
Tested by:	pho
2009-03-20 11:08:57 +00:00
kib
2e1bae397f Remove unneeded braces to reduce used vertical screen space.
The location was missed in r190140.
2009-03-20 11:03:55 +00:00
kib
bbdc41384b Do not forget to adjust buflen for the first resolution of the path
from namecache.
While there, compare pointers for equiality.

Reviewed by:	marcus
Tested by:	pho
2009-03-20 11:00:39 +00:00
kib
6c64d83946 The nc_nlen member of the struct namecache contains the length of the cached
name, not the length + 1.

PR:	132620, 132542
Reported by:	bf2006a yahoo com
Tested by:	bf2006a, pho
Reviewed by:	marcus
2009-03-20 10:59:06 +00:00
kib
0b55c1725c When ktracing namei operations, log a result of the __getcwd().
MFC after:	1 week
2009-03-20 10:47:16 +00:00
kib
de298453c7 Remove unneeded braces to reduce used vertical screen space. 2009-03-20 10:04:00 +00:00
attilio
9c2a3fc781 Fix an old-standing bug that crept in along the several revisions:
B_DELWRI cleanup and vnode disassociation should happen just before to
assign the buffer to a queue.

Reported by:	miwi, Volker <volker at vwsoft dot com>,
		Ben Kaduk <minimarmot at gmail dot com>,
		Christopher Mallon <christoph dot mallon at gmx dot de>
Tested by:	lulf, miwi
2009-03-17 16:30:49 +00:00
kib
e905171fbe Supply AT_EXECPATH auxinfo entry to the interpreter, both for native and
compat32 binaries.

Tested by:	pho
Reviewed by:	kan
2009-03-17 12:53:28 +00:00
kib
37de637d31 Use the properly sized types for ELF object header and program headers.
This fixes osrel fetching from the FreeBSD branding note for the 64bit
platforms.

Reported by:	swell.k gmail com
Reviewed by:	dchagin
Tested by:	dchagin, swell.k gmail com
2009-03-17 09:50:40 +00:00
jkim
3eda4741da Initial suspend/resume support for amd64.
This code is heavily inspired by Takanori Watanabe's experimental SMP patch
for i386 and large portion was shamelessly cut and pasted from Peter Wemm's
AP boot code.
2009-03-17 00:48:11 +00:00
kib
7ad4235008 Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.

First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.

Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.

Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf().  The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget.  Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.

In collaboration with:	pho
Reviewed by:	 tegge (previous version)
Tested by:	 glebius, yandex ...
MFC after:	 3 weeks
2009-03-16 15:39:46 +00:00
rwatson
70b6a8119c Remove IFF_NEEDSGIANT, a compatibility infrastructure introduced
in FreeBSD 5.x to allow network device drivers to run with Giant
despite the network stack being Giant-free.  This significantly
simplifies calls into ioctl() on network interfaces, especially
in the multicast code, as well as eliminates deferred invocation
of interface if_start routines.

Disable the build on device drivers still depending on
IFF_NEEDSGIANT as they no longer compile.  They will be removed
in a few weeks if they haven't been made MPSAFE in that time.
Disabled drivers:

        if_ar
        if_axe
        if_aue
        if_cdce
        if_cue
        if_kue
        if_ray
        if_rue
        if_rum
        if_sr
        if_udav
        if_ural
        if_zyd

Drivers that were already disabled because of tty changes:

        if_ppp
        if_sl

Discussed on:	arch@
2009-03-15 14:21:05 +00:00
jeff
9fedeedb8d - Wrap lock profiling state variables in #ifdef LOCK_PROFILING blocks. 2009-03-15 08:03:54 +00:00
jeff
ee1ec823f6 - Implement a new mechanism for resetting lock profiling. We now
guarantee that all cpus have acknowledged the cleared enable int by
   scheduling the resetting thread on each cpu in succession.  Since all
   lock profiling happens within a critical section this guarantees that
   all cpus have left lock profiling before we clear the datastructures.
 - Assert that the per-thread queue of locks lock profiling is aware of
   is clear on thread exit.  There were several cases where this was not
   true that slows lock profiling and leaks information.
 - Remove all objects from all lists before clearing any per-cpu
   information in reset.  Lock profiling objects can migrate between
   per-cpu caches and previously these migrated objects could be zero'd
   before they'd been removed

Discussed with:	attilio
Sponsored by:	Nokia
2009-03-15 06:41:47 +00:00
jeff
7cd47702cc - When a mutex is destroyed while locked we need to inform lock profiling
that it has been released.
2009-03-14 11:43:38 +00:00
jeff
4e3c0111d8 - Call lock_profile_release when we're transitioning a lock to be owned by
LK_KERNPROC.

Discussed with:		attilio
2009-03-14 11:43:02 +00:00
jeff
96eaa9ff52 - Fix an error that occurs when mp_ncpu is an odd number. steal_thresh
is calculated as 0 which causes errors elsewhere.

Submitted by:	KOIE Hidetaka <koie@suri.co.jp>

 - When sched_affinity() is called with a thread that is not curthread we
   need to handle the ON_RUNQ() case by adding the thread to the correct
   run queue.

Submitted by:	Justin Teller <justin.teller@gmail.com>

MFC after:	1 Week
2009-03-14 11:41:36 +00:00
dchagin
2408b715a0 Implement new way of branding ELF binaries by looking to a
".note.ABI-tag" section.

The search order of a brand is changed, now first of all the
".note.ABI-tag" is looked through.

Move code which fetch osreldate for ELF binary to check_note() handler.

PR:		118473
Approved by:	kib (mentor)
2009-03-13 16:40:51 +00:00
davidxu
c5b41352df 1) Check NULL pointer before calling umtx_pi_adjust_locked(), this avoids
a PANIC.
2) Rework locking for POSIX priority-mutex, this fixes a
   race where a thread may wait there forever even if the mutex is unlocked.
2009-03-13 06:06:20 +00:00
jhb
db47507f01 Change the sysctls for maxbcache and maxswzone from int to long. I missed
this earlier since these sysctls don't exist in 7.x yet.
2009-03-12 17:23:02 +00:00
jhb
192cd27cf3 Export the current values of nbuf, ncallout, and nswbuf via read-only
sysctls that match the tunable names.

MFC after:	3 days
2009-03-12 17:21:58 +00:00
bms
33b0309eb2 Ensure that the semaphore value is re-checked after sem_lock
is re-acquired, after the condition variable is signalled.

PR:             http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/127545
MFC after:      5 days
Reviewed by:    attilio
2009-03-12 10:36:39 +00:00
bms
6049578f97 Make semaphore debugging output more useful.
PR:             http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/127545
MFC after:      5 days
Submitted by:   Philip Semanchuk
2009-03-12 10:34:16 +00:00
rwatson
843cb67ec0 When writing out updated pollfd records when returning from
poll(), only copy out the revents field, not the whole pollfd
structure.  Otherwise, if the events field is updated
concurrently by another thread, that update may be lost.

This issue apparently causes problems for the JDK on FreeBSD,
which expects the Linux behavior of not updating all fields
(somewhat oddly, Solaris does not implement the required
behavior, but presumably our adaptation of the JDK is based
on the Linux port?).

MFC after:      2 weeks
PR:		kern/130924
Submitted by:   Kurt Miller <kurt @ intricatesoftware.com>
Discussed with:	kib
2009-03-11 22:00:03 +00:00
jhb
2548f05c6c Add a new type of KTRACE record for sysctl(3) invocations. It uses the
internal sysctl_sysctl_name() handler to map the MIB array to a string
name and logs this name in the trace log.  This can be useful to see
exactly which sysctls a thread is invoking.

MFC after:	1 month
2009-03-11 21:48:36 +00:00
jhb
62e368b0b7 Gah, fix the code to match the comment. For non-open lookups use a
shared vnode lock for the leaf vnode if LOCKSHARED is set.

Submitted by:	rdivacky
2009-03-11 14:39:55 +00:00
jhb
520acdaf69 Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
  shared vnode lock for the leaf vnode only if the mount point has the
  extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
  not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
  O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
  vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
  FIFO's require exclusive vnode locks for their open() and close()
  routines.  (My recent MPSAFE patches for UDF and cd9660 already included
  this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.

Submitted by:	ups
Reviewed by:	pjd (ZFS bits)
MFC after:	1 month
2009-03-11 14:13:47 +00:00
imp
61d6068fc6 Minor nits notice by jhb@ 2009-03-11 08:19:31 +00:00
jhb
50289fd1c1 - Make maxpipekva a signed long rather than an unsigned long as overflow
is more likely to be noticed with signed types.
- Make amountpipekva a long as well to match maxpipekva.

Discussed with:	bde
2009-03-10 21:28:43 +00:00
jhb
c8dd604fc2 In the ABI shim for vfs.bufspace, rather than truncating values larger than
INT_MAX to INT_MAX, just go ahead and write out the full long to give an
error of ENOMEM to the user process.

Requested by:	bde
2009-03-10 21:27:15 +00:00
jhb
0849234614 - Remove a recently added comment from kernel_sysctlbyname() that isn't
needed.
- Move the release of the sysctl sx lock after the vsunlock() in
  userland_sysctl() to restore the original memlock behavior of
  minimizing the amount of memory wired to handle sysctl requests.

MFC after:	1 week
2009-03-10 17:00:28 +00:00
jhb
0372a61a28 Add an ABI compat shim for the vfs.bufspace sysctl for sysctl requests that
try to fetch it as an int rather than a long.  If the current value is
greater than INT_MAX it reports a value of INT_MAX.
2009-03-10 15:26:50 +00:00
jhb
80d9458a56 Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints.  Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva.  Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers.  Now one has to overflow a long to see
such problems.  There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf.  I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.

Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.

MFC after:	1 month
2009-03-09 19:35:20 +00:00
jhb
b27007efcb Move the debug.hashstat sysctl tree under DIAGNOSTIC. I measured the
debug.hashstat.rawnchash sysctl in particular as taking 7 milliseconds on
a 3GHz Intel Xeon (4x2) running 7.1.  It accounted for almost a quarter of
the total runtime of 'sysctl -a'.  It also performs lots of copyout's while
holding the namecache lock (this does not attempt to fix that).

MFC after:	2 weeks
2009-03-09 19:04:53 +00:00
imp
ee8ea37871 Fix a long-standing bug in newbus. It was introduced when subclassing
was introduced.  If you have a bus, say cardbus, that is derived from
a base-bus (say PCI), then ordinarily all PCI drivers would attach to
cardbus devices.  However, there had been one exception: kldload
wouldn't work.

The problem is in devclass_add_driver.  In this routine, all we did
was call to the pci device's BUS_DRIVER_ADDED routine.  However, since
cardbus bus instances had a different devclass, none of them were
called.

The solution is to call all subclass devclasses, recursively down the
tree, of the class that was loaded.  Since we don't have a 'children
class' pointer, we search the whole list of devclasses for a class
whose parent matches.  Since just done a kldload time, this isn't as
bad as it sounds.  In addition, we short-circuit the whole process by
marking those classes with subclasses with a flag.  We'll likely have
to reevaluate this method the number of devclasses with subclasses
gets large.

This means we can remove the "cardbus" lines from all the PCI drivers
since we have no cardbus specific attach device attachments in the
tree.

# Also: minor tweak to an error message
2009-03-09 13:20:23 +00:00
rwatson
417d444e25 By default, don't compile in counters of calls to various time
query functions in the kernel, as these effectively serialize
parallel calls to the gettimeofday(2) system call, as well as
other kernel services that use timestamps.

Use the NetBSD version of the fix (kern_tc.c:1.32 by ad@) as
they have picked up our timecounter code and also ran into the
same problem.

Reported by:	kris
Obtained from:	NetBSD
MFC after:	3 days
2009-03-08 22:19:28 +00:00
rwatson
4aae6c65e1 Decompose the global UNIX domain sockets rwlock into two different
locks: a global list/counter/generation counter protected by a new
mutex unp_list_lock, and a global linkage rwlock, unp_global_rwlock,
which protects the connections between UNIX domain sockets.

This eliminates conditional lock acquisition that was previously a
property of the global lock being held over sonewconn() leading to a
call to uipc_attach(), which also required the global lock, but
couldn't rely on it as other paths existed to uipc_attach() that
didn't hold it: now uipc_attach() uses only the list lock, which
follows the linkage lock in the lock order.  It may also reduce
contention on the global lock for some workloads.

Add global UNIX domain socket locks to hard-coded witness lock
order.

MFC after:	1 week
Discussed with:	kris
2009-03-08 21:48:29 +00:00
marcus
399996bc30 Add a default implementation for VOP_VPTOCNP(9) which scans the parent
directory of a vnode to find a dirent with a matching file number.  The
name from that dirent is then used to provide the component name.

Note: if the initial vnode argument is not a directory itself, then
the default VOP_VPTOCNP(9) implementation still returns ENOENT.

Reviewed by:	kib
Approved by:	kib
Tested by:	pho
2009-03-08 19:05:53 +00:00
rwatson
ae492d67e0 Remove 'uio' argument from MAC Framework and MAC policy entry points for
extended attribute get/set; in the case of get an uninitialized user
buffer was passed before the EA was retrieved, making it of relatively
little use; the latter was simply unused by any policies.

Obtained from:	TrustedBSD Project
Sponsored by:	Google, Inc.
2009-03-08 12:32:06 +00:00
rwatson
5e4df03210 Improve the consistency of MAC Framework and MAC policy entry point
naming by renaming certain "proc" entry points to "cred" entry points,
reflecting their manipulation of credentials.  For some entry points,
the process was passed into the framework but not into policies; in
these cases, stop passing in the process since we don't need it.

  mac_proc_check_setaudit -> mac_cred_check_setaudit
  mac_proc_check_setaudit_addr -> mac_cred_check_setaudit_addr
  mac_proc_check_setauid -> mac_cred_check_setauid
  mac_proc_check_setegid -> mac_cred_check_setegid
  mac_proc_check_seteuid -> mac_cred_check_seteuid
  mac_proc_check_setgid -> mac_cred_check_setgid
  mac_proc_check_setgroups -> mac_cred_ceck_setgroups
  mac_proc_check_setregid -> mac_cred_check_setregid
  mac_proc_check_setresgid -> mac_cred_check_setresgid
  mac_proc_check_setresuid -> mac_cred_check_setresuid
  mac_proc_check_setreuid -> mac_cred_check_setreuid
  mac_proc_check_setuid -> mac_cred_check_setuid

Obtained from:	TrustedBSD Project
Sponsored by:	Google, Inc.
2009-03-08 10:58:37 +00:00
kib
84bab6f2a5 Extract the no_poll() and vop_nopoll() code into the common routine
poll_no_poll().
Return a poll_no_poll() result from devfs_poll_f() when
filedescriptor does not reference the live cdev, instead of ENXIO.

Noted and tested by:	hps
MFC after:	1 week
2009-03-06 15:35:37 +00:00
kib
ed62a96e43 Systematically use vm_size_t to specify the size of the segment for VM KPI.
Do not overload the local variable size in kern_shmat() due to vm_size_t
change.
Fix style bug by adding explicit comparision with 0.

Discussed with:	bde
MFC after:	1 week
2009-03-05 11:45:42 +00:00
dchagin
ab0c175b17 as suggested by jhb@, panic in case the ncpus == 0.
it helps to catch bugs in the callers.

Approved by:	kib (mentor)
MFC after:	5 days
2009-03-03 17:34:09 +00:00
rwatson
0e49e1ce16 Reduce the verbosity of SDT trace points for DTrace by defining several
wrapper macros that allow trace points and arguments to be declared
using a single macro rather than several.  This means a lot less
repetition and vertical space for each trace point.

Use these macros when defining privilege and MAC Framework trace points.

Reviewed by:	jb
MFC after:	1 week
2009-03-03 17:15:05 +00:00
jamie
63f98fcc6a Extend the "vfsopt" mount options for more general use. Make struct
vfsopt and the vfs_buildopts function public, and add some new fields
to struct vfsopt (pos and seen), and new functions vfs_getopt_pos and
vfs_opterror.

Further extend the interface to allow reading options from the kernel
in addition to sending them to the kernel, with vfs_setopt and related
functions.

While this allows the "name=value" option interface to be used for more
than just FS mounts (planned use is for jails), it retains the current
"vfsopt" name and <sys/mount.h> requirement.

Approved by:	bz (mentor)
2009-03-02 23:26:30 +00:00
kan
e17295cf6f Change vfs_busy to wait until an outcome of pending unmount
operation is known and to retry or fail accordingly to that
outcome. This fixes the problem with namespace traversing
programs failing with random ENOENT errors if someone just
happened to try to unmount that same filesystem at the same
time.

Reported by:	dhw
Reviewed by:	kib, attilio
Sponsored by:	Juniper Networks, Inc.
2009-03-02 20:51:39 +00:00
kib
453adb14fb Correct types of variables used to track amount of allocated SysV shared
memory from int to size_t. Implement a workaround for current ABI not
allowing to properly save size for and report more then 2Gb sized segment
of shared memory.

This makes it possible to use > 2 Gb shared memory segments on 64bit
architectures. Please note the new BUGS section in shmctl(2) and
UPDATING note for limitations of this temporal solution.

Reviewed by:	csjp
Tested by:	Nikolay Dzham <i levsha org ua>
MFC after:	2 weeks
2009-03-02 18:53:30 +00:00
kib
c672211541 Use the p_sysent->sv_flags flag SV_ILP32 to detect 32bit process
executing on 64bit kernel. This eliminates the direct comparisions
of p_sysent with &ia32_freebsd_sysvec, that were left intact after
r185169.
2009-03-02 18:43:50 +00:00
dchagin
718161215b Fix range-check error introduced in r182292. Also do not do anything
if all processors in the map are not available, simply return.

Approved by:	kib (mentor)
MFC after:	1 week
2009-03-01 14:26:24 +00:00
ed
45be9ed433 Improve my previous changes to the TTY code: also remove memcpy().
It's better to just use internal language constructs, because it is
likely the compiler has a better opinion on whether to perform inlining,
which is very likely to happen to struct winsize.

Submitted by:	Christoph Mallon <christoph mallon gmx de>
2009-03-01 09:50:13 +00:00
thompsa
4eafa084fe Move the NORELEASE check to after the recurse count decrement and bailout, this
is not counted as actually releasing the lock.
2009-02-28 19:10:43 +00:00
ed
51d425ac0e Replace bcopy() calls inside the TTY layer with memcpy()/strlcpy().
In all these cases the buffers never overlap. Program names are also
likely to be shorter, so use a regular strlcpy() to copy p_comm.
2009-02-28 14:20:26 +00:00
bz
df2be82cec For all files including net/vnet.h directly include opt_route.h and
net/route.h.

Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.

We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.

This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.
2009-02-27 14:12:05 +00:00
ed
f02ef8e872 Remove redundant code in printf() and vprintf().
printf() and vprintf() are exactly the same, except the way arguments
are passed. Just like we see in other pieces of code (i.e. libc's
printf()), implement printf() using vprintf().

Submitted by:	Christoph Mallon <christoph mallon gmx de>
2009-02-27 13:28:54 +00:00
ed
2078a09b34 Revert previous commit to subr_prf.c and make it more tidy.
As mentioned by bz and bde, the change I made wasn't the proper way to
fix. Inspired by bde's patch, perform some small cleanups to uprintf().

Reviewed by:	bz
2009-02-27 12:50:25 +00:00
ed
9d9e2a90b2 Remove unneeded pointer `ndp'.
Inside do_execve(), we have a pointer `ndp', which always points to
`&nd'. I can imagine a primitive (non-optimizing) compiler to really
reserve space for such a pointer, so just remove the variable and use
`&nd' directly.
2009-02-26 16:32:48 +00:00
ed
b3ddcfe1f7 Remove even more unneeded variable assignments.
kern_time.c:
- Unused variable `p'.

kern_thr.c:
- Variable `error' is always caught immediately, so no reason to
  initialize it. There is no way that error != 0 at the end of
  create_thread().

kern_sig.c:
- Unused variable `code'.

kern_synch.c:
- `rval' is always assigned in all different cases.

kern_rwlock.c:
- `v' is always overwritten with RW_UNLOCKED further on.

kern_malloc.c:
- `size' is always initialized with the proper value before being used.

kern_exit.c:
- `error' is always caught and returned immediately. abort2() never
  returns a non-zero value.

kern_exec.c:
- `len' is always assigned inside the if-statement right below it.

tty_info.c:
- `td' is always overwritten by FOREACH_THREAD_IN_PROC().

Found by:	LLVM's scan-build
2009-02-26 15:51:54 +00:00
ed
f854be19d0 Remove unneeded variable `ocn_mute'.
Found by:	LLVM's scan-build
2009-02-26 13:01:45 +00:00
ed
4f11d3e937 Remove unused variables p' and unneeded assignments of rval'.
Found by:	LLVM's scan-build
2009-02-26 13:00:13 +00:00
ed
280ef3dd23 Remove redundant assignment of `p'.
`p' is already initialized with `td->td_proc'. Because td is always
curthread, it is safe to initialize it without any locks.

Found by:	LLVM's scan-build
2009-02-26 12:12:34 +00:00
rwatson
bf80a0a378 Add static tracing for privilege checking:
priv:kernel:priv_check:priv_ok fires for granted privileges
  priv:kernel:priv_check:priv_errr fires for denied privileges

The first argument is the requested privilege number.  The naming
convention is a little different from the OpenSolaris equivilent
because we can't have '-' in probefunc names, and our privilege
namespace is different.

MFC after:	1 week
2009-02-26 10:56:13 +00:00
ed
226255f0b6 Silence compiler warning inside our ^T handler.
It turns out we're casting fixpt_t* to int*.

Spotted by:	clang
2009-02-26 10:38:19 +00:00
ed
516ad9be6d Use unsigned longs for the TTY's sysctl stats.
Spotted by:	clang
2009-02-26 10:28:32 +00:00
ed
1c7d1f084a Don't use PTY name as format string, even though it isn't insecure here.
It's guaranteed that the `name' variable always contains a string of the
form pty[l‐sL‐S][0‐9a‐v], but I'd rather keep the compiler happy (LLVM).
2009-02-26 10:14:10 +00:00
jamie
1631f0aa0a Add support for methods to the OSD subsystem. Each object type has a
predefined set of methods, which are set in osd_register() and called
via osd_call().  Currently, no methods are defined, though prison
objects will have some in the future.

Expand the locking from a single per-type mutex to three different kinds
of locks (four if you include the requirement that the container
(e.g. prison) be locked when getting/setting data).  This clears up one
existing issue, as well as others added by the method support.

Approved by:	bz (mentor)
2009-02-21 11:15:38 +00:00
ed
72727e8d9f Don't make Linux stat() open character devices to resolve its name.
The existing code calls kern_open() to resolve the vnode of a pathname
right after a stat(). This is not correct, because it causes random
character devices to be opened in /dev. This means ls'ing a tape
streamer will cause it to rewind, for example. Changes I have made:

- Add kern_statat_vnhook() to allow binary emulators to `post-process'
  struct stat, using the proper vnode.

- Remove unneeded printf's from stat() and statfs().

- Make the Linuxolator use kern_statat_vnhook(), replacing
  translate_path_major_minor_at().

- Let translate_fd_major_minor() use vp->v_rdev instead of
  vp->v_un.vu_cdev.

Result:

	crw-rw-rw- 1 root root   0, 14 Feb 20 13:54 /dev/ptmx
	crw--w---- 1 root adm  136,  0 Feb 20 14:03 /dev/pts/0
	crw--w---- 1 root adm  136,  1 Feb 20 14:02 /dev/pts/1
	crw--w---- 1 ed   tty  136,  2 Feb 20 14:03 /dev/pts/2

Before this commit, ptmx also had a major number of 136, because it
silently allocated and deallocated a pseudo-terminal. Device nodes that
cannot be opened now have proper major/minor-numbers.

Reviewed by:	kib, netchild, rdivacky (thanks!)
2009-02-20 13:05:29 +00:00
jhb
5dc7ef7e69 Enable caching of negative pathname lookups in the NFS client. To avoid
stale entries, we save a copy of the directory's modification time when
the first negative cache entry was added in the directory's NFS node.
When a negative cache entry is hit during a pathname lookup, the parent
directory's modification time is checked.  If it has changed, all of the
negative cache entries for that parent are purged and the lookup falls
back to using the RPC.  This required adding a new cache_purge_negative()
method to the name cache to purge only negative cache entries for a given
directory.

Submitted by:	mohans, Rick Macklem, Ricardo Labiaga @ NetApp
Reviewed by:	mohans
2009-02-19 22:28:48 +00:00
ed
092e753861 Squash some small bugs in pts(4).
- Don't return a negative errno when using an unknown ioctl() on a
  pseudo-terminal master device. Be sure to convert ENOIOCTL to ENOTTY,
  just like the TTY layer does.

- Even though we should return st_rdev of the master device node when
  emulating pty(4) devices, FIODGNAME should still return the name of
  the slave device. Otherwise ptsname(3) and ttyname(3) return an
  invalid device name.
2009-02-19 17:54:42 +00:00
attilio
bf75b4612a - Add a function (fill_kinfo_aggregate()) which aggregates relevant
members for a kinfo entry on a process-wide system.
- Use the newly introduced function in order to fix cases like
  KERN_PROC_PROC where aggregating stats are broken because they just
  consider the first thread in the pool for each process.
  (Note, additively, that KERN_PROC_PROC is rather inaccurate on
  thread-wide informations like the 'state' of the process.  Such
  informations should maybe be invalidated and being forceably discarded
  by the consumers?).
- Simplify the logic of sysctl_out_proc() and adjust the
  fill_kinfo_thread() accordingly.
- Remove checks on the FIRST_THREAD_IN_PROC() being NULL but add
  assertives.

This patch should fix aggregate statistics for KERN_PROC_PROC.
This is one of the reasons why top doesn't use this option and now it
can be use it safely.
ps, when launched in order to display just processes, now should report
correct cpu utilization percentages and times (as opposed by the old
code).

Reviewed by:	jhb, emaste
Sponsored by:	Sandvine Incorporated
2009-02-18 21:52:13 +00:00
marcus
60038f21cf Remove the printf's when the vnode to be exported for procstat is not a VDIR.
If the file system backing a process' cwd is removed, and procstat -f PID
is called, then these messages would have been printed.  The extra verbosity is
not required in this situation.

Requested by:	kib
Approved by:	kib
2009-02-14 21:55:09 +00:00
marcus
130b8c14ad Change two KASSERTS to printfs and simple returns. Stress testing has
revealed that a process' current working directory can be VBAD if the
directory is removed.  This can trigger a panic when procstat -f PID is
run.

Tested by:	pho
Discovered by:	phobot
Reviewed by:	kib
Approved by:	kib
2009-02-14 21:12:24 +00:00
thompsa
b04fb61e93 Remove semicolon left in the last commit
Spotted by:	csjp
2009-02-13 18:51:39 +00:00
jhb
26e338d6fc Use shared vnode locks when invoking VOP_READDIR().
MFC after:	1 month
2009-02-13 18:18:14 +00:00
luigi
8faaf7bcde Clarify and reimplement the bioq API so that bioq_disksort() has
the correct behaviour (sorting by distance from the current head position
in the scan direction) and bioq_insert_head() and bioq_insert_tail()
have a well defined (and useful) behaviour, especially when intermixed
with calls to bioq_disksort().

In particular:
- fix a bug in the existing bioq_disksort() that did not use the
  current head position correctly;
- redefine semantics of bioq_insert_head() and bioq_insert_tail().
  bioq_insert_tail() can now be used as a barrier
  between previous and subsequent calls to bioq_disksort().

The code is heavily documented in the source code so please refer
to that for the details.

Much of this code comes from Fabio Checconi. Also thanks to Kirk
for feedback on the (re)definition of bioq_insert_tail().

NOTE: in the current tree there is only a handful of files which
intermix calls to bioq_disksort() with bioq_insert_head() and
bioq_insert_tail(). The ordering of the queue in these situation
was not specified (nor easy to figure out) before, so I doubt any
of that code could be affected by the specification of the API.

Also note that the current implementation is significantly simpler
than the previous one (also used in ata_sort_queue()).
It would be useful to reimplement ata_sort_queue() using
the same code used in bioq_disksort().

MFC after:	1 week
2009-02-13 11:36:32 +00:00
thompsa
618fc50d6b Check the exit flag at the start of the taskqueue loop rather than the end. It
is possible to tear down the taskqueue before the thread has run and the
taskqueue loop would sleep forever.

Reviewed by:	sam
MFC after:	1 week
2009-02-13 01:16:51 +00:00
ed
5f2edd80fc Serialize write() calls on TTYs.
Just like the old TTY layer, the current MPSAFE TTY layer does not make
any attempt to serialize calls of write(). Data is copied into the
kernel in 256 (TTY_STACKBUF) byte chunks. If a write() call occurs at
the same time, the data may interleave. This is especially likely when
the TTY starts blocking, because the output queue reaches the high
watermark.

I've implemented this by adding a new flag, TTY_BUSY_OUT, which is used
to mark a TTY as having a thread stuck in write(). Because I don't want
non-blocking processes to be possibly blocked by a sleeping thread, I'm
still allowing it to bypass the protection. According to this message,
the Linux kernel returns EAGAIN in such cases, but I think that's a
little too restrictive:

	http://kerneltrap.org/index.php?q=mailarchive/linux-kernel/2007/5/2/85418/thread

PR:		kern/118287
2009-02-11 16:28:49 +00:00
rwatson
ced47d0a8e Modify fdcopy() so that, during fork(2), it won't copy file descriptors
from the parent to the child process if they have an operation vector
of &badfileops.  This narrows a set of races involving system calls that
allocate a new file descriptor, potentially block for some extended
period, and then return the file descriptor, when invoked by a threaded
program that concurrently invokes fork(2).  Similar approches are used
in both Solaris and Linux, and the wideness of this race was introduced
in FreeBSD when we moved to a more optimistic implementation of
accept(2) in order to simplify locking.

A small race necessarily remains because the fork(2) might occur after
the finit() in accept(2) but before the system call has returned, but
that appears unavoidable using current APIs.  However, this race is
vastly narrower.

The fix can be validated using the newfileops_on_fork regression test.

PR:		kern/130348
Reported by:	Ivan Shcheklein <shcheklein at gmail dot com>
Reviewed by:	jhb, kib
MFC after:	1 week
2009-02-11 15:22:01 +00:00
imp
dac94dc031 o Use NULL in pereference to 0 in pointer contexts.
o Use newly minted KOBJMETHOD_END as appropriate
o fix prototype for root_setup_intr.
2009-02-11 04:54:02 +00:00
mav
26e8dd306f Check for device_set_devclass() errors and skip driver probe/attach if any.
Attach call without devclass set crashes the system.

On resume AHCI driver sometimes tries to create duplicate adX device.
It is surely his own problem, but IMHO it is not a reason to crash here.
Other reasons are also possible.
2009-02-10 23:22:29 +00:00
attilio
5c1db5bb2b Scanning all the formats for binary translation of modules loading can
result in errors for a format loading but subsequent correct recognizing
for another format.

File format loading functions should avoid printing any additional
informations but just returning appropriate (and different between each
other) error condition, characterizing different informations.
Additively, the linker should handle appropriately different format
loading errors.

While a general mechanism is desired, fix a simple and common case on
amd64: file type is not recognized for link elf and confuses the linker.
Printout an error if all the registered linker classes can't recognize
and load the module.

Reviewed by:	jhb
Sponsored by:	Sandvine Incorporated
2009-02-10 15:50:19 +00:00
rwatson
c6f2b81096 Remove extra 'comma = 0' in socket state printing code, which otherwise
could lead to an extra comma in output.

Submitted by:   Christoph Mallon <christoph dot mallon at gmx dot de>
2009-02-09 18:19:58 +00:00
mbr
b47ed35e3f s/SS_FDREF/SS_NOFDREF/ 2009-02-09 13:29:01 +00:00
ed
ce1349c810 Remove a stale comment from the clists code.
We don't support quote bits.
2009-02-09 11:27:56 +00:00
jhb
f856c6d618 Tweak the output of VOP_PRINT/vn_printf() some.
- Align the fifo output in fifo_print() with other vn_printf() output.
- Remove the leading space from lockmgr_printinfo() so its output lines up
  in vn_printf().
- lockmgr_printinfo() now ends with a newline, so remove an extra newline
  from vn_printf().
2009-02-06 20:06:48 +00:00
trasz
d102122bd0 Add KASSERTs to make it easier to debug problems like the one fixed
in r188141.

Reviewed by:	kib,attilio
Approved by:	rwatson (mentor)
Tested by:	pho
Sponsored by:	FreeBSD Foundation
2009-02-06 18:16:01 +00:00
jhb
5dff890984 Expand the scope of the sysctllock sx lock to protect the sysctl tree itself.
Back in 1.1 of kern_sysctl.c the sysctl() routine wired the "old" userland
buffer for most sysctls (everything except kern.vnode.*).  I think to prevent
issues with wiring too much memory it used a 'memlock' to serialize all
sysctl(2) invocations, meaning that only one user buffer could be wired at
a time.  In 5.0 the 'memlock' was converted to an sx lock and renamed to
'sysctl lock'.  However, it still only served the purpose of serializing
sysctls to avoid wiring too much memory and didn't actually protect the
sysctl tree as its name suggested.  These changes expand the lock to actually
protect the tree.

Later on in 5.0, sysctl was changed to not wire buffers for requests by
default (sysctl_handle_opaque() will still wire buffers larger than a single
page, however).  As a result, user buffers are no longer wired as often.
However, many sysctl handlers still wire user buffers, so it is still
desirable to serialize userland sysctl requests.  Kernel sysctl requests
are allowed to run in parallel, however.

- Expose sysctl_lock()/sysctl_unlock() routines to exclusively lock the
  sysctl tree for a few places outside of kern_sysctl.c that manipulate
  the sysctl tree directly including the kernel linker and vfs_register().
- sysctl_register() and sysctl_unregister() require the caller to lock
  the sysctl lock using sysctl_lock() and sysctl_unlock().  The rest of
  the public sysctl API manage the locking internally.
- Add a locked variant of sysctl_remove_oid() for internal use so that
  external uses of the API do not need to be aware of locking requirements.
- The kernel linker no longer needs Giant when manipulating the sysctl
  tree.
- Add a missing break to the loop in vfs_register() so that we stop looking
  at the sysctl MIB once we have changed it.

MFC after:	1 month
2009-02-06 14:51:32 +00:00
jhb
fa654c58a4 Drop the kernel linker lock while running SYSUNINIT routines and removing
sysctls during a linker file unload.  We drop the lock when doing similar
operations during a linker file load.  To close races, clear the LINKED
flag before dropping the lock so that the linker file is no longer visible
to userland.

MFC after:	1 week
2009-02-05 23:01:36 +00:00
attilio
beddfe59b0 Add more KTR_VFS logging point in order to have a more effective tracing.
Reviewed by:	brueffer, kib
Tested by:	Gianni Trematerra <giovanni D trematerra A gmail D com>
2009-02-05 15:03:35 +00:00
ed
e085cfc485 Don't leave the console TTY constantly open.
When we leave the console TTY constantly open, we never reset the
termios attributes. This causes output processing, echoing, etc. not to
be reset to the proper values when going into single user mode after the
system has booted. It also causes nl-to-crnl-conversion not to take
place during shutdown, which causes a `staircase effect'.

This patch adds a new TTY flag, TF_OPENED_CONS, which is set when the
TTY is opened through /dev/console. Because the flags are only used by
the kernel and the pstat(8) utility, I've decided to renumber the TTY
flags. This shouldn't be an issue, because the TTY layer is not yet part
of a stable release.

Reported by:	Mark Atkinson <atkin901 yahoo com>
Tested by:	sepotvin
2009-02-05 14:21:09 +00:00
jamie
8f639d4b9a Don't allow creating a socket with a protocol family that the current
jail doesn't support.  This involves a new function prison_check_af,
like prison_check_ip[46] but that checks only the family.

With this change, most of the errors generated by jailed sockets
shouldn't ever occur, at least until jails are changeable.

Approved by:	bz (mentor)
2009-02-05 14:15:18 +00:00
jamie
12bbe1869f Standardize the various prison_foo_ip[46] functions and prison_if to
return zero on success and an error code otherwise.  The possible errors
are EADDRNOTAVAIL if an address being checked for doesn't match the
prison, and EAFNOSUPPORT if the prison doesn't have any addresses in
that address family.  For most callers of these functions, use the
returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or
EINVAL.

Always include a jailed() check in these functions, where a non-jailed
cred always returns success (and makes no changes).  Remove the explicit
jailed() checks that preceded many of the function calls.

Approved by:	bz (mentor)
2009-02-05 14:06:09 +00:00
trasz
a4e8c3ba99 In some situations, mnt_lockref could go negative due to vfs_unbusy() being
called without calling vfs_busy() first.  This made umount(8) hang waiting
for mnt_lockref to become zero, which would never happen.

Reviewed by:	kib
Approved by:	rwatson (mentor)
Reported by:	pho
Found with:	stress2
Sponsored by:	FreeBSD Foundation
2009-02-05 08:46:18 +00:00
rwatson
d686641019 Remove written-to but never read local variable 'offset' from
soreceive_dgram().

Submitted by:	Christoph Mallon <christoph dot mallon at gmx dot de>
MFC after:	1 week
2009-02-04 20:00:17 +00:00
ed
35bb1e1a73 Remove slush space from clists.
Right now we only have a very small amount of drivers that use clists,
but we still allocate 50 cblocks as slush space, which allows drivers to
temporarily overcommit their storage. Most of the drivers don't allow
this anyway.

I've performed the following changes:

- We don't allocate any cblocks on startup.

- I've removed the DDB command, because it has nothing useful to print
  now. You can obtain the amount of allocated blocks by running `vmstat
  -m | grep clist'.

- I've removed cfreecount, which is now unused.

- The old code first tries to allocate using M_NOWAIT, followed by
  M_WAITOK. This doesn't make any sense, so just remove this logic. It
  seems the drivers allow us to sleep anyway.

We can even remove ccmax from clist_alloc_cblocks and c_cbmax from
struct clist, but this breaks binary compatibility.

This reduces the amount of allocated cblocks on my system from 54 to 4.
2009-02-04 17:10:01 +00:00
ed
85ebf97341 Slightly improve the design of the TTY buffer.
The TTY buffers used the standard <sys/queue.h> lists. Unfortunately
they have a big shortcoming. If you want to have a double linked list,
but no tail pointer, it's still not possible to obtain the previous
element in the list. Inside the buffers we don't need them. This is why
I switched to custom linked list macros. The macros will also keep track
of the amount of items in the list. Because it doesn't use a sentinel,
we can just initialize the queues with zero.

In its simplest form (the output queue), we will only keep two
references to blocks in the queue, namely the head of the list and the
last block in use. All free blocks are stored behind the last block in
use.

I noticed there was a very subtle bug in the previous code: in a very
uncommon corner case, it would uma_zfree() a block in the queue before
calling memcpy() to extract the data from the block.
2009-02-03 19:58:28 +00:00
imp
b5abf9646f Use NULL in preference to 0 in pointer contexts. 2009-02-03 07:54:42 +00:00
imp
02f34b7ded Make bioq_disksort have a ANSI-C definition rather than a K&R definition. 2009-02-03 07:53:51 +00:00
imp
35e4f2b272 rman_debug should be static, so make it static. 2009-02-03 07:53:08 +00:00
imp
b5c4f1a094 Use ANSI function definition for profil. 2009-02-03 07:52:36 +00:00
imp
10cf8131b7 Prefer ANSI function definitions to K&R ones. 2009-02-03 07:52:07 +00:00
imp
5a073bfc7c Use NULL in preference to 0 for pointers. 2009-02-03 07:51:41 +00:00
imp
82f181ca79 Use NULL in preference to 0 for pointers. 2009-02-03 07:51:11 +00:00
imp
54f6c3e35c o Use unsigned for bit fields.
o Use NULL for pointers in preference to 0.
2009-02-03 07:50:41 +00:00
imp
b033fcf7e9 int foo(void) is the proper ANSI function definition when there's no
parameters.  Use it for resettodr().
2009-02-03 07:50:01 +00:00
imp
0821a484fd Declare bus_data_devices to be static: it isn't used elsewhere.
Use NULL in a couple of places rather than 0 in the context of
pointers to be consistent with the rest of the file.
2009-02-03 00:10:21 +00:00
sepotvin
9d78a7fce3 Fix select on platforms where sizeof(long) != sizeof(int). This used
to work by accident before the cleanup done in revision 187693.

Approved by:	kan (mentor)
2009-02-02 03:34:40 +00:00
rwatson
7147753438 If a process is a zombie and we couldn't identify another useful state,
print out the state as "zombine" in preference to "unknown" when ^T is
pressed.

MFC after:	3 days
Sponsored by:	Google, Inc.
2009-01-29 09:32:56 +00:00
ed
50efccc9f0 Mark most often used sysctl's as MPSAFE.
After running a `make buildkernel', I noticed most of the Giant locks in
sysctl are only caused by a very small amount of sysctl's:

- sysctl.name2oid. This one is locked by SYSCTL_LOCK, just like
  sysctl.oidfmt.

- kern.ident, kern.osrelease, kern.version, etc. These are just constant
  strings.

- kern.arandom, used by the stack protector. It is already protected by
  arc4_mtx.

I also saw the following sysctl's show up. Not as often as the ones
above, but still quite often:

- security.jail.jailed. Also mark security.jail.list as MPSAFE. They
  don't need locking or already use allprison_lock.

- kern.devname, used by devname(3), ttyname(3), etc.

This seems to reduce Giant locking inside sysctl by ~75% in my primitive
test setup.
2009-01-28 19:58:05 +00:00
jhb
04d889be20 Convert the global mutex protecting the directory lookup name cache from a
mutex to a reader/writer lock.  Lookup operations first grab a read lock and
perform the lookup.  If the operation results in a need to modify the cache,
then it tries to do an upgrade.  If that fails, it drops the read lock,
obtains a write lock, and redoes the lookup.
2009-01-28 19:05:18 +00:00
ed
9b5c4f4d39 Use the proper flag to let kern.ttys be executed without Giant.
Pointed out by:	jhb
2009-01-26 16:43:18 +00:00
jhb
b8c520ae3c Whitespace tweak. 2009-01-26 15:32:39 +00:00
jeff
687eba74e7 - bit has to be fd_mask to work properly on 64bit platforms. Constants
must also be cast even though the result ultimately is promoted
   to 64bit.
 - Correct a loop index upper bound in selscan().
2009-01-25 18:38:42 +00:00
rwatson
97295d8b75 When a statically linked binary is executed (or at least, one without
an interpreter definition in its program header), set the auxiliary
ELF argument AT_BASE to 0 rather than to the address that we would
have mapped the interpreter at if there had been one.

The ELF ABI specifications appear to be ambiguous as to the desired
behavior in this situation, as they define AT_BASE as the base address
of the interpreter, but do not mention what to do if there is none.
On Solaris, AT_BASE will be set to the base address of the static
binary if there is no interpreter, and on Linux, AT_BASE is set to 0.
We go with the Linux semantics as they are of more immediate utility
and allow the early runtime environment to know that the kernel has
not mapped an interpreter, but because AT_PHDR points at the ELF
header for the running binary, it is still possible to retrieve all
required mapping information when the process starts should it be
required.  Either approach would be preferable to our current behavior
of passing a pointer to an unmapped region of user memory as AT_BASE.

MFC after:	3 weeks
2009-01-25 12:07:43 +00:00
bz
6dddd78341 For consistency with prison_{local,remote,check}_ipN rename
prison_getipN to prison_get_ipN.

Submitted by:	jamie (as part of a larger patch)
MFC after:	1 week
2009-01-25 10:11:58 +00:00
jeff
dcd94957aa - Correct a typo in a comment.
Noticed by:	danger
2009-01-25 09:17:16 +00:00
jeff
69d1bd8670 - Make the keg abstraction more complete. Permit a zone to have multiple
backend kegs so it may source compatible memory from multiple backends.
   This is useful for cases such as NUMA or different layouts for the same
   memory type.
 - Provide a new api for adding new backend kegs to secondary zones.
 - Provide a new flag for adjusting the layout of zones to stagger
   allocations better across cache lines.

Sponsored by:	Nokia
2009-01-25 09:11:24 +00:00
ed
ce1034ac57 Remove unneeded use of device unit numbers from pty(4).
A much more simple approach to generate the slave device name, is to
obtain the device name of the master and replace 'p' by 't'.
2009-01-25 08:27:11 +00:00
jeff
d4c94410f6 - Use __XSTRING where I want the define to be expanded. This resulted in
sizeof("MAXCPU") being used to calculate a string length rather than
   something more reasonable such as sizeof("32").  This shouldn't have
   caused any ill effect until we run on machines with 1000000 or more
   cpus.
2009-01-25 07:35:10 +00:00
jeff
3688ae7441 Fix errors introduced when I rewrote select.
- Restructure selscan() and selrescan() to avoid producing extra selfps
   when we have a fd in multiple sets.  As described below multiple selfps
   may still exist for other reasons.
 - Make selrescan() tolerate multiple selfds for a given descriptor
   set since sockets use two selinfos per fd.  If an event on each selinfo
   fires selrescan() will see the descriptor twice.  This could result in
   select() returning 2x the number of fds actually existing in fd sets.

Reported by:	mgleason@ncftp.com
2009-01-25 07:24:34 +00:00
ed
e3697d4040 Mark kern.ttys as MPSAFE.
sysctl now allows Giantless calls, so make kern.ttys use this. If it
needs Giant, it locks the proper TTY anyway.
2009-01-24 18:20:15 +00:00
rwatson
aaaff3620b Add explicit static DTrace tracing to the callout mechanism, capturing
pointers to the callout handler just before and just after the callout
it invoked.  I attempted to do this in a manner congruent to tracing in
Solaris's callout mechanism, but couldn't quite use the same names due
to convention and syntax differences.

Example DTrace script to generate a distribution graph of callout
execution times:

callout_execute:::callout_start
{
        self->cstart = timestamp;
}

callout_execute:::callout_end
{

        @length = quantize(timestamp - self->cstart);
}

Reviewed by:	jb
MFC after:	3 days
2009-01-24 10:22:49 +00:00
jhb
a9601871c9 - Mark all standalone INT/LONG/QUAD sysctl's MPSAFE. This is done
inside the SYSCTL() macros and thus does not need to be done for
  all of the nodes scattered across the source tree.
- Mark the name-cache related sysctl's (including debug.hashstat.*) MPSAFE.
- Mark vm.loadavg MPSAFE.
- Remove GIANT_REQUIRED from vmtotal() (everything in this routine already
  has sufficient locking) and mark vm.vmtotal MPSAFE.
- Mark the vm.stats.(sys|vm).* sysctls MPSAFE.
2009-01-23 22:49:23 +00:00
jhb
0245a3370f - Add conditional Giant locking around the vrele() in
sysctl_kern_proc_pathname().
- Mark all the kern.proc.* sysctls as MPSAFE.

Submitted by:	csjp (2)
2009-01-23 22:46:45 +00:00
jhb
b6d1e3ceff Add a flag to tag individual sysctl leaf nodes as MPSAFE and thus not
needing Giant.

Submitted by:	csjp (an older version)
2009-01-23 22:40:35 +00:00
jhb
d94da54d95 Use shared vnode locks for fchdir().
Submitted by:	ups
2009-01-23 22:13:30 +00:00
jhb
d7c8a44c0d Tweak the wording for vfs_mark_atime() since the I/O it is avoiding by not
updating va_atime via VOP_SETATTR() isn't always synchronous.  For some
filesystems it is asynchronous.

Suggested by:  bde
2009-01-23 22:13:00 +00:00
jhb
4efa7c83e1 Push down Giant in the vlnru kproc main loop so that it is only acquired
around calls to vlrureclaim() on non-MPSAFE filesystems.  Specifically,
vnlru no longer needs Giant for the common case of waking up and deciding
there is nothing for it to do.

MFC after:	2 weeks
2009-01-23 22:08:54 +00:00
jhb
d2c61e641d Use the correct type for the timeout parameter to the 32-bit
compat version aio_waitcomplete().

Reminded by:	bz
Submitted by:	jamie
MFC after:	3 days
2009-01-23 13:23:17 +00:00
jhb
2939c2f76f Fix a few style bogons.
Submitted by:	bde
2009-01-21 20:08:17 +00:00
kib
44eef9d9bb Move the code from ufs_lookup.c used to do dotdot lookup, into
the helper function. It is supposed to be useful for any filesystem
that has to unlock dvp to walk to the ".." entry in lookup routine.

Requested by:	jhb
Tested by:	pho
MFC after:	1 month
2009-01-21 14:51:38 +00:00
jhb
47455a7b41 Move the VA_MARKATIME flag for VOP_SETATTR() out into its own VOP:
VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME
can be performed while holding a shared vnode lock (the same functionality
is done internally by VOP_READ which can run with a shared vnode lock).
Add missing locking of the vnode interlock to the ufs implementation and
remove a special note and test from the NFS client about not supporting the
feature.

Inspired by:	ups
Tested by:	pho
2009-01-21 14:42:00 +00:00
thompsa
50e14c608e Add functions WITNESS so it can be asserted that the lock is not released for a
section of code, this uses WITNESS_NORELEASE() and WITNESS_RELEASEOK() to mark
the boundaries. Both functions require the lock to be held when calling.

This is intended for scenarios like a bus asserting that the bus lock is not
dropped during a driver call. There doesn't appear to be a man page to
document this in.

Reviewed by:	jhb
2009-01-21 04:19:18 +00:00
kib
cbb8defa10 FFS puts the extended attributes blocks at the negative blocks for the
vnode, from -1 down. When vinvalbuf(vp, V_ALT) is done for the vnode, it
incorrectly does vm_object_page_remove(0, 0), removing all pages from
the underlying vm object, not only the pages that back the extended
attributes data.

Change vinvalbuf() to not remove any pages from the object when
V_NORMAL or V_ALT are specified. Instead, the only in-tree caller
in ffs_inode.c:ffs_truncate() that specifies V_ALT explicitely
removes the corresponding page range. The V_NORMAL caller
does vnode_pager_setsize(vp, 0) immediately after the call to
vinvalbuf(V_NORMAL) already.

Reported by:	csjp
Reviewed by:	ups
MFC after:	3 weeks
2009-01-20 11:27:45 +00:00
mckay
7620a1118d Add a limit on namecache entries.
In normal operation, the number of cache entries is roughly equal to the
number of active vnodes.  However, when most of the recently accessed
vnodes have many hard links, the number of cache entries can be 32000
times as large, exhausting kernel memory and provoking a panic in
kmem_malloc().

MFC after: 2 weeks
2009-01-20 04:21:21 +00:00
mav
dcb9508070 Teach m_copyback() to use trailing space of the last mbuf in chain. 2009-01-18 20:19:55 +00:00
jeff
3d8d825555 - Implement generic macros for producing KTR records that are compatible
with src/tools/sched/schedgraph.py.  This allows developers to quickly
   create a graphical view of ktr data for any resource in the system.
 - Add sched_tdname() and the pcpu field 'name' for quickly and uniformly
   identifying records associated with a thread or cpu.
 - Reimplement the KTR_SCHED traces using the new generic facility.

Obtained from:	attilio
Discussed with:	jhb
Sponsored by:	Nokia
2009-01-17 07:17:57 +00:00
kib
92e5c3777e Lock the semaphore identifier lock during semaphore initialization to
guarantee atomicity of the operation for other semaphore consumers.
In particular, this should guard against access to the semaphore with
not done or partially done MAC label assignment.

Reviewed by:	rwatson
MFC after:	1 month
2009-01-15 12:15:46 +00:00
kib
f068898e84 It seems that there are at least three issues with IPC_RMID operation
on SysV semaphores.

  The squeeze of the semaphore array in the kern_semctl() modifies
  sem_base for the semaphores with sem_base greater then sem_base of
  the removed semaphore, as well as the values of the semaphores,
  without locking their mutex. This can lead to (killable) hangs or
  unexpected behaviour of the processes performing any sem operations
  while other process does IPC_RMID.

  The semexit_myhook() eventhandler unlocks SEMUNDO_LOCK() while
  accessing *suptr. This allows for IPC_RMID for the sem id to be
  performed in parallel with undo hook referenced by the current undo
  structure. This leads to the panic("semexit - semid not allocated") [1].

  The semaphore creation is protected by Giant, while IPC_RMID is done
  while only semaphore mutex is held. This seems to result in invalid
  values for semtot, causing random ENOSPC error returns [2].

Redo the locking of the semaphores lifetime cycle. Delegate the
sem_mtx to the sole purpose of protecting semget() and
semctl(IPC_RMID). Introduce new sem_undo_mtx to protect SEM_UNDO
handling. Remove the Giant remnants from the code.
Note that  mac_sysvsem_check_semget() and mac_sysvsem_create() are
now called while sem_mtx is held, as well as mac_sysvsem_cleanup() [3].

When semaphore is removed, acquire semaphore locks for all semaphores
with sem_base that is going to be changed by squeeze of the sema
array. The lock order is not important there, because the region is
protected by sem_mtx.

Organize both used and free sem_undo structures into the lists,
protected by sem_undo_mtx. In semexit_myhook(), remove sem_undo
structure that is being processed, from used list, without putting it
onto the free to prevent modifications by other threads. This allows
for sem_undo_lock to be dropped to acquire individial semaphore locks
without violating lock order. Since IPC_RMID may no longer find this
sem_undo, do tolerate references to unallocated semaphores in undo
structure, and check sequential number to not undo unrelated semaphore
with the same id.

While there, convert functions definitions to ANSI C and fix small
style(9) glitches.

Reported by:	Omer Faruk Sen <omerfsen gmail com> [1], pho [2]
Reviewed by:	rwatson [3]
Tested by:	pho
MFC after:	1 month
2009-01-14 15:20:13 +00:00
jhb
a20bdb255f Add a new KTR tracepoint in the KTR_CALLOUT class to note when a callout
routine finishes executing.

MFC after:	1 week
2009-01-13 15:56:53 +00:00
kib
97fa304203 Do not call namei() while having another user-controlled vnode
locked. Lookup could attempt to recursively lock that vnode.

Do not call vn_start_write(V_WAIT) while vnode is locked, this may
result in a deadlock with suspension.

vfs_busy() the mountpoint before dropping vnode lock for vnode
that was used to look up the mountpoint, to prevent unmount in
between.

Reported and tested by:	pho
Reviewed by:	rwatson
MFC after:	3 weeks
2009-01-08 12:47:30 +00:00
ed
2f8c83e018 Remove Giant locking from domains list.
During boot, the domain list is locked with Giant. It is not possible to
register any protocols after the system has booted, so the lock is only
used to protect insertion of entries.

There is already a mutex in uipc_domain.c called dom_mtx. Use this mutex
to lock the list, instead of using Giant. It won't matter anything with
respect to performance, but we'll never get rid of Giant if we don't
remove from places where we don't need it.

Approved by:	rwatson
MFC after:	3 weeks
2009-01-04 19:22:53 +00:00
rwatson
af9e0de76f Remove two further uses (debugging and NULLing) of pr_ousrreq, missed due
to svn commit in the wrong directory.

Spotted by:	bz
2009-01-04 19:16:36 +00:00
bz
903df7883d Back out r186615; the sanitizing of the pointers in the error case
is not needed and seems that it will not be needed either.

Pointy hat:	mine, mine, mine and not pho's
2009-01-04 12:18:18 +00:00
kib
ac1b596fda Extend the struct vm_page wire_count to u_int to avoid the overflow
of the counter, that may happen when too many sendfile(2) calls are
being executed with this vnode [1].

To keep the size of the struct vm_page and offsets of the fields
accessed by out-of-tree modules, swap the types and locations
of the wire_count and cow fields. Add safety checks to detect cow
overflow and force fallback to the normal copy code for zero-copy
sockets. [2]

Reported by:	Anton Yuzhaninov <citrin citrin ru> [1]
Suggested by:	alc [2]
Reviewed by:	alc
MFC after:	2 weeks
2009-01-03 13:24:08 +00:00
ed
fa1537b948 Fix a corner case in my previous commit.
Even though there are not many setups that have absolutely no console
device, make sure a close() on a TTY doesn't dereference a null pointer.
2009-01-02 23:39:29 +00:00
ed
8244396e79 Don't let /dev/console be revoked if the TTY below is being closed.
During startup some of the syscons TTY's are used to set attributes like
the screensaver and mouse options. These actions cause /dev/console to
be rendered unusable.

Fix the issue by leaving the TTY opened when it is used as the console
device.

Reported by:	imp
2009-01-02 23:32:43 +00:00
rwatson
96b82a80dd White space and comment tweaks.
MFC after:	3 weeks
2009-01-01 20:03:22 +00:00
rwatson
c049b2b5fe Temporary workaround for the limitations of the mbuf flowid field: zero
the field in the mbuf constructors, since otherwise we have no way to
tell if they are valid.  In the future, Kip has plans to add a flag
specifically to indicate validity, which is the preferred model.
2009-01-01 20:03:01 +00:00
ed
05475b9543 Don't clobber sysctl_root()'s error number.
When sysctl() is being called with a buffer that is too small, it will
return ENOMEM. Unfortunately the changes I made the other day sets the
error number to 0, because it just returns the error number of the
copyout(). Revert this part of the change.
2009-01-01 00:19:51 +00:00
ivoras
4136fd8892 Document the relationship between enum VM_GUEST and the vm_guest_sysctl_names
array.

Approved by:	gnn (original version)
2008-12-30 23:49:54 +00:00
pho
2f6e82c78d Added missing second part of cleaning j->ip[46] as requested by bz
Approved by:	kib (mentor)
Pointy hat:	pho
2008-12-30 20:39:47 +00:00
pho
6e2644b311 Make sure that unused j->ip[46] are cleared
Reviewed by:	bz
Approved by:	kib (mentor)
2008-12-30 17:54:25 +00:00
rwatson
0c0b8926ba Rename mbcnt to mbcnt_delta in uipc_send() -- unlike other local
variables named mbcnt in uipc_usrreq.c, this instance is a delta
rather than a cache of sb_mbcnt.

MFC after:	3 weeks
2008-12-30 16:09:57 +00:00
kib
2349a65923 Clear the pointers to the file in the struct filedesc before file is closed
in fdfree. Otherwise, sysctl_kern_proc_filedesc may dereference stale
struct file * values.

Reported and tested by:	pho
MFC after:	1 month
2008-12-30 12:51:56 +00:00
kib
c81ec4dc0c In r185557, the check for existing negative entry for the given name
did not compared nc_dvp with supplied parent directory vnode pointer.
Add the check and note that now branches for vp != NULL and vp == NULL
are the same, thus can be merged.

Reported and reviewed by:	kan
Tested by:	pho
MFC after:	2 weeks
2008-12-30 12:51:14 +00:00
ed
3f319ef66d Fix compilation. Also move ogetkerninfo() to kern_xxx.c.
It seems I forgot to remove `int error' from a single piece of code. I'm
also moving ogetkerninfo() to kern_xxx.c, because it belongs to the
class of compat system information system calls, not the generic sysctl
code.
2008-12-29 19:24:00 +00:00
ed
f3a9a195cb Push down Giant inside sysctl. Also add some more assertions to the code.
In the existing code we didn't really enforce that callers hold Giant
before calling userland_sysctl(), even though there is no guarantee it
is safe. Fix this by just placing Giant locks around the call to the oid
handler. This also means we only pick up Giant for a very short period
of time. Maybe we should add MPSAFE flags to sysctl or phase it out all
together.

I've also added SYSCTL_LOCK_ASSERT(). We have to make sure sysctl_root()
and name2oid() are called with the sysctl lock held.

Reviewed by:	Jille Timmermans <jille quis cx>
2008-12-29 12:58:45 +00:00
kib
bd5d614be8 vm_map_lock_read() does not increment map->timestamp, so we should
compare map->timestamp with saved timestamp after map read lock is
reacquired, not with saved timestamp + 1. The only consequence of the +1
was unconditional lookup of the next map entry, though.

Tested by:	pho
Approved by:	des
MFC after:	2 weeks
2008-12-29 12:45:11 +00:00
kmacy
208a4373c4 drop rnh lock before destroying it 2008-12-28 14:32:27 +00:00
bz
7d22a18291 Hide detect_virtual() along with the accompanying string
arrays under #ifndef XEN to make XEN config compile again.
In case of Xen vm_guest is hard coded.

Move the list for the vm_guest sysctl out of the restictive
bounds as the sysctl is there in either case.
2008-12-27 17:19:16 +00:00
pho
fbacc4af83 Prevent overflow of uio_resid.
Approved by:	kib
2008-12-27 10:13:43 +00:00
rwatson
13abb9545e Following the recent security advisory, add a comment describing our
invariants and approach for protocol switch methods in protsw_init(),
and also some KASSERT's for non-domain init entries in protocol
switch tables: pru_abort and pru_send must both be implemented.

For now, leave those assertions #if 0'd, since there are a few
protocols that violate them in non-harmful ways.  Whether or not we
should enforce pru_abort being implemented for non-stream protocols
is an interesting question: currently abort is only invoked on stream
sockets in situations where un-accepted sockets must be abruptly
closed (i.e., close() on a listen socket with pending connections),
but in principle it is useful for datagram sockets and most datagram
socket types implement it.

MFC after:	3 weeks
2008-12-25 11:32:38 +00:00
marcus
09a2776e69 Do not KASSERT when vp->v_dd is NULL. Only directories which have had ".."
looked up would have v_dd set to a non-NULL value.  This fixes a panic
seen when running installworld on a diskless system with a separate /usr
file system.

Submitted by:	cracauer
Approved by:	kib
2008-12-23 20:43:42 +00:00
kib
87478ed893 Keep the hold on the vnode during VOP_VPTOCNP() call, allowing the vop
implementation to drop vnode lock, if needed.

Reported and tested by:	pho
2008-12-23 20:04:31 +00:00
ivoras
97219f9ae7 Add missing newlines to flags tags of CPU topology, for prettier
output.

Reviewed by:	jeff (original version)
Approved by:	gnn (mentor) (original version)
2008-12-23 16:19:59 +00:00
cperciva
87e5b5b6cc Prevent cross-site forgery attacks on ftpd(8) due to splitting
long commands into multiple requests. [08:12]

Avoid calling uninitialized function pointers in protocol switch
code. [08:13]

Merry Christmas everybody...

Approved by:	so (cperciva)
Approved by:	re (kensmith)
Security:	FreeBSD-SA-08:12.ftpd, FreeBSD-SA-08:13.protosw
2008-12-23 01:23:09 +00:00
ed
c48b61a389 Revert r185891.
In r185891 I removed the newlines from messages written to /dev/console,
because it made startup messages from rc-scripts harder to read. This,
unfortunately, causes the kernel message that is printed after a
non-terminated log message to be concatenated.

This could be fixed, but on short term it's better to just revert the
change.

Reported by:	Jaakko Heinonen <jh saunalahti fi>
2008-12-21 21:54:01 +00:00
ed
8cd1b93741 Set PTS_FINISHED before waking up any threads.
Inside ptsdrv_{in,out}wakeup() we call KNOTE_LOCKED() to wake up any
kevent(2) users. Because the kqueue handlers are executed synchronously,
we must set PTS_FINISHED before calling ptsdrv_{in,out}wakeup().

Discovered by:	nork
2008-12-21 21:16:57 +00:00
ed
1bd8105e62 Let wchan names more closely match pre-MPSAFE TTY behaviour.
Right now the wchan strings "ttyinp" and "ttybgw" only differ one
character from the strings we used prior to MPSAFE TTY. Just rename them
back to their pre-MPSAFE TTY counterparts.

Also rename "ttylck" to "ttymtx", which should make it more clear that a
process is blocked on the TTY mutex, not some other form of locking.
2008-12-20 09:36:40 +00:00
nwhitehorn
3fcef6d9c2 Modularize the Open Firmware client interface to allow run-time switching
of OFW access semantics, in order to allow future support for real-mode
OF access and flattened device frees. OF client interface modules are
implemented using KOBJ, in a similar way to the PPC PMAP modules.

Because we need Open Firmware to be available before mutexes can be used on
sparc64, changes are also included to allow KOBJ to be used very early in
the boot process by only using the mutex once we know it has been initialized.

Reviewed by:    marius, grehan
2008-12-20 00:33:10 +00:00
ivoras
9b61ff858b Further beautify the lock strings to be more pleasing to the eye and
self documenting within 6 characters.

Reviewed by:	ed (older version)
Approved by:	gnn (older version)
2008-12-19 14:49:14 +00:00
ru
021bbbd29f Removed a comment made obsolete by revisions 157927 and 174292. 2008-12-18 15:56:12 +00:00
ivoras
c6f6eeca99 By popular request, stringify kern.vm_guest sysctl. Now it returns a
short, self-documenting string describing the detected virtual
environment.

Approved by:	gnn (mentor) (earlier version)
2008-12-18 15:34:38 +00:00
ivoras
a6186ea60d Remove spaces in wait object names to make top (1) output prettier and
unbreak scripts that examine ps (1) output.

Reviewed by:	ed
Approved by:	gnn (mentor)
2008-12-18 15:25:33 +00:00
kib
5b3918fe07 The quotactl, statfs and fstatfs syscall implementations may dereference
NULL pointer to struct mount if the looked up vnode is reclaimed. Also,
these syscalls only mnt_ref() the mp, still allowing it to be unmounted;
only struct mount memory is kept from being reused.

Lock the vnode when doing name lookup, then reference its mount point,
unlock the vnode and vfs_busy the mountpoint. This sequence shall take
care of both races.

Reported and tested by:	pho
Discussed with:	attilio
MFC after:	1 month
2008-12-18 12:01:19 +00:00
kib
fe785ac856 Do not return success and doomed vnode from lookup. LK_UPGRADE allows
the vnode to be reclaimed.

Tested by:	pho
MFC after:	1 month
2008-12-18 11:58:12 +00:00
ivoras
b769de9274 Introduce a sysctl kern.vm_guest that reflects what the kernel knows about
it running under a virtual environment. This also introduces a globally
accessible variable vm_guest that can be used where appropriate in the
kernel to inspect this environment.

To make it easier for the long run, an enum VM_GUEST is also introduced,
which could possibly be factored out in a header somewhere (but the
question is where - vm/vm_param.h? sys/param.h?) so it eventually becomes
a part of the standard KPI. In any case, it's a start.

The purpose of all this isn't to absolutely detect that the OS is running
under a virtual environment (cf. "redpill") but to allow the parts of the
kernel and the userland that care about this particular aspect and can do
something useful depending on it to have a standardised interface. Reducing
kern.hz is one example but there are other things that could be done like
avoiding context switches, not using CPU instructions that are known to be
slow in emulation, possibly different strategies in VM (memory) allocation,
CPU scheduling, etc.

It isn't clear if the JAILS/VIMAGE functionality should also be exposed
by this particular mechanism (probably not since they're not "full"
virtual hardware environments). Sometime in the future another sysctl and
a variable could be introduced to reflect if the kernel supports any kind
of virtual hosting (e.g. VMWare VMI, Xen dom0).

Reviewed by:	silence from src-commiters@, virtualization@, kmacy@
Approved by:	gnn (mentor)
Security:	Obscurity doesn't help.
2008-12-17 19:57:12 +00:00
peter
35932c6d7c Remove sysctl debug.elf_trace and the trace field in auxargs. They go
nowhere.  It used to be the equivalent of $LD_DEBUG in rtld-elf.
Elf_Auxargs is an internal structure.
2008-12-17 16:54:29 +00:00
imp
4ad1824222 Minor style(9) nit. 2008-12-17 16:25:20 +00:00
kib
ce7791f58d Remove two remnant uses of AT_DEBUG. 2008-12-17 13:13:35 +00:00
attilio
697e2a94e4 1) Fix a deadlock in the VFS:
- threadA runs vfs_rel(mp1)
- threadB does unmount the mp1 fs, sets MNTK_UNMOUNT and drop MNT_ILOCK()
- threadA runs vfs_busy(mp1) and, as long as, MNTK_UNMOUNT is set, sleeps
  waiting for threadB to complete the unmount
- threadB, in vfs_mount_destroy(), finds mnt_lock > 0 and sleeps waiting
  for the refcount to expire.

Fix the deadlock by adding a flag called MNTK_REFEXPIRE which signals the
unmounter is waiting for mnt_ref to expire.
The vfs_busy contenders got awake, fails, and if they retry the
MNTK_REFEXPIRE won't allow them to sleep again.

2) Simplify significantly the code of vfs_mount_destroy() trimming
   unnecessary codes:
   - as long as any reference exited, it is no-more possible to have
     write-op (primarty and secondary) in progress.
   - it is no needed to drop and reacquire the mount lock.
   - filling the structures with dummy values is unuseful as long as
     it is going to be freed.

Tested by:	pho, Andrea Barberio <insomniac at slackware dot it>
Discussed with:	kib
2008-12-16 23:16:10 +00:00
mav
325d9e1d23 If possible, try to obtain max_mhz on cpufreq attach instead of first request.
On HyperThreading CPUs logical cores have same frequency, so setting it
on any core will change the other's one. In most cases first request
to the second core will be the "set" request, done after setting frequency
of the first core. In such case second CPU will obtain throttled frequency
of the first core as it's max_mhz making cpufreq broken due to different
frequency sets.
2008-12-16 01:24:05 +00:00
mav
37aff7daa7 Change ttyhook_register() second argument from thread to process pointer.
Thread was not really needed there, while previous ng_tty implementation
that used thread pointer had locking issues (using sx while holding mutex).
2008-12-13 21:17:46 +00:00
jkoshy
57939c399f - Bug fix: prevent a thread from migrating between CPUs between the
time it is marked for user space callchain capture in the NMI
  handler and the time the callchain capture callback runs.

- Improve code and control flow clarity by invoking hwpmc(4)'s user
  space callchain capture callback directly from low-level code.

Reviewed by:	jhb (kern/subr_trap.c)
Testing (various patch revisions): gnn,
		Fabien Thomas <fabien dot thomas at netasq dot com>,
		Artem Belevich <artemb at gmail dot com>
2008-12-13 13:07:12 +00:00
ed
7397703a4c Add FIONREAD to pseudo-terminal master devices.
All ioctl()'s that aren't implemented by pts(4) are forwarded to the TTY
itself. Unfortunately this is not correct for FIONREAD, because it will
give the wrong amount of bytes that are available to read.

Tested by:	keramida
Reminded by:	keramida
2008-12-13 07:23:55 +00:00
kib
ba6c22347a Uio_yield() already does DROP_GIANT/PICKUP_GIANT, no need to repeat this
around the call.

Noted by:  bde
2008-12-12 14:03:04 +00:00
kib
e747469903 Reference the vmspace of the process being inspected by procfs, linprocfs
and sysctl kern_proc_vmmap handlers.

Reported and tested by:	pho
Reviewed by:	rwatson, des
MFC after:	1 week
2008-12-12 12:12:36 +00:00
kib
2ef4ea7ee8 The userland_sysctl() function retries sysctl_root() until returned
error is not EAGAIN. Several sysctls that inspect another process use
p_candebug() for checking access right for the curproc. p_candebug()
returns EAGAIN for some reasons, in particular, for the process doing
exec() now. If execing process tries to lock Giant, we get a livelock,
because sysctl handlers are covered by Giant, and often do not sleep.

Break the livelock by dropping Giant and allowing other threads to
execute in the EAGAIN loop.

Also, do not return EAGAIN from p_candebug() when process is executing,
use more appropriate EBUSY error [1].

Reported and tested by:	pho
Suggested by:	rwatson [1]
Reviewed by:	rwatson, des
MFC after:	1 week
2008-12-12 12:06:28 +00:00
marcus
91e684d7f9 Add a new VOP, VOP_VPTOCNP, which translates a vnode to its component name
on a best-effort basis.  Teach vn_fullpath to use this new VOP if a
regular VFS cache lookup fails.  This VOP is designed to supplement the
VFS cache to provide a better chance that a vnode-to-name lookup will
succeed.

Currently, an implementation for devfs is being committed.  The default
implementation is to return ENOENT.

A big thanks to kib for the mentorship on this, and to pho for running it
through his stress test suite.

Reviewed by:	arch
Approved by:	kib
2008-12-12 00:57:38 +00:00
ed
9afe297ee5 Add kqueue()-support to pseudo-terminal master devices.
One thing I didn't expect many applications to use, was kqueue() on
pseudo-terminal master devices. There are applications that use kqueue()
on the TTY itself (rtorrent, etc). That doesn't mean we shouldn't
implement this. Libraries like libevent use kqueue() by default, which
means they wouldn't be able to use kqueue().

The old TTY layer implements a very broken version of kqueue() by
performing the actual polling on the TTY device.

Discussed with:	peter
2008-12-11 21:44:02 +00:00
bz
9a73283b1f Order #includes - also to reduce diffs with vimage branches in p4.
Sponsored by:	The FreeBSD Foundation
2008-12-11 16:09:31 +00:00
bz
e65de9d982 Correctly check the number of prison states to not access anything
outside the prison_states array.
When checking if there is a name configured for the prison, check the
first character to not be '\0' instead of checking if the char array
is present, which it always is. Note, that this is different for the
*jailname in the syscall.

Found with:	Coverity Prevent(tm)
CID:		4156, 4155
MFC after:	4 weeks (just that I get the mail)
2008-12-11 01:04:25 +00:00
zec
7b573d1496 Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.

Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out.  Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.

Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively

#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif

Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.

Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs.  This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.

Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.

De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps.  PF virtualization will be done
separately, most probably after next PF import.

Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw.  Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-12-10 23:12:39 +00:00
bz
f30a0a94fe Make sure nmbclusters are initialized before maxsockets
by running the tunable_mbinit() SYSINIT at SI_ORDER_MIDDLE
before the init_maxsockets() SYSINT at SI_ORDER_ANY.

Reviewed by:		rwatson, zec
Sponsored by:		The FreeBSD Foundation
MFC after:		4 weeks
2008-12-10 22:17:09 +00:00
bz
72852bcf84 Style changes only. Put the return type on an extra line[1] and
add an empty line at the beginning as we do not have any local
variables.

Submitted by:	rwatson [1]
Reviewed by:	rwatson
MFC after:	4 weeks
2008-12-10 22:10:37 +00:00
ed
4400e3e134 Remove added newlines from logged messages written to /dev/console.
The /dev/console device node logs all strings that are written to it.
When the string does not contain a trailing newline, it appends one. I
can imagine this was useful a long time ago, but with our current
rc-scripts, it generates a whole bunch of messages that look like:

| Configuring syscons:
|  blanktime
| .

By not appending the newlines, the output of `dmesg -a' is now (almost?)
exactly the same as what the user will see on the console device
(syscons, uart).
2008-12-10 21:48:05 +00:00
jhb
f3dcc2d9e0 - Add 32-bit compat system calls for VFS_AIO. The system calls live in the
aio code and are registered via the recently added SYSCALL32_*() helpers.
- Since the aio code likes to invoke fuword and suword a lot down in the
  "bowels" of system calls, add a structure holding a set of operations for
  things like storing errors, copying in the aiocb structure, storing
  status, etc.  The 32-bit system calls use a separate operations vector to
  handle fuword32 vs fuword, etc.  Also, the oldsigevent handling is now
  done by having seperate operation vectors with different aiocb copyin
  routines.
- Split out kern_foo() functions for the various AIO system calls so the
  32-bit front ends can manage things like copying in and converting
  timespec structures, etc.
- For both the native and 32-bit aio_suspend() and lio_listio() calls,
  just use copyin() to read the array of aiocb pointers instead of using
  a for loop that iterated over fuword/fuword32.  The error handling in
  the old case was incomplete (lio_listio() just ignored any aiocb's that
  it got an EFAULT trying to read rather than reporting an error), and
  possibly slower.

MFC after:	1 month
2008-12-10 20:56:19 +00:00
kmacy
c510d681c9 add RW_SYSINIT_FLAGS macro and rw_sysinit_flags initialization function 2008-12-08 21:46:55 +00:00
jkim
bc7e5e240b - Detect Bochs BIOS variants and use HZ_VM as well.
- Free kernel environment variable after its use.
- Fix style(9) nits.
2008-12-08 18:39:59 +00:00
kib
8324189f53 Do drop vm map lock earlier in the sysctl_kern_proc_vmmap(), to avoid
locking a vnode while having vm map locked.

Reported and tested by:	pho
MFC after:	1 week
2008-12-08 12:29:30 +00:00
kmacy
598b522b42 - convert radix node head lock from mutex to rwlock
- make radix node head lock not recursive
 - fix LOR in rtexpunge
 - fix LOR in rtredirect

Reviewed by:	sam
2008-12-07 21:15:43 +00:00
kib
ccad2ebfb2 Several threads in a process may do vfork() simultaneously. Then, all
parent threads sleep on the parent' struct proc until corresponding
child releases the vmspace. Each sleep is interlocked with proc mutex of
the child, that triggers assertion in the sleepq_add(). The assertion
requires that at any time, all simultaneous sleepers for the channel use
the same interlock.

Silent the assertion by using conditional variable allocated in the
child. Broadcast the variable event on exec() and exit().

Since struct proc * sleep wait channel is overloaded for several
unrelated events, I was unable to remove wakeups from the places where
cv_broadcast() is added, except exec().

Reported and tested by:	ganbold
Suggested and reviewed by:	jhb
MFC after:	2 week
2008-12-05 20:50:24 +00:00
jhb
0f1a8aa011 When the SYSINIT() to load a module invokes the MOD_LOAD event successfully,
move that module to the head of the associated linker file's list of modules.
The end result is that once all the modules are loaded, they are sorted in
the reverse of their load order.  This causes the kernel linker to invoke
the MOD_QUIESCE and MOD_UNLOAD events in the reverse of the order that
MOD_LOAD was invoked.  This means that the ordering of MOD_LOAD events that
is set by the SI_* paramters to DECLARE_MODULE() are now honored in the same
order they would be for SYSUNINIT() for the MOD_QUIESCE and MOD_UNLOAD
events.

MFC after:	1 month
2008-12-05 16:47:30 +00:00
jhb
0ac14961cc - Invoke MOD_QUIESCE on all modules in a linker file (kld) before
unloading any modules.  As a result, if any module veto's an unload
  request via MOD_QUIESCE, the entire set of modules for that linker
  file will remain loaded and active now rather than leaving the kld
  in a weird state where some modules are loaded and some are unloaded.
- This also moves the logic for handling the "forced" unload flag out of
  kern_module.c and into kern_linker.c which is a bit cleaner.
- Add a module_name() routine that returns the name of a module and use that
  instead of printing pointer values in debug messages when a module fails
  MOD_QUIESCE or MOD_UNLOAD.

MFC after:	1 month
2008-12-05 13:40:25 +00:00
bz
965b4db5a5 Fix a credential reference leak. [1]
Close subtle but relatively unlikely race conditions when
propagating the vnode write error to other active sessions
tracing to the same vnode, without holding a reference on
the vnode anymore. [2]

PR:		kern/126368 [1]
Submitted by:	rwatson [2]
Reviewed by:	kib, rwatson
MFC after:	4 weeks
2008-12-03 15:54:35 +00:00
bz
604d89458a Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by:	brooks, gnn, des, zec, imp
Sponsored by:	The FreeBSD Foundation
2008-12-02 21:37:28 +00:00
kib
ade687809e Shared lookup makes it possible to create several negative cache
entries for one name. Then, creating inode with that name would remove
one entry, leaving others dormant. Reclaiming the vnode would uncover
negative entries, causing false return of ENOENT from the calls like
stat, that do not create inode.

Prevent creation of the duplicated negative entries.

Reported and debugged with:	pho
Reviewed by:	jhb
X-MFC:	after shared lookup changes
2008-12-02 11:14:16 +00:00
peter
76037b082e Merge user/peter/kinfo branch as of r185547 into head.
This changes struct kinfo_filedesc and kinfo_vmentry such that they are
same on both 32 and 64 bit platforms like i386/amd64 and won't require
sysctl wrapping.

Two new OIDs are assigned.  The old ones are available under
COMPAT_FREEBSD7 - but it isn't that simple.  The superceded interface
was never actually released on 7.x.

The other main change is to pack the data passed to userland via the
sysctl.  kf_structsize and kve_structsize are reduced for the copyout.
If you have a process with 100,000+ sockets open, the unpacked records
require a 132MB+ copyout.  With packing, it is "only" ~35MB.  (Still
seriously unpleasant, but not quite as devastating).  A similar problem
exists for the vmentry structure - have lots and lots of shared libraries
and small mmaps and its copyout gets expensive too.

My immediate problem is valgrind.  It traditionally achieves this
functionality by parsing procfs output, in a packed format.  Secondly, when
tracing 32 bit binaries on amd64 under valgrind, it uses a cross compiled
32 bit binary which ran directly into the differing data structures in 32
vs 64 bit mode.  (valgrind uses this to track file descriptor operations
and this therefore affected every single 32 bit binary)

I've added two utility functions to libutil to unpack the structures into
a fixed record length and to make it a little more convenient to use.
2008-12-02 06:50:26 +00:00
peter
0cd59a18e3 Prune some whining. 2008-12-02 02:32:13 +00:00
kan
6f76d481f0 Shared memory objects that have size which is not necessarily equal to
exact multiple of system page size should still be allowed to be mapped
in their entirety to match the regular vnode backed file behavior.

Reported by: ed
Reviewed by: jhb
2008-12-01 22:33:50 +00:00
kensmith
b8203d2b22 Catch up with the disappearance of sys/dev/hfa. 2008-12-01 14:34:42 +00:00
attilio
2f606e171c Fix an inverted check introduced in r184554.
Submitted by:	tegge
Pointy hat to:	me
2008-12-01 03:00:26 +00:00
peter
cd7b78c33f Duplicate another few hundred lines of code in order to be compatible
with unreleased binaries.
2008-12-01 02:13:32 +00:00
davidxu
25ef38d002 Revision 184199 had not been fully reverted, add missing piece.
Reported by: phk
2008-12-01 01:54:55 +00:00
peter
2b1f03929a Properly wrap this giant block of duplicate code inside COMPAT_FREEBSD7 2008-11-30 21:04:53 +00:00
peter
343bde9706 Implement copyout packing more along the lines of what I had in mind.
Create a temporary duplicate implementation of old filedesc struct for
pre-7.1 libgtop package.
Todo: specific fd or addr request
2008-11-30 00:18:21 +00:00
peter
83dc2280ce WIP kinfo_file/kinfo_vmmentry tweaks. The idea:
1) to get the 32 and 64 bit versions in sync so that no shims are needed,
Valgrind in particular excercises this.  and:
2) reduce the size of the copyout.  On large processes this turns out to
be a huge problem.  Valgrind also suffers from this since it needs to do
this in a context that can't malloc. I want to pack the records.
3) Add new types..  'tell me about fd N' and 'tell me about addr N'.
2008-11-29 20:55:11 +00:00
bz
817305efd5 Unbreak the no-networks (no INET/6) build that I broke with
the commit in r185435.

Pointyhat:	no, but I could need a ski cap for the winter
2008-11-29 16:17:39 +00:00
bz
d2730d5b27 MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
  and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
  help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
  suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
  on cluster machines as well as all the testers and people
  who provided feedback the last months on freebsd-jail and
  other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by:	(see above)
MFC after:	3 months (this is just so that I get the mail)
X-MFC Before:   7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
kib
bf74bb2e16 In the nfsrv_fhtovp(), after the vfs_getvfs() function found the pointer
to the fs, but before a vnode on the fs is locked, unmount may free fs
structures, causing access to destroyed data and freed memory.

Introduce a vfs_busymp() function that looks up and busies found
fs while mountlist_mtx is held. Use it in nfsrv_fhtovp() and in the
implementation of the handle syscalls.

Two other uses of the vfs_getvfs() in the vfs_subr.c, namely in
sysctl_vfs_ctl and vfs_getnewfsid seems to be ok. In particular,
sysctl_vfs_ctl is protected by Giant by being a non-sleeping sysctl
handler, that prevents Giant-locked unmount code to interfere with it.

Noted by:	tegge
Reviewed by:	dfr
Tested by:	pho
MFC after:	1 month
2008-11-29 13:34:59 +00:00
pjd
881f5f6bef Improve KASSERT() call a bit:
- Print flags in hex.
- Note that flags can be fine and panic can be due unexpected error condition.
- Remove redundant new line character.

Eventhough panic message excess 80 characters keep it in one line so it is
easier to grep.
2008-11-29 12:40:14 +00:00
bz
3139d98062 With the permissions of phk@ change the license on kern_jail.c
to a 2 clause BSD license.
2008-11-28 19:23:46 +00:00
ed
355c41a8e4 Fix matching of message queues by name.
The mqfs_search() routine uses strncmp() to match message queue objects
by name. This is because it can be called from environments where the
file name is not null terminated (the VFS for example).

Unfortunately it doesn't compare the lengths of the message queue names,
which means if a system has "Queue12345", the name "Queue" will also
match.

I noticed this when a student of mine handed in an exercise using
message queues with names "Queue2" and "Queue".

Reviewed by:	rink
2008-11-28 14:53:18 +00:00
kib
c0925f09e7 Explicitely note that destroy_dev() sleeps.
Requested by:	ed (some time ago), Jaakko Heinonen <jh saunalahti fi>
2008-11-27 16:47:25 +00:00
ganbold
93f46a969a Remove unused variable.
Found with:     Coverity Prevent(tm)
CID: 3664

Approved by: kib
2008-11-27 04:40:37 +00:00
zec
95a15f5c84 Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by:  bz, julian
Approved by:  julian (mentor)
Obtained from:        //depot/projects/vimage-commit2/...
X-MFC after:  never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-11-26 22:32:07 +00:00
marcus
8b6bc38b7c Move vn_fullpath1() outside of FILEDESC locking. This is being done in
advance of teaching vn_fullpath1() how to query file systems for
vnode-to-name mappings when cache lookups fail.

Thanks to kib for guidance and patience on this process.

Reviewed by:	kib
Approved by:	kib
2008-11-25 15:36:15 +00:00
emaste
abb4c5f368 Correct typo in comment: thier -> their 2008-11-24 19:28:52 +00:00
dwmalone
e3dcd91abe It's possible that the dump device has gone away after it was
configured, change the message to let people know this is a
possibility. I've slightly changed the message from the one
submitted by Pekka to keep the printf on one line.

Submitted by:	Pekka Savola <pekkas@netcore.fi>
2008-11-23 21:05:22 +00:00
kib
8fad2283b3 Add sv_flags field to struct sysentvec with intention to provide description
of the ABI of the currently executing image. Change some places to test
the flags instead of explicit comparing with address of known sysentvec
structures to determine ABI features.

Discussed with:	dchagin, imp, jhb, peter
2008-11-22 12:36:15 +00:00
kmacy
9d3bb599b1 - bump __FreeBSD version to reflect added buf_ring, memory barriers,
and ifnet functions

- add memory barriers to <machine/atomic.h>
- update drivers to only conditionally define their own

- add lockless producer / consumer ring buffer
- remove ring buffer implementation from cxgb and update its callers

- add if_transmit(struct ifnet *ifp, struct mbuf *m) to ifnet to
  allow drivers to efficiently manage multiple hardware queues
  (i.e. not serialize all packets through one ifq)
- expose if_qflush to allow drivers to flush any driver managed queues

This work was supported by Bitgravity Inc. and Chelsio Inc.
2008-11-22 05:55:56 +00:00
julian
cf07f793f2 Fix a scope problem in the multiple routing table code that stopped the
SO_SETFIB socket option from working correctly.

Obtained from:	Ironport
MFC after:	3 days
2008-11-19 19:19:30 +00:00