Commit Graph

11068 Commits

Author SHA1 Message Date
Andrew Thompson
626fc9fe3d Add a how argument to root_mount_hold() so it can be passed NOWAIT and be called
in situations where sleeping isnt allowed.
2009-04-03 19:46:12 +00:00
Peter Wemm
0e875ecafe vn_vptocnp() unlocks the name cache and forgets to re-lock it before
returning in one error case, and mistakenly unlocks it for the
umount -f case.
2009-04-02 21:16:20 +00:00
Christian Brueffer
1fa80eb15c Fix memory leak in semunload().
PR:		133064
Submitted by:	Mateusz Guzik <mjguzik@gmail.com>
MFC after:	1 week
2009-03-30 15:01:29 +00:00
Andrew Thompson
46b70f07bc Further rate limit the root wait status, it will be printed once per
root_mount_rel() wakeup.
2009-03-30 05:57:55 +00:00
Alexander Kabaev
607fc40b04 Replace v_dd vnode pointer with v_cache_dd pointer to struct namecache
in directory vnodes. Allow namecache dotdot entry to be created pointing
from child vnode to parent vnode if no existing links in opposite
direction exist. Use direct link from parent to child for dotdot lookups
otherwise.

This restores more efficient dotdot caching in NFS filesystems which
was lost when vnodes stoppped being type stable.

Reviewed by:	kib
2009-03-29 21:25:40 +00:00
Jamie Gritton
8571af59e5 Whitespace/spelling fixes in advance of upcoming functional changes.
Approved by:	bz (mentor)
2009-03-27 13:13:59 +00:00
Andrew Thompson
d24d45d9a9 Skip the allocation of the root hold token if the mount already happened. 2009-03-27 03:52:08 +00:00
John Baldwin
2401c73637 When looking up the parent devclass of a new devclass, create the parent
devclass if it doesn't already exist.
2009-03-25 17:02:05 +00:00
John Baldwin
049ce0934f When a file lookup fails due to encountering a doomed vnode from a forced
unmount, consistently return ENOENT rather than EBADF.

Reviewed by:	kib
MFC after:	1 month
2009-03-24 18:16:42 +00:00
Jung-uk Kim
eae44ae03d Clean up MI inittodr(9) and kill noop code.
It was derived from i386 version long ago but never resync'ed again.
Originally, i386 version compared the current time from realtime clock
with time_second (which was just `time' in the old days).  When this MI
version was written, it was wrongly compared against `base' AND never
used because of a bug (typo?) in the code.  This check was killed
in i386 version when home-rolled calendaric calculation was removed.
Now, we just remove the code here as well to make the code simpler.
2009-03-23 21:16:21 +00:00
John Baldwin
9b84ba1cbb Improve the description of a few sysctls.
Submitted by:	bde (partially)
MFC after:	3 days
2009-03-23 20:18:06 +00:00
Alexander Kabaev
9999864a87 Add safety check that does not allow empty strings to be queued
to the devctl notification queue. Empty strings cause devctl read
call to return 0 and result in devd exiting prematurely.

The actual offender (ugen notes for root hubs) will be fixed
by separate commit.
2009-03-23 01:13:34 +00:00
Colin Percival
3f935cf342 Correctly sanity-check timer IDs. [SA-09:06]
Limit the size of malloced buffer when dumping environment
variables. [EN-09:01]

Approved by:	so (cperciva)
Approved by:	re (kensmith)
Security:	FreeBSD-SA-09:06.ktimer
Errata:		FreeBSD-EN-09:01.kenv
2009-03-23 00:00:50 +00:00
Konstantin Belousov
267c52fc98 Fix several issues with parsing the notes for ELF objects.
Badly formed ELF note may cause the caclulated pointer to the next note
to point both after the note region, that was checked in the code, but
also to point before the region, that was not checked [1]. Remember the
first note location in note0 and leap out if the note is not between
note0 and note_end.

In the similar way, badly formed note may cause infinite loop by
pointing next note into the same or previous note. Guard against this by
limiting amount of loop iterations by arbitrary choosen big number.

For clarity, check the calculated note alignment in each iteration.

Reported by:	Chris Palmer <chris noncombatant org> [1]
PR:	kern/132886
Reviewed and tested by:	dchagin
MFC after:	3 days
2009-03-22 13:42:41 +00:00
Konstantin Belousov
15fb32c07d Do not underflow the buffer and then report the problem. Check for the
condition before the buffer write.
Also, since buflen is unsigned, previous check was ignored.

Reviewed by:	marcus
Tested by:	pho
2009-03-20 11:08:57 +00:00
Konstantin Belousov
83817ce3b1 Remove unneeded braces to reduce used vertical screen space.
The location was missed in r190140.
2009-03-20 11:03:55 +00:00
Konstantin Belousov
9194007261 Do not forget to adjust buflen for the first resolution of the path
from namecache.
While there, compare pointers for equiality.

Reviewed by:	marcus
Tested by:	pho
2009-03-20 11:00:39 +00:00
Konstantin Belousov
065fc451f8 The nc_nlen member of the struct namecache contains the length of the cached
name, not the length + 1.

PR:	132620, 132542
Reported by:	bf2006a yahoo com
Tested by:	bf2006a, pho
Reviewed by:	marcus
2009-03-20 10:59:06 +00:00
Konstantin Belousov
c4a8c2ee24 When ktracing namei operations, log a result of the __getcwd().
MFC after:	1 week
2009-03-20 10:47:16 +00:00
Konstantin Belousov
bf5c835e1c Remove unneeded braces to reduce used vertical screen space. 2009-03-20 10:04:00 +00:00
Attilio Rao
76ed3c71f1 Fix an old-standing bug that crept in along the several revisions:
B_DELWRI cleanup and vnode disassociation should happen just before to
assign the buffer to a queue.

Reported by:	miwi, Volker <volker at vwsoft dot com>,
		Ben Kaduk <minimarmot at gmail dot com>,
		Christopher Mallon <christoph dot mallon at gmx dot de>
Tested by:	lulf, miwi
2009-03-17 16:30:49 +00:00
Konstantin Belousov
3ff063577b Supply AT_EXECPATH auxinfo entry to the interpreter, both for native and
compat32 binaries.

Tested by:	pho
Reviewed by:	kan
2009-03-17 12:53:28 +00:00
Konstantin Belousov
429f5a589b Use the properly sized types for ELF object header and program headers.
This fixes osrel fetching from the FreeBSD branding note for the 64bit
platforms.

Reported by:	swell.k gmail com
Reviewed by:	dchagin
Tested by:	dchagin, swell.k gmail com
2009-03-17 09:50:40 +00:00
Jung-uk Kim
c66d2b38c8 Initial suspend/resume support for amd64.
This code is heavily inspired by Takanori Watanabe's experimental SMP patch
for i386 and large portion was shamelessly cut and pasted from Peter Wemm's
AP boot code.
2009-03-17 00:48:11 +00:00
Konstantin Belousov
c1d8b5e82c Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.

First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.

Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.

Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf().  The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget.  Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.

In collaboration with:	pho
Reviewed by:	 tegge (previous version)
Tested by:	 glebius, yandex ...
MFC after:	 3 weeks
2009-03-16 15:39:46 +00:00
Robert Watson
e5adda3d51 Remove IFF_NEEDSGIANT, a compatibility infrastructure introduced
in FreeBSD 5.x to allow network device drivers to run with Giant
despite the network stack being Giant-free.  This significantly
simplifies calls into ioctl() on network interfaces, especially
in the multicast code, as well as eliminates deferred invocation
of interface if_start routines.

Disable the build on device drivers still depending on
IFF_NEEDSGIANT as they no longer compile.  They will be removed
in a few weeks if they haven't been made MPSAFE in that time.
Disabled drivers:

        if_ar
        if_axe
        if_aue
        if_cdce
        if_cue
        if_kue
        if_ray
        if_rue
        if_rum
        if_sr
        if_udav
        if_ural
        if_zyd

Drivers that were already disabled because of tty changes:

        if_ppp
        if_sl

Discussed on:	arch@
2009-03-15 14:21:05 +00:00
Jeff Roberson
1723a06485 - Wrap lock profiling state variables in #ifdef LOCK_PROFILING blocks. 2009-03-15 08:03:54 +00:00
Jeff Roberson
2e6b8de462 - Implement a new mechanism for resetting lock profiling. We now
guarantee that all cpus have acknowledged the cleared enable int by
   scheduling the resetting thread on each cpu in succession.  Since all
   lock profiling happens within a critical section this guarantees that
   all cpus have left lock profiling before we clear the datastructures.
 - Assert that the per-thread queue of locks lock profiling is aware of
   is clear on thread exit.  There were several cases where this was not
   true that slows lock profiling and leaks information.
 - Remove all objects from all lists before clearing any per-cpu
   information in reset.  Lock profiling objects can migrate between
   per-cpu caches and previously these migrated objects could be zero'd
   before they'd been removed

Discussed with:	attilio
Sponsored by:	Nokia
2009-03-15 06:41:47 +00:00
Jeff Roberson
d3df4af368 - When a mutex is destroyed while locked we need to inform lock profiling
that it has been released.
2009-03-14 11:43:38 +00:00
Jeff Roberson
04a2868980 - Call lock_profile_release when we're transitioning a lock to be owned by
LK_KERNPROC.

Discussed with:		attilio
2009-03-14 11:43:02 +00:00
Jeff Roberson
53a6c8b3ac - Fix an error that occurs when mp_ncpu is an odd number. steal_thresh
is calculated as 0 which causes errors elsewhere.

Submitted by:	KOIE Hidetaka <koie@suri.co.jp>

 - When sched_affinity() is called with a thread that is not curthread we
   need to handle the ON_RUNQ() case by adding the thread to the correct
   run queue.

Submitted by:	Justin Teller <justin.teller@gmail.com>

MFC after:	1 Week
2009-03-14 11:41:36 +00:00
Dmitry Chagin
32c01de21c Implement new way of branding ELF binaries by looking to a
".note.ABI-tag" section.

The search order of a brand is changed, now first of all the
".note.ABI-tag" is looked through.

Move code which fetch osreldate for ELF binary to check_note() handler.

PR:		118473
Approved by:	kib (mentor)
2009-03-13 16:40:51 +00:00
David Xu
326bf9493d 1) Check NULL pointer before calling umtx_pi_adjust_locked(), this avoids
a PANIC.
2) Rework locking for POSIX priority-mutex, this fixes a
   race where a thread may wait there forever even if the mutex is unlocked.
2009-03-13 06:06:20 +00:00
John Baldwin
42dd14bada Change the sysctls for maxbcache and maxswzone from int to long. I missed
this earlier since these sysctls don't exist in 7.x yet.
2009-03-12 17:23:02 +00:00
John Baldwin
b9f2a7da58 Export the current values of nbuf, ncallout, and nswbuf via read-only
sysctls that match the tunable names.

MFC after:	3 days
2009-03-12 17:21:58 +00:00
Bruce M Simpson
77d8bf9cc7 Ensure that the semaphore value is re-checked after sem_lock
is re-acquired, after the condition variable is signalled.

PR:             http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/127545
MFC after:      5 days
Reviewed by:    attilio
2009-03-12 10:36:39 +00:00
Bruce M Simpson
b2966a5a2f Make semaphore debugging output more useful.
PR:             http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/127545
MFC after:      5 days
Submitted by:   Philip Semanchuk
2009-03-12 10:34:16 +00:00
Robert Watson
ae81968fd1 When writing out updated pollfd records when returning from
poll(), only copy out the revents field, not the whole pollfd
structure.  Otherwise, if the events field is updated
concurrently by another thread, that update may be lost.

This issue apparently causes problems for the JDK on FreeBSD,
which expects the Linux behavior of not updating all fields
(somewhat oddly, Solaris does not implement the required
behavior, but presumably our adaptation of the JDK is based
on the Linux port?).

MFC after:      2 weeks
PR:		kern/130924
Submitted by:   Kurt Miller <kurt @ intricatesoftware.com>
Discussed with:	kib
2009-03-11 22:00:03 +00:00
John Baldwin
a56be37e68 Add a new type of KTRACE record for sysctl(3) invocations. It uses the
internal sysctl_sysctl_name() handler to map the MIB array to a string
name and logs this name in the trace log.  This can be useful to see
exactly which sysctls a thread is invoking.

MFC after:	1 month
2009-03-11 21:48:36 +00:00
John Baldwin
a6b6eb6b6b Gah, fix the code to match the comment. For non-open lookups use a
shared vnode lock for the leaf vnode if LOCKSHARED is set.

Submitted by:	rdivacky
2009-03-11 14:39:55 +00:00
John Baldwin
33fc362512 Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
  shared vnode lock for the leaf vnode only if the mount point has the
  extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
  not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
  O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
  vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
  FIFO's require exclusive vnode locks for their open() and close()
  routines.  (My recent MPSAFE patches for UDF and cd9660 already included
  this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.

Submitted by:	ups
Reviewed by:	pjd (ZFS bits)
MFC after:	1 month
2009-03-11 14:13:47 +00:00
Warner Losh
4782ea6768 Minor nits notice by jhb@ 2009-03-11 08:19:31 +00:00
John Baldwin
64ecd1399f - Make maxpipekva a signed long rather than an unsigned long as overflow
is more likely to be noticed with signed types.
- Make amountpipekva a long as well to match maxpipekva.

Discussed with:	bde
2009-03-10 21:28:43 +00:00
John Baldwin
060e911cf4 In the ABI shim for vfs.bufspace, rather than truncating values larger than
INT_MAX to INT_MAX, just go ahead and write out the full long to give an
error of ENOMEM to the user process.

Requested by:	bde
2009-03-10 21:27:15 +00:00
John Baldwin
72150f4cf7 - Remove a recently added comment from kernel_sysctlbyname() that isn't
needed.
- Move the release of the sysctl sx lock after the vsunlock() in
  userland_sysctl() to restore the original memlock behavior of
  minimizing the amount of memory wired to handle sysctl requests.

MFC after:	1 week
2009-03-10 17:00:28 +00:00
John Baldwin
38cce81ab3 Add an ABI compat shim for the vfs.bufspace sysctl for sysctl requests that
try to fetch it as an int rather than a long.  If the current value is
greater than INT_MAX it reports a value of INT_MAX.
2009-03-10 15:26:50 +00:00
John Baldwin
5bd65606f4 Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints.  Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva.  Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers.  Now one has to overflow a long to see
such problems.  There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf.  I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.

Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.

MFC after:	1 month
2009-03-09 19:35:20 +00:00
John Baldwin
4ab2a9a022 Move the debug.hashstat sysctl tree under DIAGNOSTIC. I measured the
debug.hashstat.rawnchash sysctl in particular as taking 7 milliseconds on
a 3GHz Intel Xeon (4x2) running 7.1.  It accounted for almost a quarter of
the total runtime of 'sysctl -a'.  It also performs lots of copyout's while
holding the namecache lock (this does not attempt to fix that).

MFC after:	2 weeks
2009-03-09 19:04:53 +00:00
Warner Losh
2a7e13e5ad Fix a long-standing bug in newbus. It was introduced when subclassing
was introduced.  If you have a bus, say cardbus, that is derived from
a base-bus (say PCI), then ordinarily all PCI drivers would attach to
cardbus devices.  However, there had been one exception: kldload
wouldn't work.

The problem is in devclass_add_driver.  In this routine, all we did
was call to the pci device's BUS_DRIVER_ADDED routine.  However, since
cardbus bus instances had a different devclass, none of them were
called.

The solution is to call all subclass devclasses, recursively down the
tree, of the class that was loaded.  Since we don't have a 'children
class' pointer, we search the whole list of devclasses for a class
whose parent matches.  Since just done a kldload time, this isn't as
bad as it sounds.  In addition, we short-circuit the whole process by
marking those classes with subclasses with a flag.  We'll likely have
to reevaluate this method the number of devclasses with subclasses
gets large.

This means we can remove the "cardbus" lines from all the PCI drivers
since we have no cardbus specific attach device attachments in the
tree.

# Also: minor tweak to an error message
2009-03-09 13:20:23 +00:00
Robert Watson
83160d1408 By default, don't compile in counters of calls to various time
query functions in the kernel, as these effectively serialize
parallel calls to the gettimeofday(2) system call, as well as
other kernel services that use timestamps.

Use the NetBSD version of the fix (kern_tc.c:1.32 by ad@) as
they have picked up our timecounter code and also ran into the
same problem.

Reported by:	kris
Obtained from:	NetBSD
MFC after:	3 days
2009-03-08 22:19:28 +00:00
Robert Watson
3dab55bc86 Decompose the global UNIX domain sockets rwlock into two different
locks: a global list/counter/generation counter protected by a new
mutex unp_list_lock, and a global linkage rwlock, unp_global_rwlock,
which protects the connections between UNIX domain sockets.

This eliminates conditional lock acquisition that was previously a
property of the global lock being held over sonewconn() leading to a
call to uipc_attach(), which also required the global lock, but
couldn't rely on it as other paths existed to uipc_attach() that
didn't hold it: now uipc_attach() uses only the list lock, which
follows the linkage lock in the lock order.  It may also reduce
contention on the global lock for some workloads.

Add global UNIX domain socket locks to hard-coded witness lock
order.

MFC after:	1 week
Discussed with:	kris
2009-03-08 21:48:29 +00:00
Joe Marcus Clarke
f8ecc40737 Add a default implementation for VOP_VPTOCNP(9) which scans the parent
directory of a vnode to find a dirent with a matching file number.  The
name from that dirent is then used to provide the component name.

Note: if the initial vnode argument is not a directory itself, then
the default VOP_VPTOCNP(9) implementation still returns ENOENT.

Reviewed by:	kib
Approved by:	kib
Tested by:	pho
2009-03-08 19:05:53 +00:00
Robert Watson
fefd0ac8a9 Remove 'uio' argument from MAC Framework and MAC policy entry points for
extended attribute get/set; in the case of get an uninitialized user
buffer was passed before the EA was retrieved, making it of relatively
little use; the latter was simply unused by any policies.

Obtained from:	TrustedBSD Project
Sponsored by:	Google, Inc.
2009-03-08 12:32:06 +00:00
Robert Watson
6f6174a762 Improve the consistency of MAC Framework and MAC policy entry point
naming by renaming certain "proc" entry points to "cred" entry points,
reflecting their manipulation of credentials.  For some entry points,
the process was passed into the framework but not into policies; in
these cases, stop passing in the process since we don't need it.

  mac_proc_check_setaudit -> mac_cred_check_setaudit
  mac_proc_check_setaudit_addr -> mac_cred_check_setaudit_addr
  mac_proc_check_setauid -> mac_cred_check_setauid
  mac_proc_check_setegid -> mac_cred_check_setegid
  mac_proc_check_seteuid -> mac_cred_check_seteuid
  mac_proc_check_setgid -> mac_cred_check_setgid
  mac_proc_check_setgroups -> mac_cred_ceck_setgroups
  mac_proc_check_setregid -> mac_cred_check_setregid
  mac_proc_check_setresgid -> mac_cred_check_setresgid
  mac_proc_check_setresuid -> mac_cred_check_setresuid
  mac_proc_check_setreuid -> mac_cred_check_setreuid
  mac_proc_check_setuid -> mac_cred_check_setuid

Obtained from:	TrustedBSD Project
Sponsored by:	Google, Inc.
2009-03-08 10:58:37 +00:00
Konstantin Belousov
125dcf8c7d Extract the no_poll() and vop_nopoll() code into the common routine
poll_no_poll().
Return a poll_no_poll() result from devfs_poll_f() when
filedescriptor does not reference the live cdev, instead of ENXIO.

Noted and tested by:	hps
MFC after:	1 week
2009-03-06 15:35:37 +00:00
Konstantin Belousov
45329b60da Systematically use vm_size_t to specify the size of the segment for VM KPI.
Do not overload the local variable size in kern_shmat() due to vm_size_t
change.
Fix style bug by adding explicit comparision with 0.

Discussed with:	bde
MFC after:	1 week
2009-03-05 11:45:42 +00:00
Dmitry Chagin
b2421c29f6 as suggested by jhb@, panic in case the ncpus == 0.
it helps to catch bugs in the callers.

Approved by:	kib (mentor)
MFC after:	5 days
2009-03-03 17:34:09 +00:00
Robert Watson
73e416e35d Reduce the verbosity of SDT trace points for DTrace by defining several
wrapper macros that allow trace points and arguments to be declared
using a single macro rather than several.  This means a lot less
repetition and vertical space for each trace point.

Use these macros when defining privilege and MAC Framework trace points.

Reviewed by:	jb
MFC after:	1 week
2009-03-03 17:15:05 +00:00
Jamie Gritton
f86bce5ed0 Extend the "vfsopt" mount options for more general use. Make struct
vfsopt and the vfs_buildopts function public, and add some new fields
to struct vfsopt (pos and seen), and new functions vfs_getopt_pos and
vfs_opterror.

Further extend the interface to allow reading options from the kernel
in addition to sending them to the kernel, with vfs_setopt and related
functions.

While this allows the "name=value" option interface to be used for more
than just FS mounts (planned use is for jails), it retains the current
"vfsopt" name and <sys/mount.h> requirement.

Approved by:	bz (mentor)
2009-03-02 23:26:30 +00:00
Alexander Kabaev
5ab4bb35fb Change vfs_busy to wait until an outcome of pending unmount
operation is known and to retry or fail accordingly to that
outcome. This fixes the problem with namespace traversing
programs failing with random ENOENT errors if someone just
happened to try to unmount that same filesystem at the same
time.

Reported by:	dhw
Reviewed by:	kib, attilio
Sponsored by:	Juniper Networks, Inc.
2009-03-02 20:51:39 +00:00
Konstantin Belousov
65067cc8b0 Correct types of variables used to track amount of allocated SysV shared
memory from int to size_t. Implement a workaround for current ABI not
allowing to properly save size for and report more then 2Gb sized segment
of shared memory.

This makes it possible to use > 2 Gb shared memory segments on 64bit
architectures. Please note the new BUGS section in shmctl(2) and
UPDATING note for limitations of this temporal solution.

Reviewed by:	csjp
Tested by:	Nikolay Dzham <i levsha org ua>
MFC after:	2 weeks
2009-03-02 18:53:30 +00:00
Konstantin Belousov
2883703e00 Use the p_sysent->sv_flags flag SV_ILP32 to detect 32bit process
executing on 64bit kernel. This eliminates the direct comparisions
of p_sysent with &ia32_freebsd_sysvec, that were left intact after
r185169.
2009-03-02 18:43:50 +00:00
Dmitry Chagin
6485a22ccb Fix range-check error introduced in r182292. Also do not do anything
if all processors in the map are not available, simply return.

Approved by:	kib (mentor)
MFC after:	1 week
2009-03-01 14:26:24 +00:00
Ed Schouten
c4d4bcdaf6 Improve my previous changes to the TTY code: also remove memcpy().
It's better to just use internal language constructs, because it is
likely the compiler has a better opinion on whether to perform inlining,
which is very likely to happen to struct winsize.

Submitted by:	Christoph Mallon <christoph mallon gmx de>
2009-03-01 09:50:13 +00:00
Andrew Thompson
fef11cb704 Move the NORELEASE check to after the recurse count decrement and bailout, this
is not counted as actually releasing the lock.
2009-02-28 19:10:43 +00:00
Ed Schouten
4b2d6aaf4b Replace bcopy() calls inside the TTY layer with memcpy()/strlcpy().
In all these cases the buffers never overlap. Program names are also
likely to be shorter, so use a regular strlcpy() to copy p_comm.
2009-02-28 14:20:26 +00:00
Bjoern A. Zeeb
33553d6e99 For all files including net/vnet.h directly include opt_route.h and
net/route.h.

Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.

We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.

This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.
2009-02-27 14:12:05 +00:00
Ed Schouten
91c3cbfe1f Remove redundant code in printf() and vprintf().
printf() and vprintf() are exactly the same, except the way arguments
are passed. Just like we see in other pieces of code (i.e. libc's
printf()), implement printf() using vprintf().

Submitted by:	Christoph Mallon <christoph mallon gmx de>
2009-02-27 13:28:54 +00:00
Ed Schouten
ff7b7d9039 Revert previous commit to subr_prf.c and make it more tidy.
As mentioned by bz and bde, the change I made wasn't the proper way to
fix. Inspired by bde's patch, perform some small cleanups to uprintf().

Reviewed by:	bz
2009-02-27 12:50:25 +00:00
Ed Schouten
69c9eff894 Remove unneeded pointer `ndp'.
Inside do_execve(), we have a pointer `ndp', which always points to
`&nd'. I can imagine a primitive (non-optimizing) compiler to really
reserve space for such a pointer, so just remove the variable and use
`&nd' directly.
2009-02-26 16:32:48 +00:00
Ed Schouten
c90c9021e9 Remove even more unneeded variable assignments.
kern_time.c:
- Unused variable `p'.

kern_thr.c:
- Variable `error' is always caught immediately, so no reason to
  initialize it. There is no way that error != 0 at the end of
  create_thread().

kern_sig.c:
- Unused variable `code'.

kern_synch.c:
- `rval' is always assigned in all different cases.

kern_rwlock.c:
- `v' is always overwritten with RW_UNLOCKED further on.

kern_malloc.c:
- `size' is always initialized with the proper value before being used.

kern_exit.c:
- `error' is always caught and returned immediately. abort2() never
  returns a non-zero value.

kern_exec.c:
- `len' is always assigned inside the if-statement right below it.

tty_info.c:
- `td' is always overwritten by FOREACH_THREAD_IN_PROC().

Found by:	LLVM's scan-build
2009-02-26 15:51:54 +00:00
Ed Schouten
318b1c3fd0 Remove unneeded variable `ocn_mute'.
Found by:	LLVM's scan-build
2009-02-26 13:01:45 +00:00
Ed Schouten
5225593633 Remove unused variables p' and unneeded assignments of rval'.
Found by:	LLVM's scan-build
2009-02-26 13:00:13 +00:00
Ed Schouten
2bbada90c8 Remove redundant assignment of `p'.
`p' is already initialized with `td->td_proc'. Because td is always
curthread, it is safe to initialize it without any locks.

Found by:	LLVM's scan-build
2009-02-26 12:12:34 +00:00
Robert Watson
6efcc2f26a Add static tracing for privilege checking:
priv:kernel:priv_check:priv_ok fires for granted privileges
  priv:kernel:priv_check:priv_errr fires for denied privileges

The first argument is the requested privilege number.  The naming
convention is a little different from the OpenSolaris equivilent
because we can't have '-' in probefunc names, and our privilege
namespace is different.

MFC after:	1 week
2009-02-26 10:56:13 +00:00
Ed Schouten
9e5775857d Silence compiler warning inside our ^T handler.
It turns out we're casting fixpt_t* to int*.

Spotted by:	clang
2009-02-26 10:38:19 +00:00
Ed Schouten
1d952ed28c Use unsigned longs for the TTY's sysctl stats.
Spotted by:	clang
2009-02-26 10:28:32 +00:00
Ed Schouten
1e737f33a0 Don't use PTY name as format string, even though it isn't insecure here.
It's guaranteed that the `name' variable always contains a string of the
form pty[l‐sL‐S][0‐9a‐v], but I'd rather keep the compiler happy (LLVM).
2009-02-26 10:14:10 +00:00
Jamie Gritton
613042491b Add support for methods to the OSD subsystem. Each object type has a
predefined set of methods, which are set in osd_register() and called
via osd_call().  Currently, no methods are defined, though prison
objects will have some in the future.

Expand the locking from a single per-type mutex to three different kinds
of locks (four if you include the requirement that the container
(e.g. prison) be locked when getting/setting data).  This clears up one
existing issue, as well as others added by the method support.

Approved by:	bz (mentor)
2009-02-21 11:15:38 +00:00
Ed Schouten
0eee862a54 Don't make Linux stat() open character devices to resolve its name.
The existing code calls kern_open() to resolve the vnode of a pathname
right after a stat(). This is not correct, because it causes random
character devices to be opened in /dev. This means ls'ing a tape
streamer will cause it to rewind, for example. Changes I have made:

- Add kern_statat_vnhook() to allow binary emulators to `post-process'
  struct stat, using the proper vnode.

- Remove unneeded printf's from stat() and statfs().

- Make the Linuxolator use kern_statat_vnhook(), replacing
  translate_path_major_minor_at().

- Let translate_fd_major_minor() use vp->v_rdev instead of
  vp->v_un.vu_cdev.

Result:

	crw-rw-rw- 1 root root   0, 14 Feb 20 13:54 /dev/ptmx
	crw--w---- 1 root adm  136,  0 Feb 20 14:03 /dev/pts/0
	crw--w---- 1 root adm  136,  1 Feb 20 14:02 /dev/pts/1
	crw--w---- 1 ed   tty  136,  2 Feb 20 14:03 /dev/pts/2

Before this commit, ptmx also had a major number of 136, because it
silently allocated and deallocated a pseudo-terminal. Device nodes that
cannot be opened now have proper major/minor-numbers.

Reviewed by:	kib, netchild, rdivacky (thanks!)
2009-02-20 13:05:29 +00:00
John Baldwin
03964c8e09 Enable caching of negative pathname lookups in the NFS client. To avoid
stale entries, we save a copy of the directory's modification time when
the first negative cache entry was added in the directory's NFS node.
When a negative cache entry is hit during a pathname lookup, the parent
directory's modification time is checked.  If it has changed, all of the
negative cache entries for that parent are purged and the lookup falls
back to using the RPC.  This required adding a new cache_purge_negative()
method to the name cache to purge only negative cache entries for a given
directory.

Submitted by:	mohans, Rick Macklem, Ricardo Labiaga @ NetApp
Reviewed by:	mohans
2009-02-19 22:28:48 +00:00
Ed Schouten
40d05103d8 Squash some small bugs in pts(4).
- Don't return a negative errno when using an unknown ioctl() on a
  pseudo-terminal master device. Be sure to convert ENOIOCTL to ENOTTY,
  just like the TTY layer does.

- Even though we should return st_rdev of the master device node when
  emulating pty(4) devices, FIODGNAME should still return the name of
  the slave device. Otherwise ptsname(3) and ttyname(3) return an
  invalid device name.
2009-02-19 17:54:42 +00:00
Attilio Rao
f8d9048018 - Add a function (fill_kinfo_aggregate()) which aggregates relevant
members for a kinfo entry on a process-wide system.
- Use the newly introduced function in order to fix cases like
  KERN_PROC_PROC where aggregating stats are broken because they just
  consider the first thread in the pool for each process.
  (Note, additively, that KERN_PROC_PROC is rather inaccurate on
  thread-wide informations like the 'state' of the process.  Such
  informations should maybe be invalidated and being forceably discarded
  by the consumers?).
- Simplify the logic of sysctl_out_proc() and adjust the
  fill_kinfo_thread() accordingly.
- Remove checks on the FIRST_THREAD_IN_PROC() being NULL but add
  assertives.

This patch should fix aggregate statistics for KERN_PROC_PROC.
This is one of the reasons why top doesn't use this option and now it
can be use it safely.
ps, when launched in order to display just processes, now should report
correct cpu utilization percentages and times (as opposed by the old
code).

Reviewed by:	jhb, emaste
Sponsored by:	Sandvine Incorporated
2009-02-18 21:52:13 +00:00
Joe Marcus Clarke
0618630015 Remove the printf's when the vnode to be exported for procstat is not a VDIR.
If the file system backing a process' cwd is removed, and procstat -f PID
is called, then these messages would have been printed.  The extra verbosity is
not required in this situation.

Requested by:	kib
Approved by:	kib
2009-02-14 21:55:09 +00:00
Joe Marcus Clarke
03fd9c2092 Change two KASSERTS to printfs and simple returns. Stress testing has
revealed that a process' current working directory can be VBAD if the
directory is removed.  This can trigger a panic when procstat -f PID is
run.

Tested by:	pho
Discovered by:	phobot
Reviewed by:	kib
Approved by:	kib
2009-02-14 21:12:24 +00:00
Andrew Thompson
a1797ef6c8 Remove semicolon left in the last commit
Spotted by:	csjp
2009-02-13 18:51:39 +00:00
John Baldwin
ea77ff0a15 Use shared vnode locks when invoking VOP_READDIR().
MFC after:	1 month
2009-02-13 18:18:14 +00:00
Luigi Rizzo
d4619572b4 Clarify and reimplement the bioq API so that bioq_disksort() has
the correct behaviour (sorting by distance from the current head position
in the scan direction) and bioq_insert_head() and bioq_insert_tail()
have a well defined (and useful) behaviour, especially when intermixed
with calls to bioq_disksort().

In particular:
- fix a bug in the existing bioq_disksort() that did not use the
  current head position correctly;
- redefine semantics of bioq_insert_head() and bioq_insert_tail().
  bioq_insert_tail() can now be used as a barrier
  between previous and subsequent calls to bioq_disksort().

The code is heavily documented in the source code so please refer
to that for the details.

Much of this code comes from Fabio Checconi. Also thanks to Kirk
for feedback on the (re)definition of bioq_insert_tail().

NOTE: in the current tree there is only a handful of files which
intermix calls to bioq_disksort() with bioq_insert_head() and
bioq_insert_tail(). The ordering of the queue in these situation
was not specified (nor easy to figure out) before, so I doubt any
of that code could be affected by the specification of the API.

Also note that the current implementation is significantly simpler
than the previous one (also used in ata_sort_queue()).
It would be useful to reimplement ata_sort_queue() using
the same code used in bioq_disksort().

MFC after:	1 week
2009-02-13 11:36:32 +00:00
Andrew Thompson
24ef070126 Check the exit flag at the start of the taskqueue loop rather than the end. It
is possible to tear down the taskqueue before the thread has run and the
taskqueue loop would sleep forever.

Reviewed by:	sam
MFC after:	1 week
2009-02-13 01:16:51 +00:00
Ed Schouten
c0086bf202 Serialize write() calls on TTYs.
Just like the old TTY layer, the current MPSAFE TTY layer does not make
any attempt to serialize calls of write(). Data is copied into the
kernel in 256 (TTY_STACKBUF) byte chunks. If a write() call occurs at
the same time, the data may interleave. This is especially likely when
the TTY starts blocking, because the output queue reaches the high
watermark.

I've implemented this by adding a new flag, TTY_BUSY_OUT, which is used
to mark a TTY as having a thread stuck in write(). Because I don't want
non-blocking processes to be possibly blocked by a sleeping thread, I'm
still allowing it to bypass the protection. According to this message,
the Linux kernel returns EAGAIN in such cases, but I think that's a
little too restrictive:

	http://kerneltrap.org/index.php?q=mailarchive/linux-kernel/2007/5/2/85418/thread

PR:		kern/118287
2009-02-11 16:28:49 +00:00
Robert Watson
54fffe2d67 Modify fdcopy() so that, during fork(2), it won't copy file descriptors
from the parent to the child process if they have an operation vector
of &badfileops.  This narrows a set of races involving system calls that
allocate a new file descriptor, potentially block for some extended
period, and then return the file descriptor, when invoked by a threaded
program that concurrently invokes fork(2).  Similar approches are used
in both Solaris and Linux, and the wideness of this race was introduced
in FreeBSD when we moved to a more optimistic implementation of
accept(2) in order to simplify locking.

A small race necessarily remains because the fork(2) might occur after
the finit() in accept(2) but before the system call has returned, but
that appears unavoidable using current APIs.  However, this race is
vastly narrower.

The fix can be validated using the newfileops_on_fork regression test.

PR:		kern/130348
Reported by:	Ivan Shcheklein <shcheklein at gmail dot com>
Reviewed by:	jhb, kib
MFC after:	1 week
2009-02-11 15:22:01 +00:00
Warner Losh
c9584ebe61 o Use NULL in pereference to 0 in pointer contexts.
o Use newly minted KOBJMETHOD_END as appropriate
o fix prototype for root_setup_intr.
2009-02-11 04:54:02 +00:00
Alexander Motin
e05e00bcae Check for device_set_devclass() errors and skip driver probe/attach if any.
Attach call without devclass set crashes the system.

On resume AHCI driver sometimes tries to create duplicate adX device.
It is surely his own problem, but IMHO it is not a reason to crash here.
Other reasons are also possible.
2009-02-10 23:22:29 +00:00
Attilio Rao
a1d7ce03ea Scanning all the formats for binary translation of modules loading can
result in errors for a format loading but subsequent correct recognizing
for another format.

File format loading functions should avoid printing any additional
informations but just returning appropriate (and different between each
other) error condition, characterizing different informations.
Additively, the linker should handle appropriately different format
loading errors.

While a general mechanism is desired, fix a simple and common case on
amd64: file type is not recognized for link elf and confuses the linker.
Printout an error if all the registered linker classes can't recognize
and load the module.

Reviewed by:	jhb
Sponsored by:	Sandvine Incorporated
2009-02-10 15:50:19 +00:00
Robert Watson
e2757609ec Remove extra 'comma = 0' in socket state printing code, which otherwise
could lead to an extra comma in output.

Submitted by:   Christoph Mallon <christoph dot mallon at gmx dot de>
2009-02-09 18:19:58 +00:00
Martin Blapp
37e399b26e s/SS_FDREF/SS_NOFDREF/ 2009-02-09 13:29:01 +00:00
Ed Schouten
89d647cb30 Remove a stale comment from the clists code.
We don't support quote bits.
2009-02-09 11:27:56 +00:00
John Baldwin
8941aad19b Tweak the output of VOP_PRINT/vn_printf() some.
- Align the fifo output in fifo_print() with other vn_printf() output.
- Remove the leading space from lockmgr_printinfo() so its output lines up
  in vn_printf().
- lockmgr_printinfo() now ends with a newline, so remove an extra newline
  from vn_printf().
2009-02-06 20:06:48 +00:00
Edward Tomasz Napierala
ec48c16f14 Add KASSERTs to make it easier to debug problems like the one fixed
in r188141.

Reviewed by:	kib,attilio
Approved by:	rwatson (mentor)
Tested by:	pho
Sponsored by:	FreeBSD Foundation
2009-02-06 18:16:01 +00:00
John Baldwin
875b66a05b Expand the scope of the sysctllock sx lock to protect the sysctl tree itself.
Back in 1.1 of kern_sysctl.c the sysctl() routine wired the "old" userland
buffer for most sysctls (everything except kern.vnode.*).  I think to prevent
issues with wiring too much memory it used a 'memlock' to serialize all
sysctl(2) invocations, meaning that only one user buffer could be wired at
a time.  In 5.0 the 'memlock' was converted to an sx lock and renamed to
'sysctl lock'.  However, it still only served the purpose of serializing
sysctls to avoid wiring too much memory and didn't actually protect the
sysctl tree as its name suggested.  These changes expand the lock to actually
protect the tree.

Later on in 5.0, sysctl was changed to not wire buffers for requests by
default (sysctl_handle_opaque() will still wire buffers larger than a single
page, however).  As a result, user buffers are no longer wired as often.
However, many sysctl handlers still wire user buffers, so it is still
desirable to serialize userland sysctl requests.  Kernel sysctl requests
are allowed to run in parallel, however.

- Expose sysctl_lock()/sysctl_unlock() routines to exclusively lock the
  sysctl tree for a few places outside of kern_sysctl.c that manipulate
  the sysctl tree directly including the kernel linker and vfs_register().
- sysctl_register() and sysctl_unregister() require the caller to lock
  the sysctl lock using sysctl_lock() and sysctl_unlock().  The rest of
  the public sysctl API manage the locking internally.
- Add a locked variant of sysctl_remove_oid() for internal use so that
  external uses of the API do not need to be aware of locking requirements.
- The kernel linker no longer needs Giant when manipulating the sysctl
  tree.
- Add a missing break to the loop in vfs_register() so that we stop looking
  at the sysctl MIB once we have changed it.

MFC after:	1 month
2009-02-06 14:51:32 +00:00