Commit Graph

10234 Commits

Author SHA1 Message Date
Attilio Rao
d7a7e17968 Remove explicit calling of lockmgr() with the NULL argument.
Now, lockmgr() function can only be called passing curthread and the
KASSERT() is upgraded according with this.

In order to support on-the-fly owner switching, the new function
lockmgr_disown() has been introduced and gets used in BUF_KERNPROC().
KPI, so, results changed and FreeBSD version will be bumped soon.
Differently from previous code, we assume idle thread cannot try to
acquire the lockmgr as it cannot sleep, so loose the relative check[1]
in BUF_KERNPROC().

Tested by: kris

[1] kib asked for a KASSERT in the lockmgr_disown() about this
condition, but after thinking at it, as this is a well known general
rule, I found it not really necessary.
2008-01-08 23:48:31 +00:00
John Baldwin
4ad6d200d6 Regen for shm_open(2) and shm_unlink(2). 2008-01-08 22:01:26 +00:00
John Baldwin
8e38aeff17 Add a new file descriptor type for IPC shared memory objects and use it to
implement shm_open(2) and shm_unlink(2) in the kernel:
- Each shared memory file descriptor is associated with a swap-backed vm
  object which provides the backing store.  Each descriptor starts off with
  a size of zero, but the size can be altered via ftruncate(2).  The shared
  memory file descriptors also support fstat(2).  read(2), write(2),
  ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared
  memory file descriptors.
- shm_open(2) and shm_unlink(2) are now implemented as system calls that
  manage shared memory file descriptors.  The virtual namespace that maps
  pathnames to shared memory file descriptors is implemented as a hash
  table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash
  of the pathname.
- As an extension, the constant 'SHM_ANON' may be specified in place of the
  path argument to shm_open(2).  In this case, an unnamed shared memory
  file descriptor will be created similar to the IPC_PRIVATE key for
  shmget(2).  Note that the shared memory object can still be shared among
  processes by sharing the file descriptor via fork(2) or sendmsg(2), but
  it is unnamed.  This effectively serves to implement the getmemfd() idea
  bandied about the lists several times over the years.
- The backing store for shared memory file descriptors are garbage
  collected when they are not referenced by any open file descriptors or
  the shm_open(2) virtual namespace.

Submitted by:	dillon, peter (previous versions)
Submitted by:	rwatson (I based this on his version)
Reviewed by:	alc (suggested converting getmemfd() to shm_open())
2008-01-08 21:58:16 +00:00
John Baldwin
39033470fe Close a race in the kern.ttys sysctl handler that resulted in panics in
dev2udev() when a tty was being detached concurrently with the sysctl
handler:
- Hold the 'tty_list_mutex' lock while we read all the fields out of the
  struct tty for copying out later.  Previously the pty(4) and pts(4)
  destroy routines could set t_dev to NULL, drop their reference on the
  tty and destroy the cdev while the sysctl handler was attempting to
  invoke dev2udev() on the cdev being destroyed.  This happened when the
  sysctl handler read the value of t_dev prior to it being set to NULL
  either due to it being stale or due to timing races.  By holding the
  list lock we guarantee that the destroy routines will block in ttyrel()
  in that case and not destroy the cdev until after we've copied all of our
  data.  We may see a NULL cdev pointer or we may see the previous value,
  but the previous value will no longer point to a destroyed cdev if we
  see it.
- Fix the ttyfree() routine used by tty device drivers in their detach
  methods to use ttyrel() on the tty so we don't leak them.  Also, fix it
  to use the same order of operations as pty/pts destruction (set t_dev
  NULL, ttyrel(), destroy_dev()) so it cooperates with the sysctl handler.

MFC after:	3 days
Tested by:	avatar
2008-01-08 04:53:28 +00:00
Kris Kennaway
357911ce77 Fix logic in skipcount handling (used to sample every 1/N lock operations
to reduce profiling overhead)
2008-01-08 01:11:40 +00:00
Robert Watson
57d7e86b65 Free MAC label on a POSIX semaphore when the semaphore is freed.
MFC after:	3 days
Submitted by:	jhb
2008-01-07 22:03:19 +00:00
John Baldwin
e46502943a Make ftruncate a 'struct file' operation rather than a vnode operation.
This makes it possible to support ftruncate() on non-vnode file types in
the future.
- 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on
  a given file descriptor.
- ftruncate() moves to kern/sys_generic.c and now just fetches a file
  object and invokes fo_truncate().
- The vnode-specific portions of ftruncate() move to vn_truncate() in
  vfs_vnops.c which implements fo_truncate() for vnode file types.
- Non-vnode file types return EINVAL in their fo_truncate() method.

Submitted by:	rwatson
2008-01-07 20:05:19 +00:00
Bruce Evans
9283848511 In sequential_heuristic():
- spell 16384 as 16384 and not as BKVASIZE.  16384 is (not quite) just a
  magic size that works well in practice.  BKVASIZE should be MAXBSIZE
  (65536), but is 16384 because i386's don't have enough kva for it to
  be MAXBSIZE; 16384 works (not so well) for it for much the same reasons
  that it works well in the heuristic.
- expand and/or add comments about this and other details.
- don't explicitly inline this function.
- fix some other style bugs.
2008-01-05 08:54:51 +00:00
Peter Wemm
4113f8d741 Fall back to the binary-specified interpreter (ld-elf.so.1) if the
ABI override binary isn't found.  This could probably be smoother, but
it is what I did in p4 change #126891 on 2007/09/27.  It should solve
the "ld-elf32.so.1"-in-chroot problem.
2008-01-05 08:35:56 +00:00
Jeff Roberson
fd0b8c783d - Restore timeslicing code for all bit SCHED_FIFO priority classes.
Reported by:	Peter Jeremy <peterjeremy@optushome.com.au>
2008-01-05 04:47:31 +00:00
Bjoern A. Zeeb
a82be55d42 Add missing sb_sndptr* fields to db_print_sockbuf().
While here change %d to %u for u_ints.

Discussed with:	rwatson, kmacy
2008-01-03 15:19:31 +00:00
Jeff Roberson
a57decdf32 - In sysctl_kern_file skip fdps with negative lastfiles. This can
happen if there are no files open.  Accounting for these can
   eventually return a negative value for olenp causing sysctl to
   crash with a bad malloc.

Reported by:	Pawel Worach <pawel.worach@gmail.com>
2008-01-03 01:26:59 +00:00
David E. O'Brien
bedff79a00 Note what is too {short,long}. 2008-01-02 18:48:27 +00:00
John Baldwin
c0cfd9d113 A few whitespace fixes. 2008-01-02 17:09:15 +00:00
Jeff Roberson
41e0f66d41 - Place the fhold() in unp_internalize_fp to be more consistent with refs.
- Clear all of the gc flags before doing a run.  Stale flags were causing
   us to skip some descriptors.
 - If a unp socket has been marked REF in a gc pass it can't be dead.

Found by:	rwatson's test tool.
2008-01-01 01:46:42 +00:00
Craig Rodrigues
450ea867c5 In vfs_scanopt(), make sure that the mount option value is not NULL
before calling vsscanf().

PR:		118531
Submitted by:	Jaakko Heinonen <jh saunalahti fi>
MFC after:	3 days
2007-12-31 23:44:53 +00:00
John Baldwin
0deabe7e53 Actually declare the kern.features sysctl node.
Pointy hat to:	jhb
2007-12-31 22:03:57 +00:00
Jeff Roberson
0c66dc6758 - Pause a while after disabling lock profiling and before resetting it
to be sure that all participating CPUs have stopped updating it.
 - Restore the behavior of printing the name of the lock type in the output.
2007-12-31 03:45:51 +00:00
Jeff Roberson
6f552cb098 - Check the correct variable against NULL in two places.
- If the unp_file is NULL that means it has never been internalized and it
   must be reachable.
2007-12-31 03:44:54 +00:00
Warner Losh
c94a7cac1f Rather than not redirting the bp when we get ENXIO, only redirty it
when the error is EIO.  This catches a much larger class of errors
that are unlikely to succeed if retried.

Submitted by: bde
2007-12-30 05:53:45 +00:00
Jeff Roberson
397c19d175 Remove explicit locking of struct file.
- Introduce a finit() which is used to initailize the fields of struct file
   in such a way that the ops vector is only valid after the data, type,
   and flags are valid.
 - Protect f_flag and f_count with atomic operations.
 - Remove the global list of all files and associated accounting.
 - Rewrite the unp garbage collection such that it no longer requires
   the global list of all files and instead uses a list of all unp sockets.
 - Mark sockets in the accept queue so we don't incorrectly gc them.

Tested by:	kris, pho
2007-12-30 01:42:15 +00:00
Alan Cox
f8a47341fe Add the superpage reservation system. This is "part 2 of 2" of the
machine-independent support for superpages.  (The earlier part was
the rewrite of the physical memory allocator.)  The remainder of the
code required for superpages support is machine-dependent and will
be added to the various pmap implementations at a later date.

Initially, I am only supporting one large page size per architecture.
Moreover, I am only enabling the reservation system on amd64.  (In
an emergency, it can be disabled by setting VM_NRESERVLEVELS to 0
in amd64/include/vmparam.h or your kernel configuration file.)
2007-12-29 19:53:04 +00:00
Robert Watson
c5f1beb02a In "show lockedvnods" DDB command, use db_printf() rather than printf()
so that the results end up in the DDB output stream rather than the
console output stream.

This should likely also be done for the vprint() function it calls.

MFC after:	3 months
2007-12-28 00:47:31 +00:00
Attilio Rao
100f241571 Trimm out now unused option LK_EXCLUPGRADE from the lockmgr namespace.
This option just adds complexity and the new implementation no longer
will support it, so axing it now that it is unused is probabilly the
better idea.

FreeBSD version is bumped in order to reflect the KPI breakage introduced
by this patch.

In the ports tree, kris found that only old OSKit code uses it, but as
it is thought to work only on 2.x kernels serie, version bumping will
solve any problem.
2007-12-28 00:38:13 +00:00
Attilio Rao
7a1d78fa3f In order to avoid a huge class of deadlocks (in particular in interactions
with the interlock), owner of the lock should be only curthread or at
least, for its limited usage, NULL which identifies LK_KERNPROC.

The thread "extra argument" for the lockmgr interface is going to be
removed in the near future, but for the moment, just let kernel run for
some days with this check on in order to find potential deadlocking
places around the kernel and fix them.
2007-12-27 22:56:57 +00:00
Robert Watson
0417fe5421 Return ESRCH when a kernel stack is queried on a process in execve() --
p_candebug() will return EAGAIN which, if the other process never
leaves execve(), will result in the sysctl spinning and never returning
to userspace.  Processes should always eventually leave execve(), but
spinning in kernel while we wait is bad for countless reasons, and
particularly harmful if execve() itself is deadlocked.

Possibly we should return another error, or return a marker indicating
the thread is in execve() so it can be reported that way in userspace.

Reported by:	kris
2007-12-27 22:44:01 +00:00
Attilio Rao
98e4f2e2bf As LK_EXCLUPGRADE is used in conjuction with LK_NOWAIT, LK_UPGRADE becames
equivalent with this and so operate the switch.

That call is the only one remaining LK_EXCLUPGRADE consumer and removing
it will prepare the ground for LK_EXCLUPGRADE axing and further
lockmgr improvements.

Discussed with: jeff, ups
2007-12-27 20:52:05 +00:00
Warner Losh
b27aa20e8d A partial solution to some of the 'pull the umass device with a
mounted FS' problems.  These are more along the lines of 'avoiding an
avoidable panic' than a complete solution to removable devices.  We
now close the barn door after the horse has gotten lose and has been
hit by a truck, as it were.  The barn no longer catches fire in this
case, but the horse is still dead :-).

The vfs_bio.c fix causes us not to put a failed write back into the
dirty pool if the error returned was ENXIO.  In that case, the buffer
is treated like any other clean buffer that's being retured.  ENXIO
means the device isn't there anymore and will never be there again in
the future, so retrying is futile.

The vfs_mount.c fix treats 'ENXIO' as success for unmounting a file
system.  If the device is gone, retrying later won't help and we'll
never be able to unmount the device.

These two are part of a larger patch set submitted by the author.  The
other patches will be forth coming.  I added comments to these two
patches.

Submitted by: Henrik Gulbrandsen
Reviewed by: phk@
PR: usb/46176 (partial)
2007-12-27 16:38:28 +00:00
Robert Watson
618c7db30a Add textdump(4) facility, which provides an alternative form of kernel
dump using mechanically generated/extracted debugging output rather than
a simple memory dump.  Current sources of debugging output are:

- DDB output capture buffer, if there is captured output to save
- Kernel message buffer
- Kernel configuration, if included in kernel
- Kernel version string
- Panic message

Textdumps are stored in swap/dump partitions as with regular dumps, but
are laid out as ustar files in order to allow multiple parts to be stored
as a stream of sequentially written blocks.  Blocks are written out in
reverse order, as the size of a textdump isn't known a priori.  As with
regular dumps, they will be extracted using savecore(8).

One new DDB(4) command is added, "textdump", which accepts "set",
"unset", and "status" arguments.  By default, normal kernel dumps are
generated unless "textdump set" is run in order to schedule a textdump.
It can be canceled using "textdump unset" to restore generation of a
normal kernel dump.

Several sysctls exist to configure aspects of textdumps;
debug.ddb.textdump.pending can be set to check whether a textdump is
pending, or set/unset in order to control whether the next kernel dump
will be a textdump from userspace.

While textdumps don't have to be generated as a result of a DDB script
run automatically as part of a kernel panic, this is a particular useful
way to use them, as instead of generating a complete memory dump, a
simple transcript of an automated DDB session can be captured using the
DDB output capture and textdump facilities.  This can be used to
generate quite brief kernel bug reports rich in debugging information
but not dependent on kernel symbol tables or precisely synchronized
source code.  Most textdumps I generate are less than 100k including
the full message buffer.  Using textdumps with an interactive debugging
session is also useful, with capture being enabled/disabled in order to
record some but not all of the DDB session.

MFC after:	3 months
2007-12-26 11:32:33 +00:00
Wojciech A. Koszek
4ffcc89aa6 Rewrite kern.console handling in sbuf(9). My intention is to leave
kern.console format as is. Thus, no difference in output format should
appear after this commit.

Reviewed by:	cognet@ (mentor)
Approved by:	cognet@ (mentor)
2007-12-25 21:17:34 +00:00
Robert Watson
3de213cc00 Add a new 'why' argument to kdb_enter(), and a set of constants to use
for that argument.  This will allow DDB to detect the broad category of
reason why the debugger has been entered, which it can use for the
purposes of deciding which DDB script to run.

Assign approximate why values to all current consumers of the
kdb_enter() interface.
2007-12-25 17:52:02 +00:00
Julian Elischer
6829a5c59e give thread0 the tid 100000 and bumpt the others to start at 100001
MFC after:	1 week
2007-12-22 04:56:48 +00:00
Wojciech A. Koszek
731016fe36 Make SCHED_ULE buildable with gcc3.
Reviewed by:	cognet (mentor), jeffr
Approved by:	cognet (mentor), jeffr
2007-12-21 23:30:18 +00:00
Warner Losh
d4277fef7b When devclass_get_maxunit is passed a NULL, return -1 to indicate that
there's nothing allocated at all yet.
2007-12-19 22:05:07 +00:00
David E. O'Brien
10c2b8e128 Be more exact with sigaction SA_SIGINFO handling.
Reviewed by:	marcel
2007-12-18 20:39:13 +00:00
Kip Macy
5e0f5cfaed Add SB_NOCOALESCE flag to disable socket buffer update in place 2007-12-17 10:02:01 +00:00
David Xu
7fab871d8c Check NULL pointer. 2007-12-17 08:09:37 +00:00
David Xu
9514dcc041 Add missing changes for fixing LOR of umtx lock and thread lock, follow
the committing of files:
	kern_resource.c revision 1.181
	sched_4bsd.c	revision 1.111
	sched_ule.c	revision 1.218
2007-12-17 05:55:07 +00:00
Jeff Roberson
ace8398da0 Refactor select to reduce contention and hide internal implementation
details from consumers.

 - Track individual selecters on a per-descriptor basis such that there
   are no longer collisions and after sleeping for events only those
   descriptors which triggered events must be rescaned.
 - Protect the selinfo (per descriptor) structure with a mtx pool mutex.
   mtx pool mutexes were chosen to preserve api compatibility with
   existing code which does nothing but bzero() to setup selinfo
   structures.
 - Use a per-thread wait channel rather than a global wait channel.
 - Hide select implementation details in a seltd structure which is
   opaque to the rest of the kernel.
 - Provide a 'selsocket' interface for those kernel consumers who wish to
   select on a socket when they have no fd so they no longer have to
   be aware of select implementation details.

Tested by:	kris
Reviewed on:	arch
2007-12-16 06:21:20 +00:00
Randall Stewart
6b4959bf2f - fix tab to space issue, hmm maybe I should use vi. 2007-12-15 23:14:53 +00:00
Jeff Roberson
eea4f254fe - Re-implement lock profiling in such a way that it no longer breaks
the ABI when enabled.  There is no longer an embedded lock_profile_object
   in each lock.  Instead a list of lock_profile_objects is kept per-thread
   for each lock it may own.  The cnt_hold statistic is now always 0 to
   facilitate this.
 - Support shared locking by tracking individual lock instances and
   statistics in the per-thread per-instance lock_profile_object.
 - Make the lock profiling hash table a per-cpu singly linked list with a
   per-cpu static lock_prof allocator.  This removes the need for an array
   of spinlocks and reduces cache contention between cores.
 - Use a seperate hash for spinlocks and other locks so that only a
   critical_enter() is required and not a spinlock_enter() to modify the
   per-cpu tables.
 - Count time spent spinning in the lock statistics.
 - Remove the LOCK_PROFILE_SHARED option as it is always supported now.
 - Specifically drop and release the scheduler locks in both schedulers
   since we track owners now.

In collaboration with:	Kip Macy
Sponsored by:	Nokia
2007-12-15 23:13:31 +00:00
David E. O'Brien
d08aed068a style.Makefile(5) 2007-12-14 21:30:51 +00:00
David Xu
435806d31b Fix LOR of thread lock and umtx's priority propagation mutex due
to the reworking of scheduler lock.

MFC: after 3 days
2007-12-11 08:25:36 +00:00
Robert Watson
63d79c4fd6 Check for P_WEXIT before PHOLD() on a process in kstack and vm query
sysctls, as PHOLD() asserts !P_WEXIT.

Reported by:	Michael Plass <mfp49_freebsd at plass-family dot net>
2007-12-09 17:22:27 +00:00
Joseph Koshy
d07f36b075 Kernel and hwpmc(4) support for callchain capture.
Sponsored by:	FreeBSD Foundation and Google Inc.
2007-12-07 08:20:17 +00:00
John Baldwin
74427aa423 Move several data structure definitions out of freebsd32_misc.c and into
freebsd32.h instead.

MFC after:	1 week
2007-12-06 23:11:27 +00:00
Randall Stewart
cf70a46b47 - Puts default limits on 4k/9k and 16k zones for mbufs all based
on 1/2 of each of the successive limits tied to the limit for
  2k clusters.
- Adds real functionality in so that doing a sysctl to change these
  actually changes them :-)

MFC after:	1 week
2007-12-05 15:29:44 +00:00
Konstantin Belousov
973bdaa06f Use curthread instead of the FIRST_THREAD_IN_PROC for vnlru and syncer,
when applicable.

Aquire Giant slightly later for vnlru.

In the syncer, aquire the Giant only when a vnode belongs to the
non-MPsafe fs.

In both speedup_syncer() and syncer_shutdown(), remove the syncer thread from
the lbolt sleep queue after the syncer state is modified, not before.

Herded by:	attilio
Tested by:	Peter Holm
Reviewed by:	ups
MFC after:	1 week
2007-12-05 09:34:04 +00:00
Craig Rodrigues
62bdb328bb In nmount(), internally convert the mount option: "rdonly" to "ro".
This makes updates mounts such as:
 "mount -u -o rdonly" work more like, "mount -u -o ro".

References to "-o rdonly" were changed to "-o ro" in revision 1.60 of
the mount(8) man page,
but some people still like to use "-o rdonly" since it was documented
in earlier versions of FreeBSD.

Requested by:	rwatson
MFC after:	1 week
2007-12-05 03:26:14 +00:00
Andrew Thompson
22cf347586 Apply a workaround for the unkillable jail problem where some devices created
within the jail are never freed. si_cred is only used by the MAC framework so
make the cred reference conditional on it being compiled in, this is not a fix
and will need to be reviewed for any new consumers of si_cred.

This will quell some user complaint when using jails with a default kernel.

Reviewed by:	rwatson
MFC after:	3 days
2007-12-05 01:22:03 +00:00