Commit Graph

8942 Commits

Author SHA1 Message Date
Craig Rodrigues
92f44a3f3c Contributions from XFS for FreeBSD project:
- Implement cv_wait_unlock() method which has semantics compatible
  with the sv_wait() method in IRIX.  For cv_wait_unlock(), the lock
  must be held before entering the function, but is not held when the
  function is exited.

- Implement the existing cv_wait() function in terms of cv_wait_unlock().

Submitted by:	kan
Feedback from:	jhb, trhodes, Christoph Hellwig <hch at infradead dot org>
2005-12-12 00:02:22 +00:00
Alan Cox
05406e6f33 Remove unneeded calls to pmap_remove_all(). The given page is not mapped.
Reviewed by: tegge
2005-12-11 22:06:57 +00:00
Andre Oppermann
36ae3fd3c3 Hide the 4k mbuf clusters if the normal clusters are defined to be
4k already.

This unbreaks tinderbox.

Submitted by:	ru
2005-12-10 15:21:04 +00:00
David Xu
3e70c6f047 Fix compiling warning on 64 bits system. 2005-12-09 13:16:48 +00:00
David Xu
f71a882f15 Add a sysctl to force a process to sigexit if a trap signal is
being hold by current thread or ignored by current process,
otherwise, it is very possible the thread will enter an infinite loop
and lead to an administrator's nightmare.
2005-12-09 08:29:29 +00:00
David Xu
d26b1a1fb9 Register itimers_event_hook as a kernel event handler, so I don't
have to duplicate code to call it in exec() and exit1().
2005-12-09 05:43:26 +00:00
David Xu
102178d0df Comment out mqfs_create_link. Inline some small functions. 2005-12-09 02:38:29 +00:00
David Xu
5c4745177f Now SIGCHLD is always queued. 2005-12-09 02:27:55 +00:00
David Xu
761a4d9423 Cleanup sigqueue sysctl. 2005-12-09 02:26:44 +00:00
Andre Oppermann
d5269a636b Add an API for jumbo mbuf cluster allocation and also provide
4k clusters in addition to 9k and 16k ones.

 struct mbuf *m_getjcl(int how, short type, int flags, int size)
 void *m_cljget(struct mbuf *m, int how, int size)

m_getjcl() returns an mbuf with a cluster of the specified size attached
like m_getcl() does for 2k clusters.

m_cljget() is different from m_clget() as it can allocate clusters
without attaching them to an mbuf.  In that case the return value
is the pointer to the cluster of the requested size.  If an mbuf was
specified, it gets the cluster attached to it and the return value
can be safely ignored.

For size both take MCLBYTES, MJUM4BYTES, MJUM9BYTES, MJUM16BYTES.

Reviewed by:	glebius
Tested by:	glebius
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-12-08 13:13:06 +00:00
Craig Rodrigues
d5989f64cf In devfs_first(), set mp->mnt_opt to a valid empty list of mount options
instead of leaving it NULL.  This eliminates a kernel panic
when trying to do a mount -o update of /dev.

Noticed by:	cjsp
Reviewed by:	phk
2005-12-08 04:27:53 +00:00
Craig Rodrigues
8539ca4cde Add "errmsg" to list of global mount options. 2005-12-08 04:09:29 +00:00
Craig Rodrigues
6951bea6c8 Changes imported from XFS for FreeBSD project:
- add fields to struct buf (needed by XFS)
    - 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3
    - b_pin_count, count of pinned buffer

- add new B_MANAGED flag
- add breada() function to initiate asynchronous I/O on read-ahead blocks.
- add bufdone_finish(), bpin(), bunpin_wait() functions

Patches provided by:	kan
Reviewed by:		phk
Silence on:		arch@
2005-12-07 03:39:08 +00:00
Alan Cox
8ad398d089 Reduce the scope of the page queues lock in exec_map_first_page(). The vm
object lock is sufficient for reading a page's PG_BUSY and busy flags.

MFC after: 1 week
2005-12-06 07:39:36 +00:00
David Xu
9da8a32aae o Turn on MPSAFE flag for mqueuefs.
o Reuse si_mqd field in siginfo_t, this also gives userland
  information about which descriptor is notified.
2005-12-06 06:22:12 +00:00
David Xu
027f760408 Fix a lock leak in childproc_continued(). 2005-12-06 05:30:13 +00:00
John Baldwin
5d2162b2f8 Tweak witness handling of lock object to shave 2 pointers off of each
lock object (and thus off of each mutex and sx lock):
- Rename the all_locks list to pending_locks and only put locks initialized
  before SI_SUB_WITNESS on the list so that the SI_SUB_WITNESS can add them
  to witness once it starts up.
- Now that pending_locks is only used during early startup, change it from
  a TAILQ to an STAILQ.  This removes a pointer from the STAILQ_ENTRY in
  struct lock_object.
- Since the pending_locks list is only used during the single-threaded
  early boot it no longer needs to be protected by a mutex, so remove
  all_mtx.
- Since the lo_list member of struct lock_object is now only used during
  early boot before witness is running, collapse lo_list and lo_witness
  into a union.  This shaves the second pointer off of struct lock_object.
- Axe lock_cur_cnt and lock_max_cnt.

With these changes, struct mtx shrinks from 36 to 28 bytes on 32-bit
platforms and from 72 to 56 bytes on 64-bit platforms.  Note that this
commit will completely and utterly destroy the kernel ABI, so no MFC.

Tested on:	alpha, amd64, i386, sparc64
2005-12-05 20:45:24 +00:00
David Xu
052ea11c71 After reading some documents, I realized SIGEV_NONE != NULL, also
fix code in mqueue_send_notification to handle SIGEV_NONE.
2005-12-05 04:41:32 +00:00
David Xu
9947b45978 Handle SIGEV_NONE, if notification is SIGEV_NONE, error status and
return status will be set, but no notification will be registered.
Increase hard limit of maxmsg to 100, so posixtestsuite ports can run.
2005-12-05 03:23:27 +00:00
Ruslan Ermilov
f4e9888107 Fix -Wundef. 2005-12-04 02:12:43 +00:00
Craig Rodrigues
1245b3433e Add "rdonly" to global_opts, and parse it in vfs_donmount().
Requested by:	rwatson
2005-12-03 12:04:20 +00:00
Craig Rodrigues
ec528a3472 - Add "rw" mount option to global_opts.
- In vfs_donmount(), parse "ro", "noro", and "rw", in order to set or
  unset the MNT_RDONLY filesystem flag.
2005-12-03 01:26:27 +00:00
David Xu
5ee2d4ac5a 1. Cleanup including.
2. Set configuration value for CTL_P1003_1B_MESSAGE_PASSING.
2005-12-02 14:09:32 +00:00
David Xu
a6de716d7e 1. Check if message priority is less than MQ_PRIO_MAX.
2. Use getnanotime instead of getnanouptime.
3. Don't free message in _mqueue_send, mqueue_send will free it.
2005-12-02 08:23:49 +00:00
David Xu
77e718f773 1. Set timer configuration values for sysconf().
2. Set overrun limit to INT_MAX, report ERANGE error if overrun will be
   greater than INT_MAX.
2005-12-01 07:56:15 +00:00
David Xu
b51d237a67 set signal queue values for sysconf(). 2005-12-01 00:25:50 +00:00
David Xu
b2f92ef96b Last step to make mq_notify conform to POSIX standard, If the process
has successfully attached a notification request to the message queue
via a queue descriptor, file closing should remove the attachment.
2005-11-30 05:12:03 +00:00
John Baldwin
398293a8de Fix snderr() to not leak the socket buffer lock if an error occurs in
sosend().  Robert accidentally changed the snderr() macro to jump to the
out label which assumes the lock is already released rather than the
release label which drops the lock in his previous change to sosend().
This should fix the recent panics about returning from write(2) with the
socket lock held and the most recent LOR on current@.
2005-11-29 23:07:14 +00:00
Robert Watson
66dd8a6f99 Move zero copy statistics structure before sosend_copyin().
MFC after:	1 month
Reported by:	tinderbox, sam
2005-11-28 21:45:36 +00:00
John Baldwin
ef627e7da0 When checking to see if a process has exceeded its time limit, flag the
process as over the limit when its time is >= to the limit rather than >
the limit.  Technically, if p->p_rux.rux_runtime.sec == p->p_pcpulimit
and p->p_rux.rux_runtime.frac == 0, the process hasn't exceeded the limit
yet.  However, having the fraction exactly equal to 0 is rather rare, and
it is not worth the overhead to handle that edge case.  With just the >
comparison, the process would have to exceed its limit by almost a second
before it was killed.

PR:		kern/83192
Submitted by:	Maciej Zawadzinski mzawadzinski at gmail dot com
Reviewed by:	bde
MFC after:	1 week
2005-11-28 19:09:08 +00:00
Robert Watson
a725629cf8 Break out functionality in sosend() responsible for building mbuf
chains and copying in mbufs from the body of the send logic, creating
a new function sosend_copyin().  This changes makes sosend() almost
readable, and will allow the same logic to be used by tailored socket
send routines.

MFC after:	1 month
Reviewed by:	andre, glebius
2005-11-28 18:09:03 +00:00
David Xu
f72b11a40c Fix a stupid compiler warining, remove a redundant line. 2005-11-27 22:59:47 +00:00
David Xu
47bf2cf9fe Change filesystem name from mqueue to mqueuefs for style consistent.
Suggested by: rwatson
2005-11-27 08:30:12 +00:00
David Xu
6829585c43 Regen. 2005-11-27 01:23:31 +00:00
David Xu
94e1294b06 Don't use OpenBSD syscall numbers, instead, use new syscall numbers
for POSIX message queue.

Suggested by: rwatson
2005-11-27 01:13:00 +00:00
Robert Watson
5e758b9561 Add several aliases for existing clockid_t names to indicate that the
application wishes to request high precision time stamps be returned:

Alias                           Existing

CLOCK_REALTIME_PRECISE          CLOCK_REALTIME
CLOCK_MONOTONIC_PRECISE         CLOCK_MONOTONIC
CLOCK_UPTIME_PRECISE            CLOCK_UPTIME

Add experimental low-precision clockid_t names corresponding to these
clocks, but implemented using cached timestamps in kernel rather than
a full time counter query.  This offers a minimum update rate of 1/HZ,
but in practice will often be more frequent due to the frequency of
time stamping in the kernel:

New clockid_t name              Approximates existing clockid_t

CLOCK_REALTIME_FAST             CLOCK_REALTIME
CLOCK_MONOTONIC_FAST            CLOCK_MONOTONIC
CLOCK_UPTIME_FAST               CLOCK_UPTIME

Add one additional new clockid_t, CLOCK_SECOND, which returns the
current second without performing a full time counter query or cache
lookup overhead to make sure the cached timestamp is stable.  This is
intended to support very low granularity consumers, such as time(3).

The names, visibility, and implementation of the above are subject
to change, and will not be MFC'd any time soon.  The goal is to
expose lower quality time measurement to applications willing to
sacrifice accuracy in performance critical paths, such as when taking
time stamps for the purpose of rescheduling select() and poll()
timeouts.  Future changes might include retrofitting the time counter
infrastructure to allow the "fast" time query mechanisms to use a
different time counter, rather than a cached time counter (i.e.,
TSC).

NOTE: With different underlying time mechanisms exposed, using
different time query mechanisms in the same application may result in
relative non-monoticity or the appearance of clock stalling for a
single clockid_t, as a cached time stamp queried after a precision
time stamp lookup may be "before" the time returned by the earlier
live time counter query.
2005-11-27 00:55:18 +00:00
David Xu
7023331e59 Regen. 2005-11-26 12:45:22 +00:00
David Xu
655291f2ae Bring in experimental kernel support for POSIX message queue. 2005-11-26 12:42:35 +00:00
Craig Rodrigues
5e6b93a014 In nmount() and vfs_donmount(), do not strcmp() the options in the iovec
directly.  We need to copyin() the strings in the iovec before
we can strcmp() them.  Also, when we want to send the errmsg back
to userspace, we need to copyout()/copystr() the string.

Add a small helper function vfs_getopt_pos() which takes in the
name of an option, and returns the array index of the name in the iovec,
or -1 if not found.  This allows us to locate an option in
the iovec without actually manipulating the iovec members. directly via
strcmp().

Noticed by:	kris on sparc64
2005-11-23 20:51:15 +00:00
John Polstra
ba3612cd5c Fix a bug in the loop in sonewconn that makes room on the incomplete
connection queue for a new connection.  It was removing connections
from the wrong list.

Submitted by:	Paul Mikesell
Sponsored by:	Isilon Systems
MFC after:	1 week
2005-11-22 01:55:29 +00:00
Marcel Moolenaar
60b7823989 Fix bug introduced in revision 1.186:
When all file systems have a time stamp of zero, which is the case
for example when the root file system is on a read-only medium, we
ended up not calling inittodr() at all.  A potential uncleanliness
existed as well. If multiple file systems had a non-zero time stamp,
we would call inittodr() multiple times. While this should not be
harmful, it's definitely not ideal.
Fix both issues by iterating over the mounted file systems to find
the largest time stamp and call inittodr() exactly once with that
time stamp. This could of course be a zero time stamp if none of the
mounted file systems have a non-zero time stamp. In that case the
annoying errors mentioned in the commit log for revision 1.186 still
haven't been avoided. The bottom line is that inittodr() should not
complain when it gets a time base of zero. At the time of this
commit only alpha seems to have that problem.

Reported by: Dario Freni (saturnero at freesbie dot org)
MFC after: 1 week
2005-11-19 21:51:45 +00:00
Craig Rodrigues
425e5b6268 Parse more mount options in vfs_donmount(), before vfs_domount()
is called.  It looks like there are lots of different mount flags checked
in vfs_domount(), so we need to do the parsing for these particular
mount flags earlier on.  The new flags parsed are:
async, force, multilabel, noasync, noatime, noclusterr, noclusterw,
noexec, nosuid, nosymfollow, snapshot, suiddir, sync, union.

Existing code which uses mount() to mount UFS filesystems is not
affected, but new code which uses nmount() to mount UFS filesystems
should behave better.
2005-11-19 21:22:21 +00:00
Andre Oppermann
5eefd88949 Add CLOCK_UPTIME to clock_gettime(2) reporting the current
uptime measured in SI seconds.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-18 16:51:13 +00:00
Craig Rodrigues
8fd860cfa1 In vfs_nmount(), check to see if "update" mount option was passed
in, and if so, set MNT_UPDATE filesystem flag.
vfs_nmount() calls vfs_domount(), and there is special logic
inside vfs_domount() if MNT_UPDATE is set.  This is very important
when we want to do an update mount of the root filesystem, using nmount().
2005-11-18 01:31:10 +00:00
Pyun YongHyeon
dc22aef2ae Prefer NULL to 0.
Add missing lock/unlock in sysctl handler.
Protect accessing NULL pointer when resource allocation was failed.
style(9)

Reviewed by:	scottl
MFC after:	1 week
2005-11-17 08:56:21 +00:00
Olivier Houchard
481a1fe19e Add a new sysctl, kern.elf[32|64].can_exec_dyn. When set to 1, one can
execute a ET_DYN binary (shared object).
This does not make much sense, but some linux scripts expect to be able to
execute /lib/ld-linux.so.2 (ldd comes to mind).
The sysctl defaults to 0.

MFC after:	3 days
2005-11-14 22:24:00 +00:00
Robert Watson
c5c9bd5b72 In ktr_getrequest(), acquire ktrace_mtx earlier -- while the race
currently present is minor and offers no real semantic issues, it also
doesn't make sense since an earlier lockless check has already
occurred.  Also hold the mutex longer, over a manipulation of
per-process ktrace state, which requires synchronization.

MFC after:	1 month
Pointed out by:	jhb
2005-11-14 19:30:09 +00:00
Robert Watson
2c255e9df6 Moderate rewrite of kernel ktrace code to attempt to generally improve
reliability when tracing fast-moving processes or writing traces to
slow file systems by avoiding unbounded queueuing and dropped records.
Record loss was previously possible when the global pool of records
become depleted as a result of record generation outstripping record
commit, which occurred quickly in many common situations.

These changes partially restore the 4.x model of committing ktrace
records at the point of trace generation (synchronous), but maintain
the 5.x deferred record commit behavior (asynchronous) for situations
where entering VFS and sleeping is not possible (i.e., in the
scheduler).  Records are now queued per-process as opposed to
globally, with processes responsible for committing records from their
own context as required.

- Eliminate the ktrace worker thread and global record queue, as they
  are no longer used.  Keep the global free record list, as records
  are still used.

- Add a per-process record queue, which will hold any asynchronously
  generated records, such as from context switches.  This replaces the
  global queue as the place to submit asynchronous records to.

- When a record is committed asynchronously, simply queue it to the
  process.

- When a record is committed synchronously, first drain any pending
  per-process records in order to maintain ordering as best we can.
  Currently ordering between competing threads is provided via a global
  ktrace_sx, but a per-process flag or lock may be desirable in the
  future.

- When a process returns to user space following a system call, trap,
  signal delivery, etc, flush any pending records.

- When a process exits, flush any pending records.

- Assert on process tear-down that there are no pending records.

- Slightly abstract the notion of being "in ktrace", which is used to
  prevent the recursive generation of records, as well as generating
  traces for ktrace events.

Future work here might look at changing the set of events marked for
synchronous and asynchronous record generation, re-balancing queue
depth, timeliness of commit to disk, and so on.  I.e., performing a
drain every (n) records.

MFC after:	1 month
Discussed with:	jhb
Requested by:	Marc Olzheim <marcolz at stack dot nl>
2005-11-13 13:27:44 +00:00
Craig Rodrigues
d5328381f1 style(9) cleanups.
Spotted by:	njl, bde
2005-11-12 14:41:44 +00:00
Robert Watson
71909edec8 Significant refactoring of the accounting code to improve locking and VFS
happiness, as well as correct other bugs:

- Replace notion of current and saved accounting credential/vnode with a
  single credential/vnode and an acct_suspended flag.  This simplifies the
  accounting logic substantially.

- Replace acct_mtx with acct_sx, a sleepable lock held exclusively during
  reconfiguration and space polling, but shared during log entry
  generation.  This avoids holding a mutex over sleepable VFS operations.

- Hold the sx lock over the duration of the I/O so that the vnode I/O
  cannot occur after vnode close, which could occur previously if
  accounting was disabled as a process exited.

- Write the accounting log entry with Giant conditionally acquired based
  on the file system where the log is stored.  Previously, the accounting
  code relied on the caller acquiring Giant.

- Acquire Giant conditionally in the accounting callout based on the file
  system where the accounting log is stored.  Run the callout MPSAFE.

- Expose acct_suspended via a read-only sysctl so it is possibly to
  programmatically determine whether accounting is suspended or not without
  attempting to parse logs.

- Check both acct_vp and acct_suspended lock-free before entering the
  accounting sx lock in acct().

- When accounting is disabled due to a VBAD vnode (i.e., forceable unmount),
  generate a log message indicating accounting has been disabled.

- Correct a long-standing bug in how free space is calculated and compared
  to the required space: generate and compare signed results, not unsigned
  results, or negative free space will cause accounting to not be suspended
  when required, or worse, incorrectly resumed once negative free space is
  reached.

MFC after:	2 weeks
2005-11-12 10:45:13 +00:00
David Xu
413cf3bbe1 Make sure only remove one signal by debugger. 2005-11-12 04:22:16 +00:00
Robert Watson
a0ec558af0 Correct a number of serious and closely related bugs in the UNIX domain
socket file descriptor garbage collection code, which is intended to
detect and clear cycles of orphaned file descriptors that are "in-flight"
in a socket when that socket is closed before they are received.  The
algorithm present was both run at poor times (resulting in recursion and
reentrance), and also buggy in the presence of parallelism.  In order to
fix these problems, make the following changes:

- When there are in-flight sockets and a UNIX domain socket is destroyed,
  asynchronously schedule the garbage collector, rather than running it
  synchronously in the current context.  This avoids lock order issues
  when the garbage collection code reenters the UNIX domain socket code,
  avoiding lock order reversals, deadlocks, etc.  Run the code
  asynchronously in a task queue.

- In the garbage collector, when skipping file descriptors that have
  entered a closing state (i.e., have f_count == 0), re-test the FDEFER
  flag, and decrement unp_defer.  As file descriptors can now transition
  to a closed state, while the garbage collector is running, it is no
  longer the case that unp_defer will remain an accurate count of
  deferred sockets in the mark portion of the GC algorithm.  Otherwise,
  the garbage collector will loop waiting waiting for unp_defer to reach
  zero, which it will never do as it is skipping file descriptors that
  were marked in an earlier pass, but now closed.

- Acquire the UNIX domain socket subsystem lock in unp_discard() when
  modifying the unp_rights counter, or a read/write race is risked with
  other threads also manipulating the counter.

While here:

- Remove #if 0'd code regarding acquiring the socket buffer sleep lock in
  the garbage collector, this is not required as we are able to use the
  socket buffer receive lock to protect scanning the receive buffer for
  in-flight file descriptors on the socket buffer.

- Annotate that the description of the garbage collector implementation
  is increasingly inaccurate and needs to be updated.

- Add counters of the number of deferred garbage collections and recycled
  file descriptors.  This will be removed and is here temporarily for
  debugging purposes.

With these changes in place, the unp_passfd regression test now appears
to be passed consistently on UP and SMP systems for extended runs,
whereas before it hung quickly or panicked, depending on which bug was
triggered.

Reported by:	Philip Kizer <pckizer at nostrum dot com>
MFC after:	2 weeks
2005-11-10 16:06:04 +00:00
Robert Watson
742be7821c Add the f_msgcount field to the set of struct file fields printed in show
files.

MFC after:	1 week
2005-11-10 13:26:29 +00:00
Robert Watson
2be165c93e Expanet of details printed for each file descriptor to include it's
garbage collection flags.  Reformat generally to make this fit and
leave some room for future expansion.

MFC after:	1 week
2005-11-10 11:35:59 +00:00
Robert Watson
b4e507aafa Add a DDB "show files" command to list the current open file list, some
state about each open file, and identify the first process in the process
table that references the file.  This is helpful in debugging leaks of
file descriptors.

MFC after:	1 week
2005-11-10 10:42:50 +00:00
Doug White
16e35dcc39 This is a workaround for a complicated issue involving VFS cookies and devfs.
The PR and patch have the details. The ultimate fix requires architectural
changes and clarifications to the VFS API, but this will prevent the system
from panicking when someone does "ls /dev" while running in a shell under the
linuxulator.

This issue affects HEAD and RELENG_6 only.

PR:		88249
Submitted by:	"Devon H. O'Dell" <dodell@ixsystems.com>
MFC after:	3 days
2005-11-09 22:03:50 +00:00
Robert Watson
f8a9ed1fa7 Fix typo in recent comment tweak.
Submitted by:	jkim
MFC after:	1 week
2005-11-09 22:02:02 +00:00
Robert Watson
923633b4b5 In closef(), remove the assumption that there is a thread associated
with the file descriptor.  When a file descriptor is closed as a result
of garbage collecting a UNIX domain socket, the file descriptor will
not have any associated thread, so the logic to identify advisory locks
held by that thread is not appropriate.  Check the thread for NULL to
avoid this scenario.  Expand an existing comment to say a bit more about
this.

MFC after:	1 week
2005-11-09 20:54:25 +00:00
Warner Losh
5d56add2ba General consensus is that it would be even better to run this in a
thread context.  While it doesn't matter too much at the moment, in
the future we could be back in the same boat if/when more restrictions
are placed (or enforced) in a SWI.

Suggested by: njl, bde, jhb, scottl
2005-11-09 16:22:56 +00:00
John Baldwin
26b7bd707c Use intptr_t casts to convert void * <--> int to make 64-bit archs happy. 2005-11-09 15:15:59 +00:00
Ruslan Ermilov
303989a2f3 Use sparse initializers for "struct domain" and "struct protosw",
so they are easier to follow for the human being.
2005-11-09 13:29:16 +00:00
David Xu
f4d8522334 WIFxxx macros requires an int type but p_xstat is short, convert it
to int before using the macros.

Bug reported by : Pyun YongHyeon pyunyh at gmail dot com
2005-11-09 07:58:16 +00:00
Warner Losh
161604d863 Kick off the suspend sequence from the keyboard in a SWI rather than
in the hardware interrupt context (even if it is likely just an
ithread).  We don't document that suspend/resume routines are run from
such a context and some of the things that happen in those routines
aren't interrupt safe.  Since there's no real need to run from that
context, this restores assumptions that suspend routines have made.

This fixes Thierry Herbelot's 'Trying to sleep while sleeping is
prohibited' problem.
2005-11-09 07:32:01 +00:00
Warner Losh
2002eaadb7 Clarify panic message, I parsed the old one 'trying to sleep while sleeping' 2005-11-09 07:28:52 +00:00
Craig Rodrigues
4560dfb5b1 For nmount(), allow a text string error message to be propagated back
to user-space if a parameter named "errmsg" is passed into the iovec.
Used in conjunction with vfs_mount_error(), more useful error messages
than errno can be passed back to userspace when mounting a filesystem
fails.

Discussed with:		phk, pjd
2005-11-09 02:26:38 +00:00
David Xu
323fe56580 In aio_waitcomplete, do not return EAGAIN if no other threads
have started aio, instead, initialize aio management structure
if it hasn't been done, the reason to adjust this behavior is
to make it a bit friendly for threaded program, consider two
threads, one submits aio_write, and another just calls
aio_waitcomplete to wait any I/O to be completed and recycle the
aio requests, before submitter doing any I/O, the recycler wants
to wait in kernel. This also fixes inconsistency with other aio
syscalls.
2005-11-08 23:48:32 +00:00
David Xu
c20cedbfc9 Make sure pending SIGCHLD is removed from previous parent when process
is attached or detached.
2005-11-08 23:28:12 +00:00
John Baldwin
2a522eb9d3 Various and sundry cleanups:
- Use curthread for calls to knlist_delete() and add a big comment
  explaining why as well as appropriate assertions.
- Use TAILQ_FOREACH and TAILQ_FOREACH_SAFE instead of handrolling them.
- Use fget() family of functions to lookup file objects instead of
  grovelling around in file descriptor tables.
- Destroy the aio_freeproc mutex if we are unloaded.

Tested on:	i386
2005-11-08 17:43:05 +00:00
Christian S.J. Peron
576068804d Giant clean up for exit(2)
-Change unconditional aquisition of Giant to only pickup Giant if the vnode
 for the controlling tty resides on a non-mpsafe file system.
-Pickup Giant around executable vnode reference counting operations only if
 the executable resides on a non-mpsafe file system.
-If this process is being traced, pickup Giant for trace file reference count
 operations only if it resides on a non-mpsafe file system.

Discussed with:	jhb
Tested by:	kris
2005-11-08 17:11:03 +00:00
David Xu
ebceaf6dc7 Add support for queueing SIGCHLD same as other UNIX systems did.
For each child process whose status has been changed, a SIGCHLD instance
is queued, if the signal is stilling pending, and process changed status
several times, signal information is updated to reflect latest process
status. If wait() returns because the status of a child process is
available, pending SIGCHLD signal associated with the child process is
discarded. Any other pending SIGCHLD signals remain pending.

The signal information is allocated at the same time when proc structure
is allocated, if process signal queue is fully filled or there is a memory
shortage, it can still send the signal to process.

There is a booting time tunable kern.sigqueue.queue_sigchild which
can control the behavior, setting it to zero disables the SIGCHLD queueing
feature, the tunable will be removed if the function is proved that it is
stable enough.

Tested on: i386 (SMP and UP)
2005-11-08 09:09:26 +00:00
Craig Rodrigues
84e69560b6 Add utility function to propagate mount errors as text string messages.
Discussed with:		phk
2005-11-08 04:13:39 +00:00
Gleb Smirnoff
49d46b616e Fix panic string in last revision. 2005-11-06 16:47:59 +00:00
Andre Oppermann
cd5bb63b3d Free only those mbuf+clusters back to the packet zone that were allocated
from there.  All others get broken up and free'd individually to the mbuf
and cluster zones.

The packet zone is a secondary zone to the mbuf zone.  There is currently
a limitation in UMA which prevents decreasing the packet zone stock when
the mbuf and cluster zone are drained and all their members are part of
packets.  When this is fixed this change may be reverted.
2005-11-05 19:43:55 +00:00
Andre Oppermann
a5f7708723 Fix a logic error introduced with mandatory mbuf cluster refcounting and
freeing of mbufs+clusters back to the packet zone.
2005-11-04 17:20:53 +00:00
David Xu
8f0371f19d Fix name compatible problem with POSIX standard. the sigval_ptr and
sigval_int really should be sival_ptr and sival_int.
Also sigev_notify_function accepts a union sigval value but not a
pointer.
2005-11-04 09:41:00 +00:00
John Baldwin
091e8307d0 Add stoppcbs[] arrays on Alpha and sparc64 and have each CPU save its
current context in the IPI_STOP handler so that we can get accurate stack
traces of threads on other CPUs on these two archs like we do now on i386
and amd64.

Tested on:	alpha, sparc64
2005-11-03 21:08:20 +00:00
John Baldwin
55de4dcab6 Fix 'show allpcpu' ddb command on non-x86. CPU IDs are in the range 0 ..
mp_maxid, not 0 .. mp_maxid - 1.  The result was that the highest numbered
CPU was skipped on Alpha and sparc64.

MFC after:	1 week
2005-11-03 21:06:29 +00:00
Pawel Jakub Dawidek
2a143d5bf5 Detect memory leaks when memory type is being destroyed.
This is very helpful for detecting kernel modules memory leaks on unload.

Discussed and reviewed by:	rwatson
2005-11-03 13:48:59 +00:00
David Xu
4c0fb2cfff Support sending realtime signal information via signal queue, realtime
signal memory is pre-allocated, so kernel can always notify user code.
2005-11-03 05:25:26 +00:00
David Xu
6d7b314b14 Cleanup some signal interfaces. Now the tdsignal function accepts
both proc pointer and thread pointer, if thread pointer is NULL,
tdsignal automatically finds a thread, otherwise it sends signal
to given thread.
Add utility function psignal_event to send a realtime sigevent
to a process according to the delivery requirement specified in
struct sigevent.
2005-11-03 04:49:16 +00:00
David Xu
d1f16b4d2e Oops, don't change tdsignal call. 2005-11-03 01:38:49 +00:00
David Xu
44355392b4 Add thread_find() function to search a thread by lwpid. 2005-11-03 01:34:08 +00:00
Paul Saab
1471f287e1 Calling setrlimit from 32bit apps could potentially increase certain
limits beyond what should be capiable in a 32bit process, so we
must fixup the limits.

Reviewed by:	jhb
2005-11-02 21:18:07 +00:00
Andre Oppermann
56a4e45ab3 Mandatory mbuf cluster reference counting and groundwork for UMA
based jumbo 9k and jumbo 16k cluster support.

All mbuf's with external storage attached are mandatory reference
counted.  For clusters and jumbo clusters UMA provides the refcnt
storage directly.  It does not have to be separatly allocated.  Any
other type of external storage gets its own refcnt allocated from
an UMA mbuf refcnt zone instead of normal kernel malloc.

The refcount API MEXT_ADD_REF() and MEXT_REM_REF() is no longer
publically accessible.  The proper m_* functions have to be used.

mb_ctor_clust() and mb_dtor_clust() both handle normal 2K as well
as 9k and 16k clusters.

Clusters and jumbo clusters may be obtained without attaching it
immideatly to an mbuf.  This is for high performance cluster
allocation in network drivers where mbufs are attached after the
cluster has been filled.

Tested by:	rwatson
Sponsored by:	TCP/IP Optimizations Fundraise 2005
2005-11-02 16:20:36 +00:00
Andre Oppermann
34333b16cd Retire MT_HEADER mbuf type and change its users to use MT_DATA.
Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it.  It only adds a layer of confusion.  The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.

Non-native code is not changed in this commit.  For compatibility
MT_HEADER is mapped to MT_DATA.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-02 13:46:32 +00:00
John Baldwin
68a17869c1 Push down Giant into fdfree() and remove it from two of the callers.
Other callers such as some rfork() cases weren't locking Giant anyway.

Reviewed by:	csjp
MFC after:	1 week
2005-11-01 17:13:05 +00:00
Robert Watson
2bdeb3f9f2 Reuse ktr_unused field in ktr_header structure as ktr_tid; populate
ktr_tid as part of gathering of ktr header data for new ktrace
records.  The continued use of intptr_t is required for file layout
reasons, and cannot be changed to lwpid_t at this point.

MFC after:	1 month
Reviewed by:	davidxu
2005-11-01 14:46:37 +00:00
Robert Watson
d977a58307 Replace ktr_buffer pointer in struct ktr_header with a ktr_unused
intptr_t.  The buffer length needs to be written to disk as part
of the trace log, but the kernel pointer for the buffer does not.
Add a new ktr_buffer pointer to the kernel-only ktrace request
structure to hold that pointer.  This frees up an integer in the
ktrace record format that can be used to hold the threadid,
although older ktrace files will have a garbage ktr_buffer field
(or more accurately, a kernel pointer value).

MFC after:		2 weeks
Space requested by:	davidxu
2005-11-01 12:36:19 +00:00
Paul Saab
ecc44de7a2 Reformat socket control messages on input/output for 32bit compatibility
on 64bit systems.

Submitted by:	ps, ups
Reviewed by:	jhb
2005-10-31 21:09:56 +00:00
John Baldwin
f6494f2e13 Check to see if the hash table is present in link_elf_lookup_symbol()
before dereferencing it.  Certain corrupt kernel modules might not have
a valid hash table, and would cause a kernel panic when they were loaded.
Instead of panic'ing, the kernel now prints out a warning that it is
missing the symbol hash table.

Tested by:	Benjamin Close Benjamin dot Close at clearchain dot com
MFC after:	1 week
2005-10-31 19:17:32 +00:00
Robert Watson
5bb84bc84b Normalize a significant number of kernel malloc type names:
- Prefer '_' to ' ', as it results in more easily parsed results in
  memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
  as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
  memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
  attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion.  Similar changes are required for UMA zone names.
2005-10-31 15:41:29 +00:00
Robert Watson
d374e81efd Push the assignment of a new or updated so_qlimit from solisten()
following the protocol pru_listen() call to solisten_proto(), so
that it occurs under the socket lock acquisition that also sets
SO_ACCEPTCONN.  This requires passing the new backlog parameter
to the protocol, which also allows the protocol to be aware of
changes in queue limit should it wish to do something about the
new queue limit.  This continues a move towards the socket layer
acting as a library for the protocol.

Bump __FreeBSD_version due to a change in the in-kernel protocol
interface.  This change has been tested with IPv4 and UNIX domain
sockets, but not other protocols.
2005-10-30 19:44:40 +00:00
David Xu
56c06c4b67 Let itimer store itimerspec instead of itimerval, so I don't have to
convert to or from timeval frequently.

Introduce function itimer_accept() to ack a timer signal in signal
acceptance code, this allows us to return more fresh overrun counter
than at signal generating time. while POSIX says:
"the value returned by timer_getoverrun() shall apply to the most
recent expiration signal delivery or acceptance for the timer,.."
I prefer returning it at acceptance time.

Introduce SIGEV_THREAD_ID notification mode, it is used by thread
libary to request kernel to deliver signal to a specified thread,
and in turn, the thread library may use the mechanism to implement
SIGEV_THREAD which is required by POSIX.

Timer signal is managed by timer code, so it can not fail even if
signal queue is full filled by sigqueue syscall.
2005-10-30 02:56:08 +00:00
David Xu
01790d850d Regen. 2005-10-30 02:14:37 +00:00
David Xu
0972628aff Fix sigevent's POSIX incompatible problem by adding member fields
sigev_notify_function and sigev_notify_attributes. AIO syscalls
use sigevent, so they have to be adjusted.

Reviewed by:	alc
2005-10-30 02:12:49 +00:00
Ed Maste
5d89e1d0af In watchdog_config enable the software watchdog iff the WD_ACTIVE flag is
set.  When watchdogd(1) is terminated intentionally it clears the bit,
which should then disable it in the kernel.

PR:		kern/74386
Submitted by:	Alex Hoff <ahoff at sandvine dot com>
Approved by:	phk, rwatson (mentor)
2005-10-27 17:22:47 +00:00
John Baldwin
2851f51eb1 Revert most of revision 1.235 and fix the problem a different way. We
can't acquire an sx lock in ttyinfo() because ttyinfo() can be called
from interrupt handlers (such as atkbd_intr()).  Instead, go back to
locking the process group while we pick a thread to display information for
and hold that lock until after we drop sched_lock to make sure the
process doesn't exit out from under us.  sched_lock ensures that the
specific thread from that process doesn't go away.  To protect against
the process exiting after we drop the proc lock but before we dereference
it to lookup the pid and p_comm in the call to ttyprintf(), we now copy
the pid and p_comm to local variables while holding the proc lock.

This problem was found by the recently added TD_NO_SLEEPING assertions for
interrupt handlers.

Tested by:	emaste
MFC after:	1 week
2005-10-27 16:47:28 +00:00
Paul Saab
53f5742d33 Allow 32bit get/setsockopt with SO_SNDTIMEO or SO_RECVTIMEO to work. 2005-10-27 04:26:35 +00:00
Peter Wemm
40c9966a37 Commit something we found useful at work at one point. Add sysctls for
debug.kdb.panic and debug.kdb.trap alongside the existing debug.kdb.enter
sysctl.  'panic' causes a panic, and 'trap' causes a page fault.  We used
these to ensure that crash dumps succeed from those two common failure
modes.  This avoids the need for creating a 'panic' kld module.
2005-10-26 22:40:07 +00:00
John Baldwin
fe486a370a Add a swi_remove() function to teardown software interrupt handlers. For
now it just calls intr_event_remove_handler(), but at some point it might
also be responsible for tearing down interrupt events created via swi_add.
2005-10-26 15:51:05 +00:00