freebsd-skq

Author	SHA1	Message	Date
davidxu	5d50adf57d	Last step to make mq_notify conform to POSIX standard, If the process has successfully attached a notification request to the message queue via a queue descriptor, file closing should remove the attachment.	2005-11-30 05:12:03 +00:00
jhb	4b322c88f2	Fix snderr() to not leak the socket buffer lock if an error occurs in sosend(). Robert accidentally changed the snderr() macro to jump to the out label which assumes the lock is already released rather than the release label which drops the lock in his previous change to sosend(). This should fix the recent panics about returning from write(2) with the socket lock held and the most recent LOR on current@.	2005-11-29 23:07:14 +00:00
rwatson	079403d5b7	Move zero copy statistics structure before sosend_copyin(). MFC after: 1 month Reported by: tinderbox, sam	2005-11-28 21:45:36 +00:00
jhb	76c1ae2002	When checking to see if a process has exceeded its time limit, flag the process as over the limit when its time is >= to the limit rather than > the limit. Technically, if p->p_rux.rux_runtime.sec == p->p_pcpulimit and p->p_rux.rux_runtime.frac == 0, the process hasn't exceeded the limit yet. However, having the fraction exactly equal to 0 is rather rare, and it is not worth the overhead to handle that edge case. With just the > comparison, the process would have to exceed its limit by almost a second before it was killed. PR: kern/83192 Submitted by: Maciej Zawadzinski mzawadzinski at gmail dot com Reviewed by: bde MFC after: 1 week	2005-11-28 19:09:08 +00:00
rwatson	45b44c73b7	Break out functionality in sosend() responsible for building mbuf chains and copying in mbufs from the body of the send logic, creating a new function sosend_copyin(). This changes makes sosend() almost readable, and will allow the same logic to be used by tailored socket send routines. MFC after: 1 month Reviewed by: andre, glebius	2005-11-28 18:09:03 +00:00
davidxu	0b4ce8e3e1	Fix a stupid compiler warining, remove a redundant line.	2005-11-27 22:59:47 +00:00
davidxu	d7421ba6b3	Change filesystem name from mqueue to mqueuefs for style consistent. Suggested by: rwatson	2005-11-27 08:30:12 +00:00
davidxu	d81e111959	Regen.	2005-11-27 01:23:31 +00:00
davidxu	d0fa8c77de	Don't use OpenBSD syscall numbers, instead, use new syscall numbers for POSIX message queue. Suggested by: rwatson	2005-11-27 01:13:00 +00:00
rwatson	76b544b4b3	Add several aliases for existing clockid_t names to indicate that the application wishes to request high precision time stamps be returned: Alias Existing CLOCK_REALTIME_PRECISE CLOCK_REALTIME CLOCK_MONOTONIC_PRECISE CLOCK_MONOTONIC CLOCK_UPTIME_PRECISE CLOCK_UPTIME Add experimental low-precision clockid_t names corresponding to these clocks, but implemented using cached timestamps in kernel rather than a full time counter query. This offers a minimum update rate of 1/HZ, but in practice will often be more frequent due to the frequency of time stamping in the kernel: New clockid_t name Approximates existing clockid_t CLOCK_REALTIME_FAST CLOCK_REALTIME CLOCK_MONOTONIC_FAST CLOCK_MONOTONIC CLOCK_UPTIME_FAST CLOCK_UPTIME Add one additional new clockid_t, CLOCK_SECOND, which returns the current second without performing a full time counter query or cache lookup overhead to make sure the cached timestamp is stable. This is intended to support very low granularity consumers, such as time(3). The names, visibility, and implementation of the above are subject to change, and will not be MFC'd any time soon. The goal is to expose lower quality time measurement to applications willing to sacrifice accuracy in performance critical paths, such as when taking time stamps for the purpose of rescheduling select() and poll() timeouts. Future changes might include retrofitting the time counter infrastructure to allow the "fast" time query mechanisms to use a different time counter, rather than a cached time counter (i.e., TSC). NOTE: With different underlying time mechanisms exposed, using different time query mechanisms in the same application may result in relative non-monoticity or the appearance of clock stalling for a single clockid_t, as a cached time stamp queried after a precision time stamp lookup may be "before" the time returned by the earlier live time counter query.	2005-11-27 00:55:18 +00:00
davidxu	e674eb31f2	Regen.	2005-11-26 12:45:22 +00:00
davidxu	dac7c81b62	Bring in experimental kernel support for POSIX message queue.	2005-11-26 12:42:35 +00:00
rodrigc	dc0fe47898	In nmount() and vfs_donmount(), do not strcmp() the options in the iovec directly. We need to copyin() the strings in the iovec before we can strcmp() them. Also, when we want to send the errmsg back to userspace, we need to copyout()/copystr() the string. Add a small helper function vfs_getopt_pos() which takes in the name of an option, and returns the array index of the name in the iovec, or -1 if not found. This allows us to locate an option in the iovec without actually manipulating the iovec members. directly via strcmp(). Noticed by: kris on sparc64	2005-11-23 20:51:15 +00:00
jdp	88e469fc50	Fix a bug in the loop in sonewconn that makes room on the incomplete connection queue for a new connection. It was removing connections from the wrong list. Submitted by: Paul Mikesell Sponsored by: Isilon Systems MFC after: 1 week	2005-11-22 01:55:29 +00:00
marcel	7fe698f697	Fix bug introduced in revision 1.186: When all file systems have a time stamp of zero, which is the case for example when the root file system is on a read-only medium, we ended up not calling inittodr() at all. A potential uncleanliness existed as well. If multiple file systems had a non-zero time stamp, we would call inittodr() multiple times. While this should not be harmful, it's definitely not ideal. Fix both issues by iterating over the mounted file systems to find the largest time stamp and call inittodr() exactly once with that time stamp. This could of course be a zero time stamp if none of the mounted file systems have a non-zero time stamp. In that case the annoying errors mentioned in the commit log for revision 1.186 still haven't been avoided. The bottom line is that inittodr() should not complain when it gets a time base of zero. At the time of this commit only alpha seems to have that problem. Reported by: Dario Freni (saturnero at freesbie dot org) MFC after: 1 week	2005-11-19 21:51:45 +00:00
rodrigc	9cf0eb5132	Parse more mount options in vfs_donmount(), before vfs_domount() is called. It looks like there are lots of different mount flags checked in vfs_domount(), so we need to do the parsing for these particular mount flags earlier on. The new flags parsed are: async, force, multilabel, noasync, noatime, noclusterr, noclusterw, noexec, nosuid, nosymfollow, snapshot, suiddir, sync, union. Existing code which uses mount() to mount UFS filesystems is not affected, but new code which uses nmount() to mount UFS filesystems should behave better.	2005-11-19 21:22:21 +00:00
andre	73d3dcb9b2	Add CLOCK_UPTIME to clock_gettime(2) reporting the current uptime measured in SI seconds. Sponsored by: TCP/IP Optimization Fundraise 2005	2005-11-18 16:51:13 +00:00
rodrigc	c677f67c67	In vfs_nmount(), check to see if "update" mount option was passed in, and if so, set MNT_UPDATE filesystem flag. vfs_nmount() calls vfs_domount(), and there is special logic inside vfs_domount() if MNT_UPDATE is set. This is very important when we want to do an update mount of the root filesystem, using nmount().	2005-11-18 01:31:10 +00:00
yongari	8b951cd641	Prefer NULL to 0. Add missing lock/unlock in sysctl handler. Protect accessing NULL pointer when resource allocation was failed. style(9) Reviewed by: scottl MFC after: 1 week	2005-11-17 08:56:21 +00:00
cognet	48c06903ba	Add a new sysctl, kern.elf[32\|64].can_exec_dyn. When set to 1, one can execute a ET_DYN binary (shared object). This does not make much sense, but some linux scripts expect to be able to execute /lib/ld-linux.so.2 (ldd comes to mind). The sysctl defaults to 0. MFC after: 3 days	2005-11-14 22:24:00 +00:00
rwatson	2fab30d9d4	In ktr_getrequest(), acquire ktrace_mtx earlier -- while the race currently present is minor and offers no real semantic issues, it also doesn't make sense since an earlier lockless check has already occurred. Also hold the mutex longer, over a manipulation of per-process ktrace state, which requires synchronization. MFC after: 1 month Pointed out by: jhb	2005-11-14 19:30:09 +00:00
rwatson	2a5785fb21	Moderate rewrite of kernel ktrace code to attempt to generally improve reliability when tracing fast-moving processes or writing traces to slow file systems by avoiding unbounded queueuing and dropped records. Record loss was previously possible when the global pool of records become depleted as a result of record generation outstripping record commit, which occurred quickly in many common situations. These changes partially restore the 4.x model of committing ktrace records at the point of trace generation (synchronous), but maintain the 5.x deferred record commit behavior (asynchronous) for situations where entering VFS and sleeping is not possible (i.e., in the scheduler). Records are now queued per-process as opposed to globally, with processes responsible for committing records from their own context as required. - Eliminate the ktrace worker thread and global record queue, as they are no longer used. Keep the global free record list, as records are still used. - Add a per-process record queue, which will hold any asynchronously generated records, such as from context switches. This replaces the global queue as the place to submit asynchronous records to. - When a record is committed asynchronously, simply queue it to the process. - When a record is committed synchronously, first drain any pending per-process records in order to maintain ordering as best we can. Currently ordering between competing threads is provided via a global ktrace_sx, but a per-process flag or lock may be desirable in the future. - When a process returns to user space following a system call, trap, signal delivery, etc, flush any pending records. - When a process exits, flush any pending records. - Assert on process tear-down that there are no pending records. - Slightly abstract the notion of being "in ktrace", which is used to prevent the recursive generation of records, as well as generating traces for ktrace events. Future work here might look at changing the set of events marked for synchronous and asynchronous record generation, re-balancing queue depth, timeliness of commit to disk, and so on. I.e., performing a drain every (n) records. MFC after: 1 month Discussed with: jhb Requested by: Marc Olzheim <marcolz at stack dot nl>	2005-11-13 13:27:44 +00:00
rodrigc	2630cf9721	style(9) cleanups. Spotted by: njl, bde	2005-11-12 14:41:44 +00:00
rwatson	257af099d1	Significant refactoring of the accounting code to improve locking and VFS happiness, as well as correct other bugs: - Replace notion of current and saved accounting credential/vnode with a single credential/vnode and an acct_suspended flag. This simplifies the accounting logic substantially. - Replace acct_mtx with acct_sx, a sleepable lock held exclusively during reconfiguration and space polling, but shared during log entry generation. This avoids holding a mutex over sleepable VFS operations. - Hold the sx lock over the duration of the I/O so that the vnode I/O cannot occur after vnode close, which could occur previously if accounting was disabled as a process exited. - Write the accounting log entry with Giant conditionally acquired based on the file system where the log is stored. Previously, the accounting code relied on the caller acquiring Giant. - Acquire Giant conditionally in the accounting callout based on the file system where the accounting log is stored. Run the callout MPSAFE. - Expose acct_suspended via a read-only sysctl so it is possibly to programmatically determine whether accounting is suspended or not without attempting to parse logs. - Check both acct_vp and acct_suspended lock-free before entering the accounting sx lock in acct(). - When accounting is disabled due to a VBAD vnode (i.e., forceable unmount), generate a log message indicating accounting has been disabled. - Correct a long-standing bug in how free space is calculated and compared to the required space: generate and compare signed results, not unsigned results, or negative free space will cause accounting to not be suspended when required, or worse, incorrectly resumed once negative free space is reached. MFC after: 2 weeks	2005-11-12 10:45:13 +00:00
davidxu	d5fcf7dfa0	Make sure only remove one signal by debugger.	2005-11-12 04:22:16 +00:00
rwatson	9487c057e2	Correct a number of serious and closely related bugs in the UNIX domain socket file descriptor garbage collection code, which is intended to detect and clear cycles of orphaned file descriptors that are "in-flight" in a socket when that socket is closed before they are received. The algorithm present was both run at poor times (resulting in recursion and reentrance), and also buggy in the presence of parallelism. In order to fix these problems, make the following changes: - When there are in-flight sockets and a UNIX domain socket is destroyed, asynchronously schedule the garbage collector, rather than running it synchronously in the current context. This avoids lock order issues when the garbage collection code reenters the UNIX domain socket code, avoiding lock order reversals, deadlocks, etc. Run the code asynchronously in a task queue. - In the garbage collector, when skipping file descriptors that have entered a closing state (i.e., have f_count == 0), re-test the FDEFER flag, and decrement unp_defer. As file descriptors can now transition to a closed state, while the garbage collector is running, it is no longer the case that unp_defer will remain an accurate count of deferred sockets in the mark portion of the GC algorithm. Otherwise, the garbage collector will loop waiting waiting for unp_defer to reach zero, which it will never do as it is skipping file descriptors that were marked in an earlier pass, but now closed. - Acquire the UNIX domain socket subsystem lock in unp_discard() when modifying the unp_rights counter, or a read/write race is risked with other threads also manipulating the counter. While here: - Remove #if 0'd code regarding acquiring the socket buffer sleep lock in the garbage collector, this is not required as we are able to use the socket buffer receive lock to protect scanning the receive buffer for in-flight file descriptors on the socket buffer. - Annotate that the description of the garbage collector implementation is increasingly inaccurate and needs to be updated. - Add counters of the number of deferred garbage collections and recycled file descriptors. This will be removed and is here temporarily for debugging purposes. With these changes in place, the unp_passfd regression test now appears to be passed consistently on UP and SMP systems for extended runs, whereas before it hung quickly or panicked, depending on which bug was triggered. Reported by: Philip Kizer <pckizer at nostrum dot com> MFC after: 2 weeks	2005-11-10 16:06:04 +00:00
rwatson	3153d02ada	Add the f_msgcount field to the set of struct file fields printed in show files. MFC after: 1 week	2005-11-10 13:26:29 +00:00
rwatson	dcccc2e254	Expanet of details printed for each file descriptor to include it's garbage collection flags. Reformat generally to make this fit and leave some room for future expansion. MFC after: 1 week	2005-11-10 11:35:59 +00:00
rwatson	20a1214886	Add a DDB "show files" command to list the current open file list, some state about each open file, and identify the first process in the process table that references the file. This is helpful in debugging leaks of file descriptors. MFC after: 1 week	2005-11-10 10:42:50 +00:00
dwhite	0bcdf7c033	This is a workaround for a complicated issue involving VFS cookies and devfs. The PR and patch have the details. The ultimate fix requires architectural changes and clarifications to the VFS API, but this will prevent the system from panicking when someone does "ls /dev" while running in a shell under the linuxulator. This issue affects HEAD and RELENG_6 only. PR: 88249 Submitted by: "Devon H. O'Dell" <dodell@ixsystems.com> MFC after: 3 days	2005-11-09 22:03:50 +00:00
rwatson	fc360a564f	Fix typo in recent comment tweak. Submitted by: jkim MFC after: 1 week	2005-11-09 22:02:02 +00:00
rwatson	6b8f490b77	In closef(), remove the assumption that there is a thread associated with the file descriptor. When a file descriptor is closed as a result of garbage collecting a UNIX domain socket, the file descriptor will not have any associated thread, so the logic to identify advisory locks held by that thread is not appropriate. Check the thread for NULL to avoid this scenario. Expand an existing comment to say a bit more about this. MFC after: 1 week	2005-11-09 20:54:25 +00:00
imp	53b73d2a31	General consensus is that it would be even better to run this in a thread context. While it doesn't matter too much at the moment, in the future we could be back in the same boat if/when more restrictions are placed (or enforced) in a SWI. Suggested by: njl, bde, jhb, scottl	2005-11-09 16:22:56 +00:00
jhb	e53f1ca06b	Use intptr_t casts to convert void * <--> int to make 64-bit archs happy.	2005-11-09 15:15:59 +00:00
ru	dcace5669d	Use sparse initializers for "struct domain" and "struct protosw", so they are easier to follow for the human being.	2005-11-09 13:29:16 +00:00
davidxu	f9da852761	WIFxxx macros requires an int type but p_xstat is short, convert it to int before using the macros. Bug reported by : Pyun YongHyeon pyunyh at gmail dot com	2005-11-09 07:58:16 +00:00
imp	a528ef30b2	Kick off the suspend sequence from the keyboard in a SWI rather than in the hardware interrupt context (even if it is likely just an ithread). We don't document that suspend/resume routines are run from such a context and some of the things that happen in those routines aren't interrupt safe. Since there's no real need to run from that context, this restores assumptions that suspend routines have made. This fixes Thierry Herbelot's 'Trying to sleep while sleeping is prohibited' problem.	2005-11-09 07:32:01 +00:00
imp	2f1cffe264	Clarify panic message, I parsed the old one 'trying to sleep while sleeping'	2005-11-09 07:28:52 +00:00
rodrigc	2cbc12617e	For nmount(), allow a text string error message to be propagated back to user-space if a parameter named "errmsg" is passed into the iovec. Used in conjunction with vfs_mount_error(), more useful error messages than errno can be passed back to userspace when mounting a filesystem fails. Discussed with: phk, pjd	2005-11-09 02:26:38 +00:00
davidxu	ce1172e446	In aio_waitcomplete, do not return EAGAIN if no other threads have started aio, instead, initialize aio management structure if it hasn't been done, the reason to adjust this behavior is to make it a bit friendly for threaded program, consider two threads, one submits aio_write, and another just calls aio_waitcomplete to wait any I/O to be completed and recycle the aio requests, before submitter doing any I/O, the recycler wants to wait in kernel. This also fixes inconsistency with other aio syscalls.	2005-11-08 23:48:32 +00:00
davidxu	e4f3a40860	Make sure pending SIGCHLD is removed from previous parent when process is attached or detached.	2005-11-08 23:28:12 +00:00
jhb	7b0555d459	Various and sundry cleanups: - Use curthread for calls to knlist_delete() and add a big comment explaining why as well as appropriate assertions. - Use TAILQ_FOREACH and TAILQ_FOREACH_SAFE instead of handrolling them. - Use fget() family of functions to lookup file objects instead of grovelling around in file descriptor tables. - Destroy the aio_freeproc mutex if we are unloaded. Tested on: i386	2005-11-08 17:43:05 +00:00
csjp	62ab0fa062	Giant clean up for exit(2) -Change unconditional aquisition of Giant to only pickup Giant if the vnode for the controlling tty resides on a non-mpsafe file system. -Pickup Giant around executable vnode reference counting operations only if the executable resides on a non-mpsafe file system. -If this process is being traced, pickup Giant for trace file reference count operations only if it resides on a non-mpsafe file system. Discussed with: jhb Tested by: kris	2005-11-08 17:11:03 +00:00
davidxu	37bb483679	Add support for queueing SIGCHLD same as other UNIX systems did. For each child process whose status has been changed, a SIGCHLD instance is queued, if the signal is stilling pending, and process changed status several times, signal information is updated to reflect latest process status. If wait() returns because the status of a child process is available, pending SIGCHLD signal associated with the child process is discarded. Any other pending SIGCHLD signals remain pending. The signal information is allocated at the same time when proc structure is allocated, if process signal queue is fully filled or there is a memory shortage, it can still send the signal to process. There is a booting time tunable kern.sigqueue.queue_sigchild which can control the behavior, setting it to zero disables the SIGCHLD queueing feature, the tunable will be removed if the function is proved that it is stable enough. Tested on: i386 (SMP and UP)	2005-11-08 09:09:26 +00:00
rodrigc	ee849009e5	Add utility function to propagate mount errors as text string messages. Discussed with: phk	2005-11-08 04:13:39 +00:00
glebius	33ac50a107	Fix panic string in last revision.	2005-11-06 16:47:59 +00:00
andre	07bbeaa756	Free only those mbuf+clusters back to the packet zone that were allocated from there. All others get broken up and free'd individually to the mbuf and cluster zones. The packet zone is a secondary zone to the mbuf zone. There is currently a limitation in UMA which prevents decreasing the packet zone stock when the mbuf and cluster zone are drained and all their members are part of packets. When this is fixed this change may be reverted.	2005-11-05 19:43:55 +00:00
andre	438cb7bde7	Fix a logic error introduced with mandatory mbuf cluster refcounting and freeing of mbufs+clusters back to the packet zone.	2005-11-04 17:20:53 +00:00
davidxu	ae161ac239	Fix name compatible problem with POSIX standard. the sigval_ptr and sigval_int really should be sival_ptr and sival_int. Also sigev_notify_function accepts a union sigval value but not a pointer.	2005-11-04 09:41:00 +00:00
jhb	b2c57e8bc8	Add stoppcbs[] arrays on Alpha and sparc64 and have each CPU save its current context in the IPI_STOP handler so that we can get accurate stack traces of threads on other CPUs on these two archs like we do now on i386 and amd64. Tested on: alpha, sparc64	2005-11-03 21:08:20 +00:00

1 2 3 4 5 ...

8869 Commits