freebsd-skq

Author	SHA1	Message	Date
ed	7eb7818496	Allow the user to suppress the rate-limited pty(4) warning. The pty(4) driver raises up to warnings when an old BSD-style PTY is created. The reason why I added this warning, was to make it easier to spot applications that allocate BSD-style PTY's, while they should just use openpty() or posix_openpt(). Add a sysctl, which allows you to override the number of remaining messages, making it possible to suppress the warnings. Requested by: kib Reviewed by: kib	2008-08-23 16:03:00 +00:00
rwatson	78a117e6fa	Introduce two related changes to the TrustedBSD MAC Framework: (1) Abstract interpreter vnode labeling in execve(2) and mac_execve(2) so that the general exec code isn't aware of the details of allocating, copying, and freeing labels, rather, simply passes in a void pointer to start and stop functions that will be used by the framework. This change will be MFC'd. (2) Introduce a new flags field to the MAC_POLICY_SET(9) interface allowing policies to declare which types of objects require label allocation, initialization, and destruction, and define a set of flags covering various supported object types (MPC_OBJECT_PROC, MPC_OBJECT_VNODE, MPC_OBJECT_INPCB, ...). This change reduces the overhead of compiling the MAC Framework into the kernel if policies aren't loaded, or if policies require labels on only a small number or even no object types. Each time a policy is loaded or unloaded, we recalculate a mask of labeled object types across all policies present in the system. Eliminate MAC_ALWAYS_LABEL_MBUF option as it is no longer required. MFC after: 1 week ((1) only) Reviewed by: csjp Obtained from: TrustedBSD Project Sponsored by: Apple, Inc.	2008-08-23 15:26:36 +00:00
jhb	6c646c3b72	Fix a race condition with concurrent LOOKUP namecache operations for a vnode not in the namecache when shared lookups are enabled (vfs.lookup_shared=1, it is currently off by default) and the filesystem supports shared lookups (e.g. NFS client). Specifically, if multiple concurrent LOOKUPs both miss in the name cache in parallel, each of the lookups may each end up adding an entry to the namecache resulting in duplicate entries in the namecache for the same pathname. A subsequent removal of the mapping of that pathname to that vnode (via remove or rename) would only evict one of the entries from the name cache. As a result, subseqent lookups for that pathname would still return the old vnode. This race was observed with shared lookups over NFS where a file was updated by writing a new file out to a temporary file name and then renaming that temporary file to the "real" file to effect atomic updates of a file. Other processes on the same client that were periodically reading the file would occasionally receive an ESTALE error from open(2) because the VOP_GETATTR() in nfs_open() would receive that error when given the stale vnode. The fix here is to check for duplicates in cache_enter() and just return if an entry for this same directory and leaf file name for this vnode is already in the cache. The check for duplicates is done by walking the per-vnode list of name cache entries. It is expected that this list should be very small in the common case (usually 0 or 1 entries during a cache_enter() since most files only have 1 "leaf" name). Reviewed by: ups, scottl MFC after: 2 months	2008-08-23 15:13:39 +00:00
ed	b738ca88a2	Remove unused tty_gone() checks inside ttyoutq_read_uio(). When my earlier MPSAFE TTY prototypes still implemented line disciplines, we needed a mechanism to abort read()'s on PTY master devices when inside the line discipline. Because this is no longer the case, these checks have become unneeded.	2008-08-23 13:32:21 +00:00
rodrigc	bef9b4336c	In nmount(), when we see the "force" option, set the MNT_FORCE flag, but do not persist "force" in the options list, since it is a command, not a persistent property of a mount. Similarly, when we see "reload", set MNT_RELOAD, but delete "reload" from the options list. MFC after: 1 week	2008-08-23 01:16:09 +00:00
kmacy	df8989694a	Submit a band-aid for interrupt set up race. MFC after: 1 month	2008-08-22 23:24:53 +00:00
ed	a6b774bc3b	Fix two small bugs in tcsetattr(). - According to POSIX, tcsetattr() must not fail when any of the bits in the structure are unsupported, but it must leave the unsupported flags alone. - The CIGNORE flag (set by TCSASOFT, extension) was not cleared from c_cflag, which means using it would cause it to be applied during its entire lifespan. Eventually make sure we clear the flag. I don't really like CIGNORE, but I think we must keep it alive right now. With our new TTY layer, we don't actually need this mechanism, because if you leave c_cflag, c_ispeed and c_ospeed alone, we won't make a call into the device driver anyway. Reported by: naddy Tested by: naddy	2008-08-22 21:27:37 +00:00
jhb	b054f3f992	A suspended thread can, in fact, be swapped out. Thus, thread_unsuspend_one() needs to optionally wakeup the swapper. Since we hold the thread lock for that entire function, however, we have to push that requirement up into the caller. Found by: rwatson	2008-08-22 16:15:58 +00:00
jhb	b908d9aa36	Use \|= rather than += when aggregrating requests to wakeup the swapper. What we really want is an inclusive or of all the requests, and += can in theory roll over to 0.	2008-08-22 16:14:23 +00:00
ed	7c4fe3955e	Fix pts(4) error codes when slave device is closed. Unlike pre-MPSAFE TTY, the pts(4) driver always returned ENXIO when a read() or write() was performed on a pseudo-terminal master device when the slave device was not opened. The old implementation had different semantics: - When the slave device had not been opened yet, read() and write() just blocked. - When the slave device had been closed, a read() call would return 0 bytes length. - When the slave device had been closed, a write() call would return EIO. Change the new implementation to return 0 and EIO as well. We don't implement the first rule, but I suspect this is not needed, because routines like openpty() also open the slave device node. posix_openpt() users also do similar things. Reported by: rink Tested by: rink	2008-08-22 10:40:21 +00:00
ed	deab1dbbf7	Prevent VSTART flooding when turning on software flow control. It turned out we transmitted VSTART after each successful read on a TTY when software flow control was turned on. This was because of a very evil bug where we tested the TF_HIWAT_IN flag the other way around. Reported by: Christian Weisgerber <naddy mips inka de>	2008-08-22 05:15:52 +00:00
obrien	3b12eba1b0	Add comments on NOARGS, NODEF, and NOPROTO.	2008-08-21 22:57:31 +00:00
ed	2be6ecbc22	Properly lock proctree_lock before locking the process while accounting. During the import of the MPSAFE TTY layer (r181905), I changed acct_process() to lock proctree_lock instead of SESS_LOCK, because s_ttyp is now locked using proctree_lock. One of the things I forgot, was to lock it before we PROC_LOCK. Commit this patch, written by kib@. To ensure we hold proctree_lock as short as possible, obtaining `ac_tty' has now been made the first step of filling `acct'. Reported by: Kevin <kevinxlinuz 163 com> Solved by: kib	2008-08-21 15:02:17 +00:00
ed	ae0c3320a7	Remove the now unused `lbolt' variable from the kernel. We used to have a single wait channel inside the kernel which could be used by threads that just wanted to sleep for some time (the next second). The old TTY layer was the only piece of code that still used lbolt, because I already removed the use of lbolt from the NFS clients and the VFS syncer. Approved by: philip	2008-08-20 12:20:22 +00:00
kmacy	37e5c521d0	remove scheduler_running as xenbus no longer needs it MFC after: 1 month	2008-08-20 09:21:24 +00:00
ed	4b93c9151b	Update system call tables. The previous commit also included changes to all the system call lists, but it is a tradition to update these lists in a second commit, so rerun make sysent to update the $FreeBSD$ tags inside these files to refer to the latest version of syscalls.master. Requested by: rwatson	2008-08-20 08:39:10 +00:00
ed	cc3116a938	Integrate the new MPSAFE TTY layer to the FreeBSD operating system. The last half year I've been working on a replacement TTY layer for the FreeBSD kernel. The new TTY layer was designed to improve the following: - Improved driver model: The old TTY layer has a driver model that is not abstract enough to make it friendly to use. A good example is the output path, where the device drivers directly access the output buffers. This means that an in-kernel PPP implementation must always convert network buffers into TTY buffers. If a PPP implementation would be built on top of the new TTY layer (still needs a hooks layer, though), it would allow the PPP implementation to directly hand the data to the TTY driver. - Improved hotplugging: With the old TTY layer, it isn't entirely safe to destroy TTY's from the system. This implementation has a two-step destructing design, where the driver first abandons the TTY. After all threads have left the TTY, the TTY layer calls a routine in the driver, which can be used to free resources (unit numbers, etc). The pts(4) driver also implements this feature, which means posix_openpt() will now return PTY's that are created on the fly. - Improved performance: One of the major improvements is the per-TTY mutex, which is expected to improve scalability when compared to the old Giant locking. Another change is the unbuffered copying to userspace, which is both used on TTY device nodes and PTY masters. Upgrading should be quite straightforward. Unlike previous versions, existing kernel configuration files do not need to be changed, except when they reference device drivers that are listed in UPDATING. Obtained from: //depot/projects/mpsafetty/... Approved by: philip (ex-mentor) Discussed: on the lists, at BSDCan, at the DevSummit Sponsored by: Snow B.V., the Netherlands dcons(4) fixed by: kan	2008-08-20 08:31:58 +00:00
kib	ee27d03e64	In brelse, put the B_NEEDSGIANT buffer on the QUEUE_DIRTY_GIANT queue, instead of QUEUE_DIRTY. Tested by: pho Reviewed by: attilio MFC after: 3 days	2008-08-19 11:31:49 +00:00
bz	1021d43b56	Commit step 1 of the vimage project, (network stack) virtualization work done by Marko Zec (zec@). This is the first in a series of commits over the course of the next few weeks. Mark all uses of global variables to be virtualized with a V_ prefix. Use macros to map them back to their global names for now, so this is a NOP change only. We hope to have caught at least 85-90% of what is needed so we do not invalidate a lot of outstanding patches again. Obtained from: //depot/projects/vimage-commit2/... Reviewed by: brooks, des, ed, mav, julian, jamie, kris, rwatson, zec, ... (various people I forgot, different versions) md5 (with a bit of help) Sponsored by: NLnet Foundation, The FreeBSD Foundation X-MFC after: never V_Commit_Message_Reviewed_By: more people than the patch	2008-08-17 23:27:27 +00:00
alfred	f8f6317629	Prevent crashes due to unlocked access to hash buckets in two sysctls. Use CACHE_LOCK to prevent crashes. Sysctls fixed: debug.hashstat.nchash and debug.hashstat.rawnchash. Obtained from: Juniper Networks MFC After: 1 week	2008-08-16 21:48:10 +00:00
kmacy	716fc76367	Add flag to indicate to xen support code that threads are running (and thus we can block). MFC after: 1 month	2008-08-15 21:03:13 +00:00
attilio	ff459eb3cf	Introduce some WITNESS improvements: - Speedup the lock orderings lookup modifying the witness graph from a linked tree to a matrix. A table lookup caches the lock orderings in order to make a O(1) access for them. Any witness object has an unique index withing this lookup cache table. - Reduce the lock contention on w_mtx acquiring it only when the LOR actually happens and not in a sane case. In order to do this don't totally flush lock lists (per-CPU spinlocks list and per-thread sleeplocks list) but check for ll_count anytime we need to have to verify allocations sanity. - Introduce the function witness_thread_exit() in the witness namespace which should verify a thread doesn't hold any witness occurrence why exiting. - Rename the sysctl debug.witness.graphs into debug.witness.fullgraph and add debug.witness.badstacks which prints out stacks for LOR revealed. This is implemented using the stack(9) support, which makes WITNESS to be dependent by the STACK option or by the DDB (including STACK) option. - Fix style(9) for src/sys/kern/subr_witness.c The hash table approach has been developed by Ilya Maykov on the behalf of Isilon Systems which kindly released the patch. Jeff Roberson, ported the patch to -CURRENT and fixed w_mtx contention, on the behalf of Nokia. Submitted by: Ilya Maykov <ivmaykov at gmail dot com> (Isilon Systems), jeff Sponsored by: Nokia	2008-08-13 18:24:22 +00:00
csjp	0cdadff20e	Reduce the scope of the vnode lock such that it does not cover the various copyouts associated with initializing the process's argv/env data in userspace. It is possible that these copyout operations can fault under memory pressure, possibly resulting in dead locks. This is believed to be safe since none of the copyout_strings() operations need to interact with the vnode here. Submitted by: Zhouyi Zhou PR: kern/111260 Discussed with: kib MFC after: 3 weeks	2008-08-12 21:27:48 +00:00
kib	ca3c43733a	Revert r181345. Move the NULL pointer check to the vfs_deleteopt() function. Discussed with: rodrigc MFC after: 3 days	2008-08-10 12:15:36 +00:00
ed	746d949d89	Remove unneeded D_NEEDGIANT from /dev/fd/{0,1,2}. There is no reason the fdopen() routine needs Giant. It only sets curthread->td_dupfd, based on the device unit number of the cdev. I guess we won't get massive performance improvements here, but still, I assume we eventually want to get rid of Giant.	2008-08-09 12:42:12 +00:00
des	c2c1c946ae	Add sbuf_new_auto as a shortcut for the very common case of creating a completely dynamic sbuf. Obtained from: Varnish MFC after: 2 weeks	2008-08-09 11:14:05 +00:00
des	50ef01bba1	Switch to simplified BSD license (with phk's approval), plus whitespace and style(9) cleanup.	2008-08-09 10:26:21 +00:00
jhb	e306c86e1b	Permit Giant to be passed as the explicit interlock either to msleep/mtx_sleep or the various cv_wait() routines. Currently, the "unlock" behavior of PDROP and cv_wait_unlock() with Giant is not permitted as it is will be confusing since Giant is fully unrecursed and unlocked during a thread sleep. This is handy for subsystems which wish to allow unlocked drivers to continue to use Giant such as CAM, the new TTY layer, and the new USB stack. CAM currently uses a hack that I told Scott to use because I really didn't want to permit this behavior, and the TTY and USB patches both have various patches to permit this. MFC after: 2 weeks	2008-08-07 21:00:13 +00:00
jhb	8af56fb687	If a thread that is swapped out is made runnable, then the setrunnable() routine wakes up proc0 so that proc0 can swap the thread back in. Historically, this has been done by waking up proc0 directly from setrunnable() itself via a wakeup(). When waking up a sleeping thread that was swapped out (the usual case when waking proc0 since only sleeping threads are eligible to be swapped out), this resulted in a bit of recursion (e.g. wakeup() -> setrunnable() -> wakeup()). With sleep queues having separate locks in 6.x and later, this caused a spin lock LOR (sleepq lock -> sched_lock/thread lock -> sleepq lock). An attempt was made to fix this in 7.0 by making the proc0 wakeup use the ithread mechanism for doing the wakeup. However, this required grabbing proc0's thread lock to perform the wakeup. If proc0 was asleep elsewhere in the kernel (e.g. waiting for disk I/O), then this degenerated into the same LOR since the thread lock would be some other sleepq lock. Fix this by deferring the wakeup of the swapper until after the sleepq lock held by the upper layer has been locked. The setrunnable() routine now returns a boolean value to indicate whether or not proc0 needs to be woken up. The end result is that consumers of the sleepq API such as *sleep/wakeup, condition variables, sx locks, and lockmgr, have to wakeup proc0 if they get a non-zero return value from sleepq_abort(), sleepq_broadcast(), or sleepq_signal(). Discussed with: jeff Glanced at by: sam Tested by: Jurgen Weber jurgen - ish com au MFC after: 2 weeks	2008-08-05 20:02:31 +00:00
jhb	11d83a7f89	Close two different races with concurrent opens of pty master devices that could result in leaked ttys or a leaked pty + tty pair. MFC after: 1 week	2008-08-04 19:51:23 +00:00
jhb	e6ae2f5414	- Close a race with concurrent open's of a pts master device which could result in leaked tty structures. - When constructing a new pty, allocate it's tty structure before adding it to the list. MFC after: 1 week	2008-08-04 19:49:05 +00:00
antoine	4dc3acdf62	Kill a dead variable PR: 126223 Submitted by: Mateusz Guzik	2008-08-03 21:07:19 +00:00
rwatson	8883e5c019	Remove broken code to replace st_mode value with ACCESSPERMS when lstat(2) is called on symlinks -- this code appears never to have worked. The PR this addresses suggests that the intended original behavior is the right one, but as bde points out in the PR comments, we do actually support storing a mode on symlinks, so returning it seems reasonable. This is consistent with Mac OS X, which despite documentation to the contrary does return the mode set on a symlink, but not some other platforms. The Single Unix Spec requires only that the returned bits be "meaningful", which seems at best unhelpful as advice goes. PR: 25018 MFC after: 3 days	2008-08-03 15:44:56 +00:00
kib	52aa4f35d0	Calling linker_load_dependencies() while holding the module' vnode lock may cause a LOR between kld_sx lock and vnode lock. linker_load_dependencies() drops kld_sx, and another thread may attempt to load the same kld. Reported and tested by: pjd MFC after: 1 week	2008-08-03 13:33:45 +00:00
sam	f28149353a	add callout_schedule; besides being useful it also improves compatibility with other systems Reviewed by: ed, battlez	2008-08-02 17:42:38 +00:00
csjp	743d0edd92	Currently, BSM audit pathname token generation for chrooted or jailed processes are not producing absolute pathname tokens. It is required that audited pathnames are generated relative to the global root mount point. This modification changes our implementation of audit_canon_path(9) and introduces a new function: vn_fullpath_global(9) which performs a vnode -> pathname translation relative to the global mount point based on the contents of the name cache. Much like vn_fullpath, vn_fullpath_global is a wrapper function which called vn_fullpath1. Further, the string parsing routines have been converted to use the sbuf(9) framework. This change also removes the conditional acquisition of Giant, since the vn_fullpath1 method will not dip into file system dependent code. The vnode locking was modified to use vhold()/vdrop() instead the vref() and vrele(). This will modify the hold count instead of modifying the user count. This makes more sense since it's the kernel that requires the reference to the vnode. This also makes sure that the vnode does not get recycled we hold the reference to it. [1] Discussed with: rwatson Reviewed by: kib [1] MFC after: 2 weeks	2008-07-31 16:57:41 +00:00
ed	faa0cddcb0	Remove the use of lbolt from the VFS syncer. It seems we only use `lbolt' inside the VFS syncer and the TTY layer now. Because I'm planning to replace the TTY layer next month, there's no reason to keep `lbolt' if it's only used in a single thread inside the kernel. Because the syncer code wanted to wake up the syncer thread before the timeout, it called sleepq_remove(). Because we now just use a condvar(9) with a timeout value of `hz', we can wake it up using cv_broadcast() without waking up any unrelated threads. Reviewed by: phk	2008-07-30 12:39:18 +00:00
ed	870de37626	Don't make subr_clist.c depend on the TTY layer. After the import of the new TTY layer, the TTY_QUOTE definition will not be present anymore. To make sure clists will still work as expected, introduce an internal definition called QUOTEMASK. Maybe we can decide to remove the quote bits entirely, but we still have to look into this. There may be drivers that still use the quote bits. Obtained from: //depot/projects/mpsafetty	2008-07-30 12:32:42 +00:00
jhb	bed722e078	When choosing a CPU for a thread in a cpuset, prefer the last CPU that the thread ran on if there are no other CPUs in the set with a shorter per-CPU runqueue.	2008-07-28 20:39:21 +00:00
jhb	421b41fe8c	Really fix this.	2008-07-28 18:33:43 +00:00
pjd	642dbd51b0	Properly check if td_name is empty and if it is, print process name, instead of empty thread name. Reviewed by: jhb	2008-07-28 18:10:26 +00:00
jhb	68f0af82de	Implement support for cpusets in the 4BSD scheduler. - When a cpuset is applied to a thread, walk the cpuset to see if it is a "full" cpuset (includes all available CPUs). If not, set a new TDS_AFFINITY flag to indicate that this thread can't run on all CPUs. When inheriting a cpuset from another thread during thread creation, the new thread also inherits this flag. It is in a new ts_flags field in td_sched rather than using one of the TDF_SCHEDx flags because fork() clears td_flags after invoking sched_fork(). - When placing a thread on a runqueue via sched_add(), if the thread is not pinned or bound but has the TDS_AFFINITY flag set, then invoke a new routine (sched_pickcpu()) to pick a CPU for the thread to run on next. sched_pickcpu() walks the cpuset and picks the CPU with the shortest per-CPU runqueue length. Note that the reason for the TDS_AFFINITY flag is to avoid having to walk the cpuset and examine runq lengths in the common case. - To avoid walking the per-CPU runqueues in sched_pickcpu(), add an array of counters to hold the length of the per-CPU runqueues and update them when adding and removing threads to per-CPU runqueues. MFC after: 2 weeks	2008-07-28 17:25:24 +00:00
jhb	69cc3c8c8a	Various and sundry style and whitespace fixes.	2008-07-28 15:52:02 +00:00
kmacy	70741e0245	- track maximum wait time - resize columns based on actual observed numerical values MFC after: 3 days	2008-07-27 21:45:20 +00:00
pjd	3f1807709d	Assert for exclusive vnode lock in vinactive(), vrecycle() and vgonel() functions. Reviewed by: kib	2008-07-27 11:48:15 +00:00
pjd	4dd19696a7	- Move vp test for beeing NULL under IGNORE_LOCK(). - Check if panicstr isn't set, if it is ignore the lock. This helps to avoid confusion, because lockmgr is a no-op when panicstr isn't NULL, so asserting anything at this point doesn't make sense and can just race with other panic. Discussed with: kib	2008-07-27 11:46:42 +00:00
trhodes	56ab14a8ae	Fill in a few sysctl descriptions. Approved by: rwatson	2008-07-26 00:55:35 +00:00
ed	c9af5459f4	Move ttyinfo() into its own C file. The ttyinfo() routine generates the fancy output when pressing ^T. Right now it is stored in tty.c. In the MPSAFE TTY code it is already stored in tty_info.c. To make integration of the MPSAFE TTY code a little easier, take the same approach. This makes the TTY code a little bit more readable, because having the proc_/thread_ routines in tty.c is very distractful. Approved by: philip (mentor)	2008-07-25 14:31:00 +00:00
kib	e2333a32b6	Call pargs_drop() unconditionally in do_execve(), the function correctly handles the NULL argument. Make pargs_free() static. MFC after: 1 week	2008-07-25 11:55:32 +00:00
kib	a0a58ba099	s/alredy/already/ in the comments and the log message.	2008-07-25 11:22:25 +00:00
kib	42aeaf36b0	Do the pargs_hold() on the copy of the pointer to the p_args of the child process immediately after bulk bcopy() without dropping the process lock. Since process is not single-threaded when forking, dropping and reacquiring the lock allows an other thread to change the process title of the parent in between, and results in hold being done on the invalid pointer. The problem manifested itself as the double free of the old p_args. Reported by: kris Reviewed by: jhb MFC after: 1 week	2008-07-23 08:45:25 +00:00
attilio	823ce79a5b	- Disallow XFS mounting in write mode. The write support never worked really and there is no need to maintain it. - Fix vn_get() in order to let it call vget(9) with a valid locking request. vget(9) returns the vnode locked in order to prevent recycling, but in this case internal XFS locks alredy prevent it from happening, so it is safe to drop the vnode lock before to return by vn_get(). - Add a VNASSERT() in vget(9) in order to catch malformed locking requests. Discussed with: kan, kib Tested by: Lothar Braun <lothar at lobraun dot de>	2008-07-21 23:01:09 +00:00
rwatson	f3c6f1e959	If run_interrupt_driven_config_hooks() waits 360 seconds and INVARIANTS is compiled into the kernel, then panic. MFC after: 3 days Discussed with: scottl	2008-07-21 20:50:49 +00:00
pjd	9d11b5b5b3	Implement the following macros for completeness: SYSCTL_QUAD() SYSCTL_ADD_QUAD() TUNABLE_QUAD() TUNABLE_QUAD_FETCH() Now we can use 64bit tunables on 32bit systems.	2008-07-21 15:05:25 +00:00
kmacy	565bc001a5	Add accessor functions for socket fields. MFC after: 1 week	2008-07-21 00:49:34 +00:00
alc	08181df483	Eliminate dead code. (The commit message for revision 1.287 explains why this code is dead.)	2008-07-20 04:13:51 +00:00
rwatson	b53b96f01c	Rather than simply waiting silently and indefinitely for all interrupt-driven configuration handlers to complete, print out a diagnostic message every 60 second indicating which handlers are still running. Do this at most 5 times per run so as to avoid scrolling out any useful information from the kernel message buffer. The interval of 60 seconds was selected based on a best guess as to the nature of "long enough" and may want to be tuned higher or lower depending on real-world tolerances. MFC after: 3 days Discussed with: scottl	2008-07-19 19:08:35 +00:00
rwatson	2df3fcd0c6	witness_addgraph() is required even if DDB isn't compiled into the kernel, so exclude it from #ifdef DDB. Submitted by: attilio	2008-07-19 17:47:23 +00:00
rwatson	8fd5cf995c	Add DDB "show conifhk" command, which lists hooks currently waiting for completion in run_interrupt_driven_config_hooks(). This is helpful when trying to figure out which device drivers have gone into la-la land during boot-time autoconfiguration. MFC after: 3 days	2008-07-19 12:12:54 +00:00
jeff	7ff6e9903f	Fix a race which could result in some timeout buckets being skipped. - When a tick occurs on a cpu, iterate from cs_softticks until ticks. The per-cpu tick processing happens asynchronously with the actual adjustment of the 'ticks' variable. Sometimes the results may be visible before the local call and sometimes after. Previously this could cause a one tick window where we didn't evaluate the bucket. - In softclock fetch curticks before incrementing cc_softticks so we don't skip insertions which were made for the current time. Sponsored by: Nokia	2008-07-19 05:18:29 +00:00
jeff	b2f69d1b1e	- Check whether we've recorded this tick in ts_ticks on another cpu in sched_tick() to prevent multiple increments for one tick. This pushes the value out of range and breaks priority calculation. Reviewed by: kib Found by: pho/nokia Sponsored by: Nokia MFC after: 3 days	2008-07-19 05:13:47 +00:00
kmacy	6dfc39c2b6	revert local change	2008-07-18 07:10:33 +00:00
kmacy	eacfaa0e61	revert change from local tree	2008-07-18 07:07:57 +00:00
kmacy	c01ed5ad9b	import vendor fixes to cxgb	2008-07-18 06:12:31 +00:00
kib	eff9ee09b4	Pair the VOP_OPEN call from do_execve() with the reciprocal VOP_CLOSE. This was unnoticed because local filesystems usually do nothing non-trivial in the close vop. Reported and tested by: Rick Macklem MFC after: 2 weeks	2008-07-17 16:44:07 +00:00
antoine	89ca3c5933	Staticize M_STACK. Approved by: rwatson (mentor) MFC after: 1 month	2008-07-13 17:15:05 +00:00
rodrigc	f280e5ed8f	In nmount(), if we see "update" in the mount options, set MNT_UPDATE in fsflags, and delete the "update" option from the global mount options. MNT_UPDATE is a command, and not a property of a mount that should persist after the command is executed. We need to do similar things for MNT_FORCE and MNT_RELOAD. All mount flags are prefixed by MNT_..... it would be nice if flags which were commands were named differently from flags which are persistent properties of a mount. This was not such a big deal in the pre-nmount() days, but with nmount() it is more important. Requested by: yar MFC after: 2 weeks	2008-07-12 20:12:40 +00:00
obrien	fa9172e3f7	Improve readability and cscope searches a little bit by not using the same variable name in closely related (but not conflicting) contexts.	2008-07-11 14:48:28 +00:00
kib	da671c0533	Make it atomic for the devfs_populate_loop() to see the setting of SI_ALIAS flag and initialization of the si_parent when alias is created. Assert that supplied parent device is not NULL. Both situations could cause NULL dereference in the devfs_populate_loop() when creating a symlink for SI_ALIAS'ed device. Namely, cdp->cdp_c.si_parent may be NULL. Reported by: mav MFC after: 2 weeks	2008-07-11 11:22:19 +00:00
obrien	3b9db50b75	Revert r180431. r180431 broke the AMD64 build (the only arch using kern/link_elf_obj.c)	2008-07-11 01:10:40 +00:00
obrien	0bc4bc025d	Allow 'elf_file_t' to be used in a wider scope.	2008-07-10 16:35:57 +00:00
edwin	e80b338f3b	Improve the output of kldload(8) to show which module can't be loaded. Was: kldload: Unsupported file type Is now: kldload: /boot/modules/test.ko: Unsupported file type PR: kern/121276 Submitted by: Edwin Groothuis <edwin@mavetju.org> Approved by: bde (mentor) MFC after: 1 week	2008-07-08 23:51:38 +00:00
bz	f93b85c0df	Add a `show cpusets' DDB command to print numbered root and assigned CPU affinity sets. Reviewed by: brooks	2008-07-07 21:32:02 +00:00
bz	6988e35234	MFp4 144659: Plug a memory leak with jail services. PR: 125257 Submitted by: Mateusz Guzik <mjguzik gmail.com> MFC after: 6 days	2008-07-07 20:53:49 +00:00
bz	cf63123d06	Move cpuset_refroot and cpuset_refbase functions up, grouping the cpuset_ref* functions together. Will make it easier to read and add code without forward declarations. No functional changes.	2008-07-07 20:45:55 +00:00
kib	d39c6bcffb	The kqueue_register() function assumes that it is called from the top of the syscall code and acquires various event subsystem locks as needed. The handling of the NOTE_TRACK for EVFILT_PROC is currently done by calling the kqueue_register() from filt_proc() filter, causing recursive entrance of the kqueue code. This results in the LORs and recursive acquisition of the locks. Implement the variant of the knote() function designed to only handle the fork() event. It mostly copies the knote() body, but also handles the NOTE_TRACK, removing the handling from the filt_proc(), where it causes problems described above. The function is called from the fork1() instead of knote(). When encountering NOTE_TRACK knote, it marks the knote as influx and drops the knlist and kqueue lock. In this context call to kqueue_register is safe from the problems. An error from the kqueue_register() is reported to the observer as NOTE_TRACKERR fflag. PR: 108201 Reviewed by: jhb, Pramod Srinivasan <pramod juniper net> (previous version) Discussed with: jmg Tested by: pho MFC after: 2 weeks	2008-07-07 09:30:11 +00:00
kib	ea1979e3d2	The r178914 I erronously put the setting of the KQ_FLUXWAIT flag before KQ_FLUX_WAKEUP(). Since the later macro clears the KQ_FLUXWAIT, the kqueue_scan() thread may be not woken up. Move the setting of KQ_FLUXWAIT after wakeup to correct the issue. Reported and tested by: pho MFC after: 3 days	2008-07-07 09:15:29 +00:00
alc	c016906f4e	Enable the creation of a kmem map larger than 4GB. Submitted by: Tz-Huan Huang Make several variables related to kmem map auto-sizing static. Found by: CScout	2008-07-05 19:34:33 +00:00
rwatson	051819b847	Introduce a new lock, hostname_mtx, and use it to synchronize access to global hostname and domainname variables. Where necessary, copy to or from a stack-local buffer before performing copyin() or copyout(). A few uses, such as in cd9660 and daemon_saver, remain under-synchronized and will require further updates. Correct a bug in which a failed copyin() of domainname would leave domainname potentially corrupted. MFC after: 3 weeks	2008-07-05 13:10:10 +00:00
alc	b7d6153751	Correct an error in the comments for init_param3(). Discussed with: silby	2008-07-04 19:36:58 +00:00
rwatson	482bfeab47	Remove NETISR_MPSAFE, which allows specific netisr handlers to be directly dispatched without Giant, and add NETISR_FORCEQUEUE, which allows specific netisr handlers to always be dispatched via a queue (deferred). Mark the usb and if_ppp netisr handlers as NETISR_FORCEQUEUE, and explicitly acquire Giant in those handlers. Previously, any netisr handler not marked NETISR_MPSAFE would necessarily run deferred and with Giant acquired. This change removes Giant scaffolding from the netisr infrastructure, but NETISR_FORCEQUEUE allows non-MPSAFE handlers to continue to force deferred dispatch so as to avoid lock order reversals between their acqusition of Giant and any calling context. It is likely we will be able to remove NETISR_FORCEQUEUE once IFF_NEEDSGIANT is removed, as non-MPSAFE usb and if_ppp drivers will no longer be supported. Reviewed by: bz MFC after: 1 month X-MFC note: We can't remove NETISR_MPSAFE from stable/7 for KPI reasons, but the rest can go back.	2008-07-04 00:21:38 +00:00
emaste	240825654b	Use bcopy instead of strlcpy in uipc_bind and unp_connect, since soun->sun_path isn't a null-terminated string. As UNIX(4) states, "the terminating NUL is not part of the address." Since strlcpy has to return "the total length of the string [it] tried to create," it walks off the end of soun->sun_path looking for a \0. This reverts r105332. Reported by: Ryan Stone	2008-07-03 23:26:10 +00:00
julian	7b11deb4f4	Change a variable name to not shadow a global Obtained from: vimage	2008-07-03 08:35:59 +00:00
rwatson	108da791bb	Update copyright date in light of soreceive_dgram(9).	2008-07-03 06:47:45 +00:00
rwatson	0c50a62527	Add soreceive_dgram(9), an optimized socket receive function for use by datagram-only protocols, such as UDP. This version removes use of sblock(), which is not required due to an inability to interlace data improperly with datagrams, as well as avoiding some of the larger loops and state management that don't apply on datagram sockets. This is experimental code, so hook it up only for UDPv4 for testing; if there are problems we may need to revise it or turn it off by default, but it offers significant performance improvements for threaded UDP applications such as BIND9, nsd, and memcached using UDP. Tested by: kris, ps	2008-07-02 23:23:27 +00:00
rdivacky	d3e39bd522	Use msleep_spin() instead of unlock/tsleep/lock. This was already commited but with a wrong msleep variant and then backed out. Note that this changes the semantic a little as msleep_spin does not let us to specify priority after wakeup. Approved by: wkoszek, cognet Approved by: kib (mentor)	2008-07-02 20:44:33 +00:00
bz	30064ea555	Remove an unneeded error variable to make clear that if reaching the end of the function we never return an error.	2008-06-29 18:26:07 +00:00
bz	103613ceb8	Add a new priv 'PRIV_SCHED_CPUSET' to check if manipulating cpusets is allowed and replace the suser() call. Do not allow it in jails. Reviewed by: rwatson	2008-06-29 17:58:16 +00:00
jhb	411d068395	Rework the lifetime management of the kernel implementation of POSIX semaphores. Specifically, semaphores are now represented as new file descriptor type that is set to close on exec. This removes the need for all of the manual process reference counting (and fork, exec, and exit event handlers) as the normal file descriptor operations handle all of that for us nicely. It is also suggested as one possible implementation in the spec and at least one other OS (OS X) uses this approach. Some bugs that were fixed as a result include: - References to a named semaphore whose name is removed still work after the sem_unlink() operation. Prior to this patch, if a semaphore's name was removed, valid handles from sem_open() would get EINVAL errors from sem_getvalue(), sem_post(), etc. This fixes that. - Unnamed semaphores created with sem_init() were not cleaned up when a process exited or exec'd. They were only cleaned up if the process did an explicit sem_destroy(). This could result in a leak of semaphore objects that could never be cleaned up. - On the other hand, if another process guessed the id (kernel pointer to 'struct ksem' of an unnamed semaphore (created via sem_init)) and had write access to the semaphore based on UID/GID checks, then that other process could manipulate the semaphore via sem_destroy(), sem_post(), sem_wait(), etc. - As part of the permission check (UID/GID), the umask of the proces creating the semaphore was not honored. Thus if your umask denied group read/write access but the explicit mode in the sem_init() call allowed it, the semaphore would be readable/writable by other users in the same group, for example. This includes access via the previous bug. - If the module refused to unload because there were active semaphores, then it might have deregistered one or more of the semaphore system calls before it noticed that there was a problem. I'm not sure if this actually happened as the order that modules are discovered by the kernel linker depends on how the actual .ko file is linked. One can make the order deterministic by using a single module with a mod_event handler that explicitly registers syscalls (and deregisters during unload after any checks). This also fixes a race where even if the sem_module unloaded first it would have destroyed locks that the syscalls might be trying to access if they are still executing when they are unloaded. XXX: By the way, deregistering system calls doesn't do any blocking to drain any threads from the calls. - Some minor fixes to errno values on error. For example, sem_init() isn't documented to return ENFILE or EMFILE if we run out of semaphores the way that sem_open() can. Instead, it should return ENOSPC in that case. Other changes: - Kernel semaphores now use a hash table to manage the namespace of named semaphores nearly in a similar fashion to the POSIX shared memory object file descriptors. Kernel semaphores can now also have names longer than 14 chars (up to MAXPATHLEN) and can include subdirectories in their pathname. - The UID/GID permission checks for access to a named semaphore are now done via vaccess() rather than a home-rolled set of checks. - Now that kernel semaphores have an associated file object, the various MAC checks for POSIX semaphores accept both a file credential and an active credential. There is also a new posixsem_check_stat() since it is possible to fstat() a semaphore file descriptor. - A small set of regression tests (using the ksem API directly) is present in src/tools/regression/posixsem. Reported by: kris (1) Tested by: kris Reviewed by: rwatson (lightly) MFC after: 1 month	2008-06-27 05:39:04 +00:00
julian	e62e072121	Someone cut and pasted a bunch of stuff here so lots of indents were spaces when they should have been tabs, screwing up diffs and patches.. Whitespace commit as my first SVN commit. (yay) MFC after: 1 week	2008-06-26 22:45:04 +00:00
dfr	41cea6d5ca	Re-implement the client side of rpc.lockd in the kernel. This implementation provides the correct semantics for flock(2) style locks which are used by the lockf(1) command line tool and the pidfile(3) library. It also implements recovery from server restarts and ensures that dirty cache blocks are written to the server before obtaining locks (allowing multiple clients to use file locking to safely share data). Sponsored by: Isilon Systems PR: 94256 MFC after: 2 weeks	2008-06-26 10:21:54 +00:00
ru	c878414354	Fix a chicken-and-egg problem: this files implements SSP support, so we cannot compile it with -fstack-protector[-all] flags (or it will self-recurse); this is ensured in sys/conf/files. This OTOH means that checking for defines __SSP__ and __SSP_ALL__ to determine if we should be compiling the support is impossible (which it was trying, resulting in an empty object file). Fix this by always compiling the symbols in this files. It's good because it allows us to always have SSP support, and then compile with SSP selectively. Repoted by: tinderbox	2008-06-26 07:52:45 +00:00
ru	8735fdbd4c	Enable GCC stack protection (aka Propolice) for userland: - It is opt-out for now so as to give it maximum testing, but it may be turned opt-in for stable branches depending on the consensus. You can turn it off with WITHOUT_SSP. - WITHOUT_SSP was previously used to disable the build of GNU libssp. It is harmless to steal the knob as SSP symbols have been provided by libc for a long time, GNU libssp should not have been much used. - SSP is disabled in a few corners such as system bootstrap programs (sys/boot), process bootstrap code (rtld, csu) and SSP symbols themselves. - It should be safe to use -fstack-protector-all to build world, however libc will be automatically downgraded to -fstack-protector because it breaks rtld otherwise. - This option is unavailable on ia64. Enable GCC stack protection (aka Propolice) for kernel: - It is opt-out for now so as to give it maximum testing. - Do not compile your kernel with -fstack-protector-all, it won't work. Submitted by: Jeremie Le Hen <jeremie@le-hen.org>	2008-06-25 21:33:28 +00:00
davidxu	70dd244f26	Add two commands to _umtx_op system call to allow a simple mutex to be locked and unlocked completely in userland. by locking and unlocking mutex in userland, it reduces the total time a mutex is locked by a thread, in some application code, a mutex only protects a small piece of code, the code's execution time is less than a simple system call, if a lock contention happens, however in current implemenation, the lock holder has to extend its locking time and enter kernel to unlock it, the change avoids this disadvantage, it first sets mutex to free state and then enters kernel and wake one waiter up. This improves performance dramatically in some sysbench mutex tests. Tested by: kris Sounds great: jeff	2008-06-24 07:32:12 +00:00
jhb	437891381c	Remove the posixsem_check_destroy() MAC check. It is semantically identical to doing a MAC check for close(), but no other types of close() (including close(2) and ksem_close(2)) have MAC checks. Discussed with: rwatson	2008-06-23 21:37:53 +00:00
rwatson	1e17e3cd45	If S_IFIFO is passed to mknod(2), invoke kern_mkfifoat(9) to create a FIFO, as required by SUSv3. No specific privilege check is performed in this case, as FIFOs may be created by unprivileged processes (subject to the normal file system name space restrictions that may be in place). Unlike the Apple implementation, we reject requests to create a FIFO using mknod(2) if there is a non-zero dev argument to the system call, which is permitted by the Open Group specification ("... undefined ..."). We might want to revise this if we find it causes compatibility problems for applications in practice. PR: kern/74242, kern/68459 Obtained from: Apple, Inc. MFC after: 3 weeks	2008-06-22 21:51:32 +00:00
gonzo	f0ffee5444	Use minimum of max_aio_procs and target_aio_procs when spawning new aiod since there should be no more then max_aio_procs processes.	2008-06-21 11:34:34 +00:00
imp	bf94b8a5bf	Split out the probing magic of device_probe_and_attach into device_probe() so that it can be used by busses that may wish to do additional processing between probe and attach. Reviewed by: dfr@	2008-06-20 16:58:15 +00:00
alc	c5556f0762	Enforce the mapping of kernel loadable modules in the uppermost 2GB of the kernel virtual address space on amd64.	2008-06-20 06:24:34 +00:00
delphij	4f152d47fa	Revert rev. 178124 as requested by kris@. Having jail id not being reused too frequently is useful for script controlled environment.	2008-06-19 21:41:57 +00:00
gonzo	c5bc6314e2	Renew semaphore's pointer after wakeup since during msleep sem_base may have been modified by destroying one of semaphores and semptr would not be valid in this case. PR: kern/123731	2008-06-19 18:08:42 +00:00
kib	eecc60305f	Struct cdev is always the member of the struct cdev_priv. When devfs needed to promote cdev to cdev_priv, the si_priv pointer was followed. Use member2struct() to calculate address of the wrapping cdev_priv. Rename si_priv to __si_reserved. Tested by: pho Reviewed by: ed MFC after: 2 weeks	2008-06-16 17:34:59 +00:00
jb	567c5d727e	Remove code that isn't required. It actually breaks the case where KDTRACE_HOOKS is defined and KDB isn't. This is the case that it was intended for.	2008-06-16 04:44:29 +00:00
ed	4327eebef0	Turn dev2unit(), minor(), unit2minor() and minor2unit() into macro's. Now that we got rid of the minor-to-unit conversion and the constraints on device minor numbers, we can convert the functions that operate on minor and unit numbers to simple macro's. The unit2minor() and minor2unit() macro's are now no-ops. The ZFS code als defined a macro named `minor'. Change the ZFS code to use umajor() and uminor() here, as it is the correct approach to do this. Also add $FreeBSD$ to keep SVN happy. Approved by: philip (mentor), pjd	2008-06-12 08:30:54 +00:00
ed	1bfc292986	Don't enforce unique device minor number policy anymore. Except for the case where we use the cloner library (clone_create() and friends), there is no reason to enforce a unique device minor number policy. There are various drivers in the source tree that allocate unr pools and such to provide minor numbers, without using them themselves. Because we still need to support unique device minor numbers for the cloner library, introduce a new flag called D_NEEDMINOR. All cdevsw's that are used in combination with the cloner library should be marked with this flag to make the cloning work. This means drivers can now freely use si_drv0 to store their own flags and state, making it effectively the same as si_drv1 and si_drv2. We still keep the minor() and dev2unit() routines around to make drivers happy. The NTFS code also used the minor number in its hash table. We should not do this anymore. If the si_drv0 field would be changed, it would no longer end up in the same list. Approved by: philip (mentor)	2008-06-11 18:55:19 +00:00
gonzo	4f61d04fd8	Keep proper track of nsegs counter: sem_free is called for all allocated semaphores, so it's wrong to increase it conditionally, in this case for every over-the-limit semaphore nsegs is decreased without being previously increased. PR: kern/123685 Approved by: cognet (mentor)	2008-06-10 20:55:10 +00:00
kib	926d12d0ea	Provide the mutual exclusion between the nfs export list modifications and nfs requests processing. Lockmgr lock provides the shared locking for nfs requests, while exclusive mode is used for modifications. The writer starvation is handled by lockmgr too. Reported by: kris, pho, many Based on the submission by: mohan Tested by: pho MFC after: 2 weeks	2008-06-09 10:31:38 +00:00
wkoszek	3183578270	Remove checks against DDB, which isn't used in this file. My intention is to bring no functional change. Discussion on: IRC Reviewed by: ed, kan, rink,	2008-06-08 20:43:27 +00:00
ed	be822a5885	Remove unneeded Giant locking of /dev/tty. The Giant lock is acquired in two places in tty_tty.c. In both places, it is unneeded. There is no reason to specify D_NEEDGIANT on this device node. The device node has only been designed to return ENXIO when opened. It doesn't make any sense to lock/unlock Giant, just to return this error. D_TTY is also unneeded. The unimplemented functions don't need to be patched by devfs. We don't need to lock Giant when we want to lookup the proper TTY vnode. s_ttyvp is already protected by proctree_lock (see devfs_vnops.c). Approved by: philip (mentor)	2008-06-03 12:38:00 +00:00
davidxu	d4f2094515	Use a seperated hash table for mutex and rwlock, avoid wasting some time on walking through idle threads sleeping on condition variables.	2008-05-30 02:18:54 +00:00
ed	5de6a45e07	Remove the distinction between device minor and unit numbers. Even though we got rid of device major numbers some time ago, device drivers still need to provide unique device minor numbers to make_dev(). These numbers are only used inside the kernel. They are not related to device major and minor numbers which are visible in devfs. These are actually based on the inode number of the device. It would eventually be nice to remove minor numbers entirely, but we don't want to be too agressive here. Because the 8-15 bits of the device number field (si_drv0) are still reserved for the major number, there is no 1:1 mapping of the device minor and unit numbers. Because this is now unused, remove the restrictions on these numbers. The MAXMAJOR definition was actually used for two purposes. It was used to convert both the userspace and kernelspace device numbers to their major/minor pair, which is why it is now named UMINORMASK. minor2unit() and unit2minor() have now become useless. Both minor() and dev2unit() now serve the same purpose. We should eventually remove some of them, at least turning them into macro's. If devfs would become completely minor number unaware, we could consider using si_drv0 directly, just like si_drv1 and si_drv2. Approved by: philip (mentor)	2008-05-29 12:50:46 +00:00
ed	83304da0e8	Remove redundant checks from fcntl()'s F_DUPFD. Right now we perform some of the checks inside the fcntl()'s F_DUPFD operation twice. We first validate the `fd' argument. When finished, we validate the `arg' argument. These checks are also performed inside do_dup(). The reason we need to do this, is because fcntl() should return different errno's when the `arg' argument is out of bounds (EINVAL instead of EBADF). To prevent the redundant locking of the PROC_LOCK and FILEDESC_SLOCK, patch do_dup() to support the error semantics required by fcntl(). Approved by: philip (mentor)	2008-05-28 20:25:19 +00:00
ed	00336df1bc	Rename `tty_subr.c' to` subr_clist.c'. Because clists are also used outside the TTY layer, rename the file containing the clist routines to something more accurate. The mpsafetty TTY layer doesn't use clists. It uses its own buffers, which also implement the unbuffered copying to userspace. We cannot simply remove the clist routines then, because this would break various drivers that are present within the kernel. Approved by: philip (mentor)	2008-05-27 06:41:50 +00:00
attilio	e089ccfc1b	Improve a comment which, in the actual CVS stock, doesn't completely explain the logic of the code chunk.	2008-05-27 00:27:50 +00:00
kib	5941eb2619	Take into account possible overflow when multiplying. The casuality is the malloc call later, panicing kernel due to the oversized allocation. Reported by: pho Reviewed by: jeff	2008-05-26 10:01:13 +00:00
rwatson	a3623cb733	Remove netatm from HEAD as it is not MPSAFE and relies on the now removed NET_NEEDS_GIANT. netatm has been disconnected from the build for ten months in HEAD/RELENG_7. Specifics: - netatm include files - netatm command line management tools - libatm - ATM parts in rescue and sysinstall - sample configuration files and documents - kernel support as a module or in NOTES - netgraph wrapper nodes for netatm - ctags data for netatm. - netatm-specific device drivers. MFC after: 3 weeks Reviewed by: bz Discussed with: bms, bz, harti	2008-05-25 22:11:40 +00:00
attilio	4755d96541	The "if" semantic is not needed, just fix this.	2008-05-25 16:11:27 +00:00
attilio	4d240aa98e	Replace direct atomic operation for the file refcount witht the refcount interface. It also introduces the correct usage of memory barriers, as sometimes fdrop() and fhold() are used with shared locks, which don't use any release barrier.	2008-05-25 14:57:43 +00:00
jb	1c6ecc547f	Add the vtime (virtual time) hooks for DTrace.	2008-05-25 01:44:58 +00:00
jb	c4443570b6	Add DTrace 'proc' provider probes using the Statically Defined Trace (sdt) mechanism.	2008-05-24 06:22:16 +00:00
rodrigc	a9cd468083	Do not convert the "snapshot" string to the MNT_SNAPSHOT flag here, since we do it further down in ffs_vfsops.c MFC after: 1 month	2008-05-23 23:33:07 +00:00
kib	797c3188c0	Rev. 1.274 put the ttyrel() call before the destroy_dev() in the ttyfree(), freeing the tty. Since destroy_dev() may call d_purge() cdevsw method, that is the ttypurge() for the tty, the code ends up accessing freed tty structure. Put the ttyrel() after destroy_dev() in the ttyfree. To prevent the panic the rev. 1.274 provided fix for, check the TS_GONE in sysctl handler and refuse to provide information on such tty. Reported, debugging help and tested by: pho DIscussed with and reviewed by: jhb MFC after: 1 week	2008-05-23 16:47:55 +00:00
kib	90775e30db	The dev_refthread() in the tty_gettp() may fail, because Giant is taken in the giant_trick routines after the dev_refthread increments the si_threadcount. Remove assert, do not perform dev_relthread() for failed dev_refthread(), and handle failure in the tty_gettp() callers (cdevsw tty methods). Before kern_conf.c 1.210 and 1.211, the kernel usually paniced in the giant_trick routines dereferencing NULL cdevsw, not taking this fault. Reported by: Vince Hoffman <jhary unsane co uk> Debugging help and tested by: pho Reviewed by: jhb MFC after: 1 week	2008-05-23 16:46:27 +00:00
kib	a0dac34fa6	Use the t_state for the TS_GONE test. Submitted by: jhb MFC after: 3 days	2008-05-23 16:43:59 +00:00
kib	c1c2996ed2	Assert that si_threadcount > 0 before decrementing it. This helps catching the improper use of the dev_refthread/dev_relthread. Tested by: pho MFC after: 1 week	2008-05-23 16:38:38 +00:00
ed	bdc5be605f	Move TTY unrelated bits out of <sys/tty.h>. For some reason, the <sys/tty.h> header file also contains routines of the clists and console that are used inside the TTY layer. Because the clists are not only used by the TTY layer (example: various input drivers), we'd better move the entire clist programming interface into <sys/clist.h>. Also remove a declaration of nonexistent variable. The <sys/tty.h> header also contains various definitions for the console code (tty_cons.c). Also move these to <sys/cons.h>, because they are not implemented inside the TTY layer. While there, create separate malloc pools for the clist and console code. Approved by: philip (mentor)	2008-05-23 16:06:35 +00:00
kib	bb95365b8c	Another problem caused by the knlist_cleardel() potentially dropping PIPE_MTX(). Since the pipe_present is cleared before (potentially) sleeping, the second thread may enter the pipeclose() for the reciprocal pipe end. The test at the end of the pipeclose() for the pipe_present == 0 would succeed, allowing the second thread to free the pipe memory. First threads then accesses the freed memory after being woken up. Properly track the closing state of the pipe in the pipe_present. Introduce the intermediate state that marks the pipe as mostly dismantled but might be sleeping waiting for the knote list to be cleared. Free the pipe pair memory only when both ends pass that point. Debugging help and tested by: pho Discussed with: jmg MFC after: 2 weeks	2008-05-23 11:14:03 +00:00
kib	c106911b42	Destruction of the pipe calls knlist_cleardel() to remove the knotes monitoring the pipe. The code sets pipe_present = 0 and enters knlist_cleardel(), where the PIPE_MTX might be dropped when knl->kl_list cannot be cleared due to influx knotes. If the following often encountered code fragment if (!(kn->kn_status & KN_DETACHED)) kn->kn_fop->f_detach(kn); knote_drop(kn, td); [1] is executed while the knlist lock is dropped, then the knote memory is freed by the knote_drop() without knote being removed from the knlist, since the filt_pipedetach() contains the following: if (kn->kn_filter == EVFILT_WRITE) { if (!cpipe->pipe_peer->pipe_present) { PIPE_UNLOCK(cpipe); return; Now, the memory may be reused in the zone, causing the access to the freed memory. I got the panics caused by the marker knote appearing on the knlist, that, I believe, manifestation of the issue. In the Peter Holm test scenarious, we got unkillable processes too. The pipe_peer that has the knote for write shall be present. Ignore the pipe_present value for EVFILT_WRITE in filt_pipedetach(). Debugging help and tested by: pho Discussed with: jmg MFC after: 2 weeks	2008-05-23 11:09:50 +00:00
jb	6a077c58b8	Add the ctf_get function and update the args to linker_file_function_listall.	2008-05-23 07:08:59 +00:00
jb	1ebf94be7d	Add the ctf_get method.	2008-05-23 04:06:49 +00:00
jb	e922b9b976	Allow a rendezvous with just a specified CPU too. Make the API work in the non-smp case too so that a kernel module can work the same regardless of whether or not it is loaded on a SMP kernel or not.	2008-05-23 04:05:26 +00:00
jb	858f2ace1b	Add the CTF source file which gets shared with link_elf.c and link_elf_obj.c.	2008-05-23 03:04:27 +00:00
jb	090fe643c2	Add hooks for the Compact C Type Format (CTF) data to be attached to the elf files. This is complicated by the fact that the actual CTF parsing has to be done in CDDL'd code, so the BSD licensed code only knows about the opaque data which it must be able to free.	2008-05-23 00:49:39 +00:00
jb	8c4eed9aad	Add support for the DTrace malloc provider which can enable probes on a per-malloc type basis.	2008-05-23 00:43:36 +00:00
rwatson	60b4eaf522	When sendto(2) is called with an explicit destination address argument, call mac_socket_check_connect() on that address before proceeding with the send. Otherwise policies instrumenting the connect entry point for the purposes of checking destination addresses will not have the opportunity to check implicit connect requests. MFC after: 3 weeks Sponsored by: nCircle Network Security, Inc.	2008-05-22 07:18:54 +00:00
kib	5971791c18	Implement the per-open file data for the cdev. The patch does not change the cdevsw KBI. Management of the data is provided by the functions int devfs_set_cdevpriv(void priv, cdevpriv_dtr_t dtr); int devfs_get_cdevpriv(void *datap); void devfs_clear_cdevpriv(void); All of the functions are supposed to be called from the cdevsw method contexts. - devfs_set_cdevpriv assigns the priv as private data for the file descriptor which is used to initiate currently performed driver operation. dtr is the function that will be called when either the last refernce to the file goes away, the device is destroyed or devfs_clear_cdevpriv is called. - devfs_get_cdevpriv is the obvious accessor. - devfs_clear_cdevpriv allows to clear the private data for the still open file. Implementation keeps the driver-supplied pointers in the struct cdev_privdata, that is referenced both from the struct file and struct cdev, and cannot outlive any of the referee. Man pages will be provided after the KPI stabilizes. Reviewed by: jhb Useful suggestions from: jeff, antoine Debugging help and tested by: pho MFC after: 1 month	2008-05-21 09:31:44 +00:00
pjd	a1af6d977b	Be more friendly for DDB pager. Educated by: jhb's BSDCan presentation	2008-05-18 21:08:12 +00:00
jb	52f46ad538	Add support for the DTrace struct proc and struct thread extended data via ctor and dtor event handlers. The size of the extra data is allocated opaquely and this file contains a function which the dtrace module can call to check that the kernel supports at least the amount of data that it needs. This file is optionally compiled into nthe kernel if the KDTRACE_HOOKS kernel option is defined.	2008-05-18 19:43:52 +00:00
jb	456cdd0179	Add kernel support for the Statically Defined Trace provider. This is BSD licensed code written specifically for FreeBSD. It initialises using SYSINIT so that the SDT provider, probe and argument description linkage is done whenever a module is loaded, regardless of whether the DTrace modules are loaded or not. This file is optionally compiled into the kernel if the KDTRACE_HOOKS option is defined.	2008-05-18 19:32:36 +00:00
rpaulo	2670a520c1	devctl_process_running(): Check for devsoftc.inuse == 1 instead of devsoftc.async_proc != NULL because the latter might not be true sometimes. This way /etc/rc.suspend gets executed. Reviwed by: njl Submitted by: Mitsuru IWASAKI <iwasaki at jp.FreeBSD.org> Tested also by: Andreas Wetzel <mickey242 at gmx.net> MFC after: 1 week	2008-05-18 13:55:51 +00:00
rwatson	14ceaad756	Attempt to improve convergence of POSIX semaphore code with style(9). MFC after: 3 days	2008-05-16 18:10:07 +00:00
gnn	368bdf05e9	Update the kernel to count the number of mbufs and clusters (all types) used per socket buffer. Add support to netstat to print out all of the socket buffer statistics. Update the netstat manual page to describe the new -x flag which gives the extended output. Reviewed by: rwatson, julian	2008-05-15 20:18:44 +00:00
attilio	f7f31164f1	- Embed the recursion counter for any locking primitive directly in the lock_object, using an unified field called lo_data. - Replace lo_type usage with the w_name usage and at init time pass the lock "type" directly to witness_init() from the parent lock init function. Handle delayed initialization before than witness_initialize() is called through the witness_pendhelp structure. - Axe out LO_ENROLLPEND as it is not really needed. The case where the mutex init delayed wants to be destroyed can't happen because witness_destroy() checks for witness_cold and panic in case. - In enroll(), if we cannot allocate a new object from the freelist, notify that to userspace through a printf(). - Modify the depart function in order to return nothing as in the current CVS version it always returns true and adjust callers accordingly. - Fix the witness_addgraph() argument name prototype. - Remove unuseful code from itismychild(). This commit leads to a shrinked struct lock_object and so smaller locks, in particular on amd64 where 2 uintptr_t (16 bytes per-primitive) are gained. Reviewed by: jhb	2008-05-15 20:10:06 +00:00
jhb	ec0d9f9d00	Go back to using the process command name (p_comm) for the file name and command line arguments stored in the note at the beginning of a core dump instead of the current thread name. Reviewed by: julian	2008-05-15 03:07:34 +00:00
kib	592c22cb14	Add the devctl notifications for the cdev create/destroy events. Based on the submission by: Andriy Gapon <avg icyb net ua> MFC after: 2 weeks	2008-05-14 14:29:54 +00:00
julian	27367de06f	fix typo in runz_fuzz noticed by:Elijah Buck	2008-05-12 06:42:06 +00:00
alc	c251140c26	Introduce a new parameter "superpage_align" to kmem_suballoc() that is used to request superpage alignment for the submap. Request superpage alignment for the kmem_map. Pass VMFS_ANY_SPACE instead of TRUE to vm_map_find(). (They are currently equivalent but VMFS_ANY_SPACE is the new preferred spelling.) Remove a stale comment from kmem_malloc().	2008-05-10 21:46:20 +00:00
kib	9a39931e9b	Kqueue_scan() may sleep when encountered the influx knotes. On the other hand, it may cause other threads to sleep since kqueue_scan() may mark some knotes as infux. This could lead to the deadlock. Before kqueue_scan() sleeps, wakeup the threads that are waiting for the influx knotes produced by this thread. Tested by: pho (previous version) Reviewed by: jmg MFC after: 2 weeks	2008-05-10 11:37:05 +00:00
kib	0f388c4977	The kqueue_close() encountering the KN_INFLUX knotes on the kq being closed is the legitimate situation. For instance, filedescriptor with registered events may be closed in parallel with closing the kqueue. Properly handle the case instead of asserting that this cannot happen. Reported and tested by: pho Reviewed by: jmg MFC after: 2 weeks	2008-05-10 11:35:32 +00:00
julian	1dfc5c98a4	Add code to allow the system to handle multiple routing tables. This particular implementation is designed to be fully backwards compatible and to be MFC-able to 7.x (and 6.x) Currently the only protocol that can make use of the multiple tables is IPv4 Similar functionality exists in OpenBSD and Linux. From my notes: ----- One thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons). Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. This first version will not have some of the bells and whistles that will come with later versions. It will, for example, be limited to 16 tables in the first commit. Implementation method, Compatible version. (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not always caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (8 is sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called rtalloc_fib() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). You CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Firstly, all packets have a FIB associated with them. if nothing has been done to change it, it will be FIB 0. The FIB is changed in the following ways. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -3 ping target.example.com # will use fib 3 for ping. It is an obvious extension to make it a property of a jail but I have not done so. It can be achieved by combining the setfib and jail commands. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). (possibly in the future you may be able to associate a FIB with packets received on an interface.. An ifconfig arg, but not yet.) 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). 4/ a tcp listen socket associated with a fib will generate accept sockets that are associated with that same fib. 5/ Packets generated in response to some other packet (e.g. reset or icmp packets). These should use the FIB associated with the packet being reponded to. 6/ Packets generated during encapsulation. gif, tun and other tunnel interfaces will encapsulate using the FIB that was in effect withthe proces that set up the tunnel. thus setfib 1 ifconfig gif0 [tunnel instructions] will set the fib for the tunnel to use to be fib 1. Routing messages would be associated with their process, and thus select one FIB or another. messages from the kernel would be associated with the fib they refer to and would only be received by a routing socket associated with that fib. (not yet implemented) In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). Old versions of netstat see only the first FIB. In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. ipfw has grown 2 new keywords: setfib N ip from anay to any count ip from any to any fib N In pf there seems to be a requirement to be able to give symbolic names to the fibs but I do not have that capacity. I am not sure if it is required. SCTP has interestingly enough built in support for this, called VRFs in Cisco parlance. it will be interesting to see how that handles it when it suddenly actually does something. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some roto-tilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. When the ABI can be changed it raises the possibilty of the addition of a fib entry into the "struct route". Currently, the structure contains the sockaddr of the desination, and the resulting fib entry. To make this work fully, one could add a fib number so that given an address and a fib, one can find the third element, the fib entry. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. This work was sponsored by Ironport Systems/Cisco Reviewed by: several including rwatson, bz and mlair (parts each) Obtained from: Ironport systems/Cisco	2008-05-09 23:03:00 +00:00

1 2 3 4 5 ...

10714 Commits