freebsd-skq

Author	SHA1	Message	Date
rwatson	0e49e1ce16	Reduce the verbosity of SDT trace points for DTrace by defining several wrapper macros that allow trace points and arguments to be declared using a single macro rather than several. This means a lot less repetition and vertical space for each trace point. Use these macros when defining privilege and MAC Framework trace points. Reviewed by: jb MFC after: 1 week	2009-03-03 17:15:05 +00:00
jamie	63f98fcc6a	Extend the "vfsopt" mount options for more general use. Make struct vfsopt and the vfs_buildopts function public, and add some new fields to struct vfsopt (pos and seen), and new functions vfs_getopt_pos and vfs_opterror. Further extend the interface to allow reading options from the kernel in addition to sending them to the kernel, with vfs_setopt and related functions. While this allows the "name=value" option interface to be used for more than just FS mounts (planned use is for jails), it retains the current "vfsopt" name and <sys/mount.h> requirement. Approved by: bz (mentor)	2009-03-02 23:26:30 +00:00
kan	e17295cf6f	Change vfs_busy to wait until an outcome of pending unmount operation is known and to retry or fail accordingly to that outcome. This fixes the problem with namespace traversing programs failing with random ENOENT errors if someone just happened to try to unmount that same filesystem at the same time. Reported by: dhw Reviewed by: kib, attilio Sponsored by: Juniper Networks, Inc.	2009-03-02 20:51:39 +00:00
kib	453adb14fb	Correct types of variables used to track amount of allocated SysV shared memory from int to size_t. Implement a workaround for current ABI not allowing to properly save size for and report more then 2Gb sized segment of shared memory. This makes it possible to use > 2 Gb shared memory segments on 64bit architectures. Please note the new BUGS section in shmctl(2) and UPDATING note for limitations of this temporal solution. Reviewed by: csjp Tested by: Nikolay Dzham <i levsha org ua> MFC after: 2 weeks	2009-03-02 18:53:30 +00:00
kib	c672211541	Use the p_sysent->sv_flags flag SV_ILP32 to detect 32bit process executing on 64bit kernel. This eliminates the direct comparisions of p_sysent with &ia32_freebsd_sysvec, that were left intact after r185169.	2009-03-02 18:43:50 +00:00
dchagin	718161215b	Fix range-check error introduced in r182292. Also do not do anything if all processors in the map are not available, simply return. Approved by: kib (mentor) MFC after: 1 week	2009-03-01 14:26:24 +00:00
ed	45be9ed433	Improve my previous changes to the TTY code: also remove memcpy(). It's better to just use internal language constructs, because it is likely the compiler has a better opinion on whether to perform inlining, which is very likely to happen to struct winsize. Submitted by: Christoph Mallon <christoph mallon gmx de>	2009-03-01 09:50:13 +00:00
thompsa	4eafa084fe	Move the NORELEASE check to after the recurse count decrement and bailout, this is not counted as actually releasing the lock.	2009-02-28 19:10:43 +00:00
ed	51d425ac0e	Replace bcopy() calls inside the TTY layer with memcpy()/strlcpy(). In all these cases the buffers never overlap. Program names are also likely to be shorter, so use a regular strlcpy() to copy p_comm.	2009-02-28 14:20:26 +00:00
bz	df2be82cec	For all files including net/vnet.h directly include opt_route.h and net/route.h. Remove the hidden include of opt_route.h and net/route.h from net/vnet.h. We need to make sure that both opt_route.h and net/route.h are included before net/vnet.h because of the way MRT figures out the number of FIBs from the kernel option. If we do not, we end up with the default number of 1 when including net/vnet.h and array sizes are wrong. This does not change the list of files which depend on opt_route.h but we can identify them now more easily.	2009-02-27 14:12:05 +00:00
ed	f02ef8e872	Remove redundant code in printf() and vprintf(). printf() and vprintf() are exactly the same, except the way arguments are passed. Just like we see in other pieces of code (i.e. libc's printf()), implement printf() using vprintf(). Submitted by: Christoph Mallon <christoph mallon gmx de>	2009-02-27 13:28:54 +00:00
ed	2078a09b34	Revert previous commit to subr_prf.c and make it more tidy. As mentioned by bz and bde, the change I made wasn't the proper way to fix. Inspired by bde's patch, perform some small cleanups to uprintf(). Reviewed by: bz	2009-02-27 12:50:25 +00:00
ed	9d9e2a90b2	Remove unneeded pointer `ndp'. Inside do_execve(), we have a pointer `ndp', which always points to `&nd'. I can imagine a primitive (non-optimizing) compiler to really reserve space for such a pointer, so just remove the variable and use `&nd' directly.	2009-02-26 16:32:48 +00:00
ed	b3ddcfe1f7	Remove even more unneeded variable assignments. kern_time.c: - Unused variable `p'. kern_thr.c: - Variable `error' is always caught immediately, so no reason to initialize it. There is no way that error != 0 at the end of create_thread(). kern_sig.c: - Unused variable `code'. kern_synch.c: - `rval' is always assigned in all different cases. kern_rwlock.c: - `v' is always overwritten with RW_UNLOCKED further on. kern_malloc.c: - `size' is always initialized with the proper value before being used. kern_exit.c: - `error' is always caught and returned immediately. abort2() never returns a non-zero value. kern_exec.c: - `len' is always assigned inside the if-statement right below it. tty_info.c: - `td' is always overwritten by FOREACH_THREAD_IN_PROC(). Found by: LLVM's scan-build	2009-02-26 15:51:54 +00:00
ed	f854be19d0	Remove unneeded variable `ocn_mute'. Found by: LLVM's scan-build	2009-02-26 13:01:45 +00:00
ed	4f11d3e937	Remove unused variables `p' and unneeded assignments of` rval'. Found by: LLVM's scan-build	2009-02-26 13:00:13 +00:00
ed	280ef3dd23	Remove redundant assignment of `p'. `p' is already initialized with `td->td_proc'. Because td is always curthread, it is safe to initialize it without any locks. Found by: LLVM's scan-build	2009-02-26 12:12:34 +00:00
rwatson	bf80a0a378	Add static tracing for privilege checking: priv:kernel:priv_check:priv_ok fires for granted privileges priv:kernel:priv_check:priv_errr fires for denied privileges The first argument is the requested privilege number. The naming convention is a little different from the OpenSolaris equivilent because we can't have '-' in probefunc names, and our privilege namespace is different. MFC after: 1 week	2009-02-26 10:56:13 +00:00
ed	226255f0b6	Silence compiler warning inside our ^T handler. It turns out we're casting fixpt_t* to int*. Spotted by: clang	2009-02-26 10:38:19 +00:00
ed	516ad9be6d	Use unsigned longs for the TTY's sysctl stats. Spotted by: clang	2009-02-26 10:28:32 +00:00
ed	1c7d1f084a	Don't use PTY name as format string, even though it isn't insecure here. It's guaranteed that the `name' variable always contains a string of the form pty[l‐sL‐S][0‐9a‐v], but I'd rather keep the compiler happy (LLVM).	2009-02-26 10:14:10 +00:00
jamie	1631f0aa0a	Add support for methods to the OSD subsystem. Each object type has a predefined set of methods, which are set in osd_register() and called via osd_call(). Currently, no methods are defined, though prison objects will have some in the future. Expand the locking from a single per-type mutex to three different kinds of locks (four if you include the requirement that the container (e.g. prison) be locked when getting/setting data). This clears up one existing issue, as well as others added by the method support. Approved by: bz (mentor)	2009-02-21 11:15:38 +00:00
ed	72727e8d9f	Don't make Linux stat() open character devices to resolve its name. The existing code calls kern_open() to resolve the vnode of a pathname right after a stat(). This is not correct, because it causes random character devices to be opened in /dev. This means ls'ing a tape streamer will cause it to rewind, for example. Changes I have made: - Add kern_statat_vnhook() to allow binary emulators to `post-process' struct stat, using the proper vnode. - Remove unneeded printf's from stat() and statfs(). - Make the Linuxolator use kern_statat_vnhook(), replacing translate_path_major_minor_at(). - Let translate_fd_major_minor() use vp->v_rdev instead of vp->v_un.vu_cdev. Result: crw-rw-rw- 1 root root 0, 14 Feb 20 13:54 /dev/ptmx crw--w---- 1 root adm 136, 0 Feb 20 14:03 /dev/pts/0 crw--w---- 1 root adm 136, 1 Feb 20 14:02 /dev/pts/1 crw--w---- 1 ed tty 136, 2 Feb 20 14:03 /dev/pts/2 Before this commit, ptmx also had a major number of 136, because it silently allocated and deallocated a pseudo-terminal. Device nodes that cannot be opened now have proper major/minor-numbers. Reviewed by: kib, netchild, rdivacky (thanks!)	2009-02-20 13:05:29 +00:00
jhb	5dc7ef7e69	Enable caching of negative pathname lookups in the NFS client. To avoid stale entries, we save a copy of the directory's modification time when the first negative cache entry was added in the directory's NFS node. When a negative cache entry is hit during a pathname lookup, the parent directory's modification time is checked. If it has changed, all of the negative cache entries for that parent are purged and the lookup falls back to using the RPC. This required adding a new cache_purge_negative() method to the name cache to purge only negative cache entries for a given directory. Submitted by: mohans, Rick Macklem, Ricardo Labiaga @ NetApp Reviewed by: mohans	2009-02-19 22:28:48 +00:00
ed	092e753861	Squash some small bugs in pts(4). - Don't return a negative errno when using an unknown ioctl() on a pseudo-terminal master device. Be sure to convert ENOIOCTL to ENOTTY, just like the TTY layer does. - Even though we should return st_rdev of the master device node when emulating pty(4) devices, FIODGNAME should still return the name of the slave device. Otherwise ptsname(3) and ttyname(3) return an invalid device name.	2009-02-19 17:54:42 +00:00
attilio	bf75b4612a	- Add a function (fill_kinfo_aggregate()) which aggregates relevant members for a kinfo entry on a process-wide system. - Use the newly introduced function in order to fix cases like KERN_PROC_PROC where aggregating stats are broken because they just consider the first thread in the pool for each process. (Note, additively, that KERN_PROC_PROC is rather inaccurate on thread-wide informations like the 'state' of the process. Such informations should maybe be invalidated and being forceably discarded by the consumers?). - Simplify the logic of sysctl_out_proc() and adjust the fill_kinfo_thread() accordingly. - Remove checks on the FIRST_THREAD_IN_PROC() being NULL but add assertives. This patch should fix aggregate statistics for KERN_PROC_PROC. This is one of the reasons why top doesn't use this option and now it can be use it safely. ps, when launched in order to display just processes, now should report correct cpu utilization percentages and times (as opposed by the old code). Reviewed by: jhb, emaste Sponsored by: Sandvine Incorporated	2009-02-18 21:52:13 +00:00
marcus	60038f21cf	Remove the printf's when the vnode to be exported for procstat is not a VDIR. If the file system backing a process' cwd is removed, and procstat -f PID is called, then these messages would have been printed. The extra verbosity is not required in this situation. Requested by: kib Approved by: kib	2009-02-14 21:55:09 +00:00
marcus	130b8c14ad	Change two KASSERTS to printfs and simple returns. Stress testing has revealed that a process' current working directory can be VBAD if the directory is removed. This can trigger a panic when procstat -f PID is run. Tested by: pho Discovered by: phobot Reviewed by: kib Approved by: kib	2009-02-14 21:12:24 +00:00
thompsa	b04fb61e93	Remove semicolon left in the last commit Spotted by: csjp	2009-02-13 18:51:39 +00:00
jhb	26e338d6fc	Use shared vnode locks when invoking VOP_READDIR(). MFC after: 1 month	2009-02-13 18:18:14 +00:00
luigi	8faaf7bcde	Clarify and reimplement the bioq API so that bioq_disksort() has the correct behaviour (sorting by distance from the current head position in the scan direction) and bioq_insert_head() and bioq_insert_tail() have a well defined (and useful) behaviour, especially when intermixed with calls to bioq_disksort(). In particular: - fix a bug in the existing bioq_disksort() that did not use the current head position correctly; - redefine semantics of bioq_insert_head() and bioq_insert_tail(). bioq_insert_tail() can now be used as a barrier between previous and subsequent calls to bioq_disksort(). The code is heavily documented in the source code so please refer to that for the details. Much of this code comes from Fabio Checconi. Also thanks to Kirk for feedback on the (re)definition of bioq_insert_tail(). NOTE: in the current tree there is only a handful of files which intermix calls to bioq_disksort() with bioq_insert_head() and bioq_insert_tail(). The ordering of the queue in these situation was not specified (nor easy to figure out) before, so I doubt any of that code could be affected by the specification of the API. Also note that the current implementation is significantly simpler than the previous one (also used in ata_sort_queue()). It would be useful to reimplement ata_sort_queue() using the same code used in bioq_disksort(). MFC after: 1 week	2009-02-13 11:36:32 +00:00
thompsa	618fc50d6b	Check the exit flag at the start of the taskqueue loop rather than the end. It is possible to tear down the taskqueue before the thread has run and the taskqueue loop would sleep forever. Reviewed by: sam MFC after: 1 week	2009-02-13 01:16:51 +00:00
ed	5f2edd80fc	Serialize write() calls on TTYs. Just like the old TTY layer, the current MPSAFE TTY layer does not make any attempt to serialize calls of write(). Data is copied into the kernel in 256 (TTY_STACKBUF) byte chunks. If a write() call occurs at the same time, the data may interleave. This is especially likely when the TTY starts blocking, because the output queue reaches the high watermark. I've implemented this by adding a new flag, TTY_BUSY_OUT, which is used to mark a TTY as having a thread stuck in write(). Because I don't want non-blocking processes to be possibly blocked by a sleeping thread, I'm still allowing it to bypass the protection. According to this message, the Linux kernel returns EAGAIN in such cases, but I think that's a little too restrictive: http://kerneltrap.org/index.php?q=mailarchive/linux-kernel/2007/5/2/85418/thread PR: kern/118287	2009-02-11 16:28:49 +00:00
rwatson	ced47d0a8e	Modify fdcopy() so that, during fork(2), it won't copy file descriptors from the parent to the child process if they have an operation vector of &badfileops. This narrows a set of races involving system calls that allocate a new file descriptor, potentially block for some extended period, and then return the file descriptor, when invoked by a threaded program that concurrently invokes fork(2). Similar approches are used in both Solaris and Linux, and the wideness of this race was introduced in FreeBSD when we moved to a more optimistic implementation of accept(2) in order to simplify locking. A small race necessarily remains because the fork(2) might occur after the finit() in accept(2) but before the system call has returned, but that appears unavoidable using current APIs. However, this race is vastly narrower. The fix can be validated using the newfileops_on_fork regression test. PR: kern/130348 Reported by: Ivan Shcheklein <shcheklein at gmail dot com> Reviewed by: jhb, kib MFC after: 1 week	2009-02-11 15:22:01 +00:00
imp	dac94dc031	o Use NULL in pereference to 0 in pointer contexts. o Use newly minted KOBJMETHOD_END as appropriate o fix prototype for root_setup_intr.	2009-02-11 04:54:02 +00:00
mav	26e8dd306f	Check for device_set_devclass() errors and skip driver probe/attach if any. Attach call without devclass set crashes the system. On resume AHCI driver sometimes tries to create duplicate adX device. It is surely his own problem, but IMHO it is not a reason to crash here. Other reasons are also possible.	2009-02-10 23:22:29 +00:00
attilio	5c1db5bb2b	Scanning all the formats for binary translation of modules loading can result in errors for a format loading but subsequent correct recognizing for another format. File format loading functions should avoid printing any additional informations but just returning appropriate (and different between each other) error condition, characterizing different informations. Additively, the linker should handle appropriately different format loading errors. While a general mechanism is desired, fix a simple and common case on amd64: file type is not recognized for link elf and confuses the linker. Printout an error if all the registered linker classes can't recognize and load the module. Reviewed by: jhb Sponsored by: Sandvine Incorporated	2009-02-10 15:50:19 +00:00
rwatson	c6f2b81096	Remove extra 'comma = 0' in socket state printing code, which otherwise could lead to an extra comma in output. Submitted by: Christoph Mallon <christoph dot mallon at gmx dot de>	2009-02-09 18:19:58 +00:00
mbr	b47ed35e3f	s/SS_FDREF/SS_NOFDREF/	2009-02-09 13:29:01 +00:00
ed	ce1349c810	Remove a stale comment from the clists code. We don't support quote bits.	2009-02-09 11:27:56 +00:00
jhb	f856c6d618	Tweak the output of VOP_PRINT/vn_printf() some. - Align the fifo output in fifo_print() with other vn_printf() output. - Remove the leading space from lockmgr_printinfo() so its output lines up in vn_printf(). - lockmgr_printinfo() now ends with a newline, so remove an extra newline from vn_printf().	2009-02-06 20:06:48 +00:00
trasz	d102122bd0	Add KASSERTs to make it easier to debug problems like the one fixed in r188141. Reviewed by: kib,attilio Approved by: rwatson (mentor) Tested by: pho Sponsored by: FreeBSD Foundation	2009-02-06 18:16:01 +00:00
jhb	5dff890984	Expand the scope of the sysctllock sx lock to protect the sysctl tree itself. Back in 1.1 of kern_sysctl.c the sysctl() routine wired the "old" userland buffer for most sysctls (everything except kern.vnode.*). I think to prevent issues with wiring too much memory it used a 'memlock' to serialize all sysctl(2) invocations, meaning that only one user buffer could be wired at a time. In 5.0 the 'memlock' was converted to an sx lock and renamed to 'sysctl lock'. However, it still only served the purpose of serializing sysctls to avoid wiring too much memory and didn't actually protect the sysctl tree as its name suggested. These changes expand the lock to actually protect the tree. Later on in 5.0, sysctl was changed to not wire buffers for requests by default (sysctl_handle_opaque() will still wire buffers larger than a single page, however). As a result, user buffers are no longer wired as often. However, many sysctl handlers still wire user buffers, so it is still desirable to serialize userland sysctl requests. Kernel sysctl requests are allowed to run in parallel, however. - Expose sysctl_lock()/sysctl_unlock() routines to exclusively lock the sysctl tree for a few places outside of kern_sysctl.c that manipulate the sysctl tree directly including the kernel linker and vfs_register(). - sysctl_register() and sysctl_unregister() require the caller to lock the sysctl lock using sysctl_lock() and sysctl_unlock(). The rest of the public sysctl API manage the locking internally. - Add a locked variant of sysctl_remove_oid() for internal use so that external uses of the API do not need to be aware of locking requirements. - The kernel linker no longer needs Giant when manipulating the sysctl tree. - Add a missing break to the loop in vfs_register() so that we stop looking at the sysctl MIB once we have changed it. MFC after: 1 month	2009-02-06 14:51:32 +00:00
jhb	fa654c58a4	Drop the kernel linker lock while running SYSUNINIT routines and removing sysctls during a linker file unload. We drop the lock when doing similar operations during a linker file load. To close races, clear the LINKED flag before dropping the lock so that the linker file is no longer visible to userland. MFC after: 1 week	2009-02-05 23:01:36 +00:00
attilio	beddfe59b0	Add more KTR_VFS logging point in order to have a more effective tracing. Reviewed by: brueffer, kib Tested by: Gianni Trematerra <giovanni D trematerra A gmail D com>	2009-02-05 15:03:35 +00:00
ed	e085cfc485	Don't leave the console TTY constantly open. When we leave the console TTY constantly open, we never reset the termios attributes. This causes output processing, echoing, etc. not to be reset to the proper values when going into single user mode after the system has booted. It also causes nl-to-crnl-conversion not to take place during shutdown, which causes a `staircase effect'. This patch adds a new TTY flag, TF_OPENED_CONS, which is set when the TTY is opened through /dev/console. Because the flags are only used by the kernel and the pstat(8) utility, I've decided to renumber the TTY flags. This shouldn't be an issue, because the TTY layer is not yet part of a stable release. Reported by: Mark Atkinson <atkin901 yahoo com> Tested by: sepotvin	2009-02-05 14:21:09 +00:00
jamie	8f639d4b9a	Don't allow creating a socket with a protocol family that the current jail doesn't support. This involves a new function prison_check_af, like prison_check_ip[46] but that checks only the family. With this change, most of the errors generated by jailed sockets shouldn't ever occur, at least until jails are changeable. Approved by: bz (mentor)	2009-02-05 14:15:18 +00:00
jamie	12bbe1869f	Standardize the various prison_foo_ip[46] functions and prison_if to return zero on success and an error code otherwise. The possible errors are EADDRNOTAVAIL if an address being checked for doesn't match the prison, and EAFNOSUPPORT if the prison doesn't have any addresses in that address family. For most callers of these functions, use the returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or EINVAL. Always include a jailed() check in these functions, where a non-jailed cred always returns success (and makes no changes). Remove the explicit jailed() checks that preceded many of the function calls. Approved by: bz (mentor)	2009-02-05 14:06:09 +00:00
trasz	a4e8c3ba99	In some situations, mnt_lockref could go negative due to vfs_unbusy() being called without calling vfs_busy() first. This made umount(8) hang waiting for mnt_lockref to become zero, which would never happen. Reviewed by: kib Approved by: rwatson (mentor) Reported by: pho Found with: stress2 Sponsored by: FreeBSD Foundation	2009-02-05 08:46:18 +00:00
rwatson	d686641019	Remove written-to but never read local variable 'offset' from soreceive_dgram(). Submitted by: Christoph Mallon <christoph dot mallon at gmx dot de> MFC after: 1 week	2009-02-04 20:00:17 +00:00
ed	35bb1e1a73	Remove slush space from clists. Right now we only have a very small amount of drivers that use clists, but we still allocate 50 cblocks as slush space, which allows drivers to temporarily overcommit their storage. Most of the drivers don't allow this anyway. I've performed the following changes: - We don't allocate any cblocks on startup. - I've removed the DDB command, because it has nothing useful to print now. You can obtain the amount of allocated blocks by running `vmstat -m \| grep clist'. - I've removed cfreecount, which is now unused. - The old code first tries to allocate using M_NOWAIT, followed by M_WAITOK. This doesn't make any sense, so just remove this logic. It seems the drivers allow us to sleep anyway. We can even remove ccmax from clist_alloc_cblocks and c_cbmax from struct clist, but this breaks binary compatibility. This reduces the amount of allocated cblocks on my system from 54 to 4.	2009-02-04 17:10:01 +00:00
ed	85ebf97341	Slightly improve the design of the TTY buffer. The TTY buffers used the standard <sys/queue.h> lists. Unfortunately they have a big shortcoming. If you want to have a double linked list, but no tail pointer, it's still not possible to obtain the previous element in the list. Inside the buffers we don't need them. This is why I switched to custom linked list macros. The macros will also keep track of the amount of items in the list. Because it doesn't use a sentinel, we can just initialize the queues with zero. In its simplest form (the output queue), we will only keep two references to blocks in the queue, namely the head of the list and the last block in use. All free blocks are stored behind the last block in use. I noticed there was a very subtle bug in the previous code: in a very uncommon corner case, it would uma_zfree() a block in the queue before calling memcpy() to extract the data from the block.	2009-02-03 19:58:28 +00:00
imp	b5abf9646f	Use NULL in preference to 0 in pointer contexts.	2009-02-03 07:54:42 +00:00
imp	02f34b7ded	Make bioq_disksort have a ANSI-C definition rather than a K&R definition.	2009-02-03 07:53:51 +00:00
imp	35e4f2b272	rman_debug should be static, so make it static.	2009-02-03 07:53:08 +00:00
imp	b5c4f1a094	Use ANSI function definition for profil.	2009-02-03 07:52:36 +00:00
imp	10cf8131b7	Prefer ANSI function definitions to K&R ones.	2009-02-03 07:52:07 +00:00
imp	5a073bfc7c	Use NULL in preference to 0 for pointers.	2009-02-03 07:51:41 +00:00
imp	82f181ca79	Use NULL in preference to 0 for pointers.	2009-02-03 07:51:11 +00:00
imp	54f6c3e35c	o Use unsigned for bit fields. o Use NULL for pointers in preference to 0.	2009-02-03 07:50:41 +00:00
imp	b033fcf7e9	int foo(void) is the proper ANSI function definition when there's no parameters. Use it for resettodr().	2009-02-03 07:50:01 +00:00
imp	0821a484fd	Declare bus_data_devices to be static: it isn't used elsewhere. Use NULL in a couple of places rather than 0 in the context of pointers to be consistent with the rest of the file.	2009-02-03 00:10:21 +00:00
sepotvin	9d78a7fce3	Fix select on platforms where sizeof(long) != sizeof(int). This used to work by accident before the cleanup done in revision 187693. Approved by: kan (mentor)	2009-02-02 03:34:40 +00:00
rwatson	7147753438	If a process is a zombie and we couldn't identify another useful state, print out the state as "zombine" in preference to "unknown" when ^T is pressed. MFC after: 3 days Sponsored by: Google, Inc.	2009-01-29 09:32:56 +00:00
ed	50efccc9f0	Mark most often used sysctl's as MPSAFE. After running a `make buildkernel', I noticed most of the Giant locks in sysctl are only caused by a very small amount of sysctl's: - sysctl.name2oid. This one is locked by SYSCTL_LOCK, just like sysctl.oidfmt. - kern.ident, kern.osrelease, kern.version, etc. These are just constant strings. - kern.arandom, used by the stack protector. It is already protected by arc4_mtx. I also saw the following sysctl's show up. Not as often as the ones above, but still quite often: - security.jail.jailed. Also mark security.jail.list as MPSAFE. They don't need locking or already use allprison_lock. - kern.devname, used by devname(3), ttyname(3), etc. This seems to reduce Giant locking inside sysctl by ~75% in my primitive test setup.	2009-01-28 19:58:05 +00:00
jhb	04d889be20	Convert the global mutex protecting the directory lookup name cache from a mutex to a reader/writer lock. Lookup operations first grab a read lock and perform the lookup. If the operation results in a need to modify the cache, then it tries to do an upgrade. If that fails, it drops the read lock, obtains a write lock, and redoes the lookup.	2009-01-28 19:05:18 +00:00
ed	9b5c4f4d39	Use the proper flag to let kern.ttys be executed without Giant. Pointed out by: jhb	2009-01-26 16:43:18 +00:00
jhb	b8c520ae3c	Whitespace tweak.	2009-01-26 15:32:39 +00:00
jeff	687eba74e7	- bit has to be fd_mask to work properly on 64bit platforms. Constants must also be cast even though the result ultimately is promoted to 64bit. - Correct a loop index upper bound in selscan().	2009-01-25 18:38:42 +00:00
rwatson	97295d8b75	When a statically linked binary is executed (or at least, one without an interpreter definition in its program header), set the auxiliary ELF argument AT_BASE to 0 rather than to the address that we would have mapped the interpreter at if there had been one. The ELF ABI specifications appear to be ambiguous as to the desired behavior in this situation, as they define AT_BASE as the base address of the interpreter, but do not mention what to do if there is none. On Solaris, AT_BASE will be set to the base address of the static binary if there is no interpreter, and on Linux, AT_BASE is set to 0. We go with the Linux semantics as they are of more immediate utility and allow the early runtime environment to know that the kernel has not mapped an interpreter, but because AT_PHDR points at the ELF header for the running binary, it is still possible to retrieve all required mapping information when the process starts should it be required. Either approach would be preferable to our current behavior of passing a pointer to an unmapped region of user memory as AT_BASE. MFC after: 3 weeks	2009-01-25 12:07:43 +00:00
bz	6dddd78341	For consistency with prison_{local,remote,check}_ipN rename prison_getipN to prison_get_ipN. Submitted by: jamie (as part of a larger patch) MFC after: 1 week	2009-01-25 10:11:58 +00:00
jeff	dcd94957aa	- Correct a typo in a comment. Noticed by: danger	2009-01-25 09:17:16 +00:00
jeff	69d1bd8670	- Make the keg abstraction more complete. Permit a zone to have multiple backend kegs so it may source compatible memory from multiple backends. This is useful for cases such as NUMA or different layouts for the same memory type. - Provide a new api for adding new backend kegs to secondary zones. - Provide a new flag for adjusting the layout of zones to stagger allocations better across cache lines. Sponsored by: Nokia	2009-01-25 09:11:24 +00:00
ed	ce1034ac57	Remove unneeded use of device unit numbers from pty(4). A much more simple approach to generate the slave device name, is to obtain the device name of the master and replace 'p' by 't'.	2009-01-25 08:27:11 +00:00
jeff	d4c94410f6	- Use __XSTRING where I want the define to be expanded. This resulted in sizeof("MAXCPU") being used to calculate a string length rather than something more reasonable such as sizeof("32"). This shouldn't have caused any ill effect until we run on machines with 1000000 or more cpus.	2009-01-25 07:35:10 +00:00
jeff	3688ae7441	Fix errors introduced when I rewrote select. - Restructure selscan() and selrescan() to avoid producing extra selfps when we have a fd in multiple sets. As described below multiple selfps may still exist for other reasons. - Make selrescan() tolerate multiple selfds for a given descriptor set since sockets use two selinfos per fd. If an event on each selinfo fires selrescan() will see the descriptor twice. This could result in select() returning 2x the number of fds actually existing in fd sets. Reported by: mgleason@ncftp.com	2009-01-25 07:24:34 +00:00
ed	e3697d4040	Mark kern.ttys as MPSAFE. sysctl now allows Giantless calls, so make kern.ttys use this. If it needs Giant, it locks the proper TTY anyway.	2009-01-24 18:20:15 +00:00
rwatson	aaaff3620b	Add explicit static DTrace tracing to the callout mechanism, capturing pointers to the callout handler just before and just after the callout it invoked. I attempted to do this in a manner congruent to tracing in Solaris's callout mechanism, but couldn't quite use the same names due to convention and syntax differences. Example DTrace script to generate a distribution graph of callout execution times: callout_execute:::callout_start { self->cstart = timestamp; } callout_execute:::callout_end { @length = quantize(timestamp - self->cstart); } Reviewed by: jb MFC after: 3 days	2009-01-24 10:22:49 +00:00
jhb	a9601871c9	- Mark all standalone INT/LONG/QUAD sysctl's MPSAFE. This is done inside the SYSCTL() macros and thus does not need to be done for all of the nodes scattered across the source tree. - Mark the name-cache related sysctl's (including debug.hashstat.) MPSAFE. - Mark vm.loadavg MPSAFE. - Remove GIANT_REQUIRED from vmtotal() (everything in this routine already has sufficient locking) and mark vm.vmtotal MPSAFE. - Mark the vm.stats.(sys\|vm). sysctls MPSAFE.	2009-01-23 22:49:23 +00:00
jhb	0245a3370f	- Add conditional Giant locking around the vrele() in sysctl_kern_proc_pathname(). - Mark all the kern.proc.* sysctls as MPSAFE. Submitted by: csjp (2)	2009-01-23 22:46:45 +00:00
jhb	b6d1e3ceff	Add a flag to tag individual sysctl leaf nodes as MPSAFE and thus not needing Giant. Submitted by: csjp (an older version)	2009-01-23 22:40:35 +00:00
jhb	d94da54d95	Use shared vnode locks for fchdir(). Submitted by: ups	2009-01-23 22:13:30 +00:00
jhb	d7c8a44c0d	Tweak the wording for vfs_mark_atime() since the I/O it is avoiding by not updating va_atime via VOP_SETATTR() isn't always synchronous. For some filesystems it is asynchronous. Suggested by: bde	2009-01-23 22:13:00 +00:00
jhb	4efa7c83e1	Push down Giant in the vlnru kproc main loop so that it is only acquired around calls to vlrureclaim() on non-MPSAFE filesystems. Specifically, vnlru no longer needs Giant for the common case of waking up and deciding there is nothing for it to do. MFC after: 2 weeks	2009-01-23 22:08:54 +00:00
jhb	d2c61e641d	Use the correct type for the timeout parameter to the 32-bit compat version aio_waitcomplete(). Reminded by: bz Submitted by: jamie MFC after: 3 days	2009-01-23 13:23:17 +00:00
jhb	2939c2f76f	Fix a few style bogons. Submitted by: bde	2009-01-21 20:08:17 +00:00
kib	44eef9d9bb	Move the code from ufs_lookup.c used to do dotdot lookup, into the helper function. It is supposed to be useful for any filesystem that has to unlock dvp to walk to the ".." entry in lookup routine. Requested by: jhb Tested by: pho MFC after: 1 month	2009-01-21 14:51:38 +00:00
jhb	47455a7b41	Move the VA_MARKATIME flag for VOP_SETATTR() out into its own VOP: VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME can be performed while holding a shared vnode lock (the same functionality is done internally by VOP_READ which can run with a shared vnode lock). Add missing locking of the vnode interlock to the ufs implementation and remove a special note and test from the NFS client about not supporting the feature. Inspired by: ups Tested by: pho	2009-01-21 14:42:00 +00:00
thompsa	50e14c608e	Add functions WITNESS so it can be asserted that the lock is not released for a section of code, this uses WITNESS_NORELEASE() and WITNESS_RELEASEOK() to mark the boundaries. Both functions require the lock to be held when calling. This is intended for scenarios like a bus asserting that the bus lock is not dropped during a driver call. There doesn't appear to be a man page to document this in. Reviewed by: jhb	2009-01-21 04:19:18 +00:00
kib	cbb8defa10	FFS puts the extended attributes blocks at the negative blocks for the vnode, from -1 down. When vinvalbuf(vp, V_ALT) is done for the vnode, it incorrectly does vm_object_page_remove(0, 0), removing all pages from the underlying vm object, not only the pages that back the extended attributes data. Change vinvalbuf() to not remove any pages from the object when V_NORMAL or V_ALT are specified. Instead, the only in-tree caller in ffs_inode.c:ffs_truncate() that specifies V_ALT explicitely removes the corresponding page range. The V_NORMAL caller does vnode_pager_setsize(vp, 0) immediately after the call to vinvalbuf(V_NORMAL) already. Reported by: csjp Reviewed by: ups MFC after: 3 weeks	2009-01-20 11:27:45 +00:00
mckay	7620a1118d	Add a limit on namecache entries. In normal operation, the number of cache entries is roughly equal to the number of active vnodes. However, when most of the recently accessed vnodes have many hard links, the number of cache entries can be 32000 times as large, exhausting kernel memory and provoking a panic in kmem_malloc(). MFC after: 2 weeks	2009-01-20 04:21:21 +00:00
mav	dcb9508070	Teach m_copyback() to use trailing space of the last mbuf in chain.	2009-01-18 20:19:55 +00:00
jeff	3d8d825555	- Implement generic macros for producing KTR records that are compatible with src/tools/sched/schedgraph.py. This allows developers to quickly create a graphical view of ktr data for any resource in the system. - Add sched_tdname() and the pcpu field 'name' for quickly and uniformly identifying records associated with a thread or cpu. - Reimplement the KTR_SCHED traces using the new generic facility. Obtained from: attilio Discussed with: jhb Sponsored by: Nokia	2009-01-17 07:17:57 +00:00
kib	92e5c3777e	Lock the semaphore identifier lock during semaphore initialization to guarantee atomicity of the operation for other semaphore consumers. In particular, this should guard against access to the semaphore with not done or partially done MAC label assignment. Reviewed by: rwatson MFC after: 1 month	2009-01-15 12:15:46 +00:00
kib	f068898e84	It seems that there are at least three issues with IPC_RMID operation on SysV semaphores. The squeeze of the semaphore array in the kern_semctl() modifies sem_base for the semaphores with sem_base greater then sem_base of the removed semaphore, as well as the values of the semaphores, without locking their mutex. This can lead to (killable) hangs or unexpected behaviour of the processes performing any sem operations while other process does IPC_RMID. The semexit_myhook() eventhandler unlocks SEMUNDO_LOCK() while accessing *suptr. This allows for IPC_RMID for the sem id to be performed in parallel with undo hook referenced by the current undo structure. This leads to the panic("semexit - semid not allocated") [1]. The semaphore creation is protected by Giant, while IPC_RMID is done while only semaphore mutex is held. This seems to result in invalid values for semtot, causing random ENOSPC error returns [2]. Redo the locking of the semaphores lifetime cycle. Delegate the sem_mtx to the sole purpose of protecting semget() and semctl(IPC_RMID). Introduce new sem_undo_mtx to protect SEM_UNDO handling. Remove the Giant remnants from the code. Note that mac_sysvsem_check_semget() and mac_sysvsem_create() are now called while sem_mtx is held, as well as mac_sysvsem_cleanup() [3]. When semaphore is removed, acquire semaphore locks for all semaphores with sem_base that is going to be changed by squeeze of the sema array. The lock order is not important there, because the region is protected by sem_mtx. Organize both used and free sem_undo structures into the lists, protected by sem_undo_mtx. In semexit_myhook(), remove sem_undo structure that is being processed, from used list, without putting it onto the free to prevent modifications by other threads. This allows for sem_undo_lock to be dropped to acquire individial semaphore locks without violating lock order. Since IPC_RMID may no longer find this sem_undo, do tolerate references to unallocated semaphores in undo structure, and check sequential number to not undo unrelated semaphore with the same id. While there, convert functions definitions to ANSI C and fix small style(9) glitches. Reported by: Omer Faruk Sen <omerfsen gmail com> [1], pho [2] Reviewed by: rwatson [3] Tested by: pho MFC after: 1 month	2009-01-14 15:20:13 +00:00
jhb	a20bdb255f	Add a new KTR tracepoint in the KTR_CALLOUT class to note when a callout routine finishes executing. MFC after: 1 week	2009-01-13 15:56:53 +00:00
kib	97fa304203	Do not call namei() while having another user-controlled vnode locked. Lookup could attempt to recursively lock that vnode. Do not call vn_start_write(V_WAIT) while vnode is locked, this may result in a deadlock with suspension. vfs_busy() the mountpoint before dropping vnode lock for vnode that was used to look up the mountpoint, to prevent unmount in between. Reported and tested by: pho Reviewed by: rwatson MFC after: 3 weeks	2009-01-08 12:47:30 +00:00
ed	2f8c83e018	Remove Giant locking from domains list. During boot, the domain list is locked with Giant. It is not possible to register any protocols after the system has booted, so the lock is only used to protect insertion of entries. There is already a mutex in uipc_domain.c called dom_mtx. Use this mutex to lock the list, instead of using Giant. It won't matter anything with respect to performance, but we'll never get rid of Giant if we don't remove from places where we don't need it. Approved by: rwatson MFC after: 3 weeks	2009-01-04 19:22:53 +00:00
rwatson	af9e0de76f	Remove two further uses (debugging and NULLing) of pr_ousrreq, missed due to svn commit in the wrong directory. Spotted by: bz	2009-01-04 19:16:36 +00:00
bz	903df7883d	Back out r186615; the sanitizing of the pointers in the error case is not needed and seems that it will not be needed either. Pointy hat: mine, mine, mine and not pho's	2009-01-04 12:18:18 +00:00
kib	ac1b596fda	Extend the struct vm_page wire_count to u_int to avoid the overflow of the counter, that may happen when too many sendfile(2) calls are being executed with this vnode [1]. To keep the size of the struct vm_page and offsets of the fields accessed by out-of-tree modules, swap the types and locations of the wire_count and cow fields. Add safety checks to detect cow overflow and force fallback to the normal copy code for zero-copy sockets. [2] Reported by: Anton Yuzhaninov <citrin citrin ru> [1] Suggested by: alc [2] Reviewed by: alc MFC after: 2 weeks	2009-01-03 13:24:08 +00:00
ed	fa1537b948	Fix a corner case in my previous commit. Even though there are not many setups that have absolutely no console device, make sure a close() on a TTY doesn't dereference a null pointer.	2009-01-02 23:39:29 +00:00
ed	8244396e79	Don't let /dev/console be revoked if the TTY below is being closed. During startup some of the syscons TTY's are used to set attributes like the screensaver and mouse options. These actions cause /dev/console to be rendered unusable. Fix the issue by leaving the TTY opened when it is used as the console device. Reported by: imp	2009-01-02 23:32:43 +00:00
rwatson	96b82a80dd	White space and comment tweaks. MFC after: 3 weeks	2009-01-01 20:03:22 +00:00
rwatson	c049b2b5fe	Temporary workaround for the limitations of the mbuf flowid field: zero the field in the mbuf constructors, since otherwise we have no way to tell if they are valid. In the future, Kip has plans to add a flag specifically to indicate validity, which is the preferred model.	2009-01-01 20:03:01 +00:00
ed	05475b9543	Don't clobber sysctl_root()'s error number. When sysctl() is being called with a buffer that is too small, it will return ENOMEM. Unfortunately the changes I made the other day sets the error number to 0, because it just returns the error number of the copyout(). Revert this part of the change.	2009-01-01 00:19:51 +00:00
ivoras	4136fd8892	Document the relationship between enum VM_GUEST and the vm_guest_sysctl_names array. Approved by: gnn (original version)	2008-12-30 23:49:54 +00:00
pho	2f6e82c78d	Added missing second part of cleaning j->ip[46] as requested by bz Approved by: kib (mentor) Pointy hat: pho	2008-12-30 20:39:47 +00:00
pho	6e2644b311	Make sure that unused j->ip[46] are cleared Reviewed by: bz Approved by: kib (mentor)	2008-12-30 17:54:25 +00:00
rwatson	0c0b8926ba	Rename mbcnt to mbcnt_delta in uipc_send() -- unlike other local variables named mbcnt in uipc_usrreq.c, this instance is a delta rather than a cache of sb_mbcnt. MFC after: 3 weeks	2008-12-30 16:09:57 +00:00
kib	2349a65923	Clear the pointers to the file in the struct filedesc before file is closed in fdfree. Otherwise, sysctl_kern_proc_filedesc may dereference stale struct file * values. Reported and tested by: pho MFC after: 1 month	2008-12-30 12:51:56 +00:00
kib	c81ec4dc0c	In r185557, the check for existing negative entry for the given name did not compared nc_dvp with supplied parent directory vnode pointer. Add the check and note that now branches for vp != NULL and vp == NULL are the same, thus can be merged. Reported and reviewed by: kan Tested by: pho MFC after: 2 weeks	2008-12-30 12:51:14 +00:00
ed	3f319ef66d	Fix compilation. Also move ogetkerninfo() to kern_xxx.c. It seems I forgot to remove `int error' from a single piece of code. I'm also moving ogetkerninfo() to kern_xxx.c, because it belongs to the class of compat system information system calls, not the generic sysctl code.	2008-12-29 19:24:00 +00:00
ed	f3a9a195cb	Push down Giant inside sysctl. Also add some more assertions to the code. In the existing code we didn't really enforce that callers hold Giant before calling userland_sysctl(), even though there is no guarantee it is safe. Fix this by just placing Giant locks around the call to the oid handler. This also means we only pick up Giant for a very short period of time. Maybe we should add MPSAFE flags to sysctl or phase it out all together. I've also added SYSCTL_LOCK_ASSERT(). We have to make sure sysctl_root() and name2oid() are called with the sysctl lock held. Reviewed by: Jille Timmermans <jille quis cx>	2008-12-29 12:58:45 +00:00
kib	bd5d614be8	vm_map_lock_read() does not increment map->timestamp, so we should compare map->timestamp with saved timestamp after map read lock is reacquired, not with saved timestamp + 1. The only consequence of the +1 was unconditional lookup of the next map entry, though. Tested by: pho Approved by: des MFC after: 2 weeks	2008-12-29 12:45:11 +00:00
kmacy	208a4373c4	drop rnh lock before destroying it	2008-12-28 14:32:27 +00:00
bz	7d22a18291	Hide detect_virtual() along with the accompanying string arrays under #ifndef XEN to make XEN config compile again. In case of Xen vm_guest is hard coded. Move the list for the vm_guest sysctl out of the restictive bounds as the sysctl is there in either case.	2008-12-27 17:19:16 +00:00
pho	fbacc4af83	Prevent overflow of uio_resid. Approved by: kib	2008-12-27 10:13:43 +00:00
rwatson	13abb9545e	Following the recent security advisory, add a comment describing our invariants and approach for protocol switch methods in protsw_init(), and also some KASSERT's for non-domain init entries in protocol switch tables: pru_abort and pru_send must both be implemented. For now, leave those assertions #if 0'd, since there are a few protocols that violate them in non-harmful ways. Whether or not we should enforce pru_abort being implemented for non-stream protocols is an interesting question: currently abort is only invoked on stream sockets in situations where un-accepted sockets must be abruptly closed (i.e., close() on a listen socket with pending connections), but in principle it is useful for datagram sockets and most datagram socket types implement it. MFC after: 3 weeks	2008-12-25 11:32:38 +00:00
marcus	09a2776e69	Do not KASSERT when vp->v_dd is NULL. Only directories which have had ".." looked up would have v_dd set to a non-NULL value. This fixes a panic seen when running installworld on a diskless system with a separate /usr file system. Submitted by: cracauer Approved by: kib	2008-12-23 20:43:42 +00:00
kib	87478ed893	Keep the hold on the vnode during VOP_VPTOCNP() call, allowing the vop implementation to drop vnode lock, if needed. Reported and tested by: pho	2008-12-23 20:04:31 +00:00
ivoras	97219f9ae7	Add missing newlines to flags tags of CPU topology, for prettier output. Reviewed by: jeff (original version) Approved by: gnn (mentor) (original version)	2008-12-23 16:19:59 +00:00
cperciva	87e5b5b6cc	Prevent cross-site forgery attacks on ftpd(8) due to splitting long commands into multiple requests. [08:12] Avoid calling uninitialized function pointers in protocol switch code. [08:13] Merry Christmas everybody... Approved by: so (cperciva) Approved by: re (kensmith) Security: FreeBSD-SA-08:12.ftpd, FreeBSD-SA-08:13.protosw	2008-12-23 01:23:09 +00:00
ed	c48b61a389	Revert r185891. In r185891 I removed the newlines from messages written to /dev/console, because it made startup messages from rc-scripts harder to read. This, unfortunately, causes the kernel message that is printed after a non-terminated log message to be concatenated. This could be fixed, but on short term it's better to just revert the change. Reported by: Jaakko Heinonen <jh saunalahti fi>	2008-12-21 21:54:01 +00:00
ed	8cd1b93741	Set PTS_FINISHED before waking up any threads. Inside ptsdrv_{in,out}wakeup() we call KNOTE_LOCKED() to wake up any kevent(2) users. Because the kqueue handlers are executed synchronously, we must set PTS_FINISHED before calling ptsdrv_{in,out}wakeup(). Discovered by: nork	2008-12-21 21:16:57 +00:00
ed	1bd8105e62	Let wchan names more closely match pre-MPSAFE TTY behaviour. Right now the wchan strings "ttyinp" and "ttybgw" only differ one character from the strings we used prior to MPSAFE TTY. Just rename them back to their pre-MPSAFE TTY counterparts. Also rename "ttylck" to "ttymtx", which should make it more clear that a process is blocked on the TTY mutex, not some other form of locking.	2008-12-20 09:36:40 +00:00
nwhitehorn	3fcef6d9c2	Modularize the Open Firmware client interface to allow run-time switching of OFW access semantics, in order to allow future support for real-mode OF access and flattened device frees. OF client interface modules are implemented using KOBJ, in a similar way to the PPC PMAP modules. Because we need Open Firmware to be available before mutexes can be used on sparc64, changes are also included to allow KOBJ to be used very early in the boot process by only using the mutex once we know it has been initialized. Reviewed by: marius, grehan	2008-12-20 00:33:10 +00:00
ivoras	9b61ff858b	Further beautify the lock strings to be more pleasing to the eye and self documenting within 6 characters. Reviewed by: ed (older version) Approved by: gnn (older version)	2008-12-19 14:49:14 +00:00
ru	021bbbd29f	Removed a comment made obsolete by revisions 157927 and 174292.	2008-12-18 15:56:12 +00:00
ivoras	c6f6eeca99	By popular request, stringify kern.vm_guest sysctl. Now it returns a short, self-documenting string describing the detected virtual environment. Approved by: gnn (mentor) (earlier version)	2008-12-18 15:34:38 +00:00
ivoras	a6186ea60d	Remove spaces in wait object names to make top (1) output prettier and unbreak scripts that examine ps (1) output. Reviewed by: ed Approved by: gnn (mentor)	2008-12-18 15:25:33 +00:00
kib	5b3918fe07	The quotactl, statfs and fstatfs syscall implementations may dereference NULL pointer to struct mount if the looked up vnode is reclaimed. Also, these syscalls only mnt_ref() the mp, still allowing it to be unmounted; only struct mount memory is kept from being reused. Lock the vnode when doing name lookup, then reference its mount point, unlock the vnode and vfs_busy the mountpoint. This sequence shall take care of both races. Reported and tested by: pho Discussed with: attilio MFC after: 1 month	2008-12-18 12:01:19 +00:00
kib	fe785ac856	Do not return success and doomed vnode from lookup. LK_UPGRADE allows the vnode to be reclaimed. Tested by: pho MFC after: 1 month	2008-12-18 11:58:12 +00:00
ivoras	b769de9274	Introduce a sysctl kern.vm_guest that reflects what the kernel knows about it running under a virtual environment. This also introduces a globally accessible variable vm_guest that can be used where appropriate in the kernel to inspect this environment. To make it easier for the long run, an enum VM_GUEST is also introduced, which could possibly be factored out in a header somewhere (but the question is where - vm/vm_param.h? sys/param.h?) so it eventually becomes a part of the standard KPI. In any case, it's a start. The purpose of all this isn't to absolutely detect that the OS is running under a virtual environment (cf. "redpill") but to allow the parts of the kernel and the userland that care about this particular aspect and can do something useful depending on it to have a standardised interface. Reducing kern.hz is one example but there are other things that could be done like avoiding context switches, not using CPU instructions that are known to be slow in emulation, possibly different strategies in VM (memory) allocation, CPU scheduling, etc. It isn't clear if the JAILS/VIMAGE functionality should also be exposed by this particular mechanism (probably not since they're not "full" virtual hardware environments). Sometime in the future another sysctl and a variable could be introduced to reflect if the kernel supports any kind of virtual hosting (e.g. VMWare VMI, Xen dom0). Reviewed by: silence from src-commiters@, virtualization@, kmacy@ Approved by: gnn (mentor) Security: Obscurity doesn't help.	2008-12-17 19:57:12 +00:00
peter	35932c6d7c	Remove sysctl debug.elf_trace and the trace field in auxargs. They go nowhere. It used to be the equivalent of $LD_DEBUG in rtld-elf. Elf_Auxargs is an internal structure.	2008-12-17 16:54:29 +00:00
imp	4ad1824222	Minor style(9) nit.	2008-12-17 16:25:20 +00:00
kib	ce7791f58d	Remove two remnant uses of AT_DEBUG.	2008-12-17 13:13:35 +00:00
attilio	697e2a94e4	1) Fix a deadlock in the VFS: - threadA runs vfs_rel(mp1) - threadB does unmount the mp1 fs, sets MNTK_UNMOUNT and drop MNT_ILOCK() - threadA runs vfs_busy(mp1) and, as long as, MNTK_UNMOUNT is set, sleeps waiting for threadB to complete the unmount - threadB, in vfs_mount_destroy(), finds mnt_lock > 0 and sleeps waiting for the refcount to expire. Fix the deadlock by adding a flag called MNTK_REFEXPIRE which signals the unmounter is waiting for mnt_ref to expire. The vfs_busy contenders got awake, fails, and if they retry the MNTK_REFEXPIRE won't allow them to sleep again. 2) Simplify significantly the code of vfs_mount_destroy() trimming unnecessary codes: - as long as any reference exited, it is no-more possible to have write-op (primarty and secondary) in progress. - it is no needed to drop and reacquire the mount lock. - filling the structures with dummy values is unuseful as long as it is going to be freed. Tested by: pho, Andrea Barberio <insomniac at slackware dot it> Discussed with: kib	2008-12-16 23:16:10 +00:00
mav	325d9e1d23	If possible, try to obtain max_mhz on cpufreq attach instead of first request. On HyperThreading CPUs logical cores have same frequency, so setting it on any core will change the other's one. In most cases first request to the second core will be the "set" request, done after setting frequency of the first core. In such case second CPU will obtain throttled frequency of the first core as it's max_mhz making cpufreq broken due to different frequency sets.	2008-12-16 01:24:05 +00:00
mav	37aff7daa7	Change ttyhook_register() second argument from thread to process pointer. Thread was not really needed there, while previous ng_tty implementation that used thread pointer had locking issues (using sx while holding mutex).	2008-12-13 21:17:46 +00:00
jkoshy	57939c399f	- Bug fix: prevent a thread from migrating between CPUs between the time it is marked for user space callchain capture in the NMI handler and the time the callchain capture callback runs. - Improve code and control flow clarity by invoking hwpmc(4)'s user space callchain capture callback directly from low-level code. Reviewed by: jhb (kern/subr_trap.c) Testing (various patch revisions): gnn, Fabien Thomas <fabien dot thomas at netasq dot com>, Artem Belevich <artemb at gmail dot com>	2008-12-13 13:07:12 +00:00
ed	7397703a4c	Add FIONREAD to pseudo-terminal master devices. All ioctl()'s that aren't implemented by pts(4) are forwarded to the TTY itself. Unfortunately this is not correct for FIONREAD, because it will give the wrong amount of bytes that are available to read. Tested by: keramida Reminded by: keramida	2008-12-13 07:23:55 +00:00
kib	ba6c22347a	Uio_yield() already does DROP_GIANT/PICKUP_GIANT, no need to repeat this around the call. Noted by: bde	2008-12-12 14:03:04 +00:00
kib	e747469903	Reference the vmspace of the process being inspected by procfs, linprocfs and sysctl kern_proc_vmmap handlers. Reported and tested by: pho Reviewed by: rwatson, des MFC after: 1 week	2008-12-12 12:12:36 +00:00
kib	2ef4ea7ee8	The userland_sysctl() function retries sysctl_root() until returned error is not EAGAIN. Several sysctls that inspect another process use p_candebug() for checking access right for the curproc. p_candebug() returns EAGAIN for some reasons, in particular, for the process doing exec() now. If execing process tries to lock Giant, we get a livelock, because sysctl handlers are covered by Giant, and often do not sleep. Break the livelock by dropping Giant and allowing other threads to execute in the EAGAIN loop. Also, do not return EAGAIN from p_candebug() when process is executing, use more appropriate EBUSY error [1]. Reported and tested by: pho Suggested by: rwatson [1] Reviewed by: rwatson, des MFC after: 1 week	2008-12-12 12:06:28 +00:00
marcus	91e684d7f9	Add a new VOP, VOP_VPTOCNP, which translates a vnode to its component name on a best-effort basis. Teach vn_fullpath to use this new VOP if a regular VFS cache lookup fails. This VOP is designed to supplement the VFS cache to provide a better chance that a vnode-to-name lookup will succeed. Currently, an implementation for devfs is being committed. The default implementation is to return ENOENT. A big thanks to kib for the mentorship on this, and to pho for running it through his stress test suite. Reviewed by: arch Approved by: kib	2008-12-12 00:57:38 +00:00
ed	9afe297ee5	Add kqueue()-support to pseudo-terminal master devices. One thing I didn't expect many applications to use, was kqueue() on pseudo-terminal master devices. There are applications that use kqueue() on the TTY itself (rtorrent, etc). That doesn't mean we shouldn't implement this. Libraries like libevent use kqueue() by default, which means they wouldn't be able to use kqueue(). The old TTY layer implements a very broken version of kqueue() by performing the actual polling on the TTY device. Discussed with: peter	2008-12-11 21:44:02 +00:00
bz	9a73283b1f	Order #includes - also to reduce diffs with vimage branches in p4. Sponsored by: The FreeBSD Foundation	2008-12-11 16:09:31 +00:00
bz	e65de9d982	Correctly check the number of prison states to not access anything outside the prison_states array. When checking if there is a name configured for the prison, check the first character to not be '\0' instead of checking if the char array is present, which it always is. Note, that this is different for the *jailname in the syscall. Found with: Coverity Prevent(tm) CID: 4156, 4155 MFC after: 4 weeks (just that I get the mail)	2008-12-11 01:04:25 +00:00
zec	7b573d1496	Conditionally compile out V_ globals while instantiating the appropriate container structures, depending on VIMAGE_GLOBALS compile time option. Make VIMAGE_GLOBALS a new compile-time option, which by default will not be defined, resulting in instatiations of global variables selected for V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be effectively compiled out. Instantiate new global container structures to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0, vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0. Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_ macros resolve either to the original globals, or to fields inside container structures, i.e. effectively #ifdef VIMAGE_GLOBALS #define V_rt_tables rt_tables #else #define V_rt_tables vnet_net_0._rt_tables #endif Update SYSCTL_V_*() macros to operate either on globals or on fields inside container structs. Extend the internal kldsym() lookups with the ability to resolve selected fields inside the virtualization container structs. This applies only to the fields which are explicitly registered for kldsym() visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently this is done only in sys/net/if.c. Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code, and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in turn result in proper code being generated depending on VIMAGE_GLOBALS. De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c which were prematurely V_irtualized by automated V_ prepending scripts during earlier merging steps. PF virtualization will be done separately, most probably after next PF import. Convert a few variable initializations at instantiation to initialization in init functions, most notably in ipfw. Also convert TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in initializer functions. Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation	2008-12-10 23:12:39 +00:00
bz	f30a0a94fe	Make sure nmbclusters are initialized before maxsockets by running the tunable_mbinit() SYSINIT at SI_ORDER_MIDDLE before the init_maxsockets() SYSINT at SI_ORDER_ANY. Reviewed by: rwatson, zec Sponsored by: The FreeBSD Foundation MFC after: 4 weeks	2008-12-10 22:17:09 +00:00
bz	72852bcf84	Style changes only. Put the return type on an extra line[1] and add an empty line at the beginning as we do not have any local variables. Submitted by: rwatson [1] Reviewed by: rwatson MFC after: 4 weeks	2008-12-10 22:10:37 +00:00
ed	4400e3e134	Remove added newlines from logged messages written to /dev/console. The /dev/console device node logs all strings that are written to it. When the string does not contain a trailing newline, it appends one. I can imagine this was useful a long time ago, but with our current rc-scripts, it generates a whole bunch of messages that look like: \| Configuring syscons: \| blanktime \| . By not appending the newlines, the output of `dmesg -a' is now (almost?) exactly the same as what the user will see on the console device (syscons, uart).	2008-12-10 21:48:05 +00:00
jhb	f3dcc2d9e0	- Add 32-bit compat system calls for VFS_AIO. The system calls live in the aio code and are registered via the recently added SYSCALL32_*() helpers. - Since the aio code likes to invoke fuword and suword a lot down in the "bowels" of system calls, add a structure holding a set of operations for things like storing errors, copying in the aiocb structure, storing status, etc. The 32-bit system calls use a separate operations vector to handle fuword32 vs fuword, etc. Also, the oldsigevent handling is now done by having seperate operation vectors with different aiocb copyin routines. - Split out kern_foo() functions for the various AIO system calls so the 32-bit front ends can manage things like copying in and converting timespec structures, etc. - For both the native and 32-bit aio_suspend() and lio_listio() calls, just use copyin() to read the array of aiocb pointers instead of using a for loop that iterated over fuword/fuword32. The error handling in the old case was incomplete (lio_listio() just ignored any aiocb's that it got an EFAULT trying to read rather than reporting an error), and possibly slower. MFC after: 1 month	2008-12-10 20:56:19 +00:00
kmacy	c510d681c9	add RW_SYSINIT_FLAGS macro and rw_sysinit_flags initialization function	2008-12-08 21:46:55 +00:00
jkim	bc7e5e240b	- Detect Bochs BIOS variants and use HZ_VM as well. - Free kernel environment variable after its use. - Fix style(9) nits.	2008-12-08 18:39:59 +00:00
kib	8324189f53	Do drop vm map lock earlier in the sysctl_kern_proc_vmmap(), to avoid locking a vnode while having vm map locked. Reported and tested by: pho MFC after: 1 week	2008-12-08 12:29:30 +00:00
kmacy	598b522b42	- convert radix node head lock from mutex to rwlock - make radix node head lock not recursive - fix LOR in rtexpunge - fix LOR in rtredirect Reviewed by: sam	2008-12-07 21:15:43 +00:00
kib	ccad2ebfb2	Several threads in a process may do vfork() simultaneously. Then, all parent threads sleep on the parent' struct proc until corresponding child releases the vmspace. Each sleep is interlocked with proc mutex of the child, that triggers assertion in the sleepq_add(). The assertion requires that at any time, all simultaneous sleepers for the channel use the same interlock. Silent the assertion by using conditional variable allocated in the child. Broadcast the variable event on exec() and exit(). Since struct proc * sleep wait channel is overloaded for several unrelated events, I was unable to remove wakeups from the places where cv_broadcast() is added, except exec(). Reported and tested by: ganbold Suggested and reviewed by: jhb MFC after: 2 week	2008-12-05 20:50:24 +00:00
jhb	0f1a8aa011	When the SYSINIT() to load a module invokes the MOD_LOAD event successfully, move that module to the head of the associated linker file's list of modules. The end result is that once all the modules are loaded, they are sorted in the reverse of their load order. This causes the kernel linker to invoke the MOD_QUIESCE and MOD_UNLOAD events in the reverse of the order that MOD_LOAD was invoked. This means that the ordering of MOD_LOAD events that is set by the SI_* paramters to DECLARE_MODULE() are now honored in the same order they would be for SYSUNINIT() for the MOD_QUIESCE and MOD_UNLOAD events. MFC after: 1 month	2008-12-05 16:47:30 +00:00
jhb	0ac14961cc	- Invoke MOD_QUIESCE on all modules in a linker file (kld) before unloading any modules. As a result, if any module veto's an unload request via MOD_QUIESCE, the entire set of modules for that linker file will remain loaded and active now rather than leaving the kld in a weird state where some modules are loaded and some are unloaded. - This also moves the logic for handling the "forced" unload flag out of kern_module.c and into kern_linker.c which is a bit cleaner. - Add a module_name() routine that returns the name of a module and use that instead of printing pointer values in debug messages when a module fails MOD_QUIESCE or MOD_UNLOAD. MFC after: 1 month	2008-12-05 13:40:25 +00:00
bz	965b4db5a5	Fix a credential reference leak. [1] Close subtle but relatively unlikely race conditions when propagating the vnode write error to other active sessions tracing to the same vnode, without holding a reference on the vnode anymore. [2] PR: kern/126368 [1] Submitted by: rwatson [2] Reviewed by: kib, rwatson MFC after: 4 weeks	2008-12-03 15:54:35 +00:00
bz	604d89458a	Rather than using hidden includes (with cicular dependencies), directly include only the header files needed. This reduces the unneeded spamming of various headers into lots of files. For now, this leaves us with very few modules including vnet.h and thus needing to depend on opt_route.h. Reviewed by: brooks, gnn, des, zec, imp Sponsored by: The FreeBSD Foundation	2008-12-02 21:37:28 +00:00
kib	ade687809e	Shared lookup makes it possible to create several negative cache entries for one name. Then, creating inode with that name would remove one entry, leaving others dormant. Reclaiming the vnode would uncover negative entries, causing false return of ENOENT from the calls like stat, that do not create inode. Prevent creation of the duplicated negative entries. Reported and debugged with: pho Reviewed by: jhb X-MFC: after shared lookup changes	2008-12-02 11:14:16 +00:00
peter	76037b082e	Merge user/peter/kinfo branch as of r185547 into head. This changes struct kinfo_filedesc and kinfo_vmentry such that they are same on both 32 and 64 bit platforms like i386/amd64 and won't require sysctl wrapping. Two new OIDs are assigned. The old ones are available under COMPAT_FREEBSD7 - but it isn't that simple. The superceded interface was never actually released on 7.x. The other main change is to pack the data passed to userland via the sysctl. kf_structsize and kve_structsize are reduced for the copyout. If you have a process with 100,000+ sockets open, the unpacked records require a 132MB+ copyout. With packing, it is "only" ~35MB. (Still seriously unpleasant, but not quite as devastating). A similar problem exists for the vmentry structure - have lots and lots of shared libraries and small mmaps and its copyout gets expensive too. My immediate problem is valgrind. It traditionally achieves this functionality by parsing procfs output, in a packed format. Secondly, when tracing 32 bit binaries on amd64 under valgrind, it uses a cross compiled 32 bit binary which ran directly into the differing data structures in 32 vs 64 bit mode. (valgrind uses this to track file descriptor operations and this therefore affected every single 32 bit binary) I've added two utility functions to libutil to unpack the structures into a fixed record length and to make it a little more convenient to use.	2008-12-02 06:50:26 +00:00
peter	0cd59a18e3	Prune some whining.	2008-12-02 02:32:13 +00:00
kan	6f76d481f0	Shared memory objects that have size which is not necessarily equal to exact multiple of system page size should still be allowed to be mapped in their entirety to match the regular vnode backed file behavior. Reported by: ed Reviewed by: jhb	2008-12-01 22:33:50 +00:00
kensmith	b8203d2b22	Catch up with the disappearance of sys/dev/hfa.	2008-12-01 14:34:42 +00:00
attilio	2f606e171c	Fix an inverted check introduced in r184554. Submitted by: tegge Pointy hat to: me	2008-12-01 03:00:26 +00:00
peter	cd7b78c33f	Duplicate another few hundred lines of code in order to be compatible with unreleased binaries.	2008-12-01 02:13:32 +00:00
davidxu	25ef38d002	Revision 184199 had not been fully reverted, add missing piece. Reported by: phk	2008-12-01 01:54:55 +00:00
peter	2b1f03929a	Properly wrap this giant block of duplicate code inside COMPAT_FREEBSD7	2008-11-30 21:04:53 +00:00
peter	343bde9706	Implement copyout packing more along the lines of what I had in mind. Create a temporary duplicate implementation of old filedesc struct for pre-7.1 libgtop package. Todo: specific fd or addr request	2008-11-30 00:18:21 +00:00
peter	83dc2280ce	WIP kinfo_file/kinfo_vmmentry tweaks. The idea: 1) to get the 32 and 64 bit versions in sync so that no shims are needed, Valgrind in particular excercises this. and: 2) reduce the size of the copyout. On large processes this turns out to be a huge problem. Valgrind also suffers from this since it needs to do this in a context that can't malloc. I want to pack the records. 3) Add new types.. 'tell me about fd N' and 'tell me about addr N'.	2008-11-29 20:55:11 +00:00
bz	817305efd5	Unbreak the no-networks (no INET/6) build that I broke with the commit in r185435. Pointyhat: no, but I could need a ski cap for the winter	2008-11-29 16:17:39 +00:00
bz	d2730d5b27	MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible	2008-11-29 14:32:14 +00:00
kib	bf74bb2e16	In the nfsrv_fhtovp(), after the vfs_getvfs() function found the pointer to the fs, but before a vnode on the fs is locked, unmount may free fs structures, causing access to destroyed data and freed memory. Introduce a vfs_busymp() function that looks up and busies found fs while mountlist_mtx is held. Use it in nfsrv_fhtovp() and in the implementation of the handle syscalls. Two other uses of the vfs_getvfs() in the vfs_subr.c, namely in sysctl_vfs_ctl and vfs_getnewfsid seems to be ok. In particular, sysctl_vfs_ctl is protected by Giant by being a non-sleeping sysctl handler, that prevents Giant-locked unmount code to interfere with it. Noted by: tegge Reviewed by: dfr Tested by: pho MFC after: 1 month	2008-11-29 13:34:59 +00:00
pjd	881f5f6bef	Improve KASSERT() call a bit: - Print flags in hex. - Note that flags can be fine and panic can be due unexpected error condition. - Remove redundant new line character. Eventhough panic message excess 80 characters keep it in one line so it is easier to grep.	2008-11-29 12:40:14 +00:00
bz	3139d98062	With the permissions of phk@ change the license on kern_jail.c to a 2 clause BSD license.	2008-11-28 19:23:46 +00:00
ed	355c41a8e4	Fix matching of message queues by name. The mqfs_search() routine uses strncmp() to match message queue objects by name. This is because it can be called from environments where the file name is not null terminated (the VFS for example). Unfortunately it doesn't compare the lengths of the message queue names, which means if a system has "Queue12345", the name "Queue" will also match. I noticed this when a student of mine handed in an exercise using message queues with names "Queue2" and "Queue". Reviewed by: rink	2008-11-28 14:53:18 +00:00
kib	c0925f09e7	Explicitely note that destroy_dev() sleeps. Requested by: ed (some time ago), Jaakko Heinonen <jh saunalahti fi>	2008-11-27 16:47:25 +00:00
ganbold	93f46a969a	Remove unused variable. Found with: Coverity Prevent(tm) CID: 3664 Approved by: kib	2008-11-27 04:40:37 +00:00
zec	95a15f5c84	Merge more of currently non-functional (i.e. resolving to whitespace) macros from p4/vimage branch. Do a better job at enclosing all instantiations of globals scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks. De-virtualize and mark as const saorder_state_alive and saorder_state_any arrays from ipsec code, given that they are never updated at runtime, so virtualizing them would be pointless. Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation	2008-11-26 22:32:07 +00:00
marcus	8b6bc38b7c	Move vn_fullpath1() outside of FILEDESC locking. This is being done in advance of teaching vn_fullpath1() how to query file systems for vnode-to-name mappings when cache lookups fail. Thanks to kib for guidance and patience on this process. Reviewed by: kib Approved by: kib	2008-11-25 15:36:15 +00:00
emaste	abb4c5f368	Correct typo in comment: thier -> their	2008-11-24 19:28:52 +00:00
dwmalone	e3dcd91abe	It's possible that the dump device has gone away after it was configured, change the message to let people know this is a possibility. I've slightly changed the message from the one submitted by Pekka to keep the printf on one line. Submitted by: Pekka Savola <pekkas@netcore.fi>	2008-11-23 21:05:22 +00:00
kib	8fad2283b3	Add sv_flags field to struct sysentvec with intention to provide description of the ABI of the currently executing image. Change some places to test the flags instead of explicit comparing with address of known sysentvec structures to determine ABI features. Discussed with: dchagin, imp, jhb, peter	2008-11-22 12:36:15 +00:00
kmacy	9d3bb599b1	- bump __FreeBSD version to reflect added buf_ring, memory barriers, and ifnet functions - add memory barriers to <machine/atomic.h> - update drivers to only conditionally define their own - add lockless producer / consumer ring buffer - remove ring buffer implementation from cxgb and update its callers - add if_transmit(struct ifnet ifp, struct mbuf m) to ifnet to allow drivers to efficiently manage multiple hardware queues (i.e. not serialize all packets through one ifq) - expose if_qflush to allow drivers to flush any driver managed queues This work was supported by Bitgravity Inc. and Chelsio Inc.	2008-11-22 05:55:56 +00:00
julian	cf07f793f2	Fix a scope problem in the multiple routing table code that stopped the SO_SETFIB socket option from working correctly. Obtained from: Ironport MFC after: 3 days	2008-11-19 19:19:30 +00:00
jhb	c2251260be	Allow device hints to wire the unit numbers of devices. - An "at" hint now reserves a device name. - A new BUS_HINT_DEVICE_UNIT method is added to the bus interface. When determining the unit number of a device, this method is invoked to let the bus driver specify the unit of a device given a specific devclass. This is the only way a device can be given a name reserved via an "at" hint. - Implement BUS_HINT_DEVICE_UNIT() for the acpi(4) and isa(4) bus drivers. Both of these busses implement this by comparing the resources for a given hint device with the resources enumerated by ACPI/PnPBIOS and wire a unit if the hint resources are a subset of the "real" resources. - Use bus_hinted_children() for adding hinted devices on isa(4) busses now instead of doing it by hand. - Remove the unit kludging from sio(4) as it is no longer necessary. Prodding from: peter, imp OK'd by: marcel MFC after: 1 month	2008-11-18 21:01:54 +00:00
jhb	b08d457fbe	When checking to see if another CPU is running its idle thread, examine the thread running on the other CPU instead of the thread being placed on the run queue. Reported by: Ravi Murty @ Intel Reviewed by: jeff	2008-11-18 05:41:34 +00:00
delphij	39ade9204d	Obey signedness flag in %z case. MFC after: 2 months	2008-11-17 23:57:40 +00:00
pjd	bbe899b96e	Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes. This bring huge amount of changes, I'll enumerate only user-visible changes: - Delegated Administration Allows regular users to perform ZFS operations, like file system creation, snapshot creation, etc. - L2ARC Level 2 cache for ZFS - allows to use additional disks for cache. Huge performance improvements mostly for random read of mostly static content. - slog Allow to use additional disks for ZFS Intent Log to speed up operations like fsync(2). - vfs.zfs.super_owner Allows regular users to perform privileged operations on files stored on ZFS file systems owned by him. Very careful with this one. - chflags(2) Not all the flags are supported. This still needs work. - ZFSBoot Support to boot off of ZFS pool. Not finished, AFAIK. Submitted by: dfr - Snapshot properties - New failure modes Before if write requested failed, system paniced. Now one can select from one of three failure modes: - panic - panic on write error - wait - wait for disk to reappear - continue - serve read requests if possible, block write requests - Refquota, refreservation properties Just quota and reservation properties, but don't count space consumed by children file systems, clones and snapshots. - Sparse volumes ZVOLs that don't reserve space in the pool. - External attributes Compatible with extattr(2). - NFSv4-ACLs Not sure about the status, might not be complete yet. Submitted by: trasz - Creation-time properties - Regression tests for zpool(8) command. Obtained from: OpenSolaris	2008-11-17 20:49:29 +00:00
kib	ec4c199459	Revert r184118. There is actually a code in the kernel, for instance in kern_unlinkat(), that expects that vn_start_write() actually fills the mp even when the call failed. As Tor noted, that pattern relies on the the type stability of the mount points, as well as that suspended mount points are never freed and V_XSLEEP is always passed to vn_start_write() when called on a freed mount point. Reported by: stass Reviewed by: tegge PR: 123768	2008-11-16 21:56:29 +00:00
n_hibma	1e9b4258a4	Silence detach messages if the device has marked itself quiet (u3g). MFC after: 3 weeks	2008-11-13 21:46:19 +00:00
ed	1722724523	Don't forget to relock the TTY after uiomove() returns an error. Peter Holm just discovered this funny bug inside the TTY code: if uiomove() in ttydisc_write() returns an error, we forget to relock the TTY before jumping out of ttydisc_write(). Fix it by placing tty_unlock() and tty_lock() around uiomove(). Submitted by: pho	2008-11-12 09:04:44 +00:00
ed	8d12469978	Several cleanups related to pipe(2). - Use `fildes[2]' instead of `*fildes' to make more clear that pipe(2) fills an array with two descriptors. - Remove EFAULT from the manual page. Because of the current calling convention, pipe(2) raises a segmentation fault when an invalid address is passed. - Introduce kern_pipe() to make it easier for binary emulations to implement pipe(2). - Make Linux binary emulation use kern_pipe(), which means we don't have to recover td_retval after calling the FreeBSD system call. Approved by: rdivacky Discussed on: arch	2008-11-11 14:55:59 +00:00
gallatin	5b802b524b	Avoid scheduling firmware taskqs when cold. This prevents a panic which occurs when a driver attempts to load firmware at boot via firmware_get() when the firmware module has not been preloaded. firmware_get() will enqueue a task using a struct taskqueue allocated on the stack, and the machine will crash much later in the firmware taskq thread when taskqs are started and the struct taskqueue is garbage. Not objected to by: sam	2008-11-11 12:25:08 +00:00
ed	7baae41248	Regenerate system call tables for r184789.	2008-11-09 10:48:06 +00:00
ed	9d3703b842	Mark uname(), getdomainname() and setdomainname() with COMPAT_FREEBSD4. Looking at our source code history, it seems the uname(), getdomainname() and setdomainname() system calls got deprecated somewhere after FreeBSD 1.1, but they have never been phased out properly. Because we don't have a COMPAT_FREEBSD1, just use COMPAT_FREEBSD4. Also fix the Linuxolator to build without the setdomainname() routine by just making it call userland_sysctl on kern.domainname. Also replace the setdomainname()'s implementation to use this approach, because we're duplicating code with sysctl_domainname(). I wasn't able to keep these three routines working in our COMPAT_FREEBSD32, because that would require yet another keyword for syscalls.master (COMPAT4+NOPROTO). Because this routine is probably unused already, this won't be a problem in practice. If it turns out to be a problem, we'll just restore this functionality. Reviewed by: rdivacky, kib	2008-11-09 10:45:13 +00:00
kmacy	7f1b49e5c3	make kern.ipc.nmbclusters actually have a useful effect on nmbclusters et al. initialize pkthdr in field order	2008-11-09 01:53:06 +00:00
ed	c3eca8b8cc	Reduce the default baud rate of PTY's to 9600. On RELENG_6 (and probably RELENG_7) we see our syscons windows and pseudo-terminals have the following buffer sizes: \| LINE RAW CAN OUT IHIWT ILOWT OHWT LWT COL STATE SESS PGID DISC \| ttyv0 0 0 0 7680 6720 2052 256 7 OCcl 1146 1146 term \| ttyp0 0 0 0 7680 6720 1296 256 0 OCc 82033 82033 term These buffer sizes make no sense, because we often have much more output than input, but I guess having higher input buffer sizes improves guarantees of the system. On MPSAFE TTY I just sent both the input and output buffer sizes to 7 KB, which is pretty big on a standard FreeBSD install with 8 syscons windows and some PTY's. Reduce the baud rate to 9600 baud, which means we now have the following buffer sizes: \| LINE INQ CAN LIN LOW OUTQ USE LOW COL SESS PGID STATE \| ttyv0 1920 0 0 192 1984 0 199 7 2401 2401 Oil \| pts/0 1920 0 0 192 1984 0 199 5631 1305 2526 Oi This is a lot smaller, but for pseudo-devices this should be good enough. You need to do a lot of punching to fill up a 7.5 KB input buffer. If it turns out things don't work out this way, we'll just switch to 19200 baud.	2008-11-08 20:40:39 +00:00
rodrigc	ca625c199f	Merge latest DTrace changes from Perforce. Approved by: jb	2008-11-05 19:40:36 +00:00
davidxu	1ebf3ee9a3	Revert rev 184216 and 184199, due to the way the thread_lock works, it may cause a lockup. Noticed by: peter, jhb	2008-11-05 03:01:23 +00:00
jhb	fb34cedca4	Use shared vnode locks for auditing vnode arguments as auditing only does a VOP_GETATTR() which does not require an exclusive lock. Reviewed by: csjp, rwatson	2008-11-04 22:31:04 +00:00
jhb	bfd6c884ce	Don't bother calling setrunnable() and clearing the sleeping flag in sleepq_resume_thread() if the thread isn't asleep.	2008-11-04 19:13:53 +00:00
jhb	6c6f8c89e8	Remove unnecessary locking around vn_fullpath(). The vnode lock for the vnode in question does not need to be held. All the data structures used during the name lookup are protected by the global name cache lock. Instead, the caller merely needs to ensure a reference is held on the vnode (such as vhold()) to keep it from being freed. In the case of procfs' <pid>/file entry, grab the process lock while we gain a new reference (via vhold()) on p_textvp to fully close races with execve(2). For the kern.proc.vmmap sysctl handler, use a shared vnode lock around the call to VOP_GETATTR() rather than an exclusive lock. MFC after: 1 month	2008-11-04 19:04:01 +00:00
ed	27790fa127	Remove redundant return value tests. There is no need to test whether the return value is non-zero here. Just return the error number directly.	2008-11-04 10:58:02 +00:00
jhb	9757361d22	Adjust the license statement to more closely match a standard 3-clause BSD license. MFC after: 3 days	2008-11-03 21:17:02 +00:00
jhb	ee8312c8bb	Use shared vnode locks instead of exclusive vnode locks for the access(), chdir(), chroot(), eaccess(), fpathconf(), fstat(), fstatfs(), lseek() (when figuring out the current size of the file in the SEEK_END case), pathconf(), readlink(), and statfs() system calls. Submitted by: ups (mostly) Tested by: pho MFC after: 1 month	2008-11-03 20:31:00 +00:00
attilio	26a604f3bc	Remove the mnt_holdcnt and mnt_holdcntwaiters because they are useless. Really, the concept of holdcnt in the struct mount is rappresented by the mnt_ref (which prevents the type-stable structure from being "recycled) handled through vfs_ref() and vfs_rel(). On this optic, switch the holdcnt acquisition into an emulated vfs_ref() (and subsequent release into vfs_rel()). Discussed with: kib Tested by: pho	2008-11-03 20:00:35 +00:00
jhb	24139401dd	A few style nits.	2008-11-03 19:33:20 +00:00
dfr	6929a6d99b	Regen.	2008-11-03 10:39:35 +00:00
dfr	2fb03513fc	Implement support for RPCSEC_GSS authentication to both the NFS client and server. This replaces the RPC implementation of the NFS client and server with the newer RPC implementation originally developed (actually ported from the userland sunrpc code) to support the NFS Lock Manager. I have tested this code extensively and I believe it is stable and that performance is at least equal to the legacy RPC implementation. The NFS code currently contains support for both the new RPC implementation and the older legacy implementation inherited from the original NFS codebase. The default is to use the new implementation - add the NFS_LEGACYRPC option to fall back to the old code. When I merge this support back to RELENG_7, I will probably change this so that users have to 'opt in' to get the new code. To use RPCSEC_GSS on either client or server, you must build a kernel which includes the KGSSAPI option and the crypto device. On the userland side, you must build at least a new libc, mountd, mount_nfs and gssd. You must install new versions of /etc/rc.d/gssd and /etc/rc.d/nfsd and add 'gssd_enable=YES' to /etc/rc.conf. As long as gssd is running, you should be able to mount an NFS filesystem from a server that requires RPCSEC_GSS authentication. The mount itself can happen without any kerberos credentials but all access to the filesystem will be denied unless the accessing user has a valid ticket file in the standard place (/tmp/krb5cc_<uid>). There is currently no support for situations where the ticket file is in a different place, such as when the user logged in via SSH and has delegated credentials from that login. This restriction is also present in Solaris and Linux. In theory, we could improve this in future, possibly using Brooks Davis' implementation of variant symlinks. Supporting RPCSEC_GSS on a server is nearly as simple. You must create service creds for the server in the form 'nfs/<fqdn>@<REALM>' and install them in /etc/krb5.keytab. The standard heimdal utility ktutil makes this fairly easy. After the service creds have been created, you can add a '-sec=krb5' option to /etc/exports and restart both mountd and nfsd. The only other difference an administrator should notice is that nfsd doesn't fork to create service threads any more. In normal operation, there will be two nfsd processes, one in userland waiting for TCP connections and one in the kernel handling requests. The latter process will create as many kthreads as required - these should be visible via 'top -H'. The code has some support for varying the number of service threads according to load but initially at least, nfsd uses a fixed number of threads according to the value supplied to its '-n' option. Sponsored by: Isilon Systems MFC after: 1 month	2008-11-03 10:38:00 +00:00
ivoras	d819bb20f8	Increase the initial sbuf size for CPU topology dump to something more usable for newer CPUs. The new value allows 2 x quad core configuration dumps to fit within the initial buffer without reallocations. Approved by: gnn (mentor) (older version) Pointed out by: rdivacky	2008-11-02 23:11:20 +00:00
attilio	e1f493235e	Improve VFS locking: - Implement real draining for vfs consumers by not relying on the mnt_lock and using instead a refcount in order to keep track of lock requesters. - Due to the change above, remove the mnt_lock lockmgr because it is now useless. - Due to the change above, vfs_busy() is no more linked to a lockmgr. Change so its KPI by removing the interlock argument and defining 2 new flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the old version (which was unlinked from the lockmgr alredy) and MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx once the mnt interlock is held (ability still desired by most consumers). - The stub used into vfs_mount_destroy(), that allows to override the mnt_ref if running for more than 3 seconds, make it totally useless. Remove it as it was thought to work into older versions. If a problem of "refcount held never going away" should appear, we will need to fix properly instead than trust on such hackish solution. - Fix a bug where returning (with an error) from dounmount() was still leaving the MNTK_MWAIT flag on even if it the waiters were actually woken up. Just a place in vfs_mount_destroy() is left because it is going to recycle the structure in any case, so it doesn't matter. - Remove the markercnt refcount as it is useless. This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and __FreeBSD_version will be modified accordingly. Discussed with: kib Tested by: pho	2008-11-02 10:15:42 +00:00
ed	57b4089c20	Clamp the values of t_column to 5 digits in `pstat -t' and` show all ttys'. We often run into these very high column numbers when we run curses applications, because they don't print any newlines. This messes up the table output of `pstat -t'. If these numbers get really high, they aren't of any use to the reader anyway. Convert them to `99999' when they run out of bounds.	2008-11-01 13:40:46 +00:00
ed	c2c324d379	Reimplement the /dev/console device node. One of the pieces of code that I had left alone during the development of the MPSAFE TTY layer, was tty_cons.c. This file actually has two different functions: - It contains low-level console input/output routines (cnputc(), etc). - It creates /dev/console and wraps all its cdevsw calls to the appropriate TTY. This commit reimplements the second set of functions by moving it directly into the TTY layer. /dev/console is now a character device node that's basically a regular TTY, but does a lookup of `si_drv1' each time you open it. d_write has also been changed to call log_console(). d_close() is not present, because we must make sure we don't revoke the TTY after writing a log message to it. Even though I'm not convinced this is in line with the future directions of our console code, it is a good move for now. It removes recursive locking from the top half of the TTY layer. The previous implementation called into the TTY layer with Giant held. I'm renaming tty_cons.c to kern_cons.c now. The code hardly contains any TTY related bits, so we'd better give it a less misleading name. Tested by: Andrzej Tobola <ato iem pw edu pl>, Carlos A.M. dos Santos <unixmania gmail com>, Eygene Ryabinkin <rea-fbsd codelabs ru>	2008-11-01 08:35:28 +00:00
peter	1f7fd22cbb	Add three extra to the kinfo_proc_vmmap data. kve_offset - the offset within an object that a mapping refers to. fileid and fsid are inode/dev for vnodes. (Linux procfs has these and valgrind is really unhappy without them.) I believe I didn't change the size of the struct.	2008-10-31 05:43:19 +00:00
sobomax	dafc63cd43	Make it possible to compile kernel with KTR but without DDB.	2008-10-30 21:48:28 +00:00
ivoras	483637ae39	Introduce a new sysctl, kern.sched.topology_spec, that returns an XML dump of detected ULE CPU topology. This dump can be used to check the topology detection and for general system information. An example of CPU topology dump is: kern.sched.topology_spec: <groups> <group level="1" cache-level="0"> <cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu> <flags></flags> <children> <group level="2" cache-level="0"> <cpu count="4" mask="0xf">0, 1, 2, 3</cpu> <flags></flags> </group> <group level="2" cache-level="0"> <cpu count="4" mask="0xf0">4, 5, 6, 7</cpu> <flags></flags> </group> </children> </group> </groups> Reviewed by: jeff Approved by: gnn (mentor)	2008-10-29 13:36:23 +00:00
davidxu	11aa09b488	If threads limit is exceeded, increase the totoal number of failures.	2008-10-29 12:11:48 +00:00
trasz	4e57a80147	Rename a variable missed in previous accmode_t-related commits. Approved by: rwatson (mentor)	2008-10-28 21:58:48 +00:00
trasz	0ad8692247	Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary to add more V* constants, and the variables changed by this patch were often being assigned to mode_t variables, which is 16 bit. Approved by: rwatson (mentor)	2008-10-28 13:44:11 +00:00
kib	b9b0d2c54c	Style return statements in vn_pollrecord().	2008-10-28 12:22:33 +00:00
kib	86b5e61ab2	Protect check for v_pollinfo == NULL and assignment of the newly allocated vpollinfo with vnode interlock. Fully initialize vpollinfo before putting pointer to it into vp->v_pollinfo. Discussed with: dwhite Tested by: pho MFC after: 1 week	2008-10-28 12:08:36 +00:00
rwatson	a2129bd144	Rename three MAC entry points from _proc_ to _cred_ to reflect the fact that they operate directly on credentials: mac_proc_create_swapper(), mac_proc_create_init(), and mac_proc_associate_nfsd(). Update policies. Obtained from: TrustedBSD Project	2008-10-28 11:33:06 +00:00
peter	b5b26198a7	After a machine has been up for a bit more than 20 days with HZ=1000, "ticks" goes negative. This breaks the signed comparison in softclock. This causes sleep() to never wake up, tcp to stop, etc etc. This is bad(TM). Use the SEQ_LT() method from tcp's sequence number comparisons.	2008-10-28 03:26:25 +00:00
jhb	c343bee743	- Whitespace fix for vop_poll. - Use the right label for vop_vptofh lock assertions so they are enforced.	2008-10-27 21:41:55 +00:00
sobomax	2bddeb51d2	vm_pnames should be "const char *const[]". Submitted by: Christoph Mallon	2008-10-27 08:09:05 +00:00
sobomax	c9fd562aa0	vm_pnames has no reason to be global. MFC after: 2 weeks	2008-10-27 06:34:41 +00:00
sobomax	6b076dc603	Default HZ value (1,000) on i386/amd64 is not very virtual machine friendly. Due to the nature of the beast it causes lot of unproductive overhead. This is especially bad when running SMP kernel on VMWare with several virtual processors - idle FreeBSD guest with SMP kernel takes 150% host CPU time on my dual-core MacBook Pro when I am enabling two virtual CPUs, making even host not very usable. Detect when we are running in the sandbox and reduce HZ to 10 (can be adjusted via VM_HZ in the kernel config) in such cases. This brings host CPU usage of idle FreeBSD/SMP on two virtual processors down to 10%. Detect most popular VM platforms out there - VMWare, Parallels, VirtualBox and VirtualPC. MFC after: 2 weeks	2008-10-27 06:25:02 +00:00
dfr	f98e1f1bbf	Don't rely on the value of *statep without first taking the vnode interlock. Reviewed by: Mike Tancsa MFC after: 2 weeks	2008-10-24 16:04:10 +00:00
davidxu	238f3ee5f4	Don't rearm callout if the process is exiting, it may leak a callout because callout_drain() only waits for running callout, but not disable it if it is rearmed.	2008-10-24 01:09:24 +00:00
davidxu	e66e7ee6bb	partly revert revision 184199, because TDF_NEEDSIGCHK is persitent when thread is in kernel mode, it can cause dead loop, now unlock process lock after acquired sleep queue lock and thread lock to avoid the problem. This means TDF_NEEDSIGCHK and TDF_NEEDSUSPCHK must be set with process lock and thread lock being hold at same time.	2008-10-24 01:03:31 +00:00
jhb	2e4682de75	Whitespace fix.	2008-10-23 21:50:16 +00:00
des	a1e1ad22e0	Fix a number of style issues in the MALLOC / FREE commit. I've tried to be careful not to fix anything that was already broken; the NFSv4 code is particularly bad in this respect.	2008-10-23 20:26:15 +00:00
des	66f807ed8b	Retire the MALLOC and FREE macros. They are an abomination unto style(9). MFC after: 3 months	2008-10-23 15:53:51 +00:00
davidxu	2062caca24	Actually, for signal and thread suspension, extra process spin lock is unnecessary, the normal process lock and thread lock are enough. The spin lock is still needed for process and thread exiting to mimic single sched_lock.	2008-10-23 07:55:38 +00:00
jhb	327ae6eb3a	Split the copyout of *base at the end of getdirentries() out leaving the rest in kern_getdirentries(). Use kern_getdirentries() to implement freebsd32_getdirentries(). This fixes a bug where calls to getdirentries() in 32-bit binaries would trash the 4 bytes after the 'long base' in userland. Submitted by: ups MFC after: 1 week	2008-10-22 21:55:48 +00:00
marcel	7de1858d0c	Trivially avoid a null pointer dereference when drivers don't set the rman description. While drivers should set it, a kernel panic is not the right behaviour when faced without one.	2008-10-22 18:20:45 +00:00
thompsa	0fcb99be5e	Fix spelling mistake in the last rev.	2008-10-21 14:44:25 +00:00
thompsa	8ee58ba9e6	If we have getc_inject hooked then the outq buffer is inaccessible to the driver so skip the drain rather than waiting indefinitely. Reviewed by: ed	2008-10-21 14:18:45 +00:00
kib	cc3d7dc928	Change vn_start_write() to clear *mpp on all failures when non-NULL vp is supplied, since vm_pageout_scan() expects it to be cleared on error. Submitted by: tegge PR: 123768 MFC after: 1 week	2008-10-21 09:55:49 +00:00
attilio	42c5b05453	In the actual code for witness_warn: - If there aren't spinlocks held, but there are problems with old sleeplocks, they are not reported. - If the spinlock found is not the only one, problems are not reported. Fix these 2 problems. Reported by: tegge	2008-10-20 19:22:16 +00:00
kib	e4785f6af4	Assert that v_holdcnt is non-zero before entering lockmgr in vn_lock and ffs_lock. This cannot catch situations where holdcnt is incremented not by curthread, but I think it is useful. Reviewed by: tegge, attilio Tested by: pho MFC after: 2 weeks	2008-10-20 10:11:33 +00:00
kib	015479d466	In vfs_busy(), lockmgr() cannot legitimately sleep, because code checked MNTK_UNMOUNT before, and mnt_mtx is used as interlock. vfs_busy() always tries to obtain a shared lock on mnt_lock, the other user is unmount who tries to drain it, setting MNTK_UNMOUNT before. Reviewed by: tegge, attilio Tested by: pho MFC after: 2 weeks	2008-10-20 10:07:28 +00:00
davidxu	57a7a67ea5	In realtimer_delete(), clear timer's value and interval to tell realtimer_expire() to not rearm the timer, otherwise there is a chance that a callout will be left there and be tiggered in future unexpectly. Bug reported by: tegge@	2008-10-20 02:37:53 +00:00
kib	e8c0b1746f	Ktr(9) stores format string and arguments in the event circular buffer, not the string formatted at the time of CTRX() call. Stack_ktr(9) uses an on-stack buffer for the symbol name, that is supplied as an argument to ktr. As result, stack_ktr() traces show garbage or cause page faults. Fix stack_ktr() by using pointer to module symbol table that is supposed to have a longer lifetime. Tested by: pho MFC after: 1 week	2008-10-19 11:13:49 +00:00
kmacy	4ceda2abba	- Forward port flush of page table updates on context switch or userret - Forward port vfork XEN hack	2008-10-19 01:35:27 +00:00

... 3 4 5 6 7 ...

11169 Commits