freebsd-skq

Author	SHA1	Message	Date
lstewart	15250e8f2c	Revise the sysctl handling code and restructure the hierarchy of sysctls introduced when feed-forward clock support is enabled in the kernel: - Rename the "choice" variable to "available". - Streamline the implementation of the "active" variable's sysctl handler function. - Create a kern.sysclock sysctl node for general sysclock related configuration options. Place the "available" and "active" variables under this node. - Create a kern.sysclock.ffclock sysctl node for feed-forward clock specific configuration options. Place the "version" and "ffcounter_bypass" variables under this node. - Tweak some of the description strings. Discussed with: Julien Ridoux (jridoux at unimelb edu au)	2011-12-01 07:19:13 +00:00
kib	d326d5565d	Rename vm_page_set_valid() to vm_page_set_valid_range(). The vm_page_set_valid() is the most reasonable name for the m->valid accessor. Reviewed by: attilio, alc	2011-11-30 17:39:00 +00:00
lstewart	81cb526f0e	Make sysclock_active publicly available to external consumers. Committed on behalf of Julien Ridoux and Darryl Veitch from the University of Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward Clock Synchronization Algorithms" project. For more information, see http://www.synclab.org/radclock/ Discussed with: Julien Ridoux (jridoux at unimelb edu au) Submitted by: Julien Ridoux (jridoux at unimelb edu au)	2011-11-29 08:43:04 +00:00
lstewart	58cb09352f	Do away with the somewhat clunky sysclock_ops structure and associated code, reimplementing the [get]{bin,nano,micro}[up]time() wrapper functions in terms of the new "fromclock" API instead. Committed on behalf of Julien Ridoux and Darryl Veitch from the University of Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward Clock Synchronization Algorithms" project. For more information, see http://www.synclab.org/radclock/ Discussed with: Julien Ridoux (jridoux at unimelb edu au) Submitted by: Julien Ridoux (jridoux at unimelb edu au)	2011-11-29 08:33:40 +00:00
lstewart	d76140d56b	Make the fbclock_[get]{bin,nano,micro}[up]time() function prototypes public so that new APIs with some performance sensitivity can be built on top of them. These functions should not be called directly except in special circumstances. Committed on behalf of Julien Ridoux and Darryl Veitch from the University of Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward Clock Synchronization Algorithms" project. For more information, see http://www.synclab.org/radclock/ Discussed with: Julien Ridoux (jridoux at unimelb edu au) Submitted by: Julien Ridoux (jridoux at unimelb edu au)	2011-11-29 06:53:36 +00:00
lstewart	f039559048	Fix an oversight in r227747 by calling fbclock_bin{up}time() directly from the fbclock_{nanouptime\|microuptime\|bintime\|nanotime\|microtime}() functions to avoid indirecting through a sysclock_ops wrapper function. Committed on behalf of Julien Ridoux and Darryl Veitch from the University of Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward Clock Synchronization Algorithms" project. For more information, see http://www.synclab.org/radclock/ Submitted by: Julien Ridoux (jridoux at unimelb edu au)	2011-11-29 06:12:19 +00:00
trociny	a95ae73e49	Add sysctl to retrieve ps_strings structure location of another process. Suggested by: kib Reviewed by: kib	2011-11-27 17:05:26 +00:00
trociny	c4d555a317	In sysctl_kern_proc_auxv the process was released too early: we still need to hold it when checking process sv_flags. MFC after: 2 weeks	2011-11-27 16:56:01 +00:00
lstewart	ba968938a8	Export the "ffclock" feature for kernels compiled with feed-forward clock support. Suggested by: netchild Reviewed by: netchild	2011-11-26 01:44:37 +00:00
trociny	7ca3e358b8	Add sysctl to get process resource limits. Reviewed by: kib MFC after: 2 weeks	2011-11-24 20:43:37 +00:00
kib	6ecd4a2bb2	Fix a race between getvnode() dereferencing half-constructed file and dupfdopen(). Reported and tested by: pho MFC after: 3 days	2011-11-24 20:34:06 +00:00
trociny	878f4f16e9	Fix build without INVARIANTS. Discussed with: kib	2011-11-23 08:11:04 +00:00
hselasky	53a216b722	Rename device_delete_all_children() into device_delete_children(). Suggested by: jhb @ and marius @ MFC after: 1 week	2011-11-22 21:56:55 +00:00
hselasky	9eef52e077	Style change. Suggested by: jhb @ and marius @ MFC after: 1 week	2011-11-22 21:53:19 +00:00
trociny	ce852d7df6	Add new sysctls, KERN_PROC_ENV and KERN_PROC_AUXV, to return environment strings and ELF auxiliary vectors from a process stack. Make sysctl_kern_proc_args to read not cached arguments from the process stack. Export proc_getargv() and proc_getenvv() so they can be reused by procfs and linprocfs. Suggested by: kib Reviewed by: kib Discussed with: kib, rwatson, jilles Tested by: pho MFC after: 2 weeks	2011-11-22 20:40:18 +00:00
lstewart	09887e1dc5	- Add Pulse-Per-Second timestamping using raw ffcounter and corresponding ffclock time in seconds. - Add IOCTL to retrieve ffclock timestamps from userland. Committed on behalf of Julien Ridoux and Darryl Veitch from the University of Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward Clock Synchronization Algorithms" project. For more information, see http://www.synclab.org/radclock/ Submitted by: Julien Ridoux (jridoux at unimelb edu au)	2011-11-21 13:34:29 +00:00
attilio	b95134ea01	Introduce the same mutex-wise fix in r227758 for sx locks. The functions that offer file and line specifications are: - sx_assert_ - sx_downgrade_ - sx_slock_ - sx_slock_sig_ - sx_sunlock_ - sx_try_slock_ - sx_try_xlock_ - sx_try_upgrade_ - sx_unlock_ - sx_xlock_ - sx_xlock_sig_ - sx_xunlock_ Now vm_map locking is fully converted and can avoid to know specifics about locking procedures. Reviewed by: kib MFC after: 1 month	2011-11-21 12:59:52 +00:00
pluknet	819239d461	Remove no more relevant XXXRW comments since accessing the vmspace is now properly done with the acquired vmspace reference. Pointed out by: kib	2011-11-21 12:21:00 +00:00
pluknet	b652a652ff	Use the acquired reference to the vmspace instead of direct dereferencing of p->p_vmspace like it is done in sysctl_kern_proc_vmmap().	2011-11-21 10:36:57 +00:00
lstewart	cca3084242	- Add the ffclock_getcounter(), ffclock_getestimate() and ffclock_setestimate() system calls to provide feed-forward clock management capabilities to userspace processes. ffclock_getcounter() returns the current value of the kernel's feed-forward clock counter. ffclock_getestimate() returns the current feed-forward clock parameter estimates and ffclock_setestimate() updates the feed-forward clock parameter estimates. - Document the syscalls in the ffclock.2 man page. - Regenerate the script-derived syscall related files. Committed on behalf of Julien Ridoux and Darryl Veitch from the University of Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward Clock Synchronization Algorithms" project. For more information, see http://www.synclab.org/radclock/ Submitted by: Julien Ridoux (jridoux at unimelb edu au)	2011-11-21 01:26:10 +00:00
attilio	6a69e947d3	Introduce macro stubs in the mutex implementation that will be always defined and will allow consumers, willing to provide options, file and line to locking requests, to not worry about options redefining the interfaces. This is typically useful when there is the need to build another locking interface on top of the mutex one. The introduced functions that consumers can use are: - mtx_lock_flags_ - mtx_unlock_flags_ - mtx_lock_spin_flags_ - mtx_unlock_spin_flags_ - mtx_assert_ - thread_lock_flags_ Spare notes: - Likely we can get rid of all the 'INVARIANTS' specification in the ppbus code by using the same macro as done in this patch (but this is left to the ppbus maintainer) - all the other locking interfaces may require a similar cleanup, where the most notable case is sx which will allow a further cleanup of vm_map locking facilities - The patch should be fully compatible with older branches, thus a MFC is previewed (infact it uses all the underlying mechanisms already present). Comments review by: eadler, Ben Kaduk Discussed with: kib, jhb MFC after: 1 month	2011-11-20 16:33:09 +00:00
hselasky	bcd3918fb3	Given that the typical usage of pause() is pause("zzz", hz / N), where N can be greater than hz in some cases, simply ignore a timeout value of zero. Suggested by: Bruce Evans MFC after: 1 week	2011-11-20 08:36:18 +00:00
hselasky	09cd8e0f19	Minor style change: Simplify the description of pause() and shorten the KASSERT message in pause. Also add a clamp for the timo argument in the non-KASSERT case. Suggested by: Bruce Evans MFC after: 1 week	2011-11-20 08:29:23 +00:00
lstewart	1c7932d870	- Provide a sysctl interface to change the active system clock at runtime. - Wrap [get]{bin,nano,micro}[up]time() functions of sys/time.h to allow requesting time from either the feedback or the feed-forward clock. If a feedback (e.g. ntpd) and feed-forward (e.g. radclock) daemon are both running on the system, both kernel clocks are updated but only one serves time. - Add similar wrappers for the feed-forward difference clock. Committed on behalf of Julien Ridoux and Darryl Veitch from the University of Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward Clock Synchronization Algorithms" project. For more information, see http://www.synclab.org/radclock/ Submitted by: Julien Ridoux (jridoux at unimelb edu au)	2011-11-20 05:32:12 +00:00
lstewart	603d3fe159	Provide high-level functions to access the feed-forward absolute and difference clocks. Each routine can output an upper bound on the absolute time or time interval requested. Different flavours of absolute time can be requested, for example with or without leap seconds, monotonic or not, etc. Committed on behalf of Julien Ridoux and Darryl Veitch from the University of Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward Clock Synchronization Algorithms" project. For more information, see http://www.synclab.org/radclock/ Submitted by: Julien Ridoux (jridoux at unimelb edu au)	2011-11-20 01:20:50 +00:00
lstewart	751092ac03	Core structure and functions to support a feed-forward clock within the kernel. Implement ffcounter, a monotonically increasing cumulative counter on top of the active timecounter. Provide low-level functions to read the ffcounter and convert it to absolute time or a time interval in seconds using the current ffclock estimates, which track the drift of the oscillator. Add a ring of fftimehands to track passing of time on each kernel tick and pick up updates of ffclock estimates. Committed on behalf of Julien Ridoux and Darryl Veitch from the University of Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward Clock Synchronization Algorithms" project. For more information, see http://www.synclab.org/radclock/ Submitted by: Julien Ridoux (jridoux at unimelb edu au)	2011-11-19 14:10:16 +00:00
hselasky	1b8ad7ed8e	Simplify the usb_pause_mtx() function by factoring out the generic parts to the kernel's pause() function. The pause() function can now be used when cold != 0. Also assert that the timeout in system ticks must be positive. Suggested by: Bruce Evans MFC after: 1 week	2011-11-19 11:17:27 +00:00
hselasky	3bcdb8772a	Move the device_delete_all_children() function from usb_util.c to kern/subr_bus.c. Simplify this function so that it no longer depends on malloc() to execute. Identify a few other places where it makes sense to use device_delete_all_children(). MFC after: 1 week	2011-11-19 10:11:50 +00:00
kib	36fd8d0106	Existing VOP_VPTOCNP() interface has a fatal flow that is critical for nullfs. The problem is that resulting vnode is only required to be held on return from the successfull call to vop, instead of being referenced. Nullfs VOP_INACTIVE() method reclaims the vnode, which in combination with the VOP_VPTOCNP() interface means that the directory vnode returned from VOP_VPTOCNP() is reclaimed in advance, causing vn_fullpath() to error with EBADF or like. Change the interface for VOP_VPTOCNP(), now the dvp must be referenced. Convert all in-tree implementations of VOP_VPTOCNP(), which is trivial, because vhold(9) and vref(9) are similar in the locking prerequisites. Out-of-tree fs implementation of VOP_VPTOCNP(), if any, should have no trouble with the fix. Tested by: pho Reviewed by: mckusick MFC after: 3 weeks (subject of re approval)	2011-11-19 07:50:49 +00:00
ed	914a7cfed1	Regenerate system call tables.	2011-11-19 06:36:11 +00:00
ed	9cedd4d52c	Improve access() parameter name consistency. The current code mixes the use of `flags' and `mode'. This is a bit confusing, since the faccessat() function as a `flag' parameter to store the AT_ flag. Make this less confusing by using the same name as used in the POSIX specification -- `amode'.	2011-11-19 06:35:15 +00:00
np	2fa1481795	Do not increment the parent firmware's reference count when any other firmware image in the module is registered. Instead, do it when the other image is itself referenced. This allows a module with multiple firmware images to be automatically unloaded when none of the firmware images are in use. Discussed with: jhb@ (on -hackers)	2011-11-19 00:20:28 +00:00
pho	97e40483cb	Added check for negative seconds value. Found by syscall() fuzzing. MFC after: 1 week	2011-11-18 19:14:42 +00:00
kib	b49a656854	Consistently use process spin lock for protection of the p->p_boundary_count. Race could cause the execve(2) from the threaded process to hung since thread boundary counter was incorrect and single-threading never finished. Reported by: pluknet, pho Tested by: pho MFC after: 1 week	2011-11-18 09:12:26 +00:00
kevlo	1a26b28a9b	Add unicode support to msdosfs and smbfs; original pathes from imura, bug fixes by Kuan-Chung Chiu <buganini at gmail dot com>. Tested by me in production for several days at work.	2011-11-18 03:05:20 +00:00
pjd	a3e664d830	Constify arguments for locking KPIs where possible. This enables locking consumers to pass their own structures around as const and be able to assert locks embedded into those structures. Reviewed by: ed, kib, jhb	2011-11-16 21:51:17 +00:00
pjd	f01187d1ea	Constify stack argument for functions that don't modify it. Reviewed by: ed, kib, jhb	2011-11-16 19:06:55 +00:00
marius	4b6458192f	As it turns out, r186347 actually is insufficient to avoid the use of the curthread-accessing part of mtx_{,un}lock(9) when using a r210623-style curthread implementation on sparc64, crashing the kernel in its early cycles as PCPU isn't set up, yet (and can't be set up as OFW is one of the things we need for that, which leads to a chicken-and-egg problem). What happens is that due to the fact that the idea of r210623 actually is to allow the compiler to cache invocations of curthread, it factors out obtaining curthread needed for both mtx_lock(9) and mtx_unlock(9) to before the branch based on kobj_mutex_inited when compiling the kernel without the debugging options. So change kobj_class_compile_static(9) to just never acquire kobj_mtx, effectively restricting it to its documented use, and add a kobj_init_static(9) for initializing objects using a class compiled with the former and that also avoids using mutex(9) (and malloc(9)). Also assert in both of these functions that they are used in their intended way only. While at it, inline kobj_register_method() and kobj_unregister_method() as there wasn't much point for factoring them out in the first place and so that a reader of the code has to figure out the locking for fewer functions missing a KOBJ_ASSERT. Tested on powerpc{,64} by andreast. Reviewed by: nwhitehorn (earlier version), jhb MFC after: 3 days	2011-11-15 20:11:03 +00:00
obrien	def2613c2b	Reformat comment to be more readable in standard Xterm. (while I'm here, wrap other long lines)	2011-11-15 01:48:53 +00:00
rmh	ff5c11fefd	Remove a few bits of FreeBSD 2.x compatibility code. Approved by: kib (mentor)	2011-11-14 18:21:27 +00:00
jhb	9315d9d42c	- Split out a kern_posix_fadvise() from the posix_fadvise() system call so it can be used by in-kernel consumers. - Make kern_posix_fallocate() public. - Use kern_posix_fadvise() and kern_posix_fallocate() to implement the freebsd32 wrappers for the two system calls.	2011-11-14 18:00:15 +00:00
alfred	fe73343e27	Constify args to copyiniov and copyinuio.	2011-11-14 07:12:10 +00:00
kib	ca682488f3	To limit amount of the kernel memory allocated, and to optimize the iteration over the fdsets, kern_select() limits the length of the fdsets copied in by the last valid file descriptor index. If any bit is set in a mask above the limit, current implementation ignores the filedescriptor, instead of returning EBADF. Fix the issue by scanning the tails of fdset before entering the select loop and returning EBADF if any bit above last valid filedescriptor index is set. The performance impact of the additional check is only imposed on the (somewhat) buggy applications that pass bad file descriptors to select(2) or pselect(2). PR: kern/155606, kern/162379 Discussed with: cognet, glebius Tested by: andreast (powerpc, all 64/32bit ABI combinations, big-endian), marius (sparc64, big-endian) MFC after: 2 weeks	2011-11-13 10:28:01 +00:00
kib	a8b2cc359c	Style. MFC after: 1 week	2011-11-11 04:13:47 +00:00
kib	f3c70bd299	Guard against the unlikely case of the alias path containing the '%' symbols. Reported by: arundel MFC after: 1 week	2011-11-11 04:12:58 +00:00
rstone	f6710005c7	Correct the types of the arguments to return probes of the syscall provider. Previously we were erroneously supplying the argument types of the corresponding entry probe. Reviewed by: rpaulo MFC after: 1 week	2011-11-11 03:49:42 +00:00
ed	3d017846c9	Simplify the code emitted by makeobjops.awk slightly. Just place the default kobj_method inside the kobjop_desc structure. There's no need to give these kobj_methods their own symbol. This shaves off 10 KB of a GENERIC kernel binary.	2011-11-09 11:00:29 +00:00
ed	85414e79bd	Make kobj_methods constant. These structures hold no information that is modified during runtime. By marking this constant, we see approximately 600 symbols become read-only (amd64 GENERIC). While there, also mark the kobj_method structures generated by makeobjops.awk static. They are only referenced by the kobjop_desc structures within the same file. Before: $ ls -l kernel -rwxr-xr-x 1 ed wheel 15937309 Nov 8 16:29 kernel* $ size kernel text data bss dec hex filename 12260854 1358468 2848832 16468154 fb48ba kernel $ nm kernel \| fgrep -c ' r ' 8240 After: $ ls -l kernel -rwxr-xr-x 1 ed wheel 15922469 Nov 8 16:25 kernel* $ size kernel text data bss dec hex filename 12302869 1302660 2848704 16454233 fb1259 kernel $ nm kernel \| fgrep -c ' r ' 8838	2011-11-08 15:38:21 +00:00
rstone	ae7b6414d5	The in-kernel CTF parser caches the result of its first attempt to parse CTF data from a module. On subsequent attempts to retrieve CTF data for a module, return an error if there no CTF data. This fixes a panic if you try to enable fbt probes on a module with CTF data twice. Submitted by: Paul Ambrose (ambrosehua AT gmail DOT com) MFC after: 3 days	2011-11-08 15:17:54 +00:00
attilio	8e918ec439	Introduce the option VFS_ALLOW_NONMPSAFE and turn it on by default on all the architectures. The option allows to mount non-MPSAFE filesystem. Without it, the kernel will refuse to mount a non-MPSAFE filesytem. This patch is part of the effort of killing non-MPSAFE filesystems from the tree. No MFC is expected for this patch. Tested by: gianni Reviewed by: kib	2011-11-08 10:18:07 +00:00
trociny	9cd1f9add2	Add KVME_FLAG_SUPER and use it in sysctl_kern_proc_vmmap for marking entries with superpages. Submitted by: Mel Flynn <mel.flynn+fbsd.hackers@mailing.thruhere.net> Reviewed by: alc, rwatson	2011-11-07 21:13:19 +00:00
trociny	f25b803473	In lim_fork() assert that processes locks are held. Suggested by: kib	2011-11-07 21:09:04 +00:00
ed	0c56cf839d	Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs. The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.	2011-11-07 15:43:11 +00:00
ed	e97eae1577	Mark MALLOC_DEFINEs static that have no corresponding MALLOC_DECLAREs. This means that their use is restricted to a single C file.	2011-11-07 06:44:47 +00:00
fjoe	afccaaff1c	Add KLD_DEBUG option.	2011-11-06 08:10:41 +00:00
jhb	767a02dc14	Regen.	2011-11-04 04:06:31 +00:00
jhb	78c075174e	Add the posix_fadvise(2) system call. It is somewhat similar to madvise(2) except that it operates on a file descriptor instead of a memory region. It is currently only supported on regular files. Just as with madvise(2), the advice given to posix_fadvise(2) can be divided into two types. The first type provide hints about data access patterns and are used in the file read and write routines to modify the I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are thus filesystem independent. Note that to ease implementation (and since this API is only advisory anyway), only a single non-normal range is allowed per file descriptor. The second type of hints are used to hint to the OS that data will or will not be used. These hints are implemented via a new VOP_ADVISE(). A default implementation is provided which does nothing for the WILLNEED request and attempts to move any clean pages to the cache page queue for the DONTNEED request. This latter case required two other changes. First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests vinvalbuf() to only flush clean buffers for the vnode from the buffer cache and to not remove any backing pages from the vnode. This is used to ensure clean pages are not wired into the buffer cache before attempting to move them to the cache page queue. The second change adds a new vm_object_page_cache() method. This method is somewhat similar to vm_object_page_remove() except that instead of freeing each page in the specified range, it attempts to move clean pages to the cache queue if possible. To preserve the ABI of struct file, the f_cdevpriv pointer is now reused in a union to point to the currently active advice region if one is present for regular files. Reviewed by: jilles, kib, arch@ Approved by: re (kib) MFC after: 1 month	2011-11-04 04:02:50 +00:00
jhb	1e2d8c9d67	Move the cleanup of f_cdevpriv when the reference count of a devfs file descriptor drops to zero out of _fdrop() and into devfs_close_f() as it is only relevant for devfs file descriptors. Reviewed by: kib MFC after: 1 week	2011-11-04 03:39:31 +00:00
attilio	90682e14db	Disable interrupt and preemption for smp_rendezvous() also in the UP/!SMP case. The callbacks may be relying on this feature and having 2 different ways to deal with them is not correct. Reported by: rstone Reviewed by: jhb MFC after: 2 weeks	2011-11-03 14:36:56 +00:00
marcel	11d8234b97	Revert rev. 226893: subr_syscall.c is being included from C files and on amd64 with FREEBSD32 enabled, this means that systrace_probe_func gets defined twice.	2011-10-30 02:19:39 +00:00
marcel	ac13f9cbdb	Define systrace_probe_func in subr_syscall.c where it's used, instead of defining it in MD code. This eliminates porting to other architectures.	2011-10-29 01:26:36 +00:00
pluknet	f9aa1bdb23	Fix arguments list for proc:::signal-discard DTrace probe. Reported by: Anton Yuzhaninov <citrin citrin ru> MFC after: 1 week	2011-10-28 15:22:51 +00:00
jhb	b4486349bd	Whitespace fix.	2011-10-27 17:43:36 +00:00
alc	841afea2d9	Eliminate vestiges of page coloring in VM_ALLOC_NOOBJ calls to vm_page_alloc(). While I'm here, for the sake of consistency, always specify the allocation class, such as VM_ALLOC_NORMAL, as the first of the flags.	2011-10-27 16:39:17 +00:00
pluknet	41ec9d2d93	Remove the long reprecated ``/stand/sysinstall'' from the init_path. It can be put back using the INIT_PATH config option or init_path loader variable, if still needed (which I doubt). MFC after: 1 week	2011-10-27 10:25:11 +00:00
alc	0a7d6450d6	contigmalloc(9) and contigfree(9) are now implemented in terms of other more general VM system interfaces. So, their implementation can now reside in kern_malloc.c alongside the other functions that are declared in malloc.h.	2011-10-27 02:52:24 +00:00
jhb	1c4fa2aabc	- Fixup filenames in a few more places where they are used. - Some whitespace fixes.	2011-10-26 15:17:42 +00:00
pjd	b49c6b382f	The v_data field is a pointer, so set it to NULL, not 0. MFC after: 3 days	2011-10-25 14:01:17 +00:00
marcel	26b45aef71	Don't terminate the interactive root mount prompt on mount failure. This restores the previous behaviour. While here, match '?' and '.' inputs exactly and improve the error message. Requested by: avg@ Derived from a patch by: Arnaud Lacombe <lacombar@gmail.com>	2011-10-23 20:03:33 +00:00
des	1b405df8ba	Revisit the capability failure trace points. The initial implementation only logged instances where an operation on a file descriptor required capabilities which the file descriptor did not have. By adding a type enum to struct ktr_cap_fail, we can catch other types of capability failures as well, such as disallowed system calls or attempts to wrap a file descriptor with more capabilities than it had to begin with.	2011-10-18 07:28:58 +00:00
marcel	494f14b933	Fix double vision syndrome (read: double output) when in the debugger without a panic.	2011-10-16 14:16:46 +00:00
kib	8e118d38cf	Control the execution permission of the readable segments for i386 binaries on the amd64 and ia64 with the sysctl, instead of unconditionally enabling it. Reviewed by: marcel	2011-10-15 12:35:18 +00:00
marcel	92e552423d	In elf32_trans_prot() and when compiling for amd64 or ia64, add PROT_EXECUTE when PROT_READ is needed. By default i386 allows execution when reading is allowed and JDK 1.4.x depends on that.	2011-10-13 16:16:46 +00:00
glebius	2522c42334	Make memguard(9) capable to guard uma(9) allocations.	2011-10-12 18:08:28 +00:00
rwatson	0a6da59b61	Correct a bug in export of capability-related information from the sysctls supporting procstat -f: properly provide capability rights information to userspace. The bug resulted from a merge-o during upstreaming (or rather, a failure to properly merge FreeBSD-side changed downstream). Spotted by: des, kibab MFC after: 3 days	2011-10-12 12:08:03 +00:00
adrian	7c88ef5a97	Don't call fixup_filename() on each witness lock call. This has been irking me for a while. This causes significant CPU use on bottlenecked CPUs (eg my older EEEPC w/ an earlier Celeron CPU and my MIPS24k boards) when they're passing a lot of traffic. Since the file/line values are only used for printing, this should only affect display. It should have no operational change on the code, besides reducing CPU use.	2011-10-12 09:21:02 +00:00
des	9b8d9b3ed1	Add a new trace point, KTRFAC_CAPFAIL, which traces capability check failures. It is included in the default set for ktrace(1) and kdump(1).	2011-10-11 20:37:10 +00:00
mckusick	b6c294da1b	When unmounting a filesystem always wait for the vfs_busy lock to clear so that if no vnodes in the filesystem are actively in use the unmount will succeed rather than failing with EBUSY. Reported by: Garrett Cooper Reviewed by: Attilio Rao and Kostik Belousov Tested by: Garrett Cooper PR: kern/161016 MFC after: 3 weeks	2011-10-11 18:46:41 +00:00
marius	22104b1021	In device_get_children() avoid malloc(0) in order to increase portability to other operating systems. PR: 154287	2011-10-09 21:21:37 +00:00
alc	e9db5a5a50	Fix the handling of an empty kmem map by sysctl_kmem_map_free(). In the unlikely event that sysctl_kmem_map_free() was performed on an empty kmem map, it would incorrectly report the free space as zero. Discussed with: avg MFC after: 1 week	2011-10-08 18:29:30 +00:00
jonathan	9c782bcdf6	Change one printf() to log(). As noted in kern/159780, printf() is not very jail-friendly, since it can't be easily monitored by jail management tools. This patch reports an error via log() instead, which, if nobody is watching the log file, still prints to the console. Approved by: mentor (rwatson) Submitted by: Eugene Grosbein <eugen@eg.sd.rdtc.ru> MFC after: 5 days	2011-10-07 09:51:12 +00:00
obrien	4b04845b06	Disallow various debug.kdb sysctl's when securelevel is raised. PR: 161350	2011-10-07 05:47:30 +00:00
delphij	3774d99430	Return proper errno when we hit error when doing sanity check. This fixes dtrace crashes when module is not compiled with CTF data. Submitted by: Paul Ambrose ambrosehua at gmail.com MFC after: 1 week	2011-10-07 01:37:58 +00:00
marius	1b1d84970a	- Currently, sched_balance_pair() may cause a CPU to send an IPI_PREEMPT to itself, which sparc64 hardware doesn't support. One way to solve this would be to directly call sched_preempt() instead of issuing a self-IPI. However, quoting jhb@: "On the other hand, you can probably just skip the IPI entirely if we are going to send it to the current CPU. Presumably, once this routine finishes, the current CPU will exit softlock (or will do so "soon") and will then pick the next thread to run based on the adjustments made in this routine, so there's no need to IPI the CPU running this routine anyway. I think this is the better solution. Right now what is probably happening on other platforms is as soon as this routine finishes the CPU processes its self-IPI and causes mi_switch() which will just switch back to the softclock thread it is already running." - With r226054 and the the above change in place, sparc64 now no longer is incompatible with ULE and vice versa. However, powerpc/E500 still is. Submitted by: jhb Reviewed by: jeff	2011-10-06 11:48:13 +00:00
trasz	2276ee2733	Remove assertion against empty NFSv4 ACLs. An empty ACL is not exactly valid - we don't allow for setting it on a file, for example - but it's not something we should assert on. For STABLE kernel, it changes nothing, because it's not compiled with INVARIANTS. If it was, it would fix crashes. It also fixes an assert in libc encountered with NFSv4 without nfsuserd(8) running. Submitted by: Yuri Pankov (earlier version) MFC after: 1 month	2011-10-05 17:29:49 +00:00
kib	87f923aeea	Supply unique (st_dev, st_ino) value pair for the fstat(2) done on the pipes. Reviewed by: jhb, Peter Jeremy <peterjeremy acm org> MFC after: 2 weeks	2011-10-05 16:56:06 +00:00
kib	cc9fd3ad53	Move parts of the commit log for r166167, where Tor explained the interaction between vnode locks and vfs_busy(), into comment. MFC after: 1 week	2011-10-04 18:45:29 +00:00
trasz	09473dab33	Actually enforce limit for inheritable resources on fork. MFC after: 3 days	2011-10-04 14:56:33 +00:00
trasz	76ddb474d9	Move some code inside the racct_proc_fork(); it spares a few lock operations and it's more logical this way. MFC after: 3 days	2011-10-03 17:40:55 +00:00
kib	b0ea63c19a	Assert that exiting process does not return to usermode. Reviewed by: avg, jhb MFC after: 1 week	2011-10-03 16:58:58 +00:00
trasz	82df9bbbc4	Fix another bug introduced in r225641, which caused rctl to access certain fields in 'struct proc' before they got initialized in do_fork(). MFC after: 3 days	2011-10-03 16:23:20 +00:00
trasz	138a535460	Fix bug introduced in r225641, which would cause panic if racct_proc_fork() returned error -- the racct_destroy_locked() would get called twice. MFC after: 3 days	2011-10-03 15:32:15 +00:00
kib	71b9169fe9	The sigwait(3) function shall not return EINTR, according to the POSIX/SUSvN. The sigwait(2) syscall does return EINTR, and libc.so.7 contains the wrapper sigwait(3) which hides EINTR from callers. The EINTR return is used by libthr to handle required cancellation point in the sigwait(3). To help the binaries linked against pre-libc.so.7, i.e. RELENG_6 and earlier, to have right ABI for sigwait(3), transform EINTR return from sigwait(2) into ERESTART. Discussed with: davidxu MFC after: 1 week	2011-10-01 10:18:55 +00:00
bz	2e8e044c2f	Fix handling of corrupt compress(1)ed data. [11:04] Add missing length checks on unix socket addresses. [11:05] Approved by: so (cperciva) Approved by: re (kensmith) Security: FreeBSD-SA-11:04.compress Security: CVE-2011-2895 [11:04] Security: FreeBSD-SA-11:05.unix	2011-09-28 08:47:17 +00:00
attilio	f5da64d92a	Revert r225372: wdog_kern_pat() acquires eventhandler mutex, thus it cannot work in kernel context (from where kdb_trap() runs). The right way to fix this is both offering the cpu-stop-on-panic-and-skip-locking logic and also a context for KDB to officially run. We can re-enable this (or a similar) improvement when these 2 patches hit the tree. Sponsored by: Sandvine Incorporated Discussed with: emaste, rstone MFC after: immediately	2011-09-27 13:42:11 +00:00
kib	f15c4ba986	Do not deliver SIGTRAP on exec as the normal signal, use ptracestop() on syscall exit path. Otherwise, if SIGTRAP is ignored, that tdsendsignal() do not want to deliver the signal, and debugger never get a notification of exec. Found and tested by: Anton Yuzhaninov <citrin citrin ru> Discussed with: jhb MFC after: 2 weeks	2011-09-27 13:17:02 +00:00
mav	12fe030c74	Fix interrupt counters dumping on SW_WATCHDOG fire.	2011-09-27 09:30:20 +00:00
trasz	18eab55455	Fix error handling bug that would prevent MAC structures from getting freed properly if resource limit got exceeded. Approved by: re (kib)	2011-09-17 20:48:49 +00:00
trasz	885c397429	Fix long-standing thinko regarding maxproc accounting. Basically, we were accounting the newly created process to its parent instead of the child itself. This caused problems later, when the child changed its credentials - the per-uid, per-jail etc counters were not properly updated, because the maxproc counter in the child process was 0. Approved by: re (kib)	2011-09-17 19:55:32 +00:00
kmacy	8bc0044e86	Auto-generated code from sys_ prefixing makesyscalls.sh change Approved by: re(bz)	2011-09-16 14:04:14 +00:00
kmacy	99851f359e	In order to maximize the re-usability of kernel code in user space this patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)	2011-09-16 13:58:51 +00:00
adrian	f23b1f625d	Ensure that ta_pending doesn't overflow u_short by capping its value at USHRT_MAX. If it overflows before the taskqueue can run, the task will be re-added to the taskqueue and cause a loop in the task list. Reported by: Arnaud Lacombe <lacombar@gmail.com> Submitted by: Ryan Stone <rysto32@gmail.com> Reviewed by: jhb Approved by: re (kib) MFC after: 1 day	2011-09-15 08:42:06 +00:00
rmacklem	99f390a4e8	Modify vfs_register() to use a hash calculation on vfc_name to set vfc_typenum, so that vfc_typenum doesn't change when file systems are loaded in different orders. This keeps NFS file handles from changing, for file systems that use vfc_typenum in their fsid. This change is controlled via a loader.conf variable called vfs.typenumhash, since vfc_typenum will change once when this is enabled. It defaults to 1 for 9.0, but will default to 0 when MFC'd to stable/8. Tested by: hrs Reviewed by: jhb, pjd (earlier version) Approved by: re (kib) MFC after: 1 month	2011-09-13 21:01:26 +00:00
attilio	2267bc9430	dump_write() returns ENXIO if the dump is trying to be written outside of the device boundry. While this is generally ok, the problem is that all the consumers handle similar cases (and expect to catch) ENOSPC for this (for a reference look at minidumpsys() and dumpsys() constructions). That ends up in consumers not recognizing the issue and amd64 failing to retry if the number of pages grows up during minidump. Fix this by returning ENOSPC in dump_write() and while here add some more diagnostic on involved values. Sponsored by: Sandvine Incorporated In collabouration with: emaste Approved by: re (kib) MFC after: 10 days	2011-09-12 20:39:31 +00:00
ed	5ccb03a60f	Fix error return codes for ioctls on init/lock state devices. In revision 223722 we introduced support for driver ioctls on init/lock state devices. Unfortunately the call to ttydevsw_cioctl() clobbers the value of the error variable, meaning that in many cases ioctl() will now return ENOTTY, even though the ioctl() was processed properly. Reported by: Boris Samorodov <bsam ipt ru> Patch by: jilles@ Approved by: re@ (kib@)	2011-09-12 10:07:21 +00:00
kib	55d0a85118	Inline the syscallenter() and syscallret(). This reduces the time measured by the syscall entry speed microbenchmarks by ~10% on amd64. Submitted by: jhb Approved by: re (bz) MFC after: 2 weeks	2011-09-11 16:05:09 +00:00
attilio	5494aebd97	Improve the informations reported in case of busy buffers during the shutdown: - Axe out the SHOW_BUSYBUFS option and uses a tunable for selectively enable/disable it, which is defaulted for not printing anything (0 value) but can be changed for printing (1 value) and be verbose (2 value) - Improves the informations outputed: right now, there is no track of the actual struct buf object or vnode which are referenced by the shutdown process, but it is printed the related struct bufobj object which is not really helpful - Add more verbosity about the state of the struct buf lock and the vnode informations, with the latter to be activated separately by the sysctl Sponsored by: Sandvine Incorporated Reviewed by: emaste, kib Approved by: re (ksmith) MFC after: 10 days	2011-09-08 12:56:26 +00:00
trasz	284bd48a44	Fix whitespace. Submitted by: amdmi3 Approved by: re (rwatson)	2011-09-07 07:52:45 +00:00
trasz	a8bcc12be3	Work around a kernel panic triggered by forkbomb with an rctl rule such as j:name:maxproc:sigkill=100. Proper fix - deferring psignal to a taskqueue - is somewhat complicated and thus will happen after 9.0. Approved by: re (kib)	2011-09-06 17:22:40 +00:00
attilio	ef6fc882ac	Interrupts are disabled/enabled when entering and exiting the KDB context. While this is generally good, it brings along a serie of problems, like clocks going off sync and in presence of SW_WATCHDOG, watchdogs firing without a good reason (missed hardclock wdog ticks update). Fix the latter by kicking the watchdog just before to re-enable the interrupts. Also, while here, not rely on users to stop the watchdog manually when entering DDB but do that when entering KDB context. Sponsored by: Sandvine Incorporated Reviewed by: emaste, rstone Approved by: re (kib) MFC after: 1 week	2011-09-04 13:07:02 +00:00
trasz	b1c63cf9da	Since r224036 the cputime and wallclock are supposed to be in seconds, not microseconds. Make it so. Approved by: re (kib)	2011-09-04 05:04:34 +00:00
trasz	a6059bf016	Fix panic that happens when fork(2) fails due to a limit other than the rctl one - for example, it happens when someone reaches maximum number of processes in the system. Approved by: re (kib)	2011-09-03 08:08:24 +00:00
rwatson	3c6157dcec	Correct several issues in the integration of POSIX shared memory objects and the new setmode and setowner fileops in FreeBSD 9.0: - Add new MAC Framework entry point mac_posixshm_check_create() to allow MAC policies to authorise shared memory use. Provide a stub policy and test policy templates. - Add missing Biba and MLS implementations of mac_posixshm_check_setmode() and mac_posixshm_check_setowner(). - Add 'accmode' argument to mac_posixshm_check_open() -- unlike the mac_posixsem_check_open() entry point it was modeled on, the access mode is required as shared memory access can be read-only as well as writable; this isn't true of POSIX semaphores. - Implement full range of POSIX shared memory entry points for Biba and MLS. Sponsored by: Google Inc. Obtained from: TrustedBSD Project Approved by: re (kib)	2011-09-02 17:40:39 +00:00
rwatson	3f14675cff	Attempt to make break-to-debugger and alternative break-to-debugger more accessible: (1) Always compile in support for breaking into the debugger if options KDB is present in the kernel. (2) Disable both by default, but allow them to be enabled via tunables and sysctls debug.kdb.break_to_debugger and debug.kdb.alt_break_to_debugger. (3) options BREAK_TO_DEBUGGER and options ALT_BREAK_TO_DEBUGGER continue to behave as before -- only now instead of compiling in break-to-debugger support, they change the default values of the above sysctls to enable those features by default. Current kernel configurations should, therefore, continue to behave as expected. (4) Migrate alternative break-to-debugger state machine logic out of individual device drivers into centralised KDB code. This has a number of upsides, but also one downside: it's now tricky to release sio spin locks when entering the debugger, so we don't. However, similar logic does not exist in other device drivers, including uart. (5) dcons requires some special handling; unlike other console types, it allows overriding KDB's own debugger selection, so we need a new interface to KDB to allow that to work. GENERIC kernels in -CURRENT will now support break-to-debugger as long as appropriate boot/run-time options are set, which should improve the debuggability of BETA kernels significantly. MFC after: 3 weeks Reviewed by: kib, nwhitehorn Approved by: re (bz)	2011-08-26 21:46:36 +00:00
delphij	f687a7bf6f	Fix format strings for KTR_STATE in 4BSD ad ULE schedulers. Submitted by: Ivan Klymenko <fidaj@ukr.net> PR: kern/159904, kern/159905 MFC after: 2 weeks Approved by: re (kib)	2011-08-26 18:00:07 +00:00
jamie	faaa2dc21f	Delay the recursive decrement of pr_uref when jails are made invisible but not removed; decrement it instead when the child jail actually goes away. This avoids letting the counter go below zero in the case where dying (pr_uref==0) jails are "resurrected", and an associated KASSERT panic. Submitted by: Steven Hartland Approved by: re (bz) MFC after: 1 week	2011-08-26 16:03:34 +00:00
attilio	683d7a54ce	Fix a deficiency in the selinfo interface: If a selinfo object is recorded (via selrecord()) and then it is quickly destroyed, with the waiters missing the opportunity to awake, at the next iteration they will find the selinfo object destroyed, causing a PF#. That happens because the selinfo interface has no way to drain the waiters before to destroy the registered selinfo object. Also this race is quite rare to get in practice, because it would require a selrecord(), a poll request by another thread and a quick destruction of the selrecord()'ed selinfo object. Fix this by adding the seldrain() routine which should be called before to destroy the selinfo objects (in order to avoid such case), and fix the present cases where it might have already been called. Sometimes, the context is safe enough to prevent this type of race, like it happens in device drivers which installs selinfo objects on poll callbacks. There, the destruction of the selinfo object happens at driver detach time, when all the filedescriptors should be already closed, thus there cannot be a race. For this case, mfi(4) device driver can be set as an example, as it implements a full correct logic for preventing this from happening. Sponsored by: Sandvine Incorporated Reported by: rstone Tested by: pluknet Reviewed by: jhb, kib Approved by: re (bz) MFC after: 3 weeks	2011-08-25 15:51:54 +00:00
bz	860d2aa85d	Increase the defaults for the maximum socket buffer limit, and the maximum TCP send and receive buffer limits from 256kB to 2MB. For sb_max_adj we need to add the cast as already used in the sysctl handler to not overflow the type doing the maths. Note that this is just the defaults. They will allow more memory to be consumed per socket/connection if needed but not change the default "idle" memory consumption. All values are still tunable by sysctls. Suggested by: gnn Discussed on: arch (Mar and Aug 2011) MFC after: 3 weeks Approved by: re (kib)	2011-08-25 09:20:13 +00:00
mm	e104c96f01	Generalize ffs_pages_remove() into vn_pages_remove(). Remove mapped pages for all dataset vnodes in zfs_rezget() using new vn_pages_remove() to fix mmapped files changed by zfs rollback or zfs receive -F. PR: kern/160035, kern/156933 Reviewed by: kib, pjd Approved by: re (kib) MFC after: 1 week	2011-08-25 08:17:39 +00:00
attilio	a5ccee99f7	callout_cpu_switch() allows preemption when dropping the outcoming callout cpu lock (and after having dropped it). If the newly scheduled thread wants to acquire the old queue it will just spin forever. Fix this by disabling preemption and interrupts entirely (because fast interrupt handlers may incur in the same problem too) while switching locks. Reported by: hrs, Mike Tancsa <mike AT sentex DOT net>, Chip Camden <sterling AT camdensoftware DOT com> Tested by: hrs, Mike Tancsa <mike AT sentex DOT net>, Chip Camden <sterling AT camdensoftware DOT com>, Nicholas Esborn <nick AT desert DOT net> Approved by: re (kib) MFC after: 10 days	2011-08-21 10:52:50 +00:00
kib	1d5badd36f	Prevent the hiwatermark for the unix domain socket from becoming effectively negative. Often seen as upstream fastcgi connection timeouts in nginx when using sendfile over unix domain sockets for communication. Sendfile(2) may send more bytes then currently allowed by the hiwatermark of the socket, e.g. because the so_snd sockbuf lock is dropped after sbspace() call in the kern_sendfile() loop. In this case, recalculated hiwatermark will overflow. Since lowatermark is renewed as half of the hiwatermark by sendfile code, and both are unsigned, the send buffer never reaches the free space requested by lowatermark, causing indefinite wait in sendfile. Reviewed by: rwatson Approved by: re (bz) MFC after: 2 weeks	2011-08-20 16:12:29 +00:00
rwatson	b8fd2dd0fd	r222015 introduced a new assertion that the size of a fixed-length sbuf buffer is greater than 1. This triggered panics in at least one spot in the kernel (the MAC Framework) which passes non-negative, rather than >1 buffer sizes based on the size of a user buffer passed into a system call. While 0-size buffers aren't particularly useful, they also aren't strictly incorrect, so loosen the assertion. Discussed with: phk (fears I might be EDOOFUS but willing to go along) Spotted by: pho + stress2 Approved by: re (kib)	2011-08-19 08:29:10 +00:00
jonathan	9c3c6695d8	Auto-generated system call code based on r224987. Approved by: re (implicit)	2011-08-18 23:08:52 +00:00
jonathan	5ecd1c9d40	Add experimental support for process descriptors A "process descriptor" file descriptor is used to manage processes without using the PID namespace. This is required for Capsicum's Capability Mode, where the PID namespace is unavailable. New system calls pdfork(2) and pdkill(2) offer the functional equivalents of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote process for debugging purposes. The currently-unimplemented pdwait(2) will, in the future, allow querying rusage/exit status. In the interim, poll(2) may be used to check (and wait for) process termination. When a process is referenced by a process descriptor, it does not issue SIGCHLD to the parent, making it suitable for use in libraries---a common scenario when using library compartmentalisation from within large applications (such as web browsers). Some observers may note a similarity to Mach task ports; process descriptors provide a subset of this behaviour, but in a UNIX style. This feature is enabled by "options PROCDESC", but as with several other Capsicum kernel features, is not enabled by default in GENERIC 9.0. Reviewed by: jhb, kib Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc	2011-08-18 22:51:30 +00:00
jhb	c902e65610	One of the general principles of the sysctl(3) API is that a user can query the needed size for a sysctl result by passing in a NULL old pointer and a valid oldsize. The kern.proc.args sysctl handler broke this assumption by not calling SYSCTL_OUT() if the old pointer was NULL. Approved by: re (kib) MFC after: 3 days	2011-08-18 22:20:45 +00:00
kib	324611138f	Fix build breakage. Initialize error variables explicitely for !MAC case. Pointy hat to: kib Approved by: re (bz)	2011-08-17 12:37:14 +00:00
kib	011f42054d	Add the fo_chown and fo_chmod methods to struct fileops and use them to implement fchown(2) and fchmod(2) support for several file types that previously lacked it. Add MAC entries for chown/chmod done on posix shared memory and (old) in-kernel posix semaphores. Based on the submission by: glebius Reviewed by: rwatson Approved by: re (bz)	2011-08-16 20:07:47 +00:00
jonathan	a76ca2eae7	poll(2) implementation for capabilities. When calling poll(2) on a capability, unwrap first and then poll the underlying object. Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc	2011-08-16 14:14:56 +00:00
rwatson	9343c74977	Trim some warnings and notes from capabilities.conf -- these are left over from Capsicum development, and no longer apply. Approved by: re (kib) Sponsored by: Google Inc	2011-08-13 17:22:16 +00:00
rwatson	11c783ae3f	When falloc() was broken into separate falloc_noinstall() and finstall(), a bug was introduced in kern_openat() such that the error from the vnode open operation was overwritten before it was passed as an argument to dupfdopen(). This broke operations on /dev/{stdin,stdout,stderr}. Fix by preserving the original error number across finstall() so that it is still available. Approved by: re (kib) Reported by: cognet	2011-08-13 16:03:40 +00:00
rwatson	f40628d534	Update use of the FEATURE() macro in sys_capability.c to reflect the move to two different kernel options for capability mode vs. capabilities. Approved by: re (bz)	2011-08-13 13:34:01 +00:00
rwatson	87507c16d7	Now that capability support has been committed, update and expand the comment at the type of sys_capability.c to describe its new contents. Approved by: re (xxx)	2011-08-13 13:26:40 +00:00
rwatson	614cc9631e	Regenerate system call files following r224812 changes to capabilities.conf. A no-op for non-Capsicum kernels; for Capsicum kernels, completes the enabling of fooat(2) system calls using capabilities. With this change, and subject to bug fixes, Capsicum capability support is now complete for 9.0. Approved by: re (kib) Submitted by: jonathan Sponsored by: Google Inc	2011-08-13 12:14:40 +00:00
jonathan	09f5070c50	Allow openat(2), fstatat(2), etc. in capability mode. namei() and lookup() can now perform "strictly relative" lookups. Such lookups, performed when in capability mode or when looking up relative to a directory capability, enforce two policies: - absolute paths are disallowed (including symlinks to absolute paths) - paths containing '..' components are disallowed These constraints make it safe to enable openat() and friends. These system calls are instrumental in supporting Capsicum components such as the capability-mode-aware runtime linker. Finally, adjust comments in capabilities.conf to reflect the actual state of the world (e.g. shm_open(2) already has the appropriate constraints, getdents(2) already requires CAP_SEEK). Approved by: re (bz), mentor (rwatson) Sponsored by: Google Inc.	2011-08-13 10:43:21 +00:00
jonathan	f63d2e9205	Allow Capsicum capabilities to delegate constrained access to file system subtrees to sandboxed processes. - Use of absolute paths and '..' are limited in capability mode. - Use of absolute paths and '..' are limited when looking up relative to a capability. - When a name lookup is performed, identify what operation is to be performed (such as CAP_MKDIR) as well as check for CAP_LOOKUP. With these constraints, openat() and friends are now safe in capability mode, and can then be used by code such as the capability-mode runtime linker. Approved by: re (bz), mentor (rwatson) Sponsored by: Google Inc	2011-08-13 09:21:16 +00:00
jonathan	c33150a2f0	Rename CAP__KEVENT to CAP__EVENT. Change the names of a couple of capability rights to be less FreeBSD-specific. Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc	2011-08-12 14:26:47 +00:00
jonathan	8cb2c0bf4d	Only call fdclose() on successfully-opened FDs. Since kern_openat() now uses falloc_noinstall() and finstall() separately, there are cases where we could get to cleanup code without ever creating a file descriptor. In those cases, we should not call fdclose() on FD -1. Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc	2011-08-11 13:29:59 +00:00
rwatson	4af919b491	Second-to-last commit implementing Capsicum capabilities in the FreeBSD kernel for FreeBSD 9.0: Add a new capability mask argument to fget(9) and friends, allowing system call code to declare what capabilities are required when an integer file descriptor is converted into an in-kernel struct file *. With options CAPABILITIES compiled into the kernel, this enforces capability protection; without, this change is effectively a no-op. Some cases require special handling, such as mmap(2), which must preserve information about the maximum rights at the time of mapping in the memory map so that they can later be enforced in mprotect(2) -- this is done by narrowing the rights in the existing max_protection field used for similar purposes with file permissions. In namei(9), we assert that the code is not reached from within capability mode, as we're not yet ready to enforce namespace capabilities there. This will follow in a later commit. Update two capability names: CAP_EVENT and CAP_KEVENT become CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they represent. Approved by: re (bz) Submitted by: jonathan Sponsored by: Google Inc	2011-08-11 12:30:23 +00:00
mm	ad65ea8db8	Revert r224655 and r224614 because vn_fullpath* does not always work on nullfs mounts. Change shall be reconsidered after 9.0 is released. Requested by: re (kib) Approved by: re (kib)	2011-08-08 14:02:08 +00:00
mm	8396492933	The change in r224615 didn't take into account that vn_fullpath_global() doesn't operate on locked vnode. This could cause a panic. Fix by unlocking vnode, re-locking afterwards and verifying that it wasn't renamed or deleted. To improve readability and reduce code size, move code to a new static function vfs_verify_global_path(). In addition, fix missing giant unlock in unmount(). Reported by: David Wolfskill <david@catwhisker.org> Reviewed by: kib Approved by: re (bz) MFC after: 2 weeks	2011-08-05 11:12:50 +00:00
mm	2c26b14138	Always disable mount and unmount for jails with enforce_statfs==2. A working statfs(2) is required for umount(8) in jail. Reviewed by: pjd, kib Approved by: re (kib) MFC after: 2 weeks	2011-08-02 19:44:40 +00:00
mm	a1639c8fd4	For mount, discover f_mntonname from supplied path argument using vn_fullpath_global(). This fixes f_mntonname if mounting inside chroot, jail or with relative path as argument. For unmount in jail, use vn_fullpath_global() to discover global path from supplied path argument. This fixes unmount in jail. Reviewed by: pjd, kib Approved by: re (kib) MFC after: 2 weeks	2011-08-02 19:13:56 +00:00
kib	7db6a1f506	Fix the LK_NOSHARE lockmgr flag interaction with LK_UPGRADE and LK_DOWNGRADE lock ops. Namely, the ops should be NOP since LK_NOSHARE locks are always exclusive. Reported by: rmacklem Reviewed by: attilio Tested by: pho Approved by: re (kensmith) MFC after: 1 week	2011-08-01 19:07:03 +00:00
glebius	3d867252f7	Don't leak kld_sx lock in kldunloadf(). Approved by: re (kib)	2011-07-31 13:49:15 +00:00
avg	24506a8064	smp_rendezvous: master cpu should wait until all slaves are fully done This is a followup to r222032 and a reimplementation of it. While that revision fixed the race for the smp_rv_waiters[2] exit sentinel, it still left a possibility for a target CPU to access stale or wrong smp_rv_func_arg in smp_rv_teardown_func. To fix this race the slave CPUs signal when they are really fully done with the rendezvous and the master CPU waits until all slaves are done. Diagnosed by: kib Reviewed by: jhb, mlaier, neel Approved by: re (kib) MFC after: 2 weeks	2011-07-30 20:29:39 +00:00
kib	2c49d0dddd	Fix the devmtx lock leak from make_dev(9) when the old device cloning failed due to invalid or duplicated path being generated. Reviewed by: jh Approved by: re (kensmith) MFC after: 1 week	2011-07-30 14:12:37 +00:00
avg	50b05401d3	remove RESTARTABLE_PANICS option This is done per request/suggestion from John Baldwin who introduced the option. Trying to resume normal system operation after a panic is very unpredictable and dangerous. It will become even more dangerous when we allow a thread in panic(9) to penetrate all lock contexts. I understand that the only purpose of this option was for testing scenarios potentially resulting in panic. Suggested by: jhb Reviewed by: attilio, jhb X-MFC-After: never Approved by: re (kib)	2011-07-25 09:12:48 +00:00
mckusick	ffeefed9fc	Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag so that it is visible to userland programs. This change enables the `mount' command with no arguments to be able to show if a filesystem is mounted using journaled soft updates as opposed to just normal soft updates. Approved by: re (bz)	2011-07-24 18:27:09 +00:00
mckusick	64e0ba1afe	This update changes the mnt_flag field in the mount structure from 32 bits to 64 bits and eliminates the unused mnt_xflag field. The existing mnt_flag field is completely out of bits, so this update gives us room to expand. Note that the f_flags field in the statfs structure is already 64 bits, so the expanded mnt_flag field can be exported without having to make any changes in the statfs structure. Approved by: re (bz)	2011-07-24 17:43:09 +00:00
jonathan	c488a33edf	Turn on AUDIT_ARG_RIGHTS() for cap_new(2). Now that the code is in place to audit capability method rights, start using it to audit the 'rights' argument to cap_new(2). Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc	2011-07-22 12:50:21 +00:00
jonathan	8bec41f41d	Export capability information via sysctls. When reporting on a capability, flag the fact that it is a capability, but also unwrap to report all of the usual information about the underlying file. Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc	2011-07-20 09:53:35 +00:00
attilio	750fc68e27	Remove explicit MAXCPU usage from sys/pcpu.h avoiding a namespace pollution. That is a step further in the direction of building correct policies for userland and modules on how to deal with the number of maxcpus at runtime. Reported by: jhb Reviewed and tested by: pluknet Approved by: re (kib)	2011-07-19 16:50:55 +00:00
attilio	5dc25961e9	Remove pc_name member of struct pcpu. pc_name is only included when KTR option is and it does introduce a subdle KBI breakage that totally breaks vmstat when world and kernel are not in sync. Besides, it is not used somewhere. In collabouration with: pluknet Reviewed by: jhb Approved by: re (kib)	2011-07-19 14:57:59 +00:00
bz	1a8cc2bad9	Rename ki_ocomm to ki_tdname and OCOMMLEN to TDNAMLEN. Provide backward compatibility defines under BURN_BRIDGES. Suggested by: jhb Reviewed by: emaste Sponsored by: Sandvine Incorporated Approved by: re (kib)	2011-07-18 20:06:15 +00:00
jhb	4e1a6d0e67	- Export each thread's individual resource usage in in struct kinfo_proc's ki_rusage member when KERN_PROC_INC_THREAD is passed to one of the process sysctls. - Correctly account for the current thread's cputime in the thread when doing the runtime fixup in calcru(). - Use TIDs as the key to lookup the previous thread to compute IO stat deltas in IO mode in top when thread display is enabled. Reviewed by: kib Approved by: re (kib)	2011-07-18 17:33:08 +00:00
attilio	9a6ff5ad37	- Remove the eintrcnt/eintrnames usage and introduce the concept of sintrcnt/sintrnames which are symbols containing the size of the 2 tables. - For amd64/i386 remove the storage of intr* stuff from assembly files. This area can be widely improved by applying the same to other architectures and likely finding an unified approach among them and move the whole code to be MI. More work in this area is expected to happen fairly soon. No MFC is previewed for this patch. Tested by: pluknet Reviewed by: jhb Approved by: re (kib)	2011-07-18 15:19:40 +00:00
rwatson	7c21db8ed3	Define two new sysctl node flags: CTLFLAG_CAPRD and CTLFLAG_CAPRW, which may be jointly referenced via the mask CTLFLAG_CAPRW. Sysctls with these flags are available in Capsicum's capability mode; other sysctl nodes are not. Flag several useful sysctls as available in capability mode, such as memory layout sysctls required by the run-time linker and malloc(3). Also expose access to randomness and available kernel features. A few sysctls are enabled to support name->MIB conversion; these may leak information to capability mode by virtue of providing resolution on names not flagged for access in capability mode. This is, generally, not a huge problem, but might be something to resolve in the future. Flag these cases with XXX comments. Submitted by: jonathan Sponsored by: Google, Inc.	2011-07-17 23:05:24 +00:00
rstone	83ed8794e4	Fix a LOR between hwpmc and the kernel linker. When a system-wide sampling mode PMC is allocated, hwpmc calls linker_hwpmc_list_objects() while already holding an exclusive lock on pmc-sx lock. list_objects() tries to acquire an exclusive lock on the kld_sx lock. When a KLD module is loaded or unloaded successfully, kern_kld(un)load calls into the pmc hook while already holding an exclusive lock on the kld_sx lock. Calling the pmc hook requires acquiring a shared lock on the pmc-sx lock. Fix this by only acquiring a shared lock on the kld_sx lock in linker_hwpmc_list_objects(), and also downgrading to a shared lock on the kld_sx lock in kern_kld(un)load before calling into the pmc hook. In kern_kldload this required moving some modifications of the linker_file_t to happen before calling into the pmc hook. This fixes the deadlock by ensuring that the hwpmc -> list_objects() case is always able to proceed. Without this patch, I was able to deadlock a multicore system within minutes by constantly loading and unloading an KLD module while I simultaneously started a sampling mode PMC in a loop. MFC after: 1 month	2011-07-17 21:53:42 +00:00
jonathan	5132d7b9f3	Auto-generated system call code with cap_new(), cap_getrights(). Approved by: mentor (rwatson), re (Capsicum blanket) Sponsored by: Google Inc	2011-07-15 18:33:12 +00:00
jonathan	4ec3aaddb5	Add cap_new() and cap_getrights() system calls. Implement two previously-reserved Capsicum system calls: - cap_new() creates a capability to wrap an existing file descriptor - cap_getrights() queries the rights mask of a capability. Approved by: mentor (rwatson), re (Capsicum blanket) Sponsored by: Google Inc	2011-07-15 18:26:19 +00:00
jonathan	70f535313a	Add implementation for capabilities. Code to actually implement Capsicum capabilities, including fileops and kern_capwrap(), which creates a capability to wrap an existing file descriptor. We also modify kern_close() and closef() to handle capabilities. Finally, remove cap_filelist from struct capability, since we don't actually need it. Approved by: mentor (rwatson), re (Capsicum blanket) Sponsored by: Google Inc	2011-07-15 09:37:14 +00:00
jkim	96d6cc9832	If TSC stops ticking in C3, disable deep sleep when the user forcefully select TSC as timecounter hardware. Tested by: Fabian Keil (freebsd-listen at fabiankeil dot de)	2011-07-14 21:00:26 +00:00
trasz	13232d13fa	Rename resource names to match these in login.conf.	2011-07-14 19:18:17 +00:00
bz	8448ba638c	Remove semaphore map entry count "semmap" field and its tuning option that is highly recommended to be adjusted in too much documentation while doing nothing in FreeBSD since r2729 (rev 1.1). ipcs(1) needs to be recompiled as it is accessing _KERNEL private variables. Reviewed by: jhb (before comment change on linux code) Sponsored by: Sandvine Incorporated	2011-07-14 14:18:14 +00:00
kib	09ee3c95a5	Implement an RFTSIGZMB flag to rfork(2) to specify a signal that is delivered to parent when the child exists. Submitted by: Petr Salinger <Petr.Salinger seznam cz> (Debian/kFreeBSD) MFC after: 1 week X-MFC-note: bump __FreeBSD_version	2011-07-12 20:37:18 +00:00
ae	4b5a09bf21	Include sys/sbuf.h directly.	2011-07-11 05:17:46 +00:00
mckusick	fd05bc36dc	Update tags build script	2011-07-10 00:53:04 +00:00
kib	61e3fec296	Add a facility to disable processing page faults. When activated, uiomove generates EFAULT if any accessed address is not mapped, as opposed to handling the fault. Sponsored by: The FreeBSD Foundation Reviewed by: alc (previous version)	2011-07-09 15:21:10 +00:00
mdf	d825d95c9f	Add an option to have a fail point term only execute when run by a specified pid. This is helpful for automated testing involving a global knob that would otherwise be executed by many other threads. MFC after: 1 week	2011-07-08 20:41:12 +00:00
mdf	def1e657f3	style(9) and cleanup fixes. MFC after: 1 week	2011-07-08 20:41:07 +00:00
jonathan	7c2c726167	Fix the "passability" test in fdcopy(). Rather than checking to see if a descriptor is a kqueue, check to see if its fileops flags include DFLAG_PASSABLE. At the moment, these two tests are equivalent, but this will change with the addition of capabilities that wrap kqueues but are themselves of type DTYPE_CAPABILITY. We already have the DFLAG_PASSABLE abstraction, so let's use it. This change has been tested with [the newly improved] tools/regression/kqueue. Approved by: mentor (rwatson), re (Capsicum blanket) Sponsored by: Google Inc	2011-07-08 12:19:25 +00:00
andre	41fc3214df	In the experimental soreceive_stream(): o Move the non-blocking socket test below the SBS_CANTRCVMORE so that EOF is correctly returned on a remote connection close. o In the non-blocking socket test compare SS_NBIO against the so->so_state field instead of the incorrect sb->sb_state field. o Simplify the ENOTCONN test by removing cases that can't occur. Submitted by: trociny (with some further tweaks by committer) Tested by: trociny	2011-07-08 10:50:13 +00:00
trasz	b468f23f53	Style fix - macros are supposed to be uppercase.	2011-07-07 17:44:42 +00:00
andre	19bc1179a6	Remove the TCP_SORECEIVE_STREAM compile time option. The use of soreceive_stream() for TCP still has to be enabled with the loader tuneable net.inet.tcp.soreceive_stream. Suggested by: trociny and others	2011-07-07 10:37:14 +00:00
trasz	4a17b24427	All the racct_*() calls need to happen with the proc locked. Fixing this won't happen before 9.0. This commit adds "#ifdef RACCT" around all the "PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order to avoid useless locking/unlocking in kernels built without "options RACCT".	2011-07-06 20:06:44 +00:00
marius	97f9011cd8	Call pmap_qremove() before freeing or unwiring the pages, otherwise there's a window during which a page can be re-used before its previous mapping is removed. Reviewed by: alc MFC after: 1 week	2011-07-05 18:40:37 +00:00
jonathan	6abbb93d5f	Rework _fget to accept capability parameters. This new version of _fget() requires new parameters: - cap_rights_t needrights the rights that we expect the capability's rights mask to include (e.g. CAP_READ if we are going to read from the file) - cap_rights_t haverights used to return the capability's rights mask (ignored if NULL) - u_char maxprotp the maximum mmap() rights (e.g. VM_PROT_READ) that can be permitted (only used if we are going to mmap the file; ignored if NULL) - int fget_flags FGET_GETCAP if we want to return the capability itself, rather than the underlying object which it wraps Approved by: mentor (rwatson), re (Capsicum blanket) Sponsored by: Google Inc	2011-07-05 13:45:10 +00:00
jonathan	bf3c575ea1	Add kernel functions to unwrap capabilities. cap_funwrap() and cap_funwrap_mmap() unwrap capabilities, exposing the underlying object. Attempting to unwrap a capability with an inadequate rights mask (e.g. calling cap_funwrap(fp, CAP_WRITE \| CAP_MMAP, &result) on a capability whose rights mask is CAP_READ \| CAP_MMAP) will result in ENOTCAPABLE. Unwrapping a non-capability is effectively a no-op. These functions will be used by Capsicum-aware versions of _fget(), etc. Approved by: mentor (rwatson), re (Capsicum blanket) Sponsored by: Google Inc	2011-07-04 14:40:32 +00:00
attilio	364d0522f7	With retirement of cpumask_t and usage of cpuset_t for representing a mask of CPUs, pc_other_cpus and pc_cpumask become highly inefficient. Remove them and replace their usage with custom pc_cpuid magic (as, atm, pc_cpumask can be easilly represented by (1 << pc_cpuid) and pc_other_cpus by (all_cpus & ~(1 << pc_cpuid))). This change is not targeted for MFC because of struct pcpu members removal and dependency by cpumask_t retirement. MD review by: marcel, marius, alc Tested by: pluknet MD testing by: marcel, marius, gonzo, andreast	2011-07-04 12:04:52 +00:00
bz	9cad5bfef3	Add infrastructure to allow all frames/packets received on an interface to be assigned to a non-default FIB instance. You may need to recompile world or ports due to the change of struct ifnet. Submitted by: cjsp Submitted by: Alexander V. Chernikov (melifaro ipfw.ru) (original versions) Reviewed by: julian Reviewed by: Alexander V. Chernikov (melifaro ipfw.ru) MFC after: 2 weeks X-MFC: use spare in struct ifnet	2011-07-03 12:22:02 +00:00
ed	49a6dac46f	Reintroduce the cioctl() hook in the TTY layer for digi(4). The cioctl() hook can be used by drivers to add ioctls to the .init and .lock devices. This commit breaks the ttydevsw ABI, since this structure didn't provide any padding. To prevent ABI breakage in the future, add a tsw_spare. Submitted by: Peter Jeremy <peter jeremy alcatel lucent com> Obtained from: kern/152254 (slightly modified)	2011-07-02 13:54:20 +00:00
jonathan	4d4c5b3285	When Capsicum starts creating capabilities to wrap existing file descriptors, we will want to allocate a new descriptor without installing it in the FD array. Split falloc() into falloc_noinstall() and finstall(), and rewrite falloc() to call them with appropriate atomicity. Approved by: mentor (rwatson), re (bz)	2011-06-30 15:22:49 +00:00
jonathan	8c932faae4	Add some checks to ensure that Capsicum is behaving correctly, and add some more explicit comments about what's going on and what future maintainers need to do when e.g. adding a new operation to a sys_machdep.c. Approved by: mentor(rwatson), re(bz)	2011-06-30 10:56:02 +00:00
alc	21902be08c	Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this option to vm_object_page_remove() asserts that the specified range of pages is not mapped, or more precisely that none of these pages have any managed mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on the pages. This change not only saves time by eliminating pointless calls to pmap_remove_all(), but it also eliminates an inconsistency in the use of pmap_remove_all() versus related functions, like pmap_remove_write(). It eliminates harmless but pointless calls to pmap_remove_all() that were being performed on PG_UNMANAGED pages. Update all of the existing assertions on pmap_remove_all() to reflect this change. Reviewed by: kib	2011-06-29 16:40:41 +00:00
jonathan	624e733467	We may split today's CAPABILITIES into CAPABILITY_MODE (which has to do with global namespaces) and CAPABILITIES (which has to do with constraining file descriptors). Just in case, and because it's a better name anyway, let's move CAPABILITIES out of the way. Also, change opt_capabilities.h to opt_capsicum.h; for now, this will only hold CAPABILITY_MODE, but it will probably also hold the new CAPABILITIES (implying constrained file descriptors) in the future. Approved by: rwatson Sponsored by: Google UK Ltd	2011-06-29 13:03:05 +00:00
ed	4ef034d1ea	Fix whitespace inconsistencies in the TTY layer and its drivers owned by me.	2011-06-26 18:26:20 +00:00
jonathan	7571a34069	Remove redundant Capsicum sysctl. Since we're now declaring FEATURE(security_capabilities), there's no need for an explicit SYSCTL_NODE. Approved by: rwatson	2011-06-25 12:37:06 +00:00
avg	74e5eef4ea	unconditionally stop other cpus when entering kdb in smp system ... and thus retire debug.kdb.stop_cpus tunable/sysctl. The knob was to work around CPU stopping issues, which since have been either fixed or greatly reduced. kdb should really operate in a special environment with scheduler stopped and interrupts disabled to provide deterministic debugging. Discussed with: attilio, rwatson X-MFC after: 2 months or never	2011-06-25 10:28:16 +00:00
avg	57d68644db	generic_stop_cpus: pull timeout logic from under DIAGNOSTIC ... and also increase the timeout. It's better to try to proceed somehow despite stuck CPUs than to hang indefinitely. Especially so during shutdown and when entering kdb or panic. Timeout value is still an aribitrary value. Timeout diagnostic is just a printf; the work on something more debuggable is planned by attilio. Need to be careful here as stop_cpus_hard is called very early while enetering kdb and soon(-ish) it may become called very early when entering panic. Reviewed by: attilio MFC after: 2 months	2011-06-25 10:01:43 +00:00
jonathan	d77535edb6	Tidy up a capabilities-related comment. This comment refers to an #ifdef that hasn't been merged [yet?]; remove it. Approved by: rwatson	2011-06-24 14:40:22 +00:00
jkim	6da60ac39e	Set negative quality to TSC timecounter when C3 state is enabled for Intel processors unless the invariant TSC bit of CPUID is set. Intel processors may stop incrementing TSC when DPSLP# pin is asserted, according to Intel processor manuals, i. e., TSC timecounter is useless if the processor can enter deep sleep state (C3/C4). This problem was accidentally uncovered by r222869, which increased timecounter quality of P-state invariant TSC, e.g., for Core2 Duo T5870 (Family 6, Model f) and Atom N270 (Family 6, Model 1c). Reported by: Fabian Keil (freebsd-listen at fabiankeil dot de) Ian FREISLICH (ianf at clue dot co dot za) Tested by: Fabian Keil (freebsd-listen at fabiankeil dot de) - Core2 Duo T5870 (C3 state available/enabled) jkim - Xeon X5150 (C3 state unavailable)	2011-06-22 16:40:45 +00:00
obrien	d17521adb2	Add comment from CSRG rev 7.27 (1992/06/23 19:56:55; author: mckusick)	2011-06-17 21:44:13 +00:00
kib	6e0462eab2	Do not trash the argv[0] pointer for an a.out process on amd64. Found with the binary provided by joerg.	2011-06-16 22:00:59 +00:00
kib	f13e1b44c6	Fix silly typo that resulted in the a.out process stack to end at ~200MB instead of 3GB on amd64.	2011-06-16 21:59:16 +00:00
marcel	c7e47a0b81	Even if the loaded module has no symbols, we still need to notify MD code about it and update the link map for GDB's use.	2011-06-16 17:41:21 +00:00
gibbs	6c5518eb45	sys/kern/subr_kdb.c: Modify the "alternate break sequence" detecting state machine so that only a contiguous invocation of the break sequence is accepted. The old implementation did not reset the state machine when detecting an unexpected character. While here, use an enum for the states of the machine instead of magic numbers.bmitted by: Sponsored by: Spectra Logic Corporation	2011-06-14 21:37:25 +00:00
obrien	f797e31a8d	We should not return ECHILD when debugging a child and the parent does a "wait4(-1, ..., WNOHANG, ...)". Instead wait(2) should behave as if the child does not wish to report status at this time. Reviewed by: jhb	2011-06-14 17:09:30 +00:00
gibbs	9d45c190c8	sys/sys/conf.h: sys/kern/kern_conf.c: Add make_dev_physpath_alias(). This interface takes the parent cdev of the alias, an old alias cdev (if any) to replace with the newly created alias, and the physical path string. The alias is visiable as a symlink to the parent, with the same name as the parent, rooted at physpath in devfs. Note: make_dev_physpath_alias() has hard coded knowledge of the Solaris style prefix convention for physical path data, "id1,". In the future, I expect the convention to change to allow "physical path quality" to be reported in the prefix. For example, a physical path based on NewBus topology would be of "lower quality" than a physical path reported by a device enclosure. Sponsored by: Spectra Logic Corporation	2011-06-14 16:29:43 +00:00
ken	b8dcfe0228	Instead of using an atomic operation to determine whether the devstat(9) device node has been created, pass MAKEDEV_CHECKNAME in so that the devfs code will do the check. Use a regular static variable as before, that's good enough to keep us from calling into devfs most of the time. Suggested by: kib MFC after: 1 week Sponsored by: Spectra Logic Corporation	2011-06-13 22:08:24 +00:00
gibbs	f3f56007b6	Fix a couple of race conditions in devstat(9) initialization. In devstat_new_entry(), there is no need to initialize the queue and the mutex in this function. There are ways to do static initialization on both, so use STAILQ_HEAD_INITIALIZER and MTX_SYSINIT to initialize the queue and the mutex. In devstat_alloc(), use an atomic test and set routine to guard making our entry in /dev. Using just a plain static variable creates a race condition on multiprocessor machines. If you attempt to create a second entry in devfs, the kernel will panic. Submitted by: kdm Reviewed by: gibbs Sponsored by: Spectra Logic Corporation MFC after: 1 week.	2011-06-13 21:21:02 +00:00
jeff	02186557c4	- When printing bufs with show buf the lblkno is often more useful than the blkno. Print them both.	2011-06-10 22:15:36 +00:00
attilio	2514230a6b	In the current code, a double panic condition may lead to dumps interleaving. Signal dumping to happen only for the first panic which should be the most important. Sponsored by: Sandvine Incorporated Submitted by: Nima Misaghian (nmisaghian AT sandvine DOT com) MFC after: 2 weeks	2011-06-08 19:28:59 +00:00
jhb	b5ffa4ca36	Log the socket address passed as the destination to sendto() and sendmsg() via ktrace. MFC after: 1 week	2011-06-07 17:40:33 +00:00
attilio	6ed3ca2c5b	MFC	2011-06-07 08:24:29 +00:00
ken	048adb69c7	Set pca.p_bufr to NULL when we haven't allocated a buffer. Otherwise, p_bufr is set to garbage on the stack, and if that garbage happens to be non-NULL, and the TOLOG or TOCONS flag is set, putbuf() will get called and attempt to fill the non-existent buffer. This is really only relevant for tprintf() (and only when the priority is not -1), but set it in uprintf() and ttyprintf() for completeness. The next step, to avoid log buffer scrambling, would be to add the PRINTF_BUFR_SIZE code to tprintf(), but this should prevent panics. Submitted by: rmacklem Found by: pho	2011-06-07 05:04:37 +00:00
davidxu	fc6d16c51a	Use p4prio_to_tsprio to calculate TS priority instead of using p4prio_to_rtpprio which is for RT priority. PR: kern/157657 Submitted by: krivenok.dmitry at gmail dot com MFC after: 3 days	2011-06-07 02:50:14 +00:00
marcel	36b8c5d486	Fix making kernel dumps from the debugger by creating a command for it. Do not not expect a developer to call doadump(). Calling doadump does not necessarily work when it's declared static. Nor does it necessarily do what was intended in the context of text dumps. The dump command always creates a core dump. Move printing of error messages from doadump to the dump command, now that we don't have to worry about being called from DDB.	2011-06-07 01:28:12 +00:00
attilio	fcefe479fe	MFC	2011-06-06 21:38:39 +00:00
jhb	aa8ddae280	Clear the device_t pointer in 'struct resource' when releasing a device as otherwise the sysctl to export rman info can dereference a stale pointer. PR: kern/115371 Submitted by: Arthur Hartwig MFC after: 1 week	2011-06-06 13:12:56 +00:00
attilio	9f19c1c64d	MFC	2011-06-01 16:54:33 +00:00
ken	9237f32b34	Fix a bug introduced in revision 222537. In msgbuf_reinit() and msgbuf_init(), we weren't initializing the mutex. Depending on the contents of memory, the LO_INITIALIZED flag might be set on the mutex (either due to a warm reboot, and the message buffer remaining in place, or due to garbage in memory) and in that case, with INVARIANTS turned on, we would trigger an assertion that the mutex had already been initialized. Fix this by bzeroing the message buffer mutex for the _init() and _reinit() paths. Reported by: mdf	2011-05-31 22:39:32 +00:00
attilio	bc4d32e80b	MFC	2011-05-31 21:22:44 +00:00
attilio	a924571ff7	Fix KTR_CPUMASK in order to accept a string representing a cpuset_t. This introduce all the underlying support for making this possible (via the function cpusetobj_strscan() and keeps ktr_cpumask exported. sparc64 implements its own assembly primitives for tracing events and needs to properly check it. Anyway the sparc64 logic is not implemented yet due to lack of knowledge (by me) and time (by marius), but it is just a matter of using ktr_cpumask when possible. Tested and fixed by: pluknet Reviewed by: marius	2011-05-31 20:48:58 +00:00
attilio	066c7ac96c	Revert a change that crept in during MFC.	2011-05-31 20:23:33 +00:00
ken	0febb6df5e	Fix apparent garbage in the message buffer. While we have had a fix in place (options PRINTF_BUFR_SIZE=128) to fix scrambled console output, the message buffer and syslog were still getting log messages one character at a time. While all of the characters still made it into the log (courtesy of atomic operations), they were often interleaved when there were multiple threads writing to the buffer at the same time. This fixes message buffer accesses to use buffering logic as well, so that strings that are less than PRINTF_BUFR_SIZE will be put into the message buffer atomically. So now dmesg output should look the same as console output. subr_msgbuf.c: Convert most message buffer calls to use a new spin lock instead of atomic variables in some places. Add a new routine, msgbuf_addstr(), that adds a NUL-terminated string to a message buffer. This takes a priority argument, which allows us to eliminate some races (at least in the the string at a time case) that are present in the implementation of msglogchar(). (dangling and lastpri are static variables, and are subject to races when multiple callers are present.) msgbuf_addstr() also allows the caller to request that carriage returns be stripped out of the string. This matches the behavior of msglogchar(), but in testing so far it doesn't appear that any newlines are being stripped out. So the carriage return removal functionality may be a candidate for removal later on if further analysis shows that it isn't necessary. subr_prf.c: Add a new msglogstr() routine that calls msgbuf_logstr(). Rename putcons() to putbuf(). This now handles buffered output to the message log as well as the console. Also, remove the logic in putcons() (now putbuf()) that added a carriage return before a newline. The console path was the only path that needed it, and cnputc() (called by cnputs()) already adds a carriage return. So this duplication resulted in kernel-generated console output lines ending in '\r''\r''\n'. Refactor putchar() to handle the new buffering scheme. Add buffering to log(). Change log_console() to use msglogstr() instead of msglogchar(). Don't add extra newlines by default in log_console(). Hide that behavior behind a tunable/sysctl (kern.log_console_add_linefeed) for those who would like the old behavior. The old behavior led to the insertion of extra newlines for log output for programs that print out a string, and then a trailing newline on a separate write. (This is visible with dmesg -a.) msgbuf.h: Add a prototype for msgbuf_addstr(). Add three new fields to struct msgbuf, msg_needsnl, msg_lastpri and msg_lock. The first two are needed for log message functionality previously handled by msglogchar(). (Which is still active if buffering isn't enabled.) Include sys/lock.h and sys/mutex.h for the new mutex. Reviewed by: gibbs	2011-05-31 17:29:58 +00:00
nwhitehorn	a69e106b2f	On multi-core, multi-threaded PPC systems, it is important that the threads be brought up in the order they are enumerated in the device tree (in particular, that thread 0 on each core be brought up first). The SLIST through which we loop to start the CPUs has all of its entries added with SLIST_INSERT_HEAD(), which means it is in reverse order of enumeration and so AP startup would always fail in such situations (causing a machine check or RTAS failure). Fix this by changing the SLIST into an STAILQ, and inserting new CPUs at the end. Reviewed by: jhb	2011-05-31 15:11:43 +00:00
attilio	b1bf71d3c5	MFC	2011-05-31 14:18:10 +00:00
attilio	8dd6262cd3	MFC	2011-05-29 18:33:13 +00:00
trociny	1dfa9ab873	In soreceive_generic(), if MSG_WAITALL is set but the request is larger than the receive buffer, we have to receive in sections. When notifying the protocol that some data has been drained the lock is released for a moment. Returning we block waiting for the rest of data. There is a race, when data could arrive while the lock was released and then the connection stalls in sbwait. Fix this by checking for data before blocking and skip blocking if there are some. PR: kern/154504 Reported by: Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua> Tested by: Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua> Reviewed by: rwatson Approved by: kib (co-mentor) MFC after: 2 weeks	2011-05-29 18:00:50 +00:00
attilio	55a3bf38a5	MFC	2011-05-29 00:59:38 +00:00
trasz	5499b0b9d5	Remove definitions for RACCT_FSIZE and RACCT_SBSIZE - these two are rather performance-sensitive and not that useful, so I won't be merging them before 9.0.	2011-05-27 19:57:58 +00:00
attilio	eefddaeed6	MFC	2011-05-27 16:09:10 +00:00
trasz	6a13eaa4d1	Fix support for RACCT_CORE by merging forgotten file.	2011-05-26 18:54:07 +00:00
attilio	867c6223e7	MFC	2011-05-26 17:38:00 +00:00
jhb	3c1a24d701	Silly spelling typos. Submitted by: "b. f."	2011-05-24 19:55:57 +00:00
jhb	7028e129fd	Fix an issue with critical sections and SMP rendezvous handlers. Specifically, a critical_exit() call that drops the nesting level to zero has a brief window where the pending preemption flag is set and the nesting level is set to zero. This is done purposefully to avoid races where a preemption scheduled by an interrupt could be lost otherwise (see revision 144777). However, this does mean that if an interrupt fires during this window and enters and exits a critical section, it may preempt from the interrupt context. This is generally fine as the interrupt code is careful to arrange critical sections so that they are not exited until it is safe to preempt (e.g. interrupts EOI'd and masked if necessary). However, the SMP rendezvous IPI handler does not quite follow this rule, and in general a rendezvous can never be preempted. Rendezvous handlers are also not permitted to schedule threads to execute, so they will not typically trigger preemptions. SMP rendezvous handlers may use spinlocks (carefully) such as the rm_cleanIPI() handler used in rmlocks, but using a spinlock also enters and exits a critical section. If the interrupted top-half code is in the brief window of critical_exit() where the nesting level is zero but a preemption is pending, then releasing the spinlock can trigger a preemption. Because we know that SMP rendezvous handlers can never schedule a thread, we know that a critical_exit() in an SMP rendezvous handler will only preempt in this edge case. We also know that the top-half thread will happily handle the deferred preemption once the SMP rendezvous has completed, so the preemption will not be lost. This makes it safe to employ a workaround where we use a nested critical section in the SMP rendezvous code itself around rendezvous action routines to prevent any preemptions during an SMP rendezvous. The workaround intentionally avoids checking for a deferred preemption when leaving the critical section on the assumption that if there is a pending preemption it will be handled by the interrupted top-half code. Submitted by: mlaier (variation specific to rm_cleanIPI()) Obtained from: Isilon MFC after: 1 week	2011-05-24 13:36:41 +00:00
jhb	4d0fe668f7	Update comments for DEVICE_PROBE() to reflect that BUS_PROBE_DEFAULT is now the preferred typical return value from a probe routine. Discourage the use of 0 (BUS_PROBE_SPECIFIC) as it should be used very rarely. Point the reader to the DEVICE_PROBE(9) manpage for more detailed notes on possible probe return values. Submitted by: Philip Soeberg philip-dev of soeberg net	2011-05-24 13:22:40 +00:00
jhb	d73862793b	Simplify a stale assertion. We have not called mi_switch() from a nested critical section during a preemption for several years. MFC after: 1 week	2011-05-24 13:17:08 +00:00
attilio	9879530ca1	MFC	2011-05-23 23:58:02 +00:00
attilio	66305282ac	Revert a patch that unvolountary sneaked in while I was MFCing.	2011-05-23 23:50:21 +00:00
ru	5a5a985b61	BKVASIZE was bumped to 16k more than a decade ago.	2011-05-23 19:59:01 +00:00
jh	fbe30c6e5c	In init_dynamic_kenv(), ignore environment strings exceeding the KENV_MNAMELEN + 1 + KENV_MVALLEN + 1 length limit to avoid buffer overflow in getenv(). Currenly loader(8) doesn't limit the length of environment strings. PR: kern/132104 MFC after: 1 month	2011-05-23 16:40:44 +00:00
attilio	6d7371f950	MFC	2011-05-23 01:17:30 +00:00
attilio	a8b367d89d	Merge r221912 from largeSMP project branch: Fix a long-standing bug in cpuset_thread0() where only the first part of cs_mask is set full. Submitted by: anonymous MFC after: 1 week	2011-05-22 21:35:03 +00:00
attilio	627bd73cdb	MFC	2011-05-22 20:41:10 +00:00
attilio	08bcb681d2	Make cpusetobj_strprint() prepare the string in order to print the least significant cpuset_t word at the outmost right part of the string (more far from the beginning of it). This follows the natural build of bits rappresentation in the words.	2011-05-22 20:29:47 +00:00
rmacklem	fbb8a5e8ec	Add a lock flags argument to the VFS_FHTOVP() file system method, so that callers can indicate the minimum vnode locking requirement. This will allow some file systems to choose to return a LK_SHARED locked vnode when LK_SHARED is specified for the flags argument. This patch only adds the flag. It does not change any file system to use it and all callers specify LK_EXCLUSIVE, so file system semantics are not changed. Reviewed by: kib	2011-05-22 01:07:54 +00:00
attilio	0372174d48	MFC	2011-05-19 22:55:37 +00:00
kib	8407c0a698	The CDP_ACTIVE flag is cleared at the beginning of destroy_devl(), and destroy_devl() drops dev_mtx. The protection against the race with dev_rel(), introduced in r163328, should be extended to cover destroy_devl() calls for the children of the destroyed dev. Reported and tested by: joerg MFC after: 1 week	2011-05-18 22:36:58 +00:00
attilio	0828d417d4	Fix mismerge. Reported by: pluknet	2011-05-18 15:50:12 +00:00
attilio	01e90e3193	Merge r221285 from largeSMP project: - Remove the following sysctl: kern.sched.ipiwakeup.onecpu kern.sched.ipiwakeup.htt2 Because they are absolutely obsolete. Probabilly the whole wakeup forward mechanism should be revisited for a better fitting in modern hw, in the future. - As map2 variable is no longer used rename map3 to map2 - Fix a string by making more informative the msg and removing the arguments passing. Reviewed by: julian Tested by: several	2011-05-17 22:14:00 +00:00
attilio	2cdf500faf	MFC	2011-05-17 22:03:01 +00:00
jhb	8d84cd707e	Fix a race in the SMP rendezvous code. Specifically, the write by the last CPU to to finish the rendezvous action may become visible to different CPUs at different times. As a result, the CPU that initiated the rendezvous may exit the rendezvous and drop the lock allowing another rendezvous to be initiated on the same CPU or a different CPU. In that case the exit sentinel may be cleared before all CPUs have noticed causing those CPUs to hang forever. Workaround this by using a generation count to notice when this race occurs and to exit the rendezvous in that case. The problem was independently diagnosted by mlaier@ and avg@ as well. Submitted by: neel Reviewed by: avg, mlaier Obtained from: NetApp MFC after: 1 week	2011-05-17 16:39:08 +00:00
phk	c0026c6642	Use memset() instead of bzero() and memcpy() instead of bcopy(), there is no relevant difference for sbufs, and it increases portability of the source code. Split the actual initialization of the sbuf into a separate local function, so that certain static code checkers can understand what sbuf_new() does, thus eliminating on silly annoyance of MISRA compliance testing. Contributed by: An anonymous company in the last business I expected sbufs to invade.	2011-05-17 11:04:50 +00:00
phk	99b5f98226	Don't expect PAGE_SIZE to exist on all platforms (It is a pretty arbitrary choice of default size in the first place) Reverse the order of arguments to the internal static sbuf_put_byte() function to match everything else in this file. Move sbuf_putc_func() inside the kernel version of sbuf_vprintf where it belongs. sbuf_putc() incorrectly used sbuf_putc_func() which supress NUL characters, it should use sbuf_put_byte(). Make sbuf_finish() return -1 on error. Minor stylistic nits fixed.	2011-05-17 06:36:32 +00:00
attilio	fd96a5afd1	Merge r221278 from largeSMP project: idle_cpus_mask is just used in sched_4bsd, thus make it private for it. Tested by: several	2011-05-16 23:20:12 +00:00
attilio	d57a3c7c06	MFC	2011-05-16 16:34:03 +00:00
phk	9ed2621ed9	Change the length quantities of sbufs to be ssize_t rather than int. Constify a couple of arguments.	2011-05-16 16:18:40 +00:00
avg	576b51ab8f	better integrate cyclic module with clocksource/eventtimer subsystem Now in the case when one-shot timers are used cyclic events should fire closer to theier scheduled times. As the cyclic is currently used only to drive DTrace profile provider, this is the area where the change makes a difference. Reviewed by: mav (earlier version, a while ago) X-MFC after: clocksource/eventtimer subsystem	2011-05-16 15:29:59 +00:00
attilio	c5a5c48e70	Fix a longstanding bug where only the first part of the cpumask was correctly set full. Submitted by: anonymous	2011-05-14 19:36:12 +00:00

... 3 4 5 6 7 ...

12602 Commits