freebsd-nq

Author	SHA1	Message	Date
Bjoern A. Zeeb	592bcae802	Add ip4.saddrsel/ip4.nosaddrsel (and equivalent for ip6) to control whether to use source address selection (default) or the primary jail address for unbound outgoing connections. This is intended to be used by people upgrading from single-IP jails to multi-IP jails but not having to change firewall rules, application ACLs, ... but to force their connections (unless otherwise changed) to the primry jail IP they had been used for years, as well as for people prefering to implement similar policies. Note that for IPv6, if configured incorrectly, this might lead to scope violations, which single-IPv6 jails could as well, as by the design of jails. [1] Reviewed by: jamie, hrs (ipv6 part) Pointed out by: hrs [1] MFC After: 2 weeks Asked for by: Jase Thew (bazerka beardz.net)	2010-01-17 12:57:11 +00:00
Brooks Davis	9126964cdb	Only allocate the space we need before calling kern_getgroups instead of allocating what ever the user asks for up to "ngroups_max + 1". On systems with large values of kern.ngroups this will be more efficient. The now redundant check that the array is large enough in kern_getgroups() is deliberate to allow this change to be merged to stable/8 without breaking potential third party consumers of the API. Reported by: bde MFC after: 28 days	2010-01-15 07:18:46 +00:00
Ed Schouten	d3e4b91f9c	Remove the 1000 pseudo terminal limit from pts(4). Even with the old utmp format, we could in fact go to pts/9999, because ut_line wasn't guaranteed to be null terminated there.	2010-01-13 21:22:23 +00:00
Brooks Davis	93833c1db6	Declare the kern.ngroups sysctl to be read-only, but tunable at boot for better error reporting. Submitted by: Matthew Fleming <matthew dot fleming at isilon dot com> MFC After: 1 month	2010-01-12 18:20:20 +00:00
Brooks Davis	412f9500e2	Replace the static NGROUPS=NGROUPS_MAX+1=1024 with a dynamic kern.ngroups+1. kern.ngroups can range from NGROUPS_MAX=1023 to INT_MAX-1. Given that the Windows group limit is 1024, this range should be sufficient for most applications. MFC after: 1 month	2010-01-12 07:49:34 +00:00
Bjoern A. Zeeb	fe0518e9ee	Change DDB show prison: - name some columns more closely to the user space variables, as we do for host.* or allow.* (in the listing) already. - print pr_childmax (children.max). - prefix hex values with 0x. MFC after: 3 weeks	2010-01-11 22:34:25 +00:00
Bjoern A. Zeeb	bef916f99c	Adjust a comment to reflect reality, as we have proper source address selection, even for IPv4, since r183571. Pointed out by: Jase Thew (bazerka beardz.net) MFC after: 3 days	2010-01-11 21:21:30 +00:00
Kirk McKusick	e268f54cb4	Background: When renaming a directory it passes through several intermediate states. First its new name will be created causing it to have two names (from possibly different parents). Next, if it has different parents, its value of ".." will be changed from pointing to the old parent to pointing to the new parent. Concurrently, its old name will be removed bringing it back into a consistent state. When fsck encounters an extra name for a directory, it offers to remove the "extraneous hard link"; when it finds that the names have been changed but the update to ".." has not happened, it offers to rewrite ".." to point at the correct parent. Both of these changes were considered unexpected so would cause fsck in preen mode or fsck in background mode to fail with the need to run fsck manually to fix these problems. Fsck running in preen mode or background mode now corrects these expected inconsistencies that arise during directory rename. The functionality added with this update is used by fsck running in background mode to make these fixes. Solution: This update adds three new fsck sysctl commands to support background fsck in correcting expected inconsistencies that arise from incomplete directory rename operations. They are: setcwd(dirinode) - set the current directory to dirinode in the filesystem associated with the snapshot. setdotdot(oldvalue, newvalue) - Verify that the inode number for ".." in the current directory is oldvalue then change it to newvalue. unlink(nameptr, oldvalue) - Verify that the inode number associated with nameptr in the current directory is oldvalue then unlink it. As with all other fsck sysctls, these new ones may only be used by processes with appropriate priviledge. Reported by: jeff Security issues: rwatson	2010-01-11 20:44:05 +00:00
Warner Losh	eae8e367c5	Merge change r198561 from projects/mips to head: r198561 \| thompsa \| 2009-10-28 15:25:22 -0600 (Wed, 28 Oct 2009) \| 4 lines Allow a scratch buffer to be set in order to be able to use setenv() while booting, before dynamic kenv is running. A few platforms implement their own scratch+sprintf handling to save data from the boot environment.	2010-01-10 22:34:18 +00:00
David Xu	a4b0b4b062	Make a chain be a list of queues, and make threads waiting for same key coalesce to same queue, this makes searching path shorter and improves performance. Also fix comments about shared PI-mutex.	2010-01-10 09:31:57 +00:00
Brooks Davis	5feedc2575	Correct the explination text for the kern.ngroups. It reflects the number of supplemental groups, not the total number of groups. MFC after: 3 days	2010-01-09 23:22:31 +00:00
David Xu	73532aa78c	Use enum to define key types. Suggested by: jmallett	2010-01-09 06:30:40 +00:00
David Xu	4904f91fe0	put semaphore waiter in long term list.	2010-01-09 06:12:44 +00:00
David Xu	2c3b3fef36	Add key type TYPE_SEM.	2010-01-09 06:05:31 +00:00
Attilio Rao	f7829d0d5c	Introduce the new kernel thread called "deadlock resolver". While the name is pretentious, a good explanation of its targets is reported in this 17 months old presentation e-mail: http://lists.freebsd.org/pipermail/freebsd-arch/2008-August/008452.html In order to implement it, the sq_type in sleepqueues is mandatory and not only compiled along with INVARIANTS option. Additively, a new sleepqueue function, sleepq_type() is added, returning the type of the sleepqueue linked to a wchan. Three new sysctls are added in order to configure the thread: debug.deadlkres.slptime_threshold debug.deadlkres.blktime_threshold debug.deadlkres.sleepfreq rappresenting the thresholds for sleep and block time that will lead to a deadlock matching (when exceeded), while the sleepfreq rappresents the number of seconds between 2 consecutive thread runnings. In order to enable the deadlock resolver thread recompile your kernel with the option DEADLKRES. Reviewed by: jeff Tested by: pho, Giovanni Trematerra Sponsored by: Nokia Incorporated, Sandvine Incorporated MFC after: 2 weeks	2010-01-09 01:46:38 +00:00
Christian Brueffer	30215f483e	Free allocated sbufs before returning ENOMEM. PR: 128335 Submitted by: Mateusz Guzik <mjguzik@gmail.com> MFC after: 2 week	2010-01-08 22:58:50 +00:00
Attilio Rao	6eac7e5752	- Fix a bug in sched_4bsd where the timestamp for the sleeping operation is not cleaned up on the wakeup but reset. This is harmless mostly because td_slptick (and ki_slptime from userland) should be analyzed only with the assumption that the thread is actually sleeping (thus while the td_slptick is correctly set) but without this invariant the number is nomore consistent. - Move td_slptick from u_int to int in order to follow 'ticks' signedness and wrap up accordingly [0] [0] Submitted by: emaste Sponsored by: Sandvine Incorporated MFC 1 week	2010-01-08 14:55:11 +00:00
Martin Blapp	c2ede4b379	Remove extraneous semicolons, no functional changes. Submitted by: Marc Balmer <marc@msys.ch> MFC after: 1 week	2010-01-07 21:01:37 +00:00
Attilio Rao	aab9c8c28d	Fix typos.	2010-01-07 01:24:09 +00:00
Attilio Rao	c636ba831a	Tweak comments.	2010-01-07 01:19:01 +00:00
Attilio Rao	9dbf7a62f4	Exclusive waiters sleeping with LK_SLEEPFAIL on and using interruptible sleeps/timeout may have left spourious lk_exslpfail counts on, so clean it up even when accessing a shared queue acquisition, giving to lk_exslpfail the value of 'upper limit'. In the worst case scenario, infact (mixed interruptible sleep / LK_SLEEPFAIL waiters) what may happen is that both queues are awaken even if that's not necessary, but still no harm. Reported by: Lucius Windschuh <lwindschuh at googlemail dot com> Reviewed by: kib Tested by: pho, Lucius Windschuh <lwindschuh at googlemail dot com>	2010-01-07 00:47:50 +00:00
David Xu	9b0f1823b5	Use umtx to implement process sharable semaphore, to make this work, now type sema_t is a structure which can be put in a shared memory area, and multiple processes can operate it concurrently. User can either use mmap(MAP_SHARED) + sem_init(pshared=1) or use sem_open() to initialize a shared semaphore. Named semaphore uses file system and is located in /tmp directory, and its file name is prefixed with 'SEMD', so now it is chroot or jail friendly. In simplist cases, both for named and un-named semaphore, userland code does not have to enter kernel to reduce/increase semaphore's count. The semaphore is designed to be crash-safe, it means even if an application is crashed in the middle of operating semaphore, the semaphore state is still safely recovered by later use, there is no waiter counter maintained by userland code. The main semaphore code is in libc and libthr only has some necessary stubs, this makes it possible that a non-threaded application can use semaphore without linking to thread library. Old semaphore implementation is kept libc to maintain binary compatibility. The kernel ksem API is no longer used in the new implemenation. Discussed on: threads@	2010-01-05 02:37:59 +00:00
Ed Schouten	328d9d2c96	Make TIOCSTI work again. It looks like I didn't implement this when I imported MPSAFE TTY. Applications like mail(1) still use this. I think it's conceptually bad. Tested by: Pete French <petefrench ticketswitch com> MFC after: 2 weeks	2010-01-04 20:59:52 +00:00
Edward Tomasz Napierala	922ec47140	Fix comments.	2010-01-04 12:39:42 +00:00
David Xu	79f8b61995	Add user-level semaphore synchronous type, this change allows multiple processes to share semaphore by using shared memory area, in simplest case, only one atomic operation is needed in userland, waiter flag is maintained by kernel and userland only checks the flag, if the flag is set, user code enters kernel and does a wakeup() call. Move type definitions into file _umtx.h to minimize compiling time. Also type names need to be prefixed with underline character, this would reduce name conflict (still in progress).	2010-01-04 05:27:49 +00:00
Brooks Davis	7eb5db5179	If a filter has already been added, actually return EEXIST when trying at add it again. MFC after: 1 week	2009-12-31 20:56:28 +00:00
Brooks Davis	a6fffd6cb0	The devices that supported EVFILT_NETDEV kqueue filters were removed in r195175. Remove all definitions, documentation, and usage. fifo_misc.c: Remove all kqueue tests as fifo_io.c performs all those that would have remained. Reviewed by: rwatson MFC after: 3 weeks X-MFC note: don't change vlan_link_state() function signature	2009-12-31 20:29:58 +00:00
Konstantin Belousov	17c4c3563c	Allow swap out of the kernel stack for the thread with priority greater or equial then PSOCK, not less or equial. Higher priority has lesser numerical value. Existing test does not allow for swapout of the thread waiting for advisory lock, for exiting child or sleeping for timeout. On the other hand, high-priority waiters of VFS/VM events can be swapped out. Tested by: pho Reviewed by: jhb MFC after: 1 week	2009-12-31 18:52:58 +00:00
John Baldwin	6a41dc1043	Actually set RLE_ALLOCATED when allocating a reserved resource so that resource_list_release() will later release the resource instead of failing.	2009-12-30 22:37:28 +00:00
John Baldwin	0eb9893a80	- Assert that a reserved resource returned via resource_list_alloc() is not active. - Fix bus_generic_rl_(alloc\|release)_resource() to not attempt to fetch a resource list for grandchild devices, but just pass those requests up to the parent directly. This worked by accident previously, but it is better to not let bus drivers try to operate on devices they do not manage.	2009-12-30 19:44:31 +00:00
Robert Noland	cfd7bacef2	Update d_mmap() to accept vm_ooffset_t and vm_memattr_t. This replaces d_mmap() with the d_mmap2() implementation and also changes the type of offset to vm_ooffset_t. Purge d_mmap2(). All driver modules will need to be rebuilt since D_VERSION is also bumped. Reviewed by: jhb@ MFC after: Not in this lifetime...	2009-12-29 21:51:28 +00:00
Edward Tomasz Napierala	6cb02977e2	SLIP is gone; remove its mutex from witness.	2009-12-29 08:45:27 +00:00
Ed Schouten	62375ca8c1	Don't forget to use `void' for sched_balance(). It has no arguments.	2009-12-28 23:12:12 +00:00
Antoine Brodin	13e403fdea	(S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument. Fix some wrong usages. Note: this does not affect generated binaries as this argument is not used. PR: 137213 Submitted by: Eygene Ryabinkin (initial version) MFC after: 1 month	2009-12-28 22:56:30 +00:00
Konstantin Belousov	a411786576	Add a knob to allow reclaim of the directory vnodes that are source of the namecache records. The reclamation is not enabled by default because for typical workload it would make namecache unusable, but large nested directory tree easily puts any process that accesses filesystem into 1 second wait for vlru. Reported by: yar (long time ago) MFC after: 3 days	2009-12-28 15:35:39 +00:00
Edward Tomasz Napierala	558e9b5c95	Now that all the callers seem to be fixed, add KASSERTs to make sure VAPPEND is not being used improperly.	2009-12-26 11:36:10 +00:00
Bjoern A. Zeeb	e4ff598ec6	Remove extra spaces (no functional change). MFC after: 3 days	2009-12-25 21:14:05 +00:00
Bjoern A. Zeeb	095809b084	Remove an unused global. MFC after: 3 days	2009-12-25 20:03:03 +00:00
Robert Watson	c7ca33d138	Minor comment tweaks in rmlocks. MFC after: 3 days	2009-12-25 01:16:24 +00:00
Konstantin Belousov	49e3050e6c	VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object flag. Besides providing the redundand information, need to update both vnode and object flags causes more acquisition of vnode interlock. OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects. Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for vnode-backed vm objects. Suggested and reviewed by: alc Tested by: pho MFC after: 3 weeks	2009-12-21 12:29:38 +00:00
Ed Schouten	907b48bc05	Fix indentation.	2009-12-20 22:55:27 +00:00
Ed Schouten	8dc9b4cf04	Let access overriding to TTYs depend on the cdev_priv, not the vnode. Basically this commit changes two things, which improves access to TTYs in exceptional conditions. Basically the problem was that when you ran jexec(8) to attach to a jail, you couldn't use /dev/tty (well, also the node of the actual TTY, e.g. /dev/pts/X). This is very inconvenient if you want to attach to screens quickly, use ssh(1), etc. The fixes: - Cache the cdev_priv of the controlling TTY in struct session. Change devfs_access() to compare against the cdev_priv instead of the vnode. This allows you to bypass UNIX permissions, even across different mounts of devfs. - Extend devfs_prison_check() to unconditionally expose the device node of the controlling TTY, even if normal prison nesting rules normally don't allow this. This actually allows you to interact with this device node. To be honest, I'm not really happy with this solution. We now have to store three pointers to a controlling TTY (s_ttyp, s_ttyvp, s_ttydp). In an ideal world, we should just get rid of the latter two and only use s_ttyp, but this makes certian pieces of code very impractical (e.g. devfs, kern_exit.c). Reported by: Many people	2009-12-19 18:42:12 +00:00
Edward Tomasz Napierala	28d3fd007e	Interpret VAPPEND correctly in vaccess_acl_nfs4(9).	2009-12-19 11:41:52 +00:00
Ed Schouten	e6d84d057a	Make the wchan names of pts(4) fit in top(1). Just like a similar change we made to the TTY code about half a year ago, make these strings look similar. Suggested by: Jille Timmermans <jille@quis.cx>	2009-12-18 20:11:29 +00:00
Andrew Thompson	2b54315009	If the runcount is non-zero in eventhandler_deregister() then one or more threads are executing the eventhandler, sleep in this case to make it safe for module unload. If the runcount was up then an entry would have been marked EHE_DEAD_PRIORITY so use this as a trigger to do the wakeup in eventhandler_prune_list(). Reviewed by: jhb	2009-12-17 21:17:13 +00:00
Matt Jacob	e7d829a46c	Fix argument order in a call to mtx_init. MFC after: 1 week	2009-12-17 00:22:56 +00:00
Luigi Rizzo	20c510f826	Properly fix callout handling by putting all the per-cpu info in struct callout_cpu. From the comment in the file: + * There is one struct callout_cpu per cpu, holding all relevant + * state for the callout processing thread on the individual CPU. + * In particular: + * cc_ticks is incremented once per tick in callout_cpu(). + * It tracks the global 'ticks' but in a way that the individual + * threads should not worry about races in the order in which + * hardclock() and hardclock_cpu() run on the various CPUs. + * cc_softclock is advanced in callout_cpu() to point to the + * first entry in cc_callwheel that may need handling. In turn, + * a softclock() is scheduled so it can serve the various entries i + * such that cc_softclock <= i <= cc_ticks . Together with a smaller patch committed in september, this fixes a bug that affects 8.0 with apps that rely on callouts to fire exactly in the number of ticks specified (qemu among them). Right now, callouts in 8.0 fire one tick late. This was discussed in september with JeffR and jhb MFC after: 3 days	2009-12-14 12:23:46 +00:00
Bjoern A. Zeeb	de0bd6f76b	Throughout the network stack we have a few places of if (jailed(cred)) left. If you are running with a vnet (virtual network stack) those will return true and defer you to classic IP-jails handling and thus things will be "denied" or returned with an error. Work around this problem by introducing another "jailed()" function, jailed_without_vnet(), that also takes vnets into account, and permits the calls, should the jail from the given cred have its own virtual network stack. We cannot change the classic jailed() call to do that, as it is used outside the network stack as well. Discussed with: julian, zec, jamie, rwatson (back in Sept) MFC after: 5 days	2009-12-13 13:57:32 +00:00
Attilio Rao	2028867def	In current code, threads performing an interruptible sleep (on both sxlock, via the sx_{s, x}lock_sig() interface, or plain lockmgr), will leave the waiters flag on forcing the owner to do a wakeup even when if the waiter queue is empty. That operation may lead to a deadlock in the case of doing a fake wakeup on the "preferred" (based on the wakeup algorithm) queue while the other queue has real waiters on it, because nobody is going to wakeup the 2nd queue waiters and they will sleep indefinitively. A similar bug, is present, for lockmgr in the case the waiters are sleeping with LK_SLEEPFAIL on. In this case, even if the waiters queue is not empty, the waiters won't progress after being awake but they will just fail, still not taking care of the 2nd queue waiters (as instead the lock owned doing the wakeup would expect). In order to fix this bug in a cheap way (without adding too much locking and complicating too much the semantic) add a sleepqueue interface which does report the actual number of waiters on a specified queue of a waitchannel (sleepq_sleepcnt()) and use it in order to determine if the exclusive waiters (or shared waiters) are actually present on the lockmgr (or sx) before to give them precedence in the wakeup algorithm. This fix alone, however doesn't solve the LK_SLEEPFAIL bug. In order to cope with it, add the tracking of how many exclusive LK_SLEEPFAIL waiters a lockmgr has and if all the waiters on the exclusive waiters queue are LK_SLEEPFAIL just wake both queues. The sleepq_sleepcnt() introduction and ABI breakage require __FreeBSD_version bumping. Reported by: avg, kib, pho Reviewed by: kib Tested by: pho	2009-12-12 21:31:07 +00:00
John Baldwin	42a346fa63	For some buses, devices may have active resources assigned even though they are not allocated by the device driver. These resources should still appear allocated from the system's perspective so that their assigned ranges are not reused by other resource requests. The PCI bus driver has used a hack to effect this for a while now where it uses rman_set_device() to assign devices to the PCI bus when they are first encountered and later assigns them to the actual device when a driver allocates a BAR. A few downsides of this approach is that it results in somewhat confusing devinfo -r output as well as not being very easily portable to other bus drivers. This commit adds generic support for "reserved" resources to the resource list API used by many bus drivers to manage the resources of child devices. A resource may be reserved via resource_list_reserve(). This will allocate the resource from the bus' parent without activating it. resource_list_alloc() recognizes an attempt to allocate a reserved resource. When this happens it activates the resource (if requested) and then returns the reserved resource. Similarly, when a reserved resource is released via resource_list_release(), it is deactivated (if it is active) and the resource is then marked reserved again, but is left allocated from the bus' parent. To completely remove a reserved resource, a bus driver may use resource_list_unreserve(). A bus driver may use resource_list_busy() to determine if a reserved resource is allocated by a child device or if it can be unreserved. The PCI bus driver has been changed to use this framework instead of abusing rman_set_device() to keep track of reserved vs allocated resources. Submitted by: imp (an older version many moons ago) MFC after: 1 month	2009-12-09 21:52:53 +00:00
Edward Tomasz Napierala	9d7031a6d6	Don't add VAPPEND if the file is not being opened for writing. Note that this only affects cases where open(2) is being used improperly - i.e. when the user specifies O_APPEND without O_WRONLY or O_RDWR. Reviewed by: rwatson	2009-12-08 20:47:10 +00:00
Konstantin Belousov	4f17d481ed	Remove wrong assertion. Debugee is allowed to lose a signal. Reported and tested by: jh MFC after: 2 weeks	2009-12-03 20:16:59 +00:00
Edward Tomasz Napierala	6bb58cdd0f	Add change that was somehow missed in r192586. It could manifest by incorrectly returning EINVAL from acl_valid(3) for applications linked against pre-8.0 libc.	2009-12-03 13:29:24 +00:00
Ed Schouten	6eaf04022b	Don't allocate an input buffer for a TTY when the receiver is turned off. When the termios CREAD flag is not set, it makes little sense to allocate an input buffer. Just set the size to 0 in this case to reduce memory footprint. Disallow CREAD to be disabled for pseudo-devices to prevent foot-shooting.	2009-12-01 19:14:57 +00:00
Alan Cox	a6d42a0d62	Replace VM_PROT_OVERRIDE_WRITE by VM_PROT_COPY. VM_PROT_OVERRIDE_WRITE has represented a write access that is allowed to override write protection. Until now, VM_PROT_OVERRIDE_WRITE has been used to write breakpoints into text pages. Text pages are not just write protected but they are also copy-on-write. VM_PROT_OVERRIDE_WRITE overrides the write protection on the text page and triggers the replication of the page so that the breakpoint will be written to a private copy. However, here is where things become confused. It is the debugger, not the process being debugged that requires write access to the copied page. Nonetheless, the copied page is being mapped into the process with write access enabled. In other words, once the debugger sets a breakpoint within a text page, the program can write to its private copy of that text page. Whereas prior to setting the breakpoint, a SIGSEGV would have occurred upon a write access. VM_PROT_COPY addresses this problem. The combination of VM_PROT_READ and VM_PROT_COPY forces the replication of a copy-on-write page even though the access is only for read. Moreover, the replicated page is only mapped into the process with read access, and not write access. Reviewed by: kib MFC after: 4 weeks	2009-11-26 05:16:07 +00:00
Ivan Voras	cbc4ea28e2	Make ULE process usage (%CPU) accounting usable again by keeping track of the last tick we incremented on. Submitted by: matthew.fleming/at/isilon.com, is/at/rambler-co.ru Reviewed by: jeff (who thinks there should be a better way in the future) Approved by: gnn (mentor) MFC after: 3 weeks	2009-11-24 19:57:41 +00:00
Konstantin Belousov	080136212f	On the return path from F_RDAHEAD and F_READAHEAD fcntls, do not unlock Giant twice. While there, bring conditions in the do/while loops closer to style, that also makes the lines fit into 80 columns. Reported and tested by: dougb	2009-11-20 22:22:53 +00:00
Jaakko Heinonen	10d843a446	Extend ddb(4) "show mount" command to print active string mount options. Note that only option names are printed, not values. Reviewed by: pjd Approved by: trasz (mentor) MFC after: 2 weeks	2009-11-19 14:33:03 +00:00
Oleksandr Tymoshenko	3ea6157e6b	- Unbreak build with KLD_DEBUG defined - Add debug.kld_debug sysctl to control KLD debugging level - Print information about KLD dependencies with debug enabled	2009-11-17 21:56:12 +00:00
Konstantin Belousov	a3de221dbe	Among signal generation syscalls, only sigqueue(2) is allowed by POSIX to fail due to lack of resources to queue siginfo. Add KSI_SIGQ flag that allows sigqueue_add() to fail while trying to allocate memory for new siginfo. When the flag is not set, behaviour is the same as for KSI_TRAP: if memory cannot be allocated, set bit in sq_kill. KSI_TRAP is kept to preserve KBI. Add SI_KERNEL si_code, to be used in siginfo.si_code when signal is generated by kernel. Deliver siginfo when signal is generated by kill(2) family of syscalls (SI_USER with properly filled si_uid and si_pid), or by kernel (SI_KERNEL, mostly job control or SIGIO). Since KSI_SIGQ flag is not set for the ksi, low memory condition cause old behaviour. Keep psignal(9) KBI intact, but modify it to generate SI_KERNEL si_code. Pgsignal(9) and gsignal(9) now take ksi explicitely. Add pksignal(9) that behaves like psignal but takes ksi, and ddb kill command implemented as pksignal(..., ksi = NULL) to not do allocation while in debugger. While there, remove some register specifiers and use ANSI C prototypes. Reviewed by: davidxu MFC after: 1 month	2009-11-17 11:39:15 +00:00
Xin LI	1a9d4dda9b	Revert revision 199201 for now as it has introduced a kernel vulnerability and requires more polishing.	2009-11-12 19:02:10 +00:00
Attilio Rao	d113304956	Add the possibility for vfs.root.mountfrom tunable to accept a list of items rather than a single one. The list is a space separated collection of items defined as the current one accepted. While there fix also a nit in a comment. Obtained from: Sandvine Incorporated Reviewed by: emaste Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com> Sponsored by: Sandvine Incorporated MFC: 2 weeks	2009-11-12 15:59:05 +00:00
Attilio Rao	023c800576	The building the dev nameunit string, in devclass_add_device() is based on the assumption that the unit linked with the device is invariant but that can change when calling devclass_alloc_unit() (because -1 is passed or, more simply, because the unit choosen is beyond the table limits). This results in a completely bogus string building. Fix this by reserving the necessary room for all the possible characters printable by a positive integer (we do not allow for negative unit number). Reported by: Sandvine Incorporated Reviewed by: emaste Sponsored by: Sandvine Incorporated MFC: 1 week	2009-11-12 00:52:14 +00:00
Xin LI	41c8c6e876	Add interface description capability as inspired by OpenBSD. MFC after: 3 months	2009-11-11 21:30:58 +00:00
Edward Tomasz Napierala	23e62e7654	Revert r198873. Having different VAPPEND semantics for VOP_ACCESS(9) and VOP_ACCESSX(9) is not a good idea.	2009-11-11 13:49:22 +00:00
Konstantin Belousov	88e6f61a19	When rename("a", "b/.") is performed, target namei() call returns dvp == vp. Rename syscall does not check for the case, and at least ufs_rename() cannot deal with it. POSIX explicitely requires that both rename(2) and rmdir(2) return EINVAL when any of the pathes end in "/.". Detect the slashdot lookup for RENAME or REMOVE in lookup(), and return EINVAL. Reported by: Jim Meyering <jim meyering net> Tested by: simon, pho MFC after: 1 week	2009-11-10 11:50:37 +00:00
Konstantin Belousov	75c586a4c8	In r198506, kern_sigsuspend() started doing cursig/postsig loop to make sure that a signal was delivered to the thread before returning from syscall. Signal delivery puts new return frame on the user stack, and modifies trap frame to enter signal handler. As a consequence, syscall return code sets EINTR as error return for signal frame, instead of the syscall return. Also, for ia64, due to different registers layout for those two kind of frames, usermode sigsegfaulted when returned from signal handler. Use newly-introduced cpu_set_syscall_retval(9) to set syscall result, and return EJUSTRETURN from kern_sigsuspend() to prevent syscall return code from modifying this frame [1]. Another issue is that pending SIGCONT might be cancelled by SIGSTOP, causing postsig() not to deliver any catched signal [2]. Modify postsig() to return 1 if signal was posted, and 0 otherwise, and use this in the kern_sigsuspend loop. Proposed by: marcel [1] Noted by: davidxu [2] Reviewed by: marcel, davidxu MFC after: 1 month	2009-11-10 11:46:53 +00:00
Edward Tomasz Napierala	7ba889b8f4	Add suggestion for zfs root.	2009-11-08 09:54:25 +00:00
Attilio Rao	337c5ff4fc	Save the sack when doing a lockmgr_disown() call. Requested by: kib MFC: 3 days	2009-11-06 22:33:03 +00:00
Edward Tomasz Napierala	7a172809b5	Fix build. Submitted by: Andrius Morkūnas <hinokind at gmail.com>	2009-11-04 08:25:58 +00:00
Edward Tomasz Napierala	3b1b09809f	Revert r198874, pending further discussion.	2009-11-04 07:14:16 +00:00
Edward Tomasz Napierala	da9ce28ecb	Style fixes.	2009-11-04 07:04:15 +00:00
Edward Tomasz Napierala	8fafa5cecf	Make sure we don't end up with VAPPEND without VWRITE, if someone calls open(2) like this: open(..., O_APPEND).	2009-11-04 06:48:34 +00:00
Edward Tomasz Napierala	597954c813	While VAPPEND without VWRITE makes sense for VOP_ACCESSX(9) (e.g. to check for the permission to create subdirectory (ACE4_ADD_SUBDIRECTORY)), it doesn't really make sense for VOP_ACCESS(9). Also, many VOP_ACCESS(9) implementations don't expect that. Make sure we don't confuse them.	2009-11-04 06:47:14 +00:00
Ed Schouten	ca1d2f657a	Make /dev/klog and kern.msgbuf* MPSAFE. Normally msgbufp is locked using Giant. Switch it to use the msgbuf_lock. Instead of changing the tsleep() calls to msleep(), just convert it to condvar(9). In my opinion the locking around msgbuf_peekbytes() still remains questionable. It looks like locks are dropped while performing copies of multiple blocks to userspace, which may cause the msgbuf to be reset in the mean time. At least getting it underneath from Giant should make it a little easier for us to figure out how to solve that. Reminded by: rdivacky	2009-11-03 21:06:19 +00:00
Attilio Rao	1b9d701fee	Split P_NOLOAD into a per-thread flag (TDF_NOLOAD). This improvements aims for avoiding further cache-misses in scheduler specific functions which need to keep track of average thread running time and further locking in places setting for this flag. Reported by: jeff (originally), kris (currently) Reviewed by: jhb Tested by: Giuseppe Cocomazzi <sbudella at email dot it>	2009-11-03 16:46:52 +00:00
Konstantin Belousov	1c89fc757a	If socket buffer space appears to be lower then sum of count of already prepared bytes and next portion of transfer, inner loop of kern_sendfile() aborts, not preparing next mbuf for socket buffer, and not modifying any outer loop invariants. The thread loops in the outer loop forever. Instead of breaking from inner loop, prepare only bytes that fit into the socket buffer space. In collaboration with: pho Reviewed by: bz PR: kern/138999 MFC after: 2 weeks	2009-11-03 12:52:35 +00:00
Konstantin Belousov	80a8b0f3bf	Trapsignal() and postsig() call kern_sigprocmask() with both process lock and curproc->p_sigacts->ps_mtx. Reschedule_signals may need to have ps_mtx locked to decide and wakeup a thread, causing recursion on the mutex. Inform kern_sigprocmask() and reschedule_signals() about lock state of the ps_mtx by new flag SIGPROCMASK_PS_LOCKED to avoid recursion. Reported and tested by: keramida MFC after: 1 month	2009-10-30 10:10:39 +00:00
Konstantin Belousov	8084540253	Trapsignal() calls kern_sigprocmask() when delivering catched signal with proc lock held. Reported and tested by: Mykola Dzham freebsd at levsha org ua MFC after: 1 month	2009-10-29 14:34:24 +00:00
Konstantin Belousov	7415a41f4a	Fix style issue.	2009-10-29 10:03:08 +00:00
Konstantin Belousov	17c974499c	Regenerate	2009-10-27 11:01:15 +00:00
Konstantin Belousov	066d836b02	Current pselect(3) is implemented in usermode and thus vulnerable to well-known race condition, which elimination was the reason for the function appearance in first place. If sigmask supplied as argument to pselect() enables a signal, the signal might be delivered before thread called select(2), causing lost wakeup. Reimplement pselect() in kernel, making change of sigmask and sleep atomic. Since signal shall be delivered to the usermode, but sigmask restored, set TDP_OLDMASK and save old mask in td_oldsigmask. The TDP_OLDMASK should be cleared by ast() in case signal was not gelivered during syscall execution. Reviewed by: davidxu Tested by: pho MFC after: 1 month	2009-10-27 10:55:34 +00:00
Konstantin Belousov	d6e029adbe	In r197963, a race with thread being selected for signal delivery while in kernel mode, and later changing signal mask to block the signal, was fixed for sigprocmask(2) and ptread_exit(3). The same race exists for sigreturn(2), setcontext(2) and swapcontext(2) syscalls. Use kern_sigprocmask() instead of direct manipulation of td_sigmask to reschedule newly blocked signals, closing the race. Reviewed by: davidxu Tested by: pho MFC after: 1 month	2009-10-27 10:47:58 +00:00
Konstantin Belousov	84440afb54	In kern_sigsuspend(), better manipulate thread signal mask using kern_sigprocmask() to properly notify other possible candidate threads for signal delivery. Since sigsuspend() shall only return to usermode after a signal was delivered, do cursig/postsig loop immediately after waiting for signal, repeating the wait if wakeup was spurious due to race with other thread fetching signal from the process queue before us. Add thread_suspend_check() call to allow the thread to be stopped or killed while in loop. Modify last argument of kern_sigprocmask() from boolean to flags, allowing the function to be called with locked proc. Convertion of the callers that supplied 1 to the old argument will be done in the next commit, and due to SIGPROCMASK_OLD value equial to 1, code is formally correct in between. Reviewed by: davidxu Tested by: pho MFC after: 1 month	2009-10-27 10:42:24 +00:00
John Baldwin	86855bf549	Another nit that both I and ispell missed. Submitted by: Ben Kaduk minimarmot of gmail	2009-10-26 18:32:06 +00:00
John Baldwin	9390262576	Fix some spelling nits.	2009-10-26 17:42:03 +00:00
Joseph Koshy	16d95d4f92	Inform hwpmc(4) of a thread's impending demise prior to invoking sched_throw(). Debugging help: fabient Review and testing by: fabient	2009-10-25 04:34:47 +00:00
Alan Cox	a0c703bf21	Update a comment to reflect the previous change.	2009-10-25 02:48:29 +00:00
Ruslan Ermilov	4d9d1e823c	- Rename tunable kern.ipc.shmmaxpgs to kern.ipc.shmall. - Explain the fuss when initializing shmmax. PR: 75542 (mistakenly closed instead of PR 75541)	2009-10-24 19:00:58 +00:00
John Baldwin	5ca4819ddf	- Fix several off-by-one errors when using MAXCOMLEN. The p_comm[] and td_name[] arrays are actually MAXCOMLEN + 1 in size and a few places that created shadow copies of these arrays were just using MAXCOMLEN. - Prefer using sizeof() of an array type to explicit constants for the array length in a few places. - Ensure that all of p_comm[] and td_name[] is always zero'd during execve() to guard against any possible information leaks. Previously trailing garbage in p_comm[] could be leaked to userland in ktrace record headers via td_name[]. Reviewed by: bde	2009-10-23 15:14:54 +00:00
John Baldwin	4f9d48e478	Don't bother copying the name of a kproc or kthread out into a temporary array just to pass that array to printf(). kproc and kthread names are NUL-terminated and can be printed using printf() directly. Reviewed by: bde	2009-10-23 15:09:51 +00:00
John Baldwin	3eec6f034a	Set the devclass_t pointer specified in the DRIVER_MODULE() macro sooner so it is always valid when a driver's identify routine is called. Previously, new-bus would attempt to create the devclass for a newly loaded driver in two separate places, once in devclass_add_driver(), and again after devclass_add_driver() returned in driver_module_handler(). Only the second lookup attempted to set a device class' parent and set the devclass_t pointer specified in the DRIVER_MODULE() macro. However, by the time it was executed, the driver was already added to existing instances of the parent driver at which point in time the new driver's identify routine would have been invoked. The fix is to merge the two attempts and only create the devclass once in devclass_add_driver() including setting the devclass_t pointer passed to DRIVER_MODULE() before the driver is added to any existing bus devices. Reported by: avg Reviewed by: imp MFC after: 2 weeks	2009-10-22 14:53:44 +00:00
Marcel Moolenaar	1a4fcaebe3	o Introduce vm_sync_icache() for making the I-cache coherent with the memory or D-cache, depending on the semantics of the platform. vm_sync_icache() is basically a wrapper around pmap_sync_icache(), that translates the vm_map_t argumument to pmap_t. o Introduce pmap_sync_icache() to all PMAP implementation. For powerpc it replaces the pmap_page_executable() function, added to solve the I-cache problem in uiomove_fromphys(). o In proc_rwmem() call vm_sync_icache() when writing to a page that has execute permissions. This assures that when breakpoints are written, the I-cache will be coherent and the process will actually hit the breakpoint. o This also fixes the Book-E PMAP implementation that was missing necessary locking while trying to deal with the I-cache coherency in pmap_enter() (read: mmu_booke_enter_locked). The key property of this change is that the I-cache is made coherent after writes have been done. Doing it in the PMAP layer when adding or changing a mapping means that the I-cache is made coherent before any writes happen. The difference is key when the I-cache prefetches.	2009-10-21 18:38:02 +00:00
Ruslan Ermilov	e64585bdc2	Random number generator initialization cleanup: - Introduce new SI_SUB_RANDOM point in boot sequence to make it clear from where one may start using random(9). It should be as early as possible, so place it just after SI_SUB_CPU where we have some randomness on most platforms via get_cyclecount(). - Move stack protector initialization to be after SI_SUB_RANDOM as before this point we have no randomness at all. This fixes stack protector to actually protect stack with some random guard value instead of a well-known one. Note that this patch doesn't try to address arc4random(9) issues. With current code, it will be implicitly seeded by stack protector and hence will get the same entropy as random(9). It will be securely reseeded once /dev/random is feeded by some entropy from userland. Submitted by: Maxim Dounin <mdounin@mdounin.ru> MFC after: 3 days	2009-10-20 16:36:51 +00:00
Ed Schouten	6015f6f35a	Properly set the low watermarks when reducing the baud rate. Now that buffers are deallocated lazily, we should not use tty*q_getsize() to obtain the buffer size to calculate the low watermarks. Doing this may cause the watermark to be placed outside the typical buffer size. This caused some regressions after my previous commit to the TTY code, which allows pseudo-devices to resize the buffers as well. Reported by: yongari, dougb MFC after: 1 week	2009-10-19 07:17:37 +00:00
Ed Schouten	5ed8d12443	Allow the buffer size to be configured for pseudo-like TTY devices. Devices that don't implement param() (which means they don't support hardware parameters such as flow control, baud rate) hardcode the baud rate to TTYDEF_SPEED. This means the buffer size cannot be configured, which is a little inconvenient when using canonical mode with big lines of input, etc. Make it adjustable, but do clamp it between B50 and B115200 to prevent awkward buffer sizes. Remove the baud rate assignment from /etc/gettytab. Trust the kernel to fill in a proper value. Reported by: Mikolaj Golub <to my trociny gmail com> MFC after: 1 month	2009-10-18 19:48:53 +00:00
Ed Schouten	99087885be	Make lock devices work properly. It turned out I did add the code to use the init state devices to set the termios structure when opening the device, but it seems I totally forgot to add the bits required to force the actual locking of flags through the lock state devices. Reported by: ru MFC after: 1 week (to be discussed)	2009-10-18 19:45:44 +00:00
Konstantin Belousov	7564c4ad9a	If ET_DYN binary has non-zero base address for some reason, honour it and do not relocate the binary to ET_DYN_LOAD_ADDR. This allows for the binary author to influence address map of the process. In particular, when the binary is actually an interpeter, this allows to have almost usual process address map. Communicate the relocation bias of the mapping for interpeter-less ET_DYN binary, that is interperter itself, in AT_BASE aux entry. This way, rtld is able to find its dynamic structure and relocate itself. Note that mapbase in the rtld is still wrong and requires further fixing. Reported and tested by: rwatson Discussed with: kan MFC after: 3 days	2009-10-18 12:57:48 +00:00
Ed Schouten	39410373b3	Print backspaces after echoing an EOF. Applications like shells expect EOF to give no graphical output, while our implementation prints ^D by default (tunable with stty echoctl). Make the new implementation behave like the old TTY code. Print two backspaces afterwards. Reported by: koitsu MFC after: 1 month	2009-10-17 08:59:41 +00:00
John Baldwin	d0c9a29169	Use language more closely resembling English in a panic message. Pointy hat to: jhb Submitted by: pluknet	2009-10-15 18:51:19 +00:00

1 2 3 4 5 ...

11554 Commits