freebsd-nq

Author	SHA1	Message	Date
Kirk McKusick	71469bb38f	Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL. The primary changes are that the user of the interface no longer needs to manage the mount-mutex locking and that the vnode that is returned has its mutex locked (thus avoiding the need to check to see if its is DOOMED or other possible end of life senarios). To minimize compatibility issues for third-party developers, the old MNT_VNODE_FOREACH interface will remain available so that this change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH will be removed in head. The reason for this update is to prepare for the addition of the MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the active vnodes associated with a mount point (typically less than 1% of the vnodes associated with the mount point). Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks	2012-04-17 16:28:22 +00:00
Edward Tomasz Napierala	9e21ef395a	Fix bug where NFSv4 ACL enforcement code wouldn't unconditionally allow the owner to read and write ACL and file attributes when there was no entry with subject matching the owner. In other words, 'getfacl meh' shouldn't fail for the owner if the ACL looks like this: # file: meh # owner: trasz # group: wheel user:root:------a-------:------:allow Reported by: kientzle	2012-04-17 14:54:00 +00:00
Edward Tomasz Napierala	0b18eb6d74	Stop treating system processes as special. This fixes panics like the one triggered by this: # kldload geom_vinum # pwait `pgrep -S gv_worker` & # kldunload geom_vinum or this: GEOM_JOURNAL: Shutting down geom gjournal 3464572051. panic: destroying non-empty racct: 1 allocated for resource 6 which were tracked by jh@ to be caused by checking p->p_flag, while it wasn't initialised yet. Basically, during fork, the code checked p_flag, concluded the process isn't marked as P_SYSTEM, incremented the counter, and later on, when exiting, checked that the process was marked as P_SYSTEM, and thus didn't decrement it. Also, I believe there wasn't any good reason for checking P_SYSTEM in the first place. Tested by: jh	2012-04-17 14:31:02 +00:00
Edward Tomasz Napierala	47f6635cc1	Fix panic, triggered like this: "int main() { thr_exit(); }" Submitted by: Mateusz Guzik	2012-04-17 13:44:40 +00:00
Edward Tomasz Napierala	786813aa1f	Enforce upper bound on the input buffer length. Reported by: Mateusz Guzik	2012-04-17 13:28:14 +00:00
Jung-uk Kim	d69a426fce	- Implement pipe2 syscall for Linuxulator. This syscall appeared in 2.6.27 but GNU libc used it without checking its kernel version, e. g., Fedora 10. - Move pipe(2) implementation for Linuxulator from MD files to MI file, sys/compat/linux/linux_file.c. There is no MD code for this syscall at all. - Correct an argument type for pipe() from l_ulong * to l_int *. Probably this was the source of MI/MD confusion. Reviewed by: emulation	2012-04-16 21:22:02 +00:00
Davide Italiano	99006d44f8	Fix a typo. Approved by: gnn (mentor) MFC after: 2 days	2012-04-14 23:59:58 +00:00
Davide Italiano	331805a5d3	Fix some style bugs introduced in a previous commit (r233045) Reported by: glebius, jmallet Reviewed by: jmallet Approved by: gnn (mentor) MFC after: 2 days	2012-04-14 23:53:31 +00:00
Marius Strobl	91849f349c	Fix !DDB build after r234190.	2012-04-14 11:21:24 +00:00
Adrian Chadd	676c1784cb	Use strdup() on the name (and free it when it's done) so non-static names can be used in firmware_register().	2012-04-13 04:22:42 +00:00
John Baldwin	0cc457b000	- Extend the KDB interface to add a per-debugger callback to print a backtrace for an arbitrary thread (rather than the calling thread). A kdb_backtrace_thread() wrapper function uses the configured debugger if possible, otherwise it falls back to using stack(9) if that is available. - Replace a direct call to db_trace_thread() in propagate_priority() with a call to kdb_backtrace_thread() instead. MFC after: 1 week	2012-04-12 17:43:59 +00:00
John Baldwin	7582954e34	If a linker file contains at least one module, but all of the modules fail to load (the MOD_LOAD event fails) during a kldload(2), unload the linker file and fail the kldload(2) with ENOEXEC. Reported by: gcooper MFC after: 1 week	2012-04-12 14:49:25 +00:00
Konstantin Belousov	2dd9ea6f70	Add thread-private flag to indicate that error value is already placed in td_errno. Flag is supposed to be used by syscalls returning EJUSTRETURN because errno was already placed into the usermode frame by a call to set_syscall_retval(9). Both ktrace and dtrace get errno value from td_errno if the flag is set. Use the flag to fix sigsuspend(2) error return ktrace records. Requested by: bde MFC after: 1 week	2012-04-12 10:48:43 +00:00
Kirk McKusick	ecb6e528c5	Export vinactive() from kern/vfs_subr.c (e.g., make it no longer static and declare its prototype in sys/vnode.h) so that it can be called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c) instead of the body of vinactive() being cut and pasted into process_deferred_inactive(). Reviewed by: kib MFC after: 2 weeks	2012-04-11 23:01:11 +00:00
John Baldwin	77b479e644	Allow device_busy() and device_unbusy() to be invoked while a device is being attached. This is implemented by adding a new DS_ATTACHING state while a device's DEVICE_ATTACH() method is being invoked. A driver is required to not fail an attach of a busy device. The device's state will be promoted to DS_BUSY rather than DS_ACTIVE() if the device was marked busy during DEVICE_ATTACH(). Reviewed by: kib MFC after: 1 week	2012-04-11 20:57:41 +00:00
Eitan Adler	847d0034e3	Return EBADF instead of EMFILE from dup2 when the second argument is outside the range of valid file descriptors PR: kern/164970 Submitted by: Peter Jeremy <peterjeremy@acm.org> Reviewed by: jilles Approved by: cperciva MFC after: 1 week	2012-04-11 14:08:09 +00:00
Jilles Tjoelker	8a8be77610	Remove unused and wrong SA_PROC internal signal property. The SA_PROC signal property indicated whether each signal number is directed at a specific thread or at the process in general. However, that depends on how the signal was generated and not on the signal number. SA_PROC was not used.	2012-04-09 21:58:58 +00:00
Alexander Motin	70801abe8f	Microoptimize cpu_search(). According to profiling, it makes one take 6% of CPU time on hackbench with its million of context switches per second, instead of 8% before.	2012-04-09 18:24:58 +00:00
Gleb Kurtsou	0ff93c48da	Add vfs_getopt_size. Support human readable file system options in tmpfs. Increase maximum tmpfs file system size to 4GB*PAGE_SIZE on 32 bit archs. Discussed with: delphij MFC after: 2 weeks	2012-04-07 15:27:34 +00:00
Alexander V. Chernikov	e4b3229aa5	- Improve BPF locking model. Interface locks and descriptor locks are converted from mutex(9) to rwlock(9). This greately improves performance: in most common case we need to acquire 1 reader lock instead of 2 mutexes. - Remove filter(descriptor) (reader) lock in bpf_mtap[2] This was suggested by glebius@. We protect filter by requesting interface writer lock on filter change. - Cover struct bpf_if under BPF_INTERNAL define. This permits including bpf.h without including rwlock stuff. However, this is is temporary solution, struct bpf_if should be made opaque for any external caller. Found by: Dmitrij Tejblum <tejblum@yandex-team.ru> Sponsored by: Yandex LLC Reviewed by: glebius (previous version) Reviewed by: silence on -net@ Approved by: (mentor) MFC after: 3 weeks	2012-04-06 06:53:58 +00:00
John Baldwin	35818d2e94	Add new ktrace records for the start and end of VM faults. This gives a pair of records similar to syscall entry and return that a user can use to determine how long page faults take. The new ktrace records are enabled via the 'p' trace type, and are enabled in the default set of trace points. Reviewed by: kib MFC after: 2 weeks	2012-04-05 17:13:14 +00:00
David Xu	8931e524bf	In sem_post, the field _has_waiters is no longer used, because some application destroys semaphore after sem_wait returns. Just enter kernel to wake up sleeping threads, only update _has_waiters if it is safe. While here, check if the value exceed SEM_VALUE_MAX and return EOVERFLOW if this is true.	2012-04-05 03:05:02 +00:00
David Xu	17ce606321	umtx operation UMTX_OP_MUTEX_WAKE has a side-effect that it accesses a mutex after a thread has unlocked it, it event writes data to the mutex memory to clear contention bit, there is a race that other threads can lock it and unlock it, then destroy it, so it should not write data to the mutex memory if there isn't any waiter. The new operation UMTX_OP_MUTEX_WAKE2 try to fix the problem. It requires thread library to clear the lock word entirely, then call the WAKE2 operation to check if there is any waiter in kernel, and try to wake up a thread, if necessary, the contention bit is set again by the operation. This also mitgates the chance that other threads find the contention bit and try to enter kernel to compete with each other to wake up sleeping thread, this is unnecessary. With this change, the mutex owner is no longer holding the mutex until it reaches a point where kernel umtx queue is locked, it releases the mutex as soon as possible. Performance is improved when the mutex is contensted heavily. On Intel i3-2310M, the runtime of a benchmark program is reduced from 26.87 seconds to 2.39 seconds, it even is better than UMTX_OP_MUTEX_WAKE which is deprecated now. http://people.freebsd.org/~davidxu/bench/mutex_perf.c	2012-04-05 02:24:08 +00:00
Navdeep Parhar	60a305887a	- Remove redundant call to pr_ctloutput from code that handles SO_SETFIB. - Add a check for errors during copyin while here. Reviewed by: julian, bz MFC after: 2 weeks	2012-04-03 18:38:00 +00:00
Konstantin Belousov	5085ecb75a	When process exists, not only the children shall be reparented to init, but also the orphans shall be removed from the orphan list, because the list header is destroyed. Reported and tested by: pho MFC after: 3 days	2012-04-02 19:35:36 +00:00
Konstantin Belousov	2e39e24f64	Add helper function to remove the process from the orphans list and use it instead of inlined code. Tested by: pho MFC after: 3 days	2012-04-02 19:34:56 +00:00
John Baldwin	e506e182dd	Export some more useful info about shared memory objects to userland via procstat(1) and fstat(1): - Change shm file descriptors to track the pathname they are associated with and add a shm_path() method to copy the path out to a caller-supplied buffer. - Use the fo_stat() method of shared memory objects and shm_path() to export the path, mode, and size of a shared memory object via struct kinfo_file. - Add a struct shmstat to the libprocstat(3) interface along with a procstat_get_shm_info() to export the mode and size of a shared memory object. - Change procstat to always print out the path for a given object if it is valid. - Teach fstat about shared memory objects and to display their path, mode, and size. MFC after: 2 weeks	2012-04-01 18:22:48 +00:00
David Xu	8b1eafa723	Remove stale comments.	2012-03-31 06:48:41 +00:00
David Xu	b29d7d9b60	Remove trailing semicolon, it is a typo.	2012-03-30 12:57:14 +00:00
David Xu	0cf573e989	Fix COMPAT_FREEBSD32 build. Submitted by: Andreas Tobler < andreast at fgznet dot ch >	2012-03-30 09:03:53 +00:00
David Xu	4ed8858df0	Remove trailing space.	2012-03-30 05:49:32 +00:00
David Xu	e05171d939	Merge umtxq_sleep and umtxq_nanosleep into a single function by using an abs_timeout structure which describes timeout info.	2012-03-30 05:40:26 +00:00
David Xu	d31f470d15	Reduce code size by creating common timed sleeping function.	2012-03-29 02:46:43 +00:00
Fabien Thomas	f5f9340b98	Add software PMC support. New kernel events can be added at various location for sampling or counting. This will for example allow easy system profiling whatever the processor is with known tools like pmcstat(8). Simultaneous usage of software PMC and hardware PMC is possible, for example looking at the lock acquire failure, page fault while sampling on instructions. Sponsored by: NETASQ MFC after: 1 month	2012-03-28 20:58:30 +00:00
Ryan Stone	9742410797	Instead of only iterating over the set of known SDT probes when sdt.ko is loaded and unloaded, also have sdt.ko register callbacks with kern_sdt.c that will be called when a newly loaded KLD module adds more probes or a module with probes is unloaded. This fixes two issues: first, if a module with SDT probes was loaded after sdt.ko was loaded, those new probes would not be available in DTrace. Second, if a module with SDT probes was unloaded while sdt.ko was loaded, the kernel would panic the next time DTrace had cause to try and do anything with the no-longer-existent probes. This makes it possible to create SDT probes in KLD modules, although there are still two caveats: first, any SDT probes in a KLD module must be part of a DTrace provider that is defined in that module. At present DTrace only destroys probes when the provider is destroyed, so you can still panic the system if a KLD module creates new probes in a provider from a different module(including the kernel) and then unload the the first module. Second, the system will panic if you unload a module containing SDT probes while there is an active D script that has enabled those probes. MFC after: 1 month	2012-03-27 15:07:43 +00:00
Alexander V. Chernikov	b25711e6b0	- Add knlist_init_rw_reader() function to kqueue(9). Function acquired reader lock if needed. Assert check for reader or writer lock (RA_LOCKED / RA_UNLOCKED) - While here, add knlist_init_mtx.9 to MLINKS and fix some style(9) issues Reviewed by: glebius Approved by: ae(mentor) MFC after: 2 weeks	2012-03-26 09:34:17 +00:00
Mikolaj Golub	903712c99c	Add a sysctl to set and retrieve binary osreldate of another process. Suggested by: kib Reviewed by: kib MFC after: 2 weeks	2012-03-23 20:05:41 +00:00
Andrey V. Elsukov	5b0da85a41	Correct debug message.	2012-03-22 09:29:07 +00:00
Alan Cox	5730afc9b6	Handle spurious page faults that may occur in no-fault sections of the kernel. When access restrictions are added to a page table entry, we flush the corresponding virtual address mapping from the TLB. In contrast, when access restrictions are removed from a page table entry, we do not flush the virtual address mapping from the TLB. This is exactly as recommended in AMD's documentation. In effect, when access restrictions are removed from a page table entry, AMD's MMUs will transparently refresh a stale TLB entry. In short, this saves us from having to perform potentially costly TLB flushes. In contrast, Intel's MMUs are allowed to generate a spurious page fault based upon the stale TLB entry. Usually, such spurious page faults are handled by vm_fault() without incident. However, when we are executing no-fault sections of the kernel, we are not allowed to execute vm_fault(). This change introduces special-case handling for spurious page faults that occur in no-fault sections of the kernel. In collaboration with: kib Tested by: gibbs (an earlier version) I would also like to acknowledge Hiroki Sato's assistance in diagnosing this problem. MFC after: 1 week	2012-03-22 04:52:51 +00:00
Andrey V. Elsukov	c5e7f0649a	Acquire modules lock before call module_getname() in the KLD_DEBUG case. MFC after: 1 week	2012-03-21 09:48:32 +00:00
Eitan Adler	24c10828e4	- Clean up timestamps in msgbuf code. The timestamps should now be inserted after the priority token thus cleaning up the output. - Remove the needless double internal do_add_char function. - Resolve a possible deadlock if interrupts are disabled and getnanotime is called Reviewed by: bde kmacy, avg, sbruno (various versions) Approved by: cperciva MFC after: 2 weeks	2012-03-19 00:36:32 +00:00
Jaakko Heinonen	59f513cd09	Cast wallclock.tv_sec to uint64_t to avoid overflow in the calculation. PR: kern/161552 Reviewed by: trasz Tested by: Nikos Vassiliadis MFC after: 1 week	2012-03-18 19:13:32 +00:00
Davide Italiano	c6111de55d	Add rudimentary profiling of the hash table used in the in the umtx code to hold active lock queues. Reviewed by: attilio Approved by: davidxu, gnn (mentor) MFC after: 3 weeks	2012-03-16 20:32:11 +00:00
Michael Tuexen	99f293a20e	Fix bugs which can result in a panic when an non-SCTP socket it used with an sctp_ system-call which expects an SCTP socket. MFC after: 3 days.	2012-03-15 14:13:38 +00:00
Andrey V. Elsukov	b26a09848a	Add CTLFLAG_TUN to the sysctl definition and fix style. Pointed by: Garrett Cooper MFC after: 2 weeks	2012-03-15 06:01:21 +00:00
Andrey V. Elsukov	199aa9756b	Add debug.kld_debug loader tunable. MFC after: 2 weeks	2012-03-15 05:11:29 +00:00
Jaakko Heinonen	db62ced238	Add an assert for proctree_lock to proc_to_reap(). Discussed with: kib MFC after: 1 week	2012-03-14 15:52:23 +00:00
Konstantin Belousov	7335ed90a0	Lock the process around manipulations with p_flag. Reported and reviewed by: jh MFC after: 3 days	2012-03-13 22:00:46 +00:00
Adrian Chadd	a9a282f672	Add module load/unload stubs.	2012-03-13 20:27:48 +00:00
Alexander Motin	fd053fae73	Add kern.eventtimer.activetick tunable/sysctl, specifying whether each hardclock() tick should be run on every active CPU, or on only one. On my tests, avoiding extra interrupts because of this on 8-CPU Core i7 system with HZ=10000 saves about 2% of performance. At this moment option implemented only for global timers, as reprogramming per-CPU timers is too expensive now to be compensated by this benefit, especially since we still have to regularly run hardclock() on at least one active CPU to update system uptime. For global timer it is quite trivial: timer runs always, but we just skip IPIs to other CPUs when possible. Option is enabled by default now, keeping previous behavior, as periodic hardclock() calls are still used at least to implement setitimer(2) with ITIMER_VIRTUAL and ITIMER_PROF arguments. But since default schedulers don't depend on it since r232917, we are much more free to experiment with it. MFC after: 1 month	2012-03-13 10:21:08 +00:00
Alexander Motin	7295465e33	Rewrite thread CPU usage percentage math to not depend on periodic calls with HZ rate through the sched_tick() calls from hardclock(). Potentially it can be used to improve precision, but now it is just minus one more reason to call hardclock() for every HZ tick on every active CPU. SCHED_4BSD never used sched_tick(), but keep it in place for now, as at least SCHED_FBFS existing in patches out of the tree depends on it. MFC after: 1 month	2012-03-13 08:18:54 +00:00
Peter Holm	62a9fc76df	Allways call fdrop().	2012-03-12 11:56:57 +00:00
Konstantin Belousov	1a9c7dec1f	ELF image can have several PT_NOTE program headers. Look for the ELF brand note in each header, instead of using only first one. Reviewed by: kan Tested by: andrew (arm), flo (sparc64) MFC after: 3 weeks	2012-03-11 19:38:49 +00:00
Konstantin Belousov	b80dcb55aa	Remove fifo.h. The only used function declaration from the header is migrated to sys/vnode.h. Submitted by: gianni	2012-03-11 12:19:58 +00:00
Alexander Motin	5f3818a56e	Revert r175376 and tune cpufreq(4) frequency comparison logic instead. Instead of using 25MHz equality threshold, look for the nearest value when handling dev.cpu.0.freq sysctl and for exact match when it is expected. ACPI may report extra level with frequency 1MHz above the nominal to control Intel Turbo Boost operation. It is not a bug, but feature: dev.cpu.0.freq_levels: 2934/106000 2933/95000 2800/82000 ... In this case value 2933 means 2.93GHz, but 2934 means 3.2-3.6GHz. I've found that my Core i7-870 based system has Intel Turbo Boost disabled by default and without this change it was absolutely invisible and hard to control. MFC after: 2 weeks	2012-03-10 18:56:16 +00:00
Alexander Motin	bcfd016cff	Idle ticks optimization: - Pass number of events to the statclock() and profclock() functions same as to hardclock() before to not call them many times in a loop. - Rename them into statclock_cnt() and profclock_cnt(). - Turn statclock() and profclock() into compatibility wrappers, still needed for arm. - Rename hardclock_anycpu() into hardclock_cnt() for unification. MFC after: 1 week	2012-03-10 14:57:21 +00:00
Edward Tomasz Napierala	0a53cd5742	Remove useless thread_{lock,unlock}() in raccd.	2012-03-10 14:38:49 +00:00
Juli Mallett	85729c2c44	Export intrcnt correctly when running under 32-bit compatibility. Reviewed by: gonzo, nwhitehorn	2012-03-09 22:30:54 +00:00
Peter Holm	39e77c4c50	Perform the parameter validation before assigning it to a signed int variable. This fixes the problem seen with readdir(3) fuzzing. Submitted by: bde MFC after: 1 week	2012-03-09 21:31:12 +00:00
Alexander Motin	b3f40a4107	Make kern.sched.idlespinthresh default value adaptive depending of HZ. Otherwise with HZ above 8000 CPU may never skip timer ticks on idle.	2012-03-09 19:09:08 +00:00
Alexander Motin	55c71d634f	Be more polite when setting state->nextevent inside cpu_new_callout(). Hardclock is not the only who wakes idle CPU since kdtrace cyclic addition. MFC after: 2 weeks	2012-03-09 07:30:48 +00:00
Konstantin Belousov	38ddb5725b	Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which allows a filesystem to request VFS to not allow MNTK_ASYNC. MFC after: 1 week	2012-03-09 00:12:05 +00:00
Peter Holm	ffae9d4d7c	Free up allocated memory used by posix_fadvise(2).	2012-03-08 20:34:13 +00:00
John Baldwin	b47f624183	Add KTR_VFS traces to track modifications to a vnode's writecount.	2012-03-08 20:27:20 +00:00
John Baldwin	44ad547522	Add a new sched_clear_name() method to the scheduler interface to clear the cached name used for KTR_SCHED traces when a thread's name changes. This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's name changes, most commonly via execve(). MFC after: 2 weeks	2012-03-08 19:41:05 +00:00
Konstantin Belousov	f950879e16	The pipe_poll() performs lockless access to the vnode to test fifo_iseof() condition, allowing the v_fifoinfo to be reset and freed by fifo_cleanup(). Precalculate EOF at the places were fo_wgen is changed, and cache the state in a new pipe state flag PIPE_SAMEWGEN. Reported and tested by: bf Submitted by: gianni MFC after: 1 week (a backport)	2012-03-07 07:31:50 +00:00
Edward Tomasz Napierala	c34bbd2ada	Make racct and rctl correctly handle jail renaming. Previously they would continue using old name, the one jail was created with. PR: bin/165207	2012-03-06 11:05:50 +00:00
Ivan Voras	2573ea5f76	Print out process name and thread id in the debugging message. This is useful because the message can end up in system logs in non-debugging operation. Reviewed by: attilio (earlier version)	2012-03-05 14:19:43 +00:00
Konstantin Belousov	e7f19c3d81	pipe_read(): change the type of size to int, and remove signed clamp. pipe_write(): change the type of desiredsize back to int, its value fits. Requested by: bde MFC after: 3 weeks	2012-03-04 15:09:01 +00:00
Konstantin Belousov	8bb9a904d5	Instead of incomplete handling of read(2)/write(2) return values that does not fit into registers, declare that we do not support this case using CTASSERT(), and remove endianess-unsafe code to split return value into td_retval. While there, change the style of the sysctl debug.iosize_max_clamp definition. Requested by: bde MFC after: 3 weeks	2012-03-04 14:55:37 +00:00
Mikolaj Golub	e0fcf639d2	Make kern.proc.umask sysctl readonly. Requested by: src MFC after: 1 week	2012-03-03 11:53:35 +00:00
Alexander Motin	6022f0bcb3	Fix bug of r232207, when cpu_search() could prefer CPU group with best load, but with no CPU matching given limitations. It caused kernel panics in some cases when thread was bound to specific CPUs with cpuset(1).	2012-03-03 11:50:48 +00:00
Juli Mallett	9624d94701	o) Add COMPAT_FREEBSD32 support for MIPS kernels using the n64 ABI with userlands using the o32 ABI. This mostly follows nwhitehorn's lead in implementing COMPAT_FREEBSD32 on powerpc64. o) Add a new type to the freebsd32 compat layer, time32_t, which is time_t in the 32-bit ABI being used. Since the MIPS port is relatively-new, even the 32-bit ABIs use a 64-bit time_t. o) Because time{spec,val}32 has the same size and layout as time{spec,val} on MIPS with 32-bit compatibility, then, disable some code which assumes otherwise wrongly when built for MIPS. A more general macro to check in this case would seem like a good idea eventually. If someone adds support for using n32 userland with n64 kernels on MIPS, then they will have to add a variety of flags related to each piece of the ABI that can vary. That's probably the right time to generalize further. o) Add MIPS to the list of architectures which use PAD64_REQUIRED in the freebsd32 compat code. Probably this should be generalized at some point. Reviewed by: gonzo	2012-03-03 08:19:18 +00:00
Rick Macklem	5e99212d36	Post r230394, the Lookup RPC counts for both NFS clients increased significantly. Upon investigation this was caused by name cache misses for lookups of "..". For name cache entries for non-".." directories, the cache entry serves double duty. It maps both the named directory plus ".." for the parent of the directory. As such, two ctime values (one for each of the directory and its parent) need to be saved in the name cache entry. This patch adds an entry for ctime of the parent directory to the name cache. It also adds an additional uma zone for large entries with this time value, in order to minimize memory wastage. As well, it fixes a couple of cases where the mtime of the parent directory was being saved instead of ctime for positive name cache entries. With this patch, Lookup RPC counts return to values similar to pre-r230394 kernels. Reported by: bde Discussed with: kib Reviewed by: jhb MFC after: 2 weeks	2012-03-03 01:06:54 +00:00
John Baldwin	831ce4cb3d	- Change contigmalloc() to use the vm_paddr_t type instead of an unsigned long for specifying a boundary constraint. - Change bus_dma tags to use bus_addr_t instead of bus_size_t for boundary constraints. These allow boundary constraints to be fully expressed for cases where sizeof(bus_addr_t) != sizeof(bus_size_t). Specifically, it allows a driver to properly specify a 4GB boundary in a PAE kernel. Note that this cannot be safely MFC'd without a lot of compat shims due to KBI changes, so I do not intend to merge it. Reviewed by: scottl	2012-03-01 19:58:34 +00:00
Kirk McKusick	35338e6091	This change avoids a kernel deadlock on "snaplk" when using snapshots on UFS filesystems running with journaled soft updates. This is the first of several bugs that need to be fixed before removing the restriction added in -r230250 to prevent the use of snapshots on filesystems running with journaled soft updates. The deadlock occurs when holding the snapshot lock (snaplk) and then trying to flush an inode via ffs_update(). We become blocked by another process trying to flush a different inode contained in the same inode block that we need. It holds the inode block for which we are waiting locked. When it tries to write the inode block, it gets blocked waiting for the our snaplk when it calls ffs_copyonwrite() to see if the inode block needs to be copied in our snapshot. The most obvious place that this deadlock arises is in the ffs_copyonwrite() routine when it updates critical metadata in a snapshot and tries to write it out before proceeding. The fix here is to write the data and indirect block pointer for the snapshot, but to skip the call to ffs_update() to write the snapshot inode. To ensure that we will never have to update a pointer in the inode itself, the ffs_snapshot() routine that creates the snapshot has to ensure that all the direct blocks are allocated as part of the creation of the snapshot. A less obvious place that this deadlock occurs is when we hold the snaplk because we are deleting a snapshot. In the course of doing the deletion, we need to allocate various soft update dependency structures and allocate some journal space. If we hit a resource limit while doing this we decrease the resources in use by flushing out an existing dirty file to get it to give up the soft dependency resources that it holds. The flush can cause an ffs_update() to be done on the inode for the file that we have selected to flush resulting in the same deadlock as described above when the inode that we have chosen to flush resides in the same inode block as the snapshot inode that we hold. The fix is to defer cleaning up any time that the inode on which we are operating is a snapshot. Help and review by: Jeff Roberson Tested by: Peter Holm MFC (to 9 only) after: 2 weeks	2012-03-01 18:45:25 +00:00
Mikolaj Golub	c7e41c8b50	Introduce VOP_UNP_BIND(), VOP_UNP_CONNECT(), and VOP_UNP_DETACH() operations for setting and accessing vnode's v_socket field. The operations are necessary to implement proper unix socket handling on layered file systems like nullfs(5). This change fixes the long standing issue with nullfs(5) being in that unix sockets did not work between lower and upper layers: if we bound to a socket on the lower layer we could connect only to the lower path; if we bound to the upper layer we could connect only to the upper path. The new behavior is one can connect to both the lower and the upper paths regardless what layer path one binds to. PR: kern/51583, kern/159663 Suggested by: kib Reviewed by: arch MFC after: 2 weeks	2012-02-29 21:38:31 +00:00
David Xu	c9b01ed581	initialize clock ID and flags only when copying timespec, a _umtx_time copy already contains these fields.	2012-02-29 02:01:48 +00:00
Martin Matuska	41c0675e6e	Add procfs to jail-mountable filesystems. Reviewed by: jamie MFC after: 1 week	2012-02-29 00:30:18 +00:00
Dimitry Andric	ac382cb7f1	Change definition of pipe_chmod() from K&R to C99, to avoid the following clang warning: sys/kern/sys_pipe.c:1556:10: error: promoted type 'int' of K&R function parameter is not compatible with the parameter type 'mode_t' (aka 'unsigned short') declared in a previous prototype [-Werror] mode_t mode; ^ sys/kern/sys_pipe.c:155:19: note: previous declaration is here static fo_chmod_t pipe_chmod; ^	2012-02-28 21:45:21 +00:00
John Baldwin	0428cbe4df	Properly clear a device's devclass if DEVICE_ATTACH() fails if the device does not have a fixed devclass. Reviewed by: imp MFC after: 2 weeks	2012-02-28 19:16:02 +00:00
Konstantin Belousov	1d7ca9bb8e	Currently, the debugger attached to the process executing vfork() does not get syscall exit notification until the child performed exec of exit. Swap the order of doing ptracestop() and waiting for P_PPWAIT clearing, by postponing the wait into syscallret after ptracestop() notification is done. Reported, tested and reviewed by: Dmitry Mikulin <dmitrym juniper net> MFC after: 2 weeks	2012-02-27 21:10:10 +00:00
John Baldwin	ef7b427562	Clear the a device's description string anytime it's driver changes. Descriptions are specific to drivers and we don't change drivers on attached devices. This fixes a few places where we were not clearing the description when detaching a driver (e.g. with device_attach() failed). While here, fix a few other nits: - Remove spurious call to remove a device's driver from devclass_driver_deleted(). device_detach() removes it already. - Fix a typo.	2012-02-27 16:08:18 +00:00
David Xu	24c209494a	Follow changes made in revision 232144, pass absolute timeout to kernel, this eliminates a clock_gettime() syscall.	2012-02-27 13:38:52 +00:00
Alexander Motin	36acfc6507	Rework CPU load balancing in SCHED_ULE: - In sched_pickcpu() be more careful taking previous CPU on SMT systems. Do it only if all other logical CPUs of that physical one are idle to avoid extra resource sharing. - In sched_pickcpu() change general logic of CPU selection. First look for idle CPU, sharing last level cache with previously used one, skipping SMT CPU groups. If none found, search all CPUs for the least loaded one, where the thread with its priority can run now. If none found, search just for the least loaded CPU. - Make cpu_search() compare lowest/highest CPU load when comparing CPU groups with equal load. That allows to differentiate 1+1 and 2+0 loads. - Make cpu_search() to prefer specified (previous) CPU or group if load is equal. This improves cache affinity for more complicated topologies. - Randomize CPU selection if above factors are equal. Previous code tend to prefer CPUs with lower IDs, causing unneeded collisions. - Rework periodic balancer in sched_balance_group(). With cpu_search() more intelligent now, make balansing process flat, removing recursion over the topology tree. That fixes double swap problem and makes load distribution more even and predictable. All together this gives 10-15% performance improvement in many tests on CPUs with SMT, such as Core i7, for number of threads is less then number of logical CPUs. In some tests it also gives positive effect to systems without SMT. Reviewed by: jeff Tested by: flo, hackers@ MFC after: 1 month Sponsored by: iXsystems, Inc.	2012-02-27 10:31:54 +00:00
Poul-Henning Kamp	f9a61f7dcb	Also call the low-level driver if ->c_iflag & (IXON\|IXOFF\|IXANY) changes. Uftdi(4) examines (c_iflag & (IXON\|IXOFF)) to control hw XON-XOFF support. This is obviously no good, if changes to those bits are not communicated down the stack.	2012-02-26 20:56:49 +00:00
Alan Cox	e5c92401dd	Fix typo. MFC after: 1 week	2012-02-26 19:10:14 +00:00
Martin Matuska	e7af90ab00	Analogous to r232059, add a parameter for the ZFS file system: allow.mount.zfs: allow mounting the zfs filesystem inside a jail This way the permssions for mounting all current VFCF_JAIL filesystems inside a jail are controlled wia allow.mount.* jail parameters. Update sysctl descriptions. Update jail(8) and zfs(8) manpages. TODO: document the connection of allow.mount.* and VFCF_JAIL for kernel developers MFC after: 10 days	2012-02-26 16:30:39 +00:00
Jilles Tjoelker	581400dfed	Fix fchmod() and fchown() on fifos. The new fifo implementation in r232055 broke fchmod() and fchown() on fifos. Postfix needs this. Submitted by: gianni Reported by: dougb	2012-02-26 15:14:29 +00:00
Mikolaj Golub	6ce13747dc	Add sysctl to retrieve or set umask of another process. Submitted by: Dmitry Banschikov <me ubique spb ru> Discussed with: kib, rwatson Reviewed by: kib MFC after: 2 weeks	2012-02-26 14:25:48 +00:00
Konstantin Belousov	747d2fa178	Add SO_PROTOCOL/SO_PROTOTYPE socket SOL_SOCKET-level option to get the socket protocol number. This is useful since the socket type can be implemented by different protocols in the same protocol family, e.g. SOCK_STREAM may be provided by both TCP and SCTP. Submitted by: Jukka A. Ukkonen <jau iki fi> PR: kern/162352 Discussed with: bz Reviewed by: glebius MFC after: 2 weeks	2012-02-26 13:55:43 +00:00
Konstantin Belousov	9493639e35	Remove apparently redundand checks for socket so_proto being non-NULL from sosetopt() and sogetopt(). No exposed sockets may have so_proto invalid. Discussed with: bz, rwatson Reviewed by: glebius MFC after: 2 weeks	2012-02-26 13:51:05 +00:00
Maxim Konovalov	7dfdd83d56	o Reduce chances for integer overflow. o More verbose sysctl description added. MFC after: 2 weeks Sponsored by: Nginx, Inc.	2012-02-25 12:06:40 +00:00
Mikolaj Golub	662c901c54	When detaching an unix domain socket, uipc_detach() checks unp->unp_vnode pointer to detect if there is a vnode associated with (binded to) this socket and does necessary cleanup if there is. The issue is that after forced unmount this check may be too late as the unp_vnode is reclaimed and the reference is stale. To fix this provide a helper function that is called on a socket vnode reclamation to do necessary cleanup. Pointed by: kib Reviewed by: kib MFC after: 2 weeks	2012-02-25 10:15:41 +00:00
David Xu	df1f1bae9e	In revision 231989, we pass a 16-bit clock ID into kernel, however according to POSIX document, the clock ID may be dynamically allocated, it unlikely will be in 64K forever. To make it future compatible, we pack all timeout information into a new structure called _umtx_time, and use fourth argument as a size indication, a zero means it is old code using timespec as timeout value, but the new structure also includes flags and a clock ID, so the size argument is different than before, and it is non-zero. With this change, it is possible that a thread can sleep on any supported clock, though current kernel code does not have such a POSIX clock driver system.	2012-02-25 02:12:17 +00:00
Konstantin Belousov	dcdc6c361b	Restore the return statement erronously removed in the r232048. Submitted by: cognet Pointy hat to: kib (reuse the one I already got today) MFC after: 13 days	2012-02-24 11:02:35 +00:00
Martin Matuska	bf3db8aa65	To improve control over the use of mount(8) inside a jail(8), introduce a new jail parameter node with the following parameters: allow.mount.devfs: allow mounting the devfs filesystem inside a jail allow.mount.nullfs: allow mounting the nullfs filesystem inside a jail Both parameters are disabled by default (equals the behavior before devfs and nullfs in jails). Administrators have to explicitly allow mounting devfs and nullfs for each jail. The value "-1" of the devfs_ruleset parameter is removed in favor of the new allow setting. Reviewed by: jamie Suggested by: pjd MFC after: 2 weeks	2012-02-23 18:51:24 +00:00
Kip Macy	11ac7ec076	merge pipe and fifo implementations Also reviewed by: jhb, jilles (initial revision) Tested by: pho, jilles Submitted by: gianni Reviewed by: bde	2012-02-23 18:37:30 +00:00
Christian Brueffer	6bdc1841a9	Catch up with r195837 (2.5 years ago) which renamed net_add_domain() to domain_add(). PR: 165424 Submitted by: Lachlan Kang MFC after: 1 week	2012-02-23 17:47:19 +00:00
Konstantin Belousov	dcd432817e	Allow the parent to gather the exit status of the children reparented to the debugger. When reparenting for debugging, keep the child in the new orphan list of old parent. When looping over the children in kern_wait(), iterate over both children list and orphan list to search for the process by pid. Submitted by: Dmitry Mikulin <dmitrym juniper.net> MFC after: 2 weeks	2012-02-23 11:50:23 +00:00
David Xu	f911d9fa4d	Fix typo.	2012-02-22 07:34:23 +00:00
David Xu	b13a8fa78f	Use unused fourth argument of umtx_op to pass flags to kernel for operation UMTX_OP_WAIT. Upper 16bits is enough to hold a clock id, and lower 16bits is used to pass flags. The change saves a clock_gettime() syscall from libthr.	2012-02-22 03:22:49 +00:00
Mikolaj Golub	a95852edf3	unp_connect() may use a shared lock on the vnode to fetch the socket. Suggested by: jhb Reviewed by: jhb, kib, rwatson MFC after: 2 weeks	2012-02-21 19:40:13 +00:00
Konstantin Belousov	526d0bd547	Fix found places where uio_resid is truncated to int. Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. Discussed with: bde, das (previous versions) MFC after: 1 month	2012-02-21 01:05:12 +00:00
Xin LI	fcdd3d322b	Revert r231923 for now. Further work is needed to make sure that the behavior is consistent.	2012-02-20 09:32:32 +00:00
Xin LI	5bfbb59851	Use uprintf instead of printf for the reason why a kernel module can not be loaded. This way, the administrator can get response immediately from the shell session rather than relying on dmesg. MFC after: 1 month	2012-02-20 01:05:17 +00:00
Alan Cox	7dc0ace10e	Close a race due to dropping of the map lock between creating a map entry for a shared mapping and marking the entry for inheritance. Reviewed by: kib X-MFC after: r231526	2012-02-19 00:28:49 +00:00
Konstantin Belousov	3494f31ad2	Fix misuse of the kernel map in miscellaneous image activators. Vnode-backed mappings cannot be put into the kernel map, since it is a system map. Use exec_map for transient mappings, and remove the mappings with kmem_free_wakeup() to notify the waiters on available map space. Do not map the whole executable into KVA at all to copy it out into usermode. Directly use vn_rdwr() for the case of not page aligned binary. There is one place left where the potentially unbounded amount of data is mapped into exec_map, namely, in the COFF image activator enumeration of the needed shared libraries. Reviewed by: alc MFC after: 2 weeks	2012-02-17 23:47:16 +00:00
Bjoern A. Zeeb	9dba179d5e	IFC @231845 Sponsored by: Cisco Systems, Inc.	2012-02-17 00:27:48 +00:00
Eitan Adler	f17a6f1b17	Add a timestamp to the msgbuf output in order to determine when when messages were printed. This can be enabled with the kern.msgbuf_show_timestamp sysctl PR: kern/161553 Reviewed by: avg Submitted by: Arnaud Lacombe <lacombar@gmail.com> Approved by: cperciva MFC after: 1 month	2012-02-16 05:11:35 +00:00
Konstantin Belousov	343b391f20	The PTRACESTOP() macro is used only once. Inline the only use and remove the macro. MFC after: 1 week	2012-02-11 14:49:25 +00:00
Ed Schouten	852b05c5b5	Remove unneeded newline. It fits in 80 columns now. Pointed out by: jh	2012-02-10 14:55:47 +00:00
Ed Schouten	8fac9b7b7d	Merge si_name and __si_namebuf. The si_name pointer always points to the __si_namebuf member inside the same object. Remove it and rename __si_namebuf to si_name.	2012-02-10 12:40:50 +00:00
Kevin Lo	de02885a7b	Add a missing break. This bug was introduced in r228856.	2012-02-10 06:30:52 +00:00
Konstantin Belousov	db3273398b	Mark the automatically attached child with PL_FLAG_CHILD in struct lwpinfo flags, for PT_FOLLOWFORK auto-attachment. In collaboration with: Dmitry Mikulin <dmitrym juniper net> MFC after: 1 week	2012-02-10 00:02:13 +00:00
Martin Matuska	0cc207a6f5	Add support for mounting devfs inside jails. A new jail(8) option "devfs_ruleset" defines the ruleset enforcement for mounting devfs inside jails. A value of -1 disables mounting devfs in jails, a value of zero means no restrictions. Nested jails can only have mounting devfs disabled or inherit parent's enforcement as jails are not allowed to view or manipulate devfs(8) rules. Utilizes new functions introduced in r231265. Reviewed by: jamie MFC after: 1 month	2012-02-09 10:22:08 +00:00
Konstantin Belousov	9cfb2326bc	Unbreak detection of the async mode for clustered writes after r231075. Submitted by: bde MFC after: 12 days	2012-02-08 15:07:19 +00:00
Pawel Jakub Dawidek	12075c0936	Allow to set kern.ipc.shmmax from /boot/loader.conf. MFC after: 1 week	2012-02-08 09:18:22 +00:00
Ed Schouten	cd864a19a5	Fix whitespace inconsistencies in TTY code.	2012-02-06 18:15:46 +00:00
John Baldwin	bf40d24a3f	Rename cache_lookup_times() to cache_lookup() and retire the old API and ABI stub for cache_lookup().	2012-02-06 17:00:28 +00:00
Konstantin Belousov	c480f781ea	Current implementations of sync(2) and syncer vnode fsync() VOP uses mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which is needed to guarantee a synchronous completion of the initiated i/o before syscall or VOP return. Global removal of MNTK_ASYNC option is harmful because not only i/o started from corresponding thread becomes synchronous, but all i/o is synchronous on the filesystem which is initiated during sync(2) or syncer activity. Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local thread flag to disable async i/o for current thread only. Use the opportunity to move DOINGASYNC() macro into sys/vnode.h and consistently use it through places which tested for MNTK_ASYNC. Some testing demonstrated 60-70% improvements in run time for the metadata-intensive operations on async-mounted UFS volumes, but still with great deviation due to other reasons. Reviewed by: mckusick Tested by: scottl MFC after: 2 weeks	2012-02-06 11:04:36 +00:00
Kevin Lo	0ca2381d9b	- Use uint8_t for the variable x and spell the size of the variable as sizeof(x) - Capitalized comment - Parentheses around return value Requested by: bde	2012-02-06 06:03:16 +00:00
Martin Matuska	a91d2201f9	Analogous to r230407 a separate path buffer in vfs_mount.c is required for r230129. Fixes a out of bounds write to fspath. MFC after: 10 days	2012-02-05 10:59:50 +00:00
David Xu	d56e058a79	Add 32-bit compat code for AIO kevent flags introduced in revision 230857.	2012-02-05 04:49:31 +00:00
Ryan Stone	312ac3a23a	Whenever a new kernel thread is spawned, explicitly clear any CPU affinity set on the new thread. This prevents the thread from inadvertently inheriting affinity from a random sibling. Submitted by: attilio Tested by: pho MFC after: 1 week	2012-02-04 16:49:29 +00:00
Hiroki Sato	cf8b832511	Fix input validation in SO_SETFIB. Reviewed by: bz MFC after: 1 day	2012-02-04 15:00:26 +00:00
Bjoern A. Zeeb	ee799639e8	Add SO_SETFIB option support on PF_INET6 sockets and allow inheriting the FIB number from the process, as set by setfib(2), on socket creation. Sponsored by: Cisco Systems, Inc.	2012-02-03 11:00:53 +00:00
Konstantin Belousov	6af519cf18	Add kqueue support to /dev/klog. Submitted by: Mateusz Guzik <mjguzik gmail com> PR: kern/156423 MFC after: 1 weeks	2012-02-01 14:34:52 +00:00
David Xu	fde809356a	If multiple threads call kevent() to get AIO events on same kqueue fd, it is possible that a single AIO event will be reported to multiple threads, it is not threading friendly, and the existing API can not control this behavior. Allocate a kevent flags field sigev_notify_kevent_flags for AIO event notification in sigevent, and allow user to pass EV_CLEAR, EV_DISPATCH or EV_ONESHOT to AIO kernel code, user can control whether the event should be cleared once it is retrieved by a thread. This change should be comptaible with existing application, because the field should have already been zero-filled, and no additional action will be taken by kernel. PR: kern/156567	2012-02-01 02:53:06 +00:00
Konstantin Belousov	6ad1ff09cc	A debugger which requested PT_FOLLOW_FORK should get the notification about new child not only when doing PT_TO_SCX, but also for PT_CONTINUE. If TDB_FORK flag is set, always issue a stop, the same as is done for TDB_EXEC. Reported by: Dmitry Mikulin <dmitrym juniper net> MFC after: 1 week	2012-01-30 20:00:29 +00:00
John Baldwin	2bd3e4c2c2	Refine the implementation of POSIX_FADV_NOREUSE for the read(2) case such that instead of using direct I/O it allows read-ahead similar to POSIX_FADV_NORMAL, but invokes VOP_ADVISE(POSIX_FADV_DONTNEED) after the read(2) has completed to purge just-read data. The write(2) path continues to use direct I/O for POSIX_FADV_NOREUSE for now. Note that NOREUSE works optimally if an application reads and writes full fs blocks.	2012-01-30 19:35:15 +00:00
Doug Ambrisko	8e9fc27818	When detaching an AIO or LIO requests grab the lock and tell knlist_remove that we have the lock now. This cleans up a locking panic ASSERT when knlist_empty is called without a lock when INVARIANTS etc. are turned. Reviewed by: kib jhb MFC after: 1 week	2012-01-30 19:19:22 +00:00
Konstantin Belousov	62c625fdd2	Finally, try to enable the nxstacks on amd64 and powerpc64 for both 64bit and 32bit ABIs. Also try to enable nxstacks for PAE/i386 when supported, and some variants of powerpc32. MFC after: 2 months (if ever)	2012-01-30 07:56:00 +00:00
Attilio Rao	5d7380f8e3	Avoid to check the same cache line/variable from all the locking primitives by breaking stop_scheduler into a per-thread variable. Also, store the new td_stopsched very close to td_*locks members as they will be accessed mostly in the same codepaths as td_stopsched and this results in avoiding a further cache-line pollution, possibly. STOP_SCHEDULER() was pondered to use a new 'thread' argument, in order to take advantage of already cached curthread, but in the end there should not really be a performance benefit, while introducing a KPI breakage. In collabouration with: flo Reviewed by: avg MFC after: 3 months (or never) X-MFC: r228424	2012-01-28 14:00:21 +00:00
Gleb Smirnoff	94fce84763	Fix size check, that prevents getting negative after casting to a signed type Reviewed by: bde	2012-01-27 08:58:58 +00:00
Kenneth D. Merry	7e949c467c	Xen netback driver rewrite. share/man/man4/Makefile, share/man/man4/xnb.4, sys/dev/xen/netback/netback.c, sys/dev/xen/netback/netback_unit_tests.c: Rewrote the netback driver for xen to attach properly via newbus and work properly in both HVM and PVM mode (only HVM is tested). Works with the in-tree FreeBSD netfront driver or the Windows netfront driver from SuSE. Has not been extensively tested with a Linux netfront driver. Does not implement LRO, TSO, or polling. Includes unit tests that may be run through sysctl after compiling with XNB_DEBUG defined. sys/dev/xen/blkback/blkback.c, sys/xen/interface/io/netif.h: Comment elaboration. sys/kern/uipc_mbuf.c: Fix page fault in kernel mode when calling m_print() on a null mbuf. Since m_print() is only used for debugging, there are no performance concerns for extra error checking code. sys/kern/subr_scanf.c: Add the "hh" and "ll" width specifiers from C99 to scanf(). A few callers were already using "ll" even though scanf() was handling it as "l". Submitted by: Alan Somers <alans@spectralogic.com> Submitted by: John Suykerbuyk <johns@spectralogic.com> Sponsored by: Spectra Logic MFC after: 1 week Reviewed by: ken	2012-01-26 16:35:09 +00:00
Gleb Smirnoff	434ea137cc	Although aio_nbytes is size_t, later is is signed to casted types: to ssize_t in filesystem code and to int in buf code, thus supplying a negative argument leads to kernel panic later. To fix that check user supplied argument in the beginning of syscall. Submitted by: Maxim Dounin <mdounin mdounin.ru>, maxim@	2012-01-26 11:59:48 +00:00
Konstantin Belousov	abc942b56c	When doing vflush(WRITECLOSE), clean vnode pages. Unmounts do vfs_msync() before calling VFS_UNMOUNT(), but there is still a race allowing a process to dirty pages after msync finished. Remounts rw->ro just left dirty pages in system. Reviewed by: alc, tegge (long time ago) Tested by: pho MFC after: 2 weeks	2012-01-25 20:54:09 +00:00
Konstantin Belousov	d5210589b7	Fix remaining calls to cache_enter() in both NFS clients to provide appropriate timestamps. Restore the assertions which verify that NCF_TS is set when timestamp is asked for. Reviewed by: jhb (previous version) MFC after: 2 weeks	2012-01-25 20:48:20 +00:00
Mikolaj Golub	45efc9b4aa	Fix CTL flags in the declarations of KERN_PROC_ENV, AUXV and PS_STRINGS sysctls: they are read only. MFC after: 1 week	2012-01-25 20:15:58 +00:00
Konstantin Belousov	7a7e609a32	Apparently, both nfs clients do not use cache_enter_time() consistently, creating some namecache entries without NCF_TS flag. This causes panic due to failed assertion. As a temporal relief, remove the assert. Return epoch timestamp for the entries without timestamp if asked. While there, consolidate the code which returns timestamps, into a helper cache_out_ts(). Discussed with: jhb MFC after: 2 weeks	2012-01-23 17:09:23 +00:00
Gleb Smirnoff	93a1b4c4cf	Convert panic()s to KASSERT()s. This is an optimisation for hashdestroy() since in absence of INVARIANTS a compiler will drop the entire for() cycle.	2012-01-23 16:31:46 +00:00
Mikolaj Golub	8854fe3915	Change kern.proc.rlimit sysctl to: - retrive only one, specified limit for a process, not the whole array, as it was previously (the sysctl has been added recently and has not been backported to stable yet, so this change is ok); - allow to set a resource limit for another process. Submitted by: Andrey Zonov <andrey at zonov.org> Discussed with: kib Reviewed by: kib MFC after: 2 weeks	2012-01-22 20:25:00 +00:00
Pawel Jakub Dawidek	9b9a01792d	TDF_* flags should be used with td_flags field and TDP_* flags should be used with td_pflags field. Correct two places where it was not the case. Discussed with: kib MFC after: 1 week	2012-01-22 11:01:36 +00:00
Konstantin Belousov	c2b396f294	Remove the nc_time and nc_ticks elements from struct namecache, and provide struct namecache_ts which is the old struct namecache. Only allocate struct namecache_ts if non-null struct timespec *tsp was passed to cache_enter_time, otherwise use struct namecache. Change struct namecache allocation and deallocation macros into static functions, since logic becomes somewhat twisty. Provide accessor for the nc_name member of struct namecache to hide difference between struct namecache and namecache_ts. The aim of the change is to not waste 20 bytes per small namecache entry. Reviewed by: jhb MFC after: 2 weeks X-MFC-note: after r230394	2012-01-22 01:11:06 +00:00
Martin Matuska	6dfe0a3dc2	Use separate buffer for global path to avoid overflow of path buffer. Reviewed by: jamie@ MFC after: 3 weeks	2012-01-21 00:06:21 +00:00
John Baldwin	5aefb4cbbf	Close a race in NFS lookup processing that could result in stale name cache entries on one client when a directory was renamed on another client. The root cause for the stale entry being trusted is that each per-vnode nfsnode structure has a single 'n_ctime' timestamp used to validate positive name cache entries. However, if there are multiple entries for a single vnode, they all share a single timestamp. To fix this, extend the name cache to allow filesystems to optionally store a timestamp value in each name cache entry. The NFS clients now fetch the timestamp associated with each name cache entry and use that to validate cache hits instead of the timestamps previously stored in the nfsnode. Another part of the fix is that the NFS clients now use timestamps from the post-op attributes of RPCs when adding name cache entries rather than pulling the timestamps out of the file's attribute cache. The latter is subject to races with other lookups updating the attribute cache concurrently. Some more details: - Add a variant of nfsm_postop_attr() to the old NFS client that can return a vattr structure with a copy of the post-op attributes. - Handle lookups of "." as a special case in the NFS clients since the name cache does not store name cache entries for ".", so we cannot get a useful timestamp. It didn't really make much sense to recheck the attributes on the the directory to validate the namecache hit for "." anyway. - ABI compat shims for the name cache routines are present in this commit so that it is safe to MFC. MFC after: 2 weeks	2012-01-20 20:02:01 +00:00
Konstantin Belousov	2974cc36f7	Use shared lock for the executable vnode in the exec path after the VV_TEXT changes are handled. Assert that vnode is exclusively locked at the places that modify VV_TEXT. Discussed with: alc MFC after: 3 weeks	2012-01-19 23:03:31 +00:00
Alan Cox	1dfab8025e	Explain why it is safe to unlock the vnode. Requested by: kib	2012-01-17 16:20:50 +00:00
Kirk McKusick	cc672d3599	Make sure all intermediate variables holding mount flags (mnt_flag) and that all internal kernel calls passing mount flags are declared as uint64_t so that flags in the top 32-bits are not lost. MFC after: 2 weeks	2012-01-17 01:08:01 +00:00
Alan Cox	292177e67a	Improve abstraction. Eliminate direct access by elf_load_section() to an OBJT_VNODE-specific field of the vm object. The same information can be just as easily obtained from the struct vattr that is in struct image_params if the latter is passed to elf_load_section(). Moreover, by replacing the vmspace and vm object parameters to elf*_load_section() with a struct image_params parameter, we actually reduce the size of the object code. In collaboration with: kib	2012-01-17 00:27:32 +00:00
Sergey Kandaurov	037f43d3ef	Be pedantic and change // comment to C-style one. Noticed by: Bruce Evans	2012-01-16 20:42:56 +00:00
Kevin Lo	575cabed9e	Fix a style bug Spotted by: avg	2012-01-16 14:54:48 +00:00
David Xu	29a06690ca	Eliminate branch and insert an explicit reader memory barrier to ensure that waiter bit is set before reading semaphore count.	2012-01-16 04:39:10 +00:00
Mikolaj Golub	fe7f89b71a	Abrogate nchr argument in proc_getargv() and proc_getenvv(): we always want to read strings completely to know the actual size. As a side effect it fixes the issue with kern.proc.args and kern.proc.env sysctls, which didn't return the size of available data when calling sysctl(3) with the NULL argument for oldp. Note, in get_ps_strings(), which does actual work for proc_getargv() and proc_getenvv(), we still have a safety limit on the size of data read in case of a corrupted procces stack. Suggested by: kib MFC after: 3 days	2012-01-15 18:47:24 +00:00
Martin Matuska	9cbe30e1d5	Fix missing in r230129: kern_jail.c: initialize fullpath_disabled to zero vfs_cache.c: add missing dot in comment Reported by: kib MFC after: 1 month	2012-01-15 18:08:15 +00:00
Ulrich Spörlein	9a14aa017b	Convert files to UTF-8	2012-01-15 13:23:18 +00:00
Martin Matuska	f6e633a9e1	Introduce vn_path_to_global_path() This function updates path string to vnode's full global path and checks the size of the new path string against the pathlen argument. In vfs_domount(), sys_unmount() and kern_jail_set() this new function is used to update the supplied path argument to the respective global path. Unbreaks jailed zfs(8) with enforce_statfs set to 1. Reviewed by: kib MFC after: 1 month	2012-01-15 12:08:20 +00:00
Eitan Adler	886e862866	- Fix undefined behavior when device_get_name is null - Make error message more informative PR: kern/149800 Submitted by: olgeni Approved by: cperciva MFC after: 1 week	2012-01-15 07:09:18 +00:00
Oleksandr Tymoshenko	4104e83567	Fix kernel modules loading for MIPS64 kernel: On amd64, link_elf_obj.c must specify KERNBASE rather than VM_MIN_KERNEL_ADDRESS to vm_map_find() because kernel loadable modules must be mapped for execution in the same upper region of the kernel map as the kernel code and data segments. For MIPS32 KERNBASE lies below KVA area (it's less than VM_MIN_KERNEL_ADDRESS) so basically vm_map_find got whole KVA to look through. On MIPS64 it's not the case because KERNBASE is set to the very end of XKSEG, well out of KVA bounds, so vm_map_find always fails. We should use VM_MIN_KERNEL_ADDRESS as a base for vm_map_find. Details obtained from: alc@	2012-01-14 00:36:07 +00:00
John Baldwin	fbcebf7f71	Convert the per-interface address list lock from a mutex to a reader/writer lock. Reviewed by: bz	2012-01-09 19:34:12 +00:00
Andriy Gapon	90d8265326	enable stop_scheduler_on_panic by default My plan is to make this behavior unconditional before 10.0 release. X-MFC after: r228424 (if ever)	2012-01-09 12:06:09 +00:00
Konstantin Belousov	3ab0160340	Avoid LOR between vfs_busy() lock and covered vnode lock on quotaon(). The vfs_busy() is after covered vnode lock in the global lock order, but since quotaon() does recursive VFS call to open quota file, we usually end up locking covered vnode after mp is busied in sys_quotactl(). Change the interface of VFS_QUOTACTL(), requiring that mp was unbusied by fs code, and do not try to pick up vfs_busy() reference in ufs quotaon, esp. if vfs_busy cannot succeed due to unmount being performed. Reported and tested by: pho MFC after: 1 week	2012-01-08 23:06:53 +00:00
Alan Cox	2971897d51	Correct an error of omission in the implementation of the truncation operation on POSIX shared memory objects and tmpfs. Previously, neither of these modules correctly handled the case in which the new size of the object or file was not a multiple of the page size. Specifically, they did not handle partial page truncation of data stored on swap. As a result, stale data might later be returned to an application. Interestingly, a data inconsistency was less likely to occur under tmpfs than POSIX shared memory objects. The reason being that a different mistake by the tmpfs truncation operation helped avoid a data inconsistency. If the data was still resident in memory in a PG_CACHED page, then the tmpfs truncation operation would reactivate that page, zero the truncated portion, and leave the page pinned in memory. More precisely, the benevolent error was that the truncation operation didn't add the reactivated page to any of the paging queues, effectively pinning the page. This page would remain pinned until the file was destroyed or the page was read or written. With this change, the page is now added to the inactive queue. Discussed with: jhb Reviewed by: kib (an earlier version) MFC after: 3 weeks	2012-01-08 20:09:26 +00:00
Hiroki Sato	ca54e1aee3	Fix a typo. (s/nessesary/necessary/)	2012-01-08 18:48:36 +00:00
John Baldwin	71eeeaf256	Add 5 spare VOPs as placeholders to avoid breaking the KBI in the future when new VOPs are MFC'd to a branch. Reviewed by: kib, bz MFC after: 3 days	2012-01-06 20:06:45 +00:00
John Baldwin	908cac07ce	Use proper argument structure types for the extattr post-VOP hooks. The wrong structure happened to work since the only argument used was the vnode which is in the same place in both VOP_SETATTR() and the two extattr VOPs. MFC after: 3 days	2012-01-06 20:05:48 +00:00
John Baldwin	948c460971	Fix a logic bug in change 228207 in the check for a thread's new user priority being a realtime priority. MFC after: 3 days	2012-01-05 19:02:52 +00:00
John Baldwin	137f91e80f	Convert all users of IF_ADDR_LOCK to use new locking macros that specify either a read lock or write lock. Reviewed by: bz MFC after: 2 weeks	2012-01-05 19:00:36 +00:00
John Baldwin	7e3a96ea37	Some small fixes to CPU accounting for threads: - Only initialize the per-cpu switchticks and switchtime in sched_throw() for the very first context switch on APs during boot. This avoids a small gap between the middle of thread_exit() and sched_throw() where time is not accounted to any thread. - In thread_exit(), update the timestamp bookkeeping to track the changes to mi_switch() introduced by td_rux so that the code once again matches the comment claiming it is mimicing mi_switch(). Specifically, only update the per-thread stats directly and depend on ruxagg() to update p_rux rather than adjusting p_rux directly. While here, move the timestamp bookkeeping as late in the function as possible. Reviewed by: bde, kib MFC after: 1 week	2012-01-03 21:03:28 +00:00
Ed Schouten	dc15eac046	Use strchr() and strrchr(). It seems strchr() and strrchr() are used more often than index() and rindex(). Therefore, simply migrate all kernel code to use it. For the XFS code, remove an empty line to make the code identical to the code in the Linux kernel.	2012-01-02 12:12:10 +00:00
Konstantin Belousov	cdb7a43117	Avoid double-unlock or double unreference for ndp->ni_dvp when the vnode dp lock upgrade right after the 'success' label fails. In collaboration with: pho MFC after: 1 week	2012-01-01 18:45:59 +00:00
John Baldwin	0c0d27d5dd	Cap the priority calculated from the current thread's running tick count at SCHED_PRI_RANGE to prevent overflows in the priority value. This can happen due to irregularities with clock interrupts under certain virtualization environments. Tested by: Larry Rosenman ler lerctr org MFC after: 2 weeks	2011-12-29 16:17:16 +00:00
Lawrence Stewart	6cedd609b7	Introduce the sysclock_getsnapshot() and sysclock_snap2bintime() KPIs. The sysclock_getsnapshot() function allows the caller to obtain a snapshot of all the system clock and timecounter state required to create time stamps at a later point. The sysclock_snap2bintime() function converts a previously obtained snapshot into a bintime time stamp according to the specified flags e.g. which system clock, uptime vs absolute time, etc. These KPIs enable useful functionality, including direct comparison of the feedback and feed-forward system clocks and generation of multiple time stamps with different formats from a single timecounter read. Committed on behalf of Julien Ridoux and Darryl Veitch from the University of Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward Clock Synchronization Algorithms" project. For more information, see http://www.synclab.org/radclock/ In collaboration with: Julien Ridoux (jridoux at unimelb edu au)	2011-12-24 01:32:01 +00:00
John Baldwin	f0d6c5caf0	Add post-VOP hooks for VOP_DELETEEXTATTR() and VOP_SETEXTATTR() and use these to trigger a NOTE_ATTRIB EVFILT_VNODE kevent when the extended attributes of a vnode are changed. Note that OS X already implements this behavior. Reviewed by: rwatson MFC after: 2 weeks	2011-12-23 20:11:37 +00:00
John Baldwin	268e76d86e	Use TASK_INITIALIZER() for dev_dtr_task rather than a dedicated SYSINIT().	2011-12-22 16:01:10 +00:00
Andriy Gapon	167057914b	ule: ensure that batch timeshare threads are scheduled fairly With the previous code, if the range of priorities for timeshare batch threads was greater than RQ_NQS, then the threads with low priorities in the part of the range above RQ_NQS would be scheduled to the run-queues as if they had high priorities at the beginning of the range. In other words, threads with a nice level of +N could be scheduled as if they had a nice level of -M. Reported by: George Mitchell <george@m5p.com> Reviewed by: jhb Tested by: George Mitchell <george@m5p.com> (earlier version) MFC after: 1 week	2011-12-19 20:01:21 +00:00
Mikolaj Golub	547b155eb1	Fix style and white spaces. MFC after: 1 week	2011-12-17 22:18:26 +00:00
Mikolaj Golub	fa3935bcea	On start most of sysctl_kern_proc functions use the same pattern: locate a process calling pfind() and do some additional checks like p_candebug(). To reduce this code duplication a new function pget() is introduced and used. As the function may be useful not only in kern_proc.c it is in the kernel name space. Suggested by: kib Reviewed by: kib MFC after: 2 weeks	2011-12-17 16:59:22 +00:00
Andriy Gapon	f389bc9585	belatedly transfer copyrights from libkern/gets.c to kern_cons.c MFC after: 2 months MFC with: r228642	2011-12-17 15:50:45 +00:00
Andriy Gapon	f6ce353e58	replace uses of libkern gets with cngets MFC after: 2 months	2011-12-17 15:26:34 +00:00
Andriy Gapon	8e62854265	introduce cngets, a method for kernel to read a string from console This is intended as a replacement for libkern's gets and mostly borrows its implementation. It uses cngrab/cnungrab to delimit kernel's access to console input. Note: libkern's gets obviously doesn't share any bits of implementation iwth libc's gets. They also have different APIs and the former doesn't have the overflow problems of the latter. Inspired by: bde MFC after: 2 months	2011-12-17 15:16:54 +00:00
Andriy Gapon	bf8696b408	introduce cngrab/cnungrab stub calls in some places where they make sense MFC after: 2 months	2011-12-17 15:11:22 +00:00
Andriy Gapon	9976156f12	kern cons: introduce infrastructure for console grabbing by kernel At the moment grab and ungrab methods of all console drivers are no-ops. Current intended meaning of the calls is that the kernel takes control of console input. In the future the semantics may be extended to mean that the calling thread takes full ownership of the console (e.g. console output from other threads could be suspended). Inspired by: bde MFC after: 2 months	2011-12-17 15:08:43 +00:00
John Baldwin	f427c78b19	Fire a kevent if necessary after seeking on a regular file. This fixes a case where a kevent would not fire on a regular file if an application read to EOF and then seeked backwards into the file. Reviewed by: kib MFC after: 2 weeks	2011-12-16 20:10:00 +00:00
John Baldwin	338e7cf235	Use vm_mmap_to_errno(). Submitted by: kib	2011-12-15 15:17:19 +00:00
Jilles Tjoelker	6d1c58f8a2	Fix select/poll/kqueue for write on reverse direction before first write. The reverse direction of a pipe is lazily allocated on the first write in that direction (because pipes are usually used in one direction only). A special case is needed to ensure the pipe appears writable before the first write because there are 0 bytes of pending data in 0 bytes of buffer space at that point, leaving 0 bytes of data that can be written with the normal code. Note that the first write returns [ENOMEM] if kern.ipc.maxpipekva is exceeded and does not block or return [EAGAIN], so selecting true for write is correct even in that case. PR: kern/93685 Submitted by: gianni MFC after: 2 weeks	2011-12-14 22:26:39 +00:00
John Baldwin	fb680e16f4	Add a helper API to allow in-kernel code to map portions of shared memory objects created by shm_open(2) into the kernel's address space. This provides a convenient way for creating shared memory buffers between userland and the kernel without requiring custom character devices.	2011-12-14 22:22:19 +00:00
David E. O'Brien	1c5151f3f8	Match other formatting.	2011-12-14 02:31:32 +00:00
David E. O'Brien	3d7618d8bf	Disallow various debug.kdb sysctl's when securelevel is raised. PR: 161350	2011-12-13 17:59:16 +00:00
Eitan Adler	9910b854c6	- Add a sysctl to allow non-root users the ability to set idle priorities. - While here fix up some style nits. Discussed with: cperciva (breifly) Reviewed by: pjd (earlier version) Reviewed by: bde Approved by: jhb MFC after: 1 month	2011-12-13 14:00:27 +00:00
Eitan Adler	3eb9ab5255	Document a large number of currently undocumented sysctls. While here fix some style(9) issues and reduce redundancy. PR: kern/155491 PR: kern/155490 PR: kern/155489 Submitted by: Galimov Albert <wtfcrap@mail.ru> Approved by: bde Reviewed by: jhb MFC after: 1 week	2011-12-13 00:38:50 +00:00
Andriy Gapon	7a7ce668ef	put sys/systm.h at its proper place or add it if missing Reported by: lstewart, tinderbox Pointyhat to: avg, attilio MFC after: 1 week MFC with: r228430	2011-12-12 10:05:13 +00:00
Andriy Gapon	0e225211a0	kern_racct: move sys/systm.h inclusion to its proper place This should fix the build failure introduced with r228424. Also remove duplicate inclusion of sys/param.h. Pointyhat to: avg MFC after: 1 week	2011-12-12 07:46:10 +00:00
Andriy Gapon	353705930f	panic: add a switch and infrastructure for stopping other CPUs in SMP case Historical behavior of letting other CPUs merily go on is a default for time being. The new behavior can be switched on via kern.stop_scheduler_on_panic tunable and sysctl. Stopping of the CPUs has (at least) the following benefits: - more of the system state at panic time is preserved intact - threads and interrupts do not interfere with dumping of the system state Only one thread runs uninterrupted after panic if stop_scheduler_on_panic is set. That thread might call code that is also used in normal context and that code might use locks to prevent concurrent execution of certain parts. Those locks might be held by the stopped threads and would never be released. To work around this issue, it was decided that instead of explicit checks for panic context, we would rather put those checks inside the locking primitives. This change has substantial portions written and re-written by attilio and kib at various times. Other changes are heavily based on the ideas and patches submitted by jhb and mdf. bde has provided many insights into the details and history of the current code. The new behavior may cause problems for systems that use a USB keyboard for interfacing with system console. This is because of some unusual locking patterns in the ukbd code which have to be used because on one hand ukbd is below syscons, but on the other hand it has to interface with other usb code that uses regular mutexes/Giant for its concurrency protection. Dumping to USB-connected disks may also be affected. PR: amd64/139614 (at least) In cooperation with: attilio, jhb, kib, mdf Discussed with: arch@, bde Tested by: Eugene Grosbein <eugen@grosbein.net>, gnn, Steven Hartland <killing@multiplay.co.uk>, glebius, Andrew Boyer <aboyer@averesystems.com> (various versions of the patch) MFC after: 3 months (or never)	2011-12-11 21:02:01 +00:00
Peter Holm	cdea31e305	Move cpu_set_upcall(newtd, td) up before the first call of thread_free(newtd). This to avoid a possible page fault in cpu_thread_clean() as seen on amd64 with syscall fuzzing. Reviewed by: kib MFC after: 1 week	2011-12-09 17:19:41 +00:00
Eitan Adler	5a01b72672	- Fix ktrace leakage if error is set PR: kern/163098 Submitted by: Loganaden Velvindron <loganaden@devio.us> Approved by: sbruno@ MFC after: 1 month	2011-12-08 03:20:38 +00:00
Alan Cox	ea3f07d3a0	Eliminate stale numbers from a comment.	2011-12-07 16:27:23 +00:00
Alan Cox	c749c003b8	Eliminate the possibility of 32-bit arithmetic overflow in the calculation of vm_kmem_size that may occur if the system administrator has specified a vm.vm_kmem_size tunable value that exceeds the hard cap. PR: 162741 Submitted by: Adam McDougall Reviewed by: bde@ MFC after: 3 weeks	2011-12-07 07:03:14 +00:00
Konstantin Belousov	93c26de0ad	Most users of pipe(2) do not call fstat(2) on the returned pipe descriptors. Optimize for the case, by lazily allocating the pipe inode number at the fstat(2) time. If alloc_unr(9) returns failure, do not fail fstat(2), since uses of inode numbers are even rare then fstat(2), but provide zero inode forever. Note that alloc_unr() failure is unlikely due to total number of pipes in the system limited by the number of file descriptors. Based on the submission by: gianni MFC after: 2 weeks	2011-12-06 11:24:03 +00:00

... 2 3 4 5 6 ...

12768 Commits