freebsd-dev

Author	SHA1	Message	Date
Konstantin Belousov	f7e50ea722	Fix a race between kern_setitimer() and realitexpire(), where the callout is started before kern_setitimer() acquires process mutex, but looses a race and kern_setitimer() gets the process mutex before the callout. Then, assuming that new specified struct itimerval has it_interval zero, but it_value non-zero, the callout, after it starts executing again, clears p->p_realtimer.it_value, but kern_setitimer() already rescheduled the callout. As the result of the race, both p_realtimer is zero, and the callout is rescheduled. Then, in the exit1(), the exit code sees that it_value is zero and does not even try to stop the callout. This allows the struct proc to be reused and eventually the armed callout is re-initialized. The consequence is the corrupted callwheel tailq. Use process mutex to interlock the callout start, which fixes the race. Reported and tested by: pho Reviewed by: jhb MFC after: 2 weeks	2012-12-04 20:49:39 +00:00
Konstantin Belousov	9bdf6ccab3	Do not allocate buffer of the 255 bytes length on the stack. Reported and tested by: sig6247@gmail.com MFC after: 1 week	2012-12-04 20:49:04 +00:00
Alfred Perlstein	922314f018	replace bit shifting loop with 1<<fls(n), improve comments. Reviewed by: davide	2012-12-04 05:28:20 +00:00
Konstantin Belousov	07840861b1	The vnode_free_list_mtx is required unconditionally when iterating over the active list. The mount interlock is not enough to guarantee the validity of the tailq link pointers. The __mnt_vnode_next_active() and __mnt_vnode_first_active() active lists iterators helper functions did not provided the neccessary stability for the list, allowing the iterators to pick garbage. This was uncovered after the r243599 made the active list iterators non-nop. Since a vnode interlock is before the vnode_free_list_mtx, obtain the vnode ilock in the non-blocking manner when under vnode_free_list_mtx, and restart iteration after the yield if the lock attempt failed. Assert that a vnode found on the list is active, and assert that the helpers return the vnode with interlock owned. Reported and tested by: pho MFC after: 1 week	2012-12-03 22:15:16 +00:00
Pawel Jakub Dawidek	8909f88d28	Fix one more compilation issue.	2012-12-01 08:59:36 +00:00
Pawel Jakub Dawidek	499f0f4d55	IFp4 @208451: Fix path handling for *at() syscalls. Before the change directory descriptor was totally ignored, so the relative path argument was appended to current working directory path and not to the path provided by descriptor, thus wrong paths were stored in audit logs. Now that we use directory descriptor in vfs_lookup, move AUDIT_ARG_UPATH1() and AUDIT_ARG_UPATH2() calls to the place where we hold file descriptors table lock, so we are sure paths will be resolved according to the same directory in audit record and in actual operation. Sponsored by: FreeBSD Foundation (auditdistd) Reviewed by: rwatson MFC after: 2 weeks	2012-11-30 23:18:49 +00:00
Pawel Jakub Dawidek	e1216d1335	IFp4 @208450: Remove redundant call to AUDIT_ARG_UPATH1(). Path will be remembered by the following NDINIT(AUDITVNODE1) call. Sponsored by: FreeBSD Foundation (auditdistd) MFC after: 2 weeks	2012-11-30 22:49:28 +00:00
Andre Oppermann	df905a2bd3	Using a long is the wrong type to represent the realmem and maxmbufmem variable as they may overflow on i386/PAE and i386 with > 2GB RAM. Use 64bit quad_t instead. It has broader kernel infrastructure support with TUNABLE_QUAD_FETCH() and qmin/qmax() than other available types. Pointed out by: alc, bde	2012-11-29 07:30:42 +00:00
Andre Oppermann	416a434cd0	Complete r243631 by applying the remainder of kern_mbuf.c that got lost while merging into the commit tree. MFC after: 1 month X-MFC-with: r243631	2012-11-27 23:16:56 +00:00
Andre Oppermann	358c7f47da	Fix r243627 by testing against the head socket instead of the socket just created. MFC after: 1 week X-MFC-with: r243627	2012-11-27 22:35:48 +00:00
Andre Oppermann	ead46972a4	Base the mbuf related limits on the available physical memory or kernel memory, whichever is lower. The overall mbuf related memory limit must be set so that mbufs (and clusters of various sizes) can't exhaust physical RAM or KVM. The limit is set to half of the physical RAM or KVM (whichever is lower) as the baseline. In any normal scenario we want to leave at least half of the physmem/kvm for other kernel functions and userspace to prevent it from swapping too easily. Via a tunable kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm. At the same time divorce maxfiles from maxusers and set maxfiles to physpages / 8 with a floor based on maxusers. This way busy servers can make use of the significantly increased mbuf limits with a much larger number of open sockets. Tidy up ordering in init_param2() and check up on some users of those values calculated here. Out of the overall mbuf memory limit 2K clusters and 4K (page size) clusters to get 1/4 each because these are the most heavily used mbuf sizes. 2K clusters are used for MTU 1500 ethernet inbound packets. 4K clusters are used whenever possible for sends on sockets and thus outbound packets. The larger cluster sizes of 9K and 16K are limited to 1/6 of the overall mbuf memory limit. When jumbo MTU's are used these large clusters will end up only on the inbound path. They are not used on outbound, there it's still 4K. Yes, that will stay that way because otherwise we run into lots of complications in the stack. And it really isn't a problem, so don't make a scene. Normal mbufs (256B) weren't limited at all previously. This was problematic as there are certain places in the kernel that on allocation failure of clusters try to piece together their packet from smaller mbufs. The mbuf limit is the number of all other mbuf sizes together plus some more to allow for standalone mbufs (ACK for example) and to send off a copy of a cluster. Unfortunately there isn't a way to set an overall limit for all mbuf memory together as UMA doesn't support such a limiting. NB: Every cluster also has an mbuf associated with it. Two examples on the revised mbuf sizing limits: 1GB KVM: 512MB limit for mbufs 419,430 mbufs 65,536 2K mbuf clusters 32,768 4K mbuf clusters 9,709 9K mbuf clusters 5,461 16K mbuf clusters 16GB RAM: 8GB limit for mbufs 33,554,432 mbufs 1,048,576 2K mbuf clusters 524,288 4K mbuf clusters 155,344 9K mbuf clusters 87,381 16K mbuf clusters These defaults should be sufficient for even the most demanding network loads. MFC after: 1 month	2012-11-27 21:19:58 +00:00
Andre Oppermann	2c3142c82c	Fix a race on listen socket teardown where while draining the accept queues a new socket/connection may be added to the queue due to a race on the ACCEPT_LOCK. The submitted patch is slightly changed in comments, teardown and locking order and extended with KASSERT's. Submitted by: Vijay Singh <vijju.singh-at-gmail-dot-com> Found by: His team. MFC after: 1 week	2012-11-27 20:04:52 +00:00
Pawel Jakub Dawidek	b0c9d4d70e	Add kern.capmode_coredump sysctl/tunable to allow processes in capability mode to dump core. Reviewed by: rwatson Obtained from: WHEEL Systems MFC after: 2 weeks	2012-11-27 10:38:11 +00:00
Pawel Jakub Dawidek	f121e3e81d	- Add NOCAPCHECK flag to namei that allows lookup to work even if the process is in capability mode. - Add VN_OPEN_NOCAPCHECK flag for vn_open_cred() to will ne converted into NOCAPCHECK namei flag. This functionality will be used to enable core dumps for sandboxed processes. Reviewed by: rwatson Obtained from: WHEEL Systems MFC after: 2 weeks	2012-11-27 10:32:35 +00:00
Pawel Jakub Dawidek	90b2202145	Regenerate after r243610.	2012-11-27 10:25:03 +00:00
Pawel Jakub Dawidek	8890f5d020	Allow to use kill(2) in capability mode, but process can send a signal only to himself. For example abort(3) at first tries to do kill(getpid(), SIGABRT) which was failing in capability mode, so the code was failing back to exit(1). Reviewed by: rwatson Obtained from: WHEEL Systems MFC after: 2 weeks	2012-11-27 10:22:40 +00:00
Pawel Jakub Dawidek	b62d05fcf9	Allow to modify kern.sugid_coredump and kern.corefile from loader.conf. Obtained from: WHEEL Systems	2012-11-27 10:16:48 +00:00
Pawel Jakub Dawidek	c320984687	More style fixes.	2012-11-27 10:15:58 +00:00
Pawel Jakub Dawidek	23c6445a4b	Style fixes (mostly whitespaces).	2012-11-27 10:11:54 +00:00
David Xu	3da9ab75f4	Take first active vnode correctly. Reviewed by: kib MFC after: 3 days	2012-11-27 06:07:58 +00:00
Pawel Jakub Dawidek	4f66641749	Look for zombie process only if we were given process id. Reviewed by: kib MFC after: 2 weeks X-MFC-after-or-with: 243142	2012-11-25 19:31:42 +00:00
Andriy Gapon	6898bee9a9	remove stop_scheduler_on_panic knob There has not been any complaints about the default behavior, so there is no need to keep a knob that enables the worse alternative. Now that the hard-stopping of other CPUs is the only behavior, the panic_cpu spinlock-like logic can be dropped, because only a single CPU is supposed to win stop_cpus_hard(other_cpus) race and proceed past that call. MFC after: 1 month	2012-11-25 14:22:08 +00:00
Andriy Gapon	6b991098a7	assert_vop_locked: make the assertion race-free and more efficient this is really a minor improvement for the sake of correctness MFC after: 6 days	2012-11-24 13:11:47 +00:00
Andriy Gapon	4f15bb6730	remove vop_lookup_pre and vop_lookup_post Suggested by: kib MFC after: 5 days	2012-11-22 10:36:10 +00:00
Konstantin Belousov	daee0f0b0b	Schedule garbage collection run for the in-flight rights passed over the unix domain sockets to the next tick, coalescing the serial calls until the collection fires. The thought is that more work for the collector could arise in the near time, allowing to clean more and not spend too much CPU on repeated collection when there is no garbage. Currently the collection task is fired immediately upon unix domain socket close if there are any rights in flight, which caused excessive CPU usage and too long blocking of the threads waiting for unp_list_lock and unp_link_rwlock in write mode. Robert noted that it would be nice if we could find some heuristic by which we decide whether to run GC a bit more quickly. E.g., if the number of UNIX domain sockets is close to its resource limit, but not quite. Reported and tested by: Markus Gebert <markus.gebert@hostpoint.ch> Reviewed by: rwatson MFC after: 2 weeks	2012-11-20 15:45:48 +00:00
Konstantin Belousov	b7c8d2f2f5	Add a special meaning to the negative ticks argument for taskqueue_enqueue_timeout(). Do not rearm the callout if it is already armed and the ticks is negative. Otherwise rearm it to fire in abs(ticks) ticks in the future. The intended use is to call taskqueue_enqueue_timeout() for the given timeout_task with the same negative ticks argument. As result, the task is scheduled to execute not further than abs(ticks) ticks in future, and the consequent enqueues are coalesced until the already scheduled task is finished. Reviewed by: rwatson Tested by: Markus Gebert <markus.gebert@hostpoint.ch> MFC after: 2 weeks	2012-11-20 15:33:48 +00:00
Attilio Rao	973b795b64	insmntque() is always called with the lock held in exclusive mode, then: - assume the lock is held in exclusive mode and remove a moot check about the lock acquisition. - in the destructor remove !MPSAFE specific chunk. Reviewed by: kib MFC after: 2 weeks	2012-11-19 20:43:19 +00:00
Andriy Gapon	ab49c952d9	assert_vop_locked should treat LK_EXCLOTHER as the not locked case ... from a perspective of the current thread. Spotted by: mjg Discussed with: kib MFC after: 18 days	2012-11-19 11:35:56 +00:00
Andriy Gapon	c496727c54	vnode_if: fix locking protocol description for lookup and cachedlookup Also remove the checks from vop_lookup_pre and vop_lookup_post, which are now completely redundant (before this change they were partially redundant). Discussed with: kib MFC after: 10 days	2012-11-19 11:32:56 +00:00
Mateusz Guzik	dd103d4d06	Fix possible fp reference leak in posix_openpt Reviewed by: ed Approved by: trasz (mentor) MFC after: 3 days	2012-11-18 15:48:34 +00:00
Gleb Smirnoff	716963cb5d	Update comment.	2012-11-16 14:00:54 +00:00
Konstantin Belousov	134eb42e24	In pget(9), if PGET_NOTWEXIT flag is not specified, also search the zombie list for the pid. This allows several kern.proc sysctls to report useful information for zombies. Hold the allproc_lock around all searches instead of relocking it. Remove private pfind_locked() from the new nfs client code. Requested and reviewed by: pjd Tested by: pho MFC after: 3 weeks	2012-11-16 08:25:06 +00:00
Konstantin Belousov	ea293f3f1d	Restore the proper handling of the pid 0 for waitpid(2). Fix the style around. Reported and reviewed by: bde (previous version) MFC after: 28 days	2012-11-16 06:32:38 +00:00
Konstantin Belousov	a2a8559624	Style fixes for r242958. Reported and reviewed by: bde MFC after: 28 days	2012-11-16 06:22:14 +00:00
Edward Tomasz Napierala	baf85d0a22	Improve KASSERT messages in racct, to make it clear which resource caused the problem. Submitted by: mjg	2012-11-15 15:55:49 +00:00
Edward Tomasz Napierala	84c9193ba0	Fix kassert that's not really valid for %CPU accounting. The problem here is race between decaying the resource usage in containers, and updating per-process usage; basically, the former may cause per-container usage to get smaller than per-process usage. Submitted by: Rudo Tomori	2012-11-15 14:11:34 +00:00
Alexander Motin	2fd4047f32	Fix bug in r242852 that prevented CPU from becoming idle if kernel built without SMP support.	2012-11-15 14:10:51 +00:00
Jeff Roberson	28d91af30f	- Implement run-time expansion of the KTR buffer via sysctl. - Implement a function to ensure that all preempted threads have switched back out at least once. Use this to make sure there are no stale references to the old ktr_buf or the lock profiling buffers before updating them. Reviewed by: marius (sparc64 parts), attilio (earlier patch) Sponsored by: EMC / Isilon Storage Division	2012-11-15 00:51:57 +00:00
Baptiste Daroussin	6f0a5dea71	Style fix MFC after: 1 day	2012-11-14 10:33:12 +00:00
Baptiste Daroussin	6f68699fbd	return ERANGE if the buffer is too small to contain the login as documented in the manpage Reviewed by: cognet, kib MFC after: 1 month	2012-11-14 10:32:12 +00:00
Mateusz Guzik	4419a8a88c	enterpgrp: get rid of pgrp2 variable and use KASSERT directly on pgfind result. pgrp2 was used only for debugging, but pgrp2 = pgfind(..) was present in compiled code even for kernels without INVARIANTS Approved by: trasz (mentor) MFC after: 1 week	2012-11-13 22:01:25 +00:00
Konstantin Belousov	552e993580	Regen	2012-11-13 12:53:41 +00:00
Konstantin Belousov	f13b5a0f01	Add the wait6(2) system call. It takes POSIX waitid()-like process designator to select a process which is waited for. The system call optionally returns siginfo_t which would be otherwise provided to SIGCHLD handler, as well as extended structure accounting for child and cumulative grandchild resource usage. Allow to get the current rusage information for non-exited processes as well, similar to Solaris. The explicit WEXITED flag is required to wait for exited processes, allowing for more fine-grained control of the events the waiter is interested in. Fix the handling of siginfo for WNOWAIT option for all wait*(2) family, by not removing the queued signal state. PR: standards/170346 Submitted by: "Jukka A. Ukkonen" <jau@iki.fi> MFC after: 1 month	2012-11-13 12:52:31 +00:00
Edward Tomasz Napierala	84590fd8e5	Don't divide by zero. Tested by: swills	2012-11-13 11:29:08 +00:00
Alexander Motin	2c27cb3a34	Several optimizations to sched_idletd(): - Do not try to steal load from other CPUs if there was no contest switches on this CPU (i.e. it was idle all the time and woke up just for bus mastering or TLB shutdown). If current CPU was idle, then it is quite unlikely that some other CPU has load to steal. Under high I/O rate, when TLB shutdowns cause numerous CPU wakeups, on 24-CPU system load stealing code may consume up to 25% of all CPU time without giving any benefits. - Change code that implements spinning for load to restart spin in case of context switch. Previous code periodically called cpu_idle() even under high interrupt/context switch rate. - Rise spinning threshold to 10KHz, where it gives at least some effect that may worth consumed power. Reviewed by: jeff@	2012-11-10 07:02:57 +00:00
Alfred Perlstein	79f62ed690	Allow maxusers to scale on machines with large address space. Some hooks are added to clamp down maxusers and nmbclusters for small address space systems. VM_MAX_AUTOTUNE_MAXUSERS - the max maxusers that will be autotuned based on physical memory. VM_MAX_AUTOTUNE_NMBCLUSTERS - max nmbclusters based on physical memory. These are set to the old values on i386 to preserve the clamping that was being done to all arches. Another macro VM_AUTOTUNE_NMBCLUSTERS is provided to allow an override for the calculation on a MD basis. Currently no arch defines this. Reviewed by: peter MFC after: 2 weeks	2012-11-10 02:08:40 +00:00
Attilio Rao	bc2258da88	Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag. Porters should refer to __FreeBSD_version 1000021 for this change as it may have happened at the same timeframe.	2012-11-09 18:02:25 +00:00
Marius Strobl	c882264c95	Make r242655 build on sparc64. While at it, make vm_{max,min}_kernel_address vm_offset_t as they should be.	2012-11-08 08:10:32 +00:00
Jeff Roberson	5e5c387373	- Change ULE to use dynamic slice sizes for the timeshare queue in order to further reduce latency for threads in this queue. This should help as threads transition from realtime to timeshare. The latency is bound to a max of sched_slice until we have more than sched_slice / 6 threads runnable. Then the min slice is allotted to all threads and latency becomes (nthreads - 1) * min_slice. Discussed with: mav	2012-11-08 01:46:47 +00:00
Kevin Lo	0f5e7edc14	Fix typo; s/ouput/output	2012-11-07 07:00:59 +00:00

1 2 3 4 5 ...

12936 Commits