freebsd-dev

Author	SHA1	Message	Date
Jaakko Heinonen	b1e1f725e7	Reject spaces and double quotation marks in device names. devctl(4) and devd(8) can't handle names with such characters properly. PR: bin/144736, kern/161912 Discussed with: imp, kib, pjd	2012-12-22 13:33:28 +00:00
Attilio Rao	cd2fe4e632	Fixup r240424: On entering KDB backends, the hijacked thread to run interrupt context can still be idlethread. At that point, without the panic condition, it can still happen that idlethread then will try to acquire some locks to carry on some operations. Skip the idlethread check on block/sleep lock operations when KDB is active. Reported by: jh Tested by: jh MFC after: 1 week	2012-12-22 09:37:34 +00:00
Attilio Rao	b1308d72c2	Fixup r218424: uio_yield() was scaling directly to userland priority. When kern_yield() was introduced with the possibility to specify a new priority, the behaviour changed by not lowering priority at all in the consumers, making the yielding mechanism highly ineffective for high priority kthreads like bufdaemon, syncer, vlrudaemon, etc. There are no evidences that consumers could bear with such change in semantic and this situation could finally lead to bugs similar to the ones fixed in r244240. Re-specify userland pri for kthreads involved. Tested by: pho Reviewed by: kib, mdf MFC after: 1 week	2012-12-21 13:14:12 +00:00
Dag-Erling Smørgrav	b5471c918f	Rewrite fdgrowtable() so common mortals can actually understand what it does and how, and add comments describing the data structures and explaining how they are managed.	2012-12-20 20:18:27 +00:00
Olivier Houchard	05d9035003	Create an architecture-agnostic buffer pool manager that uses uma(9) to manage a set of power-of-2 sized buffers for bus_dmamem_alloc(). This allows the caller to provide the back-end allocator uma allocator, allowing full control of the memory pages backing the pool. For convenience, it provides an optional builtin allocator that provides pages allocated with the VM_MEMATTR_UNCACHEABLE attribute, for managing pools of DMA buffers for BUS_DMA_COHERENT or BUS_DMA_NOCACHE. This also allows the caller to specify a minimum alignment, and it ensures that all buffers start on a boundary and have a length that's a multiple of that value, to avoid using buffers that trigger partial cache line flushes. Submitted by: Ian Lepore <freebsd@damnhippie.dyndns.org>	2012-12-20 00:34:54 +00:00
Pawel Jakub Dawidek	c345faea5a	Replace expand_name() function with corefile_open() function, which not only returns name, but also vnode of corefile to use. This simplifies the code and closes few races, especially in %I handling. Reviewed by: kib Obtained from: WHEEL Systems	2012-12-19 23:59:48 +00:00
Pawel Jakub Dawidek	22a5d85aa9	Use correct file permissions when looking for available core file if kern.corefile contains %I. Obtained from: WHEEL Systems	2012-12-19 23:40:02 +00:00
Jeff Roberson	4c44811c9d	- Add new machine parsable KTR macros for timing events. - Use this new format to automatically handle syscalls and VOPs. This changes the earlier format but is still human readable. Sponsored by: EMC / Isilon Storage Division	2012-12-19 20:10:00 +00:00
Jeff Roberson	5b39d5c739	- Correctly handle EWOULDBLOCK in quiesce_cpus Discussed with: mav	2012-12-19 20:08:06 +00:00
Pawel Jakub Dawidek	07a8e07896	The 'flags' argument can be modified in vn_open_cred(), so we need to set it for every loop interation. Pointed out by: kib	2012-12-19 12:14:08 +00:00
Pawel Jakub Dawidek	cc58032c44	Do not audit paths we try when kern.corefile contains %I. Obtained from: WHEEL Systems	2012-12-19 12:12:53 +00:00
Pawel Jakub Dawidek	29146f1a7a	Style cleanups.	2012-12-19 12:10:14 +00:00
Pawel Jakub Dawidek	086053a370	The expand_name() function isn't called with the process lock held anymore, so we can safely use malloc(M_WAITOK) now. Pointed out by: kib	2012-12-19 12:00:09 +00:00
Mateusz Guzik	af3c786c47	prison_racct_detach can be called for not fully initialized jail, so make it check that the jail has racct before doing anything PR: kern/174436 Reviewed by: trasz MFC after: 3 days	2012-12-18 18:34:36 +00:00
Andrey Zonov	5eb0d2838c	- Add sysctl to allow unprivileged users to call mlock(2)-family system calls and turn it on. - Do not allow to call them inside jail. [1] Pointed out by: trasz [1] Reviewed by: avg Approved by: kib (mentor) MFC after: 1 week	2012-12-18 07:36:45 +00:00
Pawel Jakub Dawidek	f06f465db7	Minor style tweaks. Obtained from: WHEEL Systems	2012-12-17 10:51:22 +00:00
Pawel Jakub Dawidek	c52ff61196	Better variables naming in expand_name() to be more consistent with coredump(). Obtained from: WHEEL Systems	2012-12-17 10:48:10 +00:00
Pawel Jakub Dawidek	dd57ce87eb	Move expand_name() after process lock is released. This fixed panic where we hold mutex (process lock) and try to obtain sleepable lock (vnode lock in expand_name()). The panic could occur when %I was used in kern.corefile. Additionally we avoid expand_name() overhead when coredumps are disabled. Obtained from: WHEEL Systems	2012-12-16 14:53:27 +00:00
Pawel Jakub Dawidek	2ce1b32df2	Don't add audit record when coredumps are disabled or name cannot be expanded. Discussed with: rwatson Obtained from: WHEEL Systems	2012-12-16 14:24:59 +00:00
Pawel Jakub Dawidek	7e73ee85ab	Make the check easier to read. Obtained from: WHEEL Systems	2012-12-16 14:14:18 +00:00
Pawel Jakub Dawidek	b039f8c2aa	Use 'cred' variable. Obtained from: WHEEL Systems	2012-12-16 13:56:38 +00:00
Konstantin Belousov	14df601e47	When mnt_vnode_next_active iterator cannot lock the next vnode and yields, specify the user priority for the yield. Otherwise, a higher-priority (kernel) thread could fall into the priority-inversion with the thread owning the mutex lock. On single-processor machines or UP kernels, do not loop adaptively when the next vnode cannot be locked, instead yield unconditionally. Restructure the iteration initializer and the iterator to remove code duplication. Put the code to fetch and lock a vnode next to the current marker, into the mnt_vnode_next_active() function, and use it instead of repeating the loop. Reported by: hrs, rmacklem Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2012-12-15 02:04:46 +00:00
Konstantin Belousov	d4015944e7	Remove a special case for XEN, which is erronous and makes vfork(2) behaviour to differ from the documented, only on XEN. If there are any issues with XEN pmap left, they should be fixed in pmap. MFC after: 2 weeks	2012-12-15 02:02:11 +00:00
Rick Macklem	f1c4014cd5	The group list for a non-default export entry (a host/subnet one) was being copied from the wrong place. This patch fixes that. This could cause access failures for mapped users, when the group permissions were needed. PR: 147998 Submitted by: Christopher Key (cjk32 at cam.ac.uk) MFC after: 2 weeks	2012-12-14 21:49:06 +00:00
Alfred Perlstein	15d32bd543	Cleanup more of the kassert_panic. fix compile warnings on !amd64 and NULL derefs that would happen if kassert_panic() would return.	2012-12-11 07:08:14 +00:00
Alfred Perlstein	c2c5ede903	Fix WITNESS when INVARIANT_SUPPORT is defined. This fixes tinderbox breakage from r244105. Pointed out by: adrian	2012-12-11 05:59:16 +00:00
Alfred Perlstein	6b6bd3b704	Switch the hardwired WITNESS panics to kassert_panic. This is an ongoing effort to provide runtime debug information useful in the field that does not panic existing installations. This gives us the flexibility needed when shipping images to a potentially large audience with WITNESS enabled without worrying about formerly non-fatal LORs hurting a release. Sponsored by: iXsystems	2012-12-11 01:23:50 +00:00
Alfred Perlstein	d3bfafb4f6	back out half of 244098. kern.bootfile needs to be rw for installkernel. Pointed out by: kib, flo	2012-12-11 00:10:20 +00:00
Alfred Perlstein	a94053ba39	allow KASSERT to enter KDB.	2012-12-10 23:11:26 +00:00
Alfred Perlstein	d06cadae1e	make sysctls kern.{bootfile,conftxt} read-only MFC after: 1 month	2012-12-10 23:09:55 +00:00
Konstantin Belousov	686ffcaceb	Do not yield while owning a mutex. The Giant reacquire in the kern_yield() is problematic than. The owned mutex is the mount interlock, and it is in fact not needed to guarantee the stability of the mount list of active vnodes, so fix the the issue by only taking the mount interlock for MNT_REF and MNT_REL operations. While there, augment the unconditional yield by some amount of spinning [1]. Reported and tested by: pho Reviewed by: attilio Submitted by: attilio [1] MFC after: 3 days	2012-12-10 20:44:09 +00:00
Andre Oppermann	0060bab556	Prevent long type overflow of realmem calculation on ILP32 by forcing calculation to be in quad_t space. Fix style issue with second parameter to qmin(). Reported by: alc Reviewed by: bde, alc	2012-12-10 12:19:03 +00:00
Konstantin Belousov	5d439a2957	Do not ignore zero address, possibly returned by the vm_map_find() call. The function indicates a failure by the TRUE return value. To be extra safe, assert that the return value from the following vm_map_insert() indicates success. Fix style issues in the nearby lines, reformulate the comment. Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2012-12-10 05:14:04 +00:00
Konstantin Belousov	17cb8cfc31	Remove useless comment. MFC after: 3 days	2012-12-09 20:34:11 +00:00
Konstantin Belousov	796fa4fb86	Fix typo. MFC after: 3 days	2012-12-09 20:26:51 +00:00
Attilio Rao	e68ccbe85e	Add a comment on why inlining critical_enter() may not be a good idea for the general case. Reviewed by: bde MFC after: 1 week	2012-12-09 04:54:22 +00:00
Pawel Jakub Dawidek	6e0b674628	Configure UMA warnings for the following zones: - unp_zone: kern.ipc.maxsockets limit reached - socket_zone: kern.ipc.maxsockets limit reached - zone_mbuf: kern.ipc.nmbufs limit reached - zone_clust: kern.ipc.nmbclusters limit reached - zone_jumbop: kern.ipc.nmbjumbop limit reached - zone_jumbo9: kern.ipc.nmbjumbo9 limit reached - zone_jumbo16: kern.ipc.nmbjumbo16 limit reached Note that those warnings are printed not often than every five minutes and can be globally turned off by setting sysctl/tunable vm.zone_warnings to 0. Discussed on: arch Obtained from: WHEEL Systems MFC after: 2 weeks	2012-12-07 22:30:30 +00:00
Pawel Jakub Dawidek	45fe0bf7e4	Make use of the fact that uma_zone_set_max(9) already returns actual limit set.	2012-12-07 22:23:53 +00:00
Pawel Jakub Dawidek	4007b61cde	More style cleanups.	2012-12-07 22:22:04 +00:00
Pawel Jakub Dawidek	b0b1402537	Style cleanups.	2012-12-07 22:19:41 +00:00
Pawel Jakub Dawidek	94b0ae5d62	- Make socket_zone static - it is used only in this file. - Update maxsockets on uma_zone_set_max(). Obtained from: WHEEL Systems	2012-12-07 22:15:51 +00:00
Pawel Jakub Dawidek	68412f4179	Style cleanups.	2012-12-07 22:13:33 +00:00
Pawel Jakub Dawidek	0b746181a2	There is no need anymore to include vm/uma.h after r241726. Obtained from: WHEEL Systems	2012-12-07 22:05:42 +00:00
Alfred Perlstein	3945a96431	Allow KASSERT to log instead of panic. This is to allow debug images to be used without taking down the system when non-fatal asserts are hit. The following sysctls are added: debug.kassert.warn_only: 1 = log, 0 = panic debug.kassert.do_ktr: set to a ktr mask for logging via KTR debug.kassert.do_log: 1 = log, 0 = quiet debug.kassert.warnings: stats, number of kasserts hit debug.kassert.log_panic_at: number of kasserts before we actually panic, 0 = never debug.kassert.log_pps_limit: pps limit for log messages debug.kassert.log_mute_at: stop warning after N kasserts, 0 = never stop debug.kassert.kassert: set this sysctl to trigger a kassert Discussed with: scottl, gnn, marcel Sponsored by: iXsystems	2012-12-07 08:25:08 +00:00
Alfred Perlstein	3356d129ad	Use uint instead of int for flags exported via sysctl.	2012-12-07 05:55:48 +00:00
Kevin Lo	b08d12d9be	- according to POSIX, make socket(2) return EAFNOSUPPORT rather than EPROTONOSUPPORT if the address family is not supported. - introduce pffinddomain() to find a domain by family and use it as appropriate. Reviewed by: glebius	2012-12-07 02:22:48 +00:00
David Xu	3f6bad0181	Eliminate superfluous code.	2012-12-06 06:29:08 +00:00
Attilio Rao	bdf9120c16	Fixup r243901: - As the comment report, CALLOUT_LOCAL_ALLOC cannot be checked directly from the callout flags but might be checked by a cached value. Hence, do so before to actually remove the callout, when needed, in softclock_call_cc(). - In softclock_call_cc() also add a comment in the waiting and deferred migration case explaining that the dereference should be safe because of the migration dereference invariants. Additively: - In softclock_call_cc(), for the deferred migration case, move all the accesses to callout structure after the comment stating the callout must not be destroyed. - For consistency with this last tweak, use cached c_flags for the KASSERT() in the deferred migration case. It is not strictly necessary but this way all the callout accesses happen after the above mentioned comment, improving consistency. Pointy hat to: me Sponsored by: Isilon Systems / EMC Corporation Reviewed by: kib MFC after: 2 weeks X-MFC: 243901	2012-12-05 22:32:12 +00:00
Konstantin Belousov	eb8a718686	The softclock_call_cc() is executing with the callout already removed from the callwheel. Calculate the cc->cc_next before removing the callout, otherwise the code followed the invalid tailq links. After this, make softclock_call_cc() return void, since it always return cc->cc_next, which is immediately available to the softclock() anyway. This also allows to eliminate a label under #ifdef SMP. Remove the assignment of cc->cc_next from callout_cc_del(), since the function is called with the callout already removed from callwheel. If cancelling the migration, also clear the CALLOUT_DFRMIGRATION flag. Postpone the free of the timeout(9) allocated callouts after the migration checks are done. Add some more strict asserts about the state of the callout in callout_call_cc(). Reviewed by: attilio Reported and tested by: pho (previous version) MFC after: 2 weeks	2012-12-05 19:02:22 +00:00
Attilio Rao	1c7d98d0df	Check for lockmgr recursion in case of disown and downgrade and panic also in !debugging kernel rather than having "undefined" behaviour. Tested by: avg MFC after: 1 week	2012-12-05 15:11:01 +00:00
Gleb Smirnoff	eb1b1807af	Mechanically substitute flags from historic mbuf allocator with malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually	2012-12-05 08:04:20 +00:00
Konstantin Belousov	f7e50ea722	Fix a race between kern_setitimer() and realitexpire(), where the callout is started before kern_setitimer() acquires process mutex, but looses a race and kern_setitimer() gets the process mutex before the callout. Then, assuming that new specified struct itimerval has it_interval zero, but it_value non-zero, the callout, after it starts executing again, clears p->p_realtimer.it_value, but kern_setitimer() already rescheduled the callout. As the result of the race, both p_realtimer is zero, and the callout is rescheduled. Then, in the exit1(), the exit code sees that it_value is zero and does not even try to stop the callout. This allows the struct proc to be reused and eventually the armed callout is re-initialized. The consequence is the corrupted callwheel tailq. Use process mutex to interlock the callout start, which fixes the race. Reported and tested by: pho Reviewed by: jhb MFC after: 2 weeks	2012-12-04 20:49:39 +00:00
Konstantin Belousov	9bdf6ccab3	Do not allocate buffer of the 255 bytes length on the stack. Reported and tested by: sig6247@gmail.com MFC after: 1 week	2012-12-04 20:49:04 +00:00
Alfred Perlstein	922314f018	replace bit shifting loop with 1<<fls(n), improve comments. Reviewed by: davide	2012-12-04 05:28:20 +00:00
Konstantin Belousov	07840861b1	The vnode_free_list_mtx is required unconditionally when iterating over the active list. The mount interlock is not enough to guarantee the validity of the tailq link pointers. The __mnt_vnode_next_active() and __mnt_vnode_first_active() active lists iterators helper functions did not provided the neccessary stability for the list, allowing the iterators to pick garbage. This was uncovered after the r243599 made the active list iterators non-nop. Since a vnode interlock is before the vnode_free_list_mtx, obtain the vnode ilock in the non-blocking manner when under vnode_free_list_mtx, and restart iteration after the yield if the lock attempt failed. Assert that a vnode found on the list is active, and assert that the helpers return the vnode with interlock owned. Reported and tested by: pho MFC after: 1 week	2012-12-03 22:15:16 +00:00
Pawel Jakub Dawidek	8909f88d28	Fix one more compilation issue.	2012-12-01 08:59:36 +00:00
Pawel Jakub Dawidek	499f0f4d55	IFp4 @208451: Fix path handling for *at() syscalls. Before the change directory descriptor was totally ignored, so the relative path argument was appended to current working directory path and not to the path provided by descriptor, thus wrong paths were stored in audit logs. Now that we use directory descriptor in vfs_lookup, move AUDIT_ARG_UPATH1() and AUDIT_ARG_UPATH2() calls to the place where we hold file descriptors table lock, so we are sure paths will be resolved according to the same directory in audit record and in actual operation. Sponsored by: FreeBSD Foundation (auditdistd) Reviewed by: rwatson MFC after: 2 weeks	2012-11-30 23:18:49 +00:00
Pawel Jakub Dawidek	e1216d1335	IFp4 @208450: Remove redundant call to AUDIT_ARG_UPATH1(). Path will be remembered by the following NDINIT(AUDITVNODE1) call. Sponsored by: FreeBSD Foundation (auditdistd) MFC after: 2 weeks	2012-11-30 22:49:28 +00:00
Andre Oppermann	df905a2bd3	Using a long is the wrong type to represent the realmem and maxmbufmem variable as they may overflow on i386/PAE and i386 with > 2GB RAM. Use 64bit quad_t instead. It has broader kernel infrastructure support with TUNABLE_QUAD_FETCH() and qmin/qmax() than other available types. Pointed out by: alc, bde	2012-11-29 07:30:42 +00:00
Andre Oppermann	416a434cd0	Complete r243631 by applying the remainder of kern_mbuf.c that got lost while merging into the commit tree. MFC after: 1 month X-MFC-with: r243631	2012-11-27 23:16:56 +00:00
Andre Oppermann	358c7f47da	Fix r243627 by testing against the head socket instead of the socket just created. MFC after: 1 week X-MFC-with: r243627	2012-11-27 22:35:48 +00:00
Andre Oppermann	ead46972a4	Base the mbuf related limits on the available physical memory or kernel memory, whichever is lower. The overall mbuf related memory limit must be set so that mbufs (and clusters of various sizes) can't exhaust physical RAM or KVM. The limit is set to half of the physical RAM or KVM (whichever is lower) as the baseline. In any normal scenario we want to leave at least half of the physmem/kvm for other kernel functions and userspace to prevent it from swapping too easily. Via a tunable kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm. At the same time divorce maxfiles from maxusers and set maxfiles to physpages / 8 with a floor based on maxusers. This way busy servers can make use of the significantly increased mbuf limits with a much larger number of open sockets. Tidy up ordering in init_param2() and check up on some users of those values calculated here. Out of the overall mbuf memory limit 2K clusters and 4K (page size) clusters to get 1/4 each because these are the most heavily used mbuf sizes. 2K clusters are used for MTU 1500 ethernet inbound packets. 4K clusters are used whenever possible for sends on sockets and thus outbound packets. The larger cluster sizes of 9K and 16K are limited to 1/6 of the overall mbuf memory limit. When jumbo MTU's are used these large clusters will end up only on the inbound path. They are not used on outbound, there it's still 4K. Yes, that will stay that way because otherwise we run into lots of complications in the stack. And it really isn't a problem, so don't make a scene. Normal mbufs (256B) weren't limited at all previously. This was problematic as there are certain places in the kernel that on allocation failure of clusters try to piece together their packet from smaller mbufs. The mbuf limit is the number of all other mbuf sizes together plus some more to allow for standalone mbufs (ACK for example) and to send off a copy of a cluster. Unfortunately there isn't a way to set an overall limit for all mbuf memory together as UMA doesn't support such a limiting. NB: Every cluster also has an mbuf associated with it. Two examples on the revised mbuf sizing limits: 1GB KVM: 512MB limit for mbufs 419,430 mbufs 65,536 2K mbuf clusters 32,768 4K mbuf clusters 9,709 9K mbuf clusters 5,461 16K mbuf clusters 16GB RAM: 8GB limit for mbufs 33,554,432 mbufs 1,048,576 2K mbuf clusters 524,288 4K mbuf clusters 155,344 9K mbuf clusters 87,381 16K mbuf clusters These defaults should be sufficient for even the most demanding network loads. MFC after: 1 month	2012-11-27 21:19:58 +00:00
Andre Oppermann	2c3142c82c	Fix a race on listen socket teardown where while draining the accept queues a new socket/connection may be added to the queue due to a race on the ACCEPT_LOCK. The submitted patch is slightly changed in comments, teardown and locking order and extended with KASSERT's. Submitted by: Vijay Singh <vijju.singh-at-gmail-dot-com> Found by: His team. MFC after: 1 week	2012-11-27 20:04:52 +00:00
Pawel Jakub Dawidek	b0c9d4d70e	Add kern.capmode_coredump sysctl/tunable to allow processes in capability mode to dump core. Reviewed by: rwatson Obtained from: WHEEL Systems MFC after: 2 weeks	2012-11-27 10:38:11 +00:00
Pawel Jakub Dawidek	f121e3e81d	- Add NOCAPCHECK flag to namei that allows lookup to work even if the process is in capability mode. - Add VN_OPEN_NOCAPCHECK flag for vn_open_cred() to will ne converted into NOCAPCHECK namei flag. This functionality will be used to enable core dumps for sandboxed processes. Reviewed by: rwatson Obtained from: WHEEL Systems MFC after: 2 weeks	2012-11-27 10:32:35 +00:00
Pawel Jakub Dawidek	90b2202145	Regenerate after r243610.	2012-11-27 10:25:03 +00:00
Pawel Jakub Dawidek	8890f5d020	Allow to use kill(2) in capability mode, but process can send a signal only to himself. For example abort(3) at first tries to do kill(getpid(), SIGABRT) which was failing in capability mode, so the code was failing back to exit(1). Reviewed by: rwatson Obtained from: WHEEL Systems MFC after: 2 weeks	2012-11-27 10:22:40 +00:00
Pawel Jakub Dawidek	b62d05fcf9	Allow to modify kern.sugid_coredump and kern.corefile from loader.conf. Obtained from: WHEEL Systems	2012-11-27 10:16:48 +00:00
Pawel Jakub Dawidek	c320984687	More style fixes.	2012-11-27 10:15:58 +00:00
Pawel Jakub Dawidek	23c6445a4b	Style fixes (mostly whitespaces).	2012-11-27 10:11:54 +00:00
David Xu	3da9ab75f4	Take first active vnode correctly. Reviewed by: kib MFC after: 3 days	2012-11-27 06:07:58 +00:00
Pawel Jakub Dawidek	4f66641749	Look for zombie process only if we were given process id. Reviewed by: kib MFC after: 2 weeks X-MFC-after-or-with: 243142	2012-11-25 19:31:42 +00:00
Andriy Gapon	6898bee9a9	remove stop_scheduler_on_panic knob There has not been any complaints about the default behavior, so there is no need to keep a knob that enables the worse alternative. Now that the hard-stopping of other CPUs is the only behavior, the panic_cpu spinlock-like logic can be dropped, because only a single CPU is supposed to win stop_cpus_hard(other_cpus) race and proceed past that call. MFC after: 1 month	2012-11-25 14:22:08 +00:00
Andriy Gapon	6b991098a7	assert_vop_locked: make the assertion race-free and more efficient this is really a minor improvement for the sake of correctness MFC after: 6 days	2012-11-24 13:11:47 +00:00
Andriy Gapon	4f15bb6730	remove vop_lookup_pre and vop_lookup_post Suggested by: kib MFC after: 5 days	2012-11-22 10:36:10 +00:00
Konstantin Belousov	daee0f0b0b	Schedule garbage collection run for the in-flight rights passed over the unix domain sockets to the next tick, coalescing the serial calls until the collection fires. The thought is that more work for the collector could arise in the near time, allowing to clean more and not spend too much CPU on repeated collection when there is no garbage. Currently the collection task is fired immediately upon unix domain socket close if there are any rights in flight, which caused excessive CPU usage and too long blocking of the threads waiting for unp_list_lock and unp_link_rwlock in write mode. Robert noted that it would be nice if we could find some heuristic by which we decide whether to run GC a bit more quickly. E.g., if the number of UNIX domain sockets is close to its resource limit, but not quite. Reported and tested by: Markus Gebert <markus.gebert@hostpoint.ch> Reviewed by: rwatson MFC after: 2 weeks	2012-11-20 15:45:48 +00:00
Konstantin Belousov	b7c8d2f2f5	Add a special meaning to the negative ticks argument for taskqueue_enqueue_timeout(). Do not rearm the callout if it is already armed and the ticks is negative. Otherwise rearm it to fire in abs(ticks) ticks in the future. The intended use is to call taskqueue_enqueue_timeout() for the given timeout_task with the same negative ticks argument. As result, the task is scheduled to execute not further than abs(ticks) ticks in future, and the consequent enqueues are coalesced until the already scheduled task is finished. Reviewed by: rwatson Tested by: Markus Gebert <markus.gebert@hostpoint.ch> MFC after: 2 weeks	2012-11-20 15:33:48 +00:00
Attilio Rao	973b795b64	insmntque() is always called with the lock held in exclusive mode, then: - assume the lock is held in exclusive mode and remove a moot check about the lock acquisition. - in the destructor remove !MPSAFE specific chunk. Reviewed by: kib MFC after: 2 weeks	2012-11-19 20:43:19 +00:00
Andriy Gapon	ab49c952d9	assert_vop_locked should treat LK_EXCLOTHER as the not locked case ... from a perspective of the current thread. Spotted by: mjg Discussed with: kib MFC after: 18 days	2012-11-19 11:35:56 +00:00
Andriy Gapon	c496727c54	vnode_if: fix locking protocol description for lookup and cachedlookup Also remove the checks from vop_lookup_pre and vop_lookup_post, which are now completely redundant (before this change they were partially redundant). Discussed with: kib MFC after: 10 days	2012-11-19 11:32:56 +00:00
Mateusz Guzik	dd103d4d06	Fix possible fp reference leak in posix_openpt Reviewed by: ed Approved by: trasz (mentor) MFC after: 3 days	2012-11-18 15:48:34 +00:00
Gleb Smirnoff	716963cb5d	Update comment.	2012-11-16 14:00:54 +00:00
Konstantin Belousov	134eb42e24	In pget(9), if PGET_NOTWEXIT flag is not specified, also search the zombie list for the pid. This allows several kern.proc sysctls to report useful information for zombies. Hold the allproc_lock around all searches instead of relocking it. Remove private pfind_locked() from the new nfs client code. Requested and reviewed by: pjd Tested by: pho MFC after: 3 weeks	2012-11-16 08:25:06 +00:00
Konstantin Belousov	ea293f3f1d	Restore the proper handling of the pid 0 for waitpid(2). Fix the style around. Reported and reviewed by: bde (previous version) MFC after: 28 days	2012-11-16 06:32:38 +00:00
Konstantin Belousov	a2a8559624	Style fixes for r242958. Reported and reviewed by: bde MFC after: 28 days	2012-11-16 06:22:14 +00:00
Edward Tomasz Napierala	baf85d0a22	Improve KASSERT messages in racct, to make it clear which resource caused the problem. Submitted by: mjg	2012-11-15 15:55:49 +00:00
Edward Tomasz Napierala	84c9193ba0	Fix kassert that's not really valid for %CPU accounting. The problem here is race between decaying the resource usage in containers, and updating per-process usage; basically, the former may cause per-container usage to get smaller than per-process usage. Submitted by: Rudo Tomori	2012-11-15 14:11:34 +00:00
Alexander Motin	2fd4047f32	Fix bug in r242852 that prevented CPU from becoming idle if kernel built without SMP support.	2012-11-15 14:10:51 +00:00
Jeff Roberson	28d91af30f	- Implement run-time expansion of the KTR buffer via sysctl. - Implement a function to ensure that all preempted threads have switched back out at least once. Use this to make sure there are no stale references to the old ktr_buf or the lock profiling buffers before updating them. Reviewed by: marius (sparc64 parts), attilio (earlier patch) Sponsored by: EMC / Isilon Storage Division	2012-11-15 00:51:57 +00:00
Baptiste Daroussin	6f0a5dea71	Style fix MFC after: 1 day	2012-11-14 10:33:12 +00:00
Baptiste Daroussin	6f68699fbd	return ERANGE if the buffer is too small to contain the login as documented in the manpage Reviewed by: cognet, kib MFC after: 1 month	2012-11-14 10:32:12 +00:00
Mateusz Guzik	4419a8a88c	enterpgrp: get rid of pgrp2 variable and use KASSERT directly on pgfind result. pgrp2 was used only for debugging, but pgrp2 = pgfind(..) was present in compiled code even for kernels without INVARIANTS Approved by: trasz (mentor) MFC after: 1 week	2012-11-13 22:01:25 +00:00
Konstantin Belousov	552e993580	Regen	2012-11-13 12:53:41 +00:00
Konstantin Belousov	f13b5a0f01	Add the wait6(2) system call. It takes POSIX waitid()-like process designator to select a process which is waited for. The system call optionally returns siginfo_t which would be otherwise provided to SIGCHLD handler, as well as extended structure accounting for child and cumulative grandchild resource usage. Allow to get the current rusage information for non-exited processes as well, similar to Solaris. The explicit WEXITED flag is required to wait for exited processes, allowing for more fine-grained control of the events the waiter is interested in. Fix the handling of siginfo for WNOWAIT option for all wait*(2) family, by not removing the queued signal state. PR: standards/170346 Submitted by: "Jukka A. Ukkonen" <jau@iki.fi> MFC after: 1 month	2012-11-13 12:52:31 +00:00
Edward Tomasz Napierala	84590fd8e5	Don't divide by zero. Tested by: swills	2012-11-13 11:29:08 +00:00
Alexander Motin	2c27cb3a34	Several optimizations to sched_idletd(): - Do not try to steal load from other CPUs if there was no contest switches on this CPU (i.e. it was idle all the time and woke up just for bus mastering or TLB shutdown). If current CPU was idle, then it is quite unlikely that some other CPU has load to steal. Under high I/O rate, when TLB shutdowns cause numerous CPU wakeups, on 24-CPU system load stealing code may consume up to 25% of all CPU time without giving any benefits. - Change code that implements spinning for load to restart spin in case of context switch. Previous code periodically called cpu_idle() even under high interrupt/context switch rate. - Rise spinning threshold to 10KHz, where it gives at least some effect that may worth consumed power. Reviewed by: jeff@	2012-11-10 07:02:57 +00:00
Alfred Perlstein	79f62ed690	Allow maxusers to scale on machines with large address space. Some hooks are added to clamp down maxusers and nmbclusters for small address space systems. VM_MAX_AUTOTUNE_MAXUSERS - the max maxusers that will be autotuned based on physical memory. VM_MAX_AUTOTUNE_NMBCLUSTERS - max nmbclusters based on physical memory. These are set to the old values on i386 to preserve the clamping that was being done to all arches. Another macro VM_AUTOTUNE_NMBCLUSTERS is provided to allow an override for the calculation on a MD basis. Currently no arch defines this. Reviewed by: peter MFC after: 2 weeks	2012-11-10 02:08:40 +00:00
Attilio Rao	bc2258da88	Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag. Porters should refer to __FreeBSD_version 1000021 for this change as it may have happened at the same timeframe.	2012-11-09 18:02:25 +00:00
Marius Strobl	c882264c95	Make r242655 build on sparc64. While at it, make vm_{max,min}_kernel_address vm_offset_t as they should be.	2012-11-08 08:10:32 +00:00
Jeff Roberson	5e5c387373	- Change ULE to use dynamic slice sizes for the timeshare queue in order to further reduce latency for threads in this queue. This should help as threads transition from realtime to timeshare. The latency is bound to a max of sched_slice until we have more than sched_slice / 6 threads runnable. Then the min slice is allotted to all threads and latency becomes (nthreads - 1) * min_slice. Discussed with: mav	2012-11-08 01:46:47 +00:00
Kevin Lo	0f5e7edc14	Fix typo; s/ouput/output	2012-11-07 07:00:59 +00:00
Alfred Perlstein	fc6874bcbb	export VM_MIN_KERNEL_ADDRESS and VM_MAX_KERNEL_ADDRESS via sysctl. On several platforms the are determined by too many nested #defines to be easily discernible. This will aid in development of auto-tuning.	2012-11-06 04:10:32 +00:00
Konstantin Belousov	76fd782cd9	A clarification to the behaviour of the active vnode list management regarding the vnode page cleaning. In collaboration with: pho MFC after: 1 week	2012-11-05 16:40:42 +00:00
Konstantin Belousov	90af57930c	Add decoding of the missed MNT_KERN_ flags to ddb "show mount" command. MFC after: 3 weeks	2012-11-04 13:33:13 +00:00
Konstantin Belousov	fb81941575	Add decoding of the missed VI_ and VV_ flags to ddb "show vnode" command. MFC after: 3 days	2012-11-04 13:32:45 +00:00
Konstantin Belousov	df3161c7df	Order the enumeration of the MNT_ flags to be the same as the order of their definitions. MFC after: 3 days	2012-11-04 13:31:41 +00:00
Ed Schouten	305921c48e	Add tty_set_winsize(). This removes some of the signalling magic from the Syscons driver and puts it in the TTY layer, where it belongs.	2012-11-03 22:21:37 +00:00
Attilio Rao	19d4153329	Merge r242395,242483 from mutex implementation: give rwlock(9) the ability to crunch different type of structures, with the only constraint that they have a lock cookie named rw_lock. This name, then, becames reserved from the struct that wants to use the rwlock(9) KPI and other locking primitives cannot reuse it for their members. Namely such structs are the current struct rwlock and the new struct rwlock_padalign. The new structure will define an object which has the same layout of a struct rwlock but will be allocated in areas aligned to the cache line size and will be as big as a cache line. For further details check comments on above mentioned revisions. Reviewed by: jimharris, jeff	2012-11-03 15:57:37 +00:00
Alfred Perlstein	5a3a8ec037	Merge 242488, better use of strlcpy. Submitted by: Eric van Gyzen <eric@vangyzen.net>	2012-11-02 18:57:38 +00:00
Konstantin Belousov	140dedb81c	The r241025 fixed the case when a binary, executed from nullfs mount, was still possible to open for write from the lower filesystem. There is a symmetric situation where the binary could already has file descriptors opened for write, but it can be executed from the nullfs overlay. Handle the issue by passing one v_writecount reference to the lower vnode if nullfs vnode has non-zero v_writecount. Note that only one write reference can be donated, since nullfs only keeps one use reference on the lower vnode. Always use the lower vnode v_writecount for the checks. Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT to manipulate the v_writecount value, which manages a single bypass reference to the lower vnode. Caling the VOPs instead of directly accessing v_writecount provide the fix described in the previous paragraph. Tested by: pho MFC after: 3 weeks	2012-11-02 13:56:36 +00:00
Alfred Perlstein	bad7e7f3dd	Provide a device name in the sysctl tree for programs to query the state of crashdump target devices. This will be used to add a "-l" (ell) flag to dumpon(8) to list the currently configured dumpdev. Reviewed by: phk	2012-11-01 17:01:05 +00:00
Attilio Rao	4ceaf45de5	Rework the known mutexes to benefit about staying on their own cache line in order to avoid manual frobbing but using struct mtx_padalign. The sole exception being nvme and sxfge drivers, where the author redefined CACHE_LINE_SIZE manually, so they need to be analyzed and dealt with separately. Reviwed by: jimharris, alc	2012-10-31 18:07:18 +00:00
Jim Harris	84e7a2ebb7	Pad and align the callout_cpu mtx to its own cacheline to reduce false sharing especially on the default CPU 0 callout_cpu structure. This will be followed up by attilio@ with a conversion to the new struct mtx_padalign but doing this manual conversion first gives an easy MFC candidate since mtx_padalign is a more extensive system change. Sponsored by: Intel Reviewed by: jeff, attilio MFC after: 1 week	2012-10-31 17:12:12 +00:00
Attilio Rao	7f44c61839	Give mtx(9) the ability to crunch different type of structures, with the only constraint that they have a lock cookie named mtx_lock. This name, then, becames reserved from the struct that wants to use the mtx(9) KPI and other locking primitives cannot reuse it for their members. Namely such structs are the current struct mtx and the new struct mtx_padalign. The new structure will define an object which is the same as the same layout of a struct mtx but will be allocated in areas aligned to the cache line size and will be as big as a cache line. This is supposed to give higher performance for highly contented mutexes both spin or sleep (because of the adaptive spinning), where the cache line contention results in too much traffic on the system bus. The struct mtx_padalign can be used in a completely transparent way with the mtx(9) KPI. At the moment, a possibility to MFC the patch should be carefully evaluated because this patch breaks the low level KPI (not its representation though). Discussed with: jhb Reviewed by: jeff, andre Reviewed by: mdf (earlier version) Tested by: jimharris	2012-10-31 13:38:56 +00:00
Attilio Rao	5584e91718	Fixup r240246: hwpmc needs to retain the pinning until ASTs are not executed. This means past the point where userret() is generally executed. Skip the td_pinned check if a callchain tracing is currently happening and add a more robust check to pmc_capture_user_callchain() in order to catch td_pinned leak past ast() in hwpmc case. Reported and tested by: fabient MFC after: 1 week X-MFC: r240246	2012-10-30 15:10:50 +00:00
Attilio Rao	a049aa05c9	tdq_lock_pair() already does spinlock_enter() so migration is not possible in sched_balance_pair(). Remove redundant sched_pin(). Reviewed by: marius, jeff	2012-10-30 12:25:52 +00:00
Andre Oppermann	e8ad36aba4	In soreceive_stream() don't drop an already dequeued mbuf chain by overwriting the return mbuf pointer with newly received data after a loop. Instead append the new mbuf chain to the existing one. Fix up sb_lastrecord when dequeuing mbuf's so that sbappend_stream() doesn't get confused. For the remainder copy case in the mbuf delivery part deduct the copied length len instead of the whole mbuf length. Additionally don't depend on 'n' being being available which isn't true in the case of MSG_PEEK. Fix the MSG_WAITALL case by comparing against sb_hiwat. Before it was looping for every receive as sb_lowat normally is zero. Add comment about issue with (MSG_WAITALL \| MSG_PEEK) which isn't properly handled. Submitted by: trociny (except for the change in last paragraph)	2012-10-29 12:31:12 +00:00
Andre Oppermann	fdd1b7f52a	Add logging for socket attach failures in sonewconn() during accept(2). Include the pointer to the PCB so it can be attributed to a particular application by corresponding it to "netstat -A" output. MFC after: 2 weeks	2012-10-29 12:14:57 +00:00
Kevin Lo	a2c36a0234	Since the macro dtom() has been removed, fix comments about the dtom. Reviewed by: glebius	2012-10-29 10:04:28 +00:00
Andre Oppermann	14d7c5b11c	Improve m_cat() by being able to also merge contents from M_EXT mbuf's by doing proper testing with M_WRITABLE(). In m_collapse() replace an incomplete manual check for M_RDONLY with the M_WRITABLE() macro that also tests for shared buffers and other cases that make a particular mbuf immutable. MFC after: 2 weeks	2012-10-28 18:38:51 +00:00
Davide Italiano	ba4be2110a	The fields of struct timespec32 should be int32_t and not uint32_t. Make this change. Reviewed by: bde, davidxu Tested by: pho MFC after: 1 week	2012-10-27 23:42:41 +00:00
Edward Tomasz Napierala	36af98697d	Add CPU percentage limit enforcement to RCTL. The resouce name is "pcpu". It was implemented by Rudolf Tomori during Google Summer of Code 2012.	2012-10-26 16:01:08 +00:00
Ed Schouten	1da7bb41ed	Correct SIGTTIN handling. In the old TTY layer, SIGTTIN was correctly handled like this: while (data should be read) { send SIGTTIN if not foreground process group read data } In the new TTY layer, however, this behaviour was changed, based on a false interpretation of the standard: send SIGTTIN if not foreground process group while (data should be read) { read data } Correct this by pushing tty_wait_background() into the ttydisc_read_*() functions. Reported by: koitsu PR: kern/173010 MFC after: 2 weeks	2012-10-25 09:05:21 +00:00
Alfred Perlstein	7b6d92c0a0	Allow autotune maxusers > 384 on 64 bit machines A default install on large memory machines with multiple 10gigE interfaces were not being given enough mbufs to do full bandwidth TCP or NFS traffic. To keep the value somewhat reasonable, we scale back the number of maxuers by 1/6 past the 384 point. This gives us enough mbufs for most of our pretty basic 10gigE line-speed tests to complete.	2012-10-25 01:46:20 +00:00
Jim Harris	39f819e2fc	Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle. This enables CPU searches (which read tdq_load) to operate independently of any contention on the spinlock. Some scheduler-intensive workloads running on an 8C single-socket SNB Xeon show considerable improvement with this change (2-3% perf improvement, 5-6% decrease in CPU util). Sponsored by: Intel Reviewed by: jeff	2012-10-24 18:36:41 +00:00
Andre Oppermann	e37e60c379	Replace the ill-named ZERO_COPY_SOCKET kernel option with two more appropriate named kernel options for the very distinct send and receive path. "options SOCKET_SEND_COW" enables VM page copy-on-write based sending of data on an outbound socket. NB: The COW based send mechanism is not safe and may result in kernel crashes. "options SOCKET_RECV_PFLIP" enables VM kernel/userspace page flipping for special disposable pages attached as external storage to mbufs. Only the naming of the kernel options is changed and their corresponding #ifdef sections are adjusted. No functionality is added or removed. Discussed with: alc (mechanism and limitations of send side COW)	2012-10-23 14:19:44 +00:00
Ed Schouten	d7259a57bd	Remove unused `vfslocked' variable. I have no idea what this `vfslocked' thing means. I wonder how it ended up here.	2012-10-22 21:14:26 +00:00
Konstantin Belousov	5050aa86cf	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho	2012-10-22 17:50:54 +00:00
Eitan Adler	3d74f47b90	Correct the killpg(2) return values: Return EPERM if processes were found but they were unable to be signaled. Return the first error from p_cansignal if no signal was successful. Reviewed by: jilles Approved by: cperciva MFC after: 1 week	2012-10-22 03:43:02 +00:00
Eitan Adler	10950e4651	Colin acked the wrong diff originally. fixed version coming soon. Approved by: cperciva (implicit)	2012-10-22 03:36:44 +00:00
Eitan Adler	2a1c0e4d4e	Correct the killpg(2) return values: Return EPERM if processes were found but they were unable to be signaled. Return the first error from p_cansignal if no signal was successful. Reviewed by: jilles Approved by: cperciva MFC after: 1 week	2012-10-22 03:34:43 +00:00
Eitan Adler	db702c59cf	remove duplicate semicolons where possible. Approved by: cperciva MFC after: 1 week	2012-10-22 03:00:37 +00:00
Andre Oppermann	dc00208ec4	Grammar fixes to r241781. Submitted by: alc	2012-10-20 19:38:22 +00:00
Andre Oppermann	2bdf61ca29	Hide the unfortunate named sysctl kern.ipc.somaxconn from sysctl -a output and replace it with a new visible sysctl kern.ipc.acceptqueue of the same functionality. It specifies the maximum length of the accept queue on a listen socket. The old kern.ipc.somaxconn remains available for reading and writing for compatibility reasons so that existing programs, scripts and configurations continue to work. There no plans to ever remove the orginal and now hidden kern.ipc.somaxconn.	2012-10-20 12:53:14 +00:00
Andre Oppermann	1490de00a8	Tidy up somaxconn (accept queue limit) and related functions and move it together into one place.	2012-10-20 10:51:32 +00:00
Andre Oppermann	4b62fe5b0b	Move socket UMA zone initialization functionality together into one place.	2012-10-19 12:16:29 +00:00
Andre Oppermann	cf8e6069e8	Move UMA socket zone initialization from uipc_domain.c to uipc_socket.c into one place next to its other related functions to avoid confusion.	2012-10-19 10:15:32 +00:00
Andre Oppermann	d10733a8da	Remove unnecessary includes from sosend_copyin() and fix a couple of style issues.	2012-10-18 21:04:30 +00:00
Andre Oppermann	1d147759db	Remove double-wrapping of #ifdef ZERO_COPY_SOCKETS within zero copy specialized sosend_copyin() helper function.	2012-10-18 20:22:17 +00:00
Attilio Rao	2e564269d0	Disconnect non-MPSAFE SMBFS from the build in preparation for dropping GIANT from VFS. In addition, disconnect also netsmb, which is a base requirement for SMBFS. In the while SMBFS regular users can use FUSE interface and smbnetfs port to work with their SMBFS partitions. Also, there are ongoing efforts by vendor to support in-kernel smbfs, so there are good chances that it will get relinked once properly locked. This is not targeted for MFC.	2012-10-18 12:04:56 +00:00
Attilio Rao	a42ac676f5	Disconnect non-MPSAFE NTFS from the build in preparation for dropping GIANT from VFS. This code is particulary broken and fragile and other in-kernel implementations around, found in other operating systems, don't really seem clean and solid enough to be imported at all. If someone wants to reconsider in-kernel NTFS implementation for inclusion again, a fair effort for completely fixing and cleaning it up is expected. In the while NTFS regular users can use FUSE interface and ntfs-3g port to work with their NTFS partitions. This is not targeted for MFC.	2012-10-17 11:30:00 +00:00
Attilio Rao	e6116d5b8e	Disconnect non-MPSAFE NWFS from the build in preparation for dropping GIANT from VFS. In addition, disconnect also netncp, which is a base requirement for NWFS. In the possibility of a future maintenance of the code and later readd to the FreeBSD base, maybe we should think about a better location for netncp. I'm not entirely sure the / top location is actually right, however I will let network people to comment on that more specifically. This is not targeted for MFC.	2012-10-17 11:16:17 +00:00
Attilio Rao	55793cdccf	Disconnect non-MPSAFE PORTALFS from the build in preparation for dropping GIANT from VFS. This is not targeted for MFC.	2012-10-16 09:59:10 +00:00
Attilio Rao	05e009c443	Disconnect non-MPSAFE HPFS from the build in preparation for dropping GIANT from VFS. This is not targeted for MFC.	2012-10-16 09:55:31 +00:00
Konstantin Belousov	36c6f3aaae	Acquire the rangelock for truncate(2) as well. Reported and reviewed by: avg Tested by: pho MFC after: 1 week	2012-10-15 18:15:18 +00:00
Konstantin Belousov	9b233e2307	Add a KPI to allow to reserve some amount of space in the numvnodes counter, without actually allocating the vnodes. The supposed use of the getnewvnode_reserve(9) is to reclaim enough free vnodes while the code still does not hold any resources that might be needed during the reclamation, and to consume the slack later for getnewvnode() calls made from the innards. After the critical block is finished, the caller shall free any reserve left, by getnewvnode_drop_reserve(9). Reviewed by: avg Tested by: pho MFC after: 1 week	2012-10-14 19:43:37 +00:00
Alexander Motin	803a9b3efd	panic() with reasonable message instead of returning zero frequency causing division by zero later if event timer's minimal period is above one second. For now it is just a theoretical possibility. Found by: Clang Static Analyzer	2012-10-10 19:46:46 +00:00
Attilio Rao	3a4730256a	Add an unified macro to deny ability from the compiler to reorder instruction loads/stores at its will. The macro __compiler_membar() is currently supported for both gcc and clang, but kernel compilation will fail otherwise. Reviewed by: bde, kib Discussed with: dim, theraven MFC after: 2 weeks	2012-10-09 14:32:30 +00:00
Andriy Gapon	298fbd1605	cngetc: use cpu_spinwait to ease the cncheckc loop a tiny bit Reviewed by: julian MFC after: 10 days	2012-10-06 19:50:23 +00:00
Andriy Gapon	c331c9703c	ktrace/kern_exec: check p_tracecred instead of p_cred .. when deciding whether to continue tracing across suid/sgid exec. Otherwise if root ktrace-d an unprivileged process and the processed exec-ed a suid program, then tracing didn't continue across exec. Reviewed by: bde, kib MFC after: 22 days	2012-10-06 19:23:44 +00:00
Ed Schouten	6b1b791da6	Fix faulty error code handling in read(2) on TTYs. When performing a non-blocking read(2), on a TTY while no data is available, we should return EAGAIN. But if there's a modem disconnect, we should return 0. Right now we only return 0 when doing a blocking read, which is wrong. MFC after: 1 month	2012-10-03 13:51:03 +00:00
Garrett Wollman	48b5c7410f	Fix spelling of the function name in two assertion messages.	2012-10-02 18:38:05 +00:00
Eitan Adler	8dbce2a343	Provide a generic way to disable devices at boot time PR: kern/119202 Requested by: peterj Reviewed by: sbruno, jhb Approved by: cperciva MFC after: 1 week	2012-10-02 03:33:41 +00:00
Pawel Jakub Dawidek	55711729f3	- Enforce CAP_MKFIFO on mkfifoat(2), not on mknodat(2). Without this change mkfifoat(2) was not restricted. - Introduce CAP_MKNOD and enforce it on mknodat(2). Sponsored by: FreeBSD Foundation MFC after: 2 weeks	2012-10-01 05:43:24 +00:00
Konstantin Belousov	877d24ac8a	Fix the mis-handling of the VV_TEXT on the nullfs vnodes. If you have a binary on a filesystem which is also mounted over by nullfs, you could execute the binary from the lower filesystem, or from the nullfs mount. When executed from lower filesystem, the lower vnode gets VV_TEXT flag set, and the file cannot be modified while the binary is active. But, if executed as the nullfs alias, only the nullfs vnode gets VV_TEXT set, and you still can open the lower vnode for write. Add a set of VOPs for the VV_TEXT query, set and clear operations, which are correctly bypassed to lower vnode. Tested by: pho (previous version) MFC after: 2 weeks	2012-09-28 11:25:02 +00:00
Matthew D Fleming	fc8fdae0df	Fix up kernel sources to be ready for a 64-bit ino_t. Original code by: Gleb Kurtsou	2012-09-27 23:30:49 +00:00
Pawel Jakub Dawidek	c8e781f6e0	Revert r240931, as the previous comment was actually in sync with POSIX. I have to note that POSIX is simply stupid in how it describes O_EXEC/fexecve and friends. Yes, not only inconsistent, but stupid. In the open(2) description, O_RDONLY flag is described as: O_RDONLY Open for reading only. Taken from: http://pubs.opengroup.org/onlinepubs/9699919799/functions/open.html Note "for reading only". Not "for reading or executing"! In the fexecve(2) description you can find: The fexecve() function shall fail if: [EBADF] The fd argument is not a valid file descriptor open for executing. Taken from: http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html As you can see the function shall fail if the file was not open with O_EXEC! And yet, if you look closer you can find this mess in the exec.html: Since execute permission is checked by fexecve(), the file description fd need not have been opened with the O_EXEC flag. Yes, O_EXEC flag doesn't have to be specified after all. You can open a file with O_RDONLY and you still be able to fexecve(2) it.	2012-09-27 16:43:23 +00:00
Mikolaj Golub	47813f5d94	Kernel and modules have "set_vnet" linker set, where virtualized global variables are placed. When a module is loaded by link_elf linker its variables from "set_vnet" linker set are copied to the kernel "set_vnet" ("modspace") and all references to these variables inside the module are relocated accordingly. The issue is when a module is loaded that has references to global variables from another, previously loaded module: these references are not relocated so an invalid address is used when the module tries to access the variable. The example is V_layer3_chain, defined in ipfw module and accessed from ipfw_nat. The same issue is with DPCPU variables, which use "set_pcpu" linker set. Fix this making the link_elf linker on a module load recognize "external" DPCPU/VNET variables defined in the previously loaded modules and relocate them accordingly. For this set_pcpu_list and set_vnet_list are used, where the addresses of modules' "set_pcpu" and "set_vnet" linker sets are stored. Note, archs that use link_elf_obj (amd64) were not affected by this issue. Reviewed by: jhb, julian, zec (initial version) MFC after: 1 month	2012-09-27 14:55:15 +00:00
Konstantin Belousov	94cb35459d	Make the updates of the tid ring buffer' head and tail pointers explicit by moving them into separate statements from the buffer element accesses. Requested by: jhb MFC after: 3 days	2012-09-26 09:25:11 +00:00
Pawel Jakub Dawidek	28f865b0b1	Fix freebsd32_kmq_timedreceive() and freebsd32_kmq_timedsend() to use getmq_read() and getmq_write() respectively, just like sys_kmq_timedreceive() and sys_kmq_timedsend(). Sponsored by: FreeBSD Foundation MFC after: 2 weeks	2012-09-25 22:15:59 +00:00
Pawel Jakub Dawidek	8c706ce0d0	vn_write() always expects FOF_OFFSET flag, which is asserted at the begining, so there is no need to check for it. Sponsored by: FreeBSD Foundation MFC after: 2 weeks	2012-09-25 21:31:17 +00:00
Pawel Jakub Dawidek	3a038c4d68	We cannot open file for reading and executing (O_RDONLY \| O_EXEC). Well, in theory we can pass those two flags, because O_RDONLY is 0, but we won't be able to read from a descriptor opened with O_EXEC. Update the comment. Sponsored by: FreeBSD Foundation MFC after: 2 weeks	2012-09-25 21:11:40 +00:00
Pawel Jakub Dawidek	5c3e5c7f03	Require CAP_DELETE on directory descriptor for unlinkat(2). Sponsored by: FreeBSD Foundation MFC after: 2 weeks	2012-09-25 21:00:36 +00:00
Pawel Jakub Dawidek	cffcbad2bf	Require CAP_CREATE on directory descriptor for symlinkat(2). Sponsored by: FreeBSD Foundation MFC after: 2 weeks	2012-09-25 20:59:12 +00:00
Pawel Jakub Dawidek	d2e166e654	Require CAP_CREATE on directory descriptor for linkat(2). Sponsored by: FreeBSD Foundation MFC after: 2 weeks	2012-09-25 20:58:15 +00:00
Pawel Jakub Dawidek	1159429db8	O_EXEC flag is not part of the O_ACCMODE mask, check it separately. If O_EXEC is provided don't require CAP_READ/CAP_WRITE, as O_EXEC is mutually exclusive to O_RDONLY/O_WRONLY/O_RDWR. Without this change CAP_FEXECVE capability right is not enforced. Sponsored by: FreeBSD Foundation MFC after: 3 days	2012-09-25 20:48:49 +00:00
George V. Neville-Neil	0bf9cb917c	Change the module name for the I/O provider to "kernel" from "genunix" This will requires us to modify externally created DTrace scripts but makes logical sense for FreeBSD. Requested by: rpaulo MFC after: 2 weeks	2012-09-25 19:16:28 +00:00
John Baldwin	d95dca1d08	Add optional entropy harvesting for software interrupts in swi_sched() as controlled by kern.random.sys.harvest.swi. SWI harvesting feeds into the interrupt FIFO and each event is estimated as providing a single bit of entropy. Reviewed by: markm, obrien MFC after: 2 weeks	2012-09-25 14:55:46 +00:00
Konstantin Belousov	787a64ddd2	Do not skip two elements of the tid_buffer when reusing the buffer slot. This eventually results in exhaustion of the tid space, causing new threads get tid -1 as identifier. The bad effect of having the thread id equal to -1 is that UMTX_OP_UMUTEX_WAIT returns EFAULT for a lock owned by such thread, because casuword cannot distinguish between literal value -1 read from the address and -1 returned as an indication of faulted access. _thr_umutex_lock() helper from libthr does not check for errors from _umtx_op_err(2), causing an infinite loop in mutex_lock_sleep(). We observed the JVM processes hanging and consuming enormous amount of system time on machines with approximately 100 days uptime. Reported by: Mykola Dzham <freebsd levsha org ua> MFC after: 1 week	2012-09-22 12:17:09 +00:00
Eitan Adler	96240c89f0	Correct double "the the" Approved by: cperciva MFC after: 3 days	2012-09-14 21:28:56 +00:00
Andriy Gapon	e87fc7cf7b	sched_ule: fix inverted condition in reporting of priority lending via ktr Reviewed by: kan MFC after: 1 week	2012-09-14 19:55:28 +00:00
Attilio Rao	0a15e5d30d	Remove all the checks on curthread != NULL with the exception of some MD trap checks (eg. printtrap()). Generally this check is not needed anymore, as there is not a legitimate case where curthread != NULL, after pcpu 0 area has been properly initialized. Reviewed by: bde, jhb MFC after: 1 week	2012-09-13 22:26:22 +00:00
John Baldwin	0f14f15b62	Ignore stop and continue signals sent to an exiting process. Stop signals set p_xstat to the signal that triggered the stop, but p_xstat is also used to hold the exit status of an exiting process. Without this change, a stop signal that arrived after a process was marked P_WEXIT but before it was marked a zombie would overwrite the exit status with the stop signal number. Reviewed by: kib MFC after: 1 week	2012-09-13 15:51:18 +00:00
Attilio Rao	e3ae0dfe69	Improve check coverage about idle threads. Idle threads are not allowed to acquire any lock but spinlocks. Deny any attempt to do so by panicing at the locking operation when INVARIANTS is on. Then, remove the check on blocking on a turnstile. The check in sleepqueues is left because they are not allowed to use tsleep() either which could happen still. Reviewed by: bde, jhb, kib MFC after: 1 week	2012-09-12 22:10:53 +00:00
Attilio Rao	faa1082aa2	Tweak the commit message in case of panic for sleeping from threads with TDP_NOSLEEPING on. The current message has no informations on the thread and wchan involed, which may be useful in case where dumps have mangled dwarf informations. Reported by: kib Reviewed by: bde, jhb, kib MFC after: 1 week	2012-09-12 22:05:54 +00:00
Konstantin Belousov	bcd5bb8e57	Add a facility for vgone() to inform the set of subscribed mounts about vnode reclamation. Typical use is for the bypass mounts like nullfs to get a notification about lower vnode going away. Now, vgone() calls new VFS op vfs_reclaim_lowervp() with an argument lowervp which is reclaimed. It is possible to register several reclamation event listeners, to correctly handle the case of several nullfs mounts over the same directory. For the filesystem not having nullfs mounts over it, the overhead added is a single mount interlock lock/unlock in the vnode reclamation path. In collaboration with: pho MFC after: 3 weeks	2012-09-09 19:17:15 +00:00
Konstantin Belousov	84c3cd4f19	Add MNTK_LOOKUP_EXCL_DOTDOT struct mount flag, which specifies to the lookup code that dotdot lookups shall override any shared lock requests with the exclusive one. The flag is useful for filesystems which sometimes need to upgrade shared lock to exclusive inside the VOP_LOOKUP or later, which cannot be done safely for dotdot, due to dvp also locked and causing LOR. In collaboration with: pho MFC after: 3 weeks	2012-09-09 19:11:52 +00:00
Attilio Rao	16cbf13b53	Move the checks for td_pinned, td_critnest, TDP_NOFAULTING and TDP_NOSLEEPING leaking from syscallret() to userret() so that also trap handling is covered. Also, the check on td_locks is not duplicated between the two functions. Reported by: avg Reviewed by: kib MFC after: 1 week	2012-09-08 18:35:15 +00:00
Attilio Rao	fbe18392a1	Move PT_UPDATED_FLUSH() before td_locks check in order to have more coverage also in the XEN case. Reviewed by: kib MFC after: 1 week	2012-09-08 18:29:53 +00:00
Attilio Rao	324e57150d	userret() already checks for td_locks when INVARIANTS is enabled, so there is no need to check if Giant is acquired after it. Reviewed by: kib MFC after: 1 week	2012-09-08 18:27:11 +00:00
Gleb Smirnoff	aaf6343576	Supply the pr_ctloutput method for local datagram sockets, so that setsockopt() and getsockopt() work on them. This makes 'tools/regression/sockets/unix_cmsg -t dgram' more successful.	2012-09-07 21:06:54 +00:00
John Baldwin	773e3b7dda	A few whitespace and comment fixes.	2012-09-07 15:10:46 +00:00
Aleksandr Rybalko	1bccd8638e	Style fixes. Suggested by: mdf Approved by: adrian (menthor)	2012-09-04 23:16:55 +00:00
Aleksandr Rybalko	6a8dada257	Add missing braces. Approved by: bschmidt (while mentor offline) Pointed by: gcooper Pointy hat to: ray	2012-09-03 09:46:46 +00:00
Andrey Zonov	ceb0f71506	- Mark some sysctls with CTLFLAG_TUN flag instead of CTLFLAG_RDTUN. Pointed out by: avg Approved by: kib (mentor) MFC after: 1 week	2012-09-03 09:26:56 +00:00
Aleksandr Rybalko	70da14c4bb	Add kern.hintmode sysctl variable to show current state of hints: 0 - loader hints in environment only; 1 - static hints only 2 - fallback mode (Dynamic KENV with fallback to kernel environment) Add kern.hintmode write handler, accept only value 2. That will switch static KENV to dynamic. So it will be possible to change device hints. Approved by: adrian (mentor)	2012-09-03 08:52:05 +00:00
Andrey Zonov	c3927cd956	- Make kern.maxtsiz, kern.dfldsiz, kern.maxdsiz, kern.dflssiz, kern.maxssiz and kern.sgrowsiz sysctls writable. Approved by: kib (mentor)	2012-09-02 17:39:02 +00:00
Mikolaj Golub	bb9f214f64	In soreceive_generic() remove the optimization for the case when MSG_WAITALL is set, and it is possible to do the entire receive operation at once if we block (resid <= hiwat). Actually it might make the recv(2) with MSG_WAITALL flag get stuck when there is enough space in the receiver buffer to satisfy the request but not enough to open the window closed previously due to the buffer being full. The issue can be reproduced using the following scenario: On the sender side do 2 send(2) requests: 1) data of size much smaller than SOBUF_SIZE (e.g. SOBUF_SIZE / 10); 2) data of size equal to SOBUF_SIZE. On the receiver side do 2 recv(2) requests with MSG_WAITALL flag set: 1) recv() data of SOBUF_SIZE / 10 size; 2) recv() data of SOBUF_SIZE size; We totally fill the receiver buffer with one SOBUF_SIZE/10 size request and partial SOBUF_SIZE request. When the first request is processed we get SOBUF_SIZE/10 free space. It is just enough to receive the rest of bytes for the second request, and soreceive_generic() blocks in the part that is a subject of this change waiting for the rest. But the window was closed when the buffer was filled and to avoid silly window syndrome it opens only when available space is larger than sb_hiwat/4 or maxseg. So it is stuck and pending data is only sent via TCP window probes. Discussed with: kib (long ago) MFC after: 2 weeks	2012-09-02 07:33:52 +00:00
Mikolaj Golub	2ad099fcb1	In soreceive_generic() when checking if the type of mbuf has changed check it for MT_CONTROL type too, otherwise the assertion "m->m_type == MT_DATA" below may be triggered by the following scenario: - the sender sends some data (MT_DATA) and then a file descriptor (MT_CONTROL); - the receiver calls recv(2) with a MSG_WAITALL asking for data larger than the receive buffer (uio_resid > hiwat). MFC after: 2 week	2012-09-02 07:29:37 +00:00
Pawel Jakub Dawidek	707641ec28	Fix panic in procdesc that can be triggered in the following scenario: 1. Process A pdfork(2)s process B. 2. Process A passes process descriptor of B to unrelated process C. 3. Hit CTRL+C to terminate process A. Process B is also terminated with SIGINT. 4. init(8) collects status of process B. 5. Process C closes process descriptor associated with process B. When we have such order of events, init(8), by collecting status of process B, will call procdesc_reap(). This function sets pd_proc to NULL. Now when process C calls close on this process descriptor, procdesc_close() is called. Unfortunately procdesc_close() assumes that pd_proc points at a valid proc structure, but it was set to NULL earlier, so the kernel panics. The patch also adds setting 'p->p_procdesc' to NULL in procdesc_reap(), which I think should be done. MFC after: 1 week	2012-09-01 11:21:56 +00:00
Attilio Rao	d4a2ab8c07	Post r222812 KTR_CPUMASK started being initialized only as a tunable handler and not more statically. Unfortunately, it seems that this is not ideal for new platform bringup and boot low level development (which needs ktr_cpumask to be effective before tunables can be setup). Because of this, add a way to statically initialize cpusets, by passing an list of initializers, divided by commas. Also, provide a way to enforce an all-set mask, for above mentioned initializers. This imposes some differences on how KTR_CPUMASK is setup now as a kernel option, and in particular this makes the words specifications backward wrt. what is currently in -CURRENT. In order to avoid mismatches between KTR_CPUMASK definition and other way to setup the mask (tunable, sysctl) and to print it, change the ordering how cpusetobj_print() and cpusetobj_scan() acquire the words belonging to the set. Please give a look to sys/conf/NOTES in order to understand how the new format is supposed to work. Also, ktr manpages will be updated shortly by gjb which volountereed for this. This patch won't be merged because it changes a POLA (at least from the theoretical standpoint) and this is however a patch that proves to be effective only in development environments. Requested by: rpaulo Reviewed by: jeff, rpaulo	2012-08-30 21:22:47 +00:00
Marius Strobl	bf38cf8ab3	- Unlike cache invalidation and TLB demapping IPIs, reading registers from other CPUs doesn't require locking so get rid of it. As the latter is used for the timecounter on certain machine models, using a spin lock in this case can lead to a deadlock with the upcoming callout(9) rework. - Merge r134227/r167250 from x86: Avoid cross-IPI SMP deadlock by using the smp_ipi_mtx spin lock not only for smp_rendezvous_cpus() but also for the MD cache invalidation and TLB demapping IPIs. - Mark some unused function arguments as such. MFC after: 1 week	2012-08-29 16:56:50 +00:00
Ed Schouten	fa4dd27847	Remove unused SI_* flags. The SI_DEVOPEN, SI_CONSOPEN and SI_CANDELETE flags are not used by any piece of code in the tree.	2012-08-28 19:30:29 +00:00
John Baldwin	10f0ab3933	Shorten the name of the fast SWI taskqueue to "fast taskq" so that it fits. Reported by: lev MFC after: 1 week	2012-08-28 13:35:37 +00:00
Navdeep Parhar	812302c3eb	Allow nmbjumbop, nmbjumbo9, and nmbjumbo16 to be set directly via loader tunables. MFC after: 1 month	2012-08-23 21:32:02 +00:00
Konstantin Belousov	258f94423b	Provide some compat32 shims for sysctl vfs.conflist. It is required for getvfsbyname(3) operation when called from 32bit process, and getvfsbyname(3) is used by recent bsdtar import. Reported by: many Tested by: David Naylor <naylor.b.david@gmail.com> MFC after: 5 days	2012-08-22 20:05:34 +00:00
John Baldwin	7e690c1f79	Assert that system calls do not leak a pinned thread (via sched_pin()) to userland.	2012-08-22 20:02:42 +00:00
John Baldwin	dda66918e6	Fix a typo.	2012-08-22 20:01:57 +00:00
John Baldwin	ba96d2d816	Mark the idle threads as non-sleepable and also assert that an idle thread never blocks on a turnstile.	2012-08-22 20:01:38 +00:00
John Baldwin	e6bdd477fc	Fix the 'show witness' DDB command to honor db_pager_quit.	2012-08-22 20:00:41 +00:00
John Baldwin	6f7d0018b0	Add a BUS_CHILD_DELETED() method that a bus can hook to allow it to cleanup any bus-specific state (such as ivars) when a child device is deleted. Requested by: kan MFC after: 1 month	2012-08-21 18:13:09 +00:00
Konstantin Belousov	888aefef89	Deliver SIGSYS to the guilty thread, not to the process. MFC after: 1 week	2012-08-18 18:17:10 +00:00
David Xu	e31eb35c3f	regen.	2012-08-17 02:47:16 +00:00
David Xu	d65f1abca7	Implement syscall clock_getcpuclockid2, so we can get a clock id for process, thread or others we want to support. Use the syscall to implement POSIX API clock_getcpuclock and pthread_getcpuclockid. PR: 168417	2012-08-17 02:26:31 +00:00
John Baldwin	f39f73f47c	Remove D_NEEDGIANT from dead_devsw. biofinish() (and thus dead_strategy) does not need Giant. MFC after: 1 month	2012-08-16 18:04:33 +00:00
Konstantin Belousov	3fa615bc11	As a safety measure, disable lowering pid_max too much. Requested by: Peter Jeremy <peter@rulingia.com> MFC after: 1 week	2012-08-16 13:04:21 +00:00
Konstantin Belousov	abce621c3a	Fix grammar. Submitted by: jh MFC after: 1 week	2012-08-16 13:01:56 +00:00
Warner Losh	79f1fdb83b	Limit popcorn limit to something sane (either 2ns or 2 ticks if that's longer). PR: 156481 Submitted by: Ian Lepore	2012-08-16 02:35:44 +00:00
Alan Cox	33327b9e9b	Correct a KASSERT message. Submitted by: bde	2012-08-15 22:12:01 +00:00
Konstantin Belousov	02c6fc2114	Add a sysctl kern.pid_max, which limits the maximum pid the system is allowed to allocate, and corresponding tunable with the same name. Note that existing processes with higher pids are left intact. MFC after: 1 week	2012-08-15 15:56:21 +00:00
Hans Petter Selasky	c01fc06ee9	Revert r239178 and implement two new functions, namely "device_free_softc()" and "device_claim_softc()", to allow USB serial drivers refcounting the softc. These functions are used to grab the softc from auto-free and to free the softc back to the correct malloc type, respectivly. Discussed with: jhb MFC after: 2 weeks	2012-08-15 15:42:57 +00:00
David E. O'Brien	60ee433881	Don't include opt_ddb.h & <ddb/ddb.h> twice.	2012-08-15 14:18:54 +00:00
Jaakko Heinonen	2f0ac2593b	Reserve room for the terminating NUL when setting or getting kernel environment variables. KENV_MNAMELEN and KENV_MVALLEN doesn't include space for the terminating NUL.	2012-08-14 19:16:30 +00:00
David Xu	d7f97db7bd	Some style fixes inspired by @bde.	2012-08-11 23:48:39 +00:00
Alexander Motin	37f4e0254f	Some more minor tunings inspired by bde@.	2012-08-11 20:24:39 +00:00
Alexander Motin	bf89d544d0	Allow idle threads to steal second threads from other cores on systems with 8 or more cores to improve utilization. None of my tests on 2xXeon (2x6x2) system shown any slowdown from mentioned "excess thrashing". Same time in pbzip2 test with number of threads more then number of CPUs I see up to 10% speedup with SMT disabled and up 5% with SMT enabled. Thinking about trashing I was trying to limit that stealing within same last level cache, but got only worse results. Present code any way prefers to steal threads from topologically closer cores. Sponsored by: iXsystems, Inc.	2012-08-11 15:08:19 +00:00
David Xu	e8afbca2bc	tvtohz will print out an error message if a negative value is given to it, avoid this problem by detecting timeout earlier. Reported by: pho	2012-08-11 00:06:56 +00:00
Alexander Motin	579895df01	Some minor tunings/cleanups inspired by bde@ after previous commits: - remove extra dynamic variable initializations; - restore (4BSD) and implement (ULE) hogticks variable setting; - make sched_rr_interval() more tolerant to options; - restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more user-friendly wrapper for sched_slice; - tune some sysctl descriptions; - make some style fixes.	2012-08-10 19:02:49 +00:00
Alexander Motin	9000aabf3b	sched_rr_interval() seems always returned period in hz ticks, but same always it was used as rate. Fix use side units to period in hz ticks.	2012-08-10 18:19:57 +00:00
Hans Petter Selasky	ea1bd564ac	Add new device method to free the automatically allocated softc structure which is returned by device_get_softc(). This method can be used to easily implement softc refcounting. This can be desirable when the softc has memory references which are controlled by userspace handles for example. This solves the problem of blocking the caller of device_detach() for a non-deterministic time. Discussed with: kib, ed MFC after: 2 weeks	2012-08-10 15:02:49 +00:00
Alexander Motin	3d7f41175d	Rework r220198 change (by fabient). I believe it solves the problem from the wrong direction. Before it, if preemption and end of time slice happen same time, thread was put to the head of the queue as for only preemption. It could cause single thread to run for indefinitely long time. r220198 handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that causes delayed context switch every time preemption happens, even when not needed. Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND, set when thread's time slice is over and it should be put to the tail of queue. Using SW_PREEMPT flag for that purpose as it was before just not enough informative to work correctly. On my tests this by 2-3 times reduces run time deviation (improves fairness) in cases when several threads share one CPU. Reviewed by: fabient MFC after: 2 months Sponsored by: iXsystems, Inc.	2012-08-09 19:26:13 +00:00
Alexander Motin	48317e9e27	SCHED_4BSD scheduling quantum mechanism appears to be broken for some time. With switchticks variable being reset each time thread preempted (that is done regularly by interrupt threads) scheduling quantum may never expire. It was not noticed in time because several other factors still regularly trigger context switches. Handle the problem by replacing that mechanism with its equivalent from SCHED_ULE called time slice. It is effectively the same, just measured in context of stathz instead of hz. Some unification is probably not bad.	2012-08-09 18:09:59 +00:00
Konstantin Belousov	c0c6e95f7f	Always initialize pl_event. Submitted by: Andrey Zonov <andrey@zonov.org> MFC after: 3 days	2012-08-08 00:20:30 +00:00
Alexander Kabaev	c9516c94b4	Do not add handler to event handlers list until ithread is created. In rare event when fast and ithread interrupts share the same vector and the fast handler was registered first, we can end up trying to schedule the ithread that is not created yet. The kernel built with INVARIANTS then triggers an assertion. Change the order to create the ithread first and only then add the handler that needs it to the interrupt event handlers list. Reviewed by: jhb	2012-08-06 16:37:43 +00:00
Konstantin Belousov	1c771f9222	After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason to pull vm_param.h was removed. Other big dependency of vm_page.h on vm_param.h are PA_LOCK* definitions, which are only needed for in-kernel code, because modules use KBI-safe functions to lock the pages. Stop including vm_param.h into vm_page.h. Include vm_param.h explicitely for the kernel code which needs it. Suggested and reviewed by: alc MFC after: 2 weeks	2012-08-05 14:11:42 +00:00
Alexander Motin	2038943013	Particlly MFcalloutng r238425 (by davide): Fix an issue related to old periodic timers. The code in kern_clocksource.c uses interrupt to keep track of time, and this time may not match with binuptime(). In order to address such incoherency, switch periodic timers to binuptime(). Except further calloutng it is needed for already present cyclic subsystem.	2012-08-04 08:06:37 +00:00
Alexander Motin	9b71c63a8b	Partialy MFcalloutng r236894 (by davide): ... While here, Bruce Evans told me that "unsigned int" is spelled "u_int" in KNF, so replace it where needed.	2012-08-04 07:46:58 +00:00
Alexander Motin	c0722d20d3	Microoptimize time math. As soon as our event periods are always below ome second we may not add intereger parts by using bintime_addx() instead of bintime_add(). Profiling shows handleevents() time redction by 15%.	2012-08-03 09:08:20 +00:00
John Baldwin	e838f09cd0	Reorder the managament of advisory locks on open files so that the advisory lock is obtained before the write count is increased during open() and the lock is released after the write count is decreased during close(). The first change closes a race where an open() that will block with O_SHLOCK or O_EXLOCK can increase the write count while it waits. If the process holding the current lock on the file then tries to call exec() on the file it has locked, it can fail with ETXTBUSY even though the advisory lock is preventing other threads from succesfully completeing a writable open(). The second change closes a race where a read-only open() with O_SHLOCK or O_EXLOCK may return successfully while the write count is non-zero due to another descriptor that had the advisory lock and was blocking the open() still being in the process of closing. If the process that completed the open() then attempts to call exec() on the file it locked, it can fail with ETXTBUSY even though the other process that held a write lock has closed the file and released the lock. Reviewed by: kib MFC after: 1 month	2012-07-31 18:25:00 +00:00
David Xu	5ff2bb52cc	I am comparing current pipe code with the one in 8.3-STABLE r236165, I found 8.3 is a history BSD version using socket to implement FIFO pipe, it uses per-file seqcount to compare with writer generation stored in per-pipe object. The concept is after all writers are gone, the pipe enters next generation, all old readers have not closed the pipe should get the indication that the pipe is disconnected, result is they should get EPIPE, SIGPIPE or get POLLHUP in poll(). But newcomer should not know that previous writters were gone, it should treat it as a fresh session. I am trying to bring back FIFO pipe to history behavior. It is still unclear that if single EOF flag can represent SBS_CANTSENDMORE and SBS_CANTRCVMORE which socket-based version is using, but I have run the poll regression test in tool directory, output is same as the one on 8.3-STABLE now. I think the output "not ok 18 FIFO state 6b: poll result 0 expected 1. expected POLLHUP; got 0" might be bogus, because newcomer should not know that old writers were gone. I got the same behavior on Linux. Our implementation always return POLLIN for disconnected pipe even it should return POLLHUP, but I think it is not wise to remove POLLIN for compatible reason, this is our history behavior. Regression test: /usr/src/tools/regression/poll	2012-07-31 05:48:35 +00:00
David Xu	12a480fa41	When a thread is blocked in direct write state, it only sets PIPE_DIRECTW flag but not PIPE_WANTW, but FIFO pipe code does not understand this internal state, when a FIFO peer reader closes the pipe, it wants to notify the writer, it checks PIPE_WANTW, if not set, it skips calling wakeup(), so blocked writer never noticed the case, but in general, the writer should return from the syscall with EPIPE error code and may get SIGPIPE signal. Setting the PIPE_WANTW fixed problem, or you can turn off direct write, it should fix the problem too. This bug is found by PR/170203. Another bug in FIFO pipe code is when peer closes the pipe, another end which is being blocked in select() or poll() is not notified, it missed to call pipeselwakeup(). Third problem is found in poll regression test, the existing code can not pass 6b,6c,6d tests, but FreeBSD-4 works. This commit does not fix the problem, I still need to study more to find the cause. PR: 170203 Tested by: Garrett Copper < yanegomi at gmail dot com >	2012-07-31 02:00:37 +00:00
Davide Italiano	6e465ac7ce	Until now KTR_ENTRIES, which defines the size of circular buffer used in ktr(4), was constrained to be a power of two. Remove this constraint and update sys/conf/NOTES accordingly. Reviewed by: jhb Approved by: gnn (mentor) Sponsored by: Google Summer of Code 2012	2012-07-30 22:46:42 +00:00
Konstantin Belousov	d8c1da8b90	Add F_DUP2FD_CLOEXEC. Apparently Solaris 11 already did this. Submitted by: Jukka A. Ukkonen <jau iki fi> PR: standards/169962 MFC after: 1 week	2012-07-27 10:41:10 +00:00
Konstantin Belousov	481af8b933	Cosmetics: define FREEBSD32_MINUSER and AOUT32_MINUSER for struct sysentvec .sv_minuser. Also improve style. Submitted by: Oliver Pinter <oliver.pinter@gmail.com> MFC after: 1 week	2012-07-22 13:41:45 +00:00
Konstantin Belousov	a53cab2c6c	(Incomplete) fixes for symbols visibility issues and style in fcntl.h. Append '__' prefix to the tag of struct oflock, and put it under BSD namespace. Structure is needed both by libc and kernel, thus cannot be hidden under #ifdef _KERNEL. Move a set of non-standard F_* and O_* constants into BSD namespace. SUSv4 explicitely allows implemenation to pollute F_* and O_* names after fcntl.h is included, but it costs us nothing to adhere to the specification if exact POSIX compliance level is requested by user code. Change some spaces after #define to tabs. Noted by and discussed with: bde MFC after: 1 week	2012-07-21 13:02:11 +00:00
Konstantin Belousov	eb3d975443	Remove line which was accidentally kept in r238614. Submitted by: pjd Pointy hat to: kib MFC after: 1 week	2012-07-19 20:38:03 +00:00
Konstantin Belousov	d1ae5c8337	Fix several reads beyond the mapped first page of the binary in the ELF parser. Specifically, do not allow note reader and interpreter path comparision in the brandelf code to read past end of the page. This may happen if specially crafter ELF image is activated. Submitted by: Lukasz Wojcik <lukasz.wojcik zoho com> MFC after: 3 days	2012-07-19 11:15:53 +00:00
Konstantin Belousov	49d02b13bc	Implement F_DUPFD_CLOEXEC command for fcntl(2), specified by SUSv4. PR: standards/169962 Submitted by: Jukka A. Ukkonen <jau iki fi> MFC after: 1 week	2012-07-19 10:22:54 +00:00
George V. Neville-Neil	57d025c338	Add support for walltimestamp in DTrace. Submitted by: Fabian Keil MFC after: 2 weeks	2012-07-16 20:17:19 +00:00
Gabor Pali	599fc82b06	- Add support for displaying process stack memory regions. Approved by: rwatson MFC after: 3 days	2012-07-16 09:38:19 +00:00
Matthew D Fleming	f806cdcf99	Fix a bug with memguard(9) on 32-bit architectures without a VM_KMEM_MAX_SIZE. The code was not taking into account the size of the kernel_map, which the kmem_map is allocated from, so it could produce a sub-map size too large to fit. The simplest solution is to ignore VM_KMEM_MAX entirely and base the memguard map's size off the kernel_map's size, since this is always relevant and always smaller. Found by: Justin Hibbits	2012-07-15 20:29:48 +00:00
John Baldwin	2919668490	Make the interval timings for EVFILT_TIMER more accurate. tvtohz() always adds an extra tick to account for the current partial clock tick. However, that is not appropriate for a repeating timer when the exact tvtohz() value should be used for subsequent intervals. Fix repeating callouts for EVFILT_TIMER by subtracting 1 tick from the tvtohz() result similar to the fix used in realitexpire() for interval timers. While here, update a few comments to note that if the EVFILT_TIMER code were to move out of kern_event.c, it should move to kern_time.c (where the interval timer code it mimics lives) rather than kern_timeout.c. MFC after: 1 month	2012-07-13 13:24:33 +00:00
Konstantin Belousov	92a9c65b06	Fix build for kernels with dtrace hooks. MFC after: 1 month	2012-07-11 18:50:50 +00:00
George V. Neville-Neil	3fac94ba94	Initial commit of an I/O provider for DTrace on FreeBSD. These probes are most useful when looking into the structures they provide, which are listed in io.d. For example: dtrace -n 'io:genunix::start { printf("%d\n", args[0]->bio_bcount); }' Note that the I/O systems in FreeBSD and Solaris/Illumos are sufficiently different that there is not a 1:1 mapping from scripts that work with one to the other. MFC after: 1 month	2012-07-11 16:27:02 +00:00
David Xu	7ce60f6013	Always clear p_xthread if current thread no longer needs it, in theory, if debugger exited without calling ptrace(PT_DETACH), there is a time window that the p_xthread may be pointing to non-existing thread, in practical, this is not a problem because child process soon will be killed by parent process.	2012-07-10 05:45:13 +00:00
David Xu	5985d61556	If you have pressed CTRL+Z and a process is suspended, then you use gdb to attach to the process, it is surprising that the process is resumed without inputting any gdb commands, however ptrace manual said: The tracing process will see the newly-traced process stop and may then control it as if it had been traced all along. But the current code does not work in this way, unless traced process received a signal later, it will continue to run as a background task. To fix this problem, just send signal SIGSTOP to the traced process after we resumed it, this works like that you are attaching to a running process, it is not perfect but better than nothing.	2012-07-09 09:24:46 +00:00
Mateusz Guzik	4fd85c4b5d	Follow-up commit to r238220: Pass only FEXEC (instead of FREAD\|FEXEC) in fgetvp_exec. _fget has to check for !FWRITE anyway and may as well know about FREAD. Make _fget code a bit more readable by converting permission checking from if() to switch(). Assert that correct permission flags are passed. In collaboration with: kib Approved by: trasz (mentor) MFC after: 6 days X-MFC: with r238220	2012-07-09 05:39:31 +00:00
Mateusz Guzik	28a7f60741	Unbreak handling of descriptors opened with O_EXEC by fexecve(2). While here return EBADF for descriptors opened for writing (previously it was ETXTBSY). Add fgetvp_exec function which performs appropriate checks. PR: kern/169651 In collaboration with: kib Approved by: trasz (mentor) MFC after: 1 week	2012-07-08 00:51:38 +00:00
Mikolaj Golub	e71a7957bd	Fix KASSERT message. MFC after: 3 days	2012-07-03 19:08:02 +00:00
Konstantin Belousov	c5c1199c83	Extend the KPI to lock and unlock f_offset member of struct file. It now fully encapsulates all accesses to f_offset, and extends f_offset locking to other consumers that need it, in particular, to lseek() and variants of getdirentries(). Ensure that on 32bit architectures f_offset, which is 64bit quantity, always read and written under the mtxpool protection. This fixes apparently easy to trigger race when parallel lseek()s or lseek() and read/write could destroy file offset. The already broken ABI emulations, including iBCS and SysV, are not converted (yet). Tested by: pho No objections from: jhb MFC after: 3 weeks	2012-07-02 21:01:03 +00:00

... 3 4 5 6 7 ...

13187 Commits