freebsd-nq

Author	SHA1	Message	Date
Jeff Roberson	f0393f063a	- Remove setrunqueue and replace it with direct calls to sched_add(). setrunqueue() was mostly empty. The few asserts and thread state setting were moved to the individual schedulers. sched_add() was chosen to displace it for naming consistency reasons. - Remove adjustrunqueue, it was 4 lines of code that was ifdef'd to be different on all three schedulers where it was only called in one place each. - Remove the long ifdef'd out remrunqueue code. - Remove the now redundant ts_state. Inspect the thread state directly. - Don't set TSF_* flags from kern_switch.c, we were only doing this to support a feature in one scheduler. - Change sched_choose() to return a thread rather than a td_sched. Also, rely on the schedulers to return the idlethread. This simplifies the logic in choosethread(). Aside from the run queue links kern_switch.c mostly does not care about the contents of td_sched. Discussed with: julian - Move the idle thread loop into the per scheduler area. ULE wants to do something different from the other schedulers. Suggested by: jhb Tested on: x86/amd64 sched_{4BSD, ULE, CORE}.	2007-01-23 08:46:51 +00:00
Craig Rodrigues	61e323a2fa	When exiting vfs_export(), delete the "export" option from the mount options list with vfs_deleteopt(). At this point, the export information is saved in mp->mnt_export, so we can delete the "export" mount option from mp->mnt_optnew and mp->mnt_opt. This fixes read-write/read-only update mounts (mount -u -o rw, mount -u -o ro) of NFS exported directories. For some reason, I could only reproduce the problem with a configuration supplied by Andre: - "options QUOTA" enabled in kernel config - "/ -maproot=root 10.0.1.105" in /etc/exports Reported by: kris, Andre Guibert de Bruet <andy siliconlandmark com>, Andrzej Tobola <ato iem pw edu pl> Tested by: Andre Guibert de Bruet	2007-01-23 06:19:16 +00:00
Andre Oppermann	7c32173ba8	Unbreak writes of 0 bytes. Zero byte writes happen when only ancillary control data but no payload data is passed. Change m_uiotombuf() to return at least one empty mbuf if the requested length was zero. Add comment to sosend_dgram and sosend_generic(). Diagnoses by: jhb Regression test by: rwatson Pointy hat to. andre	2007-01-22 14:50:28 +00:00
Konstantin Belousov	7f92c4ee02	Below is slightly edited description of the LOR by Tor Egge: -------------------------- [Deadlock] is caused by a lock order reversal in vfs_lookup(), where [some] process is trying to lock a directory vnode, that is the parent directory of covered vnode) while holding an exclusive vnode lock on covering vnode. A simplified scenario: root fs var fs / A / (/var) D /var B /log (/var/log) E vfs lock C vfs lock F Within each file system, the lock order is clear: C->A->B and F->D->E When traversing across mounts, the system can choose between two lock orders, but everything must then follow that lock order: L1: C->A->B \| +->F->D->E L2: F->D->E \| +->C->A->B The lookup() process for namei("/var") mixes those two lock orders: VOP_LOOKUP() obtains B while A is held vfs_busy() obtains a shared lock on F while A and B are held (follows L1, violates L2) vput() releases lock on B VOP_UNLOCK() releases lock on A VFS_ROOT() obtains lock on D while shared lock on F is held vfs_unbusy() releases shared lock on F vn_lock() obtains lock on A while D is held (violates L1, follows L2) dounmount() follows L1 (B is locked while F is drained). Without unmount activity, vfs_busy() will always succeed without blocking and the deadlock isn't triggered (the system behaves as if L2 is followed). With unmount, you can get 4 processes in a deadlock: p1: holds D, want A (in lookup()) p2: holds shared lock on F, want D (in VFS_ROOT()) p3: holds B, want drain lock on F (in dounmount()) p4: holds A, want B (in VOP_LOOKUP()) You can have more than one instance of p2. The reversal was introduced in revision 1.81 of src/sys/kern/vfs_lookup.c and MFCed to revision 1.80.2.1, probably to avoid a cascade of vnode locks when nfs servers are dead (VFS_ROOT() just hangs) spreading to the root fs root vnode. - Tor Egge To fix the LOR, ups@ noted that when crossing the mount point, ni_dvp is actually not used by the callers of namei. Thus, placeholder deadfs vnode vp_crossmp is introduced that is filled into ni_dvp. Idea by: ups Reviewed by: tegge, ups, jeff, rwatson (mac interaction) Tested by: Peter Holm MFC after: 2 weeks	2007-01-22 11:25:22 +00:00
Jeff Roberson	5cea64d54f	- Disable the long-term load balancer. I believe that steal_busy works better and gives more predictable results.	2007-01-20 21:24:05 +00:00
Jeff Roberson	c95d2db298	- We do need to IPI the idlethread on some systems. It may be stuck in a power saving mode otherwise. - If the thread is already bound in sched_bind() unbind it before re-binding it to a new cpu. I don't like these semantics but they are expected by some code in the tree. Patch by jkoshy.	2007-01-20 17:03:33 +00:00
Jeff Roberson	6b2f763f7c	- In tdq_transfer() always set NEEDRESCHED when necessary regardless of the ipi settings. If NEEDRESCHED is set and an ipi is later delivered it will clear it rather than cause extra context switches. However, if we miss setting it we can have terrible latency. - In sched_bind() correctly implement bind. Also be slightly more tolerant of code which calls bind multiple times. However, we don't change binding if another call is made with a different cpu. This does not presently work with hwpmc which I believe should be changed.	2007-01-20 09:03:43 +00:00
Jeff Roberson	7b8bfa0de9	Major revamp of ULE's cpu load balancing: - Switch back to direct modification of remote CPU run queues. This added a lot of complexity with questionable gain. It's easy enough to reimplement if it's shown to help on huge machines. - Re-implement the old tdq_transfer() call as tdq_pickidle(). Change sched_add() so we have selectable cpu choosers and simplify the logic a bit here. - Implement tdq_pickpri() as the new default cpu chooser. This algorithm is similar to Solaris in that it tries to always run the threads with the best priorities. It is actually slightly more complex than solaris's algorithm because we also tend to favor the local cpu over other cpus which has a boost in latency but also potentially enables cache sharing between the waking thread and the woken thread. - Add a bunch of tunables that can be used to measure effects of different load balancing strategies. Most of these will go away once the algorithm is more definite. - Add a new mechanism to steal threads from busy cpus when we idle. This is enabled with kern.sched.steal_busy and kern.sched.busy_thresh. The threshold is the required length of a tdq's run queue before another cpu will be able to steal runnable threads. This prevents most queue imbalances that contribute the long latencies.	2007-01-19 21:56:08 +00:00
Xin LI	4f506694bb	Use FOREACH_PROC_IN_SYSTEM instead of using its unrolled form.	2007-01-17 14:58:53 +00:00
Suleiman Souhlal	e8ac01c56a	Remove hptlock from the static witness table, now that it's a regular sleep mutex.	2007-01-16 22:56:28 +00:00
Randall Stewart	9b3386570c	Removes useless (flags \| ) KASSERT. The ^ one that actually does what we want. Submitted by: Li Xin delphij@delphij.net Reviewed by: rrs Approved by: gnn	2007-01-16 11:40:55 +00:00
Kip Macy	e440d8fff5	Fix warning by adding extra parentheses	2007-01-16 00:09:58 +00:00
Randall Stewart	b939bb368a	Reviewed by: rwatson Approved by: gnn Add a new function hashinit_flags() which allows NOT-waiting for memory (or waiting). The old hashinit() function now calls hashinit_flags(..., HASH_WAITOK);	2007-01-15 15:06:28 +00:00
Robert Watson	b0c521e29c	Re-wrap comments to wider margins now that they have been relocated from within functions.	2007-01-12 22:01:03 +00:00
Warner Losh	fe18f3853e	When ntp_gettime() was converted from a sysctl + wrapper to a system call, its semantics were unintentionally changed. It went from returning the time state to returning 0 or -1. Since 0 means time normal, and non-zero effectively only shows up around leap seconds, this went unnoticed until now. At least unnoticed until someone was trying to run a binary they didn't have source for and it was misbehaving... Submitted by: Judah Levine MFC After: 2 weeks	2007-01-12 07:40:30 +00:00
John Baldwin	19c80b2652	Wrap propagate_priority() in a critical section to prevent unwanted preemptions when adjusting the priority of a thread that is on a run queue. This was only observed when FULL_PREEMPTION was enabled. Reported by: kris Diagnosed by: ups MFC after: 1 week	2007-01-11 19:13:27 +00:00
Robert Watson	ef08c42034	Sort copyrights together. MFC after: 3 days	2007-01-08 20:37:02 +00:00
Robert Watson	fcdc50ebc1	Resort copyrights and licenses in kern_acct.c: per UCB letter, the UCB license now excludes the advertising clause. I'm not interested in it either, so move my copyright. This leaves only a CGD copyright with the advertising clause. MFC after: 3 days	2007-01-08 20:35:13 +00:00
Robert Watson	abdeb3b01f	Canonicalize copyrights in some files I hold copyrights on: - Sort by date in license blocks, oldest copyright first. - All rights reserved after all copyrights, not just the first. - Use (c) to be consistent with other entries. MFC after: 3 days	2007-01-08 17:49:59 +00:00
Jeff Roberson	eddb4efacd	- Don't let SCHED_TICK_TOTAL() return less than hz. This can cause integer divide faults in roundup() later if it is able to return 0. For some reason this bug only shows up on my laptop and not my testboxes.	2007-01-06 12:33:43 +00:00
Jeff Roberson	1e516cf534	- Fix the sched_priority() invalid priority bugs. Use roundup() instead of max() when computing the divisor in SCHED_TICK_PRI(). This prevents cases where rounding down would allow the quotient to exceed SCHED_PRI_RANGE. - Garbage collect some unused flags and fields. - Replace TDF_HOLD with sched_pin_td()/sched_unpin_td() since it simply duplicated this functionality. - Re-enable the rebalancer by default and fix the sysctl so it can be modified.	2007-01-06 08:44:13 +00:00
Jeff Roberson	9330bbbb61	- Don't IPI unless we're going to interrupt something exiting in the kernel. otherwise we can afford the latency. This makes a significant performance improvement.	2007-01-06 02:34:23 +00:00
Jeff Roberson	155b6ca12b	- Fix a comparison in sched_choose() that caused cpus to be constantly marked idle, thus breaking cpu load balancing. - Change sched_interact_update() to fix cases where the stored history has expanded significantly rather than handling them in the callers. This fixes a case where sched_priority() could compute a bad value. - Add a sysctl to disable the global load balancer for experimentation.	2007-01-05 23:45:38 +00:00
John Baldwin	9ae328fc8f	- Close a race between enumerating UNIX domain socket pcb structures via sysctl and socket teardown by adding a reference count to the UNIX domain pcb object and fixing the sysctl that enumerates unpcbs to grab a reference on each unpcb while it builds the list to copy out to userland. - Close a race between UNIX domain pcb garbage collection (unp_gc()) and file descriptor teardown (fdrop()) by adding a new garbage collection flag FWAIT. unp_gc() sets FWAIT while it walks the message buffers in a UNIX domain socket looking for nested file descriptor references and clears the flag when it is finished. fdrop() checks to see if the flag is set on a file descriptor whose refcount just dropped to 0 and waits for unp_gc() to clear the flag before completely destroying the file descriptor. MFC after: 1 week Reviewed by: rwatson Submitted by: ups Hopefully makes the panics go away: mx1	2007-01-05 19:59:46 +00:00
Jeff Roberson	8ab80cf009	- ftick was initialized to -1 for init and any of it's children. Fix this by setting ftick = ltick = ticks in schedinit(). - Update the priority when we are pulled off of the run queue and when we are inserted onto the run queue so that it more accurately reflects our present status. This is important for efficient priority propagation functioning. - Move the frequency test into sched_pctcpu_update() so we don't repeat it each time we'd like to call it. - Put some temporary work-around code in sched_priority() in case the tick mechanism produces a bad priority. Eventually this should revert to an assert again.	2007-01-05 08:50:38 +00:00
Jeff Roberson	3f872f85d2	- Only allow the tdq_idx to increase by one each tick rather than up to the most recently chosen index. This significantly improves nice behavior. This allows a lower priority thread to run some multiple of times before the higher priority thread makes it to the front of the queue. A nice +20 cpu hog now only gets ~5% of the cpu when running with a nice 0 cpu hog and about 1.5% with a nice -20 hog. A nice difference of 1 makes a 4% difference in cpu usage between two hogs. - Track a seperate insert and removal index. When the removal index is empty it is updated to point at the current insert index. - Don't remove and re-add a thread to the runq when it is being adjusted down in priority. - Pull some conditional code out of sched_tick(). It's looking a bit large now.	2007-01-04 12:16:19 +00:00
Jeff Roberson	cd49bb7047	- Don't pass a pointer into runq_choose_from(). The caller can adjust the index if it chooses to.	2007-01-04 12:10:58 +00:00
Jeff Roberson	e7d50326de	ULE 2.0: - Remove the double queue mechanism for timeshare threads. It was slow due to excess cache lines in play, caused suboptimal scheduling behavior with niced and other non-interactive processes, complicated priority lending, etc. - Use a circular queue with a floating starting index for timeshare threads. Enforces fairness by moving the insertion point closer to threads with worse priorities over time. - Give interactive timeshare threads real-time user-space priorities and place them on the realtime/ithd queue. - Select non-interactive timeshare thread priorities based on their cpu utilization over the last 10 seconds combined with the nice value. This gives us more sane priorities and behavior in a loaded system as compared to the old method of using the interactivity score. The interactive score quickly hit a ceiling if threads were non-interactive and penalized new hog threads. - Use one slice size for all threads. The slice is not currently dynamically set to adjust scheduling behavior of different threads. - Add some new sysctls for scheduling parameters. Bug fixes/Clean up: - Fix zeroing of td_sched after initialization in sched_fork_thread() caused by recent ksegrp removal. - Fix KSE interactivity issues related to frequent forking and exiting of kse threads. We simply disable the penalty for thread creation and exit for kse threads. - Cleanup the cpu estimator by using tickincr here as well. Keep ticks and ltick/ftick in the same frequency. Previously ticks were stathz and others were hz. - Lots of new and updated comments. - Many many others. Tested on: up x86/amd64, 8way amd64.	2007-01-04 08:56:25 +00:00
Jeff Roberson	3fed7d239a	- Add three new functions to support circular run queues. - runq_add_pri allows the caller to position the thread at any rqindex regardless of priority. - runq_choose_from() chooses the lowest priority thread starting from a given index. The index is updated with the rqindex of the chosen thread. This routine is used to pick the lowest priority relative to a given index. - runq_remove_idx() updates the index if the run queue that held the removed thread is now empty.	2007-01-04 08:39:58 +00:00
Jeff Roberson	b1c00b13d6	- Fix schedgraph output with KSE threads. Call thread_switchout() after calling CTR() so we don't confuse a new kse thread with a real preemption.	2007-01-03 02:38:41 +00:00
David Xu	fe1a9506fa	Fix compiling.	2007-01-02 04:14:01 +00:00
Robert Watson	2da78e3862	Prefer a more traditional spelling of inhibited in comments and panic messages.	2006-12-31 15:56:04 +00:00
Jeff Roberson	c02bbb43a0	- More search and replace prettying.	2006-12-29 12:55:32 +00:00
Jeff Roberson	d2ad694caa	- Clean up a bit after the most recent KSE restructuring.	2006-12-29 10:37:07 +00:00
Robert Watson	224a974b9b	Break contents of kern_mac.c out into two files following a repo-copy: mac_framework.c Contains basic MAC Framework functions, policy registration, sysinits, etc. mac_syscalls.c Contains implementations of various MAC system calls, including ENOSYS stubs when compiling without options MAC. Obtained from: TrustedBSD Project	2006-12-28 20:52:02 +00:00
Robert Watson	471e5756ad	Update MAC Framework general comments, referencing various interfaces it consumes and implements, as well as the location of the framework and policy modules. Refactor MAC Framework versioning a bit so that the current ABI version can be exported via a read-only sysctl. Further update comments relating to locking/synchronization. Update copyright to take into account these and other recent changes. Obtained from: TrustedBSD Project	2006-12-28 17:25:57 +00:00
David Xu	016fa30228	break loop early if we know that there are at least two signals.	2006-12-25 03:00:15 +00:00
David Xu	34e1241b9d	Fix typo, p_slptime should be td_slptime.	2006-12-24 01:52:27 +00:00
Bruce M Simpson	a86ec33820	Drop all received data mbufs from a socket's queue if the MT_SONAME mbuf is dropped, to preserve the invariant in the PR_ADDR case. Add a regression test to detect this condition, but do not hook it up to the build for now. PR: kern/38495 Submitted by: James Juran Reviewed by: sam, rwatson Obtained from: NetBSD MFC after: 2 weeks	2006-12-23 21:07:07 +00:00
Robert Watson	afae638809	Update comments to reflect changes in the extattrctl() code. Clean up comment formatting. Obtained from: TrustedBSD Project	2006-12-23 00:30:03 +00:00
Robert Watson	168d0553a3	Following a repo-copy of vfs_syscalls.c to vfs_extattr.c, remove non-extattr functions from vfs_extattr.c, and extattr functions from vfs_syscalls.c. Change copyright/license on vfs_extattr.c to my copyright/license on the extended attribute implementation (from extattr.h). Clean up includes a bit. Obtained from: TrustedBSD Project	2006-12-23 00:10:36 +00:00
Robert Watson	0efd6615cd	Move src/sys/sys/mac_policy.h, the kernel interface between the MAC Framework and security modules, to src/sys/security/mac/mac_policy.h, completing the removal of kernel-only MAC Framework include files from src/sys/sys. Update the MAC Framework and MAC policy modules. Delete the old mac_policy.h. Third party policy modules will need similar updating. Obtained from: TrustedBSD Project	2006-12-22 23:34:47 +00:00
Randall Stewart	5288989fac	The prepend function did not handle non-pkthdr's correctly. It always called MH_ALIGN for small lengths being prepended (less than MHLEN). This meant that if you did a prepend on a non M_PKTHDR the system would panic with the KASSERT in MH_ALIGN. Instead we are not aware of this and do a MH_ALIGN or M_ALIGN as appropriate. Reviewed by: andre Approved by: gnn	2006-12-21 19:58:04 +00:00
Robert Watson	e66fe0e1db	Remove mac_enforce_subsystem debugging sysctls. Enforcement on subsystems will be a property of policy modules, which may require access control check entry points to be invoked even when not actively enforcing (i.e., to track information flow without providing protection). Obtained from: TrustedBSD Project Suggested by: Christopher dot Vance at sparta dot com	2006-12-21 09:51:34 +00:00
Robert Watson	17041e6708	Expand commenting on label slots, justification for the MAC Framework locking model, interactions between locking and policy init/destroy methods. Rewrap some comments to 77 character line wrap. Obtained from: TrustedBSD Project	2006-12-20 20:38:44 +00:00
Jung-uk Kim	4e4de5e43c	MFP4: (part of) 110058 copyin()/copyout() for message type is separated from msgsnd()/msgrcv() and it is done from its wrapper functions to support 32-bit emulations. After I implemented this, I have briefly referenced NetBSD and Darwin. NetBSD passes copyin()/copyout() function pointers from wrappers. Darwin passes size of message type as an argument, which is actually similar to my first implementation (P4 109706). We may revisit these implementations later.	2006-12-20 19:26:30 +00:00
Konstantin Belousov	cc570216bb	In rev. 1.514, iodone on async buffer may happen before code checks the vnode v_flag. For cluster buffers this would result in dereferencing NULL b_vp. To prevent the panic, cache relevant vnode flag before calling bstrategy. Reported by: Peter Holm, kris Tested by: Peter Holm Reviewed by: tegge Pointy hat to: kib	2006-12-20 09:22:31 +00:00
David Xu	4e32b7b3cc	Add a lwpid field into per-cpu structure, the lwpid represents current running thread's id on each cpu. This allow us to add in-kernel adaptive spin for user level mutex. While spinning in user space is possible, without correct thread running state exported from kernel, it hardly can be implemented efficiently without wasting cpu cycles, however exporting thread running state unlikely will be implemented soon as it has to design and stablize interfaces. This implementation is transparent to user space, it can be disabled dynamically. With this change, mutex ping-pong program's performance is improved massively on SMP machine. performance of mysql super-smack select benchmark is increased about 7% on Intel dual dual-core2 Xeon machine, it indicates on systems which have bunch of cpus and system-call overhead is low (athlon64, opteron, and core-2 are known to be fast), the adaptive spin does help performance. Added sysctls: kern.threads.umtx_dflt_spins if the sysctl value is non-zero, a zero umutex.m_spincount will cause the sysctl value to be used a spin cycle count. kern.threads.umtx_max_spins the sysctl sets upper limit of spin cycle count. Tested on: Athlon64 X2 3800+, Dual Xeon 5130	2006-12-20 04:40:39 +00:00
Martin Blapp	cd1b20d58a	Back out rev. 1.266. The real cause for the recent panics has been fixed in rev. 1.267 and there is no need to keep this test.	2006-12-20 02:49:59 +00:00
Martin Blapp	b472f371b2	Giant might have been temporarily dropped while waiting for proctree_lock, allowing for an intervening tty_close() that cleared tp->t_session. Submitted by: tegge MFC: 1 day	2006-12-19 22:34:32 +00:00

1 2 3 4 5 ...

9819 Commits