freebsd-skq

Author	SHA1	Message	Date
kib	28174d9ffb	Fix the leak of the vmspace on the fork when the process limits are exceeded. Pointy hat to: me MFC after: 3 days	2008-03-20 15:24:49 +00:00
jeff	055d72e2b9	- Don't call the empty sched_newproc() function. sched_newproc() already existed as sched_fork() which is a non empty function in both schedulers.	2008-03-20 03:05:17 +00:00
jeff	acb93d599c	Remove kernel support for M:N threading. While the KSE project was quite successful in bringing threading to FreeBSD, the M:N approach taken by the kse library was never developed to its full potential. Backwards compatibility will be provided via libmap.conf for dynamically linked binaries and static binaries will be broken.	2008-03-12 10:12:01 +00:00
julian	bfb9581757	When forking, the new thread deserves a name too. Don't just use the td_startcopy section as it is not the right thing to do in other cases (e.g. if starting a new thread from one that is already named).	2007-11-15 02:13:44 +00:00
julian	7ee6259be7	A bunch more files that should probably print out a thread name instead of a process name.	2007-11-14 06:51:33 +00:00
kib	9ae733819b	Fix for the panic("vm_thread_new: kstack allocation failed") and silent NULL pointer dereference in the i386 and sparc64 pmap_pinit() when the kmem_alloc_nofault() failed to allocate address space. Both functions now return error instead of panicing or dereferencing NULL. As consequence, vmspace_exec() and vmspace_unshare() returns the errno int. struct vmspace arg was added to vm_forkproc() to avoid dealing with failed allocation when most of the fork1() job is already done. The kernel stack for the thread is now set up in the thread_alloc(), that itself may return NULL. Also, allocation of the first process thread is performed in the fork1() to properly deal with stack allocation failure. proc_linkup() is separated into proc_linkup() called from fork1(), and proc_linkup0(), that is used to set up the kernel process (was known as swapper). In collaboration with: Peter Holm Reviewed by: jhb	2007-11-05 11:36:16 +00:00
julian	0bf0ae3b24	Completely remove the code for single threading the mainline fork code. Put in a little comment explaining why it went away. Re-enable it in the case there an exisiting process is just splitting off its address space and file descriptors. (I donpt think anything uses that code but it needs some sort of locking and this does the job. Reviewed by: Davidxu, alc, others MFC after: 3 days	2007-11-02 19:40:36 +00:00
rwatson	60570a92bf	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer	2007-10-24 19:04:04 +00:00
julian	8b459baa42	Take out the single-threading code in fork. After discussions with jeff, alc, (various Ironport people), david Xu, and mostly Alfred (who found the problem) it has been demonstrated that this is not needed for our implementations of threads and represents a real (as in we've seen it happen a lot) deadlock danger. Several points: Since forking multiple threads is not allowed, and posix states that any mutexes owned by othre threads wilol be owned in the child by phantom threads, and therads shouldn't ba accessing shared structures without protection, It can be proved that if this leads to the child process accessing inconsistent data, it's a programming error. The mode of thread_single() being used in fork() is the wrong one. It is using SINGLE_NO_EXIT when it should be using SINGLE_BOUNDARY. Even if this we used, System processes have no need to do it as they have no userland to get inconsistent. This commmit first fixes the above bugs to get tehm correct in CVS. then removes them with #ifdef. This is so that history contains the corrected version should it be needed in the future. This code may be needed if we implement the forkall() syscall from Solaris. It may be needed for other non-posix thread libraries at some time in the future, so let the code sit for a short while while I do some work on it anyhow. This removes a reproducible lockup in NFS. It may be argued that maybe doing a fork while holding a vnode lock may not be the best idea in th efirst place but it shouldn't cause a deadlock. The removal has been running under soak test for several days now. This removal should be seriously considered for 7.0 and RELENG_6. Note. There is code in the core-dumping code that may have a similar problem with coredumping threaded processes MFC After: 4 days	2007-10-23 17:54:15 +00:00
julian	51d643caa6	Rename the kthread_xxx (e.g. kthread_create()) calls to kproc_xxx as they actually make whole processes. Thos makes way for us to add REAL kthread_create() and friends that actually make theads. it turns out that most of these calls actually end up being moved back to the thread version when it's added. but we need to make this cosmetic change first. I'd LOVE to do this rename in 7.0 so that we can eventually MFC the new kthread_xxx() calls.	2007-10-20 23:23:23 +00:00
jeff	bc0eadb21d	- Redefine p_swtime and td_slptime as p_swtick and td_slptick. This changes the units from seconds to the value of 'ticks' when swapped in/out. ULE does not have a periodic timer that scans all threads in the system and as such maintaining a per-second counter is difficult. - Change computations requiring the unit in seconds to subtract ticks and divide by hz. This does make the wraparound condition hz times more frequent but this is still in the range of several months to years and the adverse effects are minimal. Approved by: re	2007-09-21 04:10:23 +00:00
jeff	3fc0f8b973	- Move all of the PS_ flags into either p_flag or td_flags. - p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or previously the sched_lock. These bugs have existed for some time. - Allow swapout to try each thread in a process individually and then swapin the whole process if any of these fail. This allows us to move most scheduler related swap flags into td_flags. - Keep ki_sflag for backwards compat but change all in source tools to use the new and more correct location of P_INMEM. Reported by: pho Reviewed by: attilio, kib Approved by: re (kensmith)	2007-09-17 05:31:39 +00:00
rwatson	5956b5bc21	Rather than passing SUSER_RUID into priv_check_cred() to specify when a privilege is checked against the real uid rather than the effective uid, instead decide which uid to use in priv_check_cred() based on the privilege passed in. We use the real uid for PRIV_MAXFILES, PRIV_MAXPROC, and PRIV_PROC_LIMIT. Remove the definition of SUSER_RUID; there are now no flags defined for priv_check_cred(). Obtained from: TrustedBSD Project	2007-06-16 23:41:43 +00:00
jeff	bc31b141bb	- Move some common code out of sched_fork_exit() and back into fork_exit().	2007-06-12 07:47:09 +00:00
rwatson	00b02345d4	Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in some cases, move to priv_check() if it was an operation on a thread and no other flags were present. Eliminate caller-side jail exception checking (also now-unused); jail privilege exception code now goes solely in kern_jail.c. We can't yet eliminate suser() due to some cases in the KAME code where a privilege check is performed and then used in many different deferred paths. Do, however, move those prototypes to priv.h. Reviewed by: csjp Obtained from: TrustedBSD Project	2007-06-12 00:12:01 +00:00
attilio	e9fc4edc44	Optimize vmmeter locking. In particular: - Add an explicative table for locking of struct vmmeter members - Apply new rules for some of those members - Remove some unuseful comments Heavily reviewed by: alc, bde, jeff Approved by: jeff (mentor)	2007-06-10 21:59:14 +00:00
rwatson	9f332c91ef	Move per-process audit state from a pointer in the proc structure to embedded storage in struct ucred. This allows audit state to be cached with the thread, avoiding locking operations with each system call, and makes it available in asynchronous execution contexts, such as deep in the network stack or VFS. Reviewed by: csjp Approved by: re (kensmith) Obtained from: TrustedBSD Project	2007-06-07 22:27:15 +00:00
jeff	2fc4033486	Commit 6/14 of sched_lock decomposition. - Use thread_lock() rather than sched_lock for per-thread scheduling sychronization. - Use the per-process spinlock rather than the sched_lock for per-process scheduling synchronization. - Replace the tail-end of fork_exit() with a scheduler specific routine which can do the appropriate lock manipulations. Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)	2007-06-04 23:53:34 +00:00
jeff	a7a8bac81f	- Move rusage from being per-process in struct pstats to per-thread in td_ru. This removes the requirement for per-process synchronization in statclock() and mi_switch(). This was previously supported by sched_lock which is going away. All modifications to rusage are now done in the context of the owning thread. reads proceed without locks. - Aggregate exiting threads rusage in thread_exit() such that the exiting thread's rusage is not lost. - Provide a new routine, rufetch() to fetch an aggregate of all rusage structures from all threads in a process. This routine must be used in any place requiring a rusage from a process prior to it's exit. The exited process's rusage is still available via p_ru. - Aggregate tick statistics only on demand via rufetch() or when a thread exits. Tick statistics are kept in the thread and protected by sched_lock until it exits. Initial patch by: attilio Reviewed by: attilio, bde (some objections), arch (mostly silent)	2007-06-01 01:12:45 +00:00
attilio	7dd8ed88a9	Revert VMCNT_* operations introduction. Probabilly, a general approach is not the better solution here, so we should solve the sched_lock protection problems separately. Requested by: alc Approved by: jeff (mentor)	2007-05-31 22:52:15 +00:00
rwatson	0c2f08245c	Remove unnecessary assignment. CID: 2227 Found with: Coverity Prevent(tm)	2007-05-18 21:10:08 +00:00
jeff	e1996cb960	- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating vmcnts. This can be used to abstract away pcpu details but also changes to use atomics for all counters now. This means sched lock is no longer responsible for protecting counts in the switch routines. Contributed by: Attilio Rao <attilio@FreeBSD.org>	2007-05-18 07:10:50 +00:00
rwatson	765a83fd79	Replace custom file descriptor array sleep lock constructed using a mutex and flags with an sxlock. This leads to a significant and measurable performance improvement as a result of access to shared locking for frequent lookup operations, reduced general overhead, and reduced overhead in the event of contention. All of these are imported for threaded applications where simultaneous access to a shared file descriptor array occurs frequently. Kris has reported 2x-4x transaction rate improvements on 8-core MySQL benchmarks; smaller improvements can be expected for many workloads as a result of reduced overhead. - Generally eliminate the distinction between "fast" and regular acquisisition of the filedesc lock; the plan is that they will now all be fast. Change all locking instances to either shared or exclusive locks. - Correct a bug (pointed out by kib) in fdfree() where previously msleep() was called without the mutex held; sx_sleep() is now always called with the sxlock held exclusively. - Universally hold the struct file lock over changes to struct file, rather than the filedesc lock or no lock. Always update the f_ops field last. A further memory barrier is required here in the future (discussed with jhb). - Improve locking and reference management in linux_at(), which fails to properly acquire vnode references before using vnode pointers. Annotate improper use of vn_fullpath(), which will be replaced at a future date. In fcntl(), we conservatively acquire an exclusive lock, even though in some cases a shared lock may be sufficient, which should be revisited. The dropping of the filedesc lock in fdgrowtable() is no longer required as the sxlock can be held over the sleep operation; we should consider removing that (pointed out by attilio). Tested by: kris Discussed with: jhb, kris, attilio, jeff	2007-04-04 09:11:34 +00:00
rwatson	300d4098cf	Remove 'MPSAFE' annotations from the comments above most system calls: all system calls now enter without Giant held, and then in some cases, acquire Giant explicitly. Remove a number of other MPSAFE annotations in the credential code and tweak one or two other adjacent comments.	2007-03-04 22:36:48 +00:00
jhb	3a7dab48bd	Use pause() rather than tsleep() on explicit global dummy variables.	2007-02-27 17:22:30 +00:00
delphij	8cc8bccf58	Close race conditions between fork() and [sg]etpriority()'s PRIO_USER case, possibly also other places that deferences p_ucred. In the past, we insert a new process into the allproc list right after PID allocation, and release the allproc_lock sx. Because most content in new proc's structure is not yet initialized, this could lead to undefined result if we do not handle PRS_NEW with care. The problem with PRS_NEW state is that it does not provide fine grained information about how much initialization is done for a new process. By defination, after PRIO_USER setpriority(), all processes that belongs to given user should have their nice value set to the specified value. Therefore, if p_{start,end}copy section was done for a PRS_NEW process, we can not safely ignore it because p_nice is in this area. On the other hand, we should be careful on PRS_NEW processes because we do not allow non-root users to lower their nice values, and without a successful copy of the copy section, we can get stale values that is inherted from the uninitialized area of the process structure. This commit tries to close the race condition by grabbing proc mutex before we release allproc_lock xlock, and do copy as well as zero immediately after the allproc_lock xunlock. This guarantees that the new process would have its p_copy and p_zero sections, as well as user credential informaion initialized. In getpriority() case, instead of grabbing PROC_LOCK for a PRS_NEW process, we just skip the process in question, because it does not affect the final result of the call, as the p_nice value would be copied from its parent, and we will see it during allproc traverse. Other potential solutions are still under evaluation. Discussed with: davidxu, jhb, rwatson PR: kern/108071 MFC after: 2 weeks	2007-02-26 03:38:09 +00:00
jeff	474b917526	- Remove setrunqueue and replace it with direct calls to sched_add(). setrunqueue() was mostly empty. The few asserts and thread state setting were moved to the individual schedulers. sched_add() was chosen to displace it for naming consistency reasons. - Remove adjustrunqueue, it was 4 lines of code that was ifdef'd to be different on all three schedulers where it was only called in one place each. - Remove the long ifdef'd out remrunqueue code. - Remove the now redundant ts_state. Inspect the thread state directly. - Don't set TSF_* flags from kern_switch.c, we were only doing this to support a feature in one scheduler. - Change sched_choose() to return a thread rather than a td_sched. Also, rely on the schedulers to return the idlethread. This simplifies the logic in choosethread(). Aside from the run queue links kern_switch.c mostly does not care about the contents of td_sched. Discussed with: julian - Move the idle thread loop into the per scheduler area. ULE wants to do something different from the other schedulers. Suggested by: jhb Tested on: x86/amd64 sched_{4BSD, ULE, CORE}.	2007-01-23 08:46:51 +00:00
julian	396ed947f6	Threading cleanup.. part 2 of several. Make part of John Birrell's KSE patch permanent.. Specifically, remove: Any reference of the ksegrp structure. This feature was never fully utilised and made things overly complicated. All code in the scheduler that tried to make threaded programs fair to unthreaded programs. Libpthread processes will already do this to some extent and libthr processes already disable it. Also: Since this makes such a big change to the scheduler(s), take the opportunity to rename some structures and elements that had to be moved anyhow. This makes the code a lot more readable. The ULE scheduler compiles again but I have no idea if it works. The 4bsd scheduler still reqires a little cleaning and some functions that now do ALMOST nothing will go away, but I thought I'd do that as a separate commit. Tested by David Xu, and Dan Eischen using libthr and libpthread.	2006-12-06 06:34:57 +00:00
rwatson	10d0d9cf47	Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking. Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>	2006-11-06 13:42:10 +00:00
jb	f82c799735	Make KSE a kernel option, turned on by default in all GENERIC kernel configs except sun4v (which doesn't process signals properly with KSE). Reviewed by: davidxu@	2006-10-26 21:42:22 +00:00
rwatson	7beaaf5cd2	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA	2006-10-22 11:52:19 +00:00
netchild	b2a39f267a	- Change process_exec function handlers prototype to include struct image_params arg. - Change struct image_params to include struct sysentvec pointer and initialize it. - Change all consumers of process_exit/process_exec eventhandlers to new prototypes (includes splitting up into distinct exec/exit functions). - Add eventhandler to userret. Sponsored by: Google SoC 2006 Submitted by: rdivacky Parts suggested by: jhb (on hackers@)	2006-08-15 12:10:57 +00:00
jhb	9cbcfd1fff	Don't lock each of the processes while looking for a pid. The allproc and proctree locks that we already hold provide sufficient protection.	2006-08-01 15:30:56 +00:00
pjd	03a43a81a3	- Use suser_cred(9) instead of checking cr_ruid directly. - For privileged processes safe two mutex operations. We may want to consider if this is good idea to use SUSER_ALLOWJAIL here, but for now I didn't wanted to change the original behaviour. Reviewed by: rwatson	2006-06-27 11:28:50 +00:00
davidxu	7e196fdd6e	Fix a race between file operations and rfork(RFCFDG) by parking all other threads at user boundary, the race can crash kernel under stress testing. Reviewed by: jhb MFC after: 3 days	2006-03-15 23:24:14 +00:00
phk	74f8e63a10	Simplify system time accounting for profiling. Rename struct thread's td_sticks to td_pticks, we will need the other name for more appropriately named use shortly. Reduce it from uint64_t to u_int. Clear td_pticks whenever we enter the kernel instead of recording its value as reference for userret(). Use the absolute value of td->pticks in userret() and eliminate third argument.	2006-02-08 08:09:17 +00:00
jhb	2eb77c6d18	We don't need the proc lock to check P_KTHREAD on curthread since it is only set before the kthread starts executing and is never cleared.	2006-02-06 21:54:47 +00:00
wsalamon	c41a486364	Audit the args to rfork(), and the child PID for all fork system calls. Obtained from: TrustedBSD Project Approved by: rwatson (mentor)	2006-02-06 00:28:50 +00:00
rwatson	53a606d94a	Hook up audit to fork() and exit() events. These changes manage the audit state on processes, not auditing of these events. Much work by: wsalamon Obtained from: TrustedBSD Project	2006-02-02 01:32:58 +00:00
rwatson	2a5785fb21	Moderate rewrite of kernel ktrace code to attempt to generally improve reliability when tracing fast-moving processes or writing traces to slow file systems by avoiding unbounded queueuing and dropped records. Record loss was previously possible when the global pool of records become depleted as a result of record generation outstripping record commit, which occurred quickly in many common situations. These changes partially restore the 4.x model of committing ktrace records at the point of trace generation (synchronous), but maintain the 5.x deferred record commit behavior (asynchronous) for situations where entering VFS and sleeping is not possible (i.e., in the scheduler). Records are now queued per-process as opposed to globally, with processes responsible for committing records from their own context as required. - Eliminate the ktrace worker thread and global record queue, as they are no longer used. Keep the global free record list, as records are still used. - Add a per-process record queue, which will hold any asynchronously generated records, such as from context switches. This replaces the global queue as the place to submit asynchronous records to. - When a record is committed asynchronously, simply queue it to the process. - When a record is committed synchronously, first drain any pending per-process records in order to maintain ordering as best we can. Currently ordering between competing threads is provided via a global ktrace_sx, but a per-process flag or lock may be desirable in the future. - When a process returns to user space following a system call, trap, signal delivery, etc, flush any pending records. - When a process exits, flush any pending records. - Assert on process tear-down that there are no pending records. - Slightly abstract the notion of being "in ktrace", which is used to prevent the recursive generation of records, as well as generating traces for ktrace events. Future work here might look at changing the set of events marked for synchronous and asynchronous record generation, re-balancing queue depth, timeliness of commit to disk, and so on. I.e., performing a drain every (n) records. MFC after: 1 month Discussed with: jhb Requested by: Marc Olzheim <marcolz at stack dot nl>	2005-11-13 13:27:44 +00:00
ssouhlal	efe31cd3da	Fix the recent panics/LORs/hangs created by my kqueue commit by: - Introducing the possibility of using locks different than mutexes for the knlist locking. In order to do this, we add three arguments to knlist_init() to specify the functions to use to lock, unlock and check if the lock is owned. If these arguments are NULL, we assume mtx_lock, mtx_unlock and mtx_owned, respectively. - Using the vnode lock for the knlist locking, when doing kqueue operations on a vnode. This way, we don't have to lock the vnode while holding a mutex, in filt_vfsread. Reviewed by: jmg Approved by: re (scottl), scottl (mentor override) Pointyhat to: ssouhlal Will be happy: everyone	2005-07-01 16:28:32 +00:00
davidxu	0719b14efb	Inherit signal mask for child process in fork1(), RELENG_4 and other *BSD have this behaviour, also it is required by POSIX. PR: kern/80130 Submitted by: Kostik Belousov konstantin.belousov at zoral dot com dot ua	2005-04-20 13:14:52 +00:00
jhb	41cadaa11e	Divorce critical sections from spinlocks. Critical sections as denoted by critical_enter() and critical_exit() are now solely a mechanism for deferring kernel preemptions. They no longer have any affect on interrupts. This means that standalone critical sections are now very cheap as they are simply unlocked integer increments and decrements for the common case. Spin mutexes now use a separate KPI implemented in MD code: spinlock_enter() and spinlock_exit(). This KPI is responsible for providing whatever MD guarantees are needed to ensure that a thread holding a spin lock won't be preempted by any other code that will try to lock the same lock. For now all archs continue to block interrupts in a "spinlock section" as they did formerly in all critical sections. Note that I've also taken this opportunity to push a few things into MD code rather than MI. For example, critical_fork_exit() no longer exists. Instead, MD code ensures that new threads have the correct state when they are created. Also, we no longer try to fixup the idlethreads for APs in MI code. Instead, each arch sets the initial curthread and adjusts the state of the idle thread it borrows in order to perform the initial context switch. This change is largely a big NOP, but the cleaner separation it provides will allow for more efficient alternative locking schemes in other parts of the kernel (bare critical sections rather than per-CPU spin mutexes for per-CPU data for example). Reviewed by: grehan, cognet, arch@, others Tested on: i386, alpha, sparc64, powerpc, arm, possibly more	2005-04-04 21:53:56 +00:00
imp	20280f1431	/* -> /*- for copyright notices, minor format tweaks as necessary	2005-01-06 23:35:40 +00:00
phk	e622b9bcba	Add new function fdunshare() which encapsulates the necessary light magic for ensuring that a process' filedesc is not shared with anybody. Use it in the two places which previously had private implmentations. This collects all fd_refcnt handling in kern_descrip.c	2004-12-14 07:20:03 +00:00
das	130bed6547	Don't include sys/user.h merely for its side-effect of recursively including other headers.	2004-11-27 06:51:39 +00:00
das	6175c08488	Remove local definitions of RANGEOF() and use __rangeof() instead. Also remove a few bogus casts.	2004-11-20 23:00:59 +00:00
das	22907ad4ac	Malloc p_stats instead of putting it in the U area. We should consider simply embedding it in struct proc. Reviewed by: arch@	2004-11-20 02:28:48 +00:00
phk	216166ee0d	Introduce an alias for FILEDESC_{UN}LOCK() with the suffix _FAST. Use this in all the places where sleeping with the lock held is not an issue. The distinction will become significant once we finalize the exact lock-type to use for this kind of case.	2004-11-13 11:53:02 +00:00
phk	d24107be6b	Use more intuitive pointer for fdinit() and fdcopy(). Change fdcopy() to take unlocked filedesc.	2004-11-08 12:43:23 +00:00

1 2 3 4 5 ...

291 Commits