freebsd-skq

Author	SHA1	Message	Date
sobomax	d818a8db68	Revert previous change - it appears that the limit I was hitting was a maxsockets limit, not maxfiles limit. The question remains why those limits are handled differently (with error code for maxfiles but with sleep for maxsokets), but those would be addressed in a separate commit if necessary. Requested by: rwhatson, jeff	2008-03-19 09:58:25 +00:00
pjd	da476acd46	Remove extra uihold() call that accidentally sneak in during perforce change @125544.	2008-03-19 07:52:07 +00:00
jeff	d6d07d2730	- Remove some dead code and comments related to KSE. - Don't set tdq_lowpri on every switch, it should be precisely maintained now. - Add some comments to sched_thread_priority().	2008-03-19 07:36:37 +00:00
jeff	d43ad8d37e	- At the top of sleepq_catch_signals() lock the thread and check TDF_NEEDSIGCHK before doing the very expensive cursig() and related locking. NEEDSIGCHK is updated whenever our signal mask change or when a signal is delivered and should be sufficient to avoid the more expensive tests. This eliminates another source of PROC_LOCK contention in multithreaded programs.	2008-03-19 07:35:14 +00:00
jeff	d4862a02d2	- Remove stale comment. - In the last revision the code was changed to use maxfilesperproc rather than the per-process file limit to restrict the size of the poll array. This eliminates a significant source of process lock contention in multithreaded programs and is cheaper. This had been committed with the wrong batch of changes.	2008-03-19 07:33:16 +00:00
jeff	ea2b75bd30	- Add a facility similar to LOCK_PROFILING under SLEEPQUEUE_PROFILING. Keep a simple (wmesg, count) tuple in a hash to keep track of how many times we sleep at each wait message. We hash on message and not channel. No line number information is given as typically wait messages are not used in more than one place. Identical strings defined at different addresses will show up with seperate counters. - Use debug.sleepq.enable to enable, .reset to reset, and .stats dumps stats. - Do an unsynchronized check in sleepq_switch() prior to switching before calling sleepq_profile() which uses a global lock to synchronize the hash. Only sleeps which actually cause a context switch are counted.	2008-03-19 07:22:07 +00:00
jeff	4cd4553bb5	- Fix the last of the threading bugs that were introduced as far back as 1.38 in 2001. Break out of the FOREACH_THREAD_IN_PROC loop when we've discovered a new proc in the chain. - Increment i and check for maxlockdepth once per matching process not once per thread. This didn't properly terminate the loop before. - Fix a bug which has existed potentially since rev 1.1. waitblock->lf_next can be NULL when a thread has been woken-up but not yet scheduled. Check for this condition rather than blindly dereferencing. Found by: libMicro	2008-03-19 07:13:24 +00:00
jeff	4350e599a3	- Restore the NULL check for td_cpuset. This can happen if a partially constructed thread was torn down as is the case when we fail to allocate a kernel stack.	2008-03-19 06:20:21 +00:00
jeff	46f09d5bc3	- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from requiring the per-process spinlock to only requiring the process lock. - Reflect these changes in the proc.h documentation and consumers throughout the kernel. This is a substantial reduction in locking cost for these fields and was made possible by recent changes to threading support.	2008-03-19 06:19:01 +00:00
jhb	c04bb048f6	Simplify the interrupt code a bit: - Always include the ie_disable and ie_eoi methods in 'struct intr_event' and collapse down to one intr_event_create() routine. The disable and eoi hooks simply aren't used currently in the !INTR_FILTER case. - Expand 'disab' to 'disable' in a few places. - Use function casts for arm and i386:intr_eoi_src() instead of wrapper routines since to trim one extra indirection. Compiled on: {arm,amd64,i386,ia64,ppc,sparc64} x {FILTER, !FILTER} Tested on: {amd64,i386} x {FILTER, !FILTER}	2008-03-17 22:42:01 +00:00
kib	d5211e24af	Fix two races in the handling of the d_gianttrick for the D_NEEDGIANT drivers. In the giant_XXX wrappers for the device methods of the D_NEEDGIANT drivers, do not dereference the cdev->si_devsw. It is racing with the destroy_devl() clearing of the si_devsw. Instead, use the dev_refthread() and return ENXIO for the destroyed device. [1] The check for the D_INIT in the prep_cdevsw() was not synchronized with the call of the fini_cdevsw() in destroy_devl(), that under rapid device creation/destruction may result in the use of uninitialized cdevsw [2]. Change the protocol for the prep_cdevsw(), requiring it to be called under dev_mtx, where the check for D_INIT is done. Do not free the memory allocated for the gianttrick cdevsw while holding the dev_mtx, put it into the free list to be freed later. Reuse the d_gianttrick pointer to keep the size and layout of the struct cdevsw (requested by phk). Free the memory in the dev_unlock_and_free(), and do all the free after the dev_mtx is dropped (suggested by jhb). Reported by: bsdimp + many [1], pho [2] Reviewed by: phk, jhb Tested by: pho MFC after: 1 week	2008-03-17 13:17:10 +00:00
pjd	cf9cd1298d	- There is no more "uidinfo struct" mutex. - The "uidinfo hash" lock is now a rwlock. Reminded by: kib	2008-03-17 11:48:40 +00:00
pjd	4ef010fa26	Whitespace cleanups.	2008-03-16 21:32:20 +00:00
pjd	9123873999	- Use wait-free method to manage ui_sbsize and ui_proccnt fields in the uidinfo structure. This entirely removes contention observed on the ui_mtxp mutex (as it is now gone). - Convert the uihashtbl_mtx mutex to a rwlock, as most of the time we just need to read-lock it. Reviewed by: jhb, jeff, kris & others Tested by: kris	2008-03-16 21:29:02 +00:00
rwatson	e7b290ea3d	Consistently use ANSI C declarationsfor all functions in kern_synch.c.	2008-03-16 18:59:21 +00:00
pjd	52f3c4136e	Style fixes.	2008-03-16 18:26:59 +00:00
pjd	d61d590ad7	Fix information leak. We can find PIDs of running processes from within a jail, etc. by simply calling setpriority(PRIO_PROCESS, <PID>, 0) and checking the return value: 0 means that the process exists and -1 that it doesn't exist. Reviewed by: rwatson MFC after: 1 week	2008-03-16 17:55:06 +00:00
rwatson	877d7c65ba	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink	2008-03-16 10:58:09 +00:00
sobomax	1560402d31	Properly set size of the file_zone to match kern.maxfiles parameter. Otherwise the parameter is no-op, since zone by default limits number of descriptors to some 12K entries. Attempt to allocate more ends up sleeping on zonelimit. MFC after: 2 weeks	2008-03-16 06:21:30 +00:00
ru	a786aa30ca	Fix panic on e.g. "kldload /dev/null". PR: kern/121427 Reviewed by: sem MFC after: 3 days	2008-03-15 17:40:18 +00:00
jhb	9c113163fb	Add preliminary support for binding interrupts to CPUs: - Add a new intr_event method ie_assign_cpu() that is invoked when the MI code wishes to bind an interrupt source to an individual CPU. The MD code may reject the binding with an error. If an assign_cpu function is not provided, then the kernel assumes the platform does not support binding interrupts to CPUs and fails all requests to do so. - Bind ithreads to CPUs on their next execution loop once an interrupt event is bound to a CPU. Only shared ithreads are bound. We currently leave private ithreads for drivers using filters + ithreads in the INTR_FILTER case unbound. - A new intr_event_bind() routine is used to bind an interrupt event to a CPU. - Implement binding on amd64 and i386 by way of the existing pic_assign_cpu PIC method. - For x86, provide a 'intr_bind(IRQ, cpu)' wrapper routine that looks up an interrupt source and binds its interrupt event to the specified CPU. MI code can currently (ab)use this by doing: intr_bind(rman_get_start(irq_res), cpu); however, I plan to add a truly MI interface (probably a bus_bind_intr(9)) where the implementation in the x86 nexus(4) driver would end up calling intr_bind() internally. Requested by: kmacy, gallatin, jeff Tested on: {amd64, i386} x {regular, INTR_FILTER}	2008-03-14 19:41:48 +00:00
jhb	c162b59cc2	Make the function prototype for cpu_search() match the declaration so that this still compiles with gcc3.	2008-03-14 15:22:38 +00:00
jeff	a469063987	PR 117603 - Close a sleepqueue signal race by interlocking with the per-process spinlock. This was mistakenly omitted from the thread_lock patch and has been a race since. MFC After: 1 week PR: bin/117603 Reported by: Danny Braniss <danny@cs.huji.ac.il>	2008-03-13 00:46:12 +00:00
jeff	acb93d599c	Remove kernel support for M:N threading. While the KSE project was quite successful in bringing threading to FreeBSD, the M:N approach taken by the kse library was never developed to its full potential. Backwards compatibility will be provided via libmap.conf for dynamically linked binaries and static binaries will be broken.	2008-03-12 10:12:01 +00:00
jeff	3b1acbdce2	- Pass the priority argument from sleep() into sleepq and down into sched_sleep(). This removes extra thread_lock() acquisition and allows the scheduler to decide what to do with the static boost. - Change the priority arguments to cv_ to match sleepq/msleep/etc. where 0 means no priority change. Catch -1 in cv_broadcastpri() and convert it to 0 for now. - Set a flag when sleeping in a way that is compatible with swapping since direct priority comparisons are meaningless now. - Add a sysctl to ule, kern.sched.static_boost, that defaults to on which controls the boost behavior. Turning it off gives better performance in some workloads but needs more investigation. - While we're modifying sleepq, change signal and broadcast to both return with the lock held as the lock was held on enter. Reviewed by: jhb, peter	2008-03-12 06:31:06 +00:00
jeff	e30139dff5	- KSE may free a thread that was never actually forked. This will leave td_cpuset NULL. Check for this condition before dereferencing the cpuset. Reported by: david@catwhisker.org, miwi@freebsd.org Sponsored by: Nokia	2008-03-12 05:01:14 +00:00
jeff	540fa064d9	- Fix the invalid priority panics people are seeing by forcing tdq_runq_add to select the runq rather than hoping we set it properly when we adjusted the priority. This involves the same number of branches as before so should perform identically without the extra fragility. Tested by: bz Reviewed by: bz	2008-03-10 22:48:27 +00:00
jeff	e53ae3b798	- Don't rely on a side effect of sched_prio() to set the initial ts_runq for thread0. Set it directly in sched_setup(). This fixes traps on boot seen on some machines. Reported by: phk	2008-03-10 09:50:29 +00:00
jeff	aa3cc14d3d	- Handle kdb switch panics outside of mi_switch() to remove some instructions from the common path and make the code more clear. Whether this has any impact on performance may depend on optimization levels. Sponsored by: Nokia	2008-03-10 03:16:51 +00:00
jeff	14a6f96adb	Reduce ULE context switch time by over 25%. - Only calculate timeshare priorities once per tick or when a thread is woken from sleeping. - Keep the ts_runq pointer valid after all priority changes. - Call tdq_runq_add() directly from sched_switch() without passing in via tdq_add(). We don't need to adjust loads or runqs anymore. - Sort tdq and ts_sched according to utilization to improve cache behavior. Sponsored by: Nokia	2008-03-10 03:15:19 +00:00
imp	be1fee2a1a	Tiny bit of KNF to make bus_setup_intr() look like the rest of this function.	2008-03-10 01:48:25 +00:00
jeff	128f1c2547	- Add the missing '2' case to the switch table for kern.smp.topology and assign it to create the flat 'none' topology where all cpus are scheduled as if they are equal and unrelated.	2008-03-10 01:38:53 +00:00
jeff	7dc7c824ee	- Add an implementation of sched_preempt() that avoids excessive IPIs. - Normalize the preemption/ipi setting code by introducing sched_shouldpreempt() so the logical is identical and not repeated between tdq_notify() and sched_setpreempt(). - In tdq_notify() don't set NEEDRESCHED as we may not actually own the thread lock this could have caused us to lose td_flags settings. - Garbage collect some tunables that are no longer relevant.	2008-03-10 01:32:01 +00:00
jeff	171a608f92	- Add a sched_preempt() routine to be called by md code after IPI_PREEMPT is delivered. - Add a simple implementation to 4bsd.	2008-03-10 01:30:35 +00:00
imp	a7ddb800e8	Any driver that relies on its parent to set the devclass has no way to know if has siblings that need an actual probe. Introduce a specail return value called BUS_PROBE_NOOWILDCARD. If the driver returns this, the probe is only successful for devices that have had a specific devclass set for them. Reviewed by: current@, jhb@, grehan@	2008-03-09 05:10:22 +00:00
antoine	514f31f40e	Introduce a new F_DUP2FD command to fcntl(2), for compatibility with Solaris and AIX. fcntl(fd, F_DUP2FD, arg) and dup2(fd, arg) are functionnaly equivalent. Document it. Add some regression tests (identical to the dup2(2) regression tests). PR: 120233 Submitted by: Jukka Ukkonen Approved by: rwaston (mentor) MFC after: 1 month	2008-03-08 22:02:21 +00:00
rwatson	bb69385843	Use sbuf routines to construct core dump filenames rather than custom string buffer handling, making the code both easier to read and more robust against string-handling bugs. MFC after: 1 week	2008-03-08 16:31:29 +00:00
rwatson	32931f304a	Unlock the process lock when expand_name() fails, or we may leak the process lock leading to a hang. This bug was introduced in kern_sig.c:1.351, when the call to expand_name() was moved earlier bit this particular error case was not updated.	2008-03-08 15:48:06 +00:00
rwatson	d91018d529	Add __FBSDID() tag. MFC after: 3 days Pointed out by: antoine	2008-03-07 15:27:08 +00:00
jeff	7b52c04658	- Add a missing unlock to cpuset_setaffinity(CPU_LEVEL_CPUSET, CPU_WHICH_PID) Found by: gallatin	2008-03-06 20:11:24 +00:00
jeff	f278be8741	- Don't overwrite the recently allocated 'nset' in cpuset_setthread() by passing it to cpuset_which(). Pass in 'set' instead. This argument is not used but for convenience cpuset_which() nulls all incoming parameters. Submitted by: davidxu	2008-03-05 08:08:32 +00:00
jeff	7e2fbaa872	- Verify that when a user supplies a mask that is bigger than the kernel mask none of the upper bits are set. - Be more careful about enforcing the boundaries of masks and child sets. - Introduce a few more CPU_* macros for implementing these tests. - Change the cpusetsize argument to be bytes rather than bits to match other apis. Sponsored by: Nokia	2008-03-05 01:49:20 +00:00
ru	fb19d1efe4	Make it possible to continue working after calling doadump() manually from debugger. (This got broken in rev. 1.122.)	2008-03-04 07:39:31 +00:00
raj	0757a4afb5	Initial support for Freescale PowerQUICC III MPC85xx system-on-chip family. The PQ3 is a high performance integrated communications processing system based on the e500 core, which is an embedded RISC processor that implements the 32-bit Book E definition of the PowerPC architecture. For details refer to: http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=MPC8555E This port was tested and successfully run on the following members of the PQ3 family: MPC8533, MPC8541, MPC8548, MPC8555. The following major integrated peripherals are supported: * On-chip peripherals bus * OpenPIC interrupt controller * UART * Ethernet (TSEC) * Host/PCI bridge * QUICC engine (SCC functionality) This commit brings the main functionality and will be followed by individual drivers that are logically separate from this base. Approved by: cognet (mentor) Obtained from: Juniper, Semihalf MFp4: e500	2008-03-03 17:17:00 +00:00
marcel	dcf8897ad7	Unbreak after cpuset: initialize td_cpuset in sched_fork_thread().	2008-03-02 21:34:57 +00:00
jeff	c12e39a76d	Add support for the new cpu topology api: - When searching for affinity search backwards in the tree from the last cpu we ran on while the thread still has affinity for the group. This can take advantage of knowledge of shared L2 or L3 caches among a group of cores. - When searching for the least loaded cpu find the least loaded cpu via the least loaded path through the tree. This load balances system bus links, individual cache levels, and hyper-threaded/SMT cores. - Make the periodic balancer recursively balance the highest and lowest loaded cpu across each link. Add support for cpusets: - Convert the cpuset to a simple native cpumask_t while the kernel still only supports cpumask. - Pass the derived cpumask down through the cpu_search functions to restrict the result cpus. - Make the various steal functions resilient to failure since all threads can not run on all cpus any longer. General improvements: - Precisely track the lowest priority thread on every runq with tdq_setlowpri(). Before it was more advisory but this ended up having pathological behaviors. - Remove many #ifdef SMP conditions to simplify the code. - Get rid of the old cumbersome tdq_group. This is more naturally expressed via the cpu_group tree. Sponsored by: Nokia Testing by: kris	2008-03-02 08:20:59 +00:00
jeff	ad2a31513f	- Remove the old smp cpu topology specification with a new, more flexible tree structure that encodes the level of cache sharing and other properties. - Provide several convenience functions for creating one and two level cpu trees as well as a default flat topology. The system now always has some topology. - On i386 and amd64 create a seperate level in the hierarchy for HTT and multi-core cpus. This will allow the scheduler to intelligently load balance non-uniform cores. Presently we don't detect what level of the cache hierarchy is shared at each level in the topology. - Add a mechanism for testing common topologies that have more information than the MD code is able to provide via the kern.smp.topology tunable. This should be considered a debugging tool only and not a stable api. Sponsored by: Nokia	2008-03-02 07:58:42 +00:00
jeff	9b809b84f1	- Regen for cpuset Sponsored by: Nokia	2008-03-02 07:41:10 +00:00
jeff	694203dedd	Add cpuset, an api for thread to cpu binding and cpu resource grouping and assignment. - Add a reference to a struct cpuset in each thread that is inherited from the thread that created it. - Release the reference when the thread is destroyed. - Add prototypes for syscalls and macros for manipulating cpusets in sys/cpuset.h - Add syscalls to create, get, and set new numbered cpusets: cpuset(), cpuset_{get,set}id() - Add syscalls for getting and setting affinity masks for cpusets or individual threads: cpuid_{get,set}affinity() - Add types for the 'level' and 'which' parameters for the cpuset. This will permit expansion of the api to cover cpu masks for other objects identifiable with an id_t integer. For example, IRQs and Jails may be coming soon. - The root set 0 contains all valid cpus. All thread initially belong to cpuset 1. This permits migrating all threads off of certain cpus to reserve them for special applications. Sponsored by: Nokia Discussed with: arch, rwatson, brooks, davidxu, deischen Reviewed by: antoine	2008-03-02 07:39:22 +00:00
jeff	3bd7de5a7c	- Add a new sched_affinity() api to be used in the upcoming cpuset implementation. - Add empty implementations of sched_affinity() to 4BSD and ULE. Sponsored by: Nokia	2008-03-02 07:19:35 +00:00
attilio	0d87334131	- Handle buffer lock waiters count directly in the buffer cache instead than rely on the lockmgr support [1]: * bump the waiters only if the interlock is held * let brelvp() return the waiters count * rely on brelvp() instead than BUF_LOCKWAITERS() in order to check for the waiters number - Remove a namespace pollution introduced recently with lockmgr.h including lock.h by including lock.h directly in the consumers and making it mandatory for using lockmgr. - Modify flags accepted by lockinit(): * introduce LK_NOPROFILE which disables lock profiling for the specified lockmgr * introduce LK_QUIET which disables ktr tracing for the specified lockmgr [2] * disallow LK_SLEEPFAIL and LK_NOWAIT to be passed there so that it can only be used on a per-instance basis - Remove BUF_LOCKWAITERS() and lockwaiters() as they are no longer used This patch breaks KPI so __FreBSD_version will be bumped and manpages updated by further commits. Additively, 'struct buf' changes results in a disturbed ABI also. [2] Really, currently there is no ktr tracing in the lockmgr, but it will be added soon. [1] Submitted by: kib Tested by: pho, Andrea Barberio <insomniac at slackware dot it>	2008-03-01 19:47:50 +00:00
kib	8b513f42f1	Do not assert any locks for VOP_PRINT. In particular, do not assert that the vnode interlock is not held. vn_printf() already correctly handles locked and unlocked vnode interlocks, and all the in-tree vop_print methods are interlock-agnostic. Some code calls vprintf() with the vnode interlock held, that causes unjustified panics with INVARIANTS (ffs_syncvnode() as example). Reported by: Peter Holm	2008-02-26 12:16:35 +00:00
attilio	4014b55830	Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is always curthread. As KPI gets broken by this patch, manpages and __FreeBSD_version will be updated by further commits. Tested by: Andrea Barberio <insomniac at slackware dot it>	2008-02-25 18:45:57 +00:00
attilio	0d54671a48	Introduce some functions in the vnode locks namespace and in the ffs namespace in order to handle lockmgr fields in a controlled way instead than spreading all around bogus stubs: - VN_LOCK_AREC() allows lock recursion for a specified vnode - VN_LOCK_ASHARE() allows lock sharing for a specified vnode In FFS land: - BUF_AREC() allows lock recursion for a specified buffer lock - BUF_NOREC() disallows recursion for a specified buffer lock Side note: union_subr.c::unionfs_node_update() is the only other function directly handling lockmgr fields. As this is not simple to fix, it has been left behind as "sole" exception.	2008-02-24 16:38:58 +00:00
cperciva	b02e531c35	After finishing sending file data in sendfile(2), don't forget to send the provided trailers. This has been broken since revision 1.240. Submitted by: Dan Nelson PR: kern/120948 "sounds ok to me" from: phk MFC after: 3 days	2008-02-24 00:07:00 +00:00
des	df26e399aa	This patch adds a new ktrace(2) record type, KTR_STRUCT, whose payload consists of the null-terminated name and the contents of any structure you wish to record. A new ktrstruct() function constructs and emits a KTR_STRUCT record. It is accompanied by convenience macros for struct stat and struct sockaddr. In kdump(1), KTR_STRUCT records are handled by a dispatcher function that runs stringent sanity checks on its contents before handing it over to individual decoding funtions for each type of structure. Currently supported structures are struct stat and struct sockaddr for the AF_INET, AF_INET6 and AF_UNIX families; support for AF_APPLETALK and AF_IPX is present but disabled, as I am unable to test it properly. Since 's' was already taken, the letter 't' is used by ktrace(1) to enable KTR_STRUCT trace points, and in kdump(1) to enable their decoding. Derived from patches by Andrew Li <andrew2.li@citi.com>. PR: kern/117836 MFC after: 3 weeks	2008-02-23 01:01:49 +00:00
yar	ce8c493400	Undo the damage I did in sys/kern/vfs_mount.c #1.274 and sbin/mount_nfs/mount_nfs.c #1.76. Let the dragons sleep. Requested by: rodrigc, des PR: kern/120319 (welcome the bug back)	2008-02-18 20:58:57 +00:00
yar	2bac23abfa	Add a remark on a questionable property of vfs_mergeopts().	2008-02-18 10:10:42 +00:00
antoine	fb176dbab6	Make sysctl_kern_arnd return a random buffer instead of a random long, as it is expected by userland (stack protector guard setup for example). PR: 119129 Approved by: rwatson (mentor) MFC after: 1 month	2008-02-17 16:44:48 +00:00
kris	8697995804	Switch from conditionally dropping Giant in exit1() to asserting it is not held, which appears to be always true.	2008-02-17 15:28:28 +00:00
imp	0514d6cc34	Fix typo in comment.	2008-02-17 02:46:54 +00:00
antoine	ab8945769a	Remove a superfluous line in run_interrupt_driven_config_hooks(), next_entry is already initialized during TAILQ_FOREACH_SAFE(). PR: kern/119604 Approved by: rwatson (mentor) MFC after: 1 month	2008-02-15 21:54:21 +00:00
attilio	265cb5fb91	- Introduce lockmgr_args() in the lockmgr space. This function performs the same operation of lockmgr() but accepting a custom wmesg, prio and timo for the particular lock instance, overriding default values lkp->lk_wmesg, lkp->lk_prio and lkp->lk_timo. - Use lockmgr_args() in order to implement BUF_TIMELOCK() - Cleanup BUF_LOCK() - Remove LK_INTERNAL as it is nomore used in the lockmgr namespace Tested by: Andrea Barberio <insomniac at slackware dot it>	2008-02-15 21:04:36 +00:00
yar	9713f1f445	In the new order of things dictated by nmount(2), a read-only mount is to be requested via a "ro" option. At the same time, MNT_RDONLY is gradually becoming an indicator of the current state of the FS instead of a command flag. Today passing MNT_RDONLY alone to the kernel's mount machinery will lead to various glitches. (See the PRs for examples.) Therefore mount the root FS with a "ro" option instead of the MNT_RDONLY flag. (Note that MNT_RDONLY still is added to the mount flags internally, by vfs_donmount(), if "ro" was specified.) To be able to pass "ro" cleanly to kernel_vmount(), teach the latter function to accept options with NULL values. Also correct the comment explaining how mount_arg() handles length of -1. PR: bin/106636 kern/120319 Submitted by: Jaakko Heinonen <see PR kern/120319 for email> (originally)	2008-02-14 17:04:31 +00:00
simon	49aa39283b	Fix sendfile(2) write-only file permission bypass. Security: FreeBSD-SA-08:03.sendfile Submitted by: kib	2008-02-14 11:44:31 +00:00
jhb	fd8332efc0	Add KASSERT()'s to catch attempts to recurse on spin mutexes that aren't marked recursable either via mtx_lock_spin() or thread_lock(). MFC after: 1 week	2008-02-13 23:39:05 +00:00
jhb	32100bd15f	Mark sleepqueue chain spin mutexes are recursable since the sleepq code now recurses on them in sleepq_broadcast() and sleepq_signal() when resuming threads that are fully asleep. MFC after: 1 week	2008-02-13 23:36:56 +00:00
jhb	64735ffb5f	Add a couple of assertions and KTR logging to thread_lock_flags() to match mtx_lock_spin_flags(). MFC after: 1 week	2008-02-13 23:33:50 +00:00
jhb	e8b1d791b2	Add an automatic kernel module version dependency to prevent loading modules using invalid ABI versions (e.g. a 7.x module with an 8.x kernel) for a given kernel: - Add a 'kernel' module version whose value is __FreeBSD_version. - Add a version dependency on 'kernel' in every module that has an acceptable version range of __FreeBSD_version up to the end of the branch __FreeBSD_version is part of. E.g. a module compiled on 701000 would work on kernels with versions between 701000 and 799999 inclusive. Discussed on: arch@ MFC after: 1 week	2008-02-13 21:34:06 +00:00
attilio	456bfb1f0f	- Add real assertions to lockmgr locking primitives. A couple of notes for this: * WITNESS support, when enabled, is only used for shared locks in order to avoid problems with the "disowned" locks * KA_HELD and KA_UNHELD only exists in the lockmgr namespace in order to assert for a generic thread (not curthread) owning or not the lock. Really, this kind of check is bogus but it seems very widespread in the consumers code. So, for the moment, we cater this untrusted behaviour, until the consumers are not fixed and the options could be removed (hopefully during 8.0-CURRENT lifecycle) * Implementing KA_HELD and KA_UNHELD (not surported natively by WITNESS) made necessary the introduction of LA_MASKASSERT which specifies the range for default lock assertion flags * About other aspects, lockmgr_assert() follows exactly what other locking primitives offer about this operation. - Build real assertions for buffer cache locks on the top of lockmgr_assert(). They can be used with the BUF_ASSERT_*(bp) paradigm. - Add checks at lock destruction time and use a cookie for verifying lock integrity at any operation. - Redefine BUF_LOCKFREE() in order to not use a direct assert but let it rely on the aforementioned destruction time check. KPI results evidently broken, so __FreeBSD_version bumping and manpage update result necessary and will be committed soon. Side note: lockmgr_assert() will be used soon in order to implement real assertions in the vnode namespace replacing the legacy and still bogus "VOP_ISLOCKED()" way. Tested by: kris (earlier version) Reviewed by: jhb	2008-02-13 20:44:19 +00:00
csjp	b24cb219b9	Make sure we restrict Linux only IPC calls from being executed through the FreeBSD ABI. IPC_INFO, SHM_INFO, SHM_STAT were added specifically for Linux binary support. They are not documented as being a part of the FreeBSD ABI, also, the structures necessary for them have been hidden away from the users for a long time. Also, the Linux ABI layer uses it's own structures to populate the responses back to the user to ensure that the ABI is consistent. I think there is a bit more separation work that needs to happen. Reviewed by: jhb Discussed with: jhb Discussed on: freebsd-arch@ (very briefly) MFC after: 1 month	2008-02-12 20:55:03 +00:00
ru	841dab65e0	Regenerate for readlink(2).	2008-02-12 20:11:54 +00:00
ru	56aa644e2a	Change readlink(2)'s return type and type of the last argument to match POSIX. Prodded by: Alexey Lyashkov	2008-02-12 20:09:04 +00:00
marcus	7e24637c24	Add support for displaying a process' current working directory, root directory, and jail directory within procstat. While this functionality is available already in fstat, encapsulating it in the kern.proc.filedesc sysctl makes it accessible without using kvm and thus without needing elevated permissions. The new procstat output looks like: PID COMM FD T V FLAGS REF OFFSET PRO NAME 76792 tcsh cwd v d -------- - - - /usr/src 76792 tcsh root v d -------- - - - / 76792 tcsh 15 v c rw------ 16 9130 - - 76792 tcsh 16 v c rw------ 16 9130 - - 76792 tcsh 17 v c rw------ 16 9130 - - 76792 tcsh 18 v c rw------ 16 9130 - - 76792 tcsh 19 v c rw------ 16 9130 - - I am also bumping __FreeBSD_version for this as this new feature will be used in at least one port. Reviewed by: rwatson Approved by: rwatson	2008-02-09 05:16:26 +00:00
attilio	e1db4e70b3	Conver all explicit instances to VOP_ISLOCKED(arg, NULL) into VOP_ISLOCKED(arg, curthread). Now, VOP_ISLOCKED() and lockstatus() should only acquire curthread as argument; this will lead in axing the additional argument from both functions, making the code cleaner. Reviewed by: jeff, kib	2008-02-08 21:45:47 +00:00
jeff	e5687b20d7	- Add THREAD_LOCKPTR_ASSERT() to assert that the thread's lock points at the provided lock or &blocked_lock. The thread may be temporarily assigned to the blocked_lock by the scheduler so a direct comparison can not always be made. - Use THREAD_LOCKPTR_ASSERT() in the primary consumers of the scheduling interfaces. The schedulers themselves still use more explicit asserts. Sponsored by: Nokia	2008-02-07 06:55:38 +00:00
jeff	005506bb32	- In rw_wunlock_hard prefer to wakeup writers if there are both readers and writers available. Doing otherwise can cause deadlocks as no read locks can proceed while there are write waiters. Sponsored by: Nokia	2008-02-07 06:16:54 +00:00
alc	4a600fdd88	Change shm_dotruncate() so that it correctly handles cached pages that span the end of the object. (This change is analogous to revision 1.237 of vm/vnode_pager.c.) Discussed with: jhb	2008-02-07 05:55:16 +00:00
attilio	a715e455c6	td cannot be NULL in that place, so just axe out the check.	2008-02-06 13:26:01 +00:00
jeff	77ea5a24c7	Adaptive spinning in write path with readers and writer starvation avoidance. - Move recursion checking into rwlock inlines to free a bit for use with adaptive spinners. - Clear the RW_LOCK_WRITE_SPINNERS flag whenever the lock state changes causing write spinners to restart their loop. - Write spinners are limited by a count while readers hold the lock as there is no way to know for certain whether readers are running still. - In the read path block if there are write waiters or spinners to avoid starving writers. Use a new per-thread count, td_rw_rlocks, to skip starvation avoidance if it might cause a deadlock. - Remove or change invalid assertions in turnstiles. Reviewed by: attilio (developed parts of the patch as well) Sponsored by: Nokia	2008-02-06 01:02:13 +00:00
attilio	6234a71797	Add WITNESS support to lockmgr locking primitive. This support tries to be as parallel as possible with other locking primitives, but there are differences; more specifically: - The base witness support is alredy equipped for allowing lock duplication acquisition as lockmgr rely on this. - In the case of lockmgr_disown() the lock result unlocked by witness even if it is still held by the "kernel context" - In the case of upgrading we can have 3 different situations: * Total unlocking of the shared lock and nothing else * Real witness upgrade if the owner is the first upgrader * Shared unlocking and exclusive locking if the owner is not the first upgrade but it is still allowed to upgrade - LK_DRAIN is basically handled like an exclusive acquisition Additively new options LK_NODUP and LK_NOWITNESS can now be used with lockinit(): LK_NOWITNESS disables WITNESS for the specified lock while LK_NODUP enable duplicated locks tracking. This will require manpages update and a __FreeBSD_version bumping (addressed by further commits). This patch also fixes a problem occurring if a lockmgr is held in exclusive mode and the same owner try to acquire it in shared mode: currently there is a spourious shared locking acquisition while what we really want is a lock downgrade. Probabilly, this situation can be better served with a EDEADLK failing errno return. Side note: first testing on this patch alredy reveleated several LORs reported, so please expect LORs cascades until resolved. NTFS also is reported broken by WITNESS introduction. BTW, NTFS is exposing a lock leak which needs to be fixed, and this patch can help it out if rightly tweaked. Tested by: kris, yar, Scot Hetzel <swhetzel at gmail dot com>	2008-02-06 00:37:14 +00:00
attilio	acc2f89a7f	Really, no explicit checks against against lock_class_* object should be done in consumers code: using locks properties is much more appropriate. Fix current code doing these bogus checks. Note: Really, callout are not usable by all !(LC_SPINLOCK \| LC_SLEEPABLE) primitives like rmlocks doesn't implement the generic lock layer functions, but they can be equipped for this, so the check is still valid. Tested by: matteo, kris (earlier version) Reviewed by: jhb	2008-02-06 00:04:09 +00:00
rwatson	f23198af5c	Further clean up sorflush: - Expose sbrelease_internal(), a variant of sbrelease() with no expectations about the validity of locks in the socket buffer. - Use sbrelease_internel() in sorflush(), and as a result avoid intializing and destroying a socket buffer lock for the temporary stack copy of the actual buffer, asb. - Add a comment indicating why we do what we do, and remove an XXX since things have gotten less ugly in sorflush() lately. This makes socket close cleaner, and possibly also marginally faster. MFC after: 3 weeks	2008-02-04 12:25:13 +00:00
phk	13132840a1	Give sendfile(2) a SF_SYNC flag which makes it wait until all mbufs referencing the files VM pages are returned from the network stack, making changes to the file safe. This flag does not guarantee that the data has been transmitted to the other end.	2008-02-03 15:54:41 +00:00
phk	df9c99b9c2	Give MEXTADD() another argument to make both void pointers to the free function controlable, instead of passing the KVA of the buffer storage as the first argument. Fix all conventional users of the API to pass the KVA of the buffer as the first argument, to make this a no-op commit. Likely break the only non-convetional user of the API, after informing the relevant committer. Update the mbuf(9) manual page, which was already out of sync on this point. Bump __FreeBSD_version to 800016 as there is no way to tell how many arguments a CPP macro needs any other way. This paves the way for giving sendfile(9) a way to wait for the passed storage to have been accessed before returning. This does not affect the memory layout or size of mbufs. Parental oversight by: sam and rwatson. No MFC is anticipated.	2008-02-01 19:36:27 +00:00
rwatson	95bb3acd1f	Use FEATURE() macro to advertise aio availability.	2008-02-01 11:59:14 +00:00
rwatson	c57fa54759	Correct two problems relating to sorflush(), which is called to flush read socket buffers in shutdown() and close(): - Call socantrcvmore() before sblock() to dislodge any threads that might be sleeping (potentially indefinitely) while holding sblock(), such as a thread blocked in recv(). - Flag the sblock() call as non-interruptible so that a signal delivered to the thread calling sorflush() doesn't cause sblock() to fail. The sblock() is required to ensure that all other socket consumer threads have, in fact, left, and do not enter, the socket buffer until we're done flushin it. To implement the latter, change the 'flags' argument to sblock() to accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK flag. When SBL_NOINTR is set, it forces a non-interruptible sx acquisition, regardless of the setting of the disposition of SB_NOINTR on the socket buffer; without this change it would be possible for another thread to clear SB_NOINTR between when the socket buffer mutex is released and sblock() is invoked. Reviewed by: bz, kmacy Reported by: Jos Backus <jos at catnook dot com>	2008-01-31 08:22:24 +00:00
ru	910410640b	Add a wrapper function that bound checks writes to the dump device.	2008-01-28 19:04:07 +00:00
iwasaki	a9f086bbd3	Add devctl_process_running() so that power management system driver can check whether devd(8) is running. MFC after: 1 week	2008-01-27 16:06:37 +00:00
kib	82cf20c0b8	In rev. 1.156, the convertion of the minor number to the unit number resulted in the argument to the make_dev() to be a unit number. Correct this by supplying a minor number to make_dev(), and using the unit number for the calculation of the slave tty name. Reported and tested by: Peter Holm Reviewed by: jhb Yet another pointy hat to: kib MFC after: 1 day	2008-01-26 06:09:23 +00:00
jhb	dd3b84ba3a	Fix a bug where a thread that hit the race where the sleep timeout fires while the thread does not hold the thread lock would stop blocking for subsequent interruptible sleeps and would always immediately fail the sleep with EWOULDBLOCK instead (even sleeps that didn't have a timeout). Some background: - KSE has a facility for allowing one thread to interrupt another thread. During this process, the target thread aborts any interruptible sleeps much as if the target thread had a pending signal. Once the target thread acknowledges the interrupt, normal sleep handling resumes. KSE manages this via the TDF_INTERRUPTED flag. Specifically, it sets the flag when it sends an interrupt to another thread and clears it when the interrupt is acknowledged. (Note that this is purely a software interrupt sort of thing and has no relation to hardware interrupts or kernel interrupt threads.) - The old code for handling the sleep timeout race handled the race by setting the TDF_INTERRUPT flag and faking a KSE-style thread interrupt to the thread in the process of going to sleep. It probably should have just checked the TDF_TIMEOUT flag in sleepq_catch_signals() instead. - The bug was that the sleepq code would set TDF_INTERRUPT but it was never cleared. The sleepq code couldn't safely clear it in case there actually was a real KSE thread interrupt pending for the target thread (in fact, the sleepq timeout actually stomped on said pending interrupt). Thus, any future interruptible sleeps (sleep(.. PCATCH ..) or cv_wait_sig()) would see the TDF_INTERRUPT flag set and immediately fail with EWOULDBLOCK. The flag could be cleared if the thread belonged to a KSE process and another thread posted an interrupt to the original thread. However, in the more common case of a non-KSE process, the thread would pretty much stop sleeping. - Fix the bug by just setting TDF_TIMEOUT in the sleepq timeout code and not messing with TDF_INTERRUPT and td_intrval. With yesterday's fix to fix sleepq_switch() to check TDF_TIMEOUT, this is now sufficient. MFC after: 3 days	2008-01-25 19:44:46 +00:00
jhb	5d22bdedcf	Fix a race in the sleepqueue timeout code that resulted in sleeps not being properly cancelled by a timeout. In general there is a race between a the sleepq timeout handler firing while the thread is still in the process of going to sleep. In 6.x with sched_lock, the race was largely protected by sched_lock. The only place it was "exposed" and had to be handled was while checking for any pending signals in sleepq_catch_signals(). With the thread lock changes, the thread lock is dropped in between sleepq_add() and sleepq_wait() opening up a new window for this race. Thus, if the timeout fired while the sleeping thread was in between sleepq_add() and sleepq_wait(), the thread would be marked as timed out, but the thread would not be dequeued and sleepq_switch() would still block the thread until it was awakened via some other means. In the case of pause(9) where there is no other wakeup, the thread would never be awakened. Fix this by teaching sleepq_switch() to check if the thread has had its sleep canceled before blocking by checking the TDF_TIMEOUT flag and aborting the sleep and dequeueing the thread if it is set. MFC after: 3 days Reported by: dwhite, peter	2008-01-25 02:09:38 +00:00
dumbbell	ba3df23cb8	When asked to use kqueue, AIO stores its internal state in the `kn_sdata' member of the newly registered knote. The problem is that this member is overwritten by a call to kevent(2) with the EV_ADD flag, targetted at the same kevent/knote. For instance, a userland application may set the pointer to NULL, leading to a panic. A testcase was provided by the submitter. PR: kern/118911 Submitted by: MOROHOSHI Akihiko <moro@remus.dti.ne.jp> MFC after: 1 day	2008-01-24 17:10:19 +00:00
attilio	7213f4c32b	Cleanup lockmgr interface and exported KPI: - Remove the "thread" argument from the lockmgr() function as it is always curthread now - Axe lockcount() function as it is no longer used - Axe LOCKMGR_ASSERT() as it is bogus really and no currently used. Hopefully this will be soonly replaced by something suitable for it. - Remove the prototype for dumplockinfo() as the function is no longer present Addictionally: - Introduce a KASSERT() in lockstatus() in order to let it accept only curthread or NULL as they should only be passed - Do a little bit of style(9) cleanup on lockmgr.h KPI results heavilly broken by this change, so manpages and FreeBSD_version will be modified accordingly by further commits. Tested by: matteo	2008-01-24 12:34:30 +00:00
bz	1c376286e0	Replace the last susers calls in netinet6/ with privilege checks. Introduce a new privilege allowing to set certain IP header options (hop-by-hop, routing headers). Leave a few comments to be addressed later. Reviewed by: rwatson (older version, before addressing his comments)	2008-01-24 08:25:59 +00:00
jeff	be58be75dd	- sched_prio() should only adjust tdq_lowpri if the thread is running or on a run-queue. If the priority is numerically raised only change lowpri if we're certain it will be correct. Some slop is allowed however previously we could erroneously raise lowpri for an idle cpu that a thread had recently run on which lead to errors in load balancing decisions.	2008-01-23 03:10:18 +00:00
rwatson	0e6bbfc8e3	Regenerate.	2008-01-20 23:44:24 +00:00
rwatson	ff05f9dd9d	Use audit events AUE_SHMOPEN and AUE_SHMUNLINK with new system calls shm_open() and shm_unlink(). More auditing will need to be done for these calls to capture arguments properly.	2008-01-20 23:43:06 +00:00
rwatson	ff397597d9	Export a type for POSIX SHM file descriptors via kern.proc.filedesc as used by procstat, or SHM descriptors will show up as type unknown in userspace.	2008-01-20 19:55:52 +00:00
attilio	caa2ca048b	- Introduce the function lockmgr_recursed() which returns true if the lockmgr lkp, when held in exclusive mode, is recursed - Introduce the function BUF_RECURSED() which does the same for bufobj locks based on the top of lockmgr_recursed() - Introduce the function BUF_ISLOCKED() which works like the counterpart VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus BUF_REFCNT() in a more explicative and SMP-compliant way. This allows us to axe out BUF_REFCNT() and leaving the function lockcount() totally unused in our stock kernel. Further commits will axe lockcount() as well as part of lockmgr() cleanup. KPI results, obviously, broken so further commits will update manpages and freebsd version. Tested by: kris (on UFS and NFS)	2008-01-19 17:36:23 +00:00

1 2 3 4 5 ...

10405 Commits