freebsd-dev

Author	SHA1	Message	Date
Robert Watson	8e230e30b7	Attempt to improve convergence of POSIX semaphore code with style(9). MFC after: 3 days	2008-05-16 18:10:07 +00:00
George V. Neville-Neil	49f287f8c5	Update the kernel to count the number of mbufs and clusters (all types) used per socket buffer. Add support to netstat to print out all of the socket buffer statistics. Update the netstat manual page to describe the new -x flag which gives the extended output. Reviewed by: rwatson, julian	2008-05-15 20:18:44 +00:00
Attilio Rao	90356491d7	- Embed the recursion counter for any locking primitive directly in the lock_object, using an unified field called lo_data. - Replace lo_type usage with the w_name usage and at init time pass the lock "type" directly to witness_init() from the parent lock init function. Handle delayed initialization before than witness_initialize() is called through the witness_pendhelp structure. - Axe out LO_ENROLLPEND as it is not really needed. The case where the mutex init delayed wants to be destroyed can't happen because witness_destroy() checks for witness_cold and panic in case. - In enroll(), if we cannot allocate a new object from the freelist, notify that to userspace through a printf(). - Modify the depart function in order to return nothing as in the current CVS version it always returns true and adjust callers accordingly. - Fix the witness_addgraph() argument name prototype. - Remove unuseful code from itismychild(). This commit leads to a shrinked struct lock_object and so smaller locks, in particular on amd64 where 2 uintptr_t (16 bytes per-primitive) are gained. Reviewed by: jhb	2008-05-15 20:10:06 +00:00
John Baldwin	ccd3953e5f	Go back to using the process command name (p_comm) for the file name and command line arguments stored in the note at the beginning of a core dump instead of the current thread name. Reviewed by: julian	2008-05-15 03:07:34 +00:00
Konstantin Belousov	48504cc25b	Add the devctl notifications for the cdev create/destroy events. Based on the submission by: Andriy Gapon <avg icyb net ua> MFC after: 2 weeks	2008-05-14 14:29:54 +00:00
Julian Elischer	681e40627d	fix typo in runz_fuzz noticed by:Elijah Buck	2008-05-12 06:42:06 +00:00
Alan Cox	3202ed7523	Introduce a new parameter "superpage_align" to kmem_suballoc() that is used to request superpage alignment for the submap. Request superpage alignment for the kmem_map. Pass VMFS_ANY_SPACE instead of TRUE to vm_map_find(). (They are currently equivalent but VMFS_ANY_SPACE is the new preferred spelling.) Remove a stale comment from kmem_malloc().	2008-05-10 21:46:20 +00:00
Konstantin Belousov	e15864efd8	Kqueue_scan() may sleep when encountered the influx knotes. On the other hand, it may cause other threads to sleep since kqueue_scan() may mark some knotes as infux. This could lead to the deadlock. Before kqueue_scan() sleeps, wakeup the threads that are waiting for the influx knotes produced by this thread. Tested by: pho (previous version) Reviewed by: jmg MFC after: 2 weeks	2008-05-10 11:37:05 +00:00
Konstantin Belousov	2e711e4d0d	The kqueue_close() encountering the KN_INFLUX knotes on the kq being closed is the legitimate situation. For instance, filedescriptor with registered events may be closed in parallel with closing the kqueue. Properly handle the case instead of asserting that this cannot happen. Reported and tested by: pho Reviewed by: jmg MFC after: 2 weeks	2008-05-10 11:35:32 +00:00
Julian Elischer	8b07e49a00	Add code to allow the system to handle multiple routing tables. This particular implementation is designed to be fully backwards compatible and to be MFC-able to 7.x (and 6.x) Currently the only protocol that can make use of the multiple tables is IPv4 Similar functionality exists in OpenBSD and Linux. From my notes: ----- One thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons). Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. This first version will not have some of the bells and whistles that will come with later versions. It will, for example, be limited to 16 tables in the first commit. Implementation method, Compatible version. (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not always caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (8 is sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called rtalloc_fib() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). You CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Firstly, all packets have a FIB associated with them. if nothing has been done to change it, it will be FIB 0. The FIB is changed in the following ways. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -3 ping target.example.com # will use fib 3 for ping. It is an obvious extension to make it a property of a jail but I have not done so. It can be achieved by combining the setfib and jail commands. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). (possibly in the future you may be able to associate a FIB with packets received on an interface.. An ifconfig arg, but not yet.) 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). 4/ a tcp listen socket associated with a fib will generate accept sockets that are associated with that same fib. 5/ Packets generated in response to some other packet (e.g. reset or icmp packets). These should use the FIB associated with the packet being reponded to. 6/ Packets generated during encapsulation. gif, tun and other tunnel interfaces will encapsulate using the FIB that was in effect withthe proces that set up the tunnel. thus setfib 1 ifconfig gif0 [tunnel instructions] will set the fib for the tunnel to use to be fib 1. Routing messages would be associated with their process, and thus select one FIB or another. messages from the kernel would be associated with the fib they refer to and would only be received by a routing socket associated with that fib. (not yet implemented) In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). Old versions of netstat see only the first FIB. In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. ipfw has grown 2 new keywords: setfib N ip from anay to any count ip from any to any fib N In pf there seems to be a requirement to be able to give symbolic names to the fibs but I do not have that capacity. I am not sure if it is required. SCTP has interestingly enough built in support for this, called VRFs in Cisco parlance. it will be interesting to see how that handles it when it suddenly actually does something. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some roto-tilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. When the ABI can be changed it raises the possibilty of the addition of a fib entry into the "struct route". Currently, the structure contains the sockaddr of the desination, and the resulting fib entry. To make this work fully, one could add a fib number so that given an address and a fib, one can find the third element, the fib entry. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. This work was sponsored by Ironport Systems/Cisco Reviewed by: several including rwatson, bz and mlair (parts each) Obtained from: Ironport systems/Cisco	2008-05-09 23:03:00 +00:00
Doug Rabson	06c85cef9d	When blocking on an F_FLOCK style lock request which is upgrading a shared lock to exclusive, drop the shared lock before deadlock detection. MFC after: 2 days	2008-05-09 10:34:23 +00:00
Pawel Jakub Dawidek	b109dd74fe	- Export HZ value via kern.hz sysctl (this is the same name as for the loader tunable). - Document other sysctls in this file and also mark them as loader tunable via CTLFLAG_RDTUN flag. Reviewed by: roberto	2008-05-09 07:42:02 +00:00
Attilio Rao	688b98135c	Add a new witness sysctl which returns the relations between any lock and its children in the form: "parent","child" so that head and bottom of an oriented graph can be easilly detected and various form of diagrams can be build. The sysctl is called debug.witness.graphs and it is read-only; in order to get the list of relations, a simple: #sysctl debug.witness.graphs will do the trick. This approach has been choosen in order to support easilly things like the DOT format and such. Soon, an auto-explicative awk script, which filters simple informations returned by the sysctl and converts them into a real DOT script, will be committed to the repository between examples. Discussed with: rwatson	2008-05-07 21:41:36 +00:00
Kip Macy	c8c7ad9260	add malloc flag to blist so that it can be used in ithread context Reviewed by: alc, bsdimp	2008-05-05 19:48:54 +00:00
John Baldwin	be00f6053b	Fix a few edge cases with error handling in cpufreq(4)'s CPUFREQ_GET() method: - If the last of the child cpufreq drivers returns an error while trying to fetch its list of supported frequencies but an earlier driver found the requested frequency, don't return an error to the caller. - If all of the child cpufreq drivers fail and the attempt to match the frequency based on 'cpu_est_clockrate()' fails, return ENXIO rather than returning success and returning a frequency of CPUFREQ_VAL_UNKNOWN. MFC after: 3 days PR: kern/121433 Reported by: Eugene Grosbein eugen ! kuzbass dot ru	2008-05-05 19:13:52 +00:00
Peter Wemm	43d7128c14	Expand kdb_alt_break a little, most commonly used with the option ALT_BREAK_TO_DEBUGGER. In addition to "Enter ~ ctrl-B" (to enter the debugger), there is now "Enter ~ ctrl-P" (force panic) and "Enter ~ ctrl-R" (request clean reboot, ala ctrl-alt-del on syscons). We've used variations of this at work. The force panic sequence is best used with KDB_UNATTENDED for when you just want it to dump and get on with it. The reboot request is a safer way of getting into single user than a power cycle. eg: you've hosed the ability to log in (pam, rtld, etc). It gives init the reboot signal, which causes an orderly reboot. I've taken my best guess at what the !x86 and non-sio code changes should be. This also makes sio release its spinlock before calling KDB/DDB.	2008-05-04 23:29:38 +00:00
Attilio Rao	60e2edce55	sync_vnode() has some messy code about locking in order to deal with mount fs needing Giant to be held when processing bufobjs. Use a different subqueue for pending workitems on filesystems requiring Giant. This simplifies the code notably and also reduces the number of Giant acquisitions (and the whole processing cost). Suggested by: jeff Reviewed by: kib Tested by: pho	2008-05-04 13:54:55 +00:00
Julian Elischer	2182c0cfbf	Attempt to make the print types more friendly to other architectures. Prodded by: Max Laier Help from: BMS, jhb	2008-04-30 20:00:30 +00:00
Julian Elischer	c59b9a7659	Document the kproc_kthread_add() call and fix a small detail of its implementation. MFC after: 1 week	2008-04-29 22:43:15 +00:00
Roman Divacky	d6891277a4	Lock filedesc exclusively when modifying fd_[cr]dir. This is probably harmless but it's better to lock it correctly. Approved by: kib (mentor)	2008-04-29 21:40:11 +00:00
Julian Elischer	6eeac1d921	Add an option (compiled out by default) to profile outoing packets for a number of mbuf chain related parameters e.g. number of mbufs, wasted space. probably will do with further work later. Reviewed by: various	2008-04-29 21:23:21 +00:00
David Xu	3bba58f287	Fix compiling problem.	2008-04-29 05:48:05 +00:00
David Xu	727158f6f6	Introduce command UMTX_OP_WAIT_UINT_PRIVATE and UMTX_OP_WAKE_PRIVATE to allow userland to specify that an address is not shared by multiple processes.	2008-04-29 03:48:48 +00:00
Robert Watson	ae11a989e6	When writing trailers in sendfile(2), don't call kern_writev() while holding the socket buffer lock. These leads to an immediate panic due to recursing the socket buffer lock. This bug was introduced in uipc_syscalls.c:1.240, but masked by another bug until that was fixed in uipc_syscalls.c:1.269. Note that the current fix isn't perfect, but better than panicking: normally we guarantee that simultaneous invocations of a system call to write on a stream socket won't be interlaced, which is ensured by use of the socket buffer sleep lock. This is guaranteed for the sendfile headers, but not trailers. In practice, this is likely not a problem, but should be fixed. MFC after: 3 days Pointy hat to: andre (1.240), cperciva (1.269)	2008-04-27 15:50:00 +00:00
Kris Kennaway	5894445dad	* Correct a mis-merge that leaked the PROC_LOCK [1] * Return ENOENT on error instead of 0 [2] Submitted by: rdivacky [1], kib [2]	2008-04-26 13:16:55 +00:00
Pawel Jakub Dawidek	3800322fe2	Implement 'show mount' command in DDB. Without argument, it prints short info about all currently mounted file systems. When an address is given as an argument, prints detailed info about the given mount point. MFC after: 2 weeks	2008-04-26 13:04:48 +00:00
Jeff Roberson	6c47aaae12	- Add an integer argument to idle to indicate how likely we are to wake from idle over the next tick. - Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are suspended in cpu specific states. This function can fail and cause the scheduler to fall back to another mechanism (ipi). - Implement support for mwait in cpu_idle() on i386/amd64 machines that support it. mwait is a higher performance way to synchronize cpus as compared to hlt & ipis. - Allow selecting the idle routine by name via sysctl machdep.idle. This replaces machdep.cpu_idle_hlt. Only idle routines supported by the current machine are permitted. Sponsored by: Nokia	2008-04-25 05:18:50 +00:00
Kris Kennaway	b1ba81d948	fdhold can return NULL, so add the one remaining missing check for this condition. Reviewed by: attilio MFC after: 1 week	2008-04-24 22:08:36 +00:00
Konstantin Belousov	12e79a9bbc	Allow the vnode zone to return the unused memory. The vnode reference count is/shall be properly maintained for the long time, and VFS shall be safe against the vnode memory reclamation. Proposed by: jeff Tested by: pho	2008-04-24 09:58:33 +00:00
Poul-Henning Kamp	9b4a8ab7ba	Now that all platforms use genclock, shuffle things around slightly for better structure. Much of this is related to <sys/clock.h>, which should really have been called <sys/calendar.h>, but unless and until we need the name, the repocopy can wait. In general the kernel does not know about minutes, hours, days, timezones, daylight savings time, leap-years and such. All that is theoretically a matter for userland only. Parts of kernel code does however care: badly designed filesystems store timestamps in local time and RTC chips almost universally track time in a YY-MM-DD HH:MM:SS format, and sometimes in local timezone instead of UTC. For this we have <sys/clock.h> <sys/time.h> on the other hand, deals with time_t, timeval, timespec and so on. These know only seconds and fractions thereof. Move inittodr() and resettodr() prototypes to <sys/time.h>. Retain the names as it is one of the few surviving PDP/VAX references. Move startrtclock() to <machine/clock.h> on relevant platforms, it is a MD call between machdep.c/clock.c. Remove references to it elsewhere. Remove a lot of unnecessary <sys/clock.h> includes. Move the machdep.disable_rtc_set sysctl to subr_rtc.c where it belongs. XXX: should be kern.disable_rtc_set really, it's not MD.	2008-04-22 19:38:30 +00:00
Pawel Jakub Dawidek	d90d4eb28c	Back-out previous revision. For now I can use _ddb() variants of stack(9) KPI, as I use it for debugging only. Once someone will need it for more production features, the change should be reconsider. Requested by: rwatson	2008-04-21 17:22:35 +00:00
Robert Watson	8501a69cc9	Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to explicitly select write locking for all use of the inpcb mutex. Update some pcbinfo lock assertions to assert locked rather than write-locked, although in practice almost all uses of the pcbinfo rwlock main exclusive, and all instances of inpcb lock acquisition are exclusive. This change should introduce (ideally) little functional change. However, it lays the groundwork for significantly increased parallelism in the TCP/IP code. MFC after: 3 months Tested by: kris (superset of committered patch)	2008-04-17 21:38:18 +00:00
Pawel Jakub Dawidek	f55f27f862	Allow linker_search_symbol_name() to be called with KLD lock held. The linker_search_symbol_name() function is used by stack_print() and stack_print() can be called from kernel module unload method. MFC after: 1 week	2008-04-17 19:19:40 +00:00
Jeff Roberson	1690c6c1be	- Add a metric to describe how busy a processor has been over the last two ticks by counting the number of switches and the load when sched_clock() is called. - If the busy metric exceeds a threshold allow the idle thread to spin waiting for new work for a brief period to avoid using IPIs. This reduces the cost on the sender and receiver as well as reducing wakeup latency considerably when it works. Sponsored by: Nokia	2008-04-17 09:56:01 +00:00
Jeff Roberson	8df78c41d6	- Make SCHED_STATS more generic by adding a wrapper to create the variables and sysctl nodes. - In reset walk the children of kern_sched_stats and reset the counters via the oid_arg1 pointer. This allows us to add arbitrary counters to the tree and still reset them properly. - Define a set of switch types to be passed with flags to mi_switch(). These types are named SWT_*. These types correspond to SCHED_STATS counters and are automatically handled in this way. - Make the new SWT_ types more specific than the older switch stats. There are now stats for idle switches, remote idle wakeups, remote preemption ithreads idling, etc. - Add switch statistics for ULE's pickcpu algorithm. These stats include how much migration there is, how often affinity was successful, how often threads were migrated to the local cpu on wakeup, etc. Sponsored by: Nokia	2008-04-17 04:20:10 +00:00
Doug Rabson	a365ea5fba	Fix compilation with LOCKF_DEBUG.	2008-04-16 14:08:12 +00:00
Konstantin Belousov	eab626f110	Move the head of byte-level advisory lock list from the filesystem-specific vnode data to the struct vnode. Provide the default implementation for the vop_advlock and vop_advlockasync. Purge the locks on the vnode reclaim by using the lf_purgelocks(). The default implementation is augmented for the nfs and smbfs. In the nfs_advlock, push the Giant inside the nfs_dolock. Before the change, the vop_advlock and vop_advlockasync have taken the unlocked vnode and dereferenced the fs-private inode data, racing with with the vnode reclamation due to forced unmount. Now, the vop_getattr under the shared vnode lock is used to obtain the inode size, and later, in the lf_advlockasync, after locking the vnode interlock, the VI_DOOMED flag is checked to prevent an operation on the doomed vnode. The implementation of the lf_purgelocks() is submitted by dfr. Reported by: kris Tested by: kris, pho Discussed with: jeff, dfr MFC after: 2 weeks	2008-04-16 11:33:32 +00:00
David Xu	d61f3de656	Implement POSIX function tcgetsid() which returns session id. PR: stand/107561	2008-04-15 08:33:32 +00:00
Marcel Moolenaar	495168ba8d	Support and switch to the ULE scheduler: o Implement IPI_PREEMPT, o Set td_lock for the thread being switched out, o For ULE & SMP, loop while td_lock points to blocked_lock for the thread being switched in, o Enable ULE by default in GENERIC and SKI,	2008-04-15 05:02:42 +00:00
Randall Stewart	cf71e4381a	Add pru_flush routine so a transport can flush itself during Shutdown MFC after: 1 week	2008-04-14 18:06:04 +00:00
Alan Cox	e384d8a89b	Initialize the vm object's flags to include OBJ_NOSPLIT, just like the vm objects that are used by System V shared memory segments.	2008-04-13 21:08:34 +00:00
Attilio Rao	22dd228d5d	Use a "rel" memory barrier for disowning the lock as it cames from an exclusive locking operation.	2008-04-13 01:21:56 +00:00
Attilio Rao	0b0100db88	struct lock_instance and struct lock_list_entry don't need to be in the public namespace for WITNESS as they are only used internally so just move them in the private namespace for the subsystem (with all related supporting definitions).	2008-04-13 01:20:47 +00:00
Poul-Henning Kamp	8d24f82310	fix printf type confusion on amd64	2008-04-12 21:51:54 +00:00
Poul-Henning Kamp	c9ad6040dd	Emit summaries of struct c(alender)t(ime) <-> struct timespec conversions under bootverbose. Struct ct is used for setting/reading real time clocks and I'm about to Do Things to some of those, so a bit of preemptive debugging is in order. Remove a pointless __inline.	2008-04-12 20:35:56 +00:00
Attilio Rao	e5f94314ad	- Re-introduce WITNESS support for lockmgr. About the old implementation the only one difference is that lockmgr*() functions now accept LK_NOWITNESS flag which skips ordering for the instanced calling. - Remove an unuseful stub in witness_checkorder() (because the above check doesn't allow ever happening) and allow witness_upgrade() to accept non-try operation too.	2008-04-12 19:57:30 +00:00
Attilio Rao	872b7289fd	- Remove a stale comment. - Add an extra assertion in order to catch malformed requested operations.	2008-04-12 13:56:17 +00:00
Attilio Rao	1859cffaef	Add missing stubs for spinlocks cpuset and intrcnt. Submitted by: kris	2008-04-12 13:51:18 +00:00
Xin LI	31c50f53da	Instead of rolling our own jail number allocation procedure, use alloc_unr() to do it. Submitted by: Ed Schouten <ed 80386 nl> PR: kern/122270 MFC after: 1 month	2008-04-11 21:31:15 +00:00
John Baldwin	03c7442d75	Use kthread_exit() to terminate a taskqueue thread rather than kproc_exit() now that the taskqueue threads are kthreads rather than kprocs. Reported by: kris	2008-04-11 17:35:54 +00:00
Jeff Roberson	9b33b154b5	- Add the interrupt vector number to intr_event_create so MI code can lookup hard interrupt events by number. Ignore the irq# for soft intrs. - Add support to cpuset for binding hardware interrupts. This has the side effect of binding any ithread associated with the hard interrupt. As per restrictions imposed by MD code we can only bind interrupts to a single cpu presently. Interrupts can be 'unbound' by binding them to all cpus. Reviewed by: jhb Sponsored by: Nokia	2008-04-11 03:26:41 +00:00
Pawel Jakub Dawidek	b03d720760	- Use LK_TYPE_MASK where needed. Actually after sys/sys/lockmgr.h:1.69 it is no longer needed, but for now we still want to be consistent with other similar checks in the tree. - Call ASSERT_VOP_ELOCKED() only when vget() returns 0. Reviewed by: jeff	2008-04-09 20:19:55 +00:00
Sam Leffler	6c6eaea6dd	Do image loading in a context known to have a root directory: o create a private task queue thread that sets up root and current directories (hooking mountroot event as needed); this is necessary because task queue threads are parented from proc0 and it does not have a reference to rootvnode (lost when / mounting moved to init) o bounce image load + unload requests through the private task q so we can load images even when the request is made from a thread that does not have sufficient context (e.g. task q thread) o add a check in the task q thread to fail requests before root is mounted (just in case) Reviewed by: jhb, mlaier, luigi (glance) MFC after: 1 month	2008-04-09 19:07:48 +00:00
Sam Leffler	00c71fb7c3	o add a mountroot event handler that fires when / is mounted; this information was lost when root started being mounted by init o remove SI_SUB_MOUNT_ROOT since it's no longer meaningful MFC after: 2 weeks	2008-04-08 17:53:33 +00:00
Sam Leffler	175611b668	change taskqueue_start_threads to create threads instead of proc's Reviewed by: jhb	2008-04-08 17:48:02 +00:00
Konstantin Belousov	48b05c3f82	Implement the linux syscalls openat, mkdirat, mknodat, fchownat, futimesat, fstatat, unlinkat, renameat, linkat, symlinkat, readlinkat, fchmodat, faccessat. Submitted by: rdivacky Sponsored by: Google Summer of Code 2007 Tested by: pho	2008-04-08 09:45:49 +00:00
Attilio Rao	e0f62984c1	- Use a different encoding for lockmgr options: make them encoded by bit in order to allow per-bit checks on the options flag, in particular in the consumers code [1] - Re-enable the check against TDP_DEADLKTREAT as the anti-waiters starvation patch allows exclusive waiters to override new shared requests. [1] Requested by: pjd, jeff	2008-04-07 14:46:38 +00:00
Don Lewis	8a3724388b	vfs_syscalls.c 1.452 mistakenly swapped the behavior of chown() and lchown().	2008-04-07 00:29:32 +00:00
Attilio Rao	047dd67e96	Optimize lockmgr in order to get rid of the pool mutex interlock, of the state transitioning flags and of msleep(9) callings. Use, instead, an algorithm very similar to what sx(9) and rwlock(9) alredy do and direct accesses to the sleepqueue(9) primitive. In order to avoid writer starvation a mechanism very similar to what rwlock(9) uses now is implemented, with the correspective per-thread shared lockmgrs counter. This patch also adds 2 new functions to lockmgr KPI: lockmgr_rw() and lockmgr_args_rw(). These two are like the 2 "normal" versions, but they both accept a rwlock as interlock. In order to realize this, the general lockmgr manager function "__lockmgr_args()" has been implemented through the generic lock layer. It supports all the blocking primitives, but currently only these 2 mappers live. The patch drops the support for WITNESS atm, but it will be probabilly added soon. Also, there is a little race in the draining code which is also present in the current CVS stock implementation: if some sharers, once they wakeup, are in the runqueue they can contend the lock with the exclusive drainer. This is hard to be fixed but the now committed code mitigate this issue a lot better than the (past) CVS version. In addition assertive KA_HELD and KA_UNHELD have been made mute assertions because they are dangerous and they will be nomore supported soon. In order to avoid namespace pollution, stack.h is splitted into two parts: one which includes only the "struct stack" definition (_stack.h) and one defining the KPI. In this way, newly added _lockmgr.h can just include _stack.h. Kernel ABI results heavilly changed by this commit (the now committed version of "struct lock" is a lot smaller than the previous one) and KPI results broken by lockmgr_rw() / lockmgr_args_rw() introduction, so manpages and __FreeBSD_version will be updated accordingly. Tested by: kris, pho, jeff, danger Reviewed by: jeff Sponsored by: Google, Summer of Code program 2007	2008-04-06 20:08:51 +00:00
Jeff Roberson	ce62b59c88	- Correct a major error introduced in the per-cpu timeout commit. Sleep and wakeup require the same wait channel to function properly. Found by: kris Pointy hat: me	2008-04-06 11:08:49 +00:00
John Baldwin	8aa9e82e67	Move INTR_FILTER from opt_global.h to its own header.	2008-04-05 20:13:15 +00:00
John Baldwin	1ee1b68792	Add a MI intr_event_handle() routine for the non-INTR_FILTER case. This allows all the INTR_FILTER #ifdef's to be removed from the MD interrupt code. - Rename the intr_event 'eoi', 'disable', and 'enable' hooks to 'post_filter', 'pre_ithread', and 'post_ithread' to be less x86-centric. Also, add a comment describe what the MI code expects them to do. - On amd64, i386, and powerpc this is effectively a NOP. - On arm, don't bother masking the interrupt unless the ithread is scheduled in the non-INTR_FILTER case to match what INTR_FILTER did. Also, don't bother unmasking the interrupt in the post_filter case if we never masked it. The INTR_FILTER case had been doing this by having arm_unmask_irq for the post_filter (formerly 'eoi') hook. - On ia64, stray interrupts are now masked for the non-INTR_FILTER case. They were already masked in the INTR_FILTER case. - On sparc64, use the a NULL pre_ithread hook and use intr_enable_eoi() for both the 'post_filter' and 'post_ithread' hooks to match what the non-INTR_FILTER code did. - On sun4v, retire the ithread wrapper hack by using an appropriate 'post_ithread' hook instead (it's what 'post_ithread'/'enable' was designed to do even in 5.x). Glanced at by: piso Reviewed by: marius Requested by: marius [1], [5] Tested on: amd64, i386, arm, sparc64	2008-04-05 19:58:30 +00:00
Alan Cox	7630c26507	Reintroduce UMA_SLAB_KMAP; however, change its spelling to UMA_SLAB_KERNEL for consistency with its sibling UMA_SLAB_KMEM. (UMA_SLAB_KMAP met its original demise in revision 1.30 of vm/uma_core.c.) UMA_SLAB_KERNEL is now required by the jumbo frame allocators. Without it, UMA cannot correctly return pages from the jumbo frame zones to the VM system because it resets the pages' object field to NULL instead of the kernel object. In more detail, the jumbo frame zones are created with the option UMA_ZONE_REFCNT. This causes UMA to overwrite the pages' object field with the address of the slab. However, when UMA wants to release these pages, it doesn't know how to restore the object field, so it sets it to NULL. This change teaches UMA how to reset the object field to the kernel object. Crashes reported by: kris Fix tested by: kris Fix discussed with: jeff MFC after: 6 weeks	2008-04-04 18:41:12 +00:00
Jeff Roberson	00ca09449d	- Add sysctls at debug.rwlock to control the behavior of the speculative spinning when readers hold a lock. This spinning is speculative because, unlike the write case, we can not test whether the owners are running. - Add speculative read spinning for readers who are blocked by pending writers while a read lock is still held. This allows the thread to spin until the write lock succeeds after which it may spin until the writer has released the lock. This prevents excessive context switches when readers and writers both hold the lock for brief periods. Sponsored by: Nokia	2008-04-04 10:00:46 +00:00
Jeff Roberson	3bc8c68d9f	- Add a Nokia copyright to cpuset to reflect their generous contribution to this work.	2008-04-04 01:22:04 +00:00
Jeff Roberson	0502fe2e43	- Allow static_boost to specify no boost with '0', traditional kernel fixed pri boost with '1' or any priority less than the current thread's priority with a value greater than two. Default the boost to PRI_MIN_TIMESHARE to prevent regular user-space threads from starving threads in the kernel. This prevents these user-threads from also being scheduled as if they are high fixed-priority kernel threads. - Restore the setting of lowpri in tdq_choose(). It has to be either here or in sched_switch(). I accidentally removed it from both places. Tested by: kris	2008-04-04 01:16:18 +00:00
Jeff Roberson	03d17db7d5	- Don't check for the ITHD pri class in tdq_load_add and rem. 4BSD doesn't do this either. Simply check P_NOLOAD. It'd be nice if this was in a thread flag so we didn't have an extra cache miss every time we add and remove a thread from the run-queue.	2008-04-04 01:04:43 +00:00
Jeff Roberson	e4b1aa6210	- Fix a mis-merge that crept in during the softclock changes. Spotted by: jhb	2008-04-04 01:03:23 +00:00
David Xu	44253336b6	let umtxq_busy() only spin on mp machine. make function name do_rwlock_unlock to be consistent with others.	2008-04-03 11:49:20 +00:00
Jeff Roberson	e8245292a7	- Convert two timeout users to the new callout_reset_curcpu() api. Sponsored by: Nokia	2008-04-02 11:21:42 +00:00
Jeff Roberson	8d809d5061	Implement per-cpu callout threads, wheels, and locks. - Move callout thread creation from kern_intr.c to kern_timeout.c - Call callout_tick() on every processor via hardclock_cpu() rather than inspecting callout internal details in kern_clock.c. - Remove callout implementation details from callout.h - Package up all of the global variables into a per-cpu callout structure. - Start one thread per-cpu. Threads are not strictly bound. They prefer to execute on the native cpu but may migrate temporarily if interrupts are starving callout processing. - Run all callouts by default in the thread for cpu0 to maintain current ordering and concurrency guarantees. Many consumers may not properly handle concurrent execution. - The new callout_reset_on() api allows specifying a particular cpu to execute the callout on. This may migrate a callout to a new cpu. callout_reset() schedules on the last assigned cpu while callout_reset_curcpu() schedules on the current cpu. Reviewed by: phk Sponsored by: Nokia	2008-04-02 11:20:30 +00:00
Konstantin Belousov	35b450291a	Add two missed chunks from the rev. 1.210, for the giant_read() and giant_ioctl(). PR: kern/122287 MFC after: 3 days	2008-04-02 11:11:58 +00:00
Jeff Roberson	1fd9b6a577	- Destroy the bo mtx when the vnode is destroyed.	2008-04-02 10:40:03 +00:00
David Xu	fadd84c58f	Fix compiling problem for amd64.	2008-04-02 05:54:41 +00:00
David Xu	11b1023b7d	Er, don't restart a timeout version.	2008-04-02 04:26:59 +00:00
David Xu	1a30511c61	Introduce kernel based userland rwlock. Each umtx chain now has two lists, one for readers and one for writers, other types of synchronization object just use first list. Asked by: jeff	2008-04-02 04:08:37 +00:00
Attilio Rao	b31a149bbb	Add rw_try_rlock() and rw_try_wlock() to rwlocks. These functions try the specified operation (rlocking and wlocking) and true is returned if the operation completes, false otherwise. The KPI is enriched by this commit, so __FreeBSD_version bumping and manpage updating will happen soon. Requested by: jeff, kris	2008-04-01 20:31:55 +00:00
Doug Rabson	60cdfde09f	Don't try to use an SX lock while holding the vnode interlock. Sponsored by: Isilon Systems	2008-04-01 16:07:01 +00:00
Konstantin Belousov	f2296b585e	Regen	2008-03-31 12:12:27 +00:00
Konstantin Belousov	7104518b07	Add the openat(), fexecve() and other *at() syscalls to the table. Based on the submission by rdivacky, sponsored by Google Summer of Code 2007 Reviewed by: rwatson, rdivacky Tested by: pho	2008-03-31 12:06:55 +00:00
Konstantin Belousov	632dbc19e2	Implement the fexecve(2) syscall. Based on the submission by rdivacky, sponsored by Google Summer of Code 2007 Reviewed by: rwatson, rdivacky Tested by: pho	2008-03-31 12:05:52 +00:00
Konstantin Belousov	e4193f25cb	Implement the openat(2), faccessat(2), fchmodat(2), fchownat(2), fstatat(2), futimesat(2), linkat(2), mkdirat(2), mkfifoat(2), mknodat(2), readlinkat(2), renameat(2), symlinkat(2) syscalls. Based on the submission by rdivacky, sponsored by Google Summer of Code 2007 Reviewed by: rwatson, rdivacky Tested by: pho	2008-03-31 12:04:20 +00:00
Konstantin Belousov	57b4252e45	Add the support for the AT_FDCWD and fd-relative name lookups to the namei(9). Based on the submission by rdivacky, sponsored by Google Summer of Code 2007 Reviewed by: rwatson, rdivacky Tested by: pho	2008-03-31 12:01:21 +00:00
Konstantin Belousov	e314f69fff	Add the support for the O_EXEC open(2) mode, as specified by the POSIX Extended API Set Part 2 extension specification. Reviewed by: rwatson, rdivacky Tested by: pho	2008-03-31 11:57:18 +00:00
Konstantin Belousov	0a3af16a75	Add the utility function vn_commname() to retrieve the command name from the vfs namecache, when available. Reviewed by: rwatson, rdivacky Tested by: pho	2008-03-31 11:53:03 +00:00
Jeff Roberson	a03ee0000e	- Consistently return EDEADLK when presented with a new set that is incompatible with existing bindings. - Try to copyout the setid in cpuset() before migrating the proc to the setid in case the user has supplied a bad buffer. - Rename cpuset_root() and cpuset_base() to cpuset_ref{root,base} to be more descriptive and free cpuset_root to be used as a different type of symbol. - Make cpuset_root the cpuset_t set of all cpus in the system. This should contain the same bitmask as all_cpus presently. - Add a CPU_CMP() macro to compare two sets.	2008-03-30 11:31:14 +00:00
Jeff Roberson	5634d48667	- Don't allow calls to vn_lock() with no lock type requested. Callers which simply want a reference should use vref(). Callers which want to check validity need to hold a lock while performing any action based on that validity. vn_lock() would always release the interlock before returning making any action synchronous with the validity check impossible.	2008-03-29 23:36:26 +00:00
Jeff Roberson	069c6953a0	- Use vget() to lock the vnode rather than refing without a lock and locking in separate steps.	2008-03-29 23:30:40 +00:00
Attilio Rao	71072af500	b_waiters cannot be adequately protected by the interlock because it is dropped after the call to lockmgr() so just revert this approach using something similar to the precedent one: BUF_LOCKWAITERS() just checks if there are waiters (not the actual number of them) and it is based on newly introduced lockmgr_waiters() which returns if the lockmgr has waiters or not. The name has been choosen differently by old lockwaiters() in order to not confuse them. KPI results enriched by this commit so __FreeBSD_version bumping and manpage update will be happening soon. 'struct buf' also changes, so kernel ABI is disturbed. Bug found by: jeff Approved by: jeff, kib	2008-03-28 12:30:12 +00:00
John Birrell	fc70c0bdd8	Regen after makesyscalls.sh change.	2008-03-27 01:55:06 +00:00
John Birrell	e994ea8e55	Generate another function for the DTrace syscall provider to specify the syscall argument types. This code is only compiled into the systrace kernel modul and has no effect otherwise.	2008-03-27 01:53:44 +00:00
Poul-Henning Kamp	e465985885	The "free-lance" timer in the i8254 is only used for the speaker these days, so de-generalize the acquire_timer/release_timer api to just deal with speakers. The new (optional) MD functions are: timer_spkr_acquire() timer_spkr_release() and timer_spkr_setfreq() the last of which configures the timer to generate a tone of a given frequency, in Hz instead of 1/1193182th of seconds. Drop entirely timer2 on pc98, it is not used anywhere at all. Move sysbeep() to kern/tty_cons.c and use the timer_spkr() if they exist, and do nothing otherwise. Remove prototypes and empty acquire-/release-timer() and sysbeep() functions from the non-beeping archs. This eliminate the need for the speaker driver to know about i8254frequency at all. In theory this makes the speaker driver MI, contingent on the timer_spkr_() functions existing but the driver does not know this yet and still attaches to the ISA bus. Syscons is more tricky, in one function, sc_tone(), it knows the hz and things are just fine. In the other function, sc_bell() it seems to get the period from the KDMKTONE ioctl in terms if 1/1193182th second, so we hardcode the 1193182 and leave it at that. It's probably not important. Change a few other sysbeep() uses which obviously knew that the argument was in terms of i8254 frequency, and leave alone those that look like people thought sysbeep() took frequency in hertz. This eliminates the knowledge of i8254_freq from all but the actual clock.c code and the prof_machdep.c on amd64 and i386, where I think it would be smart to ask for help from the timecounters anyway [TBD].	2008-03-26 20:09:21 +00:00
Doug Rabson	a7ac0db6cb	Regen.	2008-03-26 15:24:02 +00:00
Doug Rabson	dfdcada31e	Add the new kernel-mode NFS Lock Manager. To use it instead of the user-mode lock manager, build a kernel with the NFSLOCKD option and add '-k' to 'rpc_lockd_flags' in rc.conf. Highlights include: * Thread-safe kernel RPC client - many threads can use the same RPC client handle safely with replies being de-multiplexed at the socket upcall (typically driven directly by the NIC interrupt) and handed off to whichever thread matches the reply. For UDP sockets, many RPC clients can share the same socket. This allows the use of a single privileged UDP port number to talk to an arbitrary number of remote hosts. * Single-threaded kernel RPC server. Adding support for multi-threaded server would be relatively straightforward and would follow approximately the Solaris KPI. A single thread should be sufficient for the NLM since it should rarely block in normal operation. * Kernel mode NLM server supporting cancel requests and granted callbacks. I've tested the NLM server reasonably extensively - it passes both my own tests and the NFS Connectathon locking tests running on Solaris, Mac OS X and Ubuntu Linux. * Userland NLM client supported. While the NLM server doesn't have support for the local NFS client's locking needs, it does have to field async replies and granted callbacks from remote NLMs that the local client has contacted. We relay these replies to the userland rpc.lockd over a local domain RPC socket. * Robust deadlock detection for the local lock manager. In particular it will detect deadlocks caused by a lock request that covers more than one blocking request. As required by the NLM protocol, all deadlock detection happens synchronously - a user is guaranteed that if a lock request isn't rejected immediately, the lock will eventually be granted. The old system allowed for a 'deferred deadlock' condition where a blocked lock request could wake up and find that some other deadlock-causing lock owner had beaten them to the lock. * Since both local and remote locks are managed by the same kernel locking code, local and remote processes can safely use file locks for mutual exclusion. Local processes have no fairness advantage compared to remote processes when contending to lock a region that has just been unlocked - the local lock manager enforces a strict first-come first-served model for both local and remote lockers. Sponsored by: Isilon Systems PR: 95247 107555 115524 116679 MFC after: 2 weeks	2008-03-26 15:23:12 +00:00
Scott Long	478cfc7300	Implement taskqueue_block() and taskqueue_unblock(). These functions allow the owner of a queue to block and unblock execution of the tasks in the queue while allowing tasks to continue to be added queue. Combining this with taskqueue_drain() allows a queue to be safely disabled. The unblock function may run (or schedule to run) the queue when it is called, just as calling taskqueue_enqueue() would. Reviewed by: jhb, sam	2008-03-25 22:38:45 +00:00
Ruslan Ermilov	ea26d58729	Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT. Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true since the advent of MBUMA. Reviewed by: arch There are ongoing disputes as to whether we want to switch to directly using UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.	2008-03-25 09:39:02 +00:00
Ruslan Ermilov	b2798e2573	Regen after changing prototypes of cpuset_{get,set}affinity().	2008-03-25 09:14:17 +00:00
Ruslan Ermilov	7f64829a5e	Fixed type of the fourth argument of cpuset_{get,set}affinity(2) to be size_t. Prodded by: davidxu	2008-03-25 09:11:53 +00:00
Jeff Roberson	0ee6cecc9d	- Greatly simplify vget() by removing the guarantee that any new references to a vnode with VI_OWEINACT set will force the vinactive() call. The kernel makes no guarantees about which reference was the last to close a file or when the actual inactive processing will happen. The previous code was designed to preserve existing semantics in the face of shared locks, however, this was unnecessary. Discussed with: mckusick	2008-03-24 04:22:58 +00:00
Jeff Roberson	804e60d4cf	- Don't acquire the vnode interlock in _vn_lock() unless no lock type is requested. Handle this case specially before the while loop. - Use the held vnode lock to check for VI_DOOMED. The vnode lock and interlock must both be held to set VI_DOOMED so either one held, even shared, is sufficient to check it. No objection by: kib	2008-03-24 04:17:35 +00:00

1 2 3 4 5 ...

10521 Commits