freebsd-dev

Author	SHA1	Message	Date
John Baldwin	958aa57537	Similar to 233760 and 236717, export some more useful info about the kernel-based POSIX semaphore descriptors to userland via procstat(1) and fstat(1): - Change sem file descriptors to track the pathname they are associated with and add a ksem_info() method to copy the path out to a caller-supplied buffer. - Use the fo_stat() method of shared memory objects and ksem_info() to export the path, mode, and value of a semaphore via struct kinfo_file. - Add a struct semstat to the libprocstat(3) interface along with a procstat_get_sem_info() to export the mode and value of a semaphore. - Teach fstat about semaphores and to display their path, mode, and value. MFC after: 2 weeks	2013-05-03 21:11:57 +00:00
John Baldwin	dfa66c01ae	Fix FIONREAD on regular files. The computed result was being ignored and it was being passed down to VOP_IOCTL() where it promptly resulted in ENOTTY due to a missing else for the past 8 years. While here, use a shared vnode lock while fetching the current file's size. MFC after: 1 week	2013-05-03 19:08:58 +00:00
Jilles Tjoelker	b201f4a0dc	Regenerate files for pipe2().	2013-05-01 22:45:04 +00:00
Jilles Tjoelker	dc570d5e56	Add pipe2() system call. The pipe2() function is similar to pipe() but allows setting FD_CLOEXEC and O_NONBLOCK (on both sides) as part of the function. If p points to two writable ints, pipe2(p, 0) is equivalent to pipe(p). If the pointer is not valid, behaviour differs: pipe2() writes into the array from the kernel like socketpair() does, while pipe() writes into the array from an architecture-specific assembler wrapper. Reviewed by: kan, kib	2013-05-01 22:42:42 +00:00
Jilles Tjoelker	1bf6b724f1	Regenerate files for accept4().	2013-05-01 20:12:58 +00:00
Jilles Tjoelker	da7d2afb6d	Add accept4() system call. The accept4() function, compared to accept(), allows setting the new file descriptor atomically close-on-exec and explicitly controlling the non-blocking status on the new socket. (Note that the latter point means that accept() is not equivalent to any form of accept4().) The linuxulator's accept4 implementation leaves a race window where the new file descriptor is not close-on-exec because it calls sys_accept(). This implementation leaves no such race window (by using falloc() flags). The linuxulator could be fixed and simplified by using the new code. Like accept(), accept4() is async-signal-safe, a cancellation point and permitted in capability mode.	2013-05-01 20:10:21 +00:00
Mikolaj Golub	1b8388cde9	Introduce a constant, ELF_NOTE_ROUNDSIZE, which evidently declare our intention to use 4-byte padding for elf notes. MFC after: 3 weeks	2013-05-01 14:59:16 +00:00
Jilles Tjoelker	cd31b6dd08	socket: Make shutdown() wake up a blocked accept(). A blocking accept (and some other operations) waits on &so->so_timeo. Once it wakes up, it will detect the SBS_CANTRCVMORE bit. The error from accept() is [ECONNABORTED] which is not the nicest one -- the thread calling accept() needs to know out-of-band what is happening. A spurious wakeup on so->so_timeo appears harmless (sleep retried) except when lingering on close (SO_LINGER, and in that case there is no descriptor to call shutdown() on) so this should be fairly safe. A shutdown() already woke up a blocked accept() for TCP sockets, but not for Unix domain sockets. This fix is generic for all domains. This patch was sent to -hackers@ and -net@ on April 5. MFC after: 2 weeks	2013-04-30 15:06:30 +00:00
Konstantin Belousov	3d31767952	Eliminate the layering violation in the kern_sendfile(). When quering the file size, use VOP_GETATTR() instead of accessing vnode vm_object un_pager.vnp.vnp_size. Take the shared vnode lock earlier to cover the added VOP_GETATTR() call and, as consequence, the whole internal sendfile loop. Reduce vm object lock scope to not protect the local calculations. Note that this is the last misuse of the vnp_size in the tree, the others were removed from the ELF image activator by r230246. Reviewed by: alc Tested by: pho, bf (previous version) MFC after: 1 week	2013-04-28 19:12:09 +00:00
Andre Oppermann	2ebcc8ac4a	Base the calculation of maxmbufmem in part on kmem_map size instead of kernel_map size to prevent kernel memory exhaustion by mbufs and a subsequent panic on physical page allocation failure. On architectures without a direct map all mbuf memory (except for jumbo mbufs larger than PAGE_SIZE) comes from kmem_map. It is the limiting factor hence. For architectures with a direct map using the size of kmem_map is a good proxy of available kernel memory as well. If it is much smaller the mbuf limit may be sub-optimal but remains reasonable, while avoiding panics under exhaustion. The overall mbuf memory limit calculation may be reconsidered again later, however due to the many different mbuf sizes and different backing KVM maps it is a tricky subject. Found by: pho's new network stress test Pointed out by: alc (kmem_map instead of kernel_map) Tested by: pho	2013-04-24 13:54:55 +00:00
Jaakko Heinonen	a208417c41	Include PID in the error message which is printed when the maxproc limit is exceeded. Improve formatting of the message while here. PR: kern/60550 Submitted by: Lowell Gilbert, bde	2013-04-19 15:19:29 +00:00
Gleb Smirnoff	14658a80fe	Don't compare unsigned socklen_t against < 0. Reviewed by: jhb	2013-04-19 13:40:13 +00:00
Jilles Tjoelker	1e367efa8b	sem: Restart the POSIX sem_* calls after signals with SA_RESTART set. Programs often do not expect an [EINTR] return from sem_wait() and POSIX only allows it if the signal was installed without SA_RESTART. The timeout in sem_timedwait() is absolute so it can be restarted normally. The umtx call can be invoked with a relative timeout and in that case [ERESTART] must be changed to [EINTR]. However, libc does not do this. The old POSIX semaphore implementation did this correctly (before r249566), unlike the new umtx one. It may be desirable to avoid [EINTR] completely, which matches the pthread functions and is explicitly permitted by POSIX. However, the kernel must return [EINTR] at least for signals with SA_RESTART clear, otherwise pthread cancellation will not abort a semaphore wait. In this commit, only restore the 8.x behaviour which is also permitted by POSIX. Discussed with: jhb MFC after: 1 week	2013-04-19 10:16:00 +00:00
Gleb Smirnoff	8f779cc541	On non-ACPI i386 mp_ncpus is initialized at SI_SUB_CPU, and this prevents us from creating UMA_ZONE_PCPU zones earlier. As bandaid shift initialization of counter(9) zone later. Reviewed by: kib Reported & tested by: Lytochkin Boris <lytboris gmail.com>	2013-04-17 18:43:33 +00:00
Gabor Kovesdan	a2098fea6d	- Correct mispellings of the word necessary Submitted by: Christoph Mallon <christoph.mallon@gmx.de> (via private mail)	2013-04-17 11:42:40 +00:00
Gabor Kovesdan	ab3f6b347e	- Correct mispellings of the word occurrence Submitted by: Christoph Mallon <christoph.mallon@gmx.de> (via private mail)	2013-04-17 11:40:10 +00:00
Warner Losh	11c601447d	r249408 and r249436 cause a NULL pointer dereference on the CUBIEBOARD since it doesn't set the kernel envrionment at all. Work around this by making sure kern_envp is non-NULL before dereferencing it.	2013-04-16 22:09:08 +00:00
John Baldwin	8916af883c	- Document that sem_wait() can fail with EINTR if it is interrupted by a signal. - Fix the old ksem implementation for POSIX semaphores to not restart sem_wait() or sem_timedwait() if interrupted by a signal. MFC after: 1 week	2013-04-16 20:26:31 +00:00
Mikolaj Golub	f1fca82ed5	Add a new set of notes to a process core dump to store procstat data. The notes format is a header of sizeof(int), which stores the size of the corresponding data structure to provide some versioning, and data in the format as it is returned by a related sysctl call. The userland tools (procstat(1)) will be taught to extract this data, providing additional info for postmortem analysis. PR: kern/173723 Suggested by: jhb Discussed with: jhb, kib Reviewed by: jhb (initial version), kib MFC after: 1 month	2013-04-16 19:19:14 +00:00
Rick Macklem	64fa8df6e0	Allow the vnode to be unlocked for the weird case of LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a usecount on a vnode during NFSv4 recovery from an expired lease. Reported and tested by: pho MFC after: 2 weeks	2013-04-16 14:22:16 +00:00
Konstantin Belousov	44d95698ba	Some compilers issue a warning when wider integer is casted to narrow pointer. Supposedly shut down the warning by casting through uintptr_t. Reported by: ian	2013-04-16 07:11:52 +00:00
George V. Neville-Neil	8f2ba63493	Point args[0] not at the thread that is ending but at the one that is starting. This is in line with practice in OpenSolaris. Note that this change is only in ULE and not in the 4BSD scheduler. Once this change settles in (MFC timeout has expired) we'll try it out on 4BSD as well. PR: 177706 Submitted by: Tiwei Bie MFC after: 1 month	2013-04-15 17:21:02 +00:00
Mikolaj Golub	5ea21e6904	Similarly to proc_getargv() and proc_getenvv(), export proc_getauxv() to be able to reuse the code. MFC after: 3 weeks	2013-04-14 20:03:48 +00:00
Mikolaj Golub	fe52cf5475	Re-factor the code to provide kern_proc_filedesc_out(), kern_proc_out(), and kern_proc_vmmap_out() functions to output process kinfo structures to sbuf, to make the code reusable. The functions are going to be used in the coredump routine to store procstat info in the core program header notes. Reviewed by: kib MFC after: 3 weeks	2013-04-14 20:01:36 +00:00
Mikolaj Golub	bd3902134c	Re-factor coredump routines. For each type of notes an output function is provided, which is used either to calculate the note size or output it to sbuf. On the first pass the notes are registered in a list and the resulting size is found, on the second pass the list is traversed outputing notes to sbuf. For the sbuf a drain routine is provided that writes data to a core file. The main goal of the change is to make coredump to write notes directly to the core file, without preliminary preparing them all in a memory buffer. Storing notes in memory is not a problem for the current, rather small, set of notes we write to the core, but it may becomes an issue when we start to store procstat notes. Reviewed by: jhb (initial version), kib Discussed with: jhb, kib MFC after: 3 weeks	2013-04-14 19:59:38 +00:00
Mateusz Guzik	db8f33fd32	Add fdallocn function and use it when passing fds over unix socket. This gets rid of "unp_externalize fdalloc failed" panic. Reviewed by: pjd MFC after: 1 week	2013-04-14 17:08:34 +00:00
Jayachandran C.	f46206c270	Fix changes made in r249408. In some cases, kern_envp is set by the architecture code and env_pos does not contain the length of the static kernel environment. In these cases r249408 causes the kernel to discard the environment. Fix this by updating the check for empty static env to *kern_envp != '\0' Reported by: np@	2013-04-13 07:23:37 +00:00
Jayachandran C.	15f9c9ed69	Fix kenv behavior when there is no static environment In case where there are no static kernel environment entries, the function init_dynamic_kenv() adds an incorrect entry at position 0 of the dynamic kernel environment. This in turn causes kenv(1) to print and empty list even though there are dynamic entries added later. Fix this by checking env_pos in init_dynamic_kenv() and adding dynamic entries only if there are static entries.	2013-04-12 15:58:53 +00:00
Mikolaj Golub	ddb9b61248	Add sbuf_start_section() and sbuf_end_section() functions, which can be used for automatic section alignment. Discussed with: kib Reviewed by: kib MFC after: 1 month	2013-04-11 19:49:18 +00:00
Jim Harris	d58a96538f	Fix the build.	2013-04-10 00:35:08 +00:00
Andre Oppermann	e8b3186b6a	Change certain heavily used network related mutexes and rwlocks to reside on their own cache line to prevent false sharing with other nearby structures, especially for those in the .bss segment. NB: Those mutexes and rwlocks with variables next to them that get changed on every invocation do not benefit from their own cache line. Actually it may be net negative because two cache misses would be incurred in those cases.	2013-04-09 21:02:20 +00:00
Attilio Rao	bc403f030d	Switch some "low-hanging fruit" to acquire read lock on vmobjects rather than write locks. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2013-04-08 19:58:32 +00:00
Gleb Smirnoff	4e76af6a41	Merge from projects/counters: counter(9). Introduce counter(9) API, that implements fast and raceless counters, provided (but not limited to) for gathering of statistical data. See http://lists.freebsd.org/pipermail/freebsd-arch/2013-April/014204.html for more details. In collaboration with: kib Reviewed by: luigi Tested by: ae, ray Sponsored by: Nginx, Inc.	2013-04-08 19:40:53 +00:00
Mikolaj Golub	c9d59a63e3	Use pget(9) to reduce code duplication. MFC after: 1 week	2013-04-07 17:44:30 +00:00
Mikolaj Golub	fb5ea9d1c8	Fill p_flags and p_align fields of the core dump note segement. Reviewed by: kib MFC after: 2 weeks	2013-04-07 17:42:27 +00:00
Mikolaj Golub	27b056480e	Use 4-byte padding for core dump notes on both 32 and 64bit archs. Although native word padding (i.e. 8-byte on 64bit arch) looks to be in agreement with standards, other parts of our code and other OSes use 4-byte alignment. This is not expected to change alignment for currently generated core dump notes, as the notes look to consist of structures with sizes multiple of 8 on 64-bit archs. But there are plans to add additional notes, where 4-byte vs 8-byte alignment makes difference. Discussed with: kib Reviewed by: kib MFC after: 2 weeks	2013-04-07 17:40:49 +00:00
Jilles Tjoelker	b68cf25fe6	mqueue,ksem,shm: Fix race condition with setting UF_EXCLOSE. POSIX mqueue, compatibility ksem and POSIX shm create a file descriptor that has close-on-exec set. However, they do this incorrectly, leaving a window where a thread may fork and exec while the flag has not been set yet. The race is easily reproduced on a multicore system with one thread doing shm_open and close and another thread doing posix_spawnp and waitpid. Set UF_EXCLOSE via falloc()'s flags argument instead. This also simplifies the code. MFC after: 1 week	2013-04-07 15:26:09 +00:00
Jeff Roberson	26089666b6	Prepare to replace the buf splay with a trie: - Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists. No consumers need to find them there and it complicates the tree. These flags are all FFS specific and could be moved out of the buf cache. - Use pbgetvp() and pbrelvp() to associate the background and journal bufs with the vp. Not only is this much cheaper it makes more sense for these transient bufs. - Fix the assertions in pbget* and pbrel*. It's not safe to check list pointers which were never initialized. Use the BX flags instead. We also check B_PAGING in reassignbuf() so this should cover all cases. Discussed with: kib, mckusick, attilio Sponsored by: EMC / Isilon Storage Division	2013-04-06 22:21:23 +00:00
Gleb Smirnoff	b9ce4f67ae	Fix memory leak in coredump(). Reviewed by: kib	2013-04-05 20:24:51 +00:00
Konstantin Belousov	b887a1555c	If filter of the interrupt event is not null, print it, in addition to the handler address. Add a mark to distinguish between filter and handler. Note that the arguments for both filter and handler are same. Sponsored by: The FreeBSD Foundation Reviewed by: jhb MFC after: 1 week	2013-04-05 14:30:51 +00:00
Brooks Davis	56fddc5d8c	MFP4 change 210763 Allow boothowto and bootverbose to be set via kernel options, which is useful on architectures that are unable to rely on a boot loader to pass configuration variables to the kernel. Submitted by: rwatson	2013-04-03 22:24:36 +00:00
Kenneth D. Merry	a358cf3aec	Add support for XPT_CONT_TARGET_IO CCBs in _bus_dmamap_load_ccb(). Declare CCB types in their respective switch blocks. Sponsored by: Spectra Logic	2013-04-02 16:49:49 +00:00
Matthew D Fleming	b3e6bbc676	Regen. MFC after: 1 week	2013-04-02 05:30:52 +00:00
Matthew D Fleming	e324bf91e8	Fix return type of extattr_set_* and fix rmextattr(8) utility. extattr_set_{fd,file,link} is logically a write(2)-like operation and should return ssize_t, just like extattr_get_. Also, the user-space utility was using an int for the return value of extattr_get_ and extattr_list_*, both of which return an ssize_t. MFC after: 1 week	2013-04-02 05:30:41 +00:00
Konstantin Belousov	c686ee4685	Do not call the VOP_LOOKUP() for the doomed directory vnode. The vnode could be reclaimed while lock upgrade was performed. Sponsored by: The FreeBSD Foundation Reported and tested by: pho Diagnosed and reviewed by: rmacklem MFC after: 1 week	2013-04-01 09:59:38 +00:00
Jilles Tjoelker	d289dc7b73	Rename do_pipe() to kern_pipe2() and declare it properly.	2013-03-31 17:42:54 +00:00
Matthew D Fleming	926cd204c7	Use a shared lock for VOP_GETEXTATTR, as it is a read-like operation. MFC after: 1 week	2013-03-30 15:09:04 +00:00
Jim Harris	10a93479b9	Add bus_dmamap_load_bio for non-CAM disk drivers that wish to enable unmapped I/O. Sponsored by: Intel Reviewed by: kib	2013-03-29 16:26:25 +00:00
Jim Harris	86675b5c0d	Add CTR5() to bus_dmamap_load_ccb, similar to other bus_dmamap_load_* functions. Sponsored by: Intel	2013-03-29 16:00:16 +00:00
Jim Harris	ab72998ef7	Do not add 1 to nsegs before passing to CTR5(), since nsegs has already been incremented before these calls. Sponsored by: Intel	2013-03-29 15:54:12 +00:00
Jim Harris	b327350604	Pass correct parameter to CTR5() in bus_dmamap_load_uio. Sponsored by: Intel	2013-03-29 15:51:45 +00:00
Gleb Smirnoff	21f398487c	Fix bug in m_split() in a case when split len matches len of the first mbuf, and the first mbuf is M_PKTHDR. PR: kern/176144 Submitted by: Jacques Fourie <jacques.fourie gmail.com>	2013-03-29 14:10:40 +00:00
Gleb Smirnoff	844cacd17c	Once ng_ksocket(4) is fixed, re-apply r194662. See this revision for longer description. Discussed with: andre, rwatson Sponsored by: Nginx, Inc.	2013-03-29 14:06:04 +00:00
Gleb Smirnoff	a307eb26ed	When soreceive_generic() hands off an mbuf from buffer, clear its pointer to next record, since next record belongs to the buffer, and shouldn't be leaked. The ng_ksocket(4) used to clear this pointer itself, but the correct place is here. Sponsored by: Nginx, Inc	2013-03-29 13:57:55 +00:00
Scott Long	07dbf2c768	Several fixes and improvements to sendfile() 1. If we wanted to send exactly as many bytes as the socket buffer is sized for, the inner loop of kern_sendfile() would see that the socket is full before seeing that it had no more bytes left to send. This would cause it to return EAGAIN to the caller instead of success. Fix by changing the order that these conditions are tested. 2. Simplify the calculation for the bytes to send in each iteration of the inner loop of kern_sendfile() 3. Fix some calls with bogus arguments to sf_buf_ext(). These would only trigger on mbuf allocation failure, but would be hilariously bad if they did trigger. Submitted by: gibbs(3), andre(2) Reviewed by: emax, andre Obtained from: Netflix MFC after: 1 week	2013-03-28 14:14:28 +00:00
Jim Harris	47301c53ed	deferal -> deferral	2013-03-27 23:07:43 +00:00
Konstantin Belousov	f3215a60fd	Fix a race with the vnode reclamation in the aio_qphysio(). Obtain the thread reference on the vp->v_rdev and use the returned struct cdev dev instead of using vp->v_rdev. Call dev_strategy_csw() instead of dev_strategy(), since we now own the reference. Since the csw was already calculated, test d_flags to avoid mapping the buffer if the driver supports unmapped requests []. Suggested by: kan [*] Reviewed by: kan (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-03-27 11:47:52 +00:00
Konstantin Belousov	d1e99f43ed	Add dev_strategy_csw() function, which is similar to dev_strategy() but assumes that a thread reference was already obtained on the passed device. Use the function from physio(), to avoid two extra dev_mtx lock and unlock. Note that physio() is always used as the cdevsw method, or is called from a cdevsw method, and the caller already owns the reference. dev_strategy() is left to keep KPI intact, but now it is implemented as a wrapper around dev_strategy_csw(). Do some style cleanup in physio(). Requested and reviewed by: kan (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-03-27 11:34:27 +00:00
Konstantin Belousov	88c8c0a70f	On i386, double the default size of the bio transient map. With the maxbcache size fixed, the auto-tuned transient map is too small for real-world load on i386. Tested by: David Wolfskill Sponsored by: The FreeBSD Foundation	2013-03-27 10:56:15 +00:00
Alexander Kabaev	31932fae1e	Do not pass unmapped buffers to drivers that cannot handle them In physio, check if device can handle unmapped IO and pass an appropriately mapped buffer to the driver strategy routine. The only driver in the tree that can handle unmapped buffers is one exposed by GEOM, so mark it as such with the new flag in the driver cdevsw structure. This fixes insta-panics on hosts, running dconschat, as /dev/fwmem is an example of the driver that makes use of physio routine, but bypasses the g_down thread, where the buffer gets mapped normally. Discussed with: kib (earlier version)	2013-03-26 01:17:06 +00:00
Davide Italiano	3f321a4eac	Cache the callout precision argument as part of the informations required for migrating callouts to new CPU. This value is passed to callout_cc_add() in order to update properly precision field in case of rescheduling/migration. Reviewed by: mav	2013-03-25 09:43:50 +00:00
Will Andrews	fdbc71742b	Extend taskqueue(9) to enable per-taskqueue callbacks. The scope of these callbacks is primarily to support actions that affect the taskqueue's thread environments. They are entirely optional, and consequently are introduced as a new API: taskqueue_set_callback(). This interface allows the caller to specify that a taskqueue requires a callback and optional context pointer for a given callback type. The callback types included in this commit can be used to register a constructor and destructor for thread-local storage using osd(9). This allows a particular taskqueue to define that its threads require a specific type of TLS, without the need for a specially-orchestrated task-based mechanism for startup and shutdown in order to accomplish it. Two callback types are supported at this point: - TASKQUEUE_CALLBACK_TYPE_INIT, called by every thread when it starts, prior to processing any tasks. - TASKQUEUE_CALLBACK_TYPE_SHUTDOWN, called by every thread when it exits, after it has processed its last task but before the taskqueue is reclaimed. While I'm here: - Add two new macros, TQ_ASSERT_LOCKED and TQ_ASSERT_UNLOCKED, and use them in appropriate locations. - Fix taskqueue.9 to mention taskqueue_start_threads(), which is a required interface for all consumers of taskqueue(9). Reviewed by: kib (all), eadler (taskqueue.9), brd (taskqueue.9) Approved by: ken (mentor) Sponsored by: Spectra Logic MFC after: 1 month	2013-03-23 15:11:53 +00:00
Andriy Gapon	ca84e042a3	post mountroot event after a real/final root is mounted not every time an intermediate root (including the first devfs) is mounted. This is also consistent with waking up via root_mount_complete. Reviewed by: jhb MFC after: 13 days	2013-03-23 08:59:34 +00:00
Pawel Jakub Dawidek	051a23d4e8	- Constify local path variable for chflagsat(). - Use correct format characters (%lx) for u_long. This fixes the build broken in r248599.	2013-03-22 07:40:34 +00:00
Pawel Jakub Dawidek	5d46382415	Regenerate after r248599. Sponsored by: The FreeBSD Foundation	2013-03-21 23:02:19 +00:00
Pawel Jakub Dawidek	e948704e4b	Implement chflagsat(2) system call, similar to fchmodat(2), but operates on file flags. Reviewed by: kib, jilles Sponsored by: The FreeBSD Foundation	2013-03-21 22:59:01 +00:00
Pawel Jakub Dawidek	14cd1ffdf8	Regenerate after r248597. Sponsored by: The FreeBSD Foundation	2013-03-21 22:47:03 +00:00
Pawel Jakub Dawidek	b4b2596b97	- Make 'flags' argument to chflags(2), fchflags(2) and lchflags(2) of type u_long. Before this change it was of type int for syscalls, but prototypes in sys/stat.h and documentation for chflags(2) and fchflags(2) (but not for lchflags(2)) stated that it was u_long. Now some related functions use u_long type for flags (strtofflags(3), fflagstostr(3)). - Make path argument of type 'const char *' for consistency. Discussed on: arch Sponsored by: The FreeBSD Foundation	2013-03-21 22:44:33 +00:00
Jilles Tjoelker	46f10cc265	Allow O_CLOEXEC in posix_openpt() flags. PR: kern/162374 Reviewed by: ed	2013-03-21 21:39:15 +00:00
Attilio Rao	d52d7aa871	Fix a bug in UMTX_PROFILING: UMTX_PROFILING should really analyze the distribution of locks as they index entries in the umtxq_chains hash-table. However, the current implementation does add/dec the length counters for every thread insert/removal, measuring at all really userland contention and not the hash distribution. Fix this by correctly add/dec the length counters in the points where it is really needed. Please note that this bug brought us questioning in the past the quality of the umtx hash table distribution. To date with all the benchmarks I could try I was not able to reproduce any issue about the hash distribution on umtx. Sponsored by: EMC / Isilon storage division Reviewed by: jeff, davide MFC after: 2 weeks	2013-03-21 19:58:25 +00:00
John Baldwin	d071a6fa33	Another NFS SIGSTOP related fix: Ignore thread suspend requests due to SIGSTOP if stop signals are currently deferred. This can occur if a process is stopped via SIGSTOP while a thread is running or runnable but before it has set TDF_SBDRY. Tested by: pho Reviewed by: kib MFC after: 1 week	2013-03-21 14:06:27 +00:00
Konstantin Belousov	7db07e1c85	Only size and create the bio_transient_map when unmapped buffers are enabled. Now, disabling the unmapped buffers should result in the kernel memory map identical to pre-r248550. Sponsored by: The FreeBSD Foundation	2013-03-21 07:28:15 +00:00
Konstantin Belousov	e3269b5096	In bufwrite(), a dirty buffer is moved to the clean queue before the bufobj counter of the writes in progress is incremented. Other thread inspecting the bufobj would consider it clean. For the regular vnodes, the vnode lock is typically held both by the thread performing the bufwrite() and an other thread doing syncing, which prevents the situation. On the other hand, writes to the VCHR vnodes are done without holding vnode lock. Increment the write ref counter for the buffer object before calling bundirty(). Sponsored by: The FreeBSD Foundation Tested by: pho MFC after: 2 weeks	2013-03-20 21:08:00 +00:00
Konstantin Belousov	8d6884ce9c	When the journaled FFS volume is suspended due to the journal space becoming too low, the softdep flush thread processes the workitems, which frees the space in journal, and then unsuspends the fs. The softdep_flush() and other workitem processing functions busy the filesystem before iterating over the worklist, to prevent the parallel unmount from freeing the mount data. The vfs_busy() is called with MBF_NOWAIT flag. Now, if the unmount is already started and the filesystem is suspended due to low journal space, the journal is never flushed and filesystem is never unsuspended, because vfs_busy(MBF_NOWAIT) call cannot succeed for the unmounting fs, and softdep_flush() does not process the workitems. Unmount needs to write metadata, where it hangs in the "suspfs" state. Move the vn_start_write() call in the dounmount() before setting the MNTK_UNMOUNT flag. This practically ensures that softdep_flush() processed the pending journal writes by making dounmount() wait for the lift of the suspension. Sponsored by: The FreeBSD Foundation Reported and tested by: pho MFC after: 2 weeks	2013-03-20 21:07:49 +00:00
Kirk McKusick	3289d5877a	When renaming a directory from one parent directory to another, we need to call ufs_checkpath() to walk from our new location to the root of the filesystem to ensure that we do not encounter ourselves along the way. Until now, we accomplished this by reading the ".." entries of each directory in our path until we reached the root (or encountered an error). This change tries to avoid the I/O of reading the ".." entries by first looking them up in the name cache and only doing the I/O when the name cache lookup fails. Reviewed by: kib Tested by: Peter Holm MFC after: 4 weeks	2013-03-20 17:57:00 +00:00
Jilles Tjoelker	c2e3c52e0d	Implement SOCK_CLOEXEC, SOCK_NONBLOCK and MSG_CMSG_CLOEXEC. This change allows creating file descriptors with close-on-exec set in some situations. SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file descriptors (SCM_RIGHTS) atomically close-on-exec. The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD. MSG_CMSG_CLOEXEC is the first free bit for MSG_. The SOCK_ flags are not passed to MAC because this may cause incorrect failures and can be done later via fcntl() anyway. On the other hand, audit is expected to cope with the new flags. For MSG_CMSG_CLOEXEC, unp_externalize() is extended to take a flags argument. Reviewed by: kib	2013-03-19 20:58:17 +00:00
Konstantin Belousov	e81ff91e62	Do not remap usermode pages into KVA for physio. Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-19 14:43:57 +00:00
Konstantin Belousov	7d5365c70b	Add a helper function vfs_bio_bzero_buf() to zero the portion of the buffer, transparently handling mapped or unmapped buffers. Its intent is to replace the use of bzero(bp->b_data) in cases where the buffer might be unmapped, to avoid unneeded upgrades. Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-19 14:27:14 +00:00
Konstantin Belousov	ee75e7de7b	Implement the concept of the unmapped VMIO buffers, i.e. buffers which do not map the b_pages pages into buffer_map KVA. The use of the unmapped buffers eliminate the need to perform TLB shootdown for mapping on the buffer creation and reuse, greatly reducing the amount of IPIs for shootdown on big-SMP machines and eliminating up to 25-30% of the system time on i/o intensive workloads. The unmapped buffer should be explicitely requested by the GB_UNMAPPED flag by the consumer. For unmapped buffer, no KVA reservation is performed at all. The consumer might request unmapped buffer which does have a KVA reserve, to manually map it without recursing into buffer cache and blocking, with the GB_KVAALLOC flag. When the mapped buffer is requested and unmapped buffer already exists, the cache performs an upgrade, possibly reusing the KVA reservation. Unmapped buffer is translated into unmapped bio in g_vfs_strategy(). Unmapped bio carry a pointer to the vm_page_t array, offset and length instead of the data pointer. The provider which processes the bio should explicitely specify a readiness to accept unmapped bio, otherwise g_down geom thread performs the transient upgrade of the bio request by mapping the pages into the new bio_transient_map KVA submap. The bio_transient_map submap claims up to 10% of the buffer map, and the total buffer_map + bio_transient_map KVA usage stays the same. Still, it could be manually tuned by kern.bio_transient_maxcnt tunable, in the units of the transient mappings. Eventually, the bio_transient_map could be removed after all geom classes and drivers can accept unmapped i/o requests. Unmapped support can be turned off by the vfs.unmapped_buf_allowed tunable, disabling which makes the buffer (or cluster) creation requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped buffers are only enabled by default on the architectures where pmap_copy_page() was implemented and tested. In the rework, filesystem metadata is not the subject to maxbufspace limit anymore. Since the metadata buffers are always mapped, the buffers still have to fit into the buffer map, which provides a reasonable (but practically unreachable) upper bound on it. The non-metadata buffer allocations, both mapped and unmapped, is accounted against maxbufspace, as before. Effectively, this means that the maxbufspace is forced on mapped and unmapped buffers separately. The pre-patch bufspace limiting code did not worked, because buffer_map fragmentation does not allow the limit to be reached. By Jeff Roberson request, the getnewbuf() function was split into smaller single-purpose functions. Sponsored by: The FreeBSD Foundation Discussed with: jeff (previous version) Tested by: pho, scottl (previous version), jhb, bf MFC after: 2 weeks	2013-03-19 14:13:12 +00:00
John Baldwin	1968f37bc9	Tweak some comments.	2013-03-18 18:04:09 +00:00
John Baldwin	3cf3b9f097	Partially revert r195702. Deferring stops is now implemented via a set of calls to toggle TDF_SBDRY rather than passing PBDRY to individual sleep calls. - Remove the stop_allowed parameters from cursig() and issignal(). issignal() checks TDF_SBDRY directly. - Remove the PBDRY and SLEEPQ_STOP_ON_BDRY flags.	2013-03-18 17:23:58 +00:00
Gleb Smirnoff	4f67e14304	In m_align() add assertions that mbuf is virgin, similar to assertions in M_ALIGN(), MH_ALIGN, MEXT_ALIGN() macros.	2013-03-17 07:41:14 +00:00
Pawel Jakub Dawidek	943c3bb968	Require CAP_SEEK if both O_APPEND and O_TRUNC flags are absent. In other words we don't require CAP_SEEK if either O_APPEND or O_TRUNC flag is given, because O_APPEND doesn't allow to overwrite existing data and O_TRUNC requires CAP_FTRUNCATE already. Sponsored by: The FreeBSD Foundation	2013-03-16 23:19:13 +00:00
Pawel Jakub Dawidek	d6b2bd0bc9	Style: Whitespace fixes.	2013-03-16 22:37:30 +00:00
Pawel Jakub Dawidek	1ea67dd9e5	Style: Remove redundant space.	2013-03-16 22:36:24 +00:00
Gleb Smirnoff	c95be8b536	- Replace compat macros with function calls. - Remove superfluous cleaning of m_len after allocating. Sponsored by: Nginx, Inc.	2013-03-16 08:57:36 +00:00
Gleb Smirnoff	5368b81eb0	Contrary to what the deleted comment said, the m_move_pkthdr() will not smash the M_EXT and data pointer, so it is safe to pass an mbuf with external storage procuded by m_getcl() to m_move_pkthdr(). Reviewed by: andre Sponsored by: Nginx, Inc.	2013-03-16 08:55:21 +00:00
Pawel Jakub Dawidek	c9cea47007	Sort syscalls properly.	2013-03-15 23:00:13 +00:00
Konstantin Belousov	aed5a114d7	Separate the copyright lines and the informational block by a blank line. Requested by: joel MFC after: 2 weeks	2013-03-15 14:01:37 +00:00
Konstantin Belousov	5791cee883	Add my copyright for the 2012 year work, in particular vn_io_fault() and f_offset locking. Add required Foundation notice for r248319. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-03-15 12:57:30 +00:00
Konstantin Belousov	5f5f055441	Implement the helper function vn_io_fault_pgmove(), intended to use by the filesystem VOP_READ() and VOP_WRITE() implementations in the same way as vn_io_fault_uiomove() over the unmapped buffers. Helper provides the convenient wrapper over the pmap_copy_pages() for struct uio consumers, taking care of the TDP_UIOHELD situations. Sponsored by: The FreeBSD Foundation Tested by: pho MFC after: 2 weeks	2013-03-15 11:16:12 +00:00
Gleb Smirnoff	c8b59ea750	Use m_get() and m_getcl() instead of compat macros.	2013-03-15 10:21:18 +00:00
Gleb Smirnoff	93cfe76349	- Use m_get2() instead of hand allocating. - No need for u_int cast here. Sponsored by: Nginx, Inc.	2013-03-15 10:17:24 +00:00
Gleb Smirnoff	3112ae7644	Make m_get2() never use clusters that are bigger than PAGE_SIZE. Requested by: andre, jhb Sponsored by: Nginx, Inc.	2013-03-15 10:15:07 +00:00
Edward Tomasz Napierala	a8efb53478	When throttling a process to enforce RACCT limits, do not use neither PBDRY (which simply doesn't make any sense) nor PCATCH (which could be used by a malicious process to work around the PCPU limit). Submitted by: Rudo Tomori Reviewed by: kib	2013-03-14 23:25:42 +00:00
Edward Tomasz Napierala	16befafd16	Accessing td_state requires thread lock to be held. Submitted by: Rudo Tomori Reviewed by: kib	2013-03-14 23:20:18 +00:00
Konstantin Belousov	70e198dd07	Some style fixes. Sponsored by: The FreeBSD Foundation	2013-03-14 20:31:39 +00:00
Konstantin Belousov	c535690b33	Add currently unused flag argument to the cluster_read(), cluster_write() and cluster_wbuild() functions. The flags to be allowed are a subset of the GB_* flags for getblk(). Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-14 20:28:26 +00:00
Konstantin Belousov	a1143a3ba8	Rewrite the vfs_bio_clrbuf(9) to not access the b_data for B_VMIO buffers directly, use pmap_zero_page_area(9) for each zeroing page region instead. Sponsored by: The FreeBSD Foundation Tested by: pho MFC after: 2 weeks	2013-03-14 19:48:25 +00:00
Tijl Coosemans	d19d5bf443	- Fix two possible overflows when testing if ELF program headers are on the first page: 1. Cast uint16_t operands in a multiplication to unsigned int because otherwise the implicit promotion to int results in a signed multiplication that can overflow and the behaviour on integer overflow is undefined. 2. Replace (offset + size > PAGE_SIZE) with (size > PAGE_SIZE - offset) because the sum may overflow. - Use the same tests to see if the path to the interpreter is on the first page. There's no overflow here because size is already limited by MAXPATHLEN, but the compiler optimises the new tests better. Also fix an off-by-one error. - Simplify tests to see if an ELF note program header is on the first page. This also fixes an off-by-one error. Reviewed by: kib MFC after: 1 week	2013-03-13 22:01:31 +00:00
Alexander Motin	ca9feb490c	Fix incorrect assertion that caused panic when periodic-only timers used.	2013-03-13 06:42:01 +00:00
Gleb Smirnoff	41a7572b26	Functions m_getm2() and m_get2() have different order of arguments, and that can drive someone crazy. While m_get2() is young and not documented yet, change its order of arguments to match m_getm2(). Sorry for churn, but better now than later.	2013-03-12 13:42:47 +00:00
Gleb Smirnoff	3b4a84e757	In kern_sendfile() use m_extadd() instead of MEXTADD() macro, supplying appropriate wait argument and checking return value. Before this change m_extadd() could fail, and kern_sendfile() ignored that. Sponsored by: Nginx, Inc.	2013-03-12 12:15:24 +00:00
Gleb Smirnoff	8c629bdf05	The m_extadd() can fail due to memory allocation failure, thus: - Make it return int, not void. - Add wait parameter. - Update MEXTADD() macro appropriately, defaults to M_NOWAIT, as before this change. Sponsored by: Nginx, Inc.	2013-03-12 12:12:16 +00:00
Alexander Motin	0dbf17e6eb	Make kern_nanosleep() and pause_sbt() to use per-CPU sleep queues. This removes significant sleep queue lock congestion on multithreaded microbenchmarks, making them scale to multiple CPUs almost linearly.	2013-03-12 06:58:49 +00:00
Pawel Jakub Dawidek	be26ba7cd3	Fix memory leak when one process send descriptor over UNIX domain socket, but the other process exited before receiving it.	2013-03-11 22:59:07 +00:00
Michael Tuexen	fbb3471022	Return an error if sctp_peeloff() fails because a socket can't be allocated. MFC after: 3 days	2013-03-11 17:43:55 +00:00
Andre Oppermann	a7aea132cf	Bring back the comment on the sizing of the callout array that got lost in r248031. Requested by: alc, alfred	2013-03-10 22:55:35 +00:00
Davide Italiano	c5904471dc	Fixup r248032: Change size requested to malloc(9) now that callwheel buckets are callout_list and not callout_tailq anymore. This change was already there but it seems it got lost after code churn in r248032. Reported by: alc, kib	2013-03-09 20:03:10 +00:00
Attilio Rao	1fc8c346d5	Improve UMTX_PROFILING: - Use u_int values for length and max_length values - Add a way to reset the max_length heuristic in order to have the possibility to reuse the mechanism consecutively without rebooting the machine - Add a way to quick display top5 contented buckets in the system for the max_length value. This should give a quick overview on the quality of the hash table distribution. Sponsored by: EMC / Isilon storage division Reviewed by: jeff, davide	2013-03-09 15:31:19 +00:00
Konstantin Belousov	7a61281f22	Correct the lock class for the vm object lock. Reported and tested by: joel	2013-03-09 10:16:08 +00:00
Alexander Motin	21a37a7196	Rework overflow checks of r247898 to not let too "intelligent" compiler to optimize it out. Submitted by: bde	2013-03-09 09:07:13 +00:00
Attilio Rao	89f6b8632c	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho	2013-03-09 02:32:23 +00:00
Andre Oppermann	15ae0c9af9	Move the callout subsystem initialization to its own SYSINIT() from being indirectly called via cpu_startup()+vm_ksubmap_init(). The boot order position remains the same at SI_SUB_CPU. Allocation of the callout array is changed to stardard kernel malloc from a slightly obscure direct kernel_map allocation. kern_timeout_callwheel_alloc() is renamed to callout_callwheel_init() to better describe its purpose. kern_timeout_callwheel_init() is removed simplifying the per-cpu initialization. Reviewed by: davide	2013-03-08 10:37:17 +00:00
Andre Oppermann	f8ccf82a4c	Move the auto-sizing of the callout array from init_param2() to kern_timeout_callwheel_alloc() where it is actually used. This is a mechanical move and no tuning parameters are changed. The pre-allocated callout array is only used for legacy timeout(9) calls and is only allocated and active on cpu0. Eventually all remaining users of timeout(9) should switch to the callout_* API. Reviewed by: davide	2013-03-08 10:14:58 +00:00
Alexander Motin	836972b877	Fix off-by-one error in nanoseconds validation. Submitted by: bde	2013-03-07 16:50:07 +00:00
Ian Lepore	9a2bff7ca6	Call sched_prio() to immediately change the priority of the thread in response to an rtprio_thread() call, when the priority is different than the old priority, and either the old or the new priority class is not RTP_PRIO_NORMAL (timeshare). The reasoning for the second half of the test is that if it's a change in timeshare priority, then the scheduler is going to adjust that priority in a way that completely wipes out the requested change anyway, so what's the point? (If that's not true, then allowing a thread to change its own timeshare priority would subvert the scheduler's adjustments and let a cpu-bound thread monopolize the cpu; if allowed at all, that should require priveleges.) On the other hand, if either the old or new priority class is not timeshare, then the scheduler doesn't make automatic adjustments, so we should honor the request and make the priority change right away. The reason the old class gets caught up in this is the very reason for this change: when thread A changes the priority of its child thread B from idle back to timeshare, thread B never actually gets moved to a timeshare-range run queue unless there are some idle cycles available to allow it to first get scheduled again as an idle thread. Reviewed by: jhb@	2013-03-07 02:53:29 +00:00
Alexander Motin	b5ea3779da	Reduce minimal time intervals of setitimer(2) from 1/HZ to 1/(16*HZ) by using callout_reset_sbt() instead of callout_reset(). We can't remove lower limit completely in this case because of significant processing overhead, caused by unability to use direct callout execution due to using process mutex in callout handler for sending SEGALRM signal. With support of periodic events that would allow unprivileged user to abuse the system. Reviewed by: davide	2013-03-06 22:40:47 +00:00
Alexander Motin	980c545d76	Fix time math overflows and improve zero intervals handling in poll(), select(), nanosleep() and kevent() functions after calloutng changes. Reported by: bde	2013-03-06 19:37:38 +00:00
Fabien Thomas	d49302aead	Add a generic way to call per event allocate / release function. Reviewed by: mav MFC after: 1 month	2013-03-05 10:18:48 +00:00
Davide Italiano	ac42a1726a	Complete r247813: Use true/false instead of TRUE/FALSE. Reported by: attilio Requested by: jhb	2013-03-04 21:52:12 +00:00
Davide Italiano	a4a3ce9919	Use C99 'bool' rather than Machish 'boolean_t'. Requested by: jhb	2013-03-04 21:09:22 +00:00
Davide Italiano	40e794ab19	MFcalloutng: - Rewrite kevent() timeout implementation to allow sub-tick precision. - Make the interval timings for EVFILT_TIMER more accurate. This also removes an hack introduced in r238424. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil	2013-03-04 16:55:16 +00:00
Davide Italiano	cf5e4fe6bb	MFcalloutng: Fix kern_select() and sys_poll() so that they can handle sub-tick precision for timeouts (in the same fashion it was done for nanosleep() in r247797). Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil	2013-03-04 16:41:27 +00:00
Davide Italiano	4601bab1fb	MFcalloutng (r244251 with minor changes): Specify that precision of 0.5s is enough for resource limitation. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil	2013-03-04 16:25:12 +00:00
Davide Italiano	c38250c9b9	MFcalloutng (r244255 by mav, with minor changes): Specify that syslog doesn't need exactly 5 wakeups per second. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil	2013-03-04 16:07:55 +00:00
Davide Italiano	098176f0d0	MFcalloutng: kern_nanosleep() is now converted to use tsleep_sbt(). With this change nanosleep() and usleep() can handle sub-tick precision for timeouts. Also, try to help coalesce of events passing as argument to tsleep_bt() a precision value calculated as a percentage of the sleep time. This percentage is default 5%, but it can tuned according to users need via the sysctl interface. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil	2013-03-04 15:57:41 +00:00
Davide Italiano	037637812d	Fix build with DIAGNOSTIC/CALLOUT_PROFILING options turned on. Reported by: kib, David Wolfskill <david at catwhisker dot org> Pointy-hat to: davide	2013-03-04 15:03:52 +00:00
Davide Italiano	24e48c6d5b	MFcalloutng: Introduce sbt variants of msleep(), msleep_spin(), pause(), tsleep() in the KPI, allowing to specify timeout in 'sbintime_t' rather than ticks. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil	2013-03-04 12:48:41 +00:00
Davide Italiano	461537356a	MFcalloutng: Extend condvar(9) KPI introducing sbt variant of cv_timedwait. This rely on the previously committed sleepq_set_timeout_sbt(). Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil	2013-03-04 12:20:48 +00:00
Davide Italiano	965ac611ec	MFcalloutng: Convert sleepqueue(9) bits to the new callout KPI. Take advantage of the possibility to run callback directly from hw interrupt context. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil	2013-03-04 11:51:46 +00:00
Davide Italiano	dbd2e1677f	MFcalloutng (r244355): Make loadavg calculation callout direct. There are several reasons for it: - it is very simple and doesn't worth context switch to SWI; - since SWI is no longer used here, we can remove twelve years old hack, excluding this SWI from from the loadavg statistics; - it fixes problem when eventtimer (HPET) shares interrupt with some other device, and that interrupt thread counted as permanent loadavg of 1; now loadavg accounted before that interrupt thread is scheduled. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, Fabian Keil, markj	2013-03-04 11:22:19 +00:00
Davide Italiano	5b999a6be0	- Make callout(9) tickless, relying on eventtimers(4) as backend for precise time event generation. This greatly improves granularity of callouts which are not anymore constrained to wait next tick to be scheduled. - Extend the callout KPI introducing a set of callout_reset_sbt* functions, which take a sbintime_t as timeout argument. The new KPI also offers a way for consumers to specify precision tolerance they allow, so that callout can coalesce events and reduce number of interrupts as well as potentially avoid scheduling a SWI thread. - Introduce support for dispatching callouts directly from hardware interrupt context, specifying an additional flag. This feature should be used carefully, as long as interrupt context has some limitations (e.g. no sleeping locks can be held). - Enhance mechanisms to gather informations about callwheel, introducing a new sysctl to obtain stats. This change breaks the KBI. struct callout fields has been changed, in particular 'int ticks' (4 bytes) has been replaced with 'sbintime_t' (8 bytes) and another 'sbintime_t' field was added for precision. Together with: mav Reviewed by: attilio, bde, luigi, phk Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo (amd64, sparc64), marius (sparc64), ian (arm), markj (amd64), mav, Fabian Keil	2013-03-04 11:09:56 +00:00
Pawel Jakub Dawidek	8cb539f18f	For some reason when I started to pass filedescent structures instead of pointers to the file structure receiving descriptors stopped to work when also at least few kilobytes of data is being send. In the kernel the soreceive_generic() function doesn't see control mbuf as the first mbuf and unp_externalize() is never called, first 6(?) kilobytes of data is missing as well on receiving end. This breaks for example tmux. I don't know yet why going from 8 bytes to sizeof(struct filedescent) per descriptor (or even to 16 bytes per descriptor) breaks things, but to work-around it for now use 8 bytes per file descriptor at the cost of memory allocation. Reported by: flo, Diane Bruce, Jan Beich <jbeich@tormail.org> Simple testcase provided by: mjg	2013-03-03 23:39:30 +00:00
Pawel Jakub Dawidek	5f39e56581	Use dedicated malloc type for filecaps-related data, so we can detect any memory leaks easier.	2013-03-03 23:25:45 +00:00
Pawel Jakub Dawidek	a6157c3d61	Plug memory leaks in file descriptors passing.	2013-03-03 23:23:35 +00:00
Davide Italiano	3f555c45eb	callwheelmask and callwheelsize are always greater than zero. Switch their type to u_int.	2013-03-03 15:01:33 +00:00
Davide Italiano	0fb285b716	Remove a couple of unused include.	2013-03-03 14:47:02 +00:00
Alexander Motin	4514d6fa18	MFcalloutng: Some whitespace fixes.	2013-03-03 09:11:24 +00:00
Pawel Jakub Dawidek	378a73d1bd	Regen after r247667.	2013-03-02 21:12:54 +00:00
Pawel Jakub Dawidek	7493f24ee6	- Implement two new system calls: int bindat(int fd, int s, const struct sockaddr addr, socklen_t addrlen); int connectat(int fd, int s, const struct sockaddr name, socklen_t namelen); which allow to bind and connect respectively to a UNIX domain socket with a path relative to the directory associated with the given file descriptor 'fd'. - Add manual pages for the new syscalls. - Make the new syscalls available for processes in capability mode sandbox. - Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on the directory descriptor for the syscalls to work. - Update audit(4) to support those two new syscalls and to handle path in sockaddr_un structure relative to the given directory descriptor. - Update procstat(1) to recognize the new capability rights. - Document the new capability rights in cap_rights_limit(2). Sponsored by: The FreeBSD Foundation Discussed with: rwatson, jilles, kib, des	2013-03-02 21:11:30 +00:00
Pawel Jakub Dawidek	6d4e99aaef	If the target file already exists, check for the CAP_UNLINKAT capabiity right on the target directory descriptor, but only if this is renameat(2) and real target directory descriptor is given (not AT_FDCWD). Without this fix regular rename(2) fails if the target file already exists. Reported by: Michael Butler <imb@protected-networks.net> Reported by: Larry Rosenman <ler@lerctr.org> Sponsored by: The FreeBSD Foundation	2013-03-02 09:58:47 +00:00
Pawel Jakub Dawidek	1dc31587bf	Regen after r247602.	2013-03-02 00:55:09 +00:00
Pawel Jakub Dawidek	2609222ab4	Merge Capsicum overhaul: - Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights. - The cap_new(2) system call is left, but it is no longer documented and should not be used in new code. - The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one. - The cap_getrights(2) syscall is renamed to cap_rights_get(2). - If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall. - If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2). - To support ioctl and fcntl white-listing the filedesc structure was heavly modified. - The audit subsystem, kdump and procstat tools were updated to recognize new syscalls. - Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below: CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT. Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2). Added CAP_SYMLINKAT: - Allow for symlinkat(2). Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2). Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory. Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall. Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call. Removed CAP_MAPEXEC. CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE. Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ \| PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ \| PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE \| PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ \| PROT_WRITE \| PROT_EXEC). Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT. CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required). CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required). Added convinient defines: #define CAP_PREAD (CAP_SEEK \| CAP_READ) #define CAP_PWRITE (CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP \| CAP_SEEK \| CAP_READ) #define CAP_MMAP_W (CAP_MMAP \| CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP \| CAP_SEEK \| 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R \| CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R \| CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W \| CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R \| CAP_MMAP_W \| CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE #define CAP_SOCK_CLIENT \ (CAP_CONNECT \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| CAP_GETSOCKOPT \| \ CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| CAP_SETSOCKOPT \| CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT \| CAP_BIND \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| \ CAP_GETSOCKOPT \| CAP_LISTEN \| CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| \ CAP_SETSOCKOPT \| CAP_SHUTDOWN) Added defines for backward API compatibility: #define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT \| CAP_SOCK_SERVER) Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib	2013-03-02 00:53:12 +00:00
John Baldwin	f9379dc411	Replace the TDP_NOSLEEPING flag with a counter so that the THREAD_NO_SLEEPING() and THREAD_SLEEPING_OK() macros can nest. Reviewed by: attilio	2013-03-01 22:03:31 +00:00
Pawel Jakub Dawidek	71ac38e896	Remove unnecessary variables.	2013-03-01 21:58:56 +00:00
Pawel Jakub Dawidek	f4d0191b22	Reduce lock scope a little.	2013-03-01 21:57:02 +00:00
Marius Strobl	db9066f798	- Use strdup(9) instead of reimplementing it. - Use __DECONST instead of strange casts. - Reduce code duplication and simplify name2oid(). PR: 176373 Submitted by: Christoph Mallon MFC after: 1 week	2013-03-01 18:49:14 +00:00
Konstantin Belousov	58248e57ab	Make the default implementation of the VOP_VPTOCNP() fail if the directory entry, matched by the inode number, is ".". NFSv4 client might instantiate the distinct vnodes which have the same inode number, since single v4 export can be combined from several filesystems on the server. For instance, a case when the nested server mount point is exactly one directory below the top of the export, causes directory and its parent to have the same inode number 2. The vop_stdvptocnp() algorithm then returns "." as the name of the lower directory. Filtering out the "." entry with ENOENT works around this behaviour, the error forces getcwd(3) to fall back to usermode implementation, which compares both st_dev and st_ino. Based on the submission by: rmacklem Tested by: rmacklem MFC after: 1 week	2013-03-01 18:40:14 +00:00
Davide Italiano	e234a588cb	MFcalloutng: Style fixes.	2013-02-28 16:22:49 +00:00
Alexander Motin	fdc5dd2d2f	MFcalloutng: Switch eventtimers(9) from using struct bintime to sbintime_t. Even before this not a single driver really supported full dynamic range of struct bintime even in theory, not speaking about practical inexpediency. This change legitimates the status quo and cleans up the code.	2013-02-28 13:46:03 +00:00
Davide Italiano	acccf7d8b4	MFcalloutng: When CPU becomes idle, cpu_idleclock() calculates time to the next timer event in order to reprogram hw timer. Return that time in sbintime_t to the caller and pass it to acpi_cpu_idle(), where it can be used as one more factor (quite precise) to extimate furter sleep time and choose optimal sleep state. This is a preparatory change for further callout improvements will be committed in the next days. The commmit is not targeted for MFC.	2013-02-28 10:46:54 +00:00
Konstantin Belousov	20f4e3e158	Make recursive getblk() slightly more useful. Keep the buffer state intact if getblk() is done on the already owned buffer. Exit from brelse() early when the lock recursion is detected, otherwise brelse() might prematurely destroy the buffer under some circumstances. Sponsored by: The FreeBSD Foundation Noted by: mckusick Tested by: pho MFC after: 2 weeks	2013-02-27 07:34:09 +00:00
Alexander Motin	1af19ee4a2	Add support for good old 8192Hz profiling clock to software PMC. Reviewed by: fabient	2013-02-26 18:13:42 +00:00
Attilio Rao	590f9303e5	Merge from vmobj-rwlock branch: Remove unused inclusion of vm/vm_pager.h and vm/vnode_pager.h. Sponsored by: EMC / Isilon storage division Tested by: pho Reviewed by: alc	2013-02-26 01:00:11 +00:00
Pawel Jakub Dawidek	1d59211b2e	Style. Suggested by: kib	2013-02-25 20:51:29 +00:00
Pawel Jakub Dawidek	893365e42d	After r237012, the fdgrowtable() doesn't drop the filedesc lock anymore, so update a stale comment. Reviewed by: kib, keramida	2013-02-25 20:50:08 +00:00
John Baldwin	593efaf9f7	Further refine the handling of stop signals in the NFS client. The changes in r246417 were incomplete as they did not add explicit calls to sigdeferstop() around all the places that previously passed SBDRY to _sleep(). In addition, nfs_getcacheblk() could trigger a write RPC from getblk() resulting in sigdeferstop() recursing. Rather than manually deferring stop signals in specific places, change the VFS_() and VOP_() methods to defer stop signals for filesystems which request this behavior via a new VFCF_SBDRY flag. Note that this has to be a VFC flag rather than a MNTK flag so that it works properly with VFS_MOUNT() when the mount is not yet fully constructed. For now, only the NFS clients are set this new flag in VFS_SET(). A few other related changes: - Add an assertion to ensure that TDF_SBDRY doesn't leak to userland. - When a lookup request uses VOP_READLINK() to follow a symlink, mark the request as being on behalf of the thread performing the lookup (cnp_thread) rather than using a NULL thread pointer. This causes NFS to properly handle signals during this VOP on an interruptible mount. PR: kern/176179 Reported by: Russell Cattelan (sigdeferstop() recursion) Reviewed by: kib MFC after: 1 month	2013-02-21 19:02:50 +00:00
Jamie Gritton	ffc72591b1	Don't worry if a module is already loaded when looking for a fstype to mount (possible in a race condition). Reviewed by: kib MFC after: 1 week	2013-02-21 02:41:37 +00:00
John Baldwin	353374b525	Fix a few typos.	2013-02-19 16:35:27 +00:00
Pawel Jakub Dawidek	b2e054b0d4	Update the comment: we do show the backtrace of misbehaving thread.	2013-02-17 21:37:32 +00:00
Pawel Jakub Dawidek	f0ad2ecb9c	Style.	2013-02-17 11:56:36 +00:00
Pawel Jakub Dawidek	8e1d51ab40	- Require CAP_FSYNC capability right when opening a file with O_SYNC or O_FSYNC flags. - While here simplify check for locking flags. Sponsored by: The FreeBSD Foundation	2013-02-17 11:53:51 +00:00
Pawel Jakub Dawidek	11b0cfe3cd	Remove redundant parenthesis.	2013-02-17 11:49:21 +00:00
Pawel Jakub Dawidek	49549b1894	Remove redundant space.	2013-02-17 11:48:16 +00:00
Pawel Jakub Dawidek	6c08be2b88	Add break to the default case.	2013-02-17 11:47:58 +00:00
Pawel Jakub Dawidek	4881a5950e	Don't treat pointers as booleans.	2013-02-17 11:47:30 +00:00
Pawel Jakub Dawidek	de26549841	Remove redundant parenthesis.	2013-02-17 11:47:01 +00:00
Kirk McKusick	2bc1a1fe5c	Add barrier write capability to the VFS buffer interface. A barrier write is a disk write request that tells the disk that the buffer being written must be committed to the media along with any writes that preceeded it before any future blocks may be written to the drive. Barrier writes are provided by adding the functions bbarrierwrite (bwrite with barrier) and babarrierwrite (bawrite with barrier). Following a bbarrierwrite the client knows that the requested buffer is on the media. It does not ensure that buffers written before that buffer are on the media. It only ensure that buffers written before that buffer will get to the media before any buffers written after that buffer. A flush command must be sent to the disk to ensure that all earlier written buffers are on the media. Reviewed by: kib Tested by: Peter Holm	2013-02-16 14:51:30 +00:00
Ian Lepore	a1137de941	Add PPS_CANWAIT support for time_pps_fetch(). This adds support for all three blocking modes described in section 3.4.3 of RFC 2783, allowing the caller to retrieve the most recent values without blocking, to block for a specified time, or to block forever. Reviewed by: discussion on hackers@	2013-02-15 18:30:32 +00:00
Sergey Kandaurov	d7ffa24831	vn_io_faults_cnt: - use u_long consistently - use SYSCTL_ULONG to match the type of variable Reviewed by: kib MFC after: 1 week	2013-02-15 14:22:05 +00:00
Sergey Kandaurov	ab15d8039e	Add support of passing SCM_BINTIME ancillary data object for PF_LOCAL sockets. PR: kern/175883 Submitted by: Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua> Discussed with: glebius, phk MFC after: 2 weeks	2013-02-15 13:00:20 +00:00
Ian Lepore	74938cbb7f	Make the F_READAHEAD option to fcntl(2) work as documented: a value of zero now disables read-ahead. It used to effectively restore the system default readahead hueristic if it had been changed; a negative value now restores the default. Reviewed by: kib	2013-02-13 15:09:16 +00:00
Konstantin Belousov	dd0b4fb6d5	Reform the busdma API so that new types may be added without modifying every architecture's busdma_machdep.c. It is done by unifying the bus_dmamap_load_buffer() routines so that they may be called from MI code. The MD busdma is then given a chance to do any final processing in the complete() callback. The cam changes unify the bus_dmamap_load* handling in cam drivers. The arm and mips implementations are updated to track virtual addresses for sync(). Previously this was done in a type specific way. Now it is done in a generic way by recording the list of virtuals in the map. Submitted by: jeff (sponsored by EMC/Isilon) Reviewed by: kan (previous version), scottl, mjacob (isp(4), no objections for target mode changes) Discussed with: ian (arm changes) Tested by: marius (sparc64), mips (jmallet), isci(4) on x86 (jharris), amd64 (Fabian Keil <freebsd-listen@fabiankeil.de>)	2013-02-12 16:57:20 +00:00
Marius Strobl	18716f9f4b	Update comments to reflect r246689.	2013-02-11 23:05:10 +00:00
Marius Strobl	bdc5f0172e	Make SYSCTL_{LONG,QUAD,ULONG,UQUAD}(9) work as advertised and also handle constant values. Reviewed by: kib MFC after: 3 days	2013-02-11 21:50:00 +00:00
Konstantin Belousov	2871baa49a	Remove the ia64-specific code fragment, which effect is more cleanly done by the call to trans_prot() function a line before. Discussed with: Oliver Pinter <oliver.pntr@gmail.com> MFC after: 1 week	2013-02-10 20:08:33 +00:00
Andriy Gapon	c43b08dc6c	ktr: correctly handle possible wrap-around in the boot buffer Older entries should be 'before' newer entries in the new buffer too and there should be no zero-filled gap between them. Pointed out by: jhb MFC after: 3 days X-MFC with: r246282	2013-02-08 07:29:07 +00:00
Konstantin Belousov	888d4d4f86	When vforked child is traced, the debugging events are not generated until child performs exec(). The behaviour is reasonable when a debugger is the real parent, because the parent is stopped until exec(), and sending a debugging event to the debugger would deadlock both parent and child. On the other hand, when debugger is not the parent of the vforked child, not sending debugging signals makes it impossible to debug across vfork. Fix the issue by declining generating debug signals only when vfork() was done and child called ptrace(PT_TRACEME). Set a new process flag P_PPTRACE from the attach code for PT_TRACEME, if P_PPWAIT flag is set, which indicates that the process was created with vfork() and still did not execed. Check P_PPTRACE from issignal(), instead of refusing the trace outright for the P_PPWAIT case. The scope of P_PPTRACE is exactly contained in the scope of P_PPWAIT. Found and tested by: zont Reviewed by: pluknet MFC after: 2 weeks	2013-02-07 15:34:22 +00:00
Konstantin Belousov	2ca4998342	Stop translating the ERESTART error from the open(2) into EINTR. Posix requires that open(2) is restartable for SA_RESTART. For non-posix objects, in particular, devfs nodes, still disable automatic restart of the opens. The open call to a driver could have significant side effects for the hardware. Noted and reviewed by: jilles Discussed with: bde MFC after: 2 weeks	2013-02-07 14:53:33 +00:00
Neel Natu	dae3dc73f6	If an interrupt event's assign_cpu method fails, then restore the original cpuset mask for the associated interrupt thread. The text used above is verbatim from r195249 and the code should now be in line with the intent of that commit.	2013-02-07 06:48:47 +00:00
Pawel Jakub Dawidek	fbda3d5dae	Audit sockaddr argument for bind(2), connect(2), accept(2), sendto(2) and recvfrom(2) syscalls. Sponsored by: The FreeBSD Foundation	2013-02-07 00:36:00 +00:00
Pawel Jakub Dawidek	82b316b377	Minor style tweaks.	2013-02-07 00:27:11 +00:00
John Baldwin	a120a7a3cd	Rework the handling of stop signals in the NFS client. The changes in 195702, 195703, and 195821 prevented a thread from suspending while holding locks inside of NFS by forcing the thread to fail sleeps with EINTR or ERESTART but defer the thread suspension to the user boundary. However, this had the effect that stopping a process during an NFS request could abort the request and trigger EINTR errors that were visible to userland processes (previously the thread would have suspended and completed the request once it was resumed). This change instead effectively masks stop signals while in the NFS client. It uses the existing TDF_SBDRY flag to effect this since SIGSTOP cannot be masked directly. Also, instead of setting PBDRY on individual sleeps, the NFS client now sets the TDF_SBDRY flag around each NFS request and stop signals are masked for all sleeps during that region (the previous change missed sleeps in lockmgr locks). The end result is that stop signals sent to threads performing an NFS request are completely ignored until after the NFS request has finished processing and the thread prepares to return to userland. This restores the behavior of stop signals being transparent to userland processes while still preventing threads from suspending while holding NFS locks. Reviewed by: kib MFC after: 1 month	2013-02-06 17:06:51 +00:00
Sergey Kandaurov	23c053d6a2	Prezero the acl structure which is to be copied to usermode, to avoid leakage of the previous content of padding and unitialized fields. Reported by: Ilia Noskov <noskov@nic.ru> Reviewed by: kib MFC after: 1 week	2013-02-06 15:18:46 +00:00
Sergey Kandaurov	51dc4fea4c	Remove reference to the rlist code from comments, and fix a typo visible in the resulted change. Reviewed by: kib MFC after: 1 week	2013-02-05 20:08:33 +00:00
Andriy Gapon	c8199bc955	ktr: prevent possible footshooting with KTR_ENTRIES and KTR_BOOT_ENTRIES Suggested by: adrian MFC after: 14 days X-MFC with: r246282	2013-02-04 21:58:57 +00:00
Andriy Gapon	f85ed12497	ktr: copy content from the early static buffer if KTR_ENTRIES != KTR_BOOT_ENTRIES Reported by: glebius, jhb Pointyhat to: avg MFC after: 14 days X-MFC with: r246282	2013-02-04 21:50:55 +00:00
Marius Strobl	94bfd5b1a0	Try to improve r242655 take III: move these SYSCTLs describing the kernel map, which is defined and initialized in vm/vm_kern.c, to the latter. Submitted by: alc	2013-02-04 09:35:48 +00:00
Marius Strobl	e8cbe54bc4	Further improve r242655 and supply VM_{MIN,MAX}_KERNEL_ADDRESS as constant values to SYSCTL_ULONG(9) where possible. Submitted by: bde	2013-02-03 21:43:55 +00:00
Andriy Gapon	36b7dde416	allow for large KTR_ENTRIES values by allocating ktr_buf using malloc(9) Only during very early boot, before malloc(9) is functional (SI_SUB_KMEM), the static ktr_buf_init is used. Size of the static buffer is determined by a new kernel option KTR_BOOT_ENTRIES. Its default value is 1024. This commit builds on top of r243046. Reviewed by: alc MFC after: 17 days	2013-02-03 09:57:39 +00:00
Andriy Gapon	8eede5c4d9	fix some fat-fingering in r246246 Submitted by: mjg Pointyhat to: avg MFC after: 5 days X-MFC with: r246246	2013-02-02 14:19:50 +00:00
Andriy Gapon	bfdcb3bcba	print compiler version in the kernel banner And provide kernel compiler version as a sysctl as well. This is useful while we have gcc and clang cohabitation. This could be even more useful when we have support for external toolchains. In cooperation with: mjg MFC after: 13 days	2013-02-02 11:58:35 +00:00
Grzegorz Bernacki	2d7d16429c	Get time of next event from other cores only if SMP is already started. Reviewed by: mav Obtained from: Semihalf	2013-02-01 11:39:03 +00:00
Pawel Jakub Dawidek	4bbe7b0c20	Now that MPSAFE flag is gone, we can arrange code a bit better.	2013-01-31 22:20:05 +00:00
Pawel Jakub Dawidek	b108953c6f	Remove leftover label after Giant removal from VFS.	2013-01-31 22:15:41 +00:00
Pawel Jakub Dawidek	a2c496ebb9	Remove label that was accidentally moved during Giant removal from VFS.	2013-01-31 22:14:16 +00:00
Pawel Jakub Dawidek	9e2677fd6d	Simplify code a bit. This is leftover after Giant removal from VFS.	2013-01-31 22:12:48 +00:00
Konstantin Belousov	538375d42e	The case of pid == WAIT_MYPGRP for the kern_wait() is already handled in kern_wait6(), which is called by kern_wait(). Remove the redundand check, introduced in r243136, and add a comment noting this, to make the code less confusing. The blank lines are added to properly delineate the scope of the preceeding comments. Noted by: "Jukka A. Ukkonen" <jau@iki.fi> MFC after: 1 week	2013-01-30 13:14:15 +00:00
John Baldwin	a8df530ddc	Mark 'ticks', 'time_second', and 'time_uptime' as volatile to prevent the compiler from caching their values in tight loops. Reviewed by: bde MFC after: 1 week	2013-01-28 19:38:13 +00:00
Gleb Smirnoff	29110f87a6	- Move large functions m_getjcl() and m_get2() to kern/uipc_mbuf.c - style(9) fixes to mbuf.h Reviewed by: bde	2013-01-24 09:29:41 +00:00
John Baldwin	75d774e36a	Fix a typo.	2013-01-23 14:37:05 +00:00
Andre Oppermann	371407162b	Move the mbuf memory limit calculations from init_param2() to tunable_mbinit() where it is next to where it is used later. Change the sysinit level of tunable_mbinit() from SI_SUB_TUNABLES to SI_SUB_KMEM after the VM is running. This allows to use better methods to determine the effectively available physical and virtual memory available to the kernel. Update comments. In a second step it can be merged into mbuf_init().	2013-01-17 21:28:31 +00:00
Alfred Perlstein	17ebe960a6	Do not autotune ncallout to be greater than 18508. When maxusers was unrestricted and maxfiles was allowed to autotune much higher the result was that ncallout which was based on maxfiles and maxproc grew much higher than was needed. To fix this clip autotuning to the same number we would get with the old maxusers algorithm which would stop scaling at 384 maxusers. Growing ncalout higher is not likely to be needed since most consumers of timeout(9) are gone and any higher value for ncallout causes the callwheel hashes to be much larger than will even be needed for most applications. MFC after: 1 month Reviewed by: mav	2013-01-15 19:26:17 +00:00
Andrey Zonov	0165f660c1	- Detect when we are in KVM. Silence on: emulation Approved by: kib (mentor) MFC after: 1 week	2013-01-15 14:05:59 +00:00
Konstantin Belousov	10b4bb0b33	Add a trivial comment to record the proper commit log for r245407: Set the v_hash for a new vnode in the getnewvnode() to the value calculated based on the vnode structure address. Filesystems using vfs_hash_insert() override the v_hash using the standard formula of (inode_number + mnt_hashseed). For other filesystems, the initialization allows the vfs_hash_index() to provide useful hash too. Suggested, reviewed and tested by: peter Sponsored by: The FreeBSD Foundation MFC after: 5 days	2013-01-14 05:52:23 +00:00
Konstantin Belousov	a41df84820	diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c index 7c243b6..0bdaf36 100644 --- a/sys/kern/vfs_subr.c +++ b/sys/kern/vfs_subr.c @@ -279,6 +279,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW, #define VSHOULDFREE(vp) (!((vp)->v_iflag & VI_FREE) && !(vp)->v_holdcnt) #define VSHOULDBUSY(vp) (((vp)->v_iflag & VI_FREE) && (vp)->v_holdcnt) +static int vnsz2log; /* * Initialize the vnode management data structures. @@ -293,6 +294,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW, static void vntblinit(void dummy __unused) { + u_int i; int physvnodes, virtvnodes; / @@ -332,6 +334,9 @@ vntblinit(void dummy __unused) syncer_maxdelay = syncer_mask + 1; mtx_init(&sync_mtx, "Syncer mtx", NULL, MTX_DEF); cv_init(&sync_wakeup, "syncer"); + for (i = 1; i <= sizeof(struct vnode); i <<= 1) + vnsz2log++; + vnsz2log--; } SYSINIT(vfs, SI_SUB_VFS, SI_ORDER_FIRST, vntblinit, NULL); @@ -1067,6 +1072,14 @@ alloc: } rangelock_init(&vp->v_rl); + / + * For the filesystems which do not use vfs_hash_insert(), + * still initialize v_hash to have vfs_hash_index() useful. + * E.g., nullfs uses vfs_hash_index() on the lower vnode for + * its own hashing. + / + vp->v_hash = (uintptr_t)vp >> vnsz2log; + vpp = vp; return (0); }	2013-01-14 05:42:54 +00:00
Konstantin Belousov	f6af8e375c	Add exported vfs_hash_index() function, which calculates the canonical pre-masked hash for the given vnode. The function assumes that vp->v_hash is initialized by the filesystem vnode instantiation function. At the moment, it is only done if filesystem uses vfs_hash_insert(). Reviewed by: peter Tested by: peter, pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 5 days	2013-01-14 05:41:40 +00:00
Konstantin Belousov	7b982bc831	Rename vfs_hash_index() to vfs_hash_bucket(). Reviewed by: peter Tested by: peter, pho Sponsored by: The FreeBSD Foundation MFC after: 5 days	2013-01-14 05:40:21 +00:00
Konstantin Belousov	ddd6b3fc33	Add flags argument to vfs_write_resume() and remove vfs_write_resume_flags(). Sponsored by: The FreeBSD Foundation	2013-01-11 06:08:32 +00:00
Mateusz Guzik	43287e2753	lockmgr: unlock interlock (if requested) when dealing with upgrade/downgrade requests for LK_NOSHARE locks, just like for shared locks. PR: kern/174969 Reviewed by: attilio MFC after: 1 week	2013-01-06 21:47:59 +00:00
Konstantin Belousov	a545089ed5	Protect the p->p_pgrp dereference with the process lock. MFC after: 3 days	2013-01-06 15:10:10 +00:00
Neel Natu	a09580e7a3	Teach the kernel to recognize that it is executing inside a bhyve virtual machine. Obtained from: NetApp	2013-01-05 19:18:50 +00:00
Benjamin Kaduk	5e9723e271	Fix some minor inaccuracies introduced in r243251. Also correct the comment in kern_synch.c which was the source of the problematic text. Reviewed by: kib (previous version) Approved by: hrs (mentor)	2013-01-05 00:23:26 +00:00
David Xu	eea8d86d4d	Revert revision 244760 because strncpy pads trailing space with zero, this prevents kernel data from being leaked. Noticed by: Joerg Sonnenberger < joerg at britannica dot bec dot de >	2013-01-04 11:11:12 +00:00
Konstantin Belousov	d1c5e3f8b0	Remove the deprecated MNT_VNODE_FOREACH interface. Use the MNT_VNODE_FOREACH_ALL instead.	2013-01-03 19:02:52 +00:00
Konstantin Belousov	f99cb34c4f	The process_deferred_inactive() function locks the vnodes of the ufs mount, which means that is must not be called while the snaplock is owned. The vfs_write_resume(9) does call the function as the VFS_SUSP_CLEAN() method, which is too early and falls into the region still protected by snaplock. Add yet another flag for the vfs_write_resume_flags() to avoid calling suspension cleanup handler after the suspend is lifted, and use it in the ffs_snapshot() call to vfs_write_resume. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-01-01 16:14:48 +00:00
Konstantin Belousov	91e9474552	Make it possible to atomically resume writes on the mount and account the write start, by adding a variation of the vfs_write_resume(9) which accepts flags. Use the new function to prevent a deadlock between parallel suspension and snapshotting a UFS mount. The ffs_snapshot() code performed vfs_write_resume() followed by vn_start_write() while owning the snaplock. If the suspension intervene between resume and vn_start_write(), the deadlock occured after the suspending thread tried to lock the snaplock, most typically during the write in the ffs_copyonwrite(). Reported and tested by: Andreas Longwitz <longwitz@incore.de> Reviewed by: mckusick MFC after: 2 weeks X-MFC-note: make the vfs_write_resume(9) function a macro after the MFC, in HEAD	2012-12-28 23:08:30 +00:00
Oleksandr Tymoshenko	7fc3ae51f3	Fix build on ARM (and probably other platforms)	2012-12-28 06:52:53 +00:00
David Xu	9d4bf0db7c	Use strlcpy to NULL-terminate error message even if user provided a short buffer.	2012-12-28 02:43:33 +00:00
Attilio Rao	c92c859b7b	Fixup r244240: mp_ncpus will be 1 also in the !SMP and smp_disabled=1 case. There is no point in optimizing further the code and use a TRUE litteral for a path that does heavyweight stuff anyway (like lock acq), at the price of obfuscated code. Use the appropriate check where necessary and remove a macro. Sponsored by: EMC / Isilon storage division MFC after: 3 days	2012-12-26 15:20:32 +00:00
Konstantin Belousov	ad9789f6db	Do not force a writer to the devfs file to drain the buffer writes. Requested and tested by: Ian Lepore <freebsd@damnhippie.dyndns.org> MFC after: 2 weeks	2012-12-23 22:43:27 +00:00
Jaakko Heinonen	b1e1f725e7	Reject spaces and double quotation marks in device names. devctl(4) and devd(8) can't handle names with such characters properly. PR: bin/144736, kern/161912 Discussed with: imp, kib, pjd	2012-12-22 13:33:28 +00:00
Attilio Rao	cd2fe4e632	Fixup r240424: On entering KDB backends, the hijacked thread to run interrupt context can still be idlethread. At that point, without the panic condition, it can still happen that idlethread then will try to acquire some locks to carry on some operations. Skip the idlethread check on block/sleep lock operations when KDB is active. Reported by: jh Tested by: jh MFC after: 1 week	2012-12-22 09:37:34 +00:00
Attilio Rao	b1308d72c2	Fixup r218424: uio_yield() was scaling directly to userland priority. When kern_yield() was introduced with the possibility to specify a new priority, the behaviour changed by not lowering priority at all in the consumers, making the yielding mechanism highly ineffective for high priority kthreads like bufdaemon, syncer, vlrudaemon, etc. There are no evidences that consumers could bear with such change in semantic and this situation could finally lead to bugs similar to the ones fixed in r244240. Re-specify userland pri for kthreads involved. Tested by: pho Reviewed by: kib, mdf MFC after: 1 week	2012-12-21 13:14:12 +00:00
Dag-Erling Smørgrav	b5471c918f	Rewrite fdgrowtable() so common mortals can actually understand what it does and how, and add comments describing the data structures and explaining how they are managed.	2012-12-20 20:18:27 +00:00
Olivier Houchard	05d9035003	Create an architecture-agnostic buffer pool manager that uses uma(9) to manage a set of power-of-2 sized buffers for bus_dmamem_alloc(). This allows the caller to provide the back-end allocator uma allocator, allowing full control of the memory pages backing the pool. For convenience, it provides an optional builtin allocator that provides pages allocated with the VM_MEMATTR_UNCACHEABLE attribute, for managing pools of DMA buffers for BUS_DMA_COHERENT or BUS_DMA_NOCACHE. This also allows the caller to specify a minimum alignment, and it ensures that all buffers start on a boundary and have a length that's a multiple of that value, to avoid using buffers that trigger partial cache line flushes. Submitted by: Ian Lepore <freebsd@damnhippie.dyndns.org>	2012-12-20 00:34:54 +00:00
Pawel Jakub Dawidek	c345faea5a	Replace expand_name() function with corefile_open() function, which not only returns name, but also vnode of corefile to use. This simplifies the code and closes few races, especially in %I handling. Reviewed by: kib Obtained from: WHEEL Systems	2012-12-19 23:59:48 +00:00
Pawel Jakub Dawidek	22a5d85aa9	Use correct file permissions when looking for available core file if kern.corefile contains %I. Obtained from: WHEEL Systems	2012-12-19 23:40:02 +00:00
Jeff Roberson	4c44811c9d	- Add new machine parsable KTR macros for timing events. - Use this new format to automatically handle syscalls and VOPs. This changes the earlier format but is still human readable. Sponsored by: EMC / Isilon Storage Division	2012-12-19 20:10:00 +00:00
Jeff Roberson	5b39d5c739	- Correctly handle EWOULDBLOCK in quiesce_cpus Discussed with: mav	2012-12-19 20:08:06 +00:00
Pawel Jakub Dawidek	07a8e07896	The 'flags' argument can be modified in vn_open_cred(), so we need to set it for every loop interation. Pointed out by: kib	2012-12-19 12:14:08 +00:00
Pawel Jakub Dawidek	cc58032c44	Do not audit paths we try when kern.corefile contains %I. Obtained from: WHEEL Systems	2012-12-19 12:12:53 +00:00
Pawel Jakub Dawidek	29146f1a7a	Style cleanups.	2012-12-19 12:10:14 +00:00
Pawel Jakub Dawidek	086053a370	The expand_name() function isn't called with the process lock held anymore, so we can safely use malloc(M_WAITOK) now. Pointed out by: kib	2012-12-19 12:00:09 +00:00
Mateusz Guzik	af3c786c47	prison_racct_detach can be called for not fully initialized jail, so make it check that the jail has racct before doing anything PR: kern/174436 Reviewed by: trasz MFC after: 3 days	2012-12-18 18:34:36 +00:00
Andrey Zonov	5eb0d2838c	- Add sysctl to allow unprivileged users to call mlock(2)-family system calls and turn it on. - Do not allow to call them inside jail. [1] Pointed out by: trasz [1] Reviewed by: avg Approved by: kib (mentor) MFC after: 1 week	2012-12-18 07:36:45 +00:00
Pawel Jakub Dawidek	f06f465db7	Minor style tweaks. Obtained from: WHEEL Systems	2012-12-17 10:51:22 +00:00
Pawel Jakub Dawidek	c52ff61196	Better variables naming in expand_name() to be more consistent with coredump(). Obtained from: WHEEL Systems	2012-12-17 10:48:10 +00:00
Pawel Jakub Dawidek	dd57ce87eb	Move expand_name() after process lock is released. This fixed panic where we hold mutex (process lock) and try to obtain sleepable lock (vnode lock in expand_name()). The panic could occur when %I was used in kern.corefile. Additionally we avoid expand_name() overhead when coredumps are disabled. Obtained from: WHEEL Systems	2012-12-16 14:53:27 +00:00
Pawel Jakub Dawidek	2ce1b32df2	Don't add audit record when coredumps are disabled or name cannot be expanded. Discussed with: rwatson Obtained from: WHEEL Systems	2012-12-16 14:24:59 +00:00
Pawel Jakub Dawidek	7e73ee85ab	Make the check easier to read. Obtained from: WHEEL Systems	2012-12-16 14:14:18 +00:00
Pawel Jakub Dawidek	b039f8c2aa	Use 'cred' variable. Obtained from: WHEEL Systems	2012-12-16 13:56:38 +00:00
Konstantin Belousov	14df601e47	When mnt_vnode_next_active iterator cannot lock the next vnode and yields, specify the user priority for the yield. Otherwise, a higher-priority (kernel) thread could fall into the priority-inversion with the thread owning the mutex lock. On single-processor machines or UP kernels, do not loop adaptively when the next vnode cannot be locked, instead yield unconditionally. Restructure the iteration initializer and the iterator to remove code duplication. Put the code to fetch and lock a vnode next to the current marker, into the mnt_vnode_next_active() function, and use it instead of repeating the loop. Reported by: hrs, rmacklem Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2012-12-15 02:04:46 +00:00
Konstantin Belousov	d4015944e7	Remove a special case for XEN, which is erronous and makes vfork(2) behaviour to differ from the documented, only on XEN. If there are any issues with XEN pmap left, they should be fixed in pmap. MFC after: 2 weeks	2012-12-15 02:02:11 +00:00
Rick Macklem	f1c4014cd5	The group list for a non-default export entry (a host/subnet one) was being copied from the wrong place. This patch fixes that. This could cause access failures for mapped users, when the group permissions were needed. PR: 147998 Submitted by: Christopher Key (cjk32 at cam.ac.uk) MFC after: 2 weeks	2012-12-14 21:49:06 +00:00
Alfred Perlstein	15d32bd543	Cleanup more of the kassert_panic. fix compile warnings on !amd64 and NULL derefs that would happen if kassert_panic() would return.	2012-12-11 07:08:14 +00:00
Alfred Perlstein	c2c5ede903	Fix WITNESS when INVARIANT_SUPPORT is defined. This fixes tinderbox breakage from r244105. Pointed out by: adrian	2012-12-11 05:59:16 +00:00
Alfred Perlstein	6b6bd3b704	Switch the hardwired WITNESS panics to kassert_panic. This is an ongoing effort to provide runtime debug information useful in the field that does not panic existing installations. This gives us the flexibility needed when shipping images to a potentially large audience with WITNESS enabled without worrying about formerly non-fatal LORs hurting a release. Sponsored by: iXsystems	2012-12-11 01:23:50 +00:00
Alfred Perlstein	d3bfafb4f6	back out half of 244098. kern.bootfile needs to be rw for installkernel. Pointed out by: kib, flo	2012-12-11 00:10:20 +00:00
Alfred Perlstein	a94053ba39	allow KASSERT to enter KDB.	2012-12-10 23:11:26 +00:00
Alfred Perlstein	d06cadae1e	make sysctls kern.{bootfile,conftxt} read-only MFC after: 1 month	2012-12-10 23:09:55 +00:00
Konstantin Belousov	686ffcaceb	Do not yield while owning a mutex. The Giant reacquire in the kern_yield() is problematic than. The owned mutex is the mount interlock, and it is in fact not needed to guarantee the stability of the mount list of active vnodes, so fix the the issue by only taking the mount interlock for MNT_REF and MNT_REL operations. While there, augment the unconditional yield by some amount of spinning [1]. Reported and tested by: pho Reviewed by: attilio Submitted by: attilio [1] MFC after: 3 days	2012-12-10 20:44:09 +00:00
Andre Oppermann	0060bab556	Prevent long type overflow of realmem calculation on ILP32 by forcing calculation to be in quad_t space. Fix style issue with second parameter to qmin(). Reported by: alc Reviewed by: bde, alc	2012-12-10 12:19:03 +00:00
Konstantin Belousov	5d439a2957	Do not ignore zero address, possibly returned by the vm_map_find() call. The function indicates a failure by the TRUE return value. To be extra safe, assert that the return value from the following vm_map_insert() indicates success. Fix style issues in the nearby lines, reformulate the comment. Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2012-12-10 05:14:04 +00:00
Konstantin Belousov	17cb8cfc31	Remove useless comment. MFC after: 3 days	2012-12-09 20:34:11 +00:00
Konstantin Belousov	796fa4fb86	Fix typo. MFC after: 3 days	2012-12-09 20:26:51 +00:00
Attilio Rao	e68ccbe85e	Add a comment on why inlining critical_enter() may not be a good idea for the general case. Reviewed by: bde MFC after: 1 week	2012-12-09 04:54:22 +00:00
Pawel Jakub Dawidek	6e0b674628	Configure UMA warnings for the following zones: - unp_zone: kern.ipc.maxsockets limit reached - socket_zone: kern.ipc.maxsockets limit reached - zone_mbuf: kern.ipc.nmbufs limit reached - zone_clust: kern.ipc.nmbclusters limit reached - zone_jumbop: kern.ipc.nmbjumbop limit reached - zone_jumbo9: kern.ipc.nmbjumbo9 limit reached - zone_jumbo16: kern.ipc.nmbjumbo16 limit reached Note that those warnings are printed not often than every five minutes and can be globally turned off by setting sysctl/tunable vm.zone_warnings to 0. Discussed on: arch Obtained from: WHEEL Systems MFC after: 2 weeks	2012-12-07 22:30:30 +00:00
Pawel Jakub Dawidek	45fe0bf7e4	Make use of the fact that uma_zone_set_max(9) already returns actual limit set.	2012-12-07 22:23:53 +00:00
Pawel Jakub Dawidek	4007b61cde	More style cleanups.	2012-12-07 22:22:04 +00:00
Pawel Jakub Dawidek	b0b1402537	Style cleanups.	2012-12-07 22:19:41 +00:00
Pawel Jakub Dawidek	94b0ae5d62	- Make socket_zone static - it is used only in this file. - Update maxsockets on uma_zone_set_max(). Obtained from: WHEEL Systems	2012-12-07 22:15:51 +00:00
Pawel Jakub Dawidek	68412f4179	Style cleanups.	2012-12-07 22:13:33 +00:00
Pawel Jakub Dawidek	0b746181a2	There is no need anymore to include vm/uma.h after r241726. Obtained from: WHEEL Systems	2012-12-07 22:05:42 +00:00
Alfred Perlstein	3945a96431	Allow KASSERT to log instead of panic. This is to allow debug images to be used without taking down the system when non-fatal asserts are hit. The following sysctls are added: debug.kassert.warn_only: 1 = log, 0 = panic debug.kassert.do_ktr: set to a ktr mask for logging via KTR debug.kassert.do_log: 1 = log, 0 = quiet debug.kassert.warnings: stats, number of kasserts hit debug.kassert.log_panic_at: number of kasserts before we actually panic, 0 = never debug.kassert.log_pps_limit: pps limit for log messages debug.kassert.log_mute_at: stop warning after N kasserts, 0 = never stop debug.kassert.kassert: set this sysctl to trigger a kassert Discussed with: scottl, gnn, marcel Sponsored by: iXsystems	2012-12-07 08:25:08 +00:00
Alfred Perlstein	3356d129ad	Use uint instead of int for flags exported via sysctl.	2012-12-07 05:55:48 +00:00
Kevin Lo	b08d12d9be	- according to POSIX, make socket(2) return EAFNOSUPPORT rather than EPROTONOSUPPORT if the address family is not supported. - introduce pffinddomain() to find a domain by family and use it as appropriate. Reviewed by: glebius	2012-12-07 02:22:48 +00:00
David Xu	3f6bad0181	Eliminate superfluous code.	2012-12-06 06:29:08 +00:00
Attilio Rao	bdf9120c16	Fixup r243901: - As the comment report, CALLOUT_LOCAL_ALLOC cannot be checked directly from the callout flags but might be checked by a cached value. Hence, do so before to actually remove the callout, when needed, in softclock_call_cc(). - In softclock_call_cc() also add a comment in the waiting and deferred migration case explaining that the dereference should be safe because of the migration dereference invariants. Additively: - In softclock_call_cc(), for the deferred migration case, move all the accesses to callout structure after the comment stating the callout must not be destroyed. - For consistency with this last tweak, use cached c_flags for the KASSERT() in the deferred migration case. It is not strictly necessary but this way all the callout accesses happen after the above mentioned comment, improving consistency. Pointy hat to: me Sponsored by: Isilon Systems / EMC Corporation Reviewed by: kib MFC after: 2 weeks X-MFC: 243901	2012-12-05 22:32:12 +00:00
Konstantin Belousov	eb8a718686	The softclock_call_cc() is executing with the callout already removed from the callwheel. Calculate the cc->cc_next before removing the callout, otherwise the code followed the invalid tailq links. After this, make softclock_call_cc() return void, since it always return cc->cc_next, which is immediately available to the softclock() anyway. This also allows to eliminate a label under #ifdef SMP. Remove the assignment of cc->cc_next from callout_cc_del(), since the function is called with the callout already removed from callwheel. If cancelling the migration, also clear the CALLOUT_DFRMIGRATION flag. Postpone the free of the timeout(9) allocated callouts after the migration checks are done. Add some more strict asserts about the state of the callout in callout_call_cc(). Reviewed by: attilio Reported and tested by: pho (previous version) MFC after: 2 weeks	2012-12-05 19:02:22 +00:00
Attilio Rao	1c7d98d0df	Check for lockmgr recursion in case of disown and downgrade and panic also in !debugging kernel rather than having "undefined" behaviour. Tested by: avg MFC after: 1 week	2012-12-05 15:11:01 +00:00
Gleb Smirnoff	eb1b1807af	Mechanically substitute flags from historic mbuf allocator with malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually	2012-12-05 08:04:20 +00:00
Konstantin Belousov	f7e50ea722	Fix a race between kern_setitimer() and realitexpire(), where the callout is started before kern_setitimer() acquires process mutex, but looses a race and kern_setitimer() gets the process mutex before the callout. Then, assuming that new specified struct itimerval has it_interval zero, but it_value non-zero, the callout, after it starts executing again, clears p->p_realtimer.it_value, but kern_setitimer() already rescheduled the callout. As the result of the race, both p_realtimer is zero, and the callout is rescheduled. Then, in the exit1(), the exit code sees that it_value is zero and does not even try to stop the callout. This allows the struct proc to be reused and eventually the armed callout is re-initialized. The consequence is the corrupted callwheel tailq. Use process mutex to interlock the callout start, which fixes the race. Reported and tested by: pho Reviewed by: jhb MFC after: 2 weeks	2012-12-04 20:49:39 +00:00
Konstantin Belousov	9bdf6ccab3	Do not allocate buffer of the 255 bytes length on the stack. Reported and tested by: sig6247@gmail.com MFC after: 1 week	2012-12-04 20:49:04 +00:00
Alfred Perlstein	922314f018	replace bit shifting loop with 1<<fls(n), improve comments. Reviewed by: davide	2012-12-04 05:28:20 +00:00
Konstantin Belousov	07840861b1	The vnode_free_list_mtx is required unconditionally when iterating over the active list. The mount interlock is not enough to guarantee the validity of the tailq link pointers. The __mnt_vnode_next_active() and __mnt_vnode_first_active() active lists iterators helper functions did not provided the neccessary stability for the list, allowing the iterators to pick garbage. This was uncovered after the r243599 made the active list iterators non-nop. Since a vnode interlock is before the vnode_free_list_mtx, obtain the vnode ilock in the non-blocking manner when under vnode_free_list_mtx, and restart iteration after the yield if the lock attempt failed. Assert that a vnode found on the list is active, and assert that the helpers return the vnode with interlock owned. Reported and tested by: pho MFC after: 1 week	2012-12-03 22:15:16 +00:00
Pawel Jakub Dawidek	8909f88d28	Fix one more compilation issue.	2012-12-01 08:59:36 +00:00
Pawel Jakub Dawidek	499f0f4d55	IFp4 @208451: Fix path handling for *at() syscalls. Before the change directory descriptor was totally ignored, so the relative path argument was appended to current working directory path and not to the path provided by descriptor, thus wrong paths were stored in audit logs. Now that we use directory descriptor in vfs_lookup, move AUDIT_ARG_UPATH1() and AUDIT_ARG_UPATH2() calls to the place where we hold file descriptors table lock, so we are sure paths will be resolved according to the same directory in audit record and in actual operation. Sponsored by: FreeBSD Foundation (auditdistd) Reviewed by: rwatson MFC after: 2 weeks	2012-11-30 23:18:49 +00:00
Pawel Jakub Dawidek	e1216d1335	IFp4 @208450: Remove redundant call to AUDIT_ARG_UPATH1(). Path will be remembered by the following NDINIT(AUDITVNODE1) call. Sponsored by: FreeBSD Foundation (auditdistd) MFC after: 2 weeks	2012-11-30 22:49:28 +00:00
Andre Oppermann	df905a2bd3	Using a long is the wrong type to represent the realmem and maxmbufmem variable as they may overflow on i386/PAE and i386 with > 2GB RAM. Use 64bit quad_t instead. It has broader kernel infrastructure support with TUNABLE_QUAD_FETCH() and qmin/qmax() than other available types. Pointed out by: alc, bde	2012-11-29 07:30:42 +00:00
Andre Oppermann	416a434cd0	Complete r243631 by applying the remainder of kern_mbuf.c that got lost while merging into the commit tree. MFC after: 1 month X-MFC-with: r243631	2012-11-27 23:16:56 +00:00
Andre Oppermann	358c7f47da	Fix r243627 by testing against the head socket instead of the socket just created. MFC after: 1 week X-MFC-with: r243627	2012-11-27 22:35:48 +00:00
Andre Oppermann	ead46972a4	Base the mbuf related limits on the available physical memory or kernel memory, whichever is lower. The overall mbuf related memory limit must be set so that mbufs (and clusters of various sizes) can't exhaust physical RAM or KVM. The limit is set to half of the physical RAM or KVM (whichever is lower) as the baseline. In any normal scenario we want to leave at least half of the physmem/kvm for other kernel functions and userspace to prevent it from swapping too easily. Via a tunable kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm. At the same time divorce maxfiles from maxusers and set maxfiles to physpages / 8 with a floor based on maxusers. This way busy servers can make use of the significantly increased mbuf limits with a much larger number of open sockets. Tidy up ordering in init_param2() and check up on some users of those values calculated here. Out of the overall mbuf memory limit 2K clusters and 4K (page size) clusters to get 1/4 each because these are the most heavily used mbuf sizes. 2K clusters are used for MTU 1500 ethernet inbound packets. 4K clusters are used whenever possible for sends on sockets and thus outbound packets. The larger cluster sizes of 9K and 16K are limited to 1/6 of the overall mbuf memory limit. When jumbo MTU's are used these large clusters will end up only on the inbound path. They are not used on outbound, there it's still 4K. Yes, that will stay that way because otherwise we run into lots of complications in the stack. And it really isn't a problem, so don't make a scene. Normal mbufs (256B) weren't limited at all previously. This was problematic as there are certain places in the kernel that on allocation failure of clusters try to piece together their packet from smaller mbufs. The mbuf limit is the number of all other mbuf sizes together plus some more to allow for standalone mbufs (ACK for example) and to send off a copy of a cluster. Unfortunately there isn't a way to set an overall limit for all mbuf memory together as UMA doesn't support such a limiting. NB: Every cluster also has an mbuf associated with it. Two examples on the revised mbuf sizing limits: 1GB KVM: 512MB limit for mbufs 419,430 mbufs 65,536 2K mbuf clusters 32,768 4K mbuf clusters 9,709 9K mbuf clusters 5,461 16K mbuf clusters 16GB RAM: 8GB limit for mbufs 33,554,432 mbufs 1,048,576 2K mbuf clusters 524,288 4K mbuf clusters 155,344 9K mbuf clusters 87,381 16K mbuf clusters These defaults should be sufficient for even the most demanding network loads. MFC after: 1 month	2012-11-27 21:19:58 +00:00
Andre Oppermann	2c3142c82c	Fix a race on listen socket teardown where while draining the accept queues a new socket/connection may be added to the queue due to a race on the ACCEPT_LOCK. The submitted patch is slightly changed in comments, teardown and locking order and extended with KASSERT's. Submitted by: Vijay Singh <vijju.singh-at-gmail-dot-com> Found by: His team. MFC after: 1 week	2012-11-27 20:04:52 +00:00
Pawel Jakub Dawidek	b0c9d4d70e	Add kern.capmode_coredump sysctl/tunable to allow processes in capability mode to dump core. Reviewed by: rwatson Obtained from: WHEEL Systems MFC after: 2 weeks	2012-11-27 10:38:11 +00:00
Pawel Jakub Dawidek	f121e3e81d	- Add NOCAPCHECK flag to namei that allows lookup to work even if the process is in capability mode. - Add VN_OPEN_NOCAPCHECK flag for vn_open_cred() to will ne converted into NOCAPCHECK namei flag. This functionality will be used to enable core dumps for sandboxed processes. Reviewed by: rwatson Obtained from: WHEEL Systems MFC after: 2 weeks	2012-11-27 10:32:35 +00:00
Pawel Jakub Dawidek	90b2202145	Regenerate after r243610.	2012-11-27 10:25:03 +00:00
Pawel Jakub Dawidek	8890f5d020	Allow to use kill(2) in capability mode, but process can send a signal only to himself. For example abort(3) at first tries to do kill(getpid(), SIGABRT) which was failing in capability mode, so the code was failing back to exit(1). Reviewed by: rwatson Obtained from: WHEEL Systems MFC after: 2 weeks	2012-11-27 10:22:40 +00:00
Pawel Jakub Dawidek	b62d05fcf9	Allow to modify kern.sugid_coredump and kern.corefile from loader.conf. Obtained from: WHEEL Systems	2012-11-27 10:16:48 +00:00
Pawel Jakub Dawidek	c320984687	More style fixes.	2012-11-27 10:15:58 +00:00
Pawel Jakub Dawidek	23c6445a4b	Style fixes (mostly whitespaces).	2012-11-27 10:11:54 +00:00
David Xu	3da9ab75f4	Take first active vnode correctly. Reviewed by: kib MFC after: 3 days	2012-11-27 06:07:58 +00:00
Pawel Jakub Dawidek	4f66641749	Look for zombie process only if we were given process id. Reviewed by: kib MFC after: 2 weeks X-MFC-after-or-with: 243142	2012-11-25 19:31:42 +00:00
Andriy Gapon	6898bee9a9	remove stop_scheduler_on_panic knob There has not been any complaints about the default behavior, so there is no need to keep a knob that enables the worse alternative. Now that the hard-stopping of other CPUs is the only behavior, the panic_cpu spinlock-like logic can be dropped, because only a single CPU is supposed to win stop_cpus_hard(other_cpus) race and proceed past that call. MFC after: 1 month	2012-11-25 14:22:08 +00:00
Andriy Gapon	6b991098a7	assert_vop_locked: make the assertion race-free and more efficient this is really a minor improvement for the sake of correctness MFC after: 6 days	2012-11-24 13:11:47 +00:00
Andriy Gapon	4f15bb6730	remove vop_lookup_pre and vop_lookup_post Suggested by: kib MFC after: 5 days	2012-11-22 10:36:10 +00:00
Konstantin Belousov	daee0f0b0b	Schedule garbage collection run for the in-flight rights passed over the unix domain sockets to the next tick, coalescing the serial calls until the collection fires. The thought is that more work for the collector could arise in the near time, allowing to clean more and not spend too much CPU on repeated collection when there is no garbage. Currently the collection task is fired immediately upon unix domain socket close if there are any rights in flight, which caused excessive CPU usage and too long blocking of the threads waiting for unp_list_lock and unp_link_rwlock in write mode. Robert noted that it would be nice if we could find some heuristic by which we decide whether to run GC a bit more quickly. E.g., if the number of UNIX domain sockets is close to its resource limit, but not quite. Reported and tested by: Markus Gebert <markus.gebert@hostpoint.ch> Reviewed by: rwatson MFC after: 2 weeks	2012-11-20 15:45:48 +00:00
Konstantin Belousov	b7c8d2f2f5	Add a special meaning to the negative ticks argument for taskqueue_enqueue_timeout(). Do not rearm the callout if it is already armed and the ticks is negative. Otherwise rearm it to fire in abs(ticks) ticks in the future. The intended use is to call taskqueue_enqueue_timeout() for the given timeout_task with the same negative ticks argument. As result, the task is scheduled to execute not further than abs(ticks) ticks in future, and the consequent enqueues are coalesced until the already scheduled task is finished. Reviewed by: rwatson Tested by: Markus Gebert <markus.gebert@hostpoint.ch> MFC after: 2 weeks	2012-11-20 15:33:48 +00:00
Attilio Rao	973b795b64	insmntque() is always called with the lock held in exclusive mode, then: - assume the lock is held in exclusive mode and remove a moot check about the lock acquisition. - in the destructor remove !MPSAFE specific chunk. Reviewed by: kib MFC after: 2 weeks	2012-11-19 20:43:19 +00:00

... 4 5 6 7 8 ...

13459 Commits