freebsd-dev

Author	SHA1	Message	Date
Konstantin Belousov	6cebf7e2be	Move bufshutdown() out of the #ifdef INVARIANTS block.	2015-07-29 09:57:34 +00:00
Jeff Roberson	98082691bb	- Make 'struct buf *buf' private to vfs_bio.c. Having a global variable 'buf' is inconvenient and has lead me to some irritating to discover bugs over the years. It also makes it more challenging to refactor the buf allocation system. - Move swbuf and declare it as an extern in vfs_bio.c. This is still not perfect but better than it was before. - Eliminate the unused ffs function that relied on knowledge of the buf array. - Move the shutdown code that iterates over the buf array into vfs_bio.c. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-07-29 02:26:57 +00:00
Jeff Roberson	38750ada8f	- Eliminate the EMPTYKVA queue. It served as a cache of KVA allocations attached to bufs to avoid the overhead of the vm. This purposes is now better served by vmem. Freeing the kva immediately when a buf is destroyed leads to lower fragmentation and a much simpler scan algorithm. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-07-28 20:24:09 +00:00
Ed Schouten	b114aa7959	Make shutdown() return ENOTCONN as required by POSIX, part deux. Summary: Back in 2005, maxim@ attempted to fix shutdown() to return ENOTCONN in case the socket was not connected (r150152). This had to be rolled back (r150155), as it broke some of the existing programs that depend on this behavior. I reapplied this change on my system and indeed, syslogd failed to start up. I fixed this back in February (279016) and MFC'ed it to the supported stable branches. Apart from that, things seem to work out all right. Since at least Linux and Mac OS X do the right thing, I'd like to go ahead and give this another try. To keep old copies of syslogd working, only start returning ENOTCONN for recent binaries. I took a look at the XNU sources and they seem to test against both SS_ISCONNECTED, SS_ISCONNECTING and SS_ISDISCONNECTING, instead of just SS_ISCONNECTED. That seams reasonable, so let's do the same. Test Plan: This issue was uncovered while writing tests for shutdown() in CloudABI: https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/sys/socket/shutdown_test.c#L26 Reviewers: glebius, rwatson, #manpages, gnn, #network Reviewed By: gnn, #network Subscribers: bms, mjg, imp Differential Revision: https://reviews.freebsd.org/D3039	2015-07-27 13:17:57 +00:00
Andrey V. Elsukov	41f5f69f96	Build debug version of rmlock's methods only when LOCK_DEBUG > 0. Currently LOCK_DEBUG is always defined in sys/lock.h (0 or 1). This means that debugging code always built. In addition the kernel modules have always defined LOCK_DEBUG as 1. So, debugging rmlock code is always used by kernel modules. MFC after: 1 week	2015-07-26 10:53:32 +00:00
Konstantin Belousov	6fd04eff66	With the removal of b_saveaddr in the r285819, b_data must be reset to b_kvabase when the buffer is reclaimed. Otherwise, if b_data for the mapped buffer was adjusted with the page-offset portion of b_offset, nothing would re-adjust the b_data, which breaks buffer management code which expects page-aligned b_data (see e.g. bpman_qenter(), which skips partial pages). Fix a minor issue with the GB_KVAALLOC requests, which could result in returning the mapped buffer if the reused buffer is mapped and have the right amount of KVA reserved. Improve assertion in the vfs_buf_check_mapped() to catch unmapped buffers which have their b_data incorrectly adjusted with offset. Reported and tested by: pho (previous version) Reviewed by: jeff (previous version) Sponsored by: The FreeBSD Foundation	2015-07-25 15:00:14 +00:00
Xin LI	1a7c14aec7	Fix a typo in comment. Submitted by: Yanhui Shen via twitter MFC after: 3 days	2015-07-24 22:13:39 +00:00
Marius Strobl	7815d3948c	o Revert the other functional half of r239864, i. e. the merge of r134227 from x86 to use smp_ipi_mtx spin lock not only for smp_rendezvous_cpus() but also for the MD cache invalidation, TLB demapping and remote register reading IPIs due to the following reasons: - The cross-IPI SMP deadlock x86 otherwise is subject to can't happen on sparc64. That's because on sparc64, spin locks don't disable interrupts completely but only raise the processor interrupt level to PIL_TICK. This means that IPIs still get delivered and direct dispatch IPIs such as the cache invalidation etc. IPIs in question are still executed. - In smp_rendezvous_cpus(), smp_ipi_mtx is held not only while sending an IPI_RENDEZVOUS, but until all CPUs have processed smp_rendezvous_action(). Consequently, smp_ipi_mtx may be locked for an extended amount of time as queued IPIs (as opposed to the direct ones) such as IPI_RENDEZVOUS are scheduled via a soft interrupt. Moreover, given that this soft interrupt is only delivered at PIL_RENDEZVOUS, processing of smp_rendezvous_action() on a target may be interrupted by f. e. a tick interrupt at PIL_TICK, in turn leading to the target in question trying to send an IPI by itself while IPI_RENDEZVOUS isn't fully handled, yet, and, thus, resulting in a deadlock. o As mentioned in the commit message of r245850, on least some sun4u platforms concurrent sending of IPIs by different CPUs is fatal. Therefore, hold the reintroduced MD ipi_mtx also while delivering cross-traps via MI helpers, i. e. ipi_{all_but_self,cpu,selected}(). o Akin to x86, let the last CPU to process cpu_mp_bootstrap() set smp_started instead of the BSP in cpu_mp_unleash(). This ensures that all APs actually are started, when smp_started is no longer 0. o In all MD and MI IPI helpers, check for smp_started == 1 rather than for smp_cpus > 1 or nothing at all. This avoids races during boot causing IPIs trying to be delivered to APs that in fact aren't up and running, yet. While at it, move setting of the cpu_ipi_{selected,single}() pointers to the appropriate delivery functions from mp_init() to cpu_mp_start() where it's better suited and allows to get rid of the global isjbus variable. o Given that now concurrent IPI delivery no longer is possible, also nuke the delays before completely disabling interrupts again in the CPU-specific cross-trap delivery functions, previously giving other CPUs a window for sending IPIs on their part. Actually, we now should be able to entirely get rid of completely disabling interrupts in these functions. Such a change needs more testing, though. o In {s,}tick_get_timecount_mp(), make the {s,}tick variable static. While not necessary for correctness, this avoids page faults when accessing the stack of a foreign CPU as {s,}tick now is locked into the TLBs as part of static kernel data. Hence, {s,}tick_get_timecount_mp() always execute as fast as possible, avoiding jitter. PR: 201245 MFC after: 3 days	2015-07-24 15:13:21 +00:00
Sergey Kandaurov	ef88ae77ea	Call ksem_get() with initialized 'rights'. ksem_get() consumes fget(), and it's mandatory there. Reported by: truckman Reviewed by: mjg	2015-07-23 23:18:03 +00:00
Jeff Roberson	fade8dd714	Refactor unmapped buffer address handling. - Use pointer assignment rather than a combination of pointers and flags to switch buffers between unmapped and mapped. This eliminates multiple flags and generally simplifies the logic. - Eliminate b_saveaddr since it is only used with pager bufs which have their b_data re-initialized on each allocation. - Gather up some convenience routines in the buffer cache for manipulating buf space and buf malloc space. - Add an inline, buf_mapped(), to standardize checks around unmapped buffers. In collaboration with: mlaier Reviewed by: kib Tested by: pho (many small revisions ago) Sponsored by: EMC / Isilon Storage Division	2015-07-23 19:13:41 +00:00
Jeff Roberson	1c1ddc0351	- Don't defeat the FIFO nature of the buffer cache by eliminating the most recently used buffer when we are under paging pressure. This is a perversion of the buffer and page replacement algorithms and recent improvements to the page daemon have rendered it unnecessary. In the event that low-memory deadlocks become an issue it would be possible to make a daemon or event handler that performs a similar action on the oldest buffers rather than the newest. Since the buf cache is analogous to the page cache and some minimum working set is desired another possibility is to simply shrink the minimum working set which has less downside now that file pages are not directly mapped. Sponsored by: EMC / Isilon Reviewed by: alc, kib (with some minor objection) Tested by: pho	2015-07-23 02:20:41 +00:00
Konstantin Belousov	e637a6e3f9	The smp_rendezvous_cpus() function should ensure that all accesses done by the functions called on other CPUs, are visible to the caller. Pair otherwise useless acquire on smp_rv_waiters[3] with a release add to ensure synchronized with relation, which guarantees visibility. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2015-07-21 22:56:46 +00:00
Konstantin Belousov	01f5e0866b	The part of r285680 which removed release semantic for two stores to it_need was wrong []. Restore the releases and add a comment explaining why it is needed. Noted by: alc [] Reviewed by: bde [*] Sponsored by: The FreeBSD Foundation	2015-07-21 14:39:34 +00:00
Sergey Kandaurov	94df6fad1d	Fix sb_state constant names as used e.g. to display in DDB ``show sockbuf''. MFC after: 1 week	2015-07-21 09:57:13 +00:00
Ed Schouten	5a170c1b0e	Add an API for easily creating userspace threads in kernelspace. This change refactors the existing create_thread() function to be more generic. It replaces almost all of its arguments by a callback that can be used to extract the thread ID and copy it out to the right place, but also to perform additional initialization steps, such as setting the trapframe. This also makes the difference between thr_new() and thr_create() more clear in my opinion. This function is going to be used by the CloudABI compatibility layer. It looks like the OpenSolaris compatibility framework already provides a function called thread_create(). Rename this function to do_thread_create() and use a macro to deal with the namespacing conflict. A similar approach is already used for thread_exit(). MFC after: 1 month	2015-07-20 10:20:04 +00:00
Alexander Motin	d3e2e28e74	Fix typo in comment. Submitted by: Masao Uebayashi	2015-07-20 09:37:42 +00:00
Mark Johnston	97cc6870f6	Don't increment the spin count until after the first attempt to acquire a rwlock read lock. Otherwise the lockstat:::rw-spin probe will fire spuriously. MFC after: 1 week	2015-07-19 22:26:02 +00:00
Kirk McKusick	1b79b9498b	Restructure code for readability improvement. No functional change. Reviewed by: kib	2015-07-19 22:25:16 +00:00
Mark Johnston	de2c95cc00	Consistently use a reader/writer flag for lockstat probes in rwlock(9) and sx(9), rather than using the probe function name to determine whether a given lock is a read lock or a write lock. Update lockstat(1) accordingly.	2015-07-19 22:24:33 +00:00
Mark Johnston	32cd0147fa	Implement the lockstat provider using SDT(9) instead of the custom provider in lockstat.ko. This means that lockstat probes now have typed arguments and will utilize SDT probe hot-patching support when it arrives. Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D2993	2015-07-19 22:14:09 +00:00
Marcelo Araujo	f19e47d691	Add support to the jail framework to be able to mount linsysfs(5) and linprocfs(5). Differential Revision: D2846 Submitted by: Nikolai Lifanov <lifanov@mail.lifanov.com> Reviewed by: jamie	2015-07-19 08:52:35 +00:00
Konstantin Belousov	283dfee925	Further cleanup after r285607. Remove useless release semantic for some stores to it_need. For stores where the release is needed, add a comment explaining why. Fence after the atomic_cmpset() op on the it_need should be acquire only, release is not needed (see above). The combination of atomic_cmpset() + fence_acq() is better expressed there as atomic_cmpset_acq(). Use atomic_cmpset() for swi' ih_need read and clear. Discussed with: alc, bde Reviewed by: bde Comments wording provided by: bde Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-07-18 19:59:29 +00:00
Konstantin Belousov	b4490c6e93	The si_status field of the siginfo_t, provided by the waitid(2) and SIGCHLD signal, should keep full 32 bits of the status passed to the _exit(2). Split the combined p_xstat of the struct proc into the separate exit status p_xexit for normal process exit, and signalled termination information p_xsig. Kernel-visible macro KW_EXITCODE() reconstructs old p_xstat from p_xexit and p_xsig. p_xexit contains complete status and copied out into si_status. Requested by: Joerg Schilling Reviewed by: jilles (previous version), pho Tested by: pho Sponsored by: The FreeBSD Foundation	2015-07-18 09:02:50 +00:00
Mark Johnston	c6d48c8752	Fix the !KDTRACE_HOOKS build. X-MFC-With: r285664	2015-07-18 04:38:11 +00:00
Mark Johnston	e2b25737ee	Pass the lock object to lockstat_nsecs() and return immediately if LO_NOPROFILE is set. Some timecounter handlers acquire a spin mutex, and we don't want to recurse if lockstat probes are enabled. PR: 201642 Reviewed by: avg MFC after: 3 days	2015-07-18 00:57:30 +00:00
Mark Johnston	efe8b26b82	Modify lockstat_nsecs() to just return unless lockstat probes are actually enabled. The cost of a timecounter read can be quite significant, and the problem became more apparent after r284297, since that change resulted in a call to lockstat_nsecs() for each acquisition of an rwlock read lock. PR: 201642 Reviewed by: avg Tested by: Jason Unovitch MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D3073	2015-07-18 00:22:00 +00:00
Ed Schouten	fd054c2df9	Undo r285656. It turns out that the CDDL sources already introduce a function called thread_create(). I'll investigate what we can do to make these functions coexist. Reported by: Ivan Klymenko	2015-07-17 22:26:45 +00:00
Ed Schouten	82a3d2cbfc	Add an API for easily creating userspace threads in kernelspace. This change refactors the existing create_thread() function to be more generic. It replaces almost all of its arguments by a callback that can be used to extract the thread ID and copy it out to the right place, but also to perform additional initialization steps, such as setting the trapframe. This also makes the difference between thr_new() and thr_create() more clear in my opinion. This function is going to be used by the CloudABI compatibility layer. Reviewed by: kib MFC after: 1 month	2015-07-17 16:34:01 +00:00
Mateusz Guzik	2919a0c5c1	fd: partially deduplicate fdescfree and fdescfree_remapped This also moves vrele of cdir/rdir/jdir vnodes earlier, which should not matter.	2015-07-16 15:26:37 +00:00
Mateusz Guzik	cd672ca60f	Get rid of lim_update_thread and cred_update_thread. Their primary use was in thread_cow_update to free up old resources. Freeing had to be done with proc lock held and _cow_ funcs already knew how to free old structs.	2015-07-16 14:30:11 +00:00
Mateusz Guzik	752fc07d33	vfs: implement v_holdcnt/v_usecount manipulation using atomic ops Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free list) of either counter are still guarded with vnode interlock. Reviewed by: kib (earlier version) Tested by: pho	2015-07-16 13:57:05 +00:00
Ed Schouten	457f7e23b1	Implement CloudABI's exec() call. Summary: In a runtime that is purely based on capability-based security, there is a strong emphasis on how programs start their execution. We need to make sure that we execute an new program with an exact set of file descriptors, ensuring that credentials are not leaked into the process accidentally. Providing the right file descriptors is just half the problem. There also needs to be a framework in place that gives meaning to these file descriptors. How does a CloudABI mail server know which of the file descriptors corresponds to the socket that receives incoming emails? Furthermore, how will this mail server acquire its configuration parameters, as it cannot open a configuration file from a global path on disk? CloudABI solves this problem by replacing traditional string command line arguments by tree-like data structure consisting of scalars, sequences and mappings (similar to YAML/JSON). In this structure, file descriptors are treated as a first-class citizen. When calling exec(), file descriptors are passed on to the new executable if and only if they are referenced from this tree structure. See the cloudabi-run(1) man page for more details and examples (sysutils/cloudabi-utils). Fortunately, the kernel does not need to care about this tree structure at all. The C library is responsible for serializing and deserializing, but also for extracting the list of referenced file descriptors. The system call only receives a copy of the serialized data and a layout of what the new file descriptor table should look like: int proc_exec(int execfd, const void data, size_t datalen, const int fds, size_t fdslen); This change introduces a set of fd*_remapped() functions: - fdcopy_remapped() pulls a copy of a file descriptor table, remapping all of the file descriptors according to the provided mapping table. - fdinstall_remapped() replaces the file descriptor table of the process by the copy created by fdcopy_remapped(). - fdescfree_remapped() frees the table in case we aborted before fdinstall_remapped(). We then add a function exec_copyin_data_fds() that builds on top these functions. It copies in the data and constructs a new remapped file descriptor. This is used by cloudabi_sys_proc_exec(). Test Plan: cloudabi-run(1) is capable of spawning processes successfully, providing it data and file descriptors. procstat -f seems to confirm all is good. Regular FreeBSD processes also work properly. Reviewers: kib, mjg Reviewed By: mjg Subscribers: imp Differential Revision: https://reviews.freebsd.org/D3079	2015-07-16 07:05:42 +00:00
Konstantin Belousov	70a3efc14f	Do not use atomic_swap_int(9), it is not available on all architectures. Atomic_cmpset_int(9) is a direct replacement, due to loop. The change fixes arm, arm64, mips an sparc64, which lack atomic_swap(). Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-07-15 21:44:16 +00:00
Konstantin Belousov	615b6ea2c8	Reset non-zero it_need indicator to zero atomically with fetching its current value. It is believed that the change is the real fix for the issue which was covered over by the r252683. With the current code, if the interrupt handler sets it_need between read and consequent reset, the update could be lost and ithread_execute_handlers() would not be called in response to the lost update. The r252683 could have hide the issue since at the moment of commit, atomic_load_acq_int() did locked cmpxchg on the variable, which puts the cache line into the exclusive owned state and clears store buffers. Then the immediate store of zero has very high chance of reusing the exclusive state of the cache line and make the load and store sequence operate as atomic swap. For now, add the acq+rel fence immediately after the swap, to not disturb current (but excessive) ordering. Acquire is needed for the ih_need reads after the load, while release does not serve a useful purpose []. Reviewed by: alc Noted by: alc [] Discussed with: bde Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-07-15 17:36:35 +00:00
Konstantin Belousov	03bbcb2f0c	Style. Remove excessive brackets. Compare non-boolean with zero. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-07-15 17:14:05 +00:00
Mark Johnston	02d131ad11	Fix some error-handling bugs when core dump compression is enabled: - Ensure that core dump parameters are initialized in the error path. - Don't call gzio_fini() on a NULL stream. Reported by: rpaulo	2015-07-14 18:24:05 +00:00
Conrad Meyer	0c40f3532d	Fix cleanup race between unp_dispose and unp_gc unp_dispose and unp_gc could race to teardown the same mbuf chains, which can lead to dereferencing freed filedesc pointers. This patch adds an IGNORE_RIGHTS flag on unpcbs marking the unpcb's RIGHTS as invalid/freed. The flag is protected by UNP_LIST_LOCK. To serialize against unp_gc, unp_dispose needs the socket object. Change the dom_dispose() KPI to take a socket object instead of an mbuf chain directly. PR: 194264 Differential Revision: https://reviews.freebsd.org/D3044 Reviewed by: mjg (earlier version) Approved by: markj (mentor) Obtained from: mjg MFC after: 1 month Sponsored by: EMC / Isilon Storage Division	2015-07-14 02:00:50 +00:00
Mateusz Guzik	6161705823	exec: textvp -> oldtextvp; binvp -> newtextvp This makes it consistent with the rest of the naming in do_execve. No functional changes.	2015-07-14 01:13:37 +00:00
Mateusz Guzik	853be5ffef	exec plug a redundant vref + vrele of the image vnode	2015-07-14 00:43:08 +00:00
Mateusz Guzik	e94e50af1d	racct: perform a lockless check for p_throttled This reduces proc lock contention. Reviewed by: trasz	2015-07-13 22:52:11 +00:00
Conrad Meyer	c578e0fb48	pipe_direct_write: Fix mismatched pipelock/unlock If a signal is caught in pipelock, causing it to fail, pipe_direct_write should not try to pipeunlock. Reported by: pho Differential Revision: https://reviews.freebsd.org/D3069 Reviewed by: kib Approved by: markj (mentor) MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2015-07-13 17:45:22 +00:00
Ian Lepore	969fc29e0b	Use the monotonic (uptime) counter rather than time-of-day to measure elapsed time between ntp_adjtime() clock offset adjustments. This eliminates spurious frequency steering after a large clock step (such as a 1970->2015 step on a system with no battery-backed clock hardware). This problem was discovered after the import of ntpd 4.2.8, which does things in a slightly different (but still correct) order than the 4.2.4 we had previously. In particular, 4.2.4 would step the clock then immediately after use ntp_adjtime() to set the frequency and offset to zero, which captured the post-step time-of-day as a side effect. In 4.2.8, ntpd sets frequency and offset to zero before any initial clock step, capturing the time as 1970-ish, then when it next calls ntp_adjtime() it's with a non-zero offset measurement. This non-zero value gets multiplied by the apparent 45-year interval, which blows up into a completely bogus frequency steer. That gets clamped to 500ppm, but that's still enough to make the clock drift so fast that ntpd has to keep stepping it every few minutes to compensate.	2015-07-12 18:38:17 +00:00
Bjoern A. Zeeb	97fc027722	Try to unbreak the build after r285390 removing the obsolete static declaration.	2015-07-12 00:26:22 +00:00
Mateusz Guzik	c634b75204	vfs: always clear VI_OWEINACT in consumers bumping v_usecount Previously vputx would detect the condition and clear the flag. With this change it is invalid to have both v_usecount > 0 and the flag set. Assert the condition is met in all revlevant places. Reviewed by: kib	2015-07-11 16:28:55 +00:00
Mateusz Guzik	2d1ca3cdff	vfs: move si_usecount manipulation to dedicated functions Reviewed by: kib	2015-07-11 16:28:12 +00:00
Mateusz Guzik	8a08cec166	Create a dedicated function for ensuring that cdir and rdir are populated. Previously several places were doing it on its own, partially incorrectly (e.g. without the filedesc locked) or even actively harmful by populating jdir or assigning rootvnode without vrefing it. Reviewed by: kib	2015-07-11 16:22:48 +00:00
Mateusz Guzik	f0725a8e1e	Move chdir/chroot-related fdp manipulation to kern_descrip.c Prefix exported functions with pwd_. Deduplicate some code by adding a helper for setting fd_cdir. Reviewed by: kib	2015-07-11 16:19:11 +00:00
Adrian Chadd	871ef8b0d8	Regenerate syscalls.	2015-07-11 15:22:11 +00:00
Adrian Chadd	6520495abc	Add an initial NUMA affinity/policy configuration for threads and processes. This is based on work done by jeff@ and jhb@, as well as the numa.diff patch that has been circulating when someone asks for first-touch NUMA on -10 or -11. * Introduce a simple set of VM policy and iterator types. * tie the policy types into the vm_phys path for now, mirroring how the initial first-touch allocation work was enabled. * add syscalls to control changing thread and process defaults. * add a global NUMA VM domain policy. * implement a simple cascade policy order - if a thread policy exists, use it; if a process policy exists, use it; use the default policy. * processes inherit policies from their parent processes, threads inherit policies from their parent threads. * add a simple tool (numactl) to query and modify default thread/process policities. * add documentation for the new syscalls, for numa and for numactl. * re-enable first touch NUMA again by default, as now policies can be set in a variety of methods. This is only relevant for very specific workloads. This doesn't pretend to be a final NUMA solution. The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by 'sysctl vm.default_policy=rr'. This is only relevant if MAXMEMDOM is set to something other than 1. Ie, if you're using GENERIC or a modified kernel with non-NUMA, then this is a glorified no-op for you. Thank you to Norse Corp for giving me access to rather large (for FreeBSD!) NUMA machines in order to develop and verify this. Thank you to Dell for providing me with dual socket sandybridge and westmere v3 hardware to do NUMA development with. Thank you to Scott Long at Netflix for providing me with access to the two-socket, four-domain haswell v3 hardware. Thank you to Peter Holm for running the stress testing suite against the NUMA branch during various stages of development! Tested: * MIPS (regression testing; non-NUMA) * i386 (regression testing; non-NUMA GENERIC) * amd64 (regression testing; non-NUMA GENERIC) * westmere, 2 socket (thankyou norse!) * sandy bridge, 2 socket (thankyou dell!) * ivy bridge, 2 socket (thankyou norse!) * westmere-EX, 4 socket / 1TB RAM (thankyou norse!) * haswell, 2 socket (thankyou norse!) * haswell v3, 2 socket (thankyou dell) * haswell v3, 2x18 core (thankyou scott long / netflix!) * Peter Holm ran a stress test suite on this work and found one issue, but has not been able to verify it (it doesn't look NUMA related, and he only saw it once over many testing runs.) * I've tested bhyve instances running in fixed NUMA domains and cpusets; all seems to work correctly. Verified: * intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different NUMA policies for processes under test. Review: This was reviewed through phabricator (https://reviews.freebsd.org/D2559) as well as privately and via emails to freebsd-arch@. The git history with specific attributes is available at https://github.com/erikarn/freebsd/ in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy). This has been reviewed by a number of people (stas, rpaulo, kib, ngie, wblock) but not achieved a clear consensus. My hope is that with further exposure and testing more functionality can be implemented and evaluated. Notes: * The VM doesn't handle unbalanced domains very well, and if you have an overly unbalanced memory setup whilst under high memory pressure, VM page allocation may fail leading to a kernel panic. This was a problem in the past, but it's much more easily triggered now with these tools. * This work only controls the path through vm_phys; it doesn't yet strongly/predictably affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory isn't really guaranteed in any way. That's next on my plate. Sponsored by: Norse Corp, Inc.; Dell	2015-07-11 15:21:37 +00:00
Konstantin Belousov	cf88021ab1	Do not allow creation of the dirty buffers for the dead buffer objects, i.e. for buffer objects which vnode was reclaimed. Buffer cache cannot write such buffers. Return the error and discard the buffer immediately on write attempt. BO_DIRTY now always set during vnode reclamation, since it is used not only for the INVARIANTS checks. Do allow placement of the clean buffers on dead bufobj list, otherwise filesystems cannot use bufcache at all after the devvp reclaim. Reported and tested by: trasz Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-07-11 11:21:56 +00:00
Ed Schouten	ea566832d7	Add missing const keyword to kern_sigaction()'s 'act' parameter. This structure is not modified by the function. Also add const to sigact_flag_test(), as it is called by kern_sigaction().	2015-07-10 14:39:46 +00:00
Mateusz Guzik	9a1ad66fb5	fd: further cleanup of kern_dup - make mode enum start from 0 so that the assertion covers all cases [1] - rename prefix _CLOEXEC flag with _FLAG - postpone fhold on the old file descriptor, which eliminates the need to fdrop in error cases. - fixup FDDUP_FCNTL check missed in the previous commit This removes 'fp == oldfde->fde_file' assertion which had little value. kern_dup only calls fd-related functions which cannot drop the lock or a whole lot of races would be introduced. Noted by: kib [1]	2015-07-10 13:54:03 +00:00
Mateusz Guzik	5fe97c20dc	fd: split kern_dup flags argument into actual flags and a mode Tidy up the code inside to switch on the mode.	2015-07-10 11:01:30 +00:00
Konstantin Belousov	e8677f3885	Change the mb() use in the sched_ult tdq_notify() and sched_idletd() to more C11-ish atomic_thread_fence_seq_cst(). Note that on PowerPC, which currently uses lwsync for mb(), the change actually fixes the missed store/load barrier, intended by r271604 []. Reviewed by: alc Noted by: alc [] Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2015-07-10 08:54:12 +00:00
Ed Schouten	47a84387ad	Let listen() return EDESTADDRREQ when not bound. We currently return EINVAL when calling listen() on a UNIX socket that has not been bound to a pathname. If my interpretation of POSIX is correct, we should return EDESTADDRREQ: "The socket is not bound to a local address, and the protocol does not support listening on an unbound socket." Return EDESTADDRREQ instead when not bound and not connected. Differential Revision: https://reviews.freebsd.org/D3038 Reviewed by: gnn, network	2015-07-10 06:47:14 +00:00
Mateusz Guzik	318b946321	vfs: cosmetic changes to namei and namei_handle_root - don't initialize cnp during declaration - don't test error/!error, compare to 0 instead	2015-07-09 17:17:26 +00:00
Mateusz Guzik	d177f49f6f	vfs: simplify error handling in namei The logic is reorganised so that there is one exit point prior to the lookup loop. This is an intermediate step to making audit logging functions use found vnode instead of translating ni_dirfd on their own. ni_startdir validation is removed. The only in-tree consumer is nfs which already makes sure it is a directory. Reviewed by: kib	2015-07-09 16:32:58 +00:00
Ed Schouten	2491302a04	Add implementations for some of the CloudABI file descriptor system calls. All of the CloudABI system calls that operate on file descriptors of an arbitrary type are prefixed with fd_. This change adds wrappers for most of these system calls around their FreeBSD equivalents. The dup2() system call present on CloudABI deviates from POSIX, in the sense that it can only be used to replace existing file descriptor. It cannot be used to create new ones. The reason for this is that this is inherently thread-unsafe. Furthermore, there is no need on CloudABI to use fixed file descriptor numbers. File descriptors 0, 1 and 2 have no special meaning. This change exposes the kern_dup() through <sys/syscallsubr.h> and puts the FDDUP_* flags in <sys/filedesc.h>. It then adds a new flag, FDDUP_MUSTREPLACE to force that file descriptors are replaced -- not allocated. Differential Revision: https://reviews.freebsd.org/D3035 Reviewed by: mjg	2015-07-09 16:07:01 +00:00
Mateusz Guzik	efdc25304c	fd: prepare do_dup for being exported - rename it to kern_dup. - prefix flags with FD - assert that correct flags were passed	2015-07-09 15:19:45 +00:00
Mateusz Guzik	d19ba50e12	vfs: avoid spurious vref/vrele for absolute lookups namei used to vref fd_cdir, which was immediatley vrele'd on entry to the loop. Check for absolute lookup and vref the right vnode the first time. Reviewed by: kib	2015-07-09 15:06:58 +00:00
Mateusz Guzik	a03f1b2970	vfs: plug a use-after-free of fd_rdir in namei fd_rdir vnode was stored in ni_rootdir without refing it in any way, after which the filedsc lock was being dropped. The vnode could have been freed by mountcheckdirs or another thread doing chroot. VREF the vnode while the lock is held. Reviewed by: kib MFC after: 1 week	2015-07-09 15:06:24 +00:00
Ed Schouten	3a41ec6af7	Don't clobber td->td_retval[0] in proc_reap(). While writing tests for CloudABI, I noticed that close() on process descriptors returns the process ID of the child process. This is interesting, as close() is only allowed to return 0 or -1. It turns out that we clobber td->td_retval[0] in proc_reap(), so that wait*() properly returns the process ID. Change proc_reap() to leave td->td_retval[0] alone. Set the return value in kern_wait6() instead, by keeping track of the PID before we (potentially) reap the process. Differential Revision: https://reviews.freebsd.org/D3032 Reviewed by: kib	2015-07-09 12:04:45 +00:00
Konstantin Belousov	fcb5b3a419	Cover a race between doselwakeup() and selfdfree(). If doselwakeup() loop finds the selfd entry and clears its sf_si pointer, which is handled by selfdfree() in parallel, NULL sf_si makes selfdfree() free the memory. The result is the race and accesses to the freed memory. Refcount the selfd ownership. One reference is for the sf_link linkage, which is unconditionally dereferenced by selfdfree(). Another reference is for sf_threads, both selfdfree() and doselwakeup() race to deref it, the winner unlinks and than frees the selfd entry. Reported by: Larry Rosenman <ler@lerctr.org> Tested by: Larry Rosenman <ler@lerctr.org>, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-07-09 09:22:21 +00:00
Ed Schouten	6d338f9a81	Import the CloudABI datatypes and create a system call table. CloudABI is a pure capability-based runtime environment for UNIX. It works similar to Capsicum, except that processes already run in capabilities mode on startup. All functionality that conflicts with this model has been omitted, making it a compact binary interface that can be supported by other operating systems without too much effort. CloudABI is 'secure by default'; the idea is that it should be safe to run arbitrary third-party binaries without requiring any explicit hardware virtualization (Bhyve) or namespace virtualization (Jails). The rights of an application are purely determined by the set of file descriptors that you grant it on startup. The datatypes and constants used by CloudABI's C library (cloudlibc) are defined in separate files called syscalldefs_mi.h (pointer size independent) and syscalldefs_md.h (pointer size dependent). We import these files in sys/contrib/cloudabi and wrap around them in cloudabi*_syscalldefs.h. We then add stubs for all of the system calls in sys/compat/cloudabi or sys/compat/cloudabi64, depending on whether the system call depends on the pointer size. We only have nine system calls that depend on the pointer size. If we ever want to support 32-bit binaries, we can simply add sys/compat/cloudabi32 and implement these nine system calls again. The next step is to send in code reviews for the individual system call implementations, but also add a sysentvec, to allow CloudABI executabled to be started through execve(). More information about CloudABI: - GitHub: https://github.com/NuxiNL/cloudlibc - Talk at BSDCan: https://www.youtube.com/watch?v=SVdF84x1EdA Differential Revision: https://reviews.freebsd.org/D2848 Reviewed by: emaste, brooks Obtained from: https://github.com/NuxiNL/freebsd	2015-07-09 07:20:15 +00:00
Konstantin Belousov	f4b5a9725a	Reimplement the ordering requirements for the timehands updates, and for timehands consumers, by using fences. Ensure that the timehands->th_generation reset to zero is visible before the data update is visible []. tc_setget() allowed data update writes to become visible before generation (but not on TSO architectures). Remove tc_setgen(), tc_getgen() helpers, use atomics inline []. Noted by: alc [] Requested by: bde [**] Reviewed by: alc, bde Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2015-07-08 18:42:08 +00:00
Konstantin Belousov	69d11def74	Handle copyout for the fcntl(F_OGETLK) using oflock structure. Otherwise, kernel overwrites a word past the destination. Submitted by: walter@pelissero.de PR: 196718 MFC after: 1 week	2015-07-08 13:19:13 +00:00
Mark Johnston	620711e033	Fix an incorrect assertion in witness. The number of available lock list entries for a thread is LOCK_CHILDCOUNT, and each entry can record up to LOCK_NCHILDREN locks. When iterating over the locks held by a thread, a bound on the loop index is therefore given by LOCK_CHILDCOUNT * LOCK_NCHILDREN; WITNESS_COUNT is an unrelated constant. Reviewed by: jhb MFC after: 1 week Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D2974	2015-07-07 19:29:18 +00:00
Pedro F. Giffuni	9129dd59be	Relocate sched_random() within the SMP section. Place sched_random nearer to where it's first used: moving the code nearer to where it is used makes the code easier to read and we can reduce the initial "#ifdef SMP" island. Reword a little the comment and clean some whitespaces while here.	2015-07-07 15:22:29 +00:00
Mateusz Guzik	aa0e2887f4	tty: replace several curthread->td_proc with stored curproc No functional changes.	2015-07-06 18:53:56 +00:00
Patrick Kelsey	6f99ea0520	Don't acquire sysctlmemlock in userland_sysctl() when the old value pointer is NULL, as in that case there are no userland pages that could potentially be wired. It is common for old to be NULL and oldlenp to be non-NULL in calls to userland_sysctl(), as this is used to probe for the length of a variable-length sysctl entry before retrieving a value. Note that it is typical for such calls to be made with an uninitialized value in oldlenp, so sysctlmemlock was essentially being acquired at random (depending on the uninitialized value in oldlenp being > PAGE_SIZE or not) for these calls prior to this patch. Differential Revision: https://reviews.freebsd.org/D2987 Reviewed by: mjg, kib Approved by: jmallett (mentor) MFC after: 1 month	2015-07-06 16:07:21 +00:00
Konstantin Belousov	9889bbac23	Mutex memory is not zeroed, add MTX_NEW. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-07-06 14:09:00 +00:00
Mark Johnston	947401dd50	Move the comment describing namei(9) back to namei()'s definition. MFC after: 3 days	2015-07-05 22:56:41 +00:00
Mark Johnston	8bbd1f25b1	Remove a stale descriptive comment for gbincore(). The splay trees referenced in the comment were converted to path-compressed tries in r250551. MFC after: 3 days	2015-07-05 22:44:41 +00:00
Mark Johnston	5f34e93c58	Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT. This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough filesystems like nullfs and unionfs no longer need to inherit this information from their lower layer(s). This change also restores the pre-r273336 behaviour of using the presence of a susp_clean VFS method to request suspension support. Reviewed by: kib, mjg Differential Revision: https://reviews.freebsd.org/D2937	2015-07-05 22:37:33 +00:00
Mateusz Guzik	f131759f54	fd: make 'rights' a manadatory argument to fget* functions	2015-07-05 19:05:16 +00:00
Mariusz Zaborski	54f98da930	Move the nvlist source and private includes from sys/kern to seperate directory sys/contrib/libnv. The goal of this operation is to NOT install header files which shouldn't be used outside the nvlist library. Approved by: pjd (mentor)	2015-07-04 16:33:37 +00:00
Mateusz Guzik	9ca30b0e06	vfs: use shared vnode locking when looking up ".." in vop_stdvptocnp Briefly discussed with: kib	2015-07-04 15:46:39 +00:00
Mateusz Guzik	dba0bec2bb	fd: de-k&r-ify functions + some whitespace fixes No functional changes.	2015-07-04 15:42:03 +00:00
Mateusz Guzik	ee5f66f820	sysctl: get rid of sysctl_lock/unlock Inline their contents into the only consumer.	2015-07-04 14:44:39 +00:00
Mateusz Guzik	d5fc115a1a	sysctl: remove a debugging printf which crept in with r285125	2015-07-04 07:01:43 +00:00
Mateusz Guzik	b8633775a8	sysctl: switch sysctllock to a sleepable rmlock The lock is almost never taken for writing.	2015-07-04 06:54:15 +00:00
Mateusz Guzik	e2f5418e73	sysvshm: fix up some whitespace issues and spurious initialisation	2015-07-02 19:14:30 +00:00
Mateusz Guzik	77a26248a3	sysvshm: don't lock proc when calculating attach_va vm_daddr is constant and RLIMIT_DATA can be obtained from thread's copy of rlimits.	2015-07-02 19:03:44 +00:00
Mateusz Guzik	0be3a191a4	sysvshm: fix shmrealloc The code was supposed to initialize new segs in newsegs array, but used the old pointer.	2015-07-02 19:00:22 +00:00
Konstantin Belousov	1965f86c72	Vnode is not referenced by the vfs_domount() at the point where asserts are made. Remove them, since we might dereference freed memory. Leaked locks are asserted by the syscall return code anyway. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-07-02 14:31:47 +00:00
Navdeep Parhar	9523d1bfc3	Fix leak in tcp_lro_rx. Simply clearing M_PKTHDR isn't enough, any tags hanging off the header need to be freed too. Differential Revision: https://reviews.freebsd.org/D2708 Reviewed by: ae@, hiren@	2015-06-30 17:19:58 +00:00
Mark Murray	d1b06863fb	Huge cleanup of random(4) code. * GENERAL - Update copyright. - Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set neither to ON, which means we want Fortuna - If there is no 'device random' in the kernel, there will be NO random(4) device in the kernel, and the KERN_ARND sysctl will return nothing. With RANDOM_DUMMY there will be a random(4) that always blocks. - Repair kern.arandom (KERN_ARND sysctl). The old version went through arc4random(9) and was a bit weird. - Adjust arc4random stirring a bit - the existing code looks a little suspect. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Redo read_random(9) so as to duplicate random(4)'s read internals. This makes it a first-class citizen rather than a hack. - Move stuff out of locked regions when it does not need to be there. - Trim RANDOM_DEBUG printfs. Some are excess to requirement, some behind boot verbose. - Use SYSINIT to sequence the startup. - Fix init/deinit sysctl stuff. - Make relevant sysctls also tunables. - Add different harvesting "styles" to allow for different requirements (direct, queue, fast). - Add harvesting of FFS atime events. This needs to be checked for weighing down the FS code. - Add harvesting of slab allocator events. This needs to be checked for weighing down the allocator code. - Fix the random(9) manpage. - Loadable modules are not present for now. These will be re-engineered when the dust settles. - Use macros for locks. - Fix comments. * src/share/man/... - Update the man pages. * src/etc/... - The startup/shutdown work is done in D2924. * src/UPDATING - Add UPDATING announcement. * src/sys/dev/random/build.sh - Add copyright. - Add libz for unit tests. * src/sys/dev/random/dummy.c - Remove; no longer needed. Functionality incorporated into randomdev.. live_entropy_sources.c live_entropy_sources.h - Remove; content moved. - move content to randomdev.[ch] and optimise. * src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h - Remove; plugability is no longer used. Compile-time algorithm selection is the way to go. * src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h - Add early (re)boot-time randomness caching. * src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h - Remove; no longer needed. * src/sys/dev/random/uint128.h - Provide a fake uint128_t; if a real one ever arrived, we can use that instead. All that is needed here is N=0, N++, N==0, and some localised trickery is used to manufacture a 128-bit 0ULLL. * src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h - Improve unit tests; previously the testing human needed clairvoyance; now the test will do a basic check of compressibility. Clairvoyant talent is still a good idea. - This is still a long way off a proper unit test. * src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h - Improve messy union to just uint128_t. - Remove unneeded 'static struct fortuna_start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) * src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h - Improve messy union to just uint128_t. - Remove unneeded 'staic struct start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) - Fix some magic numbers elsewhere used as FAST and SLOW. Differential Revision: https://reviews.freebsd.org/D2025 Reviewed by: vsevolod,delphij,rwatson,trasz,jmg Approved by: so (delphij)	2015-06-30 17:00:45 +00:00
Konstantin Belousov	6ef120027f	Do not calculate the stack's bottom address twice. Submitted by: Olivц╘r Pintц╘r Review: https://reviews.freebsd.org/D2953 MFC after: 1 week	2015-06-30 15:22:47 +00:00
Mark Murray	6687b6720b	Ansify another function. This is the last in the file, I hope.	2015-06-28 10:51:08 +00:00
Mark Murray	7233d3094d	ANSIfy the only function that uses K&R definition in this file.	2015-06-28 09:44:58 +00:00
Konstantin Belousov	b2c3df842b	Handle errors from background write of the cylinder group blocks. First, on the write error, bufdone() call from ffs_backgroundwrite() panics because pbrelvp() cleared bp->b_bufobj, while brelse() would try to re-dirty the copy of the cg buffer. Handle this by setting B_INVAL for the case of BIO_ERROR. Second, we must re-dirty the real buffer containing the cylinder group block data when background write failed. Real cg buffer was already marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan may reuse the buffer at any moment. The result is lost write, and if the write error was only transient, we get corrupted bitmaps. We cannot re-dirty the original cg buffer in the ffs_backgroundwritedone(), since the context is not sleepable, preventing us from sleeping for origbp' lock. Add BV_BKGDERR flag (protected by the buffer object lock), which is converted into delayed write by brelse(), bqrelse() and buffer scan. In collaboration with: Conrad Meyer <cse.cem@gmail.com> Reviewed by: mckusick Sponsored by: The FreeBSD Foundation (kib), EMC/Isilon storage division (Conrad) MFC after: 2 weeks	2015-06-27 09:44:14 +00:00
Adrian Chadd	5bbb2169d2	Un-static cpuset_which() - it's useful in other contexts, such as some CPU set operations in my upcoming NUMA work. Tested/compiled: * i386 (run) * amd64 (run) * mips (run) * mips64 (run) * armv6 (built) Sponsored by: Norse Corp, Inc.	2015-06-26 04:14:05 +00:00
Mateusz Guzik	7150ce743a	rlimit: deduplicate code in chg* functions	2015-06-25 00:15:37 +00:00
Sean Bruno	4e83b32a80	At the suggestion of jhb, replace atomic_set/clear calls with use of exclusive locks in the enable/disable interpreter path. Tested with WITNESS/INVARIANTS on and off. Reviewed by: sson davide	2015-06-24 15:52:26 +00:00
John-Mark Gurney	1977bd233a	zero this struct as it depends upon it... Reviewed by: mjg Differential Revision: https://reviews.freebsd.org/D2890	2015-06-23 18:40:20 +00:00
Konstantin Belousov	b05c401ff6	Only take previous buffer queue lock (olock) when needed for REMFREE in binsfree(). Submitted by: Conrad Meyer Sponsored by: EMC / Isilon Storage Division Review: https://reviews.freebsd.org/D2882 MFC after: 1 week	2015-06-23 06:12:14 +00:00
Sean Bruno	945afa7c25	Make imgact_binmisc_exec() static. Submitted by: kib Reviewed by: sson	2015-06-22 17:04:24 +00:00
Sean Bruno	602ec83516	Remove uneeded NULL check since malloc the malloc is now M_WAITOK Submitted by: mjg	2015-06-19 20:35:17 +00:00
Sean Bruno	e0ae213f63	Must have one of either M_WAITOK or M_NOWAIT, read the man page bruno. Submitted by: mjg	2015-06-19 19:57:39 +00:00
Sean Bruno	a7647ec444	Feedback from commit r284535 davide: imgact_binmisc_clear_entry() needs to use atomic ops to remove the enable bit. kib: M_NOWAIT is not warranted and comment is invalid.	2015-06-19 18:57:36 +00:00
Sean Bruno	5f98711d51	This change replaces the mutex with a sx lock for the interpreter list to avoid the problem of holding a non-sleep lock during a page fault as reported by witness. It also uses atomics where possible to avoid having to acquire the exclusive lock. In addition, it consistently uses memset()/memcpy() instead of bzero()/bcopy(). Differential Revision: https://reviews.freebsd.org/D1971 Submitted by: sson Reviewed by: jhb	2015-06-18 02:04:20 +00:00
Bjoern A. Zeeb	af10bf055f	Initialise pr_enforce_statfs from the "default" sysctl value and not from the compile time constant. The sysctl value is seeded from the compile time constant. MFC after: 2 weeks	2015-06-17 13:15:54 +00:00
Konstantin Belousov	1eabd96728	vfs_msync(), called from syncer vnode fsync VOP, only iterates over the active vnode list for the given mount point, with the assumption that vnodes with dirty pages are active. This is enforced by vinactive() doing vm_object_page_clean() pass over the vnode pages. The issue is, if vinactive() cannot be called during vput() due to the vnode being only shared-locked, we might end up with the dirty pages for the vnode on the free list. Such vnode is invisible to syncer, and pages are only cleaned on the vnode reactivation. In other words, the race results in the broken guarantee that user data, written through the mmap(2), is written to the disk not later than in 30 seconds after the write. Fix this by keeping the vnode which is freed but still owing inactivation, on the active list. When syncer loops find such vnode, it is deactivated and cleaned by the final vput() call. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-06-17 04:46:58 +00:00
Pedro F. Giffuni	708694f704	Use nitems() macro instead of __arraycount()	2015-06-16 20:19:00 +00:00
Mateusz Guzik	4da8456f0a	Replace struct filedesc argument in getvnode with struct thread This is is a step towards removal of spurious arguments.	2015-06-16 13:09:18 +00:00
Mateusz Guzik	9ef8328d52	fd: make rights a mandatory argument to fget_unlocked	2015-06-16 09:52:36 +00:00
Mateusz Guzik	80f3623f2f	fd: don't unnecessary copy capabilities in _fget	2015-06-16 09:08:30 +00:00
Mateusz Guzik	cedab3c72c	fd: reduce excessive zeroing on fd close fde_file as NULL is already an indicator of an unused fd. All other fields are populated when fp is installed.	2015-06-14 14:10:05 +00:00
Mateusz Guzik	ea31808c3b	fd: move out actual fp installation to _finstall Use it in fd passing functions as the first step towards fd code cleanup.	2015-06-14 14:08:52 +00:00
Jeremie Le Hen	3768a5dfb5	nit: Rename racct_alloc_resource to racct_adjust_resource. This is more accurate as the amount can be negative. MFC after: 2 weeks	2015-06-14 08:33:14 +00:00
Gleb Smirnoff	093c7f396d	Make KPI of vm_pager_get_pages() more strict: if a pager changes a page in the requested array, then it is responsible for disposition of previous page and is responsible for updating the entry in the requested array. Now consumers of KPI do not need to re-lookup the pages after call to vm_pager_get_pages(). Reviewed by: kib Sponsored by: Netflix Sponsored by: Nginx, Inc.	2015-06-12 11:32:20 +00:00
Andriy Gapon	076dd8eb2e	several lockstat improvements 0. For spin events report time spent spinning, not a loop count. While loop count is much easier and cheaper to obtain it is hard to reason about the reported numbers, espcially for adaptive locks where both spinning and sleeping can happen. So, it's better to compare apples and apples. 1. Teach lockstat about FreeBSD rw locks. This is done in part by changing the corresponding probes and in part by changing what probes lockstat should expect. 2. Teach lockstat that rw locks are adaptive and can spin on FreeBSD. 3. Report lock acquisition events for successful rw try-lock operations. 4. Teach lockstat about FreeBSD sx locks. Reporting of events for those locks completely mirrors rw locks. 5. Report spin and block events before acquisition event. This is behavior documented for the upstream, so it makes sense to stick to it. Note that because of FreeBSD adaptive lock implementations both the spin and block events may be reported for the same acquisition while the upstream reports only one of them. Differential Revision: https://reviews.freebsd.org/D2727 Reviewed by: markj MFC after: 17 days Relnotes: yes Sponsored by: ClusterHQ	2015-06-12 10:01:24 +00:00
Mateusz Guzik	3331a33a42	ussreq: use saved fdp pointer insted of td->td_proc->p_fd No functional changes.	2015-06-12 06:28:22 +00:00
Konstantin Belousov	529c97886b	Tweaks for r284178: Do not include machine/atomic.h explicitely, the header is already included by sys/systm.h. Force inlining of tc_getgen() and tc_setgen(). The functions are used more than once, which causes compilers with non-aggressive inlining policies to generate calls. Suggested by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-06-11 04:41:54 +00:00
Mateusz Guzik	21de5aea6c	Fixup the build after r284215. Submitted by: Ivan Klymenko <fidaj ukr.net> [slighly modified]	2015-06-10 12:39:01 +00:00
Mateusz Guzik	f6f6d24062	Implement lockless resource limits. Use the same scheme implemented to manage credentials. Code needing to look at process's credentials (as opposed to thred's) is provided with *_proc variants of relevant functions. Places which possibly had to take the proc lock anyway still use the proc pointer to access limits.	2015-06-10 10:48:12 +00:00
Mateusz Guzik	4ea6a9a28f	Generalised support for copy-on-write structures shared by threads. Thread credentials are maintained as follows: each thread has a pointer to creds and a reference on them. The pointer is compared with proc's creds on userspace<->kernel boundary and updated if needed. This patch introduces a counter which can be compared instead, so that more structures can use this scheme without adding more comparisons on the boundary.	2015-06-10 10:43:59 +00:00
Mateusz Guzik	3b3eb22ab6	fd: remove fdesc_mtx	2015-06-10 09:40:07 +00:00
Mateusz Guzik	153cc61b54	fd: use atomics to manage fd_refcnt and fd_holcnt This gets rid of fdesc_mtx.	2015-06-10 09:34:50 +00:00
Kenneth D. Merry	5672fac935	Add support for reading MAM attributes to camcontrol(8) and libcam(3). MAM is Medium Auxiliary Memory and is most commonly found as flash chips on tapes. This includes support for reading attributes and decoding most known attributes, but does not yet include support for writing attributes or reporting attributes in XML format. libsbuf/Makefile: Add subr_prf.c for the new sbuf_hexdump() function. This function is essentially the same function. libsbuf/Symbol.map: Add a new shared library minor version, and include the sbuf_hexdump() function. libsbuf/Version.def: Add version 1.4 of the libsbuf library. libutil/hexdump.3: Document sbuf_hexdump() alongside hexdump(3), since it is essentially the same function. camcontrol/Makefile: Add attrib.c. camcontrol/attrib.c: Implementation of READ ATTRIBUTE support for camcontrol(8). camcontrol/camcontrol.8: Document the new 'camcontrol attrib' subcommand. camcontrol/camcontrol.c: Add the new 'camcontrol attrib' subcommand. camcontrol/camcontrol.h: Add a function prototype for scsiattrib(). share/man/man9/sbuf.9: Document the existence of sbuf_hexdump() and point users to the hexdump(3) man page for more details. sys/cam/scsi/scsi_all.c: Add a table of known attributes, text descriptions and handler functions. Add a new scsi_attrib_sbuf() function along with a number of other related functions that help decode attributes. scsi_attrib_ascii_sbuf() decodes ASCII format attributes. scsi_attrib_int_sbuf() decodes binary format attributes, and will pass them off to scsi_attrib_hexdump_sbuf() if they're bigger than 8 bytes. scsi_attrib_vendser_sbuf() decodes the vendor and drive serial number attribute. scsi_attrib_volcoh_sbuf() decodes the Volume Coherency Information attribute that LTFS writes out. sys/cam/scsi/scsi_all.h: Add a number of attribute-related structure definitions and other defines. Add function prototypes for all of the functions added in scsi_all.c. sys/kern/subr_prf.c: Add a new function, sbuf_hexdump(). This is the same as the existing hexdump(9) function, except that it puts the result in an sbuf. This also changes subr_prf.c so that it can be compiled in userland for includsion in libsbuf. We should work to change this so that the kernel hexdump implementation is a wrapper around sbuf_hexdump() with a statically allocated sbuf with a drain. That will require a drain function that goes to the kernel printf() buffer that can take a non-NUL terminated string as input. That is because an sbuf isn't NUL-terminated until it is finished, and we don't want to finish it while we're still using it. We should also work to consolidate the userland hexdump and kernel hexdump implemenatations, which are currently separate. This would also mean making applications that currently link in libutil link in libsbuf. sys/sys/sbuf.h: Add the prototype for sbuf_hexdump(), and add another copy of the hexdump flag values if they aren't already defined. Ideally the flags should be defined in one place but the implemenation makes it difficult to do properly. (See above.) Sponsored by: Spectra Logic Corporation MFC after: 1 week	2015-06-09 21:39:38 +00:00
Konstantin Belousov	2c6946dca2	When updating/accessing the timehands, barriers are needed to ensure that: - th_generation update is visible after the parameters update is visible; - the read of parameters is not reordered before initial read of th_generation. On UP kernels, compiler barriers are enough. For SMP machines, CPU barriers must be used too, as was confirmed by submitter by testing on the Freescale T4240 platform with 24 PowerPC processors. Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de> MFC after: 1 week	2015-06-09 11:49:56 +00:00
John Baldwin	15c2b30155	Revert r284153, as I believe it breaks the dtrace sdt module. I will fix the original issue a different way.	2015-06-08 18:06:00 +00:00
Ed Maste	6b16d66497	Add user facing errors for exceeding process memory limits Previously the process terminating with SIGABRT at startup was the only notification. PR: 200617 Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D2731	2015-06-08 16:07:07 +00:00
John Baldwin	69c5c774fe	Add an internal "locked" variant of linker_file_lookup_set() and change the public function to acquire the global linker lock directly. This permits linker_file_lookup_set() to be safely used from other modules.	2015-06-08 14:06:47 +00:00
Mark Johnston	8b84d791a0	witness: don't warn about matrix inconsistencies without holding the mutex Lock order checking is done without the witness mutex held, so multiple threads that are racing to establish a new lock order may read matrix entries that are in an inconsistent state. Don't print a warning in this case, but instead just redo the check after taking the witness lock. Differential Revision: https://reviews.freebsd.org/D2713 Reviewed by: jhb MFC after: 2 weeks Sponsored by: EMC / Isilon Storage Division	2015-06-07 18:59:47 +00:00
Sean Bruno	280b716943	Revert 284029, update imgact_binmisctl.c change mtx to reader count, at the request of the submitter. Will attempt to use an sx_lock for this fix to WITNESS crashes in a later revision. Submitted by: sson	2015-06-05 18:16:10 +00:00
Sean Bruno	8c8613a14f	This change uses a reader count instead of holding the mutex for the interpreter list to avoid the problem of holding a non-sleep lock during a page fault as reported by witness. In addition, it consistently uses memset()/memcpy() instead of bzero()/bcopy() except in the case where bcopy() is required (i.e. overlapping copy). Differential Revision: https://reviews.freebsd.org/D2123 Submitted by: sson MFC after: 2 weeks Relnotes: Yes	2015-06-05 16:21:43 +00:00
John Baldwin	7077c42623	Add a new file operations hook for mmap operations. File type-specific logic is now placed in the mmap hook implementation rather than requiring it to be placed in sys/vm/vm_mmap.c. This hook allows new file types to support mmap() as well as potentially allowing mmap() for existing file types that do not currently support any mapping. The vm_mmap() function is now split up into two functions. A new vm_mmap_object() function handles the "back half" of vm_mmap() and accepts a referenced VM object to map rather than a (handle, handle_type) tuple. vm_mmap() is now reduced to converting a (handle, handle_type) tuple to a a VM object and then calling vm_mmap_object() to handle the actual mapping. The vm_mmap() function remains for use by other parts of the kernel (e.g. device drivers and exec) but now only supports mapping vnodes, character devices, and anonymous memory. The mmap() system call invokes vm_mmap_object() directly with a NULL object for anonymous mappings. For mappings using a file descriptor, the descriptors fo_mmap() hook is invoked instead. The fo_mmap() hook is responsible for performing type-specific checks and adjustments to arguments as well as possibly modifying mapping parameters such as flags or the object offset. The fo_mmap() hook routines then call vm_mmap_object() to handle the actual mapping. The fo_mmap() hook is optional. If it is not set, then fo_mmap() will fail with ENODEV. A fo_mmap() hook is implemented for regular files, character devices, and shared memory objects (created via shm_open()). While here, consistently use the VM_PROT_* constants for the vm_prot_t type for the 'prot' variable passed to vm_mmap() and vm_mmap_object() as well as the vm_mmap_vnode() and vm_mmap_cdev() helper routines. Previously some places were using the mmap()-specific PROT_* constants instead. While this happens to work because PROT_xx == VM_PROT_xx, using VM_PROT_* is more correct. Differential Revision: https://reviews.freebsd.org/D2658 Reviewed by: alc (glanced over), kib MFC after: 1 month Sponsored by: Chelsio	2015-06-04 19:41:15 +00:00
Eric van Gyzen	63e4c6cdf9	Provide vnode in memory map info for files on tmpfs When providing memory map information to userland, populate the vnode pointer for tmpfs files. Set the memory mapping to appear as a vnode type, to match FreeBSD 9 behavior. This fixes the use of tmpfs files with the dtrace pid provider, procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY). Submitted by: Eric Badger <eric@badgerio.us> (initial revision) Obtained from: Dell Inc. PR: 198431 MFC after: 2 weeks Reviewed by: jhb Approved by: kib (mentor)	2015-06-02 18:37:04 +00:00
Xin LI	4c372ca254	Clear p_stops when doing PT_DETACH. Without this, if a process was being traced by truss(1), which uses different p_stops bits than gdb(1), the latter would misbehave because of the unexpected bits. Reported by: jceel Submitted by: sef Sponsored by: iXsystems, Inc. MFC after: 2 weeks	2015-06-01 18:15:45 +00:00
Konstantin Belousov	aef68c961a	When delivering a signal with default disposition to the thread, tdsigwakeup() increases the priority of the low-priority threads, to give them a chance to be terminated timely. Also, kernel allows user to signal kernel processes. The combined effect is that signalling idle process bump a priority of the selected delivery thread, which starts eating CPU. Check for the delivery thread be an idle thread and do not raise its priority then. The signal delivery to the kernel threads must be opt-in feature. Kernel thread should explicitely declare the ability to handle signals directed to it. E.g., nfsd threads check for signal as an indication of exit request. Most threads do not handle signals at all, and queuing the signal to them causes odd side-effects. Most innocent consequence is the memory leak due to queued ksiginfo, which is never deleted from the sigqueue. Code to prevent even queuing signals to the kernel threads is trivial, but it requires careful examination of each call to kproc/kthread creation to decide should the signalling be allowed. The commit is a stop-gap measure which fixes the immediate case for now. PR: 200493 Reported and tested by: trasz Discussed with: trasz, emaste Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-05-29 16:26:08 +00:00
Konstantin Belousov	69baeadc31	Remove several write-only variables, all reported by the gcc 4.9 buildkernel run. Some of them were write-only under some kernel options, e.g. variables keeping values only used by CTR() macros. It costs nothing to the code readability and correctness to eliminate the warnings in those cases too by removing the local cached values used only for single-access. Review: https://reviews.freebsd.org/D2665 Reviewed by: rodrigc Looked at by: bjk Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-05-29 13:24:17 +00:00
Konstantin Belousov	780dca1b1e	Right now, dounmount() is called with unreferenced mount point. Nothing stops a parallel unmount to suceed before the given call to dounmount() checks and locks the covered vnode. Prevent dounmount() from acting on the freed (although type-stable) memory by changing the interface to require the mount point to be referenced. dounmount() consumes the reference on return, regardless of the sucessfull or erronous result. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-27 09:22:50 +00:00
Konstantin Belousov	2db0e1f50d	Add V_MNTREF flag to the vn_start_write(9) and vn_start_secondary_write(9) functions. The flag indicates that the caller already owns a reference on the mount point, and the functions can consume it. The reference is released by vn_finished_write(9) and vn_finished_secondary_write(9) in due course. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-27 09:21:47 +00:00
Konstantin Belousov	1bc93bb7b9	Currently, softupdate code detects overstepping on the workitems limits in the code which is deep in the call stack, and owns several critical system resources, like vnode locks. Attempt to wait while the per-mount softupdate thread cleans up the backlog may deadlock, because the thread might need to lock the same vnode which is owned by the waiting thread. Instead of synchronously waiting for the worker, perform the worker' tickle and pause until the backlog is cleaned, at the safe point during return from kernel to usermode. A new ast request to call softdep_ast_cleanup() is created, the SU code now only checks the size of queue and schedules ast. There is no ast delivery for the kernel threads, so they are exempted from the mechanism, except NFS daemon threads. NFS server loop explicitely checks for the request, and informs the schedule_cleanup() that it is capable of handling the requests by the process P2_AST_SU flag. This is needed because nfsd may be the sole cause of the SU workqueue overflow. But, to not cause nsfd to spawn additional threads just because we slow down existing workers, only tickle su threads, without waiting for the backlog cleanup. Reviewed by: jhb, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-27 09:20:42 +00:00
John Baldwin	57c74f5b64	Do not allow a process to reap an orphan (a child currently being traced by another process such as a debugger). The parent process does need to check for matching orphan pids to avoid returning ECHILD if an orphan has exited, but it should not return the exited status for the child until after the debugger has detached from the orphan process either explicitly or implicitly via wait(). Add two tests for for this case: one where the debugger is the direct child (thus the parent has a non-empty children list) and one where the debugger is not a direct child (so the only "child" of the parent is the orphan). Differential Revision: https://reviews.freebsd.org/D2644 Reviewed by: kib MFC after: 2 weeks	2015-05-26 10:29:37 +00:00
Xin LI	eb3d0c5d8c	MFuser/delphij/zfs-arc-rebase@r281754: In r256613, taskqueue_enqueue_locked() have been modified to release the task queue lock before returning. In r276665, taskqueue_drain_all() will call taskqueue_enqueue_locked() to insert the barrier task into the queue, but did not reacquire the lock after it but later code expects the lock still being held (e.g. TQ_SLEEP()). The barrier task is special and if we release then reacquire the lock, there would be a small race window where a high priority task could sneak into the queue. Looking more closely, the race seems to be tolerable but is undesirable from semantics standpoint. To solve this, in taskqueue_drain_tq_queue(), instead of directly calling taskqueue_enqueue_locked(), insert the barrier task directly without releasing the lock.	2015-05-26 01:40:33 +00:00
John Baldwin	515b7a0b97	Add KTR tracing for some MI ptrace events. Differential Revision: https://reviews.freebsd.org/D2643 Reviewed by: kib	2015-05-25 22:13:22 +00:00
Dmitry Chagin	7236f2c220	For future use in the Linuxulator: 1. Add a kern_kqueue() counterpart for kqueue() with flags parameter. 2. Be a bit secure. To avoid a double fp lookup add a kern_kevent_fp() counterpart for kern_kevent() with file pointer parameter instead of file descriptor an pass the buck to it. Suggested by: mjg [2] Differential Revision: https://reviews.freebsd.org/D1091 Reviewed by: trasz	2015-05-24 16:36:29 +00:00
Dmitry Chagin	91d1786f65	In preparation for switching linuxulator to the use the native 1:1 threads add a hook for cleaning thread resources before the thread die. Differential Revision: https://reviews.freebsd.org/D1038	2015-05-24 14:51:29 +00:00
Dmitry Chagin	a93e83c8d7	In preparation for switching linuxulator to the use the native 1:1 threads split sys_sched_getparam(), sys_sched_setparam(), sys_sched_getscheduler(), sys_sched_setscheduler() to their kern_* counterparts and add targettd parameter to allow specify the target thread directly by callee. Differential Revision: https://reviews.freebsd.org/D1034 Reviewed by: trasz	2015-05-24 14:44:06 +00:00
Dmitry Chagin	1aa90eca33	In preparation for switching linuxulator to the use the native 1:1 threads refactor kern_sched_rr_get_interval() and sys_sched_rr_get_interval(). Add a kern_sched_rr_get_interval() counterpart which takes a targettd parameter to allow specify target thread directly by callee (new Linuxulator). Linuxulator temporarily uses first thread in proc. Move linux_sched_rr_get_interval() to the MI part. Differential Revision: https://reviews.freebsd.org/D1032 Reviewed by: trasz	2015-05-24 14:39:26 +00:00
Dmitry Chagin	09baafb471	In preparation for switching linuxulator to the use the native 1:1 threads introduce kern_thr_alloc() which will be used later in the linux_clone(). Differential Revision: https://reviews.freebsd.org/D1029 Reviewed by: trasz	2015-05-24 14:37:45 +00:00
Dmitry Chagin	95be6d2b1f	In preparation for switching linuxulator to the use the native 1:1 threads split sys_thr_exit() up into sys_thr_exit() and kern_thr_exit(). Move Where the second will be used in linux_exit() system call later. Differential Revision: https://reviews.freebsd.org/D1028 Reviewed by: trasz	2015-05-24 14:36:33 +00:00
Konstantin Belousov	3077f938b4	If thread requested to not stop on non-boundary, then not only stopping signals should obey, but also all forms of single-threading. Otherwise, thread might sleep interruptible while owning some resources, and single-threading thread could try to access them. An example is owning vnode lock while dumping core. Submitted by: Conrad Meyer Review: https://reviews.freebsd.org/D2612 Tested by: pho MFC after: 1 week	2015-05-23 19:09:04 +00:00
Warner Losh	ee960398a0	Fix typo in symbol name. It helps to hit save in all your buffers before committing.	2015-05-22 21:10:14 +00:00
Warner Losh	d36eec691a	Export the eflags field from the elf header. This allows better discrimination between different subarch binaries, at least for mips and arm. Arm is implemented, mips is still tbd, so not currently exported. aarch64 does not export this because aarch64 binaries use different tags and flags than arm. Differential Revision: https://reviews.freebsd.org/D2611	2015-05-22 20:50:35 +00:00
Jung-uk Kim	fd90e2ed54	CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten years for head. However, it is continuously misused as the mpsafe argument for callout_init(9). Deprecate the flag and clean up callout_init() calls to make them more consistent. Differential Revision: https://reviews.freebsd.org/D2613 Reviewed by: jhb MFC after: 2 weeks	2015-05-22 17:05:21 +00:00
John Baldwin	21119d0641	Expand ktr_mask to be a 64-bit unsigned integer. The mask does not really need to be updated with atomic operations and the downside of losing races during transitions is not great (it is not marked volatile, so those races are pretty wide open as it is). Differential Revision: https://reviews.freebsd.org/D2595 Reviewed by: emaste, neel, rpaulo MFC after: 2 weeks	2015-05-22 11:09:41 +00:00
John Baldwin	c209e3e2e6	Only reparent a traced process to its old parent if the tracing process is not the old parent. Otherwise, proc_reap() will leave the zombie in place resulting in the process' status being returned twice to its parent. Add test cases for PT_TRACE_ME and PT_ATTACH which are fixed by this change. Differential Revision: https://reviews.freebsd.org/D2594 Reviewed by: kib MFC after: 2 weeks	2015-05-22 11:04:54 +00:00
John Baldwin	c636f94bd2	Revert r282971. It depends on condvar consumers not destroying condvars until all threads sleeping on a condvar have resumed execution after being awakened. However, there are cases where that guarantee is very hard to provide.	2015-05-21 16:43:26 +00:00
Pedro F. Giffuni	cd508278c1	ddb: finish converting boolean values. The replacement started at r283088 was necessarily incomplete without replacing boolean_t with bool. This also involved cleaning some type mismatches and ansifying old C function declarations. Pointed out by: bde Discussed with: bde, ian, jhb	2015-05-21 15:16:18 +00:00
Mariusz Zaborski	963bc7a03f	Fix memory leak. Approved by: pjd (mentor)	2015-05-20 17:48:22 +00:00
Mariusz Zaborski	1acd888ff5	Style. Approved by: pjd (mentor)	2015-05-20 17:47:01 +00:00
Mariusz Zaborski	823870acb8	Always use the nv_free function. Approved by: pjd (mentor)	2015-05-20 17:44:58 +00:00
Alan Somers	7a9c38e681	Properly null-terminate strings in a kernel dump header. A version string longer than 192 bytes will cause the version field of a dump header to overflow. strncpy doesn't null terminate it, so savecore will print a corrupted info file. Using strlcpy fixes the bug. Differential Revision: https://reviews.freebsd.org/D2560 Reviewed by: markj MFC after: 3 weeks Sponsored by: Spectra Logic	2015-05-19 16:23:47 +00:00
Mateusz Guzik	747c0dd67c	fd: fix imbalanced fdp unlock in F_SETLK and F_GETLK MFC after: 3 days	2015-05-18 14:27:04 +00:00
Mateusz Guzik	c3293b83c4	Tidy up sys_umask a little bit Consistently use saved fdp pointer as it cannot change. If it could change the code would be already incorrect. No functional changes.	2015-05-18 13:43:33 +00:00
John Baldwin	5c894ee2ef	Previously, cv_waiters was only updated by cv_signal or cv_wait. If a thread awakened due to a time out, then cv_waiters was not decremented. If INT_MAX threads timed out on a cv without an intervening cv_broadcast, then cv_waiters could overflow. To fix this, have each sleeping thread decrement cv_waiters when it resumes. Note that previously cv_waiters was protected by the sleepq chain lock. However, that lock is not held when threads resume from sleep. In addition, the interlock is also not always reacquired after resuming (cv_wait_unlock), nor is it always held by callers of cv_signal() or cv_broadcast(). Instead, use atomic ops to update cv_waiters. Since the sleepq chain lock is still held on every increment, it should still be safe to compare cv_waiters against zero while holding the lock in the wakeup routines as the only way the race should be lost would result in extra calls to sleepq_signal() or sleepq_broadcast(). Differential Revision: https://reviews.freebsd.org/D2427 Reviewed by: benno Reported by: benno (wrap of cv_waiters in the field) MFC after: 2 weeks	2015-05-15 13:50:37 +00:00
Konstantin Belousov	100ac78be1	On amd64, make proc0 pmap initialization slightly more correct. In particular, switch to the proc0 pmap to have expected %cr3 and PCID for the thread0 during initialization, and the up to date pm_active mask. pmap_pinit0() should be done after proc0->p_vmspace is assigned so that the amd64 pmap_activate() find the correct curproc pmap. Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2015-05-15 08:30:29 +00:00
Konstantin Belousov	84cdea97e5	Right now, the process' p_boundary_count counter is decremented by the suspended thread itself, on the return path from thread_suspend_check(). A consequence is that return from thread_single_end(SINGLE_BOUNDARY) may leave p_boundary_count non-zero, it might be even equal to the threads count. Now, assume that we have two threads in the process, both calling execve(2). Suppose that the first thread won the race to be the suspension thread, and that afterward its exec failed for any reason. After the first thread did thread_single_end(SINGLE_BOUNDARY), second thread becomes the process suspension thread and checks p_boundary_count. The non-zero value of the count allows the suspension loop to finish without actually suspending some threads. In other words, we enter exec code with some threads not suspended. Fix this by decrementing p_boundary_count in the thread_single_end()->thread_unsuspend_one() during marking the thread as runnable. This way, a return from thread_single_end() guarantees that the counter is cleared. We do not care whether the unsuspended thread has a chance to run. Add some asserts to ensure the state of the process when single boundary suspension is lifted. Also make thread_unuspend_one() static. In collaboration with: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-05-15 07:54:31 +00:00
Jonathan Anderson	60aa2c85fa	Allow sizeof(cpuset_t) to be queried in capability mode. This allows functions that retrieve and inspect pthread_attr_t objects to work correctly: querying the cpuset_t size is part of querying CPU affinity information, which is part of creating a complete pthread_attr_t. Approved by: rwatson (mentor) Reviewed by: pjd Sponsored by: NSERC	2015-05-14 15:14:03 +00:00
Edward Tomasz Napierala	ba8f0eb8fc	Build GENERIC with RACCT/RCTL support by default. Note that it still needs to be enabled by adding "kern.racct.enable=1" to /boot/loader.conf. Differential Revision: https://reviews.freebsd.org/D2407 Reviewed by: emaste@, wblock@ MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation	2015-05-14 14:03:55 +00:00
Konstantin Belousov	7b445033ff	On exec, single-threading must be enforced before arguments space is allocated from exec_map. If many threads try to perform execve(2) in parallel, the exec map is exhausted and some threads sleep uninterruptible waiting for the map space. Then, the thread which won the race for the space allocation, cannot single-thread the process, causing deadlock. Reported and tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-10 09:00:40 +00:00
Konstantin Belousov	44ec2b63c5	The vmem callback to reclaim kmem arena address space on low or fragmented conditions currently just wakes up the pagedaemon. The kmem arena is significantly smaller then the total available physical memory, which means that there are loads where kmem arena space could be exhausted, while there is a lot of pages available still. The woken up pagedaemon sees vm_pages_needed != 0, verifies the condition vm_paging_needed() which is false, clears the pass and returns back to sleep, not calling neither uma_reclaim() nor lowmem handler. To handle low kmem arena conditions, create additional pagedaemon thread which calls uma_reclaim() directly. The thread sleeps on the dedicated channel and kmem_reclaim() wakes the thread in addition to the pagedaemon. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-09 20:08:36 +00:00
Konstantin Belousov	ac437c0754	Do not return from thread_single(SINGLE_BOUNDARY) until all stopped thread are guarenteed to be removed from the processors. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-05-09 18:32:13 +00:00
Andrey V. Elsukov	089bb672c4	m_dup() is supposed to give a writable copy of an mbuf chain. It uses m_dup_pkthdr(), that uses M_COPYFLAGS mask to copy m_flags field. If original mbuf chain has M_RDONLY flag, its copy also will have it. Reset this flag explicitly. MFC after: 2 weeks	2015-05-07 18:35:01 +00:00
Mateusz Guzik	edf1796d3e	Fix up panics when fork fails due to hitting proc limit The function clearning credentials on failure asserts the process is a zombie, which is not true when fork fails. Changing creds to NULL is unnecessary, but is still being done for consistency with other code. Pointy hat: mjg Reported by: pho	2015-05-06 21:03:19 +00:00
Ian Lepore	28315e27a7	Implement a mechanism for making changes in the kernel<->driver PPS interface without breaking ABI or API compatibility with existing drivers. The existing data structures used to communicate between the kernel and driver portions of PPS processing contain no spare/padding fields and no flags field or other straightforward mechanism for communicating changes in the structures or behaviors of the code. This makes it difficult to MFC new features added to the PPS facility. ABI compatibility is important; out-of-tree drivers in module form are known to exist. (Note that the existing api_version field in the pps_params structure must contain the value mandated by RFC 2783 and any RFCs that come along after.) These changes introduce a pair of abi-version fields which are filled in by the driver and the kernel respectively to indicate the interface version. The driver sets its version field before calling the new pps_init_abi() function. That lets the kernel know how much of the pps_state structure is understood by the driver and it can avoid using newer fields at the end of the structure that it knows about if the driver is a lower version. The kernel fills in its version field during the init call, letting the driver know what features and data the kernel supports. To implement the new version information in a way that is backwards compatible with code from before these changes, the high bit of the lightly-used 'kcmode' field is repurposed as a flag bit that indicates the driver is aware of the abi versioning scheme. Basically if this bit is clear that indicates a "version 0" driver and if it is set the driver_abi field indicates the version. These changes also move the recently-added 'mtx' field of pps_state from the middle to the end of the structure, and make the kernel code that uses this field conditional on the driver being abi version 1 or higher. It changes the only driver currently supplying the mtx field, usb_serial, to use pps_init_abi(). Reviewed by: hselasky@	2015-05-04 17:59:39 +00:00
Mariusz Zaborski	a523fa069f	nv_malloc can fail in userland. Add check to prevent a NULL pointer dereference. Pointed out by: mjg Approved by: pjd (mentor)	2015-05-02 18:12:34 +00:00
Mariusz Zaborski	da20d06f84	Remove duplicated code using macro template for the nvlist_add_.* functions. Approved by: pjd (mentor)	2015-05-02 18:10:45 +00:00
Mariusz Zaborski	169c153b59	Introduce the NV_FLAG_NO_UNIQUE flag. When set, it allows to store multiple values using the same key in a nvlist. Approved by: pjd (mentor) Obtained from: WHEEL Systems (http://www.wheelsystems.com) Update man page. Reviewed by: AllanJude Approved by: pjd (mentor)	2015-05-02 18:03:47 +00:00
Mariusz Zaborski	bd1da0a002	Approved, oprócz użycie RESTORE_ERRNO() do ustawiania errno. Change the nvlist_recv() function to take additional argument that specifies flags expected on the received nvlist. Receiving a nvlist with different set of flags than the ones we expect might lead to undefined behaviour, which might be potentially dangerous. Update consumers of this and related functions and update the tests. Approved by: pjd (mentor) Update man page for nvlist_unpack, nvlist_recv, nvlist_xfer, cap_recv_nvlist and cap_xfer_nvlist. Reviewed by: AllanJude Approved by: pjd (mentor)	2015-05-02 17:45:52 +00:00
Bjoern A. Zeeb	44aa151e1b	Fix an off-by-one bug in string/array handling which lead to memory overwrite and follow-up assertion errors on at least ARM after r282257, with nvp_magic being 0x6e7600: Assertion failed: ((nvp)->nvp_magic == 0x6e7670), function nvpair_name, file .../subr_nvpair.c, line 713. Sponsored by: DARPA/AFRL	2015-05-02 08:31:16 +00:00
Mark Johnston	9ad64f27be	Remove a stale reference to the stop_scheduler_on_panic tunable, which itself was removed in r243515. MFC after: 1 week	2015-05-02 00:27:58 +00:00
Mariusz Zaborski	24f93ee714	Add nvlist_flags() function, which returns nvlist's public flags. Approved by: pjd (mentor)	2015-05-01 17:50:24 +00:00
Mariusz Zaborski	b7be86aca6	Mark local function as static as a result of removing recursion. Approved by: pjd (mentor)	2015-04-30 20:50:42 +00:00
Mariusz Zaborski	45fd5ced7b	Rename macros to use prefix ERRNO. Add macro ERRNO_SET. Now ERRNO_{RESTORE/SAVE} must by used together, additional variable is not needed. Always use ERRNO_{SAVE/RESTORE/SET} macros. Approved by: pjd (mentor)	2015-04-30 20:47:33 +00:00
John Baldwin	ed95805e90	Remove support for Xen PV domU kernels. Support for HVM domU kernels remains. Xen is planning to phase out support for PV upstream since it is harder to maintain and has more overhead. Modern x86 CPUs include virtualization extensions that support HVM guests instead of PV guests. In addition, the PV code was i386 only and not as well maintained recently as the HVM code. - Remove the i386-only NATIVE option that was used to disable certain components for PV kernels. These components are now standard as they are on amd64. - Remove !XENHVM bits from PV drivers. - Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3, etc.) - Remove duplicate copy of <xen/features.h>. - Remove unused, i386-only xenstored.h. Differential Revision: https://reviews.freebsd.org/D2362 Reviewed by: royger Tested by: royger (i386/amd64 HVM domU and amd64 PVH dom0) Relnotes: yes	2015-04-30 15:48:48 +00:00
Mariusz Zaborski	be73922fcd	Save errno from close override. Approved by: pjd (mentor)	2015-04-29 22:59:44 +00:00
Mariusz Zaborski	0bb5e6ef80	Remove the nvlist_.*[fv] functions. Those functions are problematic, because there is no way to report memory allocation problems without complicating the API, so we can either abort or potentially return invalid results. None of which is acceptable. In most cases the caller knows the size of the name, so he can allocate buffer on the stack and use snprintf(3) to prepare the name. After some discussion the conclusion is to removed those functions, which also simplifies the API. Discussed with: pjd, rstone Approved by: pjd (mentor)	2015-04-29 22:57:04 +00:00
Mariusz Zaborski	3cfb71c186	Remove recursion from descriptor-related functions. Approved by: pjd (mentor)	2015-04-29 22:15:02 +00:00
Mariusz Zaborski	003e3ea15b	Always use the nv_malloc macro instead of malloc(3). Approved by: pjd (mentor)	2015-04-29 21:54:34 +00:00
Mariusz Zaborski	906289c2ec	Style fixes. Approved by: pjd (mentor)	2015-04-29 21:50:04 +00:00
Edward Tomasz Napierala	4b5c9cf62f	Add kern.racct.enable tunable and RACCT_DISABLED config option. The point of this is to be able to add RACCT (with RACCT_DISABLED) to GENERIC, to avoid having to rebuild the kernel to use rctl(8). Differential Revision: https://reviews.freebsd.org/D2369 Reviewed by: kib@ MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation	2015-04-29 10:23:02 +00:00
Edward Tomasz Napierala	aa32b5e076	Make setproctitle(3) work in Capsicum capability mode. This makes ctld(8) child processes to indicate initiator address and name in their titles, similar to what iscsid(8) child processes do. PR: 181352 Differential Revision: https://reviews.freebsd.org/D2363 Reviewed by: rwatson@, mjg@ MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-04-27 11:18:16 +00:00
Konstantin Belousov	9a2c85350a	Partially revert r255986: do not call VOP_FSYNC() when helping bufdaemon in getnewbuf(), do use buf_flush(). The difference is that bufdaemon uses TRYLOCK to get buffer locks, which allows calls to getnewbuf() while another buffer is locked. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-04-27 11:13:19 +00:00
Konstantin Belousov	f16f8610bb	Fix locking for oshmctl() and shmsys(). Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-04-27 11:12:51 +00:00
Mateusz Guzik	8d0a4ab212	fd: plug an always overwritten initialization in fdalloc	2015-04-26 17:27:55 +00:00
Mateusz Guzik	203322f966	Consistently use p instead of td->td_proc in create_thread No functional changes.	2015-04-26 17:22:59 +00:00
Rick Macklem	7cfdc2a7bc	MAXBSIZE defines both the largest UFS block size and the largest size for a buffer in the buffer cache. This patch defines a new constant MAXBCACHEBUF, which is the largest size for a buffer in the buffer cache. Having a separate constant allows MAXBCACHEBUF to be set larger than MAXBSIZE on a per-architecture basis, so that NFS can do larger read/writes for these architectures. It modifies sys/param.h so that BKVASIZE can also be set on a per-architecture basis. A couple of cases where NFS used MAXBSIZE instead of NFS_MAXBSIZE is fixed as well. Differential Revision: https://reviews.freebsd.org/D2330 Reviewed by: mav, kib MFC after: 2 weeks	2015-04-25 00:52:01 +00:00
Konstantin Belousov	b9062c934a	Use correct length for sparse uiomove(). It must be the clipped to the page size, len is the total transfer length, which may be larger than zero_region. Reported and tested by: clusteradm (gjb) Sponsored by: The FreeBSD Foundation X-MFC-With: r281442	2015-04-24 22:05:12 +00:00
Mark Johnston	da10a60340	Make vpanic() externally visible so that it can be called as part of the DTrace panic() action. Differential Revision: https://reviews.freebsd.org/D2349 Reviewed by: avg MFC after: 2 weeks Sponsored by: EMC / Isilon Storage Division	2015-04-24 03:17:21 +00:00
Konstantin Belousov	fe8a824ca6	Handle incorrect ELF images specifying size for PT_GNU_STACK not being multiple of page size. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2015-04-23 11:27:21 +00:00
Alexander Motin	f743d981f3	Make AIO to not allocate pbufs for unmapped I/O like r281825. While there, make few more performance optimizations. On 40-core system doing many 512-byte AIO reads from array of raw SSDs this change removes lock congestions inside pbuf allocator and devfs, and bottleneck on single AIO completion taskqueue thread. It improves peak AIO performance from ~600K to ~1.3M IOPS. MFC after: 2 weeks	2015-04-22 18:11:34 +00:00
Craig Rodrigues	d9db52256e	Move zlib.c from net to libkern. It is not network-specific code and would be better as part of libkern instead. Move zlib.h and zutil.h from net/ to sys/ Update includes to use sys/zlib.h and sys/zutil.h instead of net/ Submitted by: Steve Kiernan stevek@juniper.net Obtained from: Juniper Networks, Inc. GitHub Pull Request: https://github.com/freebsd/freebsd/pull/28 Relnotes: yes	2015-04-22 14:38:58 +00:00
Craig Rodrigues	d5fec48956	Support file verification in MAC. * Add VCREAT flag to indicate when a new file is being created * Add VVERIFY to indicate verification is required * Both VCREAT and VVERIFY are only passed on the MAC method vnode_check_open and are removed from the accmode after * Add O_VERIFY flag to rtld open of objects * Add 'v' flag to __sflags to set O_VERIFY flag. Submitted by: Steve Kiernan <stevek@juniper.net> Obtained from: Juniper Networks, Inc. GitHub Pull Request: https://github.com/freebsd/freebsd/pull/27 Relnotes: yes	2015-04-22 01:54:25 +00:00
Edward Tomasz Napierala	6289b482ec	Modify kern___getcwd() to take max pathlen limit as an additional argument. This will be used for the Linux emulation layer - for Linux, PATH_MAX is 4096 and not 1024. Differential Revision: https://reviews.freebsd.org/D2335 Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-04-21 13:55:24 +00:00
Alexander Motin	869fd29a7b	Rewrite physio() to not allocate pbufs for unmapped I/O. pbufs is a limited resource, and their allocator is not SMP-scalable. So instead of always allocating pbuf to immediately convert it to bio, allocate bio just here. If buffer needs kernel mapping, then pbuf is still allocated, but used only as a source of KVA and storage for a list of held pages. On 40-core system doing many 512-byte reads from user level to array of raw SSDs this change removes huge lock congestion inside pbuf allocator. It improves peak performance from ~300K to ~1.2M IOPS. On my previous 24-core system this problem also existed, but was less serious. Reviewed by: kib MFC after: 2 weeks	2015-04-21 10:55:53 +00:00
Eric van Gyzen	c207ff9319	Always send log(9) messages to the message buffer. It is truer to the semantics of logging for messages to always go to the message buffer, where they can eventually be collected and, in fact, be put into a log file. This restores the behavior prior to r70239, which seems to have changed it inadvertently. Submitted by: Eric Badger <eric@badgerio.us> Reviewed by: jhb Approved by: kib (mentor) Obtained from: Dell Inc. MFC after: 1 week	2015-04-20 20:03:26 +00:00
Konstantin Belousov	8103a8f608	Regen.	2015-04-18 21:50:53 +00:00
Konstantin Belousov	0538aafc41	The lseek(2), mmap(2), truncate(2), ftruncate(2), pread(2), and pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x kernels which required padding before the off_t parameter. The fcntl(2) contains compatibility code to handle kernels before the struct flock was changed during the 8.x CURRENT development. The shims were reasonable to allow easier revert to the older kernel at that time. Now, two or three major releases later, shims do not serve any purpose. Such old kernels cannot handle current libc, so revert the compatibility code. Make padded syscalls support conditional under the COMPAT6 config option. For COMPAT32, the syscalls were under COMPAT6 already. Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to (partially) disable the removed shims. Reviewed by: jhb, imp (previous versions) Discussed with: peter Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-04-18 21:50:13 +00:00
Mark Johnston	38563f7c92	Remove unimplemented sched provider probes. They were added for compatibility with the sched provider in Solaris and illumos, but our sched provider is already incompatible since it uses native types, so there isn't much point in keeping them around. Differential Revision: https://reviews.freebsd.org/D2167 Reviewed by: rpaulo	2015-04-18 20:36:58 +00:00
Konstantin Belousov	ad8b1d857d	Initialize td_sel in the thread_init(). Struct thread is not zeroed on the initial allocation, but seltdinit() assumes that td_sel is NULL or a valid pointer. Note that thread_fini()/seltdfini() also relies on this, but correctly resets td_sel to NULL. Submitted by: luke.tw@gmail.com PR: 199518 MFC after: 1 week	2015-04-18 17:21:12 +00:00
Kirk McKusick	f351915514	More accurately collect name-cache statistics in sysctl functions sysctl_debug_hashstat_nchash() and sysctl_debug_hashstat_rawnchash(). These changes are in preparation for allowing changes in the size of the vnode hash tables driven by increases and decreases in the maximum number of vnodes in the system. Reviewed by: kib@ Phabric: D2265	2015-04-18 00:59:03 +00:00
Devin Teske	43d4f8c4c6	Add "GELI Passphrase:" prompt to boot loader. A new loader.conf(5) option of geom_eli_passphrase_prompt="YES" will now allow you to enter your geli(8) root-mount credentials prior to invoking the kernel. See check-password.4th(8) for details. Differential Revision: https://reviews.freebsd.org/D2105 Reviewed by: imp, kmoore Discussed on: -current MFC after: 3 days X-MFC-to: stable/10 Relnotes: yes	2015-04-16 20:53:15 +00:00
Rick Macklem	dda11d4ab9	File systems that do not use the buffer cache (such as ZFS) must use VOP_FSYNC() to perform the NFS server's Commit operation. This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which is set by file systems that use the buffer cache. If this flag is not set, the NFS server always does a VOP_FSYNC(). This should be ok for old file system modules that do not set MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although it might not be optimal for file systems that use the buffer cache. Reviewed by: kib MFC after: 2 weeks	2015-04-15 20:16:31 +00:00
Neel Natu	1a688aa53e	Fix handling of BUS_PROBE_NOWILDCARD in 'device_probe_child()'. Device probe value of BUS_PROBE_NOWILDCARD should be treated specially only if the device has a fixed devclass. Otherwise it should be interpreted just as if the driver doesn't want to claim the device. Prior to this change a device that was not claimed explicitly by its driver would remain "attached" to the driver that returned BUS_PROBE_NOWILDCARD. This would bump up the reference on 'driver->refs' and its 'dev->ops' would point to the 'driver->ops'. When the driver is subsequently unloaded the 'dev->ops->cls' is left pointing to freed memory. This fixes an easily reproducible #GP fault caused by loading and unloading vmm.ko multiple times. Differential Revision: https://reviews.freebsd.org/D2294 Reviewed by: imp, jhb Discussed with: rstone Reported by: Leon Dang (ldang@nahannisys.com) MFC after: 2 weeks	2015-04-15 16:22:05 +00:00
Edward Tomasz Napierala	1c73bcab8e	Rewrite linprocfs_domtab() as a wrapper around kern_getfsstat(). This adds missing jail and MAC checks. Differential Revision: https://reviews.freebsd.org/D2193 Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-04-15 09:13:11 +00:00
Konstantin Belousov	316b384343	Implement support for binary to requesting specific stack size for the initial thread. It is read by the ELF image activator as the virtual size of the PT_GNU_STACK program header entry, and can be specified by the linker option -z stack-size in newer binutils. The soft RLIMIT_STACK is auto-increased if possible, to satisfy the binary' request. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-04-15 08:13:53 +00:00
George V. Neville-Neil	63bf240482	When a kernel has DEVICE_POLLING turned on but no drivers have the capability do not try to take the mutex at all. Replaces misbegotten attempt from reverted commit 281276 Pointed out by: glebius Sponsored by: Rubicon Communications (Netgate) Differential Revision: https://reviews.freebsd.org/D2262	2015-04-14 14:22:34 +00:00
Randall Stewart	b132edb56f	Fix my stupid restoral of old code.. must be c_iflags now. Thanks jhb for catching my stupidity... MFC after: 3 days	2015-04-14 00:02:39 +00:00
Randall Stewart	07a2df5d83	Restore the two lines accidentally deleted that allow CALLOUT_DIRECT to be specifed in the flags. Thanks Mark Johnston for noticing this ;-o MFC after: 3 days	2015-04-13 23:06:13 +00:00
Will Andrews	6311d7aaf0	uiomove_object_page(): Avoid instantiating pages in sparse regions on reads. Check whether the page being requested is either resident or on swap. If not, read from the zero_region instead of instantiating an unnecessary page. This avoids consuming memory for sparse files on tmpfs, when they are read by applications that do not use SEEK_HOLE/SEEK_DATA (which is most of them). Reviewed by: kib MFC after: 1 week Sponsored by: Spectra Logic	2015-04-11 18:51:41 +00:00
Mateusz Guzik	2574218578	Replace struct filedesc argument in getsock_cap with struct thread This is is a step towards removal of spurious arguments.	2015-04-11 16:00:33 +00:00
Mateusz Guzik	90f54cbfeb	fd: remove filedesc argument from fdclose Just accept a thread instead. This makes it consistent with fdalloc. No functional changes.	2015-04-11 15:40:28 +00:00
Alexander Motin	cdd09fea28	Add vmem locking to r281026. While races there are not fatal, they cause result underestimation, that cause unneeded ARC reclaims. MFC after: 1 month	2015-04-05 14:17:26 +00:00
Konstantin Belousov	4cfc037c30	Restore proper error from oshmctl(2), used by COMPAT_43, when the segment cannot be found. Broken by r280323. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2015-04-04 23:56:38 +00:00
Jilles Tjoelker	78d75aba77	utimensat: Correct Capsicum required capability rights.	2015-04-04 21:47:54 +00:00
Konstantin Belousov	0122d251bf	Remove useless initialization. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2015-04-04 08:44:20 +00:00
Alexander Motin	2e9ccb32a1	Make ZFS ARC track both KVA usage and fragmentation. Even on Illumos, with its much larger KVA, ZFS ARC steps back if KVA usage reaches certain threshold (3/4 on i386 or 16/17 otherwise). FreeBSD has even less KVA, but had no such limit on archs with direct map as amd64. As result, on machines with a lot of RAM, during load with very small user- space memory pressure, such as `zfs send`, it was possible to reach state, when there is enough both physical RAM and KVA (I've seen up to 25-30%), but no continuous KVA range to allocate even single 128KB I/O request. Address this situation from two sides: - restore KVA usage limitations in a way the most close to Illumos; - introduce new requirement for KVA fragmentation, specifying that we should have at least one sequential KVA range of zfs_max_recordsize bytes. Experiments show that first limitation done alone is not sufficient. On machine with 64GB of RAM it is sometimes needed to drop up to half of ARC size to get at leats one 1MB KVA chunk. Statically limiting ARC to half of KVA/RAM is too strict, so second limitation makes it to work in cycles: accumulate trash up to certain critical mass, do massive spring-cleaning, and then start littering again. :) MFC after: 1 month	2015-04-03 14:45:48 +00:00
Konstantin Belousov	2832cd544f	Speed up symbol lookup for the amd64 kernel modules. Amd64 uses relocatable object files as the modules format. It is good WRT not having unneeded overhead for PIC code, in particular, due to absence of useless GOT and PLT. But the cost is that the module linking process cannot use hash to speed up the symbol lookup, and that each reference to the symbol requiring a relocation, instead of single-place relocation in GOT. Cache the successfull symbol lookup results in the module symbol table, using the newly allocated SHN_FBSD_CACHED value from SHN_LOOS-HIOS range as an indicator. The SHN_FBSD_CACHED together with the non-existent definition of the found symbol are reverted after successfull relocations, which is done under kld_sx lock, so it should not be visible to other consumers of the symbol table. Submitted by: Conrad Meyer Differential Revision: https://reviews.freebsd.org/D1718 MFC after: 3 weeks	2015-04-02 20:14:51 +00:00
Ryan Stone	f2c2231e0c	Fix integer truncation bug in malloc(9) A couple of internal functions used by malloc(9) and uma truncated a size_t down to an int. This could cause any number of issues (e.g. indefinite sleeps, memory corruption) if any kernel subsystem tried to allocate 2GB or more through malloc. zfs would attempt such an allocation when run on a system with 2TB or more of RAM. Note to self: When this is MFCed, sparc64 needs the same fix. Differential revision: https://reviews.freebsd.org/D2106 Reviewed by: kib Reported by: Michael Fuckner <michael@fuckner.net> Tested by: Michael Fuckner <michael@fuckner.net> MFC after: 2 weeks	2015-04-01 12:42:26 +00:00
Randall Stewart	403df7a672	Adopt jhb's suggested changes, updated comments and callout_migration() moving to kern/kern_timeout.c This does not address his -1 -> NOCPU comment. Sponsored by: Netflix Inc.	2015-03-31 00:18:00 +00:00
Gleb Smirnoff	f6d6b5e262	Catch up on r271387 and remove unused parameter from VOP_GETPAGES_ASYNC().	2015-03-30 22:49:26 +00:00
Alexander Motin	43329ffcc8	Periodically wake up threads waiting for vmem(9) resources, so they could ask for resource reclamation again. This is kind of dirty hack, but as last resort this is better then stuck indefinitely because of KVA fragmentation, waiting until some random event free something sufficient. OpenSolaris also has this hack in its vmem(9). MFC after: 2 weeks	2015-03-30 13:30:53 +00:00
Alexander Motin	b308aaed27	Add four new DDB commands to display vmem(9) statistics. In particular, such DDB commands were added: show vmem <addr> show all vmem show vmemdump <addr> show all vmemdump As possible usage, that allows to see KVA usage and fragmentation.	2015-03-29 10:02:29 +00:00
Konstantin Belousov	eeb697c8e9	Make debug.vmem_check a tunable. It is useful to set it early. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-03-28 23:30:51 +00:00
Eric van Gyzen	e858b027db	Clean up some cosmetic nits in kern_umtx.c, found during recent work in this area and by the Clang static analyzer. Remove some dead assignments. Fix a typo in a panic string. Use umtx_pi_disown() instead of duplicate code. Use an existing variable instead of curthread. Approved by: kib (mentor) MFC after: 3 days Sponsored by: Dell Inc	2015-03-28 21:21:40 +00:00
Bjoern A. Zeeb	a04d412295	Try to unbreak !SMP kernels broken in r280785 by using the proper macros to access cc_cpu.	2015-03-28 15:07:19 +00:00
Randall Stewart	15b1eb142c	Change the callout to supply -1 to indicate we are not changing CPU, also add protection against invalid CPU's as well as split c_flags and c_iflags so that if a user plays with the active flag (the one expected to be played with by callers in MPSAFE) without a lock, it won't adversely affect the callout system by causing a corrupt list. This also means that all callers need to use the macros and not play with the falgs directly (like netgraph used to). Differential Revision: htts://reviews.freebsd.org/D1894 Reviewed by: .. timed out but looked at by jhb, imp, adrian hselasky tested by hiren and netflix. Sponsored by: Netflix Inc.	2015-03-28 12:50:24 +00:00
Hans Petter Selasky	38668c6044	Implement a simple OID number garbage collector. Given the increasing number of dynamically created and destroyed SYSCTLs during runtime it is very likely that the current new OID number limit of 0x7fffffff can be reached. Especially if dynamic OID creation and destruction results from automatic tests. Additional changes: - Optimize the typical use case by decrementing the next automatic OID sequence number instead of incrementing it. This saves searching time when inserting new OIDs into a fresh parent OID node. - Add simple check for duplicate non-automatic OID numbers. MFC after: 1 week	2015-03-25 08:55:34 +00:00
Hans Petter Selasky	502702c644	Make sure tunable sysctls are only fetched once. The existing code can re-register sysctls when destroying sysctl contexts or when moving sysctls from one tree to another.	2015-03-24 17:42:53 +00:00
Gleb Smirnoff	a2d4a7e456	Do not include if_var.h and in6_var.h into kern_jail.c. It is now possible after r280444. Sponsored by: Nginx, Inc.	2015-03-24 16:46:40 +00:00
Hans Petter Selasky	ab91c9a743	Correct string pointer offset for error printout.	2015-03-24 16:37:19 +00:00
Rui Paulo	0da9e11b7e	Disable coredump_devctl because it could lead to leaking paths to jails.	2015-03-24 02:17:17 +00:00
Mateusz Guzik	ea926658ff	filedesc: microoptimize fget_unlocked by getting rid of fd < 0 branch Casting fd to an unsigned type simplifies fd range coparison to mere checking if the result is bigger than the table.	2015-03-24 00:10:11 +00:00
Ian Lepore	296f235de0	The sysctls that return process argv and envv return binary data, so clear the SBUF_INCLUDENUL flag. Pointed out by: tijl@	2015-03-22 21:18:44 +00:00
Hans Petter Selasky	2793ea13aa	Fix for out of order device destruction notifications when using the delist_dev() function. In addition to this change: - add a proper description of this function - add a proper witness assert inside this function - switch a nearby line to use the "cdp" pointer instead of cdev2priv() MFC after: 3 days	2015-03-22 13:11:56 +00:00
Mateusz Guzik	f97af9706b	proc: use MTX_NEW flag in proc_init This allows us to get rid of bzero which was added specifically to make mtx_init on p_mtx reliable. This also fixes a potential problem where mtx_init on other mutexes could trip over on unitialized memory and fire an assertion. Reviewed by: kib	2015-03-21 20:25:34 +00:00
Mateusz Guzik	ffb34484ee	cred: add proc_set_cred_init helper proc_set_cred_init can be used to set first credentials of a new process. Update proc_set_cred assertions so that it only expects already used processes. This fixes panics where p_ucred of a new process happens to be non-NULL. Reviewed by: kib	2015-03-21 20:24:54 +00:00
Mateusz Guzik	12cec311e6	fork: assign refed credentials earlier Prior to this change the kernel would take p1's credentials and assign them tempororarily to p2. But p1 could change credentials at that time and in effect give us a use-after-free. No objections from: kib	2015-03-21 20:24:03 +00:00
Alan Cox	3d653db063	Introduce vm_object_color() and use it in mmap(2) to set the color of named objects to zero before the virtual address is selected. Previously, the color setting was delayed until after the virtual address was selected. In rtld, this delay effectively prevented the mapping of a shared library's code section using superpages. Now, for example, we see the first 1 MB of libc's code on armv6 mapped by a superpage after we've gotten through the initial cold misses that bring the first 1 MB of code into memory. (With the page clustering that we perform on read faults, this happens quickly.) Differential Revision: https://reviews.freebsd.org/D2013 Reviewed by: jhb, kib Tested by: Svatopluk Kraus (armv6) MFC after: 6 weeks	2015-03-21 17:56:55 +00:00
Olivier Houchard	d8d2f47629	error is only used if MAC is defined, so make its declaration conditional as well.	2015-03-21 16:16:17 +00:00
Konstantin Belousov	0555fb3523	Somewhat modernize the SysV shm code: - Use real locking, replace Giant with global sx protecting the subsystem. Since the subsystem' lock is no longer dropped during the sleepsk, remove not needed SHMSEG_WANTED segment flag, and revert r278963. - To do proper code simplification possible after the change of the lock, restructure several functions into _locked body and originally-named wrapper which calls into _locked variant. This allows to eliminate the 'goto done2' spread over the code. - Merge shm_find_segment_by_shmid() and shm_find_segment_by_shmidx(). - Consistently change all function prototypes to ANSI C. Reviewed by: mjg (who has earlier version of the similar patch to introduce real locking) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-03-21 15:01:19 +00:00
Mateusz Guzik	5bc0ff888a	coredump: protect corefilename access with a lock Previously format string traversal could happen while the string itself was being modified. Use allproc_lock as coredumping is a rare operation and as such we don't have to create a dedicated lock. Submitted by: Tiwei Bie <btw mail.ustc.edu.cn> Reviewed by: kib X-Additional: JuniorJobs project	2015-03-21 04:39:33 +00:00
Ian Lepore	612d9391a4	The minimum sbuf buffer size is 2 bytes (a byte plus a nulterm), assert that. Values smaller than two lead to strange asserts that have nothing to do with the actual problem (in the case of size=0), or to writing beyond the end of the allocated buffer in sbuf_finish() (in the case of size=1).	2015-03-17 21:00:31 +00:00
Ian Lepore	2834924513	In sbuf_new_for_sysctl(), default the buffer size to 64 bytes if the passed-in pointer is NULL and the length is zero.	2015-03-17 20:56:24 +00:00
Gleb Smirnoff	8c4df6296b	Reduce header pollution.	2015-03-17 14:16:50 +00:00
Benno Rice	43348dc2ad	Reset bp->bio_done to unmapped_buf when removing a transient map in biodone. Submitted by: Scott Ferris <scott.ferris@isilon.com> Sponsored by: EMC / Isilon Storage Division Reviewed by: kib	2015-03-16 20:00:09 +00:00

... 3 4 5 6 7 ...

14597 Commits