freebsd-skq

Author	SHA1	Message	Date
Edward Tomasz Napierala	57a73b26e0	Mark vgonel() as static. It was already declared static earlier; no idea why compilers don't warn about this. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-08-04 08:51:56 +00:00
Ed Schouten	dc4b532479	Fix bad arithmetic in umtx_key_get() to compute object offset. It looks like umtx_key_get() has the addition and subtraction the wrong way around, meaning that it fails to match in certain cases. This causes the cloudlibc unit tests to deadlock in certain cases. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D3287	2015-08-04 06:01:13 +00:00
Ed Schouten	52942c1eae	Add missing const keyword to function parameter. The umtx_key_get() function does not dereference the address off the userspace object. The pointer can safely be const.	2015-08-03 21:11:33 +00:00
John Baldwin	92de34df2c	kgdb uses td_oncpu to determine if a thread is running and should use a pcb from stoppcbs[] rather than the thread's PCB. However, exited threads retained td_oncpu from the last time they ran, and newborn threads had their CPU fields cleared to zero during fork and thread creation since they are in the set of fields zeroed when threads are setup. To fix, explicitly update the CPU fields for exiting threads in sched_throw() to reflect the switch out and reset the CPU fields for new threads in sched_fork_thread() to NOCPU. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3193	2015-08-03 20:43:36 +00:00
Ed Schouten	39f5ebb774	Add sysent flag to switch to capabilities mode on startup. CloudABI processes should run in capabilities mode automatically. There is no need to switch manually (e.g., by calling cap_enter()). Add a flag, SV_CAPSICUM, that can be used to call into cap_enter() during execve(). Reviewed by: kib	2015-08-03 13:41:47 +00:00
Mark Johnston	ce1c953ee0	Don't modify curthread->td_locks unless INVARIANTS is enabled. This field is only used in a KASSERT that verifies that no locks are held when returning to user mode. Moreover, the td_locks accounting is only correct when LOCK_DEBUG > 0, which is implied by INVARIANTS. Reviewed by: jhb MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3205	2015-08-02 00:03:08 +00:00
John Baldwin	98685dc8af	Clear P_TRACED before reparenting a detached process back to its original parent. Otherwise the debugee will be set as an orphan of the debugger. Add tests for tracing forks via PT_FOLLOW_FORK. Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D2809	2015-08-01 16:27:52 +00:00
Ed Schouten	7ee1b208c3	Add kern_shm_open(). This allows you to specify the capabilities that the new file descriptor should have. This allows us to create shared memory objects that only have the rights we're interested in. The idea behind restricting the rights is that it makes it a lot easier for CloudABI to get consistent behaviour across different operating systems. We only need to make sure that a shared memory implementation consistently implements the operations that are whitelisted. Approved by: kib Obtained from: https://github.com/NuxiNL/freebsd	2015-08-01 07:21:14 +00:00
Ed Schouten	6236e71bfe	Fix accidental line wrapping introduced in r286122.	2015-07-31 10:46:45 +00:00
Ed Schouten	367a13f905	Limit rights on process descriptors. On CloudABI, the rights bits returned by cap_rights_get() match up with the operations that you can actually perform on the file descriptor. Limiting the rights is good, because it makes it easier to get uniform behaviour across different operating systems. If process descriptors on FreeBSD would suddenly gain support for any new file operation, this wouldn't become exposed to CloudABI processes without first extending the rights. Extend fork1() to gain a 'struct filecaps' argument that allows you to construct process descriptors with custom rights. Use this in cloudabi_sys_proc_fork() to limit the rights to just fstat() and pdwait(). Obtained from: https://github.com/NuxiNL/freebsd	2015-07-31 10:21:58 +00:00
Konstantin Belousov	8917728875	vn_io_fault() handling of the LOR for i/o into the file-backed buffers has observable overhead when the buffer pages are not resident or not mapped. The overhead comes at least from two factors, one is the additional work needed to detect the situation, prepare and execute the rollbacks. Another is the consequence of the i/o splitting into the batches of the held pages, causing filesystems see series of the smaller i/o requests instead of the single large request. Note that expected case of the resident i/o buffer does not expose these issues. Provide a prefaulting for the userspace i/o buffers, disabled by default. I am careful of not enabling prefaulting by default for now, since it would be detrimental for the applications which speculatively pass extra-large buffers of anonymous memory to not deal with buffer sizing (if such apps exist). Found and tested by: bde, emaste Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-07-31 04:12:51 +00:00
Mateusz Guzik	4ae1e3c752	Revert r285125 until rmlocks get fixed. Right now there is a chance that sysctl unregister will cause reader to block on the sx lock associated with sysctl rmlock, in which case kernels with debug enabled will panic.	2015-07-30 19:52:43 +00:00
Roger Pau Monné	c023d8234b	vfs: fill fallout from r286076 This right operator is >= not =>. Reported by: cem	2015-07-30 15:43:26 +00:00
Roger Pau Monné	8f89a299e2	vfs: fix off-by-one error in vfs_buf_check_mapped The check added in r285872 can trigger for valid buffers if the buffer space used happens to be just after unmapped_buf in KVA space. Discussed with: kib Sponsored by: Citrix Systems R&D	2015-07-30 15:28:06 +00:00
Ed Schouten	8328babdd0	Make pipes in CloudABI work. Summary: Pipes in CloudABI are unidirectional. The reason for this is that CloudABI attempts to provide a uniform runtime environment across different flavours of UNIX. Instead of implementing a custom pipe that is unidirectional, we can simply reuse Capsicum permission bits to support this. This is nice, because CloudABI already attempts to restrict permission bits to correspond with the operations that apply to a certain file descriptor. Replace kern_pipe() and kern_pipe2() by a single kern_pipe() that takes a pair of filecaps. These filecaps are passed to the newly introduced falloc_caps() function that creates the descriptors with rights in place. Test Plan: CloudABI pipes seem to be created with proper rights in place: https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/unistd/pipe_test.c#L44 Reviewers: jilles, mjg Reviewed By: mjg Subscribers: imp Differential Revision: https://reviews.freebsd.org/D3236	2015-07-29 17:18:27 +00:00
Ed Schouten	e555b4309c	Introduce falloc_caps() to create descriptors with capabilties in place. falloc_noinstall() followed by finstall() allows you to create and install file descriptors with custom capabilities. Add falloc_caps() that can do both of these actions in one go. This will be used by CloudABI to create pipes with custom capabilities. Reviewed by: mjg	2015-07-29 17:16:53 +00:00
Konstantin Belousov	6cebf7e2be	Move bufshutdown() out of the #ifdef INVARIANTS block.	2015-07-29 09:57:34 +00:00
Jeff Roberson	98082691bb	- Make 'struct buf *buf' private to vfs_bio.c. Having a global variable 'buf' is inconvenient and has lead me to some irritating to discover bugs over the years. It also makes it more challenging to refactor the buf allocation system. - Move swbuf and declare it as an extern in vfs_bio.c. This is still not perfect but better than it was before. - Eliminate the unused ffs function that relied on knowledge of the buf array. - Move the shutdown code that iterates over the buf array into vfs_bio.c. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-07-29 02:26:57 +00:00
Jeff Roberson	38750ada8f	- Eliminate the EMPTYKVA queue. It served as a cache of KVA allocations attached to bufs to avoid the overhead of the vm. This purposes is now better served by vmem. Freeing the kva immediately when a buf is destroyed leads to lower fragmentation and a much simpler scan algorithm. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-07-28 20:24:09 +00:00
Ed Schouten	b114aa7959	Make shutdown() return ENOTCONN as required by POSIX, part deux. Summary: Back in 2005, maxim@ attempted to fix shutdown() to return ENOTCONN in case the socket was not connected (r150152). This had to be rolled back (r150155), as it broke some of the existing programs that depend on this behavior. I reapplied this change on my system and indeed, syslogd failed to start up. I fixed this back in February (279016) and MFC'ed it to the supported stable branches. Apart from that, things seem to work out all right. Since at least Linux and Mac OS X do the right thing, I'd like to go ahead and give this another try. To keep old copies of syslogd working, only start returning ENOTCONN for recent binaries. I took a look at the XNU sources and they seem to test against both SS_ISCONNECTED, SS_ISCONNECTING and SS_ISDISCONNECTING, instead of just SS_ISCONNECTED. That seams reasonable, so let's do the same. Test Plan: This issue was uncovered while writing tests for shutdown() in CloudABI: https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/sys/socket/shutdown_test.c#L26 Reviewers: glebius, rwatson, #manpages, gnn, #network Reviewed By: gnn, #network Subscribers: bms, mjg, imp Differential Revision: https://reviews.freebsd.org/D3039	2015-07-27 13:17:57 +00:00
Andrey V. Elsukov	41f5f69f96	Build debug version of rmlock's methods only when LOCK_DEBUG > 0. Currently LOCK_DEBUG is always defined in sys/lock.h (0 or 1). This means that debugging code always built. In addition the kernel modules have always defined LOCK_DEBUG as 1. So, debugging rmlock code is always used by kernel modules. MFC after: 1 week	2015-07-26 10:53:32 +00:00
Konstantin Belousov	6fd04eff66	With the removal of b_saveaddr in the r285819, b_data must be reset to b_kvabase when the buffer is reclaimed. Otherwise, if b_data for the mapped buffer was adjusted with the page-offset portion of b_offset, nothing would re-adjust the b_data, which breaks buffer management code which expects page-aligned b_data (see e.g. bpman_qenter(), which skips partial pages). Fix a minor issue with the GB_KVAALLOC requests, which could result in returning the mapped buffer if the reused buffer is mapped and have the right amount of KVA reserved. Improve assertion in the vfs_buf_check_mapped() to catch unmapped buffers which have their b_data incorrectly adjusted with offset. Reported and tested by: pho (previous version) Reviewed by: jeff (previous version) Sponsored by: The FreeBSD Foundation	2015-07-25 15:00:14 +00:00
Xin LI	1a7c14aec7	Fix a typo in comment. Submitted by: Yanhui Shen via twitter MFC after: 3 days	2015-07-24 22:13:39 +00:00
Marius Strobl	7815d3948c	o Revert the other functional half of r239864, i. e. the merge of r134227 from x86 to use smp_ipi_mtx spin lock not only for smp_rendezvous_cpus() but also for the MD cache invalidation, TLB demapping and remote register reading IPIs due to the following reasons: - The cross-IPI SMP deadlock x86 otherwise is subject to can't happen on sparc64. That's because on sparc64, spin locks don't disable interrupts completely but only raise the processor interrupt level to PIL_TICK. This means that IPIs still get delivered and direct dispatch IPIs such as the cache invalidation etc. IPIs in question are still executed. - In smp_rendezvous_cpus(), smp_ipi_mtx is held not only while sending an IPI_RENDEZVOUS, but until all CPUs have processed smp_rendezvous_action(). Consequently, smp_ipi_mtx may be locked for an extended amount of time as queued IPIs (as opposed to the direct ones) such as IPI_RENDEZVOUS are scheduled via a soft interrupt. Moreover, given that this soft interrupt is only delivered at PIL_RENDEZVOUS, processing of smp_rendezvous_action() on a target may be interrupted by f. e. a tick interrupt at PIL_TICK, in turn leading to the target in question trying to send an IPI by itself while IPI_RENDEZVOUS isn't fully handled, yet, and, thus, resulting in a deadlock. o As mentioned in the commit message of r245850, on least some sun4u platforms concurrent sending of IPIs by different CPUs is fatal. Therefore, hold the reintroduced MD ipi_mtx also while delivering cross-traps via MI helpers, i. e. ipi_{all_but_self,cpu,selected}(). o Akin to x86, let the last CPU to process cpu_mp_bootstrap() set smp_started instead of the BSP in cpu_mp_unleash(). This ensures that all APs actually are started, when smp_started is no longer 0. o In all MD and MI IPI helpers, check for smp_started == 1 rather than for smp_cpus > 1 or nothing at all. This avoids races during boot causing IPIs trying to be delivered to APs that in fact aren't up and running, yet. While at it, move setting of the cpu_ipi_{selected,single}() pointers to the appropriate delivery functions from mp_init() to cpu_mp_start() where it's better suited and allows to get rid of the global isjbus variable. o Given that now concurrent IPI delivery no longer is possible, also nuke the delays before completely disabling interrupts again in the CPU-specific cross-trap delivery functions, previously giving other CPUs a window for sending IPIs on their part. Actually, we now should be able to entirely get rid of completely disabling interrupts in these functions. Such a change needs more testing, though. o In {s,}tick_get_timecount_mp(), make the {s,}tick variable static. While not necessary for correctness, this avoids page faults when accessing the stack of a foreign CPU as {s,}tick now is locked into the TLBs as part of static kernel data. Hence, {s,}tick_get_timecount_mp() always execute as fast as possible, avoiding jitter. PR: 201245 MFC after: 3 days	2015-07-24 15:13:21 +00:00
Sergey Kandaurov	ef88ae77ea	Call ksem_get() with initialized 'rights'. ksem_get() consumes fget(), and it's mandatory there. Reported by: truckman Reviewed by: mjg	2015-07-23 23:18:03 +00:00
Jeff Roberson	fade8dd714	Refactor unmapped buffer address handling. - Use pointer assignment rather than a combination of pointers and flags to switch buffers between unmapped and mapped. This eliminates multiple flags and generally simplifies the logic. - Eliminate b_saveaddr since it is only used with pager bufs which have their b_data re-initialized on each allocation. - Gather up some convenience routines in the buffer cache for manipulating buf space and buf malloc space. - Add an inline, buf_mapped(), to standardize checks around unmapped buffers. In collaboration with: mlaier Reviewed by: kib Tested by: pho (many small revisions ago) Sponsored by: EMC / Isilon Storage Division	2015-07-23 19:13:41 +00:00
Jeff Roberson	1c1ddc0351	- Don't defeat the FIFO nature of the buffer cache by eliminating the most recently used buffer when we are under paging pressure. This is a perversion of the buffer and page replacement algorithms and recent improvements to the page daemon have rendered it unnecessary. In the event that low-memory deadlocks become an issue it would be possible to make a daemon or event handler that performs a similar action on the oldest buffers rather than the newest. Since the buf cache is analogous to the page cache and some minimum working set is desired another possibility is to simply shrink the minimum working set which has less downside now that file pages are not directly mapped. Sponsored by: EMC / Isilon Reviewed by: alc, kib (with some minor objection) Tested by: pho	2015-07-23 02:20:41 +00:00
Konstantin Belousov	e637a6e3f9	The smp_rendezvous_cpus() function should ensure that all accesses done by the functions called on other CPUs, are visible to the caller. Pair otherwise useless acquire on smp_rv_waiters[3] with a release add to ensure synchronized with relation, which guarantees visibility. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2015-07-21 22:56:46 +00:00
Konstantin Belousov	01f5e0866b	The part of r285680 which removed release semantic for two stores to it_need was wrong []. Restore the releases and add a comment explaining why it is needed. Noted by: alc [] Reviewed by: bde [*] Sponsored by: The FreeBSD Foundation	2015-07-21 14:39:34 +00:00
Sergey Kandaurov	94df6fad1d	Fix sb_state constant names as used e.g. to display in DDB ``show sockbuf''. MFC after: 1 week	2015-07-21 09:57:13 +00:00
Ed Schouten	5a170c1b0e	Add an API for easily creating userspace threads in kernelspace. This change refactors the existing create_thread() function to be more generic. It replaces almost all of its arguments by a callback that can be used to extract the thread ID and copy it out to the right place, but also to perform additional initialization steps, such as setting the trapframe. This also makes the difference between thr_new() and thr_create() more clear in my opinion. This function is going to be used by the CloudABI compatibility layer. It looks like the OpenSolaris compatibility framework already provides a function called thread_create(). Rename this function to do_thread_create() and use a macro to deal with the namespacing conflict. A similar approach is already used for thread_exit(). MFC after: 1 month	2015-07-20 10:20:04 +00:00
Alexander Motin	d3e2e28e74	Fix typo in comment. Submitted by: Masao Uebayashi	2015-07-20 09:37:42 +00:00
Mark Johnston	97cc6870f6	Don't increment the spin count until after the first attempt to acquire a rwlock read lock. Otherwise the lockstat:::rw-spin probe will fire spuriously. MFC after: 1 week	2015-07-19 22:26:02 +00:00
Kirk McKusick	1b79b9498b	Restructure code for readability improvement. No functional change. Reviewed by: kib	2015-07-19 22:25:16 +00:00
Mark Johnston	de2c95cc00	Consistently use a reader/writer flag for lockstat probes in rwlock(9) and sx(9), rather than using the probe function name to determine whether a given lock is a read lock or a write lock. Update lockstat(1) accordingly.	2015-07-19 22:24:33 +00:00
Mark Johnston	32cd0147fa	Implement the lockstat provider using SDT(9) instead of the custom provider in lockstat.ko. This means that lockstat probes now have typed arguments and will utilize SDT probe hot-patching support when it arrives. Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D2993	2015-07-19 22:14:09 +00:00
Marcelo Araujo	f19e47d691	Add support to the jail framework to be able to mount linsysfs(5) and linprocfs(5). Differential Revision: D2846 Submitted by: Nikolai Lifanov <lifanov@mail.lifanov.com> Reviewed by: jamie	2015-07-19 08:52:35 +00:00
Konstantin Belousov	283dfee925	Further cleanup after r285607. Remove useless release semantic for some stores to it_need. For stores where the release is needed, add a comment explaining why. Fence after the atomic_cmpset() op on the it_need should be acquire only, release is not needed (see above). The combination of atomic_cmpset() + fence_acq() is better expressed there as atomic_cmpset_acq(). Use atomic_cmpset() for swi' ih_need read and clear. Discussed with: alc, bde Reviewed by: bde Comments wording provided by: bde Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-07-18 19:59:29 +00:00
Konstantin Belousov	b4490c6e93	The si_status field of the siginfo_t, provided by the waitid(2) and SIGCHLD signal, should keep full 32 bits of the status passed to the _exit(2). Split the combined p_xstat of the struct proc into the separate exit status p_xexit for normal process exit, and signalled termination information p_xsig. Kernel-visible macro KW_EXITCODE() reconstructs old p_xstat from p_xexit and p_xsig. p_xexit contains complete status and copied out into si_status. Requested by: Joerg Schilling Reviewed by: jilles (previous version), pho Tested by: pho Sponsored by: The FreeBSD Foundation	2015-07-18 09:02:50 +00:00
Mark Johnston	c6d48c8752	Fix the !KDTRACE_HOOKS build. X-MFC-With: r285664	2015-07-18 04:38:11 +00:00
Mark Johnston	e2b25737ee	Pass the lock object to lockstat_nsecs() and return immediately if LO_NOPROFILE is set. Some timecounter handlers acquire a spin mutex, and we don't want to recurse if lockstat probes are enabled. PR: 201642 Reviewed by: avg MFC after: 3 days	2015-07-18 00:57:30 +00:00
Mark Johnston	efe8b26b82	Modify lockstat_nsecs() to just return unless lockstat probes are actually enabled. The cost of a timecounter read can be quite significant, and the problem became more apparent after r284297, since that change resulted in a call to lockstat_nsecs() for each acquisition of an rwlock read lock. PR: 201642 Reviewed by: avg Tested by: Jason Unovitch MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D3073	2015-07-18 00:22:00 +00:00
Ed Schouten	fd054c2df9	Undo r285656. It turns out that the CDDL sources already introduce a function called thread_create(). I'll investigate what we can do to make these functions coexist. Reported by: Ivan Klymenko	2015-07-17 22:26:45 +00:00
Ed Schouten	82a3d2cbfc	Add an API for easily creating userspace threads in kernelspace. This change refactors the existing create_thread() function to be more generic. It replaces almost all of its arguments by a callback that can be used to extract the thread ID and copy it out to the right place, but also to perform additional initialization steps, such as setting the trapframe. This also makes the difference between thr_new() and thr_create() more clear in my opinion. This function is going to be used by the CloudABI compatibility layer. Reviewed by: kib MFC after: 1 month	2015-07-17 16:34:01 +00:00
Mateusz Guzik	2919a0c5c1	fd: partially deduplicate fdescfree and fdescfree_remapped This also moves vrele of cdir/rdir/jdir vnodes earlier, which should not matter.	2015-07-16 15:26:37 +00:00
Mateusz Guzik	cd672ca60f	Get rid of lim_update_thread and cred_update_thread. Their primary use was in thread_cow_update to free up old resources. Freeing had to be done with proc lock held and _cow_ funcs already knew how to free old structs.	2015-07-16 14:30:11 +00:00
Mateusz Guzik	752fc07d33	vfs: implement v_holdcnt/v_usecount manipulation using atomic ops Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free list) of either counter are still guarded with vnode interlock. Reviewed by: kib (earlier version) Tested by: pho	2015-07-16 13:57:05 +00:00
Ed Schouten	457f7e23b1	Implement CloudABI's exec() call. Summary: In a runtime that is purely based on capability-based security, there is a strong emphasis on how programs start their execution. We need to make sure that we execute an new program with an exact set of file descriptors, ensuring that credentials are not leaked into the process accidentally. Providing the right file descriptors is just half the problem. There also needs to be a framework in place that gives meaning to these file descriptors. How does a CloudABI mail server know which of the file descriptors corresponds to the socket that receives incoming emails? Furthermore, how will this mail server acquire its configuration parameters, as it cannot open a configuration file from a global path on disk? CloudABI solves this problem by replacing traditional string command line arguments by tree-like data structure consisting of scalars, sequences and mappings (similar to YAML/JSON). In this structure, file descriptors are treated as a first-class citizen. When calling exec(), file descriptors are passed on to the new executable if and only if they are referenced from this tree structure. See the cloudabi-run(1) man page for more details and examples (sysutils/cloudabi-utils). Fortunately, the kernel does not need to care about this tree structure at all. The C library is responsible for serializing and deserializing, but also for extracting the list of referenced file descriptors. The system call only receives a copy of the serialized data and a layout of what the new file descriptor table should look like: int proc_exec(int execfd, const void data, size_t datalen, const int fds, size_t fdslen); This change introduces a set of fd*_remapped() functions: - fdcopy_remapped() pulls a copy of a file descriptor table, remapping all of the file descriptors according to the provided mapping table. - fdinstall_remapped() replaces the file descriptor table of the process by the copy created by fdcopy_remapped(). - fdescfree_remapped() frees the table in case we aborted before fdinstall_remapped(). We then add a function exec_copyin_data_fds() that builds on top these functions. It copies in the data and constructs a new remapped file descriptor. This is used by cloudabi_sys_proc_exec(). Test Plan: cloudabi-run(1) is capable of spawning processes successfully, providing it data and file descriptors. procstat -f seems to confirm all is good. Regular FreeBSD processes also work properly. Reviewers: kib, mjg Reviewed By: mjg Subscribers: imp Differential Revision: https://reviews.freebsd.org/D3079	2015-07-16 07:05:42 +00:00
Konstantin Belousov	70a3efc14f	Do not use atomic_swap_int(9), it is not available on all architectures. Atomic_cmpset_int(9) is a direct replacement, due to loop. The change fixes arm, arm64, mips an sparc64, which lack atomic_swap(). Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-07-15 21:44:16 +00:00
Konstantin Belousov	615b6ea2c8	Reset non-zero it_need indicator to zero atomically with fetching its current value. It is believed that the change is the real fix for the issue which was covered over by the r252683. With the current code, if the interrupt handler sets it_need between read and consequent reset, the update could be lost and ithread_execute_handlers() would not be called in response to the lost update. The r252683 could have hide the issue since at the moment of commit, atomic_load_acq_int() did locked cmpxchg on the variable, which puts the cache line into the exclusive owned state and clears store buffers. Then the immediate store of zero has very high chance of reusing the exclusive state of the cache line and make the load and store sequence operate as atomic swap. For now, add the acq+rel fence immediately after the swap, to not disturb current (but excessive) ordering. Acquire is needed for the ih_need reads after the load, while release does not serve a useful purpose []. Reviewed by: alc Noted by: alc [] Discussed with: bde Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-07-15 17:36:35 +00:00
Konstantin Belousov	03bbcb2f0c	Style. Remove excessive brackets. Compare non-boolean with zero. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-07-15 17:14:05 +00:00
Mark Johnston	02d131ad11	Fix some error-handling bugs when core dump compression is enabled: - Ensure that core dump parameters are initialized in the error path. - Don't call gzio_fini() on a NULL stream. Reported by: rpaulo	2015-07-14 18:24:05 +00:00
Conrad Meyer	0c40f3532d	Fix cleanup race between unp_dispose and unp_gc unp_dispose and unp_gc could race to teardown the same mbuf chains, which can lead to dereferencing freed filedesc pointers. This patch adds an IGNORE_RIGHTS flag on unpcbs marking the unpcb's RIGHTS as invalid/freed. The flag is protected by UNP_LIST_LOCK. To serialize against unp_gc, unp_dispose needs the socket object. Change the dom_dispose() KPI to take a socket object instead of an mbuf chain directly. PR: 194264 Differential Revision: https://reviews.freebsd.org/D3044 Reviewed by: mjg (earlier version) Approved by: markj (mentor) Obtained from: mjg MFC after: 1 month Sponsored by: EMC / Isilon Storage Division	2015-07-14 02:00:50 +00:00
Mateusz Guzik	6161705823	exec: textvp -> oldtextvp; binvp -> newtextvp This makes it consistent with the rest of the naming in do_execve. No functional changes.	2015-07-14 01:13:37 +00:00
Mateusz Guzik	853be5ffef	exec plug a redundant vref + vrele of the image vnode	2015-07-14 00:43:08 +00:00
Mateusz Guzik	e94e50af1d	racct: perform a lockless check for p_throttled This reduces proc lock contention. Reviewed by: trasz	2015-07-13 22:52:11 +00:00
Conrad Meyer	c578e0fb48	pipe_direct_write: Fix mismatched pipelock/unlock If a signal is caught in pipelock, causing it to fail, pipe_direct_write should not try to pipeunlock. Reported by: pho Differential Revision: https://reviews.freebsd.org/D3069 Reviewed by: kib Approved by: markj (mentor) MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2015-07-13 17:45:22 +00:00
Ian Lepore	969fc29e0b	Use the monotonic (uptime) counter rather than time-of-day to measure elapsed time between ntp_adjtime() clock offset adjustments. This eliminates spurious frequency steering after a large clock step (such as a 1970->2015 step on a system with no battery-backed clock hardware). This problem was discovered after the import of ntpd 4.2.8, which does things in a slightly different (but still correct) order than the 4.2.4 we had previously. In particular, 4.2.4 would step the clock then immediately after use ntp_adjtime() to set the frequency and offset to zero, which captured the post-step time-of-day as a side effect. In 4.2.8, ntpd sets frequency and offset to zero before any initial clock step, capturing the time as 1970-ish, then when it next calls ntp_adjtime() it's with a non-zero offset measurement. This non-zero value gets multiplied by the apparent 45-year interval, which blows up into a completely bogus frequency steer. That gets clamped to 500ppm, but that's still enough to make the clock drift so fast that ntpd has to keep stepping it every few minutes to compensate.	2015-07-12 18:38:17 +00:00
Bjoern A. Zeeb	97fc027722	Try to unbreak the build after r285390 removing the obsolete static declaration.	2015-07-12 00:26:22 +00:00
Mateusz Guzik	c634b75204	vfs: always clear VI_OWEINACT in consumers bumping v_usecount Previously vputx would detect the condition and clear the flag. With this change it is invalid to have both v_usecount > 0 and the flag set. Assert the condition is met in all revlevant places. Reviewed by: kib	2015-07-11 16:28:55 +00:00
Mateusz Guzik	2d1ca3cdff	vfs: move si_usecount manipulation to dedicated functions Reviewed by: kib	2015-07-11 16:28:12 +00:00
Mateusz Guzik	8a08cec166	Create a dedicated function for ensuring that cdir and rdir are populated. Previously several places were doing it on its own, partially incorrectly (e.g. without the filedesc locked) or even actively harmful by populating jdir or assigning rootvnode without vrefing it. Reviewed by: kib	2015-07-11 16:22:48 +00:00
Mateusz Guzik	f0725a8e1e	Move chdir/chroot-related fdp manipulation to kern_descrip.c Prefix exported functions with pwd_. Deduplicate some code by adding a helper for setting fd_cdir. Reviewed by: kib	2015-07-11 16:19:11 +00:00
Adrian Chadd	871ef8b0d8	Regenerate syscalls.	2015-07-11 15:22:11 +00:00
Adrian Chadd	6520495abc	Add an initial NUMA affinity/policy configuration for threads and processes. This is based on work done by jeff@ and jhb@, as well as the numa.diff patch that has been circulating when someone asks for first-touch NUMA on -10 or -11. * Introduce a simple set of VM policy and iterator types. * tie the policy types into the vm_phys path for now, mirroring how the initial first-touch allocation work was enabled. * add syscalls to control changing thread and process defaults. * add a global NUMA VM domain policy. * implement a simple cascade policy order - if a thread policy exists, use it; if a process policy exists, use it; use the default policy. * processes inherit policies from their parent processes, threads inherit policies from their parent threads. * add a simple tool (numactl) to query and modify default thread/process policities. * add documentation for the new syscalls, for numa and for numactl. * re-enable first touch NUMA again by default, as now policies can be set in a variety of methods. This is only relevant for very specific workloads. This doesn't pretend to be a final NUMA solution. The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by 'sysctl vm.default_policy=rr'. This is only relevant if MAXMEMDOM is set to something other than 1. Ie, if you're using GENERIC or a modified kernel with non-NUMA, then this is a glorified no-op for you. Thank you to Norse Corp for giving me access to rather large (for FreeBSD!) NUMA machines in order to develop and verify this. Thank you to Dell for providing me with dual socket sandybridge and westmere v3 hardware to do NUMA development with. Thank you to Scott Long at Netflix for providing me with access to the two-socket, four-domain haswell v3 hardware. Thank you to Peter Holm for running the stress testing suite against the NUMA branch during various stages of development! Tested: * MIPS (regression testing; non-NUMA) * i386 (regression testing; non-NUMA GENERIC) * amd64 (regression testing; non-NUMA GENERIC) * westmere, 2 socket (thankyou norse!) * sandy bridge, 2 socket (thankyou dell!) * ivy bridge, 2 socket (thankyou norse!) * westmere-EX, 4 socket / 1TB RAM (thankyou norse!) * haswell, 2 socket (thankyou norse!) * haswell v3, 2 socket (thankyou dell) * haswell v3, 2x18 core (thankyou scott long / netflix!) * Peter Holm ran a stress test suite on this work and found one issue, but has not been able to verify it (it doesn't look NUMA related, and he only saw it once over many testing runs.) * I've tested bhyve instances running in fixed NUMA domains and cpusets; all seems to work correctly. Verified: * intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different NUMA policies for processes under test. Review: This was reviewed through phabricator (https://reviews.freebsd.org/D2559) as well as privately and via emails to freebsd-arch@. The git history with specific attributes is available at https://github.com/erikarn/freebsd/ in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy). This has been reviewed by a number of people (stas, rpaulo, kib, ngie, wblock) but not achieved a clear consensus. My hope is that with further exposure and testing more functionality can be implemented and evaluated. Notes: * The VM doesn't handle unbalanced domains very well, and if you have an overly unbalanced memory setup whilst under high memory pressure, VM page allocation may fail leading to a kernel panic. This was a problem in the past, but it's much more easily triggered now with these tools. * This work only controls the path through vm_phys; it doesn't yet strongly/predictably affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory isn't really guaranteed in any way. That's next on my plate. Sponsored by: Norse Corp, Inc.; Dell	2015-07-11 15:21:37 +00:00
Konstantin Belousov	cf88021ab1	Do not allow creation of the dirty buffers for the dead buffer objects, i.e. for buffer objects which vnode was reclaimed. Buffer cache cannot write such buffers. Return the error and discard the buffer immediately on write attempt. BO_DIRTY now always set during vnode reclamation, since it is used not only for the INVARIANTS checks. Do allow placement of the clean buffers on dead bufobj list, otherwise filesystems cannot use bufcache at all after the devvp reclaim. Reported and tested by: trasz Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-07-11 11:21:56 +00:00
Ed Schouten	ea566832d7	Add missing const keyword to kern_sigaction()'s 'act' parameter. This structure is not modified by the function. Also add const to sigact_flag_test(), as it is called by kern_sigaction().	2015-07-10 14:39:46 +00:00
Mateusz Guzik	9a1ad66fb5	fd: further cleanup of kern_dup - make mode enum start from 0 so that the assertion covers all cases [1] - rename prefix _CLOEXEC flag with _FLAG - postpone fhold on the old file descriptor, which eliminates the need to fdrop in error cases. - fixup FDDUP_FCNTL check missed in the previous commit This removes 'fp == oldfde->fde_file' assertion which had little value. kern_dup only calls fd-related functions which cannot drop the lock or a whole lot of races would be introduced. Noted by: kib [1]	2015-07-10 13:54:03 +00:00
Mateusz Guzik	5fe97c20dc	fd: split kern_dup flags argument into actual flags and a mode Tidy up the code inside to switch on the mode.	2015-07-10 11:01:30 +00:00
Konstantin Belousov	e8677f3885	Change the mb() use in the sched_ult tdq_notify() and sched_idletd() to more C11-ish atomic_thread_fence_seq_cst(). Note that on PowerPC, which currently uses lwsync for mb(), the change actually fixes the missed store/load barrier, intended by r271604 []. Reviewed by: alc Noted by: alc [] Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2015-07-10 08:54:12 +00:00
Ed Schouten	47a84387ad	Let listen() return EDESTADDRREQ when not bound. We currently return EINVAL when calling listen() on a UNIX socket that has not been bound to a pathname. If my interpretation of POSIX is correct, we should return EDESTADDRREQ: "The socket is not bound to a local address, and the protocol does not support listening on an unbound socket." Return EDESTADDRREQ instead when not bound and not connected. Differential Revision: https://reviews.freebsd.org/D3038 Reviewed by: gnn, network	2015-07-10 06:47:14 +00:00
Mateusz Guzik	318b946321	vfs: cosmetic changes to namei and namei_handle_root - don't initialize cnp during declaration - don't test error/!error, compare to 0 instead	2015-07-09 17:17:26 +00:00
Mateusz Guzik	d177f49f6f	vfs: simplify error handling in namei The logic is reorganised so that there is one exit point prior to the lookup loop. This is an intermediate step to making audit logging functions use found vnode instead of translating ni_dirfd on their own. ni_startdir validation is removed. The only in-tree consumer is nfs which already makes sure it is a directory. Reviewed by: kib	2015-07-09 16:32:58 +00:00
Ed Schouten	2491302a04	Add implementations for some of the CloudABI file descriptor system calls. All of the CloudABI system calls that operate on file descriptors of an arbitrary type are prefixed with fd_. This change adds wrappers for most of these system calls around their FreeBSD equivalents. The dup2() system call present on CloudABI deviates from POSIX, in the sense that it can only be used to replace existing file descriptor. It cannot be used to create new ones. The reason for this is that this is inherently thread-unsafe. Furthermore, there is no need on CloudABI to use fixed file descriptor numbers. File descriptors 0, 1 and 2 have no special meaning. This change exposes the kern_dup() through <sys/syscallsubr.h> and puts the FDDUP_* flags in <sys/filedesc.h>. It then adds a new flag, FDDUP_MUSTREPLACE to force that file descriptors are replaced -- not allocated. Differential Revision: https://reviews.freebsd.org/D3035 Reviewed by: mjg	2015-07-09 16:07:01 +00:00
Mateusz Guzik	efdc25304c	fd: prepare do_dup for being exported - rename it to kern_dup. - prefix flags with FD - assert that correct flags were passed	2015-07-09 15:19:45 +00:00
Mateusz Guzik	d19ba50e12	vfs: avoid spurious vref/vrele for absolute lookups namei used to vref fd_cdir, which was immediatley vrele'd on entry to the loop. Check for absolute lookup and vref the right vnode the first time. Reviewed by: kib	2015-07-09 15:06:58 +00:00
Mateusz Guzik	a03f1b2970	vfs: plug a use-after-free of fd_rdir in namei fd_rdir vnode was stored in ni_rootdir without refing it in any way, after which the filedsc lock was being dropped. The vnode could have been freed by mountcheckdirs or another thread doing chroot. VREF the vnode while the lock is held. Reviewed by: kib MFC after: 1 week	2015-07-09 15:06:24 +00:00
Ed Schouten	3a41ec6af7	Don't clobber td->td_retval[0] in proc_reap(). While writing tests for CloudABI, I noticed that close() on process descriptors returns the process ID of the child process. This is interesting, as close() is only allowed to return 0 or -1. It turns out that we clobber td->td_retval[0] in proc_reap(), so that wait*() properly returns the process ID. Change proc_reap() to leave td->td_retval[0] alone. Set the return value in kern_wait6() instead, by keeping track of the PID before we (potentially) reap the process. Differential Revision: https://reviews.freebsd.org/D3032 Reviewed by: kib	2015-07-09 12:04:45 +00:00
Konstantin Belousov	fcb5b3a419	Cover a race between doselwakeup() and selfdfree(). If doselwakeup() loop finds the selfd entry and clears its sf_si pointer, which is handled by selfdfree() in parallel, NULL sf_si makes selfdfree() free the memory. The result is the race and accesses to the freed memory. Refcount the selfd ownership. One reference is for the sf_link linkage, which is unconditionally dereferenced by selfdfree(). Another reference is for sf_threads, both selfdfree() and doselwakeup() race to deref it, the winner unlinks and than frees the selfd entry. Reported by: Larry Rosenman <ler@lerctr.org> Tested by: Larry Rosenman <ler@lerctr.org>, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-07-09 09:22:21 +00:00
Ed Schouten	6d338f9a81	Import the CloudABI datatypes and create a system call table. CloudABI is a pure capability-based runtime environment for UNIX. It works similar to Capsicum, except that processes already run in capabilities mode on startup. All functionality that conflicts with this model has been omitted, making it a compact binary interface that can be supported by other operating systems without too much effort. CloudABI is 'secure by default'; the idea is that it should be safe to run arbitrary third-party binaries without requiring any explicit hardware virtualization (Bhyve) or namespace virtualization (Jails). The rights of an application are purely determined by the set of file descriptors that you grant it on startup. The datatypes and constants used by CloudABI's C library (cloudlibc) are defined in separate files called syscalldefs_mi.h (pointer size independent) and syscalldefs_md.h (pointer size dependent). We import these files in sys/contrib/cloudabi and wrap around them in cloudabi*_syscalldefs.h. We then add stubs for all of the system calls in sys/compat/cloudabi or sys/compat/cloudabi64, depending on whether the system call depends on the pointer size. We only have nine system calls that depend on the pointer size. If we ever want to support 32-bit binaries, we can simply add sys/compat/cloudabi32 and implement these nine system calls again. The next step is to send in code reviews for the individual system call implementations, but also add a sysentvec, to allow CloudABI executabled to be started through execve(). More information about CloudABI: - GitHub: https://github.com/NuxiNL/cloudlibc - Talk at BSDCan: https://www.youtube.com/watch?v=SVdF84x1EdA Differential Revision: https://reviews.freebsd.org/D2848 Reviewed by: emaste, brooks Obtained from: https://github.com/NuxiNL/freebsd	2015-07-09 07:20:15 +00:00
Konstantin Belousov	f4b5a9725a	Reimplement the ordering requirements for the timehands updates, and for timehands consumers, by using fences. Ensure that the timehands->th_generation reset to zero is visible before the data update is visible []. tc_setget() allowed data update writes to become visible before generation (but not on TSO architectures). Remove tc_setgen(), tc_getgen() helpers, use atomics inline []. Noted by: alc [] Requested by: bde [**] Reviewed by: alc, bde Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2015-07-08 18:42:08 +00:00
Konstantin Belousov	69d11def74	Handle copyout for the fcntl(F_OGETLK) using oflock structure. Otherwise, kernel overwrites a word past the destination. Submitted by: walter@pelissero.de PR: 196718 MFC after: 1 week	2015-07-08 13:19:13 +00:00
Mark Johnston	620711e033	Fix an incorrect assertion in witness. The number of available lock list entries for a thread is LOCK_CHILDCOUNT, and each entry can record up to LOCK_NCHILDREN locks. When iterating over the locks held by a thread, a bound on the loop index is therefore given by LOCK_CHILDCOUNT * LOCK_NCHILDREN; WITNESS_COUNT is an unrelated constant. Reviewed by: jhb MFC after: 1 week Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D2974	2015-07-07 19:29:18 +00:00
Pedro F. Giffuni	9129dd59be	Relocate sched_random() within the SMP section. Place sched_random nearer to where it's first used: moving the code nearer to where it is used makes the code easier to read and we can reduce the initial "#ifdef SMP" island. Reword a little the comment and clean some whitespaces while here.	2015-07-07 15:22:29 +00:00
Mateusz Guzik	aa0e2887f4	tty: replace several curthread->td_proc with stored curproc No functional changes.	2015-07-06 18:53:56 +00:00
Patrick Kelsey	6f99ea0520	Don't acquire sysctlmemlock in userland_sysctl() when the old value pointer is NULL, as in that case there are no userland pages that could potentially be wired. It is common for old to be NULL and oldlenp to be non-NULL in calls to userland_sysctl(), as this is used to probe for the length of a variable-length sysctl entry before retrieving a value. Note that it is typical for such calls to be made with an uninitialized value in oldlenp, so sysctlmemlock was essentially being acquired at random (depending on the uninitialized value in oldlenp being > PAGE_SIZE or not) for these calls prior to this patch. Differential Revision: https://reviews.freebsd.org/D2987 Reviewed by: mjg, kib Approved by: jmallett (mentor) MFC after: 1 month	2015-07-06 16:07:21 +00:00
Konstantin Belousov	9889bbac23	Mutex memory is not zeroed, add MTX_NEW. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-07-06 14:09:00 +00:00
Mark Johnston	947401dd50	Move the comment describing namei(9) back to namei()'s definition. MFC after: 3 days	2015-07-05 22:56:41 +00:00
Mark Johnston	8bbd1f25b1	Remove a stale descriptive comment for gbincore(). The splay trees referenced in the comment were converted to path-compressed tries in r250551. MFC after: 3 days	2015-07-05 22:44:41 +00:00
Mark Johnston	5f34e93c58	Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT. This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough filesystems like nullfs and unionfs no longer need to inherit this information from their lower layer(s). This change also restores the pre-r273336 behaviour of using the presence of a susp_clean VFS method to request suspension support. Reviewed by: kib, mjg Differential Revision: https://reviews.freebsd.org/D2937	2015-07-05 22:37:33 +00:00
Mateusz Guzik	f131759f54	fd: make 'rights' a manadatory argument to fget* functions	2015-07-05 19:05:16 +00:00
Mariusz Zaborski	54f98da930	Move the nvlist source and private includes from sys/kern to seperate directory sys/contrib/libnv. The goal of this operation is to NOT install header files which shouldn't be used outside the nvlist library. Approved by: pjd (mentor)	2015-07-04 16:33:37 +00:00
Mateusz Guzik	9ca30b0e06	vfs: use shared vnode locking when looking up ".." in vop_stdvptocnp Briefly discussed with: kib	2015-07-04 15:46:39 +00:00
Mateusz Guzik	dba0bec2bb	fd: de-k&r-ify functions + some whitespace fixes No functional changes.	2015-07-04 15:42:03 +00:00
Mateusz Guzik	ee5f66f820	sysctl: get rid of sysctl_lock/unlock Inline their contents into the only consumer.	2015-07-04 14:44:39 +00:00
Mateusz Guzik	d5fc115a1a	sysctl: remove a debugging printf which crept in with r285125	2015-07-04 07:01:43 +00:00
Mateusz Guzik	b8633775a8	sysctl: switch sysctllock to a sleepable rmlock The lock is almost never taken for writing.	2015-07-04 06:54:15 +00:00
Mateusz Guzik	e2f5418e73	sysvshm: fix up some whitespace issues and spurious initialisation	2015-07-02 19:14:30 +00:00
Mateusz Guzik	77a26248a3	sysvshm: don't lock proc when calculating attach_va vm_daddr is constant and RLIMIT_DATA can be obtained from thread's copy of rlimits.	2015-07-02 19:03:44 +00:00
Mateusz Guzik	0be3a191a4	sysvshm: fix shmrealloc The code was supposed to initialize new segs in newsegs array, but used the old pointer.	2015-07-02 19:00:22 +00:00
Konstantin Belousov	1965f86c72	Vnode is not referenced by the vfs_domount() at the point where asserts are made. Remove them, since we might dereference freed memory. Leaked locks are asserted by the syscall return code anyway. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-07-02 14:31:47 +00:00
Navdeep Parhar	9523d1bfc3	Fix leak in tcp_lro_rx. Simply clearing M_PKTHDR isn't enough, any tags hanging off the header need to be freed too. Differential Revision: https://reviews.freebsd.org/D2708 Reviewed by: ae@, hiren@	2015-06-30 17:19:58 +00:00
Mark Murray	d1b06863fb	Huge cleanup of random(4) code. * GENERAL - Update copyright. - Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set neither to ON, which means we want Fortuna - If there is no 'device random' in the kernel, there will be NO random(4) device in the kernel, and the KERN_ARND sysctl will return nothing. With RANDOM_DUMMY there will be a random(4) that always blocks. - Repair kern.arandom (KERN_ARND sysctl). The old version went through arc4random(9) and was a bit weird. - Adjust arc4random stirring a bit - the existing code looks a little suspect. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Redo read_random(9) so as to duplicate random(4)'s read internals. This makes it a first-class citizen rather than a hack. - Move stuff out of locked regions when it does not need to be there. - Trim RANDOM_DEBUG printfs. Some are excess to requirement, some behind boot verbose. - Use SYSINIT to sequence the startup. - Fix init/deinit sysctl stuff. - Make relevant sysctls also tunables. - Add different harvesting "styles" to allow for different requirements (direct, queue, fast). - Add harvesting of FFS atime events. This needs to be checked for weighing down the FS code. - Add harvesting of slab allocator events. This needs to be checked for weighing down the allocator code. - Fix the random(9) manpage. - Loadable modules are not present for now. These will be re-engineered when the dust settles. - Use macros for locks. - Fix comments. * src/share/man/... - Update the man pages. * src/etc/... - The startup/shutdown work is done in D2924. * src/UPDATING - Add UPDATING announcement. * src/sys/dev/random/build.sh - Add copyright. - Add libz for unit tests. * src/sys/dev/random/dummy.c - Remove; no longer needed. Functionality incorporated into randomdev.. live_entropy_sources.c live_entropy_sources.h - Remove; content moved. - move content to randomdev.[ch] and optimise. * src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h - Remove; plugability is no longer used. Compile-time algorithm selection is the way to go. * src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h - Add early (re)boot-time randomness caching. * src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h - Remove; no longer needed. * src/sys/dev/random/uint128.h - Provide a fake uint128_t; if a real one ever arrived, we can use that instead. All that is needed here is N=0, N++, N==0, and some localised trickery is used to manufacture a 128-bit 0ULLL. * src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h - Improve unit tests; previously the testing human needed clairvoyance; now the test will do a basic check of compressibility. Clairvoyant talent is still a good idea. - This is still a long way off a proper unit test. * src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h - Improve messy union to just uint128_t. - Remove unneeded 'static struct fortuna_start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) * src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h - Improve messy union to just uint128_t. - Remove unneeded 'staic struct start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) - Fix some magic numbers elsewhere used as FAST and SLOW. Differential Revision: https://reviews.freebsd.org/D2025 Reviewed by: vsevolod,delphij,rwatson,trasz,jmg Approved by: so (delphij)	2015-06-30 17:00:45 +00:00
Konstantin Belousov	6ef120027f	Do not calculate the stack's bottom address twice. Submitted by: Olivц╘r Pintц╘r Review: https://reviews.freebsd.org/D2953 MFC after: 1 week	2015-06-30 15:22:47 +00:00
Mark Murray	6687b6720b	Ansify another function. This is the last in the file, I hope.	2015-06-28 10:51:08 +00:00
Mark Murray	7233d3094d	ANSIfy the only function that uses K&R definition in this file.	2015-06-28 09:44:58 +00:00
Konstantin Belousov	b2c3df842b	Handle errors from background write of the cylinder group blocks. First, on the write error, bufdone() call from ffs_backgroundwrite() panics because pbrelvp() cleared bp->b_bufobj, while brelse() would try to re-dirty the copy of the cg buffer. Handle this by setting B_INVAL for the case of BIO_ERROR. Second, we must re-dirty the real buffer containing the cylinder group block data when background write failed. Real cg buffer was already marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan may reuse the buffer at any moment. The result is lost write, and if the write error was only transient, we get corrupted bitmaps. We cannot re-dirty the original cg buffer in the ffs_backgroundwritedone(), since the context is not sleepable, preventing us from sleeping for origbp' lock. Add BV_BKGDERR flag (protected by the buffer object lock), which is converted into delayed write by brelse(), bqrelse() and buffer scan. In collaboration with: Conrad Meyer <cse.cem@gmail.com> Reviewed by: mckusick Sponsored by: The FreeBSD Foundation (kib), EMC/Isilon storage division (Conrad) MFC after: 2 weeks	2015-06-27 09:44:14 +00:00
Adrian Chadd	5bbb2169d2	Un-static cpuset_which() - it's useful in other contexts, such as some CPU set operations in my upcoming NUMA work. Tested/compiled: * i386 (run) * amd64 (run) * mips (run) * mips64 (run) * armv6 (built) Sponsored by: Norse Corp, Inc.	2015-06-26 04:14:05 +00:00
Mateusz Guzik	7150ce743a	rlimit: deduplicate code in chg* functions	2015-06-25 00:15:37 +00:00
Sean Bruno	4e83b32a80	At the suggestion of jhb, replace atomic_set/clear calls with use of exclusive locks in the enable/disable interpreter path. Tested with WITNESS/INVARIANTS on and off. Reviewed by: sson davide	2015-06-24 15:52:26 +00:00
John-Mark Gurney	1977bd233a	zero this struct as it depends upon it... Reviewed by: mjg Differential Revision: https://reviews.freebsd.org/D2890	2015-06-23 18:40:20 +00:00
Konstantin Belousov	b05c401ff6	Only take previous buffer queue lock (olock) when needed for REMFREE in binsfree(). Submitted by: Conrad Meyer Sponsored by: EMC / Isilon Storage Division Review: https://reviews.freebsd.org/D2882 MFC after: 1 week	2015-06-23 06:12:14 +00:00
Sean Bruno	945afa7c25	Make imgact_binmisc_exec() static. Submitted by: kib Reviewed by: sson	2015-06-22 17:04:24 +00:00
Sean Bruno	602ec83516	Remove uneeded NULL check since malloc the malloc is now M_WAITOK Submitted by: mjg	2015-06-19 20:35:17 +00:00
Sean Bruno	e0ae213f63	Must have one of either M_WAITOK or M_NOWAIT, read the man page bruno. Submitted by: mjg	2015-06-19 19:57:39 +00:00
Sean Bruno	a7647ec444	Feedback from commit r284535 davide: imgact_binmisc_clear_entry() needs to use atomic ops to remove the enable bit. kib: M_NOWAIT is not warranted and comment is invalid.	2015-06-19 18:57:36 +00:00
Sean Bruno	5f98711d51	This change replaces the mutex with a sx lock for the interpreter list to avoid the problem of holding a non-sleep lock during a page fault as reported by witness. It also uses atomics where possible to avoid having to acquire the exclusive lock. In addition, it consistently uses memset()/memcpy() instead of bzero()/bcopy(). Differential Revision: https://reviews.freebsd.org/D1971 Submitted by: sson Reviewed by: jhb	2015-06-18 02:04:20 +00:00
Bjoern A. Zeeb	af10bf055f	Initialise pr_enforce_statfs from the "default" sysctl value and not from the compile time constant. The sysctl value is seeded from the compile time constant. MFC after: 2 weeks	2015-06-17 13:15:54 +00:00
Konstantin Belousov	1eabd96728	vfs_msync(), called from syncer vnode fsync VOP, only iterates over the active vnode list for the given mount point, with the assumption that vnodes with dirty pages are active. This is enforced by vinactive() doing vm_object_page_clean() pass over the vnode pages. The issue is, if vinactive() cannot be called during vput() due to the vnode being only shared-locked, we might end up with the dirty pages for the vnode on the free list. Such vnode is invisible to syncer, and pages are only cleaned on the vnode reactivation. In other words, the race results in the broken guarantee that user data, written through the mmap(2), is written to the disk not later than in 30 seconds after the write. Fix this by keeping the vnode which is freed but still owing inactivation, on the active list. When syncer loops find such vnode, it is deactivated and cleaned by the final vput() call. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-06-17 04:46:58 +00:00
Pedro F. Giffuni	708694f704	Use nitems() macro instead of __arraycount()	2015-06-16 20:19:00 +00:00
Mateusz Guzik	4da8456f0a	Replace struct filedesc argument in getvnode with struct thread This is is a step towards removal of spurious arguments.	2015-06-16 13:09:18 +00:00
Mateusz Guzik	9ef8328d52	fd: make rights a mandatory argument to fget_unlocked	2015-06-16 09:52:36 +00:00
Mateusz Guzik	80f3623f2f	fd: don't unnecessary copy capabilities in _fget	2015-06-16 09:08:30 +00:00
Mateusz Guzik	cedab3c72c	fd: reduce excessive zeroing on fd close fde_file as NULL is already an indicator of an unused fd. All other fields are populated when fp is installed.	2015-06-14 14:10:05 +00:00
Mateusz Guzik	ea31808c3b	fd: move out actual fp installation to _finstall Use it in fd passing functions as the first step towards fd code cleanup.	2015-06-14 14:08:52 +00:00
Jeremie Le Hen	3768a5dfb5	nit: Rename racct_alloc_resource to racct_adjust_resource. This is more accurate as the amount can be negative. MFC after: 2 weeks	2015-06-14 08:33:14 +00:00
Gleb Smirnoff	093c7f396d	Make KPI of vm_pager_get_pages() more strict: if a pager changes a page in the requested array, then it is responsible for disposition of previous page and is responsible for updating the entry in the requested array. Now consumers of KPI do not need to re-lookup the pages after call to vm_pager_get_pages(). Reviewed by: kib Sponsored by: Netflix Sponsored by: Nginx, Inc.	2015-06-12 11:32:20 +00:00
Andriy Gapon	076dd8eb2e	several lockstat improvements 0. For spin events report time spent spinning, not a loop count. While loop count is much easier and cheaper to obtain it is hard to reason about the reported numbers, espcially for adaptive locks where both spinning and sleeping can happen. So, it's better to compare apples and apples. 1. Teach lockstat about FreeBSD rw locks. This is done in part by changing the corresponding probes and in part by changing what probes lockstat should expect. 2. Teach lockstat that rw locks are adaptive and can spin on FreeBSD. 3. Report lock acquisition events for successful rw try-lock operations. 4. Teach lockstat about FreeBSD sx locks. Reporting of events for those locks completely mirrors rw locks. 5. Report spin and block events before acquisition event. This is behavior documented for the upstream, so it makes sense to stick to it. Note that because of FreeBSD adaptive lock implementations both the spin and block events may be reported for the same acquisition while the upstream reports only one of them. Differential Revision: https://reviews.freebsd.org/D2727 Reviewed by: markj MFC after: 17 days Relnotes: yes Sponsored by: ClusterHQ	2015-06-12 10:01:24 +00:00
Mateusz Guzik	3331a33a42	ussreq: use saved fdp pointer insted of td->td_proc->p_fd No functional changes.	2015-06-12 06:28:22 +00:00
Konstantin Belousov	529c97886b	Tweaks for r284178: Do not include machine/atomic.h explicitely, the header is already included by sys/systm.h. Force inlining of tc_getgen() and tc_setgen(). The functions are used more than once, which causes compilers with non-aggressive inlining policies to generate calls. Suggested by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-06-11 04:41:54 +00:00
Mateusz Guzik	21de5aea6c	Fixup the build after r284215. Submitted by: Ivan Klymenko <fidaj ukr.net> [slighly modified]	2015-06-10 12:39:01 +00:00
Mateusz Guzik	f6f6d24062	Implement lockless resource limits. Use the same scheme implemented to manage credentials. Code needing to look at process's credentials (as opposed to thred's) is provided with *_proc variants of relevant functions. Places which possibly had to take the proc lock anyway still use the proc pointer to access limits.	2015-06-10 10:48:12 +00:00
Mateusz Guzik	4ea6a9a28f	Generalised support for copy-on-write structures shared by threads. Thread credentials are maintained as follows: each thread has a pointer to creds and a reference on them. The pointer is compared with proc's creds on userspace<->kernel boundary and updated if needed. This patch introduces a counter which can be compared instead, so that more structures can use this scheme without adding more comparisons on the boundary.	2015-06-10 10:43:59 +00:00
Mateusz Guzik	3b3eb22ab6	fd: remove fdesc_mtx	2015-06-10 09:40:07 +00:00
Mateusz Guzik	153cc61b54	fd: use atomics to manage fd_refcnt and fd_holcnt This gets rid of fdesc_mtx.	2015-06-10 09:34:50 +00:00
Kenneth D. Merry	5672fac935	Add support for reading MAM attributes to camcontrol(8) and libcam(3). MAM is Medium Auxiliary Memory and is most commonly found as flash chips on tapes. This includes support for reading attributes and decoding most known attributes, but does not yet include support for writing attributes or reporting attributes in XML format. libsbuf/Makefile: Add subr_prf.c for the new sbuf_hexdump() function. This function is essentially the same function. libsbuf/Symbol.map: Add a new shared library minor version, and include the sbuf_hexdump() function. libsbuf/Version.def: Add version 1.4 of the libsbuf library. libutil/hexdump.3: Document sbuf_hexdump() alongside hexdump(3), since it is essentially the same function. camcontrol/Makefile: Add attrib.c. camcontrol/attrib.c: Implementation of READ ATTRIBUTE support for camcontrol(8). camcontrol/camcontrol.8: Document the new 'camcontrol attrib' subcommand. camcontrol/camcontrol.c: Add the new 'camcontrol attrib' subcommand. camcontrol/camcontrol.h: Add a function prototype for scsiattrib(). share/man/man9/sbuf.9: Document the existence of sbuf_hexdump() and point users to the hexdump(3) man page for more details. sys/cam/scsi/scsi_all.c: Add a table of known attributes, text descriptions and handler functions. Add a new scsi_attrib_sbuf() function along with a number of other related functions that help decode attributes. scsi_attrib_ascii_sbuf() decodes ASCII format attributes. scsi_attrib_int_sbuf() decodes binary format attributes, and will pass them off to scsi_attrib_hexdump_sbuf() if they're bigger than 8 bytes. scsi_attrib_vendser_sbuf() decodes the vendor and drive serial number attribute. scsi_attrib_volcoh_sbuf() decodes the Volume Coherency Information attribute that LTFS writes out. sys/cam/scsi/scsi_all.h: Add a number of attribute-related structure definitions and other defines. Add function prototypes for all of the functions added in scsi_all.c. sys/kern/subr_prf.c: Add a new function, sbuf_hexdump(). This is the same as the existing hexdump(9) function, except that it puts the result in an sbuf. This also changes subr_prf.c so that it can be compiled in userland for includsion in libsbuf. We should work to change this so that the kernel hexdump implementation is a wrapper around sbuf_hexdump() with a statically allocated sbuf with a drain. That will require a drain function that goes to the kernel printf() buffer that can take a non-NUL terminated string as input. That is because an sbuf isn't NUL-terminated until it is finished, and we don't want to finish it while we're still using it. We should also work to consolidate the userland hexdump and kernel hexdump implemenatations, which are currently separate. This would also mean making applications that currently link in libutil link in libsbuf. sys/sys/sbuf.h: Add the prototype for sbuf_hexdump(), and add another copy of the hexdump flag values if they aren't already defined. Ideally the flags should be defined in one place but the implemenation makes it difficult to do properly. (See above.) Sponsored by: Spectra Logic Corporation MFC after: 1 week	2015-06-09 21:39:38 +00:00
Konstantin Belousov	2c6946dca2	When updating/accessing the timehands, barriers are needed to ensure that: - th_generation update is visible after the parameters update is visible; - the read of parameters is not reordered before initial read of th_generation. On UP kernels, compiler barriers are enough. For SMP machines, CPU barriers must be used too, as was confirmed by submitter by testing on the Freescale T4240 platform with 24 PowerPC processors. Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de> MFC after: 1 week	2015-06-09 11:49:56 +00:00
John Baldwin	15c2b30155	Revert r284153, as I believe it breaks the dtrace sdt module. I will fix the original issue a different way.	2015-06-08 18:06:00 +00:00
Ed Maste	6b16d66497	Add user facing errors for exceeding process memory limits Previously the process terminating with SIGABRT at startup was the only notification. PR: 200617 Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D2731	2015-06-08 16:07:07 +00:00
John Baldwin	69c5c774fe	Add an internal "locked" variant of linker_file_lookup_set() and change the public function to acquire the global linker lock directly. This permits linker_file_lookup_set() to be safely used from other modules.	2015-06-08 14:06:47 +00:00
Mark Johnston	8b84d791a0	witness: don't warn about matrix inconsistencies without holding the mutex Lock order checking is done without the witness mutex held, so multiple threads that are racing to establish a new lock order may read matrix entries that are in an inconsistent state. Don't print a warning in this case, but instead just redo the check after taking the witness lock. Differential Revision: https://reviews.freebsd.org/D2713 Reviewed by: jhb MFC after: 2 weeks Sponsored by: EMC / Isilon Storage Division	2015-06-07 18:59:47 +00:00
Sean Bruno	280b716943	Revert 284029, update imgact_binmisctl.c change mtx to reader count, at the request of the submitter. Will attempt to use an sx_lock for this fix to WITNESS crashes in a later revision. Submitted by: sson	2015-06-05 18:16:10 +00:00
Sean Bruno	8c8613a14f	This change uses a reader count instead of holding the mutex for the interpreter list to avoid the problem of holding a non-sleep lock during a page fault as reported by witness. In addition, it consistently uses memset()/memcpy() instead of bzero()/bcopy() except in the case where bcopy() is required (i.e. overlapping copy). Differential Revision: https://reviews.freebsd.org/D2123 Submitted by: sson MFC after: 2 weeks Relnotes: Yes	2015-06-05 16:21:43 +00:00
John Baldwin	7077c42623	Add a new file operations hook for mmap operations. File type-specific logic is now placed in the mmap hook implementation rather than requiring it to be placed in sys/vm/vm_mmap.c. This hook allows new file types to support mmap() as well as potentially allowing mmap() for existing file types that do not currently support any mapping. The vm_mmap() function is now split up into two functions. A new vm_mmap_object() function handles the "back half" of vm_mmap() and accepts a referenced VM object to map rather than a (handle, handle_type) tuple. vm_mmap() is now reduced to converting a (handle, handle_type) tuple to a a VM object and then calling vm_mmap_object() to handle the actual mapping. The vm_mmap() function remains for use by other parts of the kernel (e.g. device drivers and exec) but now only supports mapping vnodes, character devices, and anonymous memory. The mmap() system call invokes vm_mmap_object() directly with a NULL object for anonymous mappings. For mappings using a file descriptor, the descriptors fo_mmap() hook is invoked instead. The fo_mmap() hook is responsible for performing type-specific checks and adjustments to arguments as well as possibly modifying mapping parameters such as flags or the object offset. The fo_mmap() hook routines then call vm_mmap_object() to handle the actual mapping. The fo_mmap() hook is optional. If it is not set, then fo_mmap() will fail with ENODEV. A fo_mmap() hook is implemented for regular files, character devices, and shared memory objects (created via shm_open()). While here, consistently use the VM_PROT_* constants for the vm_prot_t type for the 'prot' variable passed to vm_mmap() and vm_mmap_object() as well as the vm_mmap_vnode() and vm_mmap_cdev() helper routines. Previously some places were using the mmap()-specific PROT_* constants instead. While this happens to work because PROT_xx == VM_PROT_xx, using VM_PROT_* is more correct. Differential Revision: https://reviews.freebsd.org/D2658 Reviewed by: alc (glanced over), kib MFC after: 1 month Sponsored by: Chelsio	2015-06-04 19:41:15 +00:00
Eric van Gyzen	63e4c6cdf9	Provide vnode in memory map info for files on tmpfs When providing memory map information to userland, populate the vnode pointer for tmpfs files. Set the memory mapping to appear as a vnode type, to match FreeBSD 9 behavior. This fixes the use of tmpfs files with the dtrace pid provider, procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY). Submitted by: Eric Badger <eric@badgerio.us> (initial revision) Obtained from: Dell Inc. PR: 198431 MFC after: 2 weeks Reviewed by: jhb Approved by: kib (mentor)	2015-06-02 18:37:04 +00:00
Xin LI	4c372ca254	Clear p_stops when doing PT_DETACH. Without this, if a process was being traced by truss(1), which uses different p_stops bits than gdb(1), the latter would misbehave because of the unexpected bits. Reported by: jceel Submitted by: sef Sponsored by: iXsystems, Inc. MFC after: 2 weeks	2015-06-01 18:15:45 +00:00
Konstantin Belousov	aef68c961a	When delivering a signal with default disposition to the thread, tdsigwakeup() increases the priority of the low-priority threads, to give them a chance to be terminated timely. Also, kernel allows user to signal kernel processes. The combined effect is that signalling idle process bump a priority of the selected delivery thread, which starts eating CPU. Check for the delivery thread be an idle thread and do not raise its priority then. The signal delivery to the kernel threads must be opt-in feature. Kernel thread should explicitely declare the ability to handle signals directed to it. E.g., nfsd threads check for signal as an indication of exit request. Most threads do not handle signals at all, and queuing the signal to them causes odd side-effects. Most innocent consequence is the memory leak due to queued ksiginfo, which is never deleted from the sigqueue. Code to prevent even queuing signals to the kernel threads is trivial, but it requires careful examination of each call to kproc/kthread creation to decide should the signalling be allowed. The commit is a stop-gap measure which fixes the immediate case for now. PR: 200493 Reported and tested by: trasz Discussed with: trasz, emaste Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-05-29 16:26:08 +00:00
Konstantin Belousov	69baeadc31	Remove several write-only variables, all reported by the gcc 4.9 buildkernel run. Some of them were write-only under some kernel options, e.g. variables keeping values only used by CTR() macros. It costs nothing to the code readability and correctness to eliminate the warnings in those cases too by removing the local cached values used only for single-access. Review: https://reviews.freebsd.org/D2665 Reviewed by: rodrigc Looked at by: bjk Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-05-29 13:24:17 +00:00
Konstantin Belousov	780dca1b1e	Right now, dounmount() is called with unreferenced mount point. Nothing stops a parallel unmount to suceed before the given call to dounmount() checks and locks the covered vnode. Prevent dounmount() from acting on the freed (although type-stable) memory by changing the interface to require the mount point to be referenced. dounmount() consumes the reference on return, regardless of the sucessfull or erronous result. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-27 09:22:50 +00:00
Konstantin Belousov	2db0e1f50d	Add V_MNTREF flag to the vn_start_write(9) and vn_start_secondary_write(9) functions. The flag indicates that the caller already owns a reference on the mount point, and the functions can consume it. The reference is released by vn_finished_write(9) and vn_finished_secondary_write(9) in due course. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-27 09:21:47 +00:00

1 2 3 4 5 ...

14513 Commits