freebsd-dev

Author	SHA1	Message	Date
Alan Cox	2d612d2dd2	When tmpfs and POSIX shm pagein a page for the sole purpose of performing truncation, immediately queue the page for asynchronous laundering rather than making the page pass through inactive queue first. Reviewed by: kib, markj	2016-12-11 19:24:41 +00:00
Konrad Witaszczyk	480f31c214	Add support for encrypted kernel crash dumps. Changes include modifications in kernel crash dump routines, dumpon(8) and savecore(8). A new tool called decryptcore(8) was added. A new DIOCSKERNELDUMP I/O control was added to send a kernel crash dump configuration in the diocskerneldump_arg structure to the kernel. The old DIOCSKERNELDUMP I/O control was renamed to DIOCSKERNELDUMP_FREEBSD11 for backward ABI compatibility. dumpon(8) generates an one-time random symmetric key and encrypts it using an RSA public key in capability mode. Currently only AES-256-CBC is supported but EKCD was designed to implement support for other algorithms in the future. The public key is chosen using the -k flag. The dumpon rc(8) script can do this automatically during startup using the dumppubkey rc.conf(5) variable. Once the keys are calculated dumpon sends them to the kernel via DIOCSKERNELDUMP I/O control. When the kernel receives the DIOCSKERNELDUMP I/O control it generates a random IV and sets up the key schedule for the specified algorithm. Each time the kernel tries to write a crash dump to the dump device, the IV is replaced by a SHA-256 hash of the previous value. This is intended to make a possible differential cryptanalysis harder since it is possible to write multiple crash dumps without reboot by repeating the following commands: # sysctl debug.kdb.enter=1 db> call doadump(0) db> continue # savecore A kernel dump key consists of an algorithm identifier, an IV and an encrypted symmetric key. The kernel dump key size is included in a kernel dump header. The size is an unsigned 32-bit integer and it is aligned to a block size. The header structure has 512 bytes to match the block size so it was required to make a panic string 4 bytes shorter to add a new field to the header structure. If the kernel dump key size in the header is nonzero it is assumed that the kernel dump key is placed after the first header on the dump device and the core dump is encrypted. Separate functions were implemented to write the kernel dump header and the kernel dump key as they need to be unencrypted. The dump_write function encrypts data if the kernel was compiled with the EKCD option. Encrypted kernel textdumps are not supported due to the way they are constructed which makes it impossible to use the CBC mode for encryption. It should be also noted that textdumps don't contain sensitive data by design as a user decides what information should be dumped. savecore(8) writes the kernel dump key to a key.# file if its size in the header is nonzero. # is the number of the current core dump. decryptcore(8) decrypts the core dump using a private RSA key and the kernel dump key. This is performed by a child process in capability mode. If the decryption was not successful the parent process removes a partially decrypted core dump. Description on how to encrypt crash dumps was added to the decryptcore(8), dumpon(8), rc.conf(5) and savecore(8) manual pages. EKCD was tested on amd64 using bhyve and i386, mipsel and sparc64 using QEMU. The feature still has to be tested on arm and arm64 as it wasn't possible to run FreeBSD due to the problems with QEMU emulation and lack of hardware. Designed by: def, pjd Reviewed by: cem, oshogbo, pjd Partial review: delphij, emaste, jhb, kib Approved by: pjd (mentor) Differential Revision: https://reviews.freebsd.org/D4712	2016-12-10 16:20:39 +00:00
Mark Johnston	02315a6759	Use a consistent snapshot of the lock state in owner_mtx(). MFC after: 2 weeks	2016-12-10 02:59:34 +00:00
Mark Johnston	c365a2934e	Return a non-NULL owner only if the lock is exclusively held in owner_sx(). Fix some whitespace bugs while here. MFC after: 2 weeks	2016-12-10 02:56:44 +00:00
Gleb Smirnoff	5040da77c1	Use acquire write to cr_lock to complement with release write at end of locked region. Submitted by: kib	2016-12-09 19:07:31 +00:00
Gleb Smirnoff	169170209c	Provide counter_ratecheck(), a MP-friendly substitution to ppsratecheck(). When rated event happens at a very quick rate, the ppsratecheck() is not only racy, but also becomes a performance bottleneck. Together with: rrs, jtl	2016-12-09 17:58:34 +00:00
Robert Watson	52b42f6287	Regnerate system-call definitions following r309677 correcting a whitespace glitch in syscalls.master.	2016-12-07 16:12:27 +00:00
Robert Watson	82d8d2b8bc	Replace spaces with tabs in definition of SCTP system calls, for consistency with the remainder of the syscalls.master file. This problem does not occur in the freebsd32 version of the same system calls.	2016-12-07 16:11:55 +00:00
Eric van Gyzen	3d32d4a7c9	Export the whole thread name in kinfo_proc kinfo_proc::ki_tdname is three characters shorter than thread::td_name. Add a ki_moretdname field for these three extra characters. Add the new field to kinfo_proc32, as well. Update all in-tree consumers to read the new field and assemble the full name, except for lldb's HostThreadFreeBSD.cpp, which I will handle separately. Bump __FreeBSD_version. Reviewed by: kib MFC after: 1 week Relnotes: yes Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D8722	2016-12-07 15:04:22 +00:00
Konstantin Belousov	435da98564	Restructure the code to handle reporting of non-exited processes from wait(2). - Do not acquire the process spinlock if neither WTRAPPED nor WUNTRACED options were passed [1]. - Extract the code to report alive process into a new helper report_alive_proc() and use it for trapped, stopped and continued childrens. Note that the process spinlock is required around the WTRAPPED and WUNTRACED tests, because P_STOPPED_TRACE and P_STOPPED_SIG flags are set before other threads are stopped at the suspension point, and that threads increment p_suspcount while owning only the process spinlock, the process lock is dropped by them. If the spinlock is not taken for tests, the syscall thread might miss both p_suspcount increment and wakeup in wakeup in thread_suspend_switch(). Based on the submission by: mjg [1] Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-04 20:44:58 +00:00
Eric van Gyzen	ff07dd913e	thr_set_name(): silently truncate the given name as needed Instead of failing with ENAMETOOLONG, which is swallowed by pthread_set_name_np() anyway, truncate the given name to MAXCOMLEN+1 bytes. This is more likely what the user wants, and saves the caller from truncating it before the call (which was the only recourse). Polish pthread_set_name_np(3) and add a .Xr to thr_set_name(2) so the user might find the documentation for this behavior. Reviewed by: jilles MFC after: 3 days Sponsored by: Dell EMC	2016-12-03 01:14:21 +00:00
Mateusz Guzik	a2d3554542	vfs: provide fake locking primitives for the crossmp vnode Since the vnode is only expected to be shared locked, we can save a little overhead by only pretending we are locking in the first place. Reviewed by: kib Tested by: pho	2016-12-02 18:03:15 +00:00
Mateusz Guzik	a4ce25b5b0	vfs: fix a whitespace nit in r309307	2016-11-30 02:17:03 +00:00
Mateusz Guzik	1babea0341	vfs: avoid VOP_ISLOCKED in the common case in lookup	2016-11-30 02:14:53 +00:00
Mark Johnston	64910ddbff	Launder VPO_NOSYNC pages upon vnode deactivation. As of r234483, vnode deactivation causes non-VPO_NOSYNC pages to be laundered. This behaviour has two problems: 1. Dirty VPO_NOSYNC pages must be laundered before the vnode can be reclaimed, and this work may be unfairly deferred to the vnlru process or an unrelated application when the system is under vnode pressure. 2. Deactivation of a vnode with dirty VPO_NOSYNC pages requires a scan of the corresponding VM object's memq for non-VPO_NOSYNC dirty pages; if the laundry thread needs to launder pages from an unreferenced such vnode, it will reactivate and deactivate the vnode with each laundering, potentially resulting in a large number of expensive scans. Therefore, ensure that all dirty pages are laundered upon deactivation, i.e., when all maps of the vnode are removed and all references are released. Reviewed by: alc, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D8641	2016-11-26 21:00:27 +00:00
John Baldwin	9f3aabb9eb	Permit timed sleeps for threads other than thread0 before timers are working. The callout subsystem already handles early callouts and schedules the first clock interrupt appropriately based on the currently pending callouts. The one nit to fix was that callouts scheduled via C_HARDCLOCK during early boot could fire too early once timers were enabled as the per-CPU base time is always zero until timers are initialized. The change in callout_when() handles this case by using the current uptime as the base time of the callout during bootup if the per-CPU base time is zero. Reviewed by: kib MFC after: 2 weeks Sponsored by: Netflix	2016-11-25 18:02:43 +00:00
Mateusz Guzik	746b6e8176	wait: avoid relocking the child if proc_to_reap returns 1 proc_to_reap would always unlock. However, if it returned 1, kern_wait6 would immediately lock it again. Save the dance. Reviewed by: kib	2016-11-24 18:21:48 +00:00
Mateusz Guzik	8b0e0c91e0	cache: ensure that the number of bucket locks does not exceed hash size The size can be changed by side effect of modifying kern.maxvnodes. Since numbucketlocks was not modified, setting a sufficiently low value would give more locks than actual buckets, which would then lead to corruption. Force the number of buckets to be not smaller. Note this should not matter for real world cases. Reported and tested by: pho	2016-11-23 19:50:12 +00:00
Mark Johnston	99e6e1930c	Release laundered vnode pages to the head of the inactive queue. The swap pager enqueues laundered pages near the head of the inactive queue to avoid another trip through LRU before reclamation. This change adds support for this behaviour to the vnode pager and makes use of it in UFS and ext2fs. Some ioflag handling is consolidated into a common subroutine so that this support can be easily extended to other filesystems which make use of the buffer cache. No changes are needed for ZFS since its putpages routine always undirties the pages before returning, and the laundry thread requeues the pages appropriately in this case. Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D8589	2016-11-23 17:53:07 +00:00
Ruslan Bukin	dd7d4f199e	Revert r306186 ("Adjust the sopt_val pointer on bigendian systems"). This logic doesn't work with bigger sopt_valsize (e.g. when ipfw passing 2048 bytes rule). Reported by: adrian Sponsored by: DARPA, AFRL	2016-11-22 18:31:43 +00:00
Konstantin Belousov	eb962424ba	Restore vnode pager statistic for buffer pagers. Reviewed by: alc, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D8585	2016-11-22 10:06:39 +00:00
John Baldwin	5d8cce1764	Initialize 'ticks' earlier in boot after 'hz' is set. This avoids the time-warp after kthreads have started running and the required fixup to td_slptick and td_blktick in the EARLY_AP_STARTUP case. Now, 'ticks' is initialized before any kthreads are created or any context switches are performed. Tested by: gavin MFC after: 2 weeks Sponsored by: Netflix	2016-11-22 01:02:59 +00:00
Robert Watson	1279fdafce	Audit 'fd' and 'cmd' arguments to fcntl(2), and when generating BSM, always audit the file-descriptor number and vnode information for all fnctl(2) commands, not just locking-related ones. This was likely an oversight in the original adaptation of this code from XNU. MFC after: 3 days Sponsored by: DARPA, AFRL	2016-11-22 00:41:24 +00:00
Gleb Smirnoff	00b5ffde8e	Add flag SF_USER_READAHEAD to sendfile(2). When specified, the syscall won't do any speculations about readahead, and use exactly the amount of readahead specified by user. E.g. setting SF_FLAGS(0, SF_USER_READAHEAD) will guarantee that no readahead at all will be performed.	2016-11-17 21:36:18 +00:00
Gleb Smirnoff	5dba303d01	Use bogus_page to properly reduce number of I/Os in sendfile(2). The new sendfile_swapin() loop works this way: - Find first invalid page in the request. - Do vm_pager_has_page() and get count of pages, that can be taken in single I/O. - Trim valid pages from the end of the request. - Cycle through the request and substitute to bogus_page all valid pages that are in the middle of the request. - After I/O launched (pager copies array of pages into buf(9), it is important to restore proper page pointers with help vm_page_lookup(). Count bogus pages used and report them in sendfile stats.	2016-11-17 21:02:55 +00:00
Ruslan Bukin	6e18247a3d	Fix build when no INET and INET6 in kernel config. Submitted by: kan Sponsored by: DARPA, AFRL	2016-11-17 16:13:30 +00:00
Alan Cox	7667839a7e	Remove most of the code for implementing PG_CACHED pages. (This change does not remove user-space visible fields from vm_cnt or all of the references to cached pages from comments. Those changes will come later.) Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8497	2016-11-15 18:22:50 +00:00
Mateusz Guzik	6ce45c6ac3	cache: plug a write-only variable in cache_negative_zap_one	2016-11-15 03:43:10 +00:00
Mateusz Guzik	317cac6d5a	cache: fix a race between entry removal and demotion The negative list shrinker can demote an entry with only hotlist + neglist locks held. On the other hand entry removal possibly sets the NCF_DVDROP without aformentioned locks held prior to detaching it from the respective netlist., which can lose the update made by the shrinker. Reported and tested by: truckman	2016-11-15 03:38:05 +00:00
Adrian Chadd	8ffa01a061	[mips] enable relbuf on mips for now to work around page aliasing in mips hardware. Although the higher end MIPS hardware handles cache aliasing issues in hardware, the older cores (r4k, etc) and some compile versions of the newer cores (mips24k, mips34k, mips74k) don't have this feature. This means we end up with some very unfortunate behaviour that was made very obvious by some recent changes to the FFS pager by kib. So, flip this off until we get our MIPS pmap/cache code upgraded to handle aliased pages in software. Discussed with: kib, bsdimp, juli	2016-11-15 01:41:45 +00:00
Adrian Chadd	0046bef85a	[mips] make UMTX_CHAINS configurable at compile time. The default (512) wastes quite a bit of space which doesn't really buy us much on highly embedded systems which don't take a lot of locks in parallel. This makes it at least build time configurable so people can experiment.	2016-11-15 01:34:38 +00:00
Konstantin Belousov	ae44bb0146	Initialize reserved bytes in struct mq_attr and its 32compat counterpart, to avoid kernel stack content leak in kmq_setattr(2) syscall. Also slightly simplify the checks around copyout()s. Reported by: Vlad Tsyrklevich <vlad902+spam@gmail.com> PR: 214488 MFC after: 1 week	2016-11-14 13:20:10 +00:00
Konstantin Belousov	714b7df502	Provide simple mutual exclusion between mount point update and unmount. Currently mount update keeps vfs_busy(9) reference on the mount point during MNT_UPDATE VFS_MOUNT() vfsops call. This already provides the exclusion, but is problematic for filesystems which need to perform namei(9) during VFS_MOUNT(MNT_UPDATE) operations, e.g. to refresh mnt_from path, because namei(9) must not be called while the vfs_busy(9) reference is owned. Check for MNT_UPDATE flag before setting MNTK_UNMOUNT, and for MNTK_UNMOUNT before entering innards of vfs_domount_update(), failing syscalls with EBUSY if conflict is detected. Keep vfs_busy(9) reference around VFS_MOUNT(MNT_UPDATE) calls still to not change VFS KPI. In the update path in ffs_mount(), drop vfs_busy() reference around namei(), which is now safe due to unmount never executing in parallel with VFS_MOUNT(MNT_UPDATE), and which avoids the deadlock. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-11-13 21:49:51 +00:00
Konstantin Belousov	9eb8f495b8	Move common cleanup code into helper. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-11-13 21:39:55 +00:00
Justin Hibbits	6487a709f3	Add two new ddb commands: show device/show all devices Shows several useful pieces of information from the device including the softc and ivars pointers.	2016-11-13 00:46:11 +00:00
John Baldwin	892f0ab0ab	Allow scheduling during early boot. - Send IPI wakeups once SMP is started even if cold is true. - Permit preemptions when cold is true. These changes are needed for EARLY_AP_STARTUP. MFC after: 2 weeks Sponsored by: Netflix	2016-11-12 00:23:09 +00:00
John Baldwin	a6b91f0f45	Don't place threads on the run queue after waking up other CPUs. The other CPU might resume and see a still-empty runq and go back to sleep before sched_add() adds the thread to the runq. This results in a lost wakeup and a potential hang if the system is otherwise completely idle. The race originated due to a micro-optimization (my fault) in 4BSD in that it avoided putting a thread on the run queue if the scheduler was going to preempt to the new thread. To avoid complexity while fixing this race, just drop this optimization. 4BSD now always sets the "owepreempt" flag when a preemption is warranted and defers the actual preemption to the thread_unlock of the caller the same as ULE. MFC after: 2 weeks Sponsored by: Netflix	2016-11-12 00:14:13 +00:00
Bryan Drewery	28323add09	Fix improper use of "its". Sponsored by: Dell EMC Isilon	2016-11-08 23:59:41 +00:00
Konstantin Belousov	9a639daf77	Tweaks for the buffer pager. Pass current thread credentials instead of NOCRED. Only allow unmapped buffers for filesystem which proclaimed the support. For all filesystems which currently use buffer pager (UFS, msdosfs and cd9660), the changes are effectively nop. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-11-08 10:10:55 +00:00
Konstantin Belousov	9bd4f0a2c6	vn_fullpath1() checked VV_ROOT and then unreferenced vp->v_mount->mnt_vnodecovered unlocked. This allowed unmount to race. Lock vnode after we noticed the VV_ROOT flag. See comments for explanation why unlocked check for the flag is considered safe. Reported and tested by: avg Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-11-07 10:55:56 +00:00
Konstantin Belousov	75409ce1dd	Remove remnants of the recursive sleep support. Instead assert that we never try to sleep while the thread is on a sleepqueue. Reviewed by: jhb Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D8422	2016-11-02 20:57:20 +00:00
Konstantin Belousov	7359fdcf5f	Allow some dotdot lookups in capability mode. If dotdot lookup does not escape from the file descriptor passed as the lookup root, we can allow the component traversal. Track the directories traversed, and check the result of dotdot lookup against the recorded list of the directory vnodes. Dotdot lookups are enabled by sysctl vfs.lookup_cap_dotdot, currently disabled by default until more verification of the approach is done. Disallow non-local filesystems for dotdot, since remote server might conspire with the local process to allow it to escape the namespace. This might be too cautious, provide the knob vfs.lookup_cap_dotdot_nonlocal to override as well. Idea by: rwatson Discussed with: emaste, jonathan, rwatson Reviewed by: mjg (previous version) Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 week Differential revision: https://reviews.freebsd.org/D8110	2016-11-02 12:43:15 +00:00
Konstantin Belousov	1bf6a0900d	Remove tautological casts. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-11-02 12:10:39 +00:00
Konstantin Belousov	ec84693535	Style fixes. Discussed with: emaste Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-11-02 12:02:31 +00:00
Edward Tomasz Napierala	53ae7e833c	Fix getfsstat(2) with MNT_WAIT to not skip filesystems that are in the process of being unmounted. Previously it would skip them, even if the unmount eventually failed eg due to the filesystem being busy. This behaviour broke autounmountd(8) - if you tried to manually unmount a mounted filesystem, using 'automount -u', and the autounmountd attempted to refresh the filesystem list in that very moment, it would conclude that the filesystem got unmounted and not try to unmount it afterwards. Reviewed by: kib@ Tested by: pho@ MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D8030	2016-11-02 09:43:19 +00:00
Conrad Meyer	8532d381a9	Add BUF_TRACKING and FULL_BUF_TRACKING buffer debugging Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code. This can be handy in tracking down what code touched hung bios and bufs last. The full history is especially useful, but adds enough bloat that it shouldn't be enabled in release builds. Function names (or arbitrary string constants) are tracked in a fixed-size ring in bufs. Bios gain a pointer to the upper buf for tracking. SCSI CCBs gain a pointer to the upper bio for tracking. Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8366	2016-10-31 23:09:52 +00:00
Mark Johnston	ac91917211	Fix WITNESS hints for pagequeue locks. MFC after: 1 week	2016-10-29 20:01:48 +00:00
Edward Tomasz Napierala	6eeff7a7b2	Fix getfsstat(2) handling of flags. The 'flags' argument is an enum, not a bitfield. For the intended usage - being passed either MNT_WAIT, or MNT_NOWAIT - this shouldn't introduce any changes in behaviour. Reviewed by: jhb@ MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D8373	2016-10-29 12:38:30 +00:00
Konstantin Belousov	c39baa7480	Generalize UFS buffer pager to allow it serving other filesystems which also use buffer cache. Most important addition to the code is the handling of filesystems where the block size is less than the machine page size, which might require reading several buffers to validate single page. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-10-28 11:43:59 +00:00
Marcel Moolenaar	07f862a769	Include <stdarg.h> instead of <machine/stdarg.h> when compiled as part of libsbuf. The former is the standard header, and allows us to compile libsbuf on macOS/linux.	2016-10-24 18:03:04 +00:00
Konstantin Belousov	835c2787be	Handle broadcast NMIs. On several Intel chipsets, diagnostic NMIs sent from BMC or NMIs reporting hardware errors are broadcasted to all CPUs. When kernel is configured to enter kdb on NMI, the outcome is problematic, because each CPU tries to enter kdb. All CPUs are executing NMI handlers, which set the latches disabling the nested NMI delivery; this means that stop_cpus_hard(), used by kdb_enter() to stop other cpus by broadcasting IPI_STOP_HARD NMI, cannot work. One indication of this is the harmless but annoying diagnostic "timeout stopping cpus". Much more harming behaviour is that because all CPUs try to enter kdb, and if ddb is used as debugger, all CPUs issue prompt on console and race for the input, not to mention the simultaneous use of the ddb shared state. Try to fix this by introducing a pseudo-lock for simultaneous attempts to handle NMIs. If one core happens to enter NMI trap handler, other cores see it and simulate reception of the IPI_STOP_HARD. More, generic_stop_cpus() avoids sending IPI_STOP_HARD and avoids waiting for the acknowledgement, relying on the nmi handler on other cores suspending and then restarting the CPU. Since it is impossible to detect at runtime whether some stray NMI is broadcast or unicast, add a knob for administrator (really developer) to configure debugging NMI handling mode. The updated patch was debugged with the help from Andrey Gapon (avg) and discussed with him. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D8249	2016-10-24 16:40:27 +00:00
Konstantin Belousov	55ee7a4c5f	In the fueword64(9) wrapper for architectures which do not implemented native fueword64(9) still, use proper type for local where fuword64() result is stored. Note that fueword64() is unused in the tree. Submitted by: Chunhui He <hchunhui@mail.ustc.edu.cn> PR: 212520 MFC after: 1 week	2016-10-23 11:23:17 +00:00
Conrad Meyer	8798ef0679	ddb(4): Add sleepchains to "show allchains" Reported by: markj Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8320	2016-10-22 18:02:20 +00:00
Hiren Panchasara	9d71a3975e	Rework r306337. In sendit(), if mp->msg_control is present, then in sockargs() we are allocating mbuf to store mp->msg_control. Later in kern_sendit(), call to getsock_cap(), will check validity of file pointer passed, if this fails EBADF is returned but mbuf allocated in sockargs() is not freed. Made code changes to free the same. Since freeing control mbuf in sendit() after checking (control != NULL) may lead to double freeing of control mbuf in sendit(), we can free control mbuf in kern_sendit() if there are any errors in the routine. Submitted by: Lohith Bellad <lohith.bellad@me.com> Reviewed by: glebius MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D8152	2016-10-21 18:27:30 +00:00
Mariusz Zaborski	4b83a77606	capsicum: perform copyout without the fildesc lock held in sys_cap_ioctls_get Reviewed by: pjd	2016-10-21 16:12:23 +00:00
Mateusz Guzik	bb697a20d7	cache: fix up a corner case in r307650 If no negative entry is found on the last list, the ncp pointer will be left uninitialized and a non-null value will make the function assume an entry was found. Fix the problem by initializing to NULL on entry. Reported by: glebius	2016-10-20 19:55:50 +00:00
Kevin Lo	61f481fb7e	Remove register keyword. Reviewed by: kib	2016-10-20 01:21:10 +00:00
Kevin Lo	7c68685366	Remove a sentence about putting initialization in init_proc.c or kern_proc.c and useless comment. Reviewed by: kib	2016-10-20 01:19:37 +00:00
Sean Bruno	026204b4c6	Resolve whitespace diff to NextBSD. Check to see that the taskqueue thread count requires us to acutally iterate over the thread count to bind to cpus. Submitted by: mmacy@nextbsd.org	2016-10-19 21:01:24 +00:00
Mateusz Guzik	53dc58f2dc	Mark a bunch of mpsafe sysctls as such. This gives me a sysctl Giant-free buildworld.	2016-10-19 19:42:01 +00:00
Mateusz Guzik	a45a1a25b8	cache: split negative entry LRU into multiple lists This splits the ncneg_mtx lock while preserving the hit ratio at least during buildworld. Create N dedicated lists for new negative entries. Entries with at least one hit get promoted to the hot list, where they get requeued every M hits. Shrinking demotes one hot entry and performs a round-robin shrinking of regular lists. Reviewed by: kib	2016-10-19 18:29:52 +00:00
Sean Bruno	abf38392c6	Assert that we're assigning a non-null taskqueue. ref: `535865d02c` Fix cpu assignment by assuring stride is non-zero, assert that all tasks have a valid taskqueue. ref: `db39817623` Start cpu assignment from zero. ref: `d99d39b6b6` Submitted by: mmacy@nextbsd.org	2016-10-18 14:00:26 +00:00
Sean Bruno	12d1b8c9f3	Ensure that tasks with a specific cpu set prior to smp starting get re-attached to a thread running on that cpu. ref: `fcc20e306b` Submitted by: mmacy@nextbsd.org	2016-10-18 13:55:34 +00:00
Sean Bruno	dc35f36560	Tell gtask to what we've been bound. ref: `54414984cf` Submitted by: mmacy@nextbsd.org	2016-10-18 13:16:27 +00:00
Ed Maste	9e62195361	makesyscalls.sh: remove trailing space on the "created from" line In r10905 and r10906 makesyscalls was modified to avoid emitting a literal $Id$ string in the generated file, with: gsub("[$]Id: ", "", $0) gsub(" [$]", "", $0) Then r11294 added some functionality and also tried to address the $Id$ problem in a different way, by removing every $: sed -e 's/\$//g ... This rendered the gsub infeffective. The gsub was later updated to track the $Id$ -> $FreeBSD$ switch, even though it did not do anything. Revert the addition of the s/\$//g, and update the gsub to keep the resulting format the same. Discussed with: bde MFC after: 1 week Sponsored by: The FreeBSD Foundation	2016-10-17 13:52:24 +00:00
Hans Petter Selasky	d3bf5efc1f	Fix device delete child function. When detaching device trees parent devices must be detached prior to detaching its children. This is because parent devices can have pointers to the child devices in their softcs which are not invalidated by device_delete_child(). This can cause use after free issues and panic(). Device drivers implementing trees, must ensure its detach function detaches or deletes all its children before returning. While at it remove now redundant device_detach() calls before device_delete_child() and device_delete_children(), mostly in the USB controller drivers. Tested by: Jan Henrik Sylvester <me@janh.de> Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D8070 MFC after: 2 weeks	2016-10-17 10:20:38 +00:00
Konstantin Belousov	5975e53d40	Fix a race in vm_page_busy_sleep(9). Suppose that we have an exclusively busy page, and a thread which can accept shared-busy page. In this case, typical code waiting for the page xbusy state to pass is again: VM_OBJECT_WLOCK(object); ... if (vm_page_xbusied(m)) { vm_page_lock(m); VM_OBJECT_WUNLOCK(object); <---1 vm_page_busy_sleep(p, "vmopax"); goto again; } Suppose that the xbusy state owner locked the object, unbusied the page and unlocked the object after we are at the line [1], but before we executed the load of the busy_lock word in vm_page_busy_sleep(). If it happens that there is still no waiters recorded for the busy state, the xbusy owner did not acquired the page lock, so it proceeded. More, suppose that some other thread happen to share-busy the page after xbusy state was relinquished but before the m->busy_lock is read in vm_page_busy_sleep(). Again, that thread only needs vm_object lock to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to the VPB_SHARERS_WORD(1). In this case, all tests in vm_page_busy_sleep(9) pass and we are going to sleep, despite the page being share-busied. Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9) to also accept shared-busy state if we only wait for the xbusy state to pass. Merge sequential if()s with the same 'then' clause in vm_page_busy_sleep(). Note that the current code does not share-busy pages from parallel threads, the only way to have more that one sbusy owner is right now is to recurse. Reported and tested by: pho (previous version) Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D8196	2016-10-13 14:41:05 +00:00
Conrad Meyer	d9ce8a41ea	kern_linker: Handle module-loading failures in preloaded .ko files The runtime kernel loader, linker_load_file, unloads kernel files that failed to load all of their modules. For consistency, treat preloaded (loader.conf loaded) kernel files in the same way. Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8200	2016-10-13 02:06:23 +00:00
Ed Maste	2a059700b6	Use correct size type in do_setopt_accept_filter Submitted by: ecturt@gmail.com	2016-10-12 00:56:49 +00:00
Oleksandr Tymoshenko	609b0fe966	INTRNG - fix MSI/MSIX release path Use isrc in attached MSI data structure instead of using map's isrc directly. map's isrc is set to NULL on IRQ deactivation which happens prior to pci_release_msi so MSI_RELEASE_MSI receives array of NULLs Reviewed by: mmel Differential Revision: https://reviews.freebsd.org/D8206	2016-10-11 17:00:29 +00:00
Sean Bruno	1ee17b070d	Fix bug where malloc(.., M_NOWAIT) return value is not checked, Change to M_WAITOK and move outside the mutex Submitted by: shurd Reviewed by: mmacy@nextbsd.org MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D7649	2016-10-11 14:08:53 +00:00
Mateusz Guzik	45571f8886	vfs: assert empty tmp free list on unmount	2016-10-08 13:38:05 +00:00
Mateusz Guzik	c6c44ff7eb	vfs: clear the tmp free list flag before taking the free vnode list lock Safe access is already guaranteed because of the mnt_listmx lock.	2016-10-08 13:36:59 +00:00
Konstantin Belousov	f71d08566c	Limit scope of the optimization in r306608 to dounmount() caller only. Other uses of cache_purgevfs() do rely on the cache purge for correct operations, when paths are invalidated without unmount. Reported and tested by: jkim Discussed with: mjg Sponsored by: The FreeBSD Foundation	2016-10-07 11:38:28 +00:00
Bryan Drewery	32641585a9	vrefl: Assert that the interlock is held. Sponsored by: Dell EMC Isilon MFC after: 2 weeks	2016-10-06 18:10:19 +00:00
Bryan Drewery	5a22c9582c	Add vrecyclel() to vrecycle() a vnode with the interlock already held. Obtained from: OneFS Sponsored by: Dell EMC Isilon MFC after: 2 weeks	2016-10-06 18:09:22 +00:00
Conrad Meyer	f43292ecf4	vfs_bio: Remove a leading space (style) Introduced in r282085. Sponsored by: Dell EMC Isilon	2016-10-05 23:42:02 +00:00
Bryan Drewery	0617f64ec6	Correct some comments after r294299. Sponsored by: Dell EMC Isilon	2016-10-04 21:44:20 +00:00
Ed Maste	65eea7ede6	ANSIfy inflate.c Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D8143	2016-10-04 17:57:30 +00:00
Konstantin Belousov	5420f76b59	Style. Reviewed by: emaste Sponsored by: The FreeBSD Foundation MFC after: 3 days	2016-10-04 15:23:03 +00:00
Mateusz Guzik	4876636eb7	cache: ignore purgevfs requests for filesystems with few vnodes purgevfs is purely optional and induces lock contention in workloads which frequently mount and unmount filesystems. In particular, poudriere will do this for filesystems with 4 vnodes or less. Full cache scan is clearly wasteful. Since there is no explicit counter for namecache entries, the number of vnodes used by the target fs is checked. The default limit is the number of bucket locks. Reviewed by: kib	2016-10-03 00:02:32 +00:00
Mateusz Guzik	5bb81f9b2d	vfs: batch free vnodes in per-mnt lists Previously free vnodes would always by directly returned to the global LRU list. With this change up to mnt_free_list_batch vnodes are collected first. syncer runs always return the batch regardless of its size. While vnodes on per-mnt lists are not counted as free, they can be returned in case of vnode shortage. Reviewed by: kib Tested by: pho	2016-09-30 17:27:17 +00:00
Mateusz Guzik	8660b707ff	vfs: remove the __bo_vnode field from struct vnode The pointer can be obtained using __containerof instead. Reviewed by: kib	2016-09-30 17:11:03 +00:00
Gleb Smirnoff	7ed6b78b92	Provide kern.maxphys sysctl, which returns MAXPHYS. Naming matches NetBSD.	2016-09-29 23:07:28 +00:00
Allan Jude	0176ca2ed5	Allow reading the following sysctl MIBs in capability mode: kern.hostname, kern.domainname, and kern.hostuuid This allows sandboxed applications to read these sysctls Submitted by: cem (original version) Reviewed by: cem, jonathan, rwatson (original version) Sponsored by: ScaleEngine Inc. Differential Revision: https://reviews.freebsd.org/D8015	2016-09-29 16:29:49 +00:00
Hans Petter Selasky	99eca1b2b3	While draining a timeout task prevent the taskqueue_enqueue_timeout() function from restarting the timer. Commonly taskqueue_enqueue_timeout() is called from within the task function itself without any checks for teardown. Then it can happen the timer stays active after the return of taskqueue_drain_timeout(), because the timeout and task is drained separately. This patch factors out the teardown flag into the timeout task itself, allowing existing code to stay as-is instead of applying a teardown flag to each and every of the timeout task consumers. Add assert to taskqueue_drain_timeout() which prevents parallel execution on the same timeout task. Update manual page documenting the return value of taskqueue_enqueue_timeout(). Differential Revision: https://reviews.freebsd.org/D8012 Reviewed by: kib, trasz MFC after: 1 week	2016-09-29 10:38:20 +00:00
Hiren Panchasara	7c9a4d09d6	Revert r306337. dhw@ reproted a panic which seems related to this and bde@ has raised some issues.	2016-09-26 15:45:30 +00:00
Eric van Gyzen	310ab671b8	Make no assertions about mutex state when the scheduler is stopped. This changes the assert path to match the lock and unlock paths. MFC after: 1 week Sponsored by: Dell EMC	2016-09-26 15:30:30 +00:00
Hiren Panchasara	41bb1a25a9	In sendit(), if mp->msg_control is present, then in sockargs() we are allocating mbuf to store mp->msg_control. Later in kern_sendit(), call to getsock_cap(), will check validity of file pointer passed, if this fails EBADF is returned but mbuf allocated in sockargs() is not freed. Fix this possible leak. Submitted by: Lohith Bellad <lohith.bellad@me.com> Reviewed by: adrian MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D7910	2016-09-26 10:13:58 +00:00
Julian Elischer	1c8260b61d	Give the user a clue as to which process hit maxfiles. MFC after: 1 week Sponsored by: Panzura	2016-09-24 22:56:13 +00:00
Konstantin Belousov	939457e3e0	Add the foundation copyrights to procctl kernel sources. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-09-23 12:32:20 +00:00
Mariusz Zaborski	ad5e83dd3c	fd: fix up fget_cap If the kernel is not compiled with the CAPABILITIES kernel options fget_unlocked doesn't return the sequence number so fd_modify will always report modification, in that case we got infinity loop. Reported by: br Reviewed by: mjg Tested by: br, def	2016-09-23 08:13:46 +00:00
Mateusz Guzik	deffc4a026	fd: fix up fgetvp_rights after r306184 fget_cap_locked returns a referenced file, but the fgetvp_rights does not need it. Instead, due to the filedesc lock being held, it can ref the vnode after the file was looked up. Fix up fget_cap_locked to be consistent with other _locked helpers and not ref the file. This plugs a leak introduced in r306184. Pointy hat to: mjg, oshogbo	2016-09-23 06:51:46 +00:00
Mateusz Guzik	1d2541fd1a	cache: get rid of the global lock Add a table of vnode locks and use them along with bucketlocks to provide concurrent modification support. The approach taken is to preserve the current behaviour of the namecache and just lock all relevant parts before any changes are made. Lookups still require the relevant bucket to be locked. Discussed with: kib Tested by: pho	2016-09-23 04:45:11 +00:00
Gleb Smirnoff	a2d8f9d2fc	Fix regression from r297400, which truncates headers in case of low socket buffer and put a small optimization for low socket buffer case: - Do not hack uio_resid, and let m_uiotombuf() properly take care of it. This fixes truncation of headers at low buffer. - If headers ate all the space, jump right to the end of the cycle, to avoid doing single page I/O and allocating zero length mbuf. - Clear hdr_uio only if space is positive, which indicates that all uio was copied in. Reviewed by: pluknet, jtl, emax, rrs, lstewart, emax, gallatin, scottl	2016-09-22 20:34:44 +00:00
Ruslan Bukin	30f3bfe58e	Adjust the sopt_val pointer on bigendian systems (e.g. MIPS64EB). sooptcopyin() checks if size of data provided by user is <= than we can accept, else it strips down the size. On bigendian platforms we have to move pointer as well so we copy the actual data. Reviewed by: gnn Sponsored by: DARPA, AFRL Sponsored by: HEIF5 Differential Revision: https://reviews.freebsd.org/D7980	2016-09-22 12:41:53 +00:00
Mariusz Zaborski	6490bc6529	fd: simplify fgetvp_rights by using fget_cap_locked Reviewed by: mjg	2016-09-22 11:54:20 +00:00
Mariusz Zaborski	85b0f9de11	capsicum: propagate rights on accept(2) Descriptor returned by accept(2) should inherits capabilities rights from the listening socket. PR: 201052 Reviewed by: emaste, jonathan Discussed with: many Differential Revision: https://reviews.freebsd.org/D7724	2016-09-22 09:58:46 +00:00
Mark Johnston	bdaf6d6913	Regenerate syscall provider argument strings.	2016-09-22 04:50:03 +00:00
Mark Johnston	5a4dfc8d83	Annotate syscall provider pointer arguments with the "userland" keyword. This causes dtrace to automatically copyin arguments from userland, so one no longer has to explicitly use the copyin() action to do so. Moreover, copyin() on userland addresses is a no-op, so existing scripts should be unaffected by this change. Discussed with: rstone MFC after: 2 weeks	2016-09-22 04:49:31 +00:00
Konstantin Belousov	851194715d	Make resettodr_lock accessible outside subr_rtc.c. Protect CLOCK_GETTIME() with the lock. Now all time-related accesses to the CMOS for RTC should be under the lock. This is needed to allow upcoming EFI Runtime Services support to provide required execution environment for the firmware calls. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-09-21 10:15:08 +00:00
Konstantin Belousov	643f6f47fd	Add PROC_TRAPCAP procctl(2) controls and global sysctl kern.trap_enocap. Both can be used to cause processes in capability mode to receive SIGTRAP when ENOTCAPABLE or ECAPMODE errors are returned from syscalls. Idea by: emaste Reviewed by: oshogbo (previous version), emaste Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D7965	2016-09-21 08:23:33 +00:00
Edward Tomasz Napierala	e313b4dd95	Fix bug introduced with r302388, which could cause processes accessing automounted shares to hang with "vfs_busy" wchan. (As a workaround one can run 'automount -u' from cron.) Reviewed by: kib@ MFC after: 1 month	2016-09-21 05:44:13 +00:00
Sepherosa Ziehau	a5ec35dfee	Fix LINT building. Sponsored by: Microsoft	2016-09-18 07:37:00 +00:00
Ed Maste	69a2875821	Renumber license clauses in sys/kern to avoid skipping #3	2016-09-15 13:16:20 +00:00
Kevin Lo	c3bef61e58	Remove the 4.3BSD compatible macro m_copy(), use m_copym() instead. Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D7878	2016-09-15 07:41:48 +00:00
Mariusz Zaborski	6e70b4f058	fd: add fget_cap and fget_cap_locked primitives They can be used to obtain capabilities along with a referenced fp. Reviewed by: mjg@	2016-09-12 22:46:19 +00:00
John Baldwin	71499f6a2d	Make device_quiet() an attachment property. In particular, reset the DF_QUIET flag when detaching from a device so that a driver that marks a device quiet doesn't dictate policy for a different driver that may claim the device in the future. Reviewed by: rpokala, wblock MFC after: 2 weeks Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D7803	2016-09-12 18:06:42 +00:00
Mateusz Guzik	a27815330c	cache: improve scalability by introducing bucket locks An array of bucket locks is added. All modifications still require the global cache_lock to be held for writing. However, most readers only need the relevant bucket lock and in effect can run concurrently to the writer as long as they use a different lock. See the added comment for more details. This is an intermediate step towards removal of the global lock. Reviewed by: kib Tested by: pho	2016-09-10 16:29:53 +00:00
Konstantin Belousov	2e4fd101fa	Fix build	2016-09-10 09:00:12 +00:00
Jilles Tjoelker	d30e66e53a	wait: Do not copyout uninitialized status/rusage/wrusage. If wait4() or wait6() return 0 because of WNOHANG, the status, rusage and wrusage information should not be returned. PR: 212048 Reported by: Casey Lucas MFC after: 2 weeks	2016-09-09 21:58:48 +00:00
Mateusz Guzik	a0d45f0fc8	locks: add backoff for spin mutexes and thread lock Reviewed by: jhb	2016-09-09 19:13:02 +00:00
Ed Maste	82b3cec52b	ANSIfy uipc_syscalls.c Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D7839	2016-09-09 17:40:26 +00:00
Ed Maste	e62264e2dd	Update capabilities.conf comment getdtablesize is per-process state, not global state	2016-09-08 14:04:04 +00:00
Kevin Lo	cee4a05669	In m_devget(), if the data fits in a packet header mbuf, check the amount of data is less than or equal to MHLEN instead of MLEN when placing initial small packet header at end of mbuf. Reviewed by: glebius MFC after: 3 days	2016-09-08 01:02:53 +00:00
Brooks Davis	ed6d876b19	Modernize the initalization of sigproptbl. Use C99 designators to set the value of each slot and the nitems macro to check for valid entries. In the process, switch to indexing by signal number rather than signal-1 for improved clarity. Obtained from: CheriBSD (`a6053c5abf`) Sponsored by: DARPA, AFRL Reviewed by: kib	2016-09-06 22:03:53 +00:00
Mateusz Guzik	5b7d9ae2fd	cv: do a lockless check for no waiters in cv_signal and cv_broadcastpri In case of some consumers like zfs there are no waiters vast majority of the time Reviewed by: jhb MFC after: 1 week	2016-09-06 17:16:59 +00:00
Mateusz Guzik	591df14528	cache: defer freeing entries until after the global lock is dropped This also defers vdrop for held vnodes. Glanced at by: kib	2016-09-04 16:52:14 +00:00
Mateusz Guzik	31977b420a	cache: manage negative entry list with a dedicated lock Since negative entries are managed with a LRU list, a hit requires a modificaton. Currently the code tries to upgrade the global lock if needed and is forced to retry the lookup if it fails. Provide a dedicated lock for use when the cache is only shared-locked. Reviewed by: kib MFC after: 1 week	2016-09-04 08:58:35 +00:00
Mateusz Guzik	b9042ae1bf	cache: put all negative entry management code into dedicated functions Reviewed by: kib MFC after: 1 week	2016-09-04 08:55:15 +00:00
Mark Johnston	3da0f3c9ae	Micro-optimize sleepq_signal(). Lift a comparison out of the loop that finds the highest-priority thread on the queue. MFC after: 1 week	2016-09-04 00:29:48 +00:00
Brooks Davis	fd50a70770	Merge from CheriBSD: Rename sigprop-table constants to SIGPROP_ from SA_ to reduce the impression of a namespace collision. Submitted by: rwatson Reviewed by: jhb, kib (slightly different versions) Obtained from: CheriBSD (`814ec5771c`) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D7616	2016-09-02 18:22:56 +00:00
Ed Maste	dd38731e09	allow kern.proc.nfds sysctl in capability mode Reviewed by: allanjude MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D7733	2016-09-01 02:51:50 +00:00
Patrick Kelsey	da2ded6575	_taskqueue_start_threads() now fails if it doesn't actually start any threads. Reviewed by: jhb MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D7701	2016-09-01 02:05:46 +00:00
Mark Johnston	99ab95db4d	Rename unp_dispose_so() to unp_dispose(). It implements the dom_dispose method for local socket domain, so its name should match the method name.	2016-08-31 21:48:22 +00:00
Ed Maste	bce38b9f35	Regnerate after r305140, getdtablesize in capability mode Sponsored by: The FreeBSD Foundation	2016-08-31 18:37:51 +00:00
Ed Maste	ca380195ab	Allow getdtablesize in capability mode getdtablesize is "trivial global state" and is similar to getrlimit(RLIMIT_NOFILE), so should be permitted in capability mode. Reviewed by: oshogbo MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D7719	2016-08-31 18:33:15 +00:00
Allan Jude	61bd7ae0ec	Eliminate unnecessary loop in _cap_check() Calling cap_rights_contains() several times with the same inputs is not going to produce a different output. The variable being iterated, i, is never used inside the for loop. The loop is actually done in cap_rights_contains() Submitted by: Ryan Moeller <ryan@freqlabs.com> Reviewed by: oshogbo, ed MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D7369	2016-08-31 17:52:11 +00:00
Nathan Whitehorn	09c697016b	Back out misfired extra file in r305108.	2016-08-31 04:03:55 +00:00
Nathan Whitehorn	c9a124dc9a	Refix operation on sparse CPU mappings as in r302372, temporarily broken by r304716. PR: kern/210106 MFC after: 2 days	2016-08-31 04:02:52 +00:00
Mateusz Guzik	4cbafea09c	fd: add fdeget_locked and use in kern_descrip	2016-08-30 21:53:22 +00:00
Bryan Drewery	533f3e1026	Reduce duplicated logic for !SMP Sponsored by: EMC / Isilon Storage Division	2016-08-30 19:26:07 +00:00
John Baldwin	e05ec081fe	Implement 'devctl clear driver' to undo a previous 'devctl set driver'. Add a new 'clear driver' command for devctl along with the accompanying ioctl and devctl_clear_driver() library routine to reset a device to use a wildcard devclass instead of a fixed devclass. This can be used to undo a previous 'set driver' command. After the device's name has been reset to permit wildcard names, it is reprobed so that it can attach to newly-available (to it) device drivers. MFC after: 1 month Sponsored by: Chelsio Communications	2016-08-29 22:48:36 +00:00
Mateusz Guzik	11d3ad2eab	vfs: provide a common exit point in namei for error cases This shortens the function, adds the SDT_PROBE use for error cases and consistenly unrefs rootdir last. Reviewed by: kib MFC after: 2 weeks	2016-08-27 22:43:41 +00:00
Konstantin Belousov	9ce60e28fd	Consistently delimit each vnode description block with two blank lines. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2016-08-27 18:12:42 +00:00
Konstantin Belousov	0f2d97838d	In both do_rw_wrlock() and do_rw_rdlock() after r304808, do not obliterate possible error from sleep with errors from umtxq_check_susp(), when looping to clear URWLOCK_{READ,WRITE}_WAITERS. Noted and reviewed by: vangyzen Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-25 19:15:02 +00:00
Konstantin Belousov	28e21133f3	Prevent leak of URWLOCK_READ_WAITERS flag for urwlocks. If there was some error, e.g. the sleep was interrupted, as in the referenced PR, do_rw_rdlock() did not cleared URWLOCK_READ_WAITERS. Since unlock only wakes up write waiters when there is no read waiters, for URWLOCK_PREFER_READER kind of locks, the result was missed wakeups for writers. In particular, the most visible victims are ld-elf.so locks in processes which loaded libthr, because rtld locks are urwlocks in prefer-reader mode. Normal rwlocks fall into prefer-reader mode only if thread already owns rw lock in read mode, which is not typical and correspondingly less visible. In the PR, unowned rtld bind lock was waited for in the process where only one thread was left alive. Note that do_rw_wrlock() correctly clears URWLOCK_WRITE_WAITERS in case of errors. Reported and tested by: longwitz@incore.de PR: 211947 Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-25 16:35:42 +00:00
Bruce Evans	d350ce61cf	Less-quick fix for locking fixes in r172250. r172250 added a second syscons spinlock for the output routine alone. It is better to extend the coverage of the first syscons spinlock added in r162285. 2 locks might work with complicated juggling, but no juggling was done. What the 2 locks actually did was to cover some of the missing locking in each other and deadlock less often against each other than a single lock with larger coverage would against itself. Races are preferable to deadlocks here, but 2 locks are still worse since they are harder to understand and fix. Prefer deadlocks to races and merge the second lock into the first one. Extend the scope of the spinlocking to all of sc_cnputc() instead of just the sc_puts() part. This further prefers deadlocks to races. Extend the kdb_active hack from sc_puts() internals for the second lock to all spinlocking. This reduces deadlocks much more than the other changes increases them. The s/p,10* test in ddb gets much further now. Hide this detail in the SC_VIDEO_LOCK() macro. Add namespace pollution in 1 nested #include and reduce namespace pollution in other nested #includes to pay for this. Move the first lock higher in the witness order. The second lock was unnaturally low and the first lock was unnaturally high. The second lock had to be above "sleepq chain" and/or "callout" to avoid spurious LORs for visual bells in sc_puts(). Other console driver locks are already even higher (but not adjacent like they should be) except when they are missing from the table. Audio bells also benefit from the syscons lock being high so that audio mutexes have chance of being lower. Otherwise, console drviver locks should be as low as possible. Non-spurious LORs now occur if the bell code calls printf() or is interrupted (perhaps by an NMI) and the interrupt handler calls printf(). Previous commits turned off many bells in console i/o but missed ones done by the teken layer.	2016-08-25 13:46:52 +00:00
Robert Watson	70a98c110e	Audit the accepted (or rejected) username argument to setlogin(2). (NB: This was likely a mismerge from XNU in audit support, where the text argument to setlogin(2) is captured -- but as a text token, whereas this change uses the dedicated login-name field in struct audit_record.) MFC after: 2 weeks Sponsored by: DARPA, AFRL	2016-08-20 20:28:08 +00:00
Robert Watson	c3c0088bb0	Audit additional vnode information in the implementation of the ftruncate(2) system call. This was not required by the Common Criteria, which needed only open-time audit. MFC after: 2 weeks Sponsored by: DARPA, AFRL	2016-08-20 18:51:48 +00:00
Mark Johnston	e5574e0966	Don't set P2_PTRACE_FSTP in a process that invokes ptrace(PT_TRACE_ME). Such processes are stopped synchronously by a direct call to ptracestop(SIGTRAP) upon exec. P2_PTRACE_FSTP causes the exec()ing thread to suspend itself while waiting for a SIGSTOP that never arrives. Reviewed by: kib MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D7576	2016-08-19 17:57:14 +00:00
Michal Meloun	895c8b1c39	INTRNG: Rework handling with resources. Partially revert r301453. - Read interrupt properties at bus enumeration time and store it into global mapping table. - At bus_activate_resource() time, given mapping entry is resolved and connected to real interrupt source. A copy of mapping entry is attached to given resource. - At bus_setup_intr() time, mapping entry stored in resource is used for delivery of requested interrupt configuration. - For MSI/MSIX interrupts, mapping entry is created within pci_alloc_msi()/pci_alloc_msix() call. - For legacy PCI interrupts, mapping entry must be created within pcib_route_interrupt() by pcib driver itself. Reviewed by: nwhitehorn, andrew Differential Revision: https://reviews.freebsd.org/D7493	2016-08-19 10:52:39 +00:00
Mark Johnston	7f649dda55	Correct a check for P2_PTRACE_FSTP in ptracestop(). MFC after: 1 day	2016-08-19 01:27:24 +00:00
George V. Neville-Neil	3e7e23332f	Remove the obsolete and unused openbsd_poll system call. (Phase 2) Reported by: brooks Reviewed by: brooks, jhb Differential Revision: https://reviews.freebsd.org/D7548	2016-08-18 10:54:39 +00:00
George V. Neville-Neil	5cba398b0c	Remove unusedd and obsolete openbsd_poll system call. (Phase 1) Reported by: brooks Reviewed by: brooks,jhb Differential Revision: https://reviews.freebsd.org/D7548	2016-08-18 10:50:40 +00:00
Bryan Drewery	b387915115	Garbage collect _umtx_lock(2)/_umtx_unlock(2) references removed in r263318. This has no real impact on the resulting libc.so file. MFC after: 3 days Sponsored by: EMC / Isilon Storage Division	2016-08-17 10:20:05 +00:00
Konstantin Belousov	e2a18110f0	Remove duplicated code. aio_aqueue() calls aio_init_aioinfo() as the first action. There is no need to duplicate the code in kern_aio_fsync(). Also fix indent for aio_aqueue() definition. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D7523	2016-08-17 10:14:22 +00:00
Konstantin Belousov	1680854946	Implement userspace gettimeofday(2) with HPET timecounter. Right now, userspace (fast) gettimeofday(2) on x86 only works for RDTSC. For older machines, like Core2, where RDTSC is not C2/C3 invariant, and which fall to HPET hardware, this means that the call has both the penalty of the syscall and of the uncached hw behind the QPI or PCIe connection to the sought bridge. Nothing can me done against the access latency, but the syscall overhead can be removed. System already provides mappable /dev/hpetX devices, which gives straight access to the HPET registers page. Add yet another algorithm to the x86 'vdso' timehands. Libc is updated to handle both RDTSC and HPET. For HPET, the index of the hpet device to mmap is passed from kernel to userspace, index might be changed and libc invalidates its mapping as needed. Remove cpu_fill_vdso_timehands() KPI, instead require that timecounters which can be used from userspace, to provide tc_fill_vdso_timehands{,32}() methods. Merge i386 and amd64 libc/<arch>/sys/__vdso_gettc.c into one source file in the new libc/x86/sys location. __vdso_gettc() internal interface is changed to move timecounter algorithm detection into the MD code. Measurements show that RDTSC even with the syscall overhead is faster than userspace HPET access. But still, userspace HPET is three-four times faster than syscall HPET on several Core2 and SandyBridge machines. Tested by: Howard Su <howard0su@gmail.com> Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D7473	2016-08-17 09:52:09 +00:00
Gleb Smirnoff	dc4ee9a895	Fix a stupid typo (or copy/paste buffer malfunction).	2016-08-16 23:00:22 +00:00
Gleb Smirnoff	c0f50fa012	We should not be allowing a timeout to reset when a drain is in progress on it (either async or sync drain). At this moment the only user of drain is TCP, but TCP wouldn't reschedule a callout after it has drained it, since it drains only when a tcpcb is closed. This for now the problem isn't observed. Submitted by: rrs	2016-08-16 21:55:34 +00:00
Ed Schouten	93d9ebd82e	Eliminate use of sys_fsync() and sys_fdatasync(). Make the kern_fsync() function public, so that it can be used by other parts of the kernel. Fix up existing consumers to make use of it. Requested by: kib	2016-08-15 20:11:52 +00:00
Eric Badger	b0f2185bbe	sem_post(): wake up the sleeper only after adjusting has_waiters If the caller of sem_post() wakes up a thread sleeping via sem_wait() before it clears the has_waiters flag, the caller of sem_wait() has no way of knowing when it is safe to destroy the semaphore and reuse the memory. This is because the caller of sem_post() may be interrupted between the wake step and the clearing of has_waiters. It will then write into the has_waiters flag in userspace after being preempted for some unknown amount of time. Reviewed by: jhb, kib, vangyzen Approved by: kib (mentor), vangyzen (mentor) MFC after: 2 weeks Sponsored by: Dell Inc. Differential Revision: https://reviews.freebsd.org/D7505	2016-08-15 20:09:09 +00:00
Konstantin Belousov	47e61f6cc6	Implement VOP_FDATASYNC() for msdosfs. Standard VOP_FSYNC() implementation just syncs data buffers, and due to this, is the correct and efficient implementation for msdosfs or any other filesystem which uses bufer cache trivially. Provide globally visible wrapper vop_stdfdatasync_buf() for future consumption by other filesystems. Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D7471	2016-08-15 19:17:00 +00:00
Konstantin Belousov	1d2537a26a	Regen after r304176, fdatasync(2) addition.	2016-08-15 19:15:46 +00:00
Konstantin Belousov	295af703a0	Add an implementation of fdatasync(2). The syscall is a trivial wrapper around new VOP_FDATASYNC(), sharing code with fsync(2). For all filesystems, this commit provides the implementation which delegates the work of VOP_FDATASYNC() to VOP_FSYNC(). This is functionally correct but not efficient. This is not yet POSIX-compliant implementation, because it does not ensure that queued AIO requests are completed before returning. Reviewed by: mckusick Discussed with: avg (ZFS), jhb (AIO part) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D7471	2016-08-15 19:08:51 +00:00
Konstantin Belousov	c73fb33115	VOP_FSYNC() does not take cred as an argument. Correct comment. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-15 18:55:33 +00:00
Alan Cox	ce3ee09b53	Eliminate unneeded vm_page_xbusy() and vm_page_xunbusy() operations when neither vm_pager_has_page() nor vm_pager_get_pages() is called. Reviewed by: kib, markj MFC after: 3 weeks	2016-08-14 22:00:45 +00:00
Bruce Evans	99061149a3	Print the tid of curthread in "show pcpu" in ddb. It was remarkably hard to trace all current threads. "show pcpu" only showed the pid, and there was nothing (?) better than searching ps output to find the tids on CPUs. This change simplifies the search, but you still have to trace the tid for each CPU manually.	2016-08-14 15:52:00 +00:00
Alan Cox	fc9bbf2794	Eliminate two calls to vm_page_xunbusy() that are both unnecessary and incorrect from the error cases in exec_map_first_page(). They are unnecessary because we automatically unbusy the page in vm_page_free() when we remove it from the object. The calls are incorrect because they happen after the page is freed, so we might actually unbusy the page after it has been reallocated to a different object. (This error was introduced in r292373.) Reviewed by: kib MFC after: 1 week	2016-08-13 18:10:32 +00:00
Edward Tomasz Napierala	f8acef5a3e	Remove unused "X" vnode lock assertion, somehow missed in r303743. MFC after: 1 month	2016-08-12 22:22:11 +00:00
Edward Tomasz Napierala	f83cc0aae4	Print vnode details when vnode locking assertion gets triggered. MFC after: 1 month	2016-08-12 22:20:52 +00:00
Stephen Hurd	23ac9029f9	Update iflib to support more NIC designs - Move group task queue into kern/subr_gtaskqueue.c - Change intr_enable to return an int so it can be detected if it's not implemented - Allow different TX/RX queues per set to be different sizes - Don't split up TX mbufs before transmit - Allow a completion queue for TX as well as RX - Pass the RX budget to isc_rxd_available() to allow an earlier return and avoid multiple calls Submitted by: shurd Reviewed by: gallatin Approved by: scottl Differential Revision: https://reviews.freebsd.org/D7393	2016-08-12 21:29:44 +00:00
Mark Johnston	5004817335	Remove b_pin_count from struct buf. It was added in r153192 for XFS and doesn't appear to have been used for anything else. XFS was disconnected in r241607 and removed entirely in r247631. Reported by: mlaier Reviewed by: imp, kib Differential Revision: https://reviews.freebsd.org/D7468	2016-08-11 07:58:23 +00:00
Edward Tomasz Napierala	411455a8fb	Replace all remaining calls to vprint(9) with vn_printf(9), and remove the old macro. MFC after: 1 month	2016-08-10 16:12:31 +00:00
Mateusz Guzik	7c34b35b57	ktrace: do a lockless check on fork to see if tracing is enabled This saves 2 lock acquisitions in the common case.	2016-08-10 15:25:44 +00:00
Mateusz Guzik	382172be68	sigio: do a lockless check in funsetownlist There is no need to grab the lock first to see if sigio is used, and it typically is not.	2016-08-10 15:24:15 +00:00
Konstantin Belousov	3a77833e87	Fix indentation. Reported by: hselasky MFC after: 17 days	2016-08-10 14:41:53 +00:00
Konstantin Belousov	49c394a970	Re-schedule signals after kthread exits, since apparently there are processes which combine kernel and non-kernel threads, e.g. nfsd. For such processes, termination of a kthread must recheck signal delivery among other threads according to masks. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-10 13:47:12 +00:00
Jean-Sébastien Pédron	bd937497ea	Consistently use `device_t` Several files use the internal name of `struct device` instead of `device_t` which is part of the public API. This patch changes all `struct device *` to `device_t`. The remaining occurrences of `struct device` are those referring to the Linux or OpenBSD version of the structure, or the code is not built on FreeBSD and it's unclear what to do. Submitted by: Matthew Macy <mmacy@nextbsd.org> (previous version) Approved by: emaste, jhibbits, sbruno MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D7447	2016-08-09 19:32:06 +00:00
Stephen J. Kiernan	0ce1624d0e	Move IPv4-specific jail functions to new file netinet/in_jail.c _prison_check_ip4 renamed to prison_check_ip4_locked Move IPv6-specific jail functions to new file netinet6/in6_jail.c _prison_check_ip6 renamed to prison_check_ip6_locked Add appropriate prototypes to sys/sys/jail.h Adjust kern_jail.c to call prison_check_ip4_locked and prison_check_ip6_locked accordingly. Add netinet/in_jail.c and netinet6/in6_jail.c to the list of files that need to be built when INET and INET6, respectively, are configured in the kernel configuration file. Reviewed by: jtl Approved by: sjg (mentor) Sponsored by: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D6799	2016-08-09 02:16:21 +00:00
Mark Johnston	434ac8b6b7	Handle races with listening socket close when connecting a unix socket. If the listening socket is closed while sonewconn() is executing, the nascent child socket is aborted, which results in recursion on the unp_link lock when the child's pru_detach method is invoked. Fix this by using a flag to mark such sockets, and skip a part of the socket's teardown during detach. Reported by: Raviprakash Darbha <rdarbha@juniper.net> Tested by: pho MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D7398	2016-08-08 20:25:04 +00:00
Bryan Drewery	c1fa440409	Regenerate after r303755. MFC after: 3 days X-MFC-With: r303755 Sponsored by: EMC / Isilon Storage Division	2016-08-04 19:15:51 +00:00
Bryan Drewery	417d5dec39	Still provide freebsd10_* symbols from libc for COMPAT10. r296773 was done to only remove libc symbols for <7. We want to provide the syscall symbols going forward for 7+. Discussed with: jhb MFC after: 3 days Sponsored by: EMC / Isilon Storage Division	2016-08-04 19:14:18 +00:00
Edward Tomasz Napierala	7b255097eb	Remove unused - never actually implemented - vnode lock types from vnode_if.src. MFC after: 1 month	2016-08-04 13:45:18 +00:00
Bryan Drewery	78be18ae6e	Correct some comments. Sponsored by: EMC / Isilon Storage Division MFC after: 3 days	2016-08-03 18:48:56 +00:00
Mateusz Guzik	0453ade508	locks: fix sx compilation on mips after r303643 The kernel.h header is required for the SYSINIT macro, which apparently was present on amd64 by accident. Reported by: kib	2016-08-03 09:15:10 +00:00
Konstantin Belousov	e69ba32f88	Remove mention of the Giant from the fork_return() description. Making emphasis on this lock in the core function comment is confusing for the modern kernel. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2016-08-03 07:10:09 +00:00
Ed Schouten	e938ebbc0c	Regenerate system call tables for r303699 and r303700.	2016-08-03 06:36:45 +00:00
Ed Schouten	1b42875d4b	Re-add traling slash that was removed in r303699. I must have accidentally pressed some random key in vim.	2016-08-03 06:35:58 +00:00
Ed Schouten	a813fdc6c3	mprotect(): Change prototype to comply to POSIX. Our mprotect() function seems to take a "const void " address to the pages whose permissions need to be adjusted. POSIX uses "void ". Simply stick to the POSIX one to prevent us from writing unportable code. PR: 211423 (exp-run) Tested by: antoine@ (Thanks!)	2016-08-03 06:33:04 +00:00
Mateusz Guzik	fa5000a4f3	locks: fix compilation for KDTRACE_HOOKS && !ADAPTIVE_* case Reported by: Michael Butler <imb protected-networks.net>	2016-08-02 03:05:59 +00:00
Mateusz Guzik	0412689595	locks: fix up ifdef guards introduced in r303643 Both sx and rwlocks had copy-pasted ADAPTIVE_MUTEXES instead of the correct define. MFC after: 1 week	2016-08-02 00:15:08 +00:00
Mateusz Guzik	1ada904147	Implement trivial backoff for locking primitives. All current spinning loops retry an atomic op the first chance they get, which leads to performance degradation under load. One classic solution to the problem consists of delaying the test to an extent. This implementation has a trivial linear increment and a random factor for each attempt. For simplicity, this first thouch implementation only modifies spinning loops where the lock owner is running. spin mutexes and thread lock were not modified. Current parameters are autotuned on boot based on mp_cpus. Autotune factors are very conservative and are subject to change later. Reviewed by: kib, jhb Tested by: pho MFC after: 1 week	2016-08-01 21:48:37 +00:00
Mateusz Guzik	61852185ba	locks: change sleep_cnt and spin_cnt types to u_int Both variables are uint64_t, but they only count spins or sleeps. All reasonable values which we can get here comfortably hit in 32-bit range. Suggested by: kib MFC after: 1 week	2016-07-31 12:11:55 +00:00
Mateusz Guzik	e0c45af904	sx: increment spin_cnt before cpu_spinwait in xlock The change is a no-op only done for consistency with the rest of the file.	2016-07-30 22:23:31 +00:00
Mateusz Guzik	7a54be1870	rwlock: s/READER/WRITER/ in wlock lockstat annotation	2016-07-30 22:21:48 +00:00
Konstantin Belousov	50c22263bb	Cache getbintime(9) answer in timehands, similarly to getnanotime(9) and getmicrotime(9). Suggested and reviewed by: bde (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 month	2016-07-30 09:25:57 +00:00
John Baldwin	e2325d8285	Don't treat NOCPU as a valid CPU to CPU_ISSET. If a thread is created bound to a cpuset it might already be bound before it's very first timeslice, and td_lastcpu will be NOCPU in that case. MFC after: 1 week	2016-07-29 20:19:14 +00:00
John Baldwin	005ce8e4e6	Fix locking issues with aio_fsync(). - Use correct lock in aio_cancel_sync when dequeueing job. - Add _locked variants of aio_set/clear_cancel_function and use those to avoid lock recursion when adding and removing fsync jobs to the per-process sync queue. - While here, add a basic test for aio_fsync(). PR: 211390 Reported by: Randy Westlund <rwestlun@gmail.com> MFC after: 1 week Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D7339	2016-07-29 18:26:15 +00:00
Brooks Davis	40018b91dd	Don't create pointless backups of generated files in "make sysent". Any sensible workflow will include a revision control system from which to restore the old files if required. In normal usage, developers just have to clean up the mess. Reviewed by: jhb Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D7353	2016-07-28 21:29:04 +00:00
Ed Schouten	5590eb985e	Regenerate system call table for r303435.	2016-07-28 12:22:34 +00:00
Ed Schouten	d9c4cd2fbc	Change the return type of msgrcv() to ssize_t as required by POSIX. It looks like the msgrcv() system call is already written in such a way that the size is internally computed as a size_t and written into all of td_retval[0]. This means that it is effectively already returning ssize_t. It's just that the userspace prototype doesn't match up.	2016-07-28 12:22:01 +00:00
Konstantin Belousov	2d19b736ed	Rewrite subr_sleepqueue.c use of callouts to not depend on the specifics of callout KPI. Esp., do not depend on the exact interface of callout_stop(9) return values. The main change is that instead of requiring precise callouts, code maintains absolute time to wake up. Callouts now should ensure that a wake occurs at the requested moment, but we can tolerate both run-away callout, and callout_stop(9) lying about running callout either way. As consequence, it removes the constant source of the bugs where sleepq_check_timeout() causes uninterruptible thread state where the thread is detached from CPU, see e.g. r234952 and r296320. Patch also removes dual meaning of the TDF_TIMEOUT flag, making code (IMO much) simpler to reason about. Tested by: pho Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D7137	2016-07-28 09:09:55 +00:00
Konstantin Belousov	a9e182e895	Extract the calculation of the callout fire time into the new function callout_when(9). See the man page update for the description of the intended use. Tested by: pho Reviewed by: jhb, bjk (man page updates) Sponsored by: The FreeBSD Foundation MFC after: 1 month X-Differential revision: https://reviews.freebsd.org/D7137	2016-07-28 08:57:01 +00:00
Konstantin Belousov	b7a25e63b6	When a debugger attaches to the process, SIGSTOP is sent to the target. Due to a way issignal() selects the next signal to deliver and report, if the simultaneous or already pending another signal exists, that signal might be reported by the next waitpid(2) call. This causes minor annoyance for debuggers, which must be prepared to take any signal as the first event, then filter SIGSTOP later. More importantly, for tools like gcore(1), which attach and then detach without processing events, SIGSTOP might leak to be delivered after PT_DETACH. This results in the process being unintentionally stopped after detach, which is fatal for automatic tools. The solution is to force SIGSTOP to be the first signal reported after the attach. Attach code is modified to set P2_PTRACE_FSTP to indicate that the attaching ritual was not yet finished, and issignal() prefers SIGSTOP in that condition. Also, the thread which handles P2_PTRACE_FSTP is made to guarantee to own p_xthread during the first waitpid(2). All that ensures that SIGSTOP is consumed first. Additionally, if P2_PTRACE_FSTP is still set on detach, which means that waitpid(2) was not called at all, SIGSTOP is removed from the queue, ensuring that the process is resumed on detach. In issignal(), when acting on STOPing signals, remove the signal from queue before suspending. Otherwise parallel attach could result in ptracestop() acting on that STOP as if it was the STOP signal from the attach. Then SIGSTOP from attach leaks again. As a minor refactoring, some bits of the common attach code is moved to new helper proc_set_traced(). Reported by: markj Reviewed by: jhb, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D7256	2016-07-28 08:41:13 +00:00
Stephen J. Kiernan	4ac21b4f09	Prepare for network stack as a module - Move cr_canseeinpcb to sys/netinet/in_prot.c in order to separate the INET and INET6-specific code from the rest of the prot code (It is only used by the network stack, so it makes sense for it to live with the other network stack code.) - Move cr_canseeinpcb prototype from sys/systm.h to netinet/in_systm.h - Rename cr_seeotheruids to cr_canseeotheruids and cr_seeothergids to cr_canseeothergids, make them non-static, and add prototypes (so they can be seen/called by in_prot.c functions.) - Remove sw_csum variable from ip6_forward in ip6_forward.c, as it is an unused variable. Reviewed by: gnn, jtl Approved by: sjg (mentor) Sponsored by: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D2901	2016-07-27 20:34:09 +00:00
John Baldwin	b9a53e161b	Adjust tests in fsync job scheduling loop to reduce indentation.	2016-07-27 19:31:25 +00:00
Ed Maste	18c23dffac	ANSIfy kern_proc.c and delete register keyword Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D6478	2016-07-27 14:27:08 +00:00
Konstantin Belousov	1822421c0b	Remove Giant from settime(), tc_setclock_mtx guards tc_windup() calls, and there is no other issues with parallel settime(). Remove spl() vestiges there as well. Tested by: pho (as part of the whole patch) Reviewed by: jhb (same) Discussed wit: bde Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D7302	2016-07-27 11:54:24 +00:00
Konstantin Belousov	5760b029ee	Prevent parallel tc_windup() calls, both parallel top-level calls from setclock() and from simultaneous top-level and interrupt. For this, tc_windup() is protected with a tc_setclock_mtx spinlock, in the try mode when called from hardclock interrupt. If spinlock cannot be obtained without spinning from the interrupt context, this means that top-level executes tc_windup() on other core and our try may be avoided. The boottimebin and boottime variables should be adjusted from tc_windup(). To be correct, they must be part of the timehands and read using lockless protocol. Remove the globals and reimplement the getboottime(9)/getboottimebin(9) KPI using the timehands read protocol. Tested by: pho (as part of the whole patch) Reviewed by: jhb (same) Discussed wit: bde Sponsored by: The FreeBSD Foundation MFC after: 1 month X-Differential revision: https://reviews.freebsd.org/D7302	2016-07-27 11:49:41 +00:00
Konstantin Belousov	4493f659e5	Fix a bug in r302252. Change ntpadj_lock to spinlock always, and rename stuff removing ADJ/adj from the names. ntp_update_second() requires ntp_lock and is called from the tc_windup(), so ntp_lock must be a spinlock. Add missed lock to ntp_update_second(). Tested by: pho (as part of the whole patch) Reviewed by: jhb (same) Noted by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 month X-Differential revision: https://reviews.freebsd.org/D7302	2016-07-27 11:40:06 +00:00
Konstantin Belousov	21547fc7ca	Reduce the resettodr_lock scope to only CLOCK_SETTIME() call. Tested by: pho (as part of the whole patch) Reviewed by: jhb (same) Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 month X-Differential revision: https://reviews.freebsd.org/D7302	2016-07-27 11:34:25 +00:00
Konstantin Belousov	4d29106e55	Style. Sponsored by: The FreeBSD Foundation MFC after: 1 month X-Differential revision: https://reviews.freebsd.org/D7302	2016-07-27 11:33:33 +00:00
Konstantin Belousov	a83c016f00	Reduce number of timehands to just two. This is useful because consumers can now be only one tc_windup() call late. Use C99 initialization. Tested by: pho (as part of the whole patch) Reviewed by: jhb (same) Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 month X-Differential revision: https://reviews.freebsd.org/D7302	2016-07-27 11:27:52 +00:00
Konstantin Belousov	584b675ed6	Hide the boottime and bootimebin globals, provide the getboottime(9) and getboottimebin(9) KPI. Change consumers of boottime to use the KPI. The variables were renamed to avoid shadowing issues with local variables of the same name. Issue is that boottime* should be adjusted from tc_windup(), which requires them to be members of the timehands structure. As a preparation, this commit only introduces the interface. Some uses of boottime were found doubtful, e.g. NLM uses boottime to identify the system boot instance. Arguably the identity should not change on the leap second adjustment, but the commit is about the timekeeping code and the consumers were kept bug-to-bug compatible. Tested by: pho (as part of the bigger patch) Reviewed by: jhb (same) Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 month X-Differential revision: https://reviews.freebsd.org/D7302	2016-07-27 11:08:59 +00:00
Stephen J. Kiernan	cc37baea09	Add the NUM_CORE_FILES kernel config option which specifies the limit for the number of core files allowed by a particular process when using the %I core file name pattern. Sanity check at compile time to ensure the value is within the valid range of 0-10. Reviewed by: jtl, sjg Approved by: sjg (mentor) Sponsored by: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D6812	2016-07-27 03:21:02 +00:00
Ed Schouten	f63cd251b2	Add shmatt_t. It looks like our "struct shmid_ds::shm_nattch" deviates from the standard in the sense that it is a signed integer, whereas POSIX requires that it is unsigned, having a special type shmatt_t. Patch up our native and 32-bit copies to use a new shmatt_t that is an unsigned integer. As it's unsigned, we can relax the comparisons that are performed on it. Leave the Linux, iBCS2, etc. copies of the structure alone. Reviewed by: ngie Differential Revision: https://reviews.freebsd.org/D6655	2016-07-26 17:23:49 +00:00
Conrad Meyer	af326ace9d	devfs: Move most ioctl logic down to vnode layer Devfs' file layer ioctl is now just a thin shim around the vnode layer. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D7286	2016-07-25 16:28:02 +00:00
Konstantin Belousov	90b581f2cc	Implement mtx_trylock_spin(9). Discussed with: bde Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D7192	2016-07-23 05:30:55 +00:00
John Baldwin	9c20dc9963	Add more documentation regarding unsafe AIO requests. The asynchronous I/O changes made previously result in different behavior out of the box. Previously all AIO requests failed with ENOSYS / SIGSYS unless aio.ko was explicitly loaded. Now, some AIO requests complete and others ("unsafe" requests) fail with EOPNOTSUPP. Reword the introductory paragraph in aio(4) to add a general description of AIO before describing the vfs.aio.enable_unsafe sysctl. Remove the ENOSYS error description from aio_fsync(2), aio_read(2), and aio_write(2) and replace it with a description of EOPNOTSUPP. Remove the ENOSYS error description from aio_mlock(2). Log a message to the system log the first time a process requests an "unsafe" AIO request that fails with EOPNOTSUPP. This is modeled on the log message used for processes using the legacy pty devices. Reviewed by: kib (earlier version) MFC after: 1 week Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D7151	2016-07-21 22:49:47 +00:00
Konstantin Belousov	492fe1b774	Hide counted_warning(9) under #ifdef _KERNEL braces, to allow building subr_prf.c in userspace for libsbuf. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-07-21 17:59:30 +00:00
Konstantin Belousov	9fe297bbdc	Declare aio requests on files from local filesystems safe. Two notes: - I allow AIO on reclaimed vnodes, since it is deterministically terminated fast. - devfs mounts are marked as MNT_LOCAL, but device vnodes have type VCHR, so the slow device io is not allowed. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D7273	2016-07-21 17:07:06 +00:00
Konstantin Belousov	9837947b07	Provide counter_warning(9) KPI which allows to issue limited number of warnings for some kernel events, mostly intended for the use of obsoleted or otherwise undersired interfaces. This is an abstracted and race-expelled code from compat pty driver. Requested and reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D7270	2016-07-21 16:34:56 +00:00
Conrad Meyer	1005d8aff4	imgact_elf: Rename the segment iterator to match reality The each_writable_segment routine evaluates segments on a slightly little more nuanced metric than simply "writable" or not. Rename the function to more closely match its behavior (each_dumpable_segment). Suggested by: jhb Sponsored by: EMC / Isilon Storage Division	2016-07-20 22:51:33 +00:00
Conrad Meyer	f3325003d9	ANSI-fy imgact_elf.c Sponsored by: EMC / Isilon Storage Division	2016-07-20 22:46:56 +00:00
Conrad Meyer	07f825e871	Fix DEBUG build on 64-bit arch after r303099 Reported by: Larry Rosenman <ler at lerctr.org>	2016-07-20 18:11:22 +00:00
Conrad Meyer	c17b0bd2a6	Extend ELF coredump to support more than 65535 segments The ELF e_phnum field is only 16 bits wide. To support more than 65535 segments (program headers), Sun's "Linker and Libraries Guide" table 7-7 (or 12-7, depending on document version) prescribes a special first section header where sh_info represents the real number of program headers. Test code to follow, when it is ready. Reference: http://docs.oracle.com/cd/E18752_01/pdf/817-1984.pdf Reviewed by: emaste, markj Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D7255	2016-07-20 16:59:36 +00:00
Gleb Smirnoff	9f3391243b	Redo the r302894: the very new value for a non-scheduled callout is -1. This was recently added in r290664. Noticed by: hselasky Tested by: Larry Rosenman <ler lerctr.org> PR: 210884	2016-07-20 16:48:25 +00:00
Gleb Smirnoff	47e4280922	Revert r303037. It re-introduces the panic with TCP timers. Agreed by: rrs, re (gjb)	2016-07-20 16:44:22 +00:00
Randall Stewart	3d84a18803	This reverts out Gleb's changes and adds three small fixes that I think closes up the races Gleb was looking for. This is running quite nicely in Netflix and now no longer causes TCP-tcb leaks. Differential Revision: 7135	2016-07-19 18:31:19 +00:00
John Baldwin	ccb83afd81	Include process IDs in core dumps. When threads were added to the kernel, the pr_pid member of the NT_PRSTATUS note was repurposed to store LWP IDs instead of process IDs. However, the process ID was no longer recorded in core dumps. This change adds a pr_pid field to prpsinfo (NT_PRSINFO). Rather than bumping the prpsinfo version number, note parsers can use the note's payload size to determine if pr_pid is present. Reviewed by: kib, emaste (older version) MFC after: 2 months Differential Revision: https://reviews.freebsd.org/D7117	2016-07-18 15:14:23 +00:00
John Baldwin	fc4f075a1a	Add PTRACE_VFORK to trace vfork events. First, PL_FLAG_FORKED events now also set a PL_FLAG_VFORKED flag when the new child was created via vfork() rather than fork(). Second, a new PL_FLAG_VFORK_DONE event can now be enabled via the PTRACE_VFORK event mask. This new stop is reported after the vfork parent resumes due to the child calling exit or exec. Debuggers can use this stop to reinsert breakpoints in the vfork parent process before it resumes. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D7045	2016-07-18 14:53:55 +00:00
Konstantin Belousov	77d6809483	The assertion re-added in r302614 was triggered when stopping signal is delivered to vforked child. Issue is that we avoid stopping such children in issignal() to not block parents. But executed AST, which ignored stops, leaves the child with the signal pending but no AST pending. On first exec after vfork(), call signotify() to handle pending reenabled signals. Adjust the assert to not check vfork children until exec. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-07-18 10:53:47 +00:00
Gleb Smirnoff	809a9d1353	Revert the last commit. It must get more review and testing first.	2016-07-18 09:29:08 +00:00
Gleb Smirnoff	ef58c6a7a3	Redo the r302894: the very new value for a non-scheduled callout is -1. This was recently added in r290664. Noticed by: hselasky PR: 210884	2016-07-18 09:26:06 +00:00
Konstantin Belousov	56d48d6dcb	Fix another bug after r302350. Reported and tested by: pho PR: 210884 Sponsored by: The FreeBSD Foundation MFC after: 3 days	2016-07-18 04:30:34 +00:00
Konstantin Belousov	86f1146329	Another issue reported on http://seclists.org/oss-sec/2016/q3/68 is that struct kevent member ident has uintptr_t type, which is silently truncated to int in the call to fget(). Explicitely check for the valid range. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-07-16 13:24:58 +00:00
Konstantin Belousov	f470cca578	In ptrace_vm_entry(), do not call vmspace_free() while owning a vm object lock. The vmspace_free() operations might need to lock map, object etc on last dereference. Postpone the free until object's inspection is done. Reported and tested by: will Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-07-15 23:26:33 +00:00
John Baldwin	8d570f64aa	Add a mask of optional ptrace() events. ptrace() now stores a mask of optional events in p_ptevents. Currently this mask is a single integer, but it can be expanded into an array of integers in the future. Two new ptrace requests can be used to manipulate the event mask: PT_GET_EVENT_MASK fetches the current event mask and PT_SET_EVENT_MASK sets the current event mask. The current set of events include: - PTRACE_EXEC: trace calls to execve(). - PTRACE_SCE: trace system call entries. - PTRACE_SCX: trace syscam call exits. - PTRACE_FORK: trace forks and auto-attach to new child processes. - PTRACE_LWP: trace LWP events. The S_PT_SCX and S_PT_SCE events in the procfs p_stops flags have been replaced by PTRACE_SCE and PTRACE_SCX. PTRACE_FORK replaces P_FOLLOW_FORK and PTRACE_LWP replaces P2_LWP_EVENTS. The PT_FOLLOW_FORK and PT_LWP_EVENTS ptrace requests remain for compatibility but now simply toggle corresponding flags in the event mask. While here, document that PT_SYSCALL, PT_TO_SCE, and PT_TO_SCX both modify the event mask and continue the traced process. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D7044	2016-07-15 15:32:09 +00:00
Gleb Smirnoff	2138e263cb	Fix regression introduced by r302350. The change of return value for a callout that wasn't scheduled at all was unintentional and yielded in several panics. PR: 210884	2016-07-15 09:28:32 +00:00
Konstantin Belousov	d7e8cfd63d	Do not allow creation of char or block special nodes with VNOVAL dev_t. As was reported on http://seclists.org/oss-sec/2016/q3/68, tmpfs code contains assertion that rdev != VNOVAL. On FreeBSD, there is no other consequences except triggering the assert. To be compatible with systems where device nodes have some significance, reject mknod(2) call with dev == VNOVAL at the syscall level. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-07-15 09:23:18 +00:00
John Baldwin	c77547d2f9	Include command line arguments in core dump process info. Fill in pr_psargs in the NT_PRSINFO ELF core dump note with command line arguments. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D7116	2016-07-14 23:20:05 +00:00
Mark Johnston	0649caae32	Let DDB's buf printer handle NULL pointers in the buf page array. A buf's b_pages and b_npages fields may be inconsistent after a panic. For instance, vfs_vmio_invalidate() sets b_npages to zero only after all pages are unwired and their page array entries are cleared. MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2016-07-14 18:49:05 +00:00
Eric Badger	fdb6320d45	Add explicit detection of KVM hypervisor Set vm_guest to a new enum value (VM_GUEST_KVM) when kvm is detected and use vm_guest in conditionals testing for KVM. Also, fix a conditional checking if we're running in a VM which caught only the generic VM case, but not more specific VMs (KVM, VMWare, etc.). (Spotted by: vangyzen). Differential revision: https://reviews.freebsd.org/D7172 Sponsored by: Dell Inc. Approved by: kib (mentor), vangyzen (mentor) Reviewed by: alc MFC after: 4 weeks	2016-07-13 19:19:18 +00:00
Konstantin Belousov	de56aee0bf	Trace timeval parameters to the getitimer(2) and setitimer(2) syscalls. Reviewed by: jhb Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D7158	2016-07-13 14:37:58 +00:00
Konstantin Belousov	8f01bee46b	Revive the check, disabled in r197963. Despite the implication (process has pending signals -> the current thread marked for AST and has TDF_NEEDSIGCHK set) is not true due to other thread might manipulate its signal blocking mask, it should still hold for the single-threaded processes. Enable check for the condition for single-threaded case, and replicate it from userret() to ast() as well, where we check that ast indeed has no signal to deliver. Note that the check is under DIAGNOSTIC, it is not enabled for INVARIANTS but !DIAGNOSTIC since it imposes too heavy-weight locking for day-to-day used debugging kernel. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-07-12 03:53:15 +00:00
Konstantin Belousov	4f4c35bb38	Add assert to complement r302328. AST must not execute with TDF_SBDRY or TDF_SEINTR/TDF_SERESTART thread flags set, which is asserted in userret(). As the consequence, -1 return from cursig() must not be possible. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-07-12 03:52:05 +00:00
Nathan Whitehorn	8c636a11dc	Remove assumptions in MI code that the BSP is CPU 0. MFC after: 2 weeks	2016-07-11 21:25:28 +00:00
Konstantin Belousov	725496ced5	Fix grammar. Submitted by: alc MFC after: 2 weeks	2016-07-11 17:04:22 +00:00
Konstantin Belousov	19efd8a5a8	In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called, if vnode is VMIO. For VMIO vnodes, set BO_DEAD in vm_object_terminate(). The vnode_destroy_object(), when calling into vm_object_terminate(), must be able to flush buffers. BO_DEAD purpose is to quickly destroy buffers on write when the underlying vnode is not operable any more (one example is the devfs node after geom is gone). Setting BO_DEAD for reclaiming vnode before object is terminated is premature, and results in unability to flush buffers with live SU dependencies from vinvalbuf() in vm_object_terminate(). Reported by: David Cross <dcrosstech@gmail.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-07-11 14:19:09 +00:00
Robert Watson	5fa69ff015	In process-descriptor close(2) and fstat(2), audit target process information. pgkill(2) already audits target process ID. MFC after: 3 days Sponsored by: DARPA, AFRL	2016-07-10 14:17:36 +00:00
Robert Watson	e5ec733909	Do allow auditing of read(2) and write(2) system calls, by assigning those system calls audit event identifiers AUE_READ and AUE_WRITE. While auditing file-descriptor I/O is not required by the Common Criteria, in practice this proves useful for both live and forensic analysis. NB: freebsd32 already assigns AUE_READ and AUE_WRITE to read(2) and write(2). MFC after: 3 days Sponsored by: DARPA, AFRL	2016-07-10 13:42:33 +00:00
Robert Watson	8ec75c0fc3	Audit the file-descriptor number argument for openat(2). Remove a comment about the desirability of auditing the number, as it was in fact in the wrong place (in the common path for open(2) and openat(2), and only the latter accepts a file-descriptor argument). Where other ABIs support openat(2), it may be necessary to do additional argument auditing as it is not performed in kern_openat(9). MFC after: 3 days Sponsored by: DARPA, AFRL	2016-07-10 09:50:21 +00:00
Robert Watson	51d1f69069	Audit file-descriptor arguments to I/O system calls such as read(2), write(2), dup(2), and mmap(2). This auditing is not required by the Common Criteria (and hence was not being performed), but is valuable in both contemporary live analysis and forensic use cases. MFC after: 3 days Sponsored by: DARPA, AFRL	2016-07-10 08:04:02 +00:00
Edward Tomasz Napierala	debc480e03	Add new unmount(2) flag, MNT_NONBUSY, to check whether there are any open vnodes before proceeding. Make autounmound(8) use this flag. Without it, even an unsuccessfull unmount causes filesystem flush, which interferes with normal operation. Reviewed by: kib@ Approved by: re (gjb@) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D7047	2016-07-07 09:03:57 +00:00
Nathan Whitehorn	96c85efb4b	Replace a number of conflations of mp_ncpus and mp_maxid with either mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try, for example, to run tasks on CPUs that did not exist or to allocate too few buffers on systems with sparse CPU IDs in which there are holes in the range and mp_maxid > mp_ncpus. Such circumstances generally occur on systems with SMT, but on which SMT is disabled. This patch restores system operation at least on POWER8 systems configured in this way. There are a number of other places in the kernel with potential problems in these situations, but where sparse CPU IDs are not currently known to occur, mostly in the ARM machine-dependent code. These will be fixed in a follow-up commit after the stable/11 branch. PR: kern/210106 Reviewed by: jhb Approved by: re (glebius)	2016-07-06 14:09:49 +00:00
Gleb Smirnoff	d153eeee97	The paradigm of a callout is that it has three consequent states: not scheduled -> scheduled -> running -> not scheduled. The API and the manual page assume that, some comments in the code assume that, and looks like some contributors to the code also did. The problem is that this paradigm isn't true. A callout can be scheduled and running at the same time, which makes API description ambigouous. In such case callout_stop() family of functions/macros should return 1 and 0 at the same time, since it successfully unscheduled future callout but the current one is running. Before this change we returned 1 in such a case, with an exception that if running callout was migrating we returned 0, unless CS_MIGRBLOCK was specified. With this change, we now return 0 in case if future callout was unscheduled, but another one is still in action, indicating to API users that resources are not yet safe to be freed. However, the sleepqueue code relies on getting 1 return code in that case, and there already was CS_MIGRBLOCK flag, that covered one of the edge cases. In the new return path we will also use this flag, to keep sleepqueue safe. Since the flag CS_MIGRBLOCK doesn't block migration and now isn't limited to migration edge case, rename it to CS_EXECUTING. This change fixes panics on a high loaded TCP server. Reviewed by: jch, hselasky, rrs, kib Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D7042	2016-07-05 18:47:17 +00:00
Gleb Smirnoff	a0d20ecb94	Compile in the kassert_panic() function with INVARIANT_SUPPORT option, not INVARIANTS. The function is required if we want to load in a module that is compiled with INVARIANTS. Reviewed by: jhb Approved by: re (gjb)	2016-07-05 18:34:34 +00:00
Mark Johnston	f61d6c5a5b	Ensure that spinlock sections are balanced even after a panic. vpanic() uses spinlock_enter() to disable interrupts before dumping core. However, when the scheduler is stopped and INVARIANTS is not configured, thread_lock() does not acquire a spinlock section, while thread_unlock() releases one. This can result in interrupts staying enabled while the kernel dumps core, complicating post-mortem analysis of the crash. Approved by: re (gjb) MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2016-07-05 17:59:04 +00:00
Robert Watson	971711fb7c	Call audit hooks to capture vnode attributes for three file-descriptor method implementations: fstat(2), close(2), and poll(2). This change synchronises auditing here with similar auditing for VFS-specific system calls such as stat(2) that audit more complete vnode information. Sponsored by: DARPA, AFRL Approved by: re (kib) MFC after: 1 week	2016-07-05 16:37:01 +00:00
Ed Maste	1cbb879df8	add description for debug.elf{32,64}_legacy_coredump sysctl Approved by: re (kib) MFC after: 1 week Sponsored by: The FreeBSD Foundation	2016-07-05 14:46:06 +00:00
Konstantin Belousov	46e47c4f8d	Provide helper macros to detect 'non-silent SBDRY' state and to calculate appropriate return value for stops. Simplify the code by using them. Fix typo in sig_suspend_threads(). The thread which sleep must be aborted is td2. () In issignal(), when handling stopping signal for thread in TD_SBDRY_INTR state, do not stop, this is wrong and fires assert. This is yet another place where execution should be forced out of SBDRY-protected region. For such case, return -1 from issignal() and translate it to corresponding error code in sleepq_catch_signals(). Assert that other consumers of cursig() are not affected by the new return value. () Micro-optimize, mostly VFS and VOP methods, by avoiding calling the functions when SIGDEFERSTOP_NOP non-change is requested. (*) Reported and tested by: pho () Requested by: bde (**) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb)	2016-07-03 18:19:48 +00:00
Konstantin Belousov	d12d6a8476	Remove racy assert. The thread which changes vnode usecount from 0 to 1 does it under the vnode interlock, but the interlock is not owned by the asserting thread. As result, we might read increased use counter but also still see VI_OWEINACT. In collaboration with: nwhitehorn Hardware donated by: IBM LTC Sponsored by: The FreeBSD Foundation (kib) Approved by: re (gjb)	2016-07-03 01:56:48 +00:00
Konstantin Belousov	e18ee4957d	When a process knote was attached to the process which is already exiting, the knote is activated immediately. If the exit1() later activates knotes, such knote is attempted to be activated second time. Detect the condition by zeroed kn_ptr.p_proc pointer, and avoid excessive activation. Before r302235, such knotes were removed from the knlist immediately upon activation. Reported by: truckman Sponsored by: The FreeBSD Foundation Approved by: re (gjb)	2016-07-01 20:11:28 +00:00
Konstantin Belousov	364c516cff	Currently the ntptime code and resettodr() are Giant-locked. In particular, the Giant is supposed to protect against parallel ntp_adjtime(2) invocations. But, for instance, sys_ntp_adjtime() does copyout(9) under Giant and then examines time_status to return syscall result. Since copyout(9) could sleep, the syscall result might be inconsistent. Another and more important issue is that if PPS is configured, hardpps(9) is executed without any protection against the parallel top-level code invocation. Potentially, this may result in the inconsistent state of the ntptime state variables, but I cannot say how serious such distortion is. The non-functional splclock() call in sys_ntp_adjtime() protected against clock interrupts calling hardpps() in the pre-SMP era. Modernize the locking. A mutex protects ntptime data. Due to the hardpps() KPI legitimately serving from the interrupt filters (and e.g. uart(4) does call it from filter), the lock cannot be sleepable mutex if PPS_SYNC is defined. Otherwise, use normal sleepable mutex to reduce interrupt latency. Reviewed by: imp, jhb Sponsored by: The FreeBSD Foundation Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D6825	2016-06-28 16:43:23 +00:00
Konstantin Belousov	3a0e6f920a	Do not use Giant to prevent parallel calls to CLOCK_SETTIME(). Use private mtx in resettodr(), no implementation of CLOCK_SETTIME() is allowed to sleep. Reviewed by: imp, jhb Sponsored by: The FreeBSD Foundation Approved by: re (gjb) X-Differential revision: https://reviews.freebsd.org/D6825	2016-06-28 16:42:40 +00:00
Konstantin Belousov	6f56cb8dbf	Complete r302215. TDF_SBDRY \| TDF_SERESTART and TDF_SBDRY \| TDF_SEINTR flags values, unlike TDF_SBDRY, must be treated almost as if TDF_SBDRY is not set for STOP signal delivery. The only difference is that sig_suspend_threads() should abort the sleep instead of doing immediate suspension. Reported by: ngie Sponsored by: The FreeBSD Foundation MFC after: 12 days Approved by: re (gjb)	2016-06-28 16:41:50 +00:00
Konstantin Belousov	9eb3f14324	Fix userspace build after r302235: do not expose bool field of the structure, change it to int. The real fix is to sanitize user-visible definitions in sys/event.h, e.g. the affected struct knlist is of no use for userspace programs. Reported and tested by: jkim Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb)	2016-06-27 23:34:53 +00:00
Konstantin Belousov	9e590ff04b	When filt_proc() removes event from the knlist due to the process exiting (NOTE_EXIT->knlist_remove_inevent()), two things happen: - knote kn_knlist pointer is reset - INFLUX knote is removed from the process knlist. And, there are two consequences: - KN_LIST_UNLOCK() on such knote is nop - there is nothing which would block exit1() from processing past the knlist_destroy() (and knlist_destroy() resets knlist lock pointers). Both consequences result either in leaked process lock, or dereferencing NULL function pointers for locking. Handle this by stopping embedding the process knlist into struct proc. Instead, the knlist is allocated together with struct proc, but marked as autodestroy on the zombie reap, by knlist_detach() function. The knlist is freed when last kevent is removed from the list, in particular, at the zombie reap time if the list is empty. As result, the knlist_remove_inevent() is no longer needed and removed. Other changes: In filt_procattach(), clear NOTE_EXEC and NOTE_FORK desired events from kn_sfflags for knote registered by kernel to only get NOTE_CHILD notifications. The flags leak resulted in excessive NOTE_EXEC/NOTE_FORK reports. Fix immediate note activation in filt_procattach(). Condition should be either the immediate CHILD_NOTE activation, or immediate NOTE_EXIT report for the exiting process. In knote_fork(), do not perform racy check for KN_INFLUX before kq lock is taken. Besides being racy, it did not accounted for notes just added by scan (KN_SCAN). Some minor and incomplete style fixes. Analyzed and tested by: Eric Badger <eric@badgerio.us> Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D6859	2016-06-27 21:52:17 +00:00
Konstantin Belousov	883a5a4a6a	When sleeping waiting for either local or remote advisory lock, interrupt sleeps with the ERESTART on the suspension attempts. Otherwise, single-threading requests are deferred until the locks are granted for NFS files, which causes hangs. When retrying local registration of the remotely-granted adv lock, allow full suspension and check for suspension, for usual reasons. Reported by: markj, pho Reviewed by: jilles Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb)	2016-06-26 20:08:42 +00:00
Konstantin Belousov	3a1e5dd8e6	Rewrite sigdeferstop(9) and sigallowstop(9) into more flexible framework allowing to set the suspension policy for the dynamic block. Extend the currently possible policies of stopping on interruptible sleeps and ignoring such sleeps by two more: do not suspend at interruptible sleeps, but interrupt them with either EINTR or ERESTART. Reviewed by: jilles Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb)	2016-06-26 20:07:24 +00:00
Konstantin Belousov	8a06de9e92	Do not clear robust lists pointers on fork. The forked child thread lists must be functional. Reported by: Daniel Engberg <daniel.engberg.lists@pyret.net>, Guy Yur <guyyur@gmail.com> Tested by: Guy Yur <guyyur@gmail.com> Sponsored by: The FreeBSD Foundation Approved by: re (gjb), including the KBI change	2016-06-25 11:31:25 +00:00
Jilles Tjoelker	6ea906eec0	posixshm: Fix lock leak when mac_posixshm_check_read rejects read. While reading the code, I noticed that shm_read() returns without unlocking foffset and rangelock if mac_posixshm_check_read() rejects the read. Reviewed by: kib, jhb, rwatson Approved by: re (gjb) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D6927	2016-06-23 20:59:13 +00:00
Brooks Davis	a72c64b0b6	Generate syscall tables and update pipe() implementation after r302094. Mark the pipe() system call as COMPAT10. As of r302092 libc uses pipe2() with a zero flags value instead of pipe(). Approved by: re (gjb) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D6816	2016-06-22 21:18:19 +00:00
Brooks Davis	e16e64098c	Mark the pipe() system call as COMPAT10. As of r302092 libc uses pipe2() with a zero flags value instead of pipe(). Commit with regenerated files and implementation to follow. Approved by: re (gjb) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D6816	2016-06-22 21:15:59 +00:00
Brooks Davis	e52e02ba24	Add support for COMPAT10 keywords in syscalls.master. Approved by: re (gjb) Sponsored by: DARPA, AFRL	2016-06-22 21:12:53 +00:00
John Baldwin	b1012d8036	Account for AIO socket operations in thread/process resource usage. File and disk-backed I/O requests store counts of read/written disk blocks in each AIO job so that they can be charged to the thread that completes an AIO request via aio_return() or aio_waitcomplete(). This change extends AIO jobs to store counts of received/sent messages and updates socket backends to set these counts accordingly. Note that the socket backends are careful to only charge a single messages for each AIO request even though a single request on a blocking socket might invoke sosend or soreceive multiple times. This is to mimic the resource accounting of synchronous read/write. Adjust the UNIX socketpair AIO test to verify that the message resource usage counts update accordingly for aio_read and aio_write. Approved by: re (hrs) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D6911	2016-06-21 22:19:06 +00:00
Bjoern A. Zeeb	89856f7e2d	Get closer to a VIMAGE network stack teardown from top to bottom rather than removing the network interfaces first. This change is rather larger and convoluted as the ordering requirements cannot be separated. Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and related modules to their own SI_SUB_PROTO_FIREWALL. Move initialization of "physical" interfaces to SI_SUB_DRIVERS, move virtual (cloned) interfaces to SI_SUB_PSEUDO. Move Multicast to SI_SUB_PROTO_MC. Re-work parts of multicast initialisation and teardown, not taking the huge amount of memory into account if used as a module yet. For interface teardown we try to do as many of them as we can on SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling over a higher layer protocol such as IP. In that case the interface has to go along (or before) the higher layer protocol is shutdown. Kernel hhooks need to go last on teardown as they may be used at various higher layers and we cannot remove them before we cleaned up the higher layers. For interface teardown there are multiple paths: (a) a cloned interface is destroyed (inside a VIMAGE or in the base system), (b) any interface is moved from a virtual network stack to a different network stack ("vmove"), or (c) a virtual network stack is being shut down. All code paths go through if_detach_internal() where we, depending on the vmove flag or the vnet state, make a decision on how much to shut down; in case we are destroying a VNET the individual protocol layers will cleanup their own parts thus we cannot do so again for each interface as we end up with, e.g., double-frees, destroying locks twice or acquiring already destroyed locks. When calling into protocol cleanups we equally have to tell them whether they need to detach upper layer protocols ("ulp") or not (e.g., in6_ifdetach()). Provide or enahnce helper functions to do proper cleanup at a protocol rather than at an interface level. Approved by: re (hrs) Obtained from: projects/vnet Reviewed by: gnn, jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D6747	2016-06-21 13:48:49 +00:00
Konstantin Belousov	d9a503bec4	Fix typo. Note that atomic is still required even for interlocked case. Sponsored by: The FreeBSD Foundation Approved by: re (marius)	2016-06-20 15:45:50 +00:00
Mateusz Guzik	e896fb3bae	vfs: ifdef out noop vop_* primitives on !DEBUG_VFS_LOCKS kernels This removes calls to empty functions like vop_lock_{pre/post} from common vfs routines. Approved by: re (gjb)	2016-06-17 19:41:30 +00:00
Konstantin Belousov	f8a75278dc	Add VFS interface to flush specified amount of free vnodes belonging to mount points with the given filesystem type, specified by mount vfs_ops pointer. Based on patch by: mckusick Reviewed by: avg, mckusick Tested by: allanjude, madpilot Sponsored by: The FreeBSD Foundation Approved by: re (gjb)	2016-06-17 17:33:25 +00:00
Konstantin Belousov	5c2cf81845	Update comments for the MD functions managing contexts for new threads, to make it less confusing and using modern kernel terms. Rename the functions to reflect current use of the functions, instead of the historic KSE conventions: cpu_set_fork_handler -> cpu_fork_kthread_handler (for kthreads) cpu_set_upcall -> cpu_copy_thread (for forks) cpu_set_upcall_kse -> cpu_set_upcall (for new threads creation) Reviewed by: jhb (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (hrs) Differential revision: https://reviews.freebsd.org/D6731	2016-06-16 12:05:44 +00:00
Konstantin Belousov	bd07998e0e	Remove XXX comments from kern_thread.c. In one case, there is no reason for it in modern times. In the other case, expand the comment stating instead of doubting. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (hrs) X-Differential revision: https://reviews.freebsd.org/D6731	2016-06-16 12:01:11 +00:00
Konstantin Belousov	13d2cd3b68	Remove code duplication. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (hrs) X-Differential revision: https://reviews.freebsd.org/D6731	2016-06-16 11:58:46 +00:00
John Baldwin	fe0bdd1d2c	Move backend-specific fields of kaiocb into a union. This reduces the size of kaiocb slightly. I've also added some generic fields that other backends can use in place of the BIO-specific fields. Change the socket and Chelsio DDP backends to use 'backend3' instead of abusing _aiocb_private.status directly. This confines the use of _aiocb_private to the AIO internals in vfs_aio.c. Reviewed by: kib (earlier version) Approved by: re (gjb) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D6547	2016-06-15 20:56:45 +00:00
Konstantin Belousov	9fdbfd3b6c	Do not assume that we own the use reference on the covered vnode until we set MNTK_UNMOUNT flag on the mp. Otherwise parallel unmount which wins race with us could dereference the covered vnode, and we are left with the locked freed memory. Reported and tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 1 week	2016-06-15 15:56:03 +00:00
Jamie Gritton	932a6e432d	Fix a vnode leak when giving a child jail a too-long path when debug.disablefullpath=1.	2016-06-09 21:59:11 +00:00
Jamie Gritton	cf0313c679	Re-order some jail parameter reading to prevent a vnode leak.	2016-06-09 20:43:14 +00:00
Jamie Gritton	176ff3a066	Clean up some logic in jail error messages, replacing a missing test and a redundant test with a single correct test.	2016-06-09 20:39:57 +00:00
Mariusz Zaborski	fb4cdc96a8	Define tunable instead of using CTLFLAG_RWTUN flag with kern.corefile. The allproc_lock lock used in the sysctl_kern_corefile function is initialized in the procinit function which is called after setting sysctl values at boot. That means if we set kern.corefile at boot we will be trying to use lock with is uninitialized and machine will crash. If we define kern.corefile as tunable instead of using CTFLAG_RWTUN we will not call the sysctl_kern_corefile function and we will not use an uninitialized lock. When machine will boot then we will start using function depending on the lock. Reviewed by: pjd	2016-06-09 20:23:30 +00:00
Conrad Meyer	8a3aeac27b	Add DDB command "kldstat" It prints much the same information as kldstat(8) without any arguments. Suggested by: jhibbits Sponsored by: EMC / Isilon Storage Division	2016-06-09 18:27:41 +00:00
Conrad Meyer	dd6ea7f7bc	kvprintf: Pad %c to width, like %s Sponsored by: EMC / Isilon Storage Division	2016-06-09 18:24:51 +00:00
Jamie Gritton	ef0ddea316	Make sure the OSD methods for jail set and remove can't run concurrently, by holding allprison_lock exclusively (even if only for a moment before downgrading) on all paths that call PR_METHOD_REMOVE. Since they may run on a downgraded lock, it's still possible for them to run concurrently with PR_METHOD_GET, which will need to use the prison lock.	2016-06-09 16:41:41 +00:00
Jamie Gritton	5f02f22af1	Remove a comment that was part of copied code, and is misleading in the new location.	2016-06-09 15:34:33 +00:00
Mark Johnston	508d856999	Fix some cosmetic issues in kern_fail.c omitted from r296927. Obtained from: Matthew Bryan <matthew.bryan@isilon.com>	2016-06-09 13:17:08 +00:00
Konstantin Belousov	3fc292d56b	Old process credentials for setuid execve must not be dereferenced when the process credentials were not changed. This can happen if an error occured trying to activate the setuid binary. And on error, if new credentials were not yet assigned, they must be freed to not create the leak. Use oldcred == NULL as the predicate to detect credential reassignment. Reported and tested by: pho Sponsored by: The FreeBSD Foundation	2016-06-08 04:37:03 +00:00
Mariusz Zaborski	b3a734483e	Introduce the PD_CLOEXEC for pdfork(2). Reviewed by: mjg	2016-06-08 02:09:14 +00:00
Svatopluk Kraus	c4263292fe	Remove temporary solution for storing interrupt mapping data as it's not needed after r301451 and follow-ups r301453, r301539. This makes INTRNG clean of all additions related to various buses.	2016-06-07 09:03:27 +00:00
Michal Meloun	949883bd72	INTRNG: As follow up of r301451, implement mapping and configuration of gpio pin interrupts by new way. Note: This removes last consumer of intr_ddata machinery and we remove it in separate commit.	2016-06-07 05:08:24 +00:00
Bjoern A. Zeeb	3af72c1124	Implement a `show panic` command to DDB which will helpfully print the panic string again if set, in case it scrolled out of the active window. This avoids having to remember the symbol name. Also add a show callout <addr> command to DDB in order to inspect some struct callout fields in case of panics in the callout code. This may help to see if there was memory corruption or to further ease debugging problems. Obtained from: projects/vnet MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Reviewed by: jhb (comment only on the show panic initally) Differential Revision: https://reviews.freebsd.org/D4527	2016-06-06 20:57:24 +00:00
Konstantin Belousov	93ccd6bf87	Get rid of struct proc p_sched and struct thread td_sched pointers. p_sched is unused. The struct td_sched is always co-allocated with the struct thread, except for the thread0. Avoid useless indirection, instead calculate td_sched location using simple pointer arithmetic in td_get_sched(9). For thread0, which is statically allocated, create a structure to emulate layout of the dynamic allocation. Reviewed by: jhb (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D6711	2016-06-05 17:04:03 +00:00
Konstantin Belousov	314381b529	Use ANSI function definition. Sponsored by: The FreeBSD Foundation	2016-06-05 16:55:55 +00:00
Svatopluk Kraus	ad5244ece1	INTRNG - change the way how an interrupt mapping data are provided to the framework in OFW (FDT) case. This is a follow-up to r301451. Differential Revision: https://reviews.freebsd.org/D6634	2016-06-05 16:20:12 +00:00
Svatopluk Kraus	0869297dd9	(1) Add a new bus method to get a mapping data for an interrupt. BUS_MAP_INTR() is used to get an interrupt mapping data according to provided hints. The hints could be modified afterwards, but only if mapping data was allocated. This method is intended to be called before BUS_ALLOC_RESOURCE(). An interrupt mapping data describes an interrupt - hardware number, type, configuration, cpu binding, and whatever is needed to setup it. (2) Introduce a method which allows storing of an additional data in struct resource to be available for bus drivers. This method is convenient in two ways: - there is no need to rework existing bus drivers as they can simply be extended to provide an additional data, - there is no need to modify any existing bus methods as struct resource is already passed to them as argument and thus stored data is simply accessible by other bus drivers. For now, implement this method only for INTRNG. This is motivated by needs of modern SOCs where hardware initialization is not straightforward and resources descriptions are complex, opaque for everyone but provider, and may vary from SOC to SOC. Typical situation is that one bus driver can fetch a resource description for its child device, but it's opaque for this driver. Another bus driver knows a provider for this kind of resource and can pass this resource description to it. In fact, something like device IVARS would be perfect for that if implemented generally enough. Unfortunatelly, IVARS are usable only by their owners now. Only owner knows its IVARS layout, thus other bus drivers are not able to use them. Differential Revision: https://reviews.freebsd.org/D6632	2016-06-05 16:07:57 +00:00
Andrew Turner	d1605cda2b	Add an interface to handle interrupt controllers that have a contiguous range of interrupts they pass to a second controller driver to handle. The parent driver is expected to detect when one of these interrupts has been triggered and call intr_child_irq_handler to pass the interrupt to a child. The children controllers are then expected to manage the range by allocating interrupts as needed. This will initially be used by the ARM GICv3 driver, but is is expected to be useful for other driver where this type of allocation applies. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D6436	2016-06-03 10:13:18 +00:00
Mateusz Guzik	f3c8e16ea5	taskqueue: plug a leak in _taskqueue_create While here make some style fixes and postpone the sprintf so that it is only done when the function can no longer fail. CID: 1356041	2016-06-02 15:52:34 +00:00
Mateusz Guzik	fc4f686d59	Microoptimize locking primitives by avoiding unnecessary atomic ops. Inline version of primitives do an atomic op and if it fails they fallback to actual primitives, which immediately retry the atomic op. The obvious optimisation is to check if the lock is free and only then proceed to do an atomic op. Reviewed by: jhb, vangyzen	2016-06-01 18:32:20 +00:00
Bjoern A. Zeeb	3f58662dd9	The pr_destroy field does not allow us to run the teardown code in a specific order. VNET_SYSUNINITs however are doing exactly that. Thus remove the VIMAGE conditional field from the domain(9) protosw structure and replace it with VNET_SYSUNINITs. This also allows us to change some order and to make the teardown functions file local static. Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use internally. Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g., for pfil consumers (firewalls), partially for this commit and for others to come. Reviewed by: gnn, tuexen (sctp), jhb (kernel.h) Obtained from: projects/vnet MFC after: 2 weeks X-MFC: do not remove pr_destroy Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D6652	2016-06-01 10:14:04 +00:00
Gleb Smirnoff	34e05ebe72	Fix kernel stack disclosures in the Linux and 4.3BSD compat layers. Submitted by: CTurt Security: SA-16:20 Security: SA-16:21	2016-05-31 16:56:30 +00:00
Edward Tomasz Napierala	f7bd221730	Cosmetics - add missing space after ellipses in shutdown messages. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-05-31 15:27:33 +00:00
Jamie Gritton	ee8d6bd352	Mark jail(2), and the sysctls that it (and only it) uses as deprecated. jail(8) has long used jail_set(2), and those sysctl only cause confusion.	2016-05-30 05:21:24 +00:00
Mateusz Guzik	2dbdf49cf4	fd: provide a common exit point for unlock in kern_dup While here assert dropped filedesc lock on return from closefp.	2016-05-27 17:00:15 +00:00
Mateusz Guzik	cda688443a	exec: get rid of one vnode lock/unlock pair in do_execve The lock was temporarily dropped for vrele calls, but they can be postponed to a point where the lock is not held in the first place. While here shuffle other code not needing the lock.	2016-05-27 15:03:38 +00:00
Bryan Drewery	1afd78b34d	exec: Provide execpath in imgp for the process_exec hook. This was previously set after the hook and only if auxargs were present. Now always provide it if possible. MFC after: 2 weeks Reviewed by: kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D6546	2016-05-26 23:19:39 +00:00
Bryan Drewery	881010f05d	exec: Add credential change information into imgp for process_exec hook. This allows an EVENTHANDLER(process_exec) hook to see if the new image will cause credentials to change whether due to setgid/setuid or because of POSIX saved-id semantics. This adds 3 new fields into image_params: struct ucred *newcred Non-null if the credentials will change. bool credential_setid True if the new image is setuid or setgid. This will pre-determine the new credentials before invoking the image activators, where the process_exec hook is called. The new credentials will be installed into the process in the same place as before, after image activators are done handling the image. MFC after: 2 weeks Reviewed by: kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D6544	2016-05-26 23:18:54 +00:00
Conrad Meyer	571ebf7685	crypto routines: Hint minimum buffer sizes to the compiler Use the C99 'static' keyword to hint to the compiler IVs and output digest sizes. The keyword informs the compiler of the minimum valid size for a given array. Obviously not every pointer can be validated (i.e., the compiler can produce false negative but not false positive reports). No functional change. No ABI change. Sponsored by: EMC / Isilon Storage Division	2016-05-26 19:29:29 +00:00
Hans Petter Selasky	84e717c4cf	Add support for boolean sysctl's. Because the size of bool can be implementation defined, make a bool sysctl handler which handle bools. Userspace sees the bools like unsigned 8-bit integers. Values are filtered to either 1 or 0 upon read and write, similar to what a compiler would do. Requested by: kmacy @ Sponsored by: Mellanox Technologies	2016-05-26 08:41:55 +00:00
Ian Lepore	a66dc0c52b	Include machine/acle-compat.h in cdefs.h on arm if the compiler doesn't have ACLE support built in. The ACLE (ARM C Language Extensions) defines a set of standardized symbols which indicate the architecture version and features available. ACLE support is built in to modern compilers (both clang and gcc), but absent from gcc prior to 4.4. ARM (the company) provides the acle-compat.h header file to define the right symbols for older versions of gcc. Basically, acle-compat.h does for arm about the same thing cdefs.h does for freebsd: defines standardized macros that work no matter which compiler you use. If ARM hadn't provided this file we would have ended up with a big #ifdef __arm__ section in cdefs.h with our own compatibility shims. Remove #include <machine/acle-compat.h> from the zillion other places (an ever-growing list) that it appears. Since style(9) requires sys/types.h or sys/param.h early in the include list, and both of those lead to including cdefs.h, only a couple special cases still need to include acle-compat.h directly. Loves it: imp	2016-05-25 19:44:26 +00:00
Konstantin Belousov	c5e44d6cd5	Silence false LOR report due to the taskqueue mutex and kqueue lock named the same. Reported by: Doug Luce <doug@freebsd.con.com> Sponsored by: The FreeBSD Foundation	2016-05-24 21:13:33 +00:00
John Baldwin	778ce4f297	Return the correct status when a partially completed request is cancelled. After the previous changes to fix requests on blocking sockets to complete across multiple operations, an edge case exists where a request can be cancelled after it has partially completed. POSIX doesn't appear to dictate exactly how to handle this case, but in general I feel that aio_cancel() should arrange to cancel any request it can, but that any partially completed requests should return a partial completion rather than ECANCELED. To that end, fix the socket AIO cancellation routine to return a short read/write if a partially completed request is cancelled rather than ECANCELED. Sponsored by: Chelsio Communications	2016-05-24 21:09:05 +00:00
Andrew Turner	974692e3bf	Limit calling pmc_hook to when the interrupt comes while running userspace. We may enable interrupts from within the callback, e.g. in a data abort during copyin. If we receive an interrupt at that time pmc_hook will be called again and, as it is handling userspace stack tracing, will hit a KASSERT as it checks if the trapframe is from userland. With this I can run hwpmc with intrng on a ThunderX and have it trace all CPUs. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation	2016-05-24 12:06:56 +00:00
John Baldwin	1717b68af1	Don't prematurely return short completions on blocking sockets. Always requeue an AIO job at the head of the socket buffer's queue if sosend() or soreceive() returns EWOULDBLOCK on a blocking socket. Previously, requests were only requeued if they returned EWOULDBLOCK and completed no data. Now after a partial completion on a blocking socket the request is queued and the remaining request is retried when the socket is ready. This allows writes larger than the currently available space on a blocking socket to fully complete. Reads on a blocking socket that satifsy the low watermark can still return a short read (same as read()). In order to track previously completed data, the internal 'status' field of the AIO job is used to store the amount of previously computed data. Non-blocking sockets continue to return short completions for both reads and writes. Add a test for a "large" AIO write on a blocking socket that writes twice the socket buffer size to a UNIX domain socket. Sponsored by: Chelsio Communications	2016-05-24 03:13:27 +00:00
Alan Somers	37f32e5379	Fix build of kern/subr_unit.c, broken by r300539 Reported by: peter Pointyhat to: asomers Sponsored by: Spectra Logic Corp	2016-05-24 00:14:58 +00:00
Alan Somers	1b82e02f4d	Add bit_count to the bitstring(3) api Add a bit_count function, which efficiently counts the number of bits set in a bitstring. sys/sys/bitstring.h tests/sys/sys/bitstring_test.c share/man/man3/bitstring.3 Add bit_alloc sys/kern/subr_unit.c Use bit_count instead of a naive counting loop in check_unrhdr, used when INVARIANTS are enabled. The userland test runs about 6x faster in a generic build, or 8.5x faster when built for Nehalem, which has the POPCNT instruction. sys/sys/param.h Bump __FreeBSD_version due to the addition of bit_alloc UPDATING Add a note about the ABI incompatibility of the bitstring(3) changes, as suggested by lidl. Suggested by: gibbs Reviewed by: gibbs, ngie MFC after: 9 days X-MFC-With: 299090, 300538 Relnotes: yes Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D6255	2016-05-23 20:29:18 +00:00
Andrew Turner	df7a2251cc	Add the needed hwpmc hooks to subr_intr.c. This is needed for the correct operation of hwpmc on, for example, arm64 with intrng. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation	2016-05-23 15:26:35 +00:00
Hans Petter Selasky	bc64e52679	Use DELAY() instead of _sleep() when SCHEDULER_STOPPED() is set inside pause_sbt(). This allows pause() to continue working during a panic() which is not invoking KDB. This is useful when debugging graphics drivers using the LinuxKPI. Obtained from: kmacy @ MFC after: 1 week	2016-05-23 10:31:54 +00:00
Baptiste Daroussin	306e53bce9	Fix typo introduced by me (not the submitter) when fixing typos	2016-05-22 13:10:48 +00:00
Baptiste Daroussin	2fd642c899	Fix typos in the comments Submitted by: cipherwraith666@gmail.com (via github)	2016-05-22 13:04:45 +00:00
Andriy Gapon	7107bed0f0	fix loss of taskqueue wakeups (introduced in r300113) Submitted by: kmacy Tested by: dchagin	2016-05-21 14:51:49 +00:00
John Baldwin	20fee1093e	Add sglist functions for working with arrays of VM pages. sglist_count_vmpages() determines the number of segments required for a buffer described by an array of VM pages. sglist_append_vmpages() adds the segments described by such a buffer to an sglist. The latter function is largely pulled from sglist_append_bio(), and sglist_append_bio() now uses sglist_append_vmpages(). Reviewed by: kib Sponsored by: Chelsio Communications	2016-05-20 23:28:43 +00:00
John Baldwin	f0ec174043	Consistently set status to -1 when completing an AIO request with an error. Sponsored by: Chelsio Communications	2016-05-20 19:46:25 +00:00
John Baldwin	cc981af204	Add new bus methods for mapping resources. Add a pair of bus methods that can be used to "map" resources for direct CPU access using bus_space(9). bus_map_resource() creates a mapping and bus_unmap_resource() releases a previously created mapping. Mappings are described by 'struct resource_map' object. Pointers to these objects can be passed as the first argument to the bus_space wrapper API used for bus resources. Drivers that wish to map all of a resource using default settings (for example, using uncacheable memory attributes) do not need to change. However, drivers that wish to use non-default settings can now do so without jumping through hoops. First, an RF_UNMAPPED flag is added to request that a resource is not implicitly mapped with the default settings when it is activated. This permits other activation steps (such as enabling I/O or memory decoding in a device's PCI command register) to be taken without creating a mapping. Right now the AGP drivers don't set RF_ACTIVE to avoid using up a large amount of KVA to map the AGP aperture on 32-bit platforms. Once RF_UNMAPPED is supported on all platforms that support AGP this can be changed to using RF_UNMAPPED with RF_ACTIVE instead. Second, bus_map_resource accepts an optional structure that defines additional settings for a given mapping. For example, a driver can now request to map only a subset of a resource instead of the entire range. The AGP driver could also use this to only map the first page of the aperture (IIRC, it calls pmap_mapdev() directly to map the first page currently). I will also eventually change the PCI-PCI bridge driver to request mappings of the subset of the I/O window resource on its parent side to create mappings for child devices rather than passing child resources directly up to nexus to be mapped. This also permits bridges that do address translation to request suitable mappings from a resource on the "upper" side of the bus when mapping resources on the "lower" side of the bus. Another attribute that can be specified is an alternate memory attribute for memory-mapped resources. This can be used to request a Write-Combining mapping of a PCI BAR in an MI fashion. (Currently the drivers that do this call pmap_change_attr() directly for x86 only.) Note that this commit only adds the MI framework. Each platform needs to add support for handling RF_UNMAPPED and thew new bus_map/unmap_resource methods. Generally speaking, any drivers that are calling rman_set_bustag() and rman_set_bushandle() need to be updated. Discussed on: arch Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D5237	2016-05-20 17:57:47 +00:00
Mark Johnston	5e0a6f31e5	Move IPv6 malloc tag definitions into the IPv6 code.	2016-05-20 04:45:08 +00:00
Scott Long	7e52504fc2	Adjust the creation of tq_name so it can be freed correctly Reviewed by: jhb, allanjude Differential Revision: D6454	2016-05-19 17:14:24 +00:00
Kenneth D. Merry	9a6844d55f	Add support for managing Shingled Magnetic Recording (SMR) drives. This change includes support for SCSI SMR drives (which conform to the Zoned Block Commands or ZBC spec) and ATA SMR drives (which conform to the Zoned ATA Command Set or ZAC spec) behind SAS expanders. This includes full management support through the GEOM BIO interface, and through a new userland utility, zonectl(8), and through camcontrol(8). This is now ready for filesystems to use to detect and manage zoned drives. (There is no work in progress that I know of to use this for ZFS or UFS, if anyone is interested, let me know and I may have some suggestions.) Also, improve ATA command passthrough and dispatch support, both via ATA and ATA passthrough over SCSI. Also, add support to camcontrol(8) for the ATA Extended Power Conditions feature set. You can now manage ATA device power states, and set various idle time thresholds for a drive to enter lower power states. Note that this change cannot be MFCed in full, because it depends on changes to the struct bio API that break compatilibity. In order to avoid breaking the stable API, only changes that don't touch or depend on the struct bio changes can be merged. For example, the camcontrol(8) changes don't depend on the new bio API, but zonectl(8) and the probe changes to the da(4) and ada(4) drivers do depend on it. Also note that the SMR changes have not yet been tested with an actual SCSI ZBC device, or a SCSI to ATA translation layer (SAT) that supports ZBC to ZAC translation. I have not yet gotten a suitable drive or SAT layer, so any testing help would be appreciated. These changes have been tested with Seagate Host Aware SATA drives attached to both SAS and SATA controllers. Also, I do not have any SATA Host Managed devices, and I suspect that it may take additional (hopefully minor) changes to support them. Thanks to Seagate for supplying the test hardware and answering questions. sbin/camcontrol/Makefile: Add epc.c and zone.c. sbin/camcontrol/camcontrol.8: Document the zone and epc subcommands. sbin/camcontrol/camcontrol.c: Add the zone and epc subcommands. Add auxiliary register support to build_ata_cmd(). Make sure to set the CAM_ATAIO_NEEDRESULT, CAM_ATAIO_DMA, and CAM_ATAIO_FPDMA flags as appropriate for ATA commands. Add a new get_ata_status() function to parse ATA result from SCSI sense descriptors (for ATA passthrough over SCSI) and ATA I/O requests. sbin/camcontrol/camcontrol.h: Update the build_ata_cmd() prototype Add get_ata_status(), zone(), and epc(). sbin/camcontrol/epc.c: Support for ATA Extended Power Conditions features. This includes support for all features documented in the ACS-4 Revision 12 specification from t13.org (dated February 18, 2016). The EPC feature set allows putting a drive into a power power mode immediately, or setting timeouts so that the drive will automatically enter progressively lower power states after various idle times. sbin/camcontrol/fwdownload.c: Update the firmware download code for the new build_ata_cmd() arguments. sbin/camcontrol/zone.c: Implement support for Shingled Magnetic Recording (SMR) drives via SCSI Zoned Block Commands (ZBC) and ATA Zoned Device ATA Command Set (ZAC). These specs were developed in concert, and are functionally identical. The primary differences are due to SCSI and ATA differences. (SCSI is big endian, ATA is little endian, for example.) This includes support for all commands defined in the ZBC and ZAC specs. sys/cam/ata/ata_all.c: Decode a number of additional ATA command names in ata_op_string(). Add a new CCB building function, ata_read_log(). Add ata_zac_mgmt_in() and ata_zac_mgmt_out() CCB building functions. These support both DMA and NCQ encapsulation. sys/cam/ata/ata_all.h: Add prototypes for ata_read_log(), ata_zac_mgmt_out(), and ata_zac_mgmt_in(). sys/cam/ata/ata_da.c: Revamp the ada(4) driver to support zoned devices. Add four new probe states to gather information needed for zone support. Add a new adasetflags() function to avoid duplication of large blocks of flag setting between the async handler and register functions. Add new sysctl variables that describe zone support and paramters. Add support for the new BIO_ZONE bio, and all of its subcommands: DISK_ZONE_OPEN, DISK_ZONE_CLOSE, DISK_ZONE_FINISH, DISK_ZONE_RWP, DISK_ZONE_REPORT_ZONES, and DISK_ZONE_GET_PARAMS. sys/cam/scsi/scsi_all.c: Add command descriptions for the ZBC IN/OUT commands. Add descriptions for ZBC Host Managed devices. Add a new function, scsi_ata_pass() to do ATA passthrough over SCSI. This will eventually replace scsi_ata_pass_16() -- it can create the 12, 16, and 32-byte variants of the ATA PASS-THROUGH command, and supports setting all of the registers defined as of SAT-4, Revision 5 (March 11, 2016). Change scsi_ata_identify() to use scsi_ata_pass() instead of scsi_ata_pass_16(). Add a new scsi_ata_read_log() function to facilitate reading ATA logs via SCSI. sys/cam/scsi/scsi_all.h: Add the new ATA PASS-THROUGH(32) command CDB. Add extended and variable CDB opcodes. Add Zoned Block Device Characteristics VPD page. Add ATA Return SCSI sense descriptor. Add prototypes for scsi_ata_read_log() and scsi_ata_pass(). sys/cam/scsi/scsi_da.c: Revamp the da(4) driver to support zoned devices. Add five new probe states, four of which are needed for ATA devices. Add five new sysctl variables that describe zone support and parameters. The da(4) driver supports SCSI ZBC devices, as well as ATA ZAC devices when they are attached via a SCSI to ATA Translation (SAT) layer. Since ZBC -> ZAC translation is a new feature in the T10 SAT-4 spec, most SATA drives will be supported via ATA commands sent via the SCSI ATA PASS-THROUGH command. The da(4) driver will prefer the ZBC interface, if it is available, for performance reasons, but will use the ATA PASS-THROUGH interface to the ZAC command set if the SAT layer doesn't support translation yet. As I mentioned above, ZBC command support is untested. Add support for the new BIO_ZONE bio, and all of its subcommands: DISK_ZONE_OPEN, DISK_ZONE_CLOSE, DISK_ZONE_FINISH, DISK_ZONE_RWP, DISK_ZONE_REPORT_ZONES, and DISK_ZONE_GET_PARAMS. Add scsi_zbc_in() and scsi_zbc_out() CCB building functions. Add scsi_ata_zac_mgmt_out() and scsi_ata_zac_mgmt_in() CCB/CDB building functions. Note that these have return values, unlike almost all other CCB building functions in CAM. The reason is that they can fail, depending upon the particular combination of input parameters. The primary failure case is if the user wants NCQ, but fails to specify additional CDB storage. NCQ requires using the 32-byte version of the SCSI ATA PASS-THROUGH command, and the current CAM CDB size is 16 bytes. sys/cam/scsi/scsi_da.h: Add ZBC IN and ZBC OUT CDBs and opcodes. Add SCSI Report Zones data structures. Add scsi_zbc_in(), scsi_zbc_out(), scsi_ata_zac_mgmt_out(), and scsi_ata_zac_mgmt_in() prototypes. sys/dev/ahci/ahci.c: Fix SEND / RECEIVE FPDMA QUEUED in the ahci(4) driver. ahci_setup_fis() previously set the top bits of the sector count register in the FIS to 0 for FPDMA commands. This is okay for read and write, because the PRIO field is in the only thing in those bits, and we don't implement that further up the stack. But, for SEND and RECEIVE FPDMA QUEUED, the subcommand is in that byte, so it needs to be transmitted to the drive. In ahci_setup_fis(), always set the the top 8 bits of the sector count register. We need it in both the standard and NCQ / FPDMA cases. sys/geom/eli/g_eli.c: Pass BIO_ZONE commands through the GELI class. sys/geom/geom.h: Add g_io_zonecmd() prototype. sys/geom/geom_dev.c: Add new DIOCZONECMD ioctl, which allows sending zone commands to disks. sys/geom/geom_disk.c: Add support for BIO_ZONE commands. sys/geom/geom_disk.h: Add a new flag, DISKFLAG_CANZONE, that indicates that a given GEOM disk client can handle BIO_ZONE commands. sys/geom/geom_io.c: Add a new function, g_io_zonecmd(), that handles execution of BIO_ZONE commands. Add permissions check for BIO_ZONE commands. Add command decoding for BIO_ZONE commands. sys/geom/geom_subr.c: Add DDB command decoding for BIO_ZONE commands. sys/kern/subr_devstat.c: Record statistics for REPORT ZONES commands. Note that the number of bytes transferred for REPORT ZONES won't quite match what is received from the harware. This is because we're necessarily counting bytes coming from the da(4) / ada(4) drivers, which are using the disk_zone.h interface to communicate up the stack. The structure sizes it uses are slightly different than the SCSI and ATA structure sizes. sys/sys/ata.h: Add many bit and structure definitions for ZAC, NCQ, and EPC command support. sys/sys/bio.h: Convert the bio_cmd field to a straight enumeration. This will yield more space for additional commands in the future. After change r297955 and other related changes, this is now possible. Converting to an enumeration will also prevent use as a bitmask in the future. sys/sys/disk.h: Define the DIOCZONECMD ioctl. sys/sys/disk_zone.h: Add a new API for managing zoned disks. This is very close to the SCSI ZBC and ATA ZAC standards, but uses integers in native byte order instead of big endian (SCSI) or little endian (ATA) byte arrays. This is intended to offer to the complete feature set of the ZBC and ZAC disk management without requiring the application developer to include SCSI or ATA headers. We also use one set of headers for ioctl consumers and kernel bio-level consumers. sys/sys/param.h: Bump __FreeBSD_version for sys/bio.h command changes, and inclusion of SMR support. usr.sbin/Makefile: Add the zonectl utility. usr.sbin/diskinfo/diskinfo.c Add disk zoning capability to the 'diskinfo -v' output. usr.sbin/zonectl/Makefile: Add zonectl makefile. usr.sbin/zonectl/zonectl.8 zonectl(8) man page. usr.sbin/zonectl/zonectl.c The zonectl(8) utility. This allows managing SCSI or ATA zoned disks via the disk_zone.h API. You can report zones, reset write pointers, get parameters, etc. Sponsored by: Spectra Logic Differential Revision: https://reviews.freebsd.org/D6147 Reviewed by: wblock (documentation)	2016-05-19 14:08:36 +00:00
Gleb Smirnoff	e987742995	The SA-16:19 wouldn't have happened if the sockargs() had properly typed argument for length. While here make it static and convert to ANSI C. Reviewed by: C Turt	2016-05-18 22:05:50 +00:00
Ravi Pokala	08907ec39d	Fix misleading comments in bus_if.m While looking at r300073, I noticed these incorrect comments in the context of the diff. Reviewed by: imp, jhb Differential Revision: https://reviews.freebsd.org/D6431	2016-05-18 16:25:34 +00:00
Andrew Turner	9346e9130d	Return the struct intr_pic pointer from intr_pic_register. This will be needed in later changes where we may not be able to lock the pic list lock to perform a lookup, e.g. from within interrupt context. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation	2016-05-18 15:05:44 +00:00
Konstantin Belousov	3f7ca894de	Ensure that ftruncate(2) is performed synchronously when file is opened in O_SYNC mode, at least for UFS. This also handles truncation, done due to the O_SYNC \| O_TRUNC flags combination to open(2), in synchronous way. Noted by: bde Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-05-18 12:03:57 +00:00
Scott Long	4c7070db25	Import the 'iflib' API library for network drivers. From the author: "iflib is a library to eliminate the need for frequently duplicated device independent logic propagated (poorly) across many network drivers." Participation is purely optional. The IFLIB kernel config option is provided for drivers that want to transition between legacy and iflib modes of operation. ixl and ixgbe driver conversions will be committed shortly. We hope to see participation from the Broadcom and maybe Chelsio drivers in the near future. Submitted by: mmacy@nextbsd.org Reviewed by: gallatin Differential Revision: D5211	2016-05-18 04:35:58 +00:00
Mark Johnston	ef89d843d9	Do not acquire the thread lock in hardclock_cnt() unless needed. This function only sets thread flags if a SIGPROF or SIGVTALRM timer has fired, which is almost never the case. MFC after: 2 weeks	2016-05-18 03:55:54 +00:00
Mark Johnston	d2f5f8db87	Micro-optimize sleepq_broadcast(). - Avoid a conditional branch on the return value of sleepq_resume_thread() by ORing its return value into the boolean wakeup_swapper. This is consistent with other sleepqueue functions which just pass this return value to their caller. - sleepq_resume_thread() unconditionally removes the thread from its queue, so there's no need to maintain a pointer to the next element in the queue. MFC after: 2 weeks	2016-05-18 03:50:21 +00:00
Mark Johnston	be2dfd58fe	Remove the MUTEX_DEBUG kernel option. It has no counterpart among the other lock primitives and has been a no-op for years. Mutex consistency checks are generally done whenver INVARIANTS is enabled.	2016-05-18 03:34:02 +00:00
Mark Johnston	5002e19502	Guard the lockstat:::thread-spin probe with KDTRACE_HOOKS. X-MFC-With: r300103	2016-05-18 03:23:07 +00:00
Mark Johnston	156fbc14a0	lockstat:::thread-spin should only fire after spinning for the lock. MFC after: 1 week	2016-05-18 03:21:21 +00:00
Gleb Smirnoff	17cd649f4a	Add a comment and KASSERT that a M_NOFREE mbuf has always EXT_EXTREF ext. Submitted by: kmacy	2016-05-17 23:15:16 +00:00
Warner Losh	0ac974ec78	Don't forget to quote \ characters with \.	2016-05-17 22:52:42 +00:00
Gleb Smirnoff	7349ea785c	Validate that user supplied control message length is not negative. Submitted by: C Turt <cturt hardenedbsd.org> Security: SA-16:19 Security: CVE-2016-1887	2016-05-17 22:28:53 +00:00
John Baldwin	ed7ed7f0ca	Document the formatting requirements of location and pnpinfo strings. devd requires location and pnpinfo strings generated by bus drivers to be formatted as a list of name=value keypairs. Non-conforming bus drivers cause devd to mis-parse device events for these buses. Note that this documents the desired requirements. devctl_safe_quote() doesn't yet escape backslash characters, and devd doesn't handle escaped characters in quoted values. Differential Revision: https://reviews.freebsd.org/D6252	2016-05-17 19:34:07 +00:00
Konstantin Belousov	2a339d9e3d	Add implementation of robust mutexes, hopefully close enough to the intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013. A robust mutex is guaranteed to be cleared by the system upon either thread or process owner termination while the mutex is held. The next mutex locker is then notified about inconsistent mutex state and can execute (or abandon) corrective actions. The patch mostly consists of small changes here and there, adding neccessary checks for the inconsistent and abandoned conditions into existing paths. Additionally, the thread exit handler was extended to iterate over the userspace-maintained list of owned robust mutexes, unlocking and marking as terminated each of them. The list of owned robust mutexes cannot be maintained atomically synchronous with the mutex lock state (it is possible in kernel, but is too expensive). Instead, for the duration of lock or unlock operation, the current mutex is remembered in a special slot that is also checked by the kernel at thread termination. Kernel must be aware about the per-thread location of the heads of robust mutex lists and the current active mutex slot. When a thread touches a robust mutex for the first time, a new umtx op syscall is issued which informs about location of lists heads. The umtx sleep queues for PP and PI mutexes are split between non-robust and robust. Somewhat unrelated changes in the patch: 1. Style. 2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared pi mutexes. 3. Removal of the userspace struct pthread_mutex m_owner field. 4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls the lifetime of the shared mutex associated with a vnode' page. Reviewed by: jilles (previous version, supposedly the objection was fixed) Discussed with: brooks, Martin Simmons <martin@lispworks.com> (some aspects) Tested by: pho Sponsored by: The FreeBSD Foundation	2016-05-17 09:56:22 +00:00
Andrew Turner	3fc155dc64	Introduce MSI and MSI-X support to intrng. This adds a new msi device interface with 5 methods to mirror the 5 MSI/MSI-X methods in the pcib interface. The pcib driver will need to perform a device specific lookup to find the MSI controller and pass this to intrng as the xref. Intrng will finally find the controller and have it handle the requested operation. Obtained from: ABT Systems Ltd MFH: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5985	2016-05-16 09:11:40 +00:00
Andriy Gapon	27d4b35f6e	vfs_read_dirent: increment ncookies after adding a cookie It seems that at present vfs_read_dirent() is used only with filesystems that do not support cookies, so the bug never manifested itself. MFC after: 1 week	2016-05-16 07:31:11 +00:00
Andriy Gapon	8614f45b2d	dounmount: do not call mountcheckdirs() for mounts with MNT_IGNORE This is a bit hackish, but the flag is currently set only for ZFS snapshots mounted under .zfs. mountcheckdirs() can change cdir/rdir references to a covered vnode. But for the said snapshots the covered vnode is really ephemeral and it must never be accessed (except for a few specific cases). To do: consider removing mountcheckdirs() entirely MFC after: 5 days	2016-05-16 07:23:24 +00:00
John Baldwin	fdce57a042	Add an EARLY_AP_STARTUP option to start APs earlier during boot. Currently, Application Processors (non-boot CPUs) are started by MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until SI_SUB_SMP at which point they are released to run kernel threads. SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter the scheduler and start running threads until fairly late in the boot. This change moves SI_SUB_SMP up to just before software interrupt threads are created allowing the APs to start executing kernel threads much sooner (before any devices are probed). This allows several initialization routines that need to perform initialization on all CPUs to now perform that initialization in one step rather than having to defer the AP initialization to a second SYSINIT run at SI_SUB_SMP. It also permits all CPUs to be available for handling interrupts before any devices are probed. This last feature fixes a problem on with interrupt vector exhaustion. Specifically, in the old model all device interrupts were routed onto the boot CPU during boot. Later after the APs were released at SI_SUB_SMP, interrupts were redistributed across all CPUs. However, several drivers for multiqueue hardware allocate N interrupts per CPU in the system. In a system with many CPUs, just a few drivers doing this could exhaust the available pool of interrupt vectors on the boot CPU as each driver was allocating N * mp_ncpu vectors on the boot CPU. Now, drivers will allocate interrupts on their desired CPUs during boot meaning that only N interrupts are allocated from the boot CPU instead of N * mp_ncpu. Some other bits of code can also be simplified as smp_started is now true much earlier and will now always be true for these bits of code. This removes the need to treat the single-CPU boot environment as a special case. As a transition aid, the new behavior is available under a new kernel option (EARLY_AP_STARTUP). This will allow the option to be turned off if need be during initial testing. I plan to enable this on x86 by default in a followup commit in the next few days and to have all platforms moved over before 11.0. Once the transition is complete, the option will be removed along with the !EARLY_AP_STARTUP code. These changes have only been tested on x86. Other platform maintainers are encouraged to port their architectures over as well. The main things to check for are any uses of smp_started in MD code that can be simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in the EARLY_AP_STARTUP case (e.g. the interrupt shuffling). PR: kern/199321 Reviewed by: markj, gnn, kib Sponsored by: Netflix	2016-05-14 18:22:52 +00:00
Edward Tomasz Napierala	ebc2f37754	Stop hiding errors that result in failure to mount /dev. Otherwise, missing /dev directory makes one end up with a completely deaf (init without stdout/stderr) system with no hints on the console, unless you've booted up with bootverbose. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-05-12 07:38:10 +00:00
Conrad Meyer	fe4be618c9	subr_vmem: Fix double-free in error case of vmem_create If vmem_init() fails, 'vm' is already destroyed and freed. Don't free it again. Reported by: Coverity CID: 1042110 Sponsored by: EMC / Isilon Storage Division	2016-05-11 23:16:11 +00:00
Konstantin Belousov	54a33d2f97	Add vfs_hash_ref(9) function, which finds a vnode by the hash value and returns it referenced. The function is similar to vfs_hash_get(9), but unlike the later, returned vnode is not locked. This operation cannot be requested with the vget(9) flags. Reviewed and tested by: rmacklem Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-05-11 06:32:22 +00:00
Konstantin Belousov	cd85d599d8	Style: wrap long lines. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-05-11 06:27:00 +00:00
John Baldwin	8d791e5af1	Add a new bus method to fetch device-specific CPU sets. bus_get_cpus() returns a specified set of CPUs for a device. It accepts an enum for the second parameter that indicates the type of cpuset to request. Currently two valus are supported: - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to the device when DEVICE_NUMA is enabled) - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core) For systems that do not support NUMA (or if it is not enabled in the kernel config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus' by default. The idea is that INTR_CPUS should always return a valid set. Device drivers which want to use per-CPU interrupts should start using INTR_CPUS instead of simply assigning interrupts to all available CPUs. In the future we may wish to add tunables to control the policy of INTR_CPUS (e.g. should it be local-only or global, should it ignore SMT threads or not). The x86 nexus driver exposes the internal set of interrupt CPUs from the the x86 interrupt code via INTR_CPUS. The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled. They also and the global INTR_CPUS set from the nexus driver with the per-domain set from _PXM to generate a local INTR_CPUS set for child devices. Compared to the r298933, this version uses 'struct _cpuset' in <sys/bus.h> instead of 'cpuset_t' to avoid requiring <sys/param.h> (<sys/_cpuset.h> still requires <sys/param.h> for MAXCPU even though <sys/_bitset.h> does not after recent changes).	2016-05-09 20:50:21 +00:00
Andrew Turner	b48c608386	Check malloc succeeded in pic_create, with M_NOWAIT it may return NULL. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation	2016-05-09 12:24:39 +00:00

... 5 6 7 8 9 ...

15463 Commits