freebsd-dev

Author	SHA1	Message	Date
Mateusz Guzik	f652d856ab	filedesc: make fdinit return with source filedesc locked and new one sized appropriately Assert FILEDESC_XLOCK_ASSERT only for already used tables in fdgrowtable. We don't have to call it with the lock held if we are just creating new filedesc. As a side note, strictly speaking processes can have fdtables with fd_lastfile = -1, but then they cannot enter fdgrowtable. Very first file descriptor they get will be 0 and the only syscall allowing to choose fd number requires an active file descriptor. Should this ever change, we can add an 'init' (or similar) parameter to fdgrowtable.	2014-10-31 09:25:28 +00:00
Mateusz Guzik	ffeb890592	filedesc: iterate over fd table only once in fdcopy While here add 'fdused_init' which does not perform unnecessary work. Drop FILEDESC_LOCK_ASSERT from fdisused and rely on callers to hold it when appropriate. This function is only used with INVARIANTS. No functional changes intended.	2014-10-31 09:19:46 +00:00
Mateusz Guzik	1a0c80a3df	filedesc: tidy up fdfree Implement fdefree_last variant and get rid of 'last' parameter. No functional changes.	2014-10-31 09:15:59 +00:00
Mateusz Guzik	b97a758ffc	filedesc: tidy up fdcopy a little bit Test for file availability by fde_file != NULL instead of fdisused, this is consistent with similar checks later. Drop badfileops check. badfileops don't have DFLAG_PASSABLE set, so it was never reached in practice. fdiused is now only used in some KASSERTS, so ifdef it under INVARIANTS. No functional changes.	2014-10-31 05:41:27 +00:00
Mark Murray	10cb24248a	This is the much-discussed major upgrade to the random(4) device, known to you all as /dev/random. This code has had an extensive rewrite and a good series of reviews, both by the author and other parties. This means a lot of code has been simplified. Pluggable structures for high-rate entropy generators are available, and it is most definitely not the case that /dev/random can be driven by only a hardware souce any more. This has been designed out of the device. Hardware sources are stirred into the CSPRNG (Yarrow, Fortuna) like any other entropy source. Pluggable modules may be written by third parties for additional sources. The harvesting structures and consequently the locking have been simplified. Entropy harvesting is done in a more general way (the documentation for this will follow). There is some GREAT entropy to be had in the UMA allocator, but it is disabled for now as messing with that is likely to annoy many people. The venerable (but effective) Yarrow algorithm, which is no longer supported by its authors now has an alternative, Fortuna. For now, Yarrow is retained as the default algorithm, but this may be changed using a kernel option. It is intended to make Fortuna the default algorithm for 11.0. Interested parties are encouraged to read ISBN 978-0-470-47424-2 "Cryptography Engineering" By Ferguson, Schneier and Kohno for Fortuna's gory details. Heck, read it anyway. Many thanks to Arthur Mesh who did early grunt work, and who got caught in the crossfire rather more than he deserved to. My thanks also to folks who helped me thresh this out on whiteboards and in the odd "Hallway track", or otherwise. My Nomex pants are on. Let the feedback commence! Reviewed by: trasz,des(partial),imp(partial?),rwatson(partial?) Approved by: so(des)	2014-10-30 21:21:53 +00:00
Mateusz Guzik	f55cf4b0d1	filedesc: make sure to force table reload in fget_unlocked when count == 0 This is a fixup to r273843.	2014-10-30 07:21:38 +00:00
Mateusz Guzik	29c85772bb	filedesc: microoptimize fget_unlocked by retrying obtaining reference count without restarting whole lookup Restart is only needed when fp was closed by current process, which is a much rarer event than ref/deref by some other thread.	2014-10-30 05:21:12 +00:00
Mateusz Guzik	aa77d52800	filedesc: get rid of atomic_load_acq_int from fget_unlocked A read barrier was necessary because fd table pointer and table size were updated separately, opening a window where fget_unlocked could read new size and old pointer. This patch puts both these fields into one dedicated structure, pointer to which is later atomically updated. As such, fget_unlocked only needs data a dependency barrier which is a noop on all supported architectures. Reviewed by: kib (previous version) MFC after: 2 weeks	2014-10-30 05:10:33 +00:00
John Baldwin	01e1933dcc	Rework virtual machine hypervisor detection. - Move the existing code to x86/x86/identcpu.c since it is x86-specific. - If the CPUID2_HV flag is set, assume a hypervisor is present and query the 0x40000000 leaf to determine the hypervisor vendor ID. Export the vendor ID and the highest supported hypervisor CPUID leaf via hv_vendor[] and hv_high variables, respectively. The hv_vendor[] array is also exported via the hw.hv_vendor sysctl. - Merge the VMWare detection code from tsc.c into the new probe in identcpu.c. Add a VM_GUEST_VMWARE to identify vmware and use that in the TSC code to identify VMWare. Differential Revision: https://reviews.freebsd.org/D1010 Reviewed by: delphij, jkim, neel	2014-10-28 19:17:44 +00:00
Konstantin Belousov	f7e91c288a	Convert kern_umtx.c to use fueword() and casueword(). Also fix some mishandling of suword(9) errors as errno, which resulted in spurious ERESTART. Sponsored by: The FreeBSD Foundation Tested by: pho MFC after: 3 weeks	2014-10-28 15:30:33 +00:00
Konstantin Belousov	0a2c94b86e	Replace some calls to fuword() by fueword() with proper error checking. Sponsored by: The FreeBSD Foundation Tested by: pho MFC after: 3 weeks	2014-10-28 15:28:20 +00:00
Konstantin Belousov	4f3dc90023	Add fueword(9) and casueword(9) functions. They are like fuword(9) and casuword(9), but do not mix value read and indication of fault. I know (or remember) enough assembly to handle x86 and powerpc. For arm, mips and sparc64, implement fueword() and casueword() as wrappers around fuword() and casuword(), which means that the functions cannot distinguish between -1 and fault. On architectures where fueword() and casueword() are native, implement fuword() and casuword() using fueword() and casuword(), to reduce assembly code duplication. Sponsored by: The FreeBSD Foundation Tested by: pho MFC after: 2 weeks (ia64 needs treating)	2014-10-28 15:22:13 +00:00
Hans Petter Selasky	0e1152fcc2	The SYSCTL data pointers can come from userspace and must not be directly accessed. Although this will work on some platforms, it can throw an exception if the pointer is invalid and then panic the kernel. Add a missing SYSCTL_IN() of "SCTP_BASE_STATS" structure. MFC after: 3 days Sponsored by: Mellanox Technologies	2014-10-28 12:00:39 +00:00
Mateusz Guzik	a7963b9758	Simplify sys_getloginclass. Just use current thread credentials as they have the same accuracy as the ones obtained from proc..	2014-10-28 04:59:33 +00:00
Mateusz Guzik	b720dc9749	Change loginclass mutex to an rwlock. While here reduce nesting in loginclass_free. Submitted by: Tiwei Bie <btw mail.ustc.edu.cn> X-Additional: JuniorJobs project MFC after: 2 weeks	2014-10-28 04:33:57 +00:00
Mateusz Guzik	8d866b10fc	Tidy up functions related to uidinfo management. - reference found uidinfo in uilookup - reduce nesting by handling shorter cases first	2014-10-27 20:20:05 +00:00
Mateusz Guzik	8101958c4f	De-k&r-ify function definitions in kern/kern_resource.c No functional changes.	2014-10-27 20:18:30 +00:00
Mateusz Guzik	e015b1ab0a	Avoid dynamic syscall overhead for statically compiled modules. The kernel tracks syscall users so that modules can safely unregister them. But if the module is not unloadable or was compiled into the kernel, there is no need to do this. Achieve this by adding SY_THR_STATIC_KLD macro which expands to SY_THR_STATIC during kernel build and 0 otherwise. Reviewed by: kib (previous version) MFC after: 2 weeks	2014-10-26 19:42:44 +00:00
Mateusz Guzik	b90638866e	Fix up an assertion in kern_setgroups, it should compare with ngroups_max + 1 Bug introdued in r273685. Noted by: Tiwei Bie <btw mail.ustc.edu.cn>	2014-10-26 14:25:42 +00:00
Mateusz Guzik	7e9a456a53	Tidy up sys_setgroups and kern_setgroups. - 'groups' initialization to NULL is always ovewrwriten before use, so plug it - get rid of 'goto out' - kern_setgroups's callers already validate ngrp, so only assert the condition - ngrp is an u_int, so 'ngrp < 1' is more readable as 'ngrp == 0' No functional changes.	2014-10-26 06:04:09 +00:00
Mateusz Guzik	92b064f43d	Use a temporary buffer in sys_setgroups for requests with <= XU_NGROUPS groups. Submitted by: Tiwei Bie <btw mail.ustc.edu.cn> X-Additional: JuniorJobs project MFC after: 2 weeks	2014-10-26 05:39:42 +00:00
Mateusz Guzik	f84f8f9468	Now that sysctl_root is only called with sysctl lock in shared mode, update its assertion to require that. Update comment missed in r273400: sysctl_xlock/unlock -> sysctl_xlock/xunlock Noted by: jhb	2014-10-26 01:47:55 +00:00
John Baldwin	1bc9ea1caa	Use correct type in __DEVOLATILE().	2014-10-25 20:42:47 +00:00
Alexander Motin	ccf8a5688a	Revert somewhat hackish geom_disk optimization, committed as part of r256880, and the following r273143 commit, supposed to workaround introduced issue by quite innocent-looking change. While there is no clear understanding why, but r273143 is accused in data corruption in some environments with high I/O load. I personally don't see any problem in that commit, and possibly it is just a trigger to some other bug somewhere, but better safe then sorry for now. Requested by: scottl@ MFC after: 3 days	2014-10-25 15:16:19 +00:00
Mateusz Guzik	675c3507d4	rlimit: plug duplicate assertion counter sanity is already checked by refcount_release.	2014-10-25 05:56:21 +00:00
Xin LI	6362e06b42	Fix build.	2014-10-25 00:16:36 +00:00
John Baldwin	53e1ffbbce	The current POSIX semaphore implementation stores the _has_waiters flag in a separate word from the _count. This does not permit both items to be updated atomically in a portable manner. As a result, sem_post() must always perform a system call to safely clear _has_waiters. This change removes the _has_waiters field and instead uses the high bit of _count as the _has_waiters flag. A new umtx object type (_usem2) and two new umtx operations are added (SEM_WAIT2 and SEM_WAKE2) to implement these semantics. The older operations are still supported under the COMPAT_FREEBSD9/10 options. The POSIX semaphore API in libc has been updated to use the new implementation. Note that the new implementation is not compatible with the previous implementation. However, this only affects static binaries (which cannot be helped by symbol versioning). Binaries using a dynamic libc will continue to work fine. SEM_MAGIC has been bumped so that mismatched binaries will error rather than corrupting a shared semaphore. In addition, a padding field has been added to sem_t so that it remains the same size. Differential Revision: https://reviews.freebsd.org/D961 Reported by: adrian Reviewed by: kib, jilles (earlier version) Sponsored by: Norse	2014-10-24 20:02:44 +00:00
Dag-Erling Smørgrav	b0d69dfad9	In all cases except CTLTYPE_STRING, penv is NULL here, so passing it indiscriminately to printf() and freeenv() is incorrect. Add a NULL check before freeenv(); as for printf(), we could use req.newptr instead, but we'd have to select the correct format string based on the type, and that's too much work for an error message, so just remove it.	2014-10-23 22:42:56 +00:00
Mateusz Guzik	ffc5ce7b75	In selfdfree re-evaulate sf_si after takin the lock. Otherwise we can race with doselwakeup. This is a fixup to r273549 Reviewed by: jhb Reported by: everyone and their dog	2014-10-23 19:06:08 +00:00
Xin LI	2735a91d93	Test if 'env' is NULL before doing memset() and strlen(), the caller may pass NULL to freeenv().	2014-10-23 18:23:50 +00:00
Mateusz Guzik	73f2e5f759	Avoid taking the lock in selfdfree when not needed.	2014-10-23 15:35:47 +00:00
Colin Percival	b9f6af45a5	Avoid leaking data from the kernel environment: When we convert the initial static environment to a dynamic one, zero the static environment buffer, and zero individual values when kern_unsetenv and freeenv are called. Tested by: kmoore (VM memory dump + grep) Tested by: cperciva (kernel panic dump + grep)	2014-10-22 23:35:32 +00:00
Mateusz Guzik	58a3dcb229	filedesc assert that table size is at least 3 in fdsetugidsafety Requested by: kib	2014-10-22 08:56:57 +00:00
Mateusz Guzik	4bc68ed7bc	Plug unnecessary PRS_NEW check in kern_procctl. pfind does not return processes in such state.	2014-10-22 04:16:09 +00:00
Mateusz Guzik	a39d200bb9	Reduce nesting in vn_access. No functional changes.	2014-10-22 01:53:00 +00:00
Mateusz Guzik	eac9678110	Avoid crdup when possible in kern_accessat. While here tidy up a little.	2014-10-22 01:09:07 +00:00
Mateusz Guzik	11888da8d9	filedesc: cleanup setugidsafety a little Rename it to fdsetugidsafety for consistency with other functions. There is no need to take filedesc lock if not closing any files. The loop has to verify each file and we are guaranteed fdtable has space for at least 20 fds. As such there is no need to check fd_lastfile. While here tidy up is_unsafe.	2014-10-22 00:23:43 +00:00
Mateusz Guzik	07b384cbe2	Eliminate unnecessary memory allocation in sys_getgroups and its ibcs2 counterpart.	2014-10-21 23:08:46 +00:00
Mateusz Guzik	2afec8edfc	Take the lock shared in linker_search_symbol_name. This helps sysctl kern.proc.stack.	2014-10-21 21:29:20 +00:00
Mateusz Guzik	fca7732078	Mark some more sysctl stuff shared-locked and MPSAFE.	2014-10-21 21:08:45 +00:00
Mateusz Guzik	b564c5d6aa	Make sysctl name2oid shared-locked as well. This is a follow-up to r273401.	2014-10-21 19:45:08 +00:00
Mateusz Guzik	efe0abddf5	Implement shared locking for sysctl.	2014-10-21 19:05:44 +00:00
Mateusz Guzik	580a011762	Rename sysctl_lock and _unlock to sysctl_xlock and _xunlock.	2014-10-21 19:02:26 +00:00
Hans Petter Selasky	f0188618f2	Fix multiple incorrect SYSCTL arguments in the kernel: - Wrong integer type was specified. - Wrong or missing "access" specifier. The "access" specifier sometimes included the SYSCTL type, which it should not, except for procedural SYSCTL nodes. - Logical OR where binary OR was expected. - Properly assert the "access" argument passed to all SYSCTL macros, using the CTASSERT macro. This applies to both static- and dynamically created SYSCTLs. - Properly assert the the data type for both static and dynamic SYSCTLs. In the case of static SYSCTLs we only assert that the data pointed to by the SYSCTL data pointer has the correct size, hence there is no easy way to assert types in the C language outside a C-function. - Rewrote some code which doesn't pass a constant "access" specifier when creating dynamic SYSCTL nodes, which is now a requirement. - Updated "EXAMPLES" section in SYSCTL manual page. MFC after: 3 days Sponsored by: Mellanox Technologies	2014-10-21 07:31:21 +00:00
Mateusz Guzik	5c37b305fd	Plug unnecessary binvp NULL initialization and test. Reported by: Coverity CID: 1018889	2014-10-20 22:52:15 +00:00
Mateusz Guzik	966ee9f25f	filedesc: plug 2 write-only variables Reported by: Coverity CID: 1245745, 1245746	2014-10-20 21:57:24 +00:00
Mark Johnston	4fd6ca7275	Fix a typo from r189544, which replaced unp_global_rwlock with unp_list_lock and unp_link_rwlock. MFC after: 3 days	2014-10-20 20:21:40 +00:00
Mateusz Guzik	4fce16e4c9	Provide vfs suspension support only for filesystems which need it, take two. nullfs and unionfs need to request suspension if underlying filesystem(s) use it. Utilize mnt_kern_flag for this purpose. This is a fixup for 273271. No strong objections from: kib Pointy hat to: mjg MFC after: 2 weeks	2014-10-20 18:00:50 +00:00
Marcel Moolenaar	0067051fe7	Fully support constructors for the purpose of code coverage analysis. This involves: 1. Have the loader pass the start and size of the .ctors section to the kernel in 2 new metadata elements. 2. Have the linker backends look for and record the start and size of the .ctors section in dynamically loaded modules. 3. Have the linker backends call the constructors as part of the final work of initializing preloaded or dynamically loaded modules. Note that LLVM appends the priority of the constructors to the name of the .ctors section. Not so when compiling with GCC. The code currently works for GCC and not for LLVM. Submitted by: Dmitry Mikulin <dmitrym@juniper.net> Obtained from: Juniper Networks, Inc.	2014-10-20 17:04:03 +00:00
Mateusz Guzik	020b8f17a0	Provide vfs suspension support only for filesystems which need it. Need is expressed by providing vfs_susp_clean function in vfsops. Differential Revision: D952 Reviewed by: kib (previous version) MFC after: 2 weeks	2014-10-19 06:59:33 +00:00
Adrian Chadd	3fe93b946f	Convert a missed u_char cpu -> int cpu. This was caught by a gcc build. Reported by: luigi Sponsored by: Norse Corp, Inc.	2014-10-19 04:38:02 +00:00
Adrian Chadd	e77f9fed15	Update the ULE scheduler + thread and kinfo structs to use int for cpuid rather than u_char. To try and play nice with the ABI, the u_char CPU ID values are clamped at 254. The new fields now contain the full CPU ID, or -1 for no cpu. Differential Revision: D955 Reviewed by: jhb, kib Sponsored by: Norse Corp, Inc.	2014-10-18 19:36:11 +00:00
Davide Italiano	2be111bf7d	Follow up to r225617. In order to maximize the re-usability of kernel code in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv(). This fixes a namespace collision with libc symbols. Submitted by: kmacy Tested by: make universe	2014-10-16 18:04:43 +00:00
Alexander Motin	99b9076c21	Remove setting BIO_DONE flag for BIOs that have done() method. This fixes use-after-free, caused by geom_disk, completing same BIO twice to save extra allocation, and getting BIO_DONE set after the first. MFC after: 1 week	2014-10-15 18:36:34 +00:00
Konstantin Belousov	f821fad417	Implement FIODTYPE for master ptys. Requested and reviewed by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-10-15 12:38:26 +00:00
Mateusz Guzik	32e7f8e4d5	Don't take devmtx unnecessarily in vn_isdisk. MFC after: 1 week	2014-10-15 05:17:36 +00:00
Mateusz Guzik	55056be254	filedesc: plug 2 assignments to M_ZERO-ed pointers in falloc_noinstall No functional changes.	2014-10-15 01:16:11 +00:00
Marcel Moolenaar	ddbe5b951f	Fix nits in previous commit: 1. Remove initializer for badstack_sbuf_size; it gets set unconditionally. 2. Remove meaningless comment. 3. Group witness_count and its sysctl together. 4. Fix spacing in for statements (space after for and within condition). 5. Change all M_NOWAIT usages in witness_initialize() to M_WAITOK; not just those that were newly introduced -- the allocation is assumed to succeed for all allocations. 6. Avoid using uint8_t as the base type in sizeof() expressions; Use the variable name (w_rmatrix) as much as possible. Pointed out by: jhb@ (thanks!)	2014-10-11 16:34:01 +00:00
Marcel Moolenaar	90a5222f14	Turn WITNESS_COUNT into a tunable and sysctl. This allows adjusting the value without recompiling the kernel. This is useful when recompiling is not possible as an immediate solution. When we run out of witness objects, witness is completely disabled. Not having an immediate solution can therefore be problematic. Submitted by: Sreekanth Rupavatharam <rupavath@juniper.net> Obtained from: Juniper Networks, Inc.	2014-10-11 02:02:58 +00:00
Marcel Moolenaar	2e7634503e	Regenerate after r272823: Move the SCTP syscalls to netinet with the rest of the SCTP code. Submitted by: Steve Kiernan <stevek@juniper.net> Reviewed by: tuexen, rrs Obtained from: Juniper Networks, Inc.	2014-10-09 15:19:35 +00:00
Marcel Moolenaar	80b47aefa1	Move the SCTP syscalls to netinet with the rest of the SCTP code. The syscalls themselves are tightly coupled with the network stack and therefore should not be in the generic socket code. The following four syscalls have been marked as NOSTD so they can be dynamically registered in sctp_syscalls_init() function: sys_sctp_peeloff sys_sctp_generic_sendmsg sys_sctp_generic_sendmsg_iov sys_sctp_generic_recvmsg The syscalls are also set up to be dynamically registered when COMPAT32 option is configured. As a side effect of moving the SCTP syscalls, getsock_cap needs to be made available outside of the uipc_syscalls.c source file. A proper prototype has been added to the sys/socketvar.h header file. API tests from the SCTP reference implementation have been run to ensure compatibility. (http://code.google.com/p/sctp-refimpl/source/checkout) Submitted by: Steve Kiernan <stevek@juniper.net> Reviewed by: tuexen, rrs Obtained from: Juniper Networks, Inc.	2014-10-09 15:16:52 +00:00
Adrian Chadd	ffcf962dab	Add a bus method to fetch the VM domain for the given device/bus. * Add a bus_if.m method - get_domain() - returning the VM domain or ENOENT if the device isn't in a VM domain; * Add bus methods to print out the domain of the device if appropriate; * Add code in srat.c to save the PXM -> VM domain mapping that's done and expose a function to translate VM domain -> PXM; * Add ACPI and ACPI PCI methods to check if the bus has a _PXM attribute and if so map it to the VM domain; * (.. yes, this works recursively.) * Have the pci bus glue print out the device VM domain if present. Note: this is just the plumbing to start enumerating information - it doesn't at all modify behaviour. Differential Revision: D906 Reviewed by: jhb Sponsored by: Norse Corp	2014-10-09 05:33:25 +00:00
Marcel Moolenaar	383f423be1	Fix draining in ttydev_leave(): 1. ERESTART is not only returned when the revoke count changed. It is also returned when a signal is received. While a change in the revoke count should be ignored, a signal should not. 2. Waiting until the output queue is entirely drained can cause a hang when the underlying device is stuck or broken. Have tty_drain() take care of this by telling it when we're leaving. When leaving, tty_drain() will use a timed wait to address point 2 above and it will check the revoke count to handle point 1 above. The timeout is set to 1 second, which is arbitrary and long enough to expect a change in the output queue. Discussed with: jilles@ Reported by: Yamagi Burmeister <lists@yamagi.org>	2014-10-09 02:30:38 +00:00
Marcel Moolenaar	75c2b79df8	Apply r269126 to tty_timedwait(): Don't return ERESTART when the device is gone.	2014-10-09 01:59:25 +00:00
John Baldwin	232e8b52b0	Add schedgraph traces for callout handlers. Specifically, a callwheel logs a running event each time it executes a callout function. The event includes the function pointer, argument, and whether or not it was run from hardware interrupt context. The callwheel is marked idle when each handler completes. This effectively logs the duration of each callout routine in the graph.	2014-10-08 16:22:59 +00:00
Jung-uk Kim	37417245bf	Make kern.nswbuf tunable from loader. MFC after: 1 week	2014-10-07 20:13:47 +00:00
Mateusz Guzik	dd2390be68	Convert racct stubs to inline functions. This saves some symbols and function calls for kernel without RACCT. MFC after: 1 week	2014-10-06 02:31:33 +00:00
Mateusz Guzik	2b4a2528d7	filedesc: fix up breakage introduced in 272505 Include sequence counter supports incoditionally [1]. This fixes reprted build problems with e.g. nvidia driver due to missing opt_capsicum.h. Replace fishy looking sizeof with offsetof. Make fde_seq the last member in order to simplify calculations. Suggested by: kib [1] X-MFC: with 272505	2014-10-05 19:40:29 +00:00
Konstantin Belousov	57c2505e65	On error, sbuf_bcat() returns -1. Some callers returned this -1 to the upper layers, which interpret it as errno value, which happens to be ERESTART. The result was spurious restarts of the sysctls in loop, e.g. kern.proc.proc, instead of returning ENOMEM to caller. Convert -1 from sbuf_bcat() to ENOMEM, when returning to the callers expecting errno. In collaboration with: pho Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week	2014-10-05 17:35:59 +00:00
Mateusz Guzik	bad2520a2b	Avoid unnecessary ppeers_lock acquisition in exit1. MFC after: 1 week	2014-10-05 07:21:41 +00:00
Mateusz Guzik	25108069ec	Get rid of crshared.	2014-10-05 02:16:53 +00:00
Konstantin Belousov	4142462eeb	Slightly reword comment. Move code, which is described by the comment, after it. Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-10-04 18:51:55 +00:00
Konstantin Belousov	b76278407d	Add kernel option KSTACK_USAGE_PROF to sample the stack depth on interrupts and report the largest value seen as sysctl debug.max_kstack_used. Useful to estimate how close the kernel stack size is to overflow. In collaboration with: Larry Baird <lab@gta.com> Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week	2014-10-04 18:38:14 +00:00
Konstantin Belousov	539c9eef12	Fixes for i/o during coredumping: - Do not dump into system files. - Do not acquire write reference to the mount point where img.core is written, in the coredump(). The vn_rdwr() calls from ELF imgact request the write ref from vn_rdwr(). Recursive acqusition of the write ref deadlocks with the unmount. - Instead, take the range lock for the whole core file. This prevents parallel dumping from two processes executing the same image, converting the useless interleaved dump into sequential dumping, with second core overwriting the first. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-10-04 18:35:00 +00:00
Konstantin Belousov	e3d6feceb1	Add IO_RANGELOCKED flag for vn_rdwr(9), which specifies that vnode is not locked, but range is. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-10-04 18:28:27 +00:00
Ian Lepore	41e8f7efbe	Make kevent(2) periodic timer events more reliably periodic. The event callout is now scheduled using the C_ABSOLUTE flag, and the absolute time of each event is calculated as the time the previous event was scheduled for plus the interval. This ensures that latency in processing a given event doesn't perturb the arrival time of any subsequent events. Reviewed by: jhb	2014-10-04 15:59:15 +00:00
Mateusz Guzik	ee3fd7bbb1	Plug capability races. fp and appropriate capability lookups were not atomic, which could result in improper capabilities being checked. This could result either in protection bypass or in a spurious ENOTCAPABLE. Make fp + capability check atomic with the help of sequence counters. Reviewed by: kib MFC after: 3 weeks	2014-10-04 08:08:56 +00:00
John Baldwin	7aa1071e48	Require p_cansched() for changing a process' protection status via procctl() rather than p_cansee(). Submitted by: rwatson MFC after: 3 days	2014-10-02 21:18:16 +00:00
Will Andrews	9832a24d27	In the syncer, drop the sync mutex while patting the watchdog. Some watchdog drivers (like ipmi) need to sleep while patting the watchdog. See sys/dev/ipmi/ipmi.c:ipmi_wd_event(), which calls malloc(M_WAITOK). Submitted by: asomers MFC after: 1 month Sponsored by: Spectra Logic MFSpectraBSD: 637548 on 2012/10/04	2014-10-01 15:32:28 +00:00
Navdeep Parhar	a9fa76f27e	Test for absence of M_NOFREE before attempting to purge the mbuf's tags. This will leave more state intact should the assertion go off. MFC after: 1 month	2014-09-30 23:16:26 +00:00
Mateusz Guzik	8e572983d3	Use bzero instead of explicitly zeroing stuff in do_execve. While strictly speaking this is not correct since some fields are pointers, it makes no difference on all supported archs and we already rely on it doing the right thing in other places. No functional changes.	2014-09-29 23:59:19 +00:00
Neel Natu	fbe602fb61	tty_rel_free() can be called more than once for the same tty so make sure that the tty is dequeued from 'tty_list' only the first time. The panic below was seen when a revoke(2) was issued on an nmdm device. In this case there was also a thread that was blocked on a read(2) on the device. The revoke(2) woke up the blocked thread which would typically return an error to userspace. In this case the reader also held the last reference on the file descriptor so fdrop() ended up calling tty_rel_free() via ttydev_close(). tty_rel_free() then tried to dequeue 'tp' again which led to the panic. panic: Bad link elm 0xfffff80042602400 prev->next != elm cpuid = 1 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00f9c90460 kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe00f9c90510 vpanic() at vpanic+0x189/frame 0xfffffe00f9c90590 panic() at panic+0x43/frame 0xfffffe00f9c905f0 tty_rel_free() at tty_rel_free+0x29b/frame 0xfffffe00f9c90640 ttydev_close() at ttydev_close+0x1f9/frame 0xfffffe00f9c90690 devfs_close() at devfs_close+0x298/frame 0xfffffe00f9c90720 VOP_CLOSE_APV() at VOP_CLOSE_APV+0x13c/frame 0xfffffe00f9c90770 vn_close() at vn_close+0x194/frame 0xfffffe00f9c90810 vn_closefile() at vn_closefile+0x48/frame 0xfffffe00f9c90890 devfs_close_f() at devfs_close_f+0x2c/frame 0xfffffe00f9c908c0 _fdrop() at _fdrop+0x29/frame 0xfffffe00f9c908e0 sys_read() at sys_read+0x63/frame 0xfffffe00f9c90980 amd64_syscall() at amd64_syscall+0x2b3/frame 0xfffffe00f9c90ab0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe00f9c90ab0 --- syscall (3, FreeBSD ELF64, sys_read), rip = 0x800b78d8a, rsp = 0x7fffffbfdaf8, rbp = 0x7fffffbfdb30 --- CR: https://reviews.freebsd.org/D851 Reviewed by: glebius, ed Reported by: Leon Dang Sponsored by: Nahanni Systems MFC after: 1 week	2014-09-28 21:12:23 +00:00
Gleb Smirnoff	bd071d4d19	- Remove empty wrappers ether_poll_[de]register_drv(). [1] - Move polling(9) declarations out of ifq.h back to if_var.h they are absolutely unrelated to queues. Submitted by: Mikhail <mp lenta.ru> [1]	2014-09-28 14:05:18 +00:00
Mateusz Guzik	0c4a09a378	Make do_dup() static and move relevant macros to kern_descrip.c No functional changes.	2014-09-26 19:48:47 +00:00
John Baldwin	c1d67516d9	Don't panic if a resource is allocated twice. Instead, print a warning and fail the allocation request. Allocations of "reserved" resources such as PCI BARs already fail the request instead of panic'ing in this case. MFC after: 1 week	2014-09-26 18:37:49 +00:00
Konstantin Belousov	f69261f2f9	Fix fcntl(2) compat32 after r270691. The copyin and copyout of the struct flock are done in the sys_fcntl(), which mean that compat32 used direct access to userland pointers. Move code from sys_fcntl() to new wrapper, kern_fcntl_freebsd(), which performs neccessary userland memory accesses, and use it from both native and compat32 fcntl syscalls. Reported by: jhibbits Sponsored by: The FreeBSD Foundation MFC after: 3 days	2014-09-25 21:07:19 +00:00
Konstantin Belousov	2c6fbcbed6	In kern_linkat() and kern_renameat(), do not call namei(9) while holding a write reference on the filesystem. Try to get write reference in unblocked way after all vnodes are resolved; if failed, drop all locks and retry after waiting for suspension end. The VFS_UNMOUNT() methods for UFS and tmpfs try to establish suspension on unmount, while covered vnode is locked by VFS, which prevents namei() from stepping over the mount point. The thread doing namei() sleeps on the covered vnode lock, owning the write ref. Reported by: bdrewery Tested by: bdrewery (previous version), pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-09-25 20:42:25 +00:00
Justin Hibbits	a1c1634858	Stage one of multipass suspend/resume Summary: Add the beginnings of multipass suspend/resume, by introducing BUS_SUSPEND_CHILD/BUS_RESUME_CHILD, and move the PCI driver to this. Reviewers: jhb Reviewed By: jhb Differential Revision: https://reviews.freebsd.org/D590	2014-09-23 02:56:40 +00:00
John Baldwin	9696feebe2	Add a new fo_fill_kinfo fileops method to add type-specific information to struct kinfo_file. - Move the various fill_*_info() methods out of kern_descrip.c and into the various file type implementations. - Rework the support for kinfo_ofile to generate a suitable kinfo_file object for each file and then convert that to a kinfo_ofile structure rather than keeping a second, different set of code that directly manipulates type-specific file information. - Remove the shm_path() and ksem_info() layering violations. Differential Revision: https://reviews.freebsd.org/D775 Reviewed by: kib, glebius (earlier version)	2014-09-22 16:20:47 +00:00
John Baldwin	fadf3fb98c	Convert from timeout(9) to callout(9).	2014-09-22 14:27:26 +00:00
Hans Petter Selasky	9fd573c39d	Improve transmit sending offload, TSO, algorithm in general. The current TSO limitation feature only takes the total number of bytes in an mbuf chain into account and does not limit by the number of mbufs in a chain. Some kinds of hardware is limited by two factors. One is the fragment length and the second is the fragment count. Both of these limits need to be taken into account when doing TSO. Else some kinds of hardware might have to drop completely valid mbuf chains because they cannot loaded into the given hardware's DMA engine. The new way of doing TSO limitation has been made backwards compatible as input from other FreeBSD developers and will use defaults for values not set. Reviewed by: adrian, rmacklem Sponsored by: Mellanox Technologies MFC after: 1 week	2014-09-22 08:27:27 +00:00
Sean Bruno	7c51714e0a	svn revisions r269964 and r269963 seemed to have impaired small memory footprint systems(32M/64M) and didn't leave enough free memory to load modules when it was setting up page tables that for sizes that are never used on these smallish boards. Set kmem_zmax to PAGE_SIZE on these smaller systems (< 128M) to keep this from happening. Verified on mips32 h/w. PR: 193465 Submitted by: delphij Reviewed by: adrian	2014-09-22 05:07:22 +00:00
Alexander Motin	ae9e9b4fda	Reprase r271616 comments. Submitted by: alc MFC after: 1 month	2014-09-17 17:43:32 +00:00
Adrian Chadd	066da8050b	Migrate ie->ie_assign_cpu and associated code to use an int for CPU rather than u_char. Migrate post_filter to use an int for a CPU rather than u_char. Change intr_event_bind() to use an int for CPU rather than u_char. It touches the ppc, sparc64, arm and mips machdep code but it should (hah!) be a no-op. Tested: * i386, AMD64 laptops Reviewed by: jhb	2014-09-17 17:33:22 +00:00
Adrian Chadd	7f7528fc79	Modify cpuset_setithread() to take a CPU ID as an integer, not a char. We're going to end up having > 254 CPUs at some point.	2014-09-16 01:21:47 +00:00
Enji Cooper	257597a434	Validate the mode argument in access, eaccess, and faccessat for optional POSIX compliance and to improve compatibility with Linux and NetBSD The issue was identified with lib/libc/sys/t_access:access_inval from NetBSD Update the manpage accordingly PR: 181155 Reviewed by: jilles (code), jmmv (code), wblock (manpage), wollman (code) MFC after: 4 weeks Phabric: D678 (code), D786 (manpage) Sponsored by: EMC / Isilon Storage Division	2014-09-16 00:56:47 +00:00
Alexander Motin	7965496958	Add comments describing r271604 change. MFC after: 3 days	2014-09-15 11:17:36 +00:00
Alexander Motin	7e9b58eaaa	Add couple memory barries to serialize tdq_cpu_idle and tdq_load accesses. This change fixes transient performance drops in some of my benchmarks, vanishing as soon as I am trying to collect any stats from the scheduler. It looks like reordered access to those variables sometimes caused loss of IPI_PREEMPT, that delayed thread execution until some later interrupt. MFC after: 3 days	2014-09-14 22:13:19 +00:00
Alexander V. Chernikov	c1d9ecf2be	Fix error handling in cpuset_setithread() introduced in r267716. Noted by: kib MFC after: 1 week	2014-09-13 13:46:16 +00:00
John Baldwin	2d69d0dcc2	Fix various issues with invalid file operations: - Add invfo_rdwr() (for read and write), invfo_ioctl(), invfo_poll(), and invfo_kqfilter() for use by file types that do not support the respective operations. Home-grown versions of invfo_poll() were universally broken (they returned an errno value, invfo_poll() uses poll_no_poll() to return an appropriate event mask). Home-grown ioctl routines also tended to return an incorrect errno (invfo_ioctl returns ENOTTY). - Use the invfo_() functions instead of local versions for unsupported file operations. - Reorder fileops members to match the order in the structure definition to make it easier to spot missing members. - Add several missing methods to linuxfileops used by the OFED shim layer: fo_write(), fo_truncate(), fo_kqfilter(), and fo_stat(). Most of these used invfo_(), but a dummy fo_stat() implementation was added.	2014-09-12 21:29:10 +00:00
John Baldwin	cd550b9b52	Tweak pipe_truncate() to more closely match pipe_chown() and pipe_chmod() by checking PIPE_NAMED and using invfo_truncate() for unnamed pipes.	2014-09-12 21:20:36 +00:00
John Baldwin	0ed667f6e5	Simplify vntype_to_kinfo() by returning when the desired value is found instead of breaking out of the loop and then immediately checking the loop index so that if it was broken out of the proper value can be returned. While here, use nitems().	2014-09-12 20:56:09 +00:00
Gleb Smirnoff	27ad26d8c7	Remove unused arguments for VOP_GETPAGES(), VOP_PUTPAGES().	2014-09-10 12:36:41 +00:00
Edward Tomasz Napierala	f514b97b7d	Avoid unlocking unlocked mutex in RCTL jail code. Specific test case is attached to PR. PR: 193457 MFC after: 1 week Sponsored by: The FreeBSD Foundation	2014-09-09 16:05:33 +00:00
Hiroki Sato	714266373c	- Make hhook_run_socket() vnet-aware instead of adding CURVNET_SET() around the function calls. - Fix a memory leak and stats in the case that hhook_run_socket() fails in soalloc(). PR: 193265	2014-09-08 09:04:22 +00:00
Jean-Sébastien Pédron	ffea80b445	pause_sbt(): Take the cold path (ie. use DELAY()) if KDB is active This fixes a panic in the i915 driver when one uses debug.kdb.enter=1 under vt(4). PR: 193269 Reported by: emaste@ Submitted by: avg@ MFC after: 3 days	2014-09-08 08:44:50 +00:00
Gleb Smirnoff	9e739a5a05	Fix for r271182. Submitted by: mjg Pointy hat to: me, submitter and everyone who urged me to commit	2014-09-07 05:44:14 +00:00
Mateusz Guzik	64196a9996	Plug unnecessary fp assignments in kern_fcntl. No functional changes.	2014-09-05 23:56:25 +00:00
Gleb Smirnoff	d9257d8b57	Set vnet context before accessing V_socket_hhh[]. Submitted by: "Hiroo Ono (小野寛生)" <hiroo.ono+freebsd gmail.com>	2014-09-05 19:50:18 +00:00
Sean Bruno	65f20a89f1	Allow multiple image activators to run on the same execution by changing imgp->interpreted to a bitmask instead of, functionally, a bool. Each imgactivator now requires its own flag in interpreted to indicate whether or not it has already examined argv[0]. Change imgp->interpreted to an unsigned char to add one extra bit for future use. With this change, one can execute a shell script from a 64bit host native make and still get the binmisc image activator to fire for the script interpreter. Prior to this, execution would fail. Phabric: https://reviews.freebsd.org/D696 Reviewed by: jhb@ MFC after: 4 weeks	2014-09-04 21:31:25 +00:00
Gleb Smirnoff	7ee2d05890	Change a very strange code in m_demote() to simple assertion. Sponsored by: Nginx, Inc.	2014-09-04 19:27:30 +00:00
Gleb Smirnoff	1967edba02	Provide m_catpkt(), a wrapper around m_cat() that deals with M_PKTHDR mbufs. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-09-04 09:07:14 +00:00
Mateusz Guzik	2570cdd605	Plug a hypothetical use after free in sysctl kern.proc.groups. MFC after: 1 week	2014-09-04 01:21:33 +00:00
Benno Rice	c079e1c018	Add KASSERTs to catch the case where a developer may have forgotten to set bo_bsize on a bufobj. This is a slight modification of the patch provided. PR: 193146 Submitted by: Conrad Meyer <conrad.meyer@isilon.com> Sponsored by: EMC Isilon Storage Division	2014-09-04 00:10:06 +00:00
Konstantin Belousov	624bf9e134	Style. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-09-03 08:40:16 +00:00
Konstantin Belousov	8626a0ddc6	Retire thread_unthread(), it has only one caller. Update comment in the block of code before the previous call to thread_unthread(). Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-09-03 08:35:42 +00:00
Konstantin Belousov	fd229b5b75	Right now, thread_single(SINGLE_EXIT) returns after the p_numthreads reaches 1. The p_numthreads counter is decremented in thread_exit() by a call to thread_unlink(). This means that the exiting threads may still execute on other CPUs when thread_single(SINGLE_EXIT) returns. As result, vmspace could be destroyed while paging structures are still used on other CPUs by exiting threads. Delay the return from thread_single(SINGLE_EXIT) until all threads are really destroyed by thread_stash() after the last switch out. The p_exitthreads counter already provides the required mechanism, move the wait from the thread_wait() (which is called from wait(2) code) into thread_single(). Reported by: many (as "panic: pmap active <addr>") Reviewed by: alc, jhb Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-09-03 08:18:07 +00:00
Gleb Smirnoff	5b5477d762	Fix dereference after NULL check. CID: 1234607 Sponsored by: Nginx, Inc.	2014-09-03 08:14:07 +00:00
Mateusz Guzik	9152087ea7	Fix up proc_realparent to always return correct process. Prior to the change it would always return initproc for non-traced processes. This fixes ps apparently always returning 1 as ppid. Pointy hat: mjg Reported by: many MFC after: 1 week	2014-09-03 06:25:34 +00:00
Alan Cox	01a8fb7db5	Automatically prefault a limited number of mappings to resident pages in shmat(2), just like mmap(2). MFC after: 5 days Sponsored by: EMC / Isilon Storage Division	2014-08-31 17:38:41 +00:00
Mateusz Guzik	6662ce5aab	Add missing proctree locking to fill_kinfo_proc consumers. This fixes r270444. Pointy hat: mjg Reported by: many MFC after: 1 week	2014-08-30 03:10:55 +00:00
Andreas Tobler	5be725d7e8	Rename shm_dict_init to shm_init to fix a compiler warning. Reviewed by: jhb	2014-08-29 21:50:32 +00:00
John Baldwin	610a2b3c45	Use a unit number allocator to provide suitable st_dev and st_ino values for POSIX shared memory descriptors. The implementation is similar to that used for pipes. MFC after: 1 week	2014-08-29 18:18:29 +00:00
Konstantin Belousov	575e02d94f	Add function and wrapper to switch lockmgr and vnode lock back to auto-promotion of shared to exclusive. Tested by: hrs, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-29 09:02:01 +00:00
Mateusz Guzik	8b04bbef31	Return real parent pid in kinfo (used by e.g. ps) Add a separate field which exports tracer pid and add a new keyword ("tracer") for ps to display it. This is a follow up to r270444. Reviewed by: kib MFC after: 1 week Relnotes: yes	2014-08-28 08:41:11 +00:00
Jean-Sébastien Pédron	3e206539a1	vt(4): Add cngrab() and cnungrab() callbacks They are used when a panic occurs or when entering a DDB session for instance. cngrab() forces a vt-switch to the console window, no matter if the original window is another terminal or an X session. However, cnungrab() doesn't vt-switch back to the original window currently. MFC after: 1 week	2014-08-27 10:04:10 +00:00
Gleb Smirnoff	e86447ca44	- Remove socket file operations declaration from sys/file.h. - Make them static in sys_socket.c. - Provide generic invfo_truncate() instead of soo_truncate(). Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-08-26 14:44:08 +00:00
Mateusz Guzik	037755fd15	Fix up races with f_seqcount handling. It was possible that the kernel would overwrite user-supplied hint. Abuse vnode lock for this purpose. In collaboration with: kib MFC after: 1 week	2014-08-26 08:17:22 +00:00
Konstantin Belousov	c83655f334	Revert the handling of all siginfo sa_flags except SA_SIGINFO to the pre-r270321. Namely, the flags are preserved for SIG_DFL and SIG_IGN dispositions. Requested and reviewed by: jilles Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-24 16:37:50 +00:00
Mateusz Guzik	8cc11167fb	Plug a memory leak in case of failed lookups in capability mode. Put common cnp cleanup into one function and use it for this purpose. MFC after: 1 week	2014-08-24 12:51:12 +00:00
Mateusz Guzik	ce8daaadbd	Use refcount_init in sigacts_alloc. This change is a no-op, but fixes up an inconsistency introduced with r268634. MFC after: 3 days	2014-08-24 09:24:37 +00:00
Mateusz Guzik	abd386bafe	Fix getppid for traced processes. Traced processes always have the tracer set as the parent. Utilize proc_realparent to obtain the right process when needed. Reviewed by: kib MFC after: 1 week	2014-08-24 09:04:09 +00:00
Mateusz Guzik	a661bebe26	Properly reparent traced processes when the tracer dies. Previously they were uncoditionally reparented to init. In effect it was possible that tracee was never returned to original parent. Reviewed by: kib MFC after: 1 week	2014-08-24 09:02:16 +00:00
Alexander Motin	2e7d7bb294	Restore pre-r239157 handling of sched_yield(), when thread time slice was aborted, allowing other threads to run. Without this change thread is just rescheduled again, that was illustrated by provided test tool. PR: 192926 Submitted by: eric@vangyzen.net MFC after: 2 weeks	2014-08-23 17:31:56 +00:00
Konstantin Belousov	fbb6eca60f	In do_lock_pi(), do not override error from umtxq_sleep_pi() when doing suspend check. This restores the pre-r251684 behaviour, to retry once after the signal is detected. PR: kern/192918 Submitted by: Elliott Rabe, Dell Inc., Eric van Gyzen <eric@vangyzen.net> Obtained from: Dell Inc. MFC after: 1 week	2014-08-22 18:42:14 +00:00
Konstantin Belousov	350ae56373	Ensure that sigaction flags for signal, which disposition is reset to ignored or default, are not leaking. Apparently, there exists code which relies on SA_SIGINFO not reported for SIG_DFL or SIG_IGN. In kern_sigaction, ignore flags when resetting. Encapsulate the flag and disposition testing into helper sigact_flag_test(). On exec, and when delivering signal with SA_RESETHAND flag set, signals are reset automatically. Use new helper sigdflt(), which removes duplicated code and corrects all flag bits for the signal. For proc0, set sigintr bit for all ignored signals. Ignored signals are consumed in tdsendsignal() and not delivered to the victim thread at all. Reported and tested by: royger Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-22 08:19:08 +00:00
Konstantin Belousov	2d86417410	Check the validity of struct sigaction sa_flags value, reject unknown flags. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-22 07:52:47 +00:00
Hiroki Sato	ed063112f4	Fix a panic which occurs in a VIMAGE-enabled kernel after r270158, and separate socket_hhook_register() part and put it into VNET_SYS{,UN}INIT() handler. Discussed with: marcel	2014-08-22 05:03:30 +00:00
Marcel Moolenaar	4ec7371233	For vendors like Juniper, extensibility for sockets is important. A good example is socket options that aren't necessarily generic. To this end, OSD is added to the socket structure and hooks are defined for key operations on sockets. These are: o soalloc() and sodealloc() o Get and set socket options o Socket related kevent filters. One aspect about hhook that appears to be not fully baked is the return semantics (the return value from the hook is ignored in hhook_run_hooks() at the time of commit). To support return values, the socket_hhook_data structure contains a 'status' field to hold return values. Submitted by: Anuranjan Shukla <anshukla@juniper.net> Obtained from: Juniper Networks, Inc.	2014-08-18 23:45:40 +00:00
Warner Losh	817dc00433	Expand the elf brandelf infrastructure to give access to the whole ELF header (Elf_Ehdr) to determine if a particular interpretor wants to accept it or not. Use this mechanism to filter EABI arm on OABI arm kernels, and vice versa. This method could also be used to implement OABI on EABI arm kernels, if desired, or to allow a single mips kernel to run o32, n32 and n64 binaries. Differential Revision: https://reviews.freebsd.org/D609	2014-08-18 02:44:56 +00:00
Edward Tomasz Napierala	3914ddf8a7	Bring in the new automounter, similar to what's provided in most other UNIX systems, eg. MacOS X and Solaris. It uses Sun-compatible map format, has proper kernel support, and LDAP integration. There are still a few outstanding problems; they will be fixed shortly. Reviewed by: allanjude@, emaste@, kib@, wblock@ (earlier versions) Phabric: D523 MFC after: 2 weeks Relnotes: yes Sponsored by: The FreeBSD Foundation	2014-08-17 09:44:42 +00:00
Mark Johnston	ba78d6b7a1	Correct the order of arguments passed to LIST_INSERT_AFTER(). Reviewed by: kib X-MFC-With: r269656	2014-08-15 15:42:58 +00:00
Xin LI	7001d850bb	Add a new loader tunable, vm.kmem_zmax which allows a system administrator to limit the maximum allocation size that malloc(9) would consider using the UMA cache allocator as backend. Suggested by: alfred MFC after: 2 weeks	2014-08-14 05:31:39 +00:00
Xin LI	bda06553fd	Re-instate UMA cached backend for 4K - 64K allocations. New consumers like geli(4) uses malloc(9) to allocate temporary buffers that gets free'ed shortly, causing frequent TLB shootdown as observed in hwpmc supported flame graph. Discussed with: jeff, alfred MFC after: 1 week	2014-08-14 05:13:24 +00:00
Konstantin Belousov	70978c93b8	If vm_page_grab() allocates a new page, the page is not inserted into page queue even when the allocation is not wired. It is responsibility of the vm_page_grab() caller to ensure that the page does not end on the vm_object queue but not on the pagedaemon queue, which would effectively create unpageable unwired page. In exec_map_first_page() and vm_imgact_hold_page(), activate the page immediately after unbusying it, to avoid leak. In the uiomove_object_page(), deactivate page before the object is unlocked. There is no leak, since the page is deactivated after uiomove_fromphys() finished. But allowing non-queued non-wired page in the unlocked object queue makes it impossible to assert that leak does not happen in other places. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-13 05:44:08 +00:00
Gleb Smirnoff	cd1692fa5d	Move KASSERT into locked region. Submitted by: kib	2014-08-11 15:06:07 +00:00
Gleb Smirnoff	eaf78ad3f7	Use M_WAITOK in sf_buf_init(). Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-08-11 13:12:18 +00:00
Gleb Smirnoff	818d40d033	Provide sf_buf_ref() to optimize refcounting of already allocated sendfile(2) buffers. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-08-11 12:59:55 +00:00
Bjoern A. Zeeb	e346f8c452	Split up sys_ktimer_getoverrun() into a sys_ and a kern_ variant and export the kern_ version needed by an upcoming linuxolator change. MFC after: 3 days Sponsored by: DARPA,AFRL	2014-08-07 16:49:50 +00:00
Andrey V. Elsukov	3e40097976	Temporary revert r269661, it looks like the patch isn't complete.	2014-08-07 14:32:28 +00:00
Andrey V. Elsukov	c60e497af9	Use cpuset_setithread() to apply cpu mask to taskq threads. Sponsored by: Yandex LLC	2014-08-07 10:23:50 +00:00
Konstantin Belousov	d735998057	Correct the problems with the ptrace(2) making the debuggee an orphan. One problem is inferior(9) looping due to the process tree becoming a graph instead of tree if the parent is traced by child. Another issue is due to the use of p_oppid to restore the original parent/child relationship, because real parent could already exited and its pid reused (noted by mjg). Add the function proc_realparent(9), which calculates the parent for given process. It uses the flag P_TREE_FIRST_ORPHAN to detect the head element of the p_orphan list and than stepping back to its container to find the parent process. If the parent has already exited, the init(8) is returned. Move the P_ORPHAN and the new helper flag from the p_flag* to new p_treeflag field of struct proc, which is protected by proctree lock instead of proc lock, since the orphans relationship is managed under the proctree_lock already. The remaining uses of p_oppid in ptrace(PT_DETACH) and process reapping are replaced by proc_realparent(9). Phabric: D417 Reviewed by: jhb Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-08-07 05:47:53 +00:00
Gleb Smirnoff	c8d2ffd6a7	Merge all MD sf_buf allocators into one MI, residing in kern/subr_sfbuf.c The MD allocators were very common, however there were some minor differencies. These differencies were all consolidated in the MI allocator, under ifdefs. The defines from machine/vmparam.h turn on features required for a particular machine. For details look in the comment in sys/sf_buf.h. As result no MD code left in sys///vm_machdep.c. Some arches still have machine/sf_buf.h, which is usually quite small. Tested by: glebius (i386), tuexen (arm32), kevlo (arm32) Reviewed by: kib Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-08-05 09:44:10 +00:00
Kirk McKusick	5f9500c358	Add support for multi-threading of soft updates. Replace a single soft updates thread with a thread per FFS-filesystem mount point. The threads are associated with the bufdaemon process. Reviewed by: kib Tested by: Peter Holm and Scott Long MFC after: 2 weeks Sponsored by: Netflix	2014-08-04 22:03:58 +00:00
Davide Italiano	4295aa9240	Fix an overflow in getsockopt(). optval isn't big enough to hold sbintime_t. Re-introduce r255030 behaviour capping socket timeouts to INT_32 if they're too large. CR: https://phabric.freebsd.org/D433 Reported by: demon Reviewed by: bde [1], jhb [2] MFC after: 2 weeks	2014-08-04 05:40:51 +00:00
Peter Wemm	6dde7ecb5d	Partial revert of r262867. r262867 was described as fixing socket buffer checks for SOCK_SEQPACKET, but also changed one of the SOCK_DGRAM code paths to use the new sbappendaddr_nospacecheck_locked() function. This lead to SOCK_DGRAM bypassing socket buffer limits.	2014-08-03 22:37:21 +00:00
Sergey Kandaurov	bcdd3bceb6	vn_path_to_global_path: update comment.	2014-08-03 07:59:19 +00:00
Warner Losh	146cbf6fa2	Make the witness lock limit an option.	2014-08-03 05:00:43 +00:00
Konstantin Belousov	168f4ee0a8	Remove Giant acquisition from the mount and unmount pathes. It could be claimed that two things were reasonable protected by Giant. One is vfsconf list links, which is converted to the new dedicated sx vfsconf_sx. Another is vfsconf.vfc_refcount, which is now updated with atomics. Note that vfc_refcount still has the same races now as it has under the Giant, the unload of filesystem modules can happen while the module is still in use. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-08-03 03:27:54 +00:00
Rui Paulo	551a78956c	In the shm_open() and shm_unlink() syscalls, export the path to KTR. MFC after: 1 week	2014-08-01 23:29:04 +00:00
Konstantin Belousov	634012b917	Remove one-time use macros which check for the vnode lifecycle. More, some parts of the checks are in fact redundand in the surrounding code, and it is more clear what the conditions are by direct testing of the flags. Two of the three macros were only used in assertions. In vnlru_free(), all relevant parts of vholdl() were already inlined, except the increment of v_holdcnt itself. Do not call vholdl() to do the increment as well, this allows to make assertions in vholdl()/vhold() more strict. In v_incr_usecount(), call vholdl() before incrementing other ref counters. The change is no-op, but it makes less surprising to see the vnode state in debugger if interrupted inside v_incr_usecount(). Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-29 16:42:34 +00:00
Konstantin Belousov	d3a3b8b038	Simplify the expression, by removing redundand calculation. Noted by: "O'Connor, Daniel" <Daniel.O'Connor@emc.com> MFC after: 3 days	2014-07-29 01:46:31 +00:00
Konstantin Belousov	5d9b4508fd	For md(4), posix shm(3) and tmpfs(5), free swap space used by paged in dirty page, which is written by the process. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-28 14:27:05 +00:00
Pietro Cerutti	adecd05bf0	Unbreak the ABI by reverting r268494 until the compat shims are provided	2014-07-28 07:20:22 +00:00
Marcel Moolenaar	1e0a021e3d	The accept filter code is not specific to the FreeBSD IPv4 network stack, so it really should not be under "optional inet". The fact that uipc_accf.c lives under kern/ lends some weight to making it a "standard" file. Moving kern/uipc_accf.c from "optional inet" to "standard" eliminates the need for #ifdef INET in kern/uipc_socket.c. Also, this meant the net.inet.accf.unloadable sysctl needed to move, as net.inet does not exist without networking compiled in (as it lives in netinet/in_proto.c.) The new sysctl has been named net.accf.unloadable. In order to support existing accept filter sysctls, the net.inet.accf node has been added netinet/in_proto.c. Submitted by: Steve Kiernan <stevek@juniper.net> Obtained from: Juniper Networks, Inc.	2014-07-26 19:27:34 +00:00
Marcel Moolenaar	be836fab6c	Don't return ERESTART when the device is gone. In ttydev_leave() ERESTART is the indication that draining got interrupted due to a revoke(2) and that tty_drain() is to be called again for draining to complete. If the device is flagged as gone, then waiting/draining is not possible. Only return ERESTART when waiting is still possible. Obtained from: Juniper Networks, Inc.	2014-07-26 15:46:41 +00:00
Gavin Atkinson	f6b4f5ca21	Add error return to dumpsys(), and use it in doadump(). This commit does not add error returns to minidumpsys() or textdump_dumpsys(); those can also be added later. Submitted by: Conrad Meyer (EMC / Isilon storage division)	2014-07-25 23:52:53 +00:00
Daniel Eischen	66d8df9dfc	Insert new threads at the end of the thread list in the process instead of at the beginning. This allows an intra process signal to be sent to the oldest thread with the signal unmasked - which, if it still exists, is the main thread. This mimics behavior found in Linux and Solaris.	2014-07-25 20:21:02 +00:00
Mateusz Guzik	a1bf811596	Prepare fget_unlocked for reading fd table only once. Some capsicum functions accept fdp + fd and lookup fde based on that. Add variants which accept fde. Reviewed by: pjd MFC after: 1 week	2014-07-23 19:33:49 +00:00
Mateusz Guzik	6a1cf96b4a	Cosmetic changes to unp_internalize Don't throw away the result of fget_unlocked. Move fdp increment to for loop to make it consistent with similar code elsewhere. MFC after: 1 week	2014-07-23 18:04:52 +00:00
Gleb Smirnoff	c71b4037ff	Use assignment instead of bcopy. Submitted by: jmg	2014-07-18 14:59:35 +00:00
Baptiste Daroussin	42e62eca52	Extend kqueue's EVFILT_TIMER by adding precision unit flags support Define the precision macros as bits sets to conform with XNU equivalent. Test fflags passed for EVFILT_TIMER and return EINVAL in case an invalid flag is passed. Phabric: https://phabric.freebsd.org/D421 Reviewed by: kib	2014-07-18 14:27:04 +00:00
Kevin Lo	c29a33213b	Deprecate m_act. Use m_nextpkt always.	2014-07-17 05:21:16 +00:00
Don Lewis	d3a6879421	Nuke the never-used RF_TIMESHARE feature, reducing the complexity of the code. The consensus on arch@ is that this feature might have been useful in the distant past, but is now just unnecessary bloat. The int_rman_activate_resource() and int_rman_deactivate_resource() functions become trivial, so manually inline them. The special deferred handling of RF_ACTIVE is no longer needed in reserve_resource_bound(), so eliminate the associated code at the end of the function. These changes reduce the object file size by more than 500 bytes on i386. Update the rman.9 man page to reflect the removal of the RF_TIMESHARE feature. MFC after: 2 weeks	2014-07-16 22:18:19 +00:00
Konstantin Belousov	65589a29f4	Check for the cross-device cross-link attempt in the VFS, instead of forcing filesystem VOP_LINK() methods to repeat the code. In tmpfs_link(), remove redundand check for the type of the source, already done by VFS. Note that NFS server already performs this check before calling VOP_LINK(). Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-16 14:04:46 +00:00
Konstantin Belousov	a62eb1398a	Followup to r268466. - Move the code to calculate resident count into separate function. It reduces the indent level and makes the operation of vmmap_skip_res_cnt tunable more clear. - Optimize the calculation of the resident page count for map entry. Skip directly to the next lowest available index and page among the whole shadow chain. - Restore the use of pmap_incore(9), only to verify that current mapping is indeed superpage. - Note the issue with the invalid pages. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-15 19:57:03 +00:00
Konstantin Belousov	3760e341ca	Change the calculation of the kinfo_vmentry field kve_private_resident to reflect its name. Noted and reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-15 19:49:00 +00:00
Mateusz Guzik	965d08605f	Plug p_pptr null test in do_execve. It is always true.	2014-07-14 22:40:46 +00:00
Mateusz Guzik	c959c23740	Manage struct sigacts refcnt with atomics instead of a mutex. MFC after: 1 week	2014-07-14 21:12:59 +00:00
Konstantin Belousov	895b3782c6	Extract the code to put a filesystem into the suspended state (at the unmount time) in the helper vfs_write_suspend_umnt(). Use it instead of two inline copies in FFS. Fix the bug in the FFS unmount, when suspension failed, the ufs extattrs were not reinitialized. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-14 09:10:00 +00:00
Konstantin Belousov	57ef02ff0f	In kern_linkat(), avoid passing doomed vnode to the VOP. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-14 08:41:13 +00:00
Konstantin Belousov	a69452162a	Generalize vn_get_ino() to allow filesystems to use custom vnode producer, instead of hard-coding VFS_VGET(). New function, which takes callback, is called vn_get_ino_gen(), standard callback for vn_get_ino() is provided. Convert inline copies of vn_get_ino() in msdosfs and cd9660 into the uses of vn_get_ino_gen(). Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-14 08:34:54 +00:00
Kevin Lo	cb7df69b7e	Make bind(2) and connect(2) return EAFNOSUPPORT for AF_UNIX on wrong address family. See https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=191586 for the original discussion. Reviewed by: terry	2014-07-14 06:00:01 +00:00
Mateusz Guzik	8bedd5d782	Clear nonblock and async on devctl close instaed of open. This is a purely cosmetic change.	2014-07-12 15:35:04 +00:00
Gleb Smirnoff	1fbe6a82f4	Improve reference counting of EXT_SFBUF pages attached to mbufs. o Do not use UMA refcount zone. The problem with this zone is that several refcounting words (16 on amd64) share the same cache line, and issueing atomic(9) updates on them creates cache line contention. Also, allocating and freeing them is extra CPU cycles. Instead, refcount the page directly via vm_page_wire() and the sfbuf via sf_buf_alloc(sf_buf_page(sf)) [1]. o Call refcounting/freeing function for EXT_SFBUF via direct function call, instead of function pointer. This removes barrier for CPU branch predictor. o Do not cleanup the mbuf to be freed in mb_free_ext(), merely to satisfy assertion in mb_dtor_mbuf(). Remove the assertion from mb_dtor_mbuf(). Use bcopy() instead of manual assignments to copy m_ext in mb_dupcl(). [1] This has some problems for now. Using sf_buf_alloc() merely to increase refcount is expensive, and is broken on sparc64. To be fixed. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-07-11 19:40:50 +00:00
Gleb Smirnoff	fcc34a238c	Fix style bug: rename the refcount field of m_ext to ext_cnt, to match other members. Sponsored by: Nginx, Inc.	2014-07-11 14:34:29 +00:00
Gleb Smirnoff	15c28f87b8	All mbuf external free functions never fail, so let them be void. Sponsored by: Nginx, Inc.	2014-07-11 13:58:48 +00:00
Mateusz Guzik	88f98985aa	Eliminate plim and vtmp local vars in exit1. No functional changes. MFC after: 1 week	2014-07-10 22:54:38 +00:00
Mateusz Guzik	30d58d6b39	Don't make a temporary copy of fixed sysctl strings.	2014-07-10 21:46:57 +00:00
Mateusz Guzik	b23c40d7b1	Don't zero fd_nfiles during fdp destruction. Code trying to take a look has to check fd_refcnt and it is 0 by that time. This is a follow up to r268505, without this the code would leak memory for tables bigger than the default. MFC after: 1 week	2014-07-10 21:05:45 +00:00
Mateusz Guzik	e518baf8f9	Avoid relocking filedesc lock when closing fds during fdp destruction. Don't call bzero nor fdunused from fdfree for such cases. It would do unnecessary work and complain that the lock is not taken. MFC after: 1 week	2014-07-10 20:59:54 +00:00
Pietro Cerutti	7150b86bfe	Implement Short/Small String Optimization in SBUF(9) and change lengths and positions in the API from ssize_t and int to size_t. CR: D388 Approved by: des, bapt	2014-07-10 13:08:51 +00:00
Konstantin Belousov	479fcb4e32	Unconditionally initialize addr to handle the case of changed map timestamp while the map is unlocked. Reported by: bz Sponsored by: The FreeBSD Foundation MFC after: 6 days	2014-07-10 11:20:24 +00:00
Konstantin Belousov	a91831a261	Current code in sysctl proc.vmmap, which intent is to calculate the amount of resident pages, in fact calculates the amount of installed pte entries in the region. Resident pages which were not soft-faulted yet are not counted. Calculate the amount of resident pages by looking in the objects chain backing the region. Add a knob to disable the residency calculation at all. For large sparce regions, either previous or updated algorithm runs for too long time, while several introspection tools do not need the (advisory) RSS value at all. PR: kern/188911 Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-09 19:11:57 +00:00
Xin LI	2827952eb4	Don't leave the padding between the msg header and the cmsg data, and the padding after the cmsg data un-initialized. Submitted by: tuexen Security: CVE-2014-3952 Security: FreeBSD-SA-14:17.kmem	2014-07-08 21:54:23 +00:00
Konstantin Belousov	3bcc218f46	Correct the problem reported by test16 from tools/regression/file/flock/flock.c, which completes the fix in r192685. When the lock was stolen from us, retry the whole lock sequence in kernel, instead of returning EINTR to usermode and hoping that application would handle it correctly by restarting the lock acquire. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-08 08:10:15 +00:00
Don Lewis	626a79752f	Declaration whitespace changes for style(9). MFC after: 1 week	2014-07-07 22:02:39 +00:00
Mateusz Guzik	5e2554b7f8	Don't call crdup nor uifind under vnode lock. A locked vnode can get into the way of satisyfing malloc with M_WATOK. This is a fixup to r268087. Suggested by: kib MFC after: 1 week	2014-07-07 14:03:30 +00:00
Marcel Moolenaar	e7d939bda2	Remove ia64. This includes: o All directories named ia64 o All files named ia64 o All ia64-specific code guarded by __ia64__ o All ia64-specific makefile logic o Mention of ia64 in comments and documentation This excludes: o Everything under contrib/ o Everything under crypto/ o sys/xen/interface o sys/sys/elf_common.h Discussed at: BSDcan	2014-07-07 00:27:09 +00:00
Hans Petter Selasky	604bf9d37e	When getting the initial value of numeric tunables use the getenv_xxx() functions instead of strtoq(), because the getenv_xxx() functions include wrappers for various postfixes like G/M/K, which strtoq() doesn't do.	2014-07-05 06:12:48 +00:00
Konstantin Belousov	2499a5ccef	Micro-manage clang to get the expected inlining for cpu_search(). Mark cpu_search_lowest/cpu_search_highest/cpu_search_both as noinline, while cpu_search() gets always_inline. With the attributes set, cpu_search() is inlined in wrappers, and if()s with constant conditionals are optimized. On some tests on many-core machine, the hwpmc reported samples for cpu_search*() are reduced from 25% total to 9%. Submitted by: "Rang, Anton" <anton.rang@isilon.com> MFC after: 1 week	2014-07-03 11:06:27 +00:00
Marcel Moolenaar	054b57a740	Drop KTR records when we're in the debugger so that the debugger isn't changing or overwriting the trace buffer. When KTR is enabled for things like traps or pmap functions, the amount of logging can be substantial.	2014-07-02 22:13:07 +00:00
Ed Maste	969d3cc28b	Fix typos in VTY constant names from r268158	2014-07-02 14:47:48 +00:00
Ed Maste	018147eef9	Prefer vt(4) for UEFI boot The UEFI framebuffer driver vt_efifb requires vt(4), so add a mechanism for the startup routine to set the preferred console. This change is ugly because console init happens very early in the boot, making a cleaner interface difficult. This change is intended only to facilitate the sc(4) / vt(4) transition, and can be reverted once vt(4) is the default.	2014-07-02 13:24:21 +00:00
Mateusz Guzik	a6bad85e8e	Plug gcc warning after r268074 about unitialized newsigacts Reported by: Gary Jennejohn <gljennjohn gmail.com>	2014-07-02 05:45:40 +00:00
Mateusz Guzik	350d51816e	Don't call crcopysafe or uifind unnecessarily in execve. MFC after: 1 week	2014-07-01 09:21:32 +00:00
Mateusz Guzik	d00c8ea429	Perform a lockless check in sigacts_shared. It is used only during execve (i.e. singlethreaded), so there is no fear of returning 'not shared' which soon becomes 'shared'. While here reorganize the code a little to avoid proc lock/unlock in shared case. MFC after: 1 week	2014-07-01 06:29:15 +00:00
Adrian Chadd	c445c3c7f6	If we're doing RSS then ensure that the callwheel swi's are CPU pinned.	2014-06-30 04:25:51 +00:00
Hans Petter Selasky	4813ad54f8	Compile fixes: Remove duplicate "debug_ktr.mask" sysctl definition. Remove now unused variable from "kern_ktr.c". This fixes build of "ktr" which was broken by r267961. Let the default value for "vm_kmem_size_scale" be zero. It is setup after that the sysctl has been initialized from "getenv()" in the "kmeminit()" function to equal the "VM_KMEM_SIZE_MAX" value, if zero. On Sparc64 the "VM_KMEM_SIZE_MAX" macro is not a constant. This fixes build of Sparc64 which was broken by r267961. Add a special macro to dynamically create SYSCTL root nodes, because root nodes have a special parent. This fixes build of existing OFED module and CANBUS module for pc98 which was broken by r267961. Add missing "sysctl.h" includes to get the needed sysctl header file declarations. This is needed after r267961. MFC after: 2 weeks	2014-06-28 17:36:18 +00:00
Mateusz Guzik	b0bc0cadbe	Call fdcloseexec right after fdunshare. No functional changes. MFC after: 1 week	2014-06-28 05:51:45 +00:00
Mateusz Guzik	b9d32c36fa	Make fdunshare accept only td parameter. Proc had to match the thread anyway and 2 parameters were inconsistent with the rest. MFC after: 1 week	2014-06-28 05:41:53 +00:00
Mateusz Guzik	35778d7aa9	Make sure to always clear p_fd for process getting rid of its filetable. Filetable can be shared with other processes. Previous code failed to clear the pointer for all but the last process getting rid of the table. This is mostly cosmetics. Get rid of 'This should happen earlier' comment. Clearing the pointer in this place is fine as consumers can reliably check for files availability by inspecting fd_refcnt and vnodes availabity by NULL-checking them. MFC after: 1 week	2014-06-28 05:18:03 +00:00
Hans Petter Selasky	6a3287f889	Fix regression issue after r267961. Handle special string case for SYSCTLs like previously. MFC after: 2 weeks Reported by: several people	2014-06-28 03:59:04 +00:00
Hans Petter Selasky	af3b2549c4	Pull in r267961 and r267973 again. Fix for issues reported will follow.	2014-06-28 03:56:17 +00:00
Glen Barber	37a107a407	Revert r267961, r267973: These changes prevent sysctl(8) from returning proper output, such as: 1) no output from sysctl(8) 2) erroneously returning ENOMEM with tools like truss(1) or uname(1) truss: can not get etype: Cannot allocate memory	2014-06-27 22:05:21 +00:00
Marius Strobl	7344ee184b	In order to get vt(4) a bit closer to the feature set provided by sc(4), implement options TERMINAL_{KERN,NORM}_ATTR. These are aliased to SC_{KERNEL_CONS,NORM}_ATTR and like these latter, allow to change the default colors of normal and kernel text respectively. Note on the naming: Although affecting the output of vt(4), technically kern/subr_terminal.c is primarily concerned with changing default colors so it would be inconsistent to term these options VT_{KERN,NORM}_ATTR. Actually, if the architecture and abstraction of terminal+teken+vt would be perfect, dev/vt/* wouldn't be touched by this commit at all. Reviewed by: emaste MFC after: 3 days Sponsored by: Bally Wulff Games & Entertainment GmbH	2014-06-27 19:57:57 +00:00
Ed Maste	6ac6c9d5f4	Add CTLFLAG_NOFETCH flag; console vty code runs before tunable fetch Also remove redundant "" assignment for string in BSS. Submitted by: hselasky@	2014-06-27 19:07:35 +00:00
Ed Maste	59644098f8	Use a common tunable to choose between vt(4)/sc(4) With this change and previous work from ray@ it will be possible to put both in GENERIC, and have one enabled by default, but allow the other to be selected via the loader. (The previous implementation had separate kern.vt.disable and hw.syscons.disable tunables, and would panic if both drivers were compiled in and neither was explicitly disabled.) MFC after: 1 week Sponsored by: The FreeBSD Foundation	2014-06-27 17:50:33 +00:00
Hans Petter Selasky	3da1cf1e88	Extend the meaning of the CTLFLAG_TUN flag to automatically check if there is an environment variable which shall initialize the SYSCTL during early boot. This works for all SYSCTL types both statically and dynamically created ones, except for the SYSCTL NODE type and SYSCTLs which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to be used in the case a tunable sysctl has a custom initialisation function allowing the sysctl to still be marked as a tunable. The kernel SYSCTL API is mostly the same, with a few exceptions for some special operations like iterating childrens of a static/extern SYSCTL node. This operation should probably be made into a factored out common macro, hence some device drivers use this. The reason for changing the SYSCTL API was the need for a SYSCTL parent OID pointer and not only the SYSCTL parent OID list pointer in order to quickly generate the sysctl path. The motivation behind this patch is to avoid parameter loading cludges inside the OFED driver subsystem. Instead of adding special code to the OFED driver subsystem to post-load tunables into dynamically created sysctls, we generalize this in the kernel. Other changes: - Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask" to "hw.pcic.intr_mask". - Removed redundant TUNABLE statements throughout the kernel. - Some minor code rewrites in connection to removing not needed TUNABLE statements. - Added a missing SYSCTL_DECL(). - Wrapped two very long lines. - Avoid malloc()/free() inside sysctl string handling, in case it is called to initialize a sysctl from a tunable, hence malloc()/free() is not ready when sysctls from the sysctl dataset are registered. - Bumped FreeBSD version to indicate SYSCTL API change. MFC after: 2 weeks Sponsored by: Mellanox Technologies	2014-06-27 16:33:43 +00:00
Mateusz Guzik	de966666a2	Check lower bound of cmsg_len. If passed cm->cmsg_len was below cmsghdr size the experssion: datalen = (caddr_t)cm + cm->cmsg_len - (caddr_t)data; would give negative result. However, in practice it would not result in a crash because the kernel would try to obtain garbage fds for given process and would error out with EBADF. PR: 124908 Submitted by: campbell mumble.net (modified a little) MFC after: 1 week	2014-06-27 05:04:36 +00:00
Pawel Jakub Dawidek	e16406c7ba	Remove duplicated includes. Submitted by: Mariusz Zaborski <oshogbo@FreeBSD.org>	2014-06-26 13:57:44 +00:00
Attilio Rao	e989086b1d	sysctl subsystem uses sxlocks so avoid to setup dynamic sysctl nodes before sleepinit() has been fully executed in the SLEEPQUEUE_PROFILING case. Sponsored by: EMC / Isilon storage division	2014-06-24 15:16:55 +00:00
Mateusz Guzik	450570a55e	Tidy up fd-related functions called by do_execve o assert in each one that fdp is not shared o remove unnecessary NULL checks - all userspace processes have fdtables and kernel processes cannot execve o remove comments about the danger of fd_ofiles getting reallocated - fdtable is not shared and fd_ofiles could be only reallocated if new fd was about to be added, but if that was possible the code would already be buggy as setugidsafety work could be undone MFC after: 1 week	2014-06-23 01:28:18 +00:00
Mateusz Guzik	158627616c	Don't take filedesc lock in fdunshare(). We can read refcnt safely and only care if it is equal to 1. If it could suddenly change from 1 to something bigger the code would be buggy even in the previous form and transitions from > 1 to 1 are equally racy and harmless (we copy even though there is no need). MFC after: 1 week	2014-06-22 21:37:27 +00:00
Alexander V. Chernikov	811985398d	Permit changing cpu mask for cpu set 1 in presence of drivers binding their threads to particular CPU. Changing ithread cpu mask is now performed by special cpuset_setithread(). It creates additional cpuset root group on first bind invocation. No objection: jhb Tested by: hiren MFC after: 2 weeks Sponsored by: Yandex LLC	2014-06-22 11:32:23 +00:00
Mateusz Guzik	adf87ab01c	fd: replace fd_nfiles with fd_lastfile where appropriate fd_lastfile is guaranteed to be the biggest open fd, so when the intent is to iterate over active fds or lookup one, there is no point in looking beyond that limit. Few places are left unpatched for now. MFC after: 1 week	2014-06-22 01:31:55 +00:00
Mateusz Guzik	0f0b852c73	do_dup: plug redundant adjustment of fd_lastfile By that time it was already set by fdalloc, or was there in the first place if fd is replaced. MFC after: 1 week	2014-06-22 00:53:33 +00:00
Konstantin Belousov	7b81a399a4	In msdosfs_setattr(), add a check for result of the utimes(2) permissions test, forgotten in r164033. Refactor the permission checks for utimes(2) into vnode helper function vn_utimes_perm(9), and simplify its code comparing with the UFS origin, by writing the call to VOP_ACCESSX only once. Use the helper for UFS(5), tmpfs(5), devfs(5) and msdosfs(5). Reported by: bde Reviewed by: bde, trasz Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-17 07:11:00 +00:00
Dmitry Chagin	2dedc1281a	Revert r266925 as it can lead to instant panic at fexecve(): To allow to run the interpreter itself add a new ELF branding type. Pointed out by: kib, mjg	2014-06-17 05:29:18 +00:00
Attilio Rao	3ae10f7477	- Modify vm_page_unwire() and vm_page_enqueue() to directly accept the queue where to enqueue pages that are going to be unwired. - Add stronger checks to the enqueue/dequeue for the pagequeues when adding and removing pages to them. Of course, for unmanaged pages the queue parameter of vm_page_unwire() will be ignored, just as the active parameter today. This makes adding new pagequeues quicker. This change effectively modifies the KPI. __FreeBSD_version will be, however, bumped just when the full cache of free pages will be evicted. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2014-06-16 18:15:27 +00:00
Konstantin Belousov	2e501b0a9e	Use vn_io_fault for the writes from core dumping code. Recursing into VM due to copyin(9) faulting while VFS locks are held is deadlock-prone there in the same way as for the write(2) syscall. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-06-15 04:51:53 +00:00
Alexander Motin	781c93d405	Implement simple direct-mapped cache for popular filesystem identifiers to avoid congestion on global mountlist_mtx mutex in vfs_busyfs(), while traversing through the list of mount points. This change significantly improves NFS server scalability, since it had to do this translation for every request, and the global lock becomes quite congested. This code is more optimized for relatively small number of mount points. On systems with hundreds of active mount points this simple cache may have many collisions. But the original traversal code in that case should also behave much worse, so we are not loosing much. Reviewed by: attilio MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-06-12 12:43:48 +00:00
Alexander Motin	4f655310bf	Remove unneeded mountlist_mtx acquisition from sync_fsync(). All struct mount fields accessed by sync_fsync() are protected by MNT_MTX.	2014-06-11 12:56:49 +00:00
Alexander Motin	eb6d6216c4	Move root_mount_hold() functionality to separate mutex. It has nothing to share with mutex protecting list of mounted file systems.	2014-06-11 08:14:08 +00:00
Konstantin Belousov	a19c5d3716	Devolatile as needed. Sponsored by: The FreeBSD Foundation MFC after: 13 days	2014-06-09 09:10:31 +00:00
Konstantin Belousov	7f82c6c17f	Change the nblock mutex, protecting the needsbuffer buffer deficit flags, to rwlock. Lock it in read mode when used from subroutines called from buffer release code paths. The needsbuffer is now updated using atomics, while read lock of nblock prevents loosing the wakeups from bufspacewakeup() and bufcountadd() in getnewbuf_bufd_help(). In several interesting loads, needsbuffer flags are never set, while buffers are reused quickly. This causes brelse() and bqrelse() from different threads to content on the nblock. Now they take nblock in read mode, together with needsbuffer not needing an update, allowing higher parallelism. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-06-09 03:38:03 +00:00
Alan Cox	78960940fe	Refresh a comment. The VM_STACK option was eliminated in r43209. Sponsored by: EMC / Isilon Storage Division	2014-06-09 00:15:16 +00:00
Alexander Motin	3345d73ca8	Remove extra branching from r267232. MFC after: 2 weeks	2014-06-08 19:01:37 +00:00
Alexander Motin	590d636321	Use atomics to modify numvnodes variable. This allows to mostly avoid lock usage in getnewvnode_[drop_]reserve(), that reduces number of global vnode_free_list_mtx mutex acquisitions from 4 to 2 per NFS request on ZFS, improving SMP scalability. Reviewed by: kib MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-06-08 15:38:40 +00:00
Konstantin Belousov	a288c757d4	Remove write-only local variable. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-08 10:56:25 +00:00
Konstantin Belousov	23f6698fbd	Initialize the pbuf counter for directio using SYSINIT, instead of using a direct hook called from kern_vfs_bio_buffer_alloc(). Mark ffs_rawread.c as requiring both ffs and directio options to be compiled into the kernel. Add ffs_rawread.c to the list of ufs.ko module' sources. In addition to stopping breaking the layering violation, it also allows to link kernel when FFS is configured as module and DIRECTIO is enabled. One consequence of the change is that ffs_rawread.o is always linked into the module regardless of the DIRECTIO option. This is similar to the option QUOTA and ufs_quota.c. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-08 10:55:06 +00:00
Jilles Tjoelker	093e059c7d	ktrace: Use designated initializers for the data_lengths array. In the .o file, this only changes some line numbers (head amd64) because element 0 is no longer explicitly initialized. This should make bugs like FreeBSD-SA-14:12.ktrace less likely. Discussed with: des MFC after: 1 week	2014-06-06 14:49:00 +00:00
Davide Italiano	e392e44c27	Convert functions to the new-style format. Submitted by: Vijay Singh <vijju.singh@gmail.com> via -hackers	2014-06-05 03:46:46 +00:00
Marcel Moolenaar	62d76917b8	Introduce a procedural interface to the ifnet structure. The new interface allows the ifnet structure to be defined as an opaque type in NIC drivers. This then allows the ifnet structure to be changed without a need to change or recompile NIC drivers. Put differently, NIC drivers can be written and compiled once and be used with different network stack implementations, provided of course that those network stack implementations have an API and ABI compatible interface. This commit introduces the 'if_t' type to replace 'struct ifnet ' as the type of a network interface. The 'if_t' type is defined as 'void ' to enable the compiler to perform type conversion to 'struct ifnet *' and vice versa where needed and without warnings. The functions that implement the API are the only functions that need to have an explicit cast. The MII code has been converted to use the driver API to avoid unnecessary code churn. Code churn comes from having to work with both converted and unconverted drivers in correlation with having callback functions that take an interface. By converting the MII code first, the callback functions can be defined so that the compiler will perform the typecasts automatically. As soon as all drivers have been converted, the if_t type can be redefined as needed and the API functions can be fix to not need an explicit cast. The immediate benefactors of this change are: 1. Juniper Networks - The network stack implementation in Junos is entirely different from FreeBSD's one and this change allows Juniper to build "stock" NIC drivers that can be used in combination with both the FreeBSD and Junos stacks. 2. FreeBSD - This change opens the door towards changing ifnet and implementing new features and optimizations in the network stack without it requiring a change in the many NIC drivers FreeBSD has. Submitted by: Anuranjan Shukla <anshukla@juniper.net> Reviewed by: glebius@ Obtained from: Juniper Networks, Inc.	2014-06-02 17:54:39 +00:00
Adrian Chadd	924aaf69ff	Pin the right thread. This _was_ right, a last minute suggestion and not enough testing makes Adrian a bad boy. Tested: * igb(4) with RSS patches, by hand verifying each igb(4) taskqueue tid from procstat -ka using cpuset -g -t <tid>.	2014-06-01 04:11:05 +00:00
Dmitry Chagin	5f56da1891	To allow to run the interpreter itself add a new ELF branding type. Allow Linux ABI to run ELF interpreter. MFC after: 3 days	2014-05-31 15:01:51 +00:00
Gleb Smirnoff	c46713e636	Whitespace only.	2014-05-30 08:22:58 +00:00
Mark Johnston	f2789bd5c7	Commit the rest of the changes that were intended to be part of r266826. X-MFC-with: r266826	2014-05-29 01:42:22 +00:00
Don Lewis	5b892e7363	Initialize r_flags the same way in all cases using a sanitized copy of flags that has several bits cleared. The RF_WANTED and RF_FIRSTSHARE bits are invalid in this context, and we want to defer setting RF_ACTIVE in r_flags until later. This should make rman_get_flags() return the correct answer in all cases. Add a KASSERT() to catch callers which incorrectly pass the RF_WANTED or RF_FIRSTSHARE flags. Do a strict equality check on the share type bits of flags. In particular, do an equality check on RF_PREFETCHABLE. The previous code would allow one type of mismatch of RF_PREFETCHABLE but disallow the other type of mismatch. Also, ignore the the RF_ALIGNMENT_MASK bits since alignment validity should be handled by the amask check. This field contains an integer value, but previous code did a strange bitwise comparison on it. Leave the original value of flags unmolested as a minor debug aid. Change the start+amask overflow check to a KASSERT() since it is just meant to catch a highly unlikely programming error in the caller. Reviewed by: jhb MFC after: 1 month	2014-05-28 16:57:17 +00:00
Adrian Chadd	5a6f0eee47	Add a new taskqueue setup method that takes a cpuid to pin the taskqueue worker thread(s) to. For now it isn't a taskqueue/taskthread error to fail to pin to the given cpuid. Thanks to rpaulo@, kib@ and jhb@ for feedback. Tested: * igb(4), with local RSS patches to pin taskqueues. TODO: * ask the doc team for help in documenting the new API call. * add a taskqueue_start_threads_cpuset() method which takes a cpuset_t - but this may require a bunch of surgery to bring cpuset_t into scope.	2014-05-24 20:37:15 +00:00

... 3 4 5 6 7 ...

14128 Commits