freebsd-nq

Author	SHA1	Message	Date
Mark Johnston	f61d6c5a5b	Ensure that spinlock sections are balanced even after a panic. vpanic() uses spinlock_enter() to disable interrupts before dumping core. However, when the scheduler is stopped and INVARIANTS is not configured, thread_lock() does not acquire a spinlock section, while thread_unlock() releases one. This can result in interrupts staying enabled while the kernel dumps core, complicating post-mortem analysis of the crash. Approved by: re (gjb) MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2016-07-05 17:59:04 +00:00
Robert Watson	971711fb7c	Call audit hooks to capture vnode attributes for three file-descriptor method implementations: fstat(2), close(2), and poll(2). This change synchronises auditing here with similar auditing for VFS-specific system calls such as stat(2) that audit more complete vnode information. Sponsored by: DARPA, AFRL Approved by: re (kib) MFC after: 1 week	2016-07-05 16:37:01 +00:00
Ed Maste	1cbb879df8	add description for debug.elf{32,64}_legacy_coredump sysctl Approved by: re (kib) MFC after: 1 week Sponsored by: The FreeBSD Foundation	2016-07-05 14:46:06 +00:00
Konstantin Belousov	46e47c4f8d	Provide helper macros to detect 'non-silent SBDRY' state and to calculate appropriate return value for stops. Simplify the code by using them. Fix typo in sig_suspend_threads(). The thread which sleep must be aborted is td2. () In issignal(), when handling stopping signal for thread in TD_SBDRY_INTR state, do not stop, this is wrong and fires assert. This is yet another place where execution should be forced out of SBDRY-protected region. For such case, return -1 from issignal() and translate it to corresponding error code in sleepq_catch_signals(). Assert that other consumers of cursig() are not affected by the new return value. () Micro-optimize, mostly VFS and VOP methods, by avoiding calling the functions when SIGDEFERSTOP_NOP non-change is requested. (*) Reported and tested by: pho () Requested by: bde (**) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb)	2016-07-03 18:19:48 +00:00
Konstantin Belousov	d12d6a8476	Remove racy assert. The thread which changes vnode usecount from 0 to 1 does it under the vnode interlock, but the interlock is not owned by the asserting thread. As result, we might read increased use counter but also still see VI_OWEINACT. In collaboration with: nwhitehorn Hardware donated by: IBM LTC Sponsored by: The FreeBSD Foundation (kib) Approved by: re (gjb)	2016-07-03 01:56:48 +00:00
Konstantin Belousov	e18ee4957d	When a process knote was attached to the process which is already exiting, the knote is activated immediately. If the exit1() later activates knotes, such knote is attempted to be activated second time. Detect the condition by zeroed kn_ptr.p_proc pointer, and avoid excessive activation. Before r302235, such knotes were removed from the knlist immediately upon activation. Reported by: truckman Sponsored by: The FreeBSD Foundation Approved by: re (gjb)	2016-07-01 20:11:28 +00:00
Konstantin Belousov	364c516cff	Currently the ntptime code and resettodr() are Giant-locked. In particular, the Giant is supposed to protect against parallel ntp_adjtime(2) invocations. But, for instance, sys_ntp_adjtime() does copyout(9) under Giant and then examines time_status to return syscall result. Since copyout(9) could sleep, the syscall result might be inconsistent. Another and more important issue is that if PPS is configured, hardpps(9) is executed without any protection against the parallel top-level code invocation. Potentially, this may result in the inconsistent state of the ntptime state variables, but I cannot say how serious such distortion is. The non-functional splclock() call in sys_ntp_adjtime() protected against clock interrupts calling hardpps() in the pre-SMP era. Modernize the locking. A mutex protects ntptime data. Due to the hardpps() KPI legitimately serving from the interrupt filters (and e.g. uart(4) does call it from filter), the lock cannot be sleepable mutex if PPS_SYNC is defined. Otherwise, use normal sleepable mutex to reduce interrupt latency. Reviewed by: imp, jhb Sponsored by: The FreeBSD Foundation Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D6825	2016-06-28 16:43:23 +00:00
Konstantin Belousov	3a0e6f920a	Do not use Giant to prevent parallel calls to CLOCK_SETTIME(). Use private mtx in resettodr(), no implementation of CLOCK_SETTIME() is allowed to sleep. Reviewed by: imp, jhb Sponsored by: The FreeBSD Foundation Approved by: re (gjb) X-Differential revision: https://reviews.freebsd.org/D6825	2016-06-28 16:42:40 +00:00
Konstantin Belousov	6f56cb8dbf	Complete r302215. TDF_SBDRY \| TDF_SERESTART and TDF_SBDRY \| TDF_SEINTR flags values, unlike TDF_SBDRY, must be treated almost as if TDF_SBDRY is not set for STOP signal delivery. The only difference is that sig_suspend_threads() should abort the sleep instead of doing immediate suspension. Reported by: ngie Sponsored by: The FreeBSD Foundation MFC after: 12 days Approved by: re (gjb)	2016-06-28 16:41:50 +00:00
Konstantin Belousov	9eb3f14324	Fix userspace build after r302235: do not expose bool field of the structure, change it to int. The real fix is to sanitize user-visible definitions in sys/event.h, e.g. the affected struct knlist is of no use for userspace programs. Reported and tested by: jkim Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb)	2016-06-27 23:34:53 +00:00
Konstantin Belousov	9e590ff04b	When filt_proc() removes event from the knlist due to the process exiting (NOTE_EXIT->knlist_remove_inevent()), two things happen: - knote kn_knlist pointer is reset - INFLUX knote is removed from the process knlist. And, there are two consequences: - KN_LIST_UNLOCK() on such knote is nop - there is nothing which would block exit1() from processing past the knlist_destroy() (and knlist_destroy() resets knlist lock pointers). Both consequences result either in leaked process lock, or dereferencing NULL function pointers for locking. Handle this by stopping embedding the process knlist into struct proc. Instead, the knlist is allocated together with struct proc, but marked as autodestroy on the zombie reap, by knlist_detach() function. The knlist is freed when last kevent is removed from the list, in particular, at the zombie reap time if the list is empty. As result, the knlist_remove_inevent() is no longer needed and removed. Other changes: In filt_procattach(), clear NOTE_EXEC and NOTE_FORK desired events from kn_sfflags for knote registered by kernel to only get NOTE_CHILD notifications. The flags leak resulted in excessive NOTE_EXEC/NOTE_FORK reports. Fix immediate note activation in filt_procattach(). Condition should be either the immediate CHILD_NOTE activation, or immediate NOTE_EXIT report for the exiting process. In knote_fork(), do not perform racy check for KN_INFLUX before kq lock is taken. Besides being racy, it did not accounted for notes just added by scan (KN_SCAN). Some minor and incomplete style fixes. Analyzed and tested by: Eric Badger <eric@badgerio.us> Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D6859	2016-06-27 21:52:17 +00:00
Konstantin Belousov	883a5a4a6a	When sleeping waiting for either local or remote advisory lock, interrupt sleeps with the ERESTART on the suspension attempts. Otherwise, single-threading requests are deferred until the locks are granted for NFS files, which causes hangs. When retrying local registration of the remotely-granted adv lock, allow full suspension and check for suspension, for usual reasons. Reported by: markj, pho Reviewed by: jilles Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb)	2016-06-26 20:08:42 +00:00
Konstantin Belousov	3a1e5dd8e6	Rewrite sigdeferstop(9) and sigallowstop(9) into more flexible framework allowing to set the suspension policy for the dynamic block. Extend the currently possible policies of stopping on interruptible sleeps and ignoring such sleeps by two more: do not suspend at interruptible sleeps, but interrupt them with either EINTR or ERESTART. Reviewed by: jilles Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb)	2016-06-26 20:07:24 +00:00
Konstantin Belousov	8a06de9e92	Do not clear robust lists pointers on fork. The forked child thread lists must be functional. Reported by: Daniel Engberg <daniel.engberg.lists@pyret.net>, Guy Yur <guyyur@gmail.com> Tested by: Guy Yur <guyyur@gmail.com> Sponsored by: The FreeBSD Foundation Approved by: re (gjb), including the KBI change	2016-06-25 11:31:25 +00:00
Jilles Tjoelker	6ea906eec0	posixshm: Fix lock leak when mac_posixshm_check_read rejects read. While reading the code, I noticed that shm_read() returns without unlocking foffset and rangelock if mac_posixshm_check_read() rejects the read. Reviewed by: kib, jhb, rwatson Approved by: re (gjb) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D6927	2016-06-23 20:59:13 +00:00
Brooks Davis	a72c64b0b6	Generate syscall tables and update pipe() implementation after r302094. Mark the pipe() system call as COMPAT10. As of r302092 libc uses pipe2() with a zero flags value instead of pipe(). Approved by: re (gjb) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D6816	2016-06-22 21:18:19 +00:00
Brooks Davis	e16e64098c	Mark the pipe() system call as COMPAT10. As of r302092 libc uses pipe2() with a zero flags value instead of pipe(). Commit with regenerated files and implementation to follow. Approved by: re (gjb) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D6816	2016-06-22 21:15:59 +00:00
Brooks Davis	e52e02ba24	Add support for COMPAT10 keywords in syscalls.master. Approved by: re (gjb) Sponsored by: DARPA, AFRL	2016-06-22 21:12:53 +00:00
John Baldwin	b1012d8036	Account for AIO socket operations in thread/process resource usage. File and disk-backed I/O requests store counts of read/written disk blocks in each AIO job so that they can be charged to the thread that completes an AIO request via aio_return() or aio_waitcomplete(). This change extends AIO jobs to store counts of received/sent messages and updates socket backends to set these counts accordingly. Note that the socket backends are careful to only charge a single messages for each AIO request even though a single request on a blocking socket might invoke sosend or soreceive multiple times. This is to mimic the resource accounting of synchronous read/write. Adjust the UNIX socketpair AIO test to verify that the message resource usage counts update accordingly for aio_read and aio_write. Approved by: re (hrs) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D6911	2016-06-21 22:19:06 +00:00
Bjoern A. Zeeb	89856f7e2d	Get closer to a VIMAGE network stack teardown from top to bottom rather than removing the network interfaces first. This change is rather larger and convoluted as the ordering requirements cannot be separated. Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and related modules to their own SI_SUB_PROTO_FIREWALL. Move initialization of "physical" interfaces to SI_SUB_DRIVERS, move virtual (cloned) interfaces to SI_SUB_PSEUDO. Move Multicast to SI_SUB_PROTO_MC. Re-work parts of multicast initialisation and teardown, not taking the huge amount of memory into account if used as a module yet. For interface teardown we try to do as many of them as we can on SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling over a higher layer protocol such as IP. In that case the interface has to go along (or before) the higher layer protocol is shutdown. Kernel hhooks need to go last on teardown as they may be used at various higher layers and we cannot remove them before we cleaned up the higher layers. For interface teardown there are multiple paths: (a) a cloned interface is destroyed (inside a VIMAGE or in the base system), (b) any interface is moved from a virtual network stack to a different network stack ("vmove"), or (c) a virtual network stack is being shut down. All code paths go through if_detach_internal() where we, depending on the vmove flag or the vnet state, make a decision on how much to shut down; in case we are destroying a VNET the individual protocol layers will cleanup their own parts thus we cannot do so again for each interface as we end up with, e.g., double-frees, destroying locks twice or acquiring already destroyed locks. When calling into protocol cleanups we equally have to tell them whether they need to detach upper layer protocols ("ulp") or not (e.g., in6_ifdetach()). Provide or enahnce helper functions to do proper cleanup at a protocol rather than at an interface level. Approved by: re (hrs) Obtained from: projects/vnet Reviewed by: gnn, jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D6747	2016-06-21 13:48:49 +00:00
Konstantin Belousov	d9a503bec4	Fix typo. Note that atomic is still required even for interlocked case. Sponsored by: The FreeBSD Foundation Approved by: re (marius)	2016-06-20 15:45:50 +00:00
Mateusz Guzik	e896fb3bae	vfs: ifdef out noop vop_* primitives on !DEBUG_VFS_LOCKS kernels This removes calls to empty functions like vop_lock_{pre/post} from common vfs routines. Approved by: re (gjb)	2016-06-17 19:41:30 +00:00
Konstantin Belousov	f8a75278dc	Add VFS interface to flush specified amount of free vnodes belonging to mount points with the given filesystem type, specified by mount vfs_ops pointer. Based on patch by: mckusick Reviewed by: avg, mckusick Tested by: allanjude, madpilot Sponsored by: The FreeBSD Foundation Approved by: re (gjb)	2016-06-17 17:33:25 +00:00
Konstantin Belousov	5c2cf81845	Update comments for the MD functions managing contexts for new threads, to make it less confusing and using modern kernel terms. Rename the functions to reflect current use of the functions, instead of the historic KSE conventions: cpu_set_fork_handler -> cpu_fork_kthread_handler (for kthreads) cpu_set_upcall -> cpu_copy_thread (for forks) cpu_set_upcall_kse -> cpu_set_upcall (for new threads creation) Reviewed by: jhb (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (hrs) Differential revision: https://reviews.freebsd.org/D6731	2016-06-16 12:05:44 +00:00
Konstantin Belousov	bd07998e0e	Remove XXX comments from kern_thread.c. In one case, there is no reason for it in modern times. In the other case, expand the comment stating instead of doubting. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (hrs) X-Differential revision: https://reviews.freebsd.org/D6731	2016-06-16 12:01:11 +00:00
Konstantin Belousov	13d2cd3b68	Remove code duplication. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (hrs) X-Differential revision: https://reviews.freebsd.org/D6731	2016-06-16 11:58:46 +00:00
John Baldwin	fe0bdd1d2c	Move backend-specific fields of kaiocb into a union. This reduces the size of kaiocb slightly. I've also added some generic fields that other backends can use in place of the BIO-specific fields. Change the socket and Chelsio DDP backends to use 'backend3' instead of abusing _aiocb_private.status directly. This confines the use of _aiocb_private to the AIO internals in vfs_aio.c. Reviewed by: kib (earlier version) Approved by: re (gjb) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D6547	2016-06-15 20:56:45 +00:00
Konstantin Belousov	9fdbfd3b6c	Do not assume that we own the use reference on the covered vnode until we set MNTK_UNMOUNT flag on the mp. Otherwise parallel unmount which wins race with us could dereference the covered vnode, and we are left with the locked freed memory. Reported and tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 1 week	2016-06-15 15:56:03 +00:00
Jamie Gritton	932a6e432d	Fix a vnode leak when giving a child jail a too-long path when debug.disablefullpath=1.	2016-06-09 21:59:11 +00:00
Jamie Gritton	cf0313c679	Re-order some jail parameter reading to prevent a vnode leak.	2016-06-09 20:43:14 +00:00
Jamie Gritton	176ff3a066	Clean up some logic in jail error messages, replacing a missing test and a redundant test with a single correct test.	2016-06-09 20:39:57 +00:00
Mariusz Zaborski	fb4cdc96a8	Define tunable instead of using CTLFLAG_RWTUN flag with kern.corefile. The allproc_lock lock used in the sysctl_kern_corefile function is initialized in the procinit function which is called after setting sysctl values at boot. That means if we set kern.corefile at boot we will be trying to use lock with is uninitialized and machine will crash. If we define kern.corefile as tunable instead of using CTFLAG_RWTUN we will not call the sysctl_kern_corefile function and we will not use an uninitialized lock. When machine will boot then we will start using function depending on the lock. Reviewed by: pjd	2016-06-09 20:23:30 +00:00
Conrad Meyer	8a3aeac27b	Add DDB command "kldstat" It prints much the same information as kldstat(8) without any arguments. Suggested by: jhibbits Sponsored by: EMC / Isilon Storage Division	2016-06-09 18:27:41 +00:00
Conrad Meyer	dd6ea7f7bc	kvprintf: Pad %c to width, like %s Sponsored by: EMC / Isilon Storage Division	2016-06-09 18:24:51 +00:00
Jamie Gritton	ef0ddea316	Make sure the OSD methods for jail set and remove can't run concurrently, by holding allprison_lock exclusively (even if only for a moment before downgrading) on all paths that call PR_METHOD_REMOVE. Since they may run on a downgraded lock, it's still possible for them to run concurrently with PR_METHOD_GET, which will need to use the prison lock.	2016-06-09 16:41:41 +00:00
Jamie Gritton	5f02f22af1	Remove a comment that was part of copied code, and is misleading in the new location.	2016-06-09 15:34:33 +00:00
Mark Johnston	508d856999	Fix some cosmetic issues in kern_fail.c omitted from r296927. Obtained from: Matthew Bryan <matthew.bryan@isilon.com>	2016-06-09 13:17:08 +00:00
Konstantin Belousov	3fc292d56b	Old process credentials for setuid execve must not be dereferenced when the process credentials were not changed. This can happen if an error occured trying to activate the setuid binary. And on error, if new credentials were not yet assigned, they must be freed to not create the leak. Use oldcred == NULL as the predicate to detect credential reassignment. Reported and tested by: pho Sponsored by: The FreeBSD Foundation	2016-06-08 04:37:03 +00:00
Mariusz Zaborski	b3a734483e	Introduce the PD_CLOEXEC for pdfork(2). Reviewed by: mjg	2016-06-08 02:09:14 +00:00
Svatopluk Kraus	c4263292fe	Remove temporary solution for storing interrupt mapping data as it's not needed after r301451 and follow-ups r301453, r301539. This makes INTRNG clean of all additions related to various buses.	2016-06-07 09:03:27 +00:00
Michal Meloun	949883bd72	INTRNG: As follow up of r301451, implement mapping and configuration of gpio pin interrupts by new way. Note: This removes last consumer of intr_ddata machinery and we remove it in separate commit.	2016-06-07 05:08:24 +00:00
Bjoern A. Zeeb	3af72c1124	Implement a `show panic` command to DDB which will helpfully print the panic string again if set, in case it scrolled out of the active window. This avoids having to remember the symbol name. Also add a show callout <addr> command to DDB in order to inspect some struct callout fields in case of panics in the callout code. This may help to see if there was memory corruption or to further ease debugging problems. Obtained from: projects/vnet MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Reviewed by: jhb (comment only on the show panic initally) Differential Revision: https://reviews.freebsd.org/D4527	2016-06-06 20:57:24 +00:00
Konstantin Belousov	93ccd6bf87	Get rid of struct proc p_sched and struct thread td_sched pointers. p_sched is unused. The struct td_sched is always co-allocated with the struct thread, except for the thread0. Avoid useless indirection, instead calculate td_sched location using simple pointer arithmetic in td_get_sched(9). For thread0, which is statically allocated, create a structure to emulate layout of the dynamic allocation. Reviewed by: jhb (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D6711	2016-06-05 17:04:03 +00:00
Konstantin Belousov	314381b529	Use ANSI function definition. Sponsored by: The FreeBSD Foundation	2016-06-05 16:55:55 +00:00
Svatopluk Kraus	ad5244ece1	INTRNG - change the way how an interrupt mapping data are provided to the framework in OFW (FDT) case. This is a follow-up to r301451. Differential Revision: https://reviews.freebsd.org/D6634	2016-06-05 16:20:12 +00:00
Svatopluk Kraus	0869297dd9	(1) Add a new bus method to get a mapping data for an interrupt. BUS_MAP_INTR() is used to get an interrupt mapping data according to provided hints. The hints could be modified afterwards, but only if mapping data was allocated. This method is intended to be called before BUS_ALLOC_RESOURCE(). An interrupt mapping data describes an interrupt - hardware number, type, configuration, cpu binding, and whatever is needed to setup it. (2) Introduce a method which allows storing of an additional data in struct resource to be available for bus drivers. This method is convenient in two ways: - there is no need to rework existing bus drivers as they can simply be extended to provide an additional data, - there is no need to modify any existing bus methods as struct resource is already passed to them as argument and thus stored data is simply accessible by other bus drivers. For now, implement this method only for INTRNG. This is motivated by needs of modern SOCs where hardware initialization is not straightforward and resources descriptions are complex, opaque for everyone but provider, and may vary from SOC to SOC. Typical situation is that one bus driver can fetch a resource description for its child device, but it's opaque for this driver. Another bus driver knows a provider for this kind of resource and can pass this resource description to it. In fact, something like device IVARS would be perfect for that if implemented generally enough. Unfortunatelly, IVARS are usable only by their owners now. Only owner knows its IVARS layout, thus other bus drivers are not able to use them. Differential Revision: https://reviews.freebsd.org/D6632	2016-06-05 16:07:57 +00:00
Andrew Turner	d1605cda2b	Add an interface to handle interrupt controllers that have a contiguous range of interrupts they pass to a second controller driver to handle. The parent driver is expected to detect when one of these interrupts has been triggered and call intr_child_irq_handler to pass the interrupt to a child. The children controllers are then expected to manage the range by allocating interrupts as needed. This will initially be used by the ARM GICv3 driver, but is is expected to be useful for other driver where this type of allocation applies. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D6436	2016-06-03 10:13:18 +00:00
Mateusz Guzik	f3c8e16ea5	taskqueue: plug a leak in _taskqueue_create While here make some style fixes and postpone the sprintf so that it is only done when the function can no longer fail. CID: 1356041	2016-06-02 15:52:34 +00:00
Mateusz Guzik	fc4f686d59	Microoptimize locking primitives by avoiding unnecessary atomic ops. Inline version of primitives do an atomic op and if it fails they fallback to actual primitives, which immediately retry the atomic op. The obvious optimisation is to check if the lock is free and only then proceed to do an atomic op. Reviewed by: jhb, vangyzen	2016-06-01 18:32:20 +00:00
Bjoern A. Zeeb	3f58662dd9	The pr_destroy field does not allow us to run the teardown code in a specific order. VNET_SYSUNINITs however are doing exactly that. Thus remove the VIMAGE conditional field from the domain(9) protosw structure and replace it with VNET_SYSUNINITs. This also allows us to change some order and to make the teardown functions file local static. Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use internally. Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g., for pfil consumers (firewalls), partially for this commit and for others to come. Reviewed by: gnn, tuexen (sctp), jhb (kernel.h) Obtained from: projects/vnet MFC after: 2 weeks X-MFC: do not remove pr_destroy Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D6652	2016-06-01 10:14:04 +00:00
Gleb Smirnoff	34e05ebe72	Fix kernel stack disclosures in the Linux and 4.3BSD compat layers. Submitted by: CTurt Security: SA-16:20 Security: SA-16:21	2016-05-31 16:56:30 +00:00
Edward Tomasz Napierala	f7bd221730	Cosmetics - add missing space after ellipses in shutdown messages. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-05-31 15:27:33 +00:00
Jamie Gritton	ee8d6bd352	Mark jail(2), and the sysctls that it (and only it) uses as deprecated. jail(8) has long used jail_set(2), and those sysctl only cause confusion.	2016-05-30 05:21:24 +00:00
Mateusz Guzik	2dbdf49cf4	fd: provide a common exit point for unlock in kern_dup While here assert dropped filedesc lock on return from closefp.	2016-05-27 17:00:15 +00:00
Mateusz Guzik	cda688443a	exec: get rid of one vnode lock/unlock pair in do_execve The lock was temporarily dropped for vrele calls, but they can be postponed to a point where the lock is not held in the first place. While here shuffle other code not needing the lock.	2016-05-27 15:03:38 +00:00
Bryan Drewery	1afd78b34d	exec: Provide execpath in imgp for the process_exec hook. This was previously set after the hook and only if auxargs were present. Now always provide it if possible. MFC after: 2 weeks Reviewed by: kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D6546	2016-05-26 23:19:39 +00:00
Bryan Drewery	881010f05d	exec: Add credential change information into imgp for process_exec hook. This allows an EVENTHANDLER(process_exec) hook to see if the new image will cause credentials to change whether due to setgid/setuid or because of POSIX saved-id semantics. This adds 3 new fields into image_params: struct ucred *newcred Non-null if the credentials will change. bool credential_setid True if the new image is setuid or setgid. This will pre-determine the new credentials before invoking the image activators, where the process_exec hook is called. The new credentials will be installed into the process in the same place as before, after image activators are done handling the image. MFC after: 2 weeks Reviewed by: kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D6544	2016-05-26 23:18:54 +00:00
Conrad Meyer	571ebf7685	crypto routines: Hint minimum buffer sizes to the compiler Use the C99 'static' keyword to hint to the compiler IVs and output digest sizes. The keyword informs the compiler of the minimum valid size for a given array. Obviously not every pointer can be validated (i.e., the compiler can produce false negative but not false positive reports). No functional change. No ABI change. Sponsored by: EMC / Isilon Storage Division	2016-05-26 19:29:29 +00:00
Hans Petter Selasky	84e717c4cf	Add support for boolean sysctl's. Because the size of bool can be implementation defined, make a bool sysctl handler which handle bools. Userspace sees the bools like unsigned 8-bit integers. Values are filtered to either 1 or 0 upon read and write, similar to what a compiler would do. Requested by: kmacy @ Sponsored by: Mellanox Technologies	2016-05-26 08:41:55 +00:00
Ian Lepore	a66dc0c52b	Include machine/acle-compat.h in cdefs.h on arm if the compiler doesn't have ACLE support built in. The ACLE (ARM C Language Extensions) defines a set of standardized symbols which indicate the architecture version and features available. ACLE support is built in to modern compilers (both clang and gcc), but absent from gcc prior to 4.4. ARM (the company) provides the acle-compat.h header file to define the right symbols for older versions of gcc. Basically, acle-compat.h does for arm about the same thing cdefs.h does for freebsd: defines standardized macros that work no matter which compiler you use. If ARM hadn't provided this file we would have ended up with a big #ifdef __arm__ section in cdefs.h with our own compatibility shims. Remove #include <machine/acle-compat.h> from the zillion other places (an ever-growing list) that it appears. Since style(9) requires sys/types.h or sys/param.h early in the include list, and both of those lead to including cdefs.h, only a couple special cases still need to include acle-compat.h directly. Loves it: imp	2016-05-25 19:44:26 +00:00
Konstantin Belousov	c5e44d6cd5	Silence false LOR report due to the taskqueue mutex and kqueue lock named the same. Reported by: Doug Luce <doug@freebsd.con.com> Sponsored by: The FreeBSD Foundation	2016-05-24 21:13:33 +00:00
John Baldwin	778ce4f297	Return the correct status when a partially completed request is cancelled. After the previous changes to fix requests on blocking sockets to complete across multiple operations, an edge case exists where a request can be cancelled after it has partially completed. POSIX doesn't appear to dictate exactly how to handle this case, but in general I feel that aio_cancel() should arrange to cancel any request it can, but that any partially completed requests should return a partial completion rather than ECANCELED. To that end, fix the socket AIO cancellation routine to return a short read/write if a partially completed request is cancelled rather than ECANCELED. Sponsored by: Chelsio Communications	2016-05-24 21:09:05 +00:00
Andrew Turner	974692e3bf	Limit calling pmc_hook to when the interrupt comes while running userspace. We may enable interrupts from within the callback, e.g. in a data abort during copyin. If we receive an interrupt at that time pmc_hook will be called again and, as it is handling userspace stack tracing, will hit a KASSERT as it checks if the trapframe is from userland. With this I can run hwpmc with intrng on a ThunderX and have it trace all CPUs. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation	2016-05-24 12:06:56 +00:00
John Baldwin	1717b68af1	Don't prematurely return short completions on blocking sockets. Always requeue an AIO job at the head of the socket buffer's queue if sosend() or soreceive() returns EWOULDBLOCK on a blocking socket. Previously, requests were only requeued if they returned EWOULDBLOCK and completed no data. Now after a partial completion on a blocking socket the request is queued and the remaining request is retried when the socket is ready. This allows writes larger than the currently available space on a blocking socket to fully complete. Reads on a blocking socket that satifsy the low watermark can still return a short read (same as read()). In order to track previously completed data, the internal 'status' field of the AIO job is used to store the amount of previously computed data. Non-blocking sockets continue to return short completions for both reads and writes. Add a test for a "large" AIO write on a blocking socket that writes twice the socket buffer size to a UNIX domain socket. Sponsored by: Chelsio Communications	2016-05-24 03:13:27 +00:00
Alan Somers	37f32e5379	Fix build of kern/subr_unit.c, broken by r300539 Reported by: peter Pointyhat to: asomers Sponsored by: Spectra Logic Corp	2016-05-24 00:14:58 +00:00
Alan Somers	1b82e02f4d	Add bit_count to the bitstring(3) api Add a bit_count function, which efficiently counts the number of bits set in a bitstring. sys/sys/bitstring.h tests/sys/sys/bitstring_test.c share/man/man3/bitstring.3 Add bit_alloc sys/kern/subr_unit.c Use bit_count instead of a naive counting loop in check_unrhdr, used when INVARIANTS are enabled. The userland test runs about 6x faster in a generic build, or 8.5x faster when built for Nehalem, which has the POPCNT instruction. sys/sys/param.h Bump __FreeBSD_version due to the addition of bit_alloc UPDATING Add a note about the ABI incompatibility of the bitstring(3) changes, as suggested by lidl. Suggested by: gibbs Reviewed by: gibbs, ngie MFC after: 9 days X-MFC-With: 299090, 300538 Relnotes: yes Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D6255	2016-05-23 20:29:18 +00:00
Andrew Turner	df7a2251cc	Add the needed hwpmc hooks to subr_intr.c. This is needed for the correct operation of hwpmc on, for example, arm64 with intrng. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation	2016-05-23 15:26:35 +00:00
Hans Petter Selasky	bc64e52679	Use DELAY() instead of _sleep() when SCHEDULER_STOPPED() is set inside pause_sbt(). This allows pause() to continue working during a panic() which is not invoking KDB. This is useful when debugging graphics drivers using the LinuxKPI. Obtained from: kmacy @ MFC after: 1 week	2016-05-23 10:31:54 +00:00
Baptiste Daroussin	306e53bce9	Fix typo introduced by me (not the submitter) when fixing typos	2016-05-22 13:10:48 +00:00
Baptiste Daroussin	2fd642c899	Fix typos in the comments Submitted by: cipherwraith666@gmail.com (via github)	2016-05-22 13:04:45 +00:00
Andriy Gapon	7107bed0f0	fix loss of taskqueue wakeups (introduced in r300113) Submitted by: kmacy Tested by: dchagin	2016-05-21 14:51:49 +00:00
John Baldwin	20fee1093e	Add sglist functions for working with arrays of VM pages. sglist_count_vmpages() determines the number of segments required for a buffer described by an array of VM pages. sglist_append_vmpages() adds the segments described by such a buffer to an sglist. The latter function is largely pulled from sglist_append_bio(), and sglist_append_bio() now uses sglist_append_vmpages(). Reviewed by: kib Sponsored by: Chelsio Communications	2016-05-20 23:28:43 +00:00
John Baldwin	f0ec174043	Consistently set status to -1 when completing an AIO request with an error. Sponsored by: Chelsio Communications	2016-05-20 19:46:25 +00:00
John Baldwin	cc981af204	Add new bus methods for mapping resources. Add a pair of bus methods that can be used to "map" resources for direct CPU access using bus_space(9). bus_map_resource() creates a mapping and bus_unmap_resource() releases a previously created mapping. Mappings are described by 'struct resource_map' object. Pointers to these objects can be passed as the first argument to the bus_space wrapper API used for bus resources. Drivers that wish to map all of a resource using default settings (for example, using uncacheable memory attributes) do not need to change. However, drivers that wish to use non-default settings can now do so without jumping through hoops. First, an RF_UNMAPPED flag is added to request that a resource is not implicitly mapped with the default settings when it is activated. This permits other activation steps (such as enabling I/O or memory decoding in a device's PCI command register) to be taken without creating a mapping. Right now the AGP drivers don't set RF_ACTIVE to avoid using up a large amount of KVA to map the AGP aperture on 32-bit platforms. Once RF_UNMAPPED is supported on all platforms that support AGP this can be changed to using RF_UNMAPPED with RF_ACTIVE instead. Second, bus_map_resource accepts an optional structure that defines additional settings for a given mapping. For example, a driver can now request to map only a subset of a resource instead of the entire range. The AGP driver could also use this to only map the first page of the aperture (IIRC, it calls pmap_mapdev() directly to map the first page currently). I will also eventually change the PCI-PCI bridge driver to request mappings of the subset of the I/O window resource on its parent side to create mappings for child devices rather than passing child resources directly up to nexus to be mapped. This also permits bridges that do address translation to request suitable mappings from a resource on the "upper" side of the bus when mapping resources on the "lower" side of the bus. Another attribute that can be specified is an alternate memory attribute for memory-mapped resources. This can be used to request a Write-Combining mapping of a PCI BAR in an MI fashion. (Currently the drivers that do this call pmap_change_attr() directly for x86 only.) Note that this commit only adds the MI framework. Each platform needs to add support for handling RF_UNMAPPED and thew new bus_map/unmap_resource methods. Generally speaking, any drivers that are calling rman_set_bustag() and rman_set_bushandle() need to be updated. Discussed on: arch Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D5237	2016-05-20 17:57:47 +00:00
Mark Johnston	5e0a6f31e5	Move IPv6 malloc tag definitions into the IPv6 code.	2016-05-20 04:45:08 +00:00
Scott Long	7e52504fc2	Adjust the creation of tq_name so it can be freed correctly Reviewed by: jhb, allanjude Differential Revision: D6454	2016-05-19 17:14:24 +00:00
Kenneth D. Merry	9a6844d55f	Add support for managing Shingled Magnetic Recording (SMR) drives. This change includes support for SCSI SMR drives (which conform to the Zoned Block Commands or ZBC spec) and ATA SMR drives (which conform to the Zoned ATA Command Set or ZAC spec) behind SAS expanders. This includes full management support through the GEOM BIO interface, and through a new userland utility, zonectl(8), and through camcontrol(8). This is now ready for filesystems to use to detect and manage zoned drives. (There is no work in progress that I know of to use this for ZFS or UFS, if anyone is interested, let me know and I may have some suggestions.) Also, improve ATA command passthrough and dispatch support, both via ATA and ATA passthrough over SCSI. Also, add support to camcontrol(8) for the ATA Extended Power Conditions feature set. You can now manage ATA device power states, and set various idle time thresholds for a drive to enter lower power states. Note that this change cannot be MFCed in full, because it depends on changes to the struct bio API that break compatilibity. In order to avoid breaking the stable API, only changes that don't touch or depend on the struct bio changes can be merged. For example, the camcontrol(8) changes don't depend on the new bio API, but zonectl(8) and the probe changes to the da(4) and ada(4) drivers do depend on it. Also note that the SMR changes have not yet been tested with an actual SCSI ZBC device, or a SCSI to ATA translation layer (SAT) that supports ZBC to ZAC translation. I have not yet gotten a suitable drive or SAT layer, so any testing help would be appreciated. These changes have been tested with Seagate Host Aware SATA drives attached to both SAS and SATA controllers. Also, I do not have any SATA Host Managed devices, and I suspect that it may take additional (hopefully minor) changes to support them. Thanks to Seagate for supplying the test hardware and answering questions. sbin/camcontrol/Makefile: Add epc.c and zone.c. sbin/camcontrol/camcontrol.8: Document the zone and epc subcommands. sbin/camcontrol/camcontrol.c: Add the zone and epc subcommands. Add auxiliary register support to build_ata_cmd(). Make sure to set the CAM_ATAIO_NEEDRESULT, CAM_ATAIO_DMA, and CAM_ATAIO_FPDMA flags as appropriate for ATA commands. Add a new get_ata_status() function to parse ATA result from SCSI sense descriptors (for ATA passthrough over SCSI) and ATA I/O requests. sbin/camcontrol/camcontrol.h: Update the build_ata_cmd() prototype Add get_ata_status(), zone(), and epc(). sbin/camcontrol/epc.c: Support for ATA Extended Power Conditions features. This includes support for all features documented in the ACS-4 Revision 12 specification from t13.org (dated February 18, 2016). The EPC feature set allows putting a drive into a power power mode immediately, or setting timeouts so that the drive will automatically enter progressively lower power states after various idle times. sbin/camcontrol/fwdownload.c: Update the firmware download code for the new build_ata_cmd() arguments. sbin/camcontrol/zone.c: Implement support for Shingled Magnetic Recording (SMR) drives via SCSI Zoned Block Commands (ZBC) and ATA Zoned Device ATA Command Set (ZAC). These specs were developed in concert, and are functionally identical. The primary differences are due to SCSI and ATA differences. (SCSI is big endian, ATA is little endian, for example.) This includes support for all commands defined in the ZBC and ZAC specs. sys/cam/ata/ata_all.c: Decode a number of additional ATA command names in ata_op_string(). Add a new CCB building function, ata_read_log(). Add ata_zac_mgmt_in() and ata_zac_mgmt_out() CCB building functions. These support both DMA and NCQ encapsulation. sys/cam/ata/ata_all.h: Add prototypes for ata_read_log(), ata_zac_mgmt_out(), and ata_zac_mgmt_in(). sys/cam/ata/ata_da.c: Revamp the ada(4) driver to support zoned devices. Add four new probe states to gather information needed for zone support. Add a new adasetflags() function to avoid duplication of large blocks of flag setting between the async handler and register functions. Add new sysctl variables that describe zone support and paramters. Add support for the new BIO_ZONE bio, and all of its subcommands: DISK_ZONE_OPEN, DISK_ZONE_CLOSE, DISK_ZONE_FINISH, DISK_ZONE_RWP, DISK_ZONE_REPORT_ZONES, and DISK_ZONE_GET_PARAMS. sys/cam/scsi/scsi_all.c: Add command descriptions for the ZBC IN/OUT commands. Add descriptions for ZBC Host Managed devices. Add a new function, scsi_ata_pass() to do ATA passthrough over SCSI. This will eventually replace scsi_ata_pass_16() -- it can create the 12, 16, and 32-byte variants of the ATA PASS-THROUGH command, and supports setting all of the registers defined as of SAT-4, Revision 5 (March 11, 2016). Change scsi_ata_identify() to use scsi_ata_pass() instead of scsi_ata_pass_16(). Add a new scsi_ata_read_log() function to facilitate reading ATA logs via SCSI. sys/cam/scsi/scsi_all.h: Add the new ATA PASS-THROUGH(32) command CDB. Add extended and variable CDB opcodes. Add Zoned Block Device Characteristics VPD page. Add ATA Return SCSI sense descriptor. Add prototypes for scsi_ata_read_log() and scsi_ata_pass(). sys/cam/scsi/scsi_da.c: Revamp the da(4) driver to support zoned devices. Add five new probe states, four of which are needed for ATA devices. Add five new sysctl variables that describe zone support and parameters. The da(4) driver supports SCSI ZBC devices, as well as ATA ZAC devices when they are attached via a SCSI to ATA Translation (SAT) layer. Since ZBC -> ZAC translation is a new feature in the T10 SAT-4 spec, most SATA drives will be supported via ATA commands sent via the SCSI ATA PASS-THROUGH command. The da(4) driver will prefer the ZBC interface, if it is available, for performance reasons, but will use the ATA PASS-THROUGH interface to the ZAC command set if the SAT layer doesn't support translation yet. As I mentioned above, ZBC command support is untested. Add support for the new BIO_ZONE bio, and all of its subcommands: DISK_ZONE_OPEN, DISK_ZONE_CLOSE, DISK_ZONE_FINISH, DISK_ZONE_RWP, DISK_ZONE_REPORT_ZONES, and DISK_ZONE_GET_PARAMS. Add scsi_zbc_in() and scsi_zbc_out() CCB building functions. Add scsi_ata_zac_mgmt_out() and scsi_ata_zac_mgmt_in() CCB/CDB building functions. Note that these have return values, unlike almost all other CCB building functions in CAM. The reason is that they can fail, depending upon the particular combination of input parameters. The primary failure case is if the user wants NCQ, but fails to specify additional CDB storage. NCQ requires using the 32-byte version of the SCSI ATA PASS-THROUGH command, and the current CAM CDB size is 16 bytes. sys/cam/scsi/scsi_da.h: Add ZBC IN and ZBC OUT CDBs and opcodes. Add SCSI Report Zones data structures. Add scsi_zbc_in(), scsi_zbc_out(), scsi_ata_zac_mgmt_out(), and scsi_ata_zac_mgmt_in() prototypes. sys/dev/ahci/ahci.c: Fix SEND / RECEIVE FPDMA QUEUED in the ahci(4) driver. ahci_setup_fis() previously set the top bits of the sector count register in the FIS to 0 for FPDMA commands. This is okay for read and write, because the PRIO field is in the only thing in those bits, and we don't implement that further up the stack. But, for SEND and RECEIVE FPDMA QUEUED, the subcommand is in that byte, so it needs to be transmitted to the drive. In ahci_setup_fis(), always set the the top 8 bits of the sector count register. We need it in both the standard and NCQ / FPDMA cases. sys/geom/eli/g_eli.c: Pass BIO_ZONE commands through the GELI class. sys/geom/geom.h: Add g_io_zonecmd() prototype. sys/geom/geom_dev.c: Add new DIOCZONECMD ioctl, which allows sending zone commands to disks. sys/geom/geom_disk.c: Add support for BIO_ZONE commands. sys/geom/geom_disk.h: Add a new flag, DISKFLAG_CANZONE, that indicates that a given GEOM disk client can handle BIO_ZONE commands. sys/geom/geom_io.c: Add a new function, g_io_zonecmd(), that handles execution of BIO_ZONE commands. Add permissions check for BIO_ZONE commands. Add command decoding for BIO_ZONE commands. sys/geom/geom_subr.c: Add DDB command decoding for BIO_ZONE commands. sys/kern/subr_devstat.c: Record statistics for REPORT ZONES commands. Note that the number of bytes transferred for REPORT ZONES won't quite match what is received from the harware. This is because we're necessarily counting bytes coming from the da(4) / ada(4) drivers, which are using the disk_zone.h interface to communicate up the stack. The structure sizes it uses are slightly different than the SCSI and ATA structure sizes. sys/sys/ata.h: Add many bit and structure definitions for ZAC, NCQ, and EPC command support. sys/sys/bio.h: Convert the bio_cmd field to a straight enumeration. This will yield more space for additional commands in the future. After change r297955 and other related changes, this is now possible. Converting to an enumeration will also prevent use as a bitmask in the future. sys/sys/disk.h: Define the DIOCZONECMD ioctl. sys/sys/disk_zone.h: Add a new API for managing zoned disks. This is very close to the SCSI ZBC and ATA ZAC standards, but uses integers in native byte order instead of big endian (SCSI) or little endian (ATA) byte arrays. This is intended to offer to the complete feature set of the ZBC and ZAC disk management without requiring the application developer to include SCSI or ATA headers. We also use one set of headers for ioctl consumers and kernel bio-level consumers. sys/sys/param.h: Bump __FreeBSD_version for sys/bio.h command changes, and inclusion of SMR support. usr.sbin/Makefile: Add the zonectl utility. usr.sbin/diskinfo/diskinfo.c Add disk zoning capability to the 'diskinfo -v' output. usr.sbin/zonectl/Makefile: Add zonectl makefile. usr.sbin/zonectl/zonectl.8 zonectl(8) man page. usr.sbin/zonectl/zonectl.c The zonectl(8) utility. This allows managing SCSI or ATA zoned disks via the disk_zone.h API. You can report zones, reset write pointers, get parameters, etc. Sponsored by: Spectra Logic Differential Revision: https://reviews.freebsd.org/D6147 Reviewed by: wblock (documentation)	2016-05-19 14:08:36 +00:00
Gleb Smirnoff	e987742995	The SA-16:19 wouldn't have happened if the sockargs() had properly typed argument for length. While here make it static and convert to ANSI C. Reviewed by: C Turt	2016-05-18 22:05:50 +00:00
Ravi Pokala	08907ec39d	Fix misleading comments in bus_if.m While looking at r300073, I noticed these incorrect comments in the context of the diff. Reviewed by: imp, jhb Differential Revision: https://reviews.freebsd.org/D6431	2016-05-18 16:25:34 +00:00
Andrew Turner	9346e9130d	Return the struct intr_pic pointer from intr_pic_register. This will be needed in later changes where we may not be able to lock the pic list lock to perform a lookup, e.g. from within interrupt context. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation	2016-05-18 15:05:44 +00:00
Konstantin Belousov	3f7ca894de	Ensure that ftruncate(2) is performed synchronously when file is opened in O_SYNC mode, at least for UFS. This also handles truncation, done due to the O_SYNC \| O_TRUNC flags combination to open(2), in synchronous way. Noted by: bde Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-05-18 12:03:57 +00:00
Scott Long	4c7070db25	Import the 'iflib' API library for network drivers. From the author: "iflib is a library to eliminate the need for frequently duplicated device independent logic propagated (poorly) across many network drivers." Participation is purely optional. The IFLIB kernel config option is provided for drivers that want to transition between legacy and iflib modes of operation. ixl and ixgbe driver conversions will be committed shortly. We hope to see participation from the Broadcom and maybe Chelsio drivers in the near future. Submitted by: mmacy@nextbsd.org Reviewed by: gallatin Differential Revision: D5211	2016-05-18 04:35:58 +00:00
Mark Johnston	ef89d843d9	Do not acquire the thread lock in hardclock_cnt() unless needed. This function only sets thread flags if a SIGPROF or SIGVTALRM timer has fired, which is almost never the case. MFC after: 2 weeks	2016-05-18 03:55:54 +00:00
Mark Johnston	d2f5f8db87	Micro-optimize sleepq_broadcast(). - Avoid a conditional branch on the return value of sleepq_resume_thread() by ORing its return value into the boolean wakeup_swapper. This is consistent with other sleepqueue functions which just pass this return value to their caller. - sleepq_resume_thread() unconditionally removes the thread from its queue, so there's no need to maintain a pointer to the next element in the queue. MFC after: 2 weeks	2016-05-18 03:50:21 +00:00
Mark Johnston	be2dfd58fe	Remove the MUTEX_DEBUG kernel option. It has no counterpart among the other lock primitives and has been a no-op for years. Mutex consistency checks are generally done whenver INVARIANTS is enabled.	2016-05-18 03:34:02 +00:00
Mark Johnston	5002e19502	Guard the lockstat:::thread-spin probe with KDTRACE_HOOKS. X-MFC-With: r300103	2016-05-18 03:23:07 +00:00
Mark Johnston	156fbc14a0	lockstat:::thread-spin should only fire after spinning for the lock. MFC after: 1 week	2016-05-18 03:21:21 +00:00
Gleb Smirnoff	17cd649f4a	Add a comment and KASSERT that a M_NOFREE mbuf has always EXT_EXTREF ext. Submitted by: kmacy	2016-05-17 23:15:16 +00:00
Warner Losh	0ac974ec78	Don't forget to quote \ characters with \.	2016-05-17 22:52:42 +00:00
Gleb Smirnoff	7349ea785c	Validate that user supplied control message length is not negative. Submitted by: C Turt <cturt hardenedbsd.org> Security: SA-16:19 Security: CVE-2016-1887	2016-05-17 22:28:53 +00:00
John Baldwin	ed7ed7f0ca	Document the formatting requirements of location and pnpinfo strings. devd requires location and pnpinfo strings generated by bus drivers to be formatted as a list of name=value keypairs. Non-conforming bus drivers cause devd to mis-parse device events for these buses. Note that this documents the desired requirements. devctl_safe_quote() doesn't yet escape backslash characters, and devd doesn't handle escaped characters in quoted values. Differential Revision: https://reviews.freebsd.org/D6252	2016-05-17 19:34:07 +00:00
Konstantin Belousov	2a339d9e3d	Add implementation of robust mutexes, hopefully close enough to the intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013. A robust mutex is guaranteed to be cleared by the system upon either thread or process owner termination while the mutex is held. The next mutex locker is then notified about inconsistent mutex state and can execute (or abandon) corrective actions. The patch mostly consists of small changes here and there, adding neccessary checks for the inconsistent and abandoned conditions into existing paths. Additionally, the thread exit handler was extended to iterate over the userspace-maintained list of owned robust mutexes, unlocking and marking as terminated each of them. The list of owned robust mutexes cannot be maintained atomically synchronous with the mutex lock state (it is possible in kernel, but is too expensive). Instead, for the duration of lock or unlock operation, the current mutex is remembered in a special slot that is also checked by the kernel at thread termination. Kernel must be aware about the per-thread location of the heads of robust mutex lists and the current active mutex slot. When a thread touches a robust mutex for the first time, a new umtx op syscall is issued which informs about location of lists heads. The umtx sleep queues for PP and PI mutexes are split between non-robust and robust. Somewhat unrelated changes in the patch: 1. Style. 2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared pi mutexes. 3. Removal of the userspace struct pthread_mutex m_owner field. 4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls the lifetime of the shared mutex associated with a vnode' page. Reviewed by: jilles (previous version, supposedly the objection was fixed) Discussed with: brooks, Martin Simmons <martin@lispworks.com> (some aspects) Tested by: pho Sponsored by: The FreeBSD Foundation	2016-05-17 09:56:22 +00:00
Andrew Turner	3fc155dc64	Introduce MSI and MSI-X support to intrng. This adds a new msi device interface with 5 methods to mirror the 5 MSI/MSI-X methods in the pcib interface. The pcib driver will need to perform a device specific lookup to find the MSI controller and pass this to intrng as the xref. Intrng will finally find the controller and have it handle the requested operation. Obtained from: ABT Systems Ltd MFH: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5985	2016-05-16 09:11:40 +00:00
Andriy Gapon	27d4b35f6e	vfs_read_dirent: increment ncookies after adding a cookie It seems that at present vfs_read_dirent() is used only with filesystems that do not support cookies, so the bug never manifested itself. MFC after: 1 week	2016-05-16 07:31:11 +00:00
Andriy Gapon	8614f45b2d	dounmount: do not call mountcheckdirs() for mounts with MNT_IGNORE This is a bit hackish, but the flag is currently set only for ZFS snapshots mounted under .zfs. mountcheckdirs() can change cdir/rdir references to a covered vnode. But for the said snapshots the covered vnode is really ephemeral and it must never be accessed (except for a few specific cases). To do: consider removing mountcheckdirs() entirely MFC after: 5 days	2016-05-16 07:23:24 +00:00
John Baldwin	fdce57a042	Add an EARLY_AP_STARTUP option to start APs earlier during boot. Currently, Application Processors (non-boot CPUs) are started by MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until SI_SUB_SMP at which point they are released to run kernel threads. SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter the scheduler and start running threads until fairly late in the boot. This change moves SI_SUB_SMP up to just before software interrupt threads are created allowing the APs to start executing kernel threads much sooner (before any devices are probed). This allows several initialization routines that need to perform initialization on all CPUs to now perform that initialization in one step rather than having to defer the AP initialization to a second SYSINIT run at SI_SUB_SMP. It also permits all CPUs to be available for handling interrupts before any devices are probed. This last feature fixes a problem on with interrupt vector exhaustion. Specifically, in the old model all device interrupts were routed onto the boot CPU during boot. Later after the APs were released at SI_SUB_SMP, interrupts were redistributed across all CPUs. However, several drivers for multiqueue hardware allocate N interrupts per CPU in the system. In a system with many CPUs, just a few drivers doing this could exhaust the available pool of interrupt vectors on the boot CPU as each driver was allocating N * mp_ncpu vectors on the boot CPU. Now, drivers will allocate interrupts on their desired CPUs during boot meaning that only N interrupts are allocated from the boot CPU instead of N * mp_ncpu. Some other bits of code can also be simplified as smp_started is now true much earlier and will now always be true for these bits of code. This removes the need to treat the single-CPU boot environment as a special case. As a transition aid, the new behavior is available under a new kernel option (EARLY_AP_STARTUP). This will allow the option to be turned off if need be during initial testing. I plan to enable this on x86 by default in a followup commit in the next few days and to have all platforms moved over before 11.0. Once the transition is complete, the option will be removed along with the !EARLY_AP_STARTUP code. These changes have only been tested on x86. Other platform maintainers are encouraged to port their architectures over as well. The main things to check for are any uses of smp_started in MD code that can be simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in the EARLY_AP_STARTUP case (e.g. the interrupt shuffling). PR: kern/199321 Reviewed by: markj, gnn, kib Sponsored by: Netflix	2016-05-14 18:22:52 +00:00
Edward Tomasz Napierala	ebc2f37754	Stop hiding errors that result in failure to mount /dev. Otherwise, missing /dev directory makes one end up with a completely deaf (init without stdout/stderr) system with no hints on the console, unless you've booted up with bootverbose. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-05-12 07:38:10 +00:00
Conrad Meyer	fe4be618c9	subr_vmem: Fix double-free in error case of vmem_create If vmem_init() fails, 'vm' is already destroyed and freed. Don't free it again. Reported by: Coverity CID: 1042110 Sponsored by: EMC / Isilon Storage Division	2016-05-11 23:16:11 +00:00
Konstantin Belousov	54a33d2f97	Add vfs_hash_ref(9) function, which finds a vnode by the hash value and returns it referenced. The function is similar to vfs_hash_get(9), but unlike the later, returned vnode is not locked. This operation cannot be requested with the vget(9) flags. Reviewed and tested by: rmacklem Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-05-11 06:32:22 +00:00
Konstantin Belousov	cd85d599d8	Style: wrap long lines. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-05-11 06:27:00 +00:00
John Baldwin	8d791e5af1	Add a new bus method to fetch device-specific CPU sets. bus_get_cpus() returns a specified set of CPUs for a device. It accepts an enum for the second parameter that indicates the type of cpuset to request. Currently two valus are supported: - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to the device when DEVICE_NUMA is enabled) - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core) For systems that do not support NUMA (or if it is not enabled in the kernel config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus' by default. The idea is that INTR_CPUS should always return a valid set. Device drivers which want to use per-CPU interrupts should start using INTR_CPUS instead of simply assigning interrupts to all available CPUs. In the future we may wish to add tunables to control the policy of INTR_CPUS (e.g. should it be local-only or global, should it ignore SMT threads or not). The x86 nexus driver exposes the internal set of interrupt CPUs from the the x86 interrupt code via INTR_CPUS. The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled. They also and the global INTR_CPUS set from the nexus driver with the per-domain set from _PXM to generate a local INTR_CPUS set for child devices. Compared to the r298933, this version uses 'struct _cpuset' in <sys/bus.h> instead of 'cpuset_t' to avoid requiring <sys/param.h> (<sys/_cpuset.h> still requires <sys/param.h> for MAXCPU even though <sys/_bitset.h> does not after recent changes).	2016-05-09 20:50:21 +00:00
Andrew Turner	b48c608386	Check malloc succeeded in pic_create, with M_NOWAIT it may return NULL. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation	2016-05-09 12:24:39 +00:00
Mateusz Guzik	0cfe1a1fec	fd: assert dropped filedesc lock in fdcloseexec	2016-05-08 03:26:12 +00:00
Svatopluk Kraus	a100280e59	Set correct size to the size member of struct intr_map_data when initialized. As the size member is not used at the present, it did not break anything.	2016-05-06 08:54:00 +00:00
Ed Maste	b4f6936599	Add explicit cast to fix mips and powerpc build after r299090 Sponsored by: The FreeBSD Foundation	2016-05-05 15:21:33 +00:00
Svatopluk Kraus	cd642c88a1	INTRNG - redefine struct intr_map_data to avoid headers pollution. Each struct associated with some type defined in enum intr_map_data_type must have struct intr_map_data on the top of its own definition now. When such structs are used, correct type and size must be filled in. There are three such structs defined in sys/intr.h now. Their definitions should be moved to corresponding headers by follow-up commits. While this change was propagated to all INTRNG like PICs, pic_map_intr() method implementations were corrected on some places. For this specific method, it's ensured by a caller that the 'data' argument passed to this method is never NULL. Also, the return error values were standardized there.	2016-05-05 13:31:19 +00:00
Svatopluk Kraus	15adccc687	Remove superfluous check. The pic_dev member of struct pic is never NULL on PIC found by pic_lookup().	2016-05-05 13:23:38 +00:00
Adrian Chadd	d17f808fab	s/struct device */device_t/g Submitted by: kmacy	2016-05-04 23:31:52 +00:00
Alan Somers	8907f744ff	Improve performance and functionality of the bitstring(3) api Two new functions are provided, bit_ffs_at() and bit_ffc_at(), which allow for efficient searching of set or cleared bits starting from any bit offset within the bit string. Performance is improved by operating on longs instead of bytes and using ffsl() for searches within a long. ffsl() is a compiler builtin in both clang and gcc for most architectures, converting what was a brute force while loop search into a couple of instructions. All of the bitstring(3) API continues to be contained in the header file. Some of the functions are large enough that perhaps they should be uninlined and moved to a library, but that is beyond the scope of this commit. sys/sys/bitstring.h: Convert the majority of the existing bit string implementation from macros to inline functions. Properly protect the implementation from inadvertant macro expansion when included in a user's program by prefixing all private macros/functions and local variables with '_'. Add bit_ffs_at() and bit_ffc_at(). Implement bit_ffs() and bit_ffc() in terms of their "at" counterparts. Provide a kernel implementation of bit_alloc(), making the full API usable in the kernel. Improve code documenation. share/man/man3/bitstring.3: Add pre-exisiting API bit_ffc() to the synopsis. Document new APIs. Document the initialization state of the bit strings allocated/declared by bit_alloc() and bit_decl(). Correct documentation for bitstr_size(). The original code comments indicate the size is in bytes, not "elements of bitstr_t". The new implementation follows this lead. Only hastd assumed "elements" rather than bytes and it has been corrected. etc/mtree/BSD.tests.dist: tests/sys/Makefile: tests/sys/sys/Makefile: tests/sys/sys/bitstring.c: Add tests for all existing and new functionality. include/bitstring.h Include all headers needed by sys/bitstring.h lib/libbluetooth/bluetooth.h: usr.sbin/bluetooth/hccontrol/le.c: Include bitstring.h instead of sys/bitstring.h. sbin/hastd/activemap.c: Correct usage of bitstr_size(). sys/dev/xen/blkback/blkback.c Use new bit_alloc. sys/kern/subr_unit.c: Remove hard-coded assumption that sizeof(bitstr_t) is 1. Get rid of unrb.busy, which caches the number of bits set in unrb.map. When INVARIANTS are disabled, nothing needs to know that information. callapse_unr can be adapted to use bit_ffs and bit_ffc instead. Eliminating unrb.busy saves memory, simplifies the code, and provides a slight speedup when INVARIANTS are disabled. sys/net/flowtable.c: Use the new kernel implementation of bit-alloc, instead of hacking the old libc-dependent macro. sys/sys/param.h Update __FreeBSD_version to indicate availability of new API Submitted by: gibbs, asomers Reviewed by: gibbs, ngie MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D6004	2016-05-04 22:34:11 +00:00
Roger Pau Monné	731c90b713	rtc: fix inverted resolution check The current code in clock_register checks if the newly added clock has a resolution value higher than the current one in order to make it the default, which is wrong. Clocks with a lower resolution value should be better than ones with a higher resolution value, in fact with the current code FreeBSD is always selecting the worse clock. Reviewed by: kib jhb jkim Sponsored by: Citrix Systems R&D MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D6185	2016-05-04 13:48:59 +00:00
Sepherosa Ziehau	d5f0ea7ca2	kern: Factor out function to convert hash flags to malloc(9) flags Suggested by: jhb Reviewed by: jhb, kib Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D6184	2016-05-04 03:07:52 +00:00
Konstantin Belousov	c89e1b8739	Add EVFILT_VNODE open, read and close notifications. While there, order EVFILT_VNODE notes descriptions alphabetically. Based on submission, and tested by: Vladimir Kondratyev <wulf@cicgroup.ru> MFC after: 2 weeks	2016-05-03 15:17:43 +00:00
Sepherosa Ziehau	f8ce3dfaf1	kern: Add phashinit_flags(), which allows malloc(M_NOWAIT) It will be used for the upcoming LRO hash table initialization. And probably will be useful in other cases, when M_WAITOK can't be used. Reviewed by: jhb, kib Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D6138	2016-05-03 07:17:13 +00:00
John Baldwin	8a08b7d36b	Revert bus_get_cpus() for now. I really thought I had run this through the tinderbox before committing, but many places need <sys/types.h> -> <sys/param.h> for <sys/bus.h> now.	2016-05-03 01:17:40 +00:00
John Baldwin	bc153c692f	Add a new bus method to fetch device-specific CPU sets. bus_get_cpus() returns a specified set of CPUs for a device. It accepts an enum for the second parameter that indicates the type of cpuset to request. Currently two valus are supported: - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to the device when DEVICE_NUMA is enabled) - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core) For systems that do not support NUMA (or if it is not enabled in the kernel config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus' by default. The idea is that INTR_CPUS should always return a valid set. Device drivers which want to use per-CPU interrupts should start using INTR_CPUS instead of simply assigning interrupts to all available CPUs. In the future we may wish to add tunables to control the policy of INTR_CPUS (e.g. should it be local-only or global, should it ignore SMT threads or not). The x86 nexus driver exposes the internal set of interrupt CPUs from the the x86 interrupt code via INTR_CPUS. The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled. They also and the global INTR_CPUS set from the nexus driver with the per-domain set from _PXM to generate a local INTR_CPUS set for child devices. Reviewed by: wblock (manpage) Differential Revision: https://reviews.freebsd.org/D5519	2016-05-02 18:00:38 +00:00
Konstantin Belousov	f7b71c8a5b	Issue NOTE_EXTEND when a directory entry is added to or removed from the monitored directory as the result of rename(2) operation. The renames staying in the directory are not reported. Submitted by: Vladimir Kondratyev <wulf@cicgroup.ru> MFC after: 2 weeks	2016-05-02 13:18:17 +00:00
Konstantin Belousov	bd2ead6b2e	Fix reporting of NOTE_LINK when directory link count changes due to rename removing or adding subdirectory entry. Discussed with and tested by: Vladimir Kondratyev <wulf@cicgroup.ru> NetBSD PR: 48958 (http://gnats.netbsd.org/48958) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2016-05-02 13:13:32 +00:00
Pedro F. Giffuni	e3043798aa	sys/kern: spelling fixes in comments. No functional change.	2016-04-29 22:15:33 +00:00
Pedro F. Giffuni	31b6732008	sys/kern: spelling fixes. Mostly on comments but affects some debug messages. MFC after: 2 weeks	2016-04-29 21:54:28 +00:00
Alan Somers	794277da54	Automate the subr_unit test. Build and install the subr_unit test program originally written by phk, and run it with the other ATF tests. tests/sys/kern/Makefile * Build and install the subr_unit test as a plain test sys/kern/subr_unit.c * Reduce the default number of repetitions from 100 to 1, and add a command-line parser to override it. * Don't be so noisy by default * Fix an include problem for the test build Reviewed by: ngie MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D6038	2016-04-29 21:11:31 +00:00
John Baldwin	5163d2ec94	Expose soaio_enqueue(). This can be used by protocol-specific AIO handlers to queue work to the socket AIO daemon pool. Sponsored by: Chelsio Communications	2016-04-29 20:12:45 +00:00
John Baldwin	8722384b22	Introduce a new protocol hook pru_aio_queue. This allows a protocol to claim individual AIO requests instead of using the default socket AIO handling. Sponsored by: Chelsio Communications	2016-04-29 20:11:09 +00:00
Pedro F. Giffuni	f3b827f3ec	bufs: make B_DIRTY and B_PERSISTENT flags available It appears these flags were related to ext2fs but are completely unused nowadays. Retire them. Suggested by: mckusick	2016-04-29 16:32:28 +00:00
Michal Meloun	8442087f15	INTRNG: Define 'INTR_IRQ_INVALID' constant and use it consistently as error indicator.	2016-04-28 12:04:12 +00:00
Michal Meloun	39f6c1bdf4	GPIO: Add support for gpio pin interrupts. Add new function gpio_alloc_intr_resource(), which allows an allocation of interrupt resource associated to given gpio pin. It also allows to specify interrupt configuration. Note: This functionality is dependent on INTRNG, and must be implemented in each GPIO controller.	2016-04-28 12:03:22 +00:00
John Baldwin	e240255ffc	Add a bus_null_rescan() method that always fails with an error. Use this in place of kobj_error_method to disable BUS_RESCAN() on PCI drivers that do not use the "standard" scanning algorithm.	2016-04-27 17:49:42 +00:00
John Baldwin	88eb5c506d	Add 'devctl delete' that calls device_delete_child(). 'devctl delete' can be used to delete a device that is no longer present. As an anti-foot-shooting measure, 'delete' will not delete a device unless it's parent bus says it is no longer present. This can be overridden by passing the force ('-f') flag. Note that this command should be used with care. If a device is deleted that is actually present it can't be resurrected unless the parent bus device's driver supports rescans. Differential Revision: https://reviews.freebsd.org/D6019	2016-04-27 16:33:17 +00:00
John Baldwin	a907c6914c	Add a new rescan method to the bus interface. The BUS_RESCAN() method rescans a single bus device checking for devices that have been added or removed from the bus. A new 'rescan' command is added to devctl(8) to trigger a rescan. Differential Revision: https://reviews.freebsd.org/D6016	2016-04-27 16:29:03 +00:00
Jamie Gritton	73d9e52d2f	Delay revmoing the last jail reference in prison_proc_free, and instead put it off into the pr_task. This is similar to prison_free, and in fact uses the same task even though they do something slightly different. This resolves a LOR between the process lock and allprison_lock, which came about in r298565. PR: 48471	2016-04-27 02:25:21 +00:00
Conrad Meyer	da95a2ae56	posix4_mib: Don't overrun facility_initialized array The facility_initialized and facility arrays are the same size and were intended to be indexed the same. I believe this mismatch was just a typo/braino in r208731. Reported by: Coverity CID: 1017430 Sponsored by: EMC / Isilon Storage Division	2016-04-27 00:10:32 +00:00
Conrad Meyer	a286650b08	subr_mbpool: Don't free bogus pointer in error paths An mbpool is allocated with a contiguous array of mbpages. Freeing an individual mbpage has never been valid. Don't do it. This bug has been present since this code was introduced in r117624 (2003). Reported by: Coverity CID: 1009687 Sponsored by: EMC / Isilon Storage Division	2016-04-26 23:58:55 +00:00
Jamie Gritton	1fb6767d27	Use crcopysafe in jail_attach.	2016-04-26 21:19:12 +00:00
Conrad Meyer	aa90aec270	osd(9): Change array pointer to array pointer type from void* This is a minor follow-up to r297422, prompted by a Coverity warning. (It's not a real defect, just a code smell.) OSD slot array reservations are an array of pointers (void *) but were cast to void and back unnecessarily. Keep the correct type from reservation to use. osd.9 is updated to match, along with a few trivial igor fixes. Reported by: Coverity CID: 1353811 Sponsored by: EMC / Isilon Storage Division	2016-04-26 19:57:35 +00:00
Jamie Gritton	5579267b08	Redo the changes to the SYSV IPC sysctl functions from r298585, so they don't (mis)use sbufs. PR: 48471	2016-04-26 18:17:44 +00:00
Pedro F. Giffuni	55e0987aea	sys: extend use of the howmany() macro when available. We have a howmany() macro in the <sys/param.h> header that is convenient to re-use as it makes things easier to read.	2016-04-26 15:38:17 +00:00
Ruslan Bukin	3a32292401	Add support for RISC-V.	2016-04-26 12:29:47 +00:00
Ruslan Bukin	30b72b6871	Move arm's devmap to some generic place, so it can be used by other architectures. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D6091 Sponsored by: DARPA, AFRL Sponsored by: HEIF5	2016-04-26 11:53:37 +00:00
Jamie Gritton	0bfd7a267e	Fix the logic in r298585: shm_prison_cansee returns an errno, so is the opposite of a boolean. PR: 48471	2016-04-25 22:30:10 +00:00
Jamie Gritton	52a510ace9	Encapsulate SYSV IPC objects in jails. Define per-module parameters sysvmsg, sysvsem, and sysvshm, with the following bahavior: inherit: allow full access to the IPC primitives. This is the same as the current setup with allow.sysvipc is on. Jails and the base system can see (and moduly) each other's objects, which is generally considered a bad thing (though may be useful in some circumstances). disable: all no access, same as the current setup with allow.sysvipc off. new: A jail may see use the IPC objects that it has created. It also gets its own IPC key namespace, so different jails may have their own objects using the same key value. The parent jail (or base system) can see the jail's IPC objects, but not its keys. PR: 48471 Submitted by: based on work by kikuchan98@gmail.com MFC after: 5 days	2016-04-25 17:06:50 +00:00
Jamie Gritton	add14c83aa	Use the new PR_METHOD_REMOVE to clean up jail handling in POSIX message queues.	2016-04-25 04:36:54 +00:00
Jamie Gritton	b6f47c231f	Pass the current/new jail to PR_METHOD_CHECK, which pushes the call until after the jail is found or created. This requires unlocking the jail for the call and re-locking it afterward, but that works because nothing in the jail has been changed yet, and other processes won't change the important fields as long as allprison_lock remains held. Keep better track of name vs namelc in kern_jail_set. Name should always be the hierarchical name (relative to the caller), and namelc the last component. PR: 48471 MFC after: 5 days	2016-04-25 04:27:58 +00:00
Jamie Gritton	cc5fd8c748	Add a new jail OSD method, PR_METHOD_REMOVE. It's called when a jail is removed from the user perspective, i.e. when the last pr_uref goes away, even though the jail mail still exist in the dying state. It will also be called if either PR_METHOD_CREATE or PR_METHOD_SET fail. PR: 48471 MFC after: 5 days	2016-04-25 04:24:00 +00:00
Jamie Gritton	2a54950713	Remove the PR_REMOVE flag, which was meant as a temporary marker for a jail that might be seen mid-removal. It hasn't been doing the right thing since at least the ability to resurrect dying jails, and such resurrection also makes it unnecessary.	2016-04-25 03:58:08 +00:00
Pedro F. Giffuni	d9c9c81c08	sys: use our roundup2/rounddown2() macros when param.h is available. rounddown2 tends to produce longer lines than the original code and when the code has a high indentation level it was not really advantageous to do the replacement. This tries to strike a balance between readability using the macros and flexibility of having the expressions, so not everything is converted.	2016-04-21 19:57:40 +00:00
Edward Tomasz Napierala	bbe4eb6d54	Get rid of rctl_lock; use racct_lock where appropriate. The fast paths already required both of them, so having a separate rctl_lock didn't buy us anything. Reviewed by: mjg@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5914	2016-04-21 16:22:52 +00:00
Pedro F. Giffuni	8dfea46460	Remove slightly used const values that can be replaced with nitems(). Suggested by: jhb	2016-04-21 15:38:28 +00:00
Konstantin Belousov	3e937c3a77	Arm and arm64 both have fueword() implemented for some time. Correct the comment. Sponsored by: The FreeBSD Foundation	2016-04-20 17:28:21 +00:00
Pedro F. Giffuni	63b6b7a74a	Indentation issues. Contract some lines leftover from r298310. Mea culpa.	2016-04-20 16:19:44 +00:00
Conrad Meyer	b483e111c4	kern_rctl: Fix resource leak in error path Ordinarily, rctl_write_outbuf frees 'sb'. However, if we are in low memory conditions we skip past the rctl_write_outbuf. In that case, free 'sb'. Reported by: Coverity CID: 1338539 Sponsored by: EMC / Isilon Storage Division	2016-04-20 02:09:38 +00:00
Pedro F. Giffuni	02abd40029	kernel: use our nitems() macro when it is available through param.h. No functional change, only trivial cases are done in this sweep, Discussed in: freebsd-current	2016-04-19 23:48:27 +00:00
Edward Tomasz Napierala	74a7305a91	Fix debugging printf. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-04-19 13:36:31 +00:00
Konstantin Belousov	2cfddaa6ff	Fix umtx lock/trylock for compat32. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2016-04-19 11:37:43 +00:00
Mark Johnston	11748dae80	Use a loop instead of a goto in sysctl_kern_proc_kstack(). MFC after: 3 days	2016-04-17 23:22:32 +00:00
Konstantin Belousov	ccd0ec4066	The struct thread td_estcpu member is only used by the 4BSD scheduler. Move it to the struct td_sched for 4BSD, removing always present field, otherwise unused for ULE. New scheduler method sched_estcpu() returns the estimation for kinfo_proc consumption. As before, it always returns 0 for ULE. Remove sched_tick() scheduler method, unused both by 4BSD and ULE. Update locking comment for the 4BSD struct td_sched, copying it from the same comment for ULE. Spell MAXPRI as PRI_MAX_TIMESHARE in the 4BSD comment. Based on some notes from, and reviewed by: bde Sponsored by: The FreeBSD Foundation	2016-04-17 11:04:27 +00:00
Conrad Meyer	5dc5dab6eb	Add 4Kn kernel dump support (And 4Kn minidump support, but only for amd64.) Make sure all I/O to the dump device is of the native sector size. To that end, we keep a native sector sized buffer associated with dump devices (di->blockbuf) and use it to pad smaller objects as needed (e.g. kerneldumpheader). Add dump_write_pad() as a convenience API to dump smaller objects with zero padding. (Rather than pull in NPM leftpad, we wrote our own.) Savecore(1) has been updated to deal with these dumps. The format for 512-byte sector dumps should remain backwards compatible. Minidumps for other architectures are left as an exercise for the reader. PR: 194279 Submitted by: ambrisko@ Reviewed by: cem (earlier version), rpokala Tested by: rpokala (4Kn/512 except 512 fulldump), cem (512 fulldump) Relnotes: yes Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D5848	2016-04-15 17:45:12 +00:00
Pedro F. Giffuni	b85f65af68	kern: for pointers replace 0 with NULL. These are mostly cosmetical, no functional change. Found with devel/coccinelle.	2016-04-15 16:10:11 +00:00
Edward Tomasz Napierala	23e6fff29d	Allocate RACCT/RCTL zones without UMA_ZONE_NOFREE; no idea why it was there in the first place. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-04-15 13:34:59 +00:00
Edward Tomasz Napierala	c1a43e73c5	Sort variable declarations. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-04-15 11:55:29 +00:00
Warner Losh	15be49f5a1	Create wrappers for uint64_t and int64_t for the tunables. While not strictly necessary, it is more convenient.	2016-04-15 03:09:55 +00:00
Jamie Gritton	44c16975a2	Clean up some style(9) violations.	2016-04-14 17:07:26 +00:00
Jamie Gritton	adb023ae59	Separate POSIX mqueue objects in jails; actually, separate them by the jail's root, so jails that don't have their own filesystem directory also won't have their own mqueue namespace. PR: 208082	2016-04-13 20:15:49 +00:00
Jamie Gritton	cc7b259a26	Separate POSIX sem/shm objects in jails, by prepending the jail's path name to the object's "path". While the objects don't have real path names, it's a filesystem-like namespace, which allows jails to be kept to their own space, but still allows the system / jail parent to access a jail's IPC. PR: 208082	2016-04-13 20:14:13 +00:00
Edward Tomasz Napierala	f459a81824	Fix overflow checking. There are some other potential problems related to overflowing racct counters; I'll revisit those later. Submitted by: Pieter de Goeje (earlier version) Reviewed by: emaste@ MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-04-12 18:13:24 +00:00
Pedro F. Giffuni	74b8d63dcc	Cleanup unnecessary semicolons from the kernel. Found with devel/coccinelle.	2016-04-10 23:07:00 +00:00
John Baldwin	70e22add96	Add a function to lookup a device_t object by name. This just walks the global list of devices looking for one with the requested name. The one use case outside of devctl2's implementation is for DDB commands that wish to lookup devices by name.	2016-04-10 05:05:02 +00:00
John Baldwin	62d70a8174	Add more fine-grained kernel options for NUMA support. VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in the virtual memory system. DEVICE_NUMA is used to enable affinity reporting for devices such as bus_get_domain(). MAXMEMDOM must still be set to a value greater than for any NUMA support to be effective. Note that 'cpuset -gd' always works if MAXMEMDOM is enabled and the system supports NUMA. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D5782	2016-04-09 13:58:04 +00:00
Bjoern A. Zeeb	029f99dcc4	Make the KASSERT message in hash destroy more informative. While the pointer might not be too helpful, the malloc type might at least give a good hint about which hashtbl we are talking. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Reviewed by: gnn, emaste Differential Revision: https://reviews.freebsd.org/D5802	2016-04-09 09:24:05 +00:00
Edward Tomasz Napierala	8bd8c8f14c	Make it possible to tweak RCTL throttling sysctls at runtime. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-04-08 18:15:31 +00:00
Andriy Gapon	a449bdba32	topo_set_pu_id: turn a check into an assertion The new id must not be present in any cpu set in any topology element. MFC after: 30 days	2016-04-08 11:59:11 +00:00
Konstantin Belousov	b715d9af68	Use the ABI-prescribed name for SHT_X86_64_UNWIND in the loader and kernel linker, after the r297686. Sponsored by: The FreeBSD Foundation	2016-04-08 10:23:48 +00:00
Svatopluk Kraus	cf55df9f83	Fix intr_irq_shuffle(). After r297539, ISRCs doing IPI may be also registered into global interrupt table. Thus, they must be filtered out like per-cpu interrupts. Fortunately, it does not influence anything on interrupt controllers which already use INTRNG.	2016-04-07 15:16:33 +00:00
Svatopluk Kraus	5b613c19b5	Implement intr_isrc_init_on_cpu() and use it to replace very same code implemented in every interrupt controller driver running SMP. This function returns true, if provided ISRC should be enabled on given cpu.	2016-04-07 15:00:25 +00:00
Edward Tomasz Napierala	ae34b6ff96	Add four new RCTL resources - readbps, readiops, writebps and writeiops, for limiting disk (actually filesystem) IO. Note that in some cases these limits are not quite precise. It's ok, as long as it's within some reasonable bounds. Testing - and review of the code, in particular the VFS and VM parts - is very welcome. MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5080	2016-04-07 04:23:25 +00:00
Svatopluk Kraus	4be58cba48	Fix PIC lookup by device and xref. There was not taken into account the situation that someone has a pointer to device but not its xref. This situation is regular now, after r297539.	2016-04-06 12:48:45 +00:00
Edward Tomasz Napierala	4c230cdafd	Use proper locking macros in RACCT in RCTL. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-04-05 11:30:52 +00:00
Andriy Gapon	c77702de74	x86 topo: add some comments, descriptions and references to documentation Plus a minor cosmetic change. MFC after: 1 month	2016-04-05 10:36:40 +00:00
Andriy Gapon	4725e6bff3	new x86 smp topology detection code Previously, the code determined a topology of processing units (hardware threads, cores, packages) and then deduced a cache topology using certain assumptions. The new code builds a topology that includes both processing units and caches using the information provided by the hardware. At the moment, the discovered full topology is used only to creeate a scheduling topology for SCHED_ULE. There is no KPI for other kernel uses. Summary: - based on APIC ID derivation rules for Intel and AMD CPUs - can handle non-uniform topologies - requires homogeneous APIC ID assignment (same bit widths for ID components) - topology for dual-node AMD CPUs may not be optimal - topology for latest AMD CPU models may not be optimal as the code is several years old - supports only thread/package/core/cache nodes Todo: - AMD dual-node processors - latest AMD processors - NUMA nodes - checking for homogeneity of the APIC ID assignment across packages - more flexible cache placement within topology - expose topology to userland, e.g., via sysctl nodes Long term todo: - KPI for CPU sharing and affinity with respect to various resources (e.g., two logical processors may share the same FPU, etc) Reviewed by: mav Tested by: mav MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D2728	2016-04-04 16:09:29 +00:00
Andrew Turner	6b42a1f4c0	Include sys/rman.h directly rather than relying on header pollution. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation	2016-04-04 10:52:43 +00:00
Svatopluk Kraus	bff6be3e9b	Remove FDT specific parts from INTRNG. Change its interface to make it universal. (1) New struct intr_map_data is defined as a container for arbitrary description of an interrupt used by a device. Typically, an interrupt number and configuration relevant to an interrupt controller is encoded in such description. However, any additional information may be encoded too like a set of cpus on which an interrupt should be enabled or vendor specific data needed for setup of an interrupt in controller. The struct intr_map_data itself is meant to be opaque for INTRNG. (2) An intr_map_irq() function is created which takes an interrupt controller identification and struct intr_map_data as arguments and returns global interrupt number which identifies an interrupt. (3) A set of functions to be used by bus drivers is created as well as a corresponding set of methods for interrupt controller drivers. These sets take both struct resource and struct intr_map_data as one of the arguments. There is a goal to keep struct intr_map_data in struct resource, however, this way a final solution is not limited to that. (4) Other small changes are done to reflect new situation. This is only first step aiming to create stable interface for interrupt controller drivers. Thus, some temporary solution is taken. Interrupt descriptions for devices are stored in INTRNG and two specific mapping function are created to be temporary used by bus drivers. That's why the struct intr_map_data is not opaque for INTRNG now. This temporary solution will be replaced by final one in next step. Differential Revision: https://reviews.freebsd.org/D5730	2016-04-04 09:15:25 +00:00
Edward Tomasz Napierala	f70c075e32	Add configurable rate limit for "log" and "devctl" actions. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-04-02 09:11:52 +00:00
Edward Tomasz Napierala	097e0da79d	Fix mismerge. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-04-01 18:45:04 +00:00
Edward Tomasz Napierala	862d03fb7f	Drop the 'resource' argument to racct_decay(); it wouldn't make sense to iterate separately for each resource. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-04-01 18:36:10 +00:00
John Baldwin	611fcff994	Cap IOSIZE_MAX to INT_MAX for 32-bit processes. Previously, freebsd32 binaries could submit read/write requests with lengths greater than INT_MAX that a native kernel would have rejected. Reviewed by: kib Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D5788	2016-04-01 18:29:38 +00:00
Edward Tomasz Napierala	659e74662f	Call rctl_enforce() in all cases the resource usage goes up, even when called from racct_*_force() functions. It makes the "log" and "devctl" actions work in those cases. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-04-01 17:28:55 +00:00
Edward Tomasz Napierala	0b9f1ecb87	Reorder the functions; no functional changes. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-04-01 17:21:55 +00:00
Edward Tomasz Napierala	7cea96f606	Reduce code duplication. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-04-01 17:17:32 +00:00
Edward Tomasz Napierala	1028719823	Reduce code duplication. There should be no (intended) functional changes. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-04-01 17:05:46 +00:00
Sean Bruno	910938f079	Repair a overflow condition where a user could submit a string that was not getting a proper bounds check. Thanks to CTurt for pointing at this with a big red blinking neon sign. PR: 206761 Submitted by: sson Reviewed by: cturt@hardenedbsd.org MFC after: 3 days	2016-04-01 16:16:26 +00:00
John Baldwin	b4f1d267b7	Rework handling of thread sleeps before timers are working. Previously, calls to sleep() and cv_wait*() immediately returned during early boot. Instead, permit threads that request a sleep without a timeout to sleep as wakeup() works during early boot. Sleeps with timeouts are harder to emulate without working timers, so just punt and panic explicitly if any thread tries to use those before timers are working. Any threads that depend on timeouts should either wait until SI_SUB_KICK_SCHEDULER to start or they should use DELAY() until timers are available. Until APs are started earlier this should be a no-op as other kthreads shouldn't get a chance to start running until after timers are working regardless of when they were created. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D5724	2016-03-31 18:10:29 +00:00
Edward Tomasz Napierala	ac3c9819ab	Refactor; no functional changes. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-03-31 17:32:28 +00:00
John Baldwin	4d805eacfa	Tidy up the unmapped I/O code in qphysio. - Move some blocks around to reduce the number of 'if (unmap)' checks. - Use 'pbuf == NULL' instead of 'unmap'. - Use nitems. - Pull an assignment out of an if expression. Reviewed by: kib Sponsored by: Chelsio Communications	2016-03-31 17:27:30 +00:00
Edward Tomasz Napierala	b450d4479d	Fix overflows, making it impossible to add negative amounts using rctl(8). MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-03-31 17:00:47 +00:00
Jamie Gritton	320d842101	Add osd_reserve() and osd_set_reserved(), which allow M_WAITOK allocation of an OSD array,	2016-03-30 16:57:28 +00:00
Gleb Smirnoff	9c64cfe56c	The sendfile(2) allows to send extra data from userspace before the file data (headers). Historically the size of the headers was not checked against the socket buffer space. Application could easily overcommit the socket buffer space. With the new sendfile (r293439) the problem remained, but a KASSERT was inserted that checked that amount of data written to the socket matches its space. In case when size of headers is bigger that socket space, KASSERT fires. Without INVARIANTS the new sendfile won't panic, but would report incorrect amount of bytes sent. o With this change, the headers copyin is moved down into the cycle, after the sbspace() check. The uio size is trimmed by socket space there, which fixes the overcommit problem and its consequences. o The compatibility handling for FreeBSD 4 sendfile headers API is pushed up the stack to syscall wrappers. This required a copy and paste of the code, but in turn this allowed to remove extra stack carried parameter from fo_sendfile_t, and embrace entire compat code into #ifdef. If in future we got more fo_sendfile_t function, the copy and paste level would even reduce. Reviewed by: emax, gallatin, Maxim Dounin <mdounin mdounin.ru> Tested by: Vitalij Satanivskij <satan ukr.net> Sponsored by: Netflix	2016-03-29 19:57:11 +00:00
Edward Tomasz Napierala	35030a5dd4	Remove some NULL checks for M_WAITOK allocations. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-03-29 13:56:59 +00:00
Jamie Gritton	859f4d0611	Move the various per-type arrays of OSD data into a single structure array.	2016-03-28 22:18:37 +00:00
Warner Losh	cb49b65481	Move pccard_safe_quote() up to subr_bus.c and rename to devctl_safe_quote() so it can be used more generally.	2016-03-28 20:16:29 +00:00
Navdeep Parhar	fddd4f6273	Plug leak in m_unshare. m_unshare passes on the source mbuf's flags as-is to m_getcl and this results in a leak if the flags include M_NOFREE. The fix is to clear the bits not listed in M_COPYALL before calling m_getcl. M_RDONLY should probably be filtered out too but that's outside the scope of this fix. Add assertions in the zone_mbuf and zone_pack ctors to catch similar bugs. Update netmap_get_mbuf to not pass M_NOFREE to m_getcl. It's not clear what the original code was trying to do but it's likely incorrect. Updated code is no different functionally but it avoids the newly added assertions. Reviewed by: gnn@ Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D5698	2016-03-26 23:39:53 +00:00
Conrad Meyer	c975a5d359	Add td_swinvoltick to track last involuntary context switch Expose in DDB via "show thread." Reviewed by: markj Sponsored by: EMC / Isilon Storage Division	2016-03-25 19:35:29 +00:00
Gleb Smirnoff	c8f59118be	Space and style(9) corrections for recent mbuf changes.	2016-03-24 20:06:52 +00:00
Svatopluk Kraus	61c8fde5d6	Generalize IPI support for ARM intrng and use it for interrupt controller IPI provider. New struct intr_ipi is defined which keeps all info about an IPI: its name, counter, send and dispatch methods. Generic intr_ipi_setup(), intr_ipi_send() and intr_ipi_dispatch() functions are implemented. An IPI provider must implement two functions: (1) an intr_ipi_send_t function which is able to send an IPI, (2) a setup function which initializes itself for an IPI and calls intr_ipi_setup() with appropriate arguments. Differential Revision: https://reviews.freebsd.org/D5700	2016-03-24 09:55:11 +00:00
George V. Neville-Neil	dcd070d80a	Move mbuf provider under SDT to indicate that it is FreeBSD specific and not a stable interface. Reviewed by: markj MFC after: 2 weeks Sponsored by: Rubicon Communications (Netgate) Differential Revision: https://reviews.freebsd.org/D5716	2016-03-24 08:26:06 +00:00
Bryan Drewery	1c9b29f95f	Pass the expected struct radix_node_head * to vfs_free_netcred. No functional change. struct radix_node_head's first element is rh so this was already referring to the same address. It was likely an unintended s/rnh/&rnh->rh/ change from r294706 as all other rnh_walktree() callers pass the expected struct radix_node_head * rather than obscurely passing the address of their first element. Sponsored by: EMC / Isilon Storage Division	2016-03-24 04:40:07 +00:00
Bryan Drewery	57a8e34141	Fix M_RTABLE memory leak from r274118 (11/2014). Replace free(M_RTABLE) with rn_detachhead() to match rn_inithead(). This would trigger when reloading NFS exports and was similar to problems with pf reload [1]. PR: 194078 [1] Sponsored by: EMC / Isilon Storage Division	2016-03-24 03:08:39 +00:00
Edward Tomasz Napierala	68d35798b9	Wait for root mount tokens before showing the root mount prompt. This restores the pre-r290196 behaviour, eliminating the need to manually press '.' a couple of times to get USB to finish probing. Note that there's still something wrong with the console (character echoing doesn't quite work), and there's also a reported problem with BHyVe, but those two don't seem related to the problem above. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-03-22 13:46:01 +00:00
George V. Neville-Neil	480f4e946d	Add an mbuf provider to DTrace. The mbuf provider is made up of a set of Statically Defined Tracepoints which help us look into mbufs as they are allocated and freed. This can be used to inspect the buffers or for a simplified mbuf leak detector. New tracepoints are: mbuf:::m-init mbuf:::m-gethdr mbuf:::m-get mbuf:::m-getcl mbuf:::m-clget mbuf:::m-cljget mbuf:::m-cljset mbuf:::m-free mbuf:::m-freem There is also a translator for mbufs which gives some visibility into the structure, see mbuf.d for more details. Reviewed by: bz, markj MFC after: 2 weeks Sponsored by: Rubicon Communications (Netgate) Differential Revision: https://reviews.freebsd.org/D5682	2016-03-22 13:16:52 +00:00
John Baldwin	70f52fd69f	Regen.	2016-03-21 21:38:35 +00:00
John Baldwin	bb430bc740	Fully handle size_t lengths in AIO requests. First, update the return types of aio_return() and aio_waitcomplete() to ssize_t. POSIX requires aio_return() to return a ssize_t so that it can represent all return values from read() and write(). aio_waitcomplete() should use ssize_t for the same reason. aio_return() has used ssize_t in <aio.h> since r31620 but the manpage and system call entry were not updated. aio_waitcomplete() has always returned int. Note that this does not require new system call stubs as this is effectively only an API change in how the compiler interprets the return value. Second, allow aio_nbytes values up to IOSIZE_MAX instead of just INT_MAX. aio_read/write should now honor the same length limits as normal read/write. Third, use longs instead of ints in the aio_return() and aio_waitcomplete() system call functions so that the 64-bit size_t in the in-kernel aiocb isn't truncated to 32-bits before being copied out to userland or being returned. Finally, a simple test has been added to verify the bounds checking on the maximum read size from a file.	2016-03-21 21:37:33 +00:00
Maxim Konovalov	ab77175063	o "avaliable" -> "available". PR: 208141 Submitted by: Tyler Littlefield	2016-03-21 08:03:50 +00:00
Pedro F. Giffuni	5166fdde7f	aio_qphysio(): Avoid uninitialized pointer read on error. For the !unmap case it may happen that pbuf gets called unreferenced when vm_fault_quick_hold_pages() fails. Initialize it so it doesn't cause trouble. CID: 1352776 Reviewed by: jhb MFC after: 1 week	2016-03-18 19:04:01 +00:00
Justin Hibbits	da1b038af9	Use uintmax_t (typedef'd to rman_res_t type) for rman ranges. On some architectures, u_long isn't large enough for resource definitions. Particularly, powerpc and arm allow 36-bit (or larger) physical addresses, but type `long' is only 32-bit. This extends rman's resources to uintmax_t. With this change, any resource can feasibly be placed anywhere in physical memory (within the constraints of the driver). Why uintmax_t and not something machine dependent, or uint64_t? Though it's possible for uintmax_t to grow, it's highly unlikely it will become 128-bit on 32-bit architectures. 64-bit architectures should have plenty of RAM to absorb the increase on resource sizes if and when this occurs, and the number of resources on memory-constrained systems should be sufficiently small as to not pose a drastic overhead. That being said, uintmax_t was chosen for source clarity. If it's specified as uint64_t, all printf()-like calls would either need casts to uintmax_t, or be littered with PRI64 macros. Casts to uintmax_t aren't horrible, but it would also bake into the API for resource_list_print_type() either a hidden assumption that entries get cast to uintmax_t for printing, or these calls would need the PRI64 macros. Since source code is meant to be read more often than written, I chose the clearest path of simply using uintmax_t. Tested on a PowerPC p5020-based board, which places all device resources in 0xfxxxxxxxx, and has 8GB RAM. Regression tested on qemu-system-i386 Regression tested on qemu-system-mips (malta profile) Tested PAE and devinfo on virtualbox (live CD) Special thanks to bz for his testing on ARM. Reviewed By: bz, jhb (previous) Relnotes: Yes Sponsored by: Alex Perez/Inertial Computing Differential Revision: https://reviews.freebsd.org/D4544	2016-03-18 01:28:41 +00:00
Conrad Meyer	82e17adcb4	fail(9): Only gather/print stacks if STACK is enabled This is a follow-up fix to the earlier r296927. Reported by: bz Sponsored by: EMC / Isilon Storage Division	2016-03-17 01:05:53 +00:00
Conrad Meyer	70e20d4e1a	fail(9): Upstreaming some fail point enhancements This is several year's worth of fail point upgrades done at EMC Isilon. They are interdependent enough that it makes sense to put a single diff up for them. Primarily, we added: - Changing all mainline execution paths to be lockless, which lets us use fail points in more sleep-sensitive areas, and allows more parallel execution - A number of additional commands, including 'pause' that lets us do some interesting deterministic repros of race conditions - The ability to dump the stacks of all threads sleeping on a fail point - A number of other API changes to allow marking up the fail point's context in the code, and firing callbacks before and after execution - A man page update Submitted by: Matthew Bryan <matthew.bryan@isilon.com> Reviewed by: cem (earlier version), jhb, kib, pho With feedback from: bdrewery Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D5427	2016-03-16 04:22:32 +00:00
Gleb Smirnoff	1d52250154	Free the temporary buffer in sysctl_handle_counter_u64_array(). Submitted by: mjg	2016-03-15 00:21:32 +00:00
Gleb Smirnoff	b5b7b142a7	Provide sysctl(9) macro to deal with array of counter(9).	2016-03-15 00:05:00 +00:00
Justin T. Gibbs	5405e7e2ee	Provide high precision conversion from ns,us,ms -> sbintime in kevent In timer2sbintime(), calculate the second and fractional second portions of the sbintime separately. When calculating the the fractional second portion, use a 64bit multiply to prevent excess truncation. This avoids the ~7% error in the original conversion for ns, and smaller errors of the same type for us and ms. PR: 198139 Reviewed by: jhb MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D5397	2016-03-12 23:02:53 +00:00
John Baldwin	f479b2ac89	Do not include system call wrappers in libc for old FreeBSD system calls. The base system libc is only used to run binaries built on FreeBSD 7.0 and later. It does not need to include system call wrappers for system calls only used by FreeBSD binaries built on versions older than 7.0. This was already true for "COMPAT" system calls, but now wrappers for system calls used on FreeBSD 4 and 6 are excluded as well. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D5597	2016-03-12 22:53:46 +00:00
Edward Tomasz Napierala	2a5a08cb38	Refactor the way we restore cn_lkflags; no functional changes. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-03-12 09:05:43 +00:00
Edward Tomasz Napierala	f69db55151	Remove cn_consume from 'struct componentname'. It was never set to anything other than 0. Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5611	2016-03-12 08:50:38 +00:00
Edward Tomasz Napierala	213ed83855	Fix autofs triggering problem. Assume you have an NFS server, 192.168.1.1, with share "share". This commit fixes a problem where "mkdir /net/192.168.1.1/share/meh" would return spurious error instead of creating the directory if the target filesystem wasn't mounted yet; subsequent attempts would work correctly. The failure scenario is kind of complicated to explain, but it all boils down to calling VOP_MKDIR() for the target filesystem (NFS) with wrong dvp - the autofs vnode instead of the filesystem root mounted over it. Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5442	2016-03-12 07:54:42 +00:00
John Baldwin	47cedcbd72	Use SI_SUB_LAST instead of SI_SUB_SMP as the "catch-all" subsystem. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D5515	2016-03-11 23:18:06 +00:00
John Baldwin	8d91aced32	Regen.	2016-03-09 19:06:46 +00:00
John Baldwin	399e8c1773	Simplify AIO initialization now that it is standard. - Mark AIO system calls as STD and remove the helpers to dynamically register them. - Use COMPAT6 for the old system calls with the older sigevent instead of an 'o' prefix. - Simplify the POSIX configuration to note that AIO is always available. - Handle AIO in the default VOP_PATHCONF instead of special casing it in the pathconf() system call. fpathconf() is still hackish. - Remove freebsd32_aio_cancel() as it just called the native one directly. Reviewed by: kib Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D5589	2016-03-09 19:05:11 +00:00
Konstantin Belousov	3cfce8e4df	Convert all panics from the link_elf_obj kernel linker for object files format into printfs and errors to caller. Some leaks of resources are there, but the same leaks are present in other error pathes. With the change, the kernel at least boots even when module with unexpected or corrupted ELF structure is preloaded. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-03-07 18:44:06 +00:00
Konstantin Belousov	13f28d969a	In the link_elf_obj.c, handle sections of type SHT_AMD64_UNWIND same as SHT_PROGBITS. This is needed after the clang 3.8 import, which generates that type for .eh_frame section, which had SHT_PROGBITS type before. Reported by: Nikolai Lifanov <lifanov@mail.lifanov.com> PR: 207729 Tested by: dim (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-03-06 00:31:11 +00:00
Justin Hibbits	534ccd7bbf	Replace all resource occurrences of '0UL/~0UL' with '0/~0'. Summary: The idea behind this is '~0ul' is well-defined, and casting to uintmax_t, on a 32-bit platform, will leave the upper 32 bits as 0. The maximum range of a resource is 0xFFF.... (all bits of the full type set). By dropping the 'ul' suffix, C type promotion rules apply, and the sign extension of ~0 on 32 bit platforms gets it to a type-independent 'unsigned max'. Reviewed By: cem Sponsored by: Alex Perez/Inertial Computing Differential Revision: https://reviews.freebsd.org/D5255	2016-03-03 05:07:35 +00:00
Konstantin Belousov	5db9ed8062	If callout_stop_safe() noted that the callout is currently executing, but next invocation is cancelled while migrating, sleepq_check_timeout() needs to be informed that the callout is stopped. Otherwise the thread switches off CPU and never become runnable, since running callout could have already raced with us, while the migrating and cancelled callout could be one which is expected to set TDP_TIMOFAIL flag for us. This contradicts with the expected behaviour of callout_stop() for other callers, which e.g. decrement references from the callout callbacks. Add a new flag CS_MIGRBLOCK requesting report of the situation as 'successfully stopped'. Reviewed by: jhb (previous version) Tested by: cognet, pho PR: 200992 Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D5221	2016-03-02 18:46:17 +00:00
Gleb Smirnoff	0ea37a86ba	Fix regression in r296242 affecting several drivers. For EXT_NET_DRV, EXT_MOD_TYPE, EXT_DISPOSABLE types we should first execute the free callback, then free the mbuf, otherwise we will derefernce memory that was just freed. Reported and tested: jhibbits	2016-03-02 02:12:01 +00:00
Bryan Drewery	3b77ac5f9e	Correct a comment.	2016-03-01 23:58:53 +00:00
John Baldwin	7b8cfe26a0	Use SCHEDULER_STOPPED() in cv_wait() instead of checking panicstr. Reviewed by: kib MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D5516	2016-03-01 22:51:44 +00:00
John Baldwin	f3215338ef	Refactor the AIO subsystem to permit file-type-specific handling and improve cancellation robustness. Introduce a new file operation, fo_aio_queue, which is responsible for queueing and completing an asynchronous I/O request for a given file. The AIO subystem now exports library of routines to manipulate AIO requests as well as the ability to run a handler function in the "default" pool of AIO daemons to service a request. A default implementation for file types which do not include an fo_aio_queue method queues requests to the "default" pool invoking the fo_read or fo_write methods as before. The AIO subsystem permits file types to install a private "cancel" routine when a request is queued to permit safe dequeueing and cleanup of cancelled requests. Sockets now use their own pool of AIO daemons and service per-socket requests in FIFO order. Socket requests will not block indefinitely permitting timely cancellation of all requests. Due to the now-tight coupling of the AIO subsystem with file types, the AIO subsystem is now a standard part of all kernels. The VFS_AIO kernel option and aio.ko module are gone. Many file types may block indefinitely in their fo_read or fo_write callbacks resulting in a hung AIO daemon. This can result in hung user processes (when processes attempt to cancel all outstanding requests during exit) or a hung system. To protect against this, AIO requests are only permitted for known "safe" files by default. AIO requests for all file types can be enabled by setting the new vfs.aio.enable_usafe sysctl to a non-zero value. The AIO tests have been updated to skip operations on unsafe file types if the sysctl is zero. Currently, AIO requests on sockets and raw disks are considered safe and are enabled by default. aio_mlock() is also enabled by default. Reviewed by: cem, jilles Discussed with: kib (earlier version) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D5289	2016-03-01 18:12:14 +00:00
John Baldwin	cbc4d2db75	Remove taskqueue_enqueue_fast(). taskqueue_enqueue() was changed to support both fast and non-fast taskqueues 10 years ago in r154167. It has been a compat shim ever since. It's time for the compat shim to go. Submitted by: Howard Su <howard0su@gmail.com> Reviewed by: sephe Differential Revision: https://reviews.freebsd.org/D5131	2016-03-01 17:47:32 +00:00
Svatopluk Kraus	32d001e479	Remove an alternative way for dealing with root interrupt controller which is not complete. Likely, it was forgotten and not removed before committing.	2016-03-01 11:27:58 +00:00
Svatopluk Kraus	169e6abd8f	Mark other parts of interrupt framework as INTR_SOLO option specific. Note that isrc_arg member of struct intr_irqsrc is used only for INTR_SOLO and IPI filter. This should be remembered if IPI filters and their arguments will be stored on another place. This option could be unusable very soon, if interrupt controllers implementations will not be implemented considering it.	2016-03-01 10:57:29 +00:00
Gleb Smirnoff	56a5f52e80	New way to manage reference counting of mbuf external storage. The m_ext.ext_cnt pointer becomes a union. It can now hold the refcount value itself. To tell that m_ext.ext_flags flag EXT_FLAG_EMBREF is used. The first mbuf to attach a cluster stores the refcount. The further mbufs to reference the cluster point at refcount in the first mbuf. The first mbuf is freed only when the last reference is freed. The benefit over refcounts stored in separate slabs is that now refcounts of different, unrelated mbufs do not share a cache line. For EXT_EXTREF mbufs the zone_ext_refcnt is no longer needed, and m_extadd() becomes void, making widely used M_EXTADD macro safe. For EXT_SFBUF mbufs the sf_ext_ref() is removed, which was an optimization exactly against the cache aliasing problem with regular refcounting. Discussed with: rrs, rwatson, gnn, hiren, sbruno, np Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D5396 Sponsored by: Netflix	2016-03-01 00:17:14 +00:00
Konstantin Belousov	1bdbd70599	Implement process-shared locks support for libthr.so.3, without breaking the ABI. Special value is stored in the lock pointer to indicate shared lock, and offline page in the shared memory is allocated to store the actual lock. Reviewed by: vangyzen (previous version) Discussed with: deischen, emaste, jhb, rwatson, Martin Simmons <martin@lispworks.com> Tested by: pho Sponsored by: The FreeBSD Foundation	2016-02-28 17:52:33 +00:00
Svatopluk Kraus	5b70c08cdf	Move IPI related parts back to (ARM) machine specific file now, when the interrupt framework is also going to be used by another (MIPS) architecture. IPI implementations may vary much across different architectures. An IPI implementation should still define INTR_IPI_COUNT and use intr_ipi_setup_counters() to setup IPI counters which are inside of intrcnt[] and intrnames[] arrays. Those are used for sysctl and ddb. Then, intr_ipi_increment_count() should be used to increment obtained counter. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D5459	2016-02-27 12:03:07 +00:00
Ed Schouten	afc055d90f	Remove the errno argument from unp_drop(). While there, add a comment to clarify that ECONNRESET should always be returned for POSIX conformance. Suggested by: Steven Hartland	2016-02-26 12:46:34 +00:00
Mark Johnston	0acf5d0bfd	Improve error handling for posix_fallocate(2) and posix_fadvise(2). - Set td_errno so that ktrace and dtrace can obtain the syscall error number in the usual way. - Pass negative error numbers directly to the syscall layer, as they're not intended to be returned to userland. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D5425	2016-02-25 19:58:23 +00:00
Ed Schouten	72c8072ee5	Make asynchronous connection failures on UNIX sockets fail with ECONNRESET. While making CloudABI work well on Linux, I discovered that I had a FreeBSD-ism in one of my unit tests. The test did the following: - Create UNIX socket 1, bind it, make it listen. - Create UNIX socket 2, connect it to UNIX socket 1. - Close UNIX socket 1. - Obtain SO_ERROR from socket 2. On FreeBSD this returns ECONNABORTED, while on Linux it returns ECONNRESET. I dug through some of the relevant specifications[1] and it looks like Linux is all right here. ECONNABORTED should only be returned when the local connection (socket 2) is aborted; not the peer (socket 1). It is of course slightly misleading: the function in which we set this error is called uipc_abort(), but keep in mind that we're aborting the peer, thus resetting the local socket. [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/connect.html Reviewed by: cem Sponsored by: Nuxi, the Netherlands Differential Revision: https://reviews.freebsd.org/D5419	2016-02-24 17:10:32 +00:00
Konstantin Belousov	0791e0c0e7	Provide more correct sizing of the KVA consumed by a vnode, used by the virtvnodes calculation. Include the size of fs-specific v_data as the nfs nclnode inline, the NFS nclnode is bigger than either ZFS znode or UFS inode. Include the size of namecache_ts and short cache path element, multiplied by the name cache population factor, again inline. Inline defines are used to avoid pollution of the vnode.h with the subsystem-private objects. Non-significant unsynchronized changes of the definitions are fine, we do not care about that precision, and e.g. ZFS consumes much malloced memory per vnode for reasons unaccounted in the formula. Lower the partition of kmem dedicated to vnodes, from 1/7 to 1/10. The measures reduce vnode cache pressure on kmem and bring the vnode cache memory use below some apparent thresholds that were exceeded by r291244 due to more robust vnode reuse. Reported and tested by: marius (i386, previous version) Reviewed by: bde Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-02-24 15:15:46 +00:00
Bryan Drewery	200241b504	Fix build after r295934.	2016-02-23 23:37:10 +00:00
Mariusz Zaborski	94e6fdd806	According to the sys/kern/capabilities.conf, gethostid(3) should be allowed. Pointed out by: Milosz Kaniewski <m.kaniewski@wheelsystems.com> Approved by: pjd (mentor) MFC after: 3 days Sponsored by: Wheel Systems, http://wheelsystems.com	2016-02-23 22:02:25 +00:00
Ian Lepore	85143dd18d	Allow a dynamic env to override a compiled-in static env by passing in the override indication in the env data. Submitted by: bde	2016-02-21 18:35:01 +00:00
Justin Hibbits	7915adb560	Introduce a RMAN_IS_DEFAULT_RANGE() macro, and use it. This simplifies checking for default resource range for bus_alloc_resource(), and improves readability. This is part of, and related to, the migration of rman_res_t from u_long to uintmax_t. Discussed with: jhb Suggested by: marcel	2016-02-20 01:32:58 +00:00
Mark Johnston	88c2beac9c	Ensure that we test the event condition when a disabled kevent is enabled. r274560 modified kqueue_register() to only test the event condition if the corresponding knote is not disabled. However, this check takes place before the EV_ENABLE flag is used to clear the KN_DISABLED flag on the knote, so enabling a previously-disabled kevent would not result in a notification for a triggered event. This change fixes the problem by testing for EV_ENABLED before possibly checking the event condition. This change also updates a kqueue regression test to exercise this case. PR: 206368 Reviewed by: kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D5307	2016-02-19 01:49:33 +00:00
Mark Johnston	fe169828c3	Return an error if both EV_ENABLE and EV_DISABLE are specified for a kevent. Currently, this combination results in EV_DISABLE being ignored. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D5307	2016-02-19 01:35:01 +00:00
Zbigniew Bodek	910905c74f	Fix build for i386 and arm64 after r295755 - Take bus_space_tag_t type into consideration when returning default, zero value. - Include missing rman.h required by ofw_pci.h	2016-02-18 15:44:45 +00:00
Zbigniew Bodek	b998c9656b	Introduce bus_get_bus_tag() method Provide bus_get_bus_tag() for sparc64, powerpc, arm, arm64 and mips nexus and its children in order to return a platform specific default tag. This is required to ensure generic correctness of the bus_space tag. It is especially needed for arches where child bus tag does not match the parent bus tag. This solves the problem with ppc architecture where the PCI bus tag differs from parent bus tag which is big-endian. This commit is a part of the following patch: https://reviews.freebsd.org/D4879 Submitted by: Marcin Mazurek <mma@semihalf.com> Obtained from: Semihalf Sponsored by: Annapurna Labs Reviewed by: jhibbits, mmel Differential Revision: https://reviews.freebsd.org/D4879	2016-02-18 13:00:04 +00:00
Konstantin Belousov	fa48f413ef	In bnoreuselist(), check both ends of the specified logical block numbers range. This effectively skips indirect and extdata blocks on the buffer queue. Since their logical block numbers are negative, bnoreuselist() could loop infinitely. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-02-17 19:39:57 +00:00

... 3 4 5 6 7 ...

15115 Commits