freebsd-skq

Author	SHA1	Message	Date
des	7dad8b80f6	Add YARROW_RNG and FORTUNA_RNG to sys/conf/options. Add a SYSINIT that forces a reseed during proc0 setup, which happens fairly late in the boot process. Add a RANDOM_DEBUG option which enables some debugging printf()s. Add a new RANDOM_ATTACH entropy source which harvests entropy from the get_cyclecount() delta across each call to a device attach method.	2013-10-08 11:05:26 +00:00
markm	04741fa764	Debugging. My attempt at EVENTHANDLER(multiuser) was a failure; use EVENTHANDLER(mountroot) instead. This means we can't count on /var being present, so something will need to be done about harvesting /var/db/entropy/... . Some policy now needs to be sorted out, and a pre-sync cache needs to be written, but apart from that we are now ready to go. Over to review.	2013-10-08 06:54:52 +00:00
markm	6492773aa9	Snapshot. Looking pretty good; this mostly works now. New code includes: * Read cached entropy at startup, both from files and from loader(8) preloaded entropy. Failures are soft, but announced. Untested. * Use EVENTHANDLER to do above just before we go multiuser. Untested.	2013-10-06 22:45:02 +00:00
markm	21998ad688	MFC - tracking commit	2013-10-06 09:37:57 +00:00
kib	1a04bcfbb6	Remove the uipc_cow.c file, which is not used since the zero copy sockets removal. Noted by: alc Sponsored by: The FreeBSD Foundation Approved by: re (delphij)	2013-10-06 06:57:28 +00:00
alc	4c5162818b	Tidy up kmeminit(): Since r245575, 'nmbclusters' is calculated after kmeminit() runs, so it contributes nothing to 'vm_kmem_size'; update a comment to reflect that r254025 replaced the kmem submap with the kmem arena. Reviewed by: kib Approved by: re (gjb) Sponsored by: EMC / Isilon Storage Division	2013-10-05 18:53:03 +00:00
markm	3d04a78b78	MFC - tracking commit.	2013-10-04 07:00:59 +00:00
markm	b28953010e	Snapshot. This passes the build test, but has not yet been finished or debugged. Contains: * Refactor the hardware RNG CPU instruction sources to feed into the software mixer. This is unfinished. The actual harvesting needs to be sorted out. Modified by me (see below). * Remove 'frac' parameter from random_harvest(). This was never used and adds extra code for no good reason. * Remove device write entropy harvesting. This provided a weak attack vector, was not very good at bootstrapping the device. To follow will be a replacement explicit reseed knob. * Separate out all the RANDOM_PURE sources into separate harvest entities. This adds some secuity in the case where more than one is present. * Review all the code and fix anything obviously messy or inconsistent. Address som review concerns while I'm here, like rename the pseudo-rng to 'dummy'. Submitted by: Arthur Mesh <arthurmesh@gmail.com> (the first item)	2013-10-04 06:55:06 +00:00
sbruno	6b64ef7266	Change len checks for fstypelen and fspathlen to be against absolute len not strlen as they are not strings. Discovered by GSOC student, Mike Ma <mikemandarine@gmail.com> during his fuse.glusterfs port to FreeBSD. Final patch from mckusick@ Submitted by: mckusick@ Approved by: re (hrs) MFC after: 2 weeks	2013-10-03 22:52:03 +00:00
markm	e7fc623a18	MFC - tracking update.	2013-10-02 18:12:18 +00:00
kib	031d824a49	When helping the bufdaemon from the buffer allocation context, there is no sense to walk the whole dirty buffer queue. We are only interested in, and can operate on, the buffers owned by the current vnode [1]. Instead of calling generic queue flush routine, do VOP_FSYNC() if possible. Holding the dirty buffer queue lock in the bufdaemon, without dropping it, can cause starvation of buffer writes from other threads. This is esp. easy to reproduce on the big memory machines, where large files are written, causing almost all dirty buffers accumulating in several big files, which vnodes are locked by writers. Bufdaemon cannot flush any buffer, but is iterating over the whole dirty queue continuously. Since dirty queue mutex is not dropped, bufdone() in g_up thread is starved, usually deadlocking the machine [2]. Mitigate this by dropping the queue lock after the vnode is locked, allowing other queue lock contenders to make a progress. Discussed with: Jeff [1] Reported by: pho [2] Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (hrs)	2013-10-02 06:00:34 +00:00
kib	2f645d31a9	When printing the vnode information from ddb, print the lengths of the dirty and clean buffer queues. Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)	2013-10-01 20:18:33 +00:00
kib	26571b5c44	For vunref(), try to upgrade the vnode lock if the function was called with the vnode shared-locked. If upgrade succeeded, the inactivation can be done immediately, instead of being postponed. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)	2013-09-29 18:07:14 +00:00
kib	e309fe770f	Reimplement r255797 using LK_TRYUPGRADE. The r255797 was: Increase the chance of the buffer write from the bufdaemon helper context to succeed. If the locked vnode which owns the buffer to be written is shared locked, try the non-blocking upgrade of the lock to exclusive. PR: kern/178997 Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de> Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)	2013-09-29 18:04:57 +00:00
kib	632f1d29f3	Add LK_TRYUPGRADE operation for lockmgr(9), which attempts to atomically upgrade shared lock to exclusive. On failure, error is returned and lock is not dropped in the process. Tested by: pho (previous version) No objections from: attilio Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)	2013-09-29 18:02:23 +00:00
jmg	1dc8591054	it must be the last member, not might... Reviewed by: attilio Approved by: re (delphij, gjb)	2013-09-26 17:55:04 +00:00
kib	9a6c86c297	Do not allow negative timeouts for kqueue timers, check for the negative timeout both before and after the conversion to sbintime_t. For periodic kqueue timer, convert zero timeout into 1ms, to avoid interrupt storm on fast event timers. Reported and tested by: pho Discussed with: mav Reviewed by: davide Sponsored by: The FreeBSD Foundation Approved by: re (marius)	2013-09-26 13:17:31 +00:00
kib	c58dbf73e0	Acquire a hold reference on the vnode when a knote is instantiated. Otherwise, knote keeps a pointer to a vnode which could become invalid any time. Reported by: many Tested by: Patrick Lamaiziere <patfbsd@davenulle.org> Discussed with: jmg Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (marius)	2013-09-26 13:14:51 +00:00
davide	5c31dfc658	Make the callout arithmetic more robust adding checks for overflow. Without these, if the timeout value passed is "large enough", the value of the sum of it and other factors (e.g. current time as returned by sbinuptime() or 'precision' argument) might result in a negative number. This negative number is then passed to eventtimers(4), which causes et_start() routine to load et_min_period into eventtimer, making the CPU where the thread is stuck forever in timer interrupt handler routine. This is now avoided rounding to INT64_MAX the timeout period in case of overflow. Reported by: kib, pho Discussed with: kib, mav Tested by: pho (stress2 suite, kevent7.sh scenario) Approved by: re (kib)	2013-09-26 10:06:50 +00:00
attilio	29d161240e	Avoid memory accesses reordering which can result in fget_unlocked() seeing a stale fd_ofiles table once fd_nfiles is already updated, resulting in OOB accesses. Approved by: re (kib) Sponsored by: EMC / Isilon storage division Reported and tested by: pho Reviewed by: benno	2013-09-25 13:37:52 +00:00
mav	a836e0c536	Make load average sampling asynchronous to hardclock ticks. This improves measurement of load caused by time-related events still using hardclock. For example, without this change dummynet, scheduling events each hardclock tick, was always miscounted as load of 1. There is still aliasing with events delayed by the new precision mechanism, but it probably can't be avoided without moving this sampling from using callout to some lower-level code or handling it in some other special way. Reviewed by: davide Approved by: re (marius)	2013-09-24 07:03:16 +00:00
des	65c545bff1	Always request zeroed memory, in case we're dumb enough to leak it later. Approved by: re (gjb)	2013-09-22 23:47:56 +00:00
kib	752a7a786f	Revert r255797. The LK_UPGRADE \| LK_NOWAIT drops the lock. Approved by: re (marius, implicit)	2013-09-22 20:29:03 +00:00
kib	5252c8b0bf	Pre-acquire the filedesc sx when a possibility exists that the later code could need to remove a kqueue from the filedesc list. Global lock is already locked, which causes sleepable after non-sleepable lock acquisition. Reported and tested by: pho Reviewed by: jmg Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb)	2013-09-22 19:54:47 +00:00
kib	46dd93739b	Increase the chance of the buffer write from the bufdaemon helper context to succeed. If the locked vnode which owns the buffer to be written is shared locked, try the non-blocking upgrade of the lock to exclusive. PR: kern/178997 Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de> Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (marius)	2013-09-22 19:23:48 +00:00
davide	30f0365ae9	Consistently use the same value to indicate exclusively-held and shared-held locks for all the primitives in lc_lock/lc_unlock routines. This fixes the problems introduced in r255747, which indeed introduced an inversion in the logic. Reported by: many Tested by: bdrewery, pho, lme, Adam McDougall, O. Hartmann Approved by: re (glebius)	2013-09-22 14:09:07 +00:00
glebius	12fb397079	- Create kern.ipc.sendfile namespace, and put the new "readhead" OID there as "kern.ipc.sendfile.readahead". - Push all nsfbuf related tunables into MD code. Don't move them to new namespace in favor of POLA. Reviewed by: scottl Approved by: re (gjb)	2013-09-22 13:36:52 +00:00
gibbs	f749b57e89	Fix ia64 and mips kernel builds due to XENHVM=>GENERIC integration in revision 255744. sys/kern/subr_smp.c: IPI_SUSPEND is only available on amd64 and i386. Protect new uses of this constant with #ifdefs to avoid impacting other platforms. Approved by: re (blanket Xen)	2013-09-22 02:46:13 +00:00
markj	d3af58e0a0	Regenerate syscall argument strings after r255777. Approved by: re (gjb) MFC after: 1 week	2013-09-21 23:06:36 +00:00
markj	885bf4020f	Omit "__restrict" when generating syscall argument strings. DTrace doesn't handle it and cannot determine the argument type when it's present. Approved by: re (gjb) MFC after: 1 week	2013-09-21 23:05:44 +00:00
davide	defa3b8b85	Fix callout_init_rm() in the shared case, allocating storage for 'struct rm_priotracker' directly in the softclock thread. Now consumers can pass CALLOUT_SHAREDLOCK flag to callout initialization routine safely. The choice of the already existing flags instead of special casing shared rmlocks is done to prevent consumer footshooting. Suggested by: jhb Reviewed by: jhb Approved by: re (delphij)	2013-09-20 23:16:15 +00:00
davide	5273d359fd	Fix lc_lock/lc_unlock() support for rmlocks held in shared mode. With current lock classes KPI it was really difficult because there was no way to pass an rmtracker object to the lock/unlock routines. In order to accomplish the task, modify the aforementioned functions so that they can return (or pass as argument) an uinptr_t, which is in the rm case used to hold a pointer to struct rm_priotracker for current thread. As an added bonus, this fixes rm_sleep() in the rm shared case, which right now can communicate priotracker structure between lc_unlock()/lc_lock(). Suggested by: jhb Reviewed by: jhb Approved by: re (delphij)	2013-09-20 23:06:21 +00:00
gibbs	7ed30adae7	Merge Xen PVHVM support into the GENERIC kernel config for both amd64 and i386. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Reviewed by: gibbs Approved by: re (blanket Xen) MFC after: 2 weeks sys/amd64/amd64/mp_machdep.c: sys/amd64/include/cpu.h: sys/i386/i386/mp_machdep.c: sys/i386/include/cpu.h: - Introduce two new CPU hooks for initialization and resume purposes. This allows us to get rid of the XENHVM ifdefs in mp_machdep, and also sets some hooks into common code that can be used by other hypervisor implementations. sys/amd64/conf/XENHVM: sys/i386/conf/XENHVM: - Remove these configs now that GENERIC has builtin support for Xen HVM. sys/kern/subr_smp.c: - Make sure there are no pending IPIs when suspending a system. sys/x86/xen/hvm.c: - Add cpu init and resume vectors that are called from mp_machdep using the new hooks. - Only clear the vcpu_info mapping data on resume. It is already clear for the BSP on a cold boot and is set correctly as APs are started. - Gate xen_hvm_init_cpu only to systems running under Xen. sys/x86/xen/xen_intr.c: - Gate the setup of event channels only to systems running under Xen.	2013-09-20 22:59:22 +00:00
gibbs	a9c07a6f67	Add support for suspend/resume/migration operations when running as a Xen PVHVM guest. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Reviewed by: gibbs Approved by: re (blanket Xen) MFC after: 2 weeks sys/amd64/amd64/mp_machdep.c: sys/i386/i386/mp_machdep.c: - Make sure that are no MMU related IPIs pending on migration. - Reset pending IPI_BITMAP on resume. - Init vcpu_info on resume. sys/amd64/include/intr_machdep.h: sys/i386/include/intr_machdep.h: sys/x86/acpica/acpi_wakeup.c: sys/x86/x86/intr_machdep.c: sys/x86/isa/atpic.c: sys/x86/x86/io_apic.c: sys/x86/x86/local_apic.c: - Add a "suspend_cancelled" parameter to pic_resume(). For the Xen PIC, restoration of interrupt services differs between the aborted suspend and normal resume cases, so we must provide this information. sys/dev/acpica/acpi_timer.c: sys/dev/xen/timer/timer.c: sys/timetc.h: - Don't swap out "suspend safe" timers across a suspend/resume cycle. This includes the Xen PV and ACPI timers. sys/dev/xen/control/control.c: - Perform proper suspend/resume process for PVHVM: - Suspend all APs before going into suspension, this allows us to reset the vcpu_info on resume for each AP. - Reset shared info page and callback on resume. sys/dev/xen/timer/timer.c: - Implement suspend/resume support for the PV timer. Since FreeBSD doesn't perform a per-cpu resume of the timer, we need to call smp_rendezvous in order to correctly resume the timer on each CPU. sys/dev/xen/xenpci/xenpci.c: - Don't reset the PCI interrupt on each suspend/resume. sys/kern/subr_smp.c: - When suspending a PVHVM domain make sure there are no MMU IPIs in-flight, or we will get a lockup on resume due to the fact that pending event channels are not carried over on migration. - Implement a generic version of restart_cpus that can be used by suspended and stopped cpus. sys/x86/xen/hvm.c: - Implement resume support for the hypercall page and shared info. - Clear vcpu_info so it can be reset by APs when resuming from suspension. sys/dev/xen/xenpci/xenpci.c: sys/x86/xen/hvm.c: sys/x86/xen/xen_intr.c: - Support UP kernel configurations. sys/x86/xen/xen_intr.c: - Properly rebind per-cpus VIRQs and IPIs on resume.	2013-09-20 05:06:03 +00:00
jhb	c6151e30b1	Regen. Approved by: re (delphij)	2013-09-19 18:56:00 +00:00
jhb	d3ef75b6c7	Extend the support for exempting processes from being killed when swap is exhausted. - Add a new protect(1) command that can be used to set or revoke protection from arbitrary processes. Similar to ktrace it can apply a change to all existing descendants of a process as well as future descendants. - Add a new procctl(2) system call that provides a generic interface for control operations on processes (as opposed to the debugger-specific operations provided by ptrace(2)). procctl(2) uses a combination of idtype_t and an id to identify the set of processes on which to operate similar to wait6(). - Add a PROC_SPROTECT control operation to manage the protection status of a set of processes. MADV_PROTECT still works for backwards compatability. - Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc) the first bit of which is used to track if P_PROTECT should be inherited by new child processes. Reviewed by: kib, jilles (earlier version) Approved by: re (delphij) MFC after: 1 month	2013-09-19 18:53:42 +00:00
pjd	667d7255be	Fix panic in ktrcapfail() when no capability rights are passed. While here, correct all consumers to pass NULL instead of 0 as we pass capability rights as pointers now, not uint64_t. Reported by: Daniel Peyrolon Tested by: Daniel Peyrolon Approved by: re (marius)	2013-09-18 19:26:08 +00:00
rdivacky	fae003a069	Revert r255672, it has some serious flaws, leaking file references etc. Approved by: re (delphij)	2013-09-18 18:48:33 +00:00
rdivacky	d57db3eead	Implement epoll support in Linuxulator. This is a tiny wrapper around kqueue to implement epoll subset of functionality. The kqueue user data are 32bit on i386 which is not enough for epoll user data so this patch overrides kqueue fileops to maintain enough space in struct file. Initial patch developed by me in 2007 and then extended and finished by Yuri Victorovich. Approved by: re (delphij) Sponsored by: Google Summer of Code Submitted by: Yuri Victorovich <yuri at rawbw dot com> Tested by: Yuri Victorovich <yuri at rawbw dot com>	2013-09-18 17:56:04 +00:00
glebius	ab92040a7b	Fix assertion in sendfile_readpage() to assert only the validity of requested amount of data in a page. Move assertion down below object unlock. Approved by: re (kib) Sponsored by: Nginx, Inc. Sponsored by: Netflix	2013-09-17 06:37:21 +00:00
kib	6796656333	Remove zero-copy sockets code. It only worked for anonymous memory, and the equivalent functionality is now provided by sendfile(2) over posix shared memory filedescriptor. Remove the cow member of struct vm_page, and rearrange the remaining members. While there, make hold_count unsigned. Requested and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (delphij)	2013-09-16 06:25:54 +00:00
kib	36c84de38f	Use TAILQ instead of STAILQ for kqeueue filedescriptors to ensure constant time removal on kqueue close. Reported and tested by: pho Reviewed by: jmg Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (delphij)	2013-09-13 19:50:50 +00:00
kib	4c87f20f3a	When opening or closing fifo, ensure that the vnode is locked exclusively. Filesystems are assumed to disable shared locking for the fifo vnode locks, but some do not. Reported and tested by: olgeni Discussed with: avg Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)	2013-09-13 06:52:23 +00:00
kib	b3080da236	Reduce the scope of the proctree_lock. If several processes cause continuous calls to the uprintf(9), the proctree_lock could be shared-locked for indefinite amount of time, starving exclusive requests. Since proctree_lock is needed for fork() and exit(), this effectively stops the machine. While there, do the similar reduction for tprintf(9). Reported and tested by: pho Reviewed by: ed Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)	2013-09-13 06:39:10 +00:00
jhb	5dbaab99c0	Regen. Approved by: re (kib)	2013-09-12 18:03:51 +00:00
jhb	e0689d5d63	Fix the type of the idtype argument to wait6() in syscalls.master. (Accidentally missed this in the previous commit) Approved by: re (kib) MFC after: 1 week	2013-09-12 18:01:13 +00:00
jhb	d31f6d6b64	Fix the type of the idtype argument to wait6() in syscalls.master. Approved by: re (kib) MFC after: 1 week	2013-09-12 17:52:18 +00:00
glebius	3d45194f2c	Provide pr_ctloutput method for AF_LOCAL/SOCK_SEQPACKET sockets. This makes setsockopt() on them working. Reported by: Yuri <yuri rawbw.com> Approved by: re (kib)	2013-09-11 18:22:30 +00:00
kib	4d92de31b2	Fix build with gcc. Build-tested by: gjb Approved by: re (glebius)	2013-09-11 17:31:22 +00:00
kib	55cc853a4d	Implement sendfile(2) for the posix shared memory segment file descriptor, in addition to the regular files. Requested by: alc Discussed with: emaste Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation Approved by: re (hrs)	2013-09-11 06:41:15 +00:00
des	21e6fd796b	Fix the length calculation for the final block of a sendfile(2) transmission which could be tricked into rounding up to the nearest page size, leaking up to a page of kernel memory. [13:11] In IPv6 and NetATM, stop SIOCSIFADDR, SIOCSIFBRDADDR, SIOCSIFDSTADDR and SIOCSIFNETMASK at the socket layer rather than pass them on to the link layer without validation or credential checks. [SA-13:12] Prevent cross-mount hardlinks between different nullfs mounts of the same underlying filesystem. [SA-13:13] Security: CVE-2013-5666 Security: FreeBSD-SA-13:11.sendfile Security: CVE-2013-5691 Security: FreeBSD-SA-13:12.ifioctl Security: CVE-2013-5710 Security: FreeBSD-SA-13:13.nullfs Approved by: re	2013-09-10 10:05:59 +00:00
jhb	04bb6e10cd	Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use an address in the first 2GB of the process's address space. This flag should have the same semantics as the same flag on Linux. To facilitate this, add a new parameter to vm_map_find() that specifies an optional maximum virtual address. While here, fix several callers of vm_map_find() to use a VMFS_* constant for the findspace argument instead of TRUE and FALSE. Reviewed by: alc Approved by: re (kib)	2013-09-09 18:11:59 +00:00
delphij	7917506a10	In r243868, the error message buffer errmsg have been changed from an on-stack array to a pointer and therefore sizeof(errmsg) would become 4 or 8 bytes depending on the architecture. Fix this by using ERRMSGL in place of sizeof(). Submitted by: J David <j.david.lists@gmail.com> MFC after: 3 days Approved by: re (kib)	2013-09-09 05:01:18 +00:00
kib	56cc686058	Drain for the xbusy state for two places which potentially do pmap_remove_all(). Not doing the drain allows the pmap_enter() to proceed in parallel, making the pmap_remove_all() effects void. The race results in an invalidated page mapped wired by usermode. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (glebius)	2013-09-08 17:51:22 +00:00
pjd	f16777c9dc	Sort properly.	2013-09-07 19:16:02 +00:00
pjd	e781fc782f	Fix panic in cap_rights_is_valid() when invalid rights are provided - the right_to_index() function should assert correctness in this case. Improve other assertions. Reported by: pho Tested by: pho	2013-09-07 19:03:16 +00:00
mav	e372a1314d	Micro-optimize cpu_search(), allowing compiler to use more efficient inline ffsl() implementation, when it is available, instead of homegrown iteration. On dual-E5645 amd64 system (2x6x2 cores) under heavy I/O load that reduces time spent inside cpu_search() from 19% to 13%, while IOPS increased by 5%.	2013-09-07 15:16:30 +00:00
markm	b41e1125b0	MFC	2013-09-07 07:58:29 +00:00
np	24e0bcea20	Add a vtprintf. It is to tprintf what vprintf is to printf. Reviewed by: kib	2013-09-07 07:53:21 +00:00
markm	9d67aa8bff	MFC	2013-09-06 17:42:12 +00:00
jamie	d13d69ef17	Keep PRIV_KMEM_READ permitted inside jails as it is on the outside.	2013-09-06 17:32:29 +00:00
hiren	0493244548	Fixing a small typo. Reviewed by: gjb Approved by: sbruno (mentor)	2013-09-05 18:18:23 +00:00
kib	a08e5f97bc	The vm_pageout_flush() functions sbusies pages in the passed pages run. After that, the pager put method is called, usually translated to VOP_WRITE(). For the filesystems which use buffer cache, bufwrite() sbusies the buffer pages again, waiting for the xbusy state to drain. The later is done in vfs_drain_busy_pages(), which is called with the buffer pages already sbusied (by vm_pageout_flush()). Since vfs_drain_busy_pages() can only wait for one page at the time, and during the wait, the object lock is dropped, previous pages in the buffer must be protected from other threads busying them. Up to the moment, it was done by xbusying the pages, that is incompatible with the sbusy state in the new implementation of busy. Switch to sbusy. Reported and tested by: pho Sponsored by: The FreeBSD Foundation	2013-09-05 12:56:08 +00:00
pjd	e6563595d7	The fget() function now takes pointer to cap_rights_t, so change 0 to NULL.	2013-09-05 11:59:23 +00:00
pjd	1c7defb76e	Handle cases where capability rights are not provided. Reported by: kib	2013-09-05 11:58:12 +00:00
glebius	12cfeea19a	Fix !CAPABILITIES build.	2013-09-05 10:24:09 +00:00
pjd	ad93dafc6a	Correct the logic broken in my last commit. Reported by: tijl	2013-09-05 09:36:19 +00:00
sbruno	bf3232d8bb	Restore builds on architectures that don't support CAPABILITIES (mips).	2013-09-05 03:46:44 +00:00
sbruno	c1754372ea	This looks like a typo that breaks the build. Yell at me if this isn't the intended declaration.	2013-09-05 03:36:57 +00:00
pjd	add7315b85	Style fixes.	2013-09-05 00:19:30 +00:00
pjd	36e3a50584	Style fixes. Most fixes are about not treating integers and pointers as booleans.	2013-09-05 00:17:38 +00:00
pjd	d1a65cb7ef	Regenerate after r255219. Sponsored by: The FreeBSD Foundation	2013-09-05 00:11:59 +00:00
pjd	029a6f5d92	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation	2013-09-05 00:09:56 +00:00
jhb	00a02e74da	Trim a couple of panic messages.	2013-09-04 11:52:28 +00:00
davide	3e4464bbe3	Fix socket buffer timeouts precision using the new sbintime_t KPI instead of relying on the tvtohz() workaround. The latter has been introduced lately by jhb@ (r254699) in order to have a fix that can be backported to STABLE. Reported by: Vitja Makarov <vitja.makarov at gmail dot com> Reviewed by: jhb (earlier version)	2013-09-01 23:34:53 +00:00
ray	c555be1512	MFC @r255128.	2013-09-01 23:06:28 +00:00
rmacklem	8d06f831a7	Forced dismounts of NFS mounts can fail when thread(s) are stuck waiting for an RPC reply from the server while holding the mount point busy (mnt_lockref incremented). This happens because dounmount() msleep()s waiting for mnt_lockref to become 0, before calling VFS_UNMOUNT(). This patch adds a new VFS operation called VFS_PURGE(), which the NFS client implements as purging RPCs in progress. Making this call before checking mnt_lockref fixes the problem, by ensuring that the VOP_xxx() calls will fail and unbusy the mount point. Reported by: sbruno Reviewed by: kib MFC after: 2 weeks	2013-09-01 23:02:59 +00:00
markm	ff7909302f	MFC	2013-08-30 11:38:34 +00:00
hselasky	5a68ee8757	Simplify pause_sbt() logic. Don't call DELAY() if remainder is less than or equal to zero.	2013-08-30 10:39:56 +00:00
kib	dc9c173247	Move the definition of the struct unrhdr into a separate header file, to allow embedding the struct. Add init_unrhdr(9) initializer, which sets up preallocated unrhdr. Reviewed by: alc Tested by: pho, bf	2013-08-30 07:37:45 +00:00
ken	6165f30980	Fix some issues in change 254760 pointed out by Bruce Evans: - Remove excessive parenthesis - Use KNF continuation indentation - Cut down on excessive continuation lines - More consistent style in messages - Use uprintf() instead of printf() Submitted by: bde	2013-08-29 16:41:40 +00:00
jhb	5875a10cf6	Don't return an error for socket timeouts that are too large. Just cap them to INT_MAX ticks instead. PR: kern/181416 (r254699 really) Requested by: bde MFC after: 2 weeks	2013-08-29 15:59:05 +00:00
andre	73f239a63e	Pad m_hdr on 32bit architectures to to prevent alignment and padding problems with the way MLEN, MHLEN, and struct mbuf are set up. CTASSERT's are provided to detect such issues at compile time in the future. The #define MLEN and MHLEN calculation do not take actual compiler- induced alignment and padding inside the complete struct mbuf into account. Accordingly appropriate attention is required when changing members of struct mbuf. Ideally one would calculate MLEN as (MSIZE - sizeof(((struct mbuf *)0)->m_hdr) but that doesn't work as the compiler refuses to operate on an as of yet incomplete structure. In particular ARM 32bit has more strict alignment requirements which caused 4 bytes of padding between m_hdr and pkthdr in struct mbuf because of the 64bit members in pkthdr. This wasn't picked up by MLEN and MHLEN causing an overflow of the mbuf provided data storage by overestimating its size. I386 didn't show this problem because it handles unaligned access just fine, albeit at a small performance penalty. On 64bit architectures the struct mbuf layout is 64bit aligned in all places. Reported by: Thomas Skibo <ThomasSkibo-at-sbcglobal-dot-net> Tested by: tuexen, ian, Thomas Skibo (extended patch) Sponsored by: The FreeBSD Foundation	2013-08-27 20:52:02 +00:00
kib	9426e190e7	When allocating a pbuf for the cluster write, do not sleep waiting for the available pbuf when passed vnode is backing md(4). Other i/o directed to the same md device might already hold pbufs, and then we could deadlock since only our progress can free a pbuf needed for wakeup. Obtained from: projects/vm6 Reminded and tested by: pho MFC after: 1 week	2013-08-27 01:31:12 +00:00
will	a52b9ca1d3	Add the ability to display the default FIB number for a process to the ps(1) utility, e.g. "ps -O fib". bin/ps/keyword.c: Add the "fib" keyword and default its column name to "FIB". bin/ps/ps.1: Add "fib" as a supported keyword. sys/compat/freebsd32/freebsd32.h: sys/kern/kern_proc.c: sys/sys/user.h: Add the default fib number for a process (p->p_fibnum) to the user land accessible process data of struct kinfo_proc. Submitted by: Oliver Fromme <olli@fromme.com>, gibbs	2013-08-26 23:48:21 +00:00
jmg	f4ae1f5e6d	fix up some comments and a white space issue... MFC after: 3 days	2013-08-26 18:53:19 +00:00
markm	8a3bb03c25	Snapshot; Do some running repairs on entropy harvesting. More needs to follow.	2013-08-26 18:35:21 +00:00
andre	6c0efad132	Give (*ext_free) an int return value allowing for very sophisticated external mbuf buffer management capabilities in the future. For now only EXT_FREE_OK is defined with current legacy behavior. Sponsored by: The FreeBSD Foundation	2013-08-25 10:57:09 +00:00
andre	46cd7f807a	Adjust socow_iodone() after r254799. Sponsored by: The FreeBSD Foundation	2013-08-25 09:40:15 +00:00
andre	e02967a3c7	After r254779 "error" must always be present in mb_ctor_pack(), not only when MAC is defined. Reported by: gjb / tinderbox Sponsored by: The FreeBSD Foundation	2013-08-24 21:25:53 +00:00
markj	3541d8b143	Rename the kld_unload event handler to kld_unload_try, and add a new kld_unload event handler which gets invoked after a linker file has been successfully unloaded. The kld_unload and kld_load event handlers are now invoked with the shared linker lock held, while kld_unload_try is invoked with the lock exclusively held. Convert hwpmc(4) to use these event handlers instead of having kern_kldload() and kern_kldunload() invoke hwpmc(4) hooks whenever files are loaded or unloaded. This has no functional effect, but simplifes the linker code somewhat. Reviewed by: jhb	2013-08-24 21:13:38 +00:00
andre	6ad648e223	Remove unused m_free_fast(). The difference to m_free() is only 2 predictable branches nowadays. However as a pre-condition the caller had to ensure that the mbuf pkthdr did not have any mtags attached to it, costing some potential branches again. Sponsored by: The FreeBSD Foundation	2013-08-24 21:09:57 +00:00
markj	351bab29d0	Set things up so that linker_file_lookup_set() is always called with the linker lock held. This makes it possible to call it from a kld event handler with the shared lock held. Reviewed by: jhb	2013-08-24 21:08:55 +00:00
markj	6eda1f0daf	Remove the kld lock macros and just use the sx(9) API. Add locking in linker_init_kernel_modules() and linker_preload() in order to remove most of the checks for !cold before asserting that the kld lock is held. These routines are invoked by SYSINIT(9), so there's no harm in them taking the kld lock.	2013-08-24 21:07:04 +00:00
markj	86598e9098	Remove some code that has been commented out since it was added in 2000.	2013-08-24 21:00:39 +00:00
andre	e3737c33e7	Restructure the mbuf pkthdr to make it fit for upcoming capabilities and features. The changes in particular are: o Remove rarely used "header" pointer and replace it with a 64bit protocol/ layer specific union PH_loc for local use. Protocols can flexibly overlay their own 8 to 64 bit fields to store information while the packet is worked on. o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc instead of pkthdr.header. o Extend csum_flags to 64bits to allow for additional future offload information to be carried (e.g. iSCSI, IPsec offload, and others). o Move the RSS hash type enumerator from abusing m_flags to its own 8bit rsstype field. Adjust accessor macros. o Add cosqos field to store Class of Service / Quality of Service information with the packet. It is not yet supported in any drivers but allows us to get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with a modernized ALTQ. o Add four 8 bit fields l[2-5]hlen to store the relative header offsets from the start of the packet. This is important for various offload capabilities and to relieve the drivers from having to parse the packet and protocol headers to find out location of checksums and other information. Header parsing in drivers is a lot of copy-paste and unhandled corner cases which we want to avoid. o Add another flexible 64bit union to map various additional persistent packet information, like ether_vtag, tso_segsz and csum fields. Depending on the csum_flags settings some fields may have different usage making it very flexible and adaptable to future capabilities. o Restructure the CSUM flags to better signify their outbound (down the stack) and inbound (up the stack) use. The CSUM flags used to be a bit chaotic and rather poorly documented leading to incorrect use in many places. Bring clarity into their use through better naming. Compatibility mappings are provided to preserve the API. The drivers can be corrected one by one and MFC'd without issue. o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures). Sponsored by: The FreeBSD Foundation	2013-08-24 19:51:18 +00:00
ken	32eeb1af5c	Fix a printf format warning on 32-bit mips and powerpc. Reported by: bde, gjb Pointy hat to: ken	2013-08-24 19:02:36 +00:00
andre	b148bf45e0	Add an mbuf pointer parameter to (*ext_free) to give the external free function access to the mbuf the external memory was attached to. Mechanically adjust all users to include the mbuf parameter. This fixes a long standing annoyance for external free functions. Before one had to sacrifice one of the argument pointers for this. Sponsored by: The FreeBSD Foundation	2013-08-24 16:57:44 +00:00
mav	8d8ac3d3c9	MFprojects/camlock r254460: Remove locking from taskqueue_member(). The list of threads is static during the taskqueue life cycle, so there is no need to protect it, taking quite congested lock several more times for each ZFS I/O.	2013-08-24 14:41:49 +00:00
andre	817be10f1c	dd a 24 bits wide ext_flags field to m_ext by reducing ext_type to 8 bits. ext_type is an enumerator and the number of types we have is a mere dozen. A couple of ext_types are renumbered to fit within 8 bits. EXT_VENDOR[1-4] and EXT_EXP[1-4] types for vendor-internal and experimental local mapping. The ext_flags field is currently unused but has a couple of flags already defined for future use. Again vendor and experimental flags are provided for local mapping. EXT_FLAG_BITS is provided for the printf(9) %b identifier. Initialize and copy ext_flags in the relevant mbuf functions. Improve alignment and packing of struct m_ext on 32 and 64 archs by carefully sorting the fields.	2013-08-24 13:15:42 +00:00
andre	b9a332b6b2	Avoid code duplication for mbuf initialization and use m_init() instead in mb_ctor_mbuf() and mb_ctor_pack().	2013-08-24 12:24:58 +00:00
ken	281a193b53	Add support to physio(9) for devices that don't want I/O split and configure sa(4) to request no I/O splitting by default. For tape devices, the user needs to be able to clearly understand what blocksize is actually being used when writing to a tape device. The previous behavior of physio(9) was that it would split up any I/O that was too large for the device, or too large to fit into MAXPHYS. This means that if, for instance, the user wrote a 1MB block to a tape device, and MAXPHYS was 128KB, the 1MB write would be split into 8 128K chunks. This would be done without informing the user. This has suboptimal effects, especially when trying to communicate status to the user. In the event of an error writing to a tape (e.g. physical end of tape) in the middle of a 1MB block that has been split into 8 pieces, the user could have the first two 128K pieces written successfully, the third returned with an error, and the last 5 returned with 0 bytes written. If the user is using a standard write(2) system call, all he will see is the ENOSPC error. He won't have a clue how much actually got written. (With a writev(2) system call, he should be able to determine how much got written in addition to the error.) The solution is to prevent physio(9) from splitting the I/O. The new cdev flag, SI_NOSPLIT, tells physio that the driver does not want I/O to be split beforehand. Although the sa(4) driver now enables SI_NOSPLIT by default, that can be disabled by two loader tunables for now. It will not be configurable starting in FreeBSD 11.0. kern.cam.sa.allow_io_split allows the user to configure I/O splitting for all sa(4) driver instances. kern.cam.sa.%d.allow_io_split allows the user to configure I/O splitting for a specific sa(4) instance. There are also now three sa(4) driver sysctl variables that let the users see some sa(4) driver values. kern.cam.sa.%d.allow_io_split shows whether I/O splitting is turned on. kern.cam.sa.%d.maxio shows the maximum I/O size allowed by kernel configuration parameters (e.g. MAXPHYS, DFLTPHYS) and the capabilities of the controller. kern.cam.sa.%d.cpi_maxio shows the maximum I/O size supported by the controller. Note that a better long term solution would be to implement support for chaining buffers, so that that MAXPHYS is no longer a limiting factor for I/O size to tape and disk devices. At that point, the controller and the tape drive would become the limiting factors. sys/conf.h: Add a new cdev flag, SI_NOSPLIT, that allows a driver to tell physio not to split up I/O. sys/param.h: Bump __FreeBSD_version to 1000049 for the addition of the SI_NOSPLIT cdev flag. kern_physio.c: If the SI_NOSPLIT flag is set on the cdev, return any I/O that is larger than si_iosize_max or MAXPHYS, has more than one segment, or would have to be split because of misalignment with EFBIG. (File too large). In the event of an error, print a console message to give the user a clue about what happened. scsi_sa.c: Set the SI_NOSPLIT cdev flag on the devices created for the sa(4) driver by default. Add tunables to control whether we allow I/O splitting in physio(9). Explain in the comments that allowing I/O splitting will be deprecated for the sa(4) driver in FreeBSD 11.0. Add sysctl variables to display the maximum I/O size we can do (which could be further limited by read block limits) and the maximum I/O size that the controller can do. Limit our maximum I/O size (recorded in the cdev's si_iosize_max) by MAXPHYS. This isn't strictly necessary, because physio(9) will limit it to MAXPHYS, but it will provide some clarity for the application. Record the controller's maximum I/O size reported in the Path Inquiry CCB. sa.4: Document the block size behavior, and explain that the option of allowing physio(9) to split the I/O will disappear in FreeBSD 11.0. Sponsored by: Spectra Logic	2013-08-24 04:52:22 +00:00
delphij	b93cf73204	Allow tmpfs be mounted inside jail.	2013-08-23 22:52:20 +00:00
jkim	e9e439afa6	Fix a whitespace.	2013-08-23 16:54:38 +00:00
kib	4731d4c0e7	Since the 253927, which removed the soft busy call for the sf page, it does not make sense to wait for the soft busy state of the page to drain. The vm object lock is dropped immediately after, so the result of the wait is invalidated. It might make sense to not wait for the hard busy state as well, esp. for the fully valid page, but this is postponed for now. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-23 14:50:03 +00:00
jhb	6d81dda61e	Use tvtohz() to convert a socket buffer timeout to a tick value rather than using a home-rolled version. The home-rolled version could result in shorter-than-requested sleeps. Reported by: Vitja Makarov <vitja.makarov@gmail.com> MFC after: 2 weeks	2013-08-23 13:47:41 +00:00
kib	e25f6a560e	Both cluster_rbuild() and cluster_wbuild() sometimes set the pages shared busy without first draining the hard busy state. Previously it went unnoticed since VPO_BUSY and m->busy fields were distinct, and vm_page_io_start() did not verified that the passed page has VPO_BUSY flag cleared, but such page state is wrong. New implementation is more strict and catched this case. Drain the busy state as needed, before calling vm_page_sbusy(). Tested by: pho, jkim Sponsored by: The FreeBSD Foundation	2013-08-22 18:26:45 +00:00
kib	ba12eedccd	Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9). The flag was mandatory since r209792, where vm_page_grab(9) was changed to only support the alloc retry semantic. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-22 07:39:53 +00:00
andre	d8dc6ade5b	Revert r254520 and resurrect the M_NOFREE mbuf flag and functionality. Requested by: np, grehan	2013-08-21 18:12:04 +00:00
kib	d11c4f9c32	Implement read(2)/write(2) and neccessary lseek(2) for posix shmfd. Add MAC framework entries for posix shm read and write. Do not allow implicit extension of the underlying memory segment past the limit set by ftruncate(2) by either of the syscalls. Read and write returns short i/o, lseek(2) fails with EINVAL when resulting offset does not fit into the limit. Discussed with: alc Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-21 17:45:00 +00:00
kib	6a459eb27c	Make the seek a method of the struct fileops. Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-21 17:36:01 +00:00
kib	a3e8f2c6dc	Extract the general-purpose code from tmpfs to perform uiomove from the page queue of some vm object. Discussed with: alc Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-21 17:23:24 +00:00
pho	967f62efad	Added sysctl to turn off calls to vmem_check(). Sponsored by: EMC / Isilon storage division Discussed with: jeff	2013-08-20 11:06:56 +00:00
jeff	ed90d4ba3f	- Use an arbitrary but reasonably large import size for kva on architectures that don't support superpages. This keeps the number of spans and internal fragmentation lower. - When the user asks for alignment from vmem_xalloc adjust the imported size by 2*align to be certain we can satisfy the allocation. This comes at the expense of potential failures when the backend can't supply enough memory but could supply the requested size and alignment. Sponsored by: EMC / Isilon Storage Division	2013-08-19 23:02:39 +00:00
andre	e1092223ba	Remove the unused M_NOFREE mbuf flag. It didn't have any in-tree users for a very long time, if ever. Should such a functionality ever be needed again the appropriate and much better way to do it is through a custom EXT_SOMETHING external mbuf type together with a dedicated *ext_free function. Discussed with: trociny, glebius	2013-08-19 11:16:53 +00:00
jilles	1d1d0428d2	Disallow opening a POSIX message queue for execute. O_EXEC was formerly ignored, so equivalent to O_RDONLY. Reject O_EXEC with [EINVAL] like the invalid mode 3.	2013-08-18 13:27:04 +00:00
pjd	3014e000ae	Implement 32bit versions of the cap_ioctls_limit(2) and cap_ioctls_get(2) system calls as unsigned longs have different size on i386 and amd64. Reported by: jilles Sponsored by: The FreeBSD Foundation	2013-08-18 10:30:41 +00:00
bryanv	1b4affdd25	Do not use potentially stale thread in kthread_add() When an existing process is provided, the thread selected to use to initialize the new thread could have exited and be reaped. Acquire the proc lock earlier to ensure the thread remains valid. Reviewed by: jhb, julian (previous version) MFC after: 3 days	2013-08-17 17:02:43 +00:00
pjd	635a029a89	In r114945 the line 'nmp = TAILQ_NEXT(mp, mnt_list);' was duplicated. Instead of just removing the duplicate, convert the loop to TAILQ_FOREACH().	2013-08-17 14:13:45 +00:00
delphij	a1c410b061	Fix build.	2013-08-17 00:25:11 +00:00
kib	408a640438	Restore the previous sendfile(2) behaviour on the block devices. Provide valid .fo_sendfile method for several missed struct fileops. Reviewed by: glebius Sponsored by: The FreeBSD Foundation	2013-08-16 14:22:20 +00:00
markj	e461168640	Use strdup(9) instead of reimplementing it.	2013-08-16 03:41:41 +00:00
ken	5591de079d	Change the way that unmapped I/O capability is advertised. The previous method was to set the D_UNMAPPED_IO flag in the cdevsw for the driver. The problem with this is that in many cases (e.g. sa(4)) there may be some instances of the driver that can handle unmapped I/O and some that can't. The isp(4) driver can handle unmapped I/O, but the esp(4) driver currently cannot. The cdevsw is shared among all driver instances. So instead of setting a flag on the cdevsw, set a flag on the cdev. This allows drivers to indicate support for unmapped I/O on a per-instance basis. sys/conf.h: Remove the D_UNMAPPED_IO cdevsw flag and replace it with an SI_UNMAPPED cdev flag. kern_physio.c: Look at the cdev SI_UNMAPPED flag to determine whether or not a particular driver can handle unmapped I/O. geom_dev.c: Set the SI_UNMAPPED flag for all GEOM cdevs. Since GEOM will create a temporary mapping when needed, setting SI_UNMAPPED unconditionally will work. Remove the D_UNMAPPED_IO flag. nvme_ns.c: Set the SI_UNMAPPED flag on cdevs created here if NVME_UNMAPPED_BIO_SUPPORT is enabled. vfs_aio.c: In aio_qphysio(), check the SI_UNMAPPED flag on a cdev instead of the D_UNMAPPED_IO flag on the cdevsw. sys/param.h: Bump __FreeBSD_version to `1000045` for the switch from setting the D_UNMAPPED_IO flag in the cdevsw to setting SI_UNMAPPED in the cdev. Reviewed by: kib, jimharris MFC after: 1 week Sponsored by: Spectra Logic	2013-08-15 22:52:39 +00:00
cperciva	8544609e64	Change the queue of locks in kern_rangelock.c from holding lock requests in the order that they arrive, to holding (a) granted write lock requests, followed by (b) granted read lock requests, followed by (c) ungranted requests, in order of arrival. This changes the stopping condition for iterating through granted locks to see if a new request can be granted: When considering a read lock request, we can stop iterating as soon as we see a read lock request, since anything after that point is either a granted read lock request or a request which has not yet been granted. (For write lock requests, we must still compare against all granted lock requests.) For workloads with R parallel reads and W parallel writes, this improves the time spent from O((R+W)^2) to O(W*(R+W)); i.e., heavy parallel-read workloads become significantly more scalable. No statistically significant change in buildworld time has been measured, but synthetic tests of parallel 'dd > /dev/null' and 'openssl enc >/dev/null' with the input file cached yield dramatic (up to 10x) improvement with high (up to 128 processes) levels of parallelism. Reviewed by: kib	2013-08-15 20:19:17 +00:00
ray	3f79da49d7	MFC @r254374.	2013-08-15 19:32:08 +00:00
glebius	722a1a5e5d	Make sendfile() a method in the struct fileops. Currently only vnode backed file descriptors have this method implemented. Reviewed by: kib Sponsored by: Nginx, Inc. Sponsored by: Netflix	2013-08-15 07:54:31 +00:00
markj	6c24c9fb32	Specify SDT probe argument types in the probe definition itself rather than using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API to allow probes with dynamically-translated types. There is no functional change. MFC after: 2 weeks	2013-08-15 04:08:55 +00:00
markj	cee1e037da	Use kld_{load,unload} instead of mod_{load,unload} for the linker file load and unload event handlers added in r254266. Reported by: jhb X-MFC with: r254266	2013-08-14 00:42:21 +00:00
jeff	ca13e69c2f	- Disable quantum caches on the kmem_arena. This can make fragmentation worse on small KVA systems. I had intended to only enable it for debugging. Sponsored by: EMC / Isilon Storage Division	2013-08-13 22:41:24 +00:00
jeff	d330a11545	- Add a statically allocated memguard arena since it is needed very early on. - Pass the appropriate flags to vmem_xalloc() when allocating space for the arena from kmem_arena. Sponsored by: EMC / Isilon Storage Division	2013-08-13 22:40:43 +00:00
jhb	7120845712	Some small cleanups to the fixes in r180340: - Set NOTE_TRACKERR before running filt_proc(). If the knote did not have NOTE_FORK set in fflags when registered, then the TRACKERR event could miss being posted. - Don't pass the pid in to filt_proc() for NOTE_FORK events. The special handling for pids is done knote_fork() directly and no longer in filt_proc(). MFC after: 2 weeks	2013-08-13 18:45:58 +00:00
glebius	33afc0388b	- Minor style(9) fix. - Bring a comment up to date.	2013-08-13 13:40:31 +00:00
markj	5a3f78714c	FreeBSD's DTrace implementation has a few problems with respect to handling probes declared in a kernel module when that module is unloaded. In particular, * Unloading a module with active SDT probes will cause a panic. [1] * A module's (FBT/SDT) probes aren't destroyed when the module is unloaded; trying to use them after the fact will generally cause a panic. This change fixes both problems by porting the DTrace module load/unload handlers from illumos and registering them with the corresponding EVENTHANDLER(9) handlers. This allows the DTrace framework to destroy all probes defined in a module when that module is unloaded, and to prevent a module unload from proceeding if some of its probes are active. The latter problem has already been fixed for FBT probes by checking lf->nenabled in kern_kldunload(), but moving the check into the DTrace framework generalizes it to all kernel providers and also fixes a race in the current implementation (since a probe may be activated between the check and the call to linker_file_unload()). Additionally, the SDT implementation has been reworked to define SDT providers/probes/argtypes in linker sets rather than using SYSINIT/SYSUNINIT to create and destroy SDT probes when a module is loaded or unloaded. This simplifies things quite a bit since it means that pretty much all of the SDT code can live in sdt.ko, and since it becomes easier to integrate SDT with the DTrace framework. Furthermore, this allows FreeBSD to be quite flexible in that SDT providers spanning multiple modules can be created on the fly when a module is loaded; at the moment it looks like illumos' SDT implementation requires all SDT probes to be statically defined in a single kernel table. PR: 166927, 166926, 166928 Reported by: davide [1] Reviewed by: avg, trociny (earlier version) MFC after: 1 month	2013-08-13 03:10:39 +00:00
markj	5423ffaa89	Remove some unused fields from struct linker_file. They were added in r172862 for use by the DTrace SDT framework but don't seem to have ever been used. MFC after: 2 weeks	2013-08-13 03:09:00 +00:00
markj	80dd3f5e73	Add event handlers for module load and unload events. The load handlers are called after the module has been loaded, and the unload handlers are called before the module is unloaded. Moreover, the module unload handlers may return an error to prevent the unload from proceeding. Reviewed by: avg MFC after: 2 weeks	2013-08-13 03:07:49 +00:00
kib	58a6f3bcbe	The r254167 moved initialization of the sleepqueues before the witness is operational. init_sleepqueues() initializes 256 mutexes, which, due to witness still being cold, started to overflow the pending_locks array. As stated in the reported panic message, increase WITNESS_PENDLIST from 768 to 1024, which provides space for additional 256 locks. Reported by: many Tested by: rakuco, bdrewery	2013-08-10 21:42:14 +00:00
cognet	333a884980	Don't call sleepinit() from proc0_init(), make it a SYSINIT instead. vmem needs the sleepq locks to be initialized when free'ing kva, so we want it called as early as possible.	2013-08-09 23:13:52 +00:00
cognet	51c3f72bfa	Instead of just trying to do it for arm, make sure vm_kmem_size is properly aligned in kmeminit(), where it'll work for any arch. Suggested by: alc	2013-08-09 22:30:54 +00:00
attilio	e9f37cac74	On all the architectures, avoid to preallocate the physical memory for nodes used in vm_radix. On architectures supporting direct mapping, also avoid to pre-allocate the KVA for such nodes. In order to do so make the operations derived from vm_radix_insert() to fail and handle all the deriving failure of those. vm_radix-wise introduce a new function called vm_radix_replace(), which can replace a leaf node, already present, with a new one, and take into account the possibility, during vm_radix_insert() allocation, that the operations on the radix trie can recurse. This means that if operations in vm_radix_insert() recursed vm_radix_insert() will start from scratch again. Sponsored by: EMC / Isilon storage division Reviewed by: alc (older version) Reviewed by: jeff Tested by: pho, scottl	2013-08-09 11:28:55 +00:00
attilio	3f74b0e634	Give mutex(9) the ability to recurse on a per-instance basis. Now the MTX_RECURSE flag can be passed to the mtx_*_flag() calls. This helps in cases we want to narrow down to specific calls the possibility to recurse for some locks. Sponsored by: EMC / Isilon storage division Reviewed by: jeff, alc Tested by: pho	2013-08-09 11:24:29 +00:00
attilio	16c7563cf4	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl	2013-08-09 11:11:11 +00:00
trasz	fcb31f05a9	Don't dereference null pointer should acl_alloc() be passed M_NOWAIT and allocation failed. Nothing in the tree passed M_NOWAIT. Obtained from: mjg MFC after: 1 month	2013-08-09 08:40:31 +00:00
scottl	40b11a1746	Add a helpful message that can help point to why a sysctl tree removal failed Obtained from: Netflix MFC after: 3 days	2013-08-09 01:04:44 +00:00
rstone	d9719f74bc	Allow drivers to return BUS_PROBE_NOWILDCARD from their attach routine to match devices where the driver class was fixed but the unit number was wildcarded. This better matches the documented behaviour in DEVICE_PROBE(9). Reviewed by: imp	2013-08-08 19:30:49 +00:00
jhb	9481e259bb	Don't emit a spurious EVFILT_PROC event with no fflags set on process exit if NOTE_EXIT is not being monitored. The rationale is that a listener should only get an event for exit() if they registered interest via NOTE_EXIT. This matches the behavior on OS X. - Don't save the exit status on process exit unless NOTE_EXIT is being monitored. - Add an internal EV_DROP flag that requests kqueue_scan() to free the knote without signalling it to userland and use this when a process exits but the fflags in the knote is zero. Reviewed by: jmg MFC after: 1 month	2013-08-07 19:56:35 +00:00
kevlo	52419f21e1	Remove unsigned comparison < 0 Found by: LLVM Reviewed by: luigi	2013-08-07 07:22:56 +00:00
jeff	de4ecca213	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-08-07 06:21:20 +00:00
kib	103825c951	Do not override the ENOENT error for the empty path, or EFAULT errors from copyins, with the relative lookup check. Discussed with: rwatson Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-08-05 19:42:03 +00:00
attilio	899ab64514	Revert r253939: We cannot busy a page before doing pagefaults. Infact, it can deadlock against vnode lock, as it tries to vget(). Other functions, right now, have an opposite lock ordering, like vm_object_sync(), which acquires the vnode lock first and then sleeps on the busy mechanism. Before this patch is reinserted we need to break this ordering. Sponsored by: EMC / Isilon storage division Reported by: kib	2013-08-05 08:55:35 +00:00
attilio	19b2ea9f81	The page hold mechanism is fast but it has couple of fallouts: - It does not let pages respect the LRU policy - It bloats the active/inactive queues of few pages Try to avoid it as much as possible with the long-term target to completely remove it. Use the soft-busy mechanism to protect page content accesses during short-term operations (like uiomove_fromphys()). After this change only vm_fault_quick_hold_pages() is still using the hold mechanism for page content access. There is an additional complexity there as the quick path cannot immediately access the page object to busy the page and the slow path cannot however busy more than one page a time (to avoid deadlocks). Fixing such primitive can bring to complete removal of the page hold mechanism. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff Tested by: pho	2013-08-04 21:07:24 +00:00
attilio	e825889721	Remove unnecessary soft busy of the page before to do vn_rdwr() in kern_sendfile() which is unnecessary. The page is already wired so it will not be subjected to pagefault. The content cannot be effectively protected as it is full of races already. Multiple accesses to the same indexes are serialized through vn_rdwr(). Sponsored by: EMC / Isilon storage division Reviewed by: alc, jeff Tested by: pho	2013-08-04 15:56:19 +00:00
marcel	65c945d583	Add a tunable for the default timeout.	2013-08-03 04:25:25 +00:00
glebius	7eacd3a0af	Remove extra zeroing after M_ZERO allocation.	2013-08-02 13:06:49 +00:00
kib	04cde0067f	Remove unused malloc type. Requested by: alc MFC after: 1 week	2013-08-01 12:55:41 +00:00
ian	24c5871f5a	Changes to allow using BOOTP_NFSROOT and mounting an nfs root filesystem other than the one specified by the BOOTP server. This configures NFS using the BOOTP protocol while also respecting other root-path options such as setting vfs.root.mountfrom in the environment or using the RB_DFLTROOT boot option. It allows you to override the root path provided by the server, or to supply a root path when the server provides IP configuration but no root path info. This maintains the historical BOOTP_NFSROOT behavior of panicking on a failure to mount the root path provided by the server, unless you've provided an alternative via the ROOTDEVNAME kernel option or by setting vfs.root.mountfrom. The behavior of panicking when given no other options is preserved because it amounts to a bit of a retry loop that could eventually recover from a transient network or server problem. The user can now override the root path from loader(8) even if the kernel is compiled with BOOTP_NFSROOT. If vfs.root.mountfrom is set in the environment it is used unconditionally -- it always overrides the BOOTP info. If it begins with [old]nfs: then the BOOTP code uses it instead of the server-provided info. If it specifies some other filesystem then the bootp code will not panic like it used to and the code in vfs_mountroot.c will invoke the right filesystem to do the mount. If the kernel is compiled with the ROOTDEVNAME option, then that name is used by the BOOTP code if either * The server doesn't provide a pathname. * The boothowto flags include RB_DFLTROOT. The latter allows the user to compile in alternate path in ROOTDEVNAME such as ufs:/dev/da0s1a and boot from that path by setting boot_dftlroot=1 in loader(8) or using the '-r' option in boot(8). The one thing not provided here is automatic failover from a server-provided path to a compiled-in one without the user manually requesting that. The code just isn't currently structured in a way that makes that possible with a lot of rewrite. I think the ability to set vfs.root.mountfrom and to use ROOTDEVNAME automatically when the server doesn't provide a name covers the most common needs. A set of patches submitted by Lars Eggert provided the part I couldn't figure out by myself when I tried to do this last year; many thanks. Reviewed by: rodrigc	2013-07-31 19:14:00 +00:00
scottl	8ceb091210	Another fix for r253823; retain the default of 1 readahead block for sendfile. Submitted by: glebius Obtained from: Netflix MFC after: 3 days	2013-07-31 15:55:01 +00:00
scottl	b9c4e3dc58	Fix r253823. Some WIP patches snuck in. Submitted by: zont	2013-07-30 23:50:09 +00:00
scottl	0eaffce7b3	Create a knob, kern.ipc.sfreadahead, that allows one to tune the amount of readahead that sendfile() will do. Default remains the same. Obtained from: Netflix MFC after: 3 days	2013-07-30 23:26:05 +00:00
kib	6660649d5c	When creation of the v_pollinfo raced and our instance of vpollinfo must be destroyed, knlist_clear() and seldrain() calls could be avoided, since vpollinfo was not used. More, the knlist_clear() calling protocol requires the knlist locked, which is not true at the call site. Split the destruction into the helper destroy_vpollinfo_free(), and call it when raced, instead of destroy_vpollinfo(). Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2013-07-28 06:59:29 +00:00
ray	0ac02980ce	MFC @r219886.	2013-07-25 12:43:22 +00:00
jhb	1e215bb36e	Use VMFS_OPTIMAL_SPACE instead of VMFS_ALIGNED_SPACE in shm_map().	2013-07-24 20:34:25 +00:00
marcel	13d997511f	Further restrict the MAC addresses that we use for UUID generation to those that are universally administered. While it is possible to add locally administered MAC addresses, it's unclear whether those are (expected) to be more unique than random multicast MAC addresses or not. With many U-Boot configurations assigning fixed and non-official MAC addresses to ethernet ports and without setting the 'X' flag, this change may have very little value in the embedded (development) space. Uniqueness of the universally administered addresses is non- existent on the (H/W) bench and questionable under the (S/W) desk. In short: this change is aimed at production environments...	2013-07-24 18:13:43 +00:00
marcel	d7c9064369	In uuid_ether_add(), avoid false positives due to the limited type used to hold the sum of the bytes of the MAC address. While here, rename the variable that holds the sum from 'c' to 'sum'. Pointed out by: thompsa@	2013-07-24 16:22:27 +00:00
avg	9e6374b6a9	rename scheduler->swapper and SI_SUB_RUN_SCHEDULER->SI_SUB_LAST Also directly call swapper() at the end of mi_startup instead of relying on swapper being the last thing in sysinits order. Rationale: - "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage - "scheduler" was misleading, the function swaps in the swapped out processes - another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be invoked depending on its relative order with scheduler; this was not obvious and the bug actually used to exist Reviewed by: kib (ealier version) MFC after: 14 days	2013-07-24 09:45:31 +00:00
glebius	a0ef2f4f21	Remove unused argument from vmem_add1(). Reviewed by: jeff	2013-07-24 08:02:56 +00:00
marcel	4ca16da195	Decouple the UUID generator from network interfaces by having MAC addresses added to the UUID generator using uuid_ether_add(). The UUID generator keeps an arbitrary number of MAC addresses, under the assumption that they are rarely removed (= uuid_ether_del()). This achieves the following: 1. It brings up closer to having the network stack as a loadable module. 2. It allows the UUID generator to filter MAC addresses for best results (= highest chance of uniqeness). 3. MAC addresses can come from anywhere, irrespactive of whether it's used for an interface or not. A side-effect of the change is that when no MAC addresses have been added, a random multicast MAC address is created once and re-used if needed. Previusly, when a random MAC address was needed, it was created for every call. Thus, a change in behaviour is introduced for when no MAC addresses exist. Obtained from: Juniper Networks, Inc.	2013-07-24 04:24:21 +00:00
glebius	8a9169a4ba	Revert r249590 and in case if mp_ncpus isn't initialized use MAXCPU. This allows us to init counter zone at early stage of boot. Reviewed by: kib Tested by: Lytochkin Boris <lytboris gmail.com>	2013-07-23 11:16:40 +00:00
mjg	d467dfaf08	Remove cr_prison NULL check from proc_to_reap. Userspace processes always have a prison. MFC after: 2 weeks	2013-07-22 02:07:15 +00:00
mjg	de1e379f48	Remove duplicate assertion from tdsendsignal. MFC after: 2 weeks	2013-07-22 00:44:37 +00:00
kib	a7dacef5ab	Implement compat32 wrappers for the ktimer_* syscalls. Reported, reviewed and tested by: Petr Salinger <Petr.Salinger@seznam.cz> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-07-21 19:43:52 +00:00
kib	e9d8b81db7	Wrap kmq_notify(2) for compat32 to properly consume struct sigevent32 argument. Reviewed and tested by: Petr Salinger <Petr.Salinger@seznam.cz> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-07-21 19:40:30 +00:00
kib	97d40396c6	Move the convert_sigevent32() utility function into freebsd32_misc.c for consumption outside the vfs_aio.c. For SIGEV_THREAD_ID and SIGEV_SIGNAL notification delivery methods, also copy in the sigev_value, since librt event pumping loop compares note generation number with the value passed through sigev_value. Tested by: Petr Salinger <Petr.Salinger@seznam.cz> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-07-21 19:33:48 +00:00
kib	82f12b6237	id_t is 64bit, provide the compat32 wrapper for clock_getcpuclockid2(2). Reported and tested by: Petr Salinger <Petr.Salinger@seznam.cz> PR: threads/180652 Sponsored by: The FreeBSD Foundation	2013-07-20 13:39:41 +00:00
jhb	d67e7a1cc9	Be more aggressive in using superpages in all mappings of objects: - Add a new address space allocation method (VMFS_OPTIMAL_SPACE) for vm_map_find() that will try to alter the alignment of a mapping to match any existing superpage mappings of the object being mapped. If no suitable address range is found with the necessary alignment, vm_map_find() will fall back to using the simple first-fit strategy (VMFS_ANY_SPACE). - Change mmap() without MAP_FIXED, shmat(), and the GEM mapping ioctl to use VMFS_OPTIMAL_SPACE instead of VMFS_ANY_SPACE. Reviewed by: alc (earlier version) MFC after: 2 weeks	2013-07-19 19:06:15 +00:00
kib	fcfcea28a9	Clear the vnode knotes before destroying vpollinfo. Reported and tested by: Patrick Lamaiziere <patfbsd@davenulle.org> Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-07-17 10:56:21 +00:00
glebius	8f9d71bb1b	Nuke mbstat. It wasn't used for mbuf statistics since FreeBSD 5. Now that r253351 moved sendfile() stats to a separate struct, the last field used in mbstat is m_mcfail, which is updated, but never read or obtained from userland.	2013-07-15 12:18:36 +00:00
ae	6f8e41d6cb	Introduce new structure sfstat for collecting sendfile's statistics and remove corresponding fields from struct mbstat. Use PCPU counters and SFSTAT_INC() macro for update these statistics. Discussed with: glebius	2013-07-15 06:16:57 +00:00
rodrigc	7e3e1747c8	PR: 168520 170096 Submitted by: adrian, zec Fix multiple kernel panics when VIMAGE is enabled in the kernel. These fixes are based on patches submitted by Adrian Chadd and Marko Zec. (1) Set curthread->td_vnet to vnet0 in device_probe_and_attach() just before calling device_attach(). This fixes multiple VIMAGE related kernel panics when trying to attach Bluetooth or USB Ethernet devices because curthread->td_vnet is NULL. (2) Set curthread->td_vnet in if_detach(). This fixes kernel panics when detaching networking interfaces, especially USB Ethernet devices. (3) Use VNET_DOMAIN_SET() in ng_btsocket.c (4) In ng_unref_node() set curthread->td_vnet. This fixes kernel panics when detaching Netgraph nodes.	2013-07-15 01:32:55 +00:00
kib	cbf04bd9c1	Assert that runningbufspace does not underflow. Sponsored by: The FreeBSD Foundation	2013-07-13 19:36:18 +00:00
kib	0c582b5a5c	There is no need to count waiters for the runningbufspace. Sponsored by: The FreeBSD Foundation	2013-07-13 19:34:34 +00:00
kib	66a95162d6	Allow to call clock_gettime() on the clock id for zombie process. Reported by: Petr Salinger <Petr.Salinger@seznam.cz> PR: threads/180496 Sponsored by: The FreeBSD Foundation	2013-07-13 19:32:50 +00:00
andre	19a467c450	Make use of the fact that uma_zone_set_max(9) already returns the rounded limit making a call to uma_zone_get_max(9) unnecessary. MFC after: 1 day	2013-07-11 12:53:13 +00:00
andre	a54d54c890	Fix style issues, a typo in "kern.ipc.nmbufs" and correctly plave and expose the value of the tunable maxmbufmem as "kern.ipc.maxmbufmem" through sysctl. Reported by: smh MFC after: 1 day	2013-07-11 12:46:35 +00:00
kib	6d30588666	Do not invalidate page of the B_NOCACHE buffer or buffer after an I/O error if any user wired mappings exist. Doing the invalidation destroys the user wiring. The change is the temporal measure to close the bug, the more proper fix is to delegate the invalidation of the page to upper layers always. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-07-11 05:36:26 +00:00
marcel	c660176671	Add vfs_mounted and vfs_unmounted events so that components can be informed about mount and unmount events. This is used by Juniper to implement a more optimal implementation of NetBSD's veriexec. This change differs from r253224 in the following way: o The vfs_mounted handler is called before mountcheckdirs() and with newdp locked. vp is unlocked. o The event handlers are declared in <sys/eventhandler.h> and not in <sys/mount.h>. The <sys/mount.h> header is used in user land code that pretends to be kernel code and as such creates a very convoluted environment. It's hard to untangle. Submitted by: stevek@juniper.net Discussed with: pjd@ Obtained from: Juniper Networks, Inc.	2013-07-10 15:35:25 +00:00
kib	a7b76b76e1	There are several code sequences like vfs_busy(mp); vfs_write_suspend(mp); which are problematic if other thread starts unmount between two calls. The unmount starts a write, while vfs_write_suspend() drain writers. On the other hand, unmount drains busy references, causing the deadlock. Add a flag argument to vfs_write_suspend and require the callers of it to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the mount path, i.e. the covered vnode is not locked. The suspension is not attempted if VS_SKIP_UNMOUNT is specified and unmount is in progress. Reported and tested by: Andreas Longwitz <longwitz@incore.de> Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2013-07-09 20:49:32 +00:00
avg	44c288e149	should_yield: protect from td_swvoltick being uninitialized or too stale The distance between ticks and td_swvoltick should be calculated as an unsigned number. Previously we could end up comparing a negative number with hogticks in which case should_yield() would give incorrect answer. We should probably ensure that td_swvoltick is properly initialized. Sponsored by: HybridCluster MFC after: 5 days	2013-07-09 09:01:44 +00:00
avg	a9e12e57cb	namecache sdt: freebsd doesn't support structured characters yet :-) MFC after: 7 days	2013-07-09 08:58:34 +00:00
jhb	581b7760a5	Fix build with INVARIANT_SUPPORT enabled but not INVARIANTS. Reported by: "Matthew D. Fuller" <fullermd@over-yonder.net>	2013-07-08 21:17:20 +00:00
alfred	89aac2b3a1	Make kassert_printf use __printflike. Fix associated errors/warnings while I'm here. Requested by: avg	2013-07-07 21:39:37 +00:00
jamie	7dc297170a	Make the comments a little more clear about PRIV_KMEM_*, explicitly referring to /dev/[k]mem and noting it's about opening the files rather than actually reading and writing. Reviewed by: jmallett	2013-07-06 00:10:52 +00:00
jamie	33714247f6	Add new privileges, PRIV_KMEM_READ and PRIV_KMEM_WRITE, used in opening /dev/kmem and /dev/mem (in addition to traditional file permission checks). PRIV_KMEM_READ is different from other PRIV_* checks in that it's allowed by default. Reviewed by: kib, mckusick	2013-07-05 21:31:16 +00:00
alfred	9bc7dd4897	The change in r236456 (atomic_store_rel not locked) exposed a bug in the ithread code where we could lose ithread interrupts if intr_event_schedule_thread() is called while the ithread is already running. Effectively memory writes could be ordered incorrectly such that the unatomic write of 0 to ithd->it_need (in ithread_loop) could be delayed until after it was set to be triggered again by intr_event_schedule_thread(). This was particularly a big problem for CAM because CAM optimizes scheduling of ithreads such that it only signals camisr() when it queues to an empty queue. This means that additional completion events would not unstick CAM and quickly lead to a complete lockup of the CAM stack. To fix this use atomics in all places where ithd->it_need is accessed. Submitted by: delphij, mav Obtained from: TrueOS, iXsystems MFC After: 1 week	2013-07-04 05:53:05 +00:00
mjg	80f69955c8	Fix receiving fd over unix socket broken in r247740. If n fds were passed, it would receive the first one n times. Reported by: Shawn Webb <lattera@gmail.com>, koobs, gleb Tested by: koobs, gleb Reviewed by: pjd	2013-07-02 07:36:04 +00:00
trociny	68a112b157	Plug up the lock lock leakage when exporting to a short buffer. Reported by: Alexander Leidinger Submitted by: mjg MFC after: 1 week	2013-07-01 03:27:14 +00:00
kib	8a0279994d	Fix issues with zeroing and fetching the counters, on x86 and ppc64. Issues were noted by Bruce Evans and are present on all architectures. On i386, a counter fetch should use atomic read of 64bit value, otherwise carry from the increment on other CPU could be lost for the given fetch, making error of 2^32. If 64bit read (cmpxchg8b) is not available on the machine, it cannot be SMP and it is enough to disable preemption around read to avoid the split read. On x86 the counter increment is not atomic on purpose, which makes it possible for the store of the incremented result to override just zeroed per-cpu slot. The effect would be a counter going off by arbitrary value after zeroing. Perform the counter zeroing on the same processor which does the increments, making the operations mutually exclusive. On i386, same as for the fetching, if the cmpxchg8b is not available, machine is not SMP and we disable preemption for zeroing. PowerPC64 is treated the same as amd64. For other architectures, the changes made to allow the compilation to succeed, without fixing the issues with zeroing or fetching. It should be possible to handle them by using the 64bit loads and stores atomic WRT preemption (assuming the architectures also converted from using critical sections to proper asm). If architecture does not provide the facility, using global (spin) mutex would be non-optimal but working solution. Noted by: bde Sponsored by: The FreeBSD Foundation	2013-07-01 02:48:27 +00:00
mjg	436f750292	acct: create a special plimit object and set it for exiting processes instead of allocating new one each time All limits are set to RLIM_INFINITY which sould be ok (even though we care only about RLIMT_FSIZE in this case). MFC after: 1 week	2013-06-30 19:08:06 +00:00
mjg	523cdd9810	acct: reduce code duplication by using acct_disable as cleanup for failed kproc_create MFC after: 1 week	2013-06-30 13:17:37 +00:00
peter	51861f0c71	Revert accidental commit. Pointy hat to: peter	2013-06-29 05:05:57 +00:00
peter	b08615c213	Help out gcc. clang understands. sys_generic.c:1510: warning: 'precision' may be used uninitialized *** [sys_generic.o] Error code 1	2013-06-29 04:35:04 +00:00

... 2 3 4 5 6 ...

13633 Commits