freebsd-skq

Author	SHA1	Message	Date
mckusick	6f0b4b3366	As the kernel allocates and frees vnodes, it fully initializes them on every allocation and fully releases them on every free. These are not trivial costs: it starts by zeroing a large structure then initializes a mutex, a lock manager lock, an rw lock, four lists, and six pointers. And looking at vfs.vnodes_created, these operations are being done millions of times an hour on a busy machine. As a performance optimization, this code update uses the uma_init and uma_fini routines to do these initializations and cleanups only as the vnodes enter and leave the vnode_zone. With this change the initializations are only done kern.maxvnodes times at system startup and then only rarely again. The frees are done only if the vnode_zone shrinks which never happens in practice. For those curious about the avoided work, look at the vnode_init() and vnode_fini() functions in kern/vfs_subr.c to see the code that has been removed from the main vnode allocation/free path. Reviewed by: kib Tested by: Peter Holm	2015-11-29 21:42:26 +00:00
kib	ee461b4bba	Remove sv_prepsyscall, sv_sigsize and sv_sigtbl members of the struct sysent. sv_prepsyscall is unused. sv_sigsize and sv_sigtbl translate signal number from the FreeBSD namespace into the ABI domain. It is only utilized on i386 for iBCS2 binaries. The issue with this approach is that signals for iBCS2 were delivered with the FreeBSD signal frame layout, which does not follow iBCS2. The same note is true for any other potential user if sv_sigtbl. In other words, if ABI needs signal number translation, it really needs custom sv_sendsig method instead. Sponsored by: The FreeBSD Foundation	2015-11-28 08:49:07 +00:00
kib	d58541e227	Remove VI_AGE vnode iflag, it is unused. Noted by: bde Sponsored by: The FreeBSD Foundation	2015-11-27 01:45:40 +00:00
kib	3cb64bd1ce	Move the comment about resident pages preventing vnode from leaving active list, into the header comment for vdrop(), which is the function that decides whether to leave the vnode on the list. Note that dirty page write-out in vinactive() is asynchronous. Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-11-27 01:16:35 +00:00
ae	da001d5bf7	Check that hhk_helper pointer isn't NULL before access. It isn't forbidden to use NULL pointer for hook_helper in hookinfo structure when hhook_add_hook() adds new helper hook.	2015-11-25 07:14:58 +00:00
kib	896302b6a8	Rework the vnode cache recycling to meet free and unused vnodes targets. See the comment above wantfreevnodes variable for the description of the algorithm. The vfs.vlru_alloc_cache_src sysctl is removed. New code frees namecache sources as the last chance to satisfy the highest watermark, instead of selecting the source vnodes randomly. This provides good enough behaviour to keep vn_fullpath() working in most situations. The filesystem layout with deep trees, where the removed knob was required, is thus handled automatically. Submitted by: bde Discussed with: mckusick Tested by: pho MFC after: 1 month	2015-11-24 09:45:36 +00:00
markj	9b7d03a6f4	The buffer passed to an sbuf drain callback is not necessarily null-terminated, so don't assume that it is. Reported by: pho X-MFC-With: r291059	2015-11-23 18:45:35 +00:00
kib	e0c4faece4	Split kerne timekeep ABI structure vdso_sv_tk out of the struct sysentvec. This allows the timekeep data to be shared between similar ABIs which cannot share sysentvec. Make the timekeep_push_vdso() tick callback to the timekeep structures instead of sysentvecs. If several sysentvec share the vdso_sv_tk structure, we would update the userspace data several times on each tick, without the change. Only allocate vdso_sv_tk in the exec_sysvec_init() sysinit when sysentvec is marked with the new SV_TIMEKEEP flag. This saves allocation and update of unneeded vdso_sv_tk for ABIs which do not provide userspace gettimeofday yet, which are PowerPCs arches right now. Make vdso_sv_tk allocator public, namely split out and export alloc_sv_tk() and alloc_sv_tk_compat32(). ABIs which share timekeep data now can allocate it manually and share as appropriate. Requested by: nwhitehorn Tested by: nwhitehorn, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-11-23 07:09:35 +00:00
glebius	6460e3db4e	Remove remnants of the old NFS from vnode pager. Reviewed by: kib Sponsored by: Netflix	2015-11-20 23:52:27 +00:00
trasz	a6632d64f5	The freebsd4_getfsstat() was broken in r281551 to always return 0 on success. All versions of getfsstat(3) are supposed to return the number of [o]statfs structs in the array that was copied out. Also fix missing bounds checking and signed comparison of unsigned types. Submitted by: bde@ MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-11-20 14:08:12 +00:00
jtl	f2aa140123	Consistently enforce the restriction against calling malloc/free when in a critical section. uma_zalloc_arg()/uma_zalloc_free() may acquire a sleepable lock on the zone. The malloc() family of functions may call uma_zalloc_arg() or uma_zalloc_free(). The malloc(9) man page currently claims that free() will never sleep. It also implies that the malloc() family of functions will not sleep when called with M_NOWAIT. However, it is more correct to say that these functions will not sleep indefinitely. Indeed, they may acquire a sleepable lock. However, a developer may overlook this restriction because the WITNESS check that catches attempts to call the malloc() family of functions within a critical section is inconsistenly applied. This change clarifies the language of the malloc(9) man page to clarify the restriction against calling the malloc() family of functions while in a critical section or holding a spin lock. It also adds KASSERTs at appropriate points to make the enforcement of this restriction more consistent. PR: 204633 Differential Revision: https://reviews.freebsd.org/D4197 Reviewed by: markj Approved by: gnn (mentor) Sponsored by: Juniper Networks	2015-11-19 14:04:53 +00:00
markj	7b874e569f	Remove a commented-out debug print. MFC after: 1 week	2015-11-19 05:58:51 +00:00
markj	7df27c4620	Add support for a configurable output channel to witness(4). This is useful in environments where system configuration is performed by automated interaction with the system console, since unexpected witness output makes such automation difficult. With this change, the new debug.witness.output_channel sysctl allows one to specify that witness output is to be printed to the kernel log (using log(9)) rather than the console. Reviewed by: cem, jhb MFC after: 2 weeks Relnotes: yes Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4183	2015-11-19 05:56:59 +00:00
markj	afc7726a46	Add vlog(9). Reviewed by: cem, jhb MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D4183	2015-11-19 05:50:22 +00:00
nwhitehorn	2b225aeb0b	Extend r270123 to run the brand info's header_supported() routine for branded as well as unbranded binaries. This will be required to add support for the new ELFv2 ABI on powerpc64, which is distinguished from ELFv1 by the contents of the ELF header's flags field. Reviewed by: imp MFC after: 2 weeks	2015-11-18 17:03:22 +00:00
marius	7965ca51f7	- Unbreak dumpsys(9) on sparc64 after r276772 - While at it, arrange #ifndefs in kern_dump.c more intelligently; it's rather confusing to have multiple competing and/or unused functions in the kernel.	2015-11-16 23:02:33 +00:00
trasz	ae1d62cec8	Speed up rctl operation with large rulesets, by holding the lock during iteration instead of relocking it for each traversed rule. Reviewed by: mjg@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D4110	2015-11-15 12:10:51 +00:00
rrs	dc494194a2	This fixes several places where callout_stops return is examined. The new return codes of -1 were mistakenly being considered "true". Callout_stop now returns -1 to indicate the callout had either already completed or was not running and 0 to indicate it could not be stopped. Also update the manual page to make it more consistent no non-zero in the callout_stop or callout_reset descriptions. MFC after: 1 Month with associated callout change.	2015-11-13 22:51:35 +00:00
jhb	c1d9f70889	Export various helper variables describing the layout and size of certain kernel structures for use by debuggers. This mostly aids in examining cores from a kernel without debug symbols as a debugger can infer these values if debug symbols are available. One set of variables describes the layout of 'struct linker_file' to walk the list of loaded kernel modules. A second set of variables describes the layout of 'struct proc' and 'struct thread' to walk the list of processes in the kernel and the threads in each process. The 'pcb_size' variable is used to index into the stoppcbs[] array. The 'vm_maxuser_address' is used to distinguish kernel virtual addresses from user addresses. This doesn't have to be perfect, and 'vm_maxuser_address' is a cheap and simple way to differentiate kernel pointers from simple values like TIDs and PIDs. While here, annotate the fields in struct pcb used by kgdb on amd64 and i386 to note that their ABI should be preserved. Annotations for other platforms will be added in the future. Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D3773	2015-11-12 22:00:59 +00:00
rrs	24a4335f25	Add new async_drain to the callout system. This is so-far not used but should be used by TCP for sure in its cleanup of the IN-PCB (will be coming shortly). Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D4076	2015-11-10 14:49:32 +00:00
jpaetzel	22e9b6f780	Fix a bug in the CPU % limiting code If you attempt to set a pcpu limit that is higher than 110% using rctl (for instance, you want a jail to be able to use 2 cores on your system so you set pcpu to 200%) the thing you are trying to limit becomes unthrottled. PR: 189870 Submitted by: dustinwenz@ebureau.com Reviewed by: trasz MFC after: 1 week	2015-11-10 14:14:32 +00:00
trasz	425e227d02	Make naming more consistent; no functional changes. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-11-08 18:11:24 +00:00
trasz	183c511071	Speed up rctl(8) rule retrieval; the difference shows mostly in "rctl -n", as otherwise most of the time is spent resolving UIDs to names. Reviewed by: mjg@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D4059	2015-11-08 18:08:31 +00:00
tijl	4e8b6b4a06	Since r289279 bufinit() uses mp_ncpus, but some architectures set this variable during mp_start() which is too late. Move this to mp_setmaxid() where other architectures set it and move x86 assertions to MI code. Reviewed by: kib (x86 part)	2015-11-08 14:26:50 +00:00
markj	a7bb6eb720	- Consistently use PROC_ASSERT_HELD() to verify that a process' hold count is non-zero. - Include the process address in the PROC_ASSERT_HELD() and PROC_ASSERT_NOT_HELD() assertion messages so that the corresponding process can be found easily when debugging. MFC after: 1 week	2015-11-08 01:38:56 +00:00
cem	c56fbbf66e	Flesh out sysctl types further (follow-up of r290475) Use the right intmax_t type instead of intptr_t in a few remaining places. Add support for CTLFLAG_TUN for the new fixed with types. Bruce will be upset that the new handlers silently truncate tuned quad-sized inputs, but so do all of the existing handlers. Add the new types to debug_dump_node, for whatever use that is. Bump FreeBSD_version again, for good measure. We are changing SYSCTL_HANDLER_ARGS and a member of struct sysctl_oid to intmax_t. Correct the sysctl typed NULL values for the fixed-width types. (Hat tip: hps@.) Suggested by: hps (partial) Sponsored by: EMC / Isilon Storage Division	2015-11-07 18:26:32 +00:00
adrian	4c139fdba3	Add a sched_yield() to work around low memory conditions in the current code. Things seem to get stuck in low memory conditions where no bufs are available, the reclamation path is called to wakeup the daemon, but no sleeping is done. Because of this, we are stuck in a tight loop in the current process and never run said reclamation path. This was introduced in r289279 . This is only a temporary workaround to restore system usefulness until the more permanent solutions can be found. Tested: * Carambola2, 64MB (and 32MB by manual config.)	2015-11-07 04:04:00 +00:00
cem	65f14c5699	Round out SYSCTL macros to the full set of fixed-width types Add S8, S16, S32, and U32 types; add SYSCTL() macros for them, as well as for the existing 64-bit types. (While SYSCTLQUAD and UQUAD macros already exist, they do not take the same sort of 'val' parameter that the other macros do.) Clean up the documented "types" in the sysctl.9 document. (These are macros and thus not real types, but the manual page documents intent.) The sysctl_add_oid(9) arg2 has been bumped from intptr_t to intmax_t to accommodate 64-bit types on 32-bit pointer architectures. This is just the kernel support piece; the userspace sysctl(1) support will follow in a later patch. Submitted by: Ravi Pokala <rpokala@panasas.com> Reviewed by: cem Relnotes: no Sponsored by: Panasas Differential Revision: https://reviews.freebsd.org/D4091	2015-11-07 01:43:01 +00:00
mjg	bc4e60465f	fd: implement kern.proc.nfds sysctl Intended purpose is to provide an equivalent of OpenBSD's getdtablecount syscall for the compat library..	2015-11-07 00:18:14 +00:00
jhb	78dafbc6fb	When dumping an rman in DDB, include the RID of each resource. Submitted by: Ravi Pokala (rpokala@panasas.com) Reviewed by: imp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D4086	2015-11-05 23:12:23 +00:00
markj	70321eeb6d	Have elf_lookup() return an error if the specified non-weak symbol could not be found. Otherwise, relocations against such symbols will be silently ignored instead of causing an error to be raised. Reviewed by: kib MFC after: 1 week	2015-11-03 03:29:35 +00:00
ngie	c5650de671	Define `fhard` in pps_event(..) only when PPS_SYNC is defined to mute an -Wunused-but-set-variable warning Reported by: FreeBSD_HEAD_amd64_gcc4.9 jenkins job Sponsored by: EMC / Isilon Storage Division	2015-11-02 03:14:37 +00:00
ngie	c5d7b522c7	Define `compress` in `__elfN(coredump)` when #ifdef GZIO is true to mute an -Wunused-but-set-variable warning Reported by: FreeBSD_HEAD_amd64_gcc4.9 jenkins job Sponsored by: EMC / Isilon Storage Division	2015-11-02 01:47:26 +00:00
imp	6471aad35d	The error classification from lower layers is a poor indicator of whether an error is recoverable. Always re-dirty the buffer on errors from write requests. The invalidation we used to do for errors not EIO doesn't need to be done for a device that's really gone, since that's done in a different path. Reviewed by: mckusick@, kib@	2015-10-31 04:53:07 +00:00
kib	979d5cd34d	Minor (and incomplete) style cleanup. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-10-30 20:47:42 +00:00
kib	eddc63a998	Also mark compat32 umtx op table as constant. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-10-30 19:32:30 +00:00
kib	25e819cedd	Use C99 array initialization, which also makes the code self-documented, and eases addition of new ops. For the similar reasons, eliminate UMTX_OP_MAX. nitems() handles the only use of the symbol. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-10-30 19:20:40 +00:00
trasz	e52d6f1e4a	After r290196, the kernel won't wait for stuff like gmirror nodes if they are not required for mounting rootfs. However, it's possible that some setups try to mount them in mountcritlocal (ie from fstab). Export the list of current root mount holds using a new sysctl, vfs.root_mount_hold, and make mountcritlocal retry if "mount -a" fails and the list is not empty. MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3709	2015-10-30 15:52:10 +00:00
trasz	3b970d7ccc	Make root mount wait mechanism smarter, by making it wait only if the root device doesn't yet exist. Reviewed by: kib@, marcel@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3709	2015-10-30 15:35:04 +00:00
bdrewery	7d7f09674b	getnewbuf: Initialize bp to avoid uninitialized pointer dereference and brelse(). This came in recently in r289279. Coverity CID: 1331561	2015-10-29 19:02:24 +00:00
hselasky	881f337ccc	Add missing NULL check in physio(). When destroying a character device the si_devsw field is set to NULL before all references are gone, to indicate the character device is going away. This can cause a NULL-dereference fault inside physio(). The callers of physio() should own a thread reference on the cdev and if si_devsw is seen as non-NULL, it is usable during the execution of the function. Else an ENXIO error code is returned. Reviewed by: kib MFC after: 2 weeks	2015-10-29 13:53:37 +00:00
imp	20050b3d3c	Add a note to the effect that BUS_ADD_CHILD calls device_add_child_ordered to add the child. device_add_child_ordered doesn't call BUS_ADD_CHILD.	2015-10-28 18:53:18 +00:00
mckusick	485c0ddc22	Bring the tags and links entries for amd64 up to date. Based on how out of date it is, I doubt that anyone other than me and my code-reading students still use it.	2015-10-27 22:59:24 +00:00
pjd	10f57f392f	The aio_waitcomplete(2) syscall should not sleep when the given timeout is 0. Without this change it was sleeping for one tick. Maybe not a big deal, but it makes share/dtrace/blocking script to report that. Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D3814 Sponsored by: Wheel Systems, http://wheelsystems.com	2015-10-25 18:48:09 +00:00
cem	51e3af66cf	Sysctl: Add common support for U8, U16 types Sponsored by: EMC / Isilon Storage Division	2015-10-22 23:03:06 +00:00
jhb	88ad316e08	Missing regen after last change to sys/kern/syscalls.master.	2015-10-22 21:30:39 +00:00
jhb	9740ac3060	Rename remaining linux32 symbols such as linux_sysent[] and linux_syscallnames[] from linux_* to linux32_* to avoid conflicts with linux64.ko. While here, add support for linux64 binaries to systrace. - Update NOPROTO entries in amd64/linux/syscalls.master to match the main table to fix systrace build. - Add a special case for union l_semun arguments to the systrace generation. - The systrace_linux32 module now only builds the systrace_linux32.ko. module on amd64. - Add a new systrace_linux module that builds on both i386 and amd64. For i386 it builds the existing systrace_linux.ko. For amd64 it builds a systrace_linux.ko for 64-bit binaries. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D3954	2015-10-22 21:28:20 +00:00
ed	8be8acd7af	Add a way to distinguish between forking and thread creation in schedtail. For CloudABI we need to initialize the registers of new threads differently based on whether the thread got created through a fork or through simple thread creation. Add a flag, TDP_FORKING, that is set by do_fork() and cleared by fork_exit(). This can be tested against in schedtail. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D3973	2015-10-22 09:33:34 +00:00
kib	f2fc8e1816	Trim spaces at end of line to record the proper commit message for r289660: Do not allow to execute ptrace(PT_TRACE_ME) when the process is already traced. Do not allow to execute ptrace(PT_TRACE_ME) when there is no parent which can trace the process, i.e. when the parent is already init. Note that after the PT_TRACE_ME request the process is unkillable and non-continuable until a debugger is attached, or parent is killed, the later clears P_TRACED state. Since init clearly would not debug the caller, and cannot be killed, disallow creation of unkillable processes. Reviewed by: jhb, pho Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D3908	2015-10-20 20:38:20 +00:00
kib	c517f862f2	Mark struct thread zone as type-stable. When establishing the locking state for several lock types (including blockable mutexes and sx) failed, locking primitives try to spin while the owner thread is running. The spinning loop performs the test for running condition by dereferencing the owner->td_state field of the owner thread. If the owner thread exited while spinner was put off the processor, it is harmless to access reused struct thread owner, since in some near future the current processor would notice the owner change and make appropriate progress. But it could be that the page which carried the freed struct thread was unmapped, then we fault (this cannot happen on amd64). For now, disallowing free of the struct thread seems to be good enough, and tests which create a lot of threads once, did not demonstrated regressions. Reviewed by: jhb, pho Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D3908	2015-10-20 20:29:21 +00:00
kib	791509a17b	Reviewed by: jhb, pho Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D3908	2015-10-20 20:22:57 +00:00
kib	f8f0302151	No need to dereference struct proc to pids when comparing processes for equality. Reviewed by: jhb, pho Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-10-20 20:12:42 +00:00
jhb	6db15b38fe	Switch pl_child_pid from int to pid_t. Reviewed by: emaste, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3857	2015-10-20 17:58:21 +00:00
ian	e4cc0fc7e4	Fix printf format to allow for bus_size_t not being u_long on all platforms.	2015-10-20 03:25:17 +00:00
ngie	42e86731a7	Replace /dev/acd0 with /dev/cd1 atapicd(4) has been removed since r249083, and if a system has more than one optical drive, it will likely be /dev/cd1 Update mount.conf(8) to reflect the change in behavior MFC after: never Sponsored by: EMC / Isilon Storage Division	2015-10-17 08:51:10 +00:00
kib	0fdc52ff38	If falloc_caps() failed, cleanup needs to be performed. This is a bug in r289026. Sponsored by: The FreeBSD Foundation	2015-10-16 14:55:39 +00:00
kib	a6091af923	Allow PT_INTERP and PT_NOTES segments to be located anywhere in the executable image. Keep one page (arbitrary) limit on the max allowed size of the PT_NOTES. The ELF image activators still require that program headers of the executable are fully contained in the first page of the image file. Reviewed by: emaste, jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D3871	2015-10-14 18:27:35 +00:00
jeff	4402204d47	Parallelize the buffer cache and rewrite getnewbuf(). This results in a 8x performance improvement in a micro benchmark on a 4 socket machine. - Get buffer headers from a per-cpu uma cache that sits in from of the free queue. - Use a per-cpu quantum cache in vmem to eliminate contention for kva. - Use multiple clean queues according to buffer cache size to eliminate clean queue lock contention. - Introduce a bufspace daemon that attempts to prevent getnewbuf() callers from blocking or doing direct recycling. - Close some bufspace allocation races that could lead to endless recycling. - Further the transition to a more modern style of small functions grouped by prefix in order to improve growing complexity. Sponsored by: EMC / Isilon Reviewed by: kib Tested by: pho	2015-10-14 02:10:07 +00:00
hiren	0d12306188	There are times when it would be really nice to have a record of the last few packets and/or state transitions from each TCP socket. That would help with narrowing down certain problems we see in the field that are hard to reproduce without understanding the history of how we got into a certain state. This change provides just that. It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is destroyed, the list is freed. I thought this was likely to be more performance-friendly than saving copies of the tcpcb. Plus, with the packets, you should be able to reverse-engineer what happened to the tcpcb. To enable the feature, you will need to compile a kernel with the TCPPCAP option. Even then, the feature defaults to being deactivated. You can activate it by setting a positive value for the number of captured packets. You can do that on either a global basis or on a per-socket basis (via a setsockopt call). There is no way to get the packets out of the kernel other than using kmem or getting a coredump. I thought that would help some of the legal/privacy concerns regarding such a feature. However, it should be possible to add a future effort to export them in PCAP format. I tested this at low scale, and found that there were no mbuf leaks and the peak mbuf usage appeared to be unchanged with and without the feature. The main performance concern I can envision is the number of mbufs that would be used on systems with a large number of sockets. If you save five packets per direction per socket and have 3,000 sockets, that will consume at least 30,000 mbufs just to keep these packets. I tried to reduce the concerns associated with this by limiting the number of clusters (not mbufs) that could be used for this feature. Again, in my testing, that appears to work correctly. Differential Revision: D3100 Submitted by: Jonathan Looney <jlooney at juniper dot net> Reviewed by: gnn, hiren	2015-10-14 00:35:37 +00:00
trasz	3e6041333a	Change the default setting of kern.ipc.shm_allow_removed from 0 to 1. This removes the need for manually changing this flag for Google Chrome users. It also improves compatibility with Linux applications running under Linuxulator compatibility layer, and possibly also helps in porting software from Linux. Generally speaking, the flag allows applications to create the shared memory segment, attach it, remove it, and then continue to use it and to reattach it later. This means that the kernel will automatically "clean up" after the application exits. It could be argued that it's against POSIX. However, SUSv3 says this about IPC_RMID: "Remove the shared memory identifier specified by shmid from the system and destroy the shared memory segment and shmid_ds data structure associated with it." From my reading, we break it in any case by deferring removal of the segment until it's detached; we won't break it any more by also deferring removal of the identifier. This is the behaviour exhibited by Linux since... probably always, and also by OpenBSD since the following commit: revision 1.54 date: 2011/10/27 07:56:28; author: robert; state: Exp; lines: +3 -8; Allow segments to be used even after they were marked for deletion with the IPC_RMID flag. This is permitted as an extension beyond the standards and this is similar to what other operating systems like linux do. MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3603	2015-10-10 09:29:47 +00:00
trasz	887351d9dc	Provide better debug message on kernel module name clash. Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-10-10 09:21:55 +00:00
trasz	9b6b02c02f	Remove root_mount_wait(). It's not used anywhere. Reviewed by: bapt@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3787	2015-10-09 12:11:37 +00:00
kib	753bf3d282	Enforce the maxproc limitation before allocating struct proc, initial struct thread and kernel stack for the thread. Otherwise, a load similar to a fork bomb would exhaust KVA and possibly kmem, mostly due to the struct proc being type-stable. The nprocs counter is changed from being protected by allproc_lock sx to be an atomic variable. Note that ddb/db_ps.c:db_ps() use of nprocs was unsafe before, and is still unsafe, but it seems that the only possible undesired consequence is the harmless warning printed when allproc linked list length does not match nprocs. Diagnosed by: Svatopluk Kraus <onwahe@gmail.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-10-08 11:07:09 +00:00
fabient	368842a107	Fix r283998 that broke mapin events for hwpmc. Reviewed by: jhb Sponsored by: Stormshield	2015-10-08 09:54:33 +00:00
glebius	daa604f294	Fix regression from r248371. We need to copy packet header to new mbuf. Unlike in the pre-r248371 code, assert that M_PKTHDR is set only on a first mbuf. Reported & tested by: Andriy Voskoboinyk <s3erios gmail.com> Sponsored by: Nginx, Inc.	2015-10-07 12:40:00 +00:00
jhb	b86a33b09a	Fix various edge cases related to system call tracing. - Always set td_dbg_sc_* when P_TRACED is set on system call entry even if the debugger is not tracing system call entries. This ensures the fields are valid when reporting other stops that occur at system call boundaries such as for PT_FOLLOW_FORKS or when only tracing system call exits. - Set TDB_SCX when reporting the stop for a new child process in fork_return(). This causes the event to be reported as a system call exit. - Report a system call exit event in fork_return() for new threads in a traced process. - Copy td_dbg_sc_* to new threads instead of zeroing. This ensures that td_dbg_sc_code in particular will report the system call that created the new thread or process when it reports a system call exit event in fork_return(). - Add new ptrace tests to verify that new child processes and threads report system call exit events with a valid pl_syscall_code via PT_LWPINFO. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D3822	2015-10-06 19:29:05 +00:00
cem	9c1e214f79	Fix core corruption caused by race in note_procstat_vmmap This fix is spiritually similar to r287442 and was discovered thanks to the KASSERT added in that revision. NT_PROCSTAT_VMMAP output length, when packing kinfo structs, is tied to the length of filenames corresponding to vnodes in the process' vm map via vn_fullpath. As vnodes may move during coredump, this is racy. We do not remove the race, only prevent it from causing coredump corruption. - Add a sysctl, kern.coredump_pack_vmmapinfo, to allow users to disable kinfo packing for PROCSTAT_VMMAP notes. This avoids VMMAP corruption and truncation, even if names change, at the cost of up to PATH_MAX bytes per mapped object. The new sysctl is documented in core.5. - Fix note_procstat_vmmap to self-limit in the second pass. This addresses corruption, at the cost of sometimes producing a truncated result. - Fix PROCSTAT_VMMAP consumers libutil (and libprocstat, via copy-paste) to grok the new zero padding. Reported by: pho (https://people.freebsd.org/~pho/stress/log/datamove4-2.txt) Relnotes: yes Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3824	2015-10-06 18:07:00 +00:00
glebius	94c16d891e	Remove debugging variable from r143761.	2015-10-06 09:43:49 +00:00
jhb	e2e358d7ec	Include additional info in ptrace(2) KTR traces: - The new PC value and signal passed to PT_CONTINUE, PT_DETACH, PT_SYSCALL, and PT_TO_SC[EX]. - The system call code returned via PT_LWPINFO. MFC after: 1 week	2015-10-05 21:36:53 +00:00
markj	d6b5b38ff0	Revert r288628 and instead fix a discrepancy between the posix_fadvise(2) man page and POSIX: posix_fadvise(2) returns an error number on failure. Reported by: jilles MFC after: 1 week	2015-10-03 22:27:14 +00:00
markj	f73def85e6	The return value of posix_fadvise(2) is just an error status, so sys_posix_fadvise() should simply return the errno (or 0) to syscallenter() rather than setting a return value. MFC after: 1 week	2015-10-03 19:37:41 +00:00
alc	df033a6930	Perform a single batched update to the object's paging-in-progress count rather than updating it for each page.	2015-10-03 17:04:52 +00:00
phk	0c2e89c505	Fail the sbuf if vsnprintf(3) fails.	2015-10-02 09:23:14 +00:00
markj	49cededc7e	Ensure that vop_stdadvise() does not call getblk() on vnodes that have an empty bufobj. Otherwise, vnodes belonging to filesystems that do not use the buffer cache may trigger assertion failures. Reported by: Fabien Keil	2015-10-01 16:34:53 +00:00
cperciva	8e136c4370	Disable suspend when we're shutting down. This solves the "tell FreeBSD to shut down; close laptop lid" scenario which otherwise tended to end with a laptop overheating or the battery dying. The implementation uses a new sysctl, kern.suspend_blocked; init(8) sets this while rc.suspend runs, and the ACPI sleep code ignores requests while the sysctl is set. Discussed on: freebsd-acpi (35 emails) MFC after: 1 week	2015-10-01 10:52:26 +00:00
markj	6348241c12	As a step towards the elimination of PG_CACHED pages, rework the handling of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to the head of the inactive queue instead of being cached. This affects the implementation of POSIX_FADV_NOREUSE as well, since it works by applying POSIX_FADV_DONTNEED to file ranges after they have been read or written. At that point the corresponding buffers may still be dirty, so the previous implementation would coalesce successive ranges and apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the dirty buffers would eventually be cached. To preserve this behaviour in an efficient manner, this change adds a new buf flag, B_NOREUSE, which causes the pages backing a VMIO buf to be placed at the head of the inactive queue when the buf is released. POSIX_FADV_NOREUSE then works by setting this flag in bufs that underlie the specified range. Reviewed by: alc, kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3726	2015-09-30 23:06:29 +00:00
avg	425c0bb088	save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters where n is typically smaller than 5. Perhaps SDT_PROBE should be made a private implementation detail. MFC after: 20 days	2015-09-28 12:14:16 +00:00
jeff	5ba0fc23d6	- Collapse vfs_vmio_truncate & vfs_vmio_release into a single function. - Allow vfs_vmio_invalidate() to free the pages, leaving us with a single loop and bufobj lock when B_NOCACHE/B_INVAL is used. - Eliminate the special B_ASYNC handling on free that has not been relevant for some time. - Remove the extraneous page busy from vfs_vmio_truncate(). Reviewed by: kib Tested by: pho Sponsored by: EMC / Isilon storage division	2015-09-27 05:16:06 +00:00
markj	bb2ac24a51	Remove a check for a condition that is always false by a preceding KASSERT that was added in r144704.	2015-09-26 22:26:55 +00:00
markj	3eefaf542b	Fix argument ordering in vn_printf(). MFC after: 3 days	2015-09-26 22:16:54 +00:00
cem	d13a26b53a	sbuf: Process more than one char at a time Revamp sbuf_put_byte() to sbuf_put_bytes() in the obvious fashion and fixup callers. Add a thin shim around sbuf_put_bytes() with the old ABI to avoid ugly changes to some callers. Reviewed by: jhb, markj Obtained from: Dan Sledz Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3717	2015-09-25 18:37:14 +00:00
kib	a20643c9c7	Use per-cpu values for base and last in tc_cpu_ticks(). The values are updated lockess, different CPUs write its own view of timecounter state. The critical section is done for safety, callers of tc_cpu_ticks() are supposed to already enter critical section, or to own a spinlock. The change fixes sporadical reports of too high values reported for the (W)CPU on platforms that do not provide cpu ticker and use tc_cpu_ticks(), in particular, arm*. Diagnosed and reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-25 13:03:57 +00:00
mjg	93991d4618	kqueue: simplify kern_kqueue by not refing/unrefing creds too early No functional changes.	2015-09-23 12:45:08 +00:00
jeff	9a362d63de	- Fix a nonsense reordering that somehow slipped into my last diff. Reported by: pho	2015-09-23 07:44:07 +00:00
jeff	b17d09da88	Some refactoring of the buf/vm interface. - Eliminate bogus page replacement that is inconsistently applied in the invalidation loop in brelse. This has been a no-op in modern times as biodone() is responsible for cleaning up after bogus pages. This would've spammed the console with printfs at a minimum. - Allow the compiler and human readers alike to reason about allocbuf() by splitting it into constituent parts. - Separate the VM manipulating and buf manipulating code in brelse() and bufdone() so that the intentions are clear. This makes it evident that there are several duplicated buf pages loops that will be consolidated at a later time. Reviewed by: kib Tested by: pho Sponsored by: EMC / Isilon Storage Division	2015-09-22 23:57:52 +00:00
alc	7567a2924b	Change vm_page_unwire() such that it (1) accepts PQ_NONE as the specified queue and (2) returns a Boolean indicating whether the page's wire count transitioned to zero. Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing a page that is about to be freed. (An earlier version of this change was developed by attilio@ and kmacy@. Any errors in this version are my own.) Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-09-22 18:16:52 +00:00
hselasky	d198450d08	Revert r287780 until more developers have their say. Differential Revision: https://reviews.freebsd.org/D3521 Requested by: gnn	2015-09-22 06:51:55 +00:00
bdrewery	2be5ce0af5	vfs_mountroot_shuffle() never returns non-zero.	2015-09-22 03:34:07 +00:00
kib	9704d25bc3	Ensure that maxproc does not exceed pid_max, at the time of boot. Changes to kern.pid_max mib after the boot can break this relation. The maxfiles value was calculated by the MAXFILES formula based on maxproc value, but this change decouples them, and MAXFILES now references maxusers. Without manual tuning, the maxfiles default value remains as it was prior to this commit. But for systems which have tuned maxproc and rely on maxfiles to adjust, additional reconfiguration is needed. Reported by: rwatson Reviewed by: emaste Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-09-21 15:02:59 +00:00
kib	518734671f	Add support for weak symbols to the kernel linkers. It means that linkers no longer raise an error when undefined weak symbols are found, but relocate as if the symbol value was 0. Note that we do not repeat the mistake of userspace dynamic linker of making the symbol lookup prefer non-weak symbol definition over the weak one, if both are available. In fact, kernel linker uses the first definition found, and ignores duplicates. Signature of the elf_lookup() and elf_obj_lookup() functions changed to split result/error code and the symbol address returned. Otherwise, it is impossible to return zero address as the symbol value, to MD relocation code. This explains the mechanical changes in elf_machdep.c sources. The powerpc64 R_PPC_JMP_SLOT handler did not checked error from the lookup() call, the patch leaves the code as is (untested). Reported by: glebius Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-20 01:27:59 +00:00
trasz	bd8e12dd02	Kernel part of reroot support - a way to change rootfs without reboot. Note that the mountlist manipulations are somewhat fragile, and not very pretty. The reason for this is to avoid changing vfs_mountroot(), which is (obviously) rather mission-critical, but not very well documented, and thus hard to test properly. It might be possible to rework it to use its own simple root mount mechanism instead of vfs_mountroot(). Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D2698	2015-09-18 17:32:22 +00:00
jhb	9d7ff3e89e	Always clear TDB_USERWR before fetching system call arguments. The TDB_USERWR flag may still be set after a debugger detaches from a process via PT_DETACH. Previously the flag would never be cleared forcing a double fetch of the system call arguments for each system call. Note that the flag cannot be cleared at PT_DETACH time in case one of the threads in the process is currently stopped in syscallenter() and the debugger has modified the arguments for that pending system call before detaching. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3678	2015-09-16 20:55:00 +00:00
jhb	d32c347344	When a process group leader exits, all of the processes in the group are sent SIGHUP and SIGCONT if any of the processes are stopped. Currently this behavior is triggered for any type of process stop including ptrace() stops and transient stops for single threading during exit() and execve(). Thus, if a debugger is attached to a process in a group when the leader exits, the entire group can be HUPed. Instead, only send the signals if a process in the group is stopped due to SIGSTOP. PR: 201149 Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3681	2015-09-16 16:40:07 +00:00
mjg	8ada4c3bbf	sysctl: switch sysctllock to a sleepable rmlock, take 2 This restores r285125. Previous attempt was reverted due to a bug in rmlocks, which is fixed since r287833.	2015-09-15 23:06:56 +00:00
jhb	d7f36d77c9	Threads holding a read lock of a sleepable rm lock are not permitted to sleep. The rmlock implementation enforces this by disabling sleeping when a read lock is acquired. To simplify the implementation, sleeping is disabled for most of the duration of rm_rlock. However, it doesn't need to be disabled until the lock is acquired. If a sleepable rm lock is contested, then rm_rlock may need to acquire the backing sx lock. This tripped the overly-broad assertion. Fix by relaxing the assertion around the call to sx_xlock(). Reported by: mjg Reviewed by: kib, mjg MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3324	2015-09-15 22:16:21 +00:00
cem	9fe341127e	kevent(2): Note DOOMED vnodes with NOTE_REVOKE In poll mode, check for and wake VBAD vnodes. (Vnodes that are VBAD at registration will never be woken by the RECLAIM trigger.) Add post-VOP_RECLAIM hook to trigger notes on vnode reclamation. (Vnodes that were fine at registration but are vgoned while being monitored should signal waiters.) Reviewed by: kib Approved by: markj (mentor) Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3675	2015-09-15 20:22:30 +00:00
hselasky	1a99c0928d	Implement callout_drain_async(), inspired by the projects/hps_head branch. This function is used to drain a callout via a callback instead of blocking the caller until the drain is complete. Refer to the callout_drain_async() manual page for a detailed description. Limitation: If a lock is used with the callout, the callout can only be drained asynchronously one time unless the callout_init_mtx() function is called again. This limitation is not present in projects/hps_head and will require more invasive changes to the timeout code, which was not in the scope of this patch. Differential Revision: https://reviews.freebsd.org/D3521 Reviewed by: wblock MFC after: 1 month	2015-09-14 10:52:26 +00:00
imp	dbff665874	bufdonebio is now unused. Retire it too.	2015-09-11 04:20:04 +00:00
markj	e8967c8bd9	Add stack_save_td_running(), a function to trace the kernel stack of a running thread. It is currently implemented only on amd64 and i386; on these architectures, it is implemented by raising an NMI on the CPU on which the target thread is currently running. Unlike stack_save_td(), it may fail, for example if the thread is running in user mode. This change also modifies the kern.proc.kstack sysctl to use this function, so that stacks of running threads are shown in the output of "procstat -kk". This is handy for debugging threads that are stuck in a busy loop. Reviewed by: bdrewery, jhb, kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3256	2015-09-11 03:54:37 +00:00
imp	9fcb42ef36	dev_strategy and dev_strategy_csw are unused since r281825. Remove them. Differential Revision: https://reviews.freebsd.org/D3620	2015-09-11 00:38:58 +00:00
adrian	73868b5389	Also make kern.maxfilesperproc a boot time tunable. Auto-tuning threshold discussions aside, it turns out that if you want to lower this on say, rather memory-packed machines, you either set maxusers or kern.maxfiles, or you set it in sysctl. The former is a non-exact way to tune this; the latter doesn't actually affect anything in the startup scripts. This first occured because I wondered why the hell screen would take upwards of 10 seconds to spawn a new screen. I then found python doing the same thing during fork/exec of child processes - it calls close() on each FD up to the current openfiles limit. On a 1TB machine this is like, 26 million FDs per process. Ugh. So: * This allows it to be set early in /boot/loader.conf; * It can be used to work around the ridiculous situation of screen, python, etc doing a close() on potentially millions of FDs even though you only have four open. Tested: * 4GB, 32GB, 64GB, 128GB, 384GB, 1TB systems with autotune, ensuring screen and python forking doesn't result in some pretty hilariously bad behaviour. TODO: * Note that the default login.conf sets openfiles-cur to unlimited, effectively obeying kern.maxfilesperproc. Perhaps we should fix this. * .. and even if we do, we need to also ensure that daemons get a soft limit of something reasonable and capped - they can request more FDs themselves. MFC after: 1 week Sponsored by: Norse Corp, Inc.	2015-09-10 04:05:58 +00:00
kib	9f66ca5a18	For open("name", O_DIRECTORY \| O_CREAT), do not try to create the named node, open(2) cannot create directories. But do allow the flag combination to succeed if the directory already exists. Declare the open("name", O_DIRECTORY \| O_CREAT \| O_EXCL) always invalid for the same reason, since open(2) cannot create directory. Note that there is an argument that O_DIRECTORY \| O_CREAT should be invalid always, regardless of the target directory existence or O_EXCL. The current fix is conservative and allows the call to succeed in the situation where it succeeded before the patch. Reported by: Tom Ridge <freebsd@tom-ridge.com> Reviewed by: rwatson PR: 202892 Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-09 19:31:08 +00:00
mjg	519b1a7211	fd: make rights a mandatory argument to fgetvp_rights The only caller already always passes rights.	2015-09-07 20:05:56 +00:00
mjg	cc8534cb73	fd: make the common case in filecaps_copy work lockless The filedesc lock is only needed if ioctls caps are present, which is a rare situation. This is a step towards reducing the scope of the filedesc lock.	2015-09-07 20:02:56 +00:00
cem	a8fae65d69	Follow-up to r287442: Move sysctl to compiled-once file Avoid duplicate sysctl nodes. Found by: tijl Approved by: markj (mentor) Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3586	2015-09-07 16:44:28 +00:00
mckusick	17357462a0	Track changes to kern.maxvnodes and appropriately increase or decrease the size of the name cache hash table (mapping file names to vnodes) and the vnode hash table (mapping mount point and inode number to vnode). An appropriate locking strategy is the key to changing hash table sizes while they are in active use. Reviewed by: kib Tested by: Peter Holm Differential Revision: https://reviews.freebsd.org/D2265 MFC after: 2 weeks	2015-09-06 05:50:51 +00:00
delphij	db0a2c953d	Expose an interface to determine if an ACE is inherited. Submitted by: sef Reviewed by: trasz MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D3540	2015-09-04 00:14:20 +00:00
cem	f96df638b8	Detect badly behaved coredump note helpers Coredump notes depend on being able to invoke dump routines twice; once in a dry-run mode to get the size of the note, and another to actually emit the note to the corefile. When a note helper emits a different length section the second time around than the length it requested the first time, the kernel produces a corrupt coredump. NT_PROCSTAT_FILES output length, when packing kinfo structs, is tied to the length of filenames corresponding to vnodes in the process' fd table via vn_fullpath. As vnodes may move around during dump, this is racy. So: - Detect badly behaved notes in putnote() and pad underfilled notes. - Add a fail point, debug.fail_point.fill_kinfo_vnode__random_path to exercise the NT_PROCSTAT_FILES corruption. It simply picks random lengths to expand or truncate paths to in fo_fill_kinfo_vnode(). - Add a sysctl, kern.coredump_pack_fileinfo, to allow users to disable kinfo packing for PROCSTAT_FILES notes. This should avoid both FILES note corruption and truncation, even if filenames change, at the cost of about 1 kiB in padding bloat per open fd. Document the new sysctl in core.5. - Fix note_procstat_files to self-limit in the 2nd pass. Since sometimes this will result in a short write, pad up to our advertised size. This addresses note corruption, at the risk of sometimes truncating the last several fd info entries. - Fix NT_PROCSTAT_FILES consumers libutil and libprocstat to grok the zero padding. With suggestions from: bjk, jhb, kib, wblock Approved by: markj (mentor) Relnotes: yes Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3548	2015-09-03 20:32:10 +00:00
mjg	0c1fc3bcd2	fd: remove UMA_ZONE_ZINIT argument from Files zone Originally it was added in order to prevent trashing of objects with INVARIANTS enabled. The same effect is now provided with mere UMA_ZONE_NOFREE. This reverts r286921. Discussed with: kib	2015-09-02 23:14:39 +00:00
jhb	0e4166b711	The 'sa' argument to syscallret() is not unused.	2015-09-01 22:28:23 +00:00
jhb	5c94ee0044	Export current system call code and argument count for system call entry and exit events. procfs stop events for system call tracing report these values (argument count for system call entry and code for system call exit), but ptrace() does not provide this information. (Note that while the system call code can be determined in an ABI-specific manner during system call entry, it is not generally available during system call exit.) The values are exported via new fields at the end of struct ptrace_lwpinfo available via PT_LWPINFO. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3536	2015-09-01 22:24:54 +00:00
kib	7f517dc1f3	Exit notification for EVFILT_PROC removes knote from the knlist. In particular, this invalidates the knote kn_link linkage, making the SLIST_FOREACH() loop accessing undefined values (e.g. trashed by QUEUE_MACRO_DEBUG). If the knote is freed by other thread when kq lock is released or when influx is cleared, e.g. by knote_scan() for kqueue owning the knote, the iteration step would access freed memory. Use SLIST_FOREACH_SAFE() to fix iteration. Diagnosed by: avg Tested by: avg, lstewart, pawel Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-09-01 14:05:29 +00:00
kib	cb97f9b456	Clean up the kqueue use of the uma KPI. Explain why it is fine to not check for M_NOWAIT failures in kqueue_register(). Remove unneeded check for NULL result from waitable allocation in kqueue_scan(). uma_free(9) handles NULL argument correctly, remove checks for NULL. Remove useless cast and adjust style in knote_alloc(). Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-09-01 13:21:32 +00:00
avg	a047794b25	callout_reset: fix a reversed check for cc_exec_cancel The typo was introduced in r278469 / `344ecf88af`. As a result of the bug there was a timing window where callout_reset() would fail to cancel a concurrent execution of a callout that is about to start and would schedule the callout again. The callout would fire more times than it is scheduled. That would happen even if the callout is initialized with a lock. For example, the bug triggered the "Stray timeout" assertion in taskqueue_timeout_func(). MFC after: 5 days	2015-09-01 09:27:14 +00:00
kib	b1635150b8	Use P1B_PRIO_MAX to designate max posix priority for the RR/FIFO scheduler types. It was intended to be used there, compare with the min value, and with the test for correctness in ksched_setscheduler(). Note that P1B_PRIO_MAX and RTP_PRIO_MAX do have the same numerical values, the change is cosmetical. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-08-30 18:02:57 +00:00
kib	5b7cf788aa	Remove single-use macros obfuscating malloc(9) and free(9) calls. Style. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-08-30 17:58:11 +00:00
jch	f011b0c89e	Revert r286880: If at first this change made sense, it turns out it helps only the TCP timers callout(9) usage. As the benefit for others callout(9) usages did not reach a consensus the historical usage should prevail. Differential Revision: https://reviews.freebsd.org/D3078	2015-08-30 13:44:46 +00:00
imp	ae67ea5282	Remove now obsolete comment. MFC After: 2 days	2015-08-28 20:06:58 +00:00
imp	48cf3d6ef2	Per overwhelming sentiment in the code review, use FEATURE instead. Differential Revision: https://reviews.freebsd.org/D3488 MFC After: 2 days	2015-08-28 19:53:19 +00:00
ed	066f63003b	Decompose linkat()/renameat() rights to source and target. To make it easier to understand how Capsicum interacts with linkat() and renameat(), rename the rights to CAP_{LINK,RENAME}AT_{SOURCE,TARGET}. This also addresses a shortcoming in Capsicum, where it isn't possible to disable linking to files stored in a directory. Creating hardlinks essentially makes it possible to access files with additional rights. Reviewed by: rwatson, wblock Differential Revision: https://reviews.freebsd.org/D3411	2015-08-27 15:16:41 +00:00
jch	761d6ba633	Silent a compilation warning on callout_stop()	2015-08-27 10:43:35 +00:00
jch	4015a832a3	In callout_stop(), do not forget to initialize not_running variable. Thanks to hselasky for noticing that. Differential Revision: https://reviews.freebsd.org/D3078 (Updated) Submitted by: hselasky Pointy hat to: jch	2015-08-27 08:58:03 +00:00
jch	433b16759b	In callout_stop(), if a callout is both pending and currently being serviced return 0 (fail) but it is applicable only mpsafe callouts. Thanks to hselasky for finding this. Differential Revision: https://reviews.freebsd.org/D3078 (Updated) Submitted by: hselasky Reviewed by: jch	2015-08-27 08:15:32 +00:00
marcel	22e802147b	An error of -1 from parse_mount() indicates that the specification was invalid. Don't trigger a mount failure (which by default means a panic), but instead just move on to the next directive in the configuration. This typically has us ask for the root mount. PR: 163245	2015-08-27 04:25:27 +00:00
imp	b82b5280a1	When the kernel is compiled with INVARIANTS, export that as debug.invariants. Differential Revision: https://reviews.freebsd.org/D3488 MFC after: 3 days	2015-08-26 23:58:03 +00:00
gnn	304a5ae6cb	Summary: Add the interactivity equations to the header comment for our interactivity calculation routine. Suggested by: rwatson	2015-08-26 16:36:41 +00:00
trasz	c1207ee9b8	Make vfs_unmountall() unmount /dev after /, not before. The only reason this didn't result in an unclean shutdown is that devfs ignores MNT_FORCE flag. Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3467	2015-08-24 13:18:13 +00:00
trasz	9e31188bdc	After r286237 it should be fine to call vgone(9) on a busy GEOM vnode; remove KASSERT that would prevent forced devfs unmount from working. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-08-23 14:53:54 +00:00
royger	5b319fbe38	preload_search_info: make sure mod is set Add a check to preload_search_info to make sure mod is set. Most of the callers of preload_search_info don't check that the mod parameter is set, which can cause page faults. While at it, remove some now unnecessary checks before calling preload_search_info. Sponsored by: Citrix Systems R&D Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D3440	2015-08-21 15:57:57 +00:00
kib	fd746f6246	If process becomes reaper (procctl(PROC_REAP_ACQUIRE)) while already having some children, the children' reaper is not reset to the parent. This allows for the situation where reaper has children but not descendands and the too strict asserts in the reap_status() fire. Remove the wrong asserts, add some clarification for the situation to the procctl(2) REAP_STATUS. Reported and tested by: feld Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-08-20 22:44:26 +00:00
kib	730b59f53e	fget_unlocked() depends on the freed struct file f_count field being zero. The file_zone if no-free, but r284861 added trashing of the freed memory. Most visible manifestation of the issue were 'memory modified after free' panics for the file zone, triggered from falloc_noinstall(). Add UMA_ZONE_ZINIT flag to turn off trashing. Mjg noted that it makes sense to not trash freed memory for any non-free zone, which will be done later. Reported and tested by: pho Discussed with: mjg Sponsored by: The FreeBSD Foundation	2015-08-19 11:53:32 +00:00
jch	7242654eab	callout_stop() should return 0 (fail) when the callout is currently being serviced and indeed unstoppable. A scenario to reproduce this case is: - the callout is being serviced and at same time, - callout_reset() is called on this callout that sets the CALLOUT_PENDING flag and at same time, - callout_stop() is called on this callout and returns 1 (success) even if the callout is indeed currently running and unstoppable. This issue was caught up while making r284245 (D2763) workaround, and was discussed at BSDCan 2015. Once applied the r284245 workaround is not needed anymore and will be reverted. Differential Revision: https://reviews.freebsd.org/D3078 Reviewed by: jhb Sponsored by: Verisign, Inc.	2015-08-18 10:15:09 +00:00
rpaulo	32e921b2a6	genassym.sh: call nm(1) with NMFLAGS.	2015-08-14 22:57:13 +00:00
ian	e065188170	If a specific timecounter has been chosen via sysctl, and a new timecounter with higher quality registers (presumably in a module that has just been loaded), do not undo the user's choice by switching to the new timecounter. Document that behavior, and also the fact that there is no way to unregister a timecounter (and thus no way to unload a module containing one).	2015-08-12 20:50:20 +00:00
oshogbo	7dae5ce371	When the wait*(2) syscalls wait for any process (P_ALL), they should ignore processes created with the pdfork(2) syscall. PR: 201054 Approved by: pjd (mentor) Discussed with: emaste, rwatson	2015-08-12 20:08:54 +00:00
ed	1bf9368636	Perform cleanups in response to D3307. - Document the kern_kevent_anonymous() function. - Add assertions to ensure that we don't silently leave the kqueue linked from a file descriptor table. Reviewed by: jmg Differential Revision: https://reviews.freebsd.org/D3364	2015-08-12 17:46:26 +00:00
ed	e8ba6b4817	Properly return ENOTDIR when calling at() on a non-vnode. We already properly return ENOTDIR when calling at() on a non-directory vnode, but it turns out that if you call it on a socket, we see EINVAL. Patch up namei to properly translate this to ENOTDIR.	2015-08-12 16:17:00 +00:00
ed	cd3dfbae4e	Unignore signals when starting CloudABI processes. As CloudABI processes cannot adjust their signal handlers, we need to make sure that we start up CloudABI processes with consistent signal masks. Though the POSIx standard signal behavior is all right, we do need to make sure that we ignore SIGPIPE, as it would otherwise be hard to interact with pipes and sockets. Extend execsigs() to iterate over ps_sigignore and call sigdflt() for each of the ignored signals. Reviewed by: kib Obtained from: https://github.com/NuxiNL/freebsd Differential Revision: https://reviews.freebsd.org/D3365	2015-08-12 11:30:31 +00:00
ed	598c6e70ce	Add support for anonymous kqueues. CloudABI's polling system calls merge the concept of one-shot polling (poll, select) and stateful polling (kqueue). They share the same data structures. Extend FreeBSD's kqueue to provide support for waiting for events on an anonymous kqueue. Unlike stateful polling, there is no need to support timeouts, as an additional timer event could be used instead. Furthermore, it makes no sense to use a different number of input and output kevents. Merge this into a single argument. Obtained from: https://github.com/NuxiNL/freebsd Differential Revision: https://reviews.freebsd.org/D3307	2015-08-11 13:47:23 +00:00
ed	8fba8d3894	Introduce kern_cap_rights_limit(). The existing sys_cap_rights_limit() expects that a cap_rights_t object lives in userspace. It is therefore hard to call into it from kernelspace. Move the interesting bits of sys_cap_rights_limit() into kern_cap_rights_limit(), so that we can call into it from the CloudABI compatibility layer. Obtained from: https://github.com/NuxiNL/freebsd Differential Revision: https://reviews.freebsd.org/D3314	2015-08-11 08:43:50 +00:00
kib	9033c894a1	Make kstack_pages a tunable on arm, x86, and powepc. On i386, the initial thread stack is not adjusted by the tunable, the stack is allocated too early to get access to the kernel environment. See TD0_KSTACK_PAGES for the thread0 stack sizing on i386. The tunable was tested on x86 only. From the visual inspection, it seems that it might work on arm and powerpc. The arm USPACE_SVC_STACK_TOP and powerpc USPACE macros seems to be already incorrect for the threads with non-default kstack size. I only changed the macros to use variable instead of constant, since I cannot test. On arm64, mips and sparc64, some static data structures are sized by KSTACK_PAGES, so the tunable is disabled. Sponsored by: The FreeBSD Foundation MFC after: 2 week	2015-08-10 17:18:21 +00:00
melifaro	627f722ad6	Add const-qualifiers for source mbuf argument in m_dup(), m_copym(), m_dup_pkthdr() and m_tag_copy_chain().	2015-08-08 15:50:46 +00:00
ian	33a3844dbf	Only process the PPS event types currently enabled in pps_params.mode. This makes the PPS API behave correctly, but isn't ideal -- we still end up capturing PPS data for non-enabled edges, we just don't process the data into an event that becomes visible outside of kern_tc. That's because the event type isn't passed to pps_capture(), so it can't do the filtering. Any solution for capture filtering is going to require touching every driver.	2015-08-07 23:31:31 +00:00
ian	cbf6cdcbbb	RFC 2783 requires a status of ETIMEDOUT, not EWOULDBLOCK, on a timeout.	2015-08-07 21:14:19 +00:00
ed	0698a33dea	Allow the creation of kqueues with a restricted set of Capsicum rights. On CloudABI we want to create file descriptors with just the minimal set of Capsicum rights in place. The reason for this is that it makes it easier to obtain uniform behaviour across different operating systems. By explicitly whitelisting the operations, we can return consistent error codes, but also prevent applications from depending OS-specific behaviour. Extend kern_kqueue() to take an additional struct filecaps that is passed on to falloc_caps(). Update the existing consumers to pass in NULL. Differential Revision: https://reviews.freebsd.org/D3259	2015-08-05 07:36:50 +00:00
ed	4a54322f0b	Make it possible to implement poll(2) on top of kqueue(2). It looks like EVFILT_READ and EVFILT_WRITE trigger under the same conditions as poll()'s POLLRDNORM and POLLWRNORM as described by POSIX. The only difference is that POLLRDNORM has to be triggered on regular files unconditionally, whereas EVFILT_READ only triggers when not EOF. Introduce a new flag, NOTE_FILE_POLL, that can be used to make EVFILT_READ and EVFILT_WRITE behave identically to poll(). This flag will be used by cloudlibc's poll() function. Reviewed by: jmg Differential Revision: https://reviews.freebsd.org/D3303	2015-08-05 07:34:29 +00:00
kib	abef15cd81	Copy the fencing of the algorithm to do lock-less update and reading of the timehands, from the kern_tc.c implementation to vdso. Add comments giving hints where to look for the algorithm explanation. To compensate the removal of rmb() in userspace binuptime(), add explicit lfence instruction before rdtsc. On i386, add usual complications to detect SSE2 presence; assume that old CPUs which do not implement SSE2 also execute rdtsc almost in order. Reviewed by: alc, bde (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-08-04 12:33:51 +00:00
trasz	b8165ab6a9	Mark vgonel() as static. It was already declared static earlier; no idea why compilers don't warn about this. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-08-04 08:51:56 +00:00
ed	2f1d917e8e	Fix bad arithmetic in umtx_key_get() to compute object offset. It looks like umtx_key_get() has the addition and subtraction the wrong way around, meaning that it fails to match in certain cases. This causes the cloudlibc unit tests to deadlock in certain cases. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D3287	2015-08-04 06:01:13 +00:00
ed	e5726c0608	Add missing const keyword to function parameter. The umtx_key_get() function does not dereference the address off the userspace object. The pointer can safely be const.	2015-08-03 21:11:33 +00:00

1 2 3 4 5 ...

14687 Commits