freebsd-dev

Author	SHA1	Message	Date
Alexander Motin	abeb9f61f9	Increase MTX_POOL_SLEEP_SIZE from 128 to 1024. This value remained unchanged for 15 years, and now this bump reduces lock spinning in GEOM and BIO layers while doing ~1.6M IOPS to 4 NVMe on 72-core system from ~25% to ~5% by the cost of additional 28KB RAM. While there, align struct mtx_pool fields to cache lines. MFC after: 1 month	2018-12-24 23:52:35 +00:00
Konstantin Belousov	6c59824b31	Properly test for vmio buffer in bnoreuselist(). The presence of allocated v_object does not imply that the buffer is necessary VMIO kind. Buffer might has been allocated before the object created, then the buffer is malloced. Although we try to avoid such situation, it seems to be still legitimate. Reported and tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation	2018-12-23 18:52:02 +00:00
Bruce Evans	5ef4f86d7a	Oops, rounddown() for the start was misspelled roundup() in r342295, so only aligned starts worked. This broke releasing caches in most cases where the i/o size is smaller than the fs block size.	2018-12-22 09:31:55 +00:00
Bruce Evans	2c0434acb0	Fix rounding in vop_stdadvise() for POSIX_FADV_NOREUSE (really POSIX_FADV_DONTNEED). The most broken case was for applications that advise for the whole file and then do block-aligned i/o's 1 block at a time. Then advice is sent to VOP_ADVISE() 1 block at a time, but in vop_stdadvise() the 1-block advice was turned into 0-block advice for the buffer cache part. The bugs were caused partly by callers representing the region as (a_start, a_end), where a_end is actually the maximum, and everything else representing the region as (start, end) where 'end' is actually the end (1 after the maximum). The maximum a_end must be rounded up, but was rounded down. Also, rounding to page boundaries was inconsistent. The bugs and fixes have no effect for zfs and other file systems that don't use the buffer cache or the page cache. Most or all file systems currently use the default VOP_FADVISE(), but it finds a null buffer cache and a null page cache for file systems that don't use normal methods. Reviewed by: kib	2018-12-21 04:57:59 +00:00
Kirk McKusick	13c31c29ca	Some filesystems (like cd9660 and ext3) require that VFS_STATFS() be called before VFS_ROOT() is called. Move the call for VFS_STATFS() so that it is done after VFS_MOUNT(), but before VFS_ROOT(). This change actually improves the robustness of the mount system call because it returns an error rather than failing silently when VFS_STATFS() returns failure. Reported by: Rebecca Cran <rebecca@bluestop.org> Sponsored by: Netflix	2018-12-21 01:09:25 +00:00
Mateusz Guzik	3e0178fb94	Check for probes enabled in priv_check_cred before evaluting the error. Sponsored by: The FreeBSD Foundation	2018-12-19 23:28:29 +00:00
Mateusz Guzik	628888f0e0	Remove iBCS2, part2: general kernel Reviewed by: kib (previous version) Sponsored by: The FreeBSD Foundation	2018-12-19 21:57:58 +00:00
Mateusz Guzik	19b75ef59a	Microoptimize corner case of ID bitmap handling. Prior to the change we would avoidably test more possibly used IDs. While here update the comment: there is no pidchecked variable anymore.	2018-12-19 20:29:52 +00:00
Mateusz Guzik	7d065d876e	Deinline vfork handling out of the syscall return path. vfork is rarely called (comparatively to other syscalls) and it avoidably pollutes the fast path. Sponsored by: The FreeBSD Foundation	2018-12-19 20:27:26 +00:00
Mark Johnston	26e9d9b01f	Fix DDB's "show malloc" after r338899. MFC after: 3 days Sponsored by: The FreeBSD Foundation	2018-12-19 00:17:22 +00:00
Brooks Davis	10f7b12c13	const poison the `new` pointer of __sysctl. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D18444	2018-12-18 12:44:38 +00:00
Andriy Gapon	82a5a27527	add support for marking interrupt handlers as suspended The goal of this change is to fix a problem with PCI shared interrupts during suspend and resume. I have observed a couple of variations of the following scenario. Devices A and B are on the same PCI bus and share the same interrupt. Device A's driver is suspended first and the device is powered down. Device B generates an interrupt. Interrupt handlers of both drivers are called. Device A's interrupt handler accesses registers of the powered down device and gets back bogus values (I assume all 0xff). That data is interpreted as interrupt status bits, etc. So, the interrupt handler gets confused and may produce some noise or enter an infinite loop, etc. This change affects only PCI devices. The pci(4) bus driver marks a child's interrupt handler as suspended after the child's suspend method is called and before the device is powered down. This is done only for traditional PCI interrupts, because only they can be shared. At the moment the change is only for x86. Notable changes in core subsystems / interfaces: - BUS_SUSPEND_INTR and BUS_RESUME_INTR methods are added to bus interface along with convenience functions bus_suspend_intr and bus_resume_intr; - rman_set_irq_cookie and rman_get_irq_cookie functions are added to provide a way to associate an interrupt resource with an interrupt cookie; - intr_event_suspend_handler and intr_event_resume_handler functions are added to the MI interrupt handler interface. I added two new interrupt handler flags, IH_SUSP and IH_CHANGED, to implement the new intr_event functions. IH_SUSP marks a suspended interrupt handler. IH_CHANGED is used to implement a barrier that ensures that a change to the interrupt handler's state is visible to future interrupts. While there, I fixed some whitespace issues in comments and changed a couple of logically boolean variables to be bool. MFC after: 1 month (maybe) Differential Revision: https://reviews.freebsd.org/D15755	2018-12-17 17:11:00 +00:00
Kirk McKusick	17ca94cfc0	Clarify panic in set_rootvnode(). Check for panic in vfs_mountroot_shuffle(). Sponsored by: Netflix	2018-12-15 19:18:58 +00:00
Kirk McKusick	e04d2a3c5a	Under UFS/FFS the VFS_ROOT() function will return an error if the inode check-hash fails. Panic'ing is not an appropriate response. So, check for an error return from VFS_ROOT() and when an error is reported, unwind and return the error. Reported by: Gary Jennejohn (gj) Sponsored by: Netflix	2018-12-15 19:04:50 +00:00
Mateusz Guzik	24d64be4c5	vfs: mostly depessimize NDINIT_ALL 1) filecaps_init was unnecesarily a function call 2) an asignment at the end was preventing tail calling of cap_rights_init Sponsored by: The FreeBSD Foundation	2018-12-14 03:55:08 +00:00
Mateusz Guzik	cc426dd319	Remove unused argument to priv_check_cred. Patch mostly generated with cocinnelle: @@ expression E1,E2; @@ - priv_check_cred(E1,E2,0) + priv_check_cred(E1,E2) Sponsored by: The FreeBSD Foundation	2018-12-11 19:32:16 +00:00
Mateusz Guzik	6b2d61136f	fd: dedup code in sys_getdtablesize Sponsored by: The FreeBSD Foundation	2018-12-11 12:08:18 +00:00
Mateusz Guzik	73e62bc9bb	Make lim_cur inline if possible. It is a function call only to accomodate some ABIs which install a hook. They only care for 3 types of limits: DATA, STACK, VMEM Instead of always calling the func, see at compilation time if the requested limit is something else and just do the read if so. Sponsored by: The FreeBSD Foundation	2018-12-11 12:01:46 +00:00
Mateusz Guzik	86db4d40ac	fd: tidy up closing a fd - avoid a call to knote_close in the common case - annotate mqueue as unlikely Sponsored by: The FreeBSD Foundation	2018-12-11 11:58:44 +00:00
Mateusz Guzik	663de8167e	fd: stop looking for exact freefile after allocation If a lower fd is closed later, the lookup goes to waste. Allocation always performs the lookup anyway. Sponsored by: The FreeBSD Foundation	2018-12-11 11:57:12 +00:00
Konstantin Belousov	94dd54b9a2	Free bootstacks after AP startup. Bootstacks are unused after APs executed sched_throw() in init_secondary_tail() and started executing on proper idle thread stack. Add sysinit that detects that the idle thread for each CPU was scheduled at least once, and free corresponding bootstack. Slight addition of the code (~200 bytes) is compensated by the saving, because even on typical small modern desktop CPU we leak 128K of memory otherwise (4 pages x 8 threads). Reviewed by: jhb MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18486	2018-12-11 02:54:36 +00:00
Konstantin Belousov	eba8ab0e3e	Remove special case handling for getfhat(fd, NULL, handle). There is no reason for it to behave differently from openat(fd, NULL). Also the handling did not worked because the substituted path was from the system address space, causing EFAULT. Submitted by: Jack Halford <jack@gandi.net> MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18501	2018-12-11 02:48:49 +00:00
John Baldwin	c5786670ac	Don't report stale signal information for non-signal events in ptrace_lwpinfo. Once a signal's siginfo was copied to 'td_si' as part of the signal exchange in issignal(), it was never cleared. This caused future thread events that are reported as SIGTRAP events without signal information to report the stale siginfo in 'td_si'. For example, if a debugger created a new process and used SIGSTOP to stop it after PT_ATTACH, future system call entry / exit events would set PL_FLAG_SI with the SIGSTOP siginfo in pl_siginfo. This broke 'catch syscall' in current versions of gdb as it assumed PL_FLAG_SI with SIGTRAP indicates a breakpoint or single step trap. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18487	2018-12-10 19:39:24 +00:00
Alan Cox	2905d1ceaf	blst_leaf_alloc updates bighint for a leaf when an allocation is successful and includes the last block represented by the leaf. The reasoning is that, if the last block is included, then there must be no solution before that one in the leaf, so the leaf cannot provide an allocation that big again; indeed, the leaf cannot provide a solution bigger than range1. Which is all correct, except that if the value of blk passed in did not represent the first block of the leaf, because the cursor was pointing to the middle of the leaf, then a possible solution before the cursor may have been ignored, and bighint cannot be updated. Consider the sequence allocate 63 (returning address 0), free 0,63 (freeing that same block, and allocate 1 (returning 63). The result is that one block is allocated from the first leaf, and the value of bighint is 0, so that nothing can be allocated from that leaf until the only block allocated from that leaf is freed. This change detects that skipped-over solution, and when there is one it makes sure that the value of bighint is not changed when the last block is allocated. Submitted by: Doug Moore <dougm@rice.edu> Tested by: pho X-MFC with: r340402 Differential Revision: https://reviews.freebsd.org/D18474	2018-12-09 17:55:10 +00:00
Mateusz Guzik	6017827676	umtx: avoid umtxshm locking on object termination if possible Sample build world result on tmpfs: kern.ipc.umtx_terminate_notempty: 0 kern.ipc.umtx_terminate_empty: 2891815 Sponsored by: The FreeBSD Foundation	2018-12-08 14:04:57 +00:00
Mateusz Guzik	b0b246b0ba	Remove proctree acquire from note_procstat_proc It is not needed since r340482 ("proc: always store parent pid in p_oppid") Sponsored by: The FreeBSD Foundation	2018-12-08 11:38:39 +00:00
Mateusz Guzik	eab2132ad9	Fix a corner case in ID bitmap management. If all IDs from trypid to pid_max were used as pids, the code would enter a loop which would be infinite if none of the IDs could become free (e.g. they all belong to processes which did not transitioned to zombie). Fixes: r341684 ("Manage process-related IDs with bitmaps") Sponsored by: The FreeBSD Foundation	2018-12-08 10:22:12 +00:00
Mateusz Guzik	e52327e3c5	proc: postpone proc unlock until after reporting with kqueue kqueue would always relock immediately afterwards. While here drop the NULL check for list itself. The list is always allocated. Sponsored by: The FreeBSD Foundation	2018-12-08 06:34:12 +00:00
Mateusz Guzik	eadb1dcb71	proc: handle sdt exit probe before taking the proc lock Sponsored by: The FreeBSD Foundation	2018-12-08 06:31:43 +00:00
Mateusz Guzik	13a45e4b14	Provide SDT_PROBES_ENABLED macro. Sponsored by: The FreeBSD Foundation	2018-12-08 06:30:41 +00:00
Konstantin Belousov	18519f1583	Simplify kern_readlink_vp(). When we detected that the vnode is not symlink, return immediately. This moves the readlink code out of else branch and unindents it. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-12-07 23:07:51 +00:00
Konstantin Belousov	978f879483	Fix expression evaluation. Braces were put in the wrong place, causing failing EAGAIN check to return zero result. Remove the problematic assignment from the conditional expression at all. While there, remove used once variable vp, and wrap too long line. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-12-07 23:05:12 +00:00
Mateusz Guzik	08d005e6a3	fd: use racct_set_unlocked Sponsored by: The FreeBSD Foundation	2018-12-07 16:51:38 +00:00
Mateusz Guzik	448db4f761	racct: add RACCT_ENABLED macro and racct_set_unlocked This allows to remove PROC_LOCK/UNLOCK pairs spread thorought the kernel only used to appease racct_set. Sponsored by: The FreeBSD Foundation	2018-12-07 16:47:34 +00:00
Mateusz Guzik	82f4b82634	fd: try do less work with the lock in dup Sponsored by: The FreeBSD Foundation	2018-12-07 16:44:52 +00:00
Mateusz Guzik	6ff4688b09	Replace hand-rolled unrefs if > 1 with refcount_release_if_not_last Sponsored by: The FreeBSD Foundation	2018-12-07 16:11:45 +00:00
Konstantin Belousov	fd52edaf70	Regen.	2018-12-07 15:19:00 +00:00
Konstantin Belousov	d1fd400a80	Add new file handle system calls. Namely, getfhat(2), fhlink(2), fhlinkat(2), fhreadlink(2). The syscalls are provided for a NFS userspace server (nfs-ganesha). Submitted by: Jack Halford <jack@gandi.net> Sponsored by: Gandi.net Tested by: pho Feedback from: brooks, markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18359	2018-12-07 15:17:29 +00:00
Mateusz Guzik	b1fbffe73c	proc: when exiting move to zombproc before taking proctree The kernel was already doing this prior to r329615. It was changed to reduce contention on allproc. However, introduction of pidhash locks and removal of proctree -> allproc ordering from fork thanks to bitmaps fixed things enough to make this change pessimal. waitpid takes proctree on each call and this change (now) causes avoidable stalls if allproc is held. Sponsored by: The FreeBSD Foundation	2018-12-07 12:32:25 +00:00
Mateusz Guzik	34ebdceac0	Manage process-related IDs with bitmaps Currently unique pid allocation on fork often requires a full walk of process, group, session lists to make sure it is not used by anything. This has a side effect of requiring proctree to be held along with allproc, which adds more contention in poudriere -j 128. The patch below implements trivial bitmaps which gets rid of the problem. Dedicated lock is introduced to manage IDs. While here a bug was discovered: all processes would inherit reap id from the first process spawned by init. This had a side effect of keeping the ID used and when allocation rolls over to the beginning it keeps being skipped. The patch is loosely based on initial work by mjoras@. Reviewed by: kib Sponsored by: The FreeBSD Foundation	2018-12-07 12:22:32 +00:00
Mateusz Guzik	6e8c1ccbe2	Annotate Giant drop/pickup macros with __predict_false They are used in important places of the kernel with the lock not being held majority of the time. Sponsored by: The FreeBSD Foundation	2018-12-07 12:06:03 +00:00
Mark Johnston	afde86eba3	Let kern.trap_enotcap be set as a tunable. This is handy for testing programs that are run by rc. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2018-12-06 17:29:37 +00:00
Brooks Davis	827c3852fe	Further simplify arguments to init. With the removal of BOOTCDROM and fastboot support, this code always passed "-s" or "--". The latter simply terminates getopt(3) processing in init so we only need to pass "-s" in the single user case, or nothing in other cases. The passing of "--" seems to have been done to ensure that the number of arguments passed to init was always the same and thus that argc was the same. Also GC the write-only variable pathlen (not in reviewed version). Reviewed by: kib, jhb Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D18441	2018-12-05 19:18:16 +00:00
Alan Cox	749cdf6f3b	Terminate a blist_alloc search when a blst_meta_alloc call fails with cursor == 0. Every call to blst_meta_alloc but the one at the root is made only when the meta-node is known to include a free block, so that either the allocation will succeed, the node hint will be updated, or the last block of the meta- node range is, and remains, free. But the call at the root is made without checking that there is a free block, so in the case that every block is allocated, there is no hint update to prevent the current code from looping forever. Submitted by: Doug Moore <dougm@rice.edu> Reported by: pho Reviewed by: pho Tested by: pho X-MFC with: r340402 Differential Revision: https://reviews.freebsd.org/D17999	2018-12-05 18:26:40 +00:00
Brooks Davis	68ea829fe7	Remove never enabled support for "fastboot". This has been ifdef notyet since the import of BSD 4.4 Lite Kernel Sources in r1541. Sponsored by: DARPA, AFRL	2018-12-05 17:35:15 +00:00
Brooks Davis	7a5db3a770	Remove ifdef BOOTCDROM option to start init. When BOOTCDROM is defined (via CFLAGS as there is no config option) it causes -C to be passed to init, but our init and the version of sysinstall I glanced at in 6.x don't support -C. The last plausibly related support was removed from the tree in 1995. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D18431	2018-12-05 17:29:14 +00:00
Mateusz Guzik	f26db6948d	sx: retire SX_NOADAPTIVE The flag is not used by anything for years and supporting it requires an explicit read from the lock when entering slow path. Flag value is left unused on purpose. Sponsored by: The FreeBSD Foundation	2018-12-05 16:43:03 +00:00
Brooks Davis	41f7b25317	Remove NOARGS from oaccept. This was in the orignal patch, but lost in a rebase. Reported by: andrew Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15816	2018-12-04 21:56:45 +00:00
Brooks Davis	63de13cfee	Regen after r341474: Normalize COMPAT_43 syscall declarations.	2018-12-04 16:49:14 +00:00
Brooks Davis	d48719bd96	Normalize COMPAT_43 syscall declarations. Have ogetkerninfo, ogetpagesize, ogethostname, osethostname, and oaccept declare o<foo>_args structs rather than non-compat ones. Due to a failure to use NOARGS in most cases this adds only one new declaration. No changes required in freebsd32 as only ogetpagesize() is implemented and it has a 32-bit specific implementation. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15816	2018-12-04 16:48:47 +00:00
Brooks Davis	3a325dec32	Remove a needlessly clever hack to start init with sys_exec(). Construct a struct image_args with the help of new exec_args_*() helper functions and call kern_execve(). The previous code mapped a page in userspace, copied arguments out to it one at a time, and then constructed a struct execve_args all so that sys_execve() can call exec_copyin_args() to copy the data back in to a struct image_args. Opencode the part of pre_execve()/post_execve() that releases a reference to the initial vmspace. We don't need to stop threads like they do. Reviewed by: kib, jhb (prior version) Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15469	2018-12-04 00:15:47 +00:00
Mark Johnston	02164d3603	Add a missing definition for the !COMPAT_FREEBSD32 case. Reported by: jenkins MFC with: r341442 Sponsored by: The FreeBSD Foundation	2018-12-03 21:07:10 +00:00
Mark Johnston	352aaa5122	Plug memory disclosures via ptrace(2). On some architectures, the structures returned by PT_GET*REGS were not fully populated and could contain uninitialized stack memory. The same issue existed with the register files in procfs. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: kib MFC after: 3 days Security: kernel stack memory disclosure Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18421	2018-12-03 20:54:17 +00:00
Konstantin Belousov	200bf72793	Correct accuracy of the barrier writes accounting. Discussed with: mckusick MFC after: 1 week Sponsored by: The FreeBSD Foundation	2018-12-02 12:53:39 +00:00
Eric van Gyzen	5e38e3f5eb	Include path for tmpfs objects in vm.objects sysctl This applies the fix in r283924 to the vm.objects sysctl added by r283624 so the output will include the vnode information (i.e. path) for tmpfs objects. Reviewed by: kib, dab MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D2724	2018-11-30 04:59:43 +00:00
Brooks Davis	f373437a01	Add helper functions to copy strings into struct image_args. Given a zeroed struct image_args with an allocated buf member, exec_args_add_fname() must be called to install a file name (or NULL). Then zero or more calls to exec_args_add_env() followed by zero or more calls to exec_args_add_env(). exec_args_adjust_args() may be called after args and/or env to allow an interpreter to be prepended to the argument list. To allow code reuse when adding arg and env variables, begin_envv should be accessed with the accessor exec_args_get_begin_envv() which handles the case when no environment entries have been added. Use these functions to simplify exec_copyin_args() and freebsd32_exec_copyin_args(). Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15468	2018-11-29 21:00:56 +00:00
Konstantin Belousov	7d2b0bd7d7	If BENEATH is specified, always latch the topping directory vnode. It is possible that we started with a relative path but during the lookup, found an absolute symlink. In this case, BENEATH handling code needs the latch, but it is too late to calculate it. While there, somewhat improve the assertions. Clear the NI_LCF_LATCH flag when the latch vnode is released, so that asserts know the state. Assert that there is a latch if we entered beneath+abs path mode, after the starting point is processed. Reported by: wulf With more input from: pho Sponsored by: The FreeBSD Foundation	2018-11-29 19:13:10 +00:00
Mateusz Guzik	1f6ad48c76	vfs: fix i386 build after r341220	2018-11-29 09:54:27 +00:00
Mateusz Guzik	22443809ff	cache: retire cache_enter compat schim It was added over 6 years ago for binary compat. cache_enter macro remains as it expands to cache_enter_time. Sponsored by: The FreeBSD Foundation	2018-11-29 09:32:59 +00:00
Mateusz Guzik	712775843f	vfs: drop spurious memcpy in stat Sponsored by: The FreeBSD Foundation	2018-11-29 09:04:10 +00:00
Mateusz Guzik	d47f3fdb0a	fd: unify fd range check across the routines While here annotate out of range as unlikely. Sponsored by: The FreeBSD Foundation	2018-11-29 08:53:39 +00:00
Mateusz Guzik	eec8d0a378	Convert racct_enable to bool and annotate as __read_frequently Sponsored by: The FreeBSD Foundation	2018-11-29 05:17:16 +00:00
Mateusz Guzik	64cf6a62d4	Deinline racct throttling out of syscall exit path. racct is not enabled by default and even when it is enabled processes are typically not throttled. The order of checks is left unchanged since racct_enable will be annotated as __read_frequently, while checking for the flag in the processes would probably require an extra fetch. Sponsored by: The FreeBSD Foundation	2018-11-29 05:08:46 +00:00
Mateusz Guzik	e272bf479b	Annotate td_cowgen check as unlikely. Sponsored by: The FreeBSD Foundation	2018-11-29 04:48:22 +00:00
Mateusz Guzik	3277792bde	Tidy up hardclock. - use fcmpset for updating ticks - move (rarely used) itimer handling to a dedicated function Sponsored by: The FreeBSD Foundation	2018-11-29 03:44:02 +00:00
Mateusz Guzik	1e9a1bf589	proc: create a dedicated lock for zombproc to ligthen the load on allproc_lock waitpid always takes proctree to evaluate the list, but only takes allproc if it can reap. With this patch allproc is no longer taken, which helps during poudriere -j 128. Discussed with: kib Sponsored by: The FreeBSD Foundation	2018-11-29 02:52:08 +00:00
Konstantin Belousov	affd918514	Improve sigonstack(). Avoid relying on unsigned overflow for the test. Simplify expressions to avoid duplicate check for the range. Style. Add herald comment. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18361	2018-11-27 19:50:58 +00:00
Jamie Gritton	b307954481	In hardened systems, where the security.bsd.unprivileged_proc_debug sysctl node is set, allow setting security.bsd.unprivileged_proc_debug per-jail. In part, this is needed to create jails in which the Address Sanitizer (ASAN) fully works as ASAN utilizes libkvm to inspect the virtual address space. Instead of having to allow unprivileged process debugging for the entire system, allow setting it on a per-jail basis. The sysctl node is still security.bsd.unprivileged_proc_debug and the jail(8) param is allow.unprivileged_proc_debug. The sysctl code is now a sysctl proc rather than a sysctl int. This allows us to determine setting the flag for the corresponding jail (or prison0). As part of the change, the dynamic allow.* API needed to be modified to take into account pr_allow flags which may now be disabled in prison0. This prevents conflicts with new pr_allow flags (like that of vmm(4)) that are added (and removed) dynamically. Also teach the jail creation KPI to allow differences for certain pr_allow flags between the parent and child jail. This can happen when unprivileged process debugging is disabled in the parent prison, but enabled in the child. Submitted by: Shawn Webb <lattera at gmail.com> Obtained from: HardenedBSD (45b3625edba0f73b3e3890b1ec3d0d1e95fd47e1, deba0b5078cef0faae43cbdafed3035b16587afc, ab21eeb3b4c72f2500987c96ff603ccf3b6e7de8) Relnotes: yes Sponsored by: HardenedBSD and G2, Inc Differential Revision: https://reviews.freebsd.org/D18319	2018-11-27 17:51:50 +00:00
Eric van Gyzen	607a0eb2f1	Remove superfluous bzero in getcontext/swapcontext/sendsig We zero the whole structure; we don't need to zero the __spare__ field again. Remove trailing whitespace. MFC after: 2 weeks Sponsored by: Dell EMC Isilon	2018-11-26 20:56:05 +00:00
Alan Somers	72bce9fff6	vfs_aio.c: rename "physio" symbols to "bio". aio has two paths: an asynchronous "physio" path and a synchronous path. Confusingly, physio(9) isn't actually used by the "physio" path, and never has been. In fact, it may even be called by the synchronous path! Rename the "physio" path to the "bio" path to reflect what it actually does: directly compose BIOs and send them to character devices. MFC after: 2 weeks	2018-11-26 18:31:00 +00:00
Alan Cox	ee73fef96e	blist_meta_alloc assumes that mask=scan->bm_bitmap is nonzero. But if the cursor lies in the middle of the space that the meta node represents, then blanking the low bits of mask may make it zero, and break later code that expects a nonzero value. Add a test that returns failure if the mask has been cleared. Submitted by: Doug Moore <dougm@rice.edu> Reported by: pho Tested by: pho X-MFC with: r340402 Differential Revision: https://reviews.freebsd.org/D18058	2018-11-24 21:52:10 +00:00
Mark Johnston	792843c38f	Pass malloc flags directly through kevent(2) subroutines. Some kevent functions have a boolean "waitok" parameter for use when calling malloc(9). Replace them with the corresponding malloc() flags: the desired behaviour is known at compile-time, so this eliminates a couple of conditional branches, and makes the code easier to read. No functional change intended. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18318	2018-11-24 17:06:01 +00:00
Mark Johnston	36c4960ef8	Plug some kernel memory disclosures via kevent(2). The kernel may register for events on behalf of a userspace process, in which case it must be careful to zero the kevent struct that will be copied out to userspace. Reviewed by: kib MFC after: 3 days Security: kernel stack memory disclosure Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18317	2018-11-24 17:02:31 +00:00
Mark Johnston	a2afae524a	Ensure that knotes do not get registered when KQ_CLOSING is set. KQ_CLOSING is set before draining the knotes associated with a kqueue, so we must ensure that new knotes are not added after that point. In particular, some kernel facilities may register for events on behalf of a userspace process and race with a close of the kqueue. PR: 228858 Reviewed by: kib Tested by: pho MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18316	2018-11-24 16:58:34 +00:00
Mark Johnston	1eeab857a3	Lock the knlist before releasing the in-flux state in knote_fork(). Otherwise there is a window, before iteration is resumed, during which the knote may be freed. The in-flux state ensures that the knote will not be removed from the knlist while locks are dropped. PR: 228858 Reviewed by: kib Tested by: pho MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18316	2018-11-24 16:41:29 +00:00
Konstantin Belousov	cefb93f253	Parse FreeBSD Feature Control note on the ELF image activation. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-11-23 23:33:55 +00:00
Konstantin Belousov	92328a3251	Generalize ELF parse_notes(). Remove the knowledge of the ABI note type and brandnote from it, instead provide it with a callback to do note-specific matching and data fetching. Implement callback to match against ELF brand. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-11-23 23:29:14 +00:00
Konstantin Belousov	eda8fe63c9	Trivial reduction of the code duplication, reuse the return FALSE code. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-11-23 23:16:01 +00:00
Mark Johnston	96fdfb3649	Honour the waitok parameter in kevent_expand(). Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18316	2018-11-23 23:10:03 +00:00
Konstantin Belousov	f5cf758998	Provide storage for the process feature control flags in struct proc. The flags are cleared on exec, it is up to the image activator to set them. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-11-23 23:07:57 +00:00
Mark Johnston	6d2e2df764	Ensure that directory entry padding bytes are zeroed. Directory entries must be padded to maintain alignment; in many filesystems the padding was not initialized, resulting in stack memory being copied out to userspace. With the ino64 work there are also some explicit pad fields in struct dirent. Add a subroutine to clear these bytes and use it in the in-tree filesystems. The NFS client is omitted for now as it was fixed separately in r340787. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation	2018-11-23 22:24:59 +00:00
Mateusz Guzik	e3d3e8289b	Revert "fork: fix use-after-free with vfork" This unreliably breaks libc handling of vfork where forking succeded, but execve did not. vfork code in libc performs waitpid with WNOHANG in case of failed exec. With the fix exit codepath was waking up the parent before the child fully transitioned to a zombie. Woken up parent would waitpid, which could find a not-yet-zombie child and fail to reap it due to the WNOHANG flag. While removing the flag fixes the problem, it is not an option due to older releases which would still suffer from the kernel change. Revert the fix until a solution can be worked out. Note that while use-after-free which gets back due to the revert is a real bug, it's side-effects are limited due to the fact that struct proc memory is never released by UMA.	2018-11-23 04:38:50 +00:00
Mateusz Guzik	adce241981	Annotate TDP_RFPPWAIT as unlikely. The flag is only set on vfork, but is tested for all syscalls. On amd64 this shortens common-case (not vfork) code.	2018-11-22 21:38:24 +00:00
Mateusz Guzik	a5ac8272c0	fork: remove avoidable proc lock/unlock pair We don't have to access the process after making it runnable, so there is no need to hold it either. Sponsored by: The FreeBSD Foundation	2018-11-22 21:29:36 +00:00
Mateusz Guzik	b00b27e925	fork: fix use-after-free with vfork The pointer to the child is stored without any reference held. Then it is blindly used to wait until P_PPWAIT is cleared. However, if the child is autoreaped it could have exited and get freed before the parent started waiting. Use the existing hold mechanism to mitigate the problem. Most common case of doing exec remains unchanged. The corner case of doing exit performs wake up before waiting for holds to clear. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18295	2018-11-22 21:08:37 +00:00
Mark Johnston	79db6fe7aa	Plug some networking sysctl leaks. Various network protocol sysctl handlers were not zero-filling their output buffers and thus would export uninitialized stack memory to userland. Fix a number of such handlers. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: tuexen MFC after: 3 days Security: kernel memory disclosure Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18301	2018-11-22 20:49:41 +00:00
Mateusz Guzik	f218ac5087	uipc_usrreq: fix inode number assignment The code was incrementing a global variable in an unsafe manner. Two different threads stating two different sockets could have resulted in the same inode numbers assigned to both. Creation is protected with a global lock, move the assigment there. Since inode numbers are 64-bit now drop the check for overflows. Sponsored by: The FreeBSD Foundation	2018-11-21 22:25:05 +00:00
Mateusz Guzik	a627b4629d	proc: update list manipulation comment on process exit Processes stay in the hash until they get reaped. This code does not unlink the child from the parent, so remove the claim that it does. Sponsored by: The FreeBSD Foundation	2018-11-21 22:16:10 +00:00
Mateusz Guzik	7883ce1f26	uipc_shm: use unr64 for inode numbers Sponsored by: The FreeBSD Foundation	2018-11-21 22:01:06 +00:00
Mateusz Guzik	53011553fa	proc: convert pfind & friends to use pidhash locks and other cleanup pfind_locked is retired as it relied on allproc which unnecessarily restricts locking of the hash. Sponsored by: The FreeBSD Foundation	2018-11-21 20:15:56 +00:00
Mateusz Guzik	3d3e6793f6	proc: implement pid hash locks and an iterator forks, exits and waits are frequently stalled during poudriere -j 128 runs due to killpg and process list exports performed for each package. Both uses take the allproc lock. The latter case can be modified to iterate over the hash with finer grained locking instead. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17817	2018-11-21 18:56:15 +00:00
Mark Johnston	d5e494fee4	Avoid unsynchronized updates to kn_status. kn_status is protected by the kqueue's lock, but we were updating it without the kqueue lock held. For EVFILT_TIMER knotes, there is no knlist lock, so the knote activation could occur during the kn_status update and result in KN_QUEUED being lost, in which case we'd enqueue an already-enqueued knote, corrupting the queue. Fix the problem by setting or clearing KN_DISABLED before dropping the kqueue lock to call into the filter. KN_DISABLED is used only by the core kevent code, so there is no side effect from setting it earlier. Reported and tested by: Sylvain GALLIANO <sg@efficientip.com> Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18060	2018-11-21 17:32:09 +00:00
Mark Johnston	45aecd0422	Remove KN_HASKQLOCK. It is a write-only flag whose last use was removed in r302235. No functional change intended. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18059	2018-11-21 17:28:10 +00:00
Mark Johnston	bb58b5d670	Add a taskqueue_quiesce(9) KPI. This is similar to taskqueue_drain_all(9) but will wait for the queue to become idle before returning instead of only waiting for already-enqueued tasks to finish. This will be used in the opensolaris compat layer. PR: 227784 Reviewed by: cem MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17975	2018-11-21 17:18:27 +00:00
Mark Johnston	c7dc361d6f	Clear pad bytes in the struct exported by kern.ntp_pll.gettime. Reported by: Thomas Barabosch, Fraunhofer FKIE MFC after: 3 days Sponsored by: The FreeBSD Foundation	2018-11-20 20:32:10 +00:00
Mateusz Guzik	737037f6c0	pipe: use unr64 Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18054	2018-11-20 14:59:27 +00:00
Mateusz Guzik	435bef7a2f	Implement unr64 Important users of unr like tmpfs or pipes can get away with just ever-increasing counters, making the overhead of managing the state for 32 bit counters a pessimization. Change it to an atomic variable. This can be further sped up by making the counts variable "allocate" ranges and store them per-cpu. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18054	2018-11-20 14:58:41 +00:00
Ben Widawsky	1a305bda15	acpi: fix acpi_ec_probe to only check EC devices This patch utilizes the fixed_devclass attribute in order to make sure other acpi devices with params don't get confused for an EC device. The existing code assumes that acpi_ec_probe is only ever called with a dereferencable acpi param. Aside from being incorrect because other devices of ACPI_TYPE_DEVICE may be probed here which aren't ec devices, (and they may have set acpi private data), it is even more nefarious if another ACPI driver uses private data which is not dereferancable. This will result in a pointer deref during boot and therefore boot failure. On X86, as it stands today, no other devices actually do this (acpi_cpu checks for PROCESSOR type devices) and so there is no issue. I ran into this because I am adding such a device which gets probed before acpi_ec_probe and sets private data. If ARM ever has an EC, I think they'd run into this issue as well. There have been several iterations of this patch. Earlier iterations had ECDT enumerated ECs not call into the probe/attach functions of this driver. This change was Suggested by: jhb@. Reviewed by: jhb Approved by: emaste (mentor) Differential Revision: https://reviews.freebsd.org/D16635	2018-11-19 18:29:03 +00:00
Hans Petter Selasky	90acd1d139	Minor code factoring. No functional change. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-11-19 09:36:09 +00:00
Hans Petter Selasky	2205f61a31	Be more verbose when a sysctl fails to unregister. Print name of sysctl in question. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-11-19 09:35:16 +00:00
Kevin Bowling	2a24f4d911	Retire sbsndptr() KPI As of r340465 all consumers use sbsndptr_adv and sbsndptr_noadv Reviewed by: gallatin Approved by: krion (mentor) Differential Revision: https://reviews.freebsd.org/D17998	2018-11-19 00:54:31 +00:00
Mateusz Guzik	2c054ce924	proc: always store parent pid in p_oppid Doing so removes the dependency on proctree lock from sysctl process list export which further reduces contention during poudriere -j 128 runs. Reviewed by: kib (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17825	2018-11-16 17:07:54 +00:00
Mark Johnston	aeb7a84ee1	Remove mostly-useless proc provider probes. For some reason the proc UMA zone's ctor, dtor and init functions are instrumented, but these functions are always available through FBT. Moreover, the probes are not part of the original Solaris proc provider, aren't documented, have no uses (e.g., in dwatch(8)) and have no clear use to begin with. Therefore, remove them. Reviewed by: rpaulo Differential Revision: https://reviews.freebsd.org/D2169	2018-11-15 23:02:59 +00:00
Warner Losh	36173f6976	Do proper conversion to/from sbt. Doh! sbttoX and Xtosbt were backwards. While they ran, they produced bogus results. Pointy hat to: imp@	2018-11-15 16:02:24 +00:00
Gleb Smirnoff	905837ebe7	Initialize compatibility epoch tracker for thread0. Fixes panics for drivers that call if_maddr_lock() during startup. Reported by: cy	2018-11-14 19:10:35 +00:00
Brooks Davis	5b1df30051	Use the main capabilities.conf for freebsd32. Allow the location of capabilities.conf to be configured. Also allow a per-abi syscall prefix to be configured with the abi_func_prefix syscalls.conf variable and check syscalls against entries in capabilities.conf with and without the prefix amended. Take advantage of these two features to allow use shared capabilities.conf between the default syscall vector and the freebsd32 compatability layer. We've been inconsistent about keeping the two in sync as evidenced by the bugs fixed in r340294. This eliminates that problem going forward. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17932	2018-11-14 00:46:02 +00:00
Gleb Smirnoff	6febf18036	Fix build on some architectures after r340413. On amd64 epoch.h appeared to be included implicitly.	2018-11-14 00:33:03 +00:00
Matt Macy	91cf497515	epoch(9) revert r340097 - no longer a need for multiple sections per cpu I spoke with Samy Bahra and recent changes to CK to make ck_epoch_call and ck_epoch_poll not modify the record have eliminated the need for this.	2018-11-14 00:12:04 +00:00
Gleb Smirnoff	635c18840a	style(9), mostly adjusting overly long lines.	2018-11-13 23:57:34 +00:00
Gleb Smirnoff	a760c50c9e	With epoch not inlined, there is no point in using _lite KPI. While here, remove some unnecessary casts.	2018-11-13 23:45:38 +00:00
Gleb Smirnoff	9f360eecf9	The dualism between epoch_tracker and epoch_thread is fragile and unnecessary. So, expose CK types to kernel and use a single normal structure for epoch_tracker. Reviewed by: jtl, gallatin	2018-11-13 23:20:55 +00:00
Gleb Smirnoff	b79aa45e0e	For compatibility KPI functions like if_addr_rlock() that used to have mutexes but now are converted to epoch(9) use thread-private epoch_tracker. Embedding tracker into ifnet(9) or ifnet derived structures creates a non reentrable function, that will fail miserably if called simultaneously from two different contexts. A thread private tracker will provide a single tracker that would allow to call these functions safely. It doesn't allow nested call, but this is not expected from compatibility KPIs. Reviewed by: markj	2018-11-13 22:58:38 +00:00
Mateusz Guzik	f183fb162c	locks: plug warnings about unitialized variables They only showed up after I redefined LOCKSTAT_ENABLED to 0. doing_lockprof in mutex.c is a real (but harmless) bug. Should the value be non-zero it will do checks for lock profiling which would otherwise be skipped. state in rwlock.c is a wart from the compiler, the value can't be used if lock profiling is not enabled. Sponsored by: The FreeBSD Foundation	2018-11-13 21:29:56 +00:00
Eric van Gyzen	d54474e63b	Make no assertions about lock state when the scheduler is stopped. Change the assert paths in rm, rw, and sx locks to match the lock and unlock paths. I did this for mutexes in r306346. Reported by: Travis Lane <tlane@isilon.com> MFC after: 2 weeks Sponsored by: Dell EMC Isilon	2018-11-13 20:48:05 +00:00
Gleb Smirnoff	a82296c2df	Uninline epoch(9) entrance and exit. There is no proof that modern processors would benefit from avoiding a function call, but bloating code. In fact, clang created an uninlined real function for many object files in the network stack. - Move epoch_private.h into subr_epoch.c. Code copied exactly, avoiding any changes, including style(9). - Remove private copies of critical_enter/exit. Reviewed by: kib, jtl Differential Revision: https://reviews.freebsd.org/D17879	2018-11-13 19:02:11 +00:00
Mark Johnston	bb4a27f927	Allow allocations across meta boundaries. Remove restrictions that prevent allocation requests to cross the boundary between two meta nodes. Replace the bmu_avail field in meta nodes with a bitmap that identifies which subtrees have some free memory, and iterate over the nonempty subtrees only in blst_meta_alloc. If free memory is scarce, this should make searching for it faster. Put the code for handling the next-leaf allocation in a separate function. When taking blocks from the next leaf empties the leaf, be sure to clear the appropriate bit in its parent, and so on, up to the least-common ancestor of this leaf and the next. Eliminate special terminator nodes, and rely instead on the fact that there is a 0-bit at the end of the bitmask at the root of the tree that will stop a meta_alloc search, or a next-leaf search, before the search falls off the end of the tree. Make sure that the tree is big enough to have space for that 0-bit. Eliminate special all-free indicators. Lazy initialization of subtrees stands in the way of having an allocation span a meta-node boundary, so a subtree of all free blocks is not treated specially. Subtrees of all-allocated blocks are still recognized by looking at the bitmask at the root and finding 0. Don't print all-allocated subtrees. Do print the bitmasks for meta nodes, when tree-printing. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: alc MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D12635	2018-11-13 18:40:01 +00:00
Kyle Evans	75beb4d46a	Add dynamic_kenv assertion to init_static_kenv Both to formally document the requirement that this not be called after the dynamic kenv is setup, and to perhaps help static analyzers figure out what's going on. While calling init_static_kenv this late isn't fatal, there are some caveats that the caller should be aware of: - Late calls are effectively a no-op, as far as default FreeBSD is concerned, as everything will switch to searching the dynamic kenv once it's available. - Each of the kern_getenv calls will leak memory, as it's assumed that these are searching static environment and allocations will not be made. As such, this usage is not sensible and should be detected.	2018-11-13 04:34:30 +00:00
Konstantin Belousov	389474c122	Allow set ether/vlan PCP operation from the VNET jails. The vlan interfaces can be created from vnet jails, it seems, so it sounds logical to allow pcp configuration as well. Reviewed by: bz, hselasky (previous version) Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17777	2018-11-12 15:59:32 +00:00
Conrad Meyer	0d1467b199	netdump: Fix netdumping with INVARIANTS kernels Correct boneheaded assertion I added in r339501. Mea culpa. The intent is to notice when an M_WAITOK zone allocation would fail during netdump, not to prevent all use of mbufs during netdump. Reviewed by: markj X-MFC-With: r339501 Differential Revision: https://reviews.freebsd.org/D17957	2018-11-12 05:24:20 +00:00
Konstantin Belousov	8782eef46f	Remove one-use variable. This also removes a lot of #ifdefs and cleans up a warning when the AUDIT kernel option is defined, but neither KDTRACE_HOOKS nor MAC are. Reported and tested by: danger Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-11-11 00:21:28 +00:00
Konstantin Belousov	ade85c5eec	Allow absolute paths for O_BENEATH. The path must have a tail which does not escape starting/topping directory. The documentation will come shortly, see the man pages commit message for the reason of separate commit. Reviewed by: jilles (previous version) Discussed with: emaste Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17714	2018-11-11 00:04:36 +00:00
Brooks Davis	9a38df59e9	Fix freebsd32 mknod(at). As dev_t is now a 64-bit integer, it requires special handling as a system call argument. 64-bit arguments are split between two 64-bit integers due to the way arguments are promoted to allow reuse of most system call implementations. They must be reassembled before use. Further, 64-bit arguments at an odd offset (counting from zero) are padded and slid to the next slot on powerpc and mips. Fix the non-COMPAT11 system call by adding a freebsd32_mknodat() and appropriately padded declerations. The COMPAT11 system calls are fully compatible with the 64-bit implementations so remove the freebsd32_ versions. Use uint32_t consistently as the type of the old dev_t. This matches the old definition. Reviewed by: kib MFC after: 3 days Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17928	2018-11-09 21:01:16 +00:00
Brooks Davis	b34f4419fb	Make freebsd32_umtx_op follow the freebsd32_foo convention. Sponsored by: DARPA, AFRL	2018-11-09 00:46:10 +00:00
John Baldwin	4bf4b0f139	Enable non-executable stacks by default on RISC-V. Reviewed by: markj Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D17878	2018-11-07 18:32:02 +00:00
Brooks Davis	5577e44bf4	Regen after r340221: allow pointer return types. Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17873	2018-11-07 16:56:07 +00:00
Brooks Davis	e56ec0e519	makesyscalls.sh: allow pointer return types. The previous code required that the return type be a single word. This allows it to be a pointer without using a typedef. Update the return types of break, mmap, and shmat to be void * as declared. This only effects systrace output in-tree, but can aid in generating system call wrappers from syscalls.master. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17873	2018-11-07 16:55:04 +00:00
Mark Johnston	f8a222010f	Avoid fixing the tty_info() buffer size in tty.h. Different compilation units may otherwise get a different view of the layout of struct tty depending on whether they include opt_printf.h. This caused a blowup in the number of types defined in the kernel's CTF file after r339468; thanks to dim@ for bisecting down to that revision. PR: 232675 Reported by: dim Reviewed by: cem (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17877	2018-11-06 23:41:44 +00:00
Mark Johnston	07702f72e5	Avoid specifying VM_PROT_EXECUTE in mappings from pipe_map and exec_map. These submaps are used for mapping pipe buffers and execv() argument strings respectively, so there's no need for such mappings to have execute permissions. Reported by: jhb Reviewed by: alc, jhb, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17827	2018-11-06 21:57:03 +00:00
Mark Johnston	6741ea083f	We need opt_stack.h after r339605. Reviewed by: cem Sponsored by: The FreeBSD Foundation	2018-11-06 21:47:22 +00:00
Brooks Davis	dd4d2f216f	Update some comments made obsolete by recent commits.	2018-11-06 20:45:15 +00:00
Brooks Davis	938e8dcf60	Regen after r340199: Use declared types for caddr_t arguments. Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17852	2018-11-06 18:47:29 +00:00
Brooks Davis	318f0d7720	Use declared types for caddr_t arguments. Leave ptrace(2) alone for the moment as it's defined to take a caddr_t. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17852	2018-11-06 18:46:38 +00:00
Mariusz Zaborski	f4a035b8df	Regenerate after r340129. Pointed out by: brooks	2018-11-06 18:03:04 +00:00
Mark Johnston	f71ef9b686	Use plain atomic_{add,subtract} when that's sufficient. CID: 1386920 MFC after: 2 weeks	2018-11-06 17:32:25 +00:00
Andrew Turner	4ea56599e8	Port the NetBSD ubsan runtime to the FreeBSD kernel. This allows us to build the ubsan code added in r340189 into the kernel with the KUBSAN option. This will report when undefined behaviour is detected in the currently running kernel. As it can be large, the kernel is 65MB on arm64, loader may not be able to load the kernel on all architectures so is disabled by default for now. Sponsored by: DARPA, AFRL	2018-11-06 17:32:07 +00:00
Andrew Turner	0645126fae	Import the NetBSD micro ubsan code for the kernel. This imports revision 1.3 of common/lib/libc/misc/ubsan.c from NetBSD, the micro-ubsan code. It is an implementation of the Undefined Behavior Sanitizer runtime for use with recent clang and gcc. The uubsan code will be used in a later commit to implement kubsan to help find undefined behavior in the kernel. Sponsored by: DARPA, AFRL	2018-11-06 16:56:49 +00:00
Brooks Davis	44cbc1c2b7	Fix a couple indentation errors in r339958.	2018-11-06 00:09:43 +00:00
John Baldwin	4cbbb74888	Add a KPI for the delay while spinning on a spin lock. Replace a call to DELAY(1) with a new cpu_lock_delay() KPI. Currently cpu_lock_delay() is defined to DELAY(1) on all platforms. However, platforms with a DELAY() implementation that uses spin locks should implement a custom cpu_lock_delay() doesn't use locks. Reviewed by: kib MFC after: 3 days	2018-11-05 21:34:17 +00:00
Mariusz Zaborski	82560231d3	capsicum: allow ppoll(2) in capability mode We already allow to use poll(2). There is no reason to disallow ppoll(2). PR: 232495 Submitted by: Stefan Grundmann <sg2342@googlemail.com> Reviewed by: cem, oshogbo MFC after: 2 weeks	2018-11-04 17:12:53 +00:00
Matt Macy	10f42d244b	Convert epoch to read / write records per cpu In discussing D17503 "Run epoch calls sooner and more reliably" with sbahra@ we came to the conclusion that epoch is currently misusing the ck_epoch API. It isn't safe to do a "write side" operation (ck_epoch_call or ck_epoch_poll) in the middle of a "read side" section. Since, by definition, it's possible to be preempted during the middle of an EPOCH_PREEMPT epoch the GC task might call ck_epoch_poll or another thread might call ck_epoch_call on the same section. The right solution is ultimately to change the way that ck_epoch works for this use case. However, as a stopgap for 12 we agreed to simply have separate records for each use case. Tested by: pho@ MFC after: 3 days	2018-11-03 03:43:32 +00:00
Brooks Davis	4e8c73eb20	Regen after r340080: Add const to input-only char * arguments. Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17812	2018-11-02 20:56:19 +00:00
Brooks Davis	12e69f96a2	Add const to input-only char * arguments. These arguments are mostly paths handled by NAMEI() macros which already take const char arguments. This change improves the match between syscalls.master and the public declerations of system calls. Reviewed by: kib (prior version) Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17812	2018-11-02 20:50:22 +00:00
Warner Losh	003ffd57fe	Add sysctl_usec_to_sbintime and sysctl_msec_to_sbintime. These functions are used to present a sbintime_t as either a number of microseconds or a number of milliseconds respectively. Sponsored by: Netflix	2018-11-02 17:50:57 +00:00
Mark Johnston	2203c46d87	Initialize the eflags field of vm_map headers. Initializing the eflags field of the map->header entry to a value with a unique new bit set makes a few comparisons to &map->header unnecessary. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: alc, kib Tested by: pho MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D14005	2018-11-02 16:26:44 +00:00
Brooks Davis	1493c2ee62	Make vop_symlink take a const target path. This will enable callers to take const paths as part of syscall decleration improvements. Where doing so is easy and non-distruptive carry the const through implementations. In UFS the value is passed to an interface that must take non-const values. In ZFS, const poisoning would touch code shared with upstream and it's not worth adding diffs. Bump __FreeBSD_version for external API consumers. Reviewed by: kib (prior version) Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17805	2018-11-02 14:42:36 +00:00
Conrad Meyer	78c2a9806e	kern_poll: Restore explanatory comment removed in r177374 The comment isn't stale. The check is bogus in the sense that poll(2) does not require pollfd entries to be unique in fd space, so there is no reason there cannot be more pollfd entries than open or even allowed fds. The check is mostly a seatbelt against accidental misuse or abuse. FD_SETSIZE, while usually unrelated to poll, is used as an arbitrary floor for systems with very low kern.maxfilesperproc. Additionally, document this possible EINVAL condition in the poll.2 manual. No functional change. Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D17671	2018-11-01 23:46:23 +00:00
Brooks Davis	f7e5ce325f	Regent after r340034: Use mode_t when the documented signature does. Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17784	2018-11-01 23:10:53 +00:00
Brooks Davis	2105ac07d7	Use mode_t when the documented signature does. This is more clear and produces better results when generating function stubs from syscalls.master. Reviewed by: kib, emaste Obtained from: CheribSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17784	2018-11-01 23:06:50 +00:00
John Baldwin	b317cfd4c0	Don't enter DDB for fatal traps before panic by default. Add a new 'debugger_on_trap' knob separate from 'debugger_on_panic' and make the calls to kdb_trap() in MD fatal trap handlers prior to calling panic() conditional on this new knob instead of 'debugger_on_panic'. Disable the new knob by default. Developers who wish to recover from a fatal fault by adjusting saved register state and retrying the faulting instruction can still do so by enabling the new knob. However, for the more common case this makes the user experience for panics due to a fatal fault match the user experience for other panics, e.g. 'c' in DDB will generate a crash dump and reboot the system rather than being stuck in an infinite loop of fatal fault messages and DDB prompts. Reviewed by: kib, avg MFC after: 2 months Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D17768	2018-11-01 21:34:17 +00:00
Brooks Davis	e3e5481326	Reformat syscalls.master for better readability. This takes advantage of two recents changes to makesyscalls.sh: r328598: Permit a range of syscall numbers for UNIMPL r339624: Remove the need for backslashes in syscalls.master Syscall declerations are now split across multiple lines with the syscall name and variables each on seperate lines (with an exception for syscalls taking no arguments.) Reviewed by: imp Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17706	2018-10-31 16:17:45 +00:00
Bjoern A. Zeeb	9afc56849a	Fix mips build after r339931. I erroneously thought that it was two 64bit platforms which use link_elf_obj.c. PR: 228854 Reported by: ci.f.o. MFC after: 3 days X-MFC with: r339931 Pointyhat to: bz	2018-10-30 21:35:56 +00:00
Bjoern A. Zeeb	0f823b6497	As a follow-up to r339930 and various reports implement logging in case we fail during module load because the pcpu or vnet module sections are full. We did return a proper error but not leaving any indication to the user as to what the actual problem was. Even worse, on 12/13 currently we are seeing an unrelated error (ENOSYS instead of ENOSPC, which gets skipped over in kern_linker.c) to be printed which made problem diagnostics even harder. PR: 228854 MFC after: 3 days	2018-10-30 20:51:03 +00:00
Mark Johnston	9978bd996b	Add malloc_domainset(9) and _domainset variants to other allocator KPIs. Remove malloc_domain(9) and most other _domain KPIs added in r327900. The new functions allow the caller to specify a general NUMA domain selection policy, rather than specifically requesting an allocation from a specific domain. The latter policy tends to interact poorly with M_WAITOK, resulting in situations where a caller is blocked indefinitely because the specified domain is depleted. Most existing consumers of the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy, in which we fall back to other domains to satisfy the allocation request. This change also defines a set of DOMAINSET_FIXED() policies, which only permit allocations from the specified domain. Discussed with: gallatin, jeff Reported and tested by: pho (previous version) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17418	2018-10-30 18:26:34 +00:00
Mark Johnston	920239efde	Fix some problems that manifest when NUMA domain 0 is empty. - In uma_prealloc(), we need to check for an empty domain before the first allocation attempt, not after. Fix this by switching uma_prealloc() to use a vm_domainset iterator, which addresses the secondary issue of using a signed domain identifier in round-robin iteration. - Don't automatically create a page daemon for domain 0. - In domainset_empty_vm(), recompute ds_cnt and ds_order after excluding empty domains; otherwise we may frequently specify an empty domain when calling in to the page allocator, wasting CPU time. Convert DOMAINSET_PREF() policies for empty domains to round-robin. - When freeing bootstrap pages, don't count them towards the per-domain total page counts for now: some vm_phys segments are created before the SRAT is parsed and are thus always identified as being in domain 0 even when they are not. Then, when bootstrap pages are freed, they are added to a domain that we had previously thought was empty. Until this is corrected, we simply exclude them from the per-domain page count. Reported and tested by: Rajesh Kumar <rajfbsd@gmail.com> Reviewed by: gallatin MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17704	2018-10-30 17:57:40 +00:00
Eric van Gyzen	fcbb889fdb	Always stop the scheduler when entering kdb Set curthread->td_stopsched when entering kdb via any vector. Previously, it was only set when entering via panic, so when entering kdb another way, mutexes and such were still "live", and an attempt to lock an already locked mutex would panic. Reviewed by: kib, cem Discussed with: jhb Tested by: pho MFC after: 2 months Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D17687	2018-10-30 14:54:15 +00:00
Stephen Hurd	5201e0f110	Drain grouptaskqueue of the gtask before detaching it. taskqgroup_detach() would remove the task even if it was running or enqueued, which could lead to panics (see D17404). With this change, taskqgroup_detach() drains the task and sets a new flag which prevents the task from being scheduled again. I've added grouptask_block() and grouptask_unblock() to allow control over the flag from other locations as well. Reviewed by: Jeffrey Pieper <jeffrey.e.pieper@intel.com> MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D17674	2018-10-29 14:36:03 +00:00
Mark Johnston	4aed5937db	Use M_WAITOK in init_hwpmc(). No functional change intended. MFC after: 2 weeks	2018-10-27 18:48:49 +00:00
Conrad Meyer	2384981b61	poll: Unify userspace pollfd pointer name Some of the poll code used 'fds' and some used 'ufds' to refer to the uap->fds userspace pointer that was passed around to subroutines. Some of the poll code used 'fds' to refer to the kernel memory pollfd arrays, which seemed unnecessarily confusing. Unify on 'ufds' to refer to the userspace pollfd array. Additionally, 'bits' is not an accurate description of the kernel pollfd array in kern_poll, so rename that to 'kfds'. Finally, clean up some logic with mallocarray() and nitems(). No functional change. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D17670	2018-10-26 20:07:46 +00:00
Brooks Davis	ed34a7fcf2	Move 32-bit compat support for FIODGNAME to the right place. ioctl(2) commands only have meaning in the context of a file descriptor so translating them in the syscall layer is incorrect. The new handler users an accessor to retrieve/construct a pointer from the last member of the passed structure and relies on type punning to access the other member which requires no translation. Unlike r339174 this change supports both places FIODGNAME is handled. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17475	2018-10-26 17:59:25 +00:00
Konstantin Belousov	4f77f48884	Implement O_BENEATH and AT_BENEATH. Flags prevent open(2) and *at(2) vfs syscalls name lookup from escaping the starting directory. Supposedly the interface is similar to the same proposed Linux flags. Reviewed by: jilles (code, previous version of manpages), 0mp (manpages) Discussed with: allanjude, emaste, jonathan Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D17547	2018-10-25 22:16:34 +00:00
Mark Johnston	ad054101eb	Remove a dead store. CID: 1304878 MFC after: 1 week	2018-10-25 17:36:28 +00:00
Mark Johnston	970a174f3b	Add FALLTHROUGH comments to appease Coverity. CID: 1017862-1017864, 1017866-1017868 MFC after: 2 weeks	2018-10-25 15:43:21 +00:00
Mark Johnston	a0a18fd46b	Remove a redundant check. CID: 1042100 MFC after: 2 weeks	2018-10-25 15:40:59 +00:00
Konstantin Belousov	4fceda6206	Correct condition to detect mount(2) support by a filesystem. Reported and tested by: cy Sponsored by: The FreeBSD Foundation Approved by: re (rgrimes)	2018-10-24 19:40:09 +00:00
Konstantin Belousov	8ff7fad1d7	Only call sigdeferstop() for NFS. Use bypass to catch any NFS VOP dispatch and route it through the wrapper which does sigdeferstop() and then dispatches original VOP. NFS does not need a bypass below it, which is not supported. The vop offset in the vop_vector is added since otherwise it is impossible to get vop_op_t from the internal table, and I did not wanted to create the layered fs only to wrap NFS VOPs. VFS_OP()s wrap is straightforward. Requested and reviewed by: mjg (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D17658	2018-10-23 21:43:41 +00:00
Eric Joyner	46fa0c2552	Revert r339634. That commit is causing kernel panics in em(4), so this will be reverted until those are fixed. Reported by: ae@, pho@, et al Sponsored by: Intel Corporation	2018-10-23 17:06:36 +00:00
Eric Joyner	940f62d616	iflib: drain enqueued tasks before detaching from taskqgroup The taskqgroup_detach function does not check if task is already enqueued when detaching it. This may lead to kernel panic if enqueued task starts after context state lock is destroyed. Ensure that the already enqueued admin tasks are executed before detaching them. The issue was discovered during validation of D16429. Unloading of if_ixlv followed by immediate removal of VFs with iovctl -D may lead to panic on NODEBUG kernel. As well, check if iflib is in detach before enqueueing new admin or iov tasks, to prevent new tasks from executing while the taskqgroup tasks are being drained. Submitted by: Krzysztof Galazka <krzysztof.galazka@intel.com> Reviewed by: shurd@, erj@ Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D17404	2018-10-23 04:37:29 +00:00
Brooks Davis	9d7051d920	Remove the need for backslashes in syscalls.master. Join non-special lines together until we hit a line containing a '}' character. This allows the function declaration body to be split across multiple lines without backslash continuation characters. Continue to join lines ending with backslashes to allow gradual migration and to support out-of-tree syscall vectors Reviewed by: emaste, kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17488	2018-10-22 22:13:00 +00:00
Brooks Davis	22c0c9a481	Remove __restrict qualifiers from syscalls.master. The restruct qualifier is intended to aid code generation in the compiler, but the only access to storage through these pointers is via structs using copyin/copyout and the like which can not be written in C or C++ and thus the compiler gains nothing from the qualifiers. As such, the qualifiers add no value in current usage. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17574	2018-10-22 21:50:43 +00:00
Mark Johnston	b61f314290	Make it possible to disable NUMA support with a tunable. This provides a chicken switch for anyone negatively impacted by enabling NUMA in the amd64 GENERIC kernel configuration. With NUMA disabled at boot-time, information about the NUMA topology is not exposed to the rest of the kernel, and all of physical memory is viewed as coming from a single domain. This method still has some performance overhead relative to disabling NUMA support at compile time. PR: 231460 Reviewed by: alc, gallatin, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17439	2018-10-22 20:13:51 +00:00
Conrad Meyer	93940996fd	Conditionalize kern.tty_info_kstacks feature on STACKS option Fix tinderbox (mips XLPN32) after r339471. Reported by: tinderbox X-MFC-With: r339471 Sponsored by: Dell EMC Isilon	2018-10-22 17:42:57 +00:00
Mark Johnston	21744c825f	Don't import 0 into vmem quantum caches. vmem uses UMA cache zones to implement the quantum cache. Since uma_zalloc() returns 0 (NULL) to signal an allocation failure, UMA should not be used to cache resource 0. Fix this by ensuring that 0 is never cached in UMA in the first place, and by modifying vmem_alloc() to fall back to a search of the free lists if the cache is depleted, rather than blocking in qc_import(). Reported by and discussed with: Brett Gutstein <bgutstein@rice.edu> Reviewed by: alc MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D17483	2018-10-22 16:16:42 +00:00
Warner Losh	e9b5375b04	Retire dpt(4) Marked as gone in 12 and not relevant since the early 90s. No sightings in nycbug's dmesg database. Relnotes: yes	2018-10-22 02:35:12 +00:00
Warner Losh	a1db7455b7	Remove bt(4) driver The buslogic scsi driver has been tagged as gone in 12 for some time now. Remove it. The nycbug dmesg database shows only one sighting in 6 for this driver. It was very popular in the early days of the project, but that popularity seems to have died by 2004 when the nycbug database started up. Relnotes: yes	2018-10-22 02:34:59 +00:00
Warner Losh	43b16da804	Remove adv(4) and adw(4) Remove the advanssy drivers (both adv and adw). They were tagged as gone in 12 a while qgo. The nycbug dmesg database shows this was last seen in 6 and there were only a few adv sightings then (none for adw). Relnotes: yes	2018-10-22 02:34:47 +00:00
Warner Losh	39c362e0b0	Remove aha(4) from the tree. We tagged aha as gone in 12 a while ago. Proceed with its removal. Data from nycbug's database shows the last sighting of this driver in 6, with the prior one in 4.x show its popularity had died prior to 4.x. Relnotes: yes	2018-10-22 02:34:25 +00:00
Warner Losh	c4b23051af	Remove stray refernce to pdq. Like the infamous twenty first of Johan Sebastian Bach's twenty children, it hasn't been seen in many years.	2018-10-21 16:49:49 +00:00
Conrad Meyer	3937ee7557	netdump: Zone mbufs should be allocated before dump Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D17306	2018-10-20 22:24:58 +00:00
Conrad Meyer	767bc248de	ZSTDIO: Correctly initialize zstd context with provided 'level' Prior to this revision, we allocated sufficient context space for 'level' but never actually set the compress level parameter, so we would always get the default '3'. Reviewed by: markj, vangyzen MFC after: 12 hours Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D17144	2018-10-20 21:49:44 +00:00
Conrad Meyer	ec86f8b28b	dev_refthread: Do not initialize ref when reference was not acquired Like the companion API devvn_refthread, leave ref uninitialized when a reference was not acquired. Initializing to 1 provides a vaguely correct-looking but bogus value for broken callers to (mistakenly) pass to dev_relthread() when refthread fails. Make it even more clear to consumers that dev_relthread is only valid when dev_refthread succeeds. Reviewed by: kib, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D16885	2018-10-20 19:42:38 +00:00
Conrad Meyer	d7aa89c363	tty info (^T): Add optional kernel stack(9) traces It is often useful for developers and administrators to determine a running thread's stack for debugging purposes. With this feature, using ^T will print that information For now, the feature is disabled by default. Enable with sysctl kern.tty_info_kstacks=1. Discussed with: markj Reviewed by: oshogbo Relnotes: yes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D17621	2018-10-20 18:42:28 +00:00
Conrad Meyer	6858c2cc8f	Replace ttyprintf with sbuf_printf and tty drain routine Add string variants of cnputc and tty_putchar, and use them from the tty sbuf drain routine. Suggested by: ed@ Sponsored by: Dell EMC Isilon	2018-10-20 18:31:36 +00:00
Conrad Meyer	d158fa4ade	Add flags variants to linker_files / stack(9) symbol resolution Some best-effort consumers may find trylock behavior for stack(9) symbol resolution acceptable. Expose that behavior to such consumers. This API is ugly. If in the future the modules and linker file list locking is cleaned up such that the linker_files list can be iterated safely without acquiring a sleepable lock, this API should be removed. However, most of the time nothing will be holding the linker files lock exclusive and the acquisition can proceed. Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D17620	2018-10-20 18:08:43 +00:00
Mark Johnston	662e7fa8d9	Create some global domainsets and refactor NUMA registration. Pre-defined policies are useful when integrating the domainset(9) policy machinery into various kernel memory allocators. The refactoring will make it easier to add NUMA support for other architectures. No functional change intended. Reviewed by: alc, gallatin, jeff, kib Tested by: pho (part of a larger patch) MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17416	2018-10-20 17:36:00 +00:00
Bjoern A. Zeeb	9ae7bc39c2	In r78161 the lookup_set linker method was introduced which optionally returns the section start and stop locations as well as a count if the caller asks for them. There was only one out-of-file consumer of count which did not actually use it and hence was eliminated in r339407. In r194784 parse_dpcpu(), and in r195699 parse_vnet() (a copy of the former) started to use the link_elf_lookup_set() interface internally also asking for the count. count is computed as the difference of the void stop - void start locations and as such, if the absoulte numbers (stop - start) % sizeof(void ) != 0 a round-down happens, e.g., stop 0x1003 - start 0x1000 => count 0. To get the section size instead of "count is the number of pointer elements in the section", the parse_() functions do a count = sizeof(void ). They use the result to allocate memory and copy the section data into the "master" and per-instance memory regions with a size of count. As a result of count possibly round-down this can miss the last bytes of the section. The good news is that we do not touch out of bounds memory during these operations (we may at a later stage if the last bytes would overflow the master sections). Given relocation in elf_relocaddr() works based on the absolute numbers of start and stop, this means that we can possibly try to access relocated data which was never copied and hence we get random garbage or at best zeroed memory. Stop the two (last) consumers of count (the parse_*() functions) from using count as well, and calculate the section size based on the absolute numbers of stop and start and use the proper size for the memory allocation and data copies. This will make the symbols in the last bytes of the pcpu or vnet sections be presented as expected. PR: 232289 Approved by: re (gjb) MFC after: 2 weeks	2018-10-18 20:20:41 +00:00
Jamie Gritton	4520f617c9	Fix typos from r339409. Reported by: maxim Approved by: re (gjb)	2018-10-18 15:02:57 +00:00
Jonathan T. Looney	e77f0bdcb5	r334853 added a "socket destructor" callback. However, as implemented, it was really a "socket close" callback. Update the socket destructor functionality to run when a socket is destroyed (rather than when it is closed). The original submitter has confirmed that this change satisfies the intended use case. Suggested by: rwatson Submitted by: Michio Honda <micchie at sfc.wide.ad.jp> Tested by: Michio Honda <micchie at sfc.wide.ad.jp> Approved by: re (kib) Differential Revision: https://reviews.freebsd.org/D17590	2018-10-18 14:20:15 +00:00
Jamie Gritton	b19d66fd5a	Add a new jail permission, allow.read_msgbuf. When true, jailed processes can see the dmesg buffer (this is the current behavior). When false (the new default), dmesg will be unavailable to jailed users, whether root or not. The security.bsd.unprivileged_read_msgbuf sysctl still works as before, controlling system-wide whether non-root users can see the buffer. PR: 211580 Submitted by: bz Approved by: re@ (kib@) MFC after: 3 days	2018-10-17 16:11:43 +00:00
Bjoern A. Zeeb	0455a92bcb	The countp argument passed to linker_file_lookup_set() in linker_load_dependencies() is unused, so no need to ask for the value in first place. Remove the unused "count" variable. Approved by: re (kib)	2018-10-17 10:31:08 +00:00
Mark Johnston	ddab8c351a	Reparent a child of pdfork(2) to its reaper when the procdesc is closed. Unconditionally reparenting to PID 1 breaks the procctl(2) reaper functionality. Add a regression test for this case. Reviewed by: kib Approved by: re (gjb) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17589	2018-10-16 20:06:56 +00:00
Gleb Smirnoff	47f1ea5109	Plug sendfile(2) on a listening socket with proper error code. Reported by: ngie Reviewed by: ngie Approved by: re (delphij)	2018-10-16 15:57:16 +00:00
Kyle Evans	29bf3a7ba8	Correct COMPAT* macro names in syscalls.master Both ^/sys/compat/freebsd32/syscalls.master and ^/sys/kern/syscalls.master cited "COMPAT[n] #ifdef" instead of "COMPAT_FREEBSD[n] #ifdef" in places. Approved by: re (glebius)	2018-10-15 21:35:57 +00:00
Mateusz Guzik	98fca94d22	capsicum: provide cap_rights_fde_inline Reading caps is in the hot path (on each successful fd lookup), but completely unnecessarily requires a function call. Approved by: re (gjb) Sponsored by: The FreeBSD Foundation	2018-10-12 23:48:10 +00:00
Mateusz Guzik	c9964045a0	Add a file missed in r339321 Approved by: re (implicit) Sad face: mjg	2018-10-12 00:32:45 +00:00
Mateusz Guzik	3f102f5881	Provide string functions for use before ifuncs get resolved. The change is a no-op for architectures which don't ifunc memset, memcpy nor memmove. Convert places which need them. Xen bits by royger. Reviewed by: kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17487	2018-10-11 23:28:04 +00:00
Jamie Gritton	08b4333399	Fix the test prohibiting jails from sharing IP addresses. It's not supposed to be legal for two jails to contain the same IP address, unless both jails contain only that one address. This is the behavior documented in jail(8), and is there to prevent confusion when multiple jails are listening on IADDR_ANY. VIMAGE jails (now the default for GENERIC kernels) test this correctly, but non-VIMAGE jails have been performing an incomplete test when nested jails are used. Approved by: re@ (kib@) MFC after: 5 days	2018-10-06 02:10:32 +00:00
Matt Macy	e8bb589d56	eliminate locking surrounding ui_vmsize and swap reserve by using atomics Change swap_reserve and swap_total to be in units of pages so that swap reservations can be done using only atomics instead of using a single global mutex for swap_reserve and a single mutex for all processes running under the same uid for uid accounting. Results in mmap speed up and a 70% increase in brk calls / second. Reviewed by: alc@, markj@, kib@ Approved by: re (delphij@) Differential Revision: https://reviews.freebsd.org/D16273	2018-10-05 05:50:56 +00:00
Gleb Smirnoff	ad7eb8cad5	In PR 227259, a user is reporting that they have code which is using shutdown() to wakeup another thread blocked on a stream listen socket. This code is failing, while it used to work on FreeBSD 10 and still works on Linux. It seems reasonable to add another exception to support something users are actually doing, which used to work on FreeBSD 10, and still works on Linux. And, it seems like it should be acceptable to POSIX, as we still return ENOTCONN. This patch is different to what had been committed to stable/11, since code around listening sockets is different. Patch in D15019 is written by jtl@, slightly modified by me. PR: 227259 Obtained from: jtl Approved by: re (kib) Differential Revision: D15019	2018-10-03 17:40:04 +00:00
Andrew Turner	8696dcdacf	Add kernel ifunc support on arm64. Tested with ifunc resolvers in the kernel and module with calls from kernel to kernel, module to kernel, and module to module. Reviewed by: kib (previous version) Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17370	2018-10-01 18:51:08 +00:00
Andrew Gallatin	30c5525b3c	Allow empty NUMA memory domains to support Threadripper2 The AMD Threadripper 2990WX is basically a slightly crippled Epyc. Rather than having 4 memory controllers, one per NUMA domain, it has only 2 memory controllers enabled. This means that only 2 of the 4 NUMA domains can be populated with physical memory, and the others are empty. Add support to FreeBSD for empty NUMA domains by: - creating empty memory domains when parsing the SRAT table, rather than failing to parse the table - not running the pageout deamon threads in empty domains - adding defensive code to UMA to avoid allocating from empty domains - adding defensive code to cpuset to avoid binding to an empty domain Thanks to Jeff for suggesting this strategy. Reviewed by: alc, markj Approved by: re (gjb@) Differential Revision: https://reviews.freebsd.org/D1683	2018-10-01 14:14:21 +00:00

... 2 3 4 5 6 ...

16605 Commits