freebsd-skq

Author	SHA1	Message	Date
kib	f34dbcf7d0	Implement shmat(2) flag SHM_REMAP. Based on the description in Linux man page. Reviewed by: markj, ngie (previous version) Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18837	2019-01-16 05:15:57 +00:00
delphij	74be0aa2e9	Use TD_IS_IDLETHREAD instead of unrolled version. MFC after: 2 weeks	2019-01-15 06:44:37 +00:00
glebius	7ee1aa34d4	Allocate pager bufs from UMA instead of 80-ish mutex protected linked list. o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many pbufs are we going to have set. In various subsystems that are going to utilize pbufs create private zones via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(), and sets a limit on created zone. After startup preallocate pbufs according to requirements of all pbuf zones. Subsystems that used to have a private limit with old allocator now have private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS, swap, vnode pager. The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9), aio(4). They should have their private limits, but changing that is out of scope of this commit. o Fetch tunable value of kern.nswbuf from init_param2() and while here move NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only this option. Default values aren't touched by this commit, but they probably should be reviewed wrt to modern hardware. This change removes a tight bottleneck from sendfile(2) operation, that uses pbufs in vnode pager. Other pagers also would benefit from faster allocation. Together with: gallatin Tested by: pho	2019-01-15 01:02:16 +00:00
glebius	55f760dc93	Add flag LK_NEW for lockinit() that is converted to LO_NEW and passed down to lock_init(). This allows for lockinit() on a not prezeroed memory.	2019-01-15 00:35:19 +00:00
kib	db6bffc60a	Handle overflow in calculating max kmem size. vm_kmem_size is u_long, and it might be not capable of holding page count times PAGE_SIZE, even when scaled down by VM_KMEM_SIZE_SCALE. As bde reported, 12G PAE config ends up with zero for kmem size. Explicitly check for overflow and clamp kmem size at vm_kmem_size_max. If we end up at zero size because VM_KMEM_SIZE_MAX is not defined, panic with clear explanation rather then failing in a way which is hard to relate. Reported by: bde, pho Tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18767	2019-01-14 07:31:19 +00:00
jah	23cce14a7d	Handle SIGIO for listening sockets r319722 separated struct socket and parts of the socket I/O path into listening-socket-specific and dataflow-socket-specific pieces. Listening socket connection notifications are now handled by solisten_wakeup() instead of sowakeup(), but solisten_wakeup() does not currently post SIGIO to the owning process. PR: 234258 Reported by: Kenneth Adelman MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18664	2019-01-13 20:33:54 +00:00
cognet	a7c2a2d8c6	Instead of using an incomplete list of platforms that uses 64bits time_t in 32bits mode, special case amd64, as i386 is the only arch that still uses 32bits time_t.	2019-01-13 00:19:15 +00:00
andrew	32f8cf00d6	Fix the check for the offset of td_frame and td_emuldata in struct thread. Pointy hat: andrew Sponsored by: DARPA, AFRL	2019-01-12 20:41:57 +00:00
andrew	5e0e456d9f	Add support for the Clang Coverage Sanitizer in the kernel (KCOV). When building with KCOV enabled the compiler will insert function calls to probes allowing us to trace the execution of the kernel from userspace. These probes are on function entry (trace-pc) and on comparison operations (trace-cmp). Userspace can enable the use of these probes on a single kernel thread with an ioctl interface. It can allocate space for the probe with KIOSETBUFSIZE, then mmap the allocated buffer and enable tracing with KIOENABLE, with the trace mode being passed in as the int argument. When complete KIODISABLE is used to disable tracing. The first item in the buffer is the number of trace event that have happened. Userspace can write 0 to this to reset the tracing, and is expected to do so on first use. The format of the buffer depends on the trace mode. When in PC tracing just the return address of the probe is stored. Under comparison tracing the comparison type, the two arguments, and the return address are traced. The former method uses on entry per trace event, while the later uses 4. As such they are incompatible so only a single mode may be enabled. KCOV is expected to help fuzzing the kernel, and while in development has already found a number of issues. It is required for the syzkaller system call fuzzer [1]. Other kernel fuzzers could also make use of it, either with the current interface, or by extending it with new modes. A man page is currently being worked on and is expected to be committed soon, however having the code in the kernel now is useful for other developers to use. [1] https://github.com/google/syzkaller Submitted by: Mitchell Horne <mhorne063@gmail.com> (Earlier version) Reviewed by: kib Testing by: tuexen Sponsored by: DARPA, AFRL Sponsored by: The FreeBSD Foundation (Mitchell Horne) Differential Revision: https://reviews.freebsd.org/D14599	2019-01-12 11:21:28 +00:00
glebius	22a41cecb2	Simplify sosetopt() so that function has single return point. No functional change.	2019-01-10 00:25:12 +00:00
brooks	48665600ce	style(9): fix the indent of a return.	2019-01-09 17:23:59 +00:00
tuexen	e8a9c3693f	Avoid overfow in vtruncbuf() Using daddr_t instead of int avoids trunclbn to become negative when it shouldn't. This isssue was found by running syzkaller. Reviewed by: mckusick, kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18763	2019-01-08 09:04:27 +00:00
kp	39b0b1bf8e	Remove unneeded NULL check for td_ucred td_ucred is always set, so we don't need the ternary expression to check for it.	2019-01-04 21:12:17 +00:00
cem	d6edca0f6c	Expose threads-per-core and physical core count information With new sysctls (to the best of our ability do detect them). Restructured smp.4 slightly for clarity (keep relevant stuff closer to the top) while documenting. Reviewed by: markj, jhibbits (ppc parts) MFC after: 3 days Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D18322	2019-01-04 18:31:17 +00:00
markj	85037a09f3	Support MSG_DONTWAIT in send(2). As it does for recv(2), MSG_DONTWAIT indicates that the call should not block, returning EAGAIN instead. Linux and OpenBSD both implement this, so the change makes porting easier, especially since we do not return EINVAL or so when unrecognized flags are specified. Submitted by: Greg V <greg@unrelenting.technology> Reviewed by: tuexen MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18728	2019-01-04 17:31:50 +00:00
kp	59252de0d7	Simplify jail ID printing on process exit As suggested by kib@, we don't need to check p_ucred, because that's only NULL during process creation, and cr_prison is never NULL.	2018-12-29 21:36:02 +00:00
cem	432d4753f2	Update to Zstandard 1.3.8 This merge brings in a couple new files, which needed to be attached to the build; a new dependency on <limits.h>, which must be stubbed; and a name change in the Context parameter constants, from ZSTD_p_foo to ZSTD_c_foo. Significantly, it fixes a kernel build error with GCC where floating-point functions were included in the kernel build, by hiding them under the same compile-time #ifdef that already covered their invocation. That issue was introduced to FreeBSD in the 1.3.7 update and tracked upstream here: https://github.com/facebook/zstd/issues/1386 The full 1.3.8 release notes can be found on Github: https://github.com/facebook/zstd/releases/tag/v1.3.8 Relnotes: yes	2018-12-29 21:18:01 +00:00
kib	0bf5594c41	For hw.{physmem,realmem,usermem} MIBs, clamp instead truncating. If the memory size does not fit into u_long, current code truncates the returned value and returns complete nonsense. Make the result slightly more useful by clamping it at ULONG_MAX. Reported and tested : pho MFC after: 1 week Sponsored by: The FreeBSD Foundation	2018-12-29 15:55:44 +00:00
kp	7f6b1eda35	Make kernel print jail ID when logging a process exit Kernel now includes jail ID when logging a process exit. jid is 0 for unjailed processes. Submitted by: Marie Helene Kvello-Aune <freebsd@mhka.no> Relnotes: yes Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D18618	2018-12-29 14:48:51 +00:00
jilles	9cebe88602	pfind, pfind_any: Correct zombie logic SVN r340744 erroneously changed pfind() to return any process including zombies and pfind_any() to return only non-zombie processes. In particular, this caused kill() on a zombie process to fail with [ESRCH]. There is no direct test case for this but /usr/tests/bin/sh/builtins/kill1.0 occasionally triggers it (as reported by lwhsu). Conversely, returning zombies from pfind() seems likely to violate invariants and cause panics, but I have not looked at this. PR: 233646 Reviewed by: mjg, kib, ngie Differential Revision: https://reviews.freebsd.org/D18665	2018-12-28 13:32:14 +00:00
mckusick	51739d277b	When loading an inode from disk, verify that its mode is valid. If invalid, return EINVAL. Note that inode check-hashes greatly reduce the chance that these errors will go undetected. Reported by: Christopher Krah <krah@protonmail.com> Reported as: FS-5-UFS-2: Denial Of Service in nmount-3 (ffs_read) Reviewed by: kib MFC after: 1 week Sponsored by: Netflix M sys/fs/ext2fs/ext2_vnops.c M sys/kern/vfs_subr.c M sys/ufs/ffs/ffs_snapshot.c M sys/ufs/ufs/ufs_vnops.c	2018-12-27 07:18:53 +00:00
mav	3f27e240ec	Increase MTX_POOL_SLEEP_SIZE from 128 to 1024. This value remained unchanged for 15 years, and now this bump reduces lock spinning in GEOM and BIO layers while doing ~1.6M IOPS to 4 NVMe on 72-core system from ~25% to ~5% by the cost of additional 28KB RAM. While there, align struct mtx_pool fields to cache lines. MFC after: 1 month	2018-12-24 23:52:35 +00:00
kib	f31f9d5413	Properly test for vmio buffer in bnoreuselist(). The presence of allocated v_object does not imply that the buffer is necessary VMIO kind. Buffer might has been allocated before the object created, then the buffer is malloced. Although we try to avoid such situation, it seems to be still legitimate. Reported and tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation	2018-12-23 18:52:02 +00:00
bde	6702ac5ced	Oops, rounddown() for the start was misspelled roundup() in r342295, so only aligned starts worked. This broke releasing caches in most cases where the i/o size is smaller than the fs block size.	2018-12-22 09:31:55 +00:00
bde	97d496ecdd	Fix rounding in vop_stdadvise() for POSIX_FADV_NOREUSE (really POSIX_FADV_DONTNEED). The most broken case was for applications that advise for the whole file and then do block-aligned i/o's 1 block at a time. Then advice is sent to VOP_ADVISE() 1 block at a time, but in vop_stdadvise() the 1-block advice was turned into 0-block advice for the buffer cache part. The bugs were caused partly by callers representing the region as (a_start, a_end), where a_end is actually the maximum, and everything else representing the region as (start, end) where 'end' is actually the end (1 after the maximum). The maximum a_end must be rounded up, but was rounded down. Also, rounding to page boundaries was inconsistent. The bugs and fixes have no effect for zfs and other file systems that don't use the buffer cache or the page cache. Most or all file systems currently use the default VOP_FADVISE(), but it finds a null buffer cache and a null page cache for file systems that don't use normal methods. Reviewed by: kib	2018-12-21 04:57:59 +00:00
mckusick	b9ea8013d1	Some filesystems (like cd9660 and ext3) require that VFS_STATFS() be called before VFS_ROOT() is called. Move the call for VFS_STATFS() so that it is done after VFS_MOUNT(), but before VFS_ROOT(). This change actually improves the robustness of the mount system call because it returns an error rather than failing silently when VFS_STATFS() returns failure. Reported by: Rebecca Cran <rebecca@bluestop.org> Sponsored by: Netflix	2018-12-21 01:09:25 +00:00
mjg	286549b24a	Check for probes enabled in priv_check_cred before evaluting the error. Sponsored by: The FreeBSD Foundation	2018-12-19 23:28:29 +00:00
mjg	c39e5a0486	Remove iBCS2, part2: general kernel Reviewed by: kib (previous version) Sponsored by: The FreeBSD Foundation	2018-12-19 21:57:58 +00:00
mjg	655bf8e58a	Microoptimize corner case of ID bitmap handling. Prior to the change we would avoidably test more possibly used IDs. While here update the comment: there is no pidchecked variable anymore.	2018-12-19 20:29:52 +00:00
mjg	94437ac557	Deinline vfork handling out of the syscall return path. vfork is rarely called (comparatively to other syscalls) and it avoidably pollutes the fast path. Sponsored by: The FreeBSD Foundation	2018-12-19 20:27:26 +00:00
markj	b0758adada	Fix DDB's "show malloc" after r338899. MFC after: 3 days Sponsored by: The FreeBSD Foundation	2018-12-19 00:17:22 +00:00
brooks	a3c153e5af	const poison the `new` pointer of __sysctl. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D18444	2018-12-18 12:44:38 +00:00
avg	260cc02954	add support for marking interrupt handlers as suspended The goal of this change is to fix a problem with PCI shared interrupts during suspend and resume. I have observed a couple of variations of the following scenario. Devices A and B are on the same PCI bus and share the same interrupt. Device A's driver is suspended first and the device is powered down. Device B generates an interrupt. Interrupt handlers of both drivers are called. Device A's interrupt handler accesses registers of the powered down device and gets back bogus values (I assume all 0xff). That data is interpreted as interrupt status bits, etc. So, the interrupt handler gets confused and may produce some noise or enter an infinite loop, etc. This change affects only PCI devices. The pci(4) bus driver marks a child's interrupt handler as suspended after the child's suspend method is called and before the device is powered down. This is done only for traditional PCI interrupts, because only they can be shared. At the moment the change is only for x86. Notable changes in core subsystems / interfaces: - BUS_SUSPEND_INTR and BUS_RESUME_INTR methods are added to bus interface along with convenience functions bus_suspend_intr and bus_resume_intr; - rman_set_irq_cookie and rman_get_irq_cookie functions are added to provide a way to associate an interrupt resource with an interrupt cookie; - intr_event_suspend_handler and intr_event_resume_handler functions are added to the MI interrupt handler interface. I added two new interrupt handler flags, IH_SUSP and IH_CHANGED, to implement the new intr_event functions. IH_SUSP marks a suspended interrupt handler. IH_CHANGED is used to implement a barrier that ensures that a change to the interrupt handler's state is visible to future interrupts. While there, I fixed some whitespace issues in comments and changed a couple of logically boolean variables to be bool. MFC after: 1 month (maybe) Differential Revision: https://reviews.freebsd.org/D15755	2018-12-17 17:11:00 +00:00
mckusick	fae3bd79a5	Clarify panic in set_rootvnode(). Check for panic in vfs_mountroot_shuffle(). Sponsored by: Netflix	2018-12-15 19:18:58 +00:00
mckusick	afd5ac8b62	Under UFS/FFS the VFS_ROOT() function will return an error if the inode check-hash fails. Panic'ing is not an appropriate response. So, check for an error return from VFS_ROOT() and when an error is reported, unwind and return the error. Reported by: Gary Jennejohn (gj) Sponsored by: Netflix	2018-12-15 19:04:50 +00:00
mjg	8598ea893e	vfs: mostly depessimize NDINIT_ALL 1) filecaps_init was unnecesarily a function call 2) an asignment at the end was preventing tail calling of cap_rights_init Sponsored by: The FreeBSD Foundation	2018-12-14 03:55:08 +00:00
mjg	7e31d1de7e	Remove unused argument to priv_check_cred. Patch mostly generated with cocinnelle: @@ expression E1,E2; @@ - priv_check_cred(E1,E2,0) + priv_check_cred(E1,E2) Sponsored by: The FreeBSD Foundation	2018-12-11 19:32:16 +00:00
mjg	ba8523cc7c	fd: dedup code in sys_getdtablesize Sponsored by: The FreeBSD Foundation	2018-12-11 12:08:18 +00:00
mjg	59363837d9	Make lim_cur inline if possible. It is a function call only to accomodate some ABIs which install a hook. They only care for 3 types of limits: DATA, STACK, VMEM Instead of always calling the func, see at compilation time if the requested limit is something else and just do the read if so. Sponsored by: The FreeBSD Foundation	2018-12-11 12:01:46 +00:00
mjg	45f96abb72	fd: tidy up closing a fd - avoid a call to knote_close in the common case - annotate mqueue as unlikely Sponsored by: The FreeBSD Foundation	2018-12-11 11:58:44 +00:00
mjg	78cf9b9e38	fd: stop looking for exact freefile after allocation If a lower fd is closed later, the lookup goes to waste. Allocation always performs the lookup anyway. Sponsored by: The FreeBSD Foundation	2018-12-11 11:57:12 +00:00
kib	28de053312	Free bootstacks after AP startup. Bootstacks are unused after APs executed sched_throw() in init_secondary_tail() and started executing on proper idle thread stack. Add sysinit that detects that the idle thread for each CPU was scheduled at least once, and free corresponding bootstack. Slight addition of the code (~200 bytes) is compensated by the saving, because even on typical small modern desktop CPU we leak 128K of memory otherwise (4 pages x 8 threads). Reviewed by: jhb MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18486	2018-12-11 02:54:36 +00:00
kib	4045451f9e	Remove special case handling for getfhat(fd, NULL, handle). There is no reason for it to behave differently from openat(fd, NULL). Also the handling did not worked because the substituted path was from the system address space, causing EFAULT. Submitted by: Jack Halford <jack@gandi.net> MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18501	2018-12-11 02:48:49 +00:00
jhb	7b28e77e79	Don't report stale signal information for non-signal events in ptrace_lwpinfo. Once a signal's siginfo was copied to 'td_si' as part of the signal exchange in issignal(), it was never cleared. This caused future thread events that are reported as SIGTRAP events without signal information to report the stale siginfo in 'td_si'. For example, if a debugger created a new process and used SIGSTOP to stop it after PT_ATTACH, future system call entry / exit events would set PL_FLAG_SI with the SIGSTOP siginfo in pl_siginfo. This broke 'catch syscall' in current versions of gdb as it assumed PL_FLAG_SI with SIGTRAP indicates a breakpoint or single step trap. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18487	2018-12-10 19:39:24 +00:00
alc	af34503285	blst_leaf_alloc updates bighint for a leaf when an allocation is successful and includes the last block represented by the leaf. The reasoning is that, if the last block is included, then there must be no solution before that one in the leaf, so the leaf cannot provide an allocation that big again; indeed, the leaf cannot provide a solution bigger than range1. Which is all correct, except that if the value of blk passed in did not represent the first block of the leaf, because the cursor was pointing to the middle of the leaf, then a possible solution before the cursor may have been ignored, and bighint cannot be updated. Consider the sequence allocate 63 (returning address 0), free 0,63 (freeing that same block, and allocate 1 (returning 63). The result is that one block is allocated from the first leaf, and the value of bighint is 0, so that nothing can be allocated from that leaf until the only block allocated from that leaf is freed. This change detects that skipped-over solution, and when there is one it makes sure that the value of bighint is not changed when the last block is allocated. Submitted by: Doug Moore <dougm@rice.edu> Tested by: pho X-MFC with: r340402 Differential Revision: https://reviews.freebsd.org/D18474	2018-12-09 17:55:10 +00:00
mjg	d21951d547	umtx: avoid umtxshm locking on object termination if possible Sample build world result on tmpfs: kern.ipc.umtx_terminate_notempty: 0 kern.ipc.umtx_terminate_empty: 2891815 Sponsored by: The FreeBSD Foundation	2018-12-08 14:04:57 +00:00
mjg	bae6f9dc2d	Remove proctree acquire from note_procstat_proc It is not needed since r340482 ("proc: always store parent pid in p_oppid") Sponsored by: The FreeBSD Foundation	2018-12-08 11:38:39 +00:00
mjg	59185429c4	Fix a corner case in ID bitmap management. If all IDs from trypid to pid_max were used as pids, the code would enter a loop which would be infinite if none of the IDs could become free (e.g. they all belong to processes which did not transitioned to zombie). Fixes: r341684 ("Manage process-related IDs with bitmaps") Sponsored by: The FreeBSD Foundation	2018-12-08 10:22:12 +00:00
mjg	c2763443b4	proc: postpone proc unlock until after reporting with kqueue kqueue would always relock immediately afterwards. While here drop the NULL check for list itself. The list is always allocated. Sponsored by: The FreeBSD Foundation	2018-12-08 06:34:12 +00:00
mjg	af8321b07f	proc: handle sdt exit probe before taking the proc lock Sponsored by: The FreeBSD Foundation	2018-12-08 06:31:43 +00:00

1 2 3 4 5 ...

16503 Commits