freebsd-dev

Author	SHA1	Message	Date
Mateusz Guzik	f837888a3e	thread: use tdfind in sysctl_kern_proc_kstack This treads linear scans for locked lookup, but more importantly removes the only consumer of thread_find.	2020-11-10 01:57:19 +00:00
Mateusz Guzik	94275e3e69	threads: remove the unused TID_BUFFER_SIZE macro	2020-11-10 01:31:06 +00:00
Mateusz Guzik	934e7e5ec9	thread: adds newer bits for r367537 The committed patch was an older version.	2020-11-10 01:13:58 +00:00
Mateusz Guzik	35bb59edc5	threads: reimplement tid allocation on top of a bitmap There are workloads with very bursty tid allocation and since unr tries very hard to have small-sized bitmaps it keeps reallocating memory. Just doing buildkernel gives almost 150k calls to free coming from unr. This also gets rid of the hack which tried to postpone TID reuse. Reviewed by: kib, markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D27101	2020-11-09 23:05:28 +00:00
Mateusz Guzik	1bd3cf5de5	threads: introduce a limit for total number The intent is to replace the current id allocation method and a known upper bound will be useful. Reviewed by: kib (previous version), markj (previous version) Tested by: pho Differential Revision: https://reviews.freebsd.org/D27100	2020-11-09 23:04:30 +00:00
Mateusz Guzik	f6dd1aefb7	vfs: group mount per-cpu vars into one struct While here move frequently read stuff into the same cacheline. This shrinks struct mount by 64 bytes. Tested by: pho	2020-11-09 23:02:13 +00:00
Mateusz Guzik	f0c90a0931	malloc: provide 384 byte zone Total page count after buildworld on ZFS for 384 (if present) and 512 zones: before: 29713 after: 25946 per-zone page use: vm.uma.malloc_384.keg.domain.1.pages: 11621 vm.uma.malloc_384.keg.domain.0.pages: 11597 vm.uma.malloc_512.keg.domain.1.pages: 1280 vm.uma.malloc_512.keg.domain.0.pages: 1448 Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27145	2020-11-09 22:59:41 +00:00
Mateusz Guzik	8e6526e966	malloc: retire mt_stats_zone in favor of pcpu_zone_64 Reviewed by: markj, imp Differential Revision: https://reviews.freebsd.org/D27142	2020-11-09 22:58:29 +00:00
Mateusz Guzik	3a440a421d	Add more per-cpu zones. This covers powers of 2 up to 64. Example pending user is ZFS.	2020-11-09 00:34:23 +00:00
Mateusz Guzik	523d66730c	procdesc: convert the zone to a malloc type The object is 128 bytes in size.	2020-11-09 00:05:21 +00:00
Mateusz Guzik	e90afaa015	kqueue: save space by using only one func pointer for assertions	2020-11-09 00:04:35 +00:00
Edward Tomasz Napierala	a1bd83fede	Move syscall_thread_{enter,exit}() into the slow path. This is only needed for syscalls from unloadable modules. Reviewed by: kib MFC after: 2 weeks Sponsored by: EPSRC Differential Revision: https://reviews.freebsd.org/D26988	2020-11-08 15:54:59 +00:00
Kyle Evans	8c28aa5e45	imgact_binmisc: limit the extent of match on incoming entries imgact_binmisc matches magic/mask from imgp->image_header, which is only a single page in size mapped from the first page of an image. One can specify an interpreter that matches on, e.g., --offset 4096 --size 256 to read up to 256 bytes past the mapped first page. The limitation is that we cannot specify a magic string that exceeds a single page, and we can't allow offset + size to exceed a single page either. A static assert has been added in case someone finds it useful to try and expand the size, but it does seem a little unlikely. While this looks kind of exploitable at a sideways squinty-glance, there are a couple of mitigating factors: 1.) imgact_binmisc is not enabled by default, 2.) entries may only be added by the superuser, 3.) trying to exploit this information to read what's mapped past the end would be worse than a root canal or some other relatably painful experience, and 4.) there's no way one could pull this off without it being completely obvious. The first page is mapped out of an sf_buf, the implementation of which (or lack thereof) depends on your platform. MFC after: 1 week	2020-11-08 04:24:29 +00:00
Michael Tuexen	f908d8247e	The ioctl() calls using FIONREAD, FIONWRITE, FIONSPACE, and SIOCATMARK access the socket send or receive buffer. This is not possible for listening sockets since r319722. Because send()/recv() calls fail on listening sockets, fail also ioctl() indicating EINVAL. PR: 250366 Reported by: Yong-Hao Zou Reviewed by: glebius, rscheff MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D26897	2020-11-07 21:17:49 +00:00
Kyle Evans	1024ef27fe	imgact_binmisc: move some calculations out of the exec path The offset we need to account for in the interpreter string comes in two variants: 1. Fixed - macros other than #a that will not vary from invocation to invocation 2. Variable - #a, which is substitued with the argv0 that we're replacing Note that we don't have a mechanism to modify an existing entry. By recording both of these offset requirements when the interpreter is added, we can avoid some unnecessary calculations in the exec path. Most importantly, we can know up-front whether we need to grab calculate/grab the the filename for this interpreter. We also get to avoid walking the string a first time looking for macros. For most invocations, it's a swift exit as they won't have any, but there's no point entering a loop and searching for the macro indicator if we already know there will not be one. While we're here, go ahead and only calculate the argv0 name length once per invocation. While it's unlikely that we'll have more than one #a, there's no reason to recalculate it every time we encounter an #a when it will not change. I have not bothered trying to benchmark this at all, because it's arguably a minor and straightforward/obvious improvement. MFC after: 1 week	2020-11-07 18:07:55 +00:00
Mateusz Guzik	42e7abd5db	rms: several cleanups + debug read lockers handling This adds a dedicated counter updated with atomics when INVARIANTS is used. As a side effect one can reliably determine the lock is held for reading by at least one thread, but it's still not possible to find out whether curthread has the lock in said mode. This should be good enough in practice. Problem spotted by avg.	2020-11-07 16:57:53 +00:00
Kyle Evans	ecb4fdf943	imgact_binmisc: reorder members of struct imgact_binmisc_entry (NFC) This doesn't change anything at the moment since the out-of-order elements were a pair of uint32_t, but future additions may have caused unnecessary padding by following the existing precedent. MFC after: 1 week	2020-11-07 16:41:59 +00:00
Michal Meloun	eb20867f52	Add a method to determine whether given interrupt is per CPU or not. MFC after: 2 weeks	2020-11-07 14:58:01 +00:00
Edward Tomasz Napierala	da45ea6bc6	Move TDB_USERWR check under 'if (traced)'. If we hadn't been traced in the first place when syscallenter() started executing, we can ignore TDB_USERWR. TDB_USERWR can get set, sure, but if it does, it's because the debugger raced with the syscall, and it cannot depend on winning that race. Reviewed by: kib MFC after: 2 weeks Sponsored by: EPSRC Differential Revision: https://reviews.freebsd.org/D26585	2020-11-07 13:09:51 +00:00
Kyle Evans	2192cd125f	imgact_binmisc: abstract away the list lock (NFC) This module handles relatively few execs (initial qemu-user-static, then qemu-user-static handles exec'ing itself for binaries it's already running), but all execs pay the price of at least taking the relatively expensive sx/slock to check for a match when this module is loaded. Future work will almost certainly swap this out for another lock, perhaps an rmslock. The RLOCK/WLOCK phrasing was chosen based on what the callers are really wanting, rather than using the verbiage typically appropriate for an sx. MFC after: 1 week	2020-11-07 05:10:46 +00:00
Kyle Evans	7d3ed9777a	imgact_binmisc: validate flags coming from userland We may want to reserve bits in the future for kernel-only use, so start rejecting any that aren't the two that we're currently expecting from userland. MFC after: 1 week	2020-11-07 04:10:23 +00:00
Kyle Evans	7667824ade	epoch: support non-preemptible epochs checking in_epoch() Previously, non-preemptible epochs could not check; in_epoch() would always fail, usually because non-preemptible epochs don't imply THREAD_NO_SLEEPING. For default epochs, it's easy enough to verify that we're in the given epoch: if we're in a critical section and our record for the given epoch is active, then we're in it. This patch also adds some additional INVARIANTS bookkeeping. Notably, we set and check the recorded thread in epoch_enter/epoch_exit to try and catch some edge-cases for the caller. It also checks upon freeing that none of the records had a thread in the epoch, which may make it a little easier to diagnose some improper use if epoch_free() took place while some other thread was inside. This version differs slightly from what was just previously reviewed by the below-listed, in that in_epoch() will assert that no CPU has this thread recorded even if it is currently in a critical section. This is intended to catch cases where the caller might have somehow messed up critical section nesting, we can catch both if they exited the critical section or if they exited, migrated, then re-entered (on the wrong CPU). Reviewed by: kib, markj (both previous version) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27098	2020-11-07 03:29:04 +00:00
Kyle Evans	80083216cb	imgact_binmisc: minor re-organization of imgact_binmisc_exec exits Notably, streamline error paths through the existing 'done' label, making it easier to quickly verify correct cleanup. Future work might add a kernel-only flag to indicate that a interpreter uses #a. Currently, all executions via imgact_binmisc pay the penalty of constructing sname/fname, even if they will not use it. qemu-user-static doesn't need it, the stock rc script for qemu-user-static certainly doesn't use it, and I suspect these are the vast majority of (if not the only) current users. MFC after: 1 week	2020-11-07 03:28:32 +00:00
Mateusz Guzik	e25d8b67c3	malloc: tweak the version check in r367432 to include type name While here fix a whitespace problem.	2020-11-07 01:32:16 +00:00
Mateusz Guzik	bdcc222644	malloc: move malloc_type_internal into malloc_type According to code comments the original motivation was to allow for malloc_type_internal changes without ABI breakage. This can be trivially accomplished by providing spare fields and versioning the struct, as implemented in the patch below. The upshots are one less memory indirection on each alloc and disappearance of mt_zone. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27104	2020-11-06 21:33:59 +00:00
Konstantin Belousov	f10845877e	Suspend all writeable local filesystems on power suspend. This ensures that no writes are pending in memory, either metadata or user data, but not including dirty pages not yet converted to fs writes. Only filesystems declared local are suspended. Note that this does not guarantee absence of the metadata errors or leaks if resume is not done: for instance, on UFS unlinked but opened inodes are leaked and require fsck to gc. Reviewed by: markj Discussed with: imp Tested by: imp (previous version), pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D27054	2020-11-05 20:52:49 +00:00
Mateusz Guzik	16b971ed6d	malloc: add a helper returning size allocated for given request Sample usage: kernel modules can decide whether to stick to malloc or create their own zone. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27097	2020-11-05 16:21:21 +00:00
Mateusz Guzik	2dee296a3d	Rationalize per-cpu zones. The 2 provided zones had inconsistent naming between each other ("int" and "64") and other allocator zones (which use bytes). Follow malloc by naming them "pcpu-" + size in bytes. This is a step towards replacing ad-hoc per-cpu zones with general slabs.	2020-11-05 15:08:56 +00:00
Mateusz Guzik	ea33cca971	poll/select: change selfd_zone into a malloc type On a sample box vmstat -z shows: ITEM SIZE LIMIT USED FREE REQ 64: 64, 0, 1043784, 4367538,3698187229 selfd: 64, 0, 1520, 13726,182729008 But at the same time: vm.uma.selfd.keg.domain.1.pages: 121 vm.uma.selfd.keg.domain.0.pages: 121 Thus 242 pages got pulled even though the malloc zone would likely accomodate the load without using extra memory.	2020-11-05 12:24:37 +00:00
Mateusz Guzik	2fbb45c601	vfs: change nt_zone into a malloc type Elements are small in size and allocated for short periods.	2020-11-05 12:06:50 +00:00
Kyle Evans	df69035d7f	imgact_binmisc: fix up some minor nits - Removed a bunch of redundant headers - Don't explicitly initialize to 0 - The !error check prior to setting imgp->interpreter_name is redundant, all error paths should and do return or go to 'done'. We have larger problems otherwise.	2020-11-05 04:19:48 +00:00
Mateusz Guzik	3c50616fc1	fd: make all f_count uses go through refcount_*	2020-11-05 02:12:33 +00:00
Mateusz Guzik	d737e9eaf5	fd: hide _fdrop 0 count check behind INVARIANTS While here use refcount_load and make sure to report the tested value.	2020-11-05 02:12:08 +00:00
Mateusz Guzik	331c21dd5e	pipe: whitespace nit in previous	2020-11-04 23:17:41 +00:00
Mateusz Guzik	c22ba7bb06	pipe: fix POLLHUP handling if no events were specified Linux allows polling without any events specified and it happens to be the case in FreeBSD as well. POLLHUP has to be delivered regardless of the event mask and this works fine if the condition is already present. However, if it is missing, selrecord is only called if the eventmask has relevant bits set. This in particular leads to a conditon where pipe_poll can return 0 events and neglect to selrecord, while kern_poll takes it as an indication it has to go to sleep, but then there is nobody to wake it up. While the problem seems systemic to *_poll handlers the least we can do is fix it up for pipes. Reported by: Jeremie Galarneau <jeremie.galarneau at efficios.com> Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D27094	2020-11-04 23:11:54 +00:00
Mateusz Guzik	6fc2b069ca	rms: fixup concurrent writer handling and add more features Previously the code had one wait channel for all pending writers. This could result in a buggy scenario where after a writer switches the lock mode form readers to writers goes off CPU, another writer queues itself and then the last reader wakes up the latter instead of the former. Use a separate channel. While here add features to reliably detect whether curthread has the lock write-owned. This will be used by ZFS.	2020-11-04 21:18:08 +00:00
Mark Johnston	f7db0c9532	vmspace: Convert to refcount(9) This is mostly mechanical except for vmspace_exit(). There, use the new refcount_release_if_last() to avoid switching to vmspace0 unless other processes are sharing the vmspace. In that case, upon switching to vmspace0 we can unconditionally release the reference. Remove the volatile qualifier from vm_refcnt now that accesses are protected using refcount(9) KPIs. Reviewed by: alc, kib, mmel MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27057	2020-11-04 16:30:56 +00:00
Brooks Davis	19647e76fc	sysvshm: pass relevant uap members as arguments Alter shmget_allocate_segment and shmget_existing to take the values they want from struct shmget_args rather than passing the struct around. In general, uap structures should only be the interface to sys_<foo> functions. This makes on small functional change and records the allocated space rather than the requested space. If this turns out to be a problem (e.g. if software tries to find undersized segments by exact size rather than using keys), we can correct that easily. Reviewed by: kib Obtained from: CheriBSD MFC after: 1 week Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D27077	2020-11-03 19:14:03 +00:00
Conrad Meyer	2de07e4096	unix(4): Add SOL_LOCAL:LOCAL_CREDS_PERSISTENT This option is intended to be semantically identical to Linux's SOL_SOCKET:SO_PASSCRED. For now, it is mutually exclusive with the pre-existing sockopt SOL_LOCAL:LOCAL_CREDS. Reviewed by: markj (penultimate version) Differential Revision: https://reviews.freebsd.org/D27011	2020-11-03 01:17:45 +00:00
Mateusz Guzik	e1b6a7f83f	malloc: prefix zones with malloc- Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27038	2020-11-02 17:39:15 +00:00
Mateusz Guzik	828afdda17	malloc: export kernel zones instead of relying on them being power-of-2 Reviewed by: markj (previous version) Differential Revision: https://reviews.freebsd.org/D27026	2020-11-02 17:38:08 +00:00
Stefan Eßer	1ebef47735	Make sysctl user.local a tunable that can be written at run-time This sysctl value had been provided as a read-only variable that is compiled into the C library based on the value of _PATH_LOCALBASE in paths.h. After this change, the value is compiled into the kernel as an empty string, which is translated to _PATH_LOCALBASE by the C library. This empty string can be overridden at boot time or by a privileged user at run time and will then be returned by sysctl. When set to an empty string, the value returned by sysctl reverts to _PATH_LOCALBASE. This update does not change the behavior on any system that does not modify the default value of user.localbase. I consider this change as experimental and would prefer if the run-time write permission was reconsidered and the sysctl variable defined with CLFLAG_RDTUN instead to restrict it to be set at boot time. MFC after: 1 month	2020-10-31 23:48:41 +00:00
Mateusz Guzik	82c174a3b4	malloc: delegate M_EXEC handling to dedicacted routines It is almost never needed and adds an avoidable branch. While here do minior clean ups in preparation for larger changes. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27019	2020-10-30 20:02:32 +00:00
Stefan Eßer	147eea393f	Add read only sysctl variable user.localbase The value is provided by the C library as for other sysctl variables in the user tree. It is compiled in and returns the value of _PATH_LOCALBASE defined in paths.h. Reviewed by: imp, scottl Differential Revision: https://reviews.freebsd.org/D27009	2020-10-30 18:48:09 +00:00
Mateusz Guzik	0685574968	vfs: change vnode poll to just a malloc type The size is 120, close fit for 128 and rarely used. The infrequent use avoidably populates per-CPU caches and ends up with more memory.	2020-10-30 14:02:56 +00:00
Mateusz Guzik	4bfebc8d2c	cache: add cache_vop_mkdir and rename cache_rename to cache_vop_rename	2020-10-30 10:46:35 +00:00
John Baldwin	36e0a362ac	Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc(). This gives a more uniform API for send tag life cycle management. Reviewed by: gallatin, hselasky Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27000	2020-10-29 23:28:39 +00:00
Mateusz Guzik	62568e886a	vfs: add NAMEI_DBG_HADSTARTDIR handling lost in rewrite Noted by: rpokala	2020-10-29 18:43:37 +00:00
Mateusz Guzik	eebc2e450f	vfs: add NDREINIT to facilitate repeated namei calls struct nameidata mixes caller arguments, internal state and output, which can be quite error prone. Recent addition of valdiating ni_resflags uncovered a caller which could repeatedly call namei, effectively operating on partially populated state. Add bare minimium validation this does not happen. The real fix would decouple aforementioned state. Reported by: pho Tested by: pho (different variant)	2020-10-29 12:56:02 +00:00
John Baldwin	521eac97f3	Support hardware rate limiting (pacing) with TLS offload. - Add a new send tag type for a send tag that supports both rate limiting (packet pacing) and TLS offload (mostly similar to D22669 but adds a separate structure when allocating the new tag type). - When allocating a send tag for TLS offload, check to see if the connection already has a pacing rate. If so, allocate a tag that supports both rate limiting and TLS offload rather than a plain TLS offload tag. - When setting an initial rate on an existing ifnet KTLS connection, set the rate in the TCP control block inp and then reset the TLS send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit send tag. This allocates the TLS send tag asynchronously from a task queue, so the TLS rate limit tag alloc is always sleepable. - When modifying a rate on a connection using KTLS, look for a TLS send tag. If the send tag is only a plain TLS send tag, assume we failed to allocate a TLS ratelimit tag (either during the TCP_TXTLS_ENABLE socket option, or during the send tag reset triggered by ktls_output_eagain) and ignore the new rate. If the send tag is a ratelimit TLS send tag, change the rate on the TLS tag and leave the inp tag alone. - Lock the inp lock when setting sb_tls_info for a socket send buffer so that the routines in tcp_ratelimit can safely dereference the pointer without needing to grab the socket buffer lock. - Add an IFCAP_TXTLS_RTLMT capability flag and associated administrative controls in ifconfig(8). TLS rate limit tags are only allocated if this capability is enabled. Note that TLS offload (whether unlimited or rate limited) always requires IFCAP_TXTLS[46]. Reviewed by: gallatin, hselasky Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26691	2020-10-29 00:23:16 +00:00
Konstantin Belousov	3cbf9dc81c	Check for process group change in tty_wait_background(). The calling process's process group can change between PROC_UNLOCK(p) and PGRP_LOCK(pg) in tty_wait_background(), e.g. by a setpgid() call from another process. If that happens, the signal is not sent to the calling process, even if the prior checks determine that one should be sent. Re-check that the process group hasn't changed after acquiring the pgrp lock, and if it has, redo the checks. PR: 250701 Submitted by: Jakub Piecuch <j.piecuch96@gmail.com> MFC after: 2 weeks	2020-10-28 22:12:47 +00:00
Edward Tomasz Napierala	bdc0cb4e2c	Add local variable to store the sysent pointer. Just a cleanup, no functional changes. Reviewed by: kib (earlier version) MFC after: 2 weeks Sponsored by: EPSRC Differential Revision: https://reviews.freebsd.org/D26977	2020-10-28 14:43:38 +00:00
Edward Tomasz Napierala	bce7ee9d41	Drop "All rights reserved" from all my stuff. This includes Foundation copyrights, approved by emaste@. It does not include files which carry other people's copyrights; if you're one of those people, feel free to make similar change. Reviewed by: emaste, imp, gbe (manpages) Differential Revision: https://reviews.freebsd.org/D26980	2020-10-28 13:46:11 +00:00
Mateusz Guzik	11743b6e47	vfs: tidy up vnlru_free Apart from cosmeatic changes make sure to only decrease the recycled counter if vtryrecycle succeeded. Tested by: pho	2020-10-27 18:13:09 +00:00
Mateusz Guzik	68ac2b804c	vfs: fix vnode reclaim races against getnwevnode All vnodes allocated by UMA are present on the global list used by vnlru. getnewvnode modifies the state of the vnode (most notably altering v_holdcnt) but never locks it. Moreover filesystems also modify it in arbitrary manners sometimes before taking the vnode lock or adding any other indicator that the vnode can be used. Picking up such a vnode by vnlru would be problematic. To that end there are 2 fixes: - vlrureclaim, not recycling v_holdcnt == 0 vnodes, takes the interlock and verifies that v_mount has been set. It is an invariant that the vnode lock is held by that point, providing the necessary serialisation against locking after vhold. - vnlru_free_locked, only wanting to free v_holdcnt == 0 vnodes, now makes sure to only transition the count 0->1 and newly allocated vnodes start with v_holdcnt == VHOLD_NO_SMR. getnewvnode will only transition VHOLD_NO_SMR->1 once more making the hold fail Tested by: pho	2020-10-27 18:12:07 +00:00
Mateusz Guzik	d681c51d36	cache: add missing NIRES_ABS handling	2020-10-26 18:01:18 +00:00
Alexander Motin	3c0177b887	Enable bioq 'car limit' added at r335066 at 128 bios. Without the 'car limit' enabled (before this), running sequential ZFS scrub on HDD without command queuing support, I've measured latency on concurrent random reads reaching 4 seconds (surprised that not more). Enabling this reduced the latency to 65 milliseconds, while scrub still doing ~180MB/s. For disks with command queuing this does not make much difference (if any), since most time all the requests are queued down to the disk or HBA, leaving nothing in the queue to sort. And even if something does not fit, staying on the queue, it is likely not for long. To not limit sorting in such bursty scenarios I've added batched counter zeroing when the queue is getting empty. The internal scheduler of the SAS HDD I was testing seems to be even more loyal to random I/O, reducing the scrub speed to ~120MB/s. So in case somebody worried this is limit is too strict -- it actually looks relaxed. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2020-10-26 04:04:06 +00:00
Alexander Motin	8b220f8915	Fix asymmetry in devstat(9) calls by GEOM. Before this GEOM passed bio pointer to transaction start, but not end. It was irrelevant until devstat(9) got DTrace hooks, that appeared to provide bio pointer on I/O completion, but not on submission. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2020-10-24 21:07:10 +00:00
Ruslan Bukin	f32f0095e9	o Add iommu de-initialization method for MSI interface. o Add iommu_unmap_msi() to release the msi GAS entry. o Provide default implementations for iommu init/deinit methods. Reviewed by: kib Sponsored by: Innovate DSbD Differential Revision: https://reviews.freebsd.org/D26906	2020-10-24 20:09:27 +00:00
Ryan Moeller	e58483c4fb	sysctl+kern_sysctl: Honor SKIP for descendant nodes Ensure we also skip descendants of SKIP nodes when iterating through children of an explicitly specified node. Reported by: np Reviewed by: np MFC after: 1 week Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D26833	2020-10-24 16:17:07 +00:00
Ryan Moeller	0595c12484	kern_sysctl: Misc code cleanup Remove unused oidpp parameter from sysctl_sysctl_next_ls and add high level comments to describe how it works. No functional change. Reviewed by: imp MFC after: 1 week Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D26854	2020-10-24 14:46:38 +00:00
Kyle Evans	275c821d3d	audit: correct reporting of execve(2) success r326145 corrected do_execve() to return EJUSTRETURN upon success so that important registers are not clobbered. This had the side effect of tapping out 'failures' for all execve(2) audit records, which is less than useful for auditing purposes. Audit exec returns earlier, where we can know for sure that EJUSTRETURN translates to success. Note that this unsets TDP_AUDITREC as we commit the audit record, so the usual audit in the syscall return path will do nothing. PR: 249179 Reported by: Eirik Oeverby <ltning-freebsd anduin net> Reviewed by: csjp, kib MFC after: 1 week Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26922	2020-10-24 14:39:17 +00:00
Mateusz Guzik	eb65cde4f5	cache: assorted typo fixes	2020-10-24 13:31:40 +00:00
Mateusz Guzik	029cfccc71	cache: add the missing NC_NOMAKEENTRY and NC_KEEPPOSENTRY to lockless lookup They are de facto ignored.	2020-10-24 13:31:25 +00:00
Mateusz Guzik	7cc1718613	vfs: fix a race where reclaim vholds freed vnodes Reported by: pho Tested by: pho (previous version) Fixes: r366974 ("vfs: stop taking the interlock in vnode reclaim")	2020-10-24 13:30:37 +00:00
Mateusz Guzik	acb41008f3	cache: batch updates to numcache in case of mass removal	2020-10-24 01:14:52 +00:00
Mateusz Guzik	208cb7c4b6	cache: refactor alloc/free This in particular centralizes manipulation of numcache.	2020-10-24 01:14:17 +00:00
Mateusz Guzik	1d44405690	cache: fold branch prediction into cache_ncp_canuse	2020-10-24 01:13:47 +00:00
Mateusz Guzik	c13d7d1f98	cache: fix some typos	2020-10-24 01:13:16 +00:00
Mateusz Guzik	f878526f20	cache: drop write-only vars	2020-10-24 01:13:02 +00:00
Ruslan Bukin	9729b14985	Move the iommu stubs to a generic place, so they are available on all the platforms. This allows to not depend on the IOMMU macro in AHCI driver. Requested by: kib Suggested by: andrew Reviewed by: kib Sponsored by: Innovate DSbD Differential Revision: https://reviews.freebsd.org/D26887	2020-10-23 21:27:48 +00:00
Mateusz Guzik	3862838921	cache: reduce memory waste in struct namecache The previous scheme for calculating the total size was doing sizeof on the struct and then adding the wanted space for the buffer. nc_name is at offset 58 while sizeof(struct namecache) is 64. With CACHE_PATH_CUTOFF of 39 bytes and 1 byte of padding we were allocating 104 bytes for the entry and never accounting for the 6 byte padding, wasting that space.	2020-10-23 15:56:22 +00:00
Mateusz Guzik	703f3fafa5	vfs: stop taking the interlock in vnode reclaim It no longer protects any of tested fields, keeping all the checks racy. While here make vtryrecycle drop the vnode on its own. Avoids an additional lock trip.	2020-10-23 15:49:18 +00:00
Mateusz Guzik	c7520caa4f	vfs: prevent avoidable evictions on mkdir of existing directories mkdir -p /foo/bar/baz will mkdir each path component and ignore EEXIST. The NOCACHE lookup will make the namecache unnecessarily evict the existing entry, and then fallback to the fs lookup routine eventually leading namei to return an error as the directory is already there. For invocations like mkdir -p /usr/obj/usr/src/sys/GENERIC/modules this triggers fallbacks to the slowpath for concurrently executing lookups. Tested by: pho Discussed with: kib	2020-10-22 19:28:12 +00:00
Mateusz Guzik	54f09403a3	cache: assert the created entry does not point to itself	2020-10-22 19:22:34 +00:00
Konstantin Belousov	18b8496c23	sysv_sem: semusz depends on semume. Size of the per-process semaphore undo structure (semusz) depends on the number of the per-process undos. If kern.ipc.semume is adjusted, semusz must be adjusted as well, and it makes no sense to delegate adjustment to user. Make it automatic. Reported and tested by: Olef <o.vandestadt@gmail.com> PR: 250361 Reviewed by: jhb, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D26826	2020-10-22 09:28:11 +00:00
Hans Petter Selasky	2ae634c6db	Implement mbuf hashing routines for IP over infiniband, IPoIB. No functional change intended. Differential Revision: https://reviews.freebsd.org/D26254 Reviewed by: melifaro@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-10-22 09:17:56 +00:00
Brooks Davis	44ca4575ea	vmapbuf: don't smuggle address or length in buf Instead, add arguments to vmapbuf. Since this argument is always a pointer use a type of void * and cast to vm_offset_t in vmapbuf. (In CheriBSD we've altered vm_fault_quick_hold_pages to take a pointer and check its bounds.) In no other situtation does b_data contain a user pointer and vmapbuf replaces b_data with the actual mapping. Suggested by: jhb Reviewed by: imp, jhb Obtained from: CheriBSD MFC after: 1 week Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26784	2020-10-21 16:00:15 +00:00
Mateusz Guzik	2f1c35053c	cache: drop the spurious slash_prefixed argument	2020-10-21 05:57:25 +00:00
Mateusz Guzik	ab21ed17ed	vfs: drop the de facto curthread argument from VOP_INACTIVE	2020-10-20 07:19:03 +00:00
Mateusz Guzik	8ecd87a3e7	vfs: drop spurious cred argument from VOP_VPTOCNP	2020-10-20 07:18:27 +00:00
Konstantin Belousov	c0baa3dc4a	vgonel(): avoid recursing into VOP_INACTIVE(). It is a common pattern for filesystems' VOP_INACTIVE() implementation to forcibly reclaim the vnode when its state is final. For instance, UFS vnode with zero link count is removed, and since it is inactivated, the last open reference on it is dropped. On the other hand, vnode might get spurious usecount reference for many reasons. If the spurious reference exists while vgonel() checks for active state of the vnode, it would recurse into VOP_INACTIVE(). Fix it by checking and not doing inactivation when vgone() was called from inactive VOP. Reported and tested by: pho Discussed with: mjg Sponsored by: The FreeBSD Foundation MFC after: 1 week	2020-10-19 19:20:23 +00:00
Mateusz Guzik	6d5d469fc1	cache: promote negative entries based on more than one hit During tinderbox and similar workloads negative entries get at least one hit before they get evicted. In the current scheme this avoidably promotes them. Be conservative and stick to 2 hits for now.	2020-10-19 18:51:51 +00:00
John Baldwin	6bcf3c46d8	Check TF_TOE not the tod pointer to determine if TOE is active. The TF_TOE flag is the check used in the rest of the network stack to determine if TOE is active on a socket. There is at least one path in the cxgbe(4) TOE driver that can leave the tod pointer non-NULL on a socket not using TOE. Reported by: Sony Arpita Das <sonyarpitad@chelsio.com> Reviewed by: np Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D26803	2020-10-19 18:24:06 +00:00
Mark Johnston	d80126a6f4	link_elf_obj: Colour VM objects This will cause the VM to back sufficiently large .text sections, such as those in zfs.ko or amdgpu.ko on amd64, with superpage mappings when possible. Reviewed by: alc, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26802	2020-10-19 16:57:59 +00:00
Mark Johnston	6351771b7c	vmem: Allocate btags before looping in vmem_xalloc() BT_MAXALLOC (4) is the number of boundary tags required to complete an allocation in the worst case: two to clip a free segment, and two to import from a parent arena. vmem_xalloc() preallocates four boundary tags before attempting a search to simplify the segment allocation code. It implements a loop that: 1) ensures that BT_MAXALLOC boundary tags are available, 2) attempts to find and clip a free segment satisfying the allocation constraints, and failing that, 3) attempts to import a segment. On !UMA_MD_SMALL_ALLOC platforms the btag zone has to handle recusion: it needs boundary tags to allocate boundary tags. Thus we reserve 2 * BT_MAXALLOC * mp_ncpus tags for use when recursing: the factor of 2 is because there are two layers of vmem arenas, the per-domain arena and global arena. For a single thread, 2 * BT_MAXALLOC tags should be sufficient. Because of the way the loop is structured, BT_MAXALLOC tags are not sufficient. The first bt_fill() call may allocate BT_MAXALLOC tags, then import a segment (consuming two tags), then attempt to top up the preallocation before carving into the imported free segment, thus requiring up to six tags in the worst case. Because we don't preallocate that many, this bug can cause deadlocks in rare scenarios. Fix the problem by moving the preallocation out the loop. This assumes that only a single import is ever required to satisfy an allocation request. Thanks to manu, emaste and lwhsu for helping test debug patches. Reported by: Jenkins (hardware CI lab) Reviewed by: alc, kib, rlibby MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26770	2020-10-19 16:54:06 +00:00
Mark Johnston	33a9bce62f	vmem: Simplify bt_fill() callers a bit No functional change intended. Reviewed by: alc, kib, rlibby MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26769	2020-10-19 16:52:27 +00:00
Ruslan Bukin	e707c8be4e	Manage MSI iommu pages. This allows the interrupt controller driver only need a small change to create a map for the page the device will write to raise an interrupt. Submitted by: andrew Reviewed by: kib Sponsored by: Innovate DSbD Differential Revision: https://reviews.freebsd.org/D26705	2020-10-19 13:10:21 +00:00
Mateusz Guzik	665c8c3e7d	cache: refactor negative promotion/demotion handling This will simplify policy changes.	2020-10-19 09:52:52 +00:00
Bjoern A. Zeeb	f7a0bb0dec	ddb: add show sysinit command Add a show sysinit command to ddb (similar to show vnet_sysinit) which proved to be helpful to debug some ordering issues on early-mid kernel start panics.	2020-10-17 22:47:08 +00:00
Mateusz Guzik	4c4aa84848	cache: shorten names of debug stats	2020-10-17 21:30:46 +00:00
Mateusz Guzik	676557143f	cache: don't automatically evict negative entries if usage is low The previous scheme only looked at negative entry count in relation to the total count, leading to tons of spurious evictions if the cache is not significantly populated. Instead, only try the above if negative entry count goes beyond namecache capacity.	2020-10-17 21:22:40 +00:00
Mateusz Guzik	e98c3bc667	cache: erwork sysctl vfs.cache tree Split everything into neg, debug, param and stat categories. The legacy nchstats sysctl (queried e.g., by systat) remains untouched. While here rename some vars to be easier on the eye.	2020-10-17 13:06:29 +00:00
Mateusz Guzik	fa7c73d30c	cache: factor negative lookup out of cache_fplookup_next	2020-10-17 13:04:46 +00:00
Mateusz Guzik	41e6b18422	cache: avoid smr in cache_neg_evict in favoro of the already held bucket lock	2020-10-17 13:04:25 +00:00
Mateusz Guzik	c38d8e1eb2	cache: rework parts of negative entry management - declutter sysctl vfs.cache by moving relevant entries into vfs.cache.neg - add a little more parallelism to eviction by replacing the global lock with an atomically modified counter - track more statistics The code needs further effort.	2020-10-17 08:48:58 +00:00
Mateusz Guzik	b31b5e9cfd	cache: remove entries before trying to add new ones, not after Should allow positive entries to replace negative ones in case the cache is full.	2020-10-17 08:48:32 +00:00
Mateusz Guzik	ad89066af4	vfs: annotate mountlist_mtx with __exclusive_cache_line	2020-10-17 08:47:08 +00:00
Mateusz Guzik	d6eee35004	cache: add a probe reporting addition of duplicate entries	2020-10-17 00:27:26 +00:00
Mateusz Guzik	a59b0ac3aa	cache: flip inverted condition in previous It happened to not affect correctness in that the fallback code would simply neglect to promote the entry.	2020-10-16 02:19:33 +00:00
Mateusz Guzik	e7602e04c7	cache: support negative entry promotion in slowpath smr	2020-10-16 00:56:13 +00:00
Mateusz Guzik	571bc3d1af	cache: elide vhold/vdrop around promoting negative entry	2020-10-16 00:55:57 +00:00
Mateusz Guzik	640e6162ee	cache: dedup code for negative promotion	2020-10-16 00:55:31 +00:00
Mateusz Guzik	c97c8746c0	cache: neglist -> nl; negstate -> ns No functional changes.	2020-10-16 00:55:09 +00:00
Mateusz Guzik	43777a207d	cache: split hotlist between existing negative lists This simplifies the code while allowing for concurrent negative eviction down the road. Cache misses increased slightly due to higher rate of evictions allowed by the change. The current algorithm remains too aggressive.	2020-10-15 17:44:17 +00:00
Mateusz Guzik	430dc4518d	cache: make neglist an array given the static size	2020-10-15 17:42:22 +00:00
Brooks Davis	16e4a0c89c	physio: Don't store user addresses in bio_data Only assign the address from the iovec to bio_data if it is a kernel address. This was the single place where bio_data stored (however briefly) a userspace pointer. Reviewed by: imp, markj Obtained from: CheriBSD MFC after: 1 week Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26783	2020-10-15 17:05:21 +00:00
Mateusz Guzik	214eccf4b6	vfs: add VOP_EAGAIN Can be used to stub fplookup for example.	2020-10-15 04:48:14 +00:00
Konstantin Belousov	6f3b523c9a	Avoid dump_avail[] redefinition. Move dump_avail[] extern declaration and inlines into a new header vm/vm_dumpset.h. This fixes default gcc build for mips. Reviewed by: alc, scottph Tested by: kevans (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26741	2020-10-14 22:51:40 +00:00
John Baldwin	c2a8fd6f05	Permit sending empty fragments for TLS 1.0. Due to a weakness in the TLS 1.0 protocol, OpenSSL will periodically send empty TLS records ("empty fragments"). These TLS records have no payload (and thus a page count of zero). m_uiotombuf_nomap() was returning NULL instead of an empty mbuf, and a few places needed to be updated to treat an empty TLS record as having a page count of "1" as 0 means "no work to do" (e.g. nothing to encrypt, or nothing to mark ready via sbready()). Reviewed by: gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26729	2020-10-13 17:30:34 +00:00
Warner Losh	e59db46854	newbus: use ssize_t to match sb's len and size, fix ordering of space check Both s_len and s_size are ssize_t, so their differece is also more properly a ssize_t not a size_t. Also, assert that len is <= size when we enter. This should always be the case. Ensure that we have that one byte that we write to the end of the buffer before we do so, though the error should already be set on the buffer if not, and the only times we supply 'partial' buffers they should be plenty large. Reviewed by: cem, jhb (prior version, I did cem's suggestion) Differential Revsion: https://reviews.freebsd.org/D26752	2020-10-12 22:07:44 +00:00
Conrad Meyer	f8e8a06d23	random(4) FenestrasX: Push root seed version to arc4random(3) Push the root seed version to userspace through the VDSO page, if the RANDOM_FENESTRASX algorithm is enabled. Otherwise, there is no functional change. The mechanism can be disabled with debug.fxrng_vdso_enable=0. arc4random(3) obtains a pointer to the root seed version published by the kernel in the shared page at allocation time. Like arc4random(9), it maintains its own per-process copy of the seed version corresponding to the root seed version at the time it last rekeyed. On read requests, the process seed version is compared with the version published in the shared page; if they do not match, arc4random(3) reseeds from the kernel before providing generated output. This change does not implement the FenestrasX concept of PCPU userspace generators seeded from a per-process base generator. That change is left for future discussion/work. Reviewed by: kib (previous version) Approved by: csprng (me -- only touching FXRNG here) Differential Revision: https://reviews.freebsd.org/D22839	2020-10-10 21:52:00 +00:00
Mateusz Guzik	dd28b379cb	vfs: support lockless dirfd lookups	2020-10-10 03:48:17 +00:00
Bryan Drewery	c2c6fb90e0	Use unlocked page lookup for inmem() to avoid object lock contention Reviewed By: kib, markj Submitted by: mlaier Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D26653	2020-10-09 23:49:42 +00:00
Mateusz Guzik	deb1339f3f	vfs: fix a panic when truncating comming from copy_file_range Truncating requires an exclusive lock, but it was not taken if the filesystem indicates support for shared writes. This only concerns ZFS. In particular fixes cp of files which have trailing holes. Reported by: bdrewery	2020-10-09 20:31:42 +00:00
John Baldwin	7e8bd70cff	Don't invoke semunload() if seminit() fails during MOD_LOAD. The module handler code invokes a MOD_UNLOAD event immediately if MOD_LOAD fails. The result was that if seminit() failed, semunload() was invoked twice. semunload() is not idempotent however and would try to remove it's process_exit eventhandler twice resulting in a panic. Reviewed by: kib, markj Obtained from: CheriBSD MFC after: 1 month Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26696	2020-10-09 20:20:42 +00:00
Mateusz Guzik	eb88fed446	cache: fix vexec panic when racing against vgone Use of dead_vnodeops would result in a panic instead of returning the intended EOPNOTSUPP error. While here make sure to abort, not just try to return a partial result. The former allows the regular lookup to restart from scratch, while the latter makes it stuck with an unusable vnode. Reported by: kevans	2020-10-09 19:10:00 +00:00
Rick Macklem	19fe23fa2b	Make vn_generic_copy_file_range() interruptible via a signal. Without this patch, when vn_generic_copy_file_range() is doing a large copy, it will remain in the function for a considerable amount of time, delaying handling of any outstanding signals until the copy completes. This patch adds checks for signals that need to be processed after each successful data copy cycle. When sig_intr() returns non-zero, vn_generic_copy_file_range() will return. The check "if (len < savlen)" ensures that some data has been copied, so that progress will be made. Note that, since copy_file_range(2) is allowed to return fewer bytes copied than requested, it will never return EINTR/ERESTART when sig_intr() returns non-zero. Reviewed by: kib, asomers Differential Revision: https://reviews.freebsd.org/D26620	2020-10-09 01:04:28 +00:00
Konstantin Belousov	203dda8a63	sig_intr(9): return early if AST is not scheduled. Check td_flags for relevant AST requests lock-less. This opens the race slightly wider where sig_intr() returns false negative, but might be it is worth it. Requested by: mjg Sponsored by: The FreeBSD Foundation MFC after: 1 week	2020-10-08 22:34:34 +00:00
Konstantin Belousov	4ea4966009	Do not allow to use O_BENEATH as an oracle. Specifically, if lookup() returned any error and the topping directory was not latched, which means that (non-existent) path did not returned to the topping location, give ENOTCAPABLE a priority over the lookup() error. PR: 249960 Reviewed by: emaste, ngie Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D26695	2020-10-08 22:31:11 +00:00
Mitchell Horne	841dad02e9	Fix a loop condition The correct way to identify the end of the metadata is two adjacent entries set to zero/MODINFO_END. I made a typo and this was checking the first entry twice. Reported by: rpokala Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc.	2020-10-08 18:29:17 +00:00
Mitchell Horne	22e6a67086	Add a routine to dump boot metadata The boot metadata (also referred to as modinfo, or preload metadata) provides information about the size and location of the kernel, pre-loaded modules, and other metadata (e.g. the EFI framebuffer) to be consumed during by the kernel during early boot. It is encoded as a series of type-length-value entries and is usually constructed by loader(8) and passed to the kernel. It is also faked on some architectures when booted by other means. Although much of the module information is available via kldstat(8), there is no easy way to debug the metadata in its entirety. Add some routines to parse this data and allow it to be printed to the console during early boot or output via a sysctl. Since the output can be lengthly, printing to the console is gated behind the debug.dump_modinfo_at_boot kenv variable as well as the BOOTVERBOSE flag. The sysctl to print the metadata is named debug.dump_modinfo. Reviewed by: tsoome Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26687	2020-10-08 18:02:05 +00:00
Hans Petter Selasky	eccb214897	The ethernet header structure is read-only. Add const keyword. (This is a diff reduction towards D26254) MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-10-08 11:25:19 +00:00
Mitchell Horne	44c705cf15	Handle kmod local relocation failures gracefully It is possible for elf_reloc_local() to fail in the unlikely case of an unsupported relocation type. If this occurs, do not continue to process the file. Reviewed by: kib, markj (earlier version) MFC after: 1 week Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26701	2020-10-07 23:14:49 +00:00
Warner Losh	bc683a89a3	Move kernel env global variables, etc to sys/kenv.h The kernel globals for kenv are confined to 2 files that need them and a few that likely shouldn't (but as written the code does). Move them from sys/systm.h to sys/kenv.h. This removed a XXX from systm.h and cleans it up a little bit...	2020-10-07 06:16:37 +00:00
Mitchell Horne	6debfd4b13	Remove unused function cpu_boot() The prototype was added with the creation of kern_shutdown.c in r17658, but it appears to have never been implemented. Remove it now. Reviewed by: cem, kib Differential Revision: https://reviews.freebsd.org/D26702	2020-10-06 23:16:56 +00:00
John Baldwin	56fb710f1b	Store the send tag type in the common send tag header. Both cxgbe(4) and mlx5(4) wrapped the existing send tag header with their own identical headers that stored the type that the type-specific tag structures inherited from, so in practice it seems drivers need this in the tag anyway. This permits removing these extra header indirections (struct cxgbe_snd_tag and struct mlx5e_snd_tag). In addition, this permits driver-independent code to query the type of a tag, e.g. to know what type of tag is being queried via if_snd_query. Reviewed by: gallatin, hselasky, np, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26689	2020-10-06 17:58:56 +00:00
Ryan Moeller	92e17803cd	Enable iterating all sysctls, even ones with CTLFLAG_SKIP Add an "nextnoskip" sysctl that allows for listing of sysctls intended to be normally skipped for cost reasons. This makes it so the names/descriptions of those sysctls can be discovered with sysctl -aN/sysctl -ad/sysctl -at. It also makes it so children are visited when a node flagged with CTLFLAG_SKIP is explicitly requested. The intended use case is to mark the root "kstat" node with CTLFLAG_SKIP so that the extensive and expensive stats are skipped by default but may still be easily obtained without having to know them all (which may not even be possible) and request each one-by-one. Reviewed by: jhb MFC after: 2 weeks Relnotes: yes Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D26560	2020-10-05 20:13:22 +00:00
Mateusz Guzik	4e2266100d	cache: fix pwd use-after-free in setting up fallback Since the code exits smr section prior to calling pwd_hold, the used pwd can be freed and a new one allocated with the same address, making the comparison erroneously true. Note it is very unlikely anyone ran into it.	2020-10-05 19:38:51 +00:00
Mark Johnston	780766eb52	Remove sysctl_kern_consmute() It is a trivial wrapper for sysctl_handle_int() since r184521. Also remove the NEEDGIANT flag, cn_mute is accessed locklessly. MFC after: 1 week	2020-10-05 15:54:19 +00:00
Konstantin Belousov	0400be45e9	Add sig_intr(9). It gives the answer would the thread sleep according to current state of signals and suspensions. Of course the answer is racy and allows for false-negatives (no sleep when signal is delivered after process lock is dropped). Also the answer might change due to signal rescheduling among threads in multi-threaded process. Still it is the best approximation I can provide, to answering the question was the thread interrupted. Reviewed by: markj Tested by: pho, rmacklem Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D26628	2020-10-04 16:33:42 +00:00
Konstantin Belousov	0c82fb267b	Refactor sleepq_catch_signals(). - Extract suspension check into sig_ast_checksusp() helper. - Extract signal check and calculation of the interruption errno into sig_ast_needsigchk() helper. The helpers are moved to kern_sig.c which is the proper place for signal-related code. Improve control flow in sleepq_catch_signals(), to handle ret == 0 (can sleep) and ret != 0 (interrupted) only once, by separating checking code into sleepq_check_ast_sq_locked(), which return value is interpreted at single location. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D26628	2020-10-04 16:30:05 +00:00
Edward Tomasz Napierala	4658877815	Move KTRUSERRET() from userret() to ast(). It's a really long detour - it writes ktrace entries to the filesystem - so the overhead of ast() won't make any difference. Reviewed by: kib Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26404	2020-10-03 12:03:08 +00:00
Mark Johnston	c88285c54a	Fix the INVARIANTS build for 32-bit platforms Reported by: Jenkins MFC with: r366368	2020-10-02 18:54:37 +00:00
Mark Johnston	f31695cc64	Implement sparse core dumps Currently we allocate and map zero-filled anonymous pages when dumping core. This can result in lots of needless disk I/O and page allocations. This change tries to make the core dumper more clever and represent unbacked ranges of virtual memory by holes in the core dump file. Add a new page fault type, VM_FAULT_NOFILL, which causes vm_fault() to clean up and return an error when it would otherwise map a zero-filled page. Then, in the core dumper code, prefault all user pages and handle errors by simply extending the size of the core file. This also fixes a bug related to the fact that vn_io_fault1() does not attempt partial I/O in the face of errors from vm_fault_quick_hold_pages(): if a truncated file is mapped into a user process, an attempt to dump beyond the end of the file results in an error, but this means that valid pages immediately preceding the end of the file might not have been dumped either. The change reduces the core dump size of trivial programs by a factor of ten simply by excluding unaccessed libc.so pages. PR: 249067 Reviewed by: kib Tested by: pho MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26590	2020-10-02 17:50:22 +00:00
Mark Johnston	fec41f0751	Simplify the check for non-dumpable VM object types OBJT_DEFAULT, _SWAP, _VNODE and _PHYS is exactly the set of non-fictitious object types, so just check for OBJ_FICTITIOUS. The check no longer excludes dead objects, but such objects have to be handled regardless. No functional change intended. Reviewed by: alc, dougm, kib Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26589	2020-10-02 17:49:13 +00:00
Mateusz Guzik	aa34e791fa	cache: update the commentary for path parsing	2020-10-02 14:50:03 +00:00
Bryan Drewery	9ceba22462	Revert r366340. CR wasn't finished and it breaks the build.	2020-10-01 20:08:27 +00:00
Bryan Drewery	2398cd1103	Use unlocked page lookup for inmem() to avoid object lock contention Reviewed By: kib, markj Sponsored by: Dell EMC Isilon Submitted by: mlaier Differential Revision: https://reviews.freebsd.org/D26597	2020-10-01 19:17:03 +00:00
Edward Tomasz Napierala	4c6f466cb4	Only clear TDP_NERRNO when needed, ie when it's previously been set. Reviewed by: kib Tested by: pho Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26612	2020-10-01 18:45:31 +00:00
Mateusz Guzik	b5ab177a99	cache: properly report ENOTDIR on foo/bar lookups where foo is a file Reported by: fernape	2020-10-01 08:46:21 +00:00
Rick Macklem	961afe3c99	Clip the "len" argument to vn_generic_copy_file_range() at a hole size boundary. By clipping the len argument of vn_generic_copy_file_range() to end at an exact multiple of hole size, holes are more likely to be maintained during the copy. A hole can still straddle the boundary at the end of the copy range, resulting in a block being allocated in the output file as it is being grown in size, but this will reduce the likelyhood of this happening. While here, also modify setting of blksize to better handle the case where _PC_MIN_HOLE_SIZE is returned as 1. Reviewed by: asomers Differential Revision: https://reviews.freebsd.org/D26570	2020-10-01 00:33:44 +00:00
John Baldwin	8128c65b4c	Avoid a dubious assignment to bio_data in aio_qbio(). A user pointer is not a suitable value for bio_data and the next block of code always overwrites bio_data anyway. Just use cb->aio_buf directly in the call to vm_fault_quick_hold_pages(). Reviewed by: kib Obtained from: CheriBSD MFC after: 1 month Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26595	2020-09-30 17:49:06 +00:00
Mateusz Guzik	4301a5a794	cache: push the lock into cache_purge_impl	2020-09-30 17:08:34 +00:00
Mateusz Guzik	d4cac59429	cache: use cache_has_entries where appropriate instead of opencoding it	2020-09-30 04:27:38 +00:00
Rick Macklem	164aa1e941	Make copy_file_range(2) Linux compatible for overflow of offset + len. Without this patch, if a call to copy_file_range(2) specifies an input file offset + len that would wrap around, EINVAL is returned. I thought that was the Linux behaviour, but recent testing showed that Linux accepts this case and does the copy_file_range() to EOF. This patch changes the FreeBSD code to exhibit the same behaviour as Linux for this case. Reviewed by: asomers, kib Differential Revision: https://reviews.freebsd.org/D26569	2020-09-30 02:18:09 +00:00
Edward Tomasz Napierala	3409864922	Use the 'traced' variable instead of comparing p->p_flag again. Reviewed by: kib Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26577	2020-09-29 11:18:48 +00:00
Kyle Evans	5f0601fd19	Address whitespace nits in subr_rtc.c These were separated out from a nearby patch from Andrew Gierth. MFC after: 3 days	2020-09-28 17:19:57 +00:00
Warner Losh	ab3f5b6ef2	For mulitcons boot, report it and which console is primary Until we can do proper /etc/rc output on both consoles in multicons boot (or all of them if we ever generalize), report when we are booting multicons. Also report the primary console. This will be a big hint why output stops after this line (though some slow USB discovery still happens after mountroot / init starts). Reviewed by: scottl@, tsoome@ Differential Revision: https://reviews.freebsd.org/D26574	2020-09-28 16:19:29 +00:00
Edward Tomasz Napierala	1e2521ffae	Get rid of sa->narg. It serves no purpose; use sa->callp->sy_narg instead. Reviewed by: kib Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26458	2020-09-27 18:47:06 +00:00
Edward Tomasz Napierala	0c5bd5f993	Regen after r366145. Sponsored by: DARPA	2020-09-25 10:05:38 +00:00
Alan Somers	5710395f4d	Fix some signed/unsigned comparison warnings in NFS Reviewed by: rmacklem MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D26533	2020-09-24 15:38:01 +00:00
Konstantin Belousov	5dca94ee82	Remove pointless local variable. Reported by: alc Sponsored by: The FreeBSD Foundation MFC after: 6 days	2020-09-24 12:14:25 +00:00
Mateusz Guzik	1b2edd6e2b	cache: eliminate cache_zap_locked_vnode It is only ever called for negative entries and for those it is just a wrapper around cache_zap_negative_locked_vnode_kl which always succeeds. This also fixes a bug where cache_lookup_fallback should have been calling cache_zap_locked_bucket instead. Note that in order to trigger the bug NOCACHE must not be set, which currently only happens when creating a new coredump (and then the coredump-to-be has to have a negative entry).	2020-09-24 03:38:32 +00:00
Mark Johnston	78257765f2	Add a vmparam.h constant indicating pmap support for large pages. Enable SHM_LARGEPAGE support on arm64. Reviewed by: alc, kib Sponsored by: Juniper Networks, Inc., Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26467	2020-09-23 19:34:21 +00:00
Konstantin Belousov	aaf78c16f5	Do not leak oldvmspace if image activation failed and current address space is already destroyed, so kern_execve() terminates the process. While there, clean up some internals of post_execve() inlined in init_main. Reported by: Peter <pmc@citylink.dinoex.sub.org> Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D26525	2020-09-23 18:03:07 +00:00
Mateusz Guzik	a3d9bf49b5	cache: drop the force flag from purgevfs The optional scan is wasteful, thus it is removed altogether from unmount. Callers which always want it anyway remain unaffected.	2020-09-23 10:46:07 +00:00
Mateusz Guzik	a952fefff2	cache: reimplement purgevfs to iterate vnodes instead of the entire hash The entire cache scan was a leftover from the old implementation. It is incredibly wasteful in presence of several mount points and does not win much even for single ones.	2020-09-23 10:44:49 +00:00
Mateusz Guzik	efeec5f0c6	cache: clean up atomic ops on numneg and numcache - use subtract instead of adding -1 - drop the useless _rel fence Note this should be converted to a scalable scheme.	2020-09-23 10:42:41 +00:00
Konstantin Belousov	1317da4349	Add O_RESOLVE_BENEATH and AT_RESOLVE_BENEATH to mimic Linux' RESOLVE_BENEATH. It is like O_BENEATH, but disables to walk out of the subtree rooted in the starting directory. O_BENEATH does not care if path walks out if it returned. Requested by: Dan Gohman <dev@sunfishcode.online> PR: 248335 Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D25886	2020-09-22 22:48:12 +00:00
Konstantin Belousov	6a9c72d901	Change O_BENEATH to handle relative paths same as absolute. Do not care if path walks out of the topping directory if it returns back. Requested and reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D25886	2020-09-22 22:43:32 +00:00
Konstantin Belousov	07e7ad2b98	Only clear latch for BENEATH when we walk out of the startdir, not unconditionally on any dotdot component. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D25886	2020-09-22 22:36:02 +00:00
Konstantin Belousov	4a0b316d2a	Add open2nameif() the helper to calculate namei flags both for open(2) and creat(2). Suggested and reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D25886	2020-09-22 22:23:58 +00:00
Konstantin Belousov	861f039df1	Add at2cnpflags() the helper to convert AT_ flags for *at() syscalls to namei flags. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D25886	2020-09-22 22:22:29 +00:00
Konstantin Belousov	c7de3d6f0b	Add NIRES_STRICTREL. Stop abusing internal namei flag NI_LCF_STRICTRELATIVE as indicator of cap-restricted lookup. Add designated returned flag NIRES_STRICTREL to inform kern_openat() that lookup was restricted. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D25886	2020-09-22 22:06:20 +00:00
Konstantin Belousov	f9e46c9bf1	lookup: Track last lookup component if it is directory. This makes open("/a/../a", O_BENEATH) with cwd == "/a" work. Reviewed by: markj Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D25886	2020-09-22 21:59:18 +00:00
Konstantin Belousov	44619a5e86	Improve comment above nameicap_check_dotdot(). Explain why tracker is needed at all. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D25886	2020-09-22 21:54:30 +00:00
Mitchell Horne	624a7e1f4f	Use getenv_is_true() in init_static_kenv() A small example of how these functions can be used to simplify checks of this nature. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26271	2020-09-21 15:44:23 +00:00
Mitchell Horne	cba446e2c2	Add getenv(9) boolean parsing functions This adds the getenv_bool() function, to parse a boolean value from a kernel environment variable or tunable. This works for traditional boolean values like "0" and "1", and also "true" and "false" (case-insensitive). These semantics do not yet apply to sysctls declared using SYSCTL_BOOL with CTLFLAG_TUN (they still only parse 1 and 0). Also added are two wrapper functions, getenv_is_true() and getenv_is_false(). These are slightly simpler for callers wishing to perform a single check of a configuration variable. Reviewed by: jhb (slightly earlier version) Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26270	2020-09-21 15:24:44 +00:00
Michal Meloun	95a85c125d	Add NetBSD compatible bus_space_peek_N() and bus_space_poke_N() functions. One problem with the bus_space_read_N() and bus_space_write_N() family of functions is that they provide no protection against exceptions which can occur when no physical hardware or device responds to the read or write cycles. In such a situation, the system typically would panic due to a kernel-mode bus error. The bus_space_peek_N() and bus_space_poke_N() family of functions provide a mechanism to handle these exceptions gracefully without the risk of crashing the system. Typical example is access to PCI(e) configuration space in bus enumeration function on badly implemented PCI(e) root complexes (RK3399 or Neoverse N1 N1SDP and/or access to PCI(e) register when device is in deep sleep state. This commit adds a real implementation for arm64 only. The remaining architectures have bus_space_peek()/bus_space_poke() emulated by using bus_space_read()/bus_space_write() (without exception handling). MFC after: 1 month Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D25371	2020-09-19 11:06:41 +00:00
Eric van Gyzen	f9cc8410e1	vm_ooffset_t is now unsigned vm_ooffset_t is now unsigned. Remove some tests for negative values, or make other adjustments accordingly. Reported by: Coverity Reviewed by: kib markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D26214	2020-09-18 16:48:08 +00:00
Warner Losh	fd0a41d241	Move to a more robust and conservative alloation scheme for devctl messages Change the zone setup: - Allow slabs to be returned to the OS - Set the number of slots to the max devctl will queue before discarding - Reserve 2% of the max (capped at 100) for low memory allocations - Disable per-cpu caching since we don't need it and we avoid some pathologies Change the alloation strategiy a bit: - If a normal allocation fails, try to get the reserve - If a reserve allocation fails, re-use the oldest-queued entry for storage - If there's a weird race/failure and nothing on the queue to steal, return NULL This addresses two main issues in the old code: - If devd had died, and we're generating a lot of messages, we have an unbounded leak. This new scheme avoids the issue that lead to this. - The MPASS that was 'sure' the allocation couldn't have failed turned out to be wrong in some rare cases. The new code doesn't make this assumption. Since we reserve only 2% of the space, we go from about 1MB of allocation all the time to more like 50kB for the reserve. Reviewed by: markj@ Differential Revision: https://reviews.freebsd.org/D26448	2020-09-17 17:29:33 +00:00
Edward Tomasz Napierala	70890254b3	Get rid of sv_errtbl and SV_ABI_ERRNO(). Reviewed by: kib Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26388	2020-09-17 11:39:33 +00:00
Konstantin Belousov	dd90d96342	Put calls to check_pgrp_jobc() in fixjobc_kill() under INVARIANTS. Reported by: Michael Butler <imb@protected-networks.net> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2020-09-17 00:07:15 +00:00
Konstantin Belousov	182cfe6ff4	Add check_pgrp_jobc() calls into process exit path. Both before and after job control adjustments. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D26416	2020-09-16 21:49:19 +00:00
Konstantin Belousov	2f5f11f533	Fix fixjobc+orhpanage. Orphans affect job control state, we must account for them when changing pg_jobc. Instead of p_pptr, use proc_realparent() to get parent relevant for job control. Use correct calculation of the parent for exiting process. For jobc purposes, we must use realparent, but if it is also exiting, we should fall to reaper, then recursively find non-exiting reaper. Reported by: trasz PR: 249257 Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D26416	2020-09-16 21:46:57 +00:00
Konstantin Belousov	928b85384a	Assert that P_TREE_GRPEXITED is set only once. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D26416	2020-09-16 21:40:32 +00:00
Konstantin Belousov	844219f471	proc_realparent: if p_oppid does not match pid of the current parent for non-orphaned process, return reaper instead of init. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D26416	2020-09-16 21:38:24 +00:00
Konstantin Belousov	82207cd246	Improve ddb 'show pgrpdump' command. Use ddb pager. Make lines more compact. Eliminate unneeded casts. Print more job-control related info when reporting process group. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D26416	2020-09-16 21:34:18 +00:00
Warner Losh	9ea860660f	Use standard bool type, instead of non-standard boolean_t	2020-09-16 06:02:30 +00:00
Mark Johnston	b4e07e3da5	Fix locking in uipc_accept(). Reported by: cy MFC after: 1 week Sponsored by: The FreeBSD Foundation	2020-09-15 23:03:56 +00:00
Konstantin Belousov	3c484f325e	Convert page cache read to VOP. There are several negative side-effects of not calling into VOP layer at all for page cache reads. The biggest is the missed activation of EVFILT_READ knotes. Also, it allows filesystem to make more fine grained decision to refuse read from page cache. Keep VIRF_PGREAD flag around, it is still useful for nullfs, and for asserts. Reviewed by: markj Tested by: pho Discussed with: mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26346	2020-09-15 22:06:36 +00:00
Konstantin Belousov	888636655d	vfs_subr.c: export io_hold_cnt and vn_read_from_obj(). Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26346	2020-09-15 22:00:58 +00:00
Konstantin Belousov	96474d2a3f	Do not copy vp into f_data for DTYPE_VNODE files. The pointer to vnode is already stored into f_vnode, so f_data can be reused. Fix all found users of f_data for DTYPE_VNODE. Provide finit_vnode() helper to initialize file of DTYPE_VNODE type. Reviewed by: markj (previous version) Discussed with: freqlabs (openzfs chunk) Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26346	2020-09-15 21:55:21 +00:00
Mark Johnston	448000279e	Fix locking in uipc_accept(). This function wasn't converted to use the new locking protocol in r333744. Make it use the PCB lock for synchronizing connection state. Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26300	2020-09-15 19:23:42 +00:00
Mark Johnston	ccdadf1a9b	Simplify unix socket connection peer locking. unp_pcb_owned_lock2() has some sharp edges and forces callers to deal with a bunch of cases. Simplify it: - Rename to unp_pcb_lock_peer(). - Return the connected peer instead of forcing callers to load it beforehand. - Handle self-connected sockets. - In unp_connectat(), just lock the accept socket directly. It should not be possible for the nascent socket to participate in any other lock orders. - Get rid of connect_internal(). It does not provide any useful checking anymore. - Block in unp_connectat() when a different thread is concurrently attempting to lock both sides of a connection. This provides simpler semantics for callers of unp_pcb_lock_peer(). - Make unp_connectat() return EISCONN if the socket is already connected. This fixes a race[1] when multiple threads attempt to connect() to different addresses using the same datagram socket. Upper layers will disconnect a connected datagram socket before calling the protocol connect's method, but there is no synchronization between this and protocol-layer code. Reported by: syzkaller [1] Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26299	2020-09-15 19:23:22 +00:00
Mark Johnston	ed92e1c78c	Avoid an unnecessary malloc() when connecting dgram sockets. The allocated memory is only required for SOCK_STREAM and SOCK_SEQPACKET sockets. Reviewed by: kevans Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26298	2020-09-15 19:23:01 +00:00
Mark Johnston	f0317f868b	Simplify unp_disconnect() callers. In all cases, PCBs are unlocked after unp_disconnect() returns. Since unp_disconnect() may release the last PCB reference, callers may have to bump the refcount before the call just so that they can release them again. Change unp_disconnect() to release PCB locks as well as connection references; this lets us remove several refcount manipulations. Tighten assertions. Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26297	2020-09-15 19:22:37 +00:00
Mark Johnston	4820bf6ac2	Rename unp_pcb_lock2(). unp_pcb_lock_pair() seems like a better name. Also make it handle the case where the two sockets are the same instead of making callers do it. No functional change intended. Reviewed by: glebius, kevans, kib Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26296	2020-09-15 19:22:16 +00:00
Mark Johnston	5362170da7	Improve unix socket PCB refcounting. - Use refcount_init(). - Define an INVARIANTS-only zone destructor to assert that various bits of PCB state aren't left dangling. - Annotate unp_pcb_rele() with __result_use_check. - Simplify control flow. Reviewed by: glebius, kevans, kib Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26295	2020-09-15 19:21:58 +00:00
Mark Johnston	d5cbccecd8	Update unix domain socket locking comments. - Define a locking key for unpcb members. - Rewrite some of the locking protocol description to make it less verbose and avoid referencing some subroutines which will be renamed. - Reorder includes. Reviewed by: glebius, kevans, kib Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26294	2020-09-15 19:21:33 +00:00
Edward Tomasz Napierala	fecb19e431	Move td_softdep_cleanup() from userret() to ast(); it's infrequent at best. The schedule_cleanup() function already sets TDF_ASTPENDING. Reviewed by: kib, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26375	2020-09-14 10:17:07 +00:00
Edward Tomasz Napierala	60f083efe2	Move TDP_GEOM check from userret() to ast(); this code path is quite infrequent. Reviewed by: kib No objections: mav Tested by: pho MFC after: 2 weeks Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26374	2020-09-14 10:14:03 +00:00
Edward Tomasz Napierala	30d158eecc	Move racct/rctl throttling from userret() to ast(). There's no reason for it to sit in the syscall fast path. Reviewed by: kib MFC after: 2 weeks Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26368	2020-09-14 09:44:24 +00:00
Scott Long	74c781ed91	Refine the busdma template interface. Provide tools for filling in fields that can be extended, but also ensure compile-time type checking. Refactor common code out of arch-specific implementations. Move the mpr and mps drivers to this new API. The template type remains visible to the consumer so that it can be allocated on the stack, but should be considered opaque.	2020-09-14 05:58:12 +00:00
Konstantin Belousov	7978363417	Fix interaction between largepages and seals/writes. On write with SHM_GROW_ON_WRITE, use proper truncate. Do not allow to grow largepage shm if F_SEAL_GROW is set. Note that shrinks are not supported at all due to unmanaged mappings. Call to vm_pager_update_writecount() is only valid for swap objects, skip it for unmanaged largepages. Largepages cannot support write sealing. Do not writecnt largepage mappings. Reported by: kevans Reviewed by: kevans, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D26394	2020-09-10 20:54:44 +00:00
Konstantin Belousov	d301b3580f	Support for userspace non-transparent superpages (largepages). Created with shm_open2(SHM_LARGEPAGE) and then configured with FIOSSHMLPGCNF ioctl, largepages posix shared memory objects guarantee that all userspace mappings of it are served by superpage non-managed mappings. Only amd64 for now, both 2M and 1G superpages can be requested, the later requires CPU feature. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-09 22:12:51 +00:00
Konstantin Belousov	25f44824ba	uipc_shm.c: Move comment where it belongs. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-09 21:00:11 +00:00
Gleb Smirnoff	022c2f5570	In r354148 the goal was to check THREAD_CAN_SLEEP() only once for the purpose of epoch_trace() and for calling subsequent panic, but to keep code fully under INVARIANTS, so don't use bare function call to panic(). However, at the last stage of review a true value slipped in, while always false was assumed. I checked that in email archive with kib@. Noticed by: trasz	2020-09-09 16:13:33 +00:00
Konstantin Belousov	fbf2a77876	Convert allocations of the phys pager to vm_pager_allocate(). Future changes would require additional initialization of OBJT_PHYS objects, and vm_object_allocate() is not suitable for it. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24652	2020-09-08 23:38:49 +00:00
Mateusz Guzik	54052edaa0	fd: fix fhold on an uninitialized var in fdcopy_remapped Reported by: gcc9	2020-09-08 16:07:47 +00:00
Mateusz Guzik	da62ed4f1a	cache: drop write-only tvp_seqc vars	2020-09-08 16:06:46 +00:00
Mateusz Guzik	2bcfa5ba6f	vfs: drop a write-only var in vfs_periodic_msync_inactive	2020-09-08 16:06:26 +00:00
Konstantin Belousov	7de1bc13e2	imgact_elf.c: unify check for phdr fitting into the first page. Similar to the userspace rtld check. Reviewed by: dim, emaste (previous versions) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D26339	2020-09-07 21:37:16 +00:00
Chuck Silvers	a0a36d4886	vfs: avoid exposing partially constructed vnodes If multiple threads race calling vfs_hash_insert() while creating vnodes with the same identity, all of the vnodes which lose the race must be destroyed before any other thread can see them. Previously this was accomplished by the vput() in vfs_hash_insert() resulting in the vnode's VOP_INACTIVE() method calling vgone() before the vnode lock was unlocked, but at some point changes to the the vnode refcount/inactive logic have caused that to no longer work, leading to crashes, so instead vfs_hash_insert() must call vgone() itself before calling vput() on vnodes which lose the race. Reviewed by: mjg, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26291	2020-09-05 00:26:03 +00:00
Bjoern A. Zeeb	d29a3de296	uipc_ktls: remove unused static function m_segments() was added with r363464 but never used. Remove it to avoid warnings when compiling kernels. Reported by: rmacklem (also says jhb) Reviewed by: gallatin, jhb Differential Revision: https://reviews.freebsd.org/D26330	2020-09-05 00:19:40 +00:00
Andrew Gallatin	9675d8895a	ktls: Check for a NULL send tag in ktls_cleanup() When using ifnet ktls, and when ktls_reset_send_tag() fails to allocate a replacement tag, it leaves the tls session's snd_tag pointer NULL. ktls_cleanup() tries to release the send tag, and will trip over this NULL pointer and panic unless NULL is checked for. Reviewed by: jhb Sponsored by: Netflix	2020-09-04 17:36:15 +00:00
Brooks Davis	18f917a90e	Always report ENOSYS in init While rare, encountering an unimplemented system call early in init is catastrophic and difficult to debug. Even after a SIGSYS handler is registered, such configurations are problematic. As such, always report such events for pid 1 (following kern.lognosys if non-zero). Reviewed by: kevans, imp Obtained from: CheriBSD (plus suggestions from kevans) MFC after: 1 week Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26288	2020-09-02 23:17:33 +00:00
Mateusz Guzik	b1a824b684	vfs: retire vholdl as a symbol Similarly to vrefl in r364283.	2020-09-02 19:21:37 +00:00
Mateusz Guzik	2b4632aee9	vfs: purge cache entries early on vgone There is no reason for them to linger across reclaim and it is an invariant that doomed vnodes are not added to the namecache.	2020-09-02 19:21:10 +00:00
Mark Johnston	a0efcf6400	Add sysctl(8) formatting for hw.pagesizes. - Change the type of hw.pagesizes to OPAQUE, since it returns an array. - Modify the handler to only truncate the returned length if the caller supplied an output buffer. This allows use of the trick of passing a NULL output buffer to fetch the output size, while preserving compatibility if MAXPAGESIZES is increased. - Add a "S,pagesize" formatter to sysctl(8). Reviewed by: alc, kib MFC after: 2 weeks Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26239	2020-09-02 18:17:08 +00:00
Hans Petter Selasky	624677fad7	Assert that cc_exec_drain(cc, direct) is NULL before assigning a new value. Suggested by: markj@ Tested by: callout_test MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-09-02 10:00:30 +00:00
Hans Petter Selasky	0d0053d7ed	Micro optimise _callout_stop_safe() by removing dead code. The CS_DRAIN flag cannot be set at the same time like the async-drain function pointer is set. These are orthogonal features. Assert this at the beginning of the function. Before: if (flags & CS_DRAIN) { /* FALLTHROUGH / } else if (xxx) { return yyy; } if (drain) { zzz = drain; } After: if (flags & CS_DRAIN) { / FALLTHROUGH */ } else if (xxx) { return yyy; } else { if (drain) { zzz = drain; } } Reviewed by: markj@ Tested by: callout_test Differential Revision: https://reviews.freebsd.org/D26285 MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-09-02 09:44:00 +00:00
Mateusz Guzik	6fed89b179	kern: clean up empty lines in .c and .h files	2020-09-01 22:12:32 +00:00
Kyle Evans	5dd47b52e5	posixshm: fix setting of shm_flags Noted in D24652, we currently set shmfd->shm_flags on every shm_open()/shm_open2(). This wasn't properly thought out; one shouldn't be able to specify incompatible flags on subsequent opens of non-anon shm. Move setting of shm_flags explicitly to the two places shmfd are created, as we do with seals, and validate when we're opening a pre-existing mapping that we've either passed no flags or we've passed the exact same flags as the first time. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D26242	2020-08-31 15:07:15 +00:00
Andrew Gallatin	796d4eb89e	make m_getm2() resilient to zone_jumbop exhaustion When the zone_jumbop is exhausted, most things using using sosend* (like sshd) will eventually fail or hang if allocations are limited to the depleted jumbop zone. This makes it imossible to communicate with a box which is under an attach which exhausts the jumbop zone. Rather than depending on the page size zone, also try cluster allocations to satisfy larger requests. This allows me to ssh to, and serve 100Gb/s of traffic from a server which under attack and has had its page-sized zone exhausted. Reviewed by: glebius, markj, rmacklem Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26150	2020-08-31 13:53:14 +00:00
Vladimir Kondratyev	5d4bf0578f	LinuxKPI: Implement ksize() function. In Linux, ksize() gets the actual amount of memory allocated for a given object. This commit adds malloc_usable_size() to FreeBSD KPI which does the same. It also maps LinuxKPI ksize() to newly created function. ksize() function is used by drm-kmod. Reviewed by: hselasky, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D26215	2020-08-29 19:26:31 +00:00
Warner Losh	f6c941f347	We don't need to INCLUDENUL, so turn it off to avoid assertion... sbuf_new_for_sysctl turns on INCLUDENUL, but we don't need it. And we assert for it in the new bus_pnpinfo_sb and bus_location_sb strings.	2020-08-29 11:46:50 +00:00
Warner Losh	6dd5b77a15	Use sbuf_cat instead of sbuf_cpy sbuf_cpy doesn't work with sysctl sbufs because of the drain function.	2020-08-29 11:18:10 +00:00
Warner Losh	5eade881a8	Avoid NULL pointer dereferences Add back NULL pointer checks accidentally dropped in r364946. We need to append a NUL character when that happens.	2020-08-29 09:59:52 +00:00
Warner Losh	17c219fd6f	Move to using sbuf for some sysctl in newbus Convert two different sysctl to using sbuf. First, for all the default sysctls we implement for each device driver that's attached. This is a pure sbuf conversion. Second, convert sysctl_devices to fill its buffer with sbuf rather than a hand-rolled crappy thing I wrote years ago. Reviewed by: cem, markj Differential Revision: https://reviews.freebsd.org/D26206	2020-08-29 04:30:12 +00:00
Warner Losh	887611b122	Retire devctl_notify_f() devctl_notify_f isn't needed, so retire it. The flags argument is now unused, so rather than keep it around, retire it. Convert all old users of it to devctl_notify(). This path no longer sleeps, so is safe to call from any context. Since it doesn't sleep, it doesn't need to know if it is OK to sleep or not. Reviewed by: markj@ Differential Revision: https://reviews.freebsd.org/D26140	2020-08-29 04:30:06 +00:00
Warner Losh	bca8f35f28	devctl: move to using a uma zone Convert the memory management of devctl. Rewrite if to make better use of memory. This eliminates several mallocs (5? worse case) needed to send a message. It's now possible to always send a message, though if things are really backed up the oldest message will be dropped to free up space for the newest. Add a static bus_child_{location,pnpinfo}_sb to start migrating to sbuf instead of buffer + length. Use it in the new code. Other code will be converted later (bus_child_*_str is only used inside of subr_bus.c, though implemented in ~100 places in the tree). Reviewed by: markj@ Differential Revision: https://reviews.freebsd.org/D26140	2020-08-29 04:29:53 +00:00
Kirk McKusick	66ac5b2c5a	Add a comment to clarify when and why cached names are deleted during pathname lookup. Reviewed by: kib MFC after: 3 days Sponsored by: Netflix	2020-08-27 22:14:58 +00:00
Mark Johnston	6255e8c8e2	Fix writing of the final block of encrypted, compressed kernel dumps. Previously any residual data in the final block of a compressed kernel dump would be written unencrypted. Note, such a configuration already does not work properly when using AES-CBC since the compressed data is typically not a multiple of the AES block length in size and EKCD does not implement any padding scheme. However, EKCD more recently gained support for using the ChaCha20 cipher, which being a stream cipher does not have this problem. Submitted by: sigsys@gmail.com Reviewed by: cem MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D26188	2020-08-27 17:36:06 +00:00
Mateusz Guzik	84ecea90b7	cache: don't update timestmaps on found entry	2020-08-27 06:31:55 +00:00
Mateusz Guzik	5f08d440b0	cache: assorted clean ups In particular remove spurious comments, duplicate assertions and the inconsistently done KTR support.	2020-08-27 06:31:27 +00:00
Mateusz Guzik	12441fcbe2	cache: ncp = NULL early to account for sdt probes in ailure path CID: 1432106	2020-08-27 06:30:40 +00:00
Warner Losh	cbda6f66f4	Implement FLUSHO Turn FLUSHO on/off with ^O (or whatever VDISCARD is). Honor that to throw away output quickly. This tries to remain true to 4.4BSD behavior (since that was the origin of this feature), with any corrections NetBSD has done. Since the implemenations are a little different, though, some edge conditions may be handled differently. Reviewed by: kib, kevans Differential Review: https://reviews.freebsd.org/D26148	2020-08-27 05:11:15 +00:00
Rick Macklem	df665abd34	Fix a "v_seqc_users == 0 not met" panic when VFS_STATFS() fails during mount. r363210 introduced v_seqc_users to the vnodes. This change requires a vn_seqc_write_end() to match the vn_seqc_write_begin() in vfs_cache_root_clear(). mjg@ provided this patch which seems to fix the panic. Tested for an NFS mount where the VFS_STATFS() call will fail. Submitted by: mjg Reviewed by: mjg Differential Revision: https://reviews.freebsd.org/D26160	2020-08-26 21:49:43 +00:00
Mark Johnston	41c6838786	vmem: Avoid allocating span tags when segments are never released. vmem uses span tags to delimit imported segments, so that they can be released if the segment becomes free in the future. However, the per-domain kernel KVA arenas never release resources, so the span tags between imported ranges are unused when the ranges are contiguous. Furthermore, such span tags prevent coalescing of free segments across KVA_QUANTUM boundaries, resulting in internal fragmentation which inhibits superpage promotion in the kernel map. Stop allocating span tags in arenas that never release resources. This saves a small amount of memory and allows free segements to coalesce across import boundaries. This manifests as improved kernel superpage usage during poudriere runs, which also helps to reduce physical memory fragmentation by reducing the number of broken partially populated reservations. Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24548	2020-08-26 14:31:35 +00:00
Mateusz Guzik	1e9a0b391d	cache: relock on failure in cache_zap_locked_vnode This gets rid of bogus scheme of yielding in hopes the blocking thread will make progress.	2020-08-26 12:54:18 +00:00
Mateusz Guzik	075f58f231	cache: stop null checking in cache_free	2020-08-26 12:53:16 +00:00
Mateusz Guzik	66fa11c898	cache: make it mandatory to request both timestamps or neither	2020-08-26 12:52:54 +00:00
Mateusz Guzik	eef63775b6	cache: convert bucketlocks to a mutex By now bucket locks are almost never taken for anything but writing and converting to mutex simplifies the code.	2020-08-26 12:52:17 +00:00
Mateusz Guzik	32f3d0821c	cache: only evict negative entries on CREATE when ISLASTCN is set	2020-08-26 12:50:57 +00:00
Mateusz Guzik	935e15187c	cache: decouple smr and locked lookup in the slowpath Tested by: pho	2020-08-26 12:50:10 +00:00
Mateusz Guzik	d3476daddc	cache: factor dotdot lookup out of cache_lookup Tested by: pho	2020-08-26 12:49:39 +00:00
Alan Somers	e6f6d0c9bc	crypto(9): add CRYPTO_BUF_VMPAGE crypto(9) functions can now be used on buffers composed of an array of vm_page_t structures, such as those stored in an unmapped struct bio. It requires the running to kernel to support the direct memory map, so not all architectures can use it. Reviewed by: markj, kib, jhb, mjg, mat, bcr (manpages) MFC after: 1 week Sponsored by: Axcient Differential Revision: https://reviews.freebsd.org/D25671	2020-08-26 02:37:42 +00:00
Mateusz Guzik	a459a6cfe7	vfs: respect PRIV_VFS_LOOKUP in vaccess_smr Reported by: novel	2020-08-25 14:18:50 +00:00
Rick Macklem	22df1ffd81	Fix hangs with processes stuck sleeping on btalloc on i386. r358097 introduced a problem for i386, where kernel builds will intermittently get hung, typically with many processes sleeping on "btalloc". I know nothing about VM, but received assistance from rlibby@ and markj@. rlibby@ stated the following: It looks like the problem is that for systems that do not have UMA_MD_SMALL_ALLOC, we do uma_zone_set_allocf(vmem_bt_zone, vmem_bt_alloc); but we haven't set an appropriate free function. This is probably why UMA_ZONE_NOFREE was originally there. When NOFREE was removed, it was appropriate for systems with uma_small_alloc. So by default we get page_free as our free function. That calls kmem_free, which calls vmem_free ... but we do our allocs with vmem_xalloc. I'm not positive, but I think the problem is that in effect we vmem_xalloc -> vmem_free, not vmem_xfree. Three possible fixes: 1: The one you tested, but this is not best for systems with uma_small_alloc. 2: Pass UMA_ZONE_NOFREE conditional on UMA_MD_SMALL_ALLOC. 3: Actually provide an appropriate vmem_bt_free function. I think we should just do option 2 with a comment, it's simple and it's what we used to do. I'm not sure how much benefit we would see from option 3, but it's more work. This patch implements #2. I haven't done a comment, since I don't know what the problem is. markj@ noted the following: I think the suggested patch is ok, but not for the reason stated. On platforms without a direct map the problem is: to allocate btags we need a slab, and to allocate a slab we need to map a page, and to map a page we need to allocate btags. We handle this recursion using a custom slab allocator which specifies M_USE_RESERVE, allowing it to dip into a reserve of free btags. Because the returned slab can be used to keep the reserve populated, this ensures that there are always enough free btags available to handle the recursion. UMA_ZONE_NOFREE ensures that we never reclaim free slabs from the zone. However, when it was removed, an apparent bug in UMA was exposed: keg_drain() ignores the reservation set by uma_zone_reserve() in vmem_startup(). So under memory pressure we reclaim the free btags that are needed to break the recursion. That's why adding _NOFREE back fixes the problem: it disables the reclamation. We could perhaps fix it more cleverly, by modifying keg_drain() to always leave uk_reserve slabs available. markj@'s initial patch failed testing, so committing this patch was agreed upon as the interim solution. Either rlibby@ or markj@ might choose to add a comment to it. PR: 248008 Reviewed by: rlibby, markj	2020-08-25 00:58:14 +00:00
Alexander V. Chernikov	592d300e34	Remove RT_LOCK mutex from rte. rtentry lock traditionally served 2 purposed: first was protecting refcounts, the second was assuring consistent field access/changes. Since route nexthop introduction, the need for the former disappeared and the need for the latter reduced. To be more precise, the following rte field are mutable: rt_nhop (nexthop pointer, updated with RIB_WLOCK, passed in rib_cmd_info) rte_flags (only RTF_HOST and RTF_UP, where RTF_UP gets changed at rte removal) rt_weight (relative weight, updated with RIB_WLOCK, passed in rib_cmd_info) rt_expire (time when rte deletion is scheduled, updated with RIB_WLOCK) rt_chain (deletion chain pointer, updated with RIB_WLOCK) All of them are updated under RIB_WLOCK, so the only remaining concern is the reading. rt_nhop and rt_weight (addressed in this review) are read under rib lock and stored in the rib_cmd_info, so the caller has no problem with consitency. rte_flags is currently read unlocked in rtsock reporting (however the scope is only RTF_UP flag, which is pretty static). rt_expire is currently read unlocked in rtsock reporting. rt_chain accesses are safe, as this is only used at route deletion. rt_expire and rte_flags reads will be dealt in a separate reviews soon. Differential Revision: https://reviews.freebsd.org/D26162	2020-08-24 20:23:34 +00:00
Warner Losh	f87655ec76	Change the resume notification event from 'kern' to 'kernel' We have both a system of 'kern' and of 'kernel'. Prefer the latter and convert this notification to use 'kernel' instead of 'kern'. As a transition period, continue to also generate the 'kern' notification until sometime after FreeBSD 13 is branched. MFC After: 3 days	2020-08-24 19:35:15 +00:00
Mateusz Guzik	f9cdb0775e	cache: remove leftover assert in vn_fullpath_any_smr It is only valid when !slash_prefixed. For slash_prefixed the length is properly accounted for later. Reported by: markj (syzkaller)	2020-08-24 18:23:58 +00:00
Mateusz Guzik	e35406c8f7	cache: lockless reverse lookup This enables fully scalable operation for getcwd and significantly improves realpath. For example: PATH_CUSTOM=/usr/src ./getcwd_processes -t 104 before: 1550851 after: 380135380 Tested by: pho	2020-08-24 09:00:57 +00:00
Mateusz Guzik	feabaaf995	cache: drop the always curthread argument from reverse lookup routines Note VOP_VPTOCNP keeps getting it as temporary compatibility for zfs. Tested by: pho	2020-08-24 08:57:02 +00:00
Mateusz Guzik	f0696c5e4b	cache: perform reverse lookup using v_cache_dd if possible Tested by: pho	2020-08-24 08:55:55 +00:00
Mateusz Guzik	ce575cd0e2	cache: populate v_cache_dd for non-VDIR entries It makes v_cache_dd into a little bit of a misnomer and it may be addressed later. Tested by: pho	2020-08-24 08:55:04 +00:00
Mateusz Guzik	f0d9c77e52	vfs: validate ndp state after the lookup The intent is to remove known-to-be-nops NDFREE calls after many lookups.	2020-08-23 21:06:41 +00:00
Mateusz Guzik	4b5001196a	vfs: convert nameiop into an enum While here change the field size from long to int and move it into the gap next to cn_flags. Shrinks struct componentname from 64 to 56 bytes on amd64.	2020-08-23 21:05:39 +00:00

... 3 4 5 6 7 ...

18073 Commits