freebsd-skq

Author	SHA1	Message	Date
Konstantin Belousov	f10845877e	Suspend all writeable local filesystems on power suspend. This ensures that no writes are pending in memory, either metadata or user data, but not including dirty pages not yet converted to fs writes. Only filesystems declared local are suspended. Note that this does not guarantee absence of the metadata errors or leaks if resume is not done: for instance, on UFS unlinked but opened inodes are leaked and require fsck to gc. Reviewed by: markj Discussed with: imp Tested by: imp (previous version), pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D27054	2020-11-05 20:52:49 +00:00
Mateusz Guzik	16b971ed6d	malloc: add a helper returning size allocated for given request Sample usage: kernel modules can decide whether to stick to malloc or create their own zone. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27097	2020-11-05 16:21:21 +00:00
Mateusz Guzik	2dee296a3d	Rationalize per-cpu zones. The 2 provided zones had inconsistent naming between each other ("int" and "64") and other allocator zones (which use bytes). Follow malloc by naming them "pcpu-" + size in bytes. This is a step towards replacing ad-hoc per-cpu zones with general slabs.	2020-11-05 15:08:56 +00:00
Mateusz Guzik	ea33cca971	poll/select: change selfd_zone into a malloc type On a sample box vmstat -z shows: ITEM SIZE LIMIT USED FREE REQ 64: 64, 0, 1043784, 4367538,3698187229 selfd: 64, 0, 1520, 13726,182729008 But at the same time: vm.uma.selfd.keg.domain.1.pages: 121 vm.uma.selfd.keg.domain.0.pages: 121 Thus 242 pages got pulled even though the malloc zone would likely accomodate the load without using extra memory.	2020-11-05 12:24:37 +00:00
Mateusz Guzik	2fbb45c601	vfs: change nt_zone into a malloc type Elements are small in size and allocated for short periods.	2020-11-05 12:06:50 +00:00
Kyle Evans	df69035d7f	imgact_binmisc: fix up some minor nits - Removed a bunch of redundant headers - Don't explicitly initialize to 0 - The !error check prior to setting imgp->interpreter_name is redundant, all error paths should and do return or go to 'done'. We have larger problems otherwise.	2020-11-05 04:19:48 +00:00
Mateusz Guzik	3c50616fc1	fd: make all f_count uses go through refcount_*	2020-11-05 02:12:33 +00:00
Mateusz Guzik	d737e9eaf5	fd: hide _fdrop 0 count check behind INVARIANTS While here use refcount_load and make sure to report the tested value.	2020-11-05 02:12:08 +00:00
Mateusz Guzik	331c21dd5e	pipe: whitespace nit in previous	2020-11-04 23:17:41 +00:00
Mateusz Guzik	c22ba7bb06	pipe: fix POLLHUP handling if no events were specified Linux allows polling without any events specified and it happens to be the case in FreeBSD as well. POLLHUP has to be delivered regardless of the event mask and this works fine if the condition is already present. However, if it is missing, selrecord is only called if the eventmask has relevant bits set. This in particular leads to a conditon where pipe_poll can return 0 events and neglect to selrecord, while kern_poll takes it as an indication it has to go to sleep, but then there is nobody to wake it up. While the problem seems systemic to *_poll handlers the least we can do is fix it up for pipes. Reported by: Jeremie Galarneau <jeremie.galarneau at efficios.com> Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D27094	2020-11-04 23:11:54 +00:00
Mateusz Guzik	6fc2b069ca	rms: fixup concurrent writer handling and add more features Previously the code had one wait channel for all pending writers. This could result in a buggy scenario where after a writer switches the lock mode form readers to writers goes off CPU, another writer queues itself and then the last reader wakes up the latter instead of the former. Use a separate channel. While here add features to reliably detect whether curthread has the lock write-owned. This will be used by ZFS.	2020-11-04 21:18:08 +00:00
Mark Johnston	f7db0c9532	vmspace: Convert to refcount(9) This is mostly mechanical except for vmspace_exit(). There, use the new refcount_release_if_last() to avoid switching to vmspace0 unless other processes are sharing the vmspace. In that case, upon switching to vmspace0 we can unconditionally release the reference. Remove the volatile qualifier from vm_refcnt now that accesses are protected using refcount(9) KPIs. Reviewed by: alc, kib, mmel MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27057	2020-11-04 16:30:56 +00:00
Brooks Davis	19647e76fc	sysvshm: pass relevant uap members as arguments Alter shmget_allocate_segment and shmget_existing to take the values they want from struct shmget_args rather than passing the struct around. In general, uap structures should only be the interface to sys_<foo> functions. This makes on small functional change and records the allocated space rather than the requested space. If this turns out to be a problem (e.g. if software tries to find undersized segments by exact size rather than using keys), we can correct that easily. Reviewed by: kib Obtained from: CheriBSD MFC after: 1 week Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D27077	2020-11-03 19:14:03 +00:00
Conrad Meyer	2de07e4096	unix(4): Add SOL_LOCAL:LOCAL_CREDS_PERSISTENT This option is intended to be semantically identical to Linux's SOL_SOCKET:SO_PASSCRED. For now, it is mutually exclusive with the pre-existing sockopt SOL_LOCAL:LOCAL_CREDS. Reviewed by: markj (penultimate version) Differential Revision: https://reviews.freebsd.org/D27011	2020-11-03 01:17:45 +00:00
Mateusz Guzik	e1b6a7f83f	malloc: prefix zones with malloc- Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27038	2020-11-02 17:39:15 +00:00
Mateusz Guzik	828afdda17	malloc: export kernel zones instead of relying on them being power-of-2 Reviewed by: markj (previous version) Differential Revision: https://reviews.freebsd.org/D27026	2020-11-02 17:38:08 +00:00
Stefan Eßer	1ebef47735	Make sysctl user.local a tunable that can be written at run-time This sysctl value had been provided as a read-only variable that is compiled into the C library based on the value of _PATH_LOCALBASE in paths.h. After this change, the value is compiled into the kernel as an empty string, which is translated to _PATH_LOCALBASE by the C library. This empty string can be overridden at boot time or by a privileged user at run time and will then be returned by sysctl. When set to an empty string, the value returned by sysctl reverts to _PATH_LOCALBASE. This update does not change the behavior on any system that does not modify the default value of user.localbase. I consider this change as experimental and would prefer if the run-time write permission was reconsidered and the sysctl variable defined with CLFLAG_RDTUN instead to restrict it to be set at boot time. MFC after: 1 month	2020-10-31 23:48:41 +00:00
Mateusz Guzik	82c174a3b4	malloc: delegate M_EXEC handling to dedicacted routines It is almost never needed and adds an avoidable branch. While here do minior clean ups in preparation for larger changes. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27019	2020-10-30 20:02:32 +00:00
Stefan Eßer	147eea393f	Add read only sysctl variable user.localbase The value is provided by the C library as for other sysctl variables in the user tree. It is compiled in and returns the value of _PATH_LOCALBASE defined in paths.h. Reviewed by: imp, scottl Differential Revision: https://reviews.freebsd.org/D27009	2020-10-30 18:48:09 +00:00
Mateusz Guzik	0685574968	vfs: change vnode poll to just a malloc type The size is 120, close fit for 128 and rarely used. The infrequent use avoidably populates per-CPU caches and ends up with more memory.	2020-10-30 14:02:56 +00:00
Mateusz Guzik	4bfebc8d2c	cache: add cache_vop_mkdir and rename cache_rename to cache_vop_rename	2020-10-30 10:46:35 +00:00
John Baldwin	36e0a362ac	Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc(). This gives a more uniform API for send tag life cycle management. Reviewed by: gallatin, hselasky Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27000	2020-10-29 23:28:39 +00:00
Mateusz Guzik	62568e886a	vfs: add NAMEI_DBG_HADSTARTDIR handling lost in rewrite Noted by: rpokala	2020-10-29 18:43:37 +00:00
Mateusz Guzik	eebc2e450f	vfs: add NDREINIT to facilitate repeated namei calls struct nameidata mixes caller arguments, internal state and output, which can be quite error prone. Recent addition of valdiating ni_resflags uncovered a caller which could repeatedly call namei, effectively operating on partially populated state. Add bare minimium validation this does not happen. The real fix would decouple aforementioned state. Reported by: pho Tested by: pho (different variant)	2020-10-29 12:56:02 +00:00
John Baldwin	521eac97f3	Support hardware rate limiting (pacing) with TLS offload. - Add a new send tag type for a send tag that supports both rate limiting (packet pacing) and TLS offload (mostly similar to D22669 but adds a separate structure when allocating the new tag type). - When allocating a send tag for TLS offload, check to see if the connection already has a pacing rate. If so, allocate a tag that supports both rate limiting and TLS offload rather than a plain TLS offload tag. - When setting an initial rate on an existing ifnet KTLS connection, set the rate in the TCP control block inp and then reset the TLS send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit send tag. This allocates the TLS send tag asynchronously from a task queue, so the TLS rate limit tag alloc is always sleepable. - When modifying a rate on a connection using KTLS, look for a TLS send tag. If the send tag is only a plain TLS send tag, assume we failed to allocate a TLS ratelimit tag (either during the TCP_TXTLS_ENABLE socket option, or during the send tag reset triggered by ktls_output_eagain) and ignore the new rate. If the send tag is a ratelimit TLS send tag, change the rate on the TLS tag and leave the inp tag alone. - Lock the inp lock when setting sb_tls_info for a socket send buffer so that the routines in tcp_ratelimit can safely dereference the pointer without needing to grab the socket buffer lock. - Add an IFCAP_TXTLS_RTLMT capability flag and associated administrative controls in ifconfig(8). TLS rate limit tags are only allocated if this capability is enabled. Note that TLS offload (whether unlimited or rate limited) always requires IFCAP_TXTLS[46]. Reviewed by: gallatin, hselasky Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26691	2020-10-29 00:23:16 +00:00
Konstantin Belousov	3cbf9dc81c	Check for process group change in tty_wait_background(). The calling process's process group can change between PROC_UNLOCK(p) and PGRP_LOCK(pg) in tty_wait_background(), e.g. by a setpgid() call from another process. If that happens, the signal is not sent to the calling process, even if the prior checks determine that one should be sent. Re-check that the process group hasn't changed after acquiring the pgrp lock, and if it has, redo the checks. PR: 250701 Submitted by: Jakub Piecuch <j.piecuch96@gmail.com> MFC after: 2 weeks	2020-10-28 22:12:47 +00:00
Edward Tomasz Napierala	bdc0cb4e2c	Add local variable to store the sysent pointer. Just a cleanup, no functional changes. Reviewed by: kib (earlier version) MFC after: 2 weeks Sponsored by: EPSRC Differential Revision: https://reviews.freebsd.org/D26977	2020-10-28 14:43:38 +00:00
Edward Tomasz Napierala	bce7ee9d41	Drop "All rights reserved" from all my stuff. This includes Foundation copyrights, approved by emaste@. It does not include files which carry other people's copyrights; if you're one of those people, feel free to make similar change. Reviewed by: emaste, imp, gbe (manpages) Differential Revision: https://reviews.freebsd.org/D26980	2020-10-28 13:46:11 +00:00
Mateusz Guzik	11743b6e47	vfs: tidy up vnlru_free Apart from cosmeatic changes make sure to only decrease the recycled counter if vtryrecycle succeeded. Tested by: pho	2020-10-27 18:13:09 +00:00
Mateusz Guzik	68ac2b804c	vfs: fix vnode reclaim races against getnwevnode All vnodes allocated by UMA are present on the global list used by vnlru. getnewvnode modifies the state of the vnode (most notably altering v_holdcnt) but never locks it. Moreover filesystems also modify it in arbitrary manners sometimes before taking the vnode lock or adding any other indicator that the vnode can be used. Picking up such a vnode by vnlru would be problematic. To that end there are 2 fixes: - vlrureclaim, not recycling v_holdcnt == 0 vnodes, takes the interlock and verifies that v_mount has been set. It is an invariant that the vnode lock is held by that point, providing the necessary serialisation against locking after vhold. - vnlru_free_locked, only wanting to free v_holdcnt == 0 vnodes, now makes sure to only transition the count 0->1 and newly allocated vnodes start with v_holdcnt == VHOLD_NO_SMR. getnewvnode will only transition VHOLD_NO_SMR->1 once more making the hold fail Tested by: pho	2020-10-27 18:12:07 +00:00
Mateusz Guzik	d681c51d36	cache: add missing NIRES_ABS handling	2020-10-26 18:01:18 +00:00
Alexander Motin	3c0177b887	Enable bioq 'car limit' added at r335066 at 128 bios. Without the 'car limit' enabled (before this), running sequential ZFS scrub on HDD without command queuing support, I've measured latency on concurrent random reads reaching 4 seconds (surprised that not more). Enabling this reduced the latency to 65 milliseconds, while scrub still doing ~180MB/s. For disks with command queuing this does not make much difference (if any), since most time all the requests are queued down to the disk or HBA, leaving nothing in the queue to sort. And even if something does not fit, staying on the queue, it is likely not for long. To not limit sorting in such bursty scenarios I've added batched counter zeroing when the queue is getting empty. The internal scheduler of the SAS HDD I was testing seems to be even more loyal to random I/O, reducing the scrub speed to ~120MB/s. So in case somebody worried this is limit is too strict -- it actually looks relaxed. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2020-10-26 04:04:06 +00:00
Alexander Motin	8b220f8915	Fix asymmetry in devstat(9) calls by GEOM. Before this GEOM passed bio pointer to transaction start, but not end. It was irrelevant until devstat(9) got DTrace hooks, that appeared to provide bio pointer on I/O completion, but not on submission. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2020-10-24 21:07:10 +00:00
Ruslan Bukin	f32f0095e9	o Add iommu de-initialization method for MSI interface. o Add iommu_unmap_msi() to release the msi GAS entry. o Provide default implementations for iommu init/deinit methods. Reviewed by: kib Sponsored by: Innovate DSbD Differential Revision: https://reviews.freebsd.org/D26906	2020-10-24 20:09:27 +00:00
Ryan Moeller	e58483c4fb	sysctl+kern_sysctl: Honor SKIP for descendant nodes Ensure we also skip descendants of SKIP nodes when iterating through children of an explicitly specified node. Reported by: np Reviewed by: np MFC after: 1 week Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D26833	2020-10-24 16:17:07 +00:00
Ryan Moeller	0595c12484	kern_sysctl: Misc code cleanup Remove unused oidpp parameter from sysctl_sysctl_next_ls and add high level comments to describe how it works. No functional change. Reviewed by: imp MFC after: 1 week Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D26854	2020-10-24 14:46:38 +00:00
Kyle Evans	275c821d3d	audit: correct reporting of execve(2) success r326145 corrected do_execve() to return EJUSTRETURN upon success so that important registers are not clobbered. This had the side effect of tapping out 'failures' for all execve(2) audit records, which is less than useful for auditing purposes. Audit exec returns earlier, where we can know for sure that EJUSTRETURN translates to success. Note that this unsets TDP_AUDITREC as we commit the audit record, so the usual audit in the syscall return path will do nothing. PR: 249179 Reported by: Eirik Oeverby <ltning-freebsd anduin net> Reviewed by: csjp, kib MFC after: 1 week Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26922	2020-10-24 14:39:17 +00:00
Mateusz Guzik	eb65cde4f5	cache: assorted typo fixes	2020-10-24 13:31:40 +00:00
Mateusz Guzik	029cfccc71	cache: add the missing NC_NOMAKEENTRY and NC_KEEPPOSENTRY to lockless lookup They are de facto ignored.	2020-10-24 13:31:25 +00:00
Mateusz Guzik	7cc1718613	vfs: fix a race where reclaim vholds freed vnodes Reported by: pho Tested by: pho (previous version) Fixes: r366974 ("vfs: stop taking the interlock in vnode reclaim")	2020-10-24 13:30:37 +00:00
Mateusz Guzik	acb41008f3	cache: batch updates to numcache in case of mass removal	2020-10-24 01:14:52 +00:00
Mateusz Guzik	208cb7c4b6	cache: refactor alloc/free This in particular centralizes manipulation of numcache.	2020-10-24 01:14:17 +00:00
Mateusz Guzik	1d44405690	cache: fold branch prediction into cache_ncp_canuse	2020-10-24 01:13:47 +00:00
Mateusz Guzik	c13d7d1f98	cache: fix some typos	2020-10-24 01:13:16 +00:00
Mateusz Guzik	f878526f20	cache: drop write-only vars	2020-10-24 01:13:02 +00:00
Ruslan Bukin	9729b14985	Move the iommu stubs to a generic place, so they are available on all the platforms. This allows to not depend on the IOMMU macro in AHCI driver. Requested by: kib Suggested by: andrew Reviewed by: kib Sponsored by: Innovate DSbD Differential Revision: https://reviews.freebsd.org/D26887	2020-10-23 21:27:48 +00:00
Mateusz Guzik	3862838921	cache: reduce memory waste in struct namecache The previous scheme for calculating the total size was doing sizeof on the struct and then adding the wanted space for the buffer. nc_name is at offset 58 while sizeof(struct namecache) is 64. With CACHE_PATH_CUTOFF of 39 bytes and 1 byte of padding we were allocating 104 bytes for the entry and never accounting for the 6 byte padding, wasting that space.	2020-10-23 15:56:22 +00:00
Mateusz Guzik	703f3fafa5	vfs: stop taking the interlock in vnode reclaim It no longer protects any of tested fields, keeping all the checks racy. While here make vtryrecycle drop the vnode on its own. Avoids an additional lock trip.	2020-10-23 15:49:18 +00:00
Mateusz Guzik	c7520caa4f	vfs: prevent avoidable evictions on mkdir of existing directories mkdir -p /foo/bar/baz will mkdir each path component and ignore EEXIST. The NOCACHE lookup will make the namecache unnecessarily evict the existing entry, and then fallback to the fs lookup routine eventually leading namei to return an error as the directory is already there. For invocations like mkdir -p /usr/obj/usr/src/sys/GENERIC/modules this triggers fallbacks to the slowpath for concurrently executing lookups. Tested by: pho Discussed with: kib	2020-10-22 19:28:12 +00:00
Mateusz Guzik	54f09403a3	cache: assert the created entry does not point to itself	2020-10-22 19:22:34 +00:00

1 2 3 4 5 ...

17848 Commits