freebsd-skq

Author	SHA1	Message	Date
Mateusz Guzik	9d4e369ae8	Don't generate data in sysctl_out_proc unless we intend to copy out. The first call is used to gauge how much spaces is needed. Just computing the size instead of generating the output allows to not take the proctree lock.	2018-02-25 15:16:58 +00:00
Jeff Roberson	1c2529ab32	Fix issues with sparse cpu allocation. Consistently use mp_maxid + 1. Reported by: pho Reviewed by: markj Sponsored by: Netflix, Dell/EMC Isilon	2018-02-25 00:35:21 +00:00
Conrad Meyer	63901c0171	kern/sys_generic.c: style(9) return(foo) -> return (foo) No functional change. Sponsored by: Dell EMC Isilon	2018-02-24 01:15:33 +00:00
Jeff Roberson	5f8cd1c0bf	Add a generic Proportional Integral Derivative (PID) controller algorithm and use it to regulate page daemon output. This provides much smoother and more responsive page daemon output, anticipating demand and avoiding pageout stalls by increasing the number of pages to match the workload. This is a reimplementation of work done by myself and mlaier at Isilon. Reviewed by: bsdimp Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14402	2018-02-23 22:51:51 +00:00
Kirk McKusick	16680b6af5	Include error number in the "fsync: giving up on dirty" message (in case it ever starts happening again in spite of 328444). Submitted by: Andreas Longwitz <longwitz at incore.de>	2018-02-23 21:57:10 +00:00
Konstantin Belousov	4c8a8cfcde	Restore UP build. Reviewed by: truckman Sponsored by: The FreeBSD Foundation	2018-02-23 18:26:31 +00:00
Ed Maste	315fbaeca2	Correct pseudo misspelling in sys/ comments contrib code and #define in intel_ata.h unchanged.	2018-02-23 18:15:50 +00:00
Don Lewis	97e9382d56	Decrease latency by not wrapping the idle loop's potentially lengthy search for a thread to steal inside a critical section. Since this allows the search to be preempted, restart the search if preemption happens since the search results found earlier may no longer be valid. Decrease the latency of starting a thread that may be assigned to this CPU during the search by polling for incoming threads during the search and switching to that thread instead of continuing the search. Test for stale search results and restart the search before going through the expense of calling tdq_lock_pair(). Retry some tests after grabbing the locks since things may have changed while waiting to get both locks. Eliminate special case handling for stealing from an SMT peer that uses 1 as the steal threshold. This can only succeed if a thread has been assigned but our SMT peer has not yet started executing it. This is quite rare and when it happens the other SMT thread is generally waiting for the same tdq lock that we hold. Basically both SMT threads are racing to grab the same spin lock. Add the kern.sched.always_steal knob from a ULE patch by jeff@. Incorporate another idea from Jeff's ULE patch. If the sched_switch() detects that the CPU is about to go idle, try to steal a thread before switching to the idle thread. Since the search for a thread to steal has to be done inside a critical section in this context, limit the impact on latency by adding the knob kern.sched.trysteal_limit to limit the topological distance of the search and don't restart the search if we detect stale results. If this search can't find an stealable thread, the idle loop can do a more complete search. Also poll for threads being assigned to this CPU during the search and switch to them instead of continuing the search. This change is responsibile for the majority of the improvement in parallel buildworld times. In sched_balance_group() change the minimum threshold from stealing a thread from 1 to 2. Poaching a newly assigned thread from a CPU that is waking up hasn't yet switched to that thread from idle is likely very rare and is likely to have the same lock race as is seen when stealing threads in the idle loop. Also use tdq_notify() to kick the destintation CPU instead of always sending an IPI. Update a stale comment, the number of transferable threads is not calculated. Reviewed by: kib (earlier version) Comments by: avg, jeff, mav MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D12130	2018-02-23 00:12:51 +00:00
Mateusz Guzik	a0c722bdbf	Fix up sysctl vfs.buffercache broken in r329612 Sample problem: top: sysctl(vfs.bufspace...) expected 8, got 4 Reported by: O. Hartmann <ohartmann walstatt.org>	2018-02-22 20:39:25 +00:00
Eric van Gyzen	0127914caa	sched_ule: update a comment to reflect reality MFC after: 3 days Sponsored by: Dell EMC	2018-02-22 17:09:26 +00:00
Jeff Roberson	683ca3a432	Fix the broken subqueue assignment for the cleanq. Reported by: pho Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-02-20 21:27:17 +00:00
Mateusz Guzik	500ca73d43	mtx: add debug assertions to mtx_spin_wait_unlocked	2018-02-20 20:39:34 +00:00
Mateusz Guzik	862db53fb5	Fix reaping on process fd close broken after r329449 The only consumer of proc_reap other than proc_to_reap was not updated to not PROC_SLOCK. Reported by: Juan Ramon Molina Menor <listjm club.fr>	2018-02-20 20:19:38 +00:00
Brooks Davis	b81e88d296	Reduce duplication in dynamic syscall registration code. Remove the unused syscall_(de)register() functions in favor of the better documented and easier to use syscall_helper_(un)register(9) functions. The default and freebsd32 versions differed in which array of struct sysents they used and a few missing updates to the 32-bit code as features were added to the main code. Reviewed by: cem Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14337	2018-02-20 18:08:57 +00:00
Mateusz Guzik	681a1b752c	Make killpg1 perform process validity checks without proc lock held.	2018-02-20 10:52:07 +00:00
Mateusz Guzik	81d68271d7	Reduce contention on the proctree lock during heavy package build. There is a proctree -> allproc ordering established. Most of the time it is either xlock -> xlock or slock -> slock. On fork however there is a slock -> xlock pair which results in pathological wait times due to threads keeping proctree held for reading and all waiting on allproc. Switch this to xlock -> xlock. Longer term fix would get rid of proctree in this place to begin with. Right now it is necessary to walk the session/process group lists to determine which id is free. The walk can be avoided e.g. with bitmaps. The exit path used to have one place which dealt with allproc and then with proctree. Move the allproc acquire into the section protected by proctree. This reduces contention against threads waiting on proctree in the fork codepath - the fork proctree holder does not have to wait for allproc as often. Finally, move tidhash manipulation outside of the area protected by either of these locks. The removal from the hash was already unprotected. There is no legitimate reason to look up thread ids for a process still under construction. This results in about 50% wait time reduction during -j 128 package build.	2018-02-20 02:18:30 +00:00
Jeff Roberson	06220fa737	Further parallelize the buffer cache. Provide multiple clean queues partitioned into 'domains'. Each domain manages its own bufspace and has its own bufspace daemon. Each domain has a set of subqueues indexed by the current cpuid to reduce lock contention on the cleanq. Refine the sleep/wakeup around the bufspace daemon to use atomics as much as possible. Add a B_REUSE flag that is used to requeue bufs during the scan to approximate LRU rather than locking the queue on every use of a frequently accessed buf. Implement bufspace_reserve with only atomic_fetchadd to avoid loop restarts. Reviewed by: markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14274	2018-02-20 00:06:07 +00:00
Mateusz Guzik	2ca66c1ef5	Fix process exit vs reap race introduced in r329449 The race manifested itself mostly in terms of crashes with "spin lock held too long". Relevant parts of respective code paths: exit: reap: PROC_LOCK(p); PROC_SLOCK(p); p->p_state == PRS_ZOMBIE PROC_UNLOCK(p); PROC_LOCK(p); /* exit work / if (p->p_state == PRS_ZOMBIE) / true / proc_reap() free proc / more exit work */ PROC_SUNLOCK(p); Thus a still exiting process is reaped. Prior to the change the zombie check was followed by slock/sunlock trip which prevented the problem. Even code prior to this commit has a bug: the proc is still accessed for statistic collection purposes. However, the severity is rather small and the bug may be fixed in a future commit. Reported by: many Tested by: allanjude	2018-02-19 00:54:08 +00:00
Mateusz Guzik	d257698833	mtx: add mtx_spin_wait_unlocked The primitive can be used to wait for the lock to be released. Intended usage is for locks in structures which are about to be freed. The benefit is the avoided interrupt enable/disable trip + atomic op to grab the lock and shorter wait if the lock is held (since there is no worry someone will contend on the lock, re-reads can be more aggressive). Briefly discussed with: kib	2018-02-19 00:38:14 +00:00
Mateusz Guzik	7beb60820f	exit: get rid of PROC_SLOCK when checking a process to report, take #2 The suspension counter needs synchronisation through slock, but we don't need it to check if inspecting the counter is necessary to begin with. In the common case it is not, thus avoid the lock if possible. Reviewed by: kib Tested by: pho	2018-02-18 21:07:15 +00:00
Mariusz Zaborski	965cd21173	Fix broken assertion in r329520. Reported by: pho@ lwhsu@	2018-02-18 20:04:39 +00:00
Brooks Davis	7a095112b2	Correct/improve the descriptions if kern.ipc.(shmsegs,sema,msqids). The description of kern.ipc.shmsegs was wrong since 2005. I updated the others (which were more correct) to match. PR: 225933 Reviewed by: cem MFC after: 3 days Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14391	2018-02-18 19:19:36 +00:00
Mariusz Zaborski	20641651ec	Use the fdeget_locked function instead of the fget_locked in the sys_capability. Reviewed by: pjd@ (earlier version) Discussed with: mjg@	2018-02-18 15:27:24 +00:00
Mateusz Guzik	8bf6ff2226	Revert r329448. Turns out is is actually racy, reproducible with stress2/misc/truss.sh Requested by: kib	2018-02-17 17:23:43 +00:00
Mateusz Guzik	e4ccf57fdc	Undo LOCK_PROFILING pessimisation after r313454 and r313455 With the option used to compile the kernel both sx and rw shared ops would always go to the slow path which added avoidable overhead even when the facility is disabled. Furthermore the increased time spent doing uncontested shared lock acquire would be bogusly added to total wait time, somewhat skewing the results. Restore old behaviour of going there only when profiling is enabled. This change is a no-op for kernels without LOCK_PROFILING (which is the default).	2018-02-17 12:07:09 +00:00
Mateusz Guzik	ad58e5e86c	exit: stop doing PROC_SLOCK just to call proc_reap It immediately does PROC_SUNLOCK anyway and the lock plays no role.	2018-02-17 09:03:11 +00:00
Mateusz Guzik	9c0e785c58	exit: get rid of PROC_SLOCK when checking a process to report All accessed fields are protected with already held process lock.	2018-02-17 08:48:45 +00:00
Mateusz Guzik	015cd8dc93	On process exit signal the parent after dropping the proctree lock.	2018-02-17 00:24:50 +00:00
Mateusz Guzik	7e588b9219	Unref the prison after proctree is dropped.	2018-02-17 00:23:56 +00:00
Mateusz Guzik	65f29b9caa	Postpone sx_sunlock(&proctree_lock) on fork until after allproc is dropped. There is a significant contention on the lock during -j 128 package build. This change drops total wait time on this lock by 60%.	2018-02-17 00:23:28 +00:00
Mateusz Guzik	6776bfeb8f	Tidy up kern_wait6 - don't relock curproc in msleep - don't relock proctree if P_STATCHILD is spotted - reformat the proc_to_reap call in the main loop	2018-02-17 00:21:50 +00:00
Brooks Davis	aff4f2d315	Reduce duplication in __acl_*_(file\|link). Add const to new kern_ functions and push down as required. Reviewed by: rwatson Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14174	2018-02-15 21:24:43 +00:00
Mark Johnston	05f0f0e9ea	Fix the test for SET_FOREACH termination. Unlike the queue(3) _FOREACH macros, the iterator for a SET_FOREACH is not NULL after the end of the set is reached.	2018-02-15 17:35:40 +00:00
Mateusz Guzik	f795032b47	rwlock: diff-reduction of runlock compared to sx sunlock	2018-02-14 20:37:33 +00:00
Bryan Drewery	70c144dc78	nanosleep(2): Fix bogus incrementing of rmtp by tc_tick_sbt on [EINTR]. sbt is the time in the future that the tsleep_sbt() is expected to be completed at. sbtt is the current time. Depending on the precision with sysctl kern.timecounter.alloweddeviation the start time may be incremented by tc_tick_sbt. The same increment is needed for the current time of sbtt before calculating the difference. The impact of missing this increment is that rmtp may increase by one tc_tick_sbt on every early [EINTR] return. If the same struct is passed in for rqtp as rmtp this can result in rqtp effectively incrementing by tc_tick_sbt and sleeping longer than originally intended. This problem was introduced in r247797. Reviewed by: kib, markj, vangyzen (all on an older version of the test) MFC after: 2 weeks Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D14362	2018-02-14 18:43:50 +00:00
Mark Johnston	6026dcd7ca	Add support for zstd-compressed user and kernel core dumps. This works similarly to the existing gzip compression support, but zstd is typically faster and gives better compression ratios. Support for this functionality must be configured by adding ZSTDIO to one's kernel configuration file. dumpon(8)'s new -Z option is used to configure zstd compression for kernel dumps. savecore(8) now recognizes and saves zstd-compressed kernel dumps with a .zst extension. Submitted by: cem (original version) Relnotes: yes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13101, https://reviews.freebsd.org/D13633	2018-02-13 19:28:02 +00:00
Ian Lepore	157f3d7649	Fix bad indentation. Whitespace only, no functional changes. Reported by: bde@	2018-02-13 17:38:08 +00:00
Jeff Roberson	e958ad4cf3	Make v_wire_count a per-cpu counter(9) counter. This eliminates a significant source of cache line contention from vm_page_alloc(). Use accessors and vm_page_unwire_noq() so that the mechanism can be easily changed in the future. Reviewed by: markj Discussed with: kib, glebius Tested by: pho (earlier version) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14273	2018-02-12 22:53:00 +00:00
Ian Lepore	bd54c5acba	Add a new sysctl, debug.clock_do_io, to allow manully triggering a one-shot read or write of all registered realtime clocks. In the read case, the values read are simply discarded. For writes, there's no alternative but to actually write the current system time to the device.	2018-02-12 17:41:11 +00:00
Ian Lepore	45eee6db6f	Add a set of convenience routines for RTC drivers to use for debug output, and a debug.clock_show_io sysctl to control debugging output.	2018-02-12 17:33:14 +00:00
Ian Lepore	aedc51f11a	Replace the existing print_ct() private debugging function with a set of three public functions to format and print the three major data structures used by realtime clock drivers (clocktime, bcd_clocktime, and timespec).	2018-02-12 16:25:56 +00:00
Kirk McKusick	13a025d5d8	Merge biodone_finish() back into biodone(). The primary purpose is to make the order of operations clearer to avoid the race condition that was fixed in r328914. In particular, this commit corrects a similar race that existed in the soft updates callback. Doing some sleuthing through the SVN repository, it appears that bufdone_finish() was added to support XFS: ------------------------------------------------------------------------ r153192 \| rodrigc \| 2005-12-06 19:39:08 -0800 (Tue, 06 Dec 2005) \| 13 lines Changes imported from XFS for FreeBSD project: - add fields to struct buf (needed by XFS) - 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3 - b_pin_count, count of pinned buffer - add new B_MANAGED flag - add breada() function to initiate asynchronous I/O on read-ahead blocks. - add bufdone_finish(), bpin(), bunpin_wait() functions Patches provided by: kan Reviewed by: phk Silence on: arch@ ------------------------------------------------------------------------ It does not appear to ever have been used for anything else. XFS was disconnected in r241607: ------------------------------------------------------------------------ r241607 \| attilio \| 2012-10-16 03:04:00 -0700 (Tue, 16 Oct 2012) \| 5 lines Disconnect non-MPSAFE XFS from the build in preparation for dropping GIANT from VFS. This is not targeted for MFC. ------------------------------------------------------------------------ and removed entirely in r247631: ------------------------------------------------------------------------ r247631 \| attilio \| 2013-03-02 07:33:54 -0800 (Sat, 02 Mar 2013) \| 5 lines Garbage collect XFS bits which are now already completely disconnected from the tree since few months. This is not targeted for MFC. ------------------------------------------------------------------------ Since XFS support is gone, there is no reason to retain biodone_finish(). Suggested by: Warner Losh (imp) Discussed with: cem, kib Tested by: Peter Holm (pho)	2018-02-09 19:50:47 +00:00
Gleb Smirnoff	f7d3578564	Fix boot_pages exhaustion on machines with many domains and cores, where size of UMA zone allocation is greater than page size. In this case zone of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch off of this zone from startup_alloc() until full launch of VM. o Always supply number of VM zones to uma_startup_count(). On machines with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over a page. In the latter case account VM zones for number of allocations from the zone of zones. o Rewrite startup_alloc() so that it will immediately switch off from itself any zone that is already capable of running real alloc. In worst case scenario we may leak a single page here. See comment in uma_startup_count(). o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some extra SYSINITs, e.g. vm_page_init() may sneak in before. o While here, remove uma_boot_pages_mtx. With recent changes to boot pages calculation, we are guaranteed to use all of the boot_pages in the early single threaded stage. Reported & tested by: mav	2018-02-09 04:45:39 +00:00
Andriy Gapon	b2387aa652	exec_map_first_page: fix an inverse condition introduced in r254138 While the bug itself was serious, as we could either pass a non-busied page to vm_pager_get_pages() or leak a busy page, it could only be triggered under a very rare condition where the page is already inserted into the object, but it is not valid yet. Reviewed by: kib MFC after: 2 weeks	2018-02-07 21:51:59 +00:00
Gleb Smirnoff	5073a08328	Fix three miscalculations in amount of boot pages: o Most of startup zones have struct uma_slab embedded into the slab, so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE, when calculating how many pages would certain kind of allocations require. Some zones are offpage, so we might have a positive inaccuracy. o The keg for the zone of zones is allocated "dynamically", so we need +1 when calculating amount of pages for kegs. [1] o The zones of zones and zones of kegs have arbitrary alignment of 32, and this also needs to be accounted for. [2] While here, spread more comments and improve diagnostic messages. Reported by: pho [1], jtl [2]	2018-02-07 18:32:51 +00:00
Mark Johnston	1d3a1bcfac	Dequeue wired pages lazily. Previously, wiring a page would cause it to be removed from its page queue. In the common case, unwiring causes it to be enqueued at the tail of that page queue. This change modifies vm_page_wire() to not dequeue the page, thus avoiding the highly contended page queue locks. Instead, vm_page_unwire() takes care of requeuing the page as a single operation, and the page daemon dequeues wired pages as they are encountered during a queue scan to avoid needlessly revisiting them later. For pages in PQ_ACTIVE we do even better, since a requeue is unnecessary. The change improves scalability for some common workloads. For instance, threads wiring pages into the buffer cache no longer need to modify global page queues, and unwiring is usually done by the bufspace thread, so concurrency is not as much of an issue. As another example, many sysctl handlers wire the output buffer to avoid faults on copyout, and since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid modifying the page queue in this case. The change also adds a block comment describing some properties of struct vm_page's reference counters, and the busy lock. Reviewed by: jeff Discussed with: alc, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D11943	2018-02-07 16:57:10 +00:00
Ian Lepore	ce75945d3c	Use const pointers for input data not modified by clock utility functions.	2018-02-06 22:17:01 +00:00
Jeff Roberson	e2068d0bcd	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000	2018-02-06 22:10:07 +00:00
Gleb Smirnoff	ae941b1b4e	Fix boot_pages calculation for machines that don't have UMA_MD_SMALL_ALLOC. o Call uma_startup1() after initializing kmem, vmem and domains. o Include 8 eight VM startup pages into uma_startup_count() calculation. o Account for vmem_startup() and vm_map_startup() preallocating pages. o Account for extra two allocations done by kmem_init() and vmem_create(). o Hardcode the place of execution of vm_radix_reserve_kva(). Using SYSINIT allowed several other SYSINITs to sneak in before it, thus bumping requirement for amount of boot pages.	2018-02-06 22:06:59 +00:00
Gleb Smirnoff	f4bef67c9c	Followup on r302393 by cperciva, improving calculation of boot pages required for UMA startup. o Introduce another stage of UMA startup, which is entered after vm_page_startup() finishes. After this stage we don't yet enable buckets, but we can ask VM for pages. Rename stages to meaningful names while here. New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS, BOOT_RUNNING. Enabling page alloc earlier allows us to dramatically reduce number of boot pages required. What is more important number of zones becomes consistent across different machines, as no MD allocations are done before the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use startup_alloc(), however that may change, so vm_page_startup() provides its need for early zones as argument. o Introduce uma_startup_count() function, to avoid code duplication. The functions calculates sizes of zones zone and kegs zone, and calculates how many pages UMA will need to bootstrap. It counts not only of zone structures, but also of kegs, slabs and hashes. o Hide uma_startup_foo() declarations from public file. o Provide several DIAGNOSTIC printfs on boot_pages usage. o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of mp_ncpus. Use resulting number not only in the size argument to zone_ctor() but also as args.size. Reviewed by: imp, gallatin (earlier version) Differential Revision: https://reviews.freebsd.org/D14054	2018-02-06 04:16:00 +00:00

1 2 3 4 5 ...

15875 Commits