freebsd-skq

Author	SHA1	Message	Date
mjg	428cfb036c	mtx: add debug assertions to mtx_spin_wait_unlocked	2018-02-20 20:39:34 +00:00
mjg	fcb26b586a	Fix reaping on process fd close broken after r329449 The only consumer of proc_reap other than proc_to_reap was not updated to not PROC_SLOCK. Reported by: Juan Ramon Molina Menor <listjm club.fr>	2018-02-20 20:19:38 +00:00
brooks	3217658e61	Reduce duplication in dynamic syscall registration code. Remove the unused syscall_(de)register() functions in favor of the better documented and easier to use syscall_helper_(un)register(9) functions. The default and freebsd32 versions differed in which array of struct sysents they used and a few missing updates to the 32-bit code as features were added to the main code. Reviewed by: cem Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14337	2018-02-20 18:08:57 +00:00
mjg	d4ad49fd39	Make killpg1 perform process validity checks without proc lock held.	2018-02-20 10:52:07 +00:00
mjg	25e9d331dc	Reduce contention on the proctree lock during heavy package build. There is a proctree -> allproc ordering established. Most of the time it is either xlock -> xlock or slock -> slock. On fork however there is a slock -> xlock pair which results in pathological wait times due to threads keeping proctree held for reading and all waiting on allproc. Switch this to xlock -> xlock. Longer term fix would get rid of proctree in this place to begin with. Right now it is necessary to walk the session/process group lists to determine which id is free. The walk can be avoided e.g. with bitmaps. The exit path used to have one place which dealt with allproc and then with proctree. Move the allproc acquire into the section protected by proctree. This reduces contention against threads waiting on proctree in the fork codepath - the fork proctree holder does not have to wait for allproc as often. Finally, move tidhash manipulation outside of the area protected by either of these locks. The removal from the hash was already unprotected. There is no legitimate reason to look up thread ids for a process still under construction. This results in about 50% wait time reduction during -j 128 package build.	2018-02-20 02:18:30 +00:00
jeff	e3be9f8fb6	Further parallelize the buffer cache. Provide multiple clean queues partitioned into 'domains'. Each domain manages its own bufspace and has its own bufspace daemon. Each domain has a set of subqueues indexed by the current cpuid to reduce lock contention on the cleanq. Refine the sleep/wakeup around the bufspace daemon to use atomics as much as possible. Add a B_REUSE flag that is used to requeue bufs during the scan to approximate LRU rather than locking the queue on every use of a frequently accessed buf. Implement bufspace_reserve with only atomic_fetchadd to avoid loop restarts. Reviewed by: markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14274	2018-02-20 00:06:07 +00:00
mjg	7fcf37e646	Fix process exit vs reap race introduced in r329449 The race manifested itself mostly in terms of crashes with "spin lock held too long". Relevant parts of respective code paths: exit: reap: PROC_LOCK(p); PROC_SLOCK(p); p->p_state == PRS_ZOMBIE PROC_UNLOCK(p); PROC_LOCK(p); /* exit work / if (p->p_state == PRS_ZOMBIE) / true / proc_reap() free proc / more exit work */ PROC_SUNLOCK(p); Thus a still exiting process is reaped. Prior to the change the zombie check was followed by slock/sunlock trip which prevented the problem. Even code prior to this commit has a bug: the proc is still accessed for statistic collection purposes. However, the severity is rather small and the bug may be fixed in a future commit. Reported by: many Tested by: allanjude	2018-02-19 00:54:08 +00:00
mjg	2fd1c912f6	mtx: add mtx_spin_wait_unlocked The primitive can be used to wait for the lock to be released. Intended usage is for locks in structures which are about to be freed. The benefit is the avoided interrupt enable/disable trip + atomic op to grab the lock and shorter wait if the lock is held (since there is no worry someone will contend on the lock, re-reads can be more aggressive). Briefly discussed with: kib	2018-02-19 00:38:14 +00:00
mjg	7669ad8f23	exit: get rid of PROC_SLOCK when checking a process to report, take #2 The suspension counter needs synchronisation through slock, but we don't need it to check if inspecting the counter is necessary to begin with. In the common case it is not, thus avoid the lock if possible. Reviewed by: kib Tested by: pho	2018-02-18 21:07:15 +00:00
oshogbo	e41b3243d5	Fix broken assertion in r329520. Reported by: pho@ lwhsu@	2018-02-18 20:04:39 +00:00
brooks	91b2a928ff	Correct/improve the descriptions if kern.ipc.(shmsegs,sema,msqids). The description of kern.ipc.shmsegs was wrong since 2005. I updated the others (which were more correct) to match. PR: 225933 Reviewed by: cem MFC after: 3 days Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14391	2018-02-18 19:19:36 +00:00
oshogbo	6d3d70b9a6	Use the fdeget_locked function instead of the fget_locked in the sys_capability. Reviewed by: pjd@ (earlier version) Discussed with: mjg@	2018-02-18 15:27:24 +00:00
mjg	075146c37f	Revert r329448. Turns out is is actually racy, reproducible with stress2/misc/truss.sh Requested by: kib	2018-02-17 17:23:43 +00:00
mjg	c782ed4dbc	Undo LOCK_PROFILING pessimisation after r313454 and r313455 With the option used to compile the kernel both sx and rw shared ops would always go to the slow path which added avoidable overhead even when the facility is disabled. Furthermore the increased time spent doing uncontested shared lock acquire would be bogusly added to total wait time, somewhat skewing the results. Restore old behaviour of going there only when profiling is enabled. This change is a no-op for kernels without LOCK_PROFILING (which is the default).	2018-02-17 12:07:09 +00:00
mjg	018f299fa7	exit: stop doing PROC_SLOCK just to call proc_reap It immediately does PROC_SUNLOCK anyway and the lock plays no role.	2018-02-17 09:03:11 +00:00
mjg	fc7474d6fb	exit: get rid of PROC_SLOCK when checking a process to report All accessed fields are protected with already held process lock.	2018-02-17 08:48:45 +00:00
mjg	7616e6b9b0	On process exit signal the parent after dropping the proctree lock.	2018-02-17 00:24:50 +00:00
mjg	6b281480bb	Unref the prison after proctree is dropped.	2018-02-17 00:23:56 +00:00
mjg	2ba397be42	Postpone sx_sunlock(&proctree_lock) on fork until after allproc is dropped. There is a significant contention on the lock during -j 128 package build. This change drops total wait time on this lock by 60%.	2018-02-17 00:23:28 +00:00
mjg	c1f86f952d	Tidy up kern_wait6 - don't relock curproc in msleep - don't relock proctree if P_STATCHILD is spotted - reformat the proc_to_reap call in the main loop	2018-02-17 00:21:50 +00:00
brooks	e21f157f4d	Reduce duplication in __acl_*_(file\|link). Add const to new kern_ functions and push down as required. Reviewed by: rwatson Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14174	2018-02-15 21:24:43 +00:00
markj	e1e7fa21c2	Fix the test for SET_FOREACH termination. Unlike the queue(3) _FOREACH macros, the iterator for a SET_FOREACH is not NULL after the end of the set is reached.	2018-02-15 17:35:40 +00:00
mjg	9e89b7286b	rwlock: diff-reduction of runlock compared to sx sunlock	2018-02-14 20:37:33 +00:00
bdrewery	db51b6bdae	nanosleep(2): Fix bogus incrementing of rmtp by tc_tick_sbt on [EINTR]. sbt is the time in the future that the tsleep_sbt() is expected to be completed at. sbtt is the current time. Depending on the precision with sysctl kern.timecounter.alloweddeviation the start time may be incremented by tc_tick_sbt. The same increment is needed for the current time of sbtt before calculating the difference. The impact of missing this increment is that rmtp may increase by one tc_tick_sbt on every early [EINTR] return. If the same struct is passed in for rqtp as rmtp this can result in rqtp effectively incrementing by tc_tick_sbt and sleeping longer than originally intended. This problem was introduced in r247797. Reviewed by: kib, markj, vangyzen (all on an older version of the test) MFC after: 2 weeks Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D14362	2018-02-14 18:43:50 +00:00
markj	4c9fc08f4a	Add support for zstd-compressed user and kernel core dumps. This works similarly to the existing gzip compression support, but zstd is typically faster and gives better compression ratios. Support for this functionality must be configured by adding ZSTDIO to one's kernel configuration file. dumpon(8)'s new -Z option is used to configure zstd compression for kernel dumps. savecore(8) now recognizes and saves zstd-compressed kernel dumps with a .zst extension. Submitted by: cem (original version) Relnotes: yes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13101, https://reviews.freebsd.org/D13633	2018-02-13 19:28:02 +00:00
ian	ae623dba38	Fix bad indentation. Whitespace only, no functional changes. Reported by: bde@	2018-02-13 17:38:08 +00:00
jeff	ba27b5187b	Make v_wire_count a per-cpu counter(9) counter. This eliminates a significant source of cache line contention from vm_page_alloc(). Use accessors and vm_page_unwire_noq() so that the mechanism can be easily changed in the future. Reviewed by: markj Discussed with: kib, glebius Tested by: pho (earlier version) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14273	2018-02-12 22:53:00 +00:00
ian	e1340a314d	Add a new sysctl, debug.clock_do_io, to allow manully triggering a one-shot read or write of all registered realtime clocks. In the read case, the values read are simply discarded. For writes, there's no alternative but to actually write the current system time to the device.	2018-02-12 17:41:11 +00:00
ian	e5909c170b	Add a set of convenience routines for RTC drivers to use for debug output, and a debug.clock_show_io sysctl to control debugging output.	2018-02-12 17:33:14 +00:00
ian	19e2991d59	Replace the existing print_ct() private debugging function with a set of three public functions to format and print the three major data structures used by realtime clock drivers (clocktime, bcd_clocktime, and timespec).	2018-02-12 16:25:56 +00:00
mckusick	83f7902028	Merge biodone_finish() back into biodone(). The primary purpose is to make the order of operations clearer to avoid the race condition that was fixed in r328914. In particular, this commit corrects a similar race that existed in the soft updates callback. Doing some sleuthing through the SVN repository, it appears that bufdone_finish() was added to support XFS: ------------------------------------------------------------------------ r153192 \| rodrigc \| 2005-12-06 19:39:08 -0800 (Tue, 06 Dec 2005) \| 13 lines Changes imported from XFS for FreeBSD project: - add fields to struct buf (needed by XFS) - 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3 - b_pin_count, count of pinned buffer - add new B_MANAGED flag - add breada() function to initiate asynchronous I/O on read-ahead blocks. - add bufdone_finish(), bpin(), bunpin_wait() functions Patches provided by: kan Reviewed by: phk Silence on: arch@ ------------------------------------------------------------------------ It does not appear to ever have been used for anything else. XFS was disconnected in r241607: ------------------------------------------------------------------------ r241607 \| attilio \| 2012-10-16 03:04:00 -0700 (Tue, 16 Oct 2012) \| 5 lines Disconnect non-MPSAFE XFS from the build in preparation for dropping GIANT from VFS. This is not targeted for MFC. ------------------------------------------------------------------------ and removed entirely in r247631: ------------------------------------------------------------------------ r247631 \| attilio \| 2013-03-02 07:33:54 -0800 (Sat, 02 Mar 2013) \| 5 lines Garbage collect XFS bits which are now already completely disconnected from the tree since few months. This is not targeted for MFC. ------------------------------------------------------------------------ Since XFS support is gone, there is no reason to retain biodone_finish(). Suggested by: Warner Losh (imp) Discussed with: cem, kib Tested by: Peter Holm (pho)	2018-02-09 19:50:47 +00:00
glebius	2c573b6df7	Fix boot_pages exhaustion on machines with many domains and cores, where size of UMA zone allocation is greater than page size. In this case zone of zones can not use UMA_MD_SMALL_ALLOC, and we need to postpone switch off of this zone from startup_alloc() until full launch of VM. o Always supply number of VM zones to uma_startup_count(). On machines with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over a page. In the latter case account VM zones for number of allocations from the zone of zones. o Rewrite startup_alloc() so that it will immediately switch off from itself any zone that is already capable of running real alloc. In worst case scenario we may leak a single page here. See comment in uma_startup_count(). o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some extra SYSINITs, e.g. vm_page_init() may sneak in before. o While here, remove uma_boot_pages_mtx. With recent changes to boot pages calculation, we are guaranteed to use all of the boot_pages in the early single threaded stage. Reported & tested by: mav	2018-02-09 04:45:39 +00:00
avg	75f02086d1	exec_map_first_page: fix an inverse condition introduced in r254138 While the bug itself was serious, as we could either pass a non-busied page to vm_pager_get_pages() or leak a busy page, it could only be triggered under a very rare condition where the page is already inserted into the object, but it is not valid yet. Reviewed by: kib MFC after: 2 weeks	2018-02-07 21:51:59 +00:00
glebius	ed8d237f4f	Fix three miscalculations in amount of boot pages: o Most of startup zones have struct uma_slab embedded into the slab, so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE, when calculating how many pages would certain kind of allocations require. Some zones are offpage, so we might have a positive inaccuracy. o The keg for the zone of zones is allocated "dynamically", so we need +1 when calculating amount of pages for kegs. [1] o The zones of zones and zones of kegs have arbitrary alignment of 32, and this also needs to be accounted for. [2] While here, spread more comments and improve diagnostic messages. Reported by: pho [1], jtl [2]	2018-02-07 18:32:51 +00:00
markj	d241ba38fa	Dequeue wired pages lazily. Previously, wiring a page would cause it to be removed from its page queue. In the common case, unwiring causes it to be enqueued at the tail of that page queue. This change modifies vm_page_wire() to not dequeue the page, thus avoiding the highly contended page queue locks. Instead, vm_page_unwire() takes care of requeuing the page as a single operation, and the page daemon dequeues wired pages as they are encountered during a queue scan to avoid needlessly revisiting them later. For pages in PQ_ACTIVE we do even better, since a requeue is unnecessary. The change improves scalability for some common workloads. For instance, threads wiring pages into the buffer cache no longer need to modify global page queues, and unwiring is usually done by the bufspace thread, so concurrency is not as much of an issue. As another example, many sysctl handlers wire the output buffer to avoid faults on copyout, and since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid modifying the page queue in this case. The change also adds a block comment describing some properties of struct vm_page's reference counters, and the busy lock. Reviewed by: jeff Discussed with: alc, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D11943	2018-02-07 16:57:10 +00:00
ian	7f4e81c493	Use const pointers for input data not modified by clock utility functions.	2018-02-06 22:17:01 +00:00
jeff	e67ec0d694	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000	2018-02-06 22:10:07 +00:00
glebius	59ed8012f8	Fix boot_pages calculation for machines that don't have UMA_MD_SMALL_ALLOC. o Call uma_startup1() after initializing kmem, vmem and domains. o Include 8 eight VM startup pages into uma_startup_count() calculation. o Account for vmem_startup() and vm_map_startup() preallocating pages. o Account for extra two allocations done by kmem_init() and vmem_create(). o Hardcode the place of execution of vm_radix_reserve_kva(). Using SYSINIT allowed several other SYSINITs to sneak in before it, thus bumping requirement for amount of boot pages.	2018-02-06 22:06:59 +00:00
glebius	baecd1a4b3	Followup on r302393 by cperciva, improving calculation of boot pages required for UMA startup. o Introduce another stage of UMA startup, which is entered after vm_page_startup() finishes. After this stage we don't yet enable buckets, but we can ask VM for pages. Rename stages to meaningful names while here. New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS, BOOT_RUNNING. Enabling page alloc earlier allows us to dramatically reduce number of boot pages required. What is more important number of zones becomes consistent across different machines, as no MD allocations are done before the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use startup_alloc(), however that may change, so vm_page_startup() provides its need for early zones as argument. o Introduce uma_startup_count() function, to avoid code duplication. The functions calculates sizes of zones zone and kegs zone, and calculates how many pages UMA will need to bootstrap. It counts not only of zone structures, but also of kegs, slabs and hashes. o Hide uma_startup_foo() declarations from public file. o Provide several DIAGNOSTIC printfs on boot_pages usage. o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of mp_ncpus. Use resulting number not only in the size argument to zone_ctor() but also as args.size. Reviewed by: imp, gallatin (earlier version) Differential Revision: https://reviews.freebsd.org/D14054	2018-02-06 04:16:00 +00:00
mckusick	a7ac77d8a7	Occasional cylinder-group check-hash errors were being reported on systems running with a heavy filesystem load. Tracking down this bug was elusive because there were actually two problems. Sometimes the in-memory check hash was wrong and sometimes the check hash computed when doing the read was wrong. The occurrence of either error caused a check-hash mismatch to be reported. The first error was that the check hash in the in-memory cylinder group was incorrect. This error was caused by the following sequence of events: - We read a cylinder-group buffer and the check hash is valid. - We update its cg_time and cg_old_time which makes the in-memory check-hash value invalid but we do not mark the cylinder group dirty. - We do not make any other changes to the cylinder group, so we never mark it dirty, thus do not write it out, and hence never update the incorrect check hash for the in-memory buffer. - Later, the buffer gets freed, but the page with the old incorrect check hash is still in the VM cache. - Later, we read the cylinder group again, and the first page with the old check hash is still in the VM cache, but some other pages are not, so we have to do a read. - The read does not actually get the first page from disk, but rather from the VM cache, resulting in the old check hash in the buffer. - The value computed after doing the read does not match causing the error to be printed. The fix for this problem is to only set cg_time and cg_old_time as the cylinder group is being written to disk. This keeps the in-memory check-hash valid unless the cylinder group has had other modifications which will require it to be written with a new check hash calculated. It also requires that the check hash be recalculated in the in-memory cylinder group when it is marked clean after doing a background write. The second problem was that the check hash computed at the end of the read was incorrect because the calculation of the check hash on completion of the read was being done too soon. - When a read completes we had the following sequence: - bufdone() -- b_ckhashcalc (calculates check hash) -- bufdone_finish() --- vfs_vmio_iodone() (replaces bogus pages with the cached ones) - When we are reading a buffer where one or more pages are already in memory (but not all pages, or we wouldn't be doing the read), the I/O is done with bogus_page mapped in for the pages that exist in the VM cache. This mapping is done to avoid corrupting the cached pages if there is any I/O overrun. The vfs_vmio_iodone() function is responsible for replacing the bogus_page(s) with the cached ones. But we were calculating the check hash before the bogus_page(s) were replaced. Hence, when we were calculating the check hash, we were partly reading from bogus_page, which means we calculated a bad check hash (e.g., because multiple pages have been mapped to bogus_page, so its contents are indeterminate). The second fix is to move the check-hash calculation from bufdone() to bufdone_finish() after the call to vfs_vmio_iodone() so that it computes the check hash over the correct set of pages. With these two changes, the occasional cylinder-group check-hash errors are gone. Submitted by: David Pfitzner <dpfitzner@netflix.com> Reviewed by: kib Tested by: David Pfitzner	2018-02-06 00:19:46 +00:00
jhb	6145bc74c1	Ignore relocation tables for non-memory-resident sections. As a followup to r328101, ignore relocation tables for ELF object sections that are not memory resident. For modules loaded by the loader, ignore relocation tables whose associated section was not loaded by the loader (sh_addr is zero). For modules loaded at runtime via kldload(2), ignore relocation tables whose associated section is not marked with SHF_ALLOC. Reported by: Mori Hiroki <yamori813@yahoo.co.jp>, adrian Tested on: mips, mips64 MFC after: 1 month Sponsored by: DARPA / AFRL	2018-02-05 23:35:33 +00:00
jhb	3f3863ea7c	Always give ELF brands a chance to veto a match. If a brand provides a header_supported hook, check it when trying to find a brand based on a matching interpreter as well as in the final loop for the fallback brand. Previously a brand might reject a binary via a header_supported hook in one of the earlier loops, but still be chosen by one of these later loops. Reviewed by: kib Obtained from: CheriBSD MFC after: 2 weeks Sponsored by: DARPA / AFRL Differential Revision: https://reviews.freebsd.org/D13945	2018-02-05 23:27:42 +00:00
brooks	dc50858dea	Reduce duplication in extattr_*_(file\|link) syscalls. Reviewed by: rwatson Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14173	2018-02-05 19:06:34 +00:00
brooks	44d1111709	ANSIfy syscall implementations. Reviewed by: rwatson Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14172	2018-02-05 18:58:55 +00:00
brooks	59fcf4b241	Add kern.ipc.{msqids,semsegs,sema} sysctls for FreeBSD32. Stop leaking kernel pointers though theses sysctls and make sure that the padding in the structures is zeroed on allocation to avoid other leaks. Reviewed by: gordon, kib Obtained from: CheriBSD MFC after: 1 week Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D13459	2018-02-02 18:03:12 +00:00
hselasky	91b534f7d7	Slightly bump the maximum OID path for loading tunable SYSCTLs. Coming updates to the mlx5en(4) driver will require this. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-02-02 12:42:46 +00:00
mckusick	dca0cf2869	One of the vnode fields listed by vn_printf is the union of pointers whose type depends on the type of vnode. Correct vn_printf so that it correctly identifies the name of the pointer that it is printing. Submitted by: Andreas Longwitz <longwitz at incore.de> MFC after: 1 week	2018-01-31 22:49:50 +00:00
emaste	187eb493d5	makesyscalls: permit a range of syscall numbers for UNIMPL Some ABIs have large gaps in syscall numbers. Allow gaps to be filled as ranges of UNIMPL, with an entry like: 248-1023 AUE_NULL UNIMPL unimplemented Reviewed by: jhb, gnn Sponsored by: Turing Robotic Industries Inc. Differential Revision: https://reviews.freebsd.org/D14122	2018-01-30 18:29:38 +00:00
bdrewery	59f27534cc	Don't use an .OBJDIR for 'make sysent'. Reported by: emaste, jhb Sponsored by: Dell EMC	2018-01-29 19:14:15 +00:00
lwhsu	520dbdaa53	Fix LINT build after r328508, add forgotten part in format string Reviewed by: delphij Differential Revision: https://reviews.freebsd.org/D14089	2018-01-29 02:29:08 +00:00

1 2 3 4 5 ...

15891 Commits