freebsd-dev

Author	SHA1	Message	Date
Mateusz Guzik	41b0046a4d	vfs: switch nodes_created, recycles_count and free_owe_inact to counter(9) Reviewed by: kib	2016-12-31 19:59:31 +00:00
Mateusz Guzik	0b3b55a0f2	Remove cpu_spinwait after seq_consistent. It does not add any benefit as the read routine will do it as necessary.	2016-12-30 06:26:17 +00:00
Mateusz Guzik	4938d86764	cache: sprinkle __predict_false	2016-12-29 16:35:49 +00:00
Mateusz Guzik	b37707533e	cache: move shrink lock init to nchinit This gets rid of unnecesary sysinit usage. While here also rename the lock to be consistent with the rest.	2016-12-29 12:01:54 +00:00
Mateusz Guzik	0569bc9ca9	cache: depessimize hashing macros/inlines All hash sizes are power-of-2, but the compiler does not know that for sure and 'foo % size' forces doing a division. Store the size - 1 and use 'foo & hash' instead which allows mere shift.	2016-12-29 08:41:25 +00:00
Mateusz Guzik	6dd9661b77	cache: drop the NULL check from VP2VNODELOCK Now that negative entries are annotated with a dedicated flag, NULL vnodes are no longer passed.	2016-12-29 08:34:50 +00:00
John Baldwin	1fabda45c3	Regen after r310638. Differential Revision: https://reviews.freebsd.org/D8854	2016-12-27 20:22:17 +00:00
John Baldwin	34ed0c63c8	Rename the 'flags' argument to getfsstat() to 'mode' and validate it. This argument is not a bitmask of flags, but only accepts a single value. Fail with EINVAL if an invalid value is passed to 'flag'. Rename the 'flags' argument to getmntinfo(3) to 'mode' as well to match. This is a followup to r308088. Reviewed by: kib MFC after: 1 month	2016-12-27 20:21:11 +00:00
Konstantin Belousov	fd30dd7c26	Make knote KN_INFLUX state counted. This is final fix for the issue closed by r310302 for knote(). If KN_INFLUX \| KN_SCAN flags are set for the note passed to knote() or knote_fork(), i.e. the knote is scanned, we might erronously clear INFLUX when finishing notification. For normal knote() it was fixed in r310302 simply by remembering the fact that we do not own KN_INFLUX, since there we own knlist lock and scan thread cannot clear KN_INFLUX until we drop the lock. For knote_fork(), the situation is more complicated, e must drop knlist lock AKA the process lock, since we need to register new knotes. Change KN_INFLUX into counter and allow shared ownership of the in-flux state between scan and knote_fork() or knote(). Both in-flux setters need to ensure that knote is not dropped in parallel. Added assert about kn_influx == 1 in knote_drop() verifies that in-flux state is not shared when knote is destroyed. Since KBI of the struct knote is changed by addition of the int kn_influx field, reorder kn_hook and kn_hookid to fill pad on LP64 arches [1]. This keeps sizeof(struct knote) to same 128 bytes as it was before addition of kn_influx, on amd64. Reviewed by: markj Suggested by: markj [1] Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D8898	2016-12-26 19:33:40 +00:00
Konstantin Belousov	5c36b2e8cb	Change knlist_destroy() to assert that knlist is empty instead of accepting the wrong state and printing warning. Do not obliterate kl_lock and kl_unlock pointers, they are often useful for post-mortem analysis. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks X-Differential revision: https://reviews.freebsd.org/D8898	2016-12-26 19:28:10 +00:00
Konstantin Belousov	34311568dc	Style. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D8898	2016-12-26 19:26:40 +00:00
Konstantin Belousov	fc05543fa7	Some optimizations for kqueue timers. There is no need to do two allocations per kqueue timer. Gather all data needed by the timer callout into the structure and allocate it at once. Use the structure to preserve the result of timer2sbintime(), to not perform repeated 64bit calculations in callout. Remove tautological casts. Remove now unused p_nexttime [1]. Noted by: markj [1] Reviewed by: markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week X-MFC note: do not remove p_nexttime Differential revision: https://reviews.freebsd.org/D8901	2016-12-25 19:49:35 +00:00
Konstantin Belousov	7611b72816	Some style. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D8901	2016-12-25 19:38:07 +00:00
Mark Johnston	eab80d9276	Add a comment explaining the race fixed by r310423. Suggested and reviewed by: jhb X-MFC With: r310423	2016-12-23 05:02:17 +00:00
Mark Johnston	aa3c544349	Revert part of r300109. The removal of TAILQ_FOREACH_SAFE introduced a small race: when the last thread on a sleepqueue is awoken, it reclaims the sleepqueue and may begin executing on a different CPU before sleepq_resume_thread() returns. This leaves a window during which it may go back to sleep and incorrectly be awoken again by the caller of sleepq_broadcast(). Reported and tested by: pho MFC after: 3 days Sponsored by: Dell EMC Isilon	2016-12-22 17:51:44 +00:00
John Baldwin	99bc7e4123	Don't spin in pause() during early boot for kthreads other than thread0. pause() uses a spin loop to simulate a sleep during early boot. However, we only need this for thread0 to get far enough in the boot process to enable timers (at which point pause() can sleep). For other kthreads, sleeping in pause() is ok as the callout will be scheduled and will eventually fire once thread0 initializes timers. Tested by: Steven Kargl Sleuthing by: markj MFC after: 1 week Sponsored by: Netflix	2016-12-20 19:44:44 +00:00
Konstantin Belousov	4afd808be7	Do not clear KN_INFLUX when not owning influx state. For notes in KN_INFLUX\|KN_SCAN state, the influx bit is set by a parallel scan. When knote() reports event for the vnode filters, which require kqueue unlocked, it unconditionally sets and then clears influx to keep note around kqueue unlock. There, do not clear influx flag if a scan set it, since we do not own it, instead we prevent scan from executing by holding knlist lock. The knote_fork() function has somewhat similar problem, it might set KN_INFLUX for scanned note, drop kqueue and list locks, and then clear the flag after relock. A solution there would be different enough, as well as the test program, so close the reported issue first. Reported and test case provided by: yjh0502@gmail.com PR: 214923 Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-19 22:18:36 +00:00
Konstantin Belousov	69baec3619	Switch from stdatomic.h to atomic.h for kernel. Apparently stdatomic.h implementation for gcc 4.2 on sparc64 does not work properly. This effectively reverts r251803. Reported and tested by: lidl Discussed with: ed Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-16 17:41:20 +00:00
Ed Schouten	669a25b50d	Document the existence of the {0, 6, ...} sysctl.	2016-12-15 15:45:11 +00:00
Jilles Tjoelker	b9a6fb9343	reaper: Make REAPER_KILL_SUBTREE actually work. MFC after: 2 weeks	2016-12-14 22:49:20 +00:00
Ed Schouten	ae15715360	Add a "device_index" label to all sysctls under dev.$driver.$index. This way it becomes possible to graph a property for all instances of a single driver. For example, graphing the number of packets across all USB controllers, the amount of dropped packets on all NICs, etc. Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D8775	2016-12-14 13:03:01 +00:00
Ed Schouten	fd0f59709d	Add labels to sysctls related to clocks. Sysctls like kern.eventtimer.et.*.quality currently embed the name of the clock device. This is problematic for the Prometheus metrics exporter for two reasons: - Some of those clocks have dashes in their names, which Prometheus doesn't allow to be used in metric names. - It doesn't allow for extracting the same property of all clocks on the system from within a single query. Attach these nodes to have a label, so that the Prometheus metrics exporter gives these metric a uniform name with the name of the clock attached as a label. Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D8775	2016-12-14 12:56:58 +00:00
Ed Schouten	1e1f3941e4	Add support for attaching aggregation labels to sysctl objects. I'm currently working on writing a metrics exporter for the Prometheus monitoring system to provide access to sysctl metrics. Prometheus and sysctl have some structural differences: - sysctl is a tree of string component names. - Prometheus uses a flat namespace for its metrics, but allows you to attach labels with values to them, so that you can do aggregation. An initial version of my exporter simply translated hw.acpi.thermal.tz1.temperature to sysctl_hw_acpi_thermal_tz1_temperature_celcius while we should ideally have sysctl_hw_acpi_thermal_temperature_celcius{thermal_zone="tz1"} allowing you to graph all thermal zones on a system in one go. The change presented in this commit adds support for accomplishing this, by providing the ability to attach labels to nodes. In the example I gave above, the label "thermal_zone" would be attached to "tz1". As this is a feature that will only be used very rarely, I decided to not change the KPI too aggressively. Discussed on: hackers@ Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D8775	2016-12-14 12:47:34 +00:00
Gleb Smirnoff	1276a8363c	Zero return value when counter_rate() switches over to next second and value is positive, but below the limit.	2016-12-13 20:11:45 +00:00
Mateusz Guzik	25e578de55	vfs: use vrefact in getcwd and fchdir	2016-12-12 19:16:35 +00:00
Edward Tomasz Napierala	e3d4c4dcde	Undo r309891. Konstantin is right in that this condition normally cannot happen - the um_dev field is assigned at mount and never written to afterwards.	2016-12-12 19:11:04 +00:00
Mateusz Guzik	5afb134c32	vfs: add vrefact, to be used when the vnode has to be already active This allows blind increment of relevant counters which under contention is cheaper than inc-not-zero loops at least on amd64. Use it in some of the places which are guaranteed to see already active vnodes. Reviewed by: kib (previous version)	2016-12-12 15:37:11 +00:00
Edward Tomasz Napierala	223cb0e434	Avoid dereferencing NULL pointers in devtoname(). I've seen it panic, called from ufs_print() in DDB. MFC after: 1 month	2016-12-12 15:22:21 +00:00
Konstantin Belousov	778aa66a68	Enable lookup_cap_dotdot and lookup_cap_dotdot_nonlocal. Requested and reviewed by: cem Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D8746	2016-12-12 11:12:04 +00:00
Konstantin Belousov	545d312293	When a zombie gets reparented due to the parent exit, send SIGCHLD to the reaper. The traditional reaper init(8) is aware of zombies silently reparented to it after the parents exit, it loops around waitpid(2) to collect them. For other reapers, the silent reparenting is surprising and collecting zombies requires a thread blocking in waitpid(2) just for that purpose. It seems that sending second SIGCHLD is a better workaround than forcing all reapers to obey the setup. Reported by: Michael Zuo <muh.muhten@gmail.com>, jilles PR: 213928 Reviewed by: jilles (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-12-12 11:11:50 +00:00
Alan Cox	2d612d2dd2	When tmpfs and POSIX shm pagein a page for the sole purpose of performing truncation, immediately queue the page for asynchronous laundering rather than making the page pass through inactive queue first. Reviewed by: kib, markj	2016-12-11 19:24:41 +00:00
Konrad Witaszczyk	480f31c214	Add support for encrypted kernel crash dumps. Changes include modifications in kernel crash dump routines, dumpon(8) and savecore(8). A new tool called decryptcore(8) was added. A new DIOCSKERNELDUMP I/O control was added to send a kernel crash dump configuration in the diocskerneldump_arg structure to the kernel. The old DIOCSKERNELDUMP I/O control was renamed to DIOCSKERNELDUMP_FREEBSD11 for backward ABI compatibility. dumpon(8) generates an one-time random symmetric key and encrypts it using an RSA public key in capability mode. Currently only AES-256-CBC is supported but EKCD was designed to implement support for other algorithms in the future. The public key is chosen using the -k flag. The dumpon rc(8) script can do this automatically during startup using the dumppubkey rc.conf(5) variable. Once the keys are calculated dumpon sends them to the kernel via DIOCSKERNELDUMP I/O control. When the kernel receives the DIOCSKERNELDUMP I/O control it generates a random IV and sets up the key schedule for the specified algorithm. Each time the kernel tries to write a crash dump to the dump device, the IV is replaced by a SHA-256 hash of the previous value. This is intended to make a possible differential cryptanalysis harder since it is possible to write multiple crash dumps without reboot by repeating the following commands: # sysctl debug.kdb.enter=1 db> call doadump(0) db> continue # savecore A kernel dump key consists of an algorithm identifier, an IV and an encrypted symmetric key. The kernel dump key size is included in a kernel dump header. The size is an unsigned 32-bit integer and it is aligned to a block size. The header structure has 512 bytes to match the block size so it was required to make a panic string 4 bytes shorter to add a new field to the header structure. If the kernel dump key size in the header is nonzero it is assumed that the kernel dump key is placed after the first header on the dump device and the core dump is encrypted. Separate functions were implemented to write the kernel dump header and the kernel dump key as they need to be unencrypted. The dump_write function encrypts data if the kernel was compiled with the EKCD option. Encrypted kernel textdumps are not supported due to the way they are constructed which makes it impossible to use the CBC mode for encryption. It should be also noted that textdumps don't contain sensitive data by design as a user decides what information should be dumped. savecore(8) writes the kernel dump key to a key.# file if its size in the header is nonzero. # is the number of the current core dump. decryptcore(8) decrypts the core dump using a private RSA key and the kernel dump key. This is performed by a child process in capability mode. If the decryption was not successful the parent process removes a partially decrypted core dump. Description on how to encrypt crash dumps was added to the decryptcore(8), dumpon(8), rc.conf(5) and savecore(8) manual pages. EKCD was tested on amd64 using bhyve and i386, mipsel and sparc64 using QEMU. The feature still has to be tested on arm and arm64 as it wasn't possible to run FreeBSD due to the problems with QEMU emulation and lack of hardware. Designed by: def, pjd Reviewed by: cem, oshogbo, pjd Partial review: delphij, emaste, jhb, kib Approved by: pjd (mentor) Differential Revision: https://reviews.freebsd.org/D4712	2016-12-10 16:20:39 +00:00
Mark Johnston	02315a6759	Use a consistent snapshot of the lock state in owner_mtx(). MFC after: 2 weeks	2016-12-10 02:59:34 +00:00
Mark Johnston	c365a2934e	Return a non-NULL owner only if the lock is exclusively held in owner_sx(). Fix some whitespace bugs while here. MFC after: 2 weeks	2016-12-10 02:56:44 +00:00
Gleb Smirnoff	5040da77c1	Use acquire write to cr_lock to complement with release write at end of locked region. Submitted by: kib	2016-12-09 19:07:31 +00:00
Gleb Smirnoff	169170209c	Provide counter_ratecheck(), a MP-friendly substitution to ppsratecheck(). When rated event happens at a very quick rate, the ppsratecheck() is not only racy, but also becomes a performance bottleneck. Together with: rrs, jtl	2016-12-09 17:58:34 +00:00
Robert Watson	52b42f6287	Regnerate system-call definitions following r309677 correcting a whitespace glitch in syscalls.master.	2016-12-07 16:12:27 +00:00
Robert Watson	82d8d2b8bc	Replace spaces with tabs in definition of SCTP system calls, for consistency with the remainder of the syscalls.master file. This problem does not occur in the freebsd32 version of the same system calls.	2016-12-07 16:11:55 +00:00
Eric van Gyzen	3d32d4a7c9	Export the whole thread name in kinfo_proc kinfo_proc::ki_tdname is three characters shorter than thread::td_name. Add a ki_moretdname field for these three extra characters. Add the new field to kinfo_proc32, as well. Update all in-tree consumers to read the new field and assemble the full name, except for lldb's HostThreadFreeBSD.cpp, which I will handle separately. Bump __FreeBSD_version. Reviewed by: kib MFC after: 1 week Relnotes: yes Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D8722	2016-12-07 15:04:22 +00:00
Konstantin Belousov	435da98564	Restructure the code to handle reporting of non-exited processes from wait(2). - Do not acquire the process spinlock if neither WTRAPPED nor WUNTRACED options were passed [1]. - Extract the code to report alive process into a new helper report_alive_proc() and use it for trapped, stopped and continued childrens. Note that the process spinlock is required around the WTRAPPED and WUNTRACED tests, because P_STOPPED_TRACE and P_STOPPED_SIG flags are set before other threads are stopped at the suspension point, and that threads increment p_suspcount while owning only the process spinlock, the process lock is dropped by them. If the spinlock is not taken for tests, the syscall thread might miss both p_suspcount increment and wakeup in wakeup in thread_suspend_switch(). Based on the submission by: mjg [1] Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-12-04 20:44:58 +00:00
Eric van Gyzen	ff07dd913e	thr_set_name(): silently truncate the given name as needed Instead of failing with ENAMETOOLONG, which is swallowed by pthread_set_name_np() anyway, truncate the given name to MAXCOMLEN+1 bytes. This is more likely what the user wants, and saves the caller from truncating it before the call (which was the only recourse). Polish pthread_set_name_np(3) and add a .Xr to thr_set_name(2) so the user might find the documentation for this behavior. Reviewed by: jilles MFC after: 3 days Sponsored by: Dell EMC	2016-12-03 01:14:21 +00:00
Mateusz Guzik	a2d3554542	vfs: provide fake locking primitives for the crossmp vnode Since the vnode is only expected to be shared locked, we can save a little overhead by only pretending we are locking in the first place. Reviewed by: kib Tested by: pho	2016-12-02 18:03:15 +00:00
Mateusz Guzik	a4ce25b5b0	vfs: fix a whitespace nit in r309307	2016-11-30 02:17:03 +00:00
Mateusz Guzik	1babea0341	vfs: avoid VOP_ISLOCKED in the common case in lookup	2016-11-30 02:14:53 +00:00
Mark Johnston	64910ddbff	Launder VPO_NOSYNC pages upon vnode deactivation. As of r234483, vnode deactivation causes non-VPO_NOSYNC pages to be laundered. This behaviour has two problems: 1. Dirty VPO_NOSYNC pages must be laundered before the vnode can be reclaimed, and this work may be unfairly deferred to the vnlru process or an unrelated application when the system is under vnode pressure. 2. Deactivation of a vnode with dirty VPO_NOSYNC pages requires a scan of the corresponding VM object's memq for non-VPO_NOSYNC dirty pages; if the laundry thread needs to launder pages from an unreferenced such vnode, it will reactivate and deactivate the vnode with each laundering, potentially resulting in a large number of expensive scans. Therefore, ensure that all dirty pages are laundered upon deactivation, i.e., when all maps of the vnode are removed and all references are released. Reviewed by: alc, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D8641	2016-11-26 21:00:27 +00:00
John Baldwin	9f3aabb9eb	Permit timed sleeps for threads other than thread0 before timers are working. The callout subsystem already handles early callouts and schedules the first clock interrupt appropriately based on the currently pending callouts. The one nit to fix was that callouts scheduled via C_HARDCLOCK during early boot could fire too early once timers were enabled as the per-CPU base time is always zero until timers are initialized. The change in callout_when() handles this case by using the current uptime as the base time of the callout during bootup if the per-CPU base time is zero. Reviewed by: kib MFC after: 2 weeks Sponsored by: Netflix	2016-11-25 18:02:43 +00:00
Mateusz Guzik	746b6e8176	wait: avoid relocking the child if proc_to_reap returns 1 proc_to_reap would always unlock. However, if it returned 1, kern_wait6 would immediately lock it again. Save the dance. Reviewed by: kib	2016-11-24 18:21:48 +00:00
Mateusz Guzik	8b0e0c91e0	cache: ensure that the number of bucket locks does not exceed hash size The size can be changed by side effect of modifying kern.maxvnodes. Since numbucketlocks was not modified, setting a sufficiently low value would give more locks than actual buckets, which would then lead to corruption. Force the number of buckets to be not smaller. Note this should not matter for real world cases. Reported and tested by: pho	2016-11-23 19:50:12 +00:00
Mark Johnston	99e6e1930c	Release laundered vnode pages to the head of the inactive queue. The swap pager enqueues laundered pages near the head of the inactive queue to avoid another trip through LRU before reclamation. This change adds support for this behaviour to the vnode pager and makes use of it in UFS and ext2fs. Some ioflag handling is consolidated into a common subroutine so that this support can be easily extended to other filesystems which make use of the buffer cache. No changes are needed for ZFS since its putpages routine always undirties the pages before returning, and the laundry thread requeues the pages appropriately in this case. Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D8589	2016-11-23 17:53:07 +00:00
Ruslan Bukin	dd7d4f199e	Revert r306186 ("Adjust the sopt_val pointer on bigendian systems"). This logic doesn't work with bigger sopt_valsize (e.g. when ipfw passing 2048 bytes rule). Reported by: adrian Sponsored by: DARPA, AFRL	2016-11-22 18:31:43 +00:00

1 2 3 4 5 ...

15193 Commits