freebsd-skq

Author	SHA1	Message	Date
Don Lewis	97e9382d56	Decrease latency by not wrapping the idle loop's potentially lengthy search for a thread to steal inside a critical section. Since this allows the search to be preempted, restart the search if preemption happens since the search results found earlier may no longer be valid. Decrease the latency of starting a thread that may be assigned to this CPU during the search by polling for incoming threads during the search and switching to that thread instead of continuing the search. Test for stale search results and restart the search before going through the expense of calling tdq_lock_pair(). Retry some tests after grabbing the locks since things may have changed while waiting to get both locks. Eliminate special case handling for stealing from an SMT peer that uses 1 as the steal threshold. This can only succeed if a thread has been assigned but our SMT peer has not yet started executing it. This is quite rare and when it happens the other SMT thread is generally waiting for the same tdq lock that we hold. Basically both SMT threads are racing to grab the same spin lock. Add the kern.sched.always_steal knob from a ULE patch by jeff@. Incorporate another idea from Jeff's ULE patch. If the sched_switch() detects that the CPU is about to go idle, try to steal a thread before switching to the idle thread. Since the search for a thread to steal has to be done inside a critical section in this context, limit the impact on latency by adding the knob kern.sched.trysteal_limit to limit the topological distance of the search and don't restart the search if we detect stale results. If this search can't find an stealable thread, the idle loop can do a more complete search. Also poll for threads being assigned to this CPU during the search and switch to them instead of continuing the search. This change is responsibile for the majority of the improvement in parallel buildworld times. In sched_balance_group() change the minimum threshold from stealing a thread from 1 to 2. Poaching a newly assigned thread from a CPU that is waking up hasn't yet switched to that thread from idle is likely very rare and is likely to have the same lock race as is seen when stealing threads in the idle loop. Also use tdq_notify() to kick the destintation CPU instead of always sending an IPI. Update a stale comment, the number of transferable threads is not calculated. Reviewed by: kib (earlier version) Comments by: avg, jeff, mav MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D12130	2018-02-23 00:12:51 +00:00
Ravi Pokala	dcd935dfd1	jedec_dimm(4): report asset info and temperatures for DDR3 and DDR4 DIMMs A super-set of the functionality of jedec_ts(4). jedec_dimm(4) reports asset information (Part Number, Serial Number) encoded in the "Serial Presence Detect" (SPD) data on JEDEC DDR3 and DDR4 DIMMs. It also calculates and reports the memory capacity of the DIMM, in megabytes. If the DIMM includes a "Thermal Sensor On DIMM" (TSOD), the temperature is also reported. Reviewed by: cem MFC after: 1 week Relnotes: yes Sponsored by: Panasas Differential Revision: https://reviews.freebsd.org/D14392 Discussed with: avg, cem Tested by: avg, cem (previous version, no semantic changes)	2018-02-22 23:18:46 +00:00
Ian Lepore	363b2c7fd2	Add a missing line continuation. How many commits does it take to get a simple module makefile working? Apparently at least three. Pointy hat to: ian	2018-02-22 22:25:26 +00:00
Mateusz Guzik	a0c722bdbf	Fix up sysctl vfs.buffercache broken in r329612 Sample problem: top: sysctl(vfs.bufspace...) expected 8, got 4 Reported by: O. Hartmann <ohartmann walstatt.org>	2018-02-22 20:39:25 +00:00
Kyle Evans	66964bbc36	lualoader: Attend to some 80-col issues, pointed out by luacheck Graphics have a tendency to cause 80-col issues, so make an exception to our standard indentation guidelines for these graphics. This does not hamper readability too badly. Two 40-column strings of spaces is trivially replaced with string.rep(" ", 80)	2018-02-22 20:10:23 +00:00
Oleksandr Tymoshenko	94b8a54ae6	[chvgpio] add GPIO driver for Intel Z8xxx SoC family Add chvgpio(4) driver for Intel Z8xxx SoC family. This product was formerly known as Cherry Trail but Linux and OpenBSD drivers refer to it as Cherry View. This driver is derived from OpenBSD one so the name is kept for alignment with another BSD system. Submitted by: Tom Jones <tj@enoti.me> Reviewed by: gonzo, wblock(man page) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D13086	2018-02-22 19:12:32 +00:00
Kyle Evans	b0d9fbf070	Fix userboot w/ ZFS after r329725 r329725 cleaned up ZFS commands duplicated in multiple places, but userboot was not setting HAVE_ZFS when MK_ZFS != "no". This resulted in a failure to boot (as seen in PR 226118) in bhyve, with the following message: /boot/userboot.so: Undefined symbol "ldi_get_size" PR: 226118 Glanced at by: imp	2018-02-22 18:49:53 +00:00
Alan Somers	deeec7728b	nvmecontrol: fix build on amd64/clang Broken by: 329824 Sponsored by: Spectra Logic Corp	2018-02-22 17:47:16 +00:00
Eric van Gyzen	0127914caa	sched_ule: update a comment to reflect reality MFC after: 3 days Sponsored by: Dell EMC	2018-02-22 17:09:26 +00:00
Kyle Evans	afdc2600c2	nvme: Unbreak LE builds after r329824 The parameter 'p' is unused if _BYTE_ORDER == _LITTLE_ENDIAN. Add in a (void)p to fix the build.	2018-02-22 16:16:49 +00:00
Kyle Evans	5af17cb319	lua-lint: Add note about luacheck in ports, silence warning luacheck was added in ports r462609. Silence warning about cli_execute -- it's non-standard, but for our setup it will be a standard global.	2018-02-22 15:29:57 +00:00
Hans Petter Selasky	949440623b	Return correct error code to user-space when a system call receives a signal in the LinuxKPI. The read(), write() and mmap() system calls can return either EINTR or ERESTART upon receiving a signal. Add code to figure out the correct return value by temporarily storing the return code from the relevant FreeBSD kernel APIs in the Linux task structure. MFC after: 3 days Sponsored by: Mellanox Technologies	2018-02-22 15:29:19 +00:00
Wojciech Macek	0d787e9b35	NVMe: Add big-endian support Remove bitfields from defined structures as they are not portable. Instead use shift and mask macros in the driver and nvmecontrol application. NVMe is now working on powerpc64 host. Submitted by: Michal Stanek <mst@semihalf.com> Obtained from: Semihalf Reviewed by: imp, wma Sponsored by: IBM, QCM Technologies Differential revision: https://reviews.freebsd.org/D13916	2018-02-22 13:32:31 +00:00
Andriy Gapon	de2cb430ad	another rework of getzfsvfs / getzfsvfs_impl code This change is designed to account for yet another difference between illumos and FreeBSD VFS. In FreeBSD a filesystem driver is supposed to clean up mnt_data in its VFS_UNMOUNT method because it's the last call into the driver before a struct mount object is destroyed. The VFS drains all references to the object before destroying it, but for the driver it's already as good as gone. In contrast, illumos VFS provides another method, VFS_FREEVFS, that is called when all references are drained. So, the driver can keep its data after VFS_UNMOUNT and clean it up in VFS_FREEVFS after all references are gone. This is what ZFS does on illumos. So there a reference to a filesystem is sufficient to guarantee that the ZFS specific data, aka zfsvfs_t, stays around (even if the filesystem gets unmounted). In FreeBSD we need to vfs_busy the filesystem to get the same guarantee. vfs_ref guarantees only that the struct mount is kept. The following rules should be observed in getzfsvfs / getzfsvfs_impl on FreeBSD: - if we need access to zfsvfs_t then we must use vfs_busy - if only we need to access struct mount (aka vfs_t), then vfs_ref is enough - when illumos code actually needs only the vfs_t, they still can pass the zfsvfs_t and get the vfs_t from it; that can work in FreeBSD if the filesystem is busied, but when it's just referenced then we have to pass the vfs_t explicitly - we cannot call vfs_busy while holding a dataset because that creates a LOR with dp_config_rwlock As a result: - getzfsvfs_impl now only references the filesystem, same as in illumos, but unlike illumos it has to return the vfs_t - the consumers are updated to account for the change - getzfsvfs busies the filesystem (and drops the reference from getzfsvfs_impl) Also, zfs_unmount_snap() now gets a busied a filesystem, references it and then unbusies it essentially reverting actions done in getzfsvfs. This is needed because the code may perform some checks that require the zfsvfs_t. So, those are done before the unbusying. MFC after: 2 weeks	2018-02-22 13:06:27 +00:00
Wojciech Macek	ea1d5fd117	Add bsdlabel and fdisk to powerpc64 Submitted by: Wojciech Macek <wma@semihalf.org> Obtained from: Semihalf Sponsored by: IBM, QCM Technologies	2018-02-22 12:31:28 +00:00
Warner Losh	07e5967a22	Revert r329814 as well. It should have been in r329819.	2018-02-22 11:51:50 +00:00
Andriy Gapon	8d69fe5cc8	followup to r329556, completely remove the covered vnode assert vrele() acquires the vnode lock only if the hold count drops to zero. In other scenarios it needs only the interlock. So, zfsctl_snapdir_lookup() can race with vfs_mount_destroy() -> vrele() such that the lookup adds a new reference and then vrele() drops the mountpoint's reference and only then we check the reference count. It would be just one in this case. In fact, the assert should have been removed in r323483 when the code learned how to deal with the uncovered vnode. PR: 225795 MFC after: 4 days X-MFC with: r329556	2018-02-22 11:41:00 +00:00
Warner Losh	0028abe633	Backout r329818, r329816 and r329815. These aren't the commits I thought I was testing prior to commit. Revert until I can sort out what happened and fix it.	2018-02-22 11:18:33 +00:00
Warner Losh	91acaad987	Fix typo in last commit after last rebase before commit...	2018-02-22 10:55:23 +00:00
Marcelo Araujo	61e7e50da9	The firewall_type is ignored if not set in rc.conf or rc.conf.local, after r190575 there is an option to call rc.firewall with the firewall_type passed in as an argument. Submitted by: David P. Discher <dpd@dpdtech.com> MFC after: 3 weeks. Sponsored by: iXsystems Inc. Differential Revision: https://reviews.freebsd.org/D14286	2018-02-22 08:25:39 +00:00
Warner Losh	4d87e27125	Combine BIO_DELETE requests for nda devices Now that we're queueing BIO_DELETE requests in the CAM I/O scheduler, it make sense to try to combine as many as possible into a single request to send down to hardware. Hopefully, lots of larger requests like this are better than lots of individual transactions. Note for future: need to limit based on total size of the trim request. Should also collapse adjacent ranges where possible to increase the size of the max payload. Sponsored by: Netflix	2018-02-22 05:44:00 +00:00
Warner Losh	c5fe3ae9b8	Introduce capacity flags for periphs Introduce flags word to describe the capacities of the peripheral. First bit will describe if the periph driver allows multiple outstanding TRIMS to be active in a device. Modify the I/O scheduler so that the nda driver can queue trims for a while after the first one arrives. We'll queue until we see a I/O scheduler tick, then we'll schedule as many TRIMs as allowed by other factors (currently this is slocts in the NVMe controller). This mariginally helps the read latency issues we see with reads, but sets the stage for the nda driver to do TRIM collapsing like the da and ada drivers do today. Sponsored by: Netflix	2018-02-22 05:43:55 +00:00
Warner Losh	c9878d6d63	Note when we tick. To help implement a policy of 'queue all trims until next I/O sched tick' policy to help coalesce them, note when we tick so we can do something special on the first call after the tick to get more work. Sponsored by: Netflix	2018-02-22 05:43:50 +00:00
Warner Losh	f2b9885036	Wrap an extra long line This debugging line is too big for even my largest xterm. wrap it at about 80 columns. Sponsored by: Netflix	2018-02-22 05:43:45 +00:00
Warner Losh	97f8aa050e	Don't sort TRIMs. While the code for ada and da both assume that the trim list is ordered when doing the coaleascing the TRIMs, it turns out that creating the sorted list uses more resources than are saved by having slightly fewer trims sent to the device. Sponsored by: Netflix	2018-02-22 05:43:20 +00:00
Kyle Evans	3c6b387ad1	lualoader: Clear up an empty conditional branch We will likely uncomment this whole monster in the near future.	2018-02-22 04:30:52 +00:00
Kyle Evans	c8a0a7ab9b	Add script for linting stand/lua to tools/boot. We require some --globals due to custom loader extensions in our environment. Add everything required for this to tools/boot so that other interested parties can get up and go with linting our scripts and not get a bunch of false-positives.	2018-02-22 04:28:52 +00:00
Kyle Evans	e2df27e363	lualoader: Address some 'luacheck' concerns luacheck pointed out an assortment of issues, ranging from non-standard globals being created as well as unused parameters, variables, and redundant assignments. Using '_' as a placeholder for values unused (whether it be parameters unused or return values unused, assuming multiple return values) feels clean and gets the point across, so I've adopted it. It also helps flag candidates for cleanup later in some of the lambdas I've created, giving me an easy way to re-evaluate later if we're still not using some of these features.	2018-02-22 04:15:02 +00:00
Alexander Motin	03d54eb339	MFV r329807: 8940 Sending an intra-pool resumable send stream may result in EXDEV illumos/illumos-gate@544132fce3 "zfs send -t <token>" for an incremental send should be able to resume successfully when sending to the same pool: a subtle issue in zfs_iter_children() doesn't currently allow this. Because resuming from a token requires "guid" -> "dataset" mapping (guid_to_name()), we have to walk the whole hierarchy to find the right snapshots to send. When resuming an incremental send both source and destination live in the same pool and have the same guid: this is where zfs_iter_children() gets confused and picks up the wrong snapshot, so we end up trying to send an incremental "destination@snap1 -> source@snap2" stream instead of "source@snap1 -> source@snap2": this fails with an "Invalid cross-device link" (EXDEV) error. Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 04:01:55 +00:00
Alexander Motin	34ff7cee7a	8940 Sending an intra-pool resumable send stream may result in EXDEV illumos/illumos-gate@544132fce3 "zfs send -t <token>" for an incremental send should be able to resume successfully when sending to the same pool: a subtle issue in zfs_iter_children() doesn't currently allow this. Because resuming from a token requires "guid" -> "dataset" mapping (guid_to_name()), we have to walk the whole hierarchy to find the right snapshots to send. When resuming an incremental send both source and destination live in the same pool and have the same guid: this is where zfs_iter_children() gets confused and picks up the wrong snapshot, so we end up trying to send an incremental "destination@snap1 -> source@snap2" stream instead of "source@snap1 -> source@snap2": this fails with an "Invalid cross-device link" (EXDEV) error. Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 04:01:05 +00:00
Kyle Evans	011eae6c57	lualoader: Consistently use double quotes	2018-02-22 03:55:02 +00:00
Alexander Motin	dd9ceab333	MFV r329803: 9080 recursive enter of vdev_indirect_rwlock from vdev_indirect_remap() illumos/illumos-gate@bdfded42e6 A scenario came up where a callback executed by vdev_indirect_remap() on a vdev, calls vdev_indirect_remap() on the same vdev and tries to reacquire vdev_indirect_rwlock that was already acquired from the first call to vdev_indirect_remap(). The specific scenario, is that we want to remap a block pointer that is snapshoted but its dataset's remap_deadlist is not cached. So in order to add it we issue a read through a vdev_indirect_remap() on the same vdev, which brings up the aforementioned issue. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>	2018-02-22 03:54:59 +00:00
Kyle Evans	4ab039b57f	lualoader: Eliminate some unused locals	2018-02-22 03:53:49 +00:00
Alexander Motin	4195015764	9080 recursive enter of vdev_indirect_rwlock from vdev_indirect_remap() illumos/illumos-gate@bdfded42e6 A scenario came up where a callback executed by vdev_indirect_remap() on a vdev, calls vdev_indirect_remap() on the same vdev and tries to reacquire vdev_indirect_rwlock that was already acquired from the first call to vdev_indirect_remap(). The specific scenario, is that we want to remap a block pointer that is snapshoted but its dataset's remap_deadlist is not cached. So in order to add it we issue a read through a vdev_indirect_remap() on the same vdev, which brings up the aforementioned issue. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>	2018-02-22 03:52:03 +00:00
Alexander Motin	064827be34	MFV r329799, r329800: 9079 race condition in starting and ending condesing thread for indirect vdevs illumos/illumos-gate@667ec66f1b The timeline of the race condition is the following: [1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(), so it calls the spa_condense_indirect_complete_sync() sync task which sets the spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock and set spa_condense_thread to NULL. [2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks whether it should condense the second vdev in vdev_indirect_should_condense() by checking the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread() from thread A. So it goes on and tries to spawn a new condensing thread in spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A has not set spa_condense_thread to NULL (which is basically the last thing it does before returning). The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to signify whether a condensing thread is running. Ideally we would only use one throughout the codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which basically tights condensing to scrubing when it comes to pausing and resuming those actions during spa export. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-02-22 03:49:06 +00:00
Ed Maste	a0409b6f36	Remove accidental vim droppings Reported by: cy	2018-02-22 03:37:01 +00:00
Alexander Motin	9a64996e42	Missed pieces of r329799.	2018-02-22 03:23:43 +00:00
Alexander Motin	e7245cfc4c	9079 race condition in starting and ending condesing thread for indirect vdevs illumos/illumos-gate@667ec66f1b The timeline of the race condition is the following: [1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(), so it calls the spa_condense_indirect_complete_sync() sync task which sets the spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock and set spa_condense_thread to NULL. [2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks whether it should condense the second vdev in vdev_indirect_should_condense() by checking the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread() from thread A. So it goes on and tries to spawn a new condensing thread in spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A has not set spa_condense_thread to NULL (which is basically the last thing it does before returning). The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to signify whether a condensing thread is running. Ideally we would only use one throughout the codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which basically tights condensing to scrubing when it comes to pausing and resuming those actions during spa export. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-02-22 03:22:27 +00:00
Alexander Motin	1ea10a60f9	MFV r329793, r329795: 9075 Improve ZFS pool import/load process and corrupted pool recovery illumos/illumos-gate@6f7938128a Some work has been done lately to improve the debugability of the ZFS pool load (and import) process. This includes: https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's To iterate on top of that, there's a few changes that were made to make the import process more resilient and crash free. One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened. The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. In the past, the two configurations were compared together and if there was a mismatch then the load process was aborted and an error was returned. The latter was a good way to ensure a pool does not get corrupted, however it made the pool load process needlessly fragile in cases where the vdev configuration changed or the userland configuration was outdated. Since the MOS is stored in 3 copies, the configuration provided by userland doesn't have to be perfect in order to read its contents. Hence, a new approach has been adopted: The pool is first opened with the untrusted userland configuration just so that the real configuration can be read from the MOS. The trusted MOS configuration is then used to generate a new vdev tree and the pool is re-opened. When the pool is opened with an untrusted configuration, writes are disabled to avoid accidentally damaging it. During reads, some sanity checks are performed on block pointers to see if each DVA points to a known vdev; when the configuration is untrusted, instead of panicking the system if those checks fail we simply avoid issuing reads to the invalid DVAs. This new two-step pool load process now allows rewinding pools accross vdev tree changes such as device replacement, addition, etc. Loading a pool from an external config file in a clustering environment also becomes much safer now since the pool will import even if the config is outdated and didn't, for instance, register a recent device addition. With this code in place, it became relatively easy to implement a long-sought-after feature: the ability to import a pool with missing top level (i.e. non-redundant) devices. Note that since this almost guarantees some loss Of data, this feature is for now restricted to a read-only import. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-22 03:15:35 +00:00
John Baldwin	642ffab5fc	Avoid grabbing locks when grabbing the vt(4) console for DDB. Trying to grab locks during cngrab() when entering the debugger is deadlock prone as all other CPUs are already halted (and thus unable to release locks) when cngrab() is invoked. One could instead use try-locks. However, the case that the try-lock fails still has to be handled. In addition, if the try-lock works it doesn't provide any greater ordering guarantees than is already provided by entering and exiting DDB. It is simpler to define a simpler path for the case that the try-lock would fail and always use that when entering DDB. Messing with timers, etc. when entering DDB is dubious even if the try-lock succeeds. This patch attempts to use the smallest possible set of operations to grab the vt(4) console when entering DDB without using any locks. Reviewed by: emaste Tested by: Matthew Macy MFC after: 1 week	2018-02-22 02:26:29 +00:00
Alexander Motin	250699c304	r329793 \| mav \| 2018-02-22 04:21:03 +0200 (чт, 22 февр. 2018) \| 58 lines 9075 Improve ZFS pool import/load process and corrupted pool recovery illumos/illumos-gate@6f7938128a Some work has been done lately to improve the debugability of the ZFS pool load (and import) process. This includes: https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's To iterate on top of that, there's a few changes that were made to make the import process more resilient and crash free. One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened. The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. In the past, the two configurations were compared together and if there was a mismatch then the load process was aborted and an error was returned. The latter was a good way to ensure a pool does not get corrupted, however it made the pool load process needlessly fragile in cases where the vdev configuration changed or the userland configuration was outdated. Since the MOS is stored in 3 copies, the configuration provided by userland doesn't have to be perfect in order to read its contents. Hence, a new approach has been adopted: The pool is first opened with the untrusted userland configuration just so that the real configuration can be read from the MOS. The trusted MOS configuration is then used to generate a new vdev tree and the pool is re-opened. When the pool is opened with an untrusted configuration, writes are disabled to avoid accidentally damaging it. During reads, some sanity checks are performed on block pointers to see if each DVA points to a known vdev; when the configuration is untrusted, instead of panicking the system if those checks fail we simply avoid issuing reads to the invalid DVAs. This new two-step pool load process now allows rewinding pools accross vdev tree changes such as device replacement, addition, etc. Loading a pool from an external config file in a clustering environment also becomes much safer now since the pool will import even if the config is outdated and didn't, for instance, register a recent device addition. With this code in place, it became relatively easy to implement a long-sought-after feature: the ability to import a pool with missing top level (i.e. non-redundant) devices. Note that since this almost guarantees some loss Of data, this feature is for now restricted to a read-only import. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-22 02:25:09 +00:00
Ed Maste	eae594f7d5	Correct proper nouns in the Linuxulator - Capitalize Linux - Spell FreeBSD out in full - Address some style(9) on changed lines Sponsored by: Turing Robotic Industries Inc.	2018-02-22 02:24:17 +00:00
Alexander Motin	2e4bc6ee5c	9075 Improve ZFS pool import/load process and corrupted pool recovery illumos/illumos-gate@6f7938128a Some work has been done lately to improve the debugability of the ZFS pool load (and import) process. This includes: https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's To iterate on top of that, there's a few changes that were made to make the import process more resilient and crash free. One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened. The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. In the past, the two configurations were compared together and if there was a mismatch then the load process was aborted and an error was returned. The latter was a good way to ensure a pool does not get corrupted, however it made the pool load process needlessly fragile in cases where the vdev configuration changed or the userland configuration was outdated. Since the MOS is stored in 3 copies, the configuration provided by userland doesn't have to be perfect in order to read its contents. Hence, a new approach has been adopted: The pool is first opened with the untrusted userland configuration just so that the real configuration can be read from the MOS. The trusted MOS configuration is then used to generate a new vdev tree and the pool is re-opened. When the pool is opened with an untrusted configuration, writes are disabled to avoid accidentally damaging it. During reads, some sanity checks are performed on block pointers to see if each DVA points to a known vdev; when the configuration is untrusted, instead of panicking the system if those checks fail we simply avoid issuing reads to the invalid DVAs. This new two-step pool load process now allows rewinding pools accross vdev tree changes such as device replacement, addition, etc. Loading a pool from an external config file in a clustering environment also becomes much safer now since the pool will import even if the config is outdated and didn't, for instance, register a recent device addition. With this code in place, it became relatively easy to implement a long-sought-after feature: the ability to import a pool with missing top level (i.e. non-redundant) devices. Note that since this almost guarantees some loss Of data, this feature is for now restricted to a read-only import. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-22 02:21:03 +00:00
John Baldwin	6619d9fb70	Bring in additional constants and message fields for TLS-related messages. Sponsored by: Chelsio Communications	2018-02-22 02:02:31 +00:00
Ed Maste	581bf7cbda	Use 'const int *' for sysentvec errno translation table This allows an sv_errtbl to be read-only .rodata. Sponsored by: Turing Robotic Industries Inc.	2018-02-22 01:59:59 +00:00
Kyle Evans	2e4dad82c7	lualoader: Attach cli command functions to cli module Instead of the global namespace, let's attach these to the cli module. Other users, including the "local" module, can attach functions to the cli module at will to add other cli commands and things will still Just Work. This distills down the candidates for functions that may be invoked via the cli to a minimal set (boot, autoboot, arguments), rather than any function that happens to live in the global lua namespace.	2018-02-22 01:57:38 +00:00
John Baldwin	125d42fe81	Move DDP PCB state into a helper structure. This consolidates all of the DDP state in one place. Also, the code has now been fixed to ensure that DDP state is only accessed for DDP connections. This should not be a functional change but makes it cleaner and easier to add state for other TOE socket modes in the future. MFC after: 1 month Sponsored by: Chelsio Communications	2018-02-22 01:50:30 +00:00
Kyle Evans	eca5ca66d0	lualoader: Pull argument extraction for cli functions into cli.arguments This will be the translation layer for varargs -> cmd_name, argv for cli commands. We reserve the right to break exactly what the varargs inclulde, but this gives us a stable way to pull the arguments out of varargs.	2018-02-22 01:44:30 +00:00
Alexander Motin	613b0d87da	8942 zfs promote .../%recv should be an error illumos/illumos-gate@add927f8c8 Reported on the ZFSonLinux https://github.com/zfsonlinux/zfs/issues/4843, fixed by https://github.com/zfsonlinux/zfs/pull/6339: If we are in the middle of an incremental zfs receive, the child .../%recv will exist. If you concurrently run zfs promote .../%recv, it will "work", but then zfs gets confused. For example, there's no obvious way to destroy the containing filesystem (because it is now a clone of its invisible child). Attempting to do this promote should be an error. We could fix this by having zfs_ioc_promote() check if zc_name contains a %, similar to zfs_ioc_rename(). Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 01:42:13 +00:00
Kyle Evans	3e6c7d5436	lualoader: Unbreak 'boot [kernel]' by including config	2018-02-22 01:31:05 +00:00

1 2 3 4 5 ...

230449 Commits