illumos/illumos-gate@94ddd0900ahttps://www.illumos.org/issues/8909:
There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.
Here's an example panic due to this bug:
> ::status
debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference
dump content: kernel pages only
> $c
zio_shrink+0x12()
zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit+0x80(ffffff03dcd15cc0, 9a9)
zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
write+0x250(42, fffffd7ff4832000, 2000)
sys_syscall+0x177()
If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.
A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.
This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).
Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.
This race is present in `zil_suspend`, leading to this bug.
At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.
This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnesseary.
So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@cf07d3da99https://www.illumos.org/issues/8603:
To help make the ZIL's code more understandable, it was suggested that
the zilog_t's "zl_writer_lock" field should be renamed to "zl_issuer_lock".
Reviewed by: C Fraire <cfraire@me.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@a3b2868063https://www.illumos.org/issues/8677
We want to be able to run channel programs outside of synching context.
This would greatly improve performance of channel program that just gather
information, as we won't have to wait for synching context anymore.
This feature should introduce the following:
- A new command line flag in "zfs program" to specify our intention to
run in open context.
- A new flag/option within the channel program ioctl which selects the
context.
- Appropriate error handling whenever we try a channel program in
open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
break is probably intended and correct,
but has no correctness implications due to is94 => is96
Reviewed by: cem, jilles
Reported by: swildner@DragonFlyBSD.org
MFC After: 1 week
At least on GCC7 calling __alloc_size(x) twice is not equivalent to
calling using the attribute once with two arguments. The later is the
documented use in GCC documentation so add a new alloc_size(n, x)
alternative to cover for the few places where it is used: basically:
calloc(3), reallocarray(3) and mallocarray(9).
Submitted by: Mark Millard
MFC after: 3 days
Reference:
http://docs.freebsd.org/cgi/mid.cgi?F227842D-6BE2-4680-82E7-07906AF61CD7
illumos/illumos-gate@a3b2868063https://www.illumos.org/issues/8677
We want to be able to run channel programs outside of synching context.
This would greatly improve performance of channel program that just gather
information, as we won't have to wait for synching context anymore.
This feature should introduce the following:
- A new command line flag in "zfs program" to specify our intention to
run in open context.
- A new flag/option within the channel program ioctl which selects the
context.
- Appropriate error handling whenever we try a channel program in
open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
Uses of mallocarray(9).
The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.
Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.
Reported by: wosch
PR: 225197
Nowadays we do not pass zfs_cmd_t directly through the ioctl interface.
Instead a small zfs_iocparm_t object is passed and the command is
explicitly copied in and out. So, the check has become irrelevant.
MFC after: 3 weeks
Sponsored by: Panzura
libc is set for WARNS=2, but the incoming libregex will use WARNS=6.
Sprinkle some casts and (void)bc's to alleviate the warnings that come along
with the higher WARNS level.
These 'bc' parameters could be outright removed, but as of right now they
will be used in some parts of libregex land. Silence the warnings instead
rather than flip-flopping.
Similarly as other extres pseudo-drivers, implement phy by using kobj model.
This detaches it from provider device, so single device driver can export
multiple different phys. Additionally, this allows phy to be subclassed to
more specialized drivers, like is USB OTG phy, or PCIe phy with hot-plug
capability.
Tested by: manu (previous version, on Allwinner board)
MFC after: 1 month
During set_freq a clknode might have reparent (using a better parent that
have a higher frequency for example), before refreshing the cache, re-get
the parent frequency.
Reviewed by: mmel
functionality on Raspberry Pi 0.
Reviewed by: hselasky@
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D13924
possible to change string and numeric vendor and product identifiers,
as well as anything else there might be to change for a particular
device side template, eg the MAC address.
Reviewed by: hselasky@
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D13920
In several places, entry start and end field are checked, after
excluding the possibility that the entry is map->header. By assigning
max and min values to the start and end fields of map->header in
vm_map_init, the explicit map->header checks become unnecessary.
Submitted by: Doug Moore <dougm@rice.edu>
Reviewed by: alc, kib, markj (previous version)
Tested by: pho (previous version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13735
This fixes a panic when `EVDEV_SUPPORT` is enabled: if a trackpoint
packet was detected but there was no trackpoint, we still tried to emit an
evdev event even though the associated relative evdev device (`evdev_r`)
was not initialized.
PR: 225339
MFC after: 1 week
In psmprobe(), we set the initial `syncmask` to the vendor default value
if the `PSM_CONFIG_NOCHECKSYNC` bit is unset. However, we currently only
set it for the Elantech touchpad later in psmattach(), thus `syncmask`
is always configured.
Now, we check `PSM_CONFIG_NOCHECKSYNC` and skip sync check if it is set.
This fixes Elantech touchpad support for units which have `hascrc` set.
To clarify that, when we log the `syncmask` and `syncbits` fields, also
mention if they are actually used.
Finally, when we set `PSM_CONFIG_NOCHECKSYNC`, clear `PSM_NEED_SYNCBITS`
flag.
PR: 225338
MFC after: 1 week
Restore the original character to print if we used the look-ahead
buffer, but that didn't help -- we either got an illegal sequence
or still can't complete.
PR: 224552
Submitted by: Yuri Pankov
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13963
any children prior to detach.
With the newbus child deletion ordering changes introduced in r307518,
parent devices are now detached (and their driver set to NULL) prior to
detaching and deleting child devices; child-related bus methods (e.g.
BUS_CHILD_DETACHED, BUS_CHILD_DELETED) are no longer be dispatched to the
parent device driver after it returns 0 (success) from DEVICE_DETACH.
Sponsored by: The FreeBSD Foundation
addressing. The host addressing constraint does not apply to device address
space, and shouldn't be passed to bhnd_get_dma_translation() as the
maximum supported device address width.
Sponsored by: The FreeBSD Foundation
- Extend the probe method to accept devclasses that inherit from the pci
devclass (e.g. cardbus).
- Some BCM4306-based CardBus adapters appear to advertise 4K SPROM, but
only the first 2K is mapped into BAR0. We can safely assume that the
SPROM data fits within the first 2K of the SPROM, rather than rejecting
the SPROM mapping as invalid.
Sponsored by: The FreeBSD Foundation
(i386 and arm) that never implement them. This allows the removal of
#ifdef PHYS_TO_DMAP on code otherwise protected by a runtime check on
PMAP_HAS_DMAP. It also fixes the build on ARM and i386 after I forgot an
#ifdef in r328168.
Reported by: Milan Obuch
Pointy hat to: me
still active.
Map userspace portion of VA in the PTI kernel-mode page table as
non-executable. This way, if we ever miss reloading ucr3 into %cr3 on
the return to usermode, the process traps instead of executing in
potentially vulnerable setup. Catch the condition of such trap and
verify user-mode %cr3, which is saved by page fault handler.
I peek this trick in some article about Linux implementation.
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 12 days
DIfferential revision: https://reviews.freebsd.org/D13956
- Do not panic on siba(4) detach when the bhnd(4) bus calls
bhnd_get_pmu_info() on a PMU-less device.
- Fix bhnd_pwrctl attach/detach on fixed-clock devices:
- Treat bhnd_pwrctl_updateclk() as a no-op on fixed-clock devices.
- Use bhnd_pwrctl_updateclk() to perform the appropriate clock
transition on detach.
Sponsored by: The FreeBSD Foundation
Highlights of this update:
- /__local_fixups__ is now generated to be GPL dtc and libfdt compliant
- Compiling with -@ will now cause dtc to assign phandles to all labelled
nodes
- /include/ and /incbin/ now handle absolute paths correctly
- The manpage now has information about overlays, including how to apply
them and how to generate them
- Syntactic sugar for overlays is now supported, allowing an overlay DTS
like:
=
/dts-v1/;
/plugin/;
&foo {
foo,status = "okay";
};
=
to generate a fragment targetting <&foo>.
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.
As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.
Reviewed by: kib
Suggestions from: marius, alc, kib
Runtime tested on: amd64, powerpc64, powerpc, mips64
Kernel Page Table Isolation (KPTI) was introduced in r328083 as a
mitigation for the 'Meltdown' vulnerability. AMD CPUs are not affected,
per https://www.amd.com/en/corporate/speculative-execution:
We believe AMD processors are not susceptible due to our use of
privilege level protections within paging architecture and no
mitigation is required.
Thus default KPTI to off for AMD CPUs, and to on for others. This may
be refined later as we obtain more specific information on the sets of
CPUs that are and are not affected.
Submitted by: Mitchell Horne
Reviewed by: cem
Relnotes: Yes
Security: CVE-2017-5754
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D13971
Make it possible to retrieve mmc parameters via the XPT_GET_ADVINFO
call instead. Convert camcontrol to the new scheme.
Reviewed by: imp. kibab
Sponsored by: Netflix
Differential Revision: D13868