The crash can occur when all of the following conditions are true:
- a packet consists of multiple segements (requires LRO enabled)
- there has been a failure to allocate an mbuf for the packet and
the packet has to be dropped
- a host (vmware) still owned at least one segment of the packet,
so the driver had to wait for another interrupt to proceed to
discarding the remaning segment(s)
Reviewed by: rstone
MFC after: 2 weeks
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D10874
In the deep past, when this code compiled as a binary module, ath_hal
built as a module. This allowed custom, smaller HAL modules to be built.
This was especially beneficial for small embedded platforms where you
didn't require /everything/ just to run.
However, sometime around the HAL opening fanfare, the HAL landed here
as one big driver+HAL thing, and a lot of the (dirty) infrastructure
(ie, #ifdef AH_SUPPORT_XXX) to build specific subsets of the HAL went away.
This was retained in sys/conf/files as "ath_hal_XXX" but it wasn't
really floated up to the modules themselves.
I'm now in a position where for the reaaaaaly embedded boards (both the
really old and the last couple generation of QCA MIPS boards) having a
cut down HAL module and driver loaded at runtime is /actually/ beneficial.
This reduces the kernel size down by quite a bit. The MIPS modules look
like this:
adrian@gertrude:~/work/freebsd/head-embedded/src % ls -l ../root/mips_ap/boot/kernel.CARAMBOLA2/ath*ko
-r-xr-xr-x 1 adrian adrian 5076 May 23 23:45 ../root/mips_ap/boot/kernel.CARAMBOLA2/ath_dfs.ko
-r-xr-xr-x 1 adrian adrian 100588 May 23 23:45 ../root/mips_ap/boot/kernel.CARAMBOLA2/ath_hal.ko
-r-xr-xr-x 1 adrian adrian 627324 May 23 23:45 ../root/mips_ap/boot/kernel.CARAMBOLA2/ath_hal_ar9300.ko
-r-xr-xr-x 1 adrian adrian 314588 May 23 23:45 ../root/mips_ap/boot/kernel.CARAMBOLA2/ath_main.ko
-r-xr-xr-x 1 adrian adrian 23472 May 23 23:45 ../root/mips_ap/boot/kernel.CARAMBOLA2/ath_rate.ko
And the x86 versions, like this:
root@gertrude:/home/adrian # ls -l /boot/kernel/ath*ko
-r-xr-xr-x 1 root wheel 36632 May 24 18:32 /boot/kernel/ath_dfs.ko
-r-xr-xr-x 1 root wheel 134440 May 24 18:32 /boot/kernel/ath_hal.ko
-r-xr-xr-x 1 root wheel 82320 May 24 18:32 /boot/kernel/ath_hal_ar5210.ko
-r-xr-xr-x 1 root wheel 104976 May 24 18:32 /boot/kernel/ath_hal_ar5211.ko
-r-xr-xr-x 1 root wheel 236144 May 24 18:32 /boot/kernel/ath_hal_ar5212.ko
-r-xr-xr-x 1 root wheel 336104 May 24 18:32 /boot/kernel/ath_hal_ar5416.ko
-r-xr-xr-x 1 root wheel 598336 May 24 18:32 /boot/kernel/ath_hal_ar9300.ko
-r-xr-xr-x 1 root wheel 406144 May 24 18:32 /boot/kernel/ath_main.ko
-r-xr-xr-x 1 root wheel 55352 May 24 18:32 /boot/kernel/ath_rate.ko
.. so you can see, not building the whole HAL can save quite a bit.
For example, if you don't need AR9300 support, you can actually avoid
wasting half a megabyte of RAM. On embedded routers this is quite a
big deal.
The AR9300 HAL can be later further shrunk because, hilariously,
it indeed supports AH_SUPPORT_<xxx> for optionally adding chipset support.
(I'll chase that down later as it's quite a big savings if you're only
building for a single embedded target.)
So:
* Create a very hackish way to load/unload HAL modules
* Create module metadata for each HAL subtype - ah_osdep_arXXXX.c
* Create module metadata for ath_rate and ath_dfs (bluetooth is
currently just built as part of it)
* .. yes, this means we could actually build multiple rate control
modules and pick one at load time, but I'd rather just glue this
into net80211's rate control code. Oh well, baby steps.
* Main driver is now "ath_main"
* Create an "if_ath" module that does what the ye olde one did -
load PCI glue, main driver, HAL and all child modules.
In this way, if you have "if_ath_load=YES" in /boot/modules.conf
it will load everything the old way and stuff should still work.
* For module autoloading purposes, I actually /did/ fix up
the name of the modules in if_ath_pci and if_ath_ahb.
If you want to selectively load things (eg on ye cheape ARM/MIPS platforms
where RAM is at a premium) you should:
* load ath_hal
* load the chip modules in question
* load ath_rate, ath_dfs
* load ath_main
* load if_ath_pci and/or if_ath_ahb depending upon your particular
bus bind type - this is where probe/attach is done.
TODO:
* AR5312 module and associated pieces - yes, we have the SoC side support
now so the wifi support would be good to "round things out";
* Just nuke AH_SUPPORT_AR5416 for now and always bloat the packet
structures; this'll simplify other things.
* Should add a simple refcnt thing to the HAL RF/chip modules so you
can't unload them whilst you're using them.
* Manpage updates, UPDATING if appropriate, etc.
illumos/illumos-gate@b127fe3c05b127fe3c05https://www.illumos.org/issues/6101
lzc_create(), or more correctly, zfs_ioc_create() does not reject an attempt to
create a filesystem as a child of a volume, instead it proceeds to a crash.
A crash stack obtained on FreeBSD:
page fault while in kernel mode
zap_leaf_lookup()
fzap_lookup()
zap_lookup_norm()
zap_lookup()
zfs_get_zplprop()
zfs_fill_zplprops_impl()
zfs_ioc_create()
zfsdev_ioctl()
devfs_ioctl_f()
kern_ioctl()
sys_ioctl()
This crash happened with a kernel without debugging assertions.
The immediate cause of crash appears to an attempt to interpret a zvol object
as a zap object.
For filesystems:
#define MASTER_NODE_OBJ 1
For zvols:
#define ZVOL_OBJ 1ULL
#define ZVOL_ZAP_OBJ 2ULL
So, I see two problems here:
1. an attempt to create a filesystem under a zvol should be rejected as
early as possible, maybe in zfs_fill_zplprops()
2. maybe zap_lookup / zap_lockdir should reject objects that are not of one
of the zap object types
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 2 weeks
illumos/illumos-gate@6b036259816b03625981https://www.illumos.org/issues/8026
zfs_throttle_delay and zfs_throttle_resolution became disused since the new
write throttling mechanism was introduced.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 1 week
illumos/illumos-gate@471a88e499471a88e499https://www.illumos.org/issues/5380
A stream created with zfs send -p -I contains properties of all snapshots of a
given dataset as opposed to only properties of snapshots in a given range.
Not only this is suboptimal but the receive code also does not filter
properties by the range. So, properties of earlier snapshots would be updated
even though the snapshots themselves are not in the stream (just their
properties).
Given that modifying the snapshot properties requires a TXG sync and that the
snapshots are updated one by one the described behavior may lead to a sever
performance penalty.
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 3 weeks
illumos/illumos-gate@313ae1e182313ae1e182https://www.illumos.org/issues/8027
dsl_pool_dirty_delta() should not wake up waiters when dp->dp_dirty_total ==
zfs_dirty_data_max, because they wait for dp_dirty_total to fall strictly below
the threshold.
It's probably very rare for that condition to occur, but it's better to have
more accurate code.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 1 week
illumos/illumos-gate@94c2d0eb2294c2d0eb22https://www.illumos.org/issues/7968
spa_sync() iterates over all the dirty dnodes and processes each of them by
calling dnode_sync(). If there are many dirty dnodes (e.g. because we created
or removed a lot of files), the single thread of spa_sync() calling
dnode_sync() can become a bottleneck. Additionally, if many dnodes are dirtied
concurrently in open context (e.g. due to concurrent file creation), the
os_lock will experience lock contention via dnode_setdirty().
The solution is to track dirty dnodes on a multilist_t, and for spa_sync() to
use separate threads to process each of the sublists in the multilist.
On the concurrent file creation microbenchmark, the performance improvement
from dnode_setdirty() is up to 7%. Additionally, the wall clock time spent in
spa_sync() is reduced to 15%-40% of the single-threaded case. In terms of cost/
reward, once the other bottlenecks are addressed, fixing this bug will provide
a medium-large performance gain and require a medium amount of effort to
implement.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 3 weeks
illumos/illumos-gate@10fbdecb0510fbdecb05https://www.illumos.org/issues/7970
The global tunable zfs_arc_num_sublists_per_state is used by the ARC and
the dbuf cache, and other users are planned. We should change this
tunable to be common to all multilists.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 3 weeks
illumos/illumos-gate@b0c42cd470b0c42cd470https://www.illumos.org/issues/7801
Add *_by_dnode() routines for accessing objects given their
dnode_t *, this is more efficient than accessing the object by
(objset_t *, uint64_t object). This change converts some but
not all of the existing consumers. As performance-sensitive
code paths are discovered they should be converted to use
these routines.
Ported from: 0eef1bde31
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: bzzz77 <bzzz.tomas@gmail.com>
MFC after: 24 days
illumos/illumos-gate@a3905a4592a3905a4592https://www.illumos.org/issues/7869
The issue fixed by this patch is a race condition in the deadlist code.
A thread executing an administrative command that uses
`dsl_deadlist_space_range()` holds the lock of the whole `deadlist_t` to
protect the access of all its entries that the deadlist contains in an
avl tree.
Sync threads trying to insert a new entry in the deadlist
(through `dsl_deadlist_insert()` -> `dle_enqueue()`) do not hold the
deadlist lock at that moment. If the `dle_bpobj` is the empty bpobj (our
sentinel value), we close and reopen it. Between these two operations,
it is possible for the `dsl_deadlist_space_range()` thread to dereference
that bpobj which is `NULL` during that window.
Threads should hold the a deadlist's `dl_lock` when they manipulate its
internal data so scenarios like the one above are avoided. In addition,
threads should also hold the bpobj lock whenever they are allocating the
subobj list of a bpobj, and not just when they actually insert the subobj
to the list. This way we can avoid potential memory leaks.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
MFC after: 2 weeks
illumos/illumos-gate@61e255ce7261e255ce72https://www.illumos.org/issues/7793
Background information: This assertion about tx_space_* verifies that we
are not dirtying more stuff than we thought we would. We “need” to know
how much we will dirty so that we can check if we should fail this
transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the
transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() —
typically less than a millisecond), we call dbuf_dirty() on the exact
blocks that will be modified. Once this happens, the temporary
accounting in tx_space_* is unnecessary, because we know exactly what
blocks are newly dirtied; we call dnode_willuse_space() to track this
more exact accounting.
The fundamental problem causing this bug is that dmu_tx_hold_*() relies
on the current state in the DMU (e.g. dn_nlevels) to predict how much
will be dirtied by this transaction, but this state can change before we
actually perform the transaction (i.e. call dbuf_dirty()).
This bug will be fixed by removing the assertion that the tx_space_*
accounting is perfectly accurate (i.e. we never dirty more than was
predicted by dmu_tx_hold_*()). By removing the requirement that this
accounting be perfectly accurate, we can also vastly simplify it, e.g.
removing most of the logic in dmu_tx_count_*().
The new tx space accounting will be very approximate, and may be more or
less than what is actually dirtied. It will still be used to determine
if this transaction will put us over quota. Transactions that are marked
by dmu_tx_mark_netfree() will be excepted from this check. We won’t make
an attempt to determine how much space will be freed by the transaction
— this was rarely accurate enough to determine if a transaction should
be permitted when we are over quota, which is why dmu_tx_mark_netfree()
was introduced in 2014.
We also won’t attempt to give “credit” when overwriting existing blocks,
if those blocks may be freed. This allows us to remove the
do_free_accounting logic in dbuf_dirty(), and associated routines. This
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 3 weeks
in the PCM feeder mixer. Without this change a value of 32 channels is
treated like zero, due to using a mask of 0x1f, causing a kernel
assert when trying to playback bitperfect 32-channel audio. Also
update the AWK script which is generating the division tables to
handle more than 18 channels. This commit complements r282650.
MFC after: 3 days
illumos/illumos-gate@8b65a70b768b65a70b76https://www.illumos.org/issues/7541
When calling zpool import, zpool does a few ioctls to ZFS.
zpool allocates a buffer in userland and passes it to the kernel so that ZFS
can copy info into it. ZFS will use it to put the nvlist that describes the
pool configuration.
If the allocated buffer is too small, ZFS will return ENOMEM and the call will
have to be redone. This wastes CPU time and slows down the import process. This
happens very often for the ZFS_IOC_POOL_TRYIMPORT call.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
MFC after: 2 weeks
illumos/illumos-gate@1c17160ac51c17160ac5https://www.illumos.org/issues/1300
FreeBSD note: recent FreeBSD was not affected by the issue fixed as the
name cache is completely bypassed when normalization is enabled.
The change is imported for the sake of ZAP infrastructure modifications.
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Kevin Crowe <kevin.crowe@nexenta.com>
MFC after: 3 weeks
illumos/illumos-gate@4dd77f9e384dd77f9e38https://www.illumos.org/issues/7545
When evicting from the ARC, we manipulate some refcount_t's, e.g. arcs_size.
When using zdb to examine a large amount of data (e.g. zdb -bb on a large pool
with small blocks), the ARC may have a large number of entries. If reference
tracking is enabled, there will be ~1 reference for each block in the ARC. When
evicting, we decrement the refcount and have to search all the references to
find the one that we are removing, which is very slow.
Since zdb is typically used to find problems with the on-disk format, and not
with the code it is running, we should disable reference tracking in zdb.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 2 weeks
considering cache line hits and misses. Put the lock and hash list
glue into the first cache line, put inp_refcount inp_flags inp_socket
into the second cache line.
o On allocation zero out entire structure except the lock and list entries,
including inp_route inp_lle inp_gencnt. When inp_route and inp_lle were
introduced, they were added below inp_zero_size, resulting on not being
cleared after free/alloc. This definitely was a source of bugs with route
caching. Could be that r315956 has just fixed one of them.
The inp_gencnt is reinitialized on every alloc, so it is safe to clear it.
This has been proved to improve TCP performance at Netflix.
Obtained from: rrs
Differential Revision: D10686
- mention COMPAT_FREEBSD11 earlier so that the steps are in chronological
order
- suggest removing /usr/obj before build to ensure there are no stale
objects
Reviewed by: allanjude, kib
Sponsored by: The FreeBSD Foundation
If /etc/bootparams contains a line with an excessively long pathname, and a
client asks for that path, then bootparamd will overflow a buffer and crash
while parsing that line. This is not remotely exploitable since it requires
a malformed /etc/bootparams file.
Reported by: Coverity
CID: 1305954
MFC after: 1 week
Sponsored by: Spectra Logic Corp
Also add __FBSDID.
Reviewed by: grehan
This file lacks a license(!) so for this change the following declaration
applies:
To the greatest extent permitted by, but not in contravention of,
applicable law, Affirmer hereby overtly, fully, permanently, irrevocably
and unconditionally waives, abandons, and surrenders all of Affirmer's
Copyright and Related Rights and associated claims and causes of action,
whether now known or unknown (including existing as well as future claims
and causes of action).
The Termios headers <termios.h> and <sys/_termios.h> used sometimes
_POSIX_SOURCE directly to determine if a thing should be exposed to
the user. This circumvented the feature mechanisms of <sys/cdefs.h>.
Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after: 2 weeks
This brings the AHB support in line with the PCI support - now other "things"
can wrap up the calibration / board data into a firmware blob and have them
probe/attach after the system has finished booting.
Note that this change requires /all/ of the AHB using kernel configurations
to change - so until I drop those changes in, this breaks AHB.
Fear not, I'll do that soon.
TODO:
* the above stuff.
Tested:
* AR9331, carambola 2, loading if_ath / wlan as modules at run time
bhyve was recently sandboxed with capsicum, and needs to be able to
control the CPU sets of its vcpu threads
Reviewed by: emaste, oshogbo, rwatson
MFC after: 2 weeks
Sponsored by: ScaleEngine Inc.
Differential Revision: https://reviews.freebsd.org/D10170
The latest firmware has a number of link related fixes, support for a
new custom card, and the fix for a bug that affected rate limiting on
FreeBSD.
Obtained from: Chelsio Communications
MFC after: 1 week
Sponsored by: Chelsio Communications
An extra copy of the system call gate was added to the default LDT back
in 1996 (r18513 / r18514). However, the ability to run BSD/OS 2.1
i386 binaries under FreeBSD's native ABI is most likely no longer
needed.
Discussed with: kib
In r315866, we introduced a direct read of the 8-bit sromrev field from the
memory mapped SPROM/OTP device. On OTP devices that require 16-bit access
alignment, this read fails, preventing identification of the SPROM layout.
So, let's perform an aligned read of the combined 16-bit sromrev/crc field
instead.
Approved by: adrian (mentor, implicit)
The upgrade process requires COMPAT_FREEBSD11 to support the combination
of "old" userland and "new" kernel that exists after "make kernel" and
reboot. Mention this explicitly for those using custom kernel configs.
Once the "new" world is installed the COMPAT_FREEBSD11 could be removed
again, but that does not seem necessary to mention in UPDATING.
Reported by: kib
Sponsored by: The FreeBSD Foundation
The existing upgrade process documented in UPDATING is both necessary
and sufficient for upgrading across the ino64 change. However, the
shortcut of installing both kernel + world before a single reboot has
been possible for quite some time, and several developers and users
were surprised by fallout from ino64. Add an explicit entry pointing
out that the full process must be followed.
Reviewed by: allanjude, gjb, vangyzen
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10877
The original blacklist library supported two notification types:
- failed auth attempt, which incremented the failed login count
by one for the remote address
- successful auth attempt, which reset the failed login count
to zero for that remote address
When the failed login count reached the limit in the configuration
file, the remote address would be blocked by a packet filter.
This patch implements a new notification type, "abusive behavior",
and accepts, but does not act on an additional type, "bad username".
It is envisioned that a system administrator will configure a small
list of "known bad usernames" that should be blocked immediately.
Reviewed by: emaste
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10604