mddestroy() only if the file is from a non-MPSAFE VFS.
- No longer unconditionally hold Giant in the md kthread for vnode-backed
kthreads.
- Improve the handling of the thread exit race when destroying an md
device.
a problem with listing large number of md(4) devices. Either 'list' or
'query' mode uses XML.
Additionally, new functionality was introduced. It's possible to pass
multiple devices to -u:
# ./mdconfig -l -u md0,md1
Approved by: cognet (mentor)
- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.
- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.
- Disambiguate some collisions by adding subsystem prefixes to some
memory types.
- Generally prefer lower case to upper case.
- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.
Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.
Remove md_mtx.
Remove GIANT from the mdctl device driver and avoid DROP_GIANT,
PICKUP_GIANT and geom events since we can call into GEOM directly
now.
Pick up Giant around vn_close().
Apply an exclusive sx around mdctls ioctl and preloading to protect
lists etc..
Don't initialize our lock (md_mtx or md_sx) from a
SYSINIT when there is a perfectly good pair of _fini/_init
functions to do it from.
Prune any final fractional sector from the mediasize to
keep GEOM happy.
Cleanups:
Unify MDIOVERSION check in (x)mdctlioctl()
Add pointer to start() routine to softc to eliminate a switch{}
Inline guts of mddetach().
Always pass error pointer to mdnew(), simplify implementation.
- Always check mdnew() return value, as even in !autounit case
kthread_create() can fail.
Those two changes fix serval panics provked by simple stress test.
Tested by: Kris The BugMagnet
MFC after: 3 days
by md(4). Before this change, it was possible to by-pass these flags
by creating memory disks which used a file as a backing store and
writing to the device.
This was discussed by the security team, and although this is problematic,
it was decided that it was not critical as we never guarantee that root will
be restricted.
This change implements the following behavior changes:
-If the user specifies the readonly flag, unset write operations before
opening the file. If the FWRITE mask is unset, the device will be
created with the MD_READONLY mask set. (readonly)
-Add a check in g_md_access which checks to see if the MD_READONLY mask
is set, if so return EROFS
-Do not gracefully downgrade access modes without telling the user. Instead
make the user specify their intentions for the device (assuming the file is
read only). This seems like the more correct way to handle things.
This is a RELENG_6 candidate.
PR: kern/84635
Reviewed by: phk
memory disk is larger than the number of available sf_bufs, this improves
performance on SMPs by eliminating interprocessor TLB shootdowns. For
example, with 6656 sf_bufs, the default on my test machine, and a 256MB
swap-backed memory disk, I see the command
"dd if=/dev/md0 of=/dev/null bs=64k" achieve ~489MB/sec with the default,
shared mappings, and ~587MB/sec with CPU private mappings.
in mddestroy() to properly free already allocated memory.
This fixes a panic when we want to create too big memory backed device
with preallocate memory (-o reserve).
- Remove redundant { }.
MFC after: 1 week
show file name for 'mdconfig -l -u <x>' command.
This allows to preserve API/ABI compatibility with version 0 (that's why
I changed version number back to 0) and will allow to merge this change
to RELENG_5.
MFC after: 5 days
md(8). The former is generally not going to fail, but the latter can
fail when the underlying swap device returns an error.
There are still plenty of other places where vm_pager_get_pages() failing
will lead directly to crashes, so it's a good idea to put your swap on
RAID if you care enough to put any of your disks on RAID....
After this change it should be possible to use very big md(4) devices.
- Clean up and simplify the code a bit.
- Use humanize_number(3) to print size of md(4) devices.
- Add 't' suffix which stands for terabyte.
- Make '-S' to really work with all types of devices.
- Other minor changes.
it is only used in one function. While doing so, change its type to
vm_ooffset_t.
We are still limited for swap-backed devices to 16TB on 32-bit architectures
where PAGE_SIZE is 4096 bytes.
before returning. Device nodes are created via the "taste" mechanism,
so this is necessary in order to make sure that devfs entries are
created before mdconfig(8) returns.
This may be a MFC candidate for 5.3.
Suggested by: phk
for unknown events.
A number of modules return EINVAL in this instance, and I have left
those alone for now and instead taught MOD_QUIESCE to accept this
as "didn't do anything".
completely understand], md_takeroot() runs before md_preloaded(),
rendering both useless.
As a fix, move the body (effectively one line!) of md_takeroot()
into md_preloaded(), and get rid of the stuff that has become useless.
Bug and fix reported 10 days ago on -current, no reply.
mappings required by mdstart_swap(). On i386, if the ephemeral mapping
is already in the sf_buf mapping cache, a swap-backed md performs
similarly to a malloc-backed md. Even if the ephemeral mapping is not
cached, this implementation is still faster. On 64-bit platforms, this
change has the effect of using the direct virtual-to-physical mapping,
avoiding ephemeral mapping overheads, such as TLB shootdowns on SMPs.
On a 2.4GHz, 400MHz FSB P4 Xeon configured with 64K sf_bufs and
"mdmfs -S -o async -s 128m md /mnt"
before:
dd if=/dev/md0 of=/dev/null bs=64k
134217728 bytes transferred in 0.430923 secs (311465697 bytes/sec)
after with cold sf_buf cache:
dd if=/dev/md0 of=/dev/null bs=64k
134217728 bytes transferred in 0.367948 secs (364773576 bytes/sec)
after with warm sf_buf cache:
dd if=/dev/md0 of=/dev/null bs=64k
134217728 bytes transferred in 0.252826 secs (530870010 bytes/sec)
malloc-backed md:
dd if=/dev/md0 of=/dev/null bs=64k
134217728 bytes transferred in 0.253126 secs (530240978 bytes/sec)
On vnode backed md(4) devices over a certain, currently undetermined
size relative to the buffer cache our "lemming-syncer" can provoke
a buffer starvation which puts the md thread to sleep on wdrain.
This generally tends to grind the entire system to a stop because the
event that is supposed to wake up the thread will not happen until a fair
bit of the piled up I/O requests in the system finish, and since a lot
of those are on a md(4) vnode backed device which is currently waiting
on wdrain until a fair amount of the piled up ... you get the picture.
The cure is to issue all VOP_WRITES on the vnode backing the device
with IO_SYNC.
In addition to more closely emulating a real disk device with a
non-lying write-cache, this makes the writes exempt from rate-limited
(there to avoid starving the buffer cache) and consequently prevents
the deadlock.
Unfortunately performance takes a hit.
Add "async" option to give people who know what they are doing the
old behaviour.