- the mutex aac_io_lock protects the main codepaths which handle queues and
hardware registers. Only one acquire/release is done in the top-half and
the taskqueue. This mutex also applies to the userland command path and
CAM data path.
- Move the taskqueue to the new Giant-free version.
- Register the disk device with DISKFLAG_NOGIANT so the top-half processing
runs without Giant.
- Move the dynamic command allocator to the worker thread to avoid locking
issues with bus_dmamem_alloc().
This gives about 20% improvement in most of my benchmarks.
turns runs its tasks free of Giant too. It is intended that as drivers
become locked down, they will move out of the old, Giant-bound taskqueue
and into this new one. The old taskqueue has been renamed to
taskqueue_swi_giant, and the new one keeps the name taskqueue_swi.
an if clause was true. Break the two clauses out into seperate statements
since they require different actions.
Reported/Tested by: jake
Spotted by: jhb
delta 1.371) we must ensure that we do not get ourselves into a
recursive trap endlessly trying to clean up after ourselves.
Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.
from the filesystem size field to the filesystem maximum blocksize
field. The problem is that older versions of growfs updated only the
new size field and not the old size field. This resulted in the old
(smaller) size field being copied up to the new size field which
caused the filesystem to appear to fsck to be badly trashed.
This also adds a sanity check to ensure that the superblock is not
being updated when the filesystem is mounted read-only. Obviously
such an update should never happen.
Reported by: Nate Lawson <nate@root.org>
Sponsored by: DARPA & NAI Labs.
o Always check for null when dereferencing the filename component.
o Implement a try-and-backoff method for allocating memory to
dump stats to avoid a spin-lock -> sleep-lock mutex lock order
panic with WITNESS.
Approved by: des, markm (mentor)
Not objected: jhb
for testing and setting the current and alternate address spaces.
- Changed PTDpde and APTDpde to arrays to support multiple page directory
pages.
ponsored by: DARPA, Network Associates Laboratories
tcpcb is NULL, but also its connected inpcb, since we now allow
elements of a TCP connection to hang around after other state, such
as the socket, has been recycled.
Tested by: dcs
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
track of the number of dirty buffers held by a vnode. When a
bdwrite is done on a buffer, check the existing number of dirty
buffers associated with its vnode. If the number rises above
vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one
of the other (hopefully older) dirty buffers associated with
the vnode is written (using bawrite). In the event that this
approach fails to curb the growth in it the vnode's number of
dirty buffers (due to soft updates rollback dependencies),
the more drastic approach of doing a VOP_FSYNC on the vnode
is used. This code primarily affects very large and actively
written files such as snapshots. This change should eliminate
hanging when taking snapshots or doing background fsck on
very large filesystems.
Hopefully, one day it will be possible to cache filesystem
metadata in the VM cache as is done with file data. As it
stands, only the buffer cache can be used which limits total
metadata storage to about 20Mb no matter how much memory is
available on the system. This rather small memory gets badly
thrashed causing a lot of extra I/O. For example, taking a
snapshot of a 1Tb filesystem minimally requires about 35,000
write operations, but because of the cache thrashing (we only
have about 350 buffers at our disposal) ends up doing about
237,540 I/O's thus taking twenty-five minutes instead of four
if it could run entirely in the cache.
Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.
Reviewed by: arch, mckusick
- Get rid of the useless atop() / pmap_phys_address() detour. The
device mmap handlers must now give back the physical address
without atop()'ing it.
- Don't borrow the physical address of the mapping in the returned
int. Now we properly pass a vm_offset_t * and expect it to be
filled by the mmap handler when the mapping was successful. The
mmap handler must now return 0 when successful, any other value
is considered as an error. Previously, returning -1 was the only
way to fail. This change thus accidentally fixes some devices
which were bogusly returning errno constants which would have been
considered as addresses by the device pager.
- Garbage collect the poorly named pmap_phys_address() now that it's
no longer used.
- Convert all the d_mmap_t consumers to the new API.
I'm still not sure wheter we need a __FreeBSD_version bump for this,
since and we didn't guarantee API/ABI stability until 5.1-RELEASE.
Discussed with: alc, phk, jake
Reviewed by: peter
Compile-tested on: LINT (i386), GENERIC (alpha and sparc64)
Runtime-tested on: i386
time and there's no indication that it will improve anytime soon.
By removing support for SimOS it is possible to build LINT on
Alpha, which is considered more important at the moment.
Not objected to on: alpha@
- Changed VM_MAXUSER_ADDRESS to be defined in terms of PTDPTDI. In order for
assumptions about the recursive page table map to work it must be the base
of the recursive map. Any pte offset that's not NPTEPG will break these
assumptions.
Sponsored by: DARPA, Network Associates Laboratories
flag that can be marked on each symmetric op
o eliminate hw.ubsec.maxbatch and hw.ubsec.maxaggr since they are not
needed anymore
o change ubsec_feed to return void instead of int since zero is always
returned and noone ever looked at the return value
instead of allocating another page of kva and mapping it in again. This was
likely an oversight in revision 1.174 (cut and paste from pmap_pinit).
Discussed with: peter, tegge
Sponsored by: DARPA, Network Associates Laboratories
Reading the PCI config space with the wrong (larger) size is not
a problem in this case, but writing can be as it clobbers unrelated
registers. In this case the clobbering is for reserved fields, which
too is mostly harmless... for now. Hence, this change is mostly
preventive in nature.
page directory.
- Use these instead of the magic constants 1 or PAGE_SIZE where appropriate.
There are still numerous assumptions that the page directory is exactly
1 page.
Sponsored by: DARPA, Network Associates Laboratories
without waiting, since they are called from a system-call context only.
This appears to fix all sorts of problems with open("/dev/dsp", O_WRONLY)
randomly returning ENXIO.
Found by: cognet
Security improvements:
- Increase the size of each syncookie secret from 32 to 128 bits
in order to make brute force attacks on the secrets much more
difficult.
- Always return the lowest order dword from the MD5 hash; this
allows us to expose 2 more bits of the cookie and makes ACK
floods which seek to guess the cookie value more difficult.
Performance improvements:
- Increase the lifetime of each syncookie from 4 seconds to 16
seconds. This increases the usefulness of syncookies during
an attack.
- From Yahoo!: Reduce the number of calls to MD5Update; this
results in a ~17% increase in cookie generation time here.
Reviewed by: hsu, jayanth, jlemon, nectar
MFC After: 15 seconds
in massive locking issues on diskless systems.
It is also not clear that this sysctl is non-dangerous in its
requirements for locked down memory on large RAM systems.
type M_STRING, now defined in malloc.h. Useful when string parsing
must occur using the kernel strsep() and we want to avoid toasting
the source string.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
should be done in crypto_done rather than in the callback thread
o use this flag to mark operations from /dev/crypto since the callback
routine just does a wakeup; this eliminates the last unneeded ctx switch
o change CRYPTO_F_NODELAY to CRYPTO_F_BATCH with an inverted meaning
so "0" becomes the default/desired setting (needed for user-mode
compatibility with openbsd)
o change crypto_dispatch to honor CRYPTO_F_BATCH instead of always
dispatching immediately
o remove uses of CRYPTO_F_NODELAY
o define COP_F_BATCH for ops submitted through /dev/crypto and pass
this on to the op that is submitted
Similar changes and more eventually coming for asymmetric ops.
MFC if re gives approval.
branch targets that are too far apart for the BRADDR relocation.
This is caused by the branch prediction optimizationi in the atomic
inlines here, because they jump across sections.
The workaround is to suppress jumping to a different section when
compiling LINT. To generate correct code in that case, the section
directives are replaced by a branch and a label to deal with the
fall-through case. Reasonably good C compilers will optimize this
away anyway, so the end result isn't really that bad.
packets coming out of a GIF tunnel are re-processed by ipfw, et. al.
By default they are not reprocessed. With the option they are.
This reverts 1.214. Prior to that change packets were not re-processed.
After they were which caused problems because packets do not have
distinguishing characteristics (like a special network if) that allows
them to be filtered specially.
This is really a stopgap measure designed for immediate MFC so that
4.8 has consistent handling to what was in 4.7.
PR: 48159
Reviewed by: Guido van Rooij <guido@gvr.org>
MFC after: 1 day
o Make DXS3 the primary playback channel. It may be the only
universally supported channel with the assorted revisions of this
chipset.
o Add sysctl and handler for enabling s/pdif output from DXS3.
as opposed to one after the other. This is faster in both -CURRENT
and -STABLE. Additionally, there is less code duplication for
error-checking.
One thing to note is that this code seems to return(1) when no buffers
are available; perhaps ENOBUFS should be the correct return value?
Partially submitted & tested by: Hiten Pandya <hiten@unixdaemons.com>
MFC after: 1 week
and enable it by default, with a limit of 16.
At the same time, tweak maxfragpackets downward so that in the worst
possible case, IP reassembly can use only 1/2 of all mbuf clusters.
MFC after: 3 days
Reviewed by: hsu
Liked by: bmah
a snapshot. As part of taking a snapshot of a filesystem, the kernel
builds up a list of the filesystem metadata (such as the cylinder
group bitmaps) that are contained in the snapshot. When doing a
copy-on-write check, the list is first consulted. If the block being
written is found on the list, then the full snapshot lookup can be
avoided. Besides providing an important performance speedup this
check also avoids a potential deadlock between the code creating
the snapshot and the bufdaemon trying to cleanup snapshot related
buffers. This fix creates a temporary list containing the key
metadata blocks that can cause the deadlock. This temporary list
is used between the time that the snapshot is first enabled and the
time that the fully complete list is built.
Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.
is being taken from panicing with either "freeing free block" or
"freeing free inode". The problem arises when the snapshot code
is scanning the filesystem looking for inodes with a reference
count of zero (e.g., unlinked but still open) so that it can
expunge them from its view. If it encounters a reclaimed vnode
and has to restart its scan, then it will panic if it encounters
and tries to free an inode that it has already processed. The fix
is to check each candidate inode to see if it has already been
processed before trying to delete it from the snapshot image.
Sponsored by: DARPA & NAI Labs.
that they convert to 64-bit values before shifting rather than
afterwards. Once fixed, they can be used rather than inline expanded.
Sponsored by: DARPA & NAI Labs.