track of the number of dirty buffers held by a vnode. When a
bdwrite is done on a buffer, check the existing number of dirty
buffers associated with its vnode. If the number rises above
vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one
of the other (hopefully older) dirty buffers associated with
the vnode is written (using bawrite). In the event that this
approach fails to curb the growth in it the vnode's number of
dirty buffers (due to soft updates rollback dependencies),
the more drastic approach of doing a VOP_FSYNC on the vnode
is used. This code primarily affects very large and actively
written files such as snapshots. This change should eliminate
hanging when taking snapshots or doing background fsck on
very large filesystems.
Hopefully, one day it will be possible to cache filesystem
metadata in the VM cache as is done with file data. As it
stands, only the buffer cache can be used which limits total
metadata storage to about 20Mb no matter how much memory is
available on the system. This rather small memory gets badly
thrashed causing a lot of extra I/O. For example, taking a
snapshot of a 1Tb filesystem minimally requires about 35,000
write operations, but because of the cache thrashing (we only
have about 350 buffers at our disposal) ends up doing about
237,540 I/O's thus taking twenty-five minutes instead of four
if it could run entirely in the cache.
Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.
Reviewed by: arch, mckusick
- Get rid of the useless atop() / pmap_phys_address() detour. The
device mmap handlers must now give back the physical address
without atop()'ing it.
- Don't borrow the physical address of the mapping in the returned
int. Now we properly pass a vm_offset_t * and expect it to be
filled by the mmap handler when the mapping was successful. The
mmap handler must now return 0 when successful, any other value
is considered as an error. Previously, returning -1 was the only
way to fail. This change thus accidentally fixes some devices
which were bogusly returning errno constants which would have been
considered as addresses by the device pager.
- Garbage collect the poorly named pmap_phys_address() now that it's
no longer used.
- Convert all the d_mmap_t consumers to the new API.
I'm still not sure wheter we need a __FreeBSD_version bump for this,
since and we didn't guarantee API/ABI stability until 5.1-RELEASE.
Discussed with: alc, phk, jake
Reviewed by: peter
Compile-tested on: LINT (i386), GENERIC (alpha and sparc64)
Runtime-tested on: i386
in massive locking issues on diskless systems.
It is also not clear that this sysctl is non-dangerous in its
requirements for locked down memory on large RAM systems.
Retire the "d_dump_t" and use the "dumper_t" type instead.
Dumper_t takes a void * as first arg which is more general than the
dev_t taken by d_dump_t. (Remember: we could have net-dumpers if
somebody wrote us one!)
Define the convention for GEOM controlled disk devices to be that the
first argument to the dumper function is the struct disk pointer.
Change device drivers accordingly.
dev_t to the method functions.
The dev_t can still be found at struct consdev *->cn_dev.
Add a void *cn_arg element to struct consdev which the drivers can use
for retrieving their softc.
In devsw() return dead_cdevsw instead of NULL in case the dev_t does not
have a si_devsw.
This may improve our survival chances with devices which go away unexpectedly.
compile-time constants). That is, a "bucket" now is not necessarily
a page-worth of mbufs or clusters, but it is MBUF_BUCK_SZ, CLUS_BUCK_SZ
worth of mbufs, clusters.
o Rename {mbuf,clust}_limit to {mbuf,clust}_hiwm and introduce
{mbuf,clust}_lowm, which currently has no effect but will be used
to set the low watermarks.
o Fix netstat so that it can deal with the differently-sized buckets
and teach it about the low watermarks too.
o Make sure the per-cpu stats for an absent CPU has mb_active set to 0,
explicitly.
o Get rid of the allocate refcounts from mbuf map mess. Instead,
just malloc() the refcounts in one shot from mbuf_init()
o Clean up / update comments in subr_mbuf.c
used to share resource limits between rfork threads, but never was.
Removing it makes resource limit locking much simpler -- only the current
process can change the contents of the structure that p_limit points to.
reference counter array for mbuf clusters. I don't know
how this got past early testing nor how it survived so long
without getting caught. If anyone was seeing really really
bizarre memory corruption in a few mbufs this would be why.
#if'ed out for a while. Complete the deed and tidy up some other bits.
We need to be able to call this stuff from outer edges of interrupt
handlers for devices that have the ISR bits in pci config space. Making
the bios code mpsafe was just too hairy. We had also stubbed it out some
time ago due to there simply being too much brokenness in too many systems.
This adds a leaf lock so that it is safe to use pci_read_config() and
pci_write_config() from interrupt handlers. We still will use pcibios
to do interrupt routing if there is no acpi.. [yes, I tested this]
Briefly glanced at by: imp
sched_lock around accesses to p_stats->p_timer[] to avoid a potential
race with hardclock. getitimer(), setitimer() and the realitexpire()
callout are now Giant-free.
add a signal to a mailbox's pending set.
- Add a new function, thread_signal_upcall(), this causes the current thread
to upcall so that we can deliver pending signals.
Reviewed by: mini
I was in two minds as to where to put them in the first case..
I should have listenned to the other mind.
Submitted by: parts by davidxu@
Reviewed by: jeff@ mini@
queue lock already held.
- In getblk() and flushbufqueues() use bremfreel() while we still have the
buf queue lock held to keep the lists consistent.
- Add LK_NOWAIT to two cases where we're essentially asserting that the bufs
are not locked while acquiring the locks. This will make sure that we get
the appropriate panic() and not another one for sleeping with a lock held.
- Mark the process leader as having an advisory lock
- Check if process leader is marked as having advisory lock when
closing file
- Check that file is still open after lock has been obtained
- Don't allow file descriptor table sharing between processes
with different leaders
PR: 10265
Reviewed by: alfred
freebsd4_sigaction() and osigaction() instead of around the whole
body of those functions. They now no longer hold Giant around calls
to copyin() and copyout(), and it is slightly more obvious what
Giant is protecting.
barrier between free'ing filedesc structures. Basically if you want to
access another process's filedesc, you want to hold this mutex over the
entire operation.
opposed to returning the top of the old chain when there was one and
the top of the newly allocated chain if there was no old chain.
Actually, it should be noted that prior to this fix, although the
comment above m_getm() advertised that m_getm() would return the
top of the old chain (if an old chain was being passed in) it
actually [wrongly] was returning the tail mbuf in the old chain
instead. This is a bug but since the one use of m_getm() in
the tree luckily did not depend on the behavior, it happened
to work out without notice.
Harti Brandt pointed out that the advertised behavior was actually
not the real behavior and so this change makes m_getm() ALWAYS
return the newly allocated chain (and fixes the comment). This
is less confusing and is the best course of action as then the
caller is always able to have both a reference to the top of
the original chain (because it's passing it in in the call) and
a reference to the newly attached chain. Although the API is
slightly modified, I don't think that any third-party code uses
m_getm() and if it does, it surely can't be working properly
because the old behavior was bogus.
API bug pointed out by: Harti Brandt <brandt@fokus.fraunhofer.de>
To fix scsi, don't wait for ithreads if we're dumping, it makes the
debugger sad.
To fix ata, use what appears to be a polling method if we're dumping,
I stole this from tmm but added code to ensure that this change is
only in effect while dumping.
Tested by: des
The locking here needs to be revisited, but this ought to get rid of the
LOR messages that people are complaining about for now. I imagine either
I or someone else interested with smp will eventually clear this up.
- Use the ratio of kg_runtime / kg_slptime to determine our dynamic priority.
- Scale kg_runtime and kg_slptime back when the sum of the two exceeds
SCHED_SLP_RUN_MAX. This allows us to slowly forget old behavior.
- Scale back the runtime and slptime in fork so that the new process has the
same ratio but much less accumulated time. This causes new behavior to be
noticed more quickly.
that is protected by the vnode lock.
- Move B_SCANNED into b_vflags and call it BV_SCANNED.
- Create a vop_stdfsync() modeled after spec's sync.
- Replace spec_fsync, msdos_fsync, and hpfs_fsync with the stdfsync and some
fs specific processing. This gives all of these filesystems proper
behavior wrt MNT_WAIT/NOWAIT and the use of the B_SCANNED flag.
- Annotate the locking in buf.h
buf lists, synchronization variables, and atomic ops for the counters.
This change does not remove giant from any code although some pushdown
may be possible.
- In vfs_bio_awrite() don't access buf fields without the buf lock.
Change the si_name of dev_t's to be a char * and put a private buffer for
holding the name at then end of the struct.
Initialize si_name to point to the private buffer.
Put a KASSERT in geom_disk to prevent overrun on the fake dev_t we still
have to generate for the disk_drivers.
prevent the compiler from optimizing assignments into byte-copy
operations which might make access to the individual fields non-atomic.
Use the individual fields throughout, and don't bother locking them with
Giant: it is no longer needed.
Inspired by: tjr
statclock based on profhz when profiling is enabled MD, since most platforms
don't use this anyway. This removes the need for statclock_process, whose
only purpose was to subdivide profhz, and gets the profiling clock running
outside of sched_lock on platforms that implement suswintr.
Also changed the interface for starting and stopping the profiling clock to
do just that, instead of changing the rate of statclock, since they can now
be separate.
Reviewed by: jhb, tmm
Tested on: i386, sparc64