the device statistics structures into userland instead of using sysctl.
Introduce new devstat_new_entry() function which allocates the devstat
structure an calls devstat_add_entry() on it.
- On receive, vm_map_lookup() needs to trigger the creation of a shadow
object. To make that happen, call vm_map_lookup() with PROT_WRITE
instead of PROT_READ in vm_pgmoveco().
- On send, a shadow object will be created by the vm_map_lookup() in
vm_fault(), but vm_page_cowfault() will delete the original page from
the backing object rather than simply letting the legacy COW mechanism
take over. In other words, the new page should be added to the shadow
object rather than replacing the old page in the backing object. (i.e.
vm_page_cowfault() should not be called in this case.) We accomplish
this by making sure fs.object == fs.first_object before calling
vm_page_cowfault() in vm_fault().
Submitted by: gallatin, alc
Tested by: ken
witness. Sleepable locks such as sx locks always come before all mutexes
including Giant. However, the static lock order list placed Giant before
the proctree and allproc sx locks. This resulted in witness creating a
cycle in its lock order "tree" (real trees don't have cycles) leading to
infinite recursion and eventually a double fault. To fix, put Giant after
sx locks in the lock order list.
check, mac_check_sysarch_ioperm(), permitting MAC security policy
modules to control access to these interfaces. Currently, they
protect access to IOPL on i386, and setting HAE on Alpha.
Additional checks might be required on other platforms to prevent
bypass of kernel security protections by unauthorized processes.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
modules to authorize disabling of swap against a particular vnode.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
before the MAC check so that we pass the flags field into the MAC
check properly initialized. This didn't affect any current MAC
modules since they didn't care what the flags argument was (as
they were primarily interested in the fact that it was a meta-data
write, not the contents of the write), but would be relevant to
future modules relying on that field.
Submitted by: Mike Halderman <mrh@spawar.navy.mil>
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
drain routines are done by swi_net, which allows for better queue control
at some future point. Packets may also be directly dispatched to a netisr
instead of queued, this may be of interest at some installations, but
currently defaults to off.
Reviewed by: hsu, silby, jayanth, sam
Sponsored by: DARPA, NAI Labs
- Use gbincore() and not incore() so that we can drop the vnode interlock
as we acquire the buflock.
- Use GB_LOCK_NOWAIT when getting bufs for read ahead clusters so that we
don't block on locked bufs.
- Convert a while loop to a howmany() that will most likely be faster on
modern processors. There is another while loop divide that was left
near by because it is operating on a 64bit int and is most likely faster.
- Cleanup the cluster_read() code a little to get rid of a goto and make
the logic clearer.
Tested on: x86, alpha
Tested by: Steve Kargl <sgk@troutmask.apl.washington.edu>
Reviewd by: arch
already own. The mtx_trylock() will fail however. Enhance the comment
at the top of the try lock function to explain this.
Requested by: jlemon and his evil netisr locking
- Add a comment about special lock order rules and Giant near the top of
subr_witness.c. Specifically, this documents and explains the real lock
order relationship between Giant and sleepable locks (i.e. lockmgr locks
and sx locks). Basically, Giant can be safely acquired either before or
after sleepable locks and the case of Giant before a sleepable lock is
exempted as a special case.
- Add a new static function 'witness_list_lock()' that displays a single
line of information about a struct lock_instance. This is used to
make the output of witness messages more consistent and reduce some code
duplication.
- Fixup a few comments in witness_lock().
- Properly handle the Giant-before-sleepable-lock lock order exception in
a more general fashion and remove the no longer needed LI_SLEPT flag.
- Break up the last condition before assuming a reversal a bit to try
and make the logic less confusing in witness_lock().
- Axe WITNESS_SLEEP() now that LI_SLEPT is no longer needed and replace it
with a more general WITNESS_WARN() macro/function combination.
WITNESS_WARN() allows you to output a customized message out to the
console along with a list of held locks. It will optionally drop into
the debugger as well. You can exempt a single lock from the check by
passing it in as the second argument. You can also use flags to specify
if Giant should be exempt from the check, if all sleepable locks should
be exempt from the check, and if witness should panic if any non-exempt
locks are found.
- Make the witness_list() function static. Other areas of the kernel
should use the new WITNESS_WARN() instead.
- Declare some local variables at the top of the function instead of in a
nested block.
- Use mtx_owned() instead of masking off bits from mtx_lock manually.
- Read the value of mtx_lock into 'v' as a separate line rather than inside
an if statement for clarity. This code is hairy enough as it is.
owned. Previously the KASSERT would only trigger if we successfully
acquired a lock that we already held. However, _obtain_lock() fails to
acquire locks that we already hold, so the KASSERT was never checked in
the case it was supposed to fail.
interactivity of a kseg and assigns it a value of 0 through 100.
- Use sched_interact_score() to determine the dynamic priority.
- Define SCHED_CURR() in terms of sched_interact_score().
- Adjust the maximum slice back down to 100ms.
- Remove redundant clearing of ke_runq in sched_wakeup()
- Clean up #defines and comment them.
- Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT
flag to the initial BUF_LOCK(). This will eventually be used in cases
were we want to use a buffer only if it is not currently in use.
- Convert all consumers of the getblk() api to use this extra parameter.
Reviwed by: arch
Not objected to by: mckusick
Remove extraneous uses of vop_null, instead defering to the default op.
Rename vnode type "vfs" to the more descriptive "syncer".
Fix formatting for various filesystems that use vop_print.
branches:
Initialize struct cdevsw using C99 sparse initializtion and remove
all initializations to default values.
This patch is automatically generated and has been tested by compiling
LINT with all the fields in struct cdevsw in reverse order on alpha,
sparc64 and i386.
Approved by: re(scottl)
calculations. Keep this changes local to the function so the tick count
is in its natural form otherwise. Previously 1000 was added each time
a tick fired and we divided by 1000 when it was reported. This is done
to reduce rounding errors.
To do this, initialize the d_maj member of the cdevsw to MAJOR_AUTO.
When the cdevsw is first passed to make_dev() a free major number
will be assigned.
Until we have a bit more experience with this a printf will announce
this fact.
Major numbers are not reclaimed, so loading/unloading the same
device driver which uses MAJOR_AUTO will eventually deplete the
pool of free major numbers and the system will panic when it can
not allocate one. Still undecided who to invonvenience with the
solution to this.
td_wmesg field in the thread structure points to the description string of
the condition variable or mutex. If the condvar or the mutex had been
initialized from a loadable module that was unloaded in the meantime,
td_wmesg may now point to invalid memory. Retrieving the process table now
may panic the kernel (or access junk). Setting the td_wmesg field to NULL
after unblocking on the condvar/mutex prevents this panic.
PR: kern/47408
Approved by: jake (mentor)
turns runs its tasks free of Giant too. It is intended that as drivers
become locked down, they will move out of the old, Giant-bound taskqueue
and into this new one. The old taskqueue has been renamed to
taskqueue_swi_giant, and the new one keeps the name taskqueue_swi.
delta 1.371) we must ensure that we do not get ourselves into a
recursive trap endlessly trying to clean up after ourselves.
Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.
o Always check for null when dereferencing the filename component.
o Implement a try-and-backoff method for allocating memory to
dump stats to avoid a spin-lock -> sleep-lock mutex lock order
panic with WITNESS.
Approved by: des, markm (mentor)
Not objected: jhb
track of the number of dirty buffers held by a vnode. When a
bdwrite is done on a buffer, check the existing number of dirty
buffers associated with its vnode. If the number rises above
vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one
of the other (hopefully older) dirty buffers associated with
the vnode is written (using bawrite). In the event that this
approach fails to curb the growth in it the vnode's number of
dirty buffers (due to soft updates rollback dependencies),
the more drastic approach of doing a VOP_FSYNC on the vnode
is used. This code primarily affects very large and actively
written files such as snapshots. This change should eliminate
hanging when taking snapshots or doing background fsck on
very large filesystems.
Hopefully, one day it will be possible to cache filesystem
metadata in the VM cache as is done with file data. As it
stands, only the buffer cache can be used which limits total
metadata storage to about 20Mb no matter how much memory is
available on the system. This rather small memory gets badly
thrashed causing a lot of extra I/O. For example, taking a
snapshot of a 1Tb filesystem minimally requires about 35,000
write operations, but because of the cache thrashing (we only
have about 350 buffers at our disposal) ends up doing about
237,540 I/O's thus taking twenty-five minutes instead of four
if it could run entirely in the cache.
Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.
Reviewed by: arch, mckusick
- Get rid of the useless atop() / pmap_phys_address() detour. The
device mmap handlers must now give back the physical address
without atop()'ing it.
- Don't borrow the physical address of the mapping in the returned
int. Now we properly pass a vm_offset_t * and expect it to be
filled by the mmap handler when the mapping was successful. The
mmap handler must now return 0 when successful, any other value
is considered as an error. Previously, returning -1 was the only
way to fail. This change thus accidentally fixes some devices
which were bogusly returning errno constants which would have been
considered as addresses by the device pager.
- Garbage collect the poorly named pmap_phys_address() now that it's
no longer used.
- Convert all the d_mmap_t consumers to the new API.
I'm still not sure wheter we need a __FreeBSD_version bump for this,
since and we didn't guarantee API/ABI stability until 5.1-RELEASE.
Discussed with: alc, phk, jake
Reviewed by: peter
Compile-tested on: LINT (i386), GENERIC (alpha and sparc64)
Runtime-tested on: i386
in massive locking issues on diskless systems.
It is also not clear that this sysctl is non-dangerous in its
requirements for locked down memory on large RAM systems.
Retire the "d_dump_t" and use the "dumper_t" type instead.
Dumper_t takes a void * as first arg which is more general than the
dev_t taken by d_dump_t. (Remember: we could have net-dumpers if
somebody wrote us one!)
Define the convention for GEOM controlled disk devices to be that the
first argument to the dumper function is the struct disk pointer.
Change device drivers accordingly.
dev_t to the method functions.
The dev_t can still be found at struct consdev *->cn_dev.
Add a void *cn_arg element to struct consdev which the drivers can use
for retrieving their softc.
In devsw() return dead_cdevsw instead of NULL in case the dev_t does not
have a si_devsw.
This may improve our survival chances with devices which go away unexpectedly.
compile-time constants). That is, a "bucket" now is not necessarily
a page-worth of mbufs or clusters, but it is MBUF_BUCK_SZ, CLUS_BUCK_SZ
worth of mbufs, clusters.
o Rename {mbuf,clust}_limit to {mbuf,clust}_hiwm and introduce
{mbuf,clust}_lowm, which currently has no effect but will be used
to set the low watermarks.
o Fix netstat so that it can deal with the differently-sized buckets
and teach it about the low watermarks too.
o Make sure the per-cpu stats for an absent CPU has mb_active set to 0,
explicitly.
o Get rid of the allocate refcounts from mbuf map mess. Instead,
just malloc() the refcounts in one shot from mbuf_init()
o Clean up / update comments in subr_mbuf.c
used to share resource limits between rfork threads, but never was.
Removing it makes resource limit locking much simpler -- only the current
process can change the contents of the structure that p_limit points to.
reference counter array for mbuf clusters. I don't know
how this got past early testing nor how it survived so long
without getting caught. If anyone was seeing really really
bizarre memory corruption in a few mbufs this would be why.
#if'ed out for a while. Complete the deed and tidy up some other bits.
We need to be able to call this stuff from outer edges of interrupt
handlers for devices that have the ISR bits in pci config space. Making
the bios code mpsafe was just too hairy. We had also stubbed it out some
time ago due to there simply being too much brokenness in too many systems.
This adds a leaf lock so that it is safe to use pci_read_config() and
pci_write_config() from interrupt handlers. We still will use pcibios
to do interrupt routing if there is no acpi.. [yes, I tested this]
Briefly glanced at by: imp
sched_lock around accesses to p_stats->p_timer[] to avoid a potential
race with hardclock. getitimer(), setitimer() and the realitexpire()
callout are now Giant-free.
add a signal to a mailbox's pending set.
- Add a new function, thread_signal_upcall(), this causes the current thread
to upcall so that we can deliver pending signals.
Reviewed by: mini
I was in two minds as to where to put them in the first case..
I should have listenned to the other mind.
Submitted by: parts by davidxu@
Reviewed by: jeff@ mini@
queue lock already held.
- In getblk() and flushbufqueues() use bremfreel() while we still have the
buf queue lock held to keep the lists consistent.
- Add LK_NOWAIT to two cases where we're essentially asserting that the bufs
are not locked while acquiring the locks. This will make sure that we get
the appropriate panic() and not another one for sleeping with a lock held.
- Mark the process leader as having an advisory lock
- Check if process leader is marked as having advisory lock when
closing file
- Check that file is still open after lock has been obtained
- Don't allow file descriptor table sharing between processes
with different leaders
PR: 10265
Reviewed by: alfred
freebsd4_sigaction() and osigaction() instead of around the whole
body of those functions. They now no longer hold Giant around calls
to copyin() and copyout(), and it is slightly more obvious what
Giant is protecting.
barrier between free'ing filedesc structures. Basically if you want to
access another process's filedesc, you want to hold this mutex over the
entire operation.
opposed to returning the top of the old chain when there was one and
the top of the newly allocated chain if there was no old chain.
Actually, it should be noted that prior to this fix, although the
comment above m_getm() advertised that m_getm() would return the
top of the old chain (if an old chain was being passed in) it
actually [wrongly] was returning the tail mbuf in the old chain
instead. This is a bug but since the one use of m_getm() in
the tree luckily did not depend on the behavior, it happened
to work out without notice.
Harti Brandt pointed out that the advertised behavior was actually
not the real behavior and so this change makes m_getm() ALWAYS
return the newly allocated chain (and fixes the comment). This
is less confusing and is the best course of action as then the
caller is always able to have both a reference to the top of
the original chain (because it's passing it in in the call) and
a reference to the newly attached chain. Although the API is
slightly modified, I don't think that any third-party code uses
m_getm() and if it does, it surely can't be working properly
because the old behavior was bogus.
API bug pointed out by: Harti Brandt <brandt@fokus.fraunhofer.de>
To fix scsi, don't wait for ithreads if we're dumping, it makes the
debugger sad.
To fix ata, use what appears to be a polling method if we're dumping,
I stole this from tmm but added code to ensure that this change is
only in effect while dumping.
Tested by: des
The locking here needs to be revisited, but this ought to get rid of the
LOR messages that people are complaining about for now. I imagine either
I or someone else interested with smp will eventually clear this up.
- Use the ratio of kg_runtime / kg_slptime to determine our dynamic priority.
- Scale kg_runtime and kg_slptime back when the sum of the two exceeds
SCHED_SLP_RUN_MAX. This allows us to slowly forget old behavior.
- Scale back the runtime and slptime in fork so that the new process has the
same ratio but much less accumulated time. This causes new behavior to be
noticed more quickly.
that is protected by the vnode lock.
- Move B_SCANNED into b_vflags and call it BV_SCANNED.
- Create a vop_stdfsync() modeled after spec's sync.
- Replace spec_fsync, msdos_fsync, and hpfs_fsync with the stdfsync and some
fs specific processing. This gives all of these filesystems proper
behavior wrt MNT_WAIT/NOWAIT and the use of the B_SCANNED flag.
- Annotate the locking in buf.h
buf lists, synchronization variables, and atomic ops for the counters.
This change does not remove giant from any code although some pushdown
may be possible.
- In vfs_bio_awrite() don't access buf fields without the buf lock.
Change the si_name of dev_t's to be a char * and put a private buffer for
holding the name at then end of the struct.
Initialize si_name to point to the private buffer.
Put a KASSERT in geom_disk to prevent overrun on the fake dev_t we still
have to generate for the disk_drivers.
prevent the compiler from optimizing assignments into byte-copy
operations which might make access to the individual fields non-atomic.
Use the individual fields throughout, and don't bother locking them with
Giant: it is no longer needed.
Inspired by: tjr
statclock based on profhz when profiling is enabled MD, since most platforms
don't use this anyway. This removes the need for statclock_process, whose
only purpose was to subdivide profhz, and gets the profiling clock running
outside of sched_lock on platforms that implement suswintr.
Also changed the interface for starting and stopping the profiling clock to
do just that, instead of changing the rate of statclock, since they can now
be separate.
Reviewed by: jhb, tmm
Tested on: i386, sparc64
have some negative effect on interactivity but it yields great perf. gains.
This also brings the conditions under which ULE context switches inline
with SCHED_4BSD.
- Define some new kseq_* functions for manipulating the run queue.
- Add a new kseq member ksq_rslices and ksq_bload. rslices is the sum of
the slices of runnable kses. This will be used for push load balance
decisions. bload is the number of threads blocked waiting on IO.
I'm not convinced there is anything major wrong with the patch but
them's the rules..
I am using my "David's mentor" hat to revert this as he's
offline for a while.
than having change_dir() release the vnode lock on success, hold the
lock so that we can use it later when invoking MAC checks and
VOP_ACCESS() in the chroot() code. Update the comment to reflect
this calling convention. Update callers to unlock the vnode
lock. Correct a typo regarding vnode naming in the MAC case that
crept in via the previous patch applied.
cases: we might multiply vrele() a vnode when certain classes of
failures occur. This appears to stem from earlier Giant/file
descriptor lock pushdown and restructuring.
Submitted by: maxim
This implicitly removes the need for major numbers, but a number of
drivers still know things they shouldn't need to, and we need to
consider if there are applications which cache major(+minor) gleaned
from stat(2) and rely on it being constant over reboots before we
start assigning random majors.
sched_runnable() et all.
- Remove some dead code in sched_clock().
- Define two macros KSEQ_SELF() and KSEQ_CPU() for getting the kseq of the
current cpu or some alternate cpu.
- Start introducing kseq_() functions, such as kseq_choose() and kseq_setup().
run queue for each cpu.
- Introduce kse stealing into the sched_choose() code. This helps balance
cpus better in cases where process turnover is high. This implementation
is fairly trivial and will likely be only a temporary measure until
something more sophisticated has been written.
data structure called kse_upcall to manage UPCALL. All KSE binding
and loaning code are gone.
A thread owns an upcall can collect all completed syscall contexts in
its ksegrp, turn itself into UPCALL mode, and takes those contexts back
to userland. Any thread without upcall structure has to export their
contexts and exit at user boundary.
Any thread running in user mode owns an upcall structure, when it enters
kernel, if the kse mailbox's current thread pointer is not NULL, then
when the thread is blocked in kernel, a new UPCALL thread is created and
the upcall structure is transfered to the new UPCALL thread. if the kse
mailbox's current thread pointer is NULL, then when a thread is blocked
in kernel, no UPCALL thread will be created.
Each upcall always has an owner thread. Userland can remove an upcall by
calling kse_exit, when all upcalls in ksegrp are removed, the group is
atomatically shutdown. An upcall owner thread also exits when process is
in exiting state. when an owner thread exits, the upcall it owns is also
removed.
KSE is a pure scheduler entity. it represents a virtual cpu. when a thread
is running, it always has a KSE associated with it. scheduler is free to
assign a KSE to thread according thread priority, if thread priority is changed,
KSE can be moved from one thread to another.
When a ksegrp is created, there is always N KSEs created in the group. the
N is the number of physical cpu in the current system. This makes it is
possible that even an userland UTS is single CPU safe, threads in kernel still
can execute on different cpu in parallel. Userland calls kse_create to add more
upcall structures into ksegrp to increase concurrent in userland itself, kernel
is not restricted by number of upcalls userland provides.
The code hasn't been tested under SMP by author due to lack of hardware.
Reviewed by: julian
potential discontinuities in our UTC timescale.
Applications can monitor this variable if they want to be informed
about steps in the timescale. Slews (ntp and adjtime(2)) and
frequency adjustments (ntp) will not increment this counter, only
operations which set the clock. No attempt is made to classify
size or direction of the step.