the process and session. Instead, cache a true reference to the session
when we do the hold and release our reference on that session. This avoids
the need for the proc lock when dropping the reference.
protects, so don't bother locking it while we assign it to a ucred's
cr_prison.
- Fully construct the new credential for a process before assigning it to
p_ucred.
- Set p_acflag earlier while already hold the proc lock in fork1().
- Mark the realitexpire() callout MPSAFE for new processes. It was already
marked safe for proc0 a long while ago.
generation number back to a long (sizeof(u_int) != sizeof(long) on
sparc64). The alternative would have been to heavily change the libdevstat API.
Discussed with: phk, ken
don't try and convert the argument flags to malloc flags, or we risk
implicitly requesting blocking and generating witness warnings.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
When enabled, this causes m_defrag to randomly return NULL (following
its normal failure case so that extra memory leaks are not introduced.)
Code similar to this was used to find / fix a few bugs last week.
returning some additional room in the first mbuf in a chain, and
avoiding feature-specific contents in the mbuf header. To do this:
- Modify mbuf_to_label() to extract the tag, returning NULL if not
found.
- Introduce mac_init_mbuf_tag() which does most of the work
mac_init_mbuf() used to do, except on an m_tag rather than an
mbuf.
- Scale back mac_init_mbuf() to perform m_tag allocation and invoke
mac_init_mbuf_tag().
- Replace mac_destroy_mbuf() with mac_destroy_mbuf_tag(), since
m_tag's are now GC'd deep in the m_tag/mbuf code rather than
at a higher level when mbufs are directly free()'d.
- Add mac_copy_mbuf_tag() to support m_copy_pkthdr() and related
notions.
- Generally change all references to mbuf labels so that they use
mbuf_to_label() rather than &mbuf->m_pkthdr.label. This
required no changes in the MAC policies (yay!).
- Tweak mbuf release routines to not call mac_destroy_mbuf(),
tag destruction takes care of it for us now.
- Remove MAC magic from m_copy_pkthdr() and m_move_pkthdr() --
the existing m_tag support does all this for us. Note that
we can no longer just zero the m_tag list on the target mbuf,
rather, we have to delete the chain because m_tag's will
already be hung off freshly allocated mbuf's.
- Tweak m_tag copying routines so that if we're copying a MAC
m_tag, we don't do a binary copy, rather, we initialize the
new storage and do a deep copy of the label.
- Remove use of MAC_FLAG_INITIALIZED in a few bizarre places
having to do with mbuf header copies previously.
- When an mbuf is copied in ip_input(), we no longer need to
explicitly copy the label because it will get handled by the
m_tag code now.
- No longer any weird handling of MAC labels in if_loop.c during
header copies.
- Add MPC_LOADTIME_FLAG_LABELMBUFS flag to Biba, MLS, mac_test.
In mac_test, handle the label==NULL case, since it can be
dynamically loaded.
In order to improve performance with this change, introduce the notion
of "lazy MAC label allocation" -- only allocate m_tag storage for MAC
labels if we're running with a policy that uses MAC labels on mbufs.
Policies declare this intent by setting the MPC_LOADTIME_FLAG_LABELMBUFS
flag in their load-time flags field during declaration. Note: this
opens up the possibility of post-boot policy modules getting back NULL
slot entries even though they have policy invariants of non-NULL slot
entries, as the policy might have been loaded after the mbuf was
allocated, leaving the mbuf without label storage. Policies that cannot
handle this case must be declared as NOTLATE, or must be modified.
- mac_labelmbufs holds the current cumulative status as to whether
any policies require mbuf labeling or not. This is updated whenever
the active policy set changes by the function mac_policy_updateflags().
The function iterates the list and checks whether any have the
flag set. Write access to this variable is protected by the policy
list; read access is currently not protected for performance reasons.
This might change if it causes problems.
- Add MAC_POLICY_LIST_ASSERT_EXCLUSIVE() to permit the flags update
function to assert appropriate locks.
- This makes allocation in mac_init_mbuf() conditional on the flag.
Reviewed by: sam
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
mbuf_to_label(). This permits the vast majority of entry point code
to be unaware that labels are stored in m->m_pkthdr.label, such that
we can experiment storage of labels elsewhere (such as in m_tags).
Reviewed by: sam
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
the current queue if its priority is really elevated. This needs more work
as there are cases where a next queue kse could be holding up what would
be a curr queue kse, and thus hurting interactivity. Also, when a thread
with an elevated priority has its priority lowered it should be placed
back on the next queue.
the target process exiting which causes attempts to register the kevent
to randomly fail depending on whether the target runs to completion before
the parent can call kevent(2). The bug actually effects EVFILT_PROC
events on any zombie process, but the most common manifestation is with
parents trying to monitor child processes.
MFC after: 2 weeks
Sponsored by: NTT Multimedia Communications Labs
the second kseq's run queue so that it is referenced by the kse when
it is switched out.
- Spell ksq_rslices properly.
Reported by: Ian Freislich <ianf@za.uu.net>
- Allow user adjustable min and max time slices (suggested by hiten).
- Change the SLP_RUN_MAX to 100ms from 2 seconds so that we learn whether a
process is interactive or not much more quickly.
- Place a process on the current run queue if it is interactive or if it is
running at an interrupt thread priority due to priority prop.
- Use the 'current' timeshare queue for interrupt threads, realtime threads,
and idle threads that are running at higher priority due to priority prop.
This fixes problems where priorities would have been elevated but we would
not check the timeshare run queue until other lower priority tasks were
no longer runnable.
- Keep an array of loads indexed by the priority class as well as a global
load.
- Keep an bucket of nice values with a count of the number of kses currently
runnable with that nice value.
- Keep track of the minimum nice value of any running thread.
- Remove the unused short term sleep accounting. I was attempting to use
this for load balancing but it didn't work out.
- Define a kseq_print() for use with debugging.
- Add KTR debugging at useful places so we can easily debug slice and
priority assignment.
- Decouple the runq assignment from the kseq assignment. kseq_add now keeps
track of statistics. This is done so that the nice and load is still
tracked for the currently running process. Previously if a niced process
was added while a non nice process was running the niced process would
still get a slice since it was not aware of the unnice process.
- Make adjustments for the sched api changes.
of ksegs since they primarily operation on processes.
- KSEs take ticks so pass the kse through sched_clock().
- Add a sched_class() routine that adjusts a ksegrp pri class.
- Define a sched_fork_{kse,thread,ksegrp} and sched_exit_{kse,thread,ksegrp}
that will be used to tell the scheduler about new instances of these
structures within the same process. These will be used by THR and KSE.
- Change sched_4bsd to reflect this API update.
by allprison_mtx), a unique prison/jail identifier field, two path
fields (pr_path for reporting and pr_root vnode instance) to store
the chroot() point of each jail.
o Add jail_attach(2) to allow a process to bind to an existing jail.
o Add change_root() to perform the chroot operation on a specified
vnode.
o Generalize change_dir() to accept a vnode, and move namei() calls
to callers of change_dir().
o Add a new sysctl (security.jail.list) which is a group of
struct xprison instances that represent a snapshot of active jails.
Reviewed by: rwatson, tjr
of asserting that an mbuf has a packet header. Use it instead of hand-
rolled versions wherever applicable.
Submitted by: Hiten Pandya <hiten@unixdaemons.com>
- Treat each class specially in kseq_{choose,add,rem}. Let the rest of the
code be less aware of scheduling classes.
- Skip the interactivity calculation for non TIMESHARE ksegrps.
- Move slice and runq selection into kseq_add(). Uninline it now that it's
big.
as it could be and can do with some more cleanup. Currently its under
options LAZY_SWITCH. What this does is avoid %cr3 reloads for short
context switches that do not involve another user process. ie: we can
take an interrupt, switch to a kthread and return to the user without
explicitly flushing the tlb. However, this isn't as exciting as it could
be, the interrupt overhead is still high and too much blocks on Giant
still. There are some debug sysctls, for stats and for an on/off switch.
The main problem with doing this has been "what if the process that you're
running on exits while we're borrowing its address space?" - in this case
we use an IPI to give it a kick when we're about to reclaim the pmap.
Its not compiled in unless you add the LAZY_SWITCH option. I want to fix a
few more things and get some more feedback before turning it on by default.
This is NOT a replacement for Bosko's lazy interrupt stuff. This was more
meant for the kthread case, while his was for interrupts. Mine helps a
little for interrupts, but his helps a lot more.
The stats are enabled with options SWTCH_OPTIM_STATS - this has been a
pseudo-option for years, I just added a bunch of stuff to it.
One non-trivial change was to select a new thread before calling
cpu_switch() in the first place. This allows us to catch the silly
case of doing a cpu_switch() to the current process. This happens
uncomfortably often. This simplifies a bit of the asm code in cpu_switch
(no longer have to call choosethread() in the middle). This has been
implemented on i386 and (thanks to jake) sparc64. The others will come
soon. This is actually seperate to the lazy switch stuff.
Glanced at by: jake, jhb
to select a KSE with a slice of 0 we will update its slice and insert it
onto the next queue.
- Pass the KSE instead of the ksegrp into sched_slice(). This more
accurately reflects the behavior of the code. Slices are granted to kses.
- Add a function kseq_nice_min() which finds the smallest nice value
assigned to the kseg of any KSE on the queue.
- Rewrite the logic in sched_slice(). Add a large comment describing the
new slice selection scheme. To summarize, slices are assigned based on
the nice value. Priorities are still calculated based on the nice and
interactivity of a process. Slice sizes of 0 may be granted for KSEs
whos nice is 20 or futher away from the lowest nice on the run queue.
Other nice values are scaled across the range [min, min+20]. This fixes
ULEs bad behavior with positively niced processes.
if (p->p_numthreads > 1) and not a flag because action is only necessary
if there are other threads. The rest of the system has no need to
identify thr threaded processes.
- In kern_thread.c use thr_exit1() instead of thread_exit() if P_THREADED
is not set.
- umtx_lock() is defined as an inline in umtx.h. It tries to do an
uncontested acquire of a lock which falls back to the _umtx_lock()
system-call if that fails.
- umtx_unlock() is also an inline which falls back to _umtx_unlock() if the
uncontested unlock fails.
- Locks are keyed off of the thr_id_t of the currently running thread which
is currently just the pointer to the 'struct thread' in kernel.
- _umtx_lock() uses the proc pointer to synchronize access to blocked thread
queues which are stored in the first blocked thread.
- sys/thr.h contains the user space visible api that is intended only for
use in threading library packages.
- kern/kern_thr.c contains thr system calls and other thr specific code.
kern_sigtimedwait() which is capable of supporting all of their semantics.
- These should be POSIX compliant but more careful review is needed before
we announce this.
a follow on commit to kern_sig.c
- signotify() now operates on a thread since unmasked pending signals are
stored in the thread.
- PS_NEEDSIGCHK moves to TDF_NEEDSIGCHK.
- Change all consumers to pass in a thread.
Right now this does not cause any functional changes but it will be important
later when signals can be delivered to specific threads.
of sf_buf_alloc() instead of expecting sf_buf_alloc()'s caller to map it.
The ultimate reason for this change is to enable two optimizations:
(1) that there never be more than one sf_buf mapping a vm_page at a time
and (2) 64-bit architectures can transparently use their 1-1 virtual
to physical mapping (e.g., "K0SEG") avoiding the overhead of pmap_qenter()
and pmap_qremove().
time a character is written. Use this at boot time to reject the
existing buffer contents if they are corrupt. This fixes a problem
seen on some hardware (especially laptops) where the message buffer
gets partially corrupted during a short power cycle or reset, but
the msgbuf structure is left intact so it gets reused, resulting
in random junk and control characters appearing in dmesg and
/var/log/messages.
PR: kern/28497
that the feature can be enabled during the boot process. Note the
continued limitation that FreeBSD fails so rapidly with this setting
enabled that it's hard to narrow down particular failures for
correction; we really need per-malloc type failure rates.
in a debugging feature causing M_NOWAIT allocations to fail at
a specified rate. This can be useful for detecting poor
handling of M_NOWAIT: the most frequent problems I've bumped
into are unconditional deference of the pointer even though
it's NULL, and hangs as a result of a lost event where memory
for the event couldn't be allocated. Two sysctls are added:
debug.malloc.failure_rate
How often to generate a failure: if set to 0 (default), this
feature is disabled. Otherwise, the frequency of failures --
I've been using 10 (one in ten mallocs fails), but other
popular settings might be much lower or much higher.
debug.malloc.failure_count
Number of times a coerced malloc failure has occurred as a
result of this feature. Useful for tracking what might have
happened and whether failures are being generated.
Useful possible additions: tying failure rate to malloc type,
printfs indicating the thread that experienced the coerced
failure.
Reviewed by: jeffr, jhb
additional flags argument to indicate blocking disposition, and
pass in M_NOWAIT from the IP reassembly code to indicate that
blocking is not OK when labeling a new IP fragment reassembly
queue. This should eliminate some of the WITNESS warnings that
have started popping up since fine-grained IP stack locking
started going in; if memory allocation fails, the creation of
the fragment queue will be aborted.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
where physical addresses larger than virtual addresses, such as i386s
with PAE.
- Use this to represent physical addresses in the MI vm system and in the
i386 pmap code. This also changes the paddr parameter to d_mmap_t.
- Fix printf formats to handle physical addresses >4G in the i386 memory
detection code, and due to kvtop returning vm_paddr_t instead of u_long.
Note that this is a name change only; vm_paddr_t is still the same as
vm_offset_t on all currently supported platforms.
Sponsored by: DARPA, Network Associates Laboratories
Discussed with: re, phk (cdevsw change)
flexible process_fork, process_exec, and process_exit eventhandlers. This
reduces code duplication and also means that I don't have to go duplicate
the eventhandler locking three more times for each of at_fork, at_exec, and
at_exit.
Reviewed by: phk, jake, almost complete silence on arch@
is set to 0, it now has the same affect as setting witness_dead used to
have.
- Added a sysctl handler that allows root to change witness_watch from a
non-zero value to zero to disable witness at runtime. Note that you
can't turn witness back on once it is off. You can only turn it off as
a one-way switch.
- Added a comment describing the possible values of witness_watch.
kse_mailbox to schedule an upcall, this is useful for userland timeout
routine, for example pthread_cond_timedwait().
Also extract upcall scheduling code from kse_reassign and create
a new function called thread_switchout to include these code.
Reviewed by: julain
the devstat is for an "interior" GEOM node and register using the
name argument as a geom identity pointer. Do not put these devstat
structures on the list returned by the sysctl.
This gives us the ability to tell the two kinds of nodes apart and
leave the current "strictly physical" view of devstat intact without
modifications, yet be able to use devstat for both kinds of devices.
It also saves us bloating struct devstat with another 48 bytes of
space for the name. At least for now.
Reviewed by: ken
Add a mutex and protect the allocation and traversal of the list with it.
When we allocate a page for devstat use we drop the mutex and use
M_WAITOK this is not nice, but under the given circumstances the
best we can do.
In the sysctl handler for returning the devstat entries we do not want to
hold the mutex across copyout(9) calls, so we keep a very careful eye on
the devstat_generation count, and abandon with EBUSY if it changes under
our feet.
Specifically test for BIO_WRITE, rather than default non-read,non-deletes
as write. Make the default be DEVSTAT_NO_DATA.
Add atomic increments of the sequence[01] fields so applications using the
mmap'ed view stand a chance of detecting updates in progress.
Reviewed by: ken
but I decided that it was important for this patch to not bit-rot, and
since it is mainly moving code around, the total amount of entropy is
epsilon /phk)
This is a patch to move the common parts of linux_getcwd() back into
kern/vfs_cache.c so that the standard FreeBSD libc getcwd() can use it's
extended functionality. The linux syscall linux_getcwd() in
compat/linux/linux_getcwd.c has been rewritten to use it too. It should
be possible to simplify libc's getcwd() after this. No doubt this code
needs some cleaning up, since I've left in the sysctl variables I used
for debugging.
PR: 48169
Submitted by: James Whitwell <abacau@yahoo.com.au>
Kernel:
Change statistics to use the *uptime() timescale (ie: relative to
boottime) rather than the UTC aligned timescale. This makes the
device statistics code oblivious to clock steps.
Change timestamps to bintime format, they are cheaper.
Remove the "busy_count", and replace it with two counter fields:
"start_count" and "end_count", which are updated in the down and
up paths respectively. This removes the locking constraint on
devstat.
Add a timestamp argument to devstat_start_transaction(), this will
normally be a timestamp set by the *_bio() function in bp->bio_t0.
Use this field to calculate duration of I/O operations.
Add two timestamp arguments to devstat_end_transaction(), one is
the current time, a NULL pointer means "take timestamp yourself",
the other is the timestamp of when this transaction started (see
above).
Change calculation of busy_time to operate on "the salami principle":
Only when we are idle, which we can determine by the start+end
counts being identical, do we update the "busy_from" field in the
down path. In the up path we accumulate the timeslice in busy_time
and update busy_from.
Change the byte_* and num_* fields into two arrays: bytes[] and
operations[].
Userland:
Change the misleading "busy_time" name to be called "snap_time" and
make the time long double since that is what most users need anyway,
fill it using clock_gettime(CLOCK_MONOTONIC) to put it on the same
timescale as the kernel fields.
Change devstat_compute_etime() to operate on struct bintime.
Remove the version 2 legacy interface: the change to bintime makes
compatibility far too expensive.
Fix a bug in systat's "vm" page where boot relative busy times would
be bogus.
Bump __FreeBSD_version to 500107
Review & Collaboration by: ken
KTRFAC_DROP to track instances when ktrace events are dropped due to the
request pool being exhausted. When a thread tries to post a ktrace event
and is unable to due to no available ktrace request objects, it sets
KTRFAC_DROP in its process' p_traceflag field. The next trace event to
successfully post from that process will set the KTR_DROP flag in the
header of the request going out and clear KTRFAC_DROP.
The KTR_DROP flag is the high bit in the type field of the ktr_header
structure. Older kdump binaries will simply complain about an unknown type
when seeing an entry with KTR_DROP set. Note that KTR_DROP being set on a
record in a ktrace file does not tell you anything except that at least one
event from this process was dropped prior to this event. The user has no
way of knowing what types of events were dropped nor how many were dropped.
Requested by: phk
struct proc as p_tracecred alongside the current cache of the vnode in
p_tracep. This credential is then used for all later ktrace operations on
this file rather than using the credential of the current thread at the
time of each ktrace event.
- Now that we have multiple ktrace-related items in struct proc that are
pointers, rename p_tracep to p_tracevp to make it less ambiguous.
Requested by: rwatson (1)
- Create a new function bdone() which sets B_DONE and calls wakup(bp). This
is suitable for use as b_iodone for buf consumers who are not going
through the buf cache.
- Create a new function bwait() which waits for the buf to be done at a set
priority and with a specific wmesg.
- Replace several cases where the above functionality was implemented
without locking with the new functions.
possible for some time.
- Lock the buf before accessing fields. This should very rarely be locked.
- Assert that B_DELWRI is set after we acquire the buf. This should always
be the case now.
requiring locked bufs in vfs_bio_awrite(). Previously the buf could
have been written out by fsync before we acquired the buf lock if it
weren't for giant. The cluster_wbuild() handles this race properly but
the single write at the end of vfs_bio_awrite() would not.
- Modify flushbufqueues() so there is only one copy of the loop. Pass a
parameter in that says whether or not we should sync bufs with deps.
- Call flushbufqueues() a second time and then break if we couldn't find
any bufs without deps.
than a MAXPHYS size block ahead. Having this set too high just leaves
other processes starved for IO and screws up interactive response. Let the
users with RAID set it higher when they need it.
- If SYSCTL_OUT() fails in sysctl_kern_proc_args(), return the error
instead of ignoring it if we have new arguments for the process.
- If the new arguments for a process are too long, return ENOMEM instead of
returning success but not doing the actual copy.
Submitted by: bde
hold hold it across the check to avoid extra lock operations in the
common case.
- Copy in the new args to a temporary pargs structure before we drop the
reference to the old one. Thus, if the copyin() fails, the process
arguments are unchanged rather than being deleted. Also, p_args is no
longer NULL during the sysctl operation.
it from its pgrp to avoid leaving zombies around with p_pgrp == NULL.
This bug was apparent as a NULL-dereference in the pid selection code
in fork1().
closely what function is really doing. Update all existing consumers
to use the new name.
Introduce a new vfs_stdsync function, which iterates over mount
point's vnodes and call FSYNC on each one of them in turn.
Make nwfs and smbfs use this new function instead of rolling their
own identical sync implementations.
Reviewed by: jeff
a parameter instead of using the level of a given witness. When
recursing, pass an indent level of indent + 1.
- Make use of the information witness_levelall() provides in
witness_display_list() to use an O(n) algorithm instead of an O(n^2)
algo to decide which witnesses to display hierarchies from. Basically,
we only display a hierarchy for witnesses with a level of 0.
- Add a new per-witness flag that is reset at the start of
witness_display() for all witness's and is set the first time a witness
is displayed in witness_displaydescendants(). If a witness is
encountered more than once in the lock order tree (which happens often),
witness_displaydescendants() marks the later occurrences with the string
"(already displayed)" and doesn't display the subtree under that
witness. This avoids duplicating large amounts of the lock order tree
in the 'show witness' output in DDB.
All these changes serve to make 'show witness' a lot more readable and
useful than it was previously.
adds a witness to the child list of a parent witness. rebalancetree()
runs through the entire tree removing direct descendants of witnesses
who already have said child witness as an indirect descendant through
another direct descendant. itismychild() now calls insertchild()
followed by rebalancetree() and no longer needs the evil hack of
having static recursed variable.
- Add a function reparentchildren() that adds all the direct descendants
of one witness as direct descendants of another witness.
- Change the return value of itismychild() and similar functions so that
they return 0 in the case of failure due to lack of resources instead
of 1. This makes the return value more intuitive.
- Check the return value of itismychild() when defining the static lock
order in witness_initialize().
- Don't try to setup a lock instance in witness_lock() if itismychild()
fails. Witness is hosed anyways so no need to do any more witness
related activity at that point. It also makes the code flow easier to
understand.
- Add a new depart() function as the opposite of enroll(). When the
reference count of a witness drops to 0 in witness_destroy(), this
function is called on that witness. First, it runs through the
lock order tree using reparentchildren() to reparent direct descendants
of the departing witness to each of the witness' parents in the tree.
Next, it releases it's own child list and other associated resources.
Finally it calls rebalanacetree() to rebalance the lock order tree.
- Sort function prototypes into something closer to alphabetical order.
As a result of these changes, there should no longer be 'dead' witnesses
in the order tree, and repeatedly loading and unloading a module should no
longer exhaust witness of its internal resources.
Inspired by: gallatin
recursing on a lock instead of before. This fixes a bug where WITNESS
could get a little confused if you did an sx_tryslock() on a sx lock that
you already had an slock on. WITNESS would still function correctly but
it could result in weirdness in the output of 'show locks'. This also
makes it possible for mtx_trylock() to recurse on a lock.
used popped into my head during my morning commute a few weeks ago, but
it is also very similar (though a bit simpler) to a patch that mini@
developed a while ago. Basically, each eventhandler list has a mutex and
a run count. During an eventhandler invocation, the mutex is held while
we traverse the list but is dropped while we execute actual handlers. Also,
a runcount counter is incremented at the start of an invocation and
decremented at the end of an invocation. Adding to the list is not a big
deal since the reference of a thread currently executing the handlers
remains valid across an add operation. Whether or not new handlers are
executed by threads currently executing the handlers for a given list is
indeterminate however. The harder case is when a handler is removed from
the list. If the runcount is zero, the handler is simply removed from the
list directly. If the runcount is not zero, then another thread is
currently executing the handlers of this list, so the priority of this
handler is set to a magic value (currently -1) to mark it as dead. Dead
handlers are not executed during an invocation. If the runcount is zero
after it is decremented at the end of an invocation, then a new
eventhandler_prune_list() function is called to remove dead handlers from
the list.
Additional minor notes:
- All the common parts of EVENTHANDLER_INVOKE() and
EVENTHANDLER_FAST_INVOKE() have been merged into a common
_EVENTHANDLER_INVOKE() macro to reduce duplication and ease maintenance.
- KTR logging for eventhandlers is now available via the KTR_EVH mask.
- The global eventhander_mutex is no longer recursive.
Tested by: scottl (SMP i386)
- Issue the io that we will later block on prior to doing cluster read ahead
so that it is more likely to be ready when we block.
- Loop issuing clustered reads until we've exhausted the seq count supplied
by the file system.
- Use a sysctl tunable "vfs.read_max" to determine the maximum number of
blocks that we'll read ahead.
I had commented the #ifdef INVARIANTS checks out to make sure I ran this
code in all kernels and forgot to comment the #ifdefs back in before I
committed.
Spotted by: bmilekic
[1] PHCC = Pointy Hat Correction Commit
ddb 'show locks' command. Thus, move witness_list() to the #ifdef DDB
section and remove extra checks for calling this function outside of
DDB. Also, witness_list() now returns void instead of returning an int.
Reported by: Steve Ames <steve@energistic.com>
Prodded by: davidxu
Remove an incorrect comment. (Incrementing an object's reference count
does not prevent a process from exiting. The real concern here is that the
physical page must not be deleted until transmission is complete. That is
already handled by the VM system and sf_buf_free().)
Tested by: ken
is more robust and prevents the hijacking of /dev/console for the typical
mistake.
Remove unneeded MAJOR_AUTO uses, it is only needed explicitly now if the
driver source has cross-branch compatibility to old releases.
the device statistics structures into userland instead of using sysctl.
Introduce new devstat_new_entry() function which allocates the devstat
structure an calls devstat_add_entry() on it.
- On receive, vm_map_lookup() needs to trigger the creation of a shadow
object. To make that happen, call vm_map_lookup() with PROT_WRITE
instead of PROT_READ in vm_pgmoveco().
- On send, a shadow object will be created by the vm_map_lookup() in
vm_fault(), but vm_page_cowfault() will delete the original page from
the backing object rather than simply letting the legacy COW mechanism
take over. In other words, the new page should be added to the shadow
object rather than replacing the old page in the backing object. (i.e.
vm_page_cowfault() should not be called in this case.) We accomplish
this by making sure fs.object == fs.first_object before calling
vm_page_cowfault() in vm_fault().
Submitted by: gallatin, alc
Tested by: ken
witness. Sleepable locks such as sx locks always come before all mutexes
including Giant. However, the static lock order list placed Giant before
the proctree and allproc sx locks. This resulted in witness creating a
cycle in its lock order "tree" (real trees don't have cycles) leading to
infinite recursion and eventually a double fault. To fix, put Giant after
sx locks in the lock order list.
check, mac_check_sysarch_ioperm(), permitting MAC security policy
modules to control access to these interfaces. Currently, they
protect access to IOPL on i386, and setting HAE on Alpha.
Additional checks might be required on other platforms to prevent
bypass of kernel security protections by unauthorized processes.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
modules to authorize disabling of swap against a particular vnode.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
before the MAC check so that we pass the flags field into the MAC
check properly initialized. This didn't affect any current MAC
modules since they didn't care what the flags argument was (as
they were primarily interested in the fact that it was a meta-data
write, not the contents of the write), but would be relevant to
future modules relying on that field.
Submitted by: Mike Halderman <mrh@spawar.navy.mil>
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
drain routines are done by swi_net, which allows for better queue control
at some future point. Packets may also be directly dispatched to a netisr
instead of queued, this may be of interest at some installations, but
currently defaults to off.
Reviewed by: hsu, silby, jayanth, sam
Sponsored by: DARPA, NAI Labs
- Use gbincore() and not incore() so that we can drop the vnode interlock
as we acquire the buflock.
- Use GB_LOCK_NOWAIT when getting bufs for read ahead clusters so that we
don't block on locked bufs.
- Convert a while loop to a howmany() that will most likely be faster on
modern processors. There is another while loop divide that was left
near by because it is operating on a 64bit int and is most likely faster.
- Cleanup the cluster_read() code a little to get rid of a goto and make
the logic clearer.
Tested on: x86, alpha
Tested by: Steve Kargl <sgk@troutmask.apl.washington.edu>
Reviewd by: arch
already own. The mtx_trylock() will fail however. Enhance the comment
at the top of the try lock function to explain this.
Requested by: jlemon and his evil netisr locking
- Add a comment about special lock order rules and Giant near the top of
subr_witness.c. Specifically, this documents and explains the real lock
order relationship between Giant and sleepable locks (i.e. lockmgr locks
and sx locks). Basically, Giant can be safely acquired either before or
after sleepable locks and the case of Giant before a sleepable lock is
exempted as a special case.
- Add a new static function 'witness_list_lock()' that displays a single
line of information about a struct lock_instance. This is used to
make the output of witness messages more consistent and reduce some code
duplication.
- Fixup a few comments in witness_lock().
- Properly handle the Giant-before-sleepable-lock lock order exception in
a more general fashion and remove the no longer needed LI_SLEPT flag.
- Break up the last condition before assuming a reversal a bit to try
and make the logic less confusing in witness_lock().
- Axe WITNESS_SLEEP() now that LI_SLEPT is no longer needed and replace it
with a more general WITNESS_WARN() macro/function combination.
WITNESS_WARN() allows you to output a customized message out to the
console along with a list of held locks. It will optionally drop into
the debugger as well. You can exempt a single lock from the check by
passing it in as the second argument. You can also use flags to specify
if Giant should be exempt from the check, if all sleepable locks should
be exempt from the check, and if witness should panic if any non-exempt
locks are found.
- Make the witness_list() function static. Other areas of the kernel
should use the new WITNESS_WARN() instead.
- Declare some local variables at the top of the function instead of in a
nested block.
- Use mtx_owned() instead of masking off bits from mtx_lock manually.
- Read the value of mtx_lock into 'v' as a separate line rather than inside
an if statement for clarity. This code is hairy enough as it is.
owned. Previously the KASSERT would only trigger if we successfully
acquired a lock that we already held. However, _obtain_lock() fails to
acquire locks that we already hold, so the KASSERT was never checked in
the case it was supposed to fail.
interactivity of a kseg and assigns it a value of 0 through 100.
- Use sched_interact_score() to determine the dynamic priority.
- Define SCHED_CURR() in terms of sched_interact_score().
- Adjust the maximum slice back down to 100ms.
- Remove redundant clearing of ke_runq in sched_wakeup()
- Clean up #defines and comment them.
- Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT
flag to the initial BUF_LOCK(). This will eventually be used in cases
were we want to use a buffer only if it is not currently in use.
- Convert all consumers of the getblk() api to use this extra parameter.
Reviwed by: arch
Not objected to by: mckusick
Remove extraneous uses of vop_null, instead defering to the default op.
Rename vnode type "vfs" to the more descriptive "syncer".
Fix formatting for various filesystems that use vop_print.
branches:
Initialize struct cdevsw using C99 sparse initializtion and remove
all initializations to default values.
This patch is automatically generated and has been tested by compiling
LINT with all the fields in struct cdevsw in reverse order on alpha,
sparc64 and i386.
Approved by: re(scottl)
calculations. Keep this changes local to the function so the tick count
is in its natural form otherwise. Previously 1000 was added each time
a tick fired and we divided by 1000 when it was reported. This is done
to reduce rounding errors.
To do this, initialize the d_maj member of the cdevsw to MAJOR_AUTO.
When the cdevsw is first passed to make_dev() a free major number
will be assigned.
Until we have a bit more experience with this a printf will announce
this fact.
Major numbers are not reclaimed, so loading/unloading the same
device driver which uses MAJOR_AUTO will eventually deplete the
pool of free major numbers and the system will panic when it can
not allocate one. Still undecided who to invonvenience with the
solution to this.
td_wmesg field in the thread structure points to the description string of
the condition variable or mutex. If the condvar or the mutex had been
initialized from a loadable module that was unloaded in the meantime,
td_wmesg may now point to invalid memory. Retrieving the process table now
may panic the kernel (or access junk). Setting the td_wmesg field to NULL
after unblocking on the condvar/mutex prevents this panic.
PR: kern/47408
Approved by: jake (mentor)
turns runs its tasks free of Giant too. It is intended that as drivers
become locked down, they will move out of the old, Giant-bound taskqueue
and into this new one. The old taskqueue has been renamed to
taskqueue_swi_giant, and the new one keeps the name taskqueue_swi.
delta 1.371) we must ensure that we do not get ourselves into a
recursive trap endlessly trying to clean up after ourselves.
Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.
o Always check for null when dereferencing the filename component.
o Implement a try-and-backoff method for allocating memory to
dump stats to avoid a spin-lock -> sleep-lock mutex lock order
panic with WITNESS.
Approved by: des, markm (mentor)
Not objected: jhb
track of the number of dirty buffers held by a vnode. When a
bdwrite is done on a buffer, check the existing number of dirty
buffers associated with its vnode. If the number rises above
vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one
of the other (hopefully older) dirty buffers associated with
the vnode is written (using bawrite). In the event that this
approach fails to curb the growth in it the vnode's number of
dirty buffers (due to soft updates rollback dependencies),
the more drastic approach of doing a VOP_FSYNC on the vnode
is used. This code primarily affects very large and actively
written files such as snapshots. This change should eliminate
hanging when taking snapshots or doing background fsck on
very large filesystems.
Hopefully, one day it will be possible to cache filesystem
metadata in the VM cache as is done with file data. As it
stands, only the buffer cache can be used which limits total
metadata storage to about 20Mb no matter how much memory is
available on the system. This rather small memory gets badly
thrashed causing a lot of extra I/O. For example, taking a
snapshot of a 1Tb filesystem minimally requires about 35,000
write operations, but because of the cache thrashing (we only
have about 350 buffers at our disposal) ends up doing about
237,540 I/O's thus taking twenty-five minutes instead of four
if it could run entirely in the cache.
Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.
Reviewed by: arch, mckusick
- Get rid of the useless atop() / pmap_phys_address() detour. The
device mmap handlers must now give back the physical address
without atop()'ing it.
- Don't borrow the physical address of the mapping in the returned
int. Now we properly pass a vm_offset_t * and expect it to be
filled by the mmap handler when the mapping was successful. The
mmap handler must now return 0 when successful, any other value
is considered as an error. Previously, returning -1 was the only
way to fail. This change thus accidentally fixes some devices
which were bogusly returning errno constants which would have been
considered as addresses by the device pager.
- Garbage collect the poorly named pmap_phys_address() now that it's
no longer used.
- Convert all the d_mmap_t consumers to the new API.
I'm still not sure wheter we need a __FreeBSD_version bump for this,
since and we didn't guarantee API/ABI stability until 5.1-RELEASE.
Discussed with: alc, phk, jake
Reviewed by: peter
Compile-tested on: LINT (i386), GENERIC (alpha and sparc64)
Runtime-tested on: i386
in massive locking issues on diskless systems.
It is also not clear that this sysctl is non-dangerous in its
requirements for locked down memory on large RAM systems.
Retire the "d_dump_t" and use the "dumper_t" type instead.
Dumper_t takes a void * as first arg which is more general than the
dev_t taken by d_dump_t. (Remember: we could have net-dumpers if
somebody wrote us one!)
Define the convention for GEOM controlled disk devices to be that the
first argument to the dumper function is the struct disk pointer.
Change device drivers accordingly.
dev_t to the method functions.
The dev_t can still be found at struct consdev *->cn_dev.
Add a void *cn_arg element to struct consdev which the drivers can use
for retrieving their softc.
In devsw() return dead_cdevsw instead of NULL in case the dev_t does not
have a si_devsw.
This may improve our survival chances with devices which go away unexpectedly.
compile-time constants). That is, a "bucket" now is not necessarily
a page-worth of mbufs or clusters, but it is MBUF_BUCK_SZ, CLUS_BUCK_SZ
worth of mbufs, clusters.
o Rename {mbuf,clust}_limit to {mbuf,clust}_hiwm and introduce
{mbuf,clust}_lowm, which currently has no effect but will be used
to set the low watermarks.
o Fix netstat so that it can deal with the differently-sized buckets
and teach it about the low watermarks too.
o Make sure the per-cpu stats for an absent CPU has mb_active set to 0,
explicitly.
o Get rid of the allocate refcounts from mbuf map mess. Instead,
just malloc() the refcounts in one shot from mbuf_init()
o Clean up / update comments in subr_mbuf.c
used to share resource limits between rfork threads, but never was.
Removing it makes resource limit locking much simpler -- only the current
process can change the contents of the structure that p_limit points to.
reference counter array for mbuf clusters. I don't know
how this got past early testing nor how it survived so long
without getting caught. If anyone was seeing really really
bizarre memory corruption in a few mbufs this would be why.
#if'ed out for a while. Complete the deed and tidy up some other bits.
We need to be able to call this stuff from outer edges of interrupt
handlers for devices that have the ISR bits in pci config space. Making
the bios code mpsafe was just too hairy. We had also stubbed it out some
time ago due to there simply being too much brokenness in too many systems.
This adds a leaf lock so that it is safe to use pci_read_config() and
pci_write_config() from interrupt handlers. We still will use pcibios
to do interrupt routing if there is no acpi.. [yes, I tested this]
Briefly glanced at by: imp
sched_lock around accesses to p_stats->p_timer[] to avoid a potential
race with hardclock. getitimer(), setitimer() and the realitexpire()
callout are now Giant-free.
add a signal to a mailbox's pending set.
- Add a new function, thread_signal_upcall(), this causes the current thread
to upcall so that we can deliver pending signals.
Reviewed by: mini
I was in two minds as to where to put them in the first case..
I should have listenned to the other mind.
Submitted by: parts by davidxu@
Reviewed by: jeff@ mini@
queue lock already held.
- In getblk() and flushbufqueues() use bremfreel() while we still have the
buf queue lock held to keep the lists consistent.
- Add LK_NOWAIT to two cases where we're essentially asserting that the bufs
are not locked while acquiring the locks. This will make sure that we get
the appropriate panic() and not another one for sleeping with a lock held.
- Mark the process leader as having an advisory lock
- Check if process leader is marked as having advisory lock when
closing file
- Check that file is still open after lock has been obtained
- Don't allow file descriptor table sharing between processes
with different leaders
PR: 10265
Reviewed by: alfred
freebsd4_sigaction() and osigaction() instead of around the whole
body of those functions. They now no longer hold Giant around calls
to copyin() and copyout(), and it is slightly more obvious what
Giant is protecting.
barrier between free'ing filedesc structures. Basically if you want to
access another process's filedesc, you want to hold this mutex over the
entire operation.
opposed to returning the top of the old chain when there was one and
the top of the newly allocated chain if there was no old chain.
Actually, it should be noted that prior to this fix, although the
comment above m_getm() advertised that m_getm() would return the
top of the old chain (if an old chain was being passed in) it
actually [wrongly] was returning the tail mbuf in the old chain
instead. This is a bug but since the one use of m_getm() in
the tree luckily did not depend on the behavior, it happened
to work out without notice.
Harti Brandt pointed out that the advertised behavior was actually
not the real behavior and so this change makes m_getm() ALWAYS
return the newly allocated chain (and fixes the comment). This
is less confusing and is the best course of action as then the
caller is always able to have both a reference to the top of
the original chain (because it's passing it in in the call) and
a reference to the newly attached chain. Although the API is
slightly modified, I don't think that any third-party code uses
m_getm() and if it does, it surely can't be working properly
because the old behavior was bogus.
API bug pointed out by: Harti Brandt <brandt@fokus.fraunhofer.de>
To fix scsi, don't wait for ithreads if we're dumping, it makes the
debugger sad.
To fix ata, use what appears to be a polling method if we're dumping,
I stole this from tmm but added code to ensure that this change is
only in effect while dumping.
Tested by: des