class to trace witness events.
- Make the ktr_cpu field of ktr_entry be a standard field rather than one
present only in the KTR_EXTEND case.
- Move the default definition of KTR_ENTRIES from sys/ktr.h to
kern/kern_ktr.c. It has not been needed in the header file since KTR
was un-inlined.
- Minor include cleanup in kern/kern_ktr.c.
- Fiddle with the ktr_cpumask in ktr_tracepoint() to disable KTR events
on the current CPU while we are processing an event.
- Set the current CPU inside of the critical section to ensure we don't
migrate CPU's after the critical section but before we set the CPU.
introduce a modified allocation mechanism for mbufs and mbuf clusters; one
which can scale under SMP and which offers the possibility of resource
reclamation to be implemented in the future. Notable advantages:
o Reduce contention for SMP by offering per-CPU pools and locks.
o Better use of data cache due to per-CPU pools.
o Much less code cache pollution due to excessively large allocation macros.
o Framework for `grouping' objects from same page together so as to be able
to possibly free wired-down pages back to the system if they are no longer
needed by the network stacks.
Additional things changed with this addition:
- Moved some mbuf specific declarations and initializations from
sys/conf/param.c into mbuf-specific code where they belong.
- m_getclr() has been renamed to m_get_clrd() because the old name is really
confusing. m_getclr() HAS been preserved though and is defined to the new
name. No tree sweep has been done "to change the interface," as the old
name will continue to be supported and is not depracated. The change was
merely done because m_getclr() sounds too much like "m_get a cluster."
- TEMPORARILY disabled mbtypes statistics displaying in netstat(1) and
systat(1) (see TODO below).
- Fixed systat(1) to display number of "free mbufs" based on new per-CPU
stat structures.
- Fixed netstat(1) to display new per-CPU stats based on sysctl-exported
per-CPU stat structures. All infos are fetched via sysctl.
TODO (in order of priority):
- Re-enable mbtypes statistics in both netstat(1) and systat(1) after
introducing an SMP friendly way to collect the mbtypes stats under the
already introduced per-CPU locks (i.e. hopefully don't use atomic() - it
seems too costly for a mere stat update, especially when other locks are
already present).
- Optionally have systat(1) display not only "total free mbufs" but also
"total free mbufs per CPU pool."
- Fix minor length-fetching issues in netstat(1) related to recently
re-enabled option to read mbuf stats from a core file.
- Move reference counters at least for mbuf clusters into an unused portion
of the cluster itself, to save space and need to allocate a counter.
- Look into introducing resource freeing possibly from a kproc.
Reviewed by (in parts): jlemon, jake, silby, terry
Tested by: jlemon (Intel & Alpha), mjacob (Intel & Alpha)
Preliminary performance measurements: jlemon (and me, obviously)
URL: http://people.freebsd.org/~bmilekic/mb_alloc/
and its associated constants. Implement _SC_IOV_MAX in the usual way.
Be a bit sloppy about the namespace question; this should get cleared up
in time for 5.0.
MFC after: 1 month
take a const 'name', since they dont modify anything.
159: warning: passing arg 1 of `getenv_int' discards qualifiers...
167: warning: passing arg 1 of `getenv' discards qualifiers from pointer..
Replace the a.out emulation of 'struct linker_set' with something
a little more flexible. <sys/linker_set.h> now provides macros for
accessing elements and completely hides the implementation.
The linker_set.h macros have been on the back burner in various
forms since 1998 and has ideas and code from Mike Smith (SET_FOREACH()),
John Polstra (ELF clue) and myself (cleaned up API and the conversion
of the rest of the kernel to use it).
The macros declare a strongly typed set. They return elements with the
type that you declare the set with, rather than a generic void *.
For ELF, we use the magic ld symbols (__start_<setname> and
__stop_<setname>). Thanks to Richard Henderson <rth@redhat.com> for the
trick about how to force ld to provide them for kld's.
For a.out, we use the old linker_set struct.
NOTE: the item lists are no longer null terminated. This is why
the code impact is high in certain areas.
The runtime linker has a new method to find the linker set
boundaries depending on which backend format is in use.
linker sets are still module/kld unfriendly and should never be used
for anything that may be modular one day.
Reviewed by: eivind
- Replace some very poorly thought out API hacks that should have been
fixed a long while ago.
- Provide some much more flexible search functions (resource_find_*())
- Use strings for storage instead of an outgrowth of the rather
inconvenient temporary ioconf table from config(). We already had a
fallback to using strings before malloc/vm was running anyway.
Following changed was made by previous commit:
- add a pointer to struct mauxtag. two integer was too restrictive.
- add m_aux_{add,find}2.
- make sure to nuke mbuf pointed to m_aux.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.
TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.
Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks
around, use a common function for looking up and extracting the tunables
from the kernel environment. This saves duplicating the same function
over and over again. This way typically has an overhead of 8 bytes + the
path string, versus about 26 bytes + the path string.
compliant. All the variable definitions and function names are
reasonably consistent, and the functions which should be static (i.e.,
all of them) are. Other assorted fixes were made. The majority of
the delta is indentation fixes.
Partially reviewed by: bde
- move the sysctl code to kern_intr.c
- do not use INTRCNT_COUNT, but rather eintrcnt - intrcnt to determine
the length of the intrcnt array
- move the declarations of intrnames, eintrnames, intrcnt and eintrcnt
from machine-dependent include files to sys/interrupt.h
- remove the hw.nintr sysctl, it is not needed.
- fix various style bugs
Requested by: bde
Reviewed by: bde (some time ago)
dev_t. The dev_depends(dev_t, dev_t) function is for tying them
to each other.
When destroy_dev() is called on a dev_t, all dev_t's depending
on it will also be destroyed (depth first order).
Rewrite the make_dev_alias() to use this dependency facility.
kern/subr_disk.c:
Make the disk mini-layer use dependencies to make sure all
relevant dev_t's are removed when the disk disappears.
Make the disk mini-layer precreate some magic sub devices
which the disk/slice/label code expects to be there.
kern/subr_disklabel.c:
Remove some now unneeded variables.
kern/subr_diskmbr.c:
Remove some ancient, commented out code.
kern/subr_diskslice.c:
Minor cleanup. Use name from dev_t instead of dsname()
real uid, saved uid, real gid, and saved gid to ucred, as well as the
pcred->pc_uidinfo, which was associated with the real uid, only rename
it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
original macro that pointed.
p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
we figure out locking and optimizations; generally speaking, this
means moving to a structure like this:
newcred = crdup(oldcred);
...
p->p_ucred = newcred;
crfree(oldcred);
It's not race-free, but better than nothing. There are also races
in sys_process.c, all inter-process authorization, fork, exec, and
exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
allocation.
o Clean up ktrcanset() to take into account changes, and move to using
suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
calls to better document current behavior. In a couple of places,
current behavior is a little questionable and we need to check
POSIX.1 to make sure it's "right". More commenting work still
remains to be done.
o Update credential management calls, such as crfree(), to take into
account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
change_euid()
change_egid()
change_ruid()
change_rgid()
change_svuid()
change_svgid()
In each case, the call now acts on a credential not a process, and as
such no longer requires more complicated process locking/etc. They
now assume the caller will do any necessary allocation of an
exclusive credential reference. Each is commented to document its
reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
processes and pcreds. Note that this authorization, as well as
CANSIGIO(), needs to be updated to use the p_cansignal() and
p_cansched() centralized authorization routines, as they currently
do not take into account some desirable restrictions that are handled
by the centralized routines, as well as being inconsistent with other
similar authorization instances.
o Update libkvm to take these changes into account.
Obtained from: TrustedBSD Project
Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.
Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%. For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.
I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.
Submitted by: tegge, dillon
copies out the current contents of the video buffer for a syscons terminal,
providing a snapshot of the text and attributes.
Based heavily on work originally submitted by Joel Holveck <joelh@gnu.org>
for 2.2.x almost 30 months ago, which I cleaned up a little, and forward
ported to -current.
See also the usr.bin/scrshot utility.
sleep locks.
- Delay returning from ithread_remove_handler() until we are certain that
the interrupt handler being removed has in fact been removed from the
ithread.
- XXX: There is still a problem in that nothing protects the kernel from
adding a new handler while the ithread is running, though with our
current architectures this is not a problem.
Requested by: gibbs (2)
the number of references on the filesystem root vnode to be both
expected and released. Many filesystems hold an extra reference on
the filesystem root vnode, which must be accounted for when
determining if the filesystem is busy and then released if it isn't
busy. The old `skipvp' approach required individual filesystem
xxx_unmount functions to re-implement much of vflush()'s logic to
deal with the root vnode.
All 9 filesystems that hold an extra reference on the root vnode
got the logic wrong in the case of forced unmounts, so `umount -f'
would always fail if there were any extra root vnode references.
Fix this issue centrally in vflush(), now that we can.
This commit also fixes a vnode reference leak in devfs, which could
result in idle devfs filesystems that refuse to unmount.
Reviewed by: phk, bp
. FD_CLRERR clears the error counter, thus re-enables kernel error
printf()s,
. FD_GSTAT obtains the last FDC operation state, if any,
. FDOPT_NOERRLOG (temporarily) turns off kernel printf() floppy
error logging,
. FDOPT_NOERROR makes the kernel ignore an FDC error, thus can
enable the transfer of an erroneous sector to the user application
All options are being cleared on (last) close.
Prime consumer of the last features will be fdread(1), to be committed
shortly.
(FD_CLRERR should be wired into fdcontrol(8), but then fdcontrol(8)
needs a major rewrite anyway.)
- Since polling should not involve sleeping, keep holding a
process lock upon scanning file descriptors.
- Hold a reference to every file descriptor prior to entering
polling loop in order to avoid lock order reversal between
lockmgr and p_mtx upon calling fdrop() in fo_poll().
(NOTE: this work has not been done for netncp and netsmb
yet because a socket itself has no reference counts.)
Reviewed by: jhb
struct lock_instance that is stored in the per-process and per-CPU lock
lists. Previously, the lock lists just kept a pointer to each lock held.
That pointer is now replaced by a lock instance which contains a pointer
to the lock object, the file and line of the last acquisition of a lock,
and various flags about a lock including its recursion count.
- If we sleep while holding a sleepable lock, then mark that lock instance
as having slept and ignore any lock order violations that occur while
acquiring Giant when we wake up with slept locks. This is ok because of
Giant's special nature.
- Allow witness to differentiate between shared and exclusive locks and
unlocks of a lock. Witness will now detect the case when a lock is
acquired first in one mode and then in another. Mutexes are always
locked and unlocked exclusively. Witness will also now detect the case
where a process attempts to unlock a shared lock while holding an
exclusive lock and vice versa.
- Fix a bug in the lock list implementation where we used the wrong
constant to detect the case where a lock list entry was full.
This driver supports PCI Xr-based and ISA Xem Digiboard cards.
dgm will go away soon if there are no problems reported. For now,
configuring dgm into your kernel warns that you should be using
digi. This driver is probably close to supporting Xi, Xe and Xeve
cards, but I wouldn't expect them to work properly (hardware
donations welcome).
The digi_* pseudo-drivers are not drivers themselves but contain
the BIOS and FEP/OS binaries for various digiboard cards and are
auto-loaded and auto-unloaded by the digi driver at initialisation
time. They *may* be configured into the kernel, but waste a lot
of space if they are. They're intended to be left as modules.
The digictl program is (mainly) used to re-initialise cards that
have external port modules attached such as the PC/Xem.
other "system" header files.
Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.
Sort sys/*.h includes where possible in affected files.
OK'ed by: bde (with reservations)
Make 7 filesystems which don't really know about VOP_BMAP rely
on the default vector, rather than more or less complete local
vop_nopbmap() implementations.
been made machine independent and various other adjustments have been made
to support Alpha SMP.
- It splits the per-process portions of hardclock() and statclock() off
into hardclock_process() and statclock_process() respectively. hardclock()
and statclock() call the *_process() functions for the current process so
that UP systems will run as before. For SMP systems, it is simply necessary
to ensure that all other processors execute the *_process() functions when the
main clock functions are triggered on one CPU by an interrupt. For the alpha
4100, clock interrupts are delievered in a staggered broadcast fashion, so
we simply call hardclock/statclock on the boot CPU and call the *_process()
functions on the secondaries. For x86, we call statclock and hardclock as
usual and then call forward_hardclock/statclock in the MD code to send an IPI
to cause the AP's to execute forwared_hardclock/statclock which then call the
*_process() functions.
- forward_signal() and forward_roundrobin() have been reworked to be MI and to
involve less hackery. Now the cpu doing the forward sets any flags, etc. and
sends a very simple IPI_AST to the other cpu(s). AST IPIs now just basically
return so that they can execute ast() and don't bother with setting the
astpending or needresched flags themselves. This also removes the loop in
forward_signal() as sched_lock closes the race condition that the loop worked
around.
- need_resched(), resched_wanted() and clear_resched() have been changed to take
a process to act on rather than assuming curproc so that they can be used to
implement forward_roundrobin() as described above.
- Various other SMP variables have been moved to a MI subr_smp.c and a new
header sys/smp.h declares MI SMP variables and API's. The IPI API's from
machine/ipl.h have moved to machine/smp.h which is included by sys/smp.h.
- The globaldata_register() and globaldata_find() functions as well as the
SLIST of globaldata structures has become MI and moved into subr_smp.c.
Also, the globaldata list is only available if SMP support is compiled in.
Reviewed by: jake, peter
Looked over by: eivind
to struct mount.
This makes the "struct netexport *" paramter to the vfs_export
and vfs_checkexport interface unneeded.
Consequently that all non-stacking filesystems can use
vfs_stdcheckexp().
At the same time, make it a pointer to a struct netexport
in struct mount, so that we can remove the bogus AF_MAX
and #include <net/radix.h> from <sys/mount.h>
required by POSIX.1e. This maintains the current 'struct acl'
in the kernel while providing the generic external acl_t
interface required to complete the ACL editing library.
o Add the acl_get_entry() function.
o Convert the existing ACL utilities, getfacl and setfacl, to
fully make use of the ACL editing library.
Obtained from: TrustedBSD Project
only aiod does this and is also marked P_SYSTEM, the locations that
reference p->p_vmspace usually do it within the context of the caller,
the async access from the vm system is protected by the fact that it
will skip over P_SYSTEM processes.
Ok'd by: jhb
we also reserve _adequate_ space for the mb_map submap; i.e. we need
space for nmbclusters, nmbufs, _and_ nmbcnt. Furthermore, we need to
rounddown, and not roundup, so that we are consistent.
Pointed out by: bde
Also move the insertion of the request to after the request is validated,
there's still looks like there may be some problems if an invalid address
is passed to the aio routines, basically a possible leak or having a
not completely initialized structure on the queue may still be possible.
A new sig macro was made _SIG_VALID to check the validity of a signal,
it would be advisable to use it from now on (in kern/kern_sig.c) rather
than rolling your own.
PR: kern/17152
available.
Only directory vnodes holding no child directory vnodes held in
v_cache_src are recycled, so that directory vnodes near the root of
the filesystem hierarchy remain in namecache and directory vnodes are
not reclaimed in cascade.
The period of vnode reclaiming attempt and the number of vnodes
attempted to reclaim can be tuned via sysctl(2).
Suggested by: tegge
Approved by: phk
VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.
This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.
The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine. For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.
ACL_USER_OBJ and ACL_GROUP_OBJ fields, believing that modification of the
access ACL could be used by privileged processes to change file/directory
ownership. In fact, this is incorrect; ACL_*_OBJ (+ ACL_MASK and
ACL_OTHER) should have undefined ae_id fields; this commit attempts
to correct that misunderstanding.
o Modify arguments to vaccess_acl_posix1e() to accept the uid and gid
associated with the vnode, as those can no longer be extracted from
the ACL passed as an argument. Perform all comparisons against
the passed arguments. This actually has the effect of simplifying
a number of components of this call, as well as reducing the indent
level, but now seperates handling of ACL_GROUP_OBJ from ACL_GROUP.
o Modify acl_posix1e_check() to return EINVAL if the ae_id field of
any of the ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} entries is a value
other than ACL_UNDEFINED_ID. As a temporary work-around to allow
clean upgrades, set the ae_id field to ACL_UNDEFINED_ID before
each check so that this cannot cause a failure in the short term
(this work-around will be removed when the userland libraries and
utilities are updated to take this change into account).
o Modify ufs_sync_acl_from_inode() so that it forces
ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} ae_id fields to ACL_UNDEFINED_ID
when synchronizing the ACL from the inode.
o Modify ufs_sync_inode_from_acl to not propagate uid and gid
information to the inode from the ACL during ACL update. Also
modify the masking of permission bits that may be set from
ALLPERMS to (S_IRWXU|S_IRWXG|S_IRWXO), as ACLs currently do not
carry none-ACCESSPERMS (S_ISUID, S_ISGID, S_ISTXT).
o Modify ufs_getacl() so that when it emulates an access ACL from
the inode, it initializes the ae_id fields to ACL_UNDEFINED_ID.
o Clean up ufs_setacl() substantially since it is no longer possible
to perform chown/chgrp operations using vop_setacl(), so all the
access control for that can be eliminated.
o Modify ufs_access() so that it passes owner uid and gid information
into vaccess_acl_posix1e().
Pointed out by: jedger
Obtained from: TrustedBSD Project
panic_cpu shared variable. I used a simple atomic operation here instead
of a spin lock as it seemed to be excessive overhead. Also, this can avoid
recursive panics if, for example, witness is broken.
from signal authorization checking.
o p_cansignal() takes three arguments: subject process, object process,
and signal number, unlike p_cankill(), which only took into account
the processes and not the signal number, improving the abstraction
such that CANSIGNAL() from kern_sig.c can now also be eliminated;
previously CANSIGNAL() special-cased the handling of SIGCONT based
on process session. privused is now deprecated.
o The new p_cansignal() further limits the set of signals that may
be delivered to processes with P_SUGID set, and restructures the
access control check to allow it to be extended more easily.
o These changes take into account work done by the OpenBSD Project,
as well as by Robert Watson and Thomas Moestl on the TrustedBSD
Project.
Obtained from: TrustedBSD Project
Add simple "xlat" converter which performs 8to8 table based conversion.
Unicode converter will be added in the near future.
Reviewed by: silence on arch@
Files placement reviewed by: bde
Obtained from: smbfs
list into a public witness_list_locks() function. Call this function
twice in witness_list() instead of using an evil goto.
- Adjust the 'show locks' command to take an optional parameter which
specifies the pid of a process to list the locks of. By default the
locks held by the current process are displayed.
Also place the macros under #ifdef _KERNEL. Equally hide the internal
structures such as the freelist structs which include condition variables.
Reviewed by: bde
Mostly suggested by: bde
The mbuf and mcluster free lists now each "own" a condition variable,
m_starved.
- Clean up minor indentention issues in sys/mbuf.h caused by previous
commit.
Don't use atomic operations for the stats updating, instead protect
the counts with the mbuf mutex. Most twiddling of the stats was
done right before or after releasing a mutex. By doing this we
reduce the number of locked ops needed as well as allow a sysctl
to gain a consitant view of the entire stats structure.
In the future...
This will allow us to chain common mbuf operations that would
normally need to aquire/release 2 or 3 of the locks to build an
mbuf with a cluster or external data attached into a single op
requiring only one lock.
Simplify the per-cpu locks that are planned.
There's also some if (1) code that should check if the "how"
operation specifies blocking/non-blocking behavior, we _could_ make
it so that we hold onto the mutex through calls into kmem_alloc
when non-blocking requests are made, but for safety reasons we
currently drop and reaquire the mutex around the calls.
Also, note that calling kmem_alloc is rare and only happens during
a shortage so drop/re-getting the mutex will not be a common
occurance.
Remove some #define's that seemed to obfuscate the code to me.
Remove an extranious comment.
Remove an XXX, including mutex.h isn't a crime.
Reviewed by: bmilekic
operations on file descriptors, which complement the existing set of
calls, extattr_{delete,get,set}_file() which act on paths. In doing
so, restructure the system call implementation such that the two sets
of functions share most of the relevant code, rather than duplicating
it. This pushes the vnode locking into the shared code, but keeps
the copying in of some arguments in the system call code. Allowing
access via file descriptors reduces the opportunity for race
conditions when managing extended attributes.
Obtained from: TrustedBSD Project
two subject ucreds. Unlike p_cansee(), u_cansee() doesn't have
process lock requirements, only valid ucred reference requirements,
so is prefered as process locking improves. For now, back p_cansee()
into u_cansee(), but eventually p_cansee() will go away.
Reviewed by: jhb, tmm
Obtained from: TrustedBSD Project
- Introduce lock classes and lock objects. Each lock class specifies a
name and set of flags (or properties) shared by all locks of a given
type. Currently there are three lock classes: spin mutexes, sleep
mutexes, and sx locks. A lock object specifies properties of an
additional lock along with a lock name and all of the extra stuff needed
to make witness work with a given lock. This abstract lock stuff is
defined in sys/lock.h. The lockmgr constants, types, and prototypes have
been moved to sys/lockmgr.h. For temporary backwards compatability,
sys/lock.h includes sys/lockmgr.h.
- Replace proc->p_spinlocks with a per-CPU list, PCPU(spinlocks), of spin
locks held. By making this per-cpu, we do not have to jump through
magic hoops to deal with sched_lock changing ownership during context
switches.
- Replace proc->p_heldmtx, formerly a list of held sleep mutexes, with
proc->p_sleeplocks, which is a list of held sleep locks including sleep
mutexes and sx locks.
- Add helper macros for logging lock events via the KTR_LOCK KTR logging
level so that the log messages are consistent.
- Add some new flags that can be passed to mtx_init():
- MTX_NOWITNESS - specifies that this lock should be ignored by witness.
This is used for the mutex that blocks a sx lock for example.
- MTX_QUIET - this is not new, but you can pass this to mtx_init() now
and no events will be logged for this lock, so that one doesn't have
to change all the individual mtx_lock/unlock() operations.
- All lock objects maintain an initialized flag. Use this flag to export
a mtx_initialized() macro that can be safely called from drivers. Also,
we on longer walk the all_mtx list if MUTEX_DEBUG is defined as witness
performs the corresponding checks using the initialized flag.
- The lock order reversal messages have been improved to output slightly
more accurate file and line numbers.
and change the u_int mtx_saveintr member of struct mtx to a critical_t
mtx_savecrit.
- On the alpha we no longer need a custom _get_spin_lock() macro to avoid
an extra PAL call, so remove it.
- Partially fix using mutexes with WITNESS in modules. Change all the
_mtx_{un,}lock_{spin,}_flags() macros to accept explicit file and line
parameters and rename them to use a prefix of two underscores. Inside
of kern_mutex.c, generate wrapper functions for
_mtx_{un,}lock_{spin,}_flags() (only using a prefix of one underscore)
that are called from modules. The macros mtx_{un,}lock_{spin,}_flags()
are mapped to the __mtx_* macros inside of the kernel to inline the
usual case of mutex operations and map to the internal _mtx_* functions
in the module case so that modules will use WITNESS and KTR logging if
the kernel is compiled with support for it.
Some of the major changes include:
- The SCSI error handling portion of cam_periph_error() has
been broken out into a number of subfunctions to better
modularize the code that handles the hierarchy of SCSI errors.
As a result, the code is now much easier to read.
- String handling and error printing has been significantly
revamped. We now use sbufs to do string formatting instead
of using printfs (for the kernel) and snprintf/strncat (for
userland) as before.
There is a new catchall error printing routine,
cam_error_print() and its string-based counterpart,
cam_error_string() that allow the kernel and userland
applications to pass in a CCB and have errors printed out
properly, whether or not they're SCSI errors. Among other
things, this helped eliminate a fair amount of duplicate code
in camcontrol.
We now print out more information than before, including
the CAM status and SCSI status and the error recovery action
taken to remedy the problem.
- sbufs are now available in userland, via libsbuf. This
change was necessary since most of the error printing code
is shared between libcam and the kernel.
- A new transfer settings interface is included in this checkin.
This code is #ifdef'ed out, and is primarily intended to aid
discussion with HBA driver authors on the final form the
interface should take. There is example code in the ahc(4)
driver that implements the HBA driver side of the new
interface. The new transfer settings code won't be enabled
until we're ready to switch all HBA drivers over to the new
interface.
src/Makefile.inc1,
lib/Makefile: Add libsbuf. It must be built before libcam,
since libcam uses sbuf routines.
libcam/Makefile: libcam now depends on libsbuf.
libsbuf/Makefile: Add a makefile for libsbuf. This pulls in the
sbuf sources from sys/kern.
bsd.libnames.mk: Add LIBSBUF.
camcontrol/Makefile: Add -lsbuf. Since camcontrol is statically
linked, we can't depend on the dynamic linker
to pull in libsbuf.
camcontrol.c: Use cam_error_print() instead of checking for
CAM_SCSI_STATUS_ERROR on every failed CCB.
sbuf.9: Change the prototypes for sbuf_cat() and
sbuf_cpy() so that the source string is now a
const char *. This is more in line wth the
standard system string functions, and helps
eliminate warnings when dealing with a const
source buffer.
Fix a typo.
cam.c: Add description strings for the various CAM
error status values, as well as routines to
look up those strings.
Add new cam_error_string() and
cam_error_print() routines for userland and
the kernel.
cam.h: Add a new CAM flag, CAM_RETRY_SELTO.
Add enumerated types for the various options
available with cam_error_print() and
cam_error_string().
cam_ccb.h: Add new transfer negotiation structures/types.
Change inq_len in the ccb_getdev structure to
be "reserved". This field has never been
filled in, and will be removed when we next
bump the CAM version.
cam_debug.h: Fix typo.
cam_periph.c: Modularize cam_periph_error(). The SCSI error
handling part of cam_periph_error() is now
in camperiphscsistatuserror() and
camperiphscsisenseerror().
In cam_periph_lock(), increase the reference
count on the periph while we wait for our lock
attempt to succeed so that the periph won't go
away while we're sleeping.
cam_xpt.c: Add new transfer negotiation code. (ifdefed
out)
Add a new function, xpt_path_string(). This
is a string/sbuf analog to xpt_print_path().
scsi_all.c: Revamp string handing and error printing code.
We now use sbufs for much of the string
formatting code. More of that code is shared
between userland the kernel.
scsi_all.h: Get rid of SS_TURSTART, it wasn't terribly
useful in the first place.
Add a new error action, SS_REQSENSE. (Send a
request sense and then retry the command.)
This is useful when the controller hasn't
performed autosense for some reason.
Change the default actions around a bit.
scsi_cd.c,
scsi_da.c,
scsi_pt.c,
scsi_ses.c: SF_RETRY_SELTO -> CAM_RETRY_SELTO. Selection
timeouts shouldn't be covered by a sense flag.
scsi_pass.[ch]: SF_RETRY_SELTO -> CAM_RETRY_SELTO.
Get rid of the last vestiges of a read/write
interface.
libkern/bsearch.c,
sys/libkern.h,
conf/files: Add bsearch.c, which is needed for some of the
new table lookup routines.
aic7xxx_freebsd.c: Define AHC_NEW_TRAN_SETTINGS if
CAM_NEW_TRAN_CODE is defined.
sbuf.h,
subr_sbuf.c: Add the appropriate #ifdefs so sbufs can
compile and run in userland.
Change sbuf_printf() to use vsnprintf()
instead of kvprintf(), which is only available
in the kernel.
Change the source string for sbuf_cpy() and
sbuf_cat() to be a const char *.
Add __BEGIN_DECLS and __END_DECLS around
function prototypes since they're now exported
to userland.
kdump/mkioctls: Include stdio.h before cam.h since cam.h now
includes a function with a FILE * argument.
Submitted by: gibbs (mostly)
Reviewed by: jdp, marcel (libsbuf makefile changes)
Reviewed by: des (sbuf changes)
Reviewed by: ken
MCLGET macros in order to avoid incrementing the drop count twice.
Otherwise, in some cases, we may increment m_drops once in m_mballoc()
for example, and increment it again in m_mballoc_wait() if the
wait fails.
because libc/rpc/key_call.c references uname(), and ps/print.c also
defines uname(), and ps is linked statically. This leads to a symbol
clash. The userland uname(3) kinda sucked anyway as the hostname
etc was too short. And since the libc rpc interface now uses
the utsname.nodename which gets truncated, I was tempted into doing
something about it. Create a new userland uname function, called
__xuname() which takes an extra argument that allows you to change
the size of the fields. uname() becomes a static inline function
in sys/utsname.h that passes the extra argument in. struct utsname
has its field members expanded by default now in userland.
We still provide a 'uname' externally linkable function for things
that either think that they ``know'' the utsname format and assume
32 character strings and bypass the include file, or objects that
are linked against old libcs. ie: just about every plausible
case that I can think of is covered. Should we ever change the
default lengths again, a libc major bump should not be required
as the size is now passed to the function.
XXX the uname(2) in the kernel is for FreeBSD 1.1 binary compatability!
All the uname(3) functions that are exported to userland are actually
implemented in libc with sysctl. uname(1) uses sysctl directly and
does not call uname(3).
PR: bin/4688
acl_add_perm, acl_clear_perms, acl_copy_entry, acl_create_entry,
acl_delete_perm, acl_get_permset, acl_get_qualifier, acl_get_tag_type,
acl_set_permset, acl_set_qualifier, acl_set_tag_type
This brings us within 4 functions of a full ACL editing library.
Reviewed by: rwatson
Make struct cmessage visible from socket.h (about 4 places were
defining it for themselves which wasn't good)
Make __rpc_get_local_uid() useable and give it prototype that's
visible.
Fix some issues with printing out usernames from rpcbind and keyserv.
Allow the initial hash value to be passed in, as the examples do.
Incrementally hash in the dvp->v_id (using the official api) rather than
add it. This seems to help power-of-two predictable filename trees
where the filenames repeat on a power-of-two cycle and the directory trees
have power-of-two components in it. The simple add then mask was causing
things like 12000+ entry collision chains while most other entries have
between 0 and 3 entries each. This way seems to improve things.
o acl_calc_mask(): calculates the ACL mask entry associated with
the given ACL.
o acl_delete_entry(): remove a specified ACL entry from the given
ACL.
Approved by: rwatson
char *
FooFileChunk(const char *filename, char *buf, off_t offset, off_t length)
Which only hashes part of a file.
Implement FooFile() in terms of this function.
Submitted by: roam
Make the name cache hash as well as the nfsnode hash use it.
As a special tweak, create an unsigned version of register_t. This allows
us to use a special tweak for the 64 bit versions that significantly
speeds up the i386 version (ie: int64 XOR int64 is slower than int64
XOR int32).
The code layout is a little strange for the string function, but I was
able to get between 5 to 10% improvement over the original version I
started with. The layout affects gcc code generation choices and this way
was fastest on x86 and alpha.
Note that 'CPUTYPE=p3' etc makes a fair difference to this. It is
around 45% faster with -march=pentiumpro on a p6 cpu.
introduce a new argument, "namespace", rather than relying on a first-
character namespace indicator. This is in line with more recent
thinking on EA interfaces on various mailing lists, including the
posix1e, Linux acl-devel, and trustedbsd-discuss forums. Two namespaces
are defined by default, EXTATTR_NAMESPACE_SYSTEM and
EXTATTR_NAMESPACE_USER, where the primary distinction lies in the
access control model: user EAs are accessible based on the normal
MAC and DAC file/directory protections, and system attributes are
limited to kernel-originated or appropriately privileged userland
requests.
o These API changes occur at several levels: the namespace argument is
introduced in the extattr_{get,set}_file() system call interfaces,
at the vnode operation level in the vop_{get,set}extattr() interfaces,
and in the UFS extended attribute implementation. Changes are also
introduced in the VFS extattrctl() interface (system call, VFS,
and UFS implementation), where the arguments are modified to include
a namespace field, as well as modified to advoid direct access to
userspace variables from below the VFS layer (in the style of recent
changes to mount by adrian@FreeBSD.org). This required some cleanup
and bug fixing regarding VFS locks and the VFS interface, as a vnode
pointer may now be optionally submitted to the VFS_EXTATTRCTL()
call. Updated documentation for the VFS interface will be committed
shortly.
o In the near future, the auto-starting feature will be updated to
search two sub-directories to the ".attribute" directory in appropriate
file systems: "user" and "system" to locate attributes intended for
those namespaces, as the single filename is no longer sufficient
to indicate what namespace the attribute is intended for. Until this
is committed, all attributes auto-started by UFS will be placed in
the EXTATTR_NAMESPACE_SYSTEM namespace.
o The default POSIX.1e attribute names for ACLs and Capabilities have
been updated to no longer include the '$' in their filename. As such,
if you're using these features, you'll need to rename the attribute
backing files to the same names without '$' symbols in front.
o Note that these changes will require changes in userland, which will
be committed shortly. These include modifications to the extended
attribute utilities, as well as to libutil for new namespace
string conversion routines. Once the matching userland changes are
committed, a buildworld is recommended to update all the necessary
include files and verify that the kernel and userland environments
are in sync. Note: If you do not use extended attributes (most people
won't), upgrading is not imperative although since the system call
API has changed, the new userland extended attribute code will no longer
compile with old include files.
o Couple of minor cleanups while I'm there: make more code compilation
conditional on FFS_EXTATTR, which should recover a bit of space on
kernels running without EA's, as well as update copyright dates.
Obtained from: TrustedBSD Project
used for up to "vfs.aio.max_buf_aio" of the requests. If a request
size is MAXPHYS, but the request base isn't page aligned, vmapbuf()
will map the end of the user space buffer into the start of the kva
allocated for the next physical buffer. Don't use a physical buffer
in this case. (This change addresses problem report 25617.)
When an aio_read/write() on a raw device has completed, timeout() is
used to schedule a signal to the process. Thus, the reporting is
delayed up to 10 ms (assuming hz is 100). The process might have
terminated in the meantime, causing a trap 12 when attempting to
deliver the signal. Thus, the timeout must be cancelled when removing
the job.
aio jobs in state JOBST_JOBQGLOBAL should be removed from the
kaio_jobqueue list during process rundown.
During process rundown, some aio jobs might move from one list to a
different list that has already been "emptied", causing the rundown to
be incomplete. Retry the rundown.
A call to BUF_KERNPROC() is needed after obtaining a physical buffer
to disassociate the lock from the running process since it can return
to userland without releasing that lock.
PR: 25617
Submitted by: tegge
if we hold a spin mutex, since we can trivially get into deadlocks if we
start switching out of processes that hold spinlocks. Checking to see if
interrupts were disabled was a sort of cheap way of doing this since most
of the time interrupts were only disabled when holding a spin lock. At
least on the i386. To fix this properly, use a per-process counter
p_spinlocks that counts the number of spin locks currently held, and
instead of checking to see if interrupts are disabled in the witness code,
check to see if we hold any spin locks. Since child processes always
start up with the sched lock magically held in fork_exit(), we initialize
p_spinlocks to 1 for child processes. Note that proc0 doesn't go through
fork_exit(), so it starts with no spin locks held.
Consulting from: cp
Since the compiler lays out the stuct so that pointers are naturally
(8-byte) aligned aligned, adding the int ki_layout didn't change the size of
the stuct; it just converted the alignment padding to a usable struct
field.