option is used (not on by default).
- In the case of trying to lock a mutex, if the MTX_CONTESTED flag is set,
then we can safely read the thread pointer from the mtx_lock member while
holding sched_lock. We then examine the thread to see if it is currently
executing on another CPU. If it is, then we keep looping instead of
blocking.
- In the case of trying to unlock a mutex, it is now possible for a mutex
to have MTX_CONTESTED set in mtx_lock but to not have any threads
actually blocked on it, so we need to handle that case. In that case,
we just release the lock as if MTX_CONTESTED was not set and return.
- We do not adaptively spin on Giant as Giant is held for long times and
it slows SMP systems down to a crawl (it was taking several minutes,
like 5-10 or so for my test alpha and sparc64 SMP boxes to boot up when
they adaptively spinned on Giant).
- We only compile in the code to do this for SMP kernels, it doesn't make
sense for UP kernels.
Tested on: i386, alpha, sparc64
IFS had its fingers deep in the belly of the UFS/FFS split. IFS
will be reimplemented by the maintainer at a later date.
Requested by: adrian (maintainer)
behavior by default. Also, change the options line to reflect this.
If there are no problems reported this will become the only behavior and the
knob will be removed in a month or so.
Demanded by: obrien
following sysctl variables:
debug.mutex.prof.enable enable / disable profiling
debug.mutex.prof.acquisitions number of mutex acquisitions recorded
debug.mutex.prof.records number of acquisition points recorded
debug.mutex.prof.maxrecords max number of acquisition points
debug.mutex.prof.rejected number of rejections (due to full table)
debug.mutex.prof.hashsize hash size
debug.mutex.prof.collisions number of hash collisions
debug.mutex.prof.stats profiling statistics
The code records four numbers for each acquisition point (identified by
source file name and line number): longest time held, total time held,
number of non-recursive acquisitions, average time held. The measurements
are in clock cycles (as returned by get_cyclecount(9)); this may cause
measurements on some SMP systems to be unreliable. This can probably be
worked around by replacing get_cyclecount(9) by some incarnation of
nanotime(9).
This work was derived from initial patches by eivind.
dump the trace buffer feasible.
- Remove KTR_EXTEND. This changes the format of the trace entries when
activated, making writing a userland tool which is not tied to a specific
kernel configuration difficult.
- Use get_cyclecount() for timestamps. nanotime() is much too heavy weight
and requires recursion protection due to ktr traces occuring as a result
of ktr traces. KTR_VERBOSE may still require recursion protection, which
is now conditional on it.
- Allow KTR_CPU to be overridden by MD code. This is so that it is possible
to trace early in startup before pcpu and/or curthread are setup.
- Add a version number for the ktr interface. A userland tool can check this
to detect mismatches.
- Use an array for the parameters to make decoding in userland easier.
- Add file and line recording to the non-extended traces now that the extended
version is no more.
These changes will break gdb macros to decode the extended version of the
trace buffer which are floating around. Users of these macros should either
use the show ktr command in ddb, or use the userland utility which can be run
on a core dump.
Approved by: jhb
Tested on: i386, sparc64
The stat() and open() calls have been changed to make use of this new functionality. Using shared locks in
these cases is sufficient and can significantly reduce their latency if IO is pending to these vnodes. Also,
this reduces the number of exclusive locks that are floating around in the system, which helps reduce the
number of deadlocks that occur.
A new kernel option "LOOKUP_SHARED" has been added. It defaults to off so this patch can be turned on for
testing, and should eventually go away once it is proven to be stable. I have personally been running this
patch for over a year now, so it is believed to be fully stable.
Reviewed by: jake, obrien
Approved by: jake
- Temporary fix a bug of Intel ACPI CA core code.
- Add OS layer ACPI mutex support. This can be disabled by
specifying option ACPI_NO_SEMAPHORES.
- Add ACPI threading support. Now that we have a dedicate taskqueue for
ACPI tasks and more ACPI task threads can be created by specifying option
ACPI_MAX_THREADS.
- Change acpi_EvaluateIntoBuffer() behavior slightly to reuse given
caller's buffer unless AE_BUFFER_OVERFLOW occurs. Also CM battery's
evaluations were changed to use acpi_EvaluateIntoBuffer().
- Add new utility function acpi_ConvertBufferToInteger().
- Add simple locking for CM battery and temperature updating.
- Fix a minor problem on EC locking.
- Make the thermal zone polling rate to be changeable.
- Change minor things on AcpiOsSignal(); in ACPI_SIGNAL_FATAL case,
entering Debugger is easier to investigate the problem rather than panic.
built without support for miibus PHYs. Most ed cards don't need
miibus support, so it's useful to be able to avoid the bloat of
all the mii devices for small fixed-purpose kernels.
then one can restart from a panic by resetting the panicstr variable to
NULL. This commit conditionalizes the previously committed functionality
on this variable. It also removes the __dead2 attribute from the panic()
function so that when one continues from a panic() the behavior will
be predictable.
information. The default limits only effect machines with > 1GB of ram
and can be overriden with two new kernel conf variables VM_SWZONE_SIZE_MAX
and VM_BCACHE_SIZE_MAX, or with loader variables kern.maxswzone and
kern.maxbcache. This has the effect of leaving more KVM available for
sizing NMBCLUSTERS and 'maxusers' and should avoid tripups where a sysad
adds memory to a machine and then sees the kernel panic on boot due to
running out of KVM.
Also change the default swap-meta auto-sizing calculation to allocate half
of what it was previously allocating. The prior defaults were way too high.
Note that we cannot afford to run out of swap-meta structures so we still
stay somewhat conservative here.
directories. When enabled via "options UFS_DIRHASH", in-core hash
arrays are maintained for large directories. These allow all
directory operations to take place quickly instead of requiring
long linear searches. For now anyway, dirhash is not enabled by
default.
The in-core hash arrays have a memory requirement that is approximately
half the size of the size of the on-disk directory file. A number
of new sysctl variables allow control over which directories get
hashed and over the maximum amount of memory that dirhash will use:
vfs.ufs.dirhash_minsize
The minimum on-disk directory size for which hashing should be
used. The default is 2560 (2.5k).
vfs.ufs.dirhash_maxmem
The system-wide maximum total memory to be used by dirhash data
structures. The default is 2097152 (2MB).
The current amount of memory being used by dirhash is visible
through the read-only sysctl variable vfs.ufs.dirhash_maxmem.
Finally, some extra sanity checks that are enabled by default, but
which may have an impact on performance, can be disabled by setting
vfs.ufs.dirhash_docheck to 0.
Discussed on: -fs, -hackers
1. Add SA_IO_TIMEOUT as an option (4 minutes default) to cover reads,
writes, wfm, test unit ready.
2. Add internal SCSIOP_TIMEOUT (e.g., for mode sense) at 1 minute. This
should not require an option, but is cleaner to parameterize.
MFC after: 1 week
This closes a minor information leak which allows a remote observer to
determine the rate at which the machine is generating packets, since the
default behaviour is to increment a counter for each packet sent.
Reviewed by: -net
Obtained from: OpenBSD
systems were repo-copied from sys/miscfs to sys/fs.
- Renamed the following file systems and their modules:
fdesc -> fdescfs, portal -> portalfs, union -> unionfs.
- Renamed corresponding kernel options:
FDESC -> FDESCFS, PORTAL -> PORTALFS, UNION -> UNIONFS.
- Install header files for the above file systems.
- Removed bogus -I${.CURDIR}/../../sys CFLAGS from userland
Makefiles.
If for some reason DEVFS is undesired, the "NODEVFS" option is
needed now.
Pending any significant issues, DEVFS will be made mandatory in
-current on july 1st so that we can start reaping the full
benefits of having it.
interfaces and functionality intended for use during correctness and
regression testing. Features enabled by "options REGRESSION" may
in and of themselves introduce security or correctness problems if
used improperly, and so are not intended for use in production
systems, only in testing environments.
Obtained from: TrustedBSD Project
Add simple "xlat" converter which performs 8to8 table based conversion.
Unicode converter will be added in the near future.
Reviewed by: silence on arch@
Files placement reviewed by: bde
Obtained from: smbfs
implementation is still experimental, and while fairly broadly tested,
is not yet intended for production use. Support for POSIX.1e ACLs on
UFS will not be MFC'd to RELENG_4.
This implementation works by providing implementations of VOP_[GS]ETACL()
for FFS, as well as modifying the appropriate access control and file
creation routines. In this implementation, ACLs are backed into extended
attributes; the base ACL (owner, group, other) permissions remain in the
inode for performance and compatibility reasons, so only the extended and
default ACLs are placed in extended attributes. The logic for ACL
evaluation is provided by the fs-independent kern/kern_acl.c.
o Introduce UFS_ACL, a compile-time configuration option that enables
support for ACLs on FFS (and potentially other UFS-based file systems).
o Introduce ufs_getacl(), ufs_setacl(), ufs_aclcheck(), which
respectively get, set, and check the ACLs on the passed vnode.
o Introduce ufs_sync_acl_from_inode(), ufs_sync_inode_from_acl() to
maintain access control information between inode permissions and
extended attribute data.
o Modify ufs_access() to load a file access ACL and invoke
vaccess_acl_posix1e() if ACLs are available on the file system
o Modify ufs_mkdir() and ufs_makeinode() to associate ACLs with newly
created directories and files, inheriting from the parent directory's
default ACL.
o Enable these new vnode operations and conditionally compiled code
paths if UFS_ACL is defined.
A few notes:
o This implementation is fairly widely tested, but still should be
considered experimental.
o Currently, ACLs are not exported via NFS, instead, the summarizing
file mode/etc from the inode is. This results in conservative
protection behavior, similar to the behavior of ACL-nonaware programs
acting locally.
o It is possible that underlying binary data formats associated with
this implementation may change. Consumers of the implementation
should expect to find their local configuration obsoleted in the
next few months, resulting in possible loss of ACL data during an
upgrade.
o The extended attributes interface and implementation is still
undergoing modification to address portable interface concerns, as
well as performance.
o Many applications do not yet correctly handle ACLs. In general,
due to the POSIX.1e ACL model, behavior of ACL-unaware applications
will be conservative with respects to file protection; some caution
is recommended.
o Instructions for configuring and maintaining ACLs on UFS will be
committed in the near future; in the mean time it is possible to
reference the README included in the last UFS ACL distribution
placed in the TrustedBSD web site:
http://www.TrustedBSD.org/downloads/
Substantial debugging, hardware, travel, or connectivity support for this
project was provided by: BSDi, Safeport Network Services, and NAI Labs.
Significant coding contributions were made by Chris Faulhaber. Additional
support was provided by Brian Feldman, Thomas Moestl, and Ilmar Habibulin.
Reviewed by: jedgar, keichii, mckusick, trustedbsd-discuss, freebsd-fs
Obtained from: TrustedBSD Project
very specific scenarios, and now that we have had net.inet.tcp.blackhole for
quite some time there is really no reason to use it any more.
(first of three commits)
options UFS_EXTATTR and UFS_EXTATTR_AUTOSTART respectively. This change
reflects the fact that our EA support is implemented entirely at the
UFS layer (modulo FFS start/stop/autostart hooks for mount and unmount
events). This also better reflects the fact that [shortly] MFS will also
support EAs, as well as possibly IFS.
o Consumers of the EA support in FFS are reminded that as a result, they
must change kernel config files to reflect the new option names.
Obtained from: TrustedBSD Project
"options FFS_EXTATTR". When extended attribute auto-starting
is enabled, FFS will scan the .attribute directory off of the
root of each file system, as it is mounted. If .attribute
exists, EA support will be started for the file system. If
there are files in the directory, FFS will attempt to start
them as attribute backing files for attributes baring the same
name. All attributes are started before access to the file
system is permitted, so this permits race-free enabling of
attributes. For attributes backing support for security
features, such as ACLs, MAC, Capabilities, this is vital, as
it prevents the file system attributes from getting out of
sync as a result of file system operations between mount-time
and the enabling of the extended attribute. The userland
extattrctl tool will still function exactly as previously.
Files must be placed directly in .attribute, which must be
directly off of the file system root: symbolic links are
not permitted. FFS_EXTATTR will continue to be able
to function without FFS_EXTATTR_AUTOSTART for sites that do not
want/require auto-starting. If you're using the UFS_ACL code
available from www.TrustedBSD.org, using FFS_EXTATTR_AUTOSTART
is recommended.
o This support is implemented by adding an invocation of
ufs_extattr_autostart() to ffs_mountfs(). In addition,
several new supporting calls are introduced in
ufs_extattr.c:
ufs_extattr_autostart(): start EAs on the specified mount
ufs_extattr_lookup(): given a directory and filename,
return the vnode for the file.
ufs_extattr_enable_with_open(): invoke ufs_extattr_enable()
after doing the equililent of vn_open()
on the passed file.
ufs_extattr_iterate_directory(): iterate over a directory,
invoking ufs_extattr_lookup() and
ufs_extattr_enable_with_open() on each
entry.
o This feature is not widely tested, and therefore may contain
bugs, caution is advised. Several changes are in the pipeline
for this feature, including breaking out of EA namespaces into
subdirectories of .attribute (this is waiting on the updated
EA API), as well as a per-filesystem flag indicating whether
or not EAs should be auto-started. This is required because
administrators may not want .attribute auto-started on all
file systems, especially if non-administrators have write access
to the root of a file system.
Obtained from: TrustedBSD Project
valid) if BPF is missing.
The netgraph_bpf node forced bpf to be present, reflect that in the
options.
Stop doing a 'count bpf' - we provide stubs.
Since a handful of drivers still refer to "bpf.h", provide a more accurate
indication that the API is present always. (eg: netinet6)
exactly the same functionality via a sysctl, making this feature
a run-time option.
The default is 1(ON), which means that /dev/random device will
NOT block at startup.
setting kern.random.sys.seeded to 0(OFF) will cause /dev/random
to block until the next reseed, at which stage the sysctl
will be changed back to 1(ON).
While I'm here, clean up the sysctls, and make them dynamic.
Reviewed by: des
Tested on Alpha by: obrien
- Break out the /dev/pci driver into a separate file.
- Kill the COMPAT_OLDPCI support.
- Make the EISA bridge attach a bit more like the old code; explicitly
check for the existence of eisa0/isa0 and only attach if they don't
already exist. Only make one bus_generic_attach() pass over the
bridge, once both busses are attached. Note that the stupid Intel
bridge's class is entirely unpredictable.
- Add prototypes and re-layout the core PCI modules in line with
current coding standards (not a major whitespace change, just moving
the module data to the top of the file).
- Remove redundant type-2 bridge support from the core PCI code; the
PCI-CardBus code does this itself internally. Remove the now
entirely redundant header-class-specific support, as well as the
secondary and subordinate bus number fields. These are bridge
attributes now.
- Add support for PCI Extended Capabilities.
- Add support for PCI Power Management. The interface currently
allows a driver to query and set the power state of a device.
- Add helper functions to allow drivers to enable/disable busmastering
and the decoding of I/O and memory ranges.
- Use PCI_SLOTMAX and PCI_FUNCMAX rather than magic numbers in some
places.
- Make the PCI-PCI bridge code a little more paranoid about valid
I/O and memory decodes.
- Add some more PCI register definitions for the command and status
registers. Correct another bogus definition for type-1 bridges.
function declared in kern_ktr.c. The only inline checks left are the
checks that compare KTR_COMPILE with the supplied mask and thus should
be optimized away into either nothing or a direct call to ktr_tracepoint().
- Move several KTR-related options to opt_ktr.h now that they are only
needed by kern_ktr.c and not by ktr.h.
- Add in the ktr_verbose functionality if KTR_EXTEND is turned on. If the
global variable 'ktr_verbose' is non-zero, then KTR messages will be
dumped to the console. This variable can be set by either kernel code
or via the 'debug.ktr_verbose' sysctl. It defaults to off unless the
KTR_VERBOSE kernel option is specified in which case it defaults to on.
This can be useful when the machine locks up spinning in a loop with
interrupts disabled as you might be able to see what it is doing when it
locks up.
Requested by: phk
reducues the maintenance load for the mutex code. The only MD portions
of the mutex code are in machine/mutex.h now, which include the assembly
macros for handling mutexes as well as optionally overriding the mutex
micro-operations. For example, we use optimized micro-ops on the x86
platform #ifndef I386_CPU.
- Change the behavior of the SMP_DEBUG kernel option. In the new code,
mtx_assert() only depends on INVARIANTS, allowing other kernel developers
to have working mutex assertiions without having to include all of the
mutex debugging code. The SMP_DEBUG kernel option has been renamed to
MUTEX_DEBUG and now just controls extra mutex debugging code.
- Abolish the ugly mtx_f hack. Instead, we dynamically allocate
seperate mtx_debug structures on the fly in mtx_init, except for mutexes
that are initiated very early in the boot process. These mutexes
are declared using a special MUTEX_DECLARE() macro, and use a new
flag MTX_COLD when calling mtx_init. This is still somewhat hackish,
but it is less evil than the mtx_f filler struct, and the mtx struct is
now the same size with and without mutex debugging code.
- Add some micro-micro-operation macros for doing the actual atomic
operations on the mutex mtx_lock field to make it easier for other archs
to override/optimize mutex ops if needed. These new tiny ops also clean
up the code in some places by replacing long atomic operation function
calls that spanned 2-3 lines with a short 1-line macro call.
- Don't call mi_switch() from mtx_enter_hard() when we block while trying
to obtain a sleep mutex. Calling mi_switch() would bogusly release
Giant before switching to the next process. Instead, inline most of the
code from mi_switch() in the mtx_enter_hard() function. Note that when
we finally kill Giant we can back this out and go back to calling
mi_switch().
description:
How it works:
--
Basically ifs is a copy of ffs, overriding some vfs/vnops. (Yes, hack.)
I didn't see the need in duplicating all of sys/ufs/ffs to get this
off the ground.
File creation is done through a special file - 'newfile' . When newfile
is called, the system allocates and returns an inode. Note that newfile
is done in a cloning fashion:
fd = open("newfile", O_CREAT|O_RDWR, 0644);
fstat(fd, &st);
printf("new file is %d\n", (int)st.st_ino);
Once you have created a file, you can open() and unlink() it by its returned
inode number retrieved from the stat call, ie:
fd = open("5", O_RDWR);
The creation permissions depend entirely if you have write access to the
root directory of the filesystem.
To get the list of currently allocated inodes, VOP_READDIR has been added
which returns a directory listing of those currently allocated.
--
What this entails:
* patching conf/files and conf/options to include IFS as a new compile
option (and since ifs depends upon FFS, include the FFS routines)
* An entry in i386/conf/NOTES indicating IFS exists and where to go for
an explanation
* Unstaticize a couple of routines in src/sys/ufs/ffs/ which the IFS
routines require (ffs_mount() and ffs_reload())
* a new bunch of routines in src/sys/ufs/ifs/ which implement the IFS
routines. IFS replaces some of the vfsops, and a handful of vnops -
most notably are VFS_VGET(), VOP_LOOKUP(), VOP_UNLINK() and VOP_READDIR().
Any other directory operation is marked as invalid.
What this results in:
* an IFS partition's create permissions are controlled by the perm/ownership of
the root mount point, just like a normal directory
* Each inode has perm and ownership too
* IFS does *NOT* mean an FFS partition can be opened per inode. This is a
completely seperate filesystem here
* Softupdates doesn't work with IFS, and really I don't think it needs it.
Besides, fsck's are FAST. (Try it :-)
* Inodes 0 and 1 aren't allocatable because they are special (dump/swap IIRC).
Inode 2 isn't allocatable since UFS/FFS locks all inodes in the system against
this particular inode, and unravelling THAT code isn't trivial. Therefore,
useful inodes start at 3.
Enjoy, and feedback is definitely appreciated!
that it's enabled in acpireg.h only if DIAGNOSTIC option is specified.
ACPICA OSD functions will be compiled in machine/acpi_machdep.c again
tentatively (if DIAGNOSTIC option is specified).
# Should we have acpica_osd.c ?
include:
* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)
* Per-CPU idle processes.
* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).
Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
to recycle inodes after a destroy_dev() but not until all mounts
have picked up the change.
Add support for an overflow table for DEVFS inodes. The static
table defaults to 1024 inodes, if that fills, an overflow table
of 32k inodes is allocated. Both numbers can be changed at
compile time, the size of the overflow table also with the
sysctl vfs.devfs.noverflow.
Use atomic instructions to barrier between make_dev()/destroy_dev()
and the mounts.
Add lockmgr() locking of directories for operations accessing or
modifying the directory TAILQs.
Various nitpicking here and there.