available.
Only directory vnodes holding no child directory vnodes held in
v_cache_src are recycled, so that directory vnodes near the root of
the filesystem hierarchy remain in namecache and directory vnodes are
not reclaimed in cascade.
The period of vnode reclaiming attempt and the number of vnodes
attempted to reclaim can be tuned via sysctl(2).
Suggested by: tegge
Approved by: phk
Add TI4451 as well.
These are untested since I don't have the hardware to test against.
Also, some O2Micro devices are #define w/o numbers as place holders so that
I can encourage people to submit them when they appear in the channels.
parts. This is based on the newcard code that turns it off :-). We
can now reboot after NEWCARD or Windows and have OLDCARD work. Add
support for the RL5C466 while I'm at it.
Treat TI1031 the same as the CLPD6832. It doesn't work yet, but sucks
less than it did before.
Also add a few #defines for other changes in the pipe.
and __i386__ are defined rather than if SMP and BETTER_CLOCK are defined.
The removal of BETTER_CLOCK would have broken this except that kern_clock.c
doesn't include <machine/smptests.h>, so it doesn't see the definition of
BETTER_CLOCK, and forward_*clock aren't called, even on 4.x. This seems to
fix the problem where a n-way SMP system would see 100 * n clk interrupts
and 128 * n rtc interrupts.
and AS4100s into single user mode. This work was done jointly by jhb and
myself, and builds on dfr's earlier work.
smp_init_secondary() / smp_start_secondary()
- use the uniq val to pass the globalp (me)
- fancy footwork to take any pending machine checks (me)
- doing things the FreeBSD way and getting the per-cpu idleproc created
correctly, and synchronizing the startup of secondaries (jhb)
mp_start()
- better recognition of available cpus (jhb)
smp_rendezvous()
- if smp hasn't started, only run the rendezvous function on the current
cpu. Sleuthing and (prior) incorrect fix by me, correct fix by jhb
smp_handle_ipi()
- more verbose handling of console messages (jhb)
- grab sched lock around setting PS_ASTPENDING (jhb)
forward_*clock()
- commented out. Joint decision by dfr, jhb and myself
General synchronization improvements (more mb()s, etc) (jhb)
Printf cleanups (joint)
Whitespace cleanups (jhb)
- don't do the stack overflow sanity check on MP systems -- p->p_addr
will be malloc'ed memory (not K0SEG) and the check will fail.
- don't ignore clock interrupts on secondaries. Alphas apparently
roundrobin clock interrupts to all cpus, so we're going to take clock
interrupts on all CPUS and not forward them.
- use the unique value to save the per-cpu globalp struct like the
comment says
- don't lower the ipl to ALPHA_PSL_IPL_HIGH: we may have a pending machine
check to take and we're not prepared for that yet, as we haven't setup
our interrupt entry points. (this may only happen on sable/lynx)
- indicate the fact that the working version of smp_init_secondary() doesn't
return (this is tied up in other changes and hasn't yet been committed).
VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.
This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.
The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine. For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.
sized blocks. To enable this option, use: `sysctl -w debug.bigcgs=1'.
Add debugging option to disable background writes of cylinder
groups. To enable this option, use: `sysctl -w debug.dobkgrdwrite=0'.
These debugging options should be tried on systems that are panicing
with corrupted cylinder group maps to see if it makes the problem
go away. The set of panics in question are:
ffs_clusteralloc: map mismatch
ffs_nodealloccg: map corrupted
ffs_nodealloccg: block not in map
ffs_alloccg: map corrupted
ffs_alloccg: block not in map
ffs_alloccgblk: cyl groups corrupted
ffs_alloccgblk: can't find blk in cyl
ffs_checkblk: partially free fragment
The following panics are less likely to be related to this problem,
but might be helped by these debugging options:
ffs_valloc: dup alloc
ffs_blkfree: freeing free block
ffs_blkfree: freeing free frag
ffs_vfree: freeing free inode
If you try these options, please report whether they helped reduce your
bitmap corruption panics to Kirk McKusick at <mckusick@mckusick.com>
and to Matt Dillon <dillon@earth.backplane.com>.
ACL_USER_OBJ and ACL_GROUP_OBJ fields, believing that modification of the
access ACL could be used by privileged processes to change file/directory
ownership. In fact, this is incorrect; ACL_*_OBJ (+ ACL_MASK and
ACL_OTHER) should have undefined ae_id fields; this commit attempts
to correct that misunderstanding.
o Modify arguments to vaccess_acl_posix1e() to accept the uid and gid
associated with the vnode, as those can no longer be extracted from
the ACL passed as an argument. Perform all comparisons against
the passed arguments. This actually has the effect of simplifying
a number of components of this call, as well as reducing the indent
level, but now seperates handling of ACL_GROUP_OBJ from ACL_GROUP.
o Modify acl_posix1e_check() to return EINVAL if the ae_id field of
any of the ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} entries is a value
other than ACL_UNDEFINED_ID. As a temporary work-around to allow
clean upgrades, set the ae_id field to ACL_UNDEFINED_ID before
each check so that this cannot cause a failure in the short term
(this work-around will be removed when the userland libraries and
utilities are updated to take this change into account).
o Modify ufs_sync_acl_from_inode() so that it forces
ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} ae_id fields to ACL_UNDEFINED_ID
when synchronizing the ACL from the inode.
o Modify ufs_sync_inode_from_acl to not propagate uid and gid
information to the inode from the ACL during ACL update. Also
modify the masking of permission bits that may be set from
ALLPERMS to (S_IRWXU|S_IRWXG|S_IRWXO), as ACLs currently do not
carry none-ACCESSPERMS (S_ISUID, S_ISGID, S_ISTXT).
o Modify ufs_getacl() so that when it emulates an access ACL from
the inode, it initializes the ae_id fields to ACL_UNDEFINED_ID.
o Clean up ufs_setacl() substantially since it is no longer possible
to perform chown/chgrp operations using vop_setacl(), so all the
access control for that can be eliminated.
o Modify ufs_access() so that it passes owner uid and gid information
into vaccess_acl_posix1e().
Pointed out by: jedger
Obtained from: TrustedBSD Project
panic_cpu shared variable. I used a simple atomic operation here instead
of a spin lock as it seemed to be excessive overhead. Also, this can avoid
recursive panics if, for example, witness is broken.
can happen if witness runs out of resources during initialization or if
witness_skipspin is enabled.
Sleuthing by: Peter Jeremy <peter.jeremy@alcatel.com.au>
"inside" of locked regions. That is, an acquire atomic operation will
always enforce a memory barrier after the atomic operation and a release
operation will always enforce a memory barrier before the atomic
operation.
- Explicitly use 'mb' instead of 'wmb' in release atomic operations. The
'wmb' memory barrier is not strong enough to guarantee coherence with
other processors. This is effectively a nop since alpha_wmb() actually
performs a 'mb' and not a 'wmb', but I wanted the code to be more
correct since at some point in the future alpha_wmb()'s implementation
may switch to being a real 'wmb'.
we should call ast(). This allows us to branch to a separate Lkernelret
label so we can fixup the saved t7 register in the trapframe. Otherwise
we can run into a problem on SMP systems where a process is interrupted by
a trap or interrupt on one CPU, migrates to another CPU, and then returns
with the t7 in the stack clobbering the CPU's t7. As a result, two CPU's
would both point to the same per-CPU data and things would go downhill from
there.
Sleuthing help by: gallatin
- Add a new ddb command: 'show pcpu' similar to the i386 command added
recently. By default it displays the current CPU's info, but an optional
argument can specify the logical ID of a specific CPU to examine.
bcopy would go off the end of the array by two elements, which sometimes
causes a panic if it happens to cross into a page that isn't mapped.
Submitted by: gibbs
Reviewed by: peter
are some good reasons for not doing this, even if the linting of
the code breaks.
1) If lint were ever to understand the stuff inside the macros,
that would break the checks.
2) There are ways to use __GNUC__ to exclude overly specific
code.
3) (Not yet practical) Lint(1) needs to properlyu understand
all of te code we actually run.
Complained about by: bde
Education by: jake, jhb, eivind
It is described in ufs/ffs/fs.h as follows:
/*
* Filesystem flags.
*
* Note that the FS_NEEDSFSCK flag is set and cleared only by the
* fsck utility. It is set when background fsck finds an unexpected
* inconsistency which requires a traditional foreground fsck to be
* run. Such inconsistencies should only be found after an uncorrectable
* disk error. A foreground fsck will clear the FS_NEEDSFSCK flag when
* it has successfully cleaned up the filesystem. The kernel uses this
* flag to enforce that inconsistent filesystems be mounted read-only.
*/
#define FS_UNCLEAN 0x01 /* filesystem not clean at mount */
#define FS_DOSOFTDEP 0x02 /* filesystem using soft dependencies */
#define FS_NEEDSFSCK 0x04 /* filesystem needs sync fsck before mount */
and non-P_SUGID cases, simplify p_cansignal() logic so that the
P_SUGID masking of possible signals is independent from uid checks,
removing redundant code and generally improving readability.
Reviewed by: tmm
Obtained from: TrustedBSD Project
the ability of unprivileged processes to deliver arbitrary signals
to daemons temporarily taking on unprivileged effective credentials
when P_SUGID is not set on the target process:
Removed:
(p1->p_cred->cr_ruid != ps->p_cred->cr_uid)
(p1->p_ucred->cr_uid != ps->p_cred->cr_uid)
o Replace two "allow this" exceptions in p_cansignal() restricting
the ability of unprivileged processes to deliver arbitrary signals
to daemons temporarily taking on unprivileged effective credentials
when P_SUGID is set on the target process:
Replaced:
(p1->p_cred->p_ruid != p2->p_ucred->cr_uid)
(p1->p_cred->cr_uid != p2->p_ucred->cr_uid)
With:
(p1->p_cred->p_ruid != p2->p_ucred->p_svuid)
(p1->p_ucred->cr_uid != p2->p_ucred->p_svuid)
o These changes have the effect of making the uid-based handling of
both P_SUGID and non-P_SUGID signal delivery consistent, following
these four general cases:
p1's ruid equals p2's ruid
p1's euid equals p2's ruid
p1's ruid equals p2's svuid
p1's euid equals p2's svuid
The P_SUGID and non-P_SUGID cases can now be largely collapsed,
and I'll commit this in a few days if no immediate problems are
encountered with this set of changes.
o These changes remove a number of warning cases identified by the
proc_to_proc inter-process authorization regression test.
o As these are new restrictions, we'll have to watch out carefully for
possible side effects on running code: they seem reasonable to me,
but it's possible this change might have to be backed out if problems
are experienced.
Submitted by: src/tools/regression/security/proc_to_proc/testuid
Reviewed by: tmm
Obtained from: TrustedBSD Project
ability of unprivileged processes to modify the scheduling properties
of daemons temporarily taking on unprivileged effective credentials.
These cases (p1->p_cred->p_ruid == p2->p_ucred->cr_uid) and
(p1->p_ucred->cr_uid == p2->p_ucred->cr_uid), respectively permitting
a subject process to influence the scheduling of a daemon if the subject
process has the same real uid or effective uid as the daemon's effective
uid. This removes a number of the warning cases identified by the
proc_to_proc iner-process authorization regression test.
o As these are new restrictions, we'll have to watch out carefully for
possible side effects on running code: they seem reasonable to me,
but it's possible this change might have to be backed out if problems
are experienced.
Reported by: src/tools/regression/security/proc_to_proc/testuid
Obtained from: TrustedBSD Project
by p_can(...P_CAN_SEE), rather than returning EACCES directly. This
brings the error code used here into line with similar arrangements
elsewhere, and prevents the leakage of pid usage information.
Reviewed by: jlemon
Obtained from: TrustedBSD Project
p_can(...P_CAN_SEE...) to getpgid(), getsid(), and setpgid(),
blocking these operations on processes that should not be visible
by the requesting process. Required to reduce information leakage
in MAC environments.
Obtained from: TrustedBSD Project
from signal authorization checking.
o p_cansignal() takes three arguments: subject process, object process,
and signal number, unlike p_cankill(), which only took into account
the processes and not the signal number, improving the abstraction
such that CANSIGNAL() from kern_sig.c can now also be eliminated;
previously CANSIGNAL() special-cased the handling of SIGCONT based
on process session. privused is now deprecated.
o The new p_cansignal() further limits the set of signals that may
be delivered to processes with P_SUGID set, and restructures the
access control check to allow it to be extended more easily.
o These changes take into account work done by the OpenBSD Project,
as well as by Robert Watson and Thomas Moestl on the TrustedBSD
Project.
Obtained from: TrustedBSD Project
toggle the P_SUGID bit explicitly, rather than relying on it being
set implicitly by other protection and credential logic. This feature
is introduced to support inter-process authorization regression testing
by simplifying userland credential management allowing the easy
isolation and reproduction of authorization events with specific
security contexts. This feature is enabled only by "options REGRESSION"
and is not intended to be used by applications. While the feature is
not known to introduce security vulnerabilities, it does allow
processes to enter previously inaccessible parts of the credential
state machine, and is therefore disabled by default. It may not
constitute a risk, and therefore in the future pending further analysis
(and appropriate need) may become a published interface.
Obtained from: TrustedBSD Project
interfaces and functionality intended for use during correctness and
regression testing. Features enabled by "options REGRESSION" may
in and of themselves introduce security or correctness problems if
used improperly, and so are not intended for use in production
systems, only in testing environments.
Obtained from: TrustedBSD Project
enable easy access to the hash chain stats. The raw prefixed versions
dump an integer array to userland with the chain lengths. This cheats
and calls it an array of 'struct int' rather than 'int' or sysctl -a
faithfully dumps out the 128K array on an average machine. The non-raw
versions return 4 integers: count, number of chains used, maximum chain
length, and percentage utilization (fixed point, multiplied by 100).
The raw forms are more useful for analyzing the hash distribution, while
the other form can be read easily by humans and stats loggers.
API for IPI's that isn't tied to the Intel APIC. MD code can still use
the apic_ipi() function or dink with the apic directly if needed to send
MD IPI's.
because:
- it used a better namespace (smp_ipi_* rather than *_ipi),
- it used better constant names for the IPI's (IPI_* rather than
X*_OFFSET), and
- this API also somewhat exists for both alpha and ia64 already.
codecs. Also, add some additional code to check for future cards without
this feature - attempting to initialise them as AC97 cards will hang the
machine.
PR: 26427
Reviewed by: cg
His description of the problem and solution follow. My own tests show
speedups on typical filesystem intensive workloads of 5% to 12% which
is very impressive considering the small amount of code change involved.
------
One day I noticed that some file operations run much faster on
small file systems then on big ones. I've looked at the ffs
algorithms, thought about them, and redesigned the dirpref algorithm.
First I want to describe the results of my tests. These results are old
and I have improved the algorithm after these tests were done. Nevertheless
they show how big the perfomance speedup may be. I have done two file/directory
intensive tests on a two OpenBSD systems with old and new dirpref algorithm.
The first test is "tar -xzf ports.tar.gz", the second is "rm -rf ports".
The ports.tar.gz file is the ports collection from the OpenBSD 2.8 release.
It contains 6596 directories and 13868 files. The test systems are:
1. Celeron-450, 128Mb, two IDE drives, the system at wd0, file system for
test is at wd1. Size of test file system is 8 Gb, number of cg=991,
size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current
from Dec 2000 with BUFCACHEPERCENT=35
2. PIII-600, 128Mb, two IBM DTLA-307045 IDE drives at i815e, the system
at wd0, file system for test is at wd1. Size of test file system is 40 Gb,
number of cg=5324, size of cg is 8m, block size = 8k, fragment size = 1k
OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=50
You can get more info about the test systems and methods at:
http://www.ptci.ru/gluk/dirpref/old/dirpref.html
Test Results
tar -xzf ports.tar.gz rm -rf ports
mode old dirpref new dirpref speedup old dirprefnew dirpref speedup
First system
normal 667 472 1.41 477 331 1.44
async 285 144 1.98 130 14 9.29
sync 768 616 1.25 477 334 1.43
softdep 413 252 1.64 241 38 6.34
Second system
normal 329 81 4.06 263.5 93.5 2.81
async 302 25.7 11.75 112 2.26 49.56
sync 281 57.0 4.93 263 90.5 2.9
softdep 341 40.6 8.4 284 4.76 59.66
"old dirpref" and "new dirpref" columns give a test time in seconds.
speedup - speed increasement in times, ie. old dirpref / new dirpref.
------
Algorithm description
The old dirpref algorithm is described in comments:
/*
* Find a cylinder to place a directory.
*
* The policy implemented by this algorithm is to select from
* among those cylinder groups with above the average number of
* free inodes, the one with the smallest number of directories.
*/
A new directory is allocated in a different cylinder groups than its
parent directory resulting in a directory tree that is spreaded across
all the cylinder groups. This spreading out results in a non-optimal
access to the directories and files. When we have a small filesystem
it is not a problem but when the filesystem is big then perfomance
degradation becomes very apparent.
What I mean by a big file system ?
1. A big filesystem is a filesystem which occupy 20-30 or more percent
of total drive space, i.e. first and last cylinder are physically
located relatively far from each other.
2. It has a relatively large number of cylinder groups, for example
more cylinder groups than 50% of the buffers in the buffer cache.
The first results in long access times, while the second results in
many buffers being used by metadata operations. Such operations use
cylinder group blocks and on-disk inode blocks. The cylinder group
block (fs->fs_cblkno) contains struct cg, inode and block bit maps.
It is 2k in size for the default filesystem parameters. If new and
parent directories are located in different cylinder groups then the
system performs more input/output operations and uses more buffers.
On filesystems with many cylinder groups, lots of cache buffers are
used for metadata operations.
My solution for this problem is very simple. I allocate many directories
in one cylinder group. I also do some things, so that the new allocation
method does not cause excessive fragmentation and all directory inodes
will not be located at a location far from its file's inodes and data.
The algorithm is:
/*
* Find a cylinder group to place a directory.
*
* The policy implemented by this algorithm is to allocate a
* directory inode in the same cylinder group as its parent
* directory, but also to reserve space for its files inodes
* and data. Restrict the number of directories which may be
* allocated one after another in the same cylinder group
* without intervening allocation of files.
*
* If we allocate a first level directory then force allocation
* in another cylinder group.
*/
My early versions of dirpref give me a good results for a wide range of
file operations and different filesystem capacities except one case:
those applications that create their entire directory structure first
and only later fill this structure with files.
My solution for such and similar cases is to limit a number of
directories which may be created one after another in the same cylinder
group without intervening file creations. For this purpose, I allocate
an array of counters at mount time. This array is linked to the superblock
fs->fs_contigdirs[cg]. Each time a directory is created the counter
increases and each time a file is created the counter decreases. A 60Gb
filesystem with 8mb/cg requires 10kb of memory for the counters array.
The maxcontigdirs is a maximum number of directories which may be created
without an intervening file creation. I found in my tests that the best
performance occurs when I restrict the number of directories in one cylinder
group such that all its files may be located in the same cylinder group.
There may be some deterioration in performance if all the file inodes
are in the same cylinder group as its containing directory, but their
data partially resides in a different cylinder group. The maxcontigdirs
value is calculated to try to prevent this condition. Since there is
no way to know how many files and directories will be allocated later
I added two optimization parameters in superblock/tunefs. They are:
int32_t fs_avgfilesize; /* expected average file size */
int32_t fs_avgfpdir; /* expected # of files per directory */
These parameters have reasonable defaults but may be tweeked for special
uses of a filesystem. They are only necessary in rare cases like better
tuning a filesystem being used to store a squid cache.
I have been using this algorithm for about 3 months. I have done
a lot of testing on filesystems with different capacities, average
filesize, average number of files per directory, and so on. I think
this algorithm has no negative impact on filesystem perfomance. It
works better than the default one in all cases. The new dirpref
will greatly improve untarring/removing/coping of big directories,
decrease load on cvs servers and much more. The new dirpref doesn't
speedup a compilation process, but also doesn't slow it down.
Obtained from: Grigoriy Orlov <gluk@ptci.ru>
s/1518/ETHER_MAX_LEN
Some style changes, add some braces, mostly residual from having
a lot of debug hooks added while working on this driver.
Bring in a plethora of changes from NetBSD:
revision 1.58
date: 2001/03/08 11:07:08; author: ichiro; state: Exp; lines: +17 -1
it wait until busy flag disappears.
it was able to prevent some cards with late initializing faling in wi_reset().
revision 1.41
date: 2000/10/13 19:15:08; author: jonathan; state: Exp; lines: +4 -2
Fix wi_intr() to avoid touching card registers during insert/remove events,
when sharing an interrupt with other devices:
check sc->sc_enabled, and drop the interrupt if its' off.
revision 1.30
date: 2000/08/18 04:11:48; author: jhawk; state: Exp; lines: +4 -4
Copy wi_{dst,src}_addr from struct wi_frame into faked-up ether_header
instead of addr1 and addr2. THis means that tcpdump -e will show the
correct MAC address for communications with access points instead of showing
the BSSID.
In the future there should be 802.11 support for bpf/libpcap/tcpdump,
but that is aways down the road.
count drops to 0 in witness_destroy, set the w_name and w_file pointers
to point to the string "(dead)" and the w_line field to 0. This way,
if a mutex of a given name is used only in a module, then as long as
all mutexes in the module are destroyed when the module is unloaded,
witness will not maintain stale references to the mutex's name in the
module's data section causing a panic later on when the w_name or w_file
field's are examined.
1. Pick up MII/PHY support for Livengood copper part (10/100/1000) from
Parag Patel. It was a fairly complete but not quite platform independent
job.
2. Finish silly offset differences that LIVENGOOD vs. WISEMAN registers
have (so the !)$*!)$*!$ fiber LIVENGOOD now works too).
3. Ansify the source.
So- we now suppor tthe PRO1000F and PRO1000T adapters.
1. The offsets for some registers change in LIVENGOOD. Gratuitously.
2. Define LIVENGOOD and LIVENGOOD_CU part numbers. Add some more
specific LIVENGOOD defaults.
3. Add definitions for PHY support for the copper LIVENGOOD part
(10/100/1000).
Since pid's are not in the kernel address space, this doesn't conflict
with the funcionality of specifying an arbitrary frame pointer to the
trace command.
- If the first function of a backtrace maps to fork_trampoline, then this
is a newly fork'd process that has not been executed yet, so just print
out the first frame and then return for that case.
- Lower the default count from 65535 to 1024. ddb doesn't trace into
userland, and if the stack gets hosed and starts looping it's less
annoying.
Parag Patel did all of the grunt work, so he gets the credit.
Register definitions and actions inferred from a Linux driver,
so Intel also gets some 'credit'.
the main benefit this gives for now is that via686 audio devices on
motherboards with ac97 codecs that do not support vra will be able to use
sample rates other than 48khz.
Add simple "xlat" converter which performs 8to8 table based conversion.
Unicode converter will be added in the near future.
Reviewed by: silence on arch@
Files placement reviewed by: bde
Obtained from: smbfs
o Change the number of init tries from 5 to a #define.
o Allow up to 5s rather than 2s for commands to complete. This
is still much less than 51 minutes, but makes my intel card init
with more reliability than before.
possible for some systems where the device is there, but the BIOS
hasn't allocated memory resources for it), we don't panic.
Submitted by: Gerard Roudier
peer out from sppp_lcp_open() to sppp_lcp_up(). For one, this makes
things look more symmetrical to sppp_lcp_close(), and somehow it also
just occurred to me that an Up event following the open caused the
value of the authentication option to be clobbered.
badaddr_read(). This fixes 'machine check in pal mode' halts on
ev5 2100As.
MFC candidate -- after spending 6 hours tracking this down, I checked and
discovered that it has been in NetBSD for over a year, so it should be safe
for MFC into 4.3-RELEASE
It's not finished yet (I still have to find a way to implement process-
dependent nodes without consuming too much memory, and the permission
system needs tightening up), but it's becoming hard to work on without
a repo (I've accidentally almost nuked it once already), and it works
(except for the lack of process-dependent nodes, that is).
I was supposed to commit this a week ago, but timed out waiting for jkh
to reply to some questions I had. Pass him a spoonful of bad karma :)