Symbol values are now represented using array sizes (4 arrays per symbol
so that 16-bit machines can represent 64-bit values) instead of being raw
binary values.
Reviewed by: marcel
Further experimentation showed that some Dell 2450 machines with the
prevention kludge installed still got T_RESERVED traps. CPU interrupt
vector 0x7A was observed to be triggered. This might have been the
bitwise OR of two different vectors sent from each of the IOAPICs at
the same time.
IOAPIC #0: 0x68 --> irq 8: RTC timer interrupt
IOAPIC #1: 0x32 --> irq 18: scsi host adapter or network interface
----
0x7a --> T_RESERVED
Both IOAPICs had ID 0.
Appendix B.3 in the MP spec indicates that the operating system is
responsible for assigning unique IDs to the IOAPICs.
The enclosed patch programs the IOAPIC IDs according to the IOAPIC
entries in the MP table.
Submitted by: tegge
and sysv shared memory support for it. It implements a new
PG_UNMANAGED flag that has slightly different characteristics
from PG_FICTICIOUS.
A new sysctl, kern.ipc.shm_use_phys has been added to enable the
use of physically-backed sysv shared memory rather then swap-backed.
Physically backed shm segments are not tracked with PV entries,
allowing programs which use a large shm segment as a rendezvous
point to operate without eating an insane amount of KVM in the
PV entry management. Read: Oracle.
Peter's OBJT_PHYS object will also allow us to eventually implement
page-table sharing and/or 4MB physical page support for such segments.
We're half way there.
buzy, only search upwards for a free slot to use..
This broke unit numbering on ATA systems where PCI attached controllers
come before the mainboard ones...
Reviewed by: dfr
This function will probably rewritten/renamed to devpp.
Submitted by: Assar Westerlund <assar@sics.se> on -current
Confirmed to work: Steinar Haug <sthaug@nethelp.no>,
Manfred Antar <mantar@pacbell.net>
Reviewed by: phk
this in awk using the hack of counting args of type off_t twice and args
of all other types once. This is too simple to work. It gave benignly
wrong results on alphas (off_t shouldn't be counted twice) and for
svr4_sys_mmap64() on i386's (off64_t should be counted twice). It gave
fatally wrong results for i386's with 64-bit longs (longs should be
counted twice). The correct value for sy_nargs is easier to determine
from the size of the args struct anyway, except for complications to
make the generated code almost readable.
Improved formatting of sysent tables by lining up the comments where
possible.
type. This gave an inconsistent amount of crufty padding on i386's with
64-bit longs (8 bytes instead of 4). On alphas it gives a consistent
amount of crufty padding (8 bytes) in addition to the 4 bytes of normal
padding caused by passing int args as register_t's.
Fixed the args struct tag for the NOPROTO syscalls (netbsd_lchown() and
netbsd_msync()). The tag is currently unused for NOPROTO syscalls, so
the bug has no effect, but it will be used even in the NOPROTO case to
calculate sy_nargs correctly.
<sys/bio.h>.
<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.
Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.
Still a few bogus uses of struct buf to track down.
Repocopy by: peter
spl() protection in the case of a copyout error.
Add missing spl calls around the intial activation call that is
done when when the kevent is added.
Add two KASSERT macros to help catch errors in the future.
is used to control whether the debug messages are output at runtime.
It defaults to on so that if you define BUS_DEBUG in your kernel
then you get all the debugging info when you boot.
It's very useful for disabling all the debugging info when you're
developing a loadable device driver and you're doing lots of loads
and unloads but don't always want to see all the debugging info.
Remove evil allocation macros from machdep.c (why was that there???) and
use malloc() instead.
Move paramters out of param.h and into the code itself.
Move a bunch of internal definitions from public sys/*.h headers (without
#ifdef _KERNEL even) into the code itself.
I had hoped to make some of this more dynamic, but the cost of doing
wakeups on all sleeping processes on old arrays was too frightening.
The other possibility is to initialize on the first use, and allow
dynamic sysctl changes to parameters right until that point. That would
allow /etc/rc.sysctl to change SEM* and MSG* defaults as we presently
do with SHM*, but without the nightmare of changing a running system.
wrong for many years that negative niceness would lower the priority
of a process below PUSER, and once below PUSER, there were conditionals
in the code that are required to test for whether a process was in
the kernel which would break.
The breakage could (and did) cause lock-ups, basically nothing else
but the least nice program being able to run in some conditions. The
algorithm which adjusts the priority now subtracts PRIO_MIN to do
things properly, and the ESTCPULIM() algorithm was updated to use
PRIO_TOTAL (PRIO_MAX - PRIO_MIN) to calculate the estcpu.
NICE_WEIGHT is now 1 to accomodate the full range of priorities better
(a -20 process with full CPU time has the priority of a +0 process with
no CPU time). There are now 20 queues (exactly; 80 priorities) for
use in user processes' scheduling, and PUSER has been lowered to 48
to accomplish this.
This means, to the user, that things will be scheduled more correctly
(noticeable), there is no lock-up anymore WRT a niced -20 process
never releasing the CPU time for other processes. In this fair system,
tsleep()ed < PUSER processes now will get the proper higher priority
than priority >= PUSER user processes.
The detective work of this was done by me, along with part of the
solution. Luoqi Chen has provided most of the solution, and really
helped me understand what was happening better, to boot :)
Submitted by: luoqi
Concept reviewed by: bde
bus/driver/kobj system. I am not 100% sure that this is the correct fix,
but it is harmless and does seem to solve the problem. At worst, it could
cause a tiny memory leak at unload time - this is better than a free(NULL)
and subsequent panic. I'm waiting for comments from Doug about this.
This may yet be backed out and fixed differently.
The change itself is to increment the reference count on drivers in one
case where it appears to have been missed. When everything is unloaded,
kobj_class_free() was being called twice in some cases, and panicing the
second time.
version dependency system. This isn't quite finished, but it is at a
useful stage to do a functional checkpoint.
Highlights:
- version and dependency metadata is gathered via linker sets, so things
are handled the same for static kernels and code built to live in a kld.
- The dependencies are at module level (versus at file level).
- Dependencies determine kld symbol search order - this means that you
cannot link against symbols in another file unless you depend on it. This
is so that you cannot accidently unload the target out from underneath
the ones referencing it.
- It is flexible enough that we can put tags in #include files and macros
so that we can get decent hooks for enforcing recompiles on incompatable
ABI changes. eg: if we change struct proc, we could force a recompile
for all kld's that reference the proc struct.
- Tangled dependency references at boot time are sorted. Files are
relocated once all their dependencies are already relocated.
Caveats:
- Loader support is incomplete, but has been worked on seperately.
- Actual enforcement of the version number tags is not active yet - just
the module dependencies are live. The actual structure of versioning
hasn't been agreed on yet. (eg: major.minor, or whatever)
- There is some backwards compatability for old modules without metadata
but I'm not sure how good it is.
This is based on work originally done by Boris Popov (bp@freebsd.org),
but I'm not sure he'd recognize much of it now. Don't blame him. :-)
Also, ideas have been borrowed from Mike Smith.
program running under linux emulation, the script binary is checked for
in /compat/linux first. Without this patch the wrong script binary
(i.e. the FreeBSD binary) will be run instead of the linux binary.
For example, #!/bin/sh, thus breaking out of linux compatibility mode.
This solves a number of problems people have had installing linux
software on FreeBSD boxes.
There's no excuse to have code in synthetic filestores that allows direct
references to the textvp anymore.
Feature requested by: msmith
Feature agreed to by: warner
Move requested by: phk
Move agreed to by: bde
in struct bio. Eventually, bio_offset will probably obsolete the
bio_blkno and bio_pblkno fields.
Remove the special hack in atapi-cd.c to determine of bio_offset was valid.
* Report link errors to stdout with uprintf() so that the user can see
what went wrong (PR kern/9214).
* Add support code to allow module symbols to be loaded into GDB using
the debugger's "sharedlibrary" command.
maintainers.
After we established our branding method of writing upto 8 characters of
the OS name into the ELF header in the padding; the Binutils maintainers
and/or SCO (as USL) decided that instead the ELF header should grow two new
fields -- EI_OSABI and EI_ABIVERSION. Each of these are an 8-bit unsigned
integer. SCO has assigned official values for the EI_OSABI field. In
addition to this, the Binutils maintainers and NetBSD decided that a better
ELF branding method was to include ABI information in a ".note" ELF
section.
With this set of changes, we will now create ELF binaries branded using
both "official" methods. Due to the complexity of adding a section to a
binary, binaries branded with ``brandelf'' will only brand using the
EI_OSABI method. Also due to the complexity of pulling a section out of an
ELF file vs. poking around in the ELF header, our image activator only
looks at the EI_OSABI header field.
Note that a new kernel can still properly load old binaries except for
Linux static binaries branded in our old method.
*
* For a short period of time, ``ld'' will also brand ELF binaries
* using our old method. This is so people can still use kernel.old
* with a new world. This support will be removed before 5.0-RELEASE,
* and may not last anywhere upto the actual release. My expiration
* time for this is about 6mo.
*
Exceptions:
Vinum untouched. This means that it cannot be compiled.
Greg Lehey is on the case.
CCD not converted yet, casts to struct buf (still safe)
atapi-cd casts to struct buf to examine B_PHYS
non-device code.
* Re-implement the method dispatch to improve efficiency. The new system
takes about 40ns for a method dispatch on a 300Mhz PII which is only
10ns slower than a direct function call on the same hardware.
This changes the new-bus ABI slightly so make sure you re-compile any
driver modules which you use.
devstat_end_transaction_bio()
bioq_* versions of bufq_* incl bioqdisksort()
the corresponding "buf" versions will disappear when no longer used.
Move b_offset, b_data and b_bcount to struct bio.
Add BIO_FORMAT as a hack for fd.c etc.
We are now largely ready to start converting drivers to use struct
bio instead of struct buf.
(Much of this done by script)
Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.
Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.
Add bio_queue field for struct bio aware disksort.
Address a lot of stylistic issues brought up by bde.
async I/O's. The sequential read heuristic has been extended to
cover writes as well. We continue to call cluster_write() normally,
thus blocks in the file will still be reallocated for large (but still
random) I/O's, but I/O will only be initiated for truely sequential
writes.
This solves a number of annoying situations, especially with DBM (hash
method) writes, and also has the side effect of fixing a number of
(stupid) benchmarks.
Reviewed-by: mckusick
release for inclusion into the release, but bde talked me out of
committing the module that needs this until after the release. It is
after the release now. :-)
via sysctl. It's done pretty simply but it should be quite adequate.
Also move SHMMAXPGS from $machine/include/vmparam.h as the comments that
went with it were wrong... we don't allocate KVM space for the pages so
that comment is bogus.. The only practical limit is how much physical
ram you want to lock up as this stuff isn't paged out or swap backed.
syscall path inward. A system call may select whether it needs the MP
lock or not (the default being that it does need it).
A great deal of conditional SMP code for various deadended experiments
has been removed. 'cil' and 'cml' have been removed entirely, and the
locking around the cpl has been removed. The conditional
separately-locked fast-interrupt code has been removed, meaning that
interrupts must hold the CPL now (but they pretty much had to anyway).
Another reason for doing this is that the original separate-lock for
interrupts just doesn't apply to the interrupt thread mechanism being
contemplated.
Modifications to the cpl may now ONLY occur while holding the MP
lock. For example, if an otherwise MP safe syscall needs to mess with
the cpl, it must hold the MP lock for the duration and must (as usual)
save/restore the cpl in a nested fashion.
This is precursor work for the real meat coming later: avoiding having
to hold the MP lock for common syscalls and I/O's and interrupt threads.
It is expected that the spl mechanisms and new interrupt threading
mechanisms will be able to run in tandem, allowing a slow piecemeal
transition to occur.
This patch should result in a moderate performance improvement due to
the considerable amount of code that has been removed from the critical
path, especially the simplification of the spl*() calls. The real
performance gains will come later.
Approved by: jkh
Reviewed by: current, bde (exception.s)
Some work taken from: luoqi's patch
fragmentation problem due to geteblk() reserving too much space for the
buffer and imposes a larger granularity (16K) on KVA reservations for
the buffer cache to avoid fragmentation issues. The buffer cache size
calculations have been redone to simplify them (fewer defines, better
comments, less chance of running out of KVA).
The geteblk() fix solves a performance problem that DG was able reproduce.
This patch does not completely fix the KVA fragmentation problems, but
it goes a long way
Mostly Reviewed by: bde and others
Approved by: jkh
static int setrootbyname(char *name);
out into
dev_t getdiskbyname(char *name);
This makes it easy to create a new DDB command, which is the big reason
for the change. You can now do the following in DDB:
Example rc.conf entry:
dumpdev="/dev/ad0s1b" # Device name to crashdump to (if enabled).
db> show disk/ad0s1b
dev_t = 0xc0b7ea00
db> p *dumpdev
c0b7ea00
Make the public interface more systematically named.
Remove the alternate method, it doesn't do any good, only ruins performance.
Add counters to profile the usage of the 8 access functions.
Apply the beer-ware to my code.
The weird +/- counts are caused by two repocopies behind the scenes:
kern/kern_clock.c -> kern/kern_tc.c
sys/time.h -> sys/timetc.h
(thanks peter!)
substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)
substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)
This patch is machine generated except for the ccd.c and buf.h parts.
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.
B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.
Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.
Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.
This change is a step in the direction towards a stackable BIO capability.
A lot of this patch were machine generated (Thanks to style(9) compliance!)
Vinum users: Greg has not had time to test this yet, be careful.
return ENXIO (Device not configured). Without this, vn_isdisk()
could (and did in the case of lstat() under fdesc) pass a NULL pointer
to devsw(), which caused a page fault.
Reviewed by: alfred
This avoids the unit number from going up indefinitely when
diconnecting and connecting 2 devices alternately.
Noticed by: nsayer (quite a while ago)
And stop calling DEVICE_NOMATCH at probe repeatedly. This stops the
message on the PCI VGA board from being printed when loading a PCI driver.
to recycle full fsids after only 16 mount/unmount's. This is probably
too often for exported fsids. Now we recycle the full fsids only
after 2^16 mount/ umount's and only ensure uniqueness in the lower 16
bits if there have been <= 256 calls to vfs_getnewfsid() since the
system started.
Previously, it was being called whether it was needed or not and the
ASU flag was being set (as a side affect of calling 'suser()') in
cases where superuser privileges were not actually needed. This was
all pointed out to me by Bruce Evans.
Reviewed by: bde
number was packed very wastefully, giving perfect non-uniqeness in
the lower 16 bits of fsids for filesystems with the same vfs type.
This made linux_stat() return perfectly non-unique (broken) 16-bit
st_dev's for nfs mount points, and effectively reduced mntid_base to
8 bits so that the vfs_getnewfsid() looped endlessly when there are
already 256 mounted filesystems with the required vfs type.
Approved by: jkh
VM_PROT_EXECUTE must be added to prot before calling vm_map_find.
Without this change, an mprotect on a shmat'ed region fails (when
it shouldn't). This bug was reported Feb 28 by Brooks Davis
<brooks@one-eyed-alien.net> on -hackers.
Reviewed by: bde
Approved by: jkh
with the known bogus currtpriority. This undoes the previous changes to
sys/i386/i386/trap.c, sys/alpha/alpha/trap.c, sys/sys/systm.h
Now we have the patch set approved by bde.
Approved by: bde
This
This feature allows you to specify if mmap'd data is included in
an application's corefile.
Change the type of eflags in struct vm_map_entry from u_char to
vm_eflags_t (an unsigned int).
Reviewed by: dillon,jdp,alfred
Approved by: jkh
how the kernel was booted and perhaps do conditional things
based upon it (sysinstall, for example, will now turn Debug mode
on automatically if boot -v was done).
Submitted by: msmith
Suggested by: ulf
Now this check is necessary because IPv6 source routing might use
control data bigger than MLEN. (e.g. 16bytes IPv6 addr x 23 hops)
Actually mbuf cluster should be used in uipc_socket.c:sbcreatecontrol()
and uipc_syscalls.c:sockargs() when data size is bigger then MLEN,
and such patches were already in KAME environment and have been
confirmed to work well. I just forgot to merge them into 4.0, sorry.
For safety, I'll postpone such patches until after 4.0 release.
The effect of postponement is followings.
-Ping6 source routing hops are limitted to around 6 or so.
-If some apps do setsockopt IPV6_RTHDR and try to receive
incoming IPv6 source routing info, it can't receive more
than 6 hops source routing info.
(But currently, no apps seems to be doing it.)
Approved by: jkh
VFS_AIO option is specified, all aio-related syscalls return ENOSYS.
The aio code is very fragile right now, and is unsuitable for default
inclusion in a production shell box.
Approved by: jkh
was using them exits.
Don't allow a user process to cause the kernel to take a TRCTRAP on a
user space address.
Reviewed by: jlemon, sef
Approved by: jkh
fd's in the range of 32-63, 96-127 etc. The first problem was the
FD_*() macros were shifting a 32 bit integer "1" left by more than
32 bits. The same problem happened in selscan(). ffs() also takes
an int argument and causes failure. For cases where int == long
(ie: the usual case for x86, but not always as gcc can have long
being a 64 bit quantity) ffs() could be used.
Reported by: Marian Stagarescu <marian@bile.skycache.com>
Reviewed by: dfr, gallatin (sys/types.h only)
Approved by: jkh
was needed to make attach/detach of devices work, which is
needed for the PCCARD support.
(PCCARD support is still not working though, more to come on that)
Support the CMD646 chip which is used on many alphas, sadly only
in WDMA2 mode, as the silicon is broken beyond belief for UDMA modes.
Lots of cosmetic fixes here and there.
Sorry for the size of this megapatchfromhell but it was not
possible otherwise...
newbus patches based on work from: dfr (Doug Rabson)
firmware prompt. Several sleepy folk mistook the '>>>' for the SRM
prompt, which was never the desired idea.
Submitted by: Andrew Gallatin <gallatin@cs.duke.edu>
Approved by: jkh
run out of KVM through a mmap()/fork() bomb that allocates hundreds
of thousands of vm_map_entry structures.
Add panic to make null-pointer dereference crash a little more verbose.
Add a new sysctl, vm.max_proc_mmap, which specifies the maximum number
of mmap()'d spaces (discrete vm_map_entry's in the process). The value
defaults to around 9000 for a 128MB machine. The test is scaled for the
number of processes sharing a vmspace (aka linux threads). Setting
the value to 0 disables the feature.
PR: kern/16573
Approved by: jkh
This unspams the boot messages, concentrating on the drivers that have
actually been probed.
This basically resurrects revision 1.106 from old /sys/i386/isa/isa.c.
Reviewed by: jkh, dfr
curproc. This only makes any difference on SMP, where we used a
(potentially very bogus) switchtime from our own CPU to calculate
resource usage on another CPU.
This should remove some if not all calcru() related warnings on SMP.
Approved by: jkh
#! /bin/sh # -*- perl -*-
This is simply "delete everything after the next '#', not counting the
first char in the line". No effort has been made to allow quoting,
backslash escaping or '#' in interpreter names.
The complies to POSIX 1003.2 in that Posix says the implementation is
free to choose whatever it likes.
PR: bin/16393
``jail'', and move the set_hostname_allowed sysctl there, as well as
fixing a bug in the sysctl that resulted in jails being over-limited
(preventing them from reading as well as writing the hostname). Also,
correct some formatting issues, courtesy bde :-).
Reviewed by: phk
Approved by: jkh
or not a process in a jail, with privilege, may set the jail's hostname.
Defaults to 1, which permits this. May be set to 0 by a process with
appropriate privilege outside of jail. Preventing hostname renaming
from within a jail is currently required to make jails manageable, as they
a currently identifiable only by hostname using /proc, which may be
modified without this sysctl being set to 0. This will be documented
in upcoming man commits.
Authorized by: jkh, the ever-patient
ptys in ways that might be unethical, especially towards processes not in
jail, or in other jails.
Submitted by: phk
Reviewed by: rwatson
Approved by: jkh
support code). It hasn't worked since at least October 1995, and probably
has never worked in the FreeBSD 2.0+ tree. Obviously it's not a priority
to many folks.
Reviewed by: phk, sos
the codepath is followed.
From the PR:
vclean calls vrele leading to deadlock (if usecount > 0)
vclean() calls vrele() if v_usecount of the node was higher than one.
But before calling it, it sets the VXLOCK flag, which will make
vn_lock called from vrele dead-lock.
PR: kern/15117
Submitted by: Assar Westerlund <assar@stacken.kth.se>
Reviewed by: rwatson
Obtained from: NetBSD
system is slowed down and in the right spot (a race condition in fork()).
The "previous time" fields have moved from pstat to proc. Anything which
uses KVM needs to be recompiled with a new libkvm/headers.
A couple wacky u_quad_t's in struct proc are now u_int64_t (the same, but
according to lack of 'quad's in proc.h and usage in kern_resource.c).
This will have no effect on code.
This has been make-world-and-installed-new-kernel-which-works-fine-tested.
Reviewed by: bde (previous version)
subr_diskmbr.c:
Don't "helpfully" enlarge our idea of the disk size to cover all the
primary slices. Instead, truncate or discard slices that don't seem
to be on the disk. The enlargement was a hack for disks that don't
report their size (e.g., MFM disks). It is just wrong in general.
wd.c:
In CHS mode, limit the disk size so that cylinder numbers >= 65536
cannot occur. This normally only affects disks larger than 33.8GB.
CHS mode accesses to addresses above the limit are now properly broken
(an error is returned instead of garbage for reads and disk corruption
for writes).
PR: 15611
Reviewed by: readers of freebsd-bugs did not respond to a request
for review
malloc region (kmem_map) to be wrong and semi-random on systems with more
than 1GB of RAM. This is not a complete fix, but is sufficient for
machines with 4GB or less of memory. A complete fix will require some
changes to the getenv stuff so that 64bit values can be passed around.
NOT FIXED: machines with more than 4GB of RAM (e.g. some large Alphas)
since we're still using ints to hold some of the values.
Reviewed by: bde
Using recursion to traverse the recursive data structure for extended
partitions was never good, but when slice support was implemented in
1995, the recursion worked for the default maximum number of slices
(32), and standard fdisk utilities didn't support creating more than
the default number. Even then, corrupt extended partitions could
cause endless recursion, because we attempt to check all slices, even
ones which we don't turn into devices.
The recursion has succumbed to creeping features. The stack requirements
for each level had grown to 204 bytes on i386's. Most of the growth was
caused by adding a 64-byte copy of the DOSpartition table to each frame.
The kernel stack size has shrunk to about 5K on i386's. Most of the
shrinkage was caused by the growth of `struct sigacts' by 2388 bytes
to support 128 signals.
Linux fdisk (a 1997 version at least) can now create 60 slices (4 standard
ones, 56 for logical drives within extended partitions, and it seems to
be leaving room to map the 4 BSD partitions on my test drive), and Linux
(2.2.29 and 2.3.35 at least) now reports all these slices at boot time.
The fix limits the recursion to 16 levels (4 + 16 slices) and recovers
32 bytes per level caused by gcc pessimizing for space. Switching to
a static buffer doesn't cause any problems due to recursion, since the
buffer is not passed down. Using a static buffer is wrong in general
because it requires the giant lock to protect it. However, this problem
is small compared with using a static buffer for dsname(). We sometimes
neglect to copy the result of dsname() before sleeping.
Also fixed slice names when we find more than MAX_SLICES (32) slices.
The number of the last slice found was not passed passed recursively.
The limit on the recursion now prevents finding more than 32 slices
with a standard extended partition data structure anyway.
despite having a non-null cn_tab entry. This case now works the same
as if there is no physical console, except i/o at the kernel printf
level may still work. This frees drivers of physical console drivers
from the responsibility of attaching the device no matter what.
file open in one of the special file descriptors (0, 1, or 2), close
it before completing the exec.
Submitted by: nergal@idea.avet.com.pl
Constructive comments: deraadt@openbsd.org, sef, peter, jkh
again (without this the rollback analysis was being lost). Should reduce
the write count for most workloads.
Submitted by: Craig A Soules <soules+@andrew.cmu.edu>
my tree for ages (~2 years) waiting for an excuse to commit it. Now Linux
has implemented it and it seems that Staroffice (when using the
linux_base6.1 port's libc) calls this in the linux emulator and dies in
setup. The Linux emulator can call these now.
Make gratuitous style(9) fixes (me, not the submitter) to make the aio
code more readable.
PR: kern/12053
Submitted by: Chris Sedore <cmsedore@maxwell.syr.edu>
ddb is entered. Don't refer to `in_Debugger' to see if we
are in the debugger. (The variable used to be static in Debugger()
and wasn't updated if ddb is entered via traps and panic anyway.)
- Don't refer to `in_Debugger'.
- Add `db_active' to i386/i386/db_interface.d (as in
alpha/alpha/db_interface.c).
- Remove cnpollc() stub from ddb/db_input.c.
- Add the dbctl function to syscons, pcvt, and sio. (The function for
pcvt and sio is noop at the moment.)
Jointly developed by: bde and me
(The final version was tweaked by me and not reviewed by bde. Thus,
if there is any error in this commit, that is entirely of mine, not
his.)
Some changes were obtained from: NetBSD
to wake up any processes waiting via PIOCWAIT on process exit, and truss
needs to be more aware that a process may actually disappear while it's
waiting.
Reviewed by: Paul Saab <ps@yahoo-inc.com>
1) Fastpath deletions. When a file is being deleted, check to see if it
was so recently created that its inode has not yet been written to
disk. If so, the delete can proceed to immediately free the inode.
2) Background writes: No file or block allocations can be done while the
bitmap is being written to disk. To avoid these stalls, the bitmap is
copied to another buffer which is written thus leaving the original
available for futher allocations.
3) Link count tracking. Constantly track the difference in i_effnlink and
i_nlink so that inodes that have had no change other than i_effnlink
need not be written.
4) Identify buffers with rollback dependencies so that the buffer flushing
daemon can choose to skip over them.
it only on the buf_daemon process). The problem is that when the
syncer process starts running the worklist, it wants to delete
lots of files. It does this by VFS_VGET'ing the vnodes, clearing
the blocks in them and bdwrite'ing the buffer. It can process close
to a thousand files per second which generates a large number of
dirty buffers. So, giving it special priviledge at the buffer trough
leads to trouble as the buf_daemon does occationally need a free
buffer to proceed and if the syncer has used every last one up,
we are toast.
into vnode dirtyblkhd we append it to the list instead of prepend it to
the list in order to maintain a 'forward' locality of reference, which
is arguably better then 'reverse'. The original algorithm did things this
way to but at a huge time cost.
Enhance the append interlock for NFS writes to handle intr/soft mounts
better.
Fix the hysteresis for NFS async daemon I/O requests to reduce the
number of unnecessary context switches.
Modify handling of NFS mount options. Any given user option that is
too high now defaults to the kernel maximum for that option rather then
the kernel default for that option.
Reviewed by: Alfred Perlstein <bright@wintelcom.net>
the low level interrupt handler number should be used. Change
setup_apic_irq_mapping() to allocate low level interrupt handler X (Xintr${X})
for any ISA interrupt X mentioned in the MP table.
Remove an assumption in the driver for the system clock (clock.c) that
interrupts mentioned in the MP table as delivered to IOAPIC #0 intpin Y
is handled by low level interrupt handler Y (Xintr${Y}) but don't assume
that low level interrupt handler 0 (Xintr0) is used.
Don't allocate two low level interrupt handlers for the system clock.
Reviewed by: NOKUBI Hirotaka <hnokubi@yyy.or.jp>
hardpps() produced offset component. This is tested and behaved
stable with frequency offsets from -338.05 to +499.91 PPM.
Interestingly the machine I tested this on would fail if the clock
were slower than 14.3132 MHz whereas it was perfectly happy to run
at 16.384 MHz, in other words [-340PPM ... +14.4%]
Make pps_shift tweakable with sysctl.
login (or not if root)
then exit the shell
truss will get stuct in tsleep
I dont know if this is correct, but it fixes the problem and
according to the commends in pioctl.h, PF_ISUGID is set when we
want to ignore UID changes.
The code is checking for when PF_ISUGID is not set and since it
never is set, we always ignore UID changes.
Submitted by: Paul Saab <ps@yahoo-inc.com>
resulted in vastly optimistic offset values reported to userland
(typically a factor 40+ too small). Apart from that, the code had
two sign-bugs.
Apply the hardpps() phase with the right sign with a simply
scaling by integration interval. (This may be too stiff at
long integration intervals, see below).
Allow pps_shiftmax to be reduced again.
Before this, the phase lock in hardpps() were broken, but due to
two bugs mostly cancelling out, it would end up basically working
with a large stochastic component. Now it behaves as one would
expect: smooth and quiet.
It seems that pps_shiftmax above 7..9 somewhere makes the phaselock
too weak to hold onto random walk phase errors from a HP-105 OCXO,
which basically means that it is too weak for real-life use with
such integration times. This is yet to be resolved.
Submitted to: Prof. Dave "NTP" Mills.
Tested by: Terje Mathisen <Terje.Mathisen@hda.hydro.com>
because bsd.kmod.mk is usually out of sync with kernel source. However
bsd.kmod.mk has to be updated now because of the _KERNEL change so there
is no need to keep this (pre-repo copy) version around.
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.
to `register_t *'. This fixes bugs like misplacement of argc and argv
on the user stack on i386's with 64-bit longs. We still use longs to
represent "words" like argc and argv, and assume that they are on the
stack (and that there is stack). The suword() and fuword() families
should also use register_t.
register_t, so pointers to it must be passed around as `register_t *',
not as `int *'. The type mismatches were non-benign on alphas, but
the broken code is normally only configured by LINT.
fixee incoherency of pipe timestamps relative to file timestamps in
the usual case where getnanotime() is not used for the latter. (File
and pipe timestamps are still incoherent relative to real time unless
the vfs_timestamp_precision sysctl is set to 2 or 3).
NFSSERVER defined, useful for userland fileservers that want to
use a filehandle type interface to the filesystem.
Submitted by: Assar Westerlund assar@stacken.kth.se
PR: kern/15452
stressful situations. buf_daemon now makes a distinction between
being woken up and its sleep timing out, and as a consequence is now
much better able to dynamically tune itself to its environment.
Reviewed by: Alfred Perlstein <bright@wintelcom.net>
differentiate between one of three different scenarios:
1. No init.
2. Path to init munged by an incorrect loader configuration.
3. Root file system not mounted.
Reviewed-by: billf
The variables "m_mclalloc_wid" and "m_mballoc_wid" were not in the
proper place. They should have been in uipc_mbuf.c and have been global,
not in mbuf.h and local per each file that uses mbuf.h.
Sorta bug fix:
In mbuf.h, the definitions of various things for KERNEL and not
KERNEL cases were very screwy. This fixes all of that which I could
find.
1. Data written beyond end of pipe buffer, causing kernel memory corruption.
- Check that space is still valid after obtaining the pipe lock.
- Defer the calculation of transfer size until the pipe
lock has been obtained.
- Update the pipe buffer pointers while holding the pipe lock.
2. Writes of size <= PIPE_BUF not always atomic.
- Allow an internal write to span two contiguous segments,
so writes of size <= PIPE_BUF can be kept atomic
when wrapping around from the end to the start of the
pipe buffer.
PR: 15235
Reviewed by: Matt Dillon <dillon@FreeBSD.org>
the kernel while the vnode_if.h header is a bunch of inlines to call the
code that is in the kernel. Generating the .h file on the fly is kinda
bogus because it has to match the one compiled into the kernel.
IMHO we should have kern/vnode_if.c and sys/vnode_if.h committed in the
tree but that's another battle.
means that running out of mbuf space isn't a panic anymore, and code
which runs out of network memory will sleep to wait for it.
Submitted by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: green, wollman
madvise().
This feature prevents the update daemon from gratuitously flushing
dirty pages associated with a mapped file-backed region of memory. The
system pager will still page the memory as necessary and the VM system
will still be fully coherent with the filesystem. Modifications made
by other means to the same area of memory, for example by write(), are
unaffected. The feature works on a page-granularity basis.
MAP_NOSYNC allows one to use mmap() to share memory between processes
without incuring any significant filesystem overhead, putting it in
the same performance category as SysV Shared memory and anonymous memory.
Reviewed by: julian, alc, dg
* lockstatus() and VOP_ISLOCKED() gets a new process argument and a new
return value: LK_EXCLOTHER, when the lock is held exclusively by another
process.
* The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them
* Extend the vnode_if.src format to allow more exact specification than
locked/unlocked.
This commit should not do any semantic changes unless you are using
DEBUG_VFS_LOCKS.
Discussed with: grog, mch, peter, phk
Reviewed by: peter
adequate for the IDE disks that I have available for testing. Most seem
to wait between 1 and 3 seconds before flushing their caches.
Add the ability to override the delay at compile time via the
undocumented option POWEROFF_DELAY. The delay can still be set via
sysctl as it was originally implemented.
boots I try in vain to remember which month or even year this system
was last booted in.
Print out the uptime before rebooting, and give people like me
less (or more as it may be) to think about while the systems boots.
some time ago that changes kern.randompid from a boolean to a randomness
range for the next pid assigment. Too high causes a lot of extra work
to scan for free pids, and too low merely wastes randomness entropy. It's
still possible to select a completely random range by using PID_MAX (100k)
or -1 as a shortcut to mean "the whole range".
Also, don't waste randomness when doing a wraparound.
device_add_child_ordered(). 'ivars' may now be set using the
device_set_ivars() function.
This makes it easier for us to change how arbitrary data structures are
associated with a device_t. Eventually we won't be modifying device_t
to add additional pointers for ivars, softc data etc.
Despite my best efforts I've probably forgotten something so let me know
if this breaks anything. I've been running with this change for months
and its been quite involved actually isolating all the changes from
the rest of the local changes in my tree.
Reviewed by: peter, dfr
because in the case of mbuf clusters they only increment the reference
count rather than actually copying the data.
Add comments to this effect, and add a new routine called m_dup() that
returns a real, writable copy of an mbuf chain.
This is preliminary work required for implementing 'ipfw tee'.
Reviewed by: julian
drops the counting in bwrite and puts it all in spec_strategy.
I did some tests and verified that the counts collected for writes
in spec_strategy is identical to the counts that we previously
collected in bwrite. We now also get read counts (async reads
come from requests for read-ahead blocks). Note that you need
to compile a new version of mount to get the read counts printed
out. The old mount binary is completely compatible, the only
reason to install a new mount is to get the read counts printed.
Submitted by: Craig A Soules <soules+@andrew.cmu.edu>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
Hopefully this clears up some confusion about the nature of
devclass_get_softc() vs. device_get_softc() as well.
The check against DS_ATTACHED remains as this is not
a change that modifies functionality.
Reviewed by: Peter "in principle" Wemm
what it is.
Be more correct in unbusying the mountpoint (especially before freeing it).
Remove support for mounting 'r' devices as root. You don't mount 'r'
devices anywhere else, and they're going away anyway.
Submitted by: bde
(kern.randompid), which is currently defaulted off. Use ARC4 (RC4) for our
random number generation, which will not get me executed for violating
crypto laws; a Good Thing(tm).
Reviewed and Approved by: bde, imp
commit to kern_synch.c:
----------------------------
revision 1.55
date: 1999/02/23 02:56:03; author: ross; state: Exp; lines: +39 -10
Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclk() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code
=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)
=== New schedclk() mechansism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock. Define these for alpha, where hz==1024, and nice was
totally broke.
=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater wieght) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this, it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.
=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.
----------------------------
The details are a little different in FreeBSD:
=== nice bug === Fixing this is the main point of this commit. We use
essentially the same clipping rule as NetBSD (our limit on p_estcpu
differs by a scale factor). However, clipping at all is fundamentally
bad. It gives free CPU the hoggiest hogs once they reach the limit, and
reaching the limit is normal for long-running hogs. This will be fixed
later.
=== New schedclk() mechanism === We don't use the NetBSD schedclk()
(now schedclock()) mechanism. We require (real)stathz to be about 128
and scale by an extra factor of 2 compared with NetBSD's statclock().
We scale p_estcpu instead of scaling the clock. This is more accurate
and flexible.
=== Algorithm change === Same change.
=== Other bugs === The p_pctcpu bug was fixed long ago. We don't try as
hard to abstract functionality yet.
Related changes: the new limit on p_estcpu must be exported to kern_exit.c
for clipping in wait1().
Agreed with by: dufault
commit to kern_synch.c:
----------------------------
revision 1.55
date: 1999/02/23 02:56:03; author: ross; state: Exp; lines: +39 -10
Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclk() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code
=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)
=== New schedclk() mechansism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock. Define these for alpha, where hz==1024, and nice was
totally broke.
=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater wieght) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this, it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.
=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.
----------------------------
The details are a little different in FreeBSD:
=== nice bug === Fixing this is the main point of this commit. We use
essentially the same clipping rule as NetBSD (our limit on p_estcpu
differs by a scale factor). However, clipping at all is fundamentally
bad. It gives free CPU the hoggiest hogs once they reach the limit, and
reaching the limit is normal for long-running hogs. This will be fixed
later.
=== New schedclk() mechanism === We don't use the NetBSD schedclk()
(now schedclock()) mechanism. We require (real)stathz to be about 128
and scale by an extra factor of 2 compared with NetBSD's statclock().
We scale p_estcpu instead of scaling the clock. This is more accurate
and flexible.
=== Algorithm change === Same change.
=== Other bugs === The p_pctcpu bug was fixed long ago. We don't try as
hard to abstract functionality yet.
Related changes: the new limit on p_estcpu must be exported to kern_exit.c
for clipping in wait1().
Agreed with by: dufault