that should be better.
The old code counted references to mbuf clusters by using the offset
of the cluster from the start of memory allocated for mbufs and
clusters as an index into an array of chars, which did the reference
counting. If the external storage was not a cluster then reference
counting had to be done by the code using that external storage.
NetBSD's system of linked lists of mbufs was cosidered, but Alfred
felt it would have locking issues when the kernel was made more
SMP friendly.
The system implimented uses a pool of unions to track external
storage. The union contains an int for counting the references and
a pointer for forming a free list. The reference counts are
incremented and decremented atomically and so should be SMP friendly.
This system can track reference counts for any sort of external
storage.
Access to the reference counting stuff is now through macros defined
in mbuf.h, so it should be easier to make changes to the system in
the future.
The possibility of storing the reference count in one of the
referencing mbufs was considered, but was rejected 'cos it would
often leave extra mbufs allocated. Storing the reference count in
the cluster was also considered, but because the external storage
may not be a cluster this isn't an option.
The size of the pool of reference counters is available in the
stats provided by "netstat -m".
PR: 19866
Submitted by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: alfred (glanced at by others on -net)
- stop using the evil 'struct trapframe' argument for mi_startup()
(formerly main()). There are much better ways of doing it.
- do not use prepare_usermode() - setregs() in execve() will do it
all for us as long as the p_md.md_regs pointer is set. (which is
now done in machdep.c rather than init_main.c. The Alpha port did it
this way all along and is much cleaner).
- collect all the magic %cr0 etc register settings into one place and
have the AP's call that instead of using magic numbers (!!) that keep
changing over and over again.
- Make it safe to call kthread_create() earlier, including during the
device probe sequence. It doesn't need the callback mechanism that
NetBSD's version uses.
- kthreads created this way are root-less as they exist before the root
filesystem is mounted. init(1) is set up so that it aquires the root
pointers prior to running. If other kthreads want filesystem acccess
we can make this code more generic.
- set all threads start times once we have decided what time it is.
- init uses a trampoline rather than the evil prepare_usermode() hack.
- kern_descrip.c has a couple of tweaks to deal with forking when there
is no rootdir or cwd etc.
- adjust the early SYSINIT() sequence so that a few prereqisites are in
place. eg: make sure the run queue is initialized before doing forks.
With this, the USB code can easily create a kthread to do the device
tree discovery. (I have tested it, it works nicely).
There are still some open issues before this is truely useful.
- tsleep() does not like working before the clock is running. It
sort-of tries to spin wait, but it can do more useful things now.
- stopping a kthread in kld code at unload time is "interesting" but
we have a solution for that.
The Alpha code needs no changes for this. It already uses pretty much the
same strategies, but a little cleaner.
been done.
Don't allow multiple mount operations with MNT_UPDATE at the same
time on the same mount point. When the first mount operation
completed, MNT_UPDATE was cleared in the mount structure, causing
the second to complete as if it was a no-update mount operation
with the following bad side effects:
- mount structure inserted multiple times onto the mountlist
- vp->v_mountedhere incorrectly set, causing next namei
operation walking into the mountpoint to crash with
a locking against myself panic.
Plug a vnode leak in case vinvalbuf fails.
and VOP_SETEXTATTR to simplify calling from in-kernel consumers,
such as capability code. Both accept a vnode (optionally locked,
with ioflg to indicate that), attribute name, and a buffer + buffer
length in UIO_SYSSPACE. Both authorize the call as a kernel request,
with cred set to NULL for the actual VOP_ calls.
Obtained from: TrustedBSD Project
passing a zero-valued timeout, the code would always sleep for one tick.
Change code to avoid calling tsleep if we have no intention of sleeping.
Bring in bugfix from sys_select.c, r1.60 which also applies here.
Modify error handling slightly; passing in an invalid fd will now result
in EBADF returned in the eventlist, while an attempt to change a knote
which does not exist will result in ENOENT being returned. Previously
such attempts would fail silently without notification.
Pointed out by: nicolas.leonard@animaths.com
Rick Reed (rr@yahoo-inc.com)
panicing and return a status so that we can decide whether to drop
into DDB or panic. If the status from isa_nmi is true, panic the
kernel based on machdep.panic_on_nmi, otherwise if DDB is
enabled, drop to DDB based on machdep.ddb_on_nmi.
Reviewed by: peter, phk
Don't allow cpu entries in the MP table to contain APIC IDs out of range.
Don't write outside array boundaries if an IO APIC entry in the MP table
contains an APIC ID out of range.
Assign APIC IDs for all IO APICs according to section 3.6.6 in the
Intel MP spec:
- If the current APIC ID on an IO APIC doesn't conflict with other
IO APICs or CPUs, that APIC ID should be used. The copy of the MP
table must be updated if the corresponding APIC ID in the MP table
is different.
- If the current APIC ID was in conflict with other units, the
corresponding APIC ID specified in the MP table is checked for conflict.
- If a conflict is still found then fall back to using a new unique ID.
The copy of the MP table must be updated.
- IDs out of range is considered to be in conflict.
During these operations, the IO_TO_ID array cannot be used, since any
conflict would have caused information loss. The array is then corrected,
since all APIC ID conflicts should have been resolved.
PR: 20312, 18919
usb, all in usb.ko. uhub depends on usb. The bug was that the preload
processing only adds a module to the list once it's internal dependencies
are resolved... Since it was not "seeing" the internal usb module it
believed that uhub had a missing dependency.
gcc's internal exit() prototypes and the (futile) hackery that we did to
try and avoid warnings. main() was renamed for similar reasons.
Remove an exit related hack from makesyscalls.sh.
!VFS_AIO case. Lots of things have hooks into here (kqueue, exit(),
sockets, etc), I elected to keep the external interfaces the same
rather than spread more #ifdefs around the kernel.
"kern/sys_generic.c:358: warning: cast discards qualifiers from pointer
target type"
The idea for using the uintptr_t intermediate cast for de-constifying
a pointer was hinted at by bde some time ago.
with an error condition such as EINTR, EWOULDBLOCK, and ERESTART,
are reported to the application, not silently conceal. This
behavior was copied from the {read,write}v() syscalls, and is
appropriate there but not here.
o Correct a bug in extattr_delete() wherein the LOCKLEAF flag is
passed to the wrong argument in namei(), resulting in some
unexpected errors during name resolution, and passing in an unlocked
vnode.
Obtained from: TrustedBSD Project
is not desired, then the user can register an EV_SIGNAL filter to
explicitly catch a signal event.
Change requested by: jayanth, ps, peter
"Why is kevent non-restartable after a signal?"
operation or after it. If the ktrace operation was enabled while the
process was blocked doing IO, the race would allow it to pass down
invalid (uninitialized) data and panic later down the call stack.
allow for that.
o Remember to call NDFREE() if exiting as a result of a failed
vn_start_write() when snapshotting.
Reviewed by: mckusick
Obtained from: TrustedBSD Project
with the new snapshot code.
Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.
Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().
Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.
Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.
Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.
Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.
Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.
Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.
Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.
Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.
a loop down in pmap_init_pt(). A subtraction causes the number of
pages to become negative, that was assigned to an unsigned variable,
and there is a lot of iteration. The bug is due to the ELF image
activator not properly checking for its files being the correct size
as specified by the ELF header.
The solution is to check that the header doesn't ask for part of a
file when that part of the file doesn't exist. Make sure to set
VEXEC at the proper times to make the executables immutable (remove
race conditions). Also, the ELF format specifiies header entries
that allow embedding of other executables (hence how ld-elf.so.1
gets loaded, but not the same as loading shared libraries), so those
executables need to be set VEXEC, too, so they're immutable.
Reviewed by: peter
interfaces. The original resource_find() returned a pointer to an internal
resource table entry. resource_find_hard() dereferences the actual
passed in value (oops!) - effectively trashing random memory due to
the pointer being passed in with a random initial value.
Submitted by: bde
and remove sysctl oids at will during runtime - they don't rely on
linker sets. Also, the node oids can be referenced by more than
one kernel user, which means that it's possible to create partially
overlapping trees.
Add sysctl contexts to help programmers manage multiple dynamic
oids in convenient way.
Please see the manpages for detailed discussion, and example module
for typical use.
This work is based on ideas and code snippets coming from many
people, among them: Arun Sharma, Jonathan Lemon, Doug Rabson,
Brian Feldman, Kelly Yancey, Poul-Henning Kamp and others. I'd like
to specially thank Brian Feldman for detailed review and style
fixes.
PR: kern/16928
Reviewed by: dfr, green, phk
a NMI occured, you could type continue in DDB and the kernel would
not attempt to detect what type of NMI was recieved. Now we check
for the type of NMI first and then go to DDB if it is enabled.
This will solve the problem with having DDB enabled and getting an
NMI due to some possibly bad error and being able to continue the
operation of the kernel when you really want to panic and know
what happened.
Submitted by: jhb
never expire if poll() or select() was called before the system had been
in multiuser for 1 second. This was caused by only checking to see if
tv_sec was zero rather than checking both tv_sec and tv_usec.