* Don't call ast() from interrupt() - if we switch, then we will miss
writing cr.eoi which will prevent the current cpu from receiving
interrupts until the current thread is resumed. The call to ast()
happens magically in exception_restore where it is safe.
* Add DDB 'show irq' command to examine interrupt hardware state.
* Use ptc.g instead of ptc.l so that TLB shootdowns are broadcast to the
coherence domain.
* Use smp_rendezvous for pmap_invalidate_all to ensure it happens on all
cpus.
* Dike out a DIAGNOSTIC printf which didn't compile.
* Protect the internals of pmap_install with cpu_critical_enter/exit.
Problem:
selwakeup required calling pfind which would cause lock order
reversals with the allproc_lock and the per-process filedesc lock.
Solution:
Instead of recording the pid of the select()'ing process into the
selinfo structure, actually record a pointer to the thread. To
avoid dereferencing a bad address all the selinfo structures that
are in use by a thread are kept in a list hung off the thread
(protected by sellock). When a selwakeup occurs the selinfo is
removed from that threads list, it is also removed on the way out
of select or poll where the thread will traverse its list removing
all the selinfos from its own list.
Problem:
Previously the PROC_LOCK was used to provide the mutual exclusion
needed to ensure proper locking, this couldn't work because there
was a single condvar used for select and poll and condvars can
only be used with a single mutex.
Solution:
Introduce a global mutex 'sellock' which is used to provide mutual
exclusion when recording events to wait on as well as performing
notification when an event occurs.
Interesting note:
schedlock is required to manipulate the per-thread TDF_SELECT
flag, however if given its own field it would not need schedlock,
also because TDF_SELECT is only manipulated under sellock one
doesn't actually use schedlock for syncronization, only to protect
against corruption.
Proc locks are no longer used in select/poll.
Portions contributed by: davidc
While doing this, move it earlier in the sysinit boot process so that the
VM system can use it.
After that, the system is now able to use sx locks instead of lockmgr
locks in the VM system. To accomplish this, some of the more
questionable uses of the locks (such as testing whether they are
owned or not, as well as allowing shared+exclusive recursion) are
removed, and simpler logic throughout is used so locks should also be
easier to understand.
This has been tested on my laptop for months, and has not shown any
problems on SMP systems, either, so appears quite safe. One more
user of lockmgr down, many more to go :)
which is initialized with whatever string a dhcp/bootp server passes
as vendor tag 134.
There is no standard tag that I know with this information, and
no vendor-defined tag that applies to FreeBSD that I could find
doing the same thing.
The intended use is to pass information to userland for run-time
configuration of a diskless client without having to run a bootp/dhcp
client for the third time (after the one in pxeboot/etherboot, and
the one in the kernel bootp), also because these clients generally
screwup the interface configuration, which is not exactly what you
want when you have your disks nfs-mounted.
Manpage update to follow soon.
MFC-after: 3 days
wait for those cpus, instead of all of them by using a count. Oops.
Make the pointer to the mask that the primary cpu spins on volatile, so
gcc doesn't optimize out an important load. Oops again.
Activate tlb shootdown ipi synchronization now that it works. We have
all involved cpus wait until all the others are done. This may not be
necessary, it is mostly for sanity.
Make the trigger level interrupt ipi handler work.
Submitted by: tmm
The stat() and open() calls have been changed to make use of this new functionality. Using shared locks in
these cases is sufficient and can significantly reduce their latency if IO is pending to these vnodes. Also,
this reduces the number of exclusive locks that are floating around in the system, which helps reduce the
number of deadlocks that occur.
A new kernel option "LOOKUP_SHARED" has been added. It defaults to off so this patch can be turned on for
testing, and should eventually go away once it is proven to be stable. I have personally been running this
patch for over a year now, so it is believed to be fully stable.
Reviewed by: jake, obrien
Approved by: jake
test and play with this.
This is not yet production quality and should be run only on dedicated
test boxes.
For people who want to develop transformations for GEOM there exist a
set of shims to run geom in userland (ask phk@freebsd.org).
Reports of all kinds to: phk@freebsd.org
Please include in report:
dmesg
sysctl debug.geomdot
sysctl debug.geomconf
Known significant limitations:
no kernel dump facility.
ioctls severely restricted.
Sponsored by: DARPA, NAI Labs
update the free-space statistics in some cases. The problem affected
directory blocks when the free space dropped below the size of the
maximum allowed entry size. When this happened, the free-space
summary information could claim that there are no further blocks
that can fit a maximum-size entry, even if there are.
The effect of this bug is that the directory may be enlarged even
though there is space within the directory for the new entry. This
wastes disk space and has a negative impact on performance.
Fix it by correctly computing the dh_firstfree array index, adding
a helper macro for clarity. Put an extra sanity check into
ufsdirhash_checkblock() to detect the situation in future.
Found by: dwmalone
Reviewed by: dwmalone
MFC after: 1 week
read-only.
The trouble here is that we don't reopen the device in read/write mode
when we remount in read/write mode resulting in a filesystem sending
write requests to a device which was only opened read/only.
I'm not quite sure how such a reopen would best be done and defer
the problem to more agile hackers.
unit allocation with a bitmap in the generic layer. This
allows us to get rid of the duplicated rman code in every
clonable interface.
Reviewed by: brooks
Approved by: phk
style(9)
- Minor space adjustment in cases where we have "( ", " )", if(), return(),
while(), for(), etc.
- Add /* SYMBOL */ after a few #endifs.
Reviewed by: alc
machdep.guessed_bootdev, and add code to sysctl to parse its value
and give a (not necessarily correct) name to the device we booted
from (the main motivation for this code is to use the info in the
PicoBSD boot scripts, and the impact on the kernel is minimal).
NOTE: the information available in bootdev is not always reliable,
so you should not trust it too much. The parsing code is the same
as in boot2.c, and cannot cover all cases -- as it is, it seems to
work fine with floppies and IDE disks recognised by the BIOS. It
_should_ work as well with SCSI disks recognised by the BIOS.
Booting from a CDROM in floppy emulation will return /dev/fd0 (because
this is what the BIOS tells us).
Booting off the network (e.g. with etherboot) leaves bootdev unset so
the value will be printed as "invalid (0xffffffff)".
Finally, this feature might go away at some point, hopefully when we
have a more reliable way to get the same information.
MFC-after: 5 days
in "missing dependencies" error when loading some kld modules. It is sad to
see how often these days style cleanus break doesn't broken things. Perhaps
people should recall good old principle: "don't fix it if it isn't broken".
-#if defined(__FreeBSD__) && __FreeBSD_version__ >= 500023
+#if defined(__FreeBSD__) && __FreeBSD_version >= 500023
is a genuine bug -- __FreeBSD_version__ does not exist.
The other one:
-#if (__FreeBSD__ < 5)
+#if (__FreeBSD_version < 500000)
pops out when you cross-compile the code:
__FreeBSD__ is a compiler predefine,
__FreeBSD_version is defined in <sys/param.h> .
Given that in this case (and all others in sys/dev/usb and sys/i4b)
the goal is to adapt to a different kernel interface, and not to
a compiler feature, I believe the correct form is the second one
(in the best case the two are synonyms so the change does not break
anything anyways).
for POSIX.1-2001 conformance.
o Add magic to <netinet/in.h> and <netinet6/in6.h> to prevent
redefining INET_ADDRSTRLEN and INET6_ADDRSTRLEN.
o Add a note about missing typedefs in <arpa/inet.h>.
the number of physical pages per KVA page allocated scales properly with
memory size. This fixes problems with kmem_map being too small.
Noticed by: mike, wollman
Submitted by: tmm
fully instaniated.
Revert the logic in pipeclose so that we don't have the entire function
pretty much under a single if() statement, instead invert the test and
just return if it fails.
Submitted (in different form) by: bde
Don't use pool mutexes for pipes. We can not use pool mutexes
because we will need to grab the select lock while holding a pipe
lock which is not allowed because you may not aquire additional
mutexes when holding a pool mutex.
Instead malloc(9) space for the mutex that is shared between the
pipes.
o In i386's <machine/endian.h>, macros have some advantages over
inlines, so change some inlines to macros.
o In i386's <machine/endian.h>, ungarbage collect word_swap_int()
(previously __uint16_swap_uint32), it has some uses on i386's with
PDP endianness.
Submitted by: bde
o Move a comment up in <machine/endian.h> that was accidentially moved
down a few revisions ago.
o Reenable userland's use of optimized inline-asm versions of
byteorder(3) functions.
o Fix ordering of prototypes vs. redefinition of byteorder(3)
functions, so that the non-GCC (libc asm) case has proper
prototypes.
o Add proper prototypes for byteorder(3) functions in <sys/param.h>.
o Prevent redundant duplicate prototypes by making use of the
_BYTEORDER_PROTOTYPED define.
o Move the bswap16(), bswap32(), bswap64() C functions into MD space
for platforms in which asm versions don't exist. This significantly
reduces the complexity of some things at the cost of duplicate code.
Reviewed by: bde
the revived code.
vm pages newly allocated are marked busy (PG_BUSY), thus calling
vm_page_delete before the pages has been freed or unbusied will
cause a deadlock since vm_page_object_page_remove will wait for the
busy flag to be cleared. This can be triggered by calling malloc
with size > PAGE_SIZE and the M_NOWAIT flag on systems low on
physical free memory.
A kernel module that reproduces the problem, written by Logan Gabriel
<logan@mail.2cactus.com>, can be found in the freebsd-hackers mail
archive (12 Apr 2001). The problem was recently noticed again by
Archie Cobbs <archie@dellroad.org>.
Reviewed by: dillon
simply need to prevent switching from another CPU and do not need
interrupts disabled.
- Add a comment to witness_list() about why displaying spin locks for
threads on other CPU's really is just a bad idea and probably shouldn't
be done.
and commenting of NETSMB, NETWMBCRYPTO, and SMBFS. In NOTES, they
had all floated to the bottom of the file with the list of seemingly
random and unclassified kernel options. This change moves them back
up to the network protocol and file system areas, and also documents
the dependencies.
This is belived to be the only place where a soft reference to a vnode
is held with no sort of hard reference, consequently this change should
allow us to free(9) vnodes from the freelist after properly cleaning
them up.
Reviewed by: dillon
it worked- but I ran into a case with a 2204 where commands were being lost
right and left. Best be safe.
For target mode, or things called if we call isp_handle_other response- note
that we might have dropped locks by changing the output pointer so we bail
from the loop. It's the responsibility of the entity dropping the lock to
make sure that we let the f/w know we've read thus far into the response
queue (else we begin processing the same entries again- blech!).
MFC after: 1 day
the upcoming 7.4 family (7xxx controllers).
- improved error reporting and handling
- more diagnostic output
- add extra command packet definitions
- merge sources again with -stable
than the other implementations; we have complete control over the tlb, so we
only demap specific pages. We take advantage of the ranged tlb flush api
to send one ipi for a range of pages, and due to the pm_active optimization
we rarely send ipis for demaps from user pmaps.
Remove now unused routines to load the tlb; this is only done once outside
of the tlb fault handlers.
Minor cleanups to the smp startup code.
This boots multi user with both cpus active on a dual ultra 60 and on a
dual ultra 2.
Due to allocating tlb contexts on the fly, we only ever need to demap the
primary context, non-primary contexts have already been implicitly flushed
by context switching. All we really need to tell is if its a kernel demap
or not, and its easier just to compare against the kernel_pmap which is a
constant.
the context is not actually stolen, as it would be for i386. Instead of
deactivating a user vmspace immediately when switching out, and recycling
its tlb context, wait until the next context switch to a different user
vmspace. In this way we can switch from a user process to any number of
kernel threads and back to the same user process again, without losing any
of its mappings in the tlb that would not already be knocked by the automatic
replacement algorithm. This is not expected to have a measurable performance
improvement on the machines we currently run on, but it sounds cool and makes
the sparc64 port SMPng buzz word compliant.
to exhaust all kmaps. The only reward for setting maxproc
to a value which will cause kmap exhaustion is a panic
during a forkbomb attack.
MFC after: 3 days
by removing parentheses. The main bug is in gcc: on machines with
64-bit longs and 64-bit long longs,
(unsigned long long)rdp->total_sectors / ((1024L * 1024L) / DEV_BSIZE))
has type plain unsigned long instead of the correctly promoted type
unsigned long long, so it can not be printfed using %llu format. Even
1ULL / 1L is mispromoted. Anyway, casting the correct operand
automatically avoids the problem. We do not want to to pessimize the
division; we just want to convert to a common maximal type for printing.
moderately improves msync's and VM object flushing for objects containing
randomly dirtied pages (fsync(), msync(), filesystem update daemon),
and improves cpu use for small-ranged sequential msync()s in the face of
very large mmap()ings from O(N) to O(1) as might be performed by a database.
A sysctl, vm.msync_flush_flag, has been added and defaults to 3 (the two
committed optimizations are turned on by default). 0 will turn off both
optimizations.
This code has already been tested under stable and is one in a series of
memq / vp->v_dirtyblkhd / fsync optimizations to remove O(N^2) restart
conditions that will be coming down the pipe.
MFC after: 3 days
- Move jail checks and some other checks involving constants and stack
variables out from under Giant. This isn't perfectly safe atm because
jail_sysvipc_allowed is read w/o a lock meaning that its value could be
stale. This global variable will soon become a per-jail flag, however,
at which time it will either not need a lock or will use the prison lock.
individual filesystems to determine whether they should operate in
"file system as a single object" mode, or "file system as a set of objects
with individual labels" mode. Note: in the trustedbsd_mac branch,
this is refered to as "MNT_MULTILEVEL", but the two mean the same thing.
MNT_MULTILABEL is more suggestive of a flexible policy system than one
providing purely hierarchal policies. The need for a reserved flag will
go away once nmount() is done.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
and for individual MAC policies. The framework event initializes the
access control subsystem; the policy event allows policies to register
themselves. The gap in between is for all the things we'll think of
later.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
pseudo-devices when an interface goes away. Otherwise, an open /dev/net/foo0
when the interface is removed can cause a crash.
Not objected to by: jlemon
Some buggy firmware workarounds. Fix some endian bugs.
These reduce the diffs from NetBSD, but NetBSD does have more changes since
my last manual merge.
people working on the MAC tree from getting toasted whenever system call
numbers are allocated in the main tree (for example, for KSE :-).
Calls allocated: __mac_{get,set}_proc, __mac_{get,set}_{fd,file}().
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
Includes some minor whitespace changes, and re-ordering to be able to document
properly (e.g, grouping of variables and the SYSCTL macro calls for them, where
the documentation has been added.)
Reviewed by: phk (but all errors are mine)
active-filter in pppd(8).
PR: kern/12281
Submitted by: Tim Moore <moore@bricoworks.com>
Not objected by: peter
Reviewed by: ru
Approved by: ru
MFC after: 1 week
be allocated as arrays indexed by the cpu id. Previously the only reliable
way to know the max cpu id was through MAXCPU. mp_ncpus isn't useful here
because cpu ids may be sparsely mapped, although x86 and alpha do not do this.
Also, call cpu_mp_probe much earlier so the max cpu id is known before the VM
starts up. This is intended to help support per cpu queues for the new
allocator, but may be useful elsewhere.
Reviewed by: jake
Approved by: jake
Link if only ATAPI device in kernel config
Remove unused #includes
Rearrange a bit in ata-raid to make diff against -stable smaller
Enable wc as default again, dunne how this happend...
and again in vm_page.c and vm_pageq.c.
o Delete unusused prototypes. (Mainly a result of the earlier renaming
of various functions from vm_page_*() to vm_pageq_*().)
This makes other power-management system (APM for now) to be able to
generate power profile change events (ie. AC-line status changes), and
other kernel components, not only the ACPI components, can be notified
the events.
- move subroutines in acpi_powerprofile.c (removed) to kern/subr_power.c
- call power_profile_set_state() also from APM driver when AC-line
status changes
- add call-back function for Crusoe LongRun controlling on power
profile changes for a example
on the loader to do it. Improve smp startup code to be less racy and to
defer certain things until the right time. This almost boots single user
on my dual ultra 60, it is still very fragile:
SMP: AP CPU #1 Launched!
Enter full pathname of shell or RETURN for /bin/sh:
# ls
Debugger("trapsig")
Stopped at Debugger+0x1c: ta %xcc, 1
db> heh
No such command
db>
with pmaps. When the context numbers wrap around we flush all user mappings
from the tlb. This makes use of the array indexed by cpuid to allow a pmap
to have a different context number on a different cpu. If the context numbers
are then divided evenly among cpus such that none are shared, we can avoid
sending tlb shootdown ipis in an smp system for non-shared pmaps. This also
removes a limit of 8192 processes (pmaps) that could be active at any given
time due to running out of tlb contexts.
Inspired by: the brown book
Crucial bugfix from: tmm
clobbered by the child. This is more complicated than usual because the
window that could get clobbered is pushed in kernel mode, so a lot of
registers would have to be saved in other registers in userland and we
don't have enough. What we do have is space in the pcb to temporarily
store user windows that were spilled in kernel mode, but could not be
immediately stored to the user stack. So we copy in the parent's topmost
window and save it in the pcb, and arrange for it to be copied back out
when the child is done frobbing the stack.
Reviewed by: tmm
PAGE_SIZE / MCLBYTES == 1 crash. Fix them by changing the
appropriate "allocate new page and bucket" code in mb_alloc to use
the macro for properly grabbing an allocated object from a bucket,
the one that checks whether the bucket is empty.
This should allow ken to continue testing zero-copy stuff on -CURRENT.
Noticed and provided debug info: ken
* Move the section which manipulates ia64_pal_base to after cninit() so
that we don't risk printing anything before we have a console.
* Don't call ia64_probe_sapics() for a SKI build. This should really
be dependant on ACPICA being present or something.
Add code to properly detach/attach disks that are part of a RAID.
Mark a disk that is attached on an ATA channel belonging to a
RAID as a spare disk that can be used for rebuilding failed RAID1's.
Add support for rebuilding failed RAID1's.
Several fixes to the detach/attach code.
For replacing a disk in a failed RAID1 do the following:
Find the controller channel# of the failed disk.
Exec 'atacontrol detach <channel#>' to free the disk from the system.
Replace the failed disk with a new one of at least the same size.
If your have your disks in drawers/enclosures this can be done with
the system still running.
Exec 'atacontrol attach <channel#>' to add the disk to the system and
mark it as a valid spare for rebuild.
Exec 'atacontrol rebuild <array#>'
The system will rebuild the array on the fly, the array can still
be used during this, although with slower performance.
Please let me know of any problems with this!
Sponsored by: Advanis Inc.
MFC after: 2 weeks
fill out netc_anon (a `struct ucred'), and add an XXX around the
entire operation since it isn't clear whether it's doing the right
thing with things like cr_uidinfo and cr_prison.
is not a neighbor. see comments for the detailed reason.
- Rejected the process of nd6_rtrequest() when the request is RESOLVE and
the interface does not need neighbor caches.
Obtained from: KAME
MFC After: 1 week
the data was supplied as a uio or an mbuf. Previously the limit was
ignored for mbuf data, and NFS could run the kernel out of mbufs
when an ipfw rule blocked retransmissions.
Previously, the UPAGES/KSTACK area of processes/threads would leak memory
at the time that a previously swapped process was terminated. Lukcily, the
leak was only 12K/proc, so it was unlikely to be a major problem unless you
had an undersized swap partition.
Submitted by: dillon
Reviewed by: silby
MFC after: 1 week
I put these in to match the use of spl*() in the NetBSD code I was basing this
on, but it appears to cause problems.
I'm doing this in a separate commit so as to be able to refer back if locking
becomes an issue at a later stage.
Define the atm_dev_free() routine so that its OK to free stuff that is defined
as volatile. Note this doesn't FORCE the arguemnts to be volatile,
just says that it's not an error if it is..
and isn't strictly required. However, it lowers the number of false
positives found when grep'ing the kernel sources for p_ucred to ensure
proper locking.
the pipe is locked and shouldn't be.
initialize pipe->pipe_mtxp to NULL when creating pipes in order not
to trip the above assertions.
swap pipe lock with giant around calls to pipe_destroy_write_buffer()
pipe_destroy_write_buffer issue noticed by: jhb