to a copied-in copy of the 'union semun' and a uioseg to indicate which
memory space the 'buf' pointer of the union points to. This is then used
in linux_semctl() and svr4_sys_semctl() to eliminate use of the stackgap.
- Mark linux_ipc() and svr4_sys_semsys() MPSAFE.
from going away. mount(2) is now MPSAFE.
- Expand the scope of Giant some in unmount(2) to protect the mp structure
(or rather, to handle concurrent unmount races) from going away.
umount(2) is now MPSAFE, as well as linux_umount() and linux_oldumount().
- nmount(2) and linux_mount() were already MPSAFE.
pmap_copy() if the mapping is VM_INHERIT_SHARE. Suppose the mapping
is also wired. vmspace_fork() clears the wiring attributes in the vm
map entry but pmap_copy() copies the PG_W attribute in the PTE. I
don't think this is catastrophic. It blocks pmap_remove_pages() from
destroying the mapping and corrupts the pmap's wiring count.
This revision fixes the problem by changing pmap_copy() to clear the
PG_W attribute.
Reviewed by: tegge@
This driver was ported from OpenBSD by Shigeaki Tagashira
<shigeaki@se.hiroshima-u.ac.jp> and posted at
http://www.se.hiroshima-u.ac.jp/~shigeaki/software/freebsd-nfe.html
It was additionally cleaned up by me.
It is still a work-in-progress and thus is purposefully not in GENERIC.
And it conflicts with nve(4), so only one should be loaded.
in 1999, and there are changes to the sysctl names compared to PR,
according to that discussion. The description is in sys/conf/NOTES.
Lines in the GENERIC files are added in commented-out form.
I'll attach the test script I've used to PR.
PR: kern/14584
Submitted by: babkin
VM_ALLOC_NORMAL instead of VM_ALLOC_SYSTEM when try is TRUE. In other
words, when get_pv_entry() is permitted to fail, it no longer tries as
hard to allocate a page.
Change pmap_enter_quick_locked() to fail rather than wait if it is
unable to allocate a page table page. This prevents a race between
pmap_enter_object() and the page daemon. Specifically, an inactive
page that is a successor to the page that was given to
pmap_enter_quick_locked() might become a cache page while
pmap_enter_quick_locked() waits and later pmap_enter_object() maps
the cache page violating the invariant that cache pages are never
mapped. Similarly, change
pmap_enter_quick_locked() to call pmap_try_insert_pv_entry() rather
than pmap_insert_entry(). Generally speaking,
pmap_enter_quick_locked() is used to create speculative mappings. So,
it should not try hard to allocate memory if free memory is scarce.
Add an assertion that the object containing m_start is locked in
pmap_enter_object(). Remove a similar assertion from
pmap_enter_quick_locked() because that function no longer accesses the
containing object.
Remove a stale comment.
Reviewed by: ups@
syscalls. This way there will be a log message printed to the console
(this time for real).
Note: UNIMPL should be used for syscalls we do not implement ever, e.g.
syscalls to load linux kernel modules.
Submitted by: rdivacky
Sponsored by: Goole SoC 2006
P4 IDs: 99600, 99602
when we're about to call kdb_trap() because the latter MI
function can disable interrupts by itself now.
Pointed out by: bde
X-MFC remark: depends on kern/subr_kdb.c#1.18
Sponsored by: RiNet (Cronyx Plus LLC)
Use the method described in IA-32 Intel Architecture Software
Developer's Manual chapter 11.6.6 to get valid mxcsr bits,
use the mxcsr mask to clear invalid bits passed by user code.
an explicit comment that it's needed for the linuxolator. This is not the
case anymore. For all other architectures there was only a "KEEP THIS".
I'm (and other people too) running a COMPAT_43-less kernel since it's not
necessary anymore for the linuxolator. Roman is running such a kernel for a
for longer time. No problems so far. And I doubt other (newer than ia32
or alpha) architectures really depend on it.
This may result in a small performance increase for some workloads.
If the removal of COMPAT_43 results in a not working program, please
recompile it and all dependencies and try again before reporting a
problem.
The only place where COMPAT_43 is needed (as in: does not compile without
it) is in the (outdated/not usable since too old) svr4 code.
Note: this does not remove the COMPAT_43TTY option.
Nagging by: rdivacky
There is a race with the current locking scheme and removing
it should have no measurable performance impact.
This fixes page faults leading to panics in pmap_enter_quick_locked()
on amd64/i386.
Reviewed by: alc,jhb,peter,ps
Update of syscall.master:
o Adding of several new dummy syscalls (268-310)
o Synchronization of amd64 syscall.master with i386 one
o Auditing added to amd64 syscall.master
o Change auditing type for lstat syscall (bugfix). [1]
P4-Changes: 98672, 98674
Noticed by: rwatson [1]
Sponsored by: Google SoC 2006
Submitted by: rdivacky
I picked it up again. The scheduler is forked from ULE, but the
algorithm to detect an interactive process is almost completely
different with ULE, it comes from Linux paper "Understanding the
Linux 2.6.8.1 CPU Scheduler", although I still use same word
"score" as a priority boost in ULE scheduler.
Briefly, the scheduler has following characteristic:
1. Timesharing process's nice value is seriously respected,
timeslice and interaction detecting algorithm are based
on nice value.
2. per-cpu scheduling queue and load balancing.
3. O(1) scheduling.
4. Some cpu affinity code in wakeup path.
5. Support POSIX SCHED_FIFO and SCHED_RR.
Unlike scheduler 4BSD and ULE which using fuzzy RQ_PPQ, the scheduler
uses 256 priority queues. Unlike ULE which using pull and push, the
scheduelr uses pull method, the main reason is to let relative idle
cpu do the work, but current the whole scheduler is protected by the
big sched_lock, so the benefit is not visible, it really can be worse
than nothing because all other cpu are locked out when we are doing
balancing work, which the 4BSD scheduelr does not have this problem.
The scheduler does not support hyperthreading very well, in fact,
the scheduler does not make the difference between physical CPU and
logical CPU, this should be improved in feature. The scheduler has
priority inversion problem on MP machine, it is not good for
realtime scheduling, it can cause realtime process starving.
As a result, it seems the MySQL super-smack runs better on my
Pentium-D machine when using libthr, despite on UP or SMP kernel.
the arm to compile without all the extras that don't appear, at least
not in the flavors of ARM I deal with. This helps us save about 100k.
If I've botched the available devices on a platform, please let me
know and I'll correct ASAP.
that it just warns the user with a printf when it misaligns a piece
of memory that was requested through a busdma tag.
Some drivers (such as mpt, and probably others) were asking for alignments
that could not be satisfied, but as far as driver operation was concerned,
that did not matter. In the theory that other drivers will fall into
this same category, we agreed that panicing or making the allocation
fail will cause more hardship than is necessary. The printf should
be sufficient motivation to get the driver glitch fixed.
Add a quick hack to ensure that bus_dmamem_alloc properly aligns
small allocations with large alignment requirements.
Add a panic to detect cases where we've still failed to properly align.
conformance with the mbuf and uio load routines. ENOMEM can only happen
with BUS_DMA_NOWAIT is passed in, thus the deferals are disabled. I don't
like doing this, but fixing this fixes assumptions in other important drivers,
which is a net benefit for now.
o Properly use rman(9) to manage resources. This eliminates the
need to puc-specific hacks to rman. It also allows devinfo(8)
to be used to find out the specific assignment of resources to
serial/parallel ports.
o Compress the PCI device "database" by optimizing for the common
case and to use a procedural interface to handle the exceptions.
The procedural interface also generalizes the need to setup the
hardware (program chipsets, program clock frequencies).
o Eliminate the need for PUC_FASTINTR. Serdev devices are fast by
default and non-serdev devices are handled by the bus.
o Use the serdev I/F to collect interrupt status and to handle
interrupts across ports in priority order.
o Sync the PCI device configuration to include devices found in
NetBSD and not yet merged to FreeBSD.
o Add support for Quatech 2, 4 and 8 port UARTs.
o Add support for a couple dozen Timedia serial cards as found
in Linux.
entry (PTE) have the same meaning. The exception to this rule is the
eighth bit (0x080). It is the PS bit in a PDE and the PAT bit in a
PTE. This change avoids the possibility that pmap_enter() confuses a
PAT bit with a PS bit, avoiding a panic().
Eliminate a diagnostic printf() from the i386 pmap_enter() that serves
no current purpose, i.e., I've seen no bug reports in the last two
years that are helped by this printf().
Reviewed by: jhb
caches are dangerous" to "a shared L1 data cache is dangerous". This
is a compromise between paranoia and performance: Unlike the L1 cache,
nobody has publicly demonstrated a cryptographic side channel which
exploits the L2 cache -- this is harder due to the larger size, lower
bandwidth, and greater associativity -- and prohibiting shared L2
caches turns Intel Core Duo processors into Intel Core Solo processors.
As before, the 'machdep.hyperthreading_allowed' sysctl will allow even
the L1 data cache to be shared.
Discussed with: jhb, scottl
Security: See FreeBSD-SA-05:09.htt for background material.
via the debug.minidump sysctl and tunable.
Traditional dumps store all physical memory. This was once a good thing
when machines had a maximum of 64M of ram and 1GB of kvm. These days,
machines often have many gigabytes of ram and a smaller amount of kvm.
libkvm+kgdb don't have a way to access physical ram that is not mapped
into kvm at the time of the crash dump, so the extra ram being dumped
is mostly wasted.
Minidumps invert the process. Instead of dumping physical memory in
in order to guarantee that all of kvm's backing is dumped, minidumps
instead dump only memory that is actively mapped into kvm.
amd64 has a direct map region that things like UMA use. Obviously we
cannot dump all of the direct map region because that is effectively
an old style all-physical-memory dump. Instead, introduce a bitmap
and two helper routines (dump_add_page(pa) and dump_drop_page(pa)) that
allow certain critical direct map pages to be included in the dump.
uma_machdep.c's allocator is the intended consumer.
Dumps are a custom format. At the very beginning of the file is a header,
then a copy of the message buffer, then the bitmap of pages present in
the dump, then the final level of the kvm page table trees (2MB mappings
are expanded into a 4K page mappings), then the sparse physical pages
according to the bitmap. libkvm can now conveniently access the kvm
page table entries.
Booting my test 8GB machine, forcing it into ddb and forcing a dump
leads to a 48MB minidump. While this is a best case, I expect minidumps
to be in the 100MB-500MB range. Obviously, never larger than physical
memory of course.
minidumps are on by default. It would want be necessary to turn them off
if it was necessary to debug corrupt kernel page table management as that
would mess up minidumps as well.
Both minidumps and regular dumps are supported on the same machine.
to reduce the pv_entry_count counter. This was found by Tor Egge. In the
same email, Tor also pointed out the pv_stats problem in the previous
commit, but I'd forgotten about it until I went looking for this email
about this allocation problem.
create managed mappings within the clean submap. To prevent regressions,
add assertions blocking the creation of managed mappings within the clean
submap.
Reviewed by: tegge
so that we only have to do an ioapic_write() instead of an ioapic_read()
followed by an ioapic_write() every time we mask and unmask level triggered
interrupts. This cuts the execution time for these operations roughly in
half.
Profiled by: Paolo Pisati <p.pisati@oltrelinux.com>
MFC after: 1 week
PCB in which the context of stopped CPUs is stored. To access this
PCB from KDB, we introduce a new define, called KDB_STOPPEDPCB. The
definition, when present, lives in <machine/kdb.h> and abstracts
where MD code saves the context. Define KDB_STOPPEDPCB on i386,
amd64, alpha and sparc64 in accordance to previous code.
with large mmap files mapped into many processes, this saves hundreds of
megabytes of ram.
pv entries were individually allocated and had two tailq entries and two
pointers (or addresses). Each pv entry was linked to a vm_page_t and
a process's address space (pmap). It had the virtual address and a
pointer to the pmap.
This change replaces the individual allocation with a per-process
allocation system. A page ("pv chunk") is allocated and this provides
168 pv entries for that process. We can now eliminate one of the 16 byte
tailq entries because we can simply iterate through the pv chunks to find
all the pv entries for a process. We can eliminate one of the 8 byte
pointers because the location of the pv entry implies the containing
pv chunk, which has the pointer. After overheads from the pv chunk
bitmap and tailq linkage, this works out that each pv entry has an
effective size of 24.38 bytes.
Future work still required, and other problems:
* when running low on pv entries or system ram, we may need to defrag
the chunk pages and free any spares. The stats (vm.pmap.*) show that
this doesn't seem to be that much of a problem, but it can be done if
needed.
* running low on pv entries is now a much bigger problem. The old
get_pv_entry() routine just needed to reclaim one other pv entry.
Now, since they are per-process, we can only use pv entries that are
assigned to our current process, or by stealing an entire page worth
from another process. Under normal circumstances, the pmap_collect()
code should be able to dislodge some pv entries from the current
process. But if needed, it can still reclaim entire pv chunk pages
from other processes.
* This should port to i386 really easily, except there it would reduce
pv entries from 24 bytes to about 12 bytes.
(I have integrated Alan's recent changes.)
a pv entry if the number of entries is below the high water mark for pv
entries.
Use pmap_try_insert_pv_entry() in pmap_copy() instead of
pmap_insert_entry(). This avoids possible recursion on a pmap lock in
get_pv_entry().
Eliminate the explicit low-memory checks in pmap_copy(). The check that
the number of pv entries was below the high water mark was largely
ineffective because it was located in the outer loop rather than the
inner loop where pv entries were allocated. Instead of checking, we
attempt the allocation and handle the failure.
Reviewed by: tegge
Reported by: kris
MFC after: 5 days
back to using the RSDT instead. ACPI-CA already follows this same strategy
as a workaround for yet another instance of brain-damaged BIOS writers.
PR: i386/93963
Submitted by: Masayuki FUKUI <fukui.FreeBSD@fanet.net>
Specifically, on mappings with PG_G set pmap_remove() not only performs
the necessary per-page invlpg invalidations but also performs an
unnecessary invalidation of the entire set of non-PG_G entries.
Reviewed by: tegge
Previously, we tried to allow this only for root. However, we were calling
suser() on the *target* process rather than the current process. This
means that if you can ptrace() a process running as root you can set a
hardware watch point in the kernel. In practice I think you probably have
to be root in order to pass the p_candebug() checks in ptrace() to attach
to a process running as root anyway. Rather than fix the suser(), I just
axed the entire idea, as I can't think of any good reason _at all_ for
userland to set hardware watch points for KVM.
MFC after: 3 days
Also thinks hardware watch points on KVM from userland are bad: bde, rwatson
to be 'long' instead of 'int' so that sysctl(8) correctly displays
the 8 returned bytes as a single 'long' instead of two 'int' values.
Submitted by: peter
In at least one benchmark this showed around a 20% performance increase.
If other workloads do benefit from having hyperthreads service interrupts,
we can always make this a loader tunable.
MFC after: 3 days
Tested by: ps
- Throw out all of the logical APIC ID stuff. The Intel docs are somewhat
ambiguous, but it seems that the "flat" cluster model we are currently
using is only supported on Pentium and P6 family CPUs. The other
"hierarchy" cluster model that is supported on all Intel CPUs with
local APICs is severely underdocumented. For example, it's not clear
if the OS needs to glean the topology of the APIC hierarchy from
somewhere (neither ACPI nor MP Table include it) and setup the logical
clusters based on the physical hierarchy or not. Not only that, but on
certain Intel chipsets, even though there were 4 CPUs in a logical
cluster, all the interrupts were only sent to one CPU anyway.
- We now bind interrupts to individual CPUs using physical addressing via
the local APIC IDs. This code has also moved out of the ioapic PIC
driver and into the common interrupt source code so that it can be
shared with MSI interrupt sources since MSI is addressed to APICs the
same way that I/O APIC pins are.
- Interrupt source classes grow a new method pic_assign_cpu() to bind an
interrupt source to a specific local APIC ID.
- The SMP code now tells the interrupt code which CPUs are avaiable to
handle interrupts in a simpler and more intuitive manner. For one thing,
it means we could now choose to not route interrupts to HT cores if we
wanted to (this code is currently in place in fact, but under an #if 0
for now).
- For now we simply do static round-robin of IRQs to CPUs when the first
interrupt handler just as before, with the change that IRQs are now
bound to individual CPUs rather than groups of up to 4 CPUs.
- Because the IRQ to CPU mapping has now been moved up a layer, it would
be easier to manage this mapping from higher levels. For example, we
could allow drivers to specify a CPU affinity map for their interrupts,
or we could allow a userland tool to bind IRQs to specific CPUs.
The MFC is tentative, but I want to see if this fixes problems some folks
had with UP APIC kernels on 6.0 on SMP machines (an SMP kernel would work
fine, but a UP APIC kernel (such as GENERIC in RELENG_6) would lose
interrupts).
MFC after: 1 week
Keep accounting time (in per-cpu) cputicks and the statistics counts
in the thread and summarize into struct proc when at context switch.
Don't reach across CPUs in calcru().
Add code to calibrate the top speed of cpu_tickrate() for variable
cpu_tick hardware (like TSC on power managed machines).
Don't enforce monotonicity (at least for now) in calcru. While the
calibrated cpu_tickrate ramps up it may not be true.
Use 27MHz counter on i386/Geode.
Use TSC on amd64 & i386 if present.
Use tick counter on sparc64
Rename struct thread's td_sticks to td_pticks, we will need the
other name for more appropriately named use shortly. Reduce it
from uint64_t to u_int.
Clear td_pticks whenever we enter the kernel instead of recording
its value as reference for userret(). Use the absolute value of
td->pticks in userret() and eliminate third argument.
Keep track of time spent by the cpu in various contexts in units of
"cputicks" and scale to real-world microsec^H^H^H^H^H^H^H^Hclock_t
only when somebody wants to inspect the numbers.
For now "cputicks" are still derived from the current timecounter
and therefore things should by definition remain sensible also on
SMP machines. (The main reason for this first milestone commit is
to verify that hypothesis.)
On slower machines, the avoided multiplications to normalize timestams
at every context switch, comes out as a 5-7% better score on the
unixbench/context1 microbenchmark. On more modern hardware no change
in performance is seen.
the callers if the exec either succeeds or fails early.
- Move the code to call exit1() if the exec fails after the vmspace is
gone to the bottom of kern_execve() to cut down on some code duplication.
dedicated to storing pv entries, originally so that kva didn't have to be
allocated at inconvenient times. For amd64, we can get the same effect by
using the direct map area. Allocating pages is the same as with the object
backed method, but now we can just lookup the page in the direct map area.
Thus, no more pageable kva is reserved. This is the single largest
consumer of kva on our work machines and this change should help conserve
the fixed size 2GB pageable kva on the amd64 kernel.
There are a pair of sysctl nodes introduced, named the same as their
tunable counterparts. vm.pmap.shpgperproc and vm.pmap.pv_entry_max
They work just like the tunables of the same path, except the values are
linked. The pv entry cap is now dynamically changeable.
I didn't make them totally unlimited because we need some sort of safety
limit still. One could consume all physical memory without a cap.
is a fatal fault if we are holding any non-sleepable locks. This should
cut down on the number of bogus LORs we currently get when the kernel
panics due to a NULL (or bogus) pointer dereference that goes wandering
off into the VM system which tries to acquire locks and then kicks off
the spurious LORs. This should probably be ported to all the archs at
some point.
Tested on: i386
to COMPAT_43TTY.
Add COMPAT_43TTY to NOTES and */conf/GENERIC
Compile tty_compat.c only under the new option.
Spit out
#warning "Old BSD tty API used, please upgrade."
if ioctl_compat.h gets #included from userland.
param.h. Per request, I've placed these just after the
_NO_NAMESPACE_POLLUTION ifndef. I've not renamed anything yet, but
may since we don't need the __.
Submitted by: bde, jhb, scottl, many others.
various pcib drivers to use their own private devclass_t variables for
their modules.
- Use the DEFINE_CLASS_0() macro to declare drivers for the various pcib
drivers while I'm here.
- provide an interface (macros) to the page coloring part of the VM system,
this allows to try different coloring algorithms without the need to
touch every file [1]
- make the page queue tuning values readable: sysctl vm.stats.pagequeue
- autotuning of the page coloring values based upon the cache size instead
of options in the kernel config (disabling of the page coloring as a
kernel option is still possible)
MD changes:
- detection of the cache size: only IA32 and AMD64 (untested) contains
cache size detection code, every other arch just comes with a dummy
function (this results in the use of default values like it was the
case without the autotuning of the page coloring)
- print some more info on Intel CPU's (like we do on AMD and Transmeta
CPU's)
Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.
Based upon work by: Chad David <davidc@acns.ab.ca> [1]
Reviewed by: alc, arch (in 2004)
Discussed with: alc, Chad David, arch (in 2004)
amd64_set_watch() as 'unsigned int' and 'unsigned int' is 32bit long on amd64.
Even with that fix hardware watchpoint don't work for me on amd64, ie. when
I set the watchpoint and write a byte there, nothing happens.
with flags bitfield and set BI_CAN_EXEC_DYN flag for all brands that usually
allow executing elf dynamic binaries (aka shared libraries). When it is
requested to execute ET_DYN elf image check if this flag is on after we
know the elf brand allowing execution if so.
PR: kern/87615
Submitted by: Marcin Koziej <creep@desk.pl>
passing a pointer to an opaque clockframe structure and requiring the
MD code to supply CLKF_FOO() macros to extract needed values out of the
opaque structure, just pass the needed values directly. In practice this
means passing the pair (usermode, pc) to hardclock() and profclock() and
passing the boolean (usermode) to hardclock_cpu() and hardclock_process().
Other details:
- Axe clockframe and CLKF_FOO() macros on all architectures. Basically,
all the archs were taking a trapframe and converting it into a clockframe
one way or another. Now they can just extract the PC and usermode values
directly out of the trapframe and pass it to fooclock().
- Renamed hardclock_process() to hardclock_cpu() as the latter is more
accurate.
- On Alpha, we now run profclock() at hz (profhz == hz) rather than at
the slower stathz.
- On Alpha, for the TurboLaser machines that don't have an 8254
timecounter, call hardclock() directly. This removes an extra
conditional check from every clock interrupt on Alpha on the BSP.
There is probably room for even further pruning here by changing Alpha
to use the simplified timecounter we use on x86 with the lapic timer
since we don't get interrupts from the 8254 on Alpha anyway.
- On x86, clkintr() shouldn't ever be called now unless using_lapic_timer
is false, so add a KASSERT() to that affect and remove a condition
to slightly optimize the non-lapic case.
- Change prototypeof arm_handler_execute() so that it's first arg is a
trapframe pointer rather than a void pointer for clarity.
- Use KCOUNT macro in profclock() to lookup the kernel profiling bucket.
Tested on: alpha, amd64, arm, i386, ia64, sparc64
Reviewed by: bde (mostly)
duplicated anyways) and into a single MI driver. Extend the driver a bit
to implement the bus and PCI kobj interfaces such that other drivers can
attach to it and transparently act as if their parent device is the PCI
bus (for the most part).
means:
o Remove Elf64_Quarter,
o Redefine Elf64_Half to be 16-bit,
o Redefine Elf64_Word to be 32-bit,
o Add Elf64_Xword and Elf64_Sxword for 64-bit entities,
o Use Elf_Size in MI code to abstract the difference between
Elf32_Word and Elf64_Word.
o Add Elf_Ssize as the signed counterpart of Elf_Size.
MFC after: 2 weeks
which existed to cleanup the linux_osname mutex. Now that MTX_SYSINIT()
has grown a SYSUNINIT to destroy mutexes on unload, the extra destroy here
was redundant and resulted in panics in debug kernels.
MFC after: 1 week
Reported by: Goran Gajic ggajic at afrodita dot rcub dot bg dot ac dot yu
originally thought. The BIOS that cleared CPUID_APIC actually managed
to disable the local APIC entirely and even Windows 64 doesn't boot on
it.
Reported by: bz
if the boot CPU has a local APIC because some BIOS vendors are not
competent enough to set this bit. Instead, just assume that we always have
a local APIC on amd64. For i386 the check is a bit more subtle. FreeBSD
requires either an MP Table or an ACPI MADT table to enumerate APICs. The
only systems that have one of those tables that don't have local APICs are
some presumably rare (and old) SMP 486 systems using external APICs. Thus,
instead of checking the CPUID_APIC flag, check the CPU class and abort if
we are running on a 486.
MFC after: 1 week
Reported by: bz
changes DELAY to use the TSC once it has been calibrated. This does NOT
use the TSC for long-term timekeeping. It only uses it to bound the
DELAY() spinloop. This should not be affected by the Athlon64 X2 TSC
quirks because the cpu is not halted while we use DELAY().
- Move PUSH_FRAME and POP_FRAME to asmacros.h and use PUSH_FRAME in
atpic entry points.
- Move PCPU_* asm macros out of the middle of the asm profiling macros.
- Pass IRQ vector argument as an int rather than void * to reduce diffs
with i386.
- EOI the lapic in C for the lapic timer handler.
- GC unused Xcpuast function.
- Split IPI_STOP handling code of ipi_nmi_handler() out into a
cpustop_handler() function and call it from Xcpustop rather than
duplicating all the logic in assembly.
- Fixup the list of symbols with interrupt frames in ddb traces.
Xatpic_fastintr* have never existed on amd64, and the lapic timer
handler and various IPI handlers were missing.
- Use trapframe instead of intrframe for interrupt entry points (on amd64
the interrupt vector was already a separate argument, so the two frames
were already identical) and GC intrframe.
Submitted by: peter (3)
- Move vtophys() macros next to vtopte() where vtopte() exists to match
comments above vtopte().
- Remove references to the alternate address space in the comment above
vtopte(). amd64 never had the alternate address space, and i386 lost it
prior to PAE support being added.
- s/entires/entries/ in comments.
Reviewed by: alc
MACHINE_ARCH and MACHINE). Their purpose was to be able to test
in cpp(1), but cpp(1) only understands integer type expressions.
Using such unsupported expressions introduced a number of subtle
bugs, which were discovered by compiling with -Wundef.
Use the following kernel configuration option to enable:
options BPF_JITTER
If you want to use bpf_filter() instead (e. g., debugging), do:
sysctl net.bpf.jitter.enable=0
to turn it off.
Currently BIOCSETWF and bpf_mtap2() are unsupported, and bpf_mtap() is
partially supported because 1) no need, 2) avoid expensive m_copydata(9).
Obtained from: WinPcap 3.1 (for i386)
working IRQ0 with APIC anymore. Previously, it was possible to have
some other ATPIC IRQS "leak" through in a few edge cases. For example, on
my x86 test machine, ACPI re-routes the SCI (IRQ 9) to intpin 13 on the
first I/O APIC. This leaves a hole for IRQ 13 (since the APIC doesn't
provide a source for IRQ 13 in that case) with the result that the ATPIC
IRQ13 source was registered instead. This changes the 8259A drivers to
only register their interrupt sources if none of the 16 ISA IRQs have an
interrupt source already installed.
MFC after: 1 week
- S3 Savage driver ported.
- Added support for ATI_fragment_shader registers for r200.
- Improved r300 support, needed for latest r300 DRI driver.
- (possibly) r300 PCIE support, needs X.Org server from CVS.
- Added support for PCI Matrox cards.
- Software fallbacks fixed for Rage 128, which used to render badly or hang.
- Some issues reported by WITNESS are fixed.
- i915 module Makefile added, as the driver may now be working, but is untested.
- Added scripts for copying and preprocessing DRM CVS for inclusion in the
kernel. Thanks to Daniel Stone for getting me started on that.
rare case of a stray interrupt to an unregistered source (such as a stray
interrupt from the 8259As when using APIC), this could result in a page
fault when it tried to walk the list of interrupt handlers to execute
INTR_FAST handlers. This bug was introduced with the intr_event changes,
so it's not present in 5.x or 6.x.
Submitted by: Mark Tinguely tinguely at casselton dot net
via the DEFAULTS kernel configs. This allows folks to turn it that option
off in the kernel configs if desired without having to hack the source.
This is especially useful since PUC_FASTINTR hangs the kernel boot on my
ultra60 which has two uart(4) devices hung off of a puc(4) device.
I did not enable PUC_FASTINTR by default on powerpc since powerpc does not
currently allow sharing of INTR_FAST with non-INTR_FAST like the other
archs.
during boot up. Now we do a full reset of the 8259As and setup a simple
interrupt handler (we actually borrow the apic one that just does an
immediate iret) to handle any spurious interrupts triggered by either chip.
This should fix some folks that were getting a Trap 30 during bootup of
certain SMP AMD systems. This might get pushed into the 6.0 branch as an
errata. For now a suitable workaround is to add 'device atpic' to your
kernel config.
Tested by: scottl
Helpful info from: dillon
MFC after: 1 week
mystery traps. If we don't have a message for a given trap, just use
UNKNOWN for the message.
- Add trap messages for T_XMMFLT and T_RESERVED.
MFC after: 1 week
IPI_STOP handling code use atomic_readandclear() to execute the restart
function on the first CPU to resume and restore the behavior of always
executing the restart function on the BSP since this is in fact what the
non-NMI IPI_STOP handler does. I did add back in a statement to clear
the restart function pointer after it is executed to match the behavior
of the non-NMI IPI_STOP handler.
I/O APIC that doesn't exist, then a read of the version register is going
to return -1 which is 0xffffffff not 0xffffff.
Tested on: i386
Tested by: Nikos Ntarmos ntarmos at ceid dot upatras dot gr
MFC after: 1 week
The following repo-copies were made (by Mark Murray):
sys/i386/isa/spkr.c -> sys/dev/speaker/spkr.c
sys/i386/include/speaker.h -> sys/dev/speaker/speaker.h
share/man/man4/man4.i386/spkr.4 -> share/man/man4/spkr.4
reclamation synchronously from get_pv_entry() instead of
asynchronously as part of the page daemon. Additionally, limit the
reclamation to inactive pages unless allocation from the PV entry zone
or reclamation from the inactive queue fails. Previously, reclamation
destroyed mappings to both inactive and active pages. get_pv_entry()
still, however, wakes up the page daemon when reclamation occurs. The
reason being that the page daemon may move some pages from the active
queue to the inactive queue, making some new pages available to future
reclamations.
Print the "reclaiming PV entries" message at most once per minute, but
don't stop printing it after the fifth time. This way, we do not give
the impression that the problem has gone away.
Reviewed by: tegge
sio(4) will claim it. This change therefore only affects how ports
are handled when they are not claimed by sio(4), and in principle
will improve hardware support.
MFC after: 2 months
Previously, pvzone's initialization was split between pmap_init() and
pmap_init2(). This split initialization was the underlying cause of
some UMA panics during initialization. Specifically, if the UMA boot
pages was exhausted before the pvzone was fully initialized, then UMA,
through no fault of its own, would use an inappropriate back-end
allocator leading to a panic. (Previously, as a workaround, we have
increased the UMA boot pages.) Fortunately, there is no longer any
reason that pvzone's initialization cannot be completed in
pmap_init().
Eliminate a check for whether pv_entry_high_water has been initialized
or not from get_pv_entry(). Since pvzone's initialization is
completed in pmap_init(), this check is no longer needed.
Use cnt.v_page_count, the actual count of available physical pages,
instead of vm_page_array_size to compute the maximum number of pv
entries.
Introduce the vm.pmap.pv_entries tunable on alpha and ia64.
Eliminate some unnecessary white space.
Discussed with: tegge (item #1)
Tested by: marcel (ia64)
source is first enabled similar to how intr_event's now allocate ithreads
on-demand. Previously, we would map IDT vectors 1:1 to IRQs. Since we
only have 191 available IDT vectors for I/O interrupts, this limited us
to only supporting IRQs 0-190 corresponding to the first 190 I/O APIC
intpins. On many machines, however, each PCI-X bus has its own APIC even
though it only has 1 or 2 devices, thus, we were reserving between 24 and
32 IRQs just for 1 or 2 devices and thus 24 or 32 IDT vectors. With this
change, a machine with 100 IRQs but only 5 in use will only use up 5 IDT
vectors. Also, this change provides an API (apic_alloc_vector() and
apic_free_vector()) that will allow a future MSI interrupt source driver to
request IDT vectors for use by MSI interrupts on x86 machines.
Tested on: amd64, i386
fails, reclaim a pv entry by destroying a mapping to an inactive
page.
Change the format strings in many of the assertions that were recently
converted from PMAP_DIAGNOSTIC printf()s so that they are compatible
with PAE. Avoid unnecessary differences between the amd64 and i386
format strings.
- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.
- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.
- Disambiguate some collisions by adding subsystem prefixes to some
memory types.
- Generally prefer lower case to upper case.
- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.
Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.
'device npx' (both of which aren't really optional right now) and
'device io' and 'device mem' (to preserve POLA for 4.x users upgrading
to 6.0) from GENERIC into DEFAULTS.
Requested by: scottl
Reviewed by: scottl
* Don't recursively panic if we've already paniced and the local apic is
now stuck.
* Add hw.apic.* tunables/sysctls for extint controls
* Change "lapic%d timer" to "cpu%d timer" intname to match i386
and increase flexibility to allow various different approaches to be tried
in the future.
- Split struct ithd up into two pieces. struct intr_event holds the list
of interrupt handlers associated with interrupt sources.
struct intr_thread contains the data relative to an interrupt thread.
Currently we still provide a 1:1 relationship of events to threads
with the exception that events only have an associated thread if there
is at least one threaded interrupt handler attached to the event. This
means that on x86 we no longer have 4 bazillion interrupt threads with
no handlers. It also means that interrupt events with only INTR_FAST
handlers no longer have an associated thread either.
- Renamed struct intrhand to struct intr_handler to follow the struct
intr_foo naming convention. This did require renaming the powerpc
MD struct intr_handler to struct ppc_intr_handler.
- INTR_FAST no longer implies INTR_EXCL on all architectures except for
powerpc. This means that multiple INTR_FAST handlers can attach to the
same interrupt and that INTR_FAST and non-INTR_FAST handlers can attach
to the same interrupt. Sharing INTR_FAST handlers may not always be
desirable, but having sio(4) and uhci(4) fight over an IRQ isn't fun
either. Drivers can always still use INTR_EXCL to ask for an interrupt
exclusively. The way this sharing works is that when an interrupt
comes in, all the INTR_FAST handlers are executed first, and if any
threaded handlers exist, the interrupt thread is scheduled afterwards.
This type of layout also makes it possible to investigate using interrupt
filters ala OS X where the filter determines whether or not its companion
threaded handler should run.
- Aside from the INTR_FAST changes above, the impact on MD interrupt code
is mostly just 's/ithread/intr_event/'.
- A new MI ddb command 'show intrs' walks the list of interrupt events
dumping their state. It also has a '/v' verbose switch which dumps
info about all of the handlers attached to each event.
- We currently don't destroy an interrupt thread when the last threaded
handler is removed because it would suck for things like ppbus(8)'s
braindead behavior. The code is present, though, it is just under
#if 0 for now.
- Move the code to actually execute the threaded handlers for an interrrupt
event into a separate function so that ithread_loop() becomes more
readable. Previously this code was all in the middle of ithread_loop()
and indented halfway across the screen.
- Made struct intr_thread private to kern_intr.c and replaced td_ithd
with a thread private flag TDP_ITHREAD.
- In statclock, check curthread against idlethread directly rather than
curthread's proc against idlethread's proc. (Not really related to intr
changes)
Tested on: alpha, amd64, i386, sparc64
Tested on: arm, ia64 (older version of patch by cognet and marcel)
other OSes (Solaris, Linux, VxWorks). It's not necessary to write a 0
to the config address register when using config mechanism 1 to turn
off config access. In fact, it can be downright troublesome, since it
seems to confuse the PCI-PCI bridge in the AMD8111 chipset and cause
it to sporadically botch reads from some devices. This is the cause
of the missing USP ports problem I was experiencing with my Sun Opteron
system.
Also correct the case for mechanism 2: it's only necessary to write
a 0 to the ENABLE port.
IPI_STOP IPIs.
- Change the i386 and amd64 MD IPI code to send an NMI if STOP_NMI is
enabled if an attempt is made to send an IPI_STOP IPI. If the kernel
option is enabled, there is also a sysctl to change the behavior at
runtime (debug.stop_cpus_with_nmi which defaults to enabled). This
includes removing stop_cpus_nmi() and making ipi_nmi_selected() a
private function for i386 and amd64.
- Fix ipi_all(), ipi_all_but_self(), and ipi_self() on i386 and amd64 to
properly handle bitmapped IPIs as well as IPI_STOP IPIs when STOP_NMI is
enabled.
- Fix ipi_nmi_handler() to execute the restart function on the first CPU
that is restarted making use of atomic_readandclear() rather than
assuming that the BSP is always included in the set of restarted CPUs.
Also, the NMI handler didn't clear the function pointer meaning that
subsequent stop and restarts could execute the function again.
- Define a new macro HAVE_STOPPEDPCBS on i386 and amd64 to control the use
of stoppedpcbs[] and always enable it for i386 and amd64 instead of
being dependent on KDB_STOP_NMI. It works fine in both the NMI and
non-NMI cases.
get a new pv under high system load where the available pv entries
have been exhausted before the pagedaemon has a chance to wake up
to reclaim some.
Prior to this, the NULL pointer dereference ended up causing
secondary panics with rather less than useful resulting tracebacks.
Reviewed by: alc, jhb
MFC after: 1 week
and fs.base. We always update pcb.pcb_gsbase and pcb.pcb_fsbase
when user wants to set them, in context switch routine, we only need to
write them into registers, we never have to read them out from registers
when thread is switched away. Since rdmsr is a serialization instruction,
micro benchmark shows it is worthy to do.
Reviewed by: peter, jhb
- Add newer CPUID definitions for future use.
Many thanks to Mike Tancsa <mike at sentex dot net> for providing test
cases for Intel Pentium D and AMD Athlon 64 X2.
Approved by: anholt (mentor)
changes in MD code are trivial, before this change, trapsignal and
sendsig use discrete parameters, now they uses member fields of
ksiginfo_t structure. For sendsig, this change allows us to pass
POSIX realtime signal value to user code.
2. Remove cpu_thread_siginfo, it is no longer needed because we now always
generate ksiginfo_t data and feed it to libpthread.
3. Add p_sigqueue to proc structure to hold shared signals which were
blocked by all threads in the proc.
4. Add td_sigqueue to thread structure to hold all signals delivered to
thread.
5. i386 and amd64 now return POSIX standard si_code, other arches will
be fixed.
6. In this sigqueue implementation, pending signal set is kept as before,
an extra siginfo list holds additional siginfo_t data for signals.
kernel code uses psignal() still behavior as before, it won't be failed
even under memory pressure, only exception is when deleting a signal,
we should call sigqueue_delete to remove signal from sigqueue but
not SIGDELSET. Current there is no kernel code will deliver a signal
with additional data, so kernel should be as stable as before,
a ksiginfo can carry more information, for example, allow signal to
be delivered but throw away siginfo data if memory is not enough.
SIGKILL and SIGSTOP have fast path in sigqueue_add, because they can
not be caught or masked.
The sigqueue() syscall allows user code to queue a signal to target
process, if resource is unavailable, EAGAIN will be returned as
specification said.
Just before thread exits, signal queue memory will be freed by
sigqueue_flush.
Current, all signals are allowed to be queued, not only realtime signals.
Earlier patch reviewed by: jhb, deischen
Tested on: i386, amd64
o Axe poll in trap.
o Axe IFF_POLLING flag from if_flags.
o Rework revision 1.21 (Giant removal), in such a way that
poll_mtx is not dropped during call to polling handler.
This fixes problem with idle polling.
o Make registration and deregistration from polling in a
functional way, insted of next tick/interrupt.
o Obsolete kern.polling.enable. Polling is turned on/off
with ifconfig.
Detailed kern_poll.c changes:
- Remove polling handler flags, introduced in 1.21. The are not
needed now.
- Forget and do not check if_flags, if_capenable and if_drv_flags.
- Call all registered polling handlers unconditionally.
- Do not drop poll_mtx, when entering polling handlers.
- In ether_poll() NET_LOCK_GIANT prior to locking poll_mtx.
- In netisr_poll() axe the block, where polling code asks drivers
to unregister.
- In netisr_poll() and ether_poll() do polling always, if any
handlers are present.
- In ether_poll_[de]register() remove a lot of error hiding code. Assert
that arguments are correct, instead.
- In ether_poll_[de]register() use standard return values in case of
error or success.
- Introduce poll_switch() that is a sysctl handler for kern.polling.enable.
poll_switch() goes through interface list and enabled/disables polling.
A message that kern.polling.enable is deprecated is printed.
Detailed driver changes:
- On attach driver announces IFCAP_POLLING in if_capabilities, but
not in if_capenable.
- On detach driver calls ether_poll_deregister() if polling is enabled.
- In polling handler driver obtains its lock and checks IFF_DRV_RUNNING
flag. If there is no, then unlocks and returns.
- In ioctl handler driver checks for IFCAP_POLLING flag requested to
be set or cleared. Driver first calls ether_poll_[de]register(), then
obtains driver lock and [dis/en]ables interrupts.
- In interrupt handler driver checks IFCAP_POLLING flag in if_capenable.
If present, then returns.This is important to protect from spurious
interrupts.
Reviewed by: ru, sam, jhb
osf1_signal.c:1.41, amd64/amd64/trap.c:1.291, linux_socket.c:1.60,
svr4_fcntl.c:1.36, svr4_ioctl.c:1.23, svr4_ipc.c:1.18, svr4_misc.c:1.81,
svr4_signal.c:1.34, svr4_stat.c:1.21, svr4_stream.c:1.55,
svr4_termios.c:1.13, svr4_ttold.c:1.15, svr4_util.h:1.10,
ext2_alloc.c:1.43, i386/i386/trap.c:1.279, vm86.c:1.58,
unaligned.c:1.12, imgact_elf.c:1.164, ffs_alloc.c:1.133:
Now that Giant is acquired in uprintf() and tprintf(), the caller no
longer leads to acquire Giant unless it also holds another mutex that
would generate a lock order reversal when calling into these functions.
Specifically not backed out is the acquisition of Giant in nfs_socket.c
and rpcclnt.c, where local mutexes are held and would otherwise violate
the lock order with Giant.
This aligns this code more with the eventual locking of ttys.
Suggested by: bde
kernel modules. We actually need to include any addends and the symbol
offset value, but for gcc/binutils didn't set it anywhere I've found on
'cc -fpic -shared' kernel modules.
available and can give the wrong impression when there are memory holes.
Report the total amount of usable memory that we detected instead of the
highest address.
variable and returns the previous value of the variable.
Tested on: i386, alpha, sparc64, arm (cognet)
Reviewed by: arch@
Submitted by: cognet (arm)
MFC after: 1 week
as they both interact with the tty code (!MPSAFE) and may sleep if the
tty buffer is full (per comment).
Modify all consumers of uprintf() and tprintf() to hold Giant around
calls into these functions. In most cases, this means adding an
acquisition of Giant immediately around the function. In some cases
(nfs_timer()), it means acquiring Giant higher up in the callout.
With these changes, UFS no longer panics on SMP when either blocks are
exhausted or inodes are exhausted under load due to races in the tty
code when running without Giant.
NB: Some reduction in calls to uprintf() in the svr4 code is probably
desirable.
NB: In the case of nfs_timer(), calling uprintf() while holding a mutex,
or even in a callout at all, is a bad idea, and will generate warnings
and potential upset. This needs to be fixed, but was a problem before
this change.
NB: uprintf()/tprintf() sleeping is generally a bad ideas, as is having
non-MPSAFE tty code.
MFC after: 1 week
This kernel config briefly describes some of the major MAC policies
available on FreeBSD. The hope is that this will raise the awareness
about MAC and get more people interested.
Discussed with: scottl
constraint is actually only allowed for register operands. Instead, use
separate input and output memory constraints.
Education from: alc
Reviewed by: alc
Tested on: i386, alpha
MFC after: 1 week
eliminate TLB invalidations when permissions are relaxed, such as when a
read-only mapping is changed to a read/write mapping. Additionally,
eliminate TLB invalidations when bits that are ignored by the hardware,
such as PG_W ("wired mapping"), are changed.
Reviewed by: tegge
On entry or exit from the kernel the 'alltraps' and 'doreti' code
used taken by normal traps disables interrupts to protect the
critical sections where it is setting up %gs.
This protection is insufficient in the presence of NMIs since NMIs
can be taken even when the processor has disabled normal interrupts.
Thus the NMI handler needs to actually read MSR_GBASE on entry to
the kernel to determine whether a swap of %gs using 'swapgs' is
needed. However, reads of MSRs are expensive and integrating this
check into the 'alltraps'/'doreti' path would penalize normal
interrupts.
- Teach DDB about the 'nmi_calltrap' symbol.
Reviewed by: bde, peter (older versions of this change)
1. The amd64 pmap, unlike the i386 pmap, maintains a reference count
for each page directory (PD) page. However, in the transformation
of the i386 pmap into the amd64 pmap, operations, such as
pmap_copy() and pmap_object_init_pt(), that create 2MB "superpage"
mappings by setting the PG_PS bit in a PD entry were not modified
to adjust the underlying PD page's reference count. Consequently,
superpage mappings could disappear prematurely.
2. pmap_object_init_pt() could crash or corrupt memory if either the
virtual address range being mapped crosses a 1GB boundary in the
virtual address space or nothing is mapped in the 1GB area.
3. When pmap_allocpte() destroys a 2MB "superpage" mapping it does not
reduce the pmap's resident count accordingly. It should. (This
bug is inherited from i386.)
Discussed with: peter
Reviewed by: tegge
than ~PDRMASK to extract the physical address of a superpage from a PDE.
The use of ~PDRMASK is problematic if the PDE has PG_NX set. Specifically,
the PG_NX bit will be included in the physical address if ~PDRMASK is used.
Reviewed by: peter
it to __MINSIGSTKSZ. Define MINSIGSTKSZ in <sys/signal.h>.
This is done in order to use MINSIGSTKSZ for the macro PTHREAD_STACK_MIN
in <pthread.h> (soon <limits.h>) without having to include the whole
<sys/signal.h> header.
Discussed with: bde
made in pmap_protect(): The pmap's resident count should not be reduced
unless mappings are removed.
The errant change to the pmap's resident count could result in a later
pmap_remove() failing to remove any mappings if the errant change has set
the pmap's resident count to zero.
chips are commonly found, it makes sense to have it in GENERIC. This
is a candidate for a RELENG_6 MFC.
Approved by; peter
Requested by: pav
Tested by: pav
and return a printable representation.
This fixes recognition of the PC Engines WRAP and improves the
recognition of the Soekris boards (Bios version can now be
seen in the dmesg output for instance).
Also, add watchdog support for PCM-582x platforms.
Submitted by: Adrian Steinmann <ast@marabu.ch>
Slightly changed by: phk
PR: 81360