Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs. Libpthread processes will already
do this to some extent and libthr processes already disable it.
Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.
The ULE scheduler compiles again but I have no idea if it works.
The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.
Tested by David Xu, and Dan Eischen using libthr and libpthread.
and by only delaying when an RTC register is written to. The delay
after writing to the data register is now not just a workaround.
This reduces the number of ISA accesses in the usual case from 4 to
1. The usual case is 2 rtcin()'s for each RTC interrupt. The index
register is almost always RTC_INTR for this. The 3 extra ISA accesses
were 1 for writing the index and 2 for delays. Some delays are needed
in theory, but in practice they now just slow down slow accesses some
more since almost eveyone including us does them wrong so modern systems
enforce sufficient delays in hardware. I used to have the delays ifdefed
out, but with the index register optimization the delays are rarely
executed so the old magic ones can be kept or even implemented non-
magically without significant cost.
Optimizing RTC interrupt handling is more interesting than it used to
be because RTC interrupts are currently needed to fix the more efficient
apic timer interrupts on some systems. apic_timer_hz is normally 2000
so the RTC interrupt rate needs to be 2048 to keep the apic timer
firing on such systems. Without these changes, each RTC interrupt
normally took 10 ISA accesses (2 PIC accesses and 2 sets of 4 RTC
accesses). Each ISA access takes 1-1.5uS so 10 of then at 2048 Hz
takes 2-3% of a CPU. Now 4 of them take 0.8-1.2% of a CPU.
by default for sun4v where it is absolutely required.
This change moves the buffer from struct pcpu to the stack to avoid
using the critical section which created a LOR in a couple of cases
due to interaction with the tty code and kqueue. The LOR can't be
fixed with the critical section and the pcpu buffer can't be used
without the critical section.
Putting the buffer on the stack was my initial solution, but it was
pointed out that the stress on the stack might cause problems
depending on the call path. We don't have a way of creating tests
for those possible cases, so it's best to leave this as an option
for the time being. In time we may get enough data to enable this
option more generally.
of various scattered magic values.
- Pretty print the address of hardware watchpoints in 'show watch' rather
than just displaying hex.
- Expand address field width on amd64 for 64-bit pointers.
- Drop the printf in intr_machdep.c when we assign an interrupt souce to
a CPU. Each source already has a more detailed printf.
- Don't output a line for each ioapic pin showing its initial state, this
has outlived its usefulness.
- When an APIC enumerator sets the bus, polarity, or trigger mode of an
ioapic pin, just return success without printing anything if the new
value matches the current one.
MFC after: 2 weeks
- Add a new apic_alloc_vectors() method to the local APIC support code
to allocate N contiguous IDT vectors (aligned on a M >= N boundary).
This function is used to allocate IDT vectors for a group of MSI
messages.
- Add MSI and MSI-X PICs. The PIC code here provides methods to manage
edge-triggered MSI messages as x86 interrupt sources. In addition to
the PIC methods, msi.c also includes methods to allocate and release
MSI and MSI-X messages. For x86, we allow for up to 128 different
MSI IRQs starting at IRQ 256 (IRQs 0-15 are reserved for ISA IRQs,
16-254 for APIC PCI IRQs, and IRQ 255 is reserved).
- Add pcib_(alloc|release)_msi[x]() methods to the MD x86 PCI bridge
drivers to bubble the request up to the nexus driver.
- Add pcib_(alloc|release)_msi[x]() methods to the x86 nexus drivers that
ask the MSI PIC code to allocate resources and IDT vectors.
MFC after: 2 months
of NKPT is no longer enough to run amd64 with 16G of RAM, as it
doesn't have space for mapping a kernel (16M kernel would require
additionally 8 page tables).
specific privilege names to a broad range of privileges. These may
require some future tweaking.
Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>
it as a default.
For the record, the KDTRACE option caused _no_ additional source files
to be compiled in; certainly no CDDL source files. All it did was to
allow existing BSD licensed kernel files to include one or more CDDL
header files.
By removing this from DEFAULTS, the onus is on a kernel builder to add
the option to the kernel config, possibly by including GENERIC and
customising from there. It means that DTrace won't be a feature
available in FreeBSD by default, which is the way I intended it to be.
Without this option, you can't load the dtrace module (which contains
the dtrace device and the DTrace framework). This is equivalent to
requiring an option in a kernel config before you can load the linux
emulation module, for example.
I think it is a mistake to have DTrace ported to FreeBSD, but not
to have it available to everyone, all the time. The only exception
to this is the companies which distribute systems with FreeBSD embedded.
Those companies will customise their systems anyway. The KDTRACE
option was intended for them, and only them.
adds the hooks that DTrace modules register with, and adds a few functions
which have the dtrace_ prefix to allow the DTrace FBT (function boundary
trace) provider to avoid tracing because they are called from the DTtrace
probe context.
Unlike other forms of tracing and debug, DTrace support in the kernel
incurs negligible run-time cost.
I think the only reason why anyone wouldn't want to have kernel support
enabled for DTrace would be due to the license (CDDL) under which DTrace
is released.
a lock to prevent interspersed strings written from different CPUs
at the same time.
To avoid putting a buffer on the stack or having to malloc one,
space is incorporated in the per-cpu structure. The buffer
size if 128 bytes; chosen because it's the next power of 2 size
up from 80 characters.
String writes to the console are buffered up the end of the line
or until the buffer fills. Then the buffer is flushed to all
console devices.
Existing low level console output via cnputc() is unaffected by
this change. ithread calls to log() are also unaffected to avoid
blocking those threads.
A minor change to the behaviour in a panic situation is that
console output will still be buffered, but won't be written to
a tty as before. This should prevent interspersed panic output
as a number of CPUs panic before we end up single threaded
running ddb.
Reviewed by: scottl, jhb
MFC after: 2 weeks
dynamic nature (if no native aio code is available, the linux part
returns ENOSYS because of missing requisites) should be solved differently
than it is.
All this will be done in P4.
Not included in this commit is a backout of the changes to the native aio
code (removing static in some places). Those changes (and some more) will
also be needed when the reworked linux aio stuff will reenter the tree.
Requested by: rwatson
Discussed with: rwatson
not completely decided at config time. Just don't default to using
the TSC if there are multiple active CPUs. Also, don't default to
using the TSC if it is broken. SMP ifdefs are still used to disallow
using perfmon since perfmon is always broken if SMP is just configured.
This only helps much for SMP kernels running on 1 CPU. The overheads
for using the i8254 cputime clock were a bit too high on 486/33's, and
now on multi-GHz CPUs they are usually in the 99-99.9% range. Switching
from the old default of an i8254 clock to the TSC works poorly because
the overheads are not recalibrated.
Use the same condition for declaring perfmon stuff as for using it.
Fixed a syntax error for the (!__KERNEL && !__GNUCLIKE_ASM) case in
rev.1.36. Apparently, this case has never been reached even by lint.
Submitted by: stefanf
{amd64,i386}/include/profile.h:
In case the above case is actually reached, break it properly by
providing null support that will fail at link time instead of a stub
that gives wrong (null) profiling at runtime.
was only used in the GUPROF case, so the messes to get its i386 prerequisites
included shouldn't have been needed.
Fixed some style bugs. Quote #error contents, and don't repeat an #error
directive on amd64.
this used to be slightly cleaner than using ifdefs in a few places to
support both a.out and elf, but using it now just causes messes and
unportabilities. It seems to be impossible to implement the elf
HIDENAME() portably in cpp (since token pasting of "." and <name> is
invalid).
*/prof_machdep.c:
- Removed all uses of CNAME(). CNAME() is easy enough to use in pure
asm code, but using it in inline asm requires messy quoting. The
core pure asm code has been hacked on more and all uses of CNAME() in
it have already gone away. Just assume the elf convention here too.
- Removed now-uneeded include of <machine/asmacros.h>.
- Removed the workaround for a namespace conflict with this include.
profiling is configured but high resolution profiling is not configured.
Only functions in *.[Ss] called the stub, so efficiency was not
significantly affected.
The 'nooption' kernel config entry has to be used to turn KSE off now.
This isn't my preferred way of dealing with this, but I'll defer to
scottl's experience with the io/mem kernel option change and the grief
experienced over that.
Submitted by: scottl@
except sun4v.
This change makes the transition from a default to an option more
transparent and is an attempt to head off all the compliants that are
likely from people who don't read UPDATING, based on experience with
the io/mem change.
Submitted by: scottl@
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.
This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA
be displayed specially, and debug registers are among of the least
interesting special registers (far behind %cr3). The debug registers
are still accessible as variables and displayed in another bogus place
("show watches").
Implement the linux_io_* syscalls (AIO). They are only enabled if the native
AIO code is available (either compiled in to the kernel or as a module) at
the time the functions are used. If the AIO stuff is not available there
will be a ENOSYS.
From the submitter:
---snip---
DESIGN NOTES:
1. Linux permits a process to own multiple AIO queues (distinguished by
"context"), but FreeBSD creates only one single AIO queue per process.
My code maintains a request queue (STAILQ of queue(3)) per "context",
and throws all AIO requests of all contexts owned by a process into
the single FreeBSD per-process AIO queue.
When the process calls io_destroy(2), io_getevents(2), io_submit(2) and
io_cancel(2), my code can pick out requests owned by the specified context
from the single FreeBSD per-process AIO queue according to the per-context
request queues maintained by my code.
2. The request queue maintained by my code stores contrast information between
Linux IO control blocks (struct linux_iocb) and FreeBSD IO control blocks
(struct aiocb). FreeBSD IO control block actually exists in userland memory
space, required by FreeBSD native aio_XXXXXX(2).
3. It is quite troubling that the function io_getevents() of libaio-0.3.105
needs to use Linux-specific "struct aio_ring", which is a partial mirror
of context in user space. I would rather take the address of context in
kernel as the context ID, but the io_getevents() of libaio forces me to
take the address of the "ring" in user space as the context ID.
To my surprise, one comment line in the file "io_getevents.c" of
libaio-0.3.105 reads:
Ben will hate me for this
REFERENCE:
1. Linux kernel source code: http://www.kernel.org/pub/linux/kernel/v2.6/
(include/linux/aio_abi.h, fs/aio.c)
2. Linux manual pages: http://www.kernel.org/pub/linux/docs/manpages/
(io_setup(2), io_destroy(2), io_getevents(2), io_submit(2), io_cancel(2))
3. Linux Scalability Effort: http://lse.sourceforge.net/io/aio.html
The design notes: http://lse.sourceforge.net/io/aionotes.txt
4. The package libaio, both source and binary:
http://rpmfind.net/linux/rpm2html/search.php?query=libaio
Simple transparent interface to Linux AIO system calls.
5. Libaio-oracle: http://oss.oracle.com/projects/libaio-oracle/
POSIX AIO implementation based on Linux AIO system calls (depending on
libaio).
---snip---
Submitted by: Li, Xiao <intron@intron.ac>
(PICs) rather than interrupt sources. This allows interrupt controllers
with no interrupt pics (such as the 8259As when APIC is in use) to
participate in suspend/resume.
- Always register the 8259A PICs even if we don't use any of their pins.
- Explicitly reset the 8259As on resume on amd64 if 'device atpic' isn't
included.
- Add a "dummy" PIC for the local APIC on the BSP to reset the local APIC
on resume. This gets suspend/resume working with APIC on UP systems.
SMP still needs more work to bring the APs back to life.
The MFC after is tentative.
Tested by: anholt (i386)
Submitted by: Andrea Bittau <a.bittau at cs.ucl.ac.uk> (3)
MFC after: 1 week
unsuspecting users.
- Add a comment in NOTES about experimental status of SCHED_ULE.
- Make warning about experimental status in sched_ule(4) a bit
stronger.
Suggested and reviewed by: dougb
Discussed on: developers
MFC after: 3 days
Submitted by:
Reviewed by:
Approved by:
Obtained from:
MFC after:
Security:
Move the relocation definitions to the common elf header so that DTrace
can use them on one architecture targeted to a different one.
Add the additional ELF types defines in Sun's "Linker and Libraries"
manual.
Split subr_clock.c in two parts (by repo-copy):
subr_clock.c contains generic RTC and calendaric stuff. etc.
subr_rtc.c contains the newbus'ified RTC interface.
Centralize the machdep.{adjkerntz,disable_rtc_set,wall_cmos_clock}
sysctls and associated variables into subr_clock.c. They are
not machine dependent and we have generic code that relies on being
present so they are not even optional.
timer interrupt servicing for disabled HTT cores in ULE case. Should be
probably fixed in ULE code instead, but we have no real maintainer for
ULE to do it.
PR: 103697
just use the COULD_BOUNCE flag for both and retire the USE_FILTER flag.
This fixes the problem that rev 1.81 introduced with the if_bfe driver
(and possibly others).
- Split out the communication protocols into their own files and use
a couple of function pointers in the softc that the commuication
protocols setup in their own attach routine.
- Add support for the SSIF interface (talking to IPMI over SMBus).
- Add an ACPI attachment.
- Add a PCI attachment that attaches to devices with the IPMI interface
subclass.
- Split the ISA attachment out into its own file: ipmi_isa.c.
- Change the code to probe the SMBIOS table for an IPMI entry to just use
pmap_mapbios() to map the table in rather than trying to setup a fake
resource on an isa device and then activating the resource to map in the
table.
- Make bus attachments leaner by adding attach functions for each
communication interface (ipmi_kcs_attach(), ipmi_smic_attach(), etc.)
that setup per-interface data.
- Formalize the model used by the driver to handle requests by adding an
explicit struct ipmi_request object that holds the state of a given
request and reply for the entire lifetime of the request. By bundling
the request into an object, it is easier to add retry logic to the various
communication backends (as well as eventually support BT mode which uses
a slightly different message format than KCS, SMIC, and SSIF).
- Add a per-softc lock and remove D_NEEDGIANT as the driver is now MPSAFE.
- Add 32-bit compatibility ioctl shims so you can use a 32-bit ipmitool
on FreeBSD/amd64.
- Add ipmi(4) to i386 and amd64 NOTES.
Submitted by: ambrisko (large portions of 2 and 3)
Sponsored by: IronPort Systems, Yahoo!
MFC after: 6 days
of directory reading system calls.
Respell a mis-spelled event name.
Clean up white space/line wraps in a couple of places.
Assign event numbers to some new system call entries that have turned
up in the list since audit support was added.
Obtained from: TrustedBSD Project
behaves. This fixes a lot of test which failed before. For amd64 there
are still some problems, but without any testers which apply patches
and run some predefines tests we can't do more ATM.
Submitted by: Marcin Cieslak <saper@SYSTEM.PL> (minor fixups by myself)
Tested with: LTP
pmap_invalidate_cache() in the SMP case so pmap_mapdev() in multiuser
doesn't panic with a trap 30. I broke this many months ago when I
added pmap_invalidate_cache() as early parts of the PAT work.
Patience from: jmg
Pointy hat: jhb
for overlaps, but more importantly, it collapses adjacent free regions.
This is needed to cope with BIOSen that split up ports for system devices
(like IPMI controllers) across multiple system resource entries.
- Now that rman_manage_region() is not so dumb, remove extra logic in the
x86 nexus drivers to populate the IRQ rman that manually coalesced the
regions.
MFC after: 1 week
old/broken hardware. Unfortunately, it adds cache pressure and possible
mispredicted branches to the fast path of the bus_dmamap_load collection of
functions. Since it's meant for slow path exception processing, de-inline
it and allow its conditions to be pre-computed at tag_create time and thus
short-circuited at runtime.
While here, cut down on the size of _bus_dmamap_load_buffer() by pushing the
bounce page logic into a non-inlined function. Again, this helps with
cache pressure and mispredicted branches.
According to the TSC, this shaves off a few cycles on average. Unfortunately,
the data varies quite a bit due to interrupts and preemption, so it's hard to
get a good measurement. Real world measurements of network PPS are welcomed.
A merge to amd64 and other arches is pending more testing.
and dump_avail[] arrays so they are in sync (previously it was possible
to store more entries in the physmap[] then we could store in phys_avail[],
which was pointless). While I'm here, bump up the length of these tables
to hold 30 entries on amd64 and 16 on i386. This allows machines with
fairly fragmented memory maps to boot ok (at least one machine would
not boot FreeBSD/i386 but would boot FreeBSD/amd64 because amd64 allowed
for more fragments).
MFC after: 3 days
from both the acpi module build directory and a kernel build directory.
The latter didn't work when one attempted to build a kernel which had
"device acpi" with the "make kernel-toolchain buildkernel" command
because a cross-compiler couldn't find anything in the standard system
include path (it's empty in the kernel-toolchain case).
Fix this by passing a better root path to kernel headers (src/sys)
which works for both cases, kernel and module (-I@ only worked for
module).
Also, while here, pass -nostdinc (and a different spelling for icc) --
it's a feature that the kernel source tree is self-contained, and this
change enforces this.
Reported by: glebius
any threads to them. However, it still counts those cores as "active but
permanently idle" when calculating system-wide CPUs statistics. It is
incorrect, since it skews statistics quite a bit and creates real problems
for certain types of applications (monitoring applications for example),
by making them believe that the system does have enough idle CPU resources,
while in fact it does not.
Correct the problem by not calling performance counting routines on "disabled"
cores. The cleaner solution would be to just disable APIC timer interrupts on
those cores completely, but ENOTIME here and it is not clear if the
additional complexity really worth minor performance gain.
Reviewed by: ssouhlal
Sponsored by: Sippy Software, Inc.
MFC after: 2 weeks
other stuff) in the osrelease=2.6.16 case:
- implement CLONE_PARENT semantic
- fix TLS loading in clone CLONE_SETTLS
- lock proc in the currently disabled part of CLONE_THREAD
I suggest to not unload the linux module after testing this, there are
some "<defunct>" processes hanging around after exiting (they aren't
with osrelease=2.4.2) and they may panic your kernel when unloading the
linux module. They are in state Z and some of them consume CPU according
to ps. But I don't trust the CPU part, the idle threads gets too much CPU
that this may be possible (accumulating idle, X and 2 defunct processes
results in 104.7%, this looks to much to be a rounding error).
Noticed by: Intron <mag@intron.ac>
Submitted by: rdivacky (in collaboration with Intron)
Tested by: Intron, netchild
Reviewed by: jhb (previous version)
but further on -current (still not successful, but a step into the right
direction).
Sponsored by: Google SoC 2006
Submitted by: rdivacky
Tested by: Paul Mather <paul@gromit.dlib.vt.edu>
we can do the stuff we need to do with linux processes at fork and
don't panic the kernel at exit of the child.
Submitted by: rdivacky
Tested with: tst-vfork* (glibc regression tests)
Tested by: netchild
- Send the systrace_args files for all the compat ABIs to /dev/null for
now. Right now makesyscalls.sh generates a file with a hardcoded
function name, so it wouldn't work for any of the ABIs anyway. Probably
the function name should be configurable via a 'systracename' variable
and the functions should be stored in a function pointer in the sysvec
structure.
line switch. Other files which may make the same mistake (according to
fxr.watson.org) but aren't fixed in this commit (people with more clue
about those files should fix this):
- i386/xbox/xbox.c
- arm/arm/elf_trampoline.c
- arm/arm/mem.c
Noticed by: cognet
- TLS - complete
- pid/tid mangling - complete
- thread area - complete
- futexes - complete with issues
- clone() extension - complete with some possible minor issues
- mq*/timer*/clock* stuff - complete but untested and the mq* stuff is
disabled when not build as part of the kernel with native FreeBSD mq*
support (module support for this will come later)
Tested with:
- linux-firefox - works, tested
- linux-opera - works, tested
- linux-realplay - doesnt work, issue with futexes
- linux-skype - doesnt work, issue with futexes
- linux-rt2-demo - works, tested
- linux-acroread - doesnt work, unknown reason (coredump) and sometimes
issue with futexes
- various unix utilities in linux-base-gentoo3 and linux-base-fc4:
everything tried worked
On amd64 not everything is supported like on i386, the catchup is planned for
later when the remaining bugs in the new functions are fixed.
To test this new stuff, you have to run
sysctl compat.linux.osrelease=2.6.16
to switch back use
sysctl compat.linux.osrelease=2.4.2
Don't switch while running a linux program, strange things may or may not
happen.
Sponsored by: Google SoC 2006
Submitted by: rdivacky
Some suggestions/help by: jhb, kib, manu@NetBSD.org, netchild
compat.linux.osrelease is changed to "2.6.16" or similar).
On amd64 not everything is supported like on i386, the catchup is planned for
later when the remaining bugs in the new functions are fixed.
Sponsored by: Google SoC 2006
Submitted by: rdivacky
aren't mapped via pmap_enter() (KVA). We will eventually support PAT bits
on user pages, but those will require some sort of MI caching mode stored
in the vm_page.
Reviewed by: alc
WB (write-back) on x86 via control bits in PTEs and PDEs (including making
use of the PAT MSR). Changes include:
- A new pmap_mapdev_attr() function for amd64 and i386 which takes an
additional parameter (relative to pmap_mapdev()) specifying the cache
mode for this mapping. Note that on amd64 only WB mappings are done with
the direct map, all other modes result in a private mapping.
- pmap_mapdev() on i386 and amd64 now defaults to using UC (uncached)
mappings rather than WB. Previously we relied on the BIOS setting up
MTRR's to enforce memio regions being treated as UC. This might make
hw.cbb_start_memory unnecessary in some cases now for example.
- A new pmap_mapbios()/pmap_unmapbios() API has been added to allow places
that used pmap_mapdev() to map non-device memory (such as ACPI tables)
to do so using WB as before.
- A new pmap_change_attr() function for amd64 and i386 that changes the
caching mode for a range of KVA.
Reviewed by: alc
Originally, I had adopted sparc64's name, pmap_clear_write(), for the
function that is now pmap_remove_write(). However, this function is more
like pmap_remove_all() than like pmap_clear_modify() or
pmap_clear_reference(), hence, the name change.
The higher-level rationale behind this change is described in
src/sys/amd64/amd64/pmap.c revision 1.567. The short version is that I'm
trying to clean up and fix our support for execute access.
Reviewed by: marcel@ (ia64)
and pc98 MD files. Remove nodevice and nooption lines specific
to sio(4) from ia64, powerpc and sparc64 NOTES. There were no
such lines for arm yet.
sio(4) is usable on less than half the platforms, not counting
a future mips platform. Its presence in MI files is therefore
increasingly becoming a burden.
mark system calls as being MPSAFE:
- Stop conditionally acquiring Giant around system call invocations.
- Remove all of the 'M' prefixes from the master system call files.
- Remove support for the 'M' prefix from the script that generates the
syscall-related files from the master system call files.
- Don't explicitly set SYF_MPSAFE when registering nfssvc.
implementations and adjust some of the checks while I'm here:
- Add a new check to make sure we don't return from a syscall in a critical
section.
- Add a new explicit check before userret() to make sure we don't return
with any locks held. The advantage here is that we can include the
syscall number and name in syscall() whereas that info is not available
in userret().
- Drop the mtx_assert()'s of sched_lock and Giant. They are replaced by
the more general checks just added.
MFC after: 2 weeks
map was obtained from the SMAP. SMAP is trustworthy, and the memory
extending feature is a band-aid for older systems where FreeBSD's methods
of detecting memory were not always trustworthy. This fixes the issue
where using hw.physmem could result in the ACPI tables getting trashed
breaking ACPI.
MFC after: 3 days
Tested on: i386
Giant VFS locking in that function.
- Remove bogus code to handle the case where namei() returns success but a
NULL vnode pointer.
- Note that this code duplicates exec_check_permissions() and annotate
where it differs.
- Hold the vnode lock longer to protect the write to set VV_TEXT in
v_vflag.
- Mark linux_uselib() MPSAFE.
Reviewed by: rwatson
system's machine-dependent and machine-independent layers. Once
pmap_clear_write() is implemented on all of our supported
architectures, I intend to replace all calls to pmap_page_protect() by
calls to pmap_clear_write(). Why? Both the use and implementation of
pmap_page_protect() in our virtual memory system has subtle errors,
specifically, the management of execute permission is broken on some
architectures. The "prot" argument to pmap_page_protect() should
behave differently from the "prot" argument to other pmap functions.
Instead of meaning, "give the specified access rights to all of the
physical page's mappings," it means "don't take away the specified
access rights from all of the physical page's mappings, but do take
away the ones that aren't specified." However, owing to our i386
legacy, i.e., no support for no-execute rights, all but one invocation
of pmap_page_protect() specifies VM_PROT_READ only, when the intent
is, in fact, to remove only write permission. Consequently, a
faithful implementation of pmap_page_protect(), e.g., ia64, would
remove execute permission as well as write permission. On the other
hand, some architectures that support execute permission have
basically ignored whether or not VM_PROT_EXECUTE is passed to
pmap_page_protect(), e.g., amd64 and sparc64. This change represents
the first step in replacing pmap_page_protect() by the less subtle
pmap_clear_write() that is already implemented on amd64, i386, and
sparc64.
Discussed with: grehan@ and marcel@
pmap_clear_ptes() is already convoluted. This will worsen with the
implementation of superpages. Eliminate it and add pmap_clear_write().
There are no functional changes. Checked by: md5
pmap_remove_all() before rather than after the pmap is unlocked. At
present, the page queues lock provides sufficient sychronization. In the
future, the page queues lock may not always be held when free_pv_entry() is
called.
Make three simplifications to pmap_ts_referenced():
Eliminate an initialized but otherwise unused variable.
Eliminate an unnecessary test.
Exit the loop in a shorter way.
install custom pager functions didn't actually happen in practice (they
all just used the simple pager and passed in a local quit pointer). So,
just hardcode the simple pager as the only pager and make it set a global
db_pager_quit flag that db commands can check when the user hits 'q' (or a
suitable variant) at the pager prompt. Also, now that it's easy to do so,
enable paging by default for all ddb commands. Any command that wishes to
honor the quit flag can do so by checking db_pager_quit. Note that the
pager can also be effectively disabled by setting $lines to 0.
Other fixes:
- 'show idt' on i386 and pc98 now actually checks the quit flag and
terminates early.
- 'show intr' now actually checks the quit flag and terminates early.
ibcs2_getdents(), ibcs2_read(), ogetdirentries(), svr4_sys_getdents(),
and svr4_sys_getdents64() similar to that in getdirentries().
- Mark ibcs2_getdents(), ibcs2_read(), linux_getdents(), linux_getdents64(),
linux_readdir(), ogetdirentries(), svr4_sys_getdents(), and
svr4_sys_getdents64() MPSAFE.
that the 'data' pointer is already setup to point to a valid KVM buffer
or contains the copied-in data from userland as appropriate (ioctl(2)
still does this). kern_ioctl() takes care of looking up a file pointer,
implementing FIONCLEX and FIOCLEX, and calling fi_ioctl().
- Use kern_ioctl() to implement xenix_rdchk() instead of using the stackgap
and mark xenix_rdchk() MPSAFE.
ibcs2_[gs]etgroups() rather than using the stackgap. This also makes
ibcs2_[gs]etgroups() MPSAFE. Also, it cleans up one bit of weirdness in
the old setgroups() where it allocated an entire credential just so it had
a place to copy the group list into. Now setgroups just allocates a
NGROUPS_MAX array on the stack that it copies into and then passes to
kern_setgroups().
ABI as FreeBSD's poll(2) is ABI compatible. The ibcs2_poll() function
attempted to implement poll(2) using a wrapper around select(2). Besides
being somewhat ugly, it also had at least one bug in that instead of
allocating complete fdset's on the stack via the stackgap it just allocated
pointers to fdsets.
OpenBSD. This driver seems to give a small performance increase, and
should lead to better maintainability in the future.
The nForce Ethernet-specific hack in sys/i386/xbox/xbox.c is still
required, judging from dev/nfe/if_nfe.c. The condition it hacks will
almost certainly only occur on XBOX-es anyway, so it is best left there.
Approved by: imp (mentor)
to a copied-in copy of the 'union semun' and a uioseg to indicate which
memory space the 'buf' pointer of the union points to. This is then used
in linux_semctl() and svr4_sys_semctl() to eliminate use of the stackgap.
- Mark linux_ipc() and svr4_sys_semsys() MPSAFE.
from going away. mount(2) is now MPSAFE.
- Expand the scope of Giant some in unmount(2) to protect the mp structure
(or rather, to handle concurrent unmount races) from going away.
umount(2) is now MPSAFE, as well as linux_umount() and linux_oldumount().
- nmount(2) and linux_mount() were already MPSAFE.
pmap_copy() if the mapping is VM_INHERIT_SHARE. Suppose the mapping
is also wired. vmspace_fork() clears the wiring attributes in the vm
map entry but pmap_copy() copies the PG_W attribute in the PTE. I
don't think this is catastrophic. It blocks pmap_remove_pages() from
destroying the mapping and corrupts the pmap's wiring count.
This revision fixes the problem by changing pmap_copy() to clear the
PG_W attribute.
Reviewed by: tegge@
This driver was ported from OpenBSD by Shigeaki Tagashira
<shigeaki@se.hiroshima-u.ac.jp> and posted at
http://www.se.hiroshima-u.ac.jp/~shigeaki/software/freebsd-nfe.html
It was additionally cleaned up by me.
It is still a work-in-progress and thus is purposefully not in GENERIC.
And it conflicts with nve(4), so only one should be loaded.
in 1999, and there are changes to the sysctl names compared to PR,
according to that discussion. The description is in sys/conf/NOTES.
Lines in the GENERIC files are added in commented-out form.
I'll attach the test script I've used to PR.
PR: kern/14584
Submitted by: babkin
VM_ALLOC_NORMAL instead of VM_ALLOC_SYSTEM when try is TRUE. In other
words, when get_pv_entry() is permitted to fail, it no longer tries as
hard to allocate a page.
Change pmap_enter_quick_locked() to fail rather than wait if it is
unable to allocate a page table page. This prevents a race between
pmap_enter_object() and the page daemon. Specifically, an inactive
page that is a successor to the page that was given to
pmap_enter_quick_locked() might become a cache page while
pmap_enter_quick_locked() waits and later pmap_enter_object() maps
the cache page violating the invariant that cache pages are never
mapped. Similarly, change
pmap_enter_quick_locked() to call pmap_try_insert_pv_entry() rather
than pmap_insert_entry(). Generally speaking,
pmap_enter_quick_locked() is used to create speculative mappings. So,
it should not try hard to allocate memory if free memory is scarce.
Add an assertion that the object containing m_start is locked in
pmap_enter_object(). Remove a similar assertion from
pmap_enter_quick_locked() because that function no longer accesses the
containing object.
Remove a stale comment.
Reviewed by: ups@
syscalls. This way there will be a log message printed to the console
(this time for real).
Note: UNIMPL should be used for syscalls we do not implement ever, e.g.
syscalls to load linux kernel modules.
Submitted by: rdivacky
Sponsored by: Goole SoC 2006
P4 IDs: 99600, 99602
when we're about to call kdb_trap() because the latter MI
function can disable interrupts by itself now.
Pointed out by: bde
X-MFC remark: depends on kern/subr_kdb.c#1.18
Sponsored by: RiNet (Cronyx Plus LLC)
when the bit 22 is set to 1, CPUID with EAX=0 returns a maximum
value in EAX[7..0] of 3, when set to 0(default), CPUID with EAX=0
returns the number corresponding to the maximum standard function
supported. On my machine, BIOS sets the bit to 1 to make it to be
compatible with old OS, this causes dual-core Pentium-D (two
physical cores) to be identified as hyperthreading (two logical
cores) by function mp_topology().
the return address on the stack and only then "dereferences" %pc.
Therefore, in the case of a call to an invalid address, we arrive
to the trap handler with the invalid value in tf_eip. This used
to prevent db_backtrace() from assigning the most recent and interesting
frame on the stack to the right spot in the right function, from
which the invalid call was attempted.
Try to detect and work around that by recovering the return address
from the stack.
The work-around requires the fault address be passed to db_backtrace().
Smuggle it as tf_err.
MFC after: 1 month
Sponsored by: RiNet (Cronyx Plus LLC)
Now GCC likes to stick a "mov %eax, %FOO" instruction before
"addl $BAR, %esp" if the function just called returns an int,
which is a very common case in the kernel.
Sponsored by: RiNet (Cronyx Plus LLC)
an explicit comment that it's needed for the linuxolator. This is not the
case anymore. For all other architectures there was only a "KEEP THIS".
I'm (and other people too) running a COMPAT_43-less kernel since it's not
necessary anymore for the linuxolator. Roman is running such a kernel for a
for longer time. No problems so far. And I doubt other (newer than ia32
or alpha) architectures really depend on it.
This may result in a small performance increase for some workloads.
If the removal of COMPAT_43 results in a not working program, please
recompile it and all dependencies and try again before reporting a
problem.
The only place where COMPAT_43 is needed (as in: does not compile without
it) is in the (outdated/not usable since too old) svr4 code.
Note: this does not remove the COMPAT_43TTY option.
Nagging by: rdivacky
There is a race with the current locking scheme and removing
it should have no measurable performance impact.
This fixes page faults leading to panics in pmap_enter_quick_locked()
on amd64/i386.
Reviewed by: alc,jhb,peter,ps
Update of syscall.master:
o Adding of several new dummy syscalls (268-310)
o Synchronization of amd64 syscall.master with i386 one
o Auditing added to amd64 syscall.master
o Change auditing type for lstat syscall (bugfix). [1]
P4-Changes: 98672, 98674
Noticed by: rwatson [1]
Sponsored by: Google SoC 2006
Submitted by: rdivacky
I picked it up again. The scheduler is forked from ULE, but the
algorithm to detect an interactive process is almost completely
different with ULE, it comes from Linux paper "Understanding the
Linux 2.6.8.1 CPU Scheduler", although I still use same word
"score" as a priority boost in ULE scheduler.
Briefly, the scheduler has following characteristic:
1. Timesharing process's nice value is seriously respected,
timeslice and interaction detecting algorithm are based
on nice value.
2. per-cpu scheduling queue and load balancing.
3. O(1) scheduling.
4. Some cpu affinity code in wakeup path.
5. Support POSIX SCHED_FIFO and SCHED_RR.
Unlike scheduler 4BSD and ULE which using fuzzy RQ_PPQ, the scheduler
uses 256 priority queues. Unlike ULE which using pull and push, the
scheduelr uses pull method, the main reason is to let relative idle
cpu do the work, but current the whole scheduler is protected by the
big sched_lock, so the benefit is not visible, it really can be worse
than nothing because all other cpu are locked out when we are doing
balancing work, which the 4BSD scheduelr does not have this problem.
The scheduler does not support hyperthreading very well, in fact,
the scheduler does not make the difference between physical CPU and
logical CPU, this should be improved in feature. The scheduler has
priority inversion problem on MP machine, it is not good for
realtime scheduling, it can cause realtime process starving.
As a result, it seems the MySQL super-smack runs better on my
Pentium-D machine when using libthr, despite on UP or SMP kernel.
for CBUS-PNP cards there) by default, as there are no amd64 and sparc64
machines with ISA slots and which therefore could make use of this code
known to exist. For sparc64 this additionally allows to get rid of the
compat shims for in{b,w,l}()/out{b,w,l}() etc and the associated hacks.
OK'ed by: imp, peter
the arm to compile without all the extras that don't appear, at least
not in the flavors of ARM I deal with. This helps us save about 100k.
If I've botched the available devices on a platform, please let me
know and I'll correct ASAP.
not be necessary but might be helpful and at least reduce fragmentation.
* Add an assert to detect if the wakecode ever grows too big. We include
1 KB for stack, which should be more than enough also.
* Remove unnecessary initialization of static variables.
* Add comments and a bootverbose print giving the page phys address.
to 4. There is no need to be more strict at assembly time since we copy
the code anyway to a private page.
* Clear the direction flag and eflags. Probably not necessary but it won't
hurt to be safe.
* Add prefixes to all instructions to prevent any assembler mistakes.
* Remove zeroing of eax - edi. We use those registers immediately after
to transfer values to protected mode so this was pointless.
* Update comments to reflect info found during code review.
* Add hw.acpi.resume_beep tunable and sysctl, default to 0. Beeps the PC
speaker soon after waking to diagnose whether the wakeup code is even
getting run before other drivers possibly hang the system. To stop the beep,
cause another beep (i.e. keyboard bell). Submitted by takawata@, I changed
the frequency to be lower.
* Use 4096 instead of 4 byte alignment. Might be useful although doesn't
seem to be necessary.
* Remove a useless assignment to acpi_reset_video. It was overwritten by
the default sysctl value anyway.
Eliminate unnecessary, recursive acquisitions and releases of the page
queues lock by free_pv_entry() and pmap_remove_pages().
Reduce the scope of the page queues lock in pmap_remove_pages().
that it just warns the user with a printf when it misaligns a piece
of memory that was requested through a busdma tag.
Some drivers (such as mpt, and probably others) were asking for alignments
that could not be satisfied, but as far as driver operation was concerned,
that did not matter. In the theory that other drivers will fall into
this same category, we agreed that panicing or making the allocation
fail will cause more hardship than is necessary. The printf should
be sufficient motivation to get the driver glitch fixed.
POSIX (susv3) requires this, but it is unclear what should be inherited,
duplicating whole 387 stack for new thread seems to be unnecessary and
dangerous. Revert to previous code, force a new thread to be started with
clean FP state.
the high 16 bits is non-zero, fxrstor instruction will generate GP fault,
resulting kernel crash, this bug can be triggered by setcontext and
ptrace(PT_SETXMMREGS).
other timeouts could not happen while suspending, including timeouts
for things like msleep. This caused the system to hang on suspend
when the cbb was enabled, since its suspend path powered down the
socket which used a timeout to wait for it to be done.
APM now creates a thread when it is enabled, and deletes the thread
when it is disabled. This thread takes the place of the timeout by
doing its polling every ~.9s. When the thread is disabled, it will
wakeup early, otherwise it times out and polls the varius things the
old timeout polled (APM events, suspend delays, etc).
This makes my Sony VAIO 505TS suspend/resume correctly when APM is
enabled (ACPI is black listed on my 505TS).
This will likely fix other problems with the suspend path where
drivers would sleep with msleep and/or do other timeouts. Maybe
there's some special case code that would use DELAY while suspending
and msleep otherwise that can be revisited and removed.
This was also tested by glebius@, who pointed out that in the patch I
sent him, I'd forgotten apm_saver.c
MFC After: 3 weeks
lnc(4) on PC98 and i386. The ISA front-end supports the same non-PNP
network cards as lnc(4) did and additionally a couple of PNP ones.
Like lnc(4), the C-bus front-end of le(4) only supports C-NET(98)S
and is untested due to lack of such hardware, but given that's it's
based on the respective lnc(4) and not too different from the ISA
front-end it should be highly likely to work.
- Remove the descriptions of le(4), which where converted from lnc(4),
from sys/i386/conf/NOTES and sys/pc98/conf/NOTES as there's a common
one in sys/conf/NOTES.
entry to the PCI NICs section so it's in the same spot in all GENERIC
config files.
- Add a note to the description of pcn(4) informing that is has precedence
over le(4).
Remove an unnecessary check of the table's bus clock. CPUs that
support this feature export only the high/low settings via the MSR,
packed into 32 bits.
Hardware from: Centaur Technologies
MFC after: 1 week
Add back in a scheme to emulate old type major/minor numbers via hooks into
stat, linprocfs to return major/minors that Linux app's expect. Currently
only /dev/null is always registered. Drivers can register via the Linux
type shim similar to the ioctl shim but by using
linux_device_register_handler/linux_device_unregister_handler functions.
The structure is:
struct linux_device_handler {
char *bsd_driver_name;
char *linux_driver_name;
char *bsd_device_name;
char *linux_device_name;
int linux_major;
int linux_minor;
int linux_char_device;
};
Linprocfs uses this to display the major number of the driver. The
soon to be available linsysfs will use it to fill in the driver name.
Linux_stat uses it to translate the major/minor into Linux type values.
Note major numbers are dynamically assigned via passing in a -1 for
the major number so we don't need to keep track of them.
This is somewhat needed due to us switching to our devfs. MegaCli
will not run until I add in the linsysfs and mfi Linux compat changes.
Sponsored by: IronPort Systems
conformance with the mbuf and uio load routines. ENOMEM can only happen
with BUS_DMA_NOWAIT is passed in, thus the deferals are disabled. I don't
like doing this, but fixing this fixes assumptions in other important drivers,
which is a net benefit for now.
to the unused kva in the pv memory block to thread a freelist through.
This allows us to free pages that used to be used for pv entry chunks
since we can now track holes in the kva memory block.
Idea from: ups
o Properly use rman(9) to manage resources. This eliminates the
need to puc-specific hacks to rman. It also allows devinfo(8)
to be used to find out the specific assignment of resources to
serial/parallel ports.
o Compress the PCI device "database" by optimizing for the common
case and to use a procedural interface to handle the exceptions.
The procedural interface also generalizes the need to setup the
hardware (program chipsets, program clock frequencies).
o Eliminate the need for PUC_FASTINTR. Serdev devices are fast by
default and non-serdev devices are handled by the bus.
o Use the serdev I/F to collect interrupt status and to handle
interrupts across ports in priority order.
o Sync the PCI device configuration to include devices found in
NetBSD and not yet merged to FreeBSD.
o Add support for Quatech 2, 4 and 8 port UARTs.
o Add support for a couple dozen Timedia serial cards as found
in Linux.
Remove the code to dyanmically change the pv_entry limits. Go back
to a single fixed kva reservation for pv entries, like was done
before when using the uma zone. Go back to never freeing pages
back to the free pool after they are no longer used, just like
before.
This stops the lock order reversal due to aquiring the kernel map
lock while pmap was locked.
This fixes the recursive panic if invariants are enabled.
The problem was that allocating/freeing kva causes vm_map_entry
nodes to be allocated/freed. That can recurse back into pmap as
new pages are hooked up to kvm and hence all the problem.
Allocating/freeing kva indirectly allocate/frees memory.
So, by going back to a single fixed size kva block and an index,
we avoid the recursion panics and the LOR.
The problem is that now with a linear block of kva, we have no
mechanism to track holes once pages are freed. UMA has the same
problem when using custom object for a zone and a fixed reservation
of kva. Simple solutions like having a bitmap would work, but would
be very inefficient when there are hundreds of thousands of bits
in the map. A first-free pointer is similarly flawed because pages
can be freed at random and the first-free pointer would be rewinding
huge amounts. If we could allocate memory for tree strucures or
an external freelist, that would work. Except we cannot allocate/free
memory here because we cannot allocate/free address space to use
it in. Anyway, my change here reverts back to the UMA behavior of
not freeing pages for now, thereby avoiding holes in the map.
ups@ had a truely evil idea that I'll investigate. It should allow
freeing unused pages again by giving us a no-cost way to track the
holes in the kva block. But in the meantime, this should get people
booting with witness and/or invariants again.
Footnote: amd64 doesn't have this problem because of the direct map
access method. I'd done all my witness/invariants testing there. I'd
never considered that the harmless-looking kmem_alloc/kmem_free calls
would cause such a problem and it didn't show up on the boot test.
entry (PTE) have the same meaning. The exception to this rule is the
eighth bit (0x080). It is the PS bit in a PDE and the PAT bit in a
PTE. This change avoids the possibility that pmap_enter() confuses a
PAT bit with a PS bit, avoiding a panic().
Eliminate a diagnostic printf() from the i386 pmap_enter() that serves
no current purpose, i.e., I've seen no bug reports in the last two
years that are helped by this printf().
Reviewed by: jhb
(i.e. no keyboard controller present), try two other common methods for
resetting i386 machine - pci reset and port 0x92 fast reset. Only if neither
works warn user and resort to "unmap entire address space and hope for good"
hack. This makes my MacBook Pro rebooting just fine and should also help
other legacy-free hardware out there.
Also, disable interrupts unconditionally in cpu_reset_real(), since we don't
want any interference.
MFC after: 1 week
per page = effectively 12.19 bytes per pv entry after overheads).
Instead of using a shared UMA zone for 24 byte pv entries (two 8-byte tailq
nodes, a 4 byte pointer, and a 4 byte address), we allocate a page at a
time per process. This provides 336 pv entries per process (actually, per
pmap address space) and eliminates one of the 8-byte tailq entries since
we now can track per-process pv entries implicitly. The pointer to
the pmap can be eliminated by doing address arithmetic to find the metadata
on the page headers to find a single pointer shared by all 336 entries.
There is an 11-int bitmap for the freelist of those 336 entries.
This is mostly a mechanical conversion from amd64, except:
* i386 has to allocate kvm and map the pages, amd64 has them outside of kvm
* native word size is smaller, so bitmaps etc become 32 bit instead of 64
* no dump_add_page() etc stuff because they are in kvm always.
* various pmap internals tweaks because pmap uses direct map on amd64 but
on i386 it has to use sched_pin and temporary mappings.
Also, sysctl vm.pmap.pv_entry_max and vm.pmap.shpgperproc are now
dynamic sysctls. Like on amd64, i386 can now tune the pv entry limits
without a recompile or reboot.
This is important because of the following scenario. If you have a 1GB
file (262144 pages) mmap()ed into 50 processes, that requires 13 million
pv entries. At 24 bytes per pv entry, that is 314MB of ram and kvm, while
at 12 bytes it is 157MB. A 157MB saving is significant.
Test-run by: scottl (Thanks!)
caches are dangerous" to "a shared L1 data cache is dangerous". This
is a compromise between paranoia and performance: Unlike the L1 cache,
nobody has publicly demonstrated a cryptographic side channel which
exploits the L2 cache -- this is harder due to the larger size, lower
bandwidth, and greater associativity -- and prohibiting shared L2
caches turns Intel Core Duo processors into Intel Core Solo processors.
As before, the 'machdep.hyperthreading_allowed' sysctl will allow even
the L1 data cache to be shared.
Discussed with: jhb, scottl
Security: See FreeBSD-SA-05:09.htt for background material.
Major differences:
* since there is no direct map region, there is no custom uma memory
allocator to modify to include its pages in the dumps.
* Various data entries are reduced from 64 bit to 32 bit to match the
native size.
dump_add_page() and dump_drop_page() are still present in case one wants to
arrange for arbitary pages to be dumped. This is of marginal use though
because libkvm+kgdb cannot address physical memory that isn't mapped into
kvm.
create managed mappings within the clean submap. To prevent regressions,
add assertions blocking the creation of managed mappings within the clean
submap.
Reviewed by: tegge
earlier in cpu_setregs().
- If we know this CPU has a FPU via cpuid, then just assume the INT16
interface and make the npx device quiet to not clutter the dmesg. This
is true for all Pentium and later CPUs and even some of the later 486dx
CPUs.
Reviewed by: bde
Tested by: ps
MFC after: 1 week
so that we only have to do an ioapic_write() instead of an ioapic_read()
followed by an ioapic_write() every time we mask and unmask level triggered
interrupts. This cuts the execution time for these operations roughly in
half.
Profiled by: Paolo Pisati <p.pisati@oltrelinux.com>
MFC after: 1 week
PCB in which the context of stopped CPUs is stored. To access this
PCB from KDB, we introduce a new define, called KDB_STOPPEDPCB. The
definition, when present, lives in <machine/kdb.h> and abstracts
where MD code saves the context. Define KDB_STOPPEDPCB on i386,
amd64, alpha and sparc64 in accordance to previous code.
a pv entry if the number of entries is below the high water mark for pv
entries.
Use pmap_try_insert_pv_entry() in pmap_copy() instead of
pmap_insert_entry(). This avoids possible recursion on a pmap lock in
get_pv_entry().
Eliminate the explicit low-memory checks in pmap_copy(). The check that
the number of pv entries was below the high water mark was largely
ineffective because it was located in the outer loop rather than the
inner loop where pv entries were allocated. Instead of checking, we
attempt the allocation and handle the failure.
Reviewed by: tegge
Reported by: kris
MFC after: 5 days
back to using the RSDT instead. ACPI-CA already follows this same strategy
as a workaround for yet another instance of brain-damaged BIOS writers.
PR: i386/93963
Submitted by: Masayuki FUKUI <fukui.FreeBSD@fanet.net>
Specifically, on mappings with PG_G set pmap_remove() not only performs
the necessary per-page invlpg invalidations but also performs an
unnecessary invalidation of the entire set of non-PG_G entries.
Reviewed by: tegge
Previously, we tried to allow this only for root. However, we were calling
suser() on the *target* process rather than the current process. This
means that if you can ptrace() a process running as root you can set a
hardware watch point in the kernel. In practice I think you probably have
to be root in order to pass the p_candebug() checks in ptrace() to attach
to a process running as root anyway. Rather than fix the suser(), I just
axed the entire idea, as I can't think of any good reason _at all_ for
userland to set hardware watch points for KVM.
MFC after: 3 days
Also thinks hardware watch points on KVM from userland are bad: bde, rwatson
In at least one benchmark this showed around a 20% performance increase.
If other workloads do benefit from having hyperthreads service interrupts,
we can always make this a loader tunable.
MFC after: 3 days
Tested by: ps
into a separate module. Accordingly, convert the option into a device
named similarly.
Note for MFC: Perhaps the option should stay in RELENG_6 for POLA reasons.
Suggested by: scottl
Reviewed by: cokane
MFC after: 5 days
- Throw out all of the logical APIC ID stuff. The Intel docs are somewhat
ambiguous, but it seems that the "flat" cluster model we are currently
using is only supported on Pentium and P6 family CPUs. The other
"hierarchy" cluster model that is supported on all Intel CPUs with
local APICs is severely underdocumented. For example, it's not clear
if the OS needs to glean the topology of the APIC hierarchy from
somewhere (neither ACPI nor MP Table include it) and setup the logical
clusters based on the physical hierarchy or not. Not only that, but on
certain Intel chipsets, even though there were 4 CPUs in a logical
cluster, all the interrupts were only sent to one CPU anyway.
- We now bind interrupts to individual CPUs using physical addressing via
the local APIC IDs. This code has also moved out of the ioapic PIC
driver and into the common interrupt source code so that it can be
shared with MSI interrupt sources since MSI is addressed to APICs the
same way that I/O APIC pins are.
- Interrupt source classes grow a new method pic_assign_cpu() to bind an
interrupt source to a specific local APIC ID.
- The SMP code now tells the interrupt code which CPUs are avaiable to
handle interrupts in a simpler and more intuitive manner. For one thing,
it means we could now choose to not route interrupts to HT cores if we
wanted to (this code is currently in place in fact, but under an #if 0
for now).
- For now we simply do static round-robin of IRQs to CPUs when the first
interrupt handler just as before, with the change that IRQs are now
bound to individual CPUs rather than groups of up to 4 CPUs.
- Because the IRQ to CPU mapping has now been moved up a layer, it would
be easier to manage this mapping from higher levels. For example, we
could allow drivers to specify a CPU affinity map for their interrupts,
or we could allow a userland tool to bind IRQs to specific CPUs.
The MFC is tentative, but I want to see if this fixes problems some folks
had with UP APIC kernels on 6.0 on SMP machines (an SMP kernel would work
fine, but a UP APIC kernel (such as GENERIC in RELENG_6) would lose
interrupts).
MFC after: 1 week
- Reorder the events in exit(2) slightly so that we trigger the S_EXIT
stop event earlier. After we have signalled that, we set P_WEXIT and
then wait for any processes with a hold on the vmspace via PHOLD to
release it. PHOLD now KASSERT()'s that P_WEXIT is clear when it is
invoked, and PRELE now does a wakeup if P_WEXIT is set and p_lock drops
to zero.
- Change proc_rwmem() to require that the processing read from has its
vmspace held via PHOLD by the caller and get rid of all the junk to
screw around with the vmspace reference count as we no longer need it.
- In ptrace() and pseudofs(), treat a process with P_WEXIT set as if it
doesn't exist.
- Only do one PHOLD in kern_ptrace() now, and do it earlier so it covers
FIX_SSTEP() (since on alpha at least this can end up calling proc_rwmem()
to clear an earlier single-step simualted via a breakpoint). We only
do one to avoid races. Also, by making the EINVAL error for unknown
requests be part of the default: case in the switch, the various
switch cases can now just break out to return which removes a _lot_ of
duplicated PRELE and proc unlocks, etc. Also, it fixes at least one bug
where a LWP ptrace command could return EINVAL with the proc lock still
held.
- Changed the locking for ptrace_single_step(), ptrace_set_pc(), and
ptrace_clear_single_step() to always be called with the proc lock
held (it was a mixed bag previously). Alpha and arm have to drop
the lock while the mess around with breakpoints, but other archs
avoid extra lock release/acquires in ptrace(). I did have to fix a
couple of other consumers in kern_kse and a few other places to
hold the proc lock and PHOLD.
Tested by: ps (1 mostly, but some bits of 2-4 as well)
MFC after: 1 week
Keep accounting time (in per-cpu) cputicks and the statistics counts
in the thread and summarize into struct proc when at context switch.
Don't reach across CPUs in calcru().
Add code to calibrate the top speed of cpu_tickrate() for variable
cpu_tick hardware (like TSC on power managed machines).
Don't enforce monotonicity (at least for now) in calcru. While the
calibrated cpu_tickrate ramps up it may not be true.
Use 27MHz counter on i386/Geode.
Use TSC on amd64 & i386 if present.
Use tick counter on sparc64
it. The former code used to hang older Intel CPUs by trying to get
non-existent TLB info 2^32 times.
Reduce code duplication around the calls to CPUID 0x02 by using
do-while loops.
PR: i386/92977
Tested by: cy
Rename struct thread's td_sticks to td_pticks, we will need the
other name for more appropriately named use shortly. Reduce it
from uint64_t to u_int.
Clear td_pticks whenever we enter the kernel instead of recording
its value as reference for userret(). Use the absolute value of
td->pticks in userret() and eliminate third argument.
Keep track of time spent by the cpu in various contexts in units of
"cputicks" and scale to real-world microsec^H^H^H^H^H^H^H^Hclock_t
only when somebody wants to inspect the numbers.
For now "cputicks" are still derived from the current timecounter
and therefore things should by definition remain sensible also on
SMP machines. (The main reason for this first milestone commit is
to verify that hypothesis.)
On slower machines, the avoided multiplications to normalize timestams
at every context switch, comes out as a 5-7% better score on the
unixbench/context1 microbenchmark. On more modern hardware no change
in performance is seen.