other CPUs doesn't require locking so get rid of it. As the latter is used
for the timecounter on certain machine models, using a spin lock in this
case can lead to a deadlock with the upcoming callout(9) rework.
- Merge r134227/r167250 from x86:
Avoid cross-IPI SMP deadlock by using the smp_ipi_mtx spin lock not only
for smp_rendezvous_cpus() but also for the MD cache invalidation and TLB
demapping IPIs.
- Mark some unused function arguments as such.
MFC after: 1 week
recent regression with ULE, causing processes to get stuck in getblk
as well as interrupt handler execution delays to rise above the command
timeout of mpt(4).
MFC after: 3 days
This is required for ARM EABI. Section 7.1.1 of the Procedure Call for the
ARM Architecture (AAPCS) defines wchar_t as either an unsigned int or an
unsigned short with the former preferred.
Because of this requirement we need to move the definition of __wchar_t to
a machine dependent header. It also cleans up the macros defining the limits
of wchar_t by defining __WCHAR_MIN and __WCHAR_MAX in the same machine
dependent header then using them to define WCHAR_MIN and WCHAR_MAX
respectively.
Discussed with: bde
usermode, using shared page. The structures and functions have vdso
prefix, to indicate the intended location of the code in some future.
The versioned per-algorithm data is exported in the format of struct
vdso_timehands, which mostly repeats the content of in-kernel struct
timehands. Usermode reading of the structure can be lockless.
Compatibility export for 32bit processes on 64bit host is also
provided. Kernel also provides usermode with indication about
currently used timecounter, so that libc can fall back to syscall if
configured timecounter is unknown to usermode code.
The shared data updates are initiated both from the tc_windup(), where
a fast task is queued to do the update, and from sysctl handlers which
change timecounter. A manual override switch
kern.timecounter.fast_gettime allows to turn off the mechanism.
Only x86 architectures export the real algorithm data, and there, only
for tsc timecounter. HPET counters page could be exported as well, but
I prefer to not further glue the kernel and libc ABI there until
proper vdso-based solution is developed.
Minimal stubs neccessary for non-x86 architectures to still compile
are provided.
Discussed with: bde
Reviewed by: jhb
Tested by: flo
MFC after: 1 month
layer, but it is read directly by the MI VM layer. This change introduces
pmap_page_is_write_mapped() in order to completely encapsulate all direct
access to PGA_WRITEABLE in the pmap layer.
Aesthetics aside, I am making this change because amd64 will likely begin
using an alternative method to track write mappings, and having
pmap_page_is_write_mapped() in place allows me to make such a change
without further modification to the MI VM layer.
As an added bonus, tidy up some nearby comments concerning page flags.
Reviewed by: kib
MFC after: 6 weeks
to this pmap.c. This new r/w lock is used primarily to synchronize access
to the TTE lists. However, it will be used in a somewhat unconventional
way. As finer-grained TTE list locking is added to each of the pmap
functions that acquire this r/w lock, its acquisition will be changed from
write to read, enabling concurrent execution of the pmap functions with
finer-grained locking.
Reviewed by: attilio
Tested by: flo
MFC after: 10 days
in_cksum.h required ip.h to be included for struct ip. To be
able to use some general checksum functions like in_addword()
in a non-IPv4 context, limit the (also exported to user space)
IPv4 specific functions to the times, when the ip.h header is
present and IPVERSION is defined (to 4).
We should consider more general checksum (updating) functions
to also allow easier incremental checksum updates in the L3/4
stack and firewalls, as well as ponder further requirements by
certain NIC drivers needing slightly different pseudo values
in offloading cases. Thinking in terms of a better "library".
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
discrepancy between modules and kernel, but deal with SMP differences
within the functions themselves.
As an added bonus this also helps in terms of code readability.
Requested by: gibbs
Reviewed by: jhb, marius
MFC after: 1 week
last show-stopper keeping PREEMPTION from being usable on sparc64 should
have been dealt with in r230662.
At least on 2-way systems, PREEMPTION causes a little bit of a degradation
in worldstone performance. However, FreeBSD seems to have started building
up regressions in !PREEMPTION cases so sparc64 better should not be an
oddball in this regard.
MFC after: 1 week
r233961:
Fix interrupt load balancing regression, introduced in revision
222813, that left all un-pinned interrupts assigned to CPU 0.
In intr_shuffle_irqs(), remove CPU_SETOF() call that initialized
the "intr_cpus" cpuset to only contain CPU0.
This initialization is too late and nullifies the results of calls
to the intr_add_cpu() that occur much earlier in the boot process.
r234074 (partial):
The BSP is not added to the mask of valid target CPUs for interrupts.
Fix this by adding the BSP as an interrupt target directly in
r234105:
Fix !SMP build after r234074.
MFC after: 3 days
- Correctly determine the maximum payload size for setting the TX link
frequent NACK latency and replay timer thresholds.
Submitted by: stefanf [1]
MFC after: 3 days
DMA tag with a 4 GB boundary as required by PCI-Express. With r232403 in
place this actually is redundant. However, the host-PCI-Express bridge
driver is the more appropriate place for implementing this restriction.
MFC after: 3 days
As of FreeBSD 8, this driver should not be used. Applications that use
posix_openpt(2) and openpty(3) use the pts(4) that is built into the
kernel unconditionally. If it turns out high profile depend on the
pty(4) module anyway, I'd rather get those fixed. So please report any
issues to me.
The pty(4) module is still available as a kernel module of course, so a
simple `kldload pty' can be used to run old-style pseudo-terminals.
didn't already have them. This is because the ternary expression will
return int, due to the Usual Arithmetic Conversions. Such casts are not
needed for the 32 and 64 bit variants.
While here, add additional parentheses around the x86 variant, to
protect against unintended consequences.
MFC after: 2 weeks
platforms.
This will make every attempt to mount a non-mpsafe filesystem to the
kernel forbidden, unless it is expressely compiled with
VFS_ALLOW_NONMPSAFE option.
This patch is part of the effort of killing non-MPSAFE filesystems
from the tree.
No MFC is expected for this patch.
The tag enforces a single restriction that all DMA transactions must not
cross a 4GB boundary. Note that while this restriction technically only
applies to PCI-express, this change applies it to all PCI devices as it
is simpler to implement that way and errs on the side of caution.
- Add a softc structure for PCI bus devices to hold the bus_dma tag and
a new pci_attach_common() routine that performs actions common to the
attach phase of all PCI bus drivers. Right now this only consists of
a bootverbose printf and the allocate of a bus_dma tag if necessary.
- Adjust all PCI bus drivers to allocate a PCI bus softc and to call
pci_attach_common() from their attach routines.
MFC after: 2 weeks
long for specifying a boundary constraint.
- Change bus_dma tags to use bus_addr_t instead of bus_size_t for boundary
constraints.
These allow boundary constraints to be fully expressed for cases where
sizeof(bus_addr_t) != sizeof(bus_size_t). Specifically, it allows a
driver to properly specify a 4GB boundary in a PAE kernel.
Note that this cannot be safely MFC'd without a lot of compat shims due
to KBI changes, so I do not intend to merge it.
Reviewed by: scottl
so try harder to get the CDMA sync interrupt delivered and also in
a more efficient way:
- wrap the whole process of sending and receiving the CDMA sync
interrupt in a critical section so we don't get preempted,
- send the CDMA sync interrupt to the CPU that is actually waiting
for it to happen so we don't take a detour via another CPU,
- instead of waiting for up to 15 seconds for the interrupt to
trigger try the whole process for up to 15 times using a one
second timeout (the code was also changed to just ignore belated
interrupts of a previous tries should they appear).
According to testing done by Peter Jeremy with the debugging also
added as part of this commit the first two changes apparently are
sufficient to now properly get the CDMA sync interrupts delivered
at the first try though.
VIS-based block copy/zero implementations. While with 4BSD it's
sufficient to just disable the tick interrupts, with ULE+PREEMPTION
it's otherwise also possible that these are preempted via IPIs.
helper since r230632, use these for output and panicing during the
early cycles and move cninit() until after the static per-CPU data
has been set up. This solves a couple of issue regarding the non-
availability of the static per-CPU data:
- panic() not working and only making things worse when called,
- having to supply a special DELAY() implementation to the low-level
console drivers,
- curthread accesses of mutex(9) usage in low-level console drivers
that aren't conditional due to compiler optimizations (basically,
this is the problem described in r227537 but in this case for
keyboards attached via uart(4)). [1]
PR: 164123 [1]
implementing a simple OF_panic() that may be used during the early
cycles when panic() isn't available, yet.
- Mark cpu_{exit,shutdown}() as __dead2 as appropriate.
VM_KMEM_SIZE_SCALE to 2, awaiting more insight from alc@. As it turns
out, the VM apparently has problems with machines that have large holes
in the physical address space, causing the kmem_suballoc() call in
kmeminit() to fail with a VM_KMEM_SIZE_SCALE of 1. Using a value of 2
allows these, namely Blade 1500 with 2GB of RAM, to boot.
PR: 164227
CTL is a disk and processor device emulation subsystem originally written
for Copan Systems under Linux starting in 2003. It has been shipping in
Copan (now SGI) products since 2005.
It was ported to FreeBSD in 2008, and thanks to an agreement between SGI
(who acquired Copan's assets in 2010) and Spectra Logic in 2010, CTL is
available under a BSD-style license. The intent behind the agreement was
that Spectra would work to get CTL into the FreeBSD tree.
Some CTL features:
- Disk and processor device emulation.
- Tagged queueing
- SCSI task attribute support (ordered, head of queue, simple tags)
- SCSI implicit command ordering support. (e.g. if a read follows a mode
select, the read will be blocked until the mode select completes.)
- Full task management support (abort, LUN reset, target reset, etc.)
- Support for multiple ports
- Support for multiple simultaneous initiators
- Support for multiple simultaneous backing stores
- Persistent reservation support
- Mode sense/select support
- Error injection support
- High Availability support (1)
- All I/O handled in-kernel, no userland context switch overhead.
(1) HA Support is just an API stub, and needs much more to be fully
functional.
ctl.c: The core of CTL. Command handlers and processing,
character driver, and HA support are here.
ctl.h: Basic function declarations and data structures.
ctl_backend.c,
ctl_backend.h: The basic CTL backend API.
ctl_backend_block.c,
ctl_backend_block.h: The block and file backend. This allows for using
a disk or a file as the backing store for a LUN.
Multiple threads are started to do I/O to the
backing device, primarily because the VFS API
requires that to get any concurrency.
ctl_backend_ramdisk.c: A "fake" ramdisk backend. It only allocates a
small amount of memory to act as a source and sink
for reads and writes from an initiator. Therefore
it cannot be used for any real data, but it can be
used to test for throughput. It can also be used
to test initiators' support for extremely large LUNs.
ctl_cmd_table.c: This is a table with all 256 possible SCSI opcodes,
and command handler functions defined for supported
opcodes.
ctl_debug.h: Debugging support.
ctl_error.c,
ctl_error.h: CTL-specific wrappers around the CAM sense building
functions.
ctl_frontend.c,
ctl_frontend.h: These files define the basic CTL frontend port API.
ctl_frontend_cam_sim.c: This is a CTL frontend port that is also a CAM SIM.
This frontend allows for using CTL without any
target-capable hardware. So any LUNs you create in
CTL are visible in CAM via this port.
ctl_frontend_internal.c,
ctl_frontend_internal.h:
This is a frontend port written for Copan to do
some system-specific tasks that required sending
commands into CTL from inside the kernel. This
isn't entirely relevant to FreeBSD in general,
but can perhaps be repurposed.
ctl_ha.h: This is a stubbed-out High Availability API. Much
more is needed for full HA support. See the
comments in the header and the description of what
is needed in the README.ctl.txt file for more
details.
ctl_io.h: This defines most of the core CTL I/O structures.
union ctl_io is conceptually very similar to CAM's
union ccb.
ctl_ioctl.h: This defines all ioctls available through the CTL
character device, and the data structures needed
for those ioctls.
ctl_mem_pool.c,
ctl_mem_pool.h: Generic memory pool implementation used by the
internal frontend.
ctl_private.h: Private data structres (e.g. CTL softc) and
function prototypes. This also includes the SCSI
vendor and product names used by CTL.
ctl_scsi_all.c,
ctl_scsi_all.h: CTL wrappers around CAM sense printing functions.
ctl_ser_table.c: Command serialization table. This defines what
happens when one type of command is followed by
another type of command.
ctl_util.c,
ctl_util.h: CTL utility functions, primarily designed to be
used from userland. See ctladm for the primary
consumer of these functions. These include CDB
building functions.
scsi_ctl.c: CAM target peripheral driver and CTL frontend port.
This is the path into CTL for commands from
target-capable hardware/SIMs.
README.ctl.txt: CTL code features, roadmap, to-do list.
usr.sbin/Makefile: Add ctladm.
ctladm/Makefile,
ctladm/ctladm.8,
ctladm/ctladm.c,
ctladm/ctladm.h,
ctladm/util.c: ctladm(8) is the CTL management utility.
It fills a role similar to camcontrol(8).
It allow configuring LUNs, issuing commands,
injecting errors and various other control
functions.
usr.bin/Makefile: Add ctlstat.
ctlstat/Makefile
ctlstat/ctlstat.8,
ctlstat/ctlstat.c: ctlstat(8) fills a role similar to iostat(8).
It reports I/O statistics for CTL.
sys/conf/files: Add CTL files.
sys/conf/NOTES: Add device ctl.
sys/cam/scsi_all.h: To conform to more recent specs, the inquiry CDB
length field is now 2 bytes long.
Add several mode page definitions for CTL.
sys/cam/scsi_all.c: Handle the new 2 byte inquiry length.
sys/dev/ciss/ciss.c,
sys/dev/ata/atapi-cam.c,
sys/cam/scsi/scsi_targ_bh.c,
scsi_target/scsi_cmds.c,
mlxcontrol/interface.c: Update for 2 byte inquiry length field.
scsi_da.h: Add versions of the format and rigid disk pages
that are in a more reasonable format for CTL.
amd64/conf/GENERIC,
i386/conf/GENERIC,
ia64/conf/GENERIC,
sparc64/conf/GENERIC: Add device ctl.
i386/conf/PAE: The CTL frontend SIM at least does not compile
cleanly on PAE.
Sponsored by: Copan Systems, SGI and Spectra Logic
MFC after: 1 month
configurations for various architectures in FreeBSD 10.x. This allows
basic Capsicum functionality to be used in the default FreeBSD
configuration on non-embedded architectures; process descriptors are not
yet enabled by default.
MFC after: 3 months
Sponsored by: Google, Inc
no need to additionally add CPU memory barriers to the acquire variants of
atomic(9), these are documented to also include compiler memory barriers.
So add the latter, which were previously included by using membar(), back.
According to the open firmware standard, finddevice call has to return
a phandle with value of -1 in case of error.
This commit is to:
- Fix the FDT implementation of this interface (ofw_fdt_finddevice) to
return (phandle_t)-1 in case of error, instead of 0 as it does now.
- Fix up the callers of OF_finddevice() to compare the return value with
-1 instead of 0 to check for errors.
- Since phandle_t is unsigned, the return value of OF_finddevice should
be checked with '== -1' rather than '<= 0' or '> 0', fix up these cases
as well.
Reported by: nwhitehorn
Reviewed by: raj
Approved by: raj, nwhitehorn
the 16-bit cylinders field of the VTOC8 disk label (at around 502GB). The
geometry chosen for disks above that limit allows to use disks up to 2TB,
which is the limit of the extended VTOC8 format. The geometry used for
disks smaller than the 16-bit cylinders limit stays the same as used by
cam_calc_geometry(9) for extended translation.
Thanks to Hans-Joerg Sirtl for providing hardware for testing this change.
MFC after: 3 days
compatible with each other and since r227539 the last issue seen when
using SCHED_ULE is fixed. At least on UP and 2-way machines SCHED_4BSD
still performs better than SCHED_ULE, however, the optimizations done
in r225889 pretty much compensate that so there's at least no net
regression.
Thanks go to Peter Jeremy for extensive testing.
one. Interestingly, these are actually the default for quite some time
(bus_generic_driver_added(9) since r52045 and bus_generic_print_child(9)
since r52045) but even recently added device drivers do this unnecessarily.
Discussed with: jhb, marcel
- While at it, use DEVMETHOD_END.
Discussed with: jhb
- Also while at it, use __FBSDID.
directly from g7, the pcpu pointer. This guarantees correct behavior
when the thread migrates to a different CPU.
Commit message stolen from r205431. Additional testing by Peter Jeremy.
MFC after: 3 days
all the architectures.
The option allows to mount non-MPSAFE filesystem. Without it, the
kernel will refuse to mount a non-MPSAFE filesytem.
This patch is part of the effort of killing non-MPSAFE filesystems
from the tree.
No MFC is expected for this patch.
Tested by: gianni
Reviewed by: kib
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
replace amd(4) with the former in the amd64, i386 and pc98 GENERIC kernel
configuration files. Besides duplicating functionality, amd(4), which
previously also supported the AMD Am53C974, unlike esp(4) is no longer
maintained and has accumulated enough bit rot over time to always cause
a panic during boot as long as at least one target is attached to it
(see PR 124667).
PR: 124667
Obtained from: NetBSD (based on)
MFC after: 3 days
- Move esp_devclass to ncr53c9x.c in order to allow different bus front-ends
to use it.
- Use KOBJMETHOD_END.
- Remove the gl_clear_latched_intr hook as it's not needed for any of the
chips nor the front-ends supported in FreeBSD and likely never will be.
- Correct the DMA constraints used in the SBus front-end, the LSI64854 isn't
limited to 32-bit DMA.
- The ESP200 also only supports up to 64k transfers.
- Don't let the DMA and SBus front-end supply a maximum transfer size larger
than MAXPHYS as that's the maximum the upper layers use and we otherwise
just waste resources unnecessarily.
- Initialize the ECB callout and don't zero the handle when returning ECBs
to the free list so that ncr53c9x_callout() actually is called with the
driver lock held.
- On detach the driver lock should be held across cam_sim_free() according
to isp(4) and a panic received.
- Check the return value of NCRDMA_SETUP(), i.e. bus_dmamap_load(9), and try
to handle failures gracefully.
- In ncr53c9x_action() replace N calls to xpt_done() in a switch with just
one at the end.
- On XPT_PATH_INQ report "NCR" rather than "Sun" as the vendor as the former
is somewhat more correct as well as the maximum supported transfer size via
maxio in order to take advantage of controllers that that can handle more
than DFLTPHYS.
- Print the number of MESSAGE (EXTENDED) rejected.
- Fix the path encoded in the multiple inclusion protection of ncr53c9xvar.h.
- Correct the DMA constraints used in the LSI64854 core to not exceed the
maximum supported transfer size and include the boundary so we don't need
to check on every setup of a DMA transfer.
- Let the bus DMA map callbacks do nothing in case of an error.
- Correctly handle > 64k transfers for FAS366 in the LSI64854. A new feature
flag NCR_F_LARGEXFER was introduced so we just need to check for this one
and not for individual controllers supporting large transfers in several
places.
- Let the LSI64854 core load transfer buffers using BUS_DMA_NOWAIT as the
NCR53C9x core can't handle EINPROGRESS. Due to lack of bounce buffers
support, sparc64 doesn't actually use EINPROGRESS and likely never will,
as an example for writing additional front-ends for the NCR53C9x core it
makes sense to set BUS_DMA_NOWAIT anyway though.
- Some minor cleanup.
thing when changing the debugging options as part of head becoming a new
stable branch. It may also help people who for one reason or another want
to run head but don't want it slowed down by the debugging support.
Reviewed by: kib
implement a deprecated FPU control interface in addition to the
standard one. To make this clearer, further deprecate ieeefp.h
by not declaring the function prototypes except on architectures
that implement them already.
Currently i386 and amd64 implement the ieeefp.h interface for
compatibility, and for fp[gs]etprec(), which doesn't exist on
most other hardware. Powerpc, sparc64, and ia64 partially implement
it and probably shouldn't, and other architectures don't implement it
at all.
As part of the 8.0-RELEASE cycle this was done in stable/8 (r199112)
but was left alone in head so people could work on fixing an issue that
caused boot failure on some motherboards. Apparently nobody has worked
on it and we are getting reports of boot failure with the 9.0 test builds.
So this time I'll comment out the driver in head (still hoping someone
will work on it) and MFC to stable/9.
Submitted by: Alberto Villa <avilla at FreeBSD dot org>
and pc_pmap for SMP. This is key to allowing adding support for SCHED_ULE.
Thanks go to Peter Jeremy for additional testing.
- Add support for SCHED_ULE to cpu_switch().
Committed from: 201110DevSummit
- Implement bus_adjust_resource() methods as far as necessary and in non-PCI
bridge drivers as far as feasible without rototilling them.
- As NEW_PCIB does a layering violation by activating resources at layers
above pci(4) without previously bubbling up their allocation there, move
the assignment of bus tags and handles from the bus_alloc_resource() to
the bus_activate_resource() methods like at least the other NEW_PCIB
enabled architectures do. This is somewhat unfortunate as previously
sparc64 (ab)used resource activation to indicate whether SYS_RES_MEMORY
resources should be mapped into KVA, which is only necessary if their
going to be accessed via the pointer returned from rman_get_virtual() but
not for bus_space(9) as the later always uses physical access on sparc64.
Besides wasting KVA if we always map in SYS_RES_MEMORY resources, a driver
also may deliberately not map them in if the firmware already has done so,
possibly in a special way. So in order to still allow a driver to decide
whether a SYS_RES_MEMORY resource should be mapped into KVA we let it
indicate that by calling bus_space_map(9) with BUS_SPACE_MAP_LINEAR as
actually documented in the bus_space(9) page. This is implemented by
allocating a separate bus tag per SYS_RES_MEMORY resource and passing the
resource via the previously unused bus tag cookie so we later on can call
rman_set_virtual() in sparc64_bus_mem_map(). As a side effect this now
also allows to actually indicate that a SYS_RES_MEMORY resource should be
mapped in as cacheable and/or read-only via BUS_SPACE_MAP_CACHEABLE and
BUS_SPACE_MAP_READONLY respectively.
- Do some minor cleanup like taking advantage of rman_init_from_resource(),
factor out the common part of bus tag allocation into a newly added
sparc64_alloc_bus_tag(), hook up some missing newbus methods and replace
some homegrown versions with the generic counterparts etc.
- While at it, let apb_attach() (which can't use the generic NEW_PCIB code
as APB bridges just don't have the base and limit registers implemented)
regarding the config space registers cached in pcib_softc and the SYSCTL
reporting nodes set up.
also use the streaming buffer of pre version 5/revision 2.3 hardware as
long as we stay away from context flushes (which iommu(4) so far doesn't
take advantage of). OpenSolaris does the same.
atomic operations behave as if the were followed by a memory barrier so
there's no need to include ones in the acquire variants of atomic(9).
Removing these results a small performance improvement, specifically this
is sufficient to compensate the performance loss seen in the worldstone
benchmark seen when using SCHED_ULE instead of SCHED_4BSD.
This change is inspired by Linux even more radically doing the equivalent
thing some time ago.
Thanks go to Peter Jeremy for additional testing.
protected the dirty mask updates. The dirty mask updates are handled
by atomics after the r225840.
Submitted by: alc
Tested by: flo (sparc64)
MFC after: 2 weeks
thing all the other architectures already do) thus just initialize
kernel_pmap in pmap_bootstrap().
Reported by: alc
Reviewed by: alc, marius
Tested by: flo, marius
Approved by: re (kib)
MFC after: 1 week
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.
Reviewed by: rwatson
Approved by: re (bz)
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.
Document the changes to flags field to only require the page lock.
Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.
Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)
which may cause problems when these contain garbage so zero the range
descriptors embedding the rmans when allocating them.
Approved by: re (kib)
MFC after: 3 days
to VPO_UNMANAGED (and also making the flag protected by the vm object
lock, instead of vm page queue lock).
- Mark the fake pages with both PG_FICTITIOUS (as it is now) and
VPO_UNMANAGED. As a consequence, pmap code now can use use just
VPO_UNMANAGED to decide whether the page is unmanaged.
Reviewed by: alc
Tested by: pho (x86, previous version), marius (sparc64),
marcel (arm, ia64, powerpc), ray (mips)
Sponsored by: The FreeBSD Foundation
Approved by: re (bz)
NFSCL, NFSD instead of NFSCLIENT, NFSSERVER since
NFSCL and NFSD are now the defaults. The client change is
needed for diskless configurations, so that the root
mount works for fstype nfs.
Reported by seanbru at yahoo-inc.com for i386/XEN.
Approved by: re (hrs)
When the last, possibly partially filled buffer is flushed, we didn't
reset fragsz to 0 and as such would stop reflecting reality.
- Use __FBSDID.
- Wrap a too long line.
Approved by: re (kib)
MFC after: 1 week
This patch is going to help in cases like mips flavours where you
want a more granular support on MAXCPU.
No MFC is previewed for this patch.
Tested by: pluknet
Approved by: re (kib)
sintrcnt/sintrnames which are symbols containing the size of the 2
tables.
- For amd64/i386 remove the storage of intr* stuff from assembly files.
This area can be widely improved by applying the same to other
architectures and likely finding an unified approach among them and
move the whole code to be MI. More work in this area is expected to
happen fairly soon.
No MFC is previewed for this patch.
Tested by: pluknet
Reviewed by: jhb
Approved by: re (kib)
cycle mode as timecounter just works fine. My best guess is that a firmware
update has fixed this, check at run-time whether it advances and use a
positive quality if it does. The latter will cause this timecounter to be
used instead of the tick counter based one, which just sucks for SMP.
- Remove a redundant NULL assignment from the timecounter initialization.
as STX_CTRL_PERF_CNT_CNT0_SHIFT actually is zero, if we were using the
second counter in the upper 32 bits this would be required though as the MI
timecounter code doesn't support 64-bit counters/counter registers.
- Remove a redundant NULL assignment from the timecounter initialization.
This is just a simple approach. For reasons unknown OpenSolaris uses a
more sophisticated one involving IPIing the remaining CPUs in reverse
order after the first batch of 32.
erratum causing them to trigger stray vector interrupts accompanied by a
state in which they even fault on locked TLB entries. Just retrying the
instruction in that case gets the CPU back on track though. OpenSolaris
also just ignores a certain number of stray vector interrupts.
While at it, implement the stray vector interrupt handling for SPARC64-VI
which use these for indicating uncorrectable errors in interrupt packets.
the TLBs in order to get rid of the user mappings but instead traverse
them an flush only the latter like we also do for the Spitfire-class.
Also flushing the unlocked kernel entries can cause instant faults which
when called from within cpu_switch() are handled with the scheduler lock
held which in turn can cause timeouts on the acquisition of the lock by
other CPUs. This was easily seen with a 16-core V890 but occasionally
also happened with 2-way machines.
While at it, move the SPARC64-V support code entirely to zeus.c. This
causes a little bit of duplication but is less confusing than partially
using Cheetah-class bits for these.
- For SPARC64-V ensure that 4-Mbyte page entries are stored in the 1024-
entry, 2-way set associative TLB.
- In {d,i}tlb_get_data_sun4u() turn off the interrupts in order to ensure
that ASI_{D,I}TLB_DATA_ACCESS_REG actually are read twice back-to-back.
Tested by: Peter Jeremy (16-core US-IV), Michael Moll (2-way SPARC64-V)
more explicit comments about what's going on and what future maintainers
need to do when e.g. adding a new operation to a sys_machdep.c.
Approved by: mentor(rwatson), re(bz)
interrupts. Bringup on additional machine models repeatedly reveals
firmware that enables interrupts behind our back, causing the console
to be flooded otherwise.
- As with the regular interrupt counters using uint16_t instead of
u_long for counting the stray vector interrupts should be more than
sufficient.
- Cache the interrupt vector in intr_stray_vector().
have to ignore it when sending the IPI anyway. Actually I can't think of
a good reason why this ever was done that way in the first place as it's
not even usefull for debugging.
While at it replace the use of pc_other_cpus as it's slated for deorbit.
RSF_FATAL we need to switch to alternate globals for KSTACK_CHECK just
like tl1_data_excptn(_trap) does. This is more or less cosmetic because
in case RSF_FATAL is called we're already heading south.
- Correct an END().
- Read the window state from the correct register for a CATR().
more than three temporary register in several places CATR() is used so
this code trades instructions in for registers. Actually, this still isn't
sufficient and CATR() has the side-effect of clobbering %y. Luckily, with
the current uses of CATR() this either doesn't matter or we are able to
(save and) restore it.
Now that there's only one use of AND() and TEST() left inline these.
This introduce all the underlying support for making this possible (via
the function cpusetobj_strscan() and keeps ktr_cpumask exported. sparc64
implements its own assembly primitives for tracing events and needs to
properly check it. Anyway the sparc64 logic is not implemented yet due
to lack of knowledge (by me) and time (by marius), but it is just a
matter of using ktr_cpumask when possible.
Tested and fixed by: pluknet
Reviewed by: marius
be brought up in the order they are enumerated in the device tree (in
particular, that thread 0 on each core be brought up first). The SLIST
through which we loop to start the CPUs has all of its entries added with
SLIST_INSERT_HEAD(), which means it is in reverse order of enumeration
and so AP startup would always fail in such situations (causing a machine
check or RTAS failure). Fix this by changing the SLIST into an STAILQ,
and inserting new CPUs at the end.
Reviewed by: jhb
Some files keep the SUN4V tags as a code reference, for the future,
if any rewamped sun4v support wants to be added again.
Reviewed by: marius
Tested by: sbruno
Approved by: re
architectures (i386, for example) the virtual memory space may be
constrained enough that 2MB is a large chunk. Use 64K for arches
other than amd64 and ia64, with special handling for sparc64 due to
differing hardware.
Also commit the comment changes to kmem_init_zero_region() that I
missed due to not saving the file. (Darn the unfamiliar development
environment).
Arch maintainers, please feel free to adjust ZERO_REGION_SIZE as you
see fit.
Requested by: alc
MFC after: 1 week
MFC with: r221853
pc_cpumask were changed to cpuset_t. This now calculates the cpumask based
on pc_cpuid itself as pc_cpumask is slated for being deorbited. Note that
this needs r221750 to be MFC'ed in order to compile.
This seems to work fine but after a few dozens of successful IPIs something
suddenly adds pc_cpuid to pc_other_cpus, causing the respective assertions
in mp_machdep.c to be triggered when the latter is used as the base for the
targets.
must not, otherwise we tell the CPU to IPI itself, which the sun4u CPUs don't
support. For reasons unknown so far MD and MI IPI use actually still triggers
that assertion though.
This now calculates pc_cpumask based on pc_cpuid itself as the former is
slated for being deorbited.
This branch now at least boots UP again. MP needs more things converted and
the existing conversion from cpumask_t to cpuset_t still has bugs.
versions instead. They were never needed as bus_generic_intr() and
bus_teardown_intr() had been changed to pass the original child device up
in 42734, but the ISA bus was not converted to new-bus until 45720.
driver would verify that requests for child devices were confined to any
existing I/O windows, but the driver relied on the firmware to initialize
the windows and would never grow the windows for new requests. Now the
driver actively manages the I/O windows.
This is implemented by allocating a bus resource for each I/O window from
the parent PCI bus and suballocating that resource to child devices. The
suballocations are managed by creating an rman for each I/O window. The
suballocated resources are mapped by passing the bus_activate_resource()
call up to the parent PCI bus. Windows are grown when needed by using
bus_adjust_resource() to adjust the resource allocated from the parent PCI
bus. If the adjust request succeeds, the window is adjusted and the
suballocation request for the child device is retried.
When growing a window, the rman_first_free_region() and
rman_last_free_region() routines are used to determine if the front or
end of the existing I/O window is free. From using that, the smallest
ranges that need to be added to either the front or back of the window
are computed. The driver will first try to grow the window in whichever
direction requires the smallest growth first followed by the other
direction if that fails.
Subtractive bridges will first attempt to satisfy requests for child
resources from I/O windows (including attempts to grow the windows). If
that fails, the request is passed up to the parent PCI bus directly
however.
The PCI-PCI bridge driver will try to use firmware-assigned ranges for
child BARs first and only allocate a "fresh" range if that specific range
cannot be accommodated in the I/O window. This allows systems where the
firmware assigns resources during boot but later wipes the I/O windows
(some ACPI BIOSen are known to do this) to "rediscover" the original I/O
window ranges.
The ACPI Host-PCI bridge driver has been adjusted to correctly honor
hw.acpi.host_mem_start and the I/O port equivalent when a PCI-PCI bridge
makes a wildcard request for an I/O window range.
The new PCI-PCI bridge driver is only enabled if the NEW_PCIB kernel option
is enabled. This is a transition aide to allow platforms that do not
yet support bus_activate_resource() and bus_adjust_resource() in their
Host-PCI bridge drivers (and possibly other drivers as needed) to use the
old driver for now. Once all platforms support the new driver, the
kernel option and old driver will be removed.
PR: kern/143874 kern/149306
Tested by: mav
NFS client (which I guess is no longer experimental). The fstype "newnfs"
is now "nfs" and the regular/old NFS client is now fstype "oldnfs".
Although mounts via fstype "nfs" will usually work without userland
changes, an updated mount_nfs(8) binary is needed for kernels built with
"options NFSCL" but not "options NFSCLIENT". Updated mount_nfs(8) and
mount(8) binaries are needed to do mounts for fstype "oldnfs".
The GENERIC kernel configs have been changed to use options
NFSCL and NFSD (the new client and server) instead of NFSCLIENT and NFSSERVER.
For kernels being used on diskless NFS root systems, "options NFSCL"
must be in the kernel config.
Discussed on freebsd-fs@.
stack. It means that all legacy ATA drivers are disabled and replaced by
respective CAM drivers. If you are using ATA device names in /etc/fstab or
other places, make sure to update them respectively (adX -> adaY,
acdX -> cdY, afdX -> daY, astX -> saY, where 'Y's are the sequential
numbers for each type in order of detection, unless configured otherwise
with tunables, see cam(4)).
ataraid(4) functionality is now supported by the RAID GEOM class.
To use it you can load geom_raid kernel module and use graid(8) tool
for management. Instead of /dev/arX device names, use /dev/raid/rX.
r220375 all drivers enabled in the sparc64 GENERIC should be either
correctly using bus_dmamap_sync(9) calls or supply BUS_DMA_COHERENT
when appropriate or as a workaround for missing bus_dmamap_sync(9)
calls (sound(4) drivers and partially sym(4)). In at least some
configurations taking advantage of the streaming cache results in
a modest performance improvement.
- Remove the memory barrier for BUS_DMASYNC_PREREAD which as the
comment already suggested is bogus.
- Add my copyright for having implemented several things like support
for the Fire and Oberon IOMMUs, taking over PROM IOMMU mappings etc.
Introduce the AHB glue for Atheros embedded systems. Right now it's
hard-coded for the AR9130 chip whose support isn't yet in this HAL;
it'll be added in a subsequent commit.
Kernel configuration files now need both 'ath' and 'ath_pci' devices; both
modules need to be loaded for the ath device to work.
the iommu(4) provided one, i.e. in case of Hummingbird and Sabre bridges,
otherwise just use the iommu(4) one. This also fixes a bug introduced in
r220039 which caused an empty DMA method table to be used for the second
of a pair of Psycho bridges.
syncing for Hummingbird and Sabre bridges should be applied with every
BUS_DMASYNC_POSTREAD instead of in a wrapper around interrupt handlers
for devices behind PCI-PCI bridges only as suggested by the documentation
(code for the latter actually exists in OpenSolaris but is disabled by
default), which also makes more sense.
- Take advantage of the ofw_pci_setup_device method introduced in r220038
for disabling bus parking for certain EBus bridges in order to
- Mark some unused parameters as such.
register changes when compiled with SCHIZO_DEBUG and take advantage
of them.
- Add support for the XMITS Fireplane/Safari to PCI-X bridges. I tought
I'd need this for a Sun Fire 3800, which then turned out to not being
equipped with such a bridge though. The support for these should be
complete but given that it hasn't actually been tested probing is
disabled for now.
This required a way to alter the XMITS configuration in case a PCI-X
device is found further down the device tree so the sparc64 specific
ofw_pci kobj was revived with a ofw_pci_setup_device method, which is
called by the ofw_pcibus code for every device added.
- A closer inspection of the OpenSolaris code indicates that consistent
DMA flushing/syncing as well as the block store workaround should be
applied with every BUS_DMASYNC_POSTREAD instead of in a wrapper around
interrupt handlers for devices behind PCI-PCI bridges only as suggested
by the documentation (code for the latter actually exists in OpenSolaris
but is disabled by default), which also makes more sense.
- Add a workaround for Casinni/Skyhawk combinations. Chances are that
this solves the crashes seen when using the the on-board Casinni NICs
of Sun Fire V480 equipped with centerplanes other than 501-6780 or
501-6790. This also takes advantage of the ofw_pci_setup_device method.
- Mark some unused parameters as such.
- A closer inspection of the OpenSolaris code indicates the block store
workaround is only necessary in case of BUS_DMASYNC_POSTREAD.
- Mark some unused parameters as such.
- Emitt an error when encountering an unsupported and in case of the
kernel also for unaligned relocations.
- Fix R_SPARC_LOX10 relocations. Apparently these are hardly ever used.
- Add the _RF_X committed in r212998 also to the tables in the sparc64
reloc.c in order reduce differences between the kernel and the userland
source. This results in no functional change though.
- Fix further inconsistencies in the abbreviations of the names of the
relocations.
- Further whitespace fixes.
Obtained from: NetBSD [1]
explicit process at fork trampoline path instead of eventhadler(schedtail)
invocation for each child process.
Remove eventhandler(schedtail) code and change linux ABI to use newly added
sysvec method.
While here replace explicit comparing of module sysentvec structure with the
newly created process sysentvec to detect the linux ABI.
Discussed with: kib
MFC after: 2 Week
values for resolved symbols relative to relocbase instead of sections
so detect this case and handle as appropriate, which allows using
kernel modules linked with affected versions of binutils. Actually I
think this is a bug in binutils but given that apparently nobody
complained for nearly six years and powerpc has basically the same
workaround I decided to put it in for the sparc64 kernel, too.
- Fix R_SPARC_HIX22 relocations. Apparently these are hardly ever used.
should_yield(). Use this in various places. Encapsulate the common
case of check-and-yield into a new function maybe_yield().
Change several checks for a magic number of iterations to use
should_yield() instead.
MFC after: 1 week
sf buf allocation, use wakeup() instead of wakeup_one() to notify sf
buffer waiters about free buffer.
sf_buf_alloc() calls msleep(PCATCH) when SFB_CATCH flag was given,
and for simultaneous wakeup and signal delivery, msleep() returns
EINTR/ERESTART despite the thread was selected for wakeup_one(). As
result, we loose a wakeup, and some other waiter will not be woken up.
Reported and tested by: az
Reviewed by: alc, jhb
MFC after: 1 week
Compile sys/dev/mem/memutil.c for all supported platforms and remove now
unnecessary dev_mem_md_init(). Consistently define mem_range_softc from
mem.c for all platforms. Add missing #include guards for machine/memdev.h
and sys/memrange.h. Clean up some nearby style(9) nits.
MFC after: 1 month
TSB is located within the 32-bit address space, which held true as long as
we were using virtual addresses magic-mapped before the location of the
kernel for addressing it. However, with r216803 in place when possible we
address it via its physical address instead, which on machines like Sun Fire
V880 have no physical memory in the 32-bit address space at all requires
to use 64-bit addressing. When using physical addressing it still should
be safe to assume that we can just ignore the lowest 10 bits of the address
as a minor optimization as we did before r216803.
architecture macros (__mips_n64, __powerpc64__) when 64 bit types (and
corresponding macros) are different from 32 bit. [1]
Correct the type of INT64_MIN, INT64_MAX and UINT64_MAX.
Define (U)INTMAX_C as an alias for (U)INT64_C matching the type definition
for (u)intmax_t. Do this on all architectures for consistency.
Suggested by: bde [1]
Approved by: kib (mentor)
On some architectures UCHAR_MAX and USHRT_MAX had type unsigned int.
However, lacking integer suffixes for types smaller than int, their type
should correspond to that of an object of type unsigned char (or short)
when used in an expression with objects of type int. In that case unsigned
char (short) are promoted to int (i.e. signed) so the type of UCHAR_MAX and
USHRT_MAX should also be int.
Where MIN/MAX constants implicitly have the correct type the suffix has
been removed.
While here, correct some comments.
Reviewed by: bde
Approved by: kib (mentor)
and switch sparc64 to use the first one for bus error filter handlers of
bridge drivers instead of (ab)using INTR_FAST for that so we eventually
can get rid of the latter.
Reviewed by: jhb
MFC after: 1 month
functions, otherwise if we get preempted after checking whether a certain
pmap is active on the current CPU but before disabling interrupts we might
operate on an outdated state as the pmap might have been deactivated in
the meantime. As the same issue may arises when the TLB demap function is
interrupted by a TLB demap IPI, just entering a critical section before
the check isn't sufficient so we have to fully disable interrupts instead.
MFC after: 3 days
which takes an physical address instead of an virtual one, for loading TTEs
of the kernel TSB so we no longer need to lock the kernel TSB into the dTLB,
which only has a very limited number of lockable dTLB slots. The net result
is that we now basically can handle a kernel TSB of any size and no longer
need to limit the kernel address space based on the number of dTLB slots
available for locked entries. Consequently, other parts of the trap handlers
now also only access the the kernel TSB via its physical address in order
to avoid nested traps, as does the PMAP bootstrap code as we haven't taken
over the trap table at that point, yet. Apart from that the kernel TSB now
is accessed via a direct mapping when we are otherwise taking advantage of
ASI_ATOMIC_QUAD_LDD_PHYS so no further code changes are needed. Most of this
is implemented by extending the patching of the TSB addresses and mask as
well as the ASIs used to load it into the trap table so the runtime overhead
of this change is rather low. Currently the use of ASI_ATOMIC_QUAD_LDD_PHYS
is not yet enabled on SPARC64 CPUs due to lack of testing and due to the
fact it might require minor adjustments there.
Theoretically it should be possible to use the same approach also for the
user TSB, which already is not locked into the dTLB, avoiding nested traps.
However, for reasons I don't understand yet OpenSolaris only does that with
SPARC64 CPUs. On the other hand I think that also addressing the user TSB
physically and thus avoiding nested traps would get us closer to sharing
this code with sun4v, which only supports trap level 0 and 1, so eventually
we could have a single kernel which runs on both sun4u and sun4v (as does
Linux and OpenBSD).
Developed at and committed from: 27C3
STICK/STICK_COMPARE independently of the selected instruction set by
TICK_COMPARE so tick.c as of r214358 once again can be compiled with
gcc -mcpu=v9 for reference purposes.
kernel address space in order to leave space for the buffer cache, pipes,
thread stacks, etc on machines with more physical memory until we take
advantage of ASI_ATOMIC_QUAD_LDD_PHYS on CPUs providing it so we don't need
to lock the kernel TSB pages into the dTLB, basically making the entire
64-bit kernel address space available on relevant machines.
Submitted by: alc
Passing a count of zero on i386 and amd64 for [I386|AMD64]_BUS_SPACE_MEM
causes a crash/hang since the 'loop' instruction decrements the counter
before checking if it's zero.
PR: kern/80980
Discussed with: jhb
DEBUG_MEMGUARD panics early in kmeminit() with the message
"kmem_suballoc: bad status return of 1" because of zero "size" argument
passed to kmem_suballoc() due to "vm_kmem_size_max" being zero.
The problem also exists on ia64.
creation of large page mappings in the pmap, it can provide modest
performance benefits. In particular, for a "buildworld" on a 2x 1GHz
Ultrasparc IIIi it reduced the wall clock time by 2.2% and the system
time by 12.6%.
Tested by: marius@
contents of the ones that were not empty were stale and unused.
- Now that <machine/mutex.h> no longer exists, there is no need to allow it
to override various helper macros in <sys/mutex.h>.
- Rename various helper macros for low-level operations on mutexes to live
in the _mtx_* or __mtx_* namespaces. While here, change the names to more
closely match the real API functions they are backing.
- Drop support for including <sys/mutex.h> in assembly source files.
Suggested by: bde (1, 2)
work properly with single-stepping in a kernel debugger. Specifically,
these routines have always disabled interrupts before increasing the nesting
count and restored the prior state of interrupts after decreasing the nesting
count to avoid problems with a nested interrupt not disabling interrupts
when acquiring a spin lock. However, trap interrupts for single-stepping
can still occur even when interrupts are disabled. Now the saved state of
interrupts is not saved in the thread until after interrupts have been
disabled and the nesting count has been increased. Similarly, the saved
state from the thread cannot be read once the nesting count has been
decreased to zero. To fix this, use temporary variables to store interrupt
state and shuffle it between the thread's MD area and the appropriate
registers.
In cooperation with: bde
MFC after: 1 month
need locking as otherwise we may race against the other parts of the
MD code which expects a consistent state of these. While at it move
the resetting of the pmap before entering it in the TSB.
- Spell a 0 as TLB_CTX_KERNEL.
introduce function pointers once set up to the respective implementation
for reading the (S)TICK and writing the (S)STICK_COMPARE registers as a
compromise between duplicating code and selecting between different
implementations during execution over and over again, similar to what is
done elsewhere in the MD in order to support different CPU models that
won't ever change at runtime.
- In the remaining tick interrupt handler further push down disabling of
interrupts to the periodic case as it isn't necessary here in one-shot
mode at all.
a critical section as apparently required by both. I don't think either
belongs in the event timer front-ends but the callback should handle
this as necessary instead just like for example intr_event_handle()
does but this is how the other architectures currently handle it, either
explicitly or implicitly.
- Further rename and reword references to hardclock as this front-end no
longer has a notion of actually calling it.
drift in order to achieve a more stable clock as the tick intervals may
vary in the first place. In fact I haven't seen this code kick in when
in oneshot-mode so just skip it in that case.
- There's no need to explicitly stop the (S)TICK counter in oneshot-mode
with every tick as it just won't trigger again with the (S)TICK compare
register set to a value in the past (with a wrap-around once every ~195
years of uptime at 1.5 GHz this isn't something we have to worry about
in practice).
- Given that we'll disable interrupts completely anyway there's no
need to enter critical sections.
what is done on other platforms. Unlike as with the sched_throw(NULL)
called on BSPs during their startup apparently there's nothing which will
reliably lower it on APs. I'm unsure why this only came up on V215 though,
breaking these with r207248. My best guess is that these are the only
supported ones so far fast enough to loose some race.
PR: 151404
MFC after: 3 days
to the expected type so they work like the corresponding __bswapN_var()
functions and the compiler doesn't complain when arguments of different
width are passed.
The check for alignment should be made against the physical address and not
the virtual address that maps it.
Sponsored by: NetApp
Submitted by: Will McGovern (will at netapp dot com)
Reviewed by: mjacob, jhb
the location, apply elf_relocaddr to the symbol value to have right
values for the symbols from dpcpu segment.
PR: kern/147769
Discussed with: avg
Tested by: marius
MFC after: 2 weeks
additionally takes advantage of the prefetch cache of these CPUs.
Unlike the uncommitted US-III version, which provide no measurable
speedup or even resulted in a slight slowdown on certain CPUs models
compared to using the US-I version with these, the SPARC64 version
actually results in a slight improvement.
- make dflt_lock() always panic,
- add kludge to use contigmalloc() when the alignment is larger than the size
and print a diagnostic when we didn't satisfy the alignment.
unlikely that support for these ever will be implemented on sparc64 as
the IOMMUs are able to translate to up to the maximum physical address
supported by the respective machine, bypassing the IOMMU is affected
by hardware errata and being able to support DMA engines which cannot
do at least 32-bit DMA does not justify the costs.
- The page zeroing in uma_small_alloc() may use the VIS-based block zero
function so take advantage of it.
of small maxsize and "large" (including BUS_SPACE_UNRESTRICTED) nsegments
parameters. Generally using a presz of 0 (which indeed might indicate the
use of bogus parameters for DMA tag creation) is not fatal, it just means
that no additional DVMA space will be preallocated.
The main goal of this is to generate timer interrupts only when there is
some work to do. When CPU is busy interrupts are generating at full rate
of hz + stathz to fullfill scheduler and timekeeping requirements. But
when CPU is idle, only minimum set of interrupts (down to 8 interrupts per
second per CPU now), needed to handle scheduled callouts is executed.
This allows significantly increase idle CPU sleep time, increasing effect
of static power-saving technologies. Also it should reduce host CPU load
on virtualized systems, when guest system is idle.
There is set of tunables, also available as writable sysctls, allowing to
control wanted event timer subsystem behavior:
kern.eventtimer.timer - allows to choose event timer hardware to use.
On x86 there is up to 4 different kinds of timers. Depending on whether
chosen timer is per-CPU, behavior of other options slightly differs.
kern.eventtimer.periodic - allows to choose periodic and one-shot
operation mode. In periodic mode, current timer hardware taken as the only
source of time for time events. This mode is quite alike to previous kernel
behavior. One-shot mode instead uses currently selected time counter
hardware to schedule all needed events one by one and program timer to
generate interrupt exactly in specified time. Default value depends of
chosen timer capabilities, but one-shot mode is preferred, until other is
forced by user or hardware.
kern.eventtimer.singlemul - in periodic mode specifies how much times
higher timer frequency should be, to not strictly alias hardclock() and
statclock() events. Default values are 2 and 4, but could be reduced to 1
if extra interrupts are unwanted.
kern.eventtimer.idletick - makes each CPU to receive every timer interrupt
independently of whether they busy or not. By default this options is
disabled. If chosen timer is per-CPU and runs in periodic mode, this option
has no effect - all interrupts are generating.
As soon as this patch modifies cpu_idle() on some platforms, I have also
refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions
(if supported) under high sleep/wakeup rate, as fast alternative to other
methods. It allows SMP scheduler to wake up sleeping CPUs much faster
without using IPI, significantly increasing performance on some highly
task-switching loads.
Tested by: many (on i386, amd64, sparc64 and powerc)
H/W donated by: Gheorghe Ardelean
Sponsored by: iXsystems, Inc.
This reflects actual type used to store and compare child device orders.
Change is mostly done via a Coccinelle (soon to be devel/coccinelle)
semantic patch.
Verified by LINT+modules kernel builds.
Followup to: r212213
MFC after: 10 days
In particular, provide pagesize and pagesizes array, the canary value
for SSP use, number of host CPUs and osreldate.
Tested by: marius (sparc64)
MFC after: 1 month
td_critnest > 1 when not already running on the desired CPU read the
TICK counter of the BSP via a direct cross trap request in that case
instead.
- Treat the STICK based timecounter the same way as the TICK based one
regarding its quality and obtaining the counter value from the BSP.
Like the TICK timers the STICK ones also are only synchronized during
their startup (which might not result in good synchronicity in the
first place) but not afterwards and might drift over time, causing
problems when the time is read from different CPUs (see r135972).
to single CPUs more efficiently with Cheetah(-class) and Jalapeno CPUs.
Besides being used to implement the ipi_cpu() introduced in r210939,
cpu_ipi_single() will also be used internally by the sparc64 MD code.
- Factor out the Jalapeno support from the Cheetah IPI send functions
in order to be able to more easily and efficiently implement support
for more than 32 target CPUs as well as a workaround for Cheetah+
erratum 25 for the latter.
IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that
constructed a mask for a single CPU with calls to ipi_cpu() instead. This
will matter more in the future when we transition from cpumask_t to
cpuset_t for CPU masks in which case building a CPU mask is more expensive.
Submitted by: peter, sbruno
Reviewed by: rookie
Obtained from: Yahoo! (x86)
MFC after: 1 month
zones for each malloc bucket size. The purpose is to isolate
different malloc types into hash classes, so that any buffer overruns
or use-after-free will usually only affect memory from malloc types in
that hash class. This is purely a debugging tool; by varying the hash
function and tracking which hash class was corrupted, the intersection
of the hash classes from each instance will point to a single malloc
type that is being misused. At this point inspection or memguard(9)
can be used to catch the offending code.
Add MALLOC_DEBUG_MAXZONES=8 to -current GENERIC configuration files.
The suggestion to have this on by default came from Kostik Belousov on
-arch.
This code is based on work by Ron Steinke at Isilon Systems.
Reviewed by: -arch (mostly silence)
Reviewed by: zml
Approved by: zml (mentor)
now it uses a very dumb first-touch allocation policy. This will change in
the future.
- Each architecture indicates the maximum number of supported memory domains
via a new VM_NDOMAIN parameter in <machine/vmparam.h>.
- Each cpu now has a PCPU_GET(domain) member to indicate the memory domain
a CPU belongs to. Domain values are dense and numbered from 0.
- When a platform supports multiple domains, the default freelist
(VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain.
The MD code is required to populate an array of mem_affinity structures.
Each entry in the array defines a range of memory (start and end) and a
domain for the range. Multiple entries may be present for a single
domain. The list is terminated by an entry where all fields are zero.
This array of structures is used to split up phys_avail[] regions that
fall in VM_FREELIST_DEFAULT into per-domain freelists.
- Each memory domain has a separate lookup-array of freelists that is
used when fulfulling a physical memory allocation. Right now the
per-domain freelists are listed in a round-robin order for each domain.
In the future a table such as the ACPI SLIT table may be used to order
the per-domain lookup lists based on the penalty for each memory domain
relative to a specific domain. The lookup lists may be examined via a
new vm.phys.lookup_lists sysctl.
- The first-touch policy is implemented by using PCPU_GET(domain) to
pick a lookup list when allocating memory.
Reviewed by: alc
between determining the other CPUs and calling cpu_ipi_selected(), which
apart from generally doing the wrong thing can lead to a panic when a
CPU is told to IPI itself (which sun4u doesn't support).
Reported and tested by: Nathaniel W Filardo
- Add __unused where appropriate.
MFC after: 3 days