Issues were noted by Bruce Evans and are present on all architectures.
On i386, a counter fetch should use atomic read of 64bit value,
otherwise carry from the increment on other CPU could be lost for the
given fetch, making error of 2^32. If 64bit read (cmpxchg8b) is not
available on the machine, it cannot be SMP and it is enough to disable
preemption around read to avoid the split read.
On x86 the counter increment is not atomic on purpose, which makes it
possible for the store of the incremented result to override just
zeroed per-cpu slot. The effect would be a counter going off by
arbitrary value after zeroing. Perform the counter zeroing on the
same processor which does the increments, making the operations
mutually exclusive. On i386, same as for the fetching, if the
cmpxchg8b is not available, machine is not SMP and we disable
preemption for zeroing.
PowerPC64 is treated the same as amd64.
For other architectures, the changes made to allow the compilation to
succeed, without fixing the issues with zeroing or fetching. It
should be possible to handle them by using the 64bit loads and stores
atomic WRT preemption (assuming the architectures also converted from
using critical sections to proper asm). If architecture does not
provide the facility, using global (spin) mutex would be non-optimal
but working solution.
Noted by: bde
Sponsored by: The FreeBSD Foundation
The atomic_load() and atomic_store() macros conflict with the equally
named macros from <stdatomic.h>. Remove them, as they are only used to
implement functions that are not present on any of the other
architectures.
o Relax locking assertions for pmap_enter_object() and add them also
to architectures that currently don't have any
o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade
operation on the per-object rwlock
o Use all the mechanisms above to make vm_map_pmap_enter() to work
mostl of the times only with readlocks.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
order to match the MAXCPU concept. The change should also be useful
for consolidation and consistency.
Sponsored by: EMC / Isilon storage division
Obtained from: jeff
Reviewed by: alc
and kern.cam.ctl.disable tunable; those were introduced as a workaround
to make it possible to boot GENERIC on low memory machines.
With ctl(4) being built as a module and automatically loaded by ctladm(8),
this makes CTL work out of the box.
Reviewed by: ken
Sponsored by: FreeBSD Foundation
Introduce counter(9) API, that implements fast and raceless counters,
provided (but not limited to) for gathering of statistical data.
See http://lists.freebsd.org/pipermail/freebsd-arch/2013-April/014204.html
for more details.
In collaboration with: kib
Reviewed by: luigi
Tested by: ae, ray
Sponsored by: Nginx, Inc.
most kernels before FreeBSD 9.0. Remove such modules and respective kernel
options: atadisk, ataraid, atapicd, atapifd, atapist, atapicam. Remove the
atacontrol utility and some man pages. Remove useless now options ATA_CAM.
No objections: current@, stable@
MFC after: never
uart(4) allocates send and receiver buffers in attach() before it calls
the low-level driver's attach routine. Many low-level drivers set the
fifo sizes in their attach routine, which is too late. Other drivers set
them in the probe() routine, so that they're available when uart(4)
allocates buffers. This fixes the ones that were setting the values too
late by moving the code to probe().
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
pages around, taking array of vm_page_t both for source and
destination. Starting offsets and total transfer size are specified.
The function implements optimal algorithm for copying using the
platform-specific optimizations. For instance, on the architectures
were the direct map is available, no transient mappings are created,
for i386 the per-cpu ephemeral page frame is used. The code was
typically borrowed from the pmap_copy_page() for the same
architecture.
Only i386/amd64, powerpc aim and arm/arm-v6 implementations were
tested at the time of commit. High-level code, not committed yet to
the tree, ensures that the use of the function is only allowed after
explicit enablement.
For sparc64, the existing code has known issues and a stab is added
instead, to allow the kernel linking.
Sponsored by: The FreeBSD Foundation
Tested by: pho (i386, amd64), scottl (amd64), ian (arm and arm-v6)
MFC after: 2 weeks
interrupt shutdown handlers into filters. Shutdown_nice(9) acquires a sleep
lock, which filters shouldn't do. It also seems that kern_reboot(9) still
may require Giant to be hold.
- Correct an incorrect argument to shutdown_nice(9).
Submitted by: bde
interrupt shutdown handlers into filters. Shutdown_nice(9) acquires a sleep
lock, which filters shouldn't do. It also seems that kern_reboot(9) still
may require Giant to be hold.
Submitted by: bde
an interrupt filter (some other drivers in the tree do the same). So
change the overtemperature and power fail interrupts from handlers in order
to code and get rid of a !INTR_MPSAFE handlers.
- Mark unused parameters as such.
- Use NULL instead of 0 for pointers.
MFC after: 1 week
fail interrupt handler, there seems to be either a broken batch of them
or a tendency to develop a defect which causes this interrupt to fire
inadvertedly. Given that apart from this problem these machines work
just fine, add a tunable allowing the setup of the power fail interrupt
to be disabled.
While at it, remove the DEBUGGER_ON_POWERFAIL compile time option and
make that behavior also selectable via the newly added tunable.
- Apparently, it's no longer a problem to call shutdown_nice(9) from within
an interrupt filter (some other drivers in the tree do the same). So
change the power fail interrupt from an handler in order to simplify the
code and get rid of a !INTR_MPSAFE handler.
- Use NULL instead of 0 for pointers.
MFC after: 1 week
- Use NULL instead of 0 for pointers.
- Let ofw_pcib_probe() return BUS_PROBE_DEFAULT instead of 0 so specialized
PCI-PCI-bridge drivers may attach instead.
- Add WARs for PLX Technology PEX 8114 bridges and PEX 8532 switches.
Ideally, these should live in MI code but at least for the latter we're
missing the necessary infrastructure there.
MFC after: 1 week
Switch eventtimers(9) from using struct bintime to sbintime_t.
Even before this not a single driver really supported full dynamic range of
struct bintime even in theory, not speaking about practical inexpediency.
This change legitimates the status quo and cleans up the code.
VM_OBJECT_LOCKED() macro is only used to implement a custom version
of lock assertions right now (which likely spread out thanks to
copy and paste).
Remove it and implement actual assertions.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho
* VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations
* VM_OBJECT_SLEEP() is introduced as a general purpose primitve to
get a sleep operation using a VM_OBJECT_LOCK() as protection
* The approach must bear with vm_pager.h namespace pollution so many
files require including directly rwlock.h
every architecture's busdma_machdep.c. It is done by unifying the
bus_dmamap_load_buffer() routines so that they may be called from MI
code. The MD busdma is then given a chance to do any final processing
in the complete() callback.
The cam changes unify the bus_dmamap_load* handling in cam drivers.
The arm and mips implementations are updated to track virtual
addresses for sync(). Previously this was done in a type specific
way. Now it is done in a generic way by recording the list of
virtuals in the map.
Submitted by: jeff (sponsored by EMC/Isilon)
Reviewed by: kan (previous version), scottl,
mjacob (isp(4), no objections for target mode changes)
Discussed with: ian (arm changes)
Tested by: marius (sparc64), mips (jmallet), isci(4) on x86 (jharris),
amd64 (Fabian Keil <freebsd-listen@fabiankeil.de>)
reading registers from other CPUs. As it turns out, the hardware doesn't
really like concurrent IPI'ing causing adverse effects. Also the thought
deadlock when using this spin lock here and the targeted CPU(s) are also
holding or in case of nested locks can't actually happen. This is due to
the fact that on sparc64, spinlock_enter() only raises the PIL but doesn't
disable interrupts completely. Thus direct cross calls as used for the
register reading (and all other MD IPI needs) still will be executed by
the targeted CPU(s) in that case.
MFC after: 3 days
- Implement a function to ensure that all preempted threads have switched
back out at least once. Use this to make sure there are no stale
references to the old ktr_buf or the lock profiling buffers before
updating them.
Reviewed by: marius (sparc64 parts), attilio (earlier patch)
Sponsored by: EMC / Isilon Storage Division
sleep, and perform the page allocations with VM_ALLOC_SYSTEM
class. Previously, the allocation was also allowed to completely drain
the reserve of the free pages, being translated to VM_ALLOC_INTERRUPT
request class for vm_page_alloc() and similar functions.
Allow the caller of malloc* to request the 'deep drain' semantic by
providing M_USE_RESERVE flag, now translated to VM_ALLOC_INTERRUPT
class. Previously, it resulted in less aggressive VM_ALLOC_SYSTEM
allocation class.
Centralize the translation of the M_* malloc(9) flags in the single
inline function malloc2vm_flags().
Discussion started by: "Sears, Steven" <Steven.Sears@netapp.com>
Reviewed by: alc, mdf (previous version)
Tested by: pho (previous version)
MFC after: 2 weeks
them, please let me know if not). Most of these are of the form:
static const struct bzzt_type {
[...list of members...]
} const bzzt_devs[] = {
[...list of initializers...]
};
The second const is unnecessary, as arrays cannot be modified anyway,
and if the elements are const, the whole thing is const automatically
(e.g. it is placed in .rodata).
I have verified this does not change the binary output of a full kernel
build (except for build timestamps embedded in the object files).
Reviewed by: yongari, marius
MFC after: 1 week
The reason for this is that the SPARC v9 architecture allows nested
interrupts of higher priority/level than that of the current interrupt
to occur (and we can't just entirely bypass this model, also, at least
for tick interrupts, this also wouldn't be wise). However, when a
preemption interrupt interrupts another interrupt of lower priority,
f.e. PIL_ITHREAD, and that one in turn is nested by a third interrupt,
f.e. PIL_TICK, with SCHED_ULE the execution of interrupts higher than
PIL_PREEMPT may be migrated to another CPU. In particular, tl1_ret(),
which is responsible for restoring the state of the CPU prior to entry
to the interrupt based on the (also migrated) trap frame, then is run
on a CPU which actually didn't receive the interrupt in question,
causing an inappropriate processor interrupt level to be "restored".
In turn, this causes interrupts of the first level, i.e. PIL_ITHREAD
in the above scenario, to be blocked on the target of the migration
until the correct PIL happens to be restored again on that CPU again.
Making PIL_PREEMPT the lowest real priority, this effectively prevents
this scenario from happening, as preemption interrupts no longer can
interrupt any other interrupt besides stray ones (which is no issue).
Thanks to attilio@ and especially mav@ for helping me to understand
this problem at the 201208DevSummit.
- Give PIL_STOP (which is also used for IPI_STOP_HARD, given that there's
no real equivalent to NMIs on SPARC v9) the highest possible priority
just below the hardwired PIL_TICK, so it has a chance to interrupt
more things.
MFC after: 1 week
when running tick_process(), similarly to what the x86 equivalents of
this function do, however employing the less racy sequence also used in
intr_event_handle().
MFC after: 3 days
instruction loads/stores at its will.
The macro __compiler_membar() is currently supported for both gcc and
clang, but kernel compilation will fail otherwise.
Reviewed by: bde, kib
Discussed with: dim, theraven
MFC after: 2 weeks