the cpufreq code. Replace its use with smp_started. There's at least
one userland tool that still looks at the kern.smp.active sysctl, so
preserve it but point it to smp_started as well.
Discussed with: peter, jhb
MFC after: 3 days
Obtained from: Netflix
Prototypes specific to ia64 have been left in this file for now, under
__ia64__, rather than moving them to a new header under sys/ia64.
I anticipate that (some of) the corresponding functions will be shared
by the amd64, arm64, i386, and ia64 architectures, and we can adjust
this as EFI support on other than ia64 continues to develop.
Sponsored by: The FreeBSD Foundation
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.
Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.
Exp-run revealed no ports using it directly.
No objection from: arch@
Sponsored by: EMC / Isilon Storage Division
1. move unmapped_buf_allowed to machdep.c.
2. map both pmap_mapbios() and pmap_mapdev() to pmap_mapdev_attr()
and have the actual work done by pmap_mapdev_priv() (renamed from
pmap_mapdev()). Use pmap_mapdev_priv() to map the I/O port space
because we can't use CTR() that early.
3. add pmap_pinit_common() that's used by both pmap_pinit0() and
pmap_pinit(). Previously pmap_pinit0() would call pmap_pinit(),
but that would create 2 KTR events. While here, use pmap_t instead
of "struct pmap *".
4. fix pmap_kenter() to use vm_paddr_t as the type for the physical.
5. various white-space adjustments for consistency.
6. use C99 and KNF for function definitions where appropriate.
7. slightly re-order prototypes and defines in <machine/pmap.h>
No functional change (other than the creation of KTR_PMAP events).
1. Name the kernel option XTRACE instead of EXCEPTION_TRACING
2. Put support functions in ia64/ia64/xtrace.c
3. Make it work with SMP by giving each CPU its own buffer
4. Save 16 key registers in the buffer for every exception
5. In ia64_handle_intr() and trap() transfer the trace record
to the KTR trace buffer using CTRx() and with some basic
information for now
6. Use a tunable to anble tracing and stop tracing as soon as
we enter the debugger
Room for improvements:
1. Transferring exception-relevant information to KTR
2. Add a sysctl to enable/disable tracing
don't do it also in ia64_handle_intr(). With ia64_handle_intr()
not saving and setting td_intr_frame, make sure to do it for
timer interrupts in ia64_ih_hardclock().
don't do it also in ia64_handle_intr(). With ia64_handle_intr()
not saving and setting td_intr_frame, make sure to do it for
clock interrupts in ia64_ih_clock().
memory attributes. The same applies to the mmap(2) interface. Not
doing so results in machine checks.
We find the memory attributes in the EFI memory map, as queried by
mem_phys2virt().
fence. Under system load, the CPU has been found to change the order
by which the stores are made visible. When the tag is made visible
before the other TLB values, other CPUs may use the invalid TLB values
and do bad things.
While here (i.e. not a fix) don't return errors from pmap_remove_vhpt()
to callers of pmap_remove_pte(). Those callers don't check the return
value and as such don't do what is needed to keep a consistent state.
More importantly, pmap_remove_vhpt() can't really have an error without
it indicating something unintended. Using KASSERT is therefore better.
PR: 182999, 183227
ia64 for the very first time. Only 9 years in the making...
Note that the vt/vga driver does not actually make sure there's
VGA hardware at the standard/legacy VGA I/O port and memory I/O
addresses. This can cause machine checks if the H/W does not have
a VGA controller.
sure to clear the lower 12 bits. We're adding the translation
attributes to the physical address and non-zero bits in the first
12 bits would give us something unexpected, including invalid bit
values. Those trigger nested general protection faults.
We do not have to clear the region bits, because they are ignored
anyway, so we can replace an existing dep instruction with the one
we need.
This fixes GP faults for the swapper thread, as it's the only thread
that has a direct-mapped stack. Since the bug is in the nested TLB
fault handler, the frequency of hitting the GP is in the order of
hours/days under load.
The operation was documented and implemented partially (both from a
type and architecture perspective) on 2013-08-21 and got used in
ZFS with revision 260150 (zfeature.c) and since ZFS is supported on
ia64, the lack of having atomic_swap became problem.
change virt_size(), virt_dumphdrs() and virt_dumpdata() into its callback
functions.
In virt_foreach() we iterate over all the virtual memory regions that we
want in the minidump. For now, just start with the PBVM (= kernel text
and data plus preloaded modules). The core file this produces can already
be used to work out the libkvm changes that need to be made to support it.
In parallel, we can flesh out the in-kernel bits to dump more of what we
need in a minidump without changing the core file structure.
except the chunks aren't physical memory regions but virtual memory
regions. In both cases, the core file is an ELF file and flags in
the header allow libkvm to distinguish one from the other.
requires process descriptors to work and having PROCDESC in GENERIC
seems not enough, especially that we hope to have more and more consumers
in the base.
MFC after: 3 days
words, every architecture is now auto-sizing the kmem arena. This revision
changes kmeminit() so that the definition of VM_KMEM_SIZE_SCALE becomes
mandatory and the definition of VM_KMEM_SIZE becomes optional.
Replace or eliminate all existing definitions of VM_KMEM_SIZE. With
auto-sizing enabled, VM_KMEM_SIZE effectively became an alternate spelling
for VM_KMEM_SIZE_MIN on most architectures. Use VM_KMEM_SIZE_MIN for
clarity.
Change kmeminit() so that the effect of defining VM_KMEM_SIZE is similar to
that of setting the tunable vm.kmem_size. Whereas the macros
VM_KMEM_SIZE_{MAX,MIN,SCALE} have had the same effect as the tunables
vm.kmem_size_{max,min,scale}, the effects of VM_KMEM_SIZE and vm.kmem_size
have been distinct. In particular, whereas VM_KMEM_SIZE was overridden by
VM_KMEM_SIZE_{MAX,MIN,SCALE} and vm.kmem_size_{max,min,scale}, vm.kmem_size
was not. Remedy this inconsistency. Now, VM_KMEM_SIZE can be used to set
the size of the kmem arena at compile-time without that value being
overridden by auto-sizing.
Update the nearby comments to reflect the kmem submap being replaced by the
kmem arena. Stop duplicating the auto-sizing formula in every machine-
dependent vmparam.h and place it in kmeminit() where auto-sizing takes
place.
Reviewed by: kib (an earlier version)
Sponsored by: EMC / Isilon Storage Division
outer and inner loop as 32-bit integers mux'd in 64-bit return
values. Change our data types for the count and stride to match
and simplify/adjust accordingly.
Note that with this change the defaults of the ptc.e parameters
have changed. Since all hardware is supposed to support the PAL
call, there should be no impact. Even ski is unaffected, because
the TC is re-initialized without considering the virtual address.
So, as long as we call ptc.e at least once, we're good. That's
what the new defaults do.
Most processor implementations need but a single ptc.e to purge
the entire TC anyway...
end, make pmap_invalidate_all() global and have it only handle the
local CPU -- i.e. no rendezvous. We do not use pmap_invalidate_all
other than during initialization.
Note that the BSP already purges its TC -- it was missing for APs
only. Nonetheless, this so far seems to eliminate random problems.
ia64_probe_sapics(), we also create PCPU structures for any Local SAPICs
we encounter. When SMP is disabled, this leaves us with partially setup
PCPU structures, which typically results in panics when we're iterating
over CPUs. When SMP is disabled, we now prevent the creation of the
PCPU structures.
vm_pages. Provide trivial implementation which forwards the load to
_bus_dmamap_load_phys() page by page. Right now all architectures use
bus_dmamap_load_ma_triv().
Tested by: pho (as part of the functional patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
pmap_clear_reference() has had exactly one caller in the kernel for
several years, more precisely, since FreeBSD 8. Now, that call no
longer exists.
Approved by: re (kib)
Sponsored by: EMC / Isilon Storage Division
an address in the first 2GB of the process's address space. This flag should
have the same semantics as the same flag on Linux.
To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address. While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.
Reviewed by: alc
Approved by: re (kib)
sf_buf_alloc()/sf_buf_free() inlines, to save two calls to an absolutely
empty functions.
Reviewed by: alc, kib, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
MADV_DONTNEED) and madvise(..., MADV_FREE). Specifically, introduce a new
pmap function, pmap_advise(), that operates on a range of virtual addresses
within the specified pmap, allowing for a more efficient implementation of
MADV_DONTNEED and MADV_FREE. Previously, the implementation of
MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as
pmap_clear_reference(). Intuitively, the problem with this implementation
is that the pmap-level locks are acquired and released and the page table
traversed repeatedly, once for each resident page in the range
that was specified to madvise(2). A more subtle flaw with the previous
implementation is that pmap_clear_reference() would clear the reference bit
on all mappings to the specified page, not just the mapping in the range
specified to madvise(2).
Since our malloc(3) makes heavy use of madvise(2), this change can have a
measureable impact. For example, the system time for completing a parallel
"buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%.
Note: This change only contains pmap_advise() implementations for a subset
of our supported architectures. I will commit implementations for the
remaining architectures after further testing. For now, a stub function is
sufficient because of the advisory nature of pmap_advise().
Discussed with: jeff, jhb, kib
Tested by: pho (i386), marcel (ia64)
Sponsored by: EMC / Isilon Storage Division
which is the part of struct vmspace, allocated from UMA_ZONE_NOFREE
zone. Initialize the pmap lock in the vmspace zone init function, and
remove pmap lock initialization and destruction from pmap_pinit() and
pmap_release().
Suggested and reviewed by: alc (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
used by the tools in base systems and with sandboxing more and more tools
the usage should only increase.
Submitted by: Mariusz Zaborski <oshogbo@FreeBSD.org>
Sponsored by: Google Summer of Code 2013
MFC after: 1 month
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.
Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag
The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.
Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl
- update powerpc/GENERIC64 as well, suggested by mdf
- update comments so that they make sense after the change, suggested by
jhb
X-MFC after: never (change specific to head)
KDB_TRACE is not an alternative to DDB/etc, they are complementary.
So I do not see any reason to not enable KDB_TRACE by default.
X-MFC after: never (change specific to head)
transparent layering and better fragmentation.
- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.
Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division
* Make Yarrow an optional kernel component -- enabled by "YARROW_RNG" option.
The files sha2.c, hash.c, randomdev_soft.c and yarrow.c comprise yarrow.
* random(4) device doesn't really depend on rijndael-*. Yarrow, however, does.
* Add random_adaptors.[ch] which is basically a store of random_adaptor's.
random_adaptor is basically an adapter that plugs in to random(4).
random_adaptor can only be plugged in to random(4) very early in bootup.
Unplugging random_adaptor from random(4) is not supported, and is probably a
bad idea anyway, due to potential loss of entropy pools.
We currently have 3 random_adaptors:
+ yarrow
+ rdrand (ivy.c)
+ nehemeiah
* Remove platform dependent logic from probe.c, and move it into
corresponding registration routines of each random_adaptor provider.
probe.c doesn't do anything other than picking a specific random_adaptor
from a list of registered ones.
* If the kernel doesn't have any random_adaptor adapters present then the
creation of /dev/random is postponed until next random_adaptor is kldload'ed.
* Fix randomdev_soft.c to refer to its own random_adaptor, instead of a
system wide one.
Submitted by: arthurmesh@gmail.com, obrien
Obtained from: Juniper Networks
Reviewed by: obrien
bus number into the bus argument. The bus number occupies the least
significant 8 bits. The PCI domain occupies the most significant 24
bits.
On the Altix 350, the PCI domain is a required parameter, but
changing the prototype of the pci_cfgreg*() functions to include a
separate domain argument has wide-spread consequences across the
supported architectures. We'd be changing a known interface.
Multiplexing is an acceptable kluge to give us what we need with
manageable impact. Note that the PCI bus number fits in 8 bits,
so the multiplexing of the domain is a backward compatible change.
fall within the first 256MB of memory. The origin/reason for that
limitation is not known, but it's not believed to be required for
proper initialization. What is known is that the Altix 350 does not
have physical memory at that address (by virtue of the address space
bits).
Keep the boundary at 256MB so that the info block will be covered
by a single direct-mapped translation.
While here, change the flags to M_NOWAIT to eliminate confusion. It
does not change the behaviour of contigmalloc(). What is does is
makes the flags argument explicitly say what the actual behaviour
is.
memory descriptor, don't return NULL as the virtual address, return the
direct-mapped uncacheable virtual address for it. At first, this was
needed only for the Altix 350, but now even some high-end HP machines
have devices mapped to physical addresses that aren't covered by the
EFI memory map.
Issues were noted by Bruce Evans and are present on all architectures.
On i386, a counter fetch should use atomic read of 64bit value,
otherwise carry from the increment on other CPU could be lost for the
given fetch, making error of 2^32. If 64bit read (cmpxchg8b) is not
available on the machine, it cannot be SMP and it is enough to disable
preemption around read to avoid the split read.
On x86 the counter increment is not atomic on purpose, which makes it
possible for the store of the incremented result to override just
zeroed per-cpu slot. The effect would be a counter going off by
arbitrary value after zeroing. Perform the counter zeroing on the
same processor which does the increments, making the operations
mutually exclusive. On i386, same as for the fetching, if the
cmpxchg8b is not available, machine is not SMP and we disable
preemption for zeroing.
PowerPC64 is treated the same as amd64.
For other architectures, the changes made to allow the compilation to
succeed, without fixing the issues with zeroing or fetching. It
should be possible to handle them by using the 64bit loads and stores
atomic WRT preemption (assuming the architectures also converted from
using critical sections to proper asm). If architecture does not
provide the facility, using global (spin) mutex would be non-optimal
but working solution.
Noted by: bde
Sponsored by: The FreeBSD Foundation