Commit Graph

1943 Commits

Author SHA1 Message Date
kib
7c26a038f9 Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA.  The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.

The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer.  For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.

When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.

Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer.  The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.

The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings.  Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.

Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags.  Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.

In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.

By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.

Sponsored by:	The FreeBSD Foundation
Discussed with:	jeff (previous version)
Tested by:	pho, scottl (previous version), jhb, bf
MFC after:	2 weeks
2013-03-19 14:13:12 +00:00
kib
63efc821c3 Add pmap function pmap_copy_pages(), which copies the content of the
pages around, taking array of vm_page_t both for source and
destination.  Starting offsets and total transfer size are specified.

The function implements optimal algorithm for copying using the
platform-specific optimizations.  For instance, on the architectures
were the direct map is available, no transient mappings are created,
for i386 the per-cpu ephemeral page frame is used.  The code was
typically borrowed from the pmap_copy_page() for the same
architecture.

Only i386/amd64, powerpc aim and arm/arm-v6 implementations were
tested at the time of commit. High-level code, not committed yet to
the tree, ensures that the use of the function is only allowed after
explicit enablement.

For sparc64, the existing code has known issues and a stab is added
instead, to allow the kernel linking.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho (i386, amd64), scottl (amd64), ian (arm and arm-v6)
MFC after:	2 weeks
2013-03-14 20:18:12 +00:00
attilio
e98f58faf6 MFC 2013-03-02 14:48:41 +00:00
mav
6cf7cc6e4d MFcalloutng:
Switch eventtimers(9) from using struct bintime to sbintime_t.
Even before this not a single driver really supported full dynamic range of
struct bintime even in theory, not speaking about practical inexpediency.
This change legitimates the status quo and cleans up the code.
2013-02-28 13:46:03 +00:00
davide
2bf12d0c7c MFcalloutng:
When CPU becomes idle, cpu_idleclock() calculates time to the next timer
event in order to reprogram hw timer. Return that time in sbintime_t to
the caller and pass it to acpi_cpu_idle(), where it can be used as one
more factor (quite precise) to extimate furter sleep time and choose
optimal sleep state. This is a preparatory change for further callout
improvements will be committed in the next days.

The commmit is not targeted for MFC.
2013-02-28 10:46:54 +00:00
attilio
756a9b3e47 MFC 2013-02-26 01:05:25 +00:00
marcel
e8ba976b7d kernacc() expects all KVAs to be covered in the kernel map. With the
introduction of the PBVM, this stopped being the case. Redefine the
VM parameters so that the PBVM is included in the kernel map. In
particular this introduces VM_INIT_KERNEL_ADDRESS to point to the base
of region 5 now that VM_MIN_KERNEL_ADDRESS points to the base of
region 4 to include the PBVM.
While here define KERNBASE to the actual link address of the kernel as
is intended.

PR:		169926
2013-02-25 02:41:38 +00:00
attilio
2ad8e10333 MFC 2013-02-24 17:20:53 +00:00
marcel
dc27ab27e9 Enable PREEMPTION by default now that PR 147501 has been fixed. 2013-02-23 19:27:53 +00:00
attilio
905e648d42 Hide the details for the assertion for VM_OBJECT_LOCK operations.
Rename current VM_OBJECT_LOCK_ASSERT(foo, RA_WLOCKED) into
VM_OBJECT_ASSERT_WLOCKED(foo)

Sponsored by:	EMC / Isilon storage division
Requested by:	alc
2013-02-21 21:54:53 +00:00
attilio
066bbc97b6 Fix other architectures and ZFS.
Sponsored by:	EMC / Isilon storage division
2013-02-21 15:02:36 +00:00
attilio
658534ed5a Switch vm_object lock to be a rwlock.
* VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations
* VM_OBJECT_SLEEP() is introduced as a general purpose primitve to
  get a sleep operation using a VM_OBJECT_LOCK() as protection
* The approach must bear with vm_pager.h namespace pollution so many
  files require including directly rwlock.h
2013-02-20 10:38:34 +00:00
marcel
78b3406ac0 Close a race relating to setting the PCPU pointer (r13). Register r13
points to the TLS in user space and points to the PCPU structure in
the kernel. The race is the result of having the exception handler on
the one hand and the RPC system call entry on the other. The EPC
syscall path is non-atomic in that interrupts are enabled while the
two stacks are switched. The register stack is switched last as that
is the stack used to determine whether we're going back to user space
by the exception handler. If we go back to user space, we restore r13,
otherwise we leave r13 alone. The EPC syscall path however set r13 to
the PCPU structure *before* switching the register stack, which means
that there was a window in which the exception handler would restore
r13 when it was already pointing to the PCPU structure. This is fatal
when the exception happened on CPU x, but left from the exception on
anotehr CPU. In that case r13 would point to the PCPU of the CPU the
thread was running on. This immediately results in getting the wrong
value for curthread.
The fix is to make sure we assign r13 *after* we set ar.bspstore to
point to the kernel register stack for the thread.
2013-02-17 00:51:34 +00:00
marcel
e81aa2332a Return EFAULT when the address is not a kernel virtual address. 2013-02-16 21:46:27 +00:00
marcel
6da471ce02 Eliminate the PC_CURTHREAD symbol and load the current thread's
thread structure pointer atomically from r13 (the pcpu pointer)
for the current CPU/core.
Add a CTASSERT in machdep.c to make sure that pc_curthread is in
fact the first field in struct pcpu.

The only non-atomic operations left were those related to process-
space operations, such as casuword, subyte, suword16, fubyte,
fuword16, copyin, copyout and their variations.

The casuword function has been re-structured more complete than
the others. This way we have an example of a better bundling
without introducing a lot of risk when we get it wrong. The
other functions can be rebundled in separate commits and with
the appropriate testing.
2013-02-12 17:38:35 +00:00
marcel
bd362fd0fc Eliminate padding by moving 'narg' next to 'code'. Both are 32-bit
entities in the syscall_args structure that otherwise has 64-bit
only fields.
2013-02-12 17:24:41 +00:00
kib
bd7f0fa0bb Reform the busdma API so that new types may be added without modifying
every architecture's busdma_machdep.c.  It is done by unifying the
bus_dmamap_load_buffer() routines so that they may be called from MI
code.  The MD busdma is then given a chance to do any final processing
in the complete() callback.

The cam changes unify the bus_dmamap_load* handling in cam drivers.

The arm and mips implementations are updated to track virtual
addresses for sync().  Previously this was done in a type specific
way.  Now it is done in a generic way by recording the list of
virtuals in the map.

Submitted by:	jeff (sponsored by EMC/Isilon)
Reviewed by:	kan (previous version), scottl,
	mjacob (isp(4), no objections for target mode changes)
Discussed with:	     ian (arm changes)
Tested by:	marius (sparc64), mips (jmallet), isci(4) on x86 (jharris),
	amd64 (Fabian Keil <freebsd-listen@fabiankeil.de>)
2013-02-12 16:57:20 +00:00
marcel
e0a463e76c Now that we actually use more memory descriptors, make sure to dump
them as well.
2013-02-12 16:51:43 +00:00
hrs
ef74211c05 Remove firewire devices missed in r244992. 2013-01-04 15:29:50 +00:00
kib
5a9188f8d3 Enable the UFS quotas for big-iron GENERIC kernels.
Discussed with:	      mckusick
MFC after:	      2 weeks
2013-01-03 19:03:41 +00:00
des
67e77c00a8 As discussed on -current last October, remove the firewire drivers from
GENERIC.
2013-01-03 14:30:24 +00:00
kib
e8ae50d444 Flip the semantic of M_NOWAIT to only require the allocation to not
sleep, and perform the page allocations with VM_ALLOC_SYSTEM
class. Previously, the allocation was also allowed to completely drain
the reserve of the free pages, being translated to VM_ALLOC_INTERRUPT
request class for vm_page_alloc() and similar functions.

Allow the caller of malloc* to request the 'deep drain' semantic by
providing M_USE_RESERVE flag, now translated to VM_ALLOC_INTERRUPT
class. Previously, it resulted in less aggressive VM_ALLOC_SYSTEM
allocation class.

Centralize the translation of the M_* malloc(9) flags in the single
inline function malloc2vm_flags().

Discussion started by:	"Sears, Steven" <Steven.Sears@netapp.com>
Reviewed by:	alc, mdf (previous version)
Tested by:	pho (previous version)
MFC after:	2 weeks
2012-11-14 20:01:40 +00:00
attilio
f3501b109e Rework the known rwlock to benefit about staying on their own
cache line in order to avoid manual frobbing but using
struct rwlock_padalign.

Reviewed by:	alc, jimharris
2012-11-03 23:03:14 +00:00
kib
4d0fac96d4 Fix compilation on ia64 when page size is configured for 16KB.
Reviewed by:	alc, marcel
2012-10-28 11:53:54 +00:00
alc
d5d3fe3da5 Port the new PV entry allocator from amd64/i386. This allocator has two
advantages.  First, PV entries are roughly half the size.  Second, this
allocator doesn't access the paging queues, and thus it allows for the
removal of the page queues lock from this pmap.

Replace all uses of the page queues lock by a R/W lock that is private
to this pmap.

Tested by:	marcel
2012-10-26 03:02:39 +00:00
alc
55f6ff40ed Eliminate a stale comment. It describes another use case for the pmap in
Mach that doesn't exist in FreeBSD.
2012-09-28 05:30:59 +00:00
attilio
8dece93b14 userret() already checks for td_locks when INVARIANTS is enabled, so
there is no need to check if Giant is acquired after it.

Reviewed by:	kib
MFC after:	1 week
2012-09-08 18:27:11 +00:00
marcel
caa9902fff Use pmap_kextract(x) rather than pmap_extract(kernel_pmap, x). The
former knows about all the special mappings, like PBVM. The kernel
text and data are in the PBVM.
2012-08-18 23:28:34 +00:00
marcel
12f7f0563e Remove support for SKI: HP's Itanium simulator. It's pretty much not
used, serves very little value given that FreeBSD runs on real H/W
for a long time.
Note that SKI is open-source (see http://ski.sourceforge.net), so
if there's interest and value again, then this code can be revived.

Discussed with: jhb
2012-08-18 22:59:06 +00:00
jhb
8a9e433782 Add locking for sscdisk(4) and mark it MPSAFE. Since this driver just
makes calls out to the emulator, the locking is fairly simple.  A global
mutex protects the list of ssc disks, and each ssc disk has a mutex
to protect it's bioq.

Approved by:	marcel
2012-08-16 17:17:08 +00:00
kib
cac2fe116f After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason
to pull vm_param.h was removed.  Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.

Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.

Suggested and reviewed by:	alc
MFC after:    2 weeks
2012-08-05 14:11:42 +00:00
marcel
070f1b40de Move PCPU initialization to a new function called cpu_pcpu_setup().
This makes it easier to add additional CPU or platform information
to the per-CPU structure without duplicated code.
2012-07-08 18:00:22 +00:00
marcel
bfa21239bc Unleash the APs at SI_SUB_KICK_SCHEDULER so that we have them all
up and running to service interrupts. This is especially important
when the firmware has bound interrupts to CPUs, like for the SGI
Altix 350. We wake up APs at SI_SUB_CPU time and they sit and spin
until we unleash them, so there's nothing fundamentally different
from a MD perspective.
2012-07-08 17:43:25 +00:00
marcel
aac33e9ad6 Implement ia64_physmem_alloc() and use it consistently to get memory
before VM has been initialized. This includes:
1.  Replacing pmap_steal_memory(),
2.  Replace the handcrafted logic to allocate a naturally aligned VHPT,
3.  Properly allocate the DPCPU for the BSP.

Ad 3: Appending the DPCPU to kernend worked as long as we wouldn't
      cross into the next PBVM page. If we were to cross into the next
      page, then there wouldn't be a PTE entry on the page table for it
      and we would end up with a MCA following a page fault. As such,
      this commit fixes MCAs occasionally seen.
2012-07-07 05:17:43 +00:00
marcel
c55246a63a Hide the creation of phys_avail behind an API to make it easier to do it
correctly. We now iterate the EFI memory descriptors once and collect all
the information in a single pass. This includes:
1.  The I/O port base address,
2.  The PAL memory region. Have the physmem API track this.
3.  Memory descriptors of memory we can't use, like bad memory, runtime
    services code & data, etc. Have the physmem API track these.
4.  memory descriptors of memory we can use or re-use, such as free
    memory, boot time services code & data, loader code & data, etc.
    These are added by the physmem API.

Since the PBVM page table and pages are in memory described as loader
data, inform the physmem API of chunks that need to be delated from the
available physical memory.

While here, remove Maxmem and replace it with the better named paddr_max.
Maxmem was defined as physmem, which is generally wrong. Now, paddr_max
is properly defined as the largesty physical address.

The upshot of all this is that:
1.  We properly determine realmem.
2.  We maximize physmem by re-using memory where possible.
3.  We remove complexity from ia64_init() in machdep.c.
4.  Remove confusion about realmem, physmem & Maxmem.

The new ia64_physmem_alloc() is to replace pmap_steal_memory() in pmap.c,
as well as replace the handcrafted allocation of the VHPT for the BSP in
pmap_bootstrap() in pmap.c. This is step 2 and addresses the manipulation
of phys_avail after it is being created.
2012-07-07 00:25:17 +00:00
andrew
0a7002aae7 Make the wchar_t type machine dependent.
This is required for ARM EABI. Section 7.1.1 of the Procedure Call for the
ARM Architecture (AAPCS) defines wchar_t as either an unsigned int or an
unsigned short with the former preferred.

Because of this requirement we need to move the definition of __wchar_t to
a machine dependent header. It also cleans up the macros defining the limits
of wchar_t by defining __WCHAR_MIN and __WCHAR_MAX in the same machine
dependent header then using them to define WCHAR_MIN and WCHAR_MAX
respectively.

Discussed with:	bde
2012-06-24 04:15:58 +00:00
kib
7b36a08108 Implement mechanism to export some kernel timekeeping data to
usermode, using shared page.  The structures and functions have vdso
prefix, to indicate the intended location of the code in some future.

The versioned per-algorithm data is exported in the format of struct
vdso_timehands, which mostly repeats the content of in-kernel struct
timehands. Usermode reading of the structure can be lockless.
Compatibility export for 32bit processes on 64bit host is also
provided. Kernel also provides usermode with indication about
currently used timecounter, so that libc can fall back to syscall if
configured timecounter is unknown to usermode code.

The shared data updates are initiated both from the tc_windup(), where
a fast task is queued to do the update, and from sysctl handlers which
change timecounter. A manual override switch
kern.timecounter.fast_gettime allows to turn off the mechanism.

Only x86 architectures export the real algorithm data, and there, only
for tsc timecounter. HPET counters page could be exported as well, but
I prefer to not further glue the kernel and libc ABI there until
proper vdso-based solution is developed.

Minimal stubs neccessary for non-x86 architectures to still compile
are provided.

Discussed with:	bde
Reviewed by:	jhb
Tested by:	flo
MFC after:	1 month
2012-06-22 07:06:40 +00:00
kib
d98ad62d7e Reserve AT_TIMEKEEP auxv entry for providing usermode the pointer to
timekeeping information.

MFC after:  1 week
2012-06-22 06:38:31 +00:00
alc
6eeaee04e4 The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap
layer, but it is read directly by the MI VM layer.  This change introduces
pmap_page_is_write_mapped() in order to completely encapsulate all direct
access to PGA_WRITEABLE in the pmap layer.

Aesthetics aside, I am making this change because amd64 will likely begin
using an alternative method to track write mappings, and having
pmap_page_is_write_mapped() in place allows me to make such a change
without further modification to the MI VM layer.

As an added bonus, tidy up some nearby comments concerning page flags.

Reviewed by:	kib
MFC after:	6 weeks
2012-06-16 18:56:19 +00:00
jkim
7c0f0ac672 Improve style(9) in the previous commit. 2012-06-01 17:07:52 +00:00
iwasaki
9cbcc3c492 Call AcpiLeaveSleepStatePrep() in interrupt disabled context
(described in ACPICA source code).

- Move intr_disable() and intr_restore() from acpi_wakeup.c to acpi.c
  and call AcpiLeaveSleepStatePrep() in interrupt disabled context.
- Add acpi_wakeup_machdep() to execute wakeup MD procedures and call
  it twice in interrupt disabled/enabled context (ia64 version is
  just dummy).
- Rename wakeup_cpus variable in acpi_sleep_machdep() to suspcpus in
  order to be shared by acpi_sleep_machdep() and acpi_wakeup_machdep().
- Move identity mapping related code to acpi_install_wakeup_handler()
  (i386 version) for preparation of x86/acpica/acpi_wakeup.c
  (MFC candidate).

Reviewed by:	jkim@
MFC after:	2 days
2012-06-01 15:26:32 +00:00
alc
0ca67337b8 pmap_alloc_vhpt() doesn't need the pages that it allocates to be mapped
into the kernel map, so vm_page_alloc_contig() can be used in place of
contigmalloc().

Reviewed by:	marcel
2012-06-01 03:56:12 +00:00
bz
cd8b136e12 MFp4 bz_ipv6_fast:
in_cksum.h required ip.h to be included for struct ip.  To be
  able to use some general checksum functions like in_addword()
  in a non-IPv4 context, limit the (also exported to user space)
  IPv4 specific functions to the times, when the ip.h header is
  present and IPVERSION is defined (to 4).

  We should consider more general checksum (updating) functions
  to also allow easier incremental checksum updates in the L3/4
  stack and firewalls, as well as ponder further requirements by
  certain NIC drivers needing slightly different pseudo values
  in offloading cases.  Thinking in terms of a better "library".

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-24 22:00:48 +00:00
marcel
c1bb05ca58 Don't assume we have legacy PICs (i.e. 8259A in cascade) at the legacy
I/O port addresses. Even if we do, this is hardly the place to mask
interrupts. It's not clear that this was at all needed. The code came
with CVS revision 1.2 of nexus.c when interrupt support was first added.
What is known is that ia64 has always been designed around the IOSAPIC,
and that doing I/O like this prevents Altix from booting.
2012-05-04 23:16:29 +00:00
dim
81a3e2b46d Add a convenience macro for the returns_twice attribute, and apply it to
the prototypes of the appropriate functions (getcontext, savectx,
setjmp, sigsetjmp and vfork).

MFC after:	2 weeks
2012-04-29 11:04:31 +00:00
ed
619ab2327e Remove pty(4) from our kernel configurations.
As of FreeBSD 8, this driver should not be used. Applications that use
posix_openpt(2) and openpty(3) use the pts(4) that is built into the
kernel unconditionally. If it turns out high profile depend on the
pty(4) module anyway, I'd rather get those fixed. So please report any
issues to me.

The pty(4) module is still available as a kernel module of course, so a
simple `kldload pty' can be used to run old-style pseudo-terminals.
2012-03-21 08:38:42 +00:00
tijl
42f636fdcd Copy i386 specialreg.h to x86 and merge with amd64 specialreg.h. Replace
amd64/i386/pc98 specialreg.h with stubs.
2012-03-19 21:34:11 +00:00
tijl
ea1f38e564 Copy i386 psl.h to x86 and replace amd64/i386/pc98 psl.h with stubs. 2012-03-19 21:29:57 +00:00
tijl
2a126f1975 Move userland bits (and some common kernel bits) from amd64 and i386
segments.h to a new x86 segments.h.

Add __packed attribute to some structs (just to be sure).
Also make it clear that i386 GDT and LDT entries are used in ia64 code.
2012-03-19 21:24:50 +00:00
tijl
35c7447060 Eliminate ia32_reg.h by moving its contents to x86 and ia64 reg.h.
Reviewed by:	kib
2012-03-18 19:12:11 +00:00