Commit Graph

1919 Commits

Author SHA1 Message Date
kib
a9d505a22a Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.

Document the changes to flags field to only require the page lock.

Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.

Reviewed by:    alc, attilio
Tested by:      marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by:	re (bz)
2011-09-06 10:30:11 +00:00
kib
f408aa11a3 - Move the PG_UNMANAGED flag from m->flags to m->oflags, renaming the flag
to VPO_UNMANAGED (and also making the flag protected by the vm object
  lock, instead of vm page queue lock).
- Mark the fake pages with both PG_FICTITIOUS (as it is now) and
  VPO_UNMANAGED. As a consequence, pmap code now can use use just
  VPO_UNMANAGED to decide whether the page is unmanaged.

Reviewed by:	alc
Tested by:	pho (x86, previous version), marius (sparc64),
    marcel (arm, ia64, powerpc), ray (mips)
Sponsored by:	The FreeBSD Foundation
Approved by:	re (bz)
2011-08-09 21:01:36 +00:00
marcel
74b9549a99 Fix kernel core dumps now that the kernel is using PBVM. The basic
problem to solve is that we don't have a fixed mapping from kernel
text to physical address so that libkvm can bootstrap itself. We
solve this by passing the physical address of the bootinfo structure
to the consumer as the entry point of the core file. This way,
libkvm can extract the PBVM page table information and locate the
kernel in the core file.
We also need to dump memory chunks of type loader data, because
those hold the kernel and the PBVM page table (among other things).

Approved by:	re (blanket)
2011-08-06 03:40:33 +00:00
marcel
729191f841 Follow-up commit: refactor pmap_kextract() to make it easier to
catch and debug issues like the one fixed in the previous commit:
Replace all return statements with goto statements so that we end
up at a single place with a value for the physical address.
Print a message for all unknown KVA addresses.

Approved by:	re (blanket)
2011-08-05 23:10:47 +00:00
marcel
1d5d907de5 Remove stray semicolon in pmap_kextract() that turned the conditional
"return (0)" into an unconditional one and as such broke PBVM address
queries -- such as during kernel core dumps.

Approved by:	re (blanket)
2011-08-05 23:05:46 +00:00
attilio
30d87a57de Bump MAXCPU for amd64, ia64 and XLP mips appropriately.
From now on, default values for FreeBSD will be 64 maxiumum supported
CPUs on amd64 and ia64 and 128 for XLP. All the other architectures
seem already capped appropriately (with the exception of sparc64 which
needs further support on jalapeno flavour).

Bump __FreeBSD_version in order to reflect KBI/KPI brekage introduced
during the infrastructure cleanup for supporting MAXCPU > 32. This
covers cpumask_t retiral too.

The switch is considered completed at the present time, so for whatever
bug you may experience that is reconducible to that area, please report
immediately.

Requested by:	marcel, jchandra
Tested by:	pluknet, sbruno
Approved by:	re (kib)
2011-07-19 13:00:30 +00:00
attilio
1f74ad4e1b On 64 bit architectures size_t is 8 bytes, thus it should use an 8 bytes
storage.
Fix the sintrcnt/sintrnames specification.

No MFC is previewed for this patch.

Reported, reviewed and tested by:	marcel
Approved by:	re (kib)
2011-07-19 12:41:57 +00:00
attilio
a73e834ebb Add the possibility to specify from kernel configs MAXCPU value.
This patch is going to help in cases like mips flavours where you
want a more granular support on MAXCPU.

No MFC is previewed for this patch.

Tested by:	pluknet
Approved by:	re (kib)
2011-07-19 00:37:24 +00:00
attilio
9a6ff5ad37 - Remove the eintrcnt/eintrnames usage and introduce the concept of
sintrcnt/sintrnames which are symbols containing the size of the 2
  tables.
- For amd64/i386 remove the storage of intr* stuff from assembly files.
  This area can be widely improved by applying the same to other
  architectures and likely finding an unified approach among them and
  move the whole code to be MI. More work in this area is expected to
  happen fairly soon.

No MFC is previewed for this patch.

Tested by:	pluknet
Reviewed by:	jhb
Approved by:	re (kib)
2011-07-18 15:19:40 +00:00
jhb
838b39dd92 Enable NEW_PCIB by default on ia64.
Approved by:	re (kib), marcel
2011-07-18 14:05:14 +00:00
jhb
0f0af1e2f2 Implement bus_adjust_resource() for the ia64 nexus driver.
Reviewed by:	marcel
Approved by:	re (kib)
2011-07-18 14:04:37 +00:00
marcel
3abb8f9591 Don't assume pmap_mapdev() gets called only for memory mapped I/O
addresses (i.e. uncacheable). ACPI in particular uses pmap_mapdev()
rather excessively (number of calls) just to get a valid KVA. To
that end, have pmap_mapdev():
1.  cache the last result so that we don't waste time for multiple
    consecutive invocations with the same PA/SZ.
2.  find the memory descriptor that covers the PA and return NULL
    if none was found or when the PA is for a common DRAM address.
3.  Use either a region 6 or region 7 KVA, in accordance with the
    memory attribute.
2011-07-16 20:34:02 +00:00
marcel
87a4ad12f0 Don't send EOI to the CPU before we handled the interrupt. This could
potentially trigger multiple pending interrupts for level-sensitive
interrupts.  However, the event timer interrupt does need EOI before
being handled to avoid missing clock events.

These conflicting requirements are handled by having the XIV handler
inform the dispatch code whether or not it send EOI to the CPU. If not,
the dispatch code will do it. This allows handlers to send EOI before
doing potentially long-running activities, while still have a sensible
default behaviour.
2011-07-16 20:16:49 +00:00
marcel
21706948a5 Add a few more helper functions for working with memory descriptors:
o   efi_md_find() - returns the md that covers the given address
o   efi_md_last() - returns the last md in the list
o   efi_md_prev() - returns the md that preceeds the given md.
2011-07-16 19:56:07 +00:00
marcel
01e4705d93 Implement basic support for memory attributes. At this time we only
distinguish between UC and WB memory so that we can map the page to
either a region 6 address (for UC) or a region 7 address (for WB).

This change is only now possible, because previously we would map
regions 6 and 7 with 256MB translations and on top of that had the
kernel mapped in region 7 using a wired translation. The introduction
of the PBVM moved the kernel into its own region and freed up region
7 and allowed us to revert to standard page-sized translations.

This commit inroduces pmap_page_to_va() that respects the attribute.
2011-07-08 16:30:54 +00:00
marcel
3ab6414525 Disable PREEMPTION for now. See also PR ia64/147501. 2011-07-04 16:59:26 +00:00
attilio
364d0522f7 With retirement of cpumask_t and usage of cpuset_t for representing a
mask of CPUs, pc_other_cpus and pc_cpumask become highly inefficient.

Remove them and replace their usage with custom pc_cpuid magic (as,
atm, pc_cpumask can be easilly represented by (1 << pc_cpuid) and
pc_other_cpus by (all_cpus & ~(1 << pc_cpuid))).

This change is not targeted for MFC because of struct pcpu members
removal and dependency by cpumask_t retirement.

MD review by:	marcel, marius, alc
Tested by:	pluknet
MD testing by:	marcel, marius, gonzo, andreast
2011-07-04 12:04:52 +00:00
alc
9cb2f04853 When iterating over a paging queue, explicitly check for PG_MARKER, instead
of relying on zeroed memory being interpreted as an empty PV list.

Reviewed by:	kib
2011-07-02 23:42:04 +00:00
marcel
858ce69e22 Change the management of nested faults by switching to physical
addressing while reading or writing the trap frame. It's not
possible to guarantee that the one translation cache entry that
we depend on is not going to get purged by the CPU. We already
know that global shootdowns (ptc.g and/or ptc.ga) can (and will)
cause multiple TC entries to get purged and we initialize tried
to handle that by serializing kernel entry with these operations.
However, we need to serialize kernel exit as well.

But even if we can serialize, it appears that CPU threads within
a core can affect each other's TC entries beyond the global
shootdown. This would mean serializing any and all translatation
cache updates with the threads in a core with the kernel entry
and exit of any thread in that core. This is just too painful
and complicated.

Since we already properly coded for the 2 nested faults that we
can get, all we need to do is use those to obtain the physical
address of the trap frame, switch to physical mode and in that
way eliminate any further faults. The trap frame is already
aligned to 1KB boundaries to make sure we don't cross the page
boundary, this is safe to do.

We still need to serialize ptc.g or ptc.ga across CPUs because
the platform can only have 1 such operation outstanding at the
same time. We can now use a regular (spin) lock for this.

Also, it has been observed that we can get a nested TLB faults
for region 7 virtual addresses. This was unexpected. For now,
we enhance the nested TLB fault handler to deal with those as
well, but it needs to be understood.
2011-06-30 20:34:55 +00:00
alc
21902be08c Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings.  Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.

This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write().  It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.

Update all of the existing assertions on pmap_remove_all() to reflect this
change.

Reviewed by:	kib
2011-06-29 16:40:41 +00:00
marcel
985a8017af Oops. The sec field of struct bintime is *not* a 32-bit type.
It's time_t, which is 64 bits on ia64.
2011-06-25 17:58:35 +00:00
marcel
628e425dc7 Define the minimum fractional period in terms of hz. We know hz is
a magnitude smaller than itc_freq. A minimum period of 10*hz is
sufficient precision. As a side-effect, the number of clocks per
second, when the machine is idle, dropped by more than 50%.
Be anal and define the maximum period to be at least 4G seconds.
With a 64-bit counter and an ITC frequency that's expected to be
always less than 4Ghz, it takes longer than that to wrap around.
2011-06-25 16:35:43 +00:00
marcel
31a073f6e1 Replace the original copyright notice with my own. Everything in
this file is written by me and has no bearing on the initial or
original version.
2011-06-25 03:43:58 +00:00
marcel
34aad3a65e Update copyright. 2011-06-25 03:37:40 +00:00
marcel
215e258fd1 Switch to the event timers infrastructure. This includes:
o   Setting td_intr_frame to the XIVs trap frame because it's referenced
    by the ET event handler.
o   Signal EOI to the CPU before calling the registered XIV handlers.
    This prevents lost ITC interrupts, which cause starvation in one-shot
    mode.
o   Adding support for IPI_HARDCLOCK with corresponding per-CPU counters.
o   Have the APs call cpu_initclocks() so as to limited the scattering of
    clock related initialization. cpu_initclocks() calls the <self>_bsp()
    or <self>_ap() version accordingly.
o   Uncomment the ET clock handling in cpu_idle().
o   Update the DDB 'show pcpu' output for the new MD fields.
o   Entirely rewritten ia64_ih_clock(). Note that we don't create as many
    clock XIVs as we have CPUs, as is done on PowerPC. It doesn't scale.
    We can only have 240 XIVs and we can have more CPUs than that. There's
    a single intrcnt index for the cumulative clock ticks and we keep per
    CPU counts in the PCPU stats structure.
o   Register the ITC by hooking SI_SUB_CONFIGURE (2nd order).

Open issues:
o   Clock interrupts can still be lost. Some tweaking is still necessary.

Thanks to: mav@ for his support, feedback and explanations.

ET stats while committing:
eris% sysctl machdep.cpu | grep nclks

machdep.cpu.0.nclks: 24007
machdep.cpu.1.nclks: 22895
machdep.cpu.2.nclks: 13523
machdep.cpu.3.nclks: 9342
machdep.cpu.4.nclks: 9103
machdep.cpu.5.nclks: 9298
machdep.cpu.6.nclks: 10039
machdep.cpu.7.nclks: 9479
eris% vmstat -i | grep clock
clock                      108599         50
2011-06-25 02:15:14 +00:00
marcel
594e41652d Unblock the outgoing thread after we performed pmap_switch() to
switch the region registers. pmap_switch() returns the pmap for
which the region register are currently programmed, which needs
to be re-programmed on the CPU the ougoing thread gets switched
in.  This change does not noticibly change anything or fix known
bugs, but does give me a warm fuzzy feeling by being more
correct.
2011-06-23 16:21:43 +00:00
alc
17293d73f8 Use a non-standard page size that is supported. 2011-06-21 12:38:40 +00:00
marcel
bf21fc1cf4 Improve on style(9) 2011-06-17 05:30:12 +00:00
marcel
f60d9c4627 Properly serialize the global shootdown with the instruction
stream of the local processor. Also explicitly invalidate
the ALAT. This is done on the other CPUs in the coherence
domain by virtue of the ptc.ga instruction, but does not
apply to the local CPU.
2011-06-17 04:26:03 +00:00
marcel
41adb82a21 Add the model number for the Montvale processor (marketed as Itanium 2 9100).
At this time we're missing just one: Tukwila (Itanium 2 9300).
2011-06-11 02:22:11 +00:00
attilio
6ed3ca2c5b MFC 2011-06-07 08:24:29 +00:00
marcel
2bf7ed6fc0 Call set_cputicker() to have the time counter use the ITC register.
Note that the ITC frequency is fixed.
2011-06-07 01:06:49 +00:00
attilio
fcefe479fe MFC 2011-06-06 21:38:39 +00:00
marcel
b85f295840 Improve cpu_idle():
o   cpu_idle_hook is expected to be called with interrupts
    disabled and re-enables interrupts on return.
o   sync with x86: don't idle when the CPU has runnable tasks
o   have callers of ia64_call_pal_static() disable interrupts
    and re-enable interrupts.
o   add, but compile-out, support for idle mode. This will be
    enabled at some later time, after proper testing.
2011-06-06 19:06:15 +00:00
attilio
bc4d32e80b MFC 2011-05-31 21:22:44 +00:00
nwhitehorn
a69e106b2f On multi-core, multi-threaded PPC systems, it is important that the threads
be brought up in the order they are enumerated in the device tree (in
particular, that thread 0 on each core be brought up first). The SLIST
through which we loop to start the CPUs has all of its entries added with
SLIST_INSERT_HEAD(), which means it is in reverse order of enumeration
and so AP startup would always fail in such situations (causing a machine
check or RTAS failure). Fix this by changing the SLIST into an STAILQ,
and inserting new CPUs at the end.

Reviewed by:	jhb
2011-05-31 15:11:43 +00:00
attilio
336fa86932 MFC 2011-05-14 19:20:13 +00:00
marcel
8f1059ecb5 Prefer switching the memory stack from user to kernel *before* switching
the register stack. While the ordering doesn't matter, it creates an
invariant not previously there: the memory stack pointer will always be
larger than the register stack pointer. With this invariant in place,
it's easier to add instrumentation code that detects a stack overflow
because in such a scenario the memory stack pointer and register stack
pointers have crossed each other.

Aside: basic kernel operation needs about half the stack size (~16K)
at most. We have plenty of head room on the kernel stack...
2011-05-14 14:55:15 +00:00
marcel
dce7f91629 Sharpening the saw:
o   Clobber the register that holds the restart token immediately after
    crossing the restart point. This prevents false positives (i.e. a
    nested exception that we don't know can happen and that is being
    treated as one we know by virtue of a lingering restart token).
o   Now that the bootstrap kernel stack is free, switch onto it and call
    trap() for nested traps that we don't know about. In trap we panic()
    so that we can analyze the condition.
2011-05-14 14:47:19 +00:00
marcel
5a5941f823 Be pedantic: mark the pcpu pointer (= register r13) itself as volatile. 2011-05-14 14:40:24 +00:00
marcel
2619d467b3 Turn ia64_srlz() and ia64_srlz_i() into defines so that the code is
still correct when inlining is disabled.
2011-05-14 14:36:08 +00:00
attilio
9ff3491e67 MFC 2011-05-13 20:58:48 +00:00
mdf
3d3b036f95 Move the ZERO_REGION_SIZE to a machine-dependent file, as on many
architectures (i386, for example) the virtual memory space may be
constrained enough that 2MB is a large chunk.  Use 64K for arches
other than amd64 and ia64, with special handling for sparc64 due to
differing hardware.

Also commit the comment changes to kmem_init_zero_region() that I
missed due to not saving the file.  (Darn the unfamiliar development
environment).

Arch maintainers, please feel free to adjust ZERO_REGION_SIZE as you
see fit.

Requested by:	alc
MFC after:	1 week
MFC with:	r221853
2011-05-13 19:35:01 +00:00
attilio
ebb514a54a Fix remaining bits that actually weren't converted by mistake.
Reported by:	sbruno
2011-05-13 14:57:20 +00:00
attilio
cae315a375 MFC 2011-05-07 23:34:14 +00:00
marcel
b7e6c09c4f In pmap_kextract(), return the physical address for PBVM virtual
addresses as well (incl. the PBVM page table).
2011-05-07 17:23:13 +00:00
attilio
a0b51ba62f MFC 2011-05-06 22:45:33 +00:00
jhb
5512bf549d Retire isa_setup_intr() and isa_teardown_intr() and use the generic bus
versions instead.  They were never needed as bus_generic_intr() and
bus_teardown_intr() had been changed to pass the original child device up
in 42734, but the ISA bus was not converted to new-bus until 45720.
2011-05-06 13:48:53 +00:00
attilio
fe4de567b5 Commit the support for removing cpumask_t and replacing it directly with
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).

Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.

The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN

while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.

Some technical notes:
- This commit may be considered an ABI nop for all the architectures
  different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
  accessed avoiding migration, because the size of cpuset_t should be
  considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
  primirally done in order to leave some more space in userland to cope
  with KBI extensions). If you need to access kernel cpuset_t from the
  userland please refer to example in this patch on how to do that
  correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now

The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.

Tested by:	pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by:	jeff, jhb, sbruno
2011-05-05 14:39:14 +00:00
marcel
7bd34e0180 Don't use the whole region 5 for KVA, because the CPU may not implement all
of the 61 bits available within the region for virtual addressing.  Since
there's no good way for us to map out the gap in the virtual address space,
limit KVA to the architectural minimum implemented address bits. This still
gives us 1 petabyte of KVA, so no worries.
2011-05-02 17:49:05 +00:00
marcel
a524773fc0 Stop linking against a direct-mapped virtual address and instead
use the PBVM. This eliminates the implied hardcoding of the
physical address at which the kernel needs to be loaded. Using the
PBVM makes it possible to load the kernel irrespective of the
physical memory organization and allows us to replicate kernel text
on NUMA machines.

While here, reduce the direct-mapped page size to the kernel's
page size so that we can support memory attributes better.
2011-04-30 20:49:00 +00:00
jhb
08955ceac0 Change rman_manage_region() to actually honor the rm_start and rm_end
constraints on the rman and reject attempts to manage a region that is out
of range.
- Fix various places that set rm_end incorrectly (to ~0 or ~0u instead of
  ~0ul).
- To preserve existing behavior, change rman_init() to set rm_start and
  rm_end to allow managing the full range (0 to ~0ul) if they are not set by
  the caller when rman_init() is called.
2011-04-29 18:41:21 +00:00
attilio
d685681d59 Add the watchdogs patting during the (shutdown time) disk syncing and
disk dumping.
With the option SW_WATCHDOG on, these operations are doomed to let
watchdog fire, fi they take too long.

I implemented the stubs this way because I really want wdog_kern_*
KPI to not be dependant by SW_WATCHDOG being on (and really, the option
only enables watchdog activation in hardclock) and also avoid to
call them when not necessary (avoiding not-volountary watchdog
activations).

Sponsored by:	Sandvine Incorporated
Discussed with:	emaste, des
MFC after:	2 weeks
2011-04-28 16:02:05 +00:00
rmacklem
66b402e198 This patch changes head so that the default NFS client is now the new
NFS client (which I guess is no longer experimental). The fstype "newnfs"
is now "nfs" and the regular/old NFS client is now fstype "oldnfs".
Although mounts via fstype "nfs" will usually work without userland
changes, an updated mount_nfs(8) binary is needed for kernels built with
"options NFSCL" but not "options NFSCLIENT". Updated mount_nfs(8) and
mount(8) binaries are needed to do mounts for fstype "oldnfs".
The GENERIC kernel configs have been changed to use options
NFSCL and NFSD (the new client and server) instead of NFSCLIENT and NFSSERVER.
For kernels being used on diskless NFS root systems, "options NFSCL"
must be in the kernel config.
Discussed on freebsd-fs@.
2011-04-27 17:51:51 +00:00
marcel
f3caa733d6 Remove prototypes of non-existent functions. 2011-04-25 22:38:09 +00:00
mav
512a6cd715 Switch the GENERIC kernels for all architectures to the new CAM-based ATA
stack. It means that all legacy ATA drivers are disabled and replaced by
respective CAM drivers. If you are using ATA device names in /etc/fstab or
other places, make sure to update them respectively (adX -> adaY,
acdX -> cdY, afdX -> daY, astX -> saY, where 'Y's are the sequential
numbers for each type in order of detection, unless configured otherwise
with tunables, see cam(4)).

ataraid(4) functionality is now supported by the RAID GEOM class.
To use it you can load geom_raid kernel module and use graid(8) tool
for management. Instead of /dev/arX device names, use /dev/raid/rX.
2011-04-24 08:58:58 +00:00
marcel
09f8bacb54 Use the new arch_loadaddr I/F to align ELF objects to PBVM page
boundaries. For good measure, align all other objects to cache
lines boundaries.

Use the new arch_loadseg I/F to keep track of kernel text and
data so that we can wire as much of it as is possible. It is
the responsibility of the kernel to link critical (read IVT
related) code and data at the front of the respective segment
so that it's covered by TRs before the kernel has a chance to
add more translations.

Use a better way of determining whether we're loading a legacy
kernel or not. We can't check for the presence of the PBVM page
table, because we may have unloaded that kernel and loaded an
older (legacy) kernel after that. Simply use the latest load
address for it.
2011-04-03 23:49:20 +00:00
kib
7c2eaa21fe Add support for executing the FreeBSD 1/i386 a.out binaries on amd64.
In particular:
- implement compat shims for old stat(2) variants and ogetdirentries(2);
- implement delivery of signals with ancient stack frame layout and
  corresponding sigreturn(2);
- implement old getpagesize(2);
- provide a user-mode trampoline and LDT call gate for lcall $7,$0;
- port a.out image activator and connect it to the build as a module
  on amd64.

The changes are hidden under COMPAT_43.

MFC after:   1 month
2011-04-01 11:16:29 +00:00
alc
411b69f8a0 Eliminate an unused definition.
Reviewed by:	marcel
2011-03-26 20:40:33 +00:00
marcel
cd817c0d10 Fix switching to physical mode as part of calling into EFI runtime
services or PAL procedures. The new implementation is based on
specific functions that are known to be called in certain scenarios
only. This in particular fixes the PAL call to obtain information
about translation registers. In general, the new implementation does
not bank on virtual addresses being direct-mapped and will work when
the kernel uses PBVM.

When new scenarios need to be supported, new functions are added if
the existing functions cannot be changed to handle the new scenario.
If a single generic implementation is possible, it will become clear
in due time.

While here, change bootinfo to a pointer type in anticipation of
future development.
2011-03-21 18:20:53 +00:00
marcel
e347771c19 Change region 4 to be part of the kernel. This serves 2 purposes:
1.  The PBVM is in region 4, so if we want to make use of it, we
    need region 4 freed up.
2.  Region 4 and above cannot be represented by an off_t by virtue
    of that type being signed. This is problematic for truss(1),
    ktrace(1) and other such programs.
2011-03-21 01:09:50 +00:00
bz
c41eae2d13 For now remove options FLOWTABLE from the remaining GENERIC kernel
configurations and make it opt-in for those who want it.  LINT will
still build it.

While it may be a perfect win in some scenarios, it still troubles users
(see PRs) in general cases.  In addition we are still allocating resources
even if disabled by sysctl and still leak arp/nd6 entries in case of
interface destruction.

Discussed with:	qingli (2010-11-24, just never executed)
Discussed with: juli (OCTEON1)
PR:		kern/148018, kern/155604, kern/144917, kern/146792
MFC after:	2 weeks
2011-03-19 15:50:34 +00:00
marcel
2fa9bafa3d o Move the IVT and supporting functions to the front of the text
segment so that it's always mapped by the loader.
o   Change the alternate fault handlers to account for PBVM. Since
    currently the region is handled by the VHPT, no alternate faults
    will be generated for it.
2011-03-18 22:45:43 +00:00
marcel
1e701a660a Remove inclusion of unneeded bootinfo.h header. 2011-03-18 22:33:19 +00:00
marcel
a49f210540 Use VM_MAXUSER_ADDRESS rather than VM_MAX_ADDRESS when we talk about
the bounds of user space. Redefine VM_MAX_ADDRESS as ~0UL, even though
it's not used anywhere in the source tree.
2011-03-18 15:36:28 +00:00
marcel
8e0b0a2284 MFaltix:
Add support for Pre-Boot Virtual Memory (PBVM) to the loader.

PBVM allows us to link the kernel at a fixed virtual address without
having to make any assumptions about the physical memory layout. On
the SGI Altix 350 for example, there's no usuable physical memory
below 192GB. Also, the PBVM allows us to control better where we're
going to physically load the kernel and its modules so that we can
make sure we load the kernel in memory that's close to the BSP.

The PBVM is managed by a simple page table. The minimum size of the
page table is 4KB (EFI page size) and the maximum is currently set
to 1MB. A page in the PBVM is 64KB, as that's the maximum alignment
one can specify in a linker script. The bottom line is that PBVM is
between 64KB and 8GB in size.

The loader maps the PBVM page table at a fixed virtual address and
using a single translations. The PBVM itself is also mapped using a
single translation for a maximum of 32MB.

While here, increase the heap in the EFI loader from 512KB to 2MB
and set the stage for supporting relocatable modules.
2011-03-16 03:53:18 +00:00
mdf
f225a1fc1c Mostly revert r219468, as I had misremembered the C standard regarding
the size of an extern array.

Keep one change from strncpy to strlcpy.
2011-03-11 18:56:55 +00:00
mdf
675238071f Use MAXPATHLEN rather than the size of an extern array when copying the
kernel name.  Also consistenly use strlcpy().

Suggested by:	Warner Losh
2011-03-10 22:56:00 +00:00
dchagin
69b8756d3d Extend struct sysvec with new method sv_schedtail, which is used for an
explicit process at fork trampoline path instead of eventhadler(schedtail)
invocation for each child process.

Remove eventhandler(schedtail) code and change linux ABI to use newly added
sysvec method.

While here replace explicit comparing of module sysentvec structure with the
newly created process sysentvec to detect the linux ABI.

Discussed with:	kib

MFC after:	2 Week
2011-03-08 19:01:45 +00:00
alc
2f4da8e71e Remove pmap fields that are either unused or not fully implemented.
Discussed with:	kib
2011-02-17 15:36:29 +00:00
marcel
78c6e82cfc Comment-out FLOWTABLE. It causes a kernel panic due to a misaligned memory
access related to an IPv6 route update.

PR:		kern/148018
2011-02-06 22:18:37 +00:00
mdf
b291e9a365 Put the general logic for being a CPU hog into a new function
should_yield().  Use this in various places.  Encapsulate the common
case of check-and-yield into a new function maybe_yield().

Change several checks for a magic number of iterations to use
should_yield() instead.

MFC after:	1 week
2011-02-02 16:35:10 +00:00
pluknet
5f536fc1d3 Make MSGBUF_SIZE kernel option a loader tunable kern.msgbufsize.
Submitted by:	perryh pluto.rain.com (previous version)
Reviewed by:	jhb
Approved by:	kib (mentor)
Tested by:	universe
2011-01-21 10:26:26 +00:00
jkim
c94e91a673 Remove empty dev_mem_md_init() stubs. 2011-01-17 23:06:47 +00:00
jkim
ea861abf2a Add reader/writer lock around mem_range_attr_get() and mem_range_attr_set().
Compile sys/dev/mem/memutil.c for all supported platforms and remove now
unnecessary dev_mem_md_init().  Consistently define mem_range_softc from
mem.c for all platforms.  Add missing #include guards for machine/memdev.h
and sys/memrange.h.  Clean up some nearby style(9) nits.

MFC after:	1 month
2011-01-17 22:58:28 +00:00
jhb
c17f46e472 Remove unneeded includes of <sys/linker_set.h>. Other headers that use
it internally contain nested includes.

Reviewed by:	bde
2011-01-11 13:59:06 +00:00
kib
4f8260e700 Move repeated MAXSLP definition from machine/vmparam.h to sys/vmmeter.h.
Update the outdated comments describing MAXSLP and the process
selection algorithm for swap out.

Comments wording and reviewed by:	alc
2011-01-09 12:50:44 +00:00
das
404cc61926 The highest-precision floating point type on ia64 has 64 bits of
precision, so DECIMAL_DIG should be 21, as on i386/amd64.
2011-01-09 06:05:02 +00:00
tijl
89281909e1 On mixed 32/64 bit architectures (mips, powerpc) use __LP64__ rather than
architecture macros (__mips_n64, __powerpc64__) when 64 bit types (and
corresponding macros) are different from 32 bit. [1]

Correct the type of INT64_MIN, INT64_MAX and UINT64_MAX.

Define (U)INTMAX_C as an alias for (U)INT64_C matching the type definition
for (u)intmax_t. Do this on all architectures for consistency.

Suggested by:	bde [1]
Approved by:	kib (mentor)
2011-01-08 12:43:05 +00:00
tijl
af03e997ba Fix types of some values in machine/_limits.h.
On some architectures UCHAR_MAX and USHRT_MAX had type unsigned int.
However, lacking integer suffixes for types smaller than int, their type
should correspond to that of an object of type unsigned char (or short)
when used in an expression with objects of type int. In that case unsigned
char (short) are promoted to int (i.e. signed) so the type of UCHAR_MAX and
USHRT_MAX should also be int.

Where MIN/MAX constants implicitly have the correct type the suffix has
been removed.

While here, correct some comments.

Reviewed by:	bde
Approved by:	kib (mentor)
2011-01-08 11:13:34 +00:00
kib
ed862725de Add AT_STACKPROT elf aux vector. Will be used to inform rtld about the
initial stack protection set by the kernel image activator.
2011-01-07 14:22:34 +00:00
brucec
6e3faf1602 Revert r216134. This checkin broke platforms where bus_space are macros:
they need to be a single statement, and do { } while (0) doesn't work in this
situation so revert until a solution can be devised.
2010-12-03 07:09:23 +00:00
brucec
dc1c4b9270 Disallow passing in a count of zero bytes to the bus_space(9) functions.
Passing a count of zero on i386 and amd64 for [I386|AMD64]_BUS_SPACE_MEM
causes a crash/hang since the 'loop' instruction decrements the counter
before checking if it's zero.

PR:	kern/80980
Discussed with:	jhb
2010-12-02 22:19:30 +00:00
alc
de534f6887 phys_avail[] is correctly defined as an array of vm_paddr_t's in
machdep.c.  Use that same type, and not vm_offset_t, in this include file.
2010-12-01 05:52:27 +00:00
jhb
acd72eb169 - Remove <machine/mutex.h>. Most of the headers were empty, and the
contents of the ones that were not empty were stale and unused.
- Now that <machine/mutex.h> no longer exists, there is no need to allow it
  to override various helper macros in <sys/mutex.h>.
- Rename various helper macros for low-level operations on mutexes to live
  in the _mtx_* or __mtx_* namespaces.  While here, change the names to more
  closely match the real API functions they are backing.
- Drop support for including <sys/mutex.h> in assembly source files.

Suggested by:	bde (1, 2)
2010-11-09 20:46:41 +00:00
jhb
c016e5df49 Remove unused includes of <sys/mutex.h> and <machine/mutex.h>. 2010-11-09 20:41:10 +00:00
jkim
36651dc0f0 Reduce diff between platforms and fix style(9) bugs. 2010-11-09 00:14:39 +00:00
jhb
45c0759920 Adjust the order of operations in spinlock_enter() and spinlock_exit() to
work properly with single-stepping in a kernel debugger.  Specifically,
these routines have always disabled interrupts before increasing the nesting
count and restored the prior state of interrupts after decreasing the nesting
count to avoid problems with a nested interrupt not disabling interrupts
when acquiring a spin lock.  However, trap interrupts for single-stepping
can still occur even when interrupts are disabled.  Now the saved state of
interrupts is not saved in the thread until after interrupts have been
disabled and the nesting count has been increased.  Similarly, the saved
state from the thread cannot be read once the nesting count has been
decreased to zero.  To fix this, use temporary variables to store interrupt
state and shuffle it between the thread's MD area and the appropriate
registers.

In cooperation with:	bde
MFC after:     1 month
2010-11-05 13:42:58 +00:00
neel
11129fcf49 Fix bogus error message from bus_dmamem_alloc() about incorrect alignment.
The check for alignment should be made against the physical address and not
the virtual address that maps it.

Sponsored by:	NetApp
Submitted by:	Will McGovern (will at netapp dot com)
Reviewed by:	mjacob, jhb
2010-09-29 21:53:11 +00:00
imp
425d541f16 Remove clauses 3 and 4, per changes to NetBSD versions of these files. 2010-09-25 04:41:42 +00:00
avg
c9fe8ad7f0 bus_add_child: change type of order parameter to u_int
This reflects actual type used to store and compare child device orders.
Change is mostly done via a Coccinelle (soon to be devel/coccinelle)
semantic patch.
Verified by LINT+modules kernel builds.

Followup to:	r212213
MFC after:	10 days
2010-09-10 11:19:03 +00:00
jhb
d02cab2556 Remove unused KTRACE includes. 2010-08-19 16:41:27 +00:00
kib
d9f088a03e Supply some useful information to the started image using ELF aux vectors.
In particular, provide pagesize and pagesizes array, the canary value
for SSP use, number of host CPUs and osreldate.

Tested by:	marius (sparc64)
MFC after:	1 month
2010-08-17 08:55:45 +00:00
kib
8e1e89f01b Prefer struct sysentvec sv_psstrings to hardcoding FREEBSD32_PS_STRINGS
in the compat32 code. Use sv_usrstack instead of FREEBSD32_USRSTACK as well.

MFC after:	1 week
2010-08-07 11:57:13 +00:00
jhb
19ddbf5c38 Add a new ipi_cpu() function to the MI IPI API that can be used to send an
IPI to a specific CPU by its cpuid.  Replace calls to ipi_selected() that
constructed a mask for a single CPU with calls to ipi_cpu() instead.  This
will matter more in the future when we transition from cpumask_t to
cpuset_t for CPU masks in which case building a CPU mask is more expensive.

Submitted by:	peter, sbruno
Reviewed by:	rookie
Obtained from:	Yahoo! (x86)
MFC after:	1 month
2010-08-06 15:36:59 +00:00
jhb
007376c918 Mark the __curthread() functions as __pure2 and remove the volatile keyword
from the inline assembly.  This allows the compiler to cache invocations of
curthread since it's value does not change within a thread context.

Submitted by:	zec (i386)
MFC after:	1 week
2010-07-29 18:44:10 +00:00
mdf
6857471cf3 Add MALLOC_DEBUG_MAXZONES debug malloc(9) option to use multiple uma
zones for each malloc bucket size.  The purpose is to isolate
different malloc types into hash classes, so that any buffer overruns
or use-after-free will usually only affect memory from malloc types in
that hash class.  This is purely a debugging tool; by varying the hash
function and tracking which hash class was corrupted, the intersection
of the hash classes from each instance will point to a single malloc
type that is being misused.  At this point inspection or memguard(9)
can be used to catch the offending code.

Add MALLOC_DEBUG_MAXZONES=8 to -current GENERIC configuration files.
The suggestion to have this on by default came from Kostik Belousov on
-arch.

This code is based on work by Ron Steinke at Isilon Systems.

Reviewed by:    -arch (mostly silence)
Reviewed by:    zml
Approved by:    zml (mentor)
2010-07-28 15:36:12 +00:00
jhb
f27c8b35e2 Very rough first cut at NUMA support for the physical page allocator. For
now it uses a very dumb first-touch allocation policy.  This will change in
the future.
- Each architecture indicates the maximum number of supported memory domains
  via a new VM_NDOMAIN parameter in <machine/vmparam.h>.
- Each cpu now has a PCPU_GET(domain) member to indicate the memory domain
  a CPU belongs to.  Domain values are dense and numbered from 0.
- When a platform supports multiple domains, the default freelist
  (VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain.
  The MD code is required to populate an array of mem_affinity structures.
  Each entry in the array defines a range of memory (start and end) and a
  domain for the range.  Multiple entries may be present for a single
  domain.  The list is terminated by an entry where all fields are zero.
  This array of structures is used to split up phys_avail[] regions that
  fall in VM_FREELIST_DEFAULT into per-domain freelists.
- Each memory domain has a separate lookup-array of freelists that is
  used when fulfulling a physical memory allocation.  Right now the
  per-domain freelists are listed in a round-robin order for each domain.
  In the future a table such as the ACPI SLIT table may be used to order
  the per-domain lookup lists based on the penalty for each memory domain
  relative to a specific domain.  The lookup lists may be examined via a
  new vm.phys.lookup_lists sysctl.
- The first-touch policy is implemented by using PCPU_GET(domain) to
  pick a lookup list when allocating memory.

Reviewed by:	alc
2010-07-27 20:33:50 +00:00
kib
9ac2754b6d When compat32 binary asks for the value of hw.machine_arch, report the
name of 32bit sibling architecture instead of the host one. Do the
same for hw.machine on amd64.

Add a safety belt debug.adaptive_machine_arch sysctl, to turn the
substitution off.

Reviewed by:	jhb, nwhitehorn
MFC after:	2 weeks
2010-07-22 09:13:49 +00:00
marcel
d729021fb8 Add acpi_find_table() -- a convenience function for looking up an
ACPI table given the signature.
2010-07-07 20:07:33 +00:00