Commit Graph

3440 Commits

Author SHA1 Message Date
royger
c9ce58e137 vm_phys: remove limitation on number of fictitious regions
The number of vm fictitious regions was limited to 8 by default, but
Xen will make heavy usage of those kind of regions in order to map
memory from foreign domains, so instead of increasing the default
number, change the implementation to use a red-black tree to track vm
fictitious ranges.

The public interface remains the same.

Sponsored by: Citrix Systems R&D
Reviewed by: kib, alc
Approved by: gibbs

vm/vm_phys.c:
 - Replace the vm fictitious static array with a red-black tree.
 - Use a rwlock instead of a mutex, since now we also need to take the
   lock in vm_phys_fictitious_to_vm_page, and it can be shared.
2014-07-09 08:12:58 +00:00
marcel
9f28abd980 Remove ia64.
This includes:
o   All directories named *ia64*
o   All files named *ia64*
o   All ia64-specific code guarded by __ia64__
o   All ia64-specific makefile logic
o   Mention of ia64 in comments and documentation

This excludes:
o   Everything under contrib/
o   Everything under crypto/
o   sys/xen/interface
o   sys/sys/elf_common.h

Discussed at: BSDcan
2014-07-07 00:27:09 +00:00
alc
d74e85dbb9 Introduce pmap_unwire(). It will replace pmap_change_wiring(). There are
several reasons for this change:

pmap_change_wiring() has never (in my memory) been used to set the wired
attribute on a virtual page.  We have always used pmap_enter() to do that.
Moreover, it is not really safe to use pmap_change_wiring() to set the wired
attribute on a virtual page.  The description of pmap_change_wiring() says
that it assumes the existence of a mapping in the pmap.  However, non-wired
mappings may be reclaimed by the pmap at any time.  (See pmap_collect().)
Many implementations of pmap_change_wiring() will crash if the mapping does
not exist.

pmap_unwire() accepts a range of virtual addresses, whereas
pmap_change_wiring() acts upon a single virtual page.  Since we are
typically unwiring a range of virtual addresses, pmap_unwire() will be more
efficient.  Moreover, pmap_unwire() allows us to unwire superpage mappings.
Previously, we were forced to demote the superpage mapping, because
pmap_change_wiring() only allowed us to express the unwiring of a single
base page mapping at a time.  This added to the overhead of unwiring for
large ranges of addresses, including the implicit unwiring that occurs at
process termination.

Implementations for arm and powerpc will follow.

Discussed with:	jeff, marcel
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-07-06 17:42:38 +00:00
hselasky
35b126e324 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
gjb
fc21f40567 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
hselasky
bd1ed65f0f Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
alc
50c074c333 Delay the call to crhold() in vm_map_insert() until we know that we won't
have to undo it by calling crfree().  This reduces the total number of calls
by vm_map_insert() to crhold() and crfree() by 45% in my tests.

Eliminate an unnecessary variable from vm_map_insert().

Reviewed by:	kib
Tested by:	pho
2014-06-26 16:04:03 +00:00
alc
aeda9d4c41 Now that vm_map_insert() sets MAP_ENTRY_GROWS_{DOWN,UP} on the stack entries
that it creates (r267645), we can place the check that blocks map entry
coalescing on stack entries in vm_map_simplify_entry() where it properly
belongs.

Reviewed by:	kib
2014-06-25 03:30:03 +00:00
kib
c2a4e94982 Use correct names for the flags. MAP_ENTRY_GROWS_* have the same
numerical values as MAP_STACK_GROWS_*, but the former is for entries'
eflags, while the later for the cow argument of vm_map_insert().

Submitted by:	alc
2014-06-23 07:03:47 +00:00
kib
bded3c3768 Assert that the new entry is inserted into the right location in the
map entries list, and that it does not overlap with the previous and
next entries.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-20 07:01:53 +00:00
alc
2cd5489f2c Eliminate a pointless call to vm_map_clip_start() from vm_map_growstack().
For this call to do anything at all we would have to have two overlapping
map entries.

Submitted by:	kib
2014-06-19 21:05:07 +00:00
alc
1fa56d3c4e When MAP_STACK_GROWS_{DOWN,UP} are passed to vm_map_insert() set the
corresponding flag(s) in the new map entry.  Previously, the caller was
responsible for setting them after vm_map_insert() returned.

Pass MAP_STACK_GROWS_DOWN to vm_map_insert() from vm_map_growstack() when
extending the stack in the downward direction.

Together these changes slightly simplify the caller's task when creating a
downward growing stack.  In particular, the caller no longer needs to clip
the previous entry, because the new stack entry can't possibly coalesce
with the previous entry.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-06-19 16:26:16 +00:00
kib
893100aa10 Add MAP_EXCL flag for mmap(2). It should be combined with MAP_FIXED,
and prevents the request from deleting existing mappings in the
region, failing instead.

Reviewed by:	alc
Discussed with:	jhb
Tested by:	markj, pho (previous version, as part of the bigger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-19 05:00:39 +00:00
attilio
2802c525ad - Modify vm_page_unwire() and vm_page_enqueue() to directly accept
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
  adding and removing pages to them.

Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.

This change effectively modifies the KPI.  __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2014-06-16 18:15:27 +00:00
alc
7b0b901024 Tidy up the early parts of vm_map_insert(), in particular, simplify one
of the assertions and eliminate a comment that has grown stale.

Reviewed by:	kib
MFC after:	1 week
2014-06-16 16:37:41 +00:00
alc
5badd21ab6 One of the intentions behind r267254 was that the global variable "sgrowsiz"
would be read once and cached in a local variable so that the resource limit
check and map entry insertion would be guaranteed to use the same value.
However, the value being passed to vm_map_insert() is still from "sgrowsiz"
and not the local variable.  Correct this oversight.

Reviewed by:	kib
2014-06-15 07:52:59 +00:00
mav
2bc26491c3 Introduce new "256 Bucket" zone to split requests and reduce congestion
on "128 Bucket" zone lock.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-06-12 11:57:07 +00:00
mav
c7987fc583 Allocating new bucket for bucket zone, never take it from the zone itself,
since it will almost certanly fail.  Take next bigger zone instead.

This situation should not happen with original bucket zones configuration:
"32 Bucket" zone uses "64 Bucket" and vice versa.  But if "64 Bucket" zone
lock is congested, zone may grow its bucket size and start biting itself.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-06-12 11:36:22 +00:00
alc
f133c95804 Correct a bug in the management of the population map on big-endian
machines.  Specifically, there was a mismatch between how the routine
allocation and deallocation operations accessed the population map
and how the aggressively optimized reservation-breaking operation
accessed it.  So, problems only occurred when reservations were broken.
This change makes the routine operations access the population map in
the same way as the reservation breaking operation.

This bug was introduced in r259999.

PR:		187080
Tested by:	jmg (on an "armeb" machine)
Sponsored by:	EMC / Isilon Storage Division
2014-06-11 16:11:12 +00:00
kib
7f8a65c9fc Make mmap(MAP_STACK) search for the available address space, similar
to !MAP_STACK mapping requests.  For MAP_STACK | MAP_FIXED, clear any
mappings which could previously exist in the used range.

For this, teach vm_map_find() and vm_map_fixed() to handle
MAP_STACK_GROWS_DOWN or _UP cow flags, by calling a new
vm_map_stack_locked() helper, which is factored out from
vm_map_stack().

The side effect of the change is that MAP_STACK started obeying
MAP_ALIGNMENT and MAP_32BIT flags.

Reported by:	rwatson
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-06-09 03:37:41 +00:00
alc
39548e640f Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.

Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.

On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping.  For
example, both kinds of mappings entail the creation of a single PTE and PV
entry.  With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter.  Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages.  Now, it will
create up to 96 base page or superpage mappings.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-06-07 17:12:26 +00:00
kib
b03014a79c Remove the assert which can be triggered by the userspace. The
situation checked by assert is verified to not take place in
vm_map_wire(), and protection permissions on the wired entry can be
revoked afterward.

Reported by:	markj
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-05-28 00:45:35 +00:00
alc
549a5817c0 There is no reason to perform the pmap_remove() on the kernel pmap while
the kmem object lock is held.  Do the pmap_remove() before acquiring the
kmem object lock.

MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-23 16:22:36 +00:00
kib
62a5a6e7ef Remove redundand loop. The inner goto restarts the whole page
handling in the situation identical to the loop condition.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-05-21 08:19:04 +00:00
kib
235f4b1083 When exec_new_vmspace() decides that current vmspace cannot be reused
on execve(2), it calls vmspace_exec(), which frees the current
vmspace.  The thread executing an exec syscall gets new vmspace
assigned, and old vmspace is freed if only referenced by the current
process.  The free operation includes pmap_release(), which
de-constructs the paging structures used by hardware.

If the calling process is multithreaded, other threads are suspended
in the thread_suspend_check(), and need to be unsuspended and run to
be able to exit on successfull exec.  Now, since the old vmspace is
destroyed, paging structures are invalid, threads are resumed on the
non-existent pmaps (page tables), which leads to triple fault on x86.

To fix, postpone the free of old vmspace until the threads are resumed
and exited.  To avoid modifications to all image activators all of
which use exec_new_vmspace(), memoize the current (old) vmspace in
kern_execve(), and notify it about the need to call vmspace_free()
with a thread-private flag TDP_EXECVMSPC.

http://bugs.debian.org/743141

Reported by:	Ivo De Decker <ivo.dedecker@ugent.be> through secteam
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-05-20 09:19:35 +00:00
alc
cf3c13370d On a fork allow read-only wired pages to be copy-on-write shared between the
parent and child processes.  Previously, we copied these pages even though
they are read only.  However, the reason for copying them is historical and
no longer exists.  In recent times, vm_map_protect() has developed the
ability to copy pages when write access is added to wired copy-on-write
pages.  So, in this case, copy-on-write sharing of wired pages is not to be
feared.  It is not going to lead to copy-on-write faults on wired memory.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-13 13:20:23 +00:00
kib
ba20870cbd Fix locking. The dst_object must remain locked on the retry of the
loop iteration.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	6 days
2014-05-11 18:07:07 +00:00
alc
b716a32dc2 With the new-and-improved vm_fault_copy_entry() (r265843), we can always
avoid soft page faults when adding write access to user wired entries in
vm_map_protect().  Previously, we only avoided the soft page fault when
the underlying pages were copy-on-write.  In other words, we avoided the
pages faults that might sleep on page allocation, but not the trivial
page faults to update the physical map.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-11 17:41:29 +00:00
alc
d283b621dc About 9% of the pmap_protect() calls being performed by vm_map_copy_entry()
are unnecessary.  Eliminate the unnecessary calls.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-10 19:47:00 +00:00
kib
7798d7f7f4 For the upgrade case in vm_fault_copy_entry(), when the entry does not
need COW and is writeable (i.e. becoming writeable due to the
mprotect(2) operation), do not create a new backing object for the
entry.  The caller of the function is vm_map_protect(), the call is
made to ensure that wired entry has all pages resident and wired in
the top level object and to enable the write.  We might need to copy
read-only page from some backing objects into the top object or remap
the page with the write allowed.

This fixes the issue with mishandling of the swap accounting when
read-only wired mapping is upgraded to write-enabled after fork.  The
previous code path did not accounted the new object, but it creation
is redundand anyway and the change provides an optimization for the
non-common situation.

Reported by:	markj
Suggested and reviewed by:	alc (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-05-10 17:03:33 +00:00
kib
32f811c3b5 When printing the map with the ddb 'show procvm' command, do not dump
page queues for the backing objects.  The queues are huge and clutter
the display, when mostly the map entries and its backing storage is
interesting.

The page queues can be seen with ddb 'show object' command.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-05-10 16:36:13 +00:00
kib
90e904e481 Print the entry address in addition to the object. The variable is
typically optimized out and debuggers cannot find its value.

Sponsored by:	    The FreeBSD Foundation
MFC after:	1 week
2014-05-10 16:30:48 +00:00
pho
1108602cbc msync(2) must return ENOMEM and not EINVAL when the address is outside the
allowed range or when one or more pages are not mapped. This according to
The Open Group Base Specifications Issue 7.

Discussed with:	 attilio, Bruce Evans
Reviewed by:	 alc, Garrett Cooper
Reported by:	 ATF
MFC after:	 2 weeks
Sponsored by:	EMC / Isilon storage division
2014-05-07 08:38:02 +00:00
alc
c77af15259 Prior to r254304, a separate function, vm_pageout_page_stats(), was used to
periodically update the reference status of the active pages.  This function
was called, instead of vm_pageout_scan(), when memory was not scarce.  The
objective was to provide up to date reference status for active pages in
case memory did become scarce and active pages needed to be deactivated.

The active page queue scan performed by vm_pageout_page_stats() was
virtually identical to that performed by vm_pageout_scan(), and so r254304
eliminated vm_pageout_page_stats().  Instead, vm_pageout_scan() is
called with the parameter "pass" set to zero.  The intention was that when
pass is zero, vm_pageout_scan() would only scan the active queue.  However,
the variable page_shortage can still be greater than zero when memory is not
scarce and vm_pageout_scan() is called with pass equal to zero.
Consequently, the inactive queue may be scanned and dirty pages laundered
even though that was not intended by r254304.  This revision fixes that.

Reported by:	avg
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-06 03:42:04 +00:00
kib
0c45ba8eb0 For the VM_PHYSSEG_DENSE case, checking the requested range to fall
into the area backed by vm_page_array wrongly compared end with
vm_page_array_size.  It should be adjusted by first_page index to be
correct.

Also, the corner and incorrect case of the requested range extending
after the end of the vm_page_array was incorrectly handled by
allocating the segment.

Fix the comparision for the end of range and return EINVAL if the end
extends beyond vm_page_array.

Discussed with:	royger
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-04-29 18:42:37 +00:00
kib
c36743d23f When vm_fault_copy_entry() is called from vm_map_protect() for a wired
entry and performs the upgrade of the entry permissions from read-only
to read-write, we must allow to search for the source pages in the
backing object, like we do in the case of forking the read-only wired
entry. For the fork case, the behaviour is allowed by src_readonly
boolean, which in fact is only used to assert that read-write case
provides all source pages in the top-level object.

Eliminate the src_readonly variable.  Allow for the copy loop to look
into the backing objects, add explicit asserts to ensure that only
read-only and upgrade case actually does.

Expand comments. Change the panic call into assert.

Reported by:	markj
Tested by:	markj, pho (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-04-27 05:19:01 +00:00
des
65493094e9 Add sysctl OIDs showing the actual size and capacity of the swap zone.
MFC after:	1 week
2014-04-26 12:18:17 +00:00
bdrewery
6fcf6199a4 Rename global cnt to vm_cnt to avoid shadowing.
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from:	arch@
Sponsored by:	EMC / Isilon Storage Division
2014-03-22 10:26:09 +00:00
kib
24c4e4a548 Fix two issues with /dev/mem access on amd64, both causing kernel page
faults.

First, for accesses to direct map region should check for the limit by
which direct map is instantiated.

Second, for accesses to the kernel map, success returned from the
kernacc(9) does not guarantee that consequent attempt to read or write
to the checked address succeed, since other thread might invalidate
the address meantime.  Add a new thread private flag TDP_DEVMEMIO,
which instructs vm_fault() to return error when fault happens on the
MAP_ENTRY_NOFAULT entry, instead of panicing.  The trap handler would
then see a page fault from access, and recover in normal way, making
/dev/mem access safer.

Remove GIANT_REQUIRED from the amd64 memrw(), since it is not needed
and having Giant locked does not solve issues for amd64.

Note that at least the second issue exists on other architectures, and
requires similar patching for md code.

Reported and tested by:	clusteradm (gjb, sbruno)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-03-21 14:25:09 +00:00
kib
07e1d8d74f Initialize vm_map_entry member wiring_thread on the map entry creation.
This was missed in r253190.

Reported by:	hps, peter
Tested by:	hps
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-03-21 13:55:57 +00:00
attilio
f19bbde667 vm_page_grab() and vm_pager_get_pages() can drop the vm_object lock,
then threads can sleep on the pip condition.
Avoid to deadlock such threads by correctly awakening the sleeping ones
after the pip is finished.
swapoff side of the bug can likely result in shutdown deadlocks.

Sponsored by:	EMC / Isilon Storage Division
Reported by:	pho, pluknet
Tested by:	pho
2014-03-19 01:13:42 +00:00
rwatson
33fdc14c0c Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
kib
12c58a8fb0 Initialize paddr to handle the case of zero size.
Reported and reviewed by:	Conrad Meyer <cemeyer@uw.edu>
MFC after:	1 week
2014-03-12 16:38:55 +00:00
kib
e4111a6b71 Do not vdrop() the tmpfs vnode until it is unlocked. The hold
reference might be the last, and then vdrop() would free the vnode.

Reported and tested by:	bdrewery
MFC after:	1 week
2014-03-12 15:13:57 +00:00
dim
e21b440a4c After r251709, avoid a clang 3.4 warning about an unused static const
variable (uma_max_ipers), when asserts are disabled.

Reviewed by:	glebius
MFC after:	3 days
2014-02-14 17:47:18 +00:00
attilio
6a31e25bb9 Fix-up r254141: in the process of making a failing vm_page_rename()
a call of pager_swap_freespace() was moved around, now leading to freeing
the incorrect page because of the pindex changes after vm_page_rename().

Get back to use the correct pindex when destroying the swap space.

Sponsored by:	EMC / Isilon storage division
Reported by:	avg
Tested by:	pho
MFC after:	7 days
2014-02-14 03:34:12 +00:00
glebius
b1650f2d1e Fix function name in KASSERT().
Submitted by:	hiren
2014-02-12 20:11:20 +00:00
jhb
ee403b8a8c Correct assertion to assert that the existing device VM object uses the
same type rather than asserting in the case where we just created a new
VM object.

Reviewed by:	kib
2014-02-11 22:05:21 +00:00
glebius
45bf1cc683 Create two public UMA_ZONE_PCPU zones: 64 bit sized and pointer sized.
Sponsored by:	Nginx, Inc.
2014-02-10 19:59:46 +00:00
glebius
613c5f4e53 Style. 2014-02-10 19:51:15 +00:00