Commit Graph

3369 Commits

Author SHA1 Message Date
Adrian Chadd
6520495abc Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.

* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
  the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
  if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
  policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
  policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
  set in a variety of methods.

This is only relevant for very specific workloads.

This doesn't pretend to be a final NUMA solution.

The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.

This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.

Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.

Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.

Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.

Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!

Tested:

* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)

* Peter Holm ran a stress test suite on this work and found one
  issue, but has not been able to verify it (it doesn't look NUMA
  related, and he only saw it once over many testing runs.)

* I've tested bhyve instances running in fixed NUMA domains and cpusets;
  all seems to work correctly.

Verified:

* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
  NUMA policies for processes under test.

Review:

This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@.  The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).

This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus.  My hope is that with further
exposure and testing more functionality can be implemented and evaluated.

Notes:

* The VM doesn't handle unbalanced domains very well, and if you have an overly
  unbalanced memory setup whilst under high memory pressure, VM page allocation
  may fail leading to a kernel panic.  This was a problem in the past, but it's
  much more easily triggered now with these tools.

* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
  affect contigmalloc, KVA placement, UMA, etc.  So, driver placement of memory
  isn't really guaranteed in any way.  That's next on my plate.

Sponsored by:	Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
Alan Cox
22cf98d1f3 The intention of r254304 was to scan the active queue continuously.
However, I've observed the active queue scan stopping when there are
frequent free page shortages and the inactive queue is steadily refilled
by other mechanisms, such as the sequential access heuristic in vm_fault()
or madvise(2).  To remedy this problem, record the time of the last active
queue scan, and always scan a number of pages proportional to the time
since the last scan, regardless of whether that last scan was a
timeout-triggered ("pass == 0") or free-page-shortage-triggered ("pass >
0") scan.

Also, on a timeout-triggered scan, allow a full scan of the active queue
when the system is short of inactive pages.

Reviewed by:	kib
MFC after:	6 weeks
Sponsored by:	EMC / Isilon Storage Division
2015-07-08 17:45:59 +00:00
Mark Johnston
010ba3842c Add a local variable initialization needed in the OBJT_DEFAULT case.
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D2992
2015-07-05 22:26:19 +00:00
Mateusz Guzik
cd336bad26 vm: don't lock proc around accesses to vm_{t,d}addr and RLIMIT_DATA in sys_mmap
vm_{t,d}addr are constant and we can use thread's copy of resource limits
2015-07-02 18:30:12 +00:00
Konstantin Belousov
be930a2021 Account for the main process stack being one page below the highest
user address when ABI uses shared page.

Note that the change is no-op for correctness, since shared page does
not fault.  The mapping for the shared page is installed at the
address space creation, the page is unmanaged and its pte/pv entry
cannot be reclaimed.

Submitted by:	Oliver Pinter
Review:	https://reviews.freebsd.org/D2954
MFC after:	1 week
2015-07-02 15:22:13 +00:00
Mark Murray
d1b06863fb Huge cleanup of random(4) code.
* GENERAL
- Update copyright.
- Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set
  neither to ON, which means we want Fortuna
- If there is no 'device random' in the kernel, there will be NO
  random(4) device in the kernel, and the KERN_ARND sysctl will
  return nothing. With RANDOM_DUMMY there will be a random(4) that
  always blocks.
- Repair kern.arandom (KERN_ARND sysctl). The old version went
  through arc4random(9) and was a bit weird.
- Adjust arc4random stirring a bit - the existing code looks a little
  suspect.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Redo read_random(9) so as to duplicate random(4)'s read internals.
  This makes it a first-class citizen rather than a hack.
- Move stuff out of locked regions when it does not need to be
  there.
- Trim RANDOM_DEBUG printfs. Some are excess to requirement, some
  behind boot verbose.
- Use SYSINIT to sequence the startup.
- Fix init/deinit sysctl stuff.
- Make relevant sysctls also tunables.
- Add different harvesting "styles" to allow for different requirements
  (direct, queue, fast).
- Add harvesting of FFS atime events. This needs to be checked for
  weighing down the FS code.
- Add harvesting of slab allocator events. This needs to be checked for
  weighing down the allocator code.
- Fix the random(9) manpage.
- Loadable modules are not present for now. These will be re-engineered
  when the dust settles.
- Use macros for locks.
- Fix comments.

* src/share/man/...
- Update the man pages.

* src/etc/...
- The startup/shutdown work is done in D2924.

* src/UPDATING
- Add UPDATING announcement.

* src/sys/dev/random/build.sh
- Add copyright.
- Add libz for unit tests.

* src/sys/dev/random/dummy.c
- Remove; no longer needed. Functionality incorporated into randomdev.*.

* live_entropy_sources.c live_entropy_sources.h
- Remove; content moved.
- move content to randomdev.[ch] and optimise.

* src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h
- Remove; plugability is no longer used. Compile-time algorithm
  selection is the way to go.

* src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h
- Add early (re)boot-time randomness caching.

* src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h
- Remove; no longer needed.

* src/sys/dev/random/uint128.h
- Provide a fake uint128_t; if a real one ever arrived, we can use
  that instead. All that is needed here is N=0, N++, N==0, and some
  localised trickery is used to manufacture a 128-bit 0ULLL.

* src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h
- Improve unit tests; previously the testing human needed clairvoyance;
  now the test will do a basic check of compressibility. Clairvoyant
  talent is still a good idea.
- This is still a long way off a proper unit test.

* src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h
- Improve messy union to just uint128_t.
- Remove unneeded 'static struct fortuna_start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
  it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])

* src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h
- Improve messy union to just uint128_t.
- Remove unneeded 'staic struct start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
  it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
  functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
- Fix some magic numbers elsewhere used as FAST and SLOW.

Differential Revision: https://reviews.freebsd.org/D2025
Reviewed by: vsevolod,delphij,rwatson,trasz,jmg
Approved by: so (delphij)
2015-06-30 17:00:45 +00:00
John-Mark Gurney
afc6dc3669 If INVARIANTS is specified, add ctor/dtor to junk memory if they are
unspecified...

Submitted by:	Suresh Gumpula at Netapp
Differential Revision:	https://reviews.freebsd.org/D2725
2015-06-25 20:44:46 +00:00
Alan Cox
aa04413540 Avoid pmap_is_modified() on pages that can't be mapped.
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2015-06-21 01:22:35 +00:00
Gleb Smirnoff
093ebe1d28 o Un-inline vm_pager_get_pages(), vm_pager_get_pages_async().
o Provide an extensive set of assertions for input array of pages.
o Remove now duplicate assertions from different pagers.

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-06-17 22:44:27 +00:00
Konstantin Belousov
776f729c86 Invalid pages do not need neither update of the activation count nor
they coould be dirty.  Move the handling if the invalid pages in the
inactive scan earlier.

Remove some code duplication in the scan by introducing the
'drop_page' label, which centralizes the object and the page unlock.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-06-14 20:23:41 +00:00
Alan Cox
78afdce6af As the next step in eliminating PG_CACHE pages, free rather than cache
pages in vm_pageout_scan().  The reactivation rate of cache pages created
by vm_pageout_scan() is extremely low; typically no more than 0.5% to
2.25% of the pages are ever reactivated.  At the same time, caching pages
is more expensive than freeing them.  For example, in a test with
PostgreSQL, this change reduced the amount of time spent in the inactive
queue scan by 1/6.

Differential Revision:	https://reviews.freebsd.org/D2805
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-06-14 05:23:39 +00:00
Gleb Smirnoff
093c7f396d Make KPI of vm_pager_get_pages() more strict: if a pager changes a page
in the requested array, then it is responsible for disposition of previous
page and is responsible for updating the entry in the requested array.
Now consumers of KPI do not need to re-lookup the pages after call to
vm_pager_get_pages().

Reviewed by:	kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-06-12 11:32:20 +00:00
Mateusz Guzik
f6f6d24062 Implement lockless resource limits.
Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
2015-06-10 10:48:12 +00:00
Alan Cox
5cd7a4f76c Correct a type error in kmem_unback(). Previously, kmem_unback() did not
correctly handle deallocation requests of two or more gigabytes in size.
Eventually, this would lead to a panic elsewhere in the kernel, such as
"vm_radix_insert: key <vm_pindex_t> is already present".

Reported by:	Ilias Marinos
MFC after:	1 week
2015-06-10 05:17:14 +00:00
Alan Cox
966272ca33 Retire VM_FREEPOOL_CACHE as the next step in eliminating PG_CACHE pages.
Differential Revision:	https://reviews.freebsd.org/D2712
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-06-08 04:59:32 +00:00
John Baldwin
7077c42623 Add a new file operations hook for mmap operations. File type-specific
logic is now placed in the mmap hook implementation rather than requiring
it to be placed in sys/vm/vm_mmap.c.  This hook allows new file types to
support mmap() as well as potentially allowing mmap() for existing file
types that do not currently support any mapping.

The vm_mmap() function is now split up into two functions.  A new
vm_mmap_object() function handles the "back half" of vm_mmap() and accepts
a referenced VM object to map rather than a (handle, handle_type) tuple.
vm_mmap() is now reduced to converting a (handle, handle_type) tuple to a
a VM object and then calling vm_mmap_object() to handle the actual mapping.
The vm_mmap() function remains for use by other parts of the kernel
(e.g. device drivers and exec) but now only supports mapping vnodes,
character devices, and anonymous memory.

The mmap() system call invokes vm_mmap_object() directly with a NULL object
for anonymous mappings.  For mappings using a file descriptor, the
descriptors fo_mmap() hook is invoked instead.  The fo_mmap() hook is
responsible for performing type-specific checks and adjustments to
arguments as well as possibly modifying mapping parameters such as flags
or the object offset.  The fo_mmap() hook routines then call
vm_mmap_object() to handle the actual mapping.

The fo_mmap() hook is optional.  If it is not set, then fo_mmap() will
fail with ENODEV.  A fo_mmap() hook is implemented for regular files,
character devices, and shared memory objects (created via shm_open()).

While here, consistently use the VM_PROT_* constants for the vm_prot_t
type for the 'prot' variable passed to vm_mmap() and vm_mmap_object()
as well as the vm_mmap_vnode() and vm_mmap_cdev() helper routines.
Previously some places were using the mmap()-specific PROT_* constants
instead.  While this happens to work because PROT_xx == VM_PROT_xx,
using VM_PROT_* is more correct.

Differential Revision:	https://reviews.freebsd.org/D2658
Reviewed by:	alc (glanced over), kib
MFC after:	1 month
Sponsored by:	Chelsio
2015-06-04 19:41:15 +00:00
Eric van Gyzen
63e4c6cdf9 Provide vnode in memory map info for files on tmpfs
When providing memory map information to userland, populate the vnode pointer
for tmpfs files.  Set the memory mapping to appear as a vnode type, to match
FreeBSD 9 behavior.

This fixes the use of tmpfs files with the dtrace pid provider,
procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY).

Submitted by:   Eric Badger <eric@badgerio.us> (initial revision)
Obtained from:  Dell Inc.
PR:             198431
MFC after:      2 weeks
Reviewed by:    jhb
Approved by:    kib (mentor)
2015-06-02 18:37:04 +00:00
Alan Cox
c4e49ba477 Document vm_page_alloc_contig()'s support for the VM_ALLOC_NODUMP option.
MFC after:	3 days
2015-05-30 23:37:47 +00:00
John Baldwin
ff87ae350e Export a list of VM objects in the system via a sysctl. The list can be
examined via 'vmstat -o'.  It can be used to determine which files are
using physical pages of memory and how much each is using.

Differential Revision:	https://reviews.freebsd.org/D2277
Reviewed by:	alc, kib
MFC after:	2 weeks
Sponsored by:	Norse Corp, Inc. (forward porting to HEAD/10)
2015-05-27 18:11:05 +00:00
Jung-uk Kim
fd90e2ed54 CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten
years for head.  However, it is continuously misused as the mpsafe argument
for callout_init(9).  Deprecate the flag and clean up callout_init() calls
to make them more consistent.

Differential Revision:	https://reviews.freebsd.org/D2613
Reviewed by:	jhb
MFC after:	2 weeks
2015-05-22 17:05:21 +00:00
Konstantin Belousov
2c20bd8b99 Do grammar fix in the comment to record the right commit message for
r283162.

Fix a cosmetic issue with vm_page_alloc() calling vm_page_free_toq()
with the page not completely satisfying vm_page_free() assertions.
The page is not owned by the object, since insertion failed.  But
besides m->object reset to NULL, we should also set VPO_UNMANAGED flag
for consistency.

Reported by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-20 23:15:56 +00:00
Konstantin Belousov
da47499040 Remove the write-only variable phent. We currently do not check the
size of the program header's entries.

Reported by:	adrian (by using gcc 4.9)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-20 23:03:22 +00:00
Konstantin Belousov
9cddade79a Satisfy vm_object uma zone destructor requirements after r282660 when
vnode object creation raced.

Reported by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2015-05-10 08:21:03 +00:00
Konstantin Belousov
44ec2b63c5 The vmem callback to reclaim kmem arena address space on low or
fragmented conditions currently just wakes up the pagedaemon.  The
kmem arena is significantly smaller then the total available physical
memory, which means that there are loads where kmem arena space could
be exhausted, while there is a lot of pages available still.  The
woken up pagedaemon sees vm_pages_needed != 0, verifies the condition
vm_paging_needed() which is false, clears the pass and returns back to
sleep, not calling neither uma_reclaim() nor lowmem handler.

To handle low kmem arena conditions, create additional pagedaemon
thread which calls uma_reclaim() directly.  The thread sleeps on the
dedicated channel and kmem_reclaim() wakes the thread in addition to
the pagedaemon.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-09 20:08:36 +00:00
John Baldwin
e735691b61 Place VM objects on the object list when created and never remove them.
This is ok since objects come from a NOFREE zone and allows objects to
be locked while traversing the object list without triggering a LOR.

Ensure that objects on the list are marked DEAD while free or stillborn,
and that they have a refcount of zero.  This required updating most of
the pagers to explicitly mark an object as dead when deallocating it.
(Only the vnode pager did this previously.)

Differential Revision:	https://reviews.freebsd.org/D2423
Reviewed by:	alc, kib (earlier version)
MFC after:	2 weeks
Sponsored by:	Norse Corp, Inc.
2015-05-08 19:43:37 +00:00
Adrian Chadd
cf82d8112e oops - how'd i miss this. Sorry! 2015-05-08 06:02:23 +00:00
Adrian Chadd
415d7ccab2 Add initial memory locality cost awareness to the VM, and include
a basic ACPI SLIT table parser.

For now this just exports the map via sysctl; it'll eventually be useful
to userland when there's more useful NUMA support in -HEAD.

* Add an optional mem_locality map;
* add a mapping function taking from/to domain and returning the
  relative cost, or -1 if it's not available;
* Add a very basic SLIT parser to x86 ACPI.

Differential Revision:	https://reviews.freebsd.org/D2460
Reviewed by:	rpaulo, stas, jhb
Sponsored by:	Norse Corp, Inc (hardware, coding); Dell (hardware)
2015-05-08 00:56:56 +00:00
Gleb Smirnoff
d2596d179a Fix the KASSERT and improve wording in r282426.
Submitted by:	alc
2015-05-06 08:07:11 +00:00
Gleb Smirnoff
84d313761d Fix arithmetical bug in vnode_pager_haspage(). The check against object size
should be done not with the number of pages in the first block, but with the
overall number of pages.  While here, add KASSERT that makes sure that BMAP
doesn't return completely irrelevant blocks.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-05-04 18:49:25 +00:00
Gleb Smirnoff
89c241d1a6 Instead of reading, validating and adjusting value of the vm.swap_async_max
in the main swapper work cycle, do it in the sysctl handler.  This removes
extra mutex acquisition from the main cycle and makes the sysctl knob return
error on an invalid value, instead of accepting and fixing it.

Reviewed by:	kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-05-02 20:27:37 +00:00
John Baldwin
ed95805e90 Remove support for Xen PV domU kernels. Support for HVM domU kernels
remains.  Xen is planning to phase out support for PV upstream since it
is harder to maintain and has more overhead.  Modern x86 CPUs include
virtualization extensions that support HVM guests instead of PV guests.
In addition, the PV code was i386 only and not as well maintained recently
as the HVM code.
- Remove the i386-only NATIVE option that was used to disable certain
  components for PV kernels.  These components are now standard as they
  are on amd64.
- Remove !XENHVM bits from PV drivers.
- Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3,
  etc.)
- Remove duplicate copy of <xen/features.h>.
- Remove unused, i386-only xenstored.h.

Differential Revision:	https://reviews.freebsd.org/D2362
Reviewed by:	royger
Tested by:	royger (i386/amd64 HVM domU and amd64 PVH dom0)
Relnotes:	yes
2015-04-30 15:48:48 +00:00
Scott Long
affc4a4bff Improve support for blacklisting bad memory locations. The user can supply
a text file with a list of physical memory addresses to exclude, and have it
loaded at boot time via the provided example in loader.conf.  The tunable
'vm.blacklist' remains, but using an external file means that there's no
practical limit to the size of the list.  This change also improves the
scanning algorithm for processing the list, scanning the list only once
instead of scanning it for every page in the system.  Both the sysctl and
the file can be unsorted and contain duplicates so long as each entry is
numeric (decimal or hex) and is separated by a space, comma, or newline
character.  The sysctl 'vm.page_blacklist' is now provided to report what
memory locations were successfully excluded.

Reviewed by:	imp, emax
Obtained from:	Netflix, Inc.
MFC after:	3 days
2015-04-29 15:57:14 +00:00
Edward Tomasz Napierala
4b5c9cf62f Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision:	https://reviews.freebsd.org/D2369
Reviewed by:	kib@
MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
2015-04-29 10:23:02 +00:00
Konstantin Belousov
85af31a464 Do not sleep waiting for the MAP_ENTRY_IN_TRANSITION state ending with
the vnode locked.

Review:	https://reviews.freebsd.org/D2381
Submitted by:	Conrad Meyer, Attilio Rao
MFC after:	1 week
2015-04-28 08:20:23 +00:00
Scott Long
43ffa928f9 Revert r281451. It causes a panic/hang early in boot for a number of
users, myself included.  The original code is likely papering over a
larger bug that needs to be explored, but for now get things back to
a working state.

Obtained from:	Netflix, Inc.
MFC after:	immediately
2015-04-24 17:03:53 +00:00
John Baldwin
179fa75e6e Reassign copyright statements on several files from Advanced
Computing Technologies LLC to Hudson River Trading LLC.

Approved by:	Hudson River Trading LLC (who owns ACT LLC)
MFC after:	1 week
2015-04-23 14:22:20 +00:00
Alan Cox
d74e6a1d27 Eliminate an unused variable.
MFC after:	1 week
2015-04-20 16:48:21 +00:00
Alan Cox
74f944344b Eliminate an unused variable.
MFC after:	1 week
2015-04-19 00:29:02 +00:00
Konstantin Belousov
0538aafc41 The lseek(2), mmap(2), truncate(2), ftruncate(2), pread(2), and
pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x
kernels which required padding before the off_t parameter.  The
fcntl(2) contains compatibility code to handle kernels before the
struct flock was changed during the 8.x CURRENT development.  The
shims were reasonable to allow easier revert to the older kernel at
that time.

Now, two or three major releases later, shims do not serve any
purpose.  Such old kernels cannot handle current libc, so revert the
compatibility code.

Make padded syscalls support conditional under the COMPAT6 config
option.  For COMPAT32, the syscalls were under COMPAT6 already.

Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to
(partially) disable the removed shims.

Reviewed by:	jhb, imp (previous versions)
Discussed with:	peter
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-18 21:50:13 +00:00
Dmitry Chagin
51cfb0be84 Rework r281162. Indeed, the flexible array member is preferable here.
Suggested by:   Justin T. Gibbs

MFC after:	3 days
2015-04-12 06:21:58 +00:00
Alan Cox
6a93e36b5f Correct an off-by-one error in vm_reserv_reclaim_contig() that results in
an infinite loop.

Submitted by:	Svatopluk Kraus
MFC after:	1 week
2015-04-11 22:57:13 +00:00
Gleb Smirnoff
16be9f54e7 UMA zone limit can be lowered, so remove protection against from
the sysctl_handle_uma_zone_max().

Sponsored by:	Nginx, Inc.
2015-04-10 06:56:49 +00:00
Alexander Motin
0ada3afc25 Remove sleeps from geom_up thread on device destruction.
MFC after:	3 days.
2015-04-09 13:09:05 +00:00
Jeff Roberson
34d8b7ea3b - Simplify vm_pageout_scan() by introducing a new vm_pageout_clean()
function that does the locking and validation associated with cleaning
   a page.  This moves 150 lines of code into its own function.
 - Rename vm_pageout_clean() to vm_pageout_cluster() to define what it
   really does; clustering nearby pages for pageout optimization.

Reviewd by:	alc, kib, kmacy
Tested by:	pho (earlier version)
Sponsored by:	EMC / Isilon
2015-04-07 02:18:52 +00:00
Dmitry Chagin
6723fdfe46 Properly calculate "UMA Zones" per cpu cache size. Avoid allocating
an extra struct uma_cache since the struct uma_zone already has one.

PR:		199169
Submitted by:	luke.tw gmail com
MFC after:	1 week
2015-04-06 18:45:41 +00:00
Alan Cox
b5ab20c066 Until the lock assertions in vm_page_advise() are properly reevaluated,
vm_fault_dontneed() should acquire a write lock on the first object in
the shadow chain.

Reported by:	gleb, David Wolfskill
2015-04-05 20:07:33 +00:00
Dmitry Chagin
1d2c0c460c Fix wrong kassert msg in uma.
PR:		199172
Submitted by:	luke.tw gmail com
MFC after:	1 week
2015-04-05 18:25:23 +00:00
Alan Cox
a8b0f1009d Replace vm_fault()'s heuristic for automatic cache behind with a heuristic
that performs the equivalent of an automatic madvise(..., MADV_DONTNEED).
The current heuristic, even with the improvements that I made a few years
ago, is a good example of making the wrong trade-off, or optimizing for
the infrequent case.  The infrequent case being reading a single file that
is much larger than memory using mmap(2).  And, in this case, the page
daemon isn't the bottleneck; it's the I/O.

In all other cases, the current heuristic has too many false positives,
i.e., it caches too many pages that are later reused.  To give one
example, thousands of pages are cached by the current heuristic during a
buildworld and all of them are reactivated before the buildworld
completes.  In particular, clang reads source files using mmap(2) and
there are some relatively large source files in our source tree, e.g.,
sqlite, that are read multiple times.  With the new heuristic, I see fewer
false positives and they have a much lower cost.

I actually tried something like this more than two years ago and it
didn't perform as well as the cache behind heuristic.  However, that was
before the changes to the page daemon in late summer of 2013 and the
existence of pmap_advise().  In particular, with the page daemon doing
its work more frequently and in smaller batches, it now completes its
work while the application accessing the file is blocked on I/O.
Whereas previously, the page daemon appeared to hog the CPU for so long
that it caused "hiccups" in the application's execution.

Finally, I'll add that the elimination of cache pages is a prerequisite
for NUMA support.

Reviewed by:	jeff, kib
Sponsored by:	EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
Ryan Stone
f2c2231e0c Fix integer truncation bug in malloc(9)
A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int.  This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc.  zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.

Note to self: When this is MFCed, sparc64 needs the same fix.

Differential revision:	https://reviews.freebsd.org/D2106
Reviewed by:	kib
Reported by:	Michael Fuckner <michael@fuckner.net>
Tested by:	Michael Fuckner <michael@fuckner.net>
MFC after:	2 weeks
2015-04-01 12:42:26 +00:00
Gleb Smirnoff
f6d6b5e262 Catch up on r271387 and remove unused parameter from
VOP_GETPAGES_ASYNC().
2015-03-30 22:49:26 +00:00