Commit Graph

66 Commits

Author SHA1 Message Date
Andrew Gallatin
edb2994a62 Conditionally move initial vfs bio alloc above 4G
On machines with just the wrong amount of physical memory (enough to
have a lot of bufs, but not enough to use VM_FREELIST_DMA32) it is
possible for 32-bit address limited devices to have little to no
memory left when attaching, due to potentially large vfs bio configs
consuming all memory below 4GB not protected by VM_FREELIST_ISADMA.
This causes the 32-bit devices to allocate from VM_FREELIST_ISADMA,
leaving that freelist emtpy when ISA devices need DMAable memory.

Rather than decrease VM_DMA32_NPAGES_THRESHOLD, use the time honored
technique of putting initially allocated kernel data structs
at the end (or at least not the beginning) of memory.

Since this allocation is done at boot and is wired, is not freed,
so the system is low on 32-bit (and ISA) dma'ble memory forever.
So it is a good candidate to move above 4GB.

While here, remove an unneeded round_page() from kmem_malloc's size
argument as suggested by alc.  The first thing kmem_malloc() does
is a round_page(size), so there is no need to do it before the call.

Reviewed by: alc
Sponsored by: Netflix
2016-10-03 13:23:43 +00:00
Jeff Roberson
21fae96123 Parallelize the buffer cache and rewrite getnewbuf(). This results in a
8x performance improvement in a micro benchmark on a 4 socket machine.

 - Get buffer headers from a per-cpu uma cache that sits in from of the
   free queue.
 - Use a per-cpu quantum cache in vmem to eliminate contention for kva.
 - Use multiple clean queues according to buffer cache size to eliminate
   clean queue lock contention.
 - Introduce a bufspace daemon that attempts to prevent getnewbuf() callers
   from blocking or doing direct recycling.
 - Close some bufspace allocation races that could lead to endless
   recycling.
 - Further the transition to a more modern style of small functions grouped
   by prefix in order to improve growing complexity.

Sponsored by:	EMC / Isilon
Reviewed by:	kib
Tested by:	pho
2015-10-14 02:10:07 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
John Baldwin
edb572a38c Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use
an address in the first 2GB of the process's address space.  This flag should
have the same semantics as the same flag on Linux.

To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address.  While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.

Reviewed by:	alc
Approved by:	re (kib)
2013-09-09 18:11:59 +00:00
Jeff Roberson
d91722fb27 - Use an arbitrary but reasonably large import size for kva on architectures
that don't support superpages.  This keeps the number of spans and internal
   fragmentation lower.
 - When the user asks for alignment from vmem_xalloc adjust the imported size
   by 2*align to be certain we can satisfy the allocation.  This comes at
   the expense of potential failures when the backend can't supply enough
   memory but could supply the requested size and alignment.

Sponsored by:	EMC / Isilon Storage Division
2013-08-19 23:02:39 +00:00
John Baldwin
5aa60b6f21 Add new mmap(2) flags to permit applications to request specific virtual
address alignment of mappings.
- MAP_ALIGNED(n) requests a mapping aligned on a boundary of (1 << n).
  Requests for n >= number of bits in a pointer or less than the size of
  a page fail with EINVAL.  This matches the API provided by NetBSD.
- MAP_ALIGNED_SUPER is a special case of MAP_ALIGNED.  It can be used
  to optimize the chances of using large pages.  By default it will align
  the mapping on a large page boundary (the system is free to choose any
  large page size to align to that seems best for the mapping request).
  However, if the object being mapped is already using large pages, then
  it will align the virtual mapping to match the existing large pages in
  the object instead.
- Internally, VMFS_ALIGNED_SPACE is now renamed to VMFS_SUPER_SPACE, and
  VMFS_ALIGNED_SPACE(n) is repurposed for specifying a specific alignment.
  MAP_ALIGNED(n) maps to using VMFS_ALIGNED_SPACE(n), while
  MAP_ALIGNED_SUPER maps to VMFS_SUPER_SPACE.
- mmap() of a device object now uses VMFS_OPTIMAL_SPACE rather than
  explicitly using VMFS_SUPER_SPACE.  All device objects are forced to
  use a specific color on creation, so VMFS_OPTIMAL_SPACE is effectively
  equivalent.

Reviewed by:	alc
MFC after:	1 month
2013-08-16 21:13:55 +00:00
Jeff Roberson
5df87b21d3 Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

 - Normalize functions that allocate memory to use kmem_*
 - Those that allocate address space are named kva_*
 - Those that operate on maps are named kmap_*
 - Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-08-07 06:21:20 +00:00
Jeff Roberson
5f51836645 - Add a general purpose resource allocator, vmem, from NetBSD. It was
originally inspired by the Solaris vmem detailed in the proceedings
   of usenix 2001.  The NetBSD version was heavily refactored for bugs
   and simplicity.
 - Use this resource allocator to allocate the buffer and transient maps.
   Buffer cache defrags are reduced by 25% when used by filesystems with
   mixed block sizes.  Ultimately this may permit dynamic buffer cache
   sizing on low KVA machines.

Discussed with:	alc, kib, attilio
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-06-28 03:51:20 +00:00
Konstantin Belousov
7db07e1c85 Only size and create the bio_transient_map when unmapped buffers are
enabled.  Now, disabling the unmapped buffers should result in the
kernel memory map identical to pre-r248550.

Sponsored by:	The FreeBSD Foundation
2013-03-21 07:28:15 +00:00
Konstantin Belousov
ee75e7de7b Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA.  The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.

The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer.  For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.

When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.

Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer.  The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.

The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings.  Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.

Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags.  Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.

In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.

By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.

Sponsored by:	The FreeBSD Foundation
Discussed with:	jeff (previous version)
Tested by:	pho, scottl (previous version), jhb, bf
MFC after:	2 weeks
2013-03-19 14:13:12 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Andre Oppermann
15ae0c9af9 Move the callout subsystem initialization to its own SYSINIT()
from being indirectly called via cpu_startup()+vm_ksubmap_init().
The boot order position remains the same at SI_SUB_CPU.

Allocation of the callout array is changed to stardard kernel malloc
from a slightly obscure direct kernel_map allocation.

kern_timeout_callwheel_alloc() is renamed to callout_callwheel_init()
to better describe its purpose.
kern_timeout_callwheel_init() is removed simplifying the per-cpu
initialization.

Reviewed by:	davide
2013-03-08 10:37:17 +00:00
Alan Cox
2af6e14d39 Introduce exec_alloc_args(). The objective being to encapsulate the
details of the string buffer allocation in one place.

Eliminate the portion of the string buffer that was dedicated to storing
the interpreter name.  The pointer to the interpreter name can simply be
made to point to the appropriate argument string.

Reviewed by:	kib
2010-07-27 17:31:03 +00:00
Alan Cox
9e4e511499 Change the order in which the file name, arguments, environment, and
shell command are stored in exec*()'s demand-paged string buffer.  For
a "buildworld" on an 8GB amd64 multiprocessor, the new order reduces
the number of global TLB shootdowns by 31%.  It also eliminates about
330k page faults on the kernel address space.

Change exec_shell_imgact() to use "args->begin_argv" consistently as
the start of the argument and environment strings.  Previously, it
would sometimes use "args->buf", which is the start of the overall
buffer, but no longer the start of the argument and environment
strings.  While I'm here, eliminate unnecessary passing of "&length"
to copystr(), where we don't actually care about the length of the
copied string.

Clean up the initialization of the exec map.  In particular, use the
correct size for an entry, and express that size in the same way that
is used when an entry is allocated.  The old size was one page too
large.  (This discrepancy originated in 2004 when I rewrote
exec_map_first_page() to use sf_buf_alloc() instead of the exec map
for mapping the first page of the executable.)

Reviewed by:	kib
2010-07-25 17:43:38 +00:00
Alan Cox
fd776d18a0 Align the start of the clean submap to a superpage boundary. Although
no superpage mappings are created within the clean submap, aligning the
start of the clean submap helps to prevent interference with kmem_alloc()'s
use of superpages.
2010-02-21 22:23:13 +00:00
John Baldwin
5bd65606f4 Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints.  Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva.  Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers.  Now one has to overflow a long to see
such problems.  There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf.  I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.

Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.

MFC after:	1 month
2009-03-09 19:35:20 +00:00
Alan Cox
3202ed7523 Introduce a new parameter "superpage_align" to kmem_suballoc() that is
used to request superpage alignment for the submap.

Request superpage alignment for the kmem_map.

Pass VMFS_ANY_SPACE instead of TRUE to vm_map_find().  (They are currently
equivalent but VMFS_ANY_SPACE is the new preferred spelling.)

Remove a stale comment from kmem_malloc().
2008-05-10 21:46:20 +00:00
Robert Watson
237fdd787b In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation.  This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after:	1 month
Discussed with:	imp, rink
2008-03-16 10:58:09 +00:00
Kris Kennaway
f5fca0d8be Add the vm.exec_map_entries tunable and read-only sysctl, which controls
the number of entries in exec_map (maximum number of simultaneous execs
that can be handled by the kernel).  The default value of 16 is
insufficient on heavily loaded machines (particularly SMP machines), and
if it is exceeded then executing further processes will generate a SIGABRT.

This is a workaround until a better solution can be implemented.

Reviewed by:	alc
MFC after:	3 days
2005-04-25 19:22:05 +00:00
Warner Losh
60727d8b86 /* -> /*- for license, minor formatting changes 2005-01-07 02:29:27 +00:00
Alan Cox
e0b47a134b Remove dead code. A vm_map's first_free is never NULL (even if the map is
full).

(This is preparation for an O(log n) implementation of vm_map_findspace().)

Submitted by:	Mark W. Krentel
2004-08-07 05:58:31 +00:00
Alan Cox
41f1b2c460 The demise of vm_pager_map_page() in revision 1.93 of vm/vm_pager.c permits
the reduction of the pager map's size by 8M bytes.  In other words, eight
megabytes of largely wasted KVA are returned to the kernel map for use
elsewhere.
2004-04-08 19:08:49 +00:00
Warner Losh
05eb3785e7 Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core
2004-04-06 20:15:37 +00:00
Alan Cox
bdb93eb248 Remove unused arguments from pmap_init(). 2004-04-05 00:37:50 +00:00
Alan Cox
889eb0fc62 Eliminate unused arguments from vm_page_startup(). 2004-04-04 23:33:36 +00:00
Eivind Eklund
2ae51145e8 Change clean_map from a global to an auto variable 2003-09-01 16:46:47 +00:00
Mike Silbersack
cebde06978 More pipe changes:
From alc:
Move pageable pipe memory to a seperate kernel submap to avoid awkward
vm map interlocking issues.  (Bad explanation provided by me.)

From me:
Rework pipespace accounting code to handle this new layout, and adjust
our default values to account for the fact that we now have a solid
limit on allocations.

Also, remove the "maxpipes" limit, as it no longer has a purpose.
(The limit on kva usage solves the problem of having two many pipes.)
2003-08-11 05:51:51 +00:00
Robert Drehmel
857961d925 Avoid an unnecessary calculation: there is no need to subtract
`firstaddr' from `v' if we know that the former equals zero.
2003-07-13 21:02:11 +00:00
David E. O'Brien
874651b13c Use __FBSDID(). 2003-06-11 23:50:51 +00:00
Thomas Moestl
0fca57b8b8 Move the definitions of the hw.physmem, hw.usermem and hw.availpages
sysctls to MI code; this reduces code duplication and makes all of them
available on sparc64, and the latter two on powerpc.
The semantics by the i386 and pc98 hw.availpages is slightly changed:
previously, holes between ranges of available pages would be included,
while they are excluded now. The new behaviour should be more correct
and brings i386 in line with the other architectures.

Move physmem to vm/vm_init.c, where this variable is used in MI code.
2002-11-07 23:57:17 +00:00
Peter Wemm
447b3772dc Change hw.physmem and hw.usermem to unsigned long like they used to be
in the original hardwired sysctl implementation.

The buf size calculator still overflows an integer on machines with large
KVA (eg: ia64) where the number of pages does not fit into an int.  Use
'long' there.

Change Maxmem and physmem and related variables to 'long', mostly for
completeness.  Machines are not likely to overflow 'int' pages in the
near term, but then again, 640K ought to be enough for anybody.  This
comes for free on 32 bit machines, so why not?
2002-08-30 04:04:37 +00:00
Jeff Roberson
670d17b5c0 Remove references to vm_zone.h and switch over to the new uma API. 2002-03-20 04:02:59 +00:00
Alfred Perlstein
11caded34f Remove __P. 2002-03-19 22:20:14 +00:00
Jeff Roberson
8355f576a9 This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by:	arch@
2002-03-19 09:11:49 +00:00
Eivind Eklund
a128794977 - Remove a number of extra newlines that do not belong here according to
style(9)
- Minor space adjustment in cases where we have "( ", " )", if(), return(),
  while(), for(), etc.
- Add /* SYMBOL */ after a few #endifs.

Reviewed by:	alc
2002-03-10 21:52:48 +00:00
Matthew Dillon
219d632c15 Move most of the kernel submap initialization code, including the
timeout callwheel and buffer cache, out of the platform specific areas
and into the machine independant area.  i386 and alpha adjusted here.
Other cpus can be fixed piecemeal.

Reviewed by:    freebsd-smp, jake
2001-08-22 04:07:27 +00:00
Matthew Dillon
0cddd8f023 With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage).  Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
2001-07-04 16:20:28 +00:00
John Baldwin
874468957d Sort includes from previous commit. 2001-05-22 05:35:45 +00:00
Alfred Perlstein
2395531439 Introduce a global lock for the vm subsystem (vm_mtx).
vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb
2001-05-19 01:28:09 +00:00
Mark Murray
fb919e4d5a Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by:	bde (with reservations)
2001-05-01 08:13:21 +00:00
John Baldwin
0f68b6595a Add missing include. 2001-01-24 06:54:24 +00:00
Dag-Erling Smørgrav
e78411656b Call vm_zone_init() at the appropriate time.
Reviewed by:	jasone, jhb
2001-01-22 07:02:42 +00:00
Philippe Charnier
5929bcfaba Revert spelling mistake I made in the previous commit
Requested by: Alan and Bruce
2000-03-27 20:41:17 +00:00
Philippe Charnier
956f31353c Spelling 2000-03-26 15:20:23 +00:00
Poul-Henning Kamp
923502ff91 useracc() the prequel:
Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>.  This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.
1999-10-29 18:09:36 +00:00
Peter Wemm
c3aac50f28 $Id$ -> $FreeBSD$ 1999-08-28 01:08:13 +00:00
Eivind Eklund
0b08f5f737 Back out DIAGNOSTIC changes. 1998-02-06 12:14:30 +00:00
Eivind Eklund
47cfdb166d Turn DIAGNOSTIC into a new-style option. 1998-02-04 22:34:03 +00:00