Commit Graph

144 Commits

Author SHA1 Message Date
Navdeep Parhar
fddd4f6273 Plug leak in m_unshare.
m_unshare passes on the source mbuf's flags as-is to m_getcl and this
results in a leak if the flags include M_NOFREE.  The fix is to clear
the bits not listed in M_COPYALL before calling m_getcl.  M_RDONLY
should probably be filtered out too but that's outside the scope of this
fix.

Add assertions in the zone_mbuf and zone_pack ctors to catch similar
bugs.

Update netmap_get_mbuf to not pass M_NOFREE to m_getcl.  It's not clear
what the original code was trying to do but it's likely incorrect.
Updated code is no different functionally but it avoids the newly added
assertions.

Reviewed by:	gnn@
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5698
2016-03-26 23:39:53 +00:00
Gleb Smirnoff
c8f59118be Space and style(9) corrections for recent mbuf changes. 2016-03-24 20:06:52 +00:00
George V. Neville-Neil
480f4e946d Add an mbuf provider to DTrace.
The mbuf provider is made up of a set of Statically Defined Tracepoints
which help us look into mbufs as they are allocated and freed.  This can be
used to inspect the buffers or for a simplified mbuf leak detector.

New tracepoints are:

mbuf:::m-init
mbuf:::m-gethdr
mbuf:::m-get
mbuf:::m-getcl
mbuf:::m-clget
mbuf:::m-cljget
mbuf:::m-cljset
mbuf:::m-free
mbuf:::m-freem

There is also a translator for mbufs which gives some visibility into the structure,
see mbuf.d for more details.

Reviewed by:	bz, markj
MFC after:	2 weeks
Sponsored by:	Rubicon Communications (Netgate)
Differential Revision:	https://reviews.freebsd.org/D5682
2016-03-22 13:16:52 +00:00
Gleb Smirnoff
0ea37a86ba Fix regression in r296242 affecting several drivers. For EXT_NET_DRV,
EXT_MOD_TYPE, EXT_DISPOSABLE types we should first execute the free
callback, then free the mbuf, otherwise we will derefernce memory that
was just freed.

Reported and tested:	jhibbits
2016-03-02 02:12:01 +00:00
Gleb Smirnoff
56a5f52e80 New way to manage reference counting of mbuf external storage.
The m_ext.ext_cnt pointer becomes a union. It can now hold the refcount
value itself. To tell that m_ext.ext_flags flag EXT_FLAG_EMBREF is used.
The first mbuf to attach a cluster stores the refcount. The further mbufs
to reference the cluster point at refcount in the first mbuf. The first
mbuf is freed only when the last reference is freed.

The benefit over refcounts stored in separate slabs is that now refcounts
of different, unrelated mbufs do not share a cache line.

For EXT_EXTREF mbufs the zone_ext_refcnt is no longer needed, and m_extadd()
becomes void, making widely used M_EXTADD macro safe.

For EXT_SFBUF mbufs the sf_ext_ref() is removed, which was an optimization
exactly against the cache aliasing problem with regular refcounting.

Discussed with:		rrs, rwatson, gnn, hiren, sbruno, np
Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D5396
Sponsored by:		Netflix
2016-03-01 00:17:14 +00:00
Gleb Smirnoff
5e4bc63b7c o Gather all mbuf(9) allocation functions into kern_mbuf.c, and all
mbuf(9) manipulation functions into uipc_mbuf.c.  This looks like
  the initial intent, but had diffused in the last decade.

o Gather all declarations in mbuf.h in one place and sort them.

o Uninline m_clget() and m_cljget().

There are no functional changes in this patch.

The patch comes from a larger version, where all mbuf(9) allocation was
uninlined, which allowed to make mbuf(9) UMA zones private to kern_mbuf.c.
The performance impact of the total uninlining is still unclear, so we
are holding on now with larger version.

Together with:	melifaro, olivier
2016-02-11 21:32:23 +00:00
Gleb Smirnoff
b4b12e52fb Garbage collect unused arguments of m_init(). 2016-02-10 18:54:18 +00:00
Gleb Smirnoff
e60b2fcbeb Redo r292484. Embed task(9) into zone, so that uz_maxaction is called
in a context that can sleep, allowing consumers of the KPI to run their
drain routines without any extra measures.

Discussed with:	jtl
2016-02-03 23:30:17 +00:00
Gleb Smirnoff
9542ea7b80 Move uma_dbg_alloc() and uma_dbg_free() into uma_core.c, which allows
to make uma_dbg.h not depend on uma_int.h, which allows to uninclude
uma_int.h from the mbuf(9) allocator.
2016-02-03 22:02:36 +00:00
Jonathan T. Looney
54503a13d8 Add a safety net to reclaim mbufs when one of the mbuf zones become
exhausted.

It is possible for a bug in the code (or, theoretically, even unusual
network conditions) to exhaust all possible mbufs or mbuf clusters.
When this occurs, things can grind to a halt fairly quickly. However,
we currently do not call mb_reclaim() unless the entire system is
experiencing a low-memory condition.

While it is best to try to prevent exhaustion of one of the mbuf zones,
it would also be useful to have a mechanism to attempt to recover from
these situations by freeing "expendable" mbufs.

This patch makes two changes:

a) The patch adds a generic API to the UMA zone allocator to set a
function that should be called when an allocation fails because the
zone limit has been reached. Because of the way this function can be
called, it really should do minimal work.

b) The patch uses this API to try to free mbufs when an allocation
fails from one of the mbuf zones because the zone limit has been
reached. The function schedules a callout to run mb_reclaim().

Differential Revision:	https://reviews.freebsd.org/D3864
Reviewed by:	gnn
Comments by:	rrs, glebius
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2015-12-20 02:05:33 +00:00
Ryan Stone
f2c2231e0c Fix integer truncation bug in malloc(9)
A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int.  This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc.  zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.

Note to self: When this is MFCed, sparc64 needs the same fix.

Differential revision:	https://reviews.freebsd.org/D2106
Reviewed by:	kib
Reported by:	Michael Fuckner <michael@fuckner.net>
Tested by:	Michael Fuckner <michael@fuckner.net>
MFC after:	2 weeks
2015-04-01 12:42:26 +00:00
Navdeep Parhar
a9fa76f27e Test for absence of M_NOFREE before attempting to purge the mbuf's tags.
This will leave more state intact should the assertion go off.

MFC after:	1 month
2014-09-30 23:16:26 +00:00
Gleb Smirnoff
1fbe6a82f4 Improve reference counting of EXT_SFBUF pages attached to mbufs.
o Do not use UMA refcount zone. The problem with this zone is that
  several refcounting words (16 on amd64) share the same cache line,
  and issueing atomic(9) updates on them creates cache line contention.
  Also, allocating and freeing them is extra CPU cycles.
  Instead, refcount the page directly via vm_page_wire() and the sfbuf
  via sf_buf_alloc(sf_buf_page(sf)) [1].

o Call refcounting/freeing function for EXT_SFBUF via direct function
  call, instead of function pointer. This removes barrier for CPU
  branch predictor.

o Do not cleanup the mbuf to be freed in mb_free_ext(), merely to
  satisfy assertion in mb_dtor_mbuf(). Remove the assertion from
  mb_dtor_mbuf(). Use bcopy() instead of manual assignments to
  copy m_ext in mb_dupcl().

[1] This has some problems for now. Using sf_buf_alloc() merely to
    increase refcount is expensive, and is broken on sparc64. To be
    fixed.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-07-11 19:40:50 +00:00
Gleb Smirnoff
fcc34a238c Fix style bug: rename the refcount field of m_ext to ext_cnt, to match
other members.

Sponsored by:	Nginx, Inc.
2014-07-11 14:34:29 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Ed Maste
b7bd677fe1 Initialise m_pkthdr via bzero instead of explicitly zeroing each member
Sponsored by:	The FreeBSD Foundation
2014-04-04 21:09:06 +00:00
John Baldwin
d251e7006b Ignore attempts to set the nmbcluster sysctls to their current value
rather than failing with an error.

Reviewed by:	andre
Approved by:	re (delphij)
MFC after:	2 weeks
2013-10-10 16:11:34 +00:00
Hiren Panchasara
b6f49c23a3 Fixing a small typo.
Reviewed by:	gjb
Approved by:	sbruno (mentor)
2013-09-05 18:18:23 +00:00
Andre Oppermann
ce28636bcf After r254779 "error" must always be present in mb_ctor_pack(),
not only when MAC is defined.

Reported by:	gjb / tinderbox
Sponsored by:	The FreeBSD Foundation
2013-08-24 21:25:53 +00:00
Andre Oppermann
ce6169e715 Remove unused m_free_fast(). The difference to m_free() is only
2 predictable branches nowadays.  However as a pre-condition the
caller had to ensure that the mbuf pkthdr did not have any mtags
attached to it, costing some potential branches again.

Sponsored by:	The FreeBSD Foundation
2013-08-24 21:09:57 +00:00
Andre Oppermann
1b4381afbb Restructure the mbuf pkthdr to make it fit for upcoming capabilities and
features.  The changes in particular are:

o Remove rarely used "header" pointer and replace it with a 64bit protocol/
  layer specific union PH_loc for local use.  Protocols can flexibly overlay
  their own 8 to 64 bit fields to store information while the packet is
  worked on.

o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc
  instead of pkthdr.header.

o Extend csum_flags to 64bits to allow for additional future offload
  information to be carried (e.g. iSCSI, IPsec offload, and others).

o Move the RSS hash type enumerator from abusing m_flags to its own 8bit
  rsstype field.  Adjust accessor macros.

o Add cosqos field to store Class of Service / Quality of Service information
  with the packet.  It is not yet supported in any drivers but allows us to
  get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with
  a modernized ALTQ.

o Add four 8 bit fields l[2-5]hlen to store the relative header offsets
  from the start of the packet.  This is important for various offload
  capabilities and to relieve the drivers from having to parse the packet
  and protocol headers to find out location of checksums and other
  information.  Header parsing in drivers is a lot of copy-paste and
  unhandled corner cases which we want to avoid.

o Add another flexible 64bit union to map various additional persistent
  packet information, like ether_vtag, tso_segsz and csum fields.
  Depending on the csum_flags settings some fields may have different usage
  making it very flexible and adaptable to future capabilities.

o Restructure the CSUM flags to better signify their outbound (down the
  stack) and inbound (up the stack) use.  The CSUM flags used to be a bit
  chaotic and rather poorly documented leading to incorrect use in many
  places.  Bring clarity into their use through better naming.
  Compatibility mappings are provided to preserve the API.  The drivers
  can be corrected one by one and MFC'd without issue.

o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures).

Sponsored by:	The FreeBSD Foundation
2013-08-24 19:51:18 +00:00
Andre Oppermann
894734cbd6 dd a 24 bits wide ext_flags field to m_ext by reducing ext_type
to 8 bits.  ext_type is an enumerator and the number of types we
have is a mere dozen.

A couple of ext_types are renumbered to fit within 8 bits.

EXT_VENDOR[1-4] and EXT_EXP[1-4] types for vendor-internal and
experimental local mapping.

The ext_flags field is currently unused but has a couple of flags
already defined for future use.  Again vendor and experimental
flags are provided for local mapping.

EXT_FLAG_BITS is provided for the printf(9) %b identifier.

Initialize and copy ext_flags in the relevant mbuf functions.

Improve alignment and packing of struct m_ext on 32 and 64 archs
by carefully sorting the fields.
2013-08-24 13:15:42 +00:00
Andre Oppermann
afb295cc9a Avoid code duplication for mbuf initialization and use m_init() instead
in mb_ctor_mbuf() and mb_ctor_pack().
2013-08-24 12:24:58 +00:00
Andre Oppermann
1227f20d26 Revert r254520 and resurrect the M_NOFREE mbuf flag and functionality.
Requested by:	np, grehan
2013-08-21 18:12:04 +00:00
Andre Oppermann
aa3cb8fb64 Remove the unused M_NOFREE mbuf flag. It didn't have any in-tree users
for a very long time, if ever.

Should such a functionality ever be needed again the appropriate and
much better way to do it is through a custom EXT_SOMETHING external mbuf
type together with a dedicated *ext_free function.

Discussed with:	trociny, glebius
2013-08-19 11:16:53 +00:00
Jeff Roberson
5df87b21d3 Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

 - Normalize functions that allocate memory to use kmem_*
 - Those that allocate address space are named kva_*
 - Those that operate on maps are named kmap_*
 - Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-08-07 06:21:20 +00:00
Gleb Smirnoff
59b9c4f289 Nuke mbstat. It wasn't used for mbuf statistics since FreeBSD 5.
Now that r253351 moved sendfile() stats to a separate struct, the
last field used in mbstat is m_mcfail, which is updated, but never
read or obtained from userland.
2013-07-15 12:18:36 +00:00
Andrey V. Elsukov
05d1f5bce0 Introduce new structure sfstat for collecting sendfile's statistics
and remove corresponding fields from struct mbstat. Use PCPU counters
and SFSTAT_INC() macro for update these statistics.

Discussed with:	glebius
2013-07-15 06:16:57 +00:00
Andre Oppermann
bc4a1b8ccd Make use of the fact that uma_zone_set_max(9) already returns the
rounded limit making a call to uma_zone_get_max(9) unnecessary.

MFC after:	1 day
2013-07-11 12:53:13 +00:00
Andre Oppermann
e0c00adda2 Fix style issues, a typo in "kern.ipc.nmbufs" and correctly plave and
expose the value of the tunable maxmbufmem as "kern.ipc.maxmbufmem"
through sysctl.

Reported by:	smh
MFC after:	1 day
2013-07-11 12:46:35 +00:00
Julian Elischer
4591f0d339 Initialising the new fibnum field to a known value turns out to
be a GOOD IDEA (TM).
Apparently MOST users set this (e.g. tcp and friends) but there are a few
users that just assume that it is a sensible value but then go on to read it.
These include SCTP, pf and the FLOWTABLE option (and maybe others).
2013-05-24 02:18:37 +00:00
Andre Oppermann
2ebcc8ac4a Base the calculation of maxmbufmem in part on kmem_map size
instead of kernel_map size to prevent kernel memory exhaustion
by mbufs and a subsequent panic on physical page allocation
failure.

On architectures without a direct map all mbuf memory (except
for jumbo mbufs larger than PAGE_SIZE) comes from kmem_map.
It is the limiting factor hence.

For architectures with a direct map using the size of kmem_map
is a good proxy of available kernel memory as well.  If it is
much smaller the mbuf limit may be sub-optimal but remains
reasonable, while avoiding panics under exhaustion.

The overall mbuf memory limit calculation may be reconsidered
again later, however due to the many different mbuf sizes and
different backing KVM maps it is a tricky subject.

Found by:	pho's new network stress test
Pointed out by:	alc (kmem_map instead of kernel_map)
Tested by:	pho
2013-04-24 13:54:55 +00:00
Andre Oppermann
371407162b Move the mbuf memory limit calculations from init_param2() to
tunable_mbinit() where it is next to where it is used later.

Change the sysinit level of tunable_mbinit() from SI_SUB_TUNABLES
to SI_SUB_KMEM after the VM is running.  This allows to use better
methods to determine the effectively available physical and virtual
memory available to the kernel.

Update comments.

In a second step it can be merged into mbuf_init().
2013-01-17 21:28:31 +00:00
Pawel Jakub Dawidek
6e0b674628 Configure UMA warnings for the following zones:
- unp_zone: kern.ipc.maxsockets limit reached
- socket_zone: kern.ipc.maxsockets limit reached
- zone_mbuf: kern.ipc.nmbufs limit reached
- zone_clust: kern.ipc.nmbclusters limit reached
- zone_jumbop: kern.ipc.nmbjumbop limit reached
- zone_jumbo9: kern.ipc.nmbjumbo9 limit reached
- zone_jumbo16: kern.ipc.nmbjumbo16 limit reached

Note that those warnings are printed not often than every five minutes and can
be globally turned off by setting sysctl/tunable vm.zone_warnings to 0.

Discussed on:	arch
Obtained from:	WHEEL Systems
MFC after:	2 weeks
2012-12-07 22:30:30 +00:00
Pawel Jakub Dawidek
45fe0bf7e4 Make use of the fact that uma_zone_set_max(9) already returns actual limit set. 2012-12-07 22:23:53 +00:00
Pawel Jakub Dawidek
4007b61cde More style cleanups. 2012-12-07 22:22:04 +00:00
Pawel Jakub Dawidek
b0b1402537 Style cleanups. 2012-12-07 22:19:41 +00:00
Andre Oppermann
416a434cd0 Complete r243631 by applying the remainder of kern_mbuf.c that got
lost while merging into the commit tree.

MFC after:	1 month
X-MFC-with:	r243631
2012-11-27 23:16:56 +00:00
Andre Oppermann
ead46972a4 Base the mbuf related limits on the available physical memory or
kernel memory, whichever is lower.  The overall mbuf related memory
limit must be set so that mbufs (and clusters of various sizes)
can't exhaust physical RAM or KVM.

The limit is set to half of the physical RAM or KVM (whichever is
lower) as the baseline.  In any normal scenario we want to leave
at least half of the physmem/kvm for other kernel functions and
userspace to prevent it from swapping too easily.  Via a tunable
kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm.

At the same time divorce maxfiles from maxusers and set maxfiles to
physpages / 8 with a floor based on maxusers.  This way busy servers
can make use of the significantly increased mbuf limits with a much
larger number of open sockets.

Tidy up ordering in init_param2() and check up on some users of
those values calculated here.

Out of the overall mbuf memory limit 2K clusters and 4K (page size)
clusters to get 1/4 each because these are the most heavily used mbuf
sizes.  2K clusters are used for MTU 1500 ethernet inbound packets.
4K clusters are used whenever possible for sends on sockets and thus
outbound packets.  The larger cluster sizes of 9K and 16K are limited
to 1/6 of the overall mbuf memory limit.  When jumbo MTU's are used
these large clusters will end up only on the inbound path.  They are
not used on outbound, there it's still 4K.  Yes, that will stay that
way because otherwise we run into lots of complications in the
stack.  And it really isn't a problem, so don't make a scene.

Normal mbufs (256B) weren't limited at all previously.  This was
problematic as there are certain places in the kernel that on
allocation failure of clusters try to piece together their packet
from smaller mbufs.

The mbuf limit is the number of all other mbuf sizes together plus
some more to allow for standalone mbufs (ACK for example) and to
send off a copy of a cluster.  Unfortunately there isn't a way to
set an overall limit for all mbuf memory together as UMA doesn't
support such a limiting.

NB: Every cluster also has an mbuf associated with it.

Two examples on the revised mbuf sizing limits:

1GB KVM:
 512MB limit for mbufs
 419,430 mbufs
  65,536 2K mbuf clusters
  32,768 4K mbuf clusters
   9,709 9K mbuf clusters
   5,461 16K mbuf clusters

16GB RAM:
 8GB limit for mbufs
 33,554,432 mbufs
  1,048,576 2K mbuf clusters
    524,288 4K mbuf clusters
    155,344 9K mbuf clusters
     87,381 16K mbuf clusters

These defaults should be sufficient for even the most demanding
network loads.

MFC after:	1 month
2012-11-27 21:19:58 +00:00
Alfred Perlstein
79f62ed690 Allow maxusers to scale on machines with large address space.
Some hooks are added to clamp down maxusers and nmbclusters for
small address space systems.

VM_MAX_AUTOTUNE_MAXUSERS - the max maxusers that will be autotuned based on
physical memory.
VM_MAX_AUTOTUNE_NMBCLUSTERS - max nmbclusters based on physical memory.

These are set to the old values on i386 to preserve the clamping that was
being done to all arches.

Another macro VM_AUTOTUNE_NMBCLUSTERS is provided to allow an override
for the calculation on a MD basis.  Currently no arch defines this.

Reviewed by: peter
MFC after: 2 weeks
2012-11-10 02:08:40 +00:00
Kevin Lo
a2c36a0234 Since the macro dtom() has been removed, fix comments about the dtom.
Reviewed by:	glebius
2012-10-29 10:04:28 +00:00
Navdeep Parhar
812302c3eb Allow nmbjumbop, nmbjumbo9, and nmbjumbo16 to be set directly via loader
tunables.

MFC after:	1 month
2012-08-23 21:32:02 +00:00
Ed Schouten
60ae52f785 Use ISO C99 integer types in sys/kern where possible.
There are only about 100 occurences of the BSD-specific u_int*_t
datatypes in sys/kern. The ISO C99 integer types are used here more
often.
2010-06-21 09:55:56 +00:00
Alan Cox
3153e878dd Add support to the virtual memory system for configuring machine-
dependent memory attributes:

Rename vm_cache_mode_t to vm_memattr_t.  The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.

Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.

Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes.  Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures.  The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map.  The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:

  kmem_alloc_contig() can now be used to allocate kernel memory with
  non-default memory attributes on amd64 and i386.

  vm_page_alloc() and the device pager will set the memory attributes
  for the real or fictitious page according to the object's default
  memory attributes.

Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.

Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386.  In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.

In collaboration with: jhb

Approved by:	re (kib)
2009-07-12 23:31:20 +00:00
Alan Cox
e999111ae7 This change is the next step in implementing the cache control functionality
required by video card drivers.  Specifically, this change introduces
vm_cache_mode_t with an appropriate VM_CACHE_DEFAULT definition on all
architectures.  In addition, this changes adds a vm_cache_mode_t parameter
to kmem_alloc_contig() and vm_phys_alloc_contig().  These will be the
interfaces for allocating mapped kernel memory and physical memory,
respectively, with non-default cache modes.

In collaboration with:	jhb
2009-06-26 04:47:43 +00:00
Kip Macy
5b204a113c define helper routines for deferred mbuf initialization 2009-06-19 21:14:39 +00:00
Alan Cox
c45c00344f Utilize the new function kmem_alloc_contig() to implement the UMA back-end
allocator for the jumbo frames zones.  This change has two benefits: (1) a
custom back-end deallocator is no longer required.  UMA's standard
deallocator suffices. (2) It eliminates a potentially confusing artifact
of using contigmalloc(): The malloc(9) statistics contain bogus information
about the usage of jumbo frames.  Specifically, the malloc(9) statistics
report all jumbo frames in use whereas the UMA zone statistics report the
"truth" about the number in use vs. the number free.
2009-06-18 17:59:04 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
Robert Watson
877e88123e Temporary workaround for the limitations of the mbuf flowid field: zero
the field in the mbuf constructors, since otherwise we have no way to
tell if they are valid.  In the future, Kip has plans to add a flag
specifically to indicate validity, which is the preferred model.
2009-01-01 20:03:01 +00:00
Ruslan Ermilov
98bd6a1982 Removed a comment made obsolete by revisions 157927 and 174292. 2008-12-18 15:56:12 +00:00
Bjoern A. Zeeb
629386598e Make sure nmbclusters are initialized before maxsockets
by running the tunable_mbinit() SYSINIT at SI_ORDER_MIDDLE
before the init_maxsockets() SYSINT at SI_ORDER_ANY.

Reviewed by:		rwatson, zec
Sponsored by:		The FreeBSD Foundation
MFC after:		4 weeks
2008-12-10 22:17:09 +00:00
Kip Macy
8aa7a58108 make kern.ipc.nmbclusters actually have a useful effect on nmbclusters et al.
initialize pkthdr in field order
2008-11-09 01:53:06 +00:00
Alan Cox
7630c26507 Reintroduce UMA_SLAB_KMAP; however, change its spelling to
UMA_SLAB_KERNEL for consistency with its sibling UMA_SLAB_KMEM.
(UMA_SLAB_KMAP met its original demise in revision 1.30 of
vm/uma_core.c.)  UMA_SLAB_KERNEL is now required by the jumbo frame
allocators.  Without it, UMA cannot correctly return pages from the
jumbo frame zones to the VM system because it resets the pages' object
field to NULL instead of the kernel object.  In more detail, the jumbo
frame zones are created with the option UMA_ZONE_REFCNT.  This causes
UMA to overwrite the pages' object field with the address of the slab.
However, when UMA wants to release these pages, it doesn't know how to
restore the object field, so it sets it to NULL.  This change teaches
UMA how to reset the object field to the kernel object.

Crashes reported by: kris
Fix tested by: kris
Fix discussed with: jeff
MFC after: 6 weeks
2008-04-04 18:41:12 +00:00
Robert Watson
237fdd787b In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation.  This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after:	1 month
Discussed with:	imp, rink
2008-03-16 10:58:09 +00:00
Poul-Henning Kamp
cf827063a9 Give MEXTADD() another argument to make both void pointers to the
free function controlable, instead of passing the KVA of the buffer
storage as the first argument.

Fix all conventional users of the API to pass the KVA of the buffer
as the first argument, to make this a no-op commit.

Likely break the only non-convetional user of the API, after informing
the relevant committer.

Update the mbuf(9) manual page, which was already out of sync on
this point.

Bump __FreeBSD_version to 800016 as there is no way to tell how
many arguments a CPP macro needs any other way.

This paves the way for giving sendfile(9) a way to wait for the
passed storage to have been accessed before returning.

This does not affect the memory layout or size of mbufs.

Parental oversight by:	sam and rwatson.

No MFC is anticipated.
2008-02-01 19:36:27 +00:00
Randall Stewart
6b4959bf2f - fix tab to space issue, hmm maybe I should use vi. 2007-12-15 23:14:53 +00:00
Randall Stewart
cf70a46b47 - Puts default limits on 4k/9k and 16k zones for mbufs all based
on 1/2 of each of the successive limits tied to the limit for
  2k clusters.
- Adds real functionality in so that doing a sysctl to change these
  actually changes them :-)

MFC after:	1 week
2007-12-05 15:29:44 +00:00
Alan Cox
ba63339a0a Introduce an UMA backend page allocator for the jumbo frame zones that
allocates physically contiguous memory.

MFC after: 3 months
Requested and reviewed by: Kip Macy
Tested by: Andrew Gallatin and Pyun YongHyeon
2007-12-04 07:06:08 +00:00
David E. O'Brien
ef44c8d2a3 style(9) 2007-10-26 16:33:47 +00:00
Robert Watson
30d239bc4c Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
Kip Macy
457869b973 This patch adds an M_NOFREE flag which allows one to mark an mbuf as
not being independently freeable. This allows one to embed an mbuf in
the cluster itself. This confers the benefits of the packet zone on
all cluster sizes. Embedded mbufs currently suffer from the same
limitation that packet zone mbufs do in that one cannot disconnect
them and pass them around independently of the cluster. It would
likely be possible to eliminate this limitation in the future by
adding a second reference for the mbuf itself.

Approved by: re(gnn)
2007-10-06 21:42:39 +00:00
Kip Macy
629b9e0853 Allow drivers to free an mbuf without having the mbuf be touched if
the driver has already freed any attached tags

Approved by: re(gnn)
2007-10-06 21:13:55 +00:00
David Malone
041b706b2f Despite several examples in the kernel, the third argument of
sysctl_handle_int is not sizeof the int type you want to export.
The type must always be an int or an unsigned int.

Remove the instances where a sizeof(variable) is passed to stop
people accidently cut and pasting these examples.

In a few places this was sysctl_handle_int was being used on 64 bit
types, which would truncate the value to be exported.  In these
cases use sysctl_handle_quad to export them and change the format
to Q so that sysctl(1) can still print them.
2007-06-04 18:25:08 +00:00
Kip Macy
0f4d9d04ea Fix mb_ctor_clust and mb_dtor_clust to reference the appropriate zone,
simplify setting refcnt

Reviewed by: andre, rwatson, and glebius
MFC after: 3 days
2007-04-04 21:27:01 +00:00
Mohan Srinivasan
6c125b8df6 Fix for problems that occur when all mbuf clusters migrate to the mbuf packet
zone. Cluster allocations fail when this happens. Also processes that may have
blocked on cluster allocations will never be woken up. Thanks to rwatson for
an overview of the issue and pointers to the mbuma paper and his tool to dump
out UMA zones.

Reviewed by: andre@
2007-01-25 01:05:23 +00:00
Robert Watson
aed5570872 Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h.  sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA
2006-10-22 11:52:19 +00:00
Andre Oppermann
a855e2b4c0 Remove VLAN mtag UMA zones and initialize ether_vtag and tso_segsz packet
header fields to zero on mbuf allocation.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-09-17 13:44:32 +00:00
Robert Watson
b37ffd3189 Move some functions and definitions from uipc_socket2.c to uipc_socket.c:
- Move sonewconn(), which creates new sockets for incoming connections on
  listen sockets, so that all socket allocate code is together in
  uipc_socket.c.

- Move 'maxsockets' and associated sysctls to uipc_socket.c with the
  socket allocation code.

- Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it
  to sysctl.h and remove lots of scattered implementations in various
  IPC modules.

- Sort sodealloc() after soalloc() in uipc_socket.c for dependency order
  reasons.  Statisticize soalloc() and sodealloc() as they are now
  required only in uipc_socket.c, and are internal to the socket
  implementation.

After this change, socket allocation and deallocation is entirely
centralized in one file, and uipc_socket2.c consists entirely of socket
buffer manipulation and default protocol switch functions.

MFC after:	1 month
2006-06-10 14:34:07 +00:00
Paul Saab
4f590175b7 Allow for nmbclusters and maxsockets to be increased via sysctl.
An eventhandler is used to update all the various zones that depend
on these values.
2006-04-21 09:25:40 +00:00
Andre Oppermann
a7bd90ef93 Properly handle the case when the packet secondary zone can't allocate
further mbuf clusters to attach to mbufs.

Reported by:	kris
Tested by:	kris
Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 days
2006-03-08 14:05:38 +00:00
Gleb Smirnoff
73bb09f2d0 One more grammar nit.
Submitted by:	ru
2006-02-27 07:22:32 +00:00
Gleb Smirnoff
fcf9061858 Fix several typos and trim spaces at eol.
PR:		kern/93759
Submitted by:	Antoine Brodin <antoine.brodin laposte.net>
2006-02-26 11:44:28 +00:00
Andre Oppermann
ec63cb90a3 Replace the 4k fixed sized jumbo mbuf clusters with PAGE_SIZE sized
jumbo mbuf clusters.  To make the variable size clear they are named
MJUMPAGESIZE.

Having jumbo clusters with the native PAGE_SIZE is more useful than
a fixed 4k size according the device driver writers using this API.

The 9k and 16k jumbo mbuf clusters remain unchanged.

Requested by:	glebius, gallatin
Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 days
2006-02-17 14:14:15 +00:00
Gleb Smirnoff
75ee267c22 Merge the //depot/user/yar/vlan branch into CVS. It contains some collective
work by yar, thompsa and myself. The checksum offloading part also involves
work done by Mihail Balikov.

The most important changes:

o   Instead of global linked list of all vlan softc use a per-trunk
  hash. The size of hash is dynamically adjusted, depending on
  number of entries. This changes struct ifnet, replacing counter
  of vlans with a pointer to trunk structure. This change is an
  improvement for setups with big number of VLANs, several interfaces
  and several CPUs. It is a small regression for a setup with a single
  VLAN interface.
    An alternative to dynamic hash is a per-trunk static array with
  4096 entries, which is a compile time option - VLAN_ARRAY. In my
  experiments the array is not an improvement, probably because such
  a big trunk structure doesn't fit into CPU cache.
o   Introduce an UMA zone for VLAN tags. Since drivers depend on it,
  the zone is declared in kern_mbuf.c, not in optional vlan(4) driver.
  This change is a big improvement for any setup utilizing vlan(4).
o   Use rwlock(9) instead of mutex(9) for locking. We are the first
  ones to do this! :)
o   Some drivers can do hardware VLAN tagging + hardware checksum
  offloading. Add an infrastructure for this. Whenever vlan(4) is
  attached to a parent or parent configuration is changed, the flags
  on vlan(4) interface are updated.

In collaboration with:	yar, thompsa
In collaboration with:	Mihail Balikov <mihail.balikov interbgc.com>
2006-01-30 13:45:15 +00:00
Andre Oppermann
fd2413398c In mb_zinit_pack() explicitly ignore the return value of uma_zalloc_arg().
The success of the cluster allocation is checked through a field in the
mbuf structure.  This change is non-functional but properly silences code
inspection tools.

Found by:	Coverity Prevent(tm)
Coverity ID:	CID807
Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-01-23 15:49:01 +00:00
Andre Oppermann
36ae3fd3c3 Hide the 4k mbuf clusters if the normal clusters are defined to be
4k already.

This unbreaks tinderbox.

Submitted by:	ru
2005-12-10 15:21:04 +00:00
Andre Oppermann
d5269a636b Add an API for jumbo mbuf cluster allocation and also provide
4k clusters in addition to 9k and 16k ones.

 struct mbuf *m_getjcl(int how, short type, int flags, int size)
 void *m_cljget(struct mbuf *m, int how, int size)

m_getjcl() returns an mbuf with a cluster of the specified size attached
like m_getcl() does for 2k clusters.

m_cljget() is different from m_clget() as it can allocate clusters
without attaching them to an mbuf.  In that case the return value
is the pointer to the cluster of the requested size.  If an mbuf was
specified, it gets the cluster attached to it and the return value
can be safely ignored.

For size both take MCLBYTES, MJUM4BYTES, MJUM9BYTES, MJUM16BYTES.

Reviewed by:	glebius
Tested by:	glebius
Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-12-08 13:13:06 +00:00
Gleb Smirnoff
49d46b616e Fix panic string in last revision. 2005-11-06 16:47:59 +00:00
Andre Oppermann
cd5bb63b3d Free only those mbuf+clusters back to the packet zone that were allocated
from there.  All others get broken up and free'd individually to the mbuf
and cluster zones.

The packet zone is a secondary zone to the mbuf zone.  There is currently
a limitation in UMA which prevents decreasing the packet zone stock when
the mbuf and cluster zone are drained and all their members are part of
packets.  When this is fixed this change may be reverted.
2005-11-05 19:43:55 +00:00
Andre Oppermann
a5f7708723 Fix a logic error introduced with mandatory mbuf cluster refcounting and
freeing of mbufs+clusters back to the packet zone.
2005-11-04 17:20:53 +00:00
Andre Oppermann
56a4e45ab3 Mandatory mbuf cluster reference counting and groundwork for UMA
based jumbo 9k and jumbo 16k cluster support.

All mbuf's with external storage attached are mandatory reference
counted.  For clusters and jumbo clusters UMA provides the refcnt
storage directly.  It does not have to be separatly allocated.  Any
other type of external storage gets its own refcnt allocated from
an UMA mbuf refcnt zone instead of normal kernel malloc.

The refcount API MEXT_ADD_REF() and MEXT_REM_REF() is no longer
publically accessible.  The proper m_* functions have to be used.

mb_ctor_clust() and mb_dtor_clust() both handle normal 2K as well
as 9k and 16k clusters.

Clusters and jumbo clusters may be obtained without attaching it
immideatly to an mbuf.  This is for high performance cluster
allocation in network drivers where mbufs are attached after the
cluster has been filled.

Tested by:	rwatson
Sponsored by:	TCP/IP Optimizations Fundraise 2005
2005-11-02 16:20:36 +00:00
Robert Watson
32a6bd9510 No longer maintain mbstat statistics for the mbuf allocator, UMA
statistics and libmemstat(3) are now used to track mbuf statistics.

MFC after:	1 month
2005-09-27 20:28:43 +00:00
Robert Watson
68352adfe7 Define four constants, MBUF_{,MEM,CLUSTER,PACKET,TAG}_MEM_NAME, which
are string names for their respective UMA zones and malloc types, and
are passed into uma_zcreate() and MALLOC_DEFINE().  Export them
outside of _KERNEL in mbuf.h so that netstat can reference them.

Change the names to improve consistency, with each zone/type
associated with the mbuf allocator being prefixed mbuf_.

MFC after:	1 week
2005-07-17 14:04:03 +00:00
Mike Silbersack
a7b844d2be Fix the false memory modified after free messages some users have been
reporting - in my previous change, I missed the case where a mbuf
from the packet zone was freed back to the mbuf/packet keg, where
it was subsequently put into the mbuf zone and found not to contain
the expected trash.  This change adds the necessary trash_dtor call inside
mb_fini_pack so that everything is correct.

Thanks for Bosko for finding the bug and showing me how secondary zones
work.

Approved by:	re (dwhite)
2005-06-29 08:18:26 +00:00
Mike Silbersack
121f050976 Change the mbuf, mbuf cluster, and mbuf packet allocation routines so that
the UMA "trash" allocator is used - this ensures that any writes to a freed
mbuf should provoke a panic.

Only enabled under INVARIANTS, of course.

Approved by:	re (scottl)
2005-06-23 04:33:39 +00:00
Bosko Milekic
8076cb5289 Well, it seems that I pre-maturely removed the "All rights reserved"
statement from some files, so re-add it for the moment, until the
related legalese is sorted out.  This change affects:

sys/kern/kern_mbuf.c
sys/vm/memguard.c
sys/vm/memguard.h
sys/vm/uma.h
sys/vm/uma_core.c
sys/vm/uma_dbg.c
sys/vm/uma_dbg.h
sys/vm/uma_int.h
2005-02-16 21:45:59 +00:00
Bosko Milekic
3d2a3ff25e Optimize the way reference counting is performed with Mbufs. We
do not need to perform an extra memory fetch in the Packet (Mbuf+Cluster)
constructor to initialize the reference counter anymore.  The reference
counts are located in a separate memory region (in the slab header,
because this zone is UMA_ZONE_REFCNT), so the memory fetch resulted very
often in a cache miss.  Additionally, and perhaps more significantly,
optimize the free mbuf+cluster (packet) case, which is very common, to
no longer require an atomic operation on free (to verify the reference
counter) if the reference on the cluster has never been increased (also
very common).  Reduces an atomic on mbuf free on average.

Original patch submitted by: Gerrit Nagelhout <gnagelhout@sandvine.com>
2005-02-10 22:23:02 +00:00
Bosko Milekic
737cd9525b Update copyright, remove "all rights reserved" (since they are not
all reserved, as the lisence makes clear), and strike the third clause
(now this is a 2-clause liberal BSDL as are the rest of files I hold
copyright over).
2005-02-01 03:17:52 +00:00
Brian Somers
a04946cf6e CTASSERT that MSZIE is a power of 2 (otherwise dtom() breaks)
Ask uma_zcreate() to align mbufs to MSIZE bytes (otherwise dtom() breaks)

As it happens, uma_zalloc_arg() always returned mbufs aligned to MSIZE
anyway, but that was an implementation side-effect....

KASSERT -> CTASSERT suggested by: dd@
Approved by:	silence on -net
2004-09-20 08:52:04 +00:00
Brian Feldman
b23f72e98a * Add a "how" argument to uma_zone constructors and initialization functions
so that they know whether the allocation is supposed to be able to sleep
  or not.
* Allow uma_zone constructors and initialation functions to return either
  success or error.  Almost all of the ones in the tree currently return
  success unconditionally, but mbuf is a notable exception: the packet
  zone constructor wants to be able to fail if it cannot suballocate an
  mbuf cluster, and the mbuf allocators want to be able to fail in general
  in a MAC kernel if the MAC mbuf initializer fails.  This fixes the
  panics people are seeing when they run out of memory for mbuf clusters.
* Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing
  the default.

Both bmilekic and jeff have reviewed the changes made to make failable
zone allocations work.
2004-08-02 00:18:36 +00:00
Bosko Milekic
6bc72ab95a Fix a couple of bugs in the mbuf and packet ctors. In the latter case,
nextpkt within the m_hdr was not being initialized to NULL for
!M_PKTHDR cases.  *Maybe* this will fix weird socket buffer
inconsistency panics, but we'll see.
2004-06-01 16:17:10 +00:00
Bosko Milekic
099a0e588c Bring in mbuma to replace mballoc.
mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
  - Better layering between slab <-> zone caches; introduce
    Keg structure which splits off slab cache away from the
    zone structure and allows multiple zones to be stacked
    on top of a single Keg (single type of slab cache);
    perhaps we should look into defining a subset API on
    top of the Keg for special use by malloc(9),
    for example.
  - UMA_ZONE_REFCNT zones can now be added, and reference
    counters automagically allocated for them within the end
    of the associated slab structures.  uma_find_refcnt()
    does a kextract to fetch the slab struct reference from
    the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
  - integrates mbuf & cluster allocations with extended UMA
    and provides caches for commonly-allocated items; defines
    several zones (two primary, one secondary) and two kegs.
  - change up certain code paths that always used to do:
    m_get() + m_clget() to instead just use m_getcl() and
    try to take advantage of the newly defined secondary
    Packet zone.
  - netstat(1) and systat(1) quickly hacked up to do basic
    stat reporting but additional stats work needs to be
    done once some other details within UMA have been taken
    care of and it becomes clearer to how stats will work
    within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used.  The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
   - One report of 'ips' (ServeRAID) driver acting really
     slow in conjunction with mbuma.  Need more data.
     Latest report is that ips is equally sucking with
     and without mbuma.
   - Giant leak in NFS code sometimes occurs, can't
     reproduce but currently analyzing; brueffer is
     able to reproduce but THIS IS NOT an mbuma-specific
     problem and currently occurs even WITHOUT mbuma.
   - Issues in network locking: there is at least one
     code path in the rip code where one or more locks
     are acquired and we end up in m_prepend() with
     M_WAITOK, which causes WITNESS to whine from within
     UMA.  Current temporary solution: force all UMA
     allocations to be M_NOWAIT from within UMA for now
     to avoid deadlocks unless WITNESS is defined and we
     can determine with certainty that we're not holding
     any locks when we're M_WAITOK.
   - I've seen at least one weird socketbuffer empty-but-
     mbuf-still-attached panic.  I don't believe this
     to be related to mbuma but please keep your eyes
     open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
    rwatson,
    brueffer,
    Ketrien I. Saihr-Kesenchedra,
    ...
Reviewed by: Lots of people (for different parts)
2004-05-31 21:46:06 +00:00