fc32c80215
Reviewed by: hackers
479 lines
22 KiB
Groff
479 lines
22 KiB
Groff
.\" Copyright (c) 2001, Matthew Dillon. Terms and conditions are those of
|
|
.\" the BSD Copyright as specified in the file "/usr/src/COPYRIGHT" in
|
|
.\" the source tree.
|
|
.\"
|
|
.\" $FreeBSD$
|
|
.\"
|
|
.Dd May 25, 2001
|
|
.Dt TUNING 7
|
|
.Os FreeBSD
|
|
.Sh NAME
|
|
.Nm tuning
|
|
.Nd performance tuning under FreeBSD
|
|
.Sh SYSTEM SETUP - DISKLABEL, NEWFS, TUNEFS, SWAP
|
|
.Pp
|
|
When using
|
|
.Xr disklabel 8
|
|
to lay out your filesystems on a hard disk it is important to remember
|
|
that hard drives can transfer data much more quickly from outer tracks
|
|
then they can from inner tracks. To take advantage of this you should
|
|
try to pack your smaller filesystems and swap closer to the outer tracks,
|
|
follow with the larger filesystems, and end with the largest filesystems.
|
|
It is also important to size system standard filesystems such that you
|
|
will not be forced to resize them later as you scale the machine up.
|
|
I usually create, in order, a 128M root, 1G swap, 128M /var, 128M /var/tmp,
|
|
3G /usr, and use any remaining space for /home.
|
|
.Pp
|
|
You should typically size your swap space to approximately 2x main memory.
|
|
If you do not have a lot of ram, though, you will generally want a lot
|
|
more swap. It is not recommended that you configure any less than
|
|
256M of swap on a system and you should keep in mind future memory
|
|
expansion when sizing the swap partition.
|
|
The kernel's VM paging algorithms are tuned to perform best when there is
|
|
at least 2x swap versus main memory. Configuring too little swap can lead
|
|
to inefficiencies in the VM page scanning code as well as create issues
|
|
later on if you add more memory to your machine. Finally, on larger systems
|
|
with multiple SCSI disks (or multiple IDE disks operating on different
|
|
controllers), we strongly recommend that you configure swap on each drive
|
|
(up to four drives). The swap partitions on the drives should be
|
|
approximately the same size. The kernel can handle arbitrary sizes but
|
|
internal data structures scale to 4 times the largest swap partition. Keeping
|
|
the swap partitions near the same size will allow the kernel to optimally
|
|
stripe swap space across the N disks. Don't worry about overdoing it a
|
|
little, swap space is the saving grace of
|
|
.Ux
|
|
and even if you don't normally use much swap, it can give you more time to
|
|
recover from a runaway program before being forced to reboot.
|
|
.Pp
|
|
How you size your
|
|
.Em /var
|
|
partition depends heavily on what you intend to use the machine for. This
|
|
partition is primarily used to hold mailboxes, the print spool, and log
|
|
files. Some people even make
|
|
.Em /var/log
|
|
its own partition (but except for extreme cases it isn't worth the waste
|
|
of a partition id). If your machine is intended to act as a mail
|
|
or print server,
|
|
or you are running a heavily visited web server, you should consider
|
|
creating a much larger partition - perhaps a gig or more. It is very easy
|
|
to underestimate log file storage requirements.
|
|
.Pp
|
|
Sizing
|
|
.Em /var/tmp
|
|
depends on the kind of temporary file usage you think you will need. 128M is
|
|
the minimum we recommend. Also note that you usually want to make
|
|
.Em /tmp
|
|
a softlink to
|
|
.Em /var/tmp .
|
|
Dedicating a partition for temporary file storage is important for
|
|
two reasons: First, it reduces the possibility of filesystem corruption
|
|
in a crash, and second it reduces the chance of a runaway process that
|
|
fills up [/var]/tmp from blowing up more critical subsystems (mail,
|
|
logging, etc). Filling up [/var]/tmp is a very common problem to have.
|
|
.Pp
|
|
In the old days there were differences between /tmp and /var/tmp,
|
|
but the introduction of /var (and /var/tmp) led to massive confusion
|
|
by program writers so today programs halfhazardly use one or the
|
|
other and thus no real distinction can be made between the two. So
|
|
it makes sense to have just one temporary directory. You can do the
|
|
softlink either way. The one thing you do not want to do is leave /tmp
|
|
on the root partition where it might cause root to fill up or possibly
|
|
corrupt root in a crash/reboot situation.
|
|
.Pp
|
|
The
|
|
.Em /usr
|
|
partition holds the bulk of the files required to support the system and
|
|
a subdirectory within it called
|
|
.Em /usr/local
|
|
holds the bulk of the files installed from the
|
|
.Xr ports 7
|
|
hierarchy. If you do not use ports all that much and do not intend to keep
|
|
system source (/usr/src) on the machine, you can get away with
|
|
a 1 gigabyte /usr partition. However, if you install a lot of ports
|
|
(especially window managers and linux-emulated binaries), we recommend
|
|
at least a 2 gigabyte /usr and if you also intend to keep system source
|
|
on the machine, we recommend a 3 gigabyte /usr. Do not underestimate the
|
|
amount of space you will need in this partition, it can creep up and
|
|
surprise you!
|
|
.Pp
|
|
The
|
|
.Em /home
|
|
partition is typically used to hold user-specific data. I usually size it
|
|
to the remainder of the disk.
|
|
.Pp
|
|
Why partition at all? Why not create one big
|
|
.Em /
|
|
partition and be done with it? Then I don't have to worry about undersizing
|
|
things! Well, there are several reasons this isn't a good idea. First,
|
|
each partition has different operational characteristics and separating them
|
|
allows the filesystem to tune itself to those characteristics. For example,
|
|
the root and /usr partitions are read-mostly, with very little writing, while
|
|
a lot of reading and writing could occur in /var and /var/tmp. By properly
|
|
partitioning your system, fragmentation introduced in the smaller more
|
|
heavily write-loaded partitions will not bleed over into the mostly-read
|
|
partitions. Additionally, keeping the write-loaded partitions closer to
|
|
the edge of the disk (i.e. before the really big partitions instead of after
|
|
in the partition table) will increase I/O performance in the partitions
|
|
where you need it the most. Now it is true that you might also need I/O
|
|
performance in the larger partitions, but they are so large that shifting
|
|
them more towards the edge of the disk will not lead to a significnat
|
|
performance improvement whereas moving /var to the edge can have a huge impact.
|
|
Finally, there are safety concerns. Having a small neat root partition that
|
|
is essentially read-only gives it a greater chance of surviving a bad crash
|
|
intact.
|
|
.Pp
|
|
Properly partitioning your system also allows you to tune
|
|
.Xr newfs 8 ,
|
|
and
|
|
.Xr tunefs 8
|
|
parameters. Tuning
|
|
.Fn newfs
|
|
requires more experience but can lead to significant improvements in
|
|
performance. There are three parameters that are relatively safe to
|
|
tune:
|
|
.Em blocksize ,
|
|
.Em bytes/inode ,
|
|
and
|
|
.Em cylinders/group .
|
|
.Pp
|
|
.Fx
|
|
performs best when using 8K or 16K filesystem block sizes. The default
|
|
filesystem block size is 8K. For larger partitions it is usually a good
|
|
idea to use a 16K block size. This also requires you to specify a larger
|
|
fragment size. We recommend always using a fragment size that is 1/8
|
|
the block size (less testing has been done on other fragment size factors).
|
|
The
|
|
.Fn newfs
|
|
options for this would be
|
|
.Em newfs -f 2048 -b 16384 ...
|
|
Using a larger block size can cause fragmentation of the buffer cache and
|
|
lead to lower performance.
|
|
.Pp
|
|
If a large partition is intended to be used to hold fewer, larger files, such
|
|
as a database files, you can increase the
|
|
.Em bytes/inode
|
|
ratio which reduces the number if inodes (maximum number of files and
|
|
directories that can be created) for that partition. Decreasing the number
|
|
of inodes in a filesystem can greatly reduce
|
|
.Xr fsck 8
|
|
recovery times after a crash. Do not use this option
|
|
unless you are actually storing large files on the partition, because if you
|
|
overcompensate you can wind up with a filesystem that has lots of free
|
|
space remaining but cannot accomodate any more files. Using
|
|
32768, 65536, or 262144 bytes/inode is recommended. You can go higher but
|
|
it will have only incremental effects on fsck recovery times. For
|
|
example,
|
|
.Em newfs -i 32768 ...
|
|
.Pp
|
|
Finally, increasing the
|
|
.Em cylinders/group
|
|
ratio has the effect of packing the inodes closer together. This can increase
|
|
directory performance and also decrease fsck times. If you use this option
|
|
at all, we recommend maxing it out. Use
|
|
.Em newfs -c 999
|
|
and newfs will error out and tell you what the maximum is, then use that.
|
|
.Pp
|
|
.Xr tunefs 8
|
|
may be used to further tune a filesystem. This command can be run in
|
|
single-user mode without having to reformat the filesystem. However, this
|
|
is possibly the most abused program in the system. Many people attempt to
|
|
increase available filesystem space by setting the min-free percentage to 0.
|
|
This can lead to severe filesystem fragmentation and we do not recommend
|
|
that you do this. Really the only tunefs option worthwhile here is turning on
|
|
.Em softupdates
|
|
with
|
|
.Em tunefs -n enable /filesystem.
|
|
(Note: In 5.x softupdates can be turned on using the -U option to newfs).
|
|
Softupdates drastically improves meta-data performance, mainly file
|
|
creation and deletion. We recommend turning softupdates on on all of your
|
|
filesystems. There are two downsides to softupdates that you should be
|
|
aware of: First, softupdates guarentees filesystem consistency in the
|
|
case of a crash but could very easily be several seconds (even a minute!)
|
|
behind updating the physical disk. If you crash you may lose more work
|
|
then otherwise. Secondly, softupdates delays the freeing of filesystem
|
|
blocks. If you have a filesystem (such as the root filesystem) which is
|
|
close to full, doing a major update of it, e.g.
|
|
.Em make installworld,
|
|
can run it out of space and cause the update to fail.
|
|
.Sh STRIPING DISKS
|
|
In larger systems you can stripe partitions from several drives together
|
|
to create a much larger overall partition. Striping can also improve
|
|
the performance of a filesystem by splitting I/O operations across two
|
|
or more disks. The
|
|
.Xr vinum 8
|
|
and
|
|
.Xr ccd 4
|
|
utilities may be used to create simple striped filesystems. Generally
|
|
speaking, striping smaller partitions such as the root and /var/tmp,
|
|
or essentially read-only partitions such as /usr is a complete waste of
|
|
time. You should only stripe partitions that require serious I/O performance...
|
|
typically /var, /home, or custom partitions used to hold databases and web
|
|
pages. Choosing the proper stripe size is also
|
|
important. Filesystems tend to store meta-data on power-of-2 boundries
|
|
and you usually want to reduce seeking rather then increase seeking. This
|
|
means you want to use a large off-center stripe size such as 1152 sectors
|
|
so sequential I/O does not seek both disks and so meta-data is distributed
|
|
across both disks rather then concentrated on a single disk. If
|
|
you really need to get sophisticated, we recommend using a real hardware
|
|
raid controller from the list of
|
|
.Fx
|
|
supported controllers.
|
|
.Sh SYSCTL TUNING
|
|
.Pp
|
|
There are several hundred
|
|
.Xr sysctl 8
|
|
variables in the system, including many that appear to be candidates for
|
|
tuning but actually aren't. In this document we will only cover the ones
|
|
that have the greatest effect on the system.
|
|
.Pp
|
|
The
|
|
.Em kern.ipc.shm_use_phys
|
|
sysctl defaults to 0 (off) and may be set to 0 (off) or 1 (on). Setting
|
|
this parameter to 1 will cause all SysV shared memory segments to be
|
|
mapped to unpageable physical ram. This feature only has an effect if you
|
|
are either (A) mapping small amounts of shared memory across many (hundreds)
|
|
of processes, or (B) mapping large amounts of shared memory across any
|
|
number of processes. This feature allows the kernel to remove a great deal
|
|
of internal memory management page-tracking overhead at the cost of wiring
|
|
the shared memory into core, making it unswappable.
|
|
.Pp
|
|
The
|
|
.Em vfs.vmiodirenable
|
|
sysctl defaults to 0 (off) (though soon it will default to 1) and may be
|
|
set to 0 (off) or 1 (on). This parameter controls how directories are cached
|
|
by the system. Most directories are small and use but a single fragment
|
|
(typically 1K) in the filesystem and even less (typically 512 bytes) in
|
|
the buffer cache. However, when operating in the default mode the buffer
|
|
cache will only cache a fixed number of directories even if you have a huge
|
|
amount of memory. Turning on this sysctl allows the buffer cache to use
|
|
the VM Page Cache to cache the directories. The advantage is that all of
|
|
memory is now available for caching directories. The disadvantage is that
|
|
the minimum in-core memory used to cache a directory is the physical page
|
|
size (typically 4K) rather then 512 bytes. We recommend turning this option
|
|
on if you are running any services which manipulate large numbers of files.
|
|
Such services can include web caches, large mail systems, and news systems.
|
|
Turning on this option will generally not reduce performance even with the
|
|
wasted memory but you should experiment to find out.
|
|
.Pp
|
|
There are various buffer-cache and VM page cache related sysctls. We do
|
|
not recommend messing around with these at all. As of
|
|
.Fx 4.3 ,
|
|
the VM system does an extremely good job tuning itself.
|
|
.Pp
|
|
The
|
|
.Em net.inet.tcp.sendspace
|
|
and
|
|
.Em net.inet.tcp.recvspace
|
|
sysctls are of particular interest if you are running network intensive
|
|
applications. This controls the amount of send and receive buffer space
|
|
allowed for any given TCP connection. The default is 16K. You can often
|
|
improve bandwidth utilization by increasing the default at the cost of
|
|
eating up more kernel memory for each connection. We do not recommend
|
|
increasing the defaults if you are serving hundreds or thousands of
|
|
simultanious connections because it is possible to quickly run the system
|
|
out of memory due to stalled connections building up. But if you need
|
|
high bandwidth over a fewer number of connections, especially if you have
|
|
gigabit ethernet, increasing these defaults can make a huge difference.
|
|
You can adjust the buffer size for incoming and outgoing data separately.
|
|
For example, if your machine is primarily doing web serving you may want
|
|
to decrease the recvspace in order to be able to increase the sendspace
|
|
without eating too much kernel memory. Note that the route table, see
|
|
.Xr route 8 ,
|
|
can be used to introduce route-specific send and receive buffer size
|
|
defaults. As an additional mangagement tool you can use pipes in your
|
|
firewall rules, see
|
|
.Xr ipfw 8 ,
|
|
to limit the bandwidth going to or from particular IP blocks or ports.
|
|
For example, if you have a T1 you might want to limit your web traffic
|
|
to 70% of the T1's bandwidth in order to leave the remainder available
|
|
for mail and interactive use. Normally a heavily loaded web server
|
|
will not introduce significant latencies into other services even if
|
|
the network link is maxed out, but enforcing a limit can smooth things
|
|
out and lead to longer term stability. Many people also enforce artificial
|
|
bandwidth limitations in order to ensure that they are not charged for
|
|
using too much bandwidth.
|
|
.Pp
|
|
We recommend that you turn on (set to 1) and leave on the
|
|
.Em net.inet.tcp.always_keepalive
|
|
control. The default is usually off. This introduces a small amount of
|
|
additional network bandwidth but guarentees that dead tcp connections
|
|
will eventually be recognized and cleared. Dead tcp connections are a
|
|
particular problem on systems accesed by users operating over dialups,
|
|
because users often disconnect their modems without properly closing active
|
|
connections.
|
|
.Pp
|
|
The
|
|
.Em kern.ipc.somaxconn
|
|
sysctl limits the size of the listen queue for accepting new tcp connections.
|
|
The default value of 128 is typically too low for robust handling of new
|
|
connections in a heavily loaded web server environment. For such environments,
|
|
we recommend increasing this value to 1024 or higher. The service daemon
|
|
may itself limit the listen queue size (e.g. sendmail, apache) but will
|
|
often have a directive in its configuration file to adjust the queue size up.
|
|
Larger listen queue also do a better job of fending of denial of service
|
|
attacks.
|
|
.Sh KERNEL CONFIG TUNING
|
|
.Pp
|
|
There are a number of kernel options that you may have to fiddle with in
|
|
a large scale system. In order to change these options you need to be
|
|
able to compile a new kernel from source. The
|
|
.Xr config 8
|
|
manual page and the handbook are good starting points for learning how to
|
|
do this. Generally the first thing you do when creating your own custom
|
|
kernel is to strip out all the drivers and services you don't use. Removing
|
|
things like
|
|
.Em INET6
|
|
and drivers you don't have will reduce the size of your kernel, sometimes
|
|
by a megabyte or more, leaving more memory available for applications.
|
|
.Pp
|
|
The
|
|
.Em maxusers
|
|
kernel option defaults to an incredibly low value. For most modern machines,
|
|
you probably want to increase this value to 64, 128, or 256. We do not
|
|
recommend going above 256 unless you need a huge number of file descriptors.
|
|
Network buffers are also affected but can be controlled with a separate
|
|
kernel option. Do not increase maxusers just to get more network mbufs.
|
|
.Pp
|
|
.Em NMBCLUSTERS
|
|
may be adjusted to increase the number of network mbufs the system is
|
|
willing to allocate. Each cluster represents approximately 2K of memory,
|
|
so a value of 1024 represents 2M of kernel memory reserved for network
|
|
buffers. You can do a simple calculation to figure out how many you need.
|
|
If you have a web server which maxes out at 1000 simultanious connections,
|
|
and each connection eats a 16K receive and 16K send buffer, you need
|
|
approximate 32MB worth of network buffers to deal with it. A good rule of
|
|
thumb is to multiply by 2, so 32MBx2 = 64MB/2K = 32768. So for this case
|
|
you would want to se NMBCLUSTERS to 32768. We recommend values between
|
|
1024 and 4096 for machines with moderates amount of memory, and between 4096
|
|
and 32768 for machines with greater amounts of memory. Under no circumstances
|
|
should you specify an arbitrarily high value for this parameter, it could
|
|
lead to a boot-time crash. The -m option to
|
|
.Xr netstat 1
|
|
may be used to observe network cluster use.
|
|
.Pp
|
|
More and more programs are using the
|
|
.Fn sendfile
|
|
system call to transmit files over the network. The
|
|
.Em NSFBUFS
|
|
kernel parameter controls the number of filesystem buffers
|
|
.Fn sendfile
|
|
is allowed to use to perform its work. This parameter nominally scales
|
|
with
|
|
.Em maxusers
|
|
so you should not need to mess with this parameter except under extreme
|
|
circumstances.
|
|
.Pp
|
|
.Em SCSI_DELAY
|
|
and
|
|
.Em IDE_DELAY
|
|
may be used to reduce system boot times. The defaults are fairly high and
|
|
can be responsible for 15+ seconds of delay in the boot process. Reducing
|
|
SCSI_DELAY to 5 seconds usually works (especially with modern drives).
|
|
Reducing IDE_DELAY also works but you have to be a little more careful.
|
|
.Pp
|
|
There are a number of
|
|
.Em XXX_CPU
|
|
options that can be commented out. If you only want the kernel to run
|
|
on a Pentium class cpu, you can easily remove
|
|
.Em I386_CPU
|
|
and
|
|
.Em I486_CPU,
|
|
but only remove
|
|
.Em I586_CPU
|
|
if you are sure your cpu is being recognized as a Pentium II or better.
|
|
Some clones may be recognized as a pentium or even a 486 and not be able
|
|
to boot without those options. If it works, great! The operating system
|
|
will be able to better-use higher-end cpu features for mmu, task switching,
|
|
timebase, and even device operations. Additionally, higher-end cpus support
|
|
4MB MMU pages which the kernel uses to map the kernel itself into memory,
|
|
which increases its efficiency under heavy syscall loads.
|
|
.Sh IDE WRITE CACHING
|
|
As of
|
|
.Fx 4.3 ,
|
|
IDE write caching is turned off by default. This will reduce write bandwidth
|
|
to IDE disks but is considered necessary due to serious data consistency
|
|
issues introduced by hard drive vendors. Basically the problem is that
|
|
IDE drives lie about when a write completes. With IDE write caching turned
|
|
on, IDE hard drives will not only write data to disk out of order, they
|
|
will sometimes delay some of the blocks indefinitely when under heavy disk
|
|
loads. A crash or power failure can result in serious filesystem
|
|
corruption. So our default is to be safe. If you are willing to risk
|
|
filesystem corruption, you can return to the old behavior by setting the
|
|
hw.ata.wc
|
|
kernel variable back to 1. This must be done from the boot loader at boot
|
|
time. Please see
|
|
.Xr ata 4 ,
|
|
and
|
|
.Xr loader 8 .
|
|
.Pp
|
|
There is a new experimental feature for IDE hard drives called hw.ata.tags
|
|
(you also set this in the bootloader) which allows write caching to be safely
|
|
turned on. This brings SCSI tagging features to IDE drives. As of this
|
|
writing only IBM DPTA and DTLA drives support the feature.
|
|
.Sh CPU, MEMORY, DISK, NETWORK
|
|
The type of tuning you do depends heavily on where your system begins to
|
|
bottleneck as load increases. If your system runs out of cpu (idle times
|
|
are pepetually 0%) then you need to consider upgrading the cpu or moving to
|
|
an SMP motherboard (multiple cpu's), or perhaps you need to revisit the
|
|
programs that are causing the load and try to optimize them. If your system
|
|
is paging to swap a lot you need to consider adding more memory. If your
|
|
system is saturating the disk you typically see high cpu idle times and
|
|
total disk saturation.
|
|
.Xr systat 1
|
|
can be used to monitor this. There are many solutions to saturated disks:
|
|
increasing memory for caching, mirroring disks, distributing operations across
|
|
several machines, and so forth. If disk performance is an issue and you
|
|
are using IDE drives, switching to SCSI can help a great deal. While modern
|
|
IDE drives compare with SCSI in raw sequential bandwidth, the moment you
|
|
start seeking around the disk SCSI drives usually win.
|
|
.Pp
|
|
Finally, you might run out of network suds. The first line of defense for
|
|
improving network performance is to make sure you are using switches instead
|
|
of hubs, especially these days where switches are almost as cheap. Hubs
|
|
have severe problems under heavy loads due to collision backoff and one bad
|
|
host can severely degrade the entire LAN. Second, optimize the network path
|
|
as much as possible. For example, in
|
|
.Xr firewall 7
|
|
we describe a firewall protecting internal hosts with a topology where
|
|
the externally visible hosts are not routed through it. Use 100BaseT rather
|
|
then 10BaseT, or use 1000BaseT rather then 100BaseT, depending on your needs.
|
|
Most bottlenecks occur at the WAN link (e.g. modem, T1, DSL, whatever).
|
|
If expanding the link is not an option it may be possible to use ipfw's
|
|
.Sy DUMMYNET
|
|
feature to implement peak shaving or other forms of traffic shaping to
|
|
prevent the overloaded service (such as web services) from effecting other
|
|
services (such as email), or vise versa. In home installations this could
|
|
be used to give interactive traffic (your browser, ssh logins) priority
|
|
over services you export from your box (web services, email).
|
|
.Sh SEE ALSO
|
|
.Pp
|
|
.Xr ata 4 ,
|
|
.Xr boot 8 ,
|
|
.Xr ccd 4 ,
|
|
.Xr config 8 ,
|
|
.Xr disklabel 8 ,
|
|
.Xr firewall 7 ,
|
|
.Xr fsck 8 ,
|
|
.Xr hier 7 ,
|
|
.Xr ifconfig 8 ,
|
|
.Xr ipfw 8 ,
|
|
.Xr loader 8 ,
|
|
.Xr login.conf 5 ,
|
|
.Xr netstat 1 ,
|
|
.Xr newfs 8 ,
|
|
.Xr ports 7 ,
|
|
.Xr route 8 ,
|
|
.Xr sysctl 8 ,
|
|
.Xr systat 1 ,
|
|
.Xr tunefs 8 ,
|
|
.Xr vinum 8
|
|
.Sh HISTORY
|
|
The
|
|
.Nm
|
|
manual page was originally written by
|
|
.An Matthew Dillon
|
|
and first appeared
|
|
in
|
|
.Fx 4.3 ,
|
|
May 2001.
|