425 lines
17 KiB
Plaintext
425 lines
17 KiB
Plaintext
<!-- $Id: kerneldebug.sgml,v 1.6 1996/01/03 11:10:30 gclarkii Exp $ -->
|
|
<!-- The FreeBSD Documentation Project -->
|
|
|
|
<chapt><heading>Kernel Debugging<label id="kerneldebug"></heading>
|
|
|
|
<p><em>Contributed by &a.paul; and &a.joerg;</em>
|
|
|
|
<sect><heading>Debugging a kernel crash dump with kgdb</heading>
|
|
|
|
<p>Here are some instructions for getting kernel debugging
|
|
working on a crash dump, it assumes that you have enough swap
|
|
space for a crash dump. If you have multiple swap
|
|
partitions and the first one is too small to hold the dump,
|
|
you can configure your kernel to use an alternate dump device
|
|
(in the <tt>config kernel</tt> line), or
|
|
you can specify an alternate using the dumpon(8) command.
|
|
Dumps to non-swap devices,
|
|
tapes for example, are currently not supported. Config your
|
|
kernel using <tt>config -g</tt>.
|
|
See <ref id="kernelconfig" name="Kernel Configuration"> for
|
|
details on configuring the FreeBSD kernel.
|
|
|
|
Use the <tt>dumpon(8)</tt> command to tell the kernel where to dump
|
|
to (note that this will have to be done after configuring the
|
|
partition in question as swap space via <tt>swapon(8)</tt>). This is
|
|
normally arranged via <tt>/etc/sysconfig</tt> and <tt>/etc/rc</tt>.
|
|
Alternatively, you can
|
|
hard-code the dump device via the `dump' clause in the `config' line
|
|
of your kernel config file. This is deprecated, use only if you
|
|
want a crash dump from a kernel that crashes during booting.
|
|
|
|
<em><bf>Note:</bf> In the following, the term `<tt>kgdb</tt>' refers
|
|
to <tt>gdb</tt> run in `kernel debug mode'. This can be accomplished by
|
|
either starting the <tt>gdb</tt> with the option <tt>-k</tt>, or by linking
|
|
and starting it under the name <tt>kgdb</tt>. This is not being
|
|
done by default, however.</em>
|
|
|
|
When the kernel has been built make a copy of it, say
|
|
<tt>kernel.debug</tt>, and then run <tt>strip -x</tt> on the
|
|
original. Install the original as normal. You may also install
|
|
the unstripped kernel, but symbol table lookup time for some
|
|
programs will drastically increase, and since
|
|
the whole kernel is loaded entirely at boot time and cannot be
|
|
swapped out later, several megabytes of
|
|
physical memory will be wasted.
|
|
|
|
If you are testing a new kernel, for example by typing the new
|
|
kernel's name at the boot prompt, but need to boot a different
|
|
one in order to get your system up and running again, boot it
|
|
only into single user state using the <tt>-s</tt> flag at the
|
|
boot prompt, and then perform the following steps:
|
|
<tscreen><verb>
|
|
fsck -p
|
|
mount -a -t ufs # so your file system for /var/crash is writable
|
|
savecore -N /kernel.panicked /var/crash
|
|
exit # ...to multi-user
|
|
</verb></tscreen>
|
|
This instructs <tt>savecore(8)</tt> to use another kernel for symbol name
|
|
extraction. It would otherwise default to the currently running kernel
|
|
and most likely not do anything at all since the crash dump and the
|
|
kernel symbols differ.
|
|
|
|
Now, after a crash dump, go to <tt>/sys/compile/WHATEVER</tt> and run
|
|
<tt>kgdb</tt>. From <tt>kgdb</tt> do:
|
|
<tscreen><verb>
|
|
symbol-file kernel.debug
|
|
exec-file /var/crash/kernel.0
|
|
core-file /var/crash/vmcore.0
|
|
</verb></tscreen>
|
|
and voila, you can debug the crash dump using the kernel sources
|
|
just like you can for any other program.
|
|
|
|
Here's a script log of a <tt>kgdb</tt> session illustrating the
|
|
procedure. Long
|
|
lines have been folded to improve readability, and the lines are
|
|
numbered for reference. Despite of this, it's a real-world error
|
|
trace taken during the development of the pcvt console driver.
|
|
<tscreen><verb>
|
|
1:Script started on Fri Dec 30 23:15:22 1994
|
|
2:uriah # cd /sys/compile/URIAH
|
|
3:uriah # kgdb kernel /var/crash/vmcore.1
|
|
4:Reading symbol data from /usr/src/sys/compile/URIAH/kernel...done.
|
|
5:IdlePTD 1f3000
|
|
6:panic: because you said to!
|
|
7:current pcb at 1e3f70
|
|
8:Reading in symbols for ../../i386/i386/machdep.c...done.
|
|
9:(kgdb) where
|
|
10:#0 boot (arghowto=256) (../../i386/i386/machdep.c line 767)
|
|
11:#1 0xf0115159 in panic ()
|
|
12:#2 0xf01955bd in diediedie () (../../i386/i386/machdep.c line 698)
|
|
13:#3 0xf010185e in db_fncall ()
|
|
14:#4 0xf0101586 in db_command (-266509132, -266509516, -267381073)
|
|
15:#5 0xf0101711 in db_command_loop ()
|
|
16:#6 0xf01040a0 in db_trap ()
|
|
17:#7 0xf0192976 in kdb_trap (12, 0, -272630436, -266743723)
|
|
18:#8 0xf019d2eb in trap_fatal (...)
|
|
19:#9 0xf019ce60 in trap_pfault (...)
|
|
20:#10 0xf019cb2f in trap (...)
|
|
21:#11 0xf01932a1 in exception:calltrap ()
|
|
22:#12 0xf0191503 in cnopen (...)
|
|
23:#13 0xf0132c34 in spec_open ()
|
|
24:#14 0xf012d014 in vn_open ()
|
|
25:#15 0xf012a183 in open ()
|
|
26:#16 0xf019d4eb in syscall (...)
|
|
27:(kgdb) up 10
|
|
28:Reading in symbols for ../../i386/i386/trap.c...done.
|
|
29:#10 0xf019cb2f in trap (frame={tf_es = -260440048, tf_ds = 16, tf_\
|
|
30:edi = 3072, tf_esi = -266445372, tf_ebp = -272630356, tf_isp = -27\
|
|
31:2630396, tf_ebx = -266427884, tf_edx = 12, tf_ecx = -266427884, tf\
|
|
32:_eax = 64772224, tf_trapno = 12, tf_err = -272695296, tf_eip = -26\
|
|
33:6672343, tf_cs = -266469368, tf_eflags = 66066, tf_esp = 3072, tf_\
|
|
34:ss = -266427884}) (../../i386/i386/trap.c line 283)
|
|
35:283 (void) trap_pfault(&frame, FALSE);
|
|
36:(kgdb) frame frame->tf_ebp frame->tf_eip
|
|
37:Reading in symbols for ../../i386/isa/pcvt/pcvt_drv.c...done.
|
|
38:#0 0xf01ae729 in pcopen (dev=3072, flag=3, mode=8192, p=(struct p\
|
|
39:roc *) 0xf07c0c00) (../../i386/isa/pcvt/pcvt_drv.c line 403)
|
|
40:403 return ((*linesw[tp->t_line].l_open)(dev, tp));
|
|
41:(kgdb) list
|
|
42:398
|
|
43:399 tp->t_state |= TS_CARR_ON;
|
|
44:400 tp->t_cflag |= CLOCAL; /* cannot be a modem (:-) */
|
|
45:401
|
|
46:402 #if PCVT_NETBSD || (PCVT_FREEBSD >= 200)
|
|
47:403 return ((*linesw[tp->t_line].l_open)(dev, tp));
|
|
48:404 #else
|
|
49:405 return ((*linesw[tp->t_line].l_open)(dev, tp, flag));
|
|
50:406 #endif /* PCVT_NETBSD || (PCVT_FREEBSD >= 200) */
|
|
51:407 }
|
|
52:(kgdb) print tp
|
|
53:Reading in symbols for ../../i386/i386/cons.c...done.
|
|
54:$1 = (struct tty *) 0x1bae
|
|
55:(kgdb) print tp->t_line
|
|
56:$2 = 1767990816
|
|
57:(kgdb) up
|
|
58:#1 0xf0191503 in cnopen (dev=0x00000000, flag=3, mode=8192, p=(st\
|
|
59:ruct proc *) 0xf07c0c00) (../../i386/i386/cons.c line 126)
|
|
60: return ((*cdevsw[major(dev)].d_open)(dev, flag, mode, p));
|
|
61:(kgdb) up
|
|
62:#2 0xf0132c34 in spec_open ()
|
|
63:(kgdb) up
|
|
64:#3 0xf012d014 in vn_open ()
|
|
65:(kgdb) up
|
|
66:#4 0xf012a183 in open ()
|
|
67:(kgdb) up
|
|
68:#5 0xf019d4eb in syscall (frame={tf_es = 39, tf_ds = 39, tf_edi =\
|
|
69: 2158592, tf_esi = 0, tf_ebp = -272638436, tf_isp = -272629788, tf\
|
|
70:_ebx = 7086, tf_edx = 1, tf_ecx = 0, tf_eax = 5, tf_trapno = 582, \
|
|
71:tf_err = 582, tf_eip = 75749, tf_cs = 31, tf_eflags = 582, tf_esp \
|
|
72:= -272638456, tf_ss = 39}) (../../i386/i386/trap.c line 673)
|
|
73:673 error = (*callp->sy_call)(p, args, rval);
|
|
74:(kgdb) up
|
|
75:Initial frame selected; you cannot go up.
|
|
76:(kgdb) quit
|
|
77:uriah # exit
|
|
78:exit
|
|
79:
|
|
80:Script done on Fri Dec 30 23:18:04 1994
|
|
</verb></tscreen>
|
|
Comments to the above script:
|
|
|
|
<descrip>
|
|
<tag/line 6:/ This is a dump taken from within DDB (see below), hence the
|
|
panic comment ``because you said to!'', and a rather long
|
|
stack trace; the initial reason for going into DDB has been
|
|
a page fault trap though.
|
|
<tag/line 20:/ This is the location of function <tt>trap()</tt>
|
|
in the stack trace.
|
|
<tag/line 36:/ Force usage of a new stack frame; this is no longer
|
|
necessary now. The stack frames are supposed to point to
|
|
the right locations now, even in case of a trap.
|
|
(I don't have a new core dump handy <g>, my kernel
|
|
didn't panic for rather long.)
|
|
From looking at the code in source line 403,
|
|
there's a high probability that either the pointer
|
|
access for ``tp'' was messed up, or the array access was
|
|
out of bounds.
|
|
<tag/line 52:/ The pointer looks suspicious, but happens to be a valid
|
|
address.
|
|
<tag/line 56:/ However, it obviously points to garbage, so we have found our
|
|
error! (For those unfamiliar with that particular piece
|
|
of code: <tt>tp->t_line</tt> refers to the line discipline
|
|
of the console device here, which must be a rather small integer
|
|
number.)
|
|
</descrip>
|
|
|
|
|
|
<sect><heading>Post-mortem analysis of a dump</heading>
|
|
|
|
<p>What do you do if a kernel dumped core but you did not expect
|
|
it, and it's therefore not compiled using <tt>config -g</tt>?
|
|
Not everything is lost here. Don't panic!
|
|
|
|
Of course, you still need to enable crash dumps. See above
|
|
on the options you've got in order to do this.
|
|
|
|
Go to your kernel compile directory, and edit the line
|
|
containing <tt>COPTFLAGS?=-O</tt>. Add the <tt>-g</tt> option
|
|
there (but <em>don't</em> change anything on the level of
|
|
optimization). If you do already know roughly the probable
|
|
location of the failing piece of code (e.g., the <tt>pcvt</tt>
|
|
driver in the example above), remove all the object files for
|
|
this code. Rebuild the kernel. Due to the time stamp change on
|
|
the Makefile, there will be some other object files rebuild,
|
|
for example <tt>trap.o</tt>. With a bit of luck, the added
|
|
<tt>-g</tt> option won't change anything for the generated
|
|
code, so you'll finally get a new kernel with similar code to
|
|
the faulting one but some debugging symbols. You should at
|
|
least verify the old and new sizes with the <tt>size(1)</tt> command. If
|
|
there is a mismatch, you probably need to give up here.
|
|
|
|
Go and examine the dump as described above. The debugging
|
|
symbols might be incomplete for some places, as can be seen in
|
|
the stack trace in the example above where some functions are
|
|
displayed without line numbers and argument lists. If you need
|
|
more debugging symbols, remove the appropriate object files and
|
|
repeat the <tt>kgdb</tt> session until you know enough.
|
|
|
|
All this is not guaranteed to work, but it will do it fine in
|
|
most cases.
|
|
|
|
<sect><heading>On-line kernel debugging using DDB</heading>
|
|
|
|
<p>While <tt>kgdb</tt> as an offline debugger provides a very
|
|
high level of user interface, there are some things it cannot do.
|
|
The most important ones being breakpointing and single-stepping
|
|
kernel code.
|
|
|
|
If you need to do low-level debugging on your kernel, there's
|
|
an on- line debugger available called DDB. It allows to
|
|
setting breakpoints, single-steping kernel functions, examining
|
|
and changing kernel variables, etc. However, it cannot not
|
|
access kernel source files, and only has access to the global
|
|
and static symbols, not to the full debug information like
|
|
<tt>kgdb</tt>.
|
|
|
|
To configure your kernel to include DDB, add the option line
|
|
<tscreen><verb>
|
|
options DDB
|
|
</verb></tscreen>
|
|
to your config file, and rebuild. (See <ref id="kernelconfig"
|
|
name="Kernel Configuration"> for details on configuring the
|
|
FreeBSD kernel. Note that if you have an older version of the
|
|
boot blocks, your debugger symbols might not be loaded at all.
|
|
Update the boot blocks, the recent ones do load the DDB symbols
|
|
automagically.)
|
|
|
|
Once your DDB kernel is running, there are several ways to
|
|
enter DDB. The first, and earliest way is to type the boot
|
|
flag <tt>-d</tt> right at the boot prompt. The kernel will
|
|
start up in debug mode and enter DDB prior to any device
|
|
probing. Hence you are able to even debug the device
|
|
probe/attach functions.
|
|
|
|
The second scenario is a hot-key on the keyboard, usually
|
|
Ctrl-Alt-ESC. For syscons, this can be remapped, and some of
|
|
the distributed maps do this, so watch out.
|
|
There's an option
|
|
available for serial consoles
|
|
that allows the use of a serial line BREAK on the console line to
|
|
enter DDB (``<tt>options BREAK_TO_DEBUGGER</tt>''
|
|
in the kernel config file). It is not the default since there are a lot of
|
|
crappy serial adapters around that gratuitously generate a
|
|
BREAK condition for example when pulling the cable.
|
|
|
|
The third way is that any panic condition will branch to DDB if
|
|
the kernel is configured to use it.
|
|
For this reason, it is not wise to
|
|
configure a kernel with DDB for a machine running unattended.
|
|
|
|
The DDB commands roughly resemble some <tt>gdb</tt> commands. The first you
|
|
probably need is to set a breakpoint:
|
|
<tscreen><verb>
|
|
b function-name
|
|
b address
|
|
</verb></tscreen>
|
|
|
|
Numbers are taken hexadecimal by default, but to make them
|
|
distinct from symbol names, hexadecimal numbers starting with the
|
|
letters <tt>a</tt>-<tt>f</tt> need to be preceded with
|
|
<tt>0x</tt> (for other numbers, this is optional). Simple
|
|
expressions are allowed, for example: <tt>function-name + 0x103</tt>.
|
|
|
|
To continue the operation of an interrupted kernel, simply type
|
|
<tscreen><verb>
|
|
c
|
|
</verb></tscreen>
|
|
To get a stack trace, use
|
|
<tscreen><verb>
|
|
trace
|
|
</verb></tscreen>
|
|
Note that when entering DDB via a hot-key, the kernel is currently
|
|
servicing an interrupt, so the stack trace might be not of much use
|
|
for you.
|
|
|
|
If you want to remove a breakpoint, use
|
|
<tscreen><verb>
|
|
del
|
|
del address-expression
|
|
</verb></tscreen>
|
|
The first form will be accepted immediately after a breakpoint hit,
|
|
and deletes the current breakpoint. The second form can remove any
|
|
breakpoint, but you need to specify the exact address, as it can be
|
|
obtained from
|
|
<tscreen><verb>
|
|
show b
|
|
</verb></tscreen>
|
|
To single-step the kernel, try
|
|
<tscreen><verb>
|
|
s
|
|
</verb></tscreen>
|
|
This will step into functions, but you can make DDB trace them until
|
|
the matching return statement is reached by
|
|
<tscreen><verb>
|
|
n
|
|
</verb></tscreen>
|
|
<bf>Note:</bf> this is different from <tt>gdb</tt>'s `next' statement, it's like
|
|
<tt>gdb</tt>'s `finish'.
|
|
|
|
To examine data from memory, use (for example):
|
|
<tscreen><verb>
|
|
x/wx 0xf0133fe0,40
|
|
x/hd db_symtab_space
|
|
x/bc termbuf,10
|
|
x/s stringbuf
|
|
</verb></tscreen>
|
|
for word/halfword/byte access, and hexadecimal/decimal/character/
|
|
string display. The number after the comma is the object count.
|
|
To display the next 0x10 items, simply use
|
|
<tscreen><verb>
|
|
x ,10
|
|
</verb></tscreen>
|
|
Similarly, use
|
|
<tscreen><verb>
|
|
x/ia foofunc,10
|
|
</verb></tscreen>
|
|
to disassemble the first 0x10 instructions of <tt>foofunc</tt>, and display
|
|
them along with their offset from the beginning of <tt>foofunc</tt>.
|
|
|
|
To modify the memory, use the write command:
|
|
<tscreen><verb>
|
|
w/b termbuf 0xa 0xb 0
|
|
w/w 0xf0010030 0 0
|
|
</verb></tscreen>
|
|
The command modifier (<tt>b</tt>/<tt>h</tt>/<tt>w</tt>)
|
|
specifies the size of the data to be written, the first
|
|
following expression is the address to write to, the remainder
|
|
is interpreted as data to write to successive memory locations.
|
|
|
|
If you need to know the current registers, use
|
|
<tscreen><verb>
|
|
show reg
|
|
</verb></tscreen>
|
|
Alternatively, you can display a single register value by e.g.
|
|
<tscreen><verb>
|
|
p $eax
|
|
</verb></tscreen>
|
|
and modify it by
|
|
<tscreen><verb>
|
|
set $eax new-value
|
|
</verb></tscreen>
|
|
|
|
Should you need to call some kernel functions from DDB, simply
|
|
say
|
|
<tscreen><verb>
|
|
call func(arg1, arg2, ...)
|
|
</verb></tscreen>
|
|
The return value will be printed.
|
|
|
|
For a <tt>ps(1)</tt> style summary of all running processes, use
|
|
<tscreen><verb>
|
|
ps
|
|
</verb></tscreen>
|
|
|
|
Now you have now examined why your kernel failed, and you wish to
|
|
reboot. Remember that, depending on the severity of previous
|
|
malfunctioning, not all parts of the kernel might still be working
|
|
as expected. Perform one of the following actions to shut down and
|
|
reboot your system:
|
|
<tscreen><verb>
|
|
call diediedie()
|
|
</verb></tscreen>
|
|
|
|
will cause your kernel to dump core and reboot, so you can
|
|
later analyze the core on a higher level with kgdb. This
|
|
command usually must be followed by another
|
|
`<tt>continue</tt>' statement.
|
|
There is now an alias for this: `<tt>panic</tt>'.
|
|
|
|
<tscreen><verb>
|
|
call boot(0)
|
|
</verb></tscreen>
|
|
might be a good way to cleanly shut down the running system, <tt>sync()</tt>
|
|
all disks, and finally reboot. As long as the disk and file system
|
|
interfaces of the kernel are not damaged, this might be a good way
|
|
for an almost clean shutdown.
|
|
|
|
<tscreen><verb>
|
|
call cpu_reset()
|
|
</verb></tscreen>
|
|
is the final way out of disaster and almost the same as hitting
|
|
the Big Red Button.
|
|
|
|
If you need a short command summary, simply type
|
|
<tscreen><verb>
|
|
help
|
|
</verb></tscreen>
|
|
However, it's highly recommended to have a printed copy of the
|
|
<tt>ddb(4)</tt> manual page ready for a debugging session.
|
|
Remember that it's hard to read the on-line manual while
|
|
single-stepping the kernel.
|
|
|
|
|
|
<sect><heading>Debugging a console driver</heading>
|
|
|
|
<p>Since you need a console driver to run DDB on, things are more
|
|
complicated if the console driver itself is failing. You might
|
|
remember the use of a serial console (either with modified boot
|
|
blocks, or by specifying <tt><bf>-h</bf></tt> at the <tt>Boot:</tt>
|
|
prompt), and hook up a standard
|
|
terminal onto your first serial port. DDB works on any configured
|
|
console driver, of course also on a serial console.
|
|
|
|
|