freebsd-skq/share/FAQ/kernel-debug.FAQ

                   Kernel debugging FAQ for FreeBSD

$Id: kernel-debug.FAQ,v 1.3 1995/01/02 12:01:59 joerg Exp $


*** Debugging a kernel crash dump with kgdb ***

  Here are some instructions for getting kernel debugging working on a
  crash dump, it assumes that you have enough swap space for a crash
  dump.  If you happen to have multiple swap partitions with the first
  one being too small to keep the dump, you can configure your kernel to
  use an alternate dump device (in the ``kernel'' line).  Dumps to non-
  swap devices (e.g. tapes) are currently not supported.

  Config your kernel using config -g

  Remember that you need to specify ``options DODUMP'' in your config
  file in order to get kernel core dumps.

  When the kernel's been built make a copy of it, say kernel.debug, and
  then run strip -x on the original. Install the original as normal.
  You may also install the unstripped kernel, but symtab lookup time
  for some programs might drastically increase.

  If you are testing a new kernel (e.g. by typing the new kernel's
  name at the boot prompt), but need to boot a different one in order
  to get your system up & running again, do boot it only into single
  user state (the -s flag at the boot prompt), and then perform the
  following steps:

  fsck -p
  mount -a -t ufs       # so your file system for /var/crash is writable
  savecore -N /kernel.panicked /var/crash
  exit                  # ...to multi-user

  This instructs savecore to use another kernel for symbol name
  extraction; it would default to the currently running kernel
  otherwise.

  Now, after a crash dump, go to /sys/compile/WHATEVER and run
  kgdb. From kgdb do:

  symbol-file kernel.debug
  exec-file /var/crash/system.0
  core-file /var/crash/ram.0

  and voila, you can debug the crash dump using the kernel sources
  just like you can for any other program.

  If your kernel panicked due to a trap (perhaps the most common case
  for getting a core dump), the following trick might help you.  Examine
  the stack (`where') and look for the stack frame in the function
  trap().  Go `up' to that frame, and then type:

  frame frame->tf_ebp frame->tf_eip

  This will tell kgdb to go to the stack frame explicitly named by a
  frame pointer and instruction pointer, which is the location where
  the trap occured.  There are still some bugs in kgdb (you can go
  `up' from there, but not `down'; the stack trace will still remain
  as it was before going to here), but generally this method will lead
  you much closer to the failing piece of code.

  Here's a script log of a kgdb session illustrating the above.  Long
  lines have been folded to improve readability, and the lines are
  numbered for reference.  Despite of this, it's a real-world error
  trace taken during the development of the pcvt console driver.

   1:Script started on Fri Dec 30 23:15:22 1994
   2:uriah # cd /sys/compile/URIAH
   3:uriah # kgdb kernel /var/crash/vmcore.1 
   4:Reading symbol data from /usr/src/sys/compile/URIAH/kernel...done.
   5:IdlePTD 1f3000
   6:panic: because you said to!
   7:current pcb at 1e3f70
   8:Reading in symbols for ../../i386/i386/machdep.c...done.
   9:(kgdb) where
  10:#0  boot (arghowto=256) (../../i386/i386/machdep.c line 767)
  11:#1  0xf0115159 in panic ()
  12:#2  0xf01955bd in diediedie () (../../i386/i386/machdep.c line 698)
  13:#3  0xf010185e in db_fncall ()
  14:#4  0xf0101586 in db_command (-266509132, -266509516, -267381073)
  15:#5  0xf0101711 in db_command_loop ()
  16:#6  0xf01040a0 in db_trap ()
  17:#7  0xf0192976 in kdb_trap (12, 0, -272630436, -266743723)
  18:#8  0xf019d2eb in trap_fatal (...)
  19:#9  0xf019ce60 in trap_pfault (...)
  20:#10 0xf019cb2f in trap (...)
  21:#11 0xf01932a1 in exception:calltrap ()
  22:#12 0xf0191503 in cnopen (...)
  23:#13 0xf0132c34 in spec_open ()
  24:#14 0xf012d014 in vn_open ()
  25:#15 0xf012a183 in open ()
  26:#16 0xf019d4eb in syscall (...)
  27:(kgdb) up 10
  28:Reading in symbols for ../../i386/i386/trap.c...done.
  29:#10 0xf019cb2f in trap (frame={tf_es = -260440048, tf_ds = 16, tf_\
  30:edi = 3072, tf_esi = -266445372, tf_ebp = -272630356, tf_isp = -27\
  31:2630396, tf_ebx = -266427884, tf_edx = 12, tf_ecx = -266427884, tf\
  32:_eax = 64772224, tf_trapno = 12, tf_err = -272695296, tf_eip = -26\
  33:6672343, tf_cs = -266469368, tf_eflags = 66066, tf_esp = 3072, tf_\
  34:ss = -266427884}) (../../i386/i386/trap.c line 283)
  35:283                             (void) trap_pfault(&frame, FALSE);
  36:(kgdb) frame frame->tf_ebp frame->tf_eip
  37:Reading in symbols for ../../i386/isa/pcvt/pcvt_drv.c...done.
  38:#0  0xf01ae729 in pcopen (dev=3072, flag=3, mode=8192, p=(struct p\
  39:roc *) 0xf07c0c00) (../../i386/isa/pcvt/pcvt_drv.c line 403)
  40:403             return ((*linesw[tp->t_line].l_open)(dev, tp));
  41:(kgdb) list
  42:398        
  43:399             tp->t_state |= TS_CARR_ON;
  44:400             tp->t_cflag |= CLOCAL;  /* cannot be a modem (:-) */
  45:401     
  46:402     #if PCVT_NETBSD || (PCVT_FREEBSD >= 200)
  47:403             return ((*linesw[tp->t_line].l_open)(dev, tp));
  48:404     #else
  49:405             return ((*linesw[tp->t_line].l_open)(dev, tp, flag));
  50:406     #endif /* PCVT_NETBSD || (PCVT_FREEBSD >= 200) */
  51:407     }
  52:(kgdb) print tp
  53:Reading in symbols for ../../i386/i386/cons.c...done.
  54:$1 = (struct tty *) 0x1bae
  55:(kgdb) print tp->t_line
  56:$2 = 1767990816
  57:(kgdb) up
  58:#1  0xf0191503 in cnopen (dev=0x00000000, flag=3, mode=8192, p=(st\
  59:ruct proc *) 0xf07c0c00) (../../i386/i386/cons.c line 126)
  60:       return ((*cdevsw[major(dev)].d_open)(dev, flag, mode, p));
  61:(kgdb) up
  62:#2  0xf0132c34 in spec_open ()
  63:(kgdb) up
  64:#3  0xf012d014 in vn_open ()
  65:(kgdb) up
  66:#4  0xf012a183 in open ()
  67:(kgdb) up
  68:#5  0xf019d4eb in syscall (frame={tf_es = 39, tf_ds = 39, tf_edi =\
  69: 2158592, tf_esi = 0, tf_ebp = -272638436, tf_isp = -272629788, tf\
  70:_ebx = 7086, tf_edx = 1, tf_ecx = 0, tf_eax = 5, tf_trapno = 582, \
  71:tf_err = 582, tf_eip = 75749, tf_cs = 31, tf_eflags = 582, tf_esp \
  72:= -272638456, tf_ss = 39}) (../../i386/i386/trap.c line 673)
  73:673             error = (*callp->sy_call)(p, args, rval);
  74:(kgdb) up
  75:Initial frame selected; you cannot go up.
  76:(kgdb) quit
  77:uriah # exit
  78:exit
  79:
  80:Script done on Fri Dec 30 23:18:04 1994

  Comments to the above script:
  
  line  6:  this is a dump taken from within DDB (see below), hence the
            panic comment ``because you said to!'', and a rather long
            stack trace; the initial reason for going into DDB has been
            a page fault trap though
  
  line 20:  the location of function ``trap()'' in the stack trace
  
  line 36:  force usage of a new stack frame, kgdb responds and displays
            the source line where the trap happened; from looking at the
            code, there's a high probability that either the pointer
            access for ``tp'' was messed up, or the array access was
            out of bounds
  
  line 52:  the pointer looks suspicious, but happens to be a valid
            address...
  
  line 56:  ... but obviously points to garbage, so we have found our
            error, sigh!  [For those uncommon with that particular piece
            of code: tp->t_line refers to the line discipline of the
            console device here, which must be a rather small integer
            number.]
  

*** Post-mortem analysis of a dump ***

  What to do if a kernel dumped core but you didn't expect it, and it's
  therefore not compiled using config -g?

  Not everything is lost here.  Don't panic. :-)

  Of course, you still need to configure all your kernels with the
  DODUMP option being set, otherwise you won't get a core dump at all.
  (This is for safety reasons in the default kernels, to avoid them
  trying to dump e.g. during system installation where there's no
  FreeBSD partition at all and valuable data on the disk could be
  destroyed.)

  Go to your kernel compile directory, and edit the line containing
  COPTFLAGS?=-O.  Add the `-g' option there (but DON'T change anything
  on the level of optimization).  If you do already know roughly the
  probable location of the failing piece of code (e.g., the `pcvt'
  driver in the example above), remove all the object files for this
  code.  Rebuild the kernel. Due to the time stamp change on the
  Makefile, there will be some other object files rebuild, e.g.
  trap.o.  With a bit of luck, the added -g option won't change
  anything for the generated code, so you'll finally get a new kernel
  with similiar code to the faulting one but some debugging symbols.
  You should at least verify the old and new sizes with the `size'
  command; if they mismatch, you probably need to give up here.

  Go and examine the dump as described above.  The debugging symbols
  might be incomplete for some places (as can be seen in the stack trace
  in the example above: some functions are displayed without line
  numbers and argument lists).  If you need more debugging symbols,
  remove the appropriate object files and repeat the kgdb session until
  you know enough.

  All this is not guaranteed to work, but most likely will do it fine.


*** On-line kernel debugging using DDB ***

  While kgdb as an offline debugger provides a very high level of user
  interface (e.g. it can lookup source files, display C structures
  etc.), there are some things it cannot do.  The most important ones
  being breakpointing and single-stepping kernel code.

  If you need to do low-level debugging on your kernel, there's an on-
  line debugger available called DDB.  It allows to set breakpoints,
  single-step kernel functions, examine and change kernel variables
  etc.  It can however not access kernel source files, and it does
  only have access to the global and static symbols, but not to the
  full debug information (including type and line number information)
  like kgdb.

  To configure your kernel to include DDB, add the option lines

        options DDB
        options "SYMTAB_SPACE=XXXX"

  to your config file, and rebuild.  XXXX is the amount of space to be
  reserved into a global array DDB examines to find its symbols at run
  time.  It must be large enough to hold all symbols, but not too
  large at all to avoid wasting space.  100000 Bytes are a good first
  bet for a ``normal'' kernel.  The link stage will tell you about the
  usage of the symtab space, you'll see something like:

  dbsym: need 98765; avail 100000

  If the amount of allocated space has been too small, the above
  message is accompanied by the following error message:

  not enough room in db_symtab array

  and the link stage fails.  You then need to increase the number,
  reconfig and recompile.  If your config(8) has been compiled to not
  remove the old compile directory before continuing (this is a
  compile-time option [CONFIG_DONT_CLOBBER]), you need to remove
  db_aout.o prior to recompilation; this is the only file being
  affected by the SYMTAB_SPACE option.


  Once your DDB kernel is running, there are several ways to enter
  DDB.  The first (and most early) way is to set the boot flag `-d'
  (right at the boot prompt).  The kernel will start up in debug mode
  and enter DDB prior to any device probing.  Hence you are able to
  even debug the device probe/attach functions.

  The second scenario is a hot-key on the keyboard, usually Ctrl-Alt-
  ESC.  (For syscons, this can be remapped, and some of the
  distributed maps do this, so watch out.)  There are patches
  available for a COMCONSOLE kernel, ask me (joerg@FreeBSD.org) for
  them.

  The third way is that any panic condition will branch to DDB if the
  kernel is configured to use it.  (Thus it is not wise to configure a
  kernel with DDB for a machine running unattended.)


  The DDB commands roughly resemble some gdb commands.  The first you
  probably need is to set a breakpoint:

  b function-name
  b address

  Numbers are taken hexadecimal by default, but to make them distinct
  from symbol names, hex numbers starting with the letters `a' - `f'
  need to be preceded with `0x' (for other numbers, this is optional).
  Simple expressions are allowed, e.g. ``function-name + 0x103''.

  To continue the operation of an interrupted kernel, simply type

  c

  To get a stack trace, use

  trace

  Note that when entering DDB via a hot-key, the kernel is currently
  servicing an interrupt, so the stack trace might be not of much use
  for you.

  If you want to remove a breakpoint, use

  del
  del address-expression

  The first form will be accepted immediately after a breakpoint hit,
  and deletes the current breakpoint.  The second form can remove any
  breakpoint, but you need to specify the exact address, as it can be
  obtained from

  show b

  To single-step the kernel, try

  s

  This will step into functions, but you can make DDB trace them until
  the matching return statement is reached by

  n

  NOTE: this is different from gdb's ``next'' statement, it's like
  gdb's ``finish''.

  To examine data from memory, use e.g.

  x/wx 0xf0133fe0,40
  x/hd db_symtab_space
  x/bc termbuf,10
  x/s stringbuf

  for word/halfword/byte access, and hexadecimal/decimal/character/
  string display.  The number after the comma is the object count.
  To display the next 0x10 items, simply use

  x ,10

  Similiarly, use

  x/ia foofunc,10

  to disassemble the first 0x10 instructions of foofunc, and display
  them along with their offset from the beginning of foofunc.

  To modify the memory, use the write command:

  w/b termbuf 0xa 0xb 0
  w/w 0xf0010030 0 0

  The command modifier (b/h/w) specifies the size of the data to be
  writtten, the first following expression is the address to write to,
  the remainder is interpreted as data to write to successive memory
  locations.

  If you need to know the current registers, use

  show reg

  Alternatively, you can display a single register value by e.g.

  print $eax

  and modify it by

  set $eax new-value


  Should you need to call some kernel functions from DDB, simply
  say

  call func(arg1, arg2, ...)

  The return value will be printed.

  For a ps-style summary of all running processes, use

  ps


  Well, you've now examined why your kernel failed, and you wish to
  reboot.  Remember that, depending on the severity of previous
  malfunctioning, not all parts of the kernel might still be working
  as expected.  Perform one of the following actions to shut down and
  reboot your system:


  call diediedie()

  (must usually be followed by another ``c[ontinue]'' statement),
  will cause your kernel to dump core and reboot, so you can later
  analyze the core on a higher level with kgdb.


  call boot(0)

  might be a good way to cleanly shut down the running system, sync()
  all disks, and finally reboot.  As long as the disk and file system
  interfaces of the kernel are not damaged, this might be a good way
  for an almost clean shutdown.


  call cpu_reset()

  ...is the final way out of the desaster, almost similiar to hitting
  the Big Red Button.


*** What to do if i want to debug a console driver? ***

  Since you need a console driver to run DDB on, things are more
  complicated if the console driver itself is flakey.  You might
  remember the ``options COMCONSOLE'' line, and hook up a standard
  terminal onto your first serial port.  DDB works on any configured
  console driver, of course it also works on a COMCONSOLE.


  Paul Richards, FreeBSD core team member. (paul@FreeBSD.org)
  J"org Wunsch (joerg@FreeBSD.org)