Introducing DTrace
Bryan Cantrill (bmc at kiowa.eng.sun.com)
From: Bryan Cantrill (bmc at kiowa.eng.sun.com)
Subject: Introducing DTrace
Newsgroups: comp.unix.solaris
Date: 2003-11-14 22:02:13 PST
All, The DTrace cat has been pawing its way out of the metaphorical bag for quite some time now, with Gavin Maltby posting some especially enticing examples here on comp.unix.solaris. Now that Solaris Express 11/03 is publicly available for download, we on the DTrace engineering team wanted to take a moment to post some slightly more elaborate examples of this new technology. Before we get started, note that there is quite a bit of publicly available documentation on DTrace at BigAdmin: http://www.sun.com/bigadmin/content/dtrace Also, note that this post isn't meant to be a comprehensive tutorial -- it's really more of a quick demonstration of how DTrace might be used to solve problems that you may come across in your day-to-day workings with Solaris. On with the show: let's say you're looking at mpstat(1) output on your multiuser server. You might see something like this: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 12 1 27 3504 338 206 765 27 114 65 1 337 9 19 41 31 13 1 19 5725 98 68 723 22 108 120 0 692 3 17 20 61 14 0 57 3873 224 192 670 22 75 86 1 805 7 10 35 48 15 36 7 1551 42 28 689 6 68 59 0 132 2 9 35 54 16 14 7 7209 504 457 1031 37 125 244 0 459 6 30 4 60 17 5 5 4960 150 108 817 37 98 154 0 375 6 26 6 62 18 5 6 6085 1687 1661 741 60 76 248 0 434 3 33 0 64 19 0 15 10037 72 41 876 23 100 291 1 454 2 19 9 71 20 12 5 5890 746 711 992 32 122 216 2 960 10 33 4 53 21 60 5 1567 729 713 467 15 80 59 0 376 2 35 10 53 22 0 6 4378 315 291 751 17 84 142 1 312 3 16 1 80 23 0 6 12119 33 3 874 20 82 384 1 513 4 24 11 62 And well you may wonder (as perhaps you often have) -- what the hell is causing all of those cross calls, anyway? (Cross calls appear in the "xcal" column; see mpstat(1).) Using DTrace, investigating this is a snap: # dtrace -n xcalls'{@[execname] = count()}' dtrace: description 'xcalls' matched 4 probes [ letting this run for a few seconds ] ^C mozilla-bin 1 lockd 1 in.mpathd 2 nsrmmd 5 grep 6 chmod 6 cat 6 nwadmin 13 ls 24 in.tftpd 28 nsrindexd 34 fsflush 38 cut 42 find 42 mkdir 66 rm 76 ipop3d 78 scp 79 inetd 96 dtrace 111 nawk 118 imapd-simmonmt 126 rmdir 132 sshd 138 rpc.rstatd 159 mv 398 save 1292 gzip 1315 get_all2 1678 sched 1712 nfsd 3709 tar 27054 So the bulk of these are coming from tar(1) processes. To explore the cross calls coming from tar specifically, we could add a predicate to the above: # dtrace -n xcalls'/execname == "tar"/{@[pid] = count()}' dtrace: description 'xcalls' matched 4 probes [ again letting this run for a few seconds ] ^C 938339 507 938415 784 938229 2438 938390 3282 938363 6464 937067 13984 But now we notice a strange thing: none of these processes seems to be around by the time we're back on the shell, e.g.: # pflags 937067 pflags: cannot examine 937067: no such process or core file So is someone launching a bunch of short-lived tar processes? To explore this as a rough first cut, we could just print the pid whenever a tar process is launched: # dtrace -n syscall::exec\*:return'/execname == "tar"/{trace(pid)}' CPU ID FUNCTION:NAME 18 112 exece:return 954538 21 112 exece:return 954519 17 112 exece:return 954584 21 112 exece:return 954579 21 112 exece:return 954607 22 112 exece:return 954632 21 112 exece:return 954658 20 112 exece:return 954684 20 112 exece:return 954705 15 112 exece:return 954727 14 112 exece:return 954763 16 112 exece:return 954775 12 112 exece:return 954798 18 112 exece:return 954818 22 112 exece:return 954873 22 112 exece:return 954889 23 112 exece:return 954893 14 112 exece:return 954916 14 112 exece:return 954971 ... Seeing that this is happening with some frequency (the above has output every second or so), we want to explore more generally what is being exec(2)'d on the machine: # dtrace -n syscall::exec\*:entry'{@[copyinstr(arg0)] = count()}' [ letting this run for a bit ] ^C /usr/lib/mail.local 1 /usr/bin/true 1 /usr/bin/pgrep 1 /usr/bin/wc 1 /usr/bin/touch 1 /usr/bin/sleep 1 /usr/sbin/modinfo 4 /bin/awk 4 /bin/sh 4 /usr/bin/chmod 14 /usr/bin/cat 14 /usr/bin/tar 14 ./do1 14 /usr/bin/cut 15 /usr/bin/grep 15 /usr/bin/find 15 /usr/bin/ls 15 /usr/bin/mv 15 /usr/bin/rm 16 /usr/bin/mkdir 17 /usr/bin/gzip 29 /usr/bin/nawk 44 /usr/bin/rmdir 48 /home/ellenp/rmdir 48 /home1/netscape/rmdir 48 /usr/dt/bin/rmdir 96 /usr/openwin/bin/rmdir 96 There's a bunch that's interesting here -- but for now we're interested in those tar processes. In particular, when someone is exec'ing a tar process, we want to know who exactly they are. To do this, we write the following DTrace script, and call it "whotar.d": syscall::exec*:entry /copyinstr(arg0) == "/usr/bin/tar"/ { @[execname] = count(); } Running this points us straight to the culprit: # dtrace -s whotar.d [ again letting this run for a bit ] ^C get_all2 22 So get_all2 is the problem. We can now use traditional tools to hone in on this: # ptree `pgrep get_all2` 100578 /usr/sbin/inetd -s 900870 in.rlogind 900872 -csh 901382 /bin/sh get_all3 901383 /bin/sh ./get_all2 80/80acc080 90/00fccc90 959045 /bin/sh ./get_all2 80/80acc080 90/00fccc90 959046 tar xvf - explorer.80c23d8e.ptgw02is-2003.10.27.09.35/README 959047 gzip -dc /net/pamer.central/proactive/rawdata/8e/80c23d8e 901384 /bin/sh ./get_all2 90/80bdbb90 a0/1274eda0 959565 /bin/sh ./get_all2 90/80bdbb90 a0/1274eda0 959566 tar xvf - explorer.8089ed9d.curly-2002.05.06.10.00/README exp 959567 gzip -dc /net/pamer.central/proactive/rawdata/9d/8089ed9d 901385 /bin/sh ./get_all2 a0/80aabda0 b0/1d6c05b0 959619 mv disks/diskinfo disks/format.out disks/mount.out etc/release 901386 /bin/sh ./get_all2 b0/80b264b0 c0/335419c0 959612 /bin/sh ./get_all2 b0/80b264b0 c0/335419c0 959613 tar xvf - explorer.82a86abd.seq22-2002.05.31.21.49/README exp 959614 gzip -dc /net/pamer.central/proactive/rawdata/bd/82a86abd So that explains where the tar processes (and thus the cross calls) are coming from! And just how short-lived _are_ these tar processes? We can answer this question with DTrace, natch. To do this, we'll use an associative array and the "timestamp" variable. We are going to linearly quantize the result from 0 seconds to 60 seconds. We'll also use a "tick-1sec" probe to automatically stop after a minute. The script (which we call "longtar.d"): syscall::exec*:return /execname == "tar"/ { start[pid] = timestamp; } fbt::exit:entry /start[pid]/ { @["tar time, in seconds"] = lquantize((timestamp - start[pid]) / 1000000000, 0, 60); start[pid] = 0; } tick-1sec /n++ >= 60/ { exit(0); } Running this: # dtrace -s longtar.d dtrace: script 'longtar.d' matched 4 probes CPU ID FUNCTION:NAME 23 38110 :tick-1sec tar time, in seconds value ------------- Distribution ------------- count < 0 | 0 0 |@ 1 1 |@@@@@ 4 2 |@@@@@@@@@@@ 8 3 |@@@@@@@ 5 4 |@@@@@@@ 5 5 |@@ 2 6 | 0 7 |@@@@ 3 8 | 0 This highlights just how difficult this problem would be to find using traditional (process-centric) tools: each tar process is only alive for (roughly) two seconds! And yet, taken together, the tar processes are accounting for the vast majority of cross calls on the system. This is a good example of using DTrace because the problem here is _not_ a single application -- it's an insurgent army of tar processes that are slipping away before they can be noticed. Indeed, running prstat(1) on this machine doesn't show tar -- or _any_ of the get_all2 progeny processes -- as top resource-consuming processes. Using DTrace, however, one can quickly root-cause aberrant systemic behavior of this nature. Of course, this is but a fraction of what DTrace can do. We encourage you to check out the AnswerBook at BigAdmin: http://www.sun.com/bigadmin/content/dtrace Or better yet, go to: http://wwws.sun.com/software/solaris/solaris-express/get.html There you can download Solaris Express 11/03 and try DTrace yourself! Enjoy, Bryan Cantrill, Mike Shapiro and Adam Leventhal DTrace Engineering Team ---------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. bmc@eng.sun.com (650) 786-3652
Last changes: Friday, February 09, 2007 03:39:23 PM,
:P 2004 filibeto.org, site statistics