The Art of Debugging

Standard

Debugging embedded software is an art.  There is no guaranteed method for finding every bugs.  But there are ways we could maximize our chances.   In this article I will share some of them that helped me over the years.  This article does not address the basics of using debugging tools, such as GDB, in-circuit emulators, etc.; the readers are assumed to possess these fundamental skills.

Best Defense is a Good Offense

One of the best ways to eliminate bugs is to not introduce them at all.  Code review is a great way to find issues before they become bugs.  Code review is not an overview, but a line-by-line review of the source code, a process usually lead by the code author.  It is important to avoid shortcut, or skip  code that are deemed trivial.  Every line of code should be reviewed.  Before attending the review, the reviewers should acquire sufficient knowledge about the problem the code is solving.  Without this knowledge the reviewers will not be able to critique the logic behind the implementation.

Divide and Conquer

Once the software is tested and bugs are found, the developers have to find the bugs and eliminate them.  The first step is to isolate where a bug is first introduced.  Many use the bisection method to narrow the search.   An earlier known-good version, together with the version where the bug was found, forms the starting range.  The middle version of this range divides the range into two halves, and is tested.  If the bug is found in the middle version, then the next range is the first half; otherwise, the next range is second half.   Bisection is then repeated on the new range recursively until the version where the bug is first introduced is found.   Once found, a source code compare tool, such Beyond Compare or WinMerge, is used to compare the buggy version and the one right before it.  More often than not, the root cause will be found in the difference.

Ask the Right Questions

The compare tool will help highlight the changes that may have introduced the bug, but it does not find and explain the root cause.  The root cause is found by asking questions that will lead the developer down the right path of investigation.  Some example questions include:

  • What mechanisms can lead to this bug?
  • What recent changes that can lead to this bug?
  • Could hardware caused this bug?
  • Are the assumptions in code correct, and satisfied?
  • Are there any corner cases unaddressed by the code?

Often developers will quickly jump to conclusion based on their hunches.   While hunch is not a bad thing, the logic behind the hunch should be vetted before committing resources to investigate it.   All possible root causes are guilty until proven innocent.  Effort should be focused more on explaining how the code could fail, rather than why the code should work.   As different possible root causes are vetted, it is important to spot and drill down conflicting observations, as their final resolutions often leads to root cause discovery.

Send in Your Spies

Some bugs, especially those in embedded systems, can be very hard to find because developers have little or no observability.   In these situations, one may have to create software traps to capture the faulting sequence.  In other words, send in your own spies to help see what’s going on.  However, one has to be mindful that spy code can mask the bug.   This scenario is particularly likely if the bug is related to stack overflow, improper synchronization and execution timing.  In these situations, the addition of spy code may make the bug go away, making bug trapping difficult.  However, in my experience, the majority of times the spy code will help uncover unexpected execution sequences that lead to the root cause.  When well written spy code fails to gather meaningful data, one should be looking for stack overflow or race condition as root cause.

Don’t Get Boxed In

A software developer can be stuck with a hard bug for a long time, with no idea on what could be the root cause.  The lack of new insights can feel like being trapped inside a box.  Before frustration sets in, the developer should consider walking away from the problem, and return to it later with a fresher perspective.   Another way to avoid getting boxed in is to talk to other developers.  Frequently, casual conversations can turn up new ideas.  Also, when the evidences point to the improbable as root causes, one must give the improbable consideration.  I have ran into a few undiscovered hardware bugs that all along I thought was software related.  Lastly, developers should never assume that they are looking at just one bug.  If the evidences hint that there is more than one bug, consider the possibility and cast the net wider to gather more data.

Further Reading:

(The above article is solely the expressed opinion of the author and does not necessarily reflect the position of his current and past employers)