{Previous tutorial}[link:files/doc/tutorials/04-EventPropagation_rdoc.html]
{Next tutorial}[link:files/doc/tutorials/06-Overview_rdoc.html]
= Representing and handling errors

One thing about robotics, and in particular plan execution, is that Murphy's
rule applies quite well. This is due to a few things. Among them, the first is
that the models planning uses (and therefore the plans it builds) are (i) too
simple to completely reflect the reality, (ii) badly parametrized and (iii)
represent dynamic agents, which can themselves be able to take decisions. So, in
essence, the rule of thumb is that a plan will fail during its execution.

Because Roby represents and executes all the activities of a given system, the
representation of errors becomes a very powerful thing: it is quite easy, when
an error appears somewhere to actually determine what are its consequences.

What this tutorial will show is:
* how parts of the error conditions are encoded in the task structure.
* how exceptions that come from the code itself (like NoMethodError ...) are
  handled.

== Where do errors come from ?
=== Task structure as a constraint representation

Some (not all) task relations also define a set of constraints on the plan
execution. For instance, the +realized_by+ relation defines a set of
_desirable_ and a set of _forbidden_ events (the +success+ and +failure+
options of TaskStructure#realized_by). If none of the desirable events are
reachable (i.e. none will be emitted +ever+, see
Roby::EventGenerator#unreachable?), or if one of the forbidden events is
emitted, a ChildFailedError error is generated.

For instance, if we look at the first tutorial, we had an error provoked because
the +failed+ event of ComputePath has been emitted, while ComputePath was a
child of MoveTo:
  $ scripts/shell
  >> move_to! :x => 10, :y => 10
  => MoveTo{goal => Vector3D(x=10.000000,y=10.000000,z=0.000000)}:0x48350370[]
  >>
  !Roby::ChildFailedError
  !at [336040:01:45.419/186] in the failed event of ComputePath:0x483502e0
  !block not supplied (ArgumentError)
  !  /home/doudou/dev/roby/lib/roby/thread_task.rb:51:in `instance_eval',
  !    /home/doudou/dev/roby/lib/roby/thread_task.rb:61:in `value',
  !    /home/doudou/dev/roby/lib/roby/thread_task.rb:61:in the polling handler,
  !    /home/doudou/system/powerpc-linux/ruby-1.8.6/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:27:in `gem_original_require',
  !    /home/doudou/system/powerpc-linux/ruby-1.8.6/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:27:in `require',
  !    scripts/run:3
  !
  !The failed relation is
  !  MoveTo:0x48350370
  !    owners: Roby::Distributed
  !    arguments: {:goal=>Vector3D(x=10.000000,y=10.000000,z=0.000000)}
  !  realized_by ComputePath:0x483502e0
  !    owners: Roby::Distributed
  !    arguments: {:max_speed=>1.0,
  !     :goal=>Vector3D(x=10.000000,y=10.000000,z=0.000000)}
  !The following tasks have been killed:
  !  ComputePath:0x483502e0
  !  MoveTo:0x48350370

In the case of the PlannedBy relation that we saw in the previous tutorial, the
error is that no plan can be found. A PlanningFailedError is generated in that
case.

Those two types of error have in common that it is possible to associate the
error with one of the plan objects (event or task). They are <i>localized
errors</i> and are subclasses of Roby::LocalizedError. The nice aspect of that
is that it is possible to assess what is their impact on the plan execution. It
is therefore possible to handle the error at the plan level and continue
executing what can be executed.

=== Errors generated by the code itself

In that case, the problem is not to have plan-specific errors anymore. It is to
handle errors that appear because of bugs in the code itself. Roby is
implemented in a way where the code is split into two parts:

The <i>framework code</i> is the really problematic one. It means that there is
really a bug in the execution engine itself. In that case, Roby tries to hang up
as cleanly as possible by killing all tasks that are being executed.

The <i>user code</i> is the part of the code which is tied to events and tasks:
event commands, event handlers, polling blocks. For those, it is actually
possible to generate a Roby::LocalizedError as in the previous case and to
handle the error at the plan level. Failed command tasks generate a
Roby::CommandFailedError, failed event handlers a Roby::EventHandlerError.
Polling blocks actually emit +failed+ with the poller exception as context (see
the above error message).

Let's try. Add the following event handler in the definition of MoveTo
(<tt>tasks/move_to.rb</tt>).

  on :start do
    raise
  end

Start (or restart) the controller and launch a <tt>move_to!</tt> action in the
shell. The following should happen:

  !Roby::EventHandlerError: user code raised an exception at [336641:28:04.607/23] in the start event of MoveTo:0x2b4330b4fae8
  !
  !
  ! (RuntimeError)
  !./tasks/move_to.rb:10:in event handler for 'start',
  !  /home/joyeux/system/rubygems/lib/rubygems/custom_require.rb:27:in `gem_original_require',
  !  /home/joyeux/system/rubygems/lib/rubygems/custom_require.rb:27:in `require',
  !  scripts/run:3
  !The following tasks have been killed:
  !  MoveTo:0x2b4330b4fae8

An equivalent thing would happen with a task-level event handler (i.e. one
defined on the task object instead of the task model). Remove the model-level
handler we just added and try adding the following to the planning method in
<tt>planners/PathPlan/main.rb</tt>. Execute, and see the result !

  move.on :start do
    raise
  end

Now, what happens during execution: how Roby does react to that error ? What we
can see in the relation display is the following two successive steps (don't
forget to uncheck <tt>View/Hide finalized</tt>).

link:../../images/replay_handler_error_0.png
link:../../images/replay_handler_error_1.png

From Roby point of view, the event has already happened when the event handlers
are called. Therefore, the event propagation should go on (the temporal
structure is well-formed). However, an error occured and has not been handled,
so the MoveTo task cannot be kept running. That is the job of the garbage
collection process, which queues the 'stop' event, to be executed during the
next cycle. The MoveTo task is therefore stopped at the next cycle, and the
tasks that are now useless are also stopped.

For event commands, all depends on where the exception actually appears. If
'emit' has already been called, then the event will be emitted and propagated.
Otherwise, it counts as a cancelling of the event command.

== Handling errors

Now that we have seen how errors are detected and represented, we can tackle
the problem of handling them. There are three ways to do that in Roby, that we
will present right away.

As we saw in the fifth tutorial, the forwarding relation represents an event
generalization (the target represents a superset of the situations represented
by the source), allowing to represent <i>fault modes</i>, i.e. specific fault
situations that are classified through the forwarding relation (see figure
below). The target of the forwarding relations being, of course, the +failed+
event. This is used during error handling to generalize the event handlers: an
event handler which applies to a given erroneous situation also applies to all
the situations that are subsets of it.

link:../../images/task_event_generalization.png

*Example*: the +blocked+ event is a particular fault mode during the movement.
More complex forwarding network would allow to represent the relationships
between the different type of faults recognized by the system.

=== Repairing during events propagation

If a child fails, for instance because of a spurious problem, it would have been
possible to actually restart the failing child directly in the event handler of
'failed' and replace the failed task through this new one. This is as simple as:

  on(:failed) do
      plan.respawn(self)
  end

Let's try it. Add the following to the definition of +TrackPath+ to simulate an
error:
  attr_accessor :should_pass
  event :start do
    if !should_pass
      forward :start, self, :failed, :delay => 0.2
    end
    emit :start
  end

What the event command does is schedule a delayed forwarding of 0.2 seconds if
#should_pass is false (the default). +failed+ will therefore be emitted 0.2
seconds after the path tracking has been started, if +should_pass+ is false.
    
Then, the error handler itself:

  on :failed do
    if !should_pass
      Robot.info "respawning ..."
      new_task = plan.respawn(self)
      new_task.should_pass = true
    end
  end

This handler replaces the failed TrackPath with a copy of itself and schedules
it for starting. Then, we set @should_pass to true to avoid having further
errors. Look at the relation display to see how it worked. Note that doing such
a thing on the +failed+ event is a bad idea, as +failed+ is emitted when the
task gets interrupted.

The next figure is an example of how it works on a real robot. As a workaround
of a spurious error in the +TrackSpeedStart+ task, known to be harmless, an
event handler is defined on this task model, which restarts the task online.

link:../../images/repair_event_propagation.png

=== Asynchronous repairs

Sometime, repairing the plan needs a few actions. While those actions are
performed, we do not actually know yet if the plan *can* be repaired or not,
only that necessary measures are taken to assess it and/or repair it.

In Roby's plans, asynchronous repairs are represented as <i>plan repairs</i>
(Roby::Plan#add_repair). Plan repairs are tasks which are associated with a
task's event. While the plan repair is running, errors whose failure point is
the associated event are simply ignored by the system. Once the task finished,
normal error detection and handling resumes.

To automate the process of installing plan repairs, a ErrorHandling relation
exists, which defines the set of possible plan repairs for a given task and
event. Roby::TaskEventGenerator#handle_with allows to easily add a new plan
repair by associated the receiving task event with the (pending) task.

Here is a simple example: http://roby.rubyforge.org/videos/rflex_repaired.avi.
In this video, the microcontroller which drives the robot's motors can give us
spurious <tt>BRAKES_ON</tt> messages. Our problem is that the Roby controller
must determine if the message is spurious, or if brakes are actually set by the
means of an emergency switch for instance. To do that, an error handling is set
up, which wait for a few seconds and tests the <tt>BRAKES_ON</tt> state of the
robot. If the brakes are reported as off, then the robot can start moving again.
Otherwise, the error was a rightful one and should be handled by other means.

Let's simulate the same kind of problem in the PathPlan controller. What we will
do is the following:
* add a 'blocked' fault event to the model of TrackPath, and make the 'poll'
  event of TrackPath emit 'blocked' randomly.
* have a 'repair' task wait 2 seconds and either (randomly) respawn the path
  tracking after those two seconds, or emit +failed+.

The first point is done by adding the following to the definition of TrackPath:
  event :blocked
  forward :blocked => :failed

and then those three lines to the polling block:
  if rand < 0.05
    emit :blocked
  end

A new RepairTask model has to be added. Open <tt>tasks/repair_task.rb</tt> and
add the following
  class RepairTask < Roby::Task
    terminates

    event :start do
      Robot.info "repair will succeed in 2 seconds"
      forward :start, self, :success, :delay => 2
      emit :start
    end

    on :success do
      plan.respawn(failed_task)
    end
  end

Finally, the repair handler must be defined added to the plan. Edit the +move_to+ method in <tt>planners/PathPlan/main.rb</tt> and add the following line before the last line of the method:
  track.event(:blocked).handle_with(RepairTask.new)

Run as usual and see what happens ...

Another, more complex example is the "P3d repaired" video presented
{here}[files/doc/videos_rdoc.html]

=== Exception propagation

This is the third error handling paradigm available in Roby. It is akin to
classical exception propagation: 
* task models can define per-type exception handlers using Roby::Task#on_exception
* when an error occurs and is not handled by a plan repair, the error is
  propagated up in the +realized_by+ relation, searching for a matching
  exception handler.
* if an exception handler is found, it is called with the error. If the
  exception handlers raises, or if it calls #pass_exception, the propagation is
  resumed.  Otherwise, the system stops propagating the exception. In addition
  to following the +realized_by+ relation, the +planned_by+ relation is used
  to check if planning activities can repair the error as well (see example below).
* if no handler accepted the error, it is passed to a global error handler defined
  by Roby.on_exception.

link:../../images/exception_propagation_5.png

== Unhandled errors

Once the exception propagation phase is finished, the plan analysis (i.e.
constraint verification) is re-ran once to verify that exception handlers do
have repaired the errors. If errors are still found, they cannot be handled
anymore.

This set of errors, and the errors that have not been handled before, determine
a set of tasks that can be dangerous for the whole system. The garbage
collection kicks in and will take the necessary actions to remove these tasks
from the plan. Once necessity is to kill all tasks which were actually
depending on the faulty activities: all tasks that are parents of the faulty
tasks in any relation are forcefully garbage collected. In the exception
propagation example above, all tasks which have a number will be killed and
remove from the plan.

= Next tutorial

This tutorial presented you with one of the two most singular features of Roby:
an extensive way to represent and handle errors. Among them, the error handling
relation is the most powerful, as it allows to represent error handling
directly in the plan and would for instance work in multi-robot context, even
without communication between the two robots.

{The next tutorial} is not really a tutorial. It is an overview of important
Roby features that these tutorials did not cover. Again, my PhD thesis should
still be considered one of the most central design document which allow to
understand the system.