Analyzing Failure Effects and Recovery in ERLANG /OTP systems
- Project manager ?%
- Bengt Jonsson,
Department of Computer Systems, Department of Information
Technology, Uppsala University.
- Ph.D. student 80%
- Jan
Nyström,
Department of Computer Systems, Department of Information
Technology, Uppsala University.
- Industrial contact
- Ulf Wiger,
ATM Multiservice Networks,
Ericsson Telecom AB.
- Thomas Arts,
Computer Science Laboratory, Ericsson Utveckling AB.
ERLANG is a concurrent functional language, which
has been successfully used for the development of complex telecommunication
software within Ericsson. An important feature of
ERLANG, which allows to build highly concurrent and still
very robust systems, is its in-built support for recovery from failures.
An ERLANG system typically creates a large number
of processes.
OTP provides support for organising the processes of an
ERLANG system into trees, in which parent
processes monitor the failure status of their children and are responsible
for recovery, typically meaning to restart chrashed children.
The purpose of thus project is to support the analysis of the ability of
ERLANG systems to tolerate and recover crashes of
individual processes. The questions that we seek to answer are:
-
What is the effect of a process crash on the overall system?
-
What recovery will happen upon the crash of a process?
-
What are the potentially dangerous scenarios in which recovery may not be
guaranteed?
The questions will be addressed on the basis of the code of an
Erland system.
The idea of the project is to answer the above questions by developing
a tool that can extract relevant system structure from an
ERLANG system, and thereafter analyze this structure
to find potential "holes" in the recovery mechanisms.
In our work, we will assume that applications are written using the
libraries provided in OTP.
The system structure that is relevant for recovery is based on the
process structure, represented as a tree.
In this tree, each node
represents a process, and the processes that it may potentially create are
represented as its children. Along with the tree will be recorded
information that is relevant for recovery, such as links for propagating
crash information, types of certain processes that are responsible for
restart (there are predefined such types in OTP), etc.
The tree will be
visualised graphically, with possibilities for animation
to show creation of the processes at setup.
The extracted structure should be subject to analysis to answer
the questions
-
What is the effect of a process crash on the overall system? This
can be visualized on the extracted graph, using the additional information in
the tree.
-
What recovery will happen upon the crash of a process? Often, the answer can
be provided by inspecting the attributes of the tree.
For instance, if one keeps the attributes of the
standard behaviours, such as maximal restart frequency, in the model
one can use the knowledge of these behaviours and, e.g.,
show what parts, if any, are restarted on the failure of a particular
process.
-
What are the potentially dangerous scenarios in which recovery may not be
guaranteed? In connection with previous anslyses, certain crashes from which
recovery is not obvious can be investigated, possible by invoking
other types of anlysis.
The project will proceed in a number stages presented below.
After a meeting with Ericsson on December 21, the earlier plan has been changed
to include a visit to at the development team at
Ericsson ATM Multiservice Networks, in order to apply the
current implementation to existing code in the AXD 301 system, and
to have better feedback from Erlang system developers.
- Stage1
- In a system well written system, i.e. where the
OTP System Architecture Support Libraries(SASL) have been
used, as well as the generic modules and behaviours, identify the
static processes created and in what order they will be created. By
static processes is meant the processes that are created in the
beginning of the system and exist continuously throughout the systems
lifespan. The processes should be annotated by behaviour where that is
known, i.e. where OTP standard behaviours such as ``generic server''
is used. The structure will be presented graphically
- Stage2
- A) Turn the ``unsafe'' creation tree from
prototype of Stage1 into a safe over-approximation of the
possible creation trees. Define what a safe over-approximation is in
terms of semantics with side-effects for ERLANG .
B) Add animation of start and failure behaviours to the graphical
interface.
- Stage3
- Investigate the possible ways in which an Erlang system may
not recover from a process crash.
Determine how to look for such scenarios on the basis of the analyzed
structures extracted in Stage 2. The idea is to use this analysis to
guide a programmer to the interesting scenarios which must be more
thoroughly tested and analyzed (by some means).
- Stage4
- Visit to the Ericsson ATM Multiservice Networks, in order to apply the
current implementation to existing code in the AXD 301 system, and
to have better feedback from Erlang system developers. This visit must
be prepared by improving the scalability of the current implementation,
in order that it can handle larger code bases.
- Stage5
- Extend the analysis of Stage 2 to detect also
processes that are dynamically created ("worker processes"),
and which processes are linked. Analyse which processes
communicate with each other. Develop mechanisms for specifying systems
in the same manner that they are presented when abstracted, i.e., as
trees with annotated edges and nodes.
- Stage6
- Provide a mechanism to compare abstracted system
behaviour and specified behaviour. Trough the tracing mechanism
provided by the ERLANG runtime system check(possibly having
annotated the code with ``names'' to identify what processes are
created) that the test runs are within specified behaviour
of the system.
- 2001-05-01
- Stage1 prototype finished.
- 2001-08-28
- ASTEC Seminar presenting Stage1.
- 2001-09-01
- Presentation of Stage 1 prototype at Erlang Workshop in Firenze,
Principles,
Logics, and Implementations of high-level programming languages.
- 2001-11-30
- Stage2 A) prototype finished.
- 2001-12-15
- DoCS Technical report describing the Stage2 A)
- 2002-02
- ASTEC Seminar presenting Stage2.
- 2002-01-15
- Stage2 B) prototype finished.
- 2002-03-01
- Stage 3 results presented in report, which is submitted to
significant conference.
- 2002-03
- Stage 4: Visit to the Ericsson ATM Multiservice Networks
development team. The visit will be preceded by improving
scalability of the analysis engine.
- 2002-05-31
- Stage 3 and 4, and 5 results and prototypes presented at ASTEC seminar.
- 2002-10
- Ph.D. thesis finished and defended, possibly including results
of Stage 6.
We will organise the work as follows:
- Stage1)
- Our process structure tree is constructed using an instrumented evaluator for
Core ERLANG . Core ERLANG [Carlsson00,Carlsson01b] is an intermediate
format used in the OTP ERLANG compiler where syntactic sugar has
been removed and
a restricted set of constructs and formats are used. There are also
constructs in Core ERLANG that are not present in ERLANG , such as,
``let'' and ``letrec'' which are used to replace explicit matching and local
functions generated by list expressions.
The reason, as to why we instrumented an evaluator, rather then trying
a static program analysis, is that when starting an application and
generating the static processes the programmer has the use of, and in
many cases also need for, the full expressive power of ERLANG . Hence we
would have to deal with the full language in order to determine what
processes where generated by an application.
Our reason for writing an evaluator of Core ERLANG as
opposed to ERLANG itself (for which there already exist an evaluator
in the distribution of ERLANG ) is that it was much simpler to deal
with code where matters such as macros, files inclusion and records
had already been taken care of and it has fewer constructs.
The analysis does so far consist of evaluating a call to
application:start/1
, where the result will contain the value
computed, collected ``side-effects'' and the resulting environment.
The collection of ``side-effects'', or rather processes creation and
management. The
resulting environment will contain the modules and applications loaded
during the evaluation.
For more detailed description read the PLI'01[Nyström] contribution.
- Stage2)
- A) Construct a abstracting collecting semantics of Core ERLANG based on
the denotational semantics for Core ERLANG presented in
[Carlsson01a], use this a basis for making a abstracting evaluator
that collects all the alternative result of the different executions.
B) The behaviour of the application is given by the process creation
tree and animation can easily be made of what happens on a process
failure.
- Stage3)
- Design and implement a technique, which analyzes the trees generated
in Stage 2 and calculates things like:
- for each possible process failure, which other processes will be
informed and/or terminated? What actions will the informed processes
(typically supervisors) perform? Are there potential inconsistencies in the
handling of a particular process that may cause deadlocks in the
recovery mechanism?
-
Checks that certain design guidelines are satisfied.
- Stage4)
- Adapt the implementation in order that it is able to scale to applications
typically found in the AXD 301 system, which includes implementing
techniques for representing large collections of potential trees, handling
some important OTP libraries. During the visit at Ericsson, the
implementation will be exercised on applications in the current
AXD 301 system. The work will be discussed with designers in the team.
- Stage5)
-
In order to handle dynamically created "worker processes", one must consider
the problems that such worker processes typically have no registered
names, and that many instances (e.g., one per handled call)
of them are typically created. Approaches to handling these problems include
to name processes after their initial Erlang function, and to identify
sets of "similar" processes as "collection nodes" in the generated tree.
Inspiration may be obtained form techniques used to analyze
pointer structures in program analysis.
- Stage6)
-
This problem could potentially be handled by implementing an "observer"
process, which collects information gathered during a test run, and
condenses it into a process tree, of the same form as used before.
The condensation is similar to that used in code analysis.
-
Jan Nyström and Bengt Jonsson:
Extracting the Process Structure of Erlang Applications
in: Erlang Workshop, Firenze, Sept. 2.
ERLANG [Armstrong,Barklund] is a simple call by value
functional language with
semi-lightweight processes that allows a high degree of
concurrency. Processes communicates asynchronously via message passing
and the distribution is non-transparent. The ERLANG runtime system
supports detection of process, as well as ERLANG node, failures. The
above, together with the code loading mechanisms support for hot-swap of
code, provides a good framework for building fault-tolerant concurrent
soft real-time systems that can be run with high availability.
Typical systems that benefit from ERLANG is telecommunication
systems[Blau,Däcker,Hinde], but ERLANG has also been used to
construct the Bluetail Mail Robustifier[Bluetail] that
handles traffic management for clusters of mail server in a robust way.
Coupled with the ERLANG programming language is Open
Telecom Platform([OTP]) which a set of
libraries originally intended to aid the construction of telecommunication
systems. The libraries encode ``design patterns'' used to construct
telecommunication systems and effectively supplies code skeletons
where the actual behaviour is realised as a call-back module.
Here we will present the state of the art in areas connected to the
work proposed in this project.
A possible approach would be to use abstract interpretation, in the
context of ERLANG this has been explored by Huch in [Huch].
In Huch work the erlang system is viewed as as a set of
expression evaluation in the context of the identity of the process
executing the expression and the message queue. The abstract consists
of truncating the terms in the expression at a predefined depth.
It is mentioned in the paper how one could tailor the interpretation
so that for selected terms the terms are either kept as they are or
truncated at a greater depth. The interpretation which can handle
small problems. The interpretation can only handle tail recursive
programs and does not handle exceptions, links, nor process termination.
Another interesting method of coming to grips with concurrent systems
is to reduce them to a finite state models by means of extracting a control
skeleton from the code.
Holzmann has extracted finite non-deterministic SPIN models for a
C with threads, by abstracting away statements and procedure not
connected to the properties under inspection, in the tools
FeaVer[Holzmanna] and AX[Holzmannb]. This method
relies on the user to define the abstractions, i.e. deciding what
procedures and variables are unimportant, and consequently is not
formally sound.
A similar but more soundly founded approach is applied by Corbet and
colleagues for JAVA in the Bandera projec[Corbet] where they use
shape analysis to determine what variables are only accessible from
one thread.
A more theoretically oriented approach is followed by Nielson et
al[Nielson], which derive finite-state control skeleton from a
program in Concurrent ML where values abstract to their types.
An important subproblem is the construction of call-graphs for the
modules in the system under inspection, notably is resent work by
Agrawal[Agrawal] where the call-graph is construct on demand
rather then constructing the entire call-graph, which has been case in
most earlier work on call-graphs.
The project runs in continuous contact with the AXD 301 development group
at Ericsson Telecom AB. Contact person: Ulf Wiger. Discussions are also
conducted regularly with the Computer Science Lab. and Ericsson UA. Contact
person: Thomas Arts.
- Agrawal
-
Agrawal, G., 2000: Demand-Driven Construction of Call
Graphs, In Proceedings of 9th International Conference in Compiler
Construction, ed. D.A. Watt , LNCS Vol.1781, pp.125-140, Springer-Verlag, 2000.
- Armstrong
-
Armstrong, J., Virding, R., Wikström, C., Williams, M., 1993:
Concurrent Programming in Erlang, 2nd ed., Prentice
Hall, 1996.
- Barklund
-
Barklund, J., Virding, R., 1999: Specification of the
Standard Erlang programming language, draft version 0.7, http://www.erlang.org/download/erl_spec47.ps.gz, June 1999.
- Blau
-
Blau, S., Rooth, J., Axell, J., Hellstrand, F., Buhrgard,
M., Westin, T., Wicklund, G., 1999:
AXD 310: A new generation ATM switching system,
Computer Networks, No.31, pp.559-582, Elsevier Science
B.V., 1999.
- Bluetail
-
Product information on Bluetail Mail Robustifier,
http://www.bluetail.com/products/bmr/.
- Carlsson00
-
Carlsson, R, Gustavsson, B., Johansson, E., Lindgren, T.,
Nyström, S.-O., Pettersson, M., Virding, R., 2000: Core Erlang
1.0 language specification, Technical Report 2000-030,
Department of Information Technology, Uppsala University, 2000.
- Carlsson01a
-
Carlsson, R., 2001: Core Erlang introduction and
formal sematics, Manuscript, Private communication, 2001.
- Carlsson01b
-
Carlsson, R., 2001: An Introduction to Core Erlang,
Proceedings of Erlang Workshop, 2001.
- Corbet
-
Corbet, J.C., 2000: Using Shape Analysis to Reduce
Finite-State Models of Concurrent Java Programs, ACM Transactions on
Software Engineering and Methodology,
Vol.9, No.1, pp.51-93, January 2000.
p
- Däcker
-
Däcker, B., 2000: Concurrent Functional Programming for
Telecommunications: A Case Study of Technology Introduction,
Licentiate thesis, Computer Communication System Laboratory,
Department Of Teleinformatics, Royal Institute of Technology, Sweden,
2000.
- Hinde
-
Hinde, S., 2000:
Use of Erlang/OTP as a Service Creation Tool for IN
Services,
Proceedings of Sixth International Erlang/OTP Users
Conference, http://www.erlang.se/euc/00/one2one.pdf,
2000.
- Holzmanna
-
Holzmann, G.J., Smith, M.H., 2000: Automating software
feature verification, Bell Labs Technical Journal, April-June 2000,
Special Issue on Software Complexity, 2000.
- Holzmannb
-
Holzmann, G.J., 2000: Logic Verification of ANSI-C Code
with SPIN, In Proceedings of 7th International SPIN
Workshop, eds. K. Havelund, J. Penix and W. Visser, LNCS Vol.1885,
pp.131-148, Springer-Verlag, 2000.
- Huch
-
Huch, F., 1999: Verification of Erlang Programs using
Abstract Interpretation and Model Checking, Proceedings of ICFP'99,
ACM SIGPLAN Notices, Vol.34, No.9, pp.261-272, ACM Press, 1999.
- Nielson
-
Nielson, H.R., Amtoft, T., Nielson, F., 1998: Behaviour
Analysis and Safety Conditions: a Case Study in CML, LNCS Vol.1382,
pp.255-269, Springer-Verlag, 1998.
- Nyström
-
Nyström, J., Jonsson, B., 2001:Extracting the Process
Structure of Erlang Applications, Proceedings of Erlang Workshop, 2001.
- OTP
-
OTP, 2000: OTP Documentation,
http://www.erlang.org/doc/current/doc/index.html.
jann@DoCS.UU.SE