Create a Domain Specific Language with JJTree

This article was originally posted on JRoller in 2013. The content is still valid, though the referenced JavaCC version is not outdated.

This blog describes how to parse a text file containing a home grown language into a POJOS using JJTree. The approach could be used to describe a specific topic (domain specific language, short DSL) or parse a specific data format.
The complete code could be found on Bitbucket https://bitbucket.org/skfcz/jroller

The Task

In my actual case I want to describe the structure of a binary file and then generate the POJOs, the parser and (in the distant future) the writer for that format.
The format used to describe the structure looks like this:

/**
 * The Segment header ...
 */
segment LSGSegment {
    { segmentType: 1, compressed: true};

    /** Base Node Element */
    entity BaseNodeElement extends ElementHeader {
        ObjectTypeID : 0x10dd1035, 0x2ac8, 0x11d1, 0x9b, 0x6b, 0x00, 0x80, 0xc7, 0xbb, 0x59, 0x97;
        var BaseNodeData baseNodeData;
    }

    /**  Base Node Data */
    entity BaseNodeData {         
        var I16 versionNumber;
        var U32 nodeFlags;
        var I32 attributeCount;
        var I32[] attributeObjectID { size:"entryCount" };
    }
...

The characteristics of the language are:

It is yet another curly braces language
Each file contains one to more segments containing entities.
Each entity contains a number of variables.
The variables are either simple types or types defined in the same segment.
Segments, entities and variables could carry a JSON like key/value description.
Segemnts and entities may carry a JavaDoc like comment.

Tools

There are several tools to read in files like this. For a start one could use the standard java.io.Reader combined with a lot of regular expression and Java code and implement a reader from scratch. In most cases this becomes pretty unmaintainable code, that's why specific parser generator tools where invented to create parsers code.
Typical candidates for Java are currently :

Each tool will help you to create a parser, each has its advantages and disadvantages. Before I explain why I prefer EMACS over vi, here are my reasons to prefer JavaCC:

	ANTLR	JavaCC	Xtext
runtime dependency	yes	no	yes
number of JARs	1	1	>>10
Generates matching POJOS	no	no	yes
Generates matching ECLIPSE Editor	no	no	yes

As I prefer lightweight solution, JavaCC is my personal favourite. If you want to generate parser, POJOs and an ECLIPSE editor, Xtext seems to be a good solution. The drawback is that you end up with half of the ECLIPSE JARs in your class path.

JJTree

The JavaCC tool is used to create a parser from a grammar description. In most case writing the grammar description itself is pretty straight forward. But the parser itself is not the target, you want to add some busines logic to use the parsed data. With JavaCC this busines logic is placed in the very same file containing the grammar description, so your steps to build a parser looks like this:

The combination of grammar and busines logic within the .jj file is ok for simple languages, but could result in a mixture from hell. Using JJTree offers a solution to that problem, as the grammar description and the busines logic are written in separate files, your steps to build a parse look like this:

The trick is that JJtree generates an intermediate representation of the parsed file, called Abstract Syntax Tree (AST). The handwritten busines logic is applied after parsing on the AST. Off course this comes at a price, the AST stays in memory until it is discarded. Therefore JJTree based parser are acceptable if you intend to parse small to medium sized description file, but a no go e.g. for a 1 GB text representation.
The real benefit is the better separation of concern, as the grammar contains no busines logic, this is concentrated in the code used to traverse the AST after parsing.

Grammar Description

The JavaCC tool uses a grammar description as input. This grammar file describes how the files to read looks like and what actions should be performed while reading. Grammar files have the following layout:

options, describing how JavaCC should work. Contained in options { ... }
parser, describing some Java code used for the parser itself. Contained in PARSER_BEGIN( $parsername) ... PARSER_END($parsername)
tokens, describing the tokens used for the language. Contained in options { ... }
grammar, describing how the tokens are combined to the language. The grammar is defined by "productions", their definition is similar to a java method definition.

The reference on the JavaCC homepage gives you the complete backrgound. For a head start you will find a number of ready to use grammar files here : http://java.net/projects/javacc/downloads/directory/contrib/grammars. This repository is a perfect quarry to find either the grammar you need or at least some parts for reuse. In case you have problems to define your own grammar, there is a quite helpful, low traffic mailing list on JavaCC and the book "Generating Parsers with JavaCC" from Tom Copeland. In this blog I will only present only parts of the JavaCC grammar file.

Grammars for JJTree

The grammars used for JJTree do not look different to those used for JavaCC, there are just some additional options available. When writing grammars I stick to the following rules:

Root Node

The root node of the grammar needs to return the AST node. This is needed later on to traverse the AST tree.

SimpleNode JoveDL():
{}
{
   ( (FormalComment())?  ( SegmentDeclaration() | BlockDeclaration() ) )+
  <EOF>

 { return jjtThis;}
}

Zero to one Token per a production

Even with JJTree you have to define which tokens you need later on and want to keep in an AST node. E.g. to keep the name of an entity one assigns the collect id Token and saves it in the jjtThis.value:

void SegmentDeclaration():
{Token id = null;}
{
  <SEGMENT> id = <IDENTIFIER> <LBRACE>
    ( GenericParameterDeclaration() <SEMICOLON> )?
    ( (FormalComment())? EntityDeclaration() )+   
    <RBRACE>
    { jjtThis.value = id.image; }
}

If you define you productions to contain more than one token value, you will have to work with collections of some kind. And this tends to be messy, as different types of nodes have collections of different length, some no collection at all ...
On the downside you will have to create individual AST nodes for each an every item to preser. If for examplle a variable is defined as array like this var UChar[] version you will have need an AST node for the identifier "version" and the the array definition.

void VariableDeclaration() :
{Token id = null; }
{
    <VAR> TypeDeclaration() ( ArrayDeclaration() )? id = <IDENTIFIER>  ( GenericParameterDeclaration() )? <SEMICOLON>
    { jjtThis.value = id.image; }
}
void ArrayDeclaration() : 
{}
{ "[]" }

The generated AST for "var UChar[] version { size:80};" looks like below, the ArrayDeclaraion is an empty node used as marker:

VariableDeclaration, version
   TypeDeclaration, null
       SimpleTypeDeclaration, UChar
   ArrayDeclaration
   GenericParameterDeclaration, null
       ParameterNameAndValue, size
           ParameterValue, 80

Parse Early

Literals like integers or doubles are completley described by your tokens definition. Therefore it is a safe bet to parse them from the string representation into the their respective type in the AST node:

void ParameterValue() : 
{Token vals = null;}
{    
      vals = <INTEGER_LITERAL> { jjtThis.value =  Integer.parseInt( vals.image);}
    | vals = <FLOATING_POINT_LITERAL> { jjtThis.value =  Double.parseDouble( vals.image);}
    | vals = <STRING_LITERAL> { jjtThis.value =  vals.image;}
    | vals = <BOOLEAN_LITERAL> { jjtThis.value =  Boolean.parseBoolean(vals.image);}
}

Running JJTree

After defining the grammar file one need to run JJTree and JavaCC to create your java files. JJTree inserts the Java Code used to create the AST into a new file (".jj"), this is taken as input for JavaCC to create the parser itself. Depending on your build process running JJTree and JavaCC could be done either using the command line, ANT or Maven. The following is some ANT code to create the needed files:

<target name="javacc" depends="init">
        <echo>Delete automatically generated files</echo>
        <delete >
            <fileset dir="src/de/groygroy/javacc/">
                <exclude name="jovedl.jjt" />
                <include name="*.java" />
                <include name="jovedl.jj" />                
            </fileset>
        </delete>
        <echo>Generate Parsers</echo>
        <java classname="jjtree" fork="true" dir="${basedir}/src/de/groygroy/javacc/p1">
            <jvmarg line="${run.jvmargs}" />
            <classpath>
                <path path="javacc-5.0/bin/lib/javacc.jar" />
            </classpath>
            <arg value="jovedl.jjt" />
        </java>
        <java classname="javacc" fork="true" dir="${basedir}/src/de/groygroy/javacc/p1">
            <jvmarg line="${run.jvmargs}" />
            <classpath>
                 <path path="javacc-5.0/bin/lib/javacc.jar" />
            </classpath>
            <arg value="jovedl.jj" />
        </java>
    </target>

Running JJTree and JavaCC creates a number of Java classes:

JavaCharStream.java is a JavaCC specific stream class
${parserName}.java contains the generated parser
${parserName}Constants.java contains constants used for the parser
Token.java is the JavaCC specific Token represntation
${parserName}TokenManager.java contains the tokenizer
TokenMgrError.java is the JavaCC specific exception for the TokenManager
Node.java and SimpleNode.java are basic classes for the AST nodes
AST??? files contain a node class for each declared production
${parserName}TreeConstants contains constants used for the AST generation
${parserName}Visitor contains the interface for the AST Visitor

To run the generated parser in your code you need only three lines of code:

    
    InputStream in = new FileInputStream( "data/simple.jdl" );
    JoveDLParser parser = new JoveDLParser( in );
    SimpleNode node = parser.JoveDL();

Traversing the AST

Now it is time to traverse the complete AST and e.g. fill some POJOs with the found values. You will find the interface for a visitor class, due to the MULTI=true; option containing one method per declared production:

public interface JoveDLParserVisitor
{
  public Object visit(SimpleNode node, java.util.Map<String, Object> data);
  public Object visit(ASTJoveDL node, java.util.Map<String, Object> data);
  public Object visit(ASTFormalComment node, java.util.Map<String, Object> data);
  public Object visit(ASTBlockDeclaration node, java.util.Map<String, Object> data);
  public Object visit(ASTSegmentDeclaration node, java.util.Map<String, Object> data);
  ...
}

After the parser created the AST, you could run an arbitrary number of vistors of the tree. Let’s start with a minimal implementation printing the AST including the preserved Tokens values. Each method is started on one type of AST Node, it prints the node type ( node.toString() ) and the value contained in the node due to the code found in the grammar (node.jjtGetValue()). After print to log, the Visitor (this) is passed on the children node of the given currently visited node using "node.childrenAccept".

public class LoggerVisitor implements JoveDLParserVisitor {

    public static Logger log = LoggerFactory.getLogger( LoggerVisitor.class );
    private int indent = 0;
   
    public Object visit( SimpleNode node, Map<String, Object> data ) {
        log.debug( indent( 1 ) + node.toString() + ", " + node.jjtGetValue() );
        node.childrenAccept( this, data );
        indent( -1 );
        return data;

    }

    public Object visit( ASTJoveDL node, Map<String, Object> data ) {
        log.debug( indent( 1 ) + node.toString() + ", " + node.jjtGetValue() );
        node.childrenAccept( this, data );
        indent( -1 );
        return data;
    }

    ...

    private String indent( int delta ) {
        indent += delta;
        StringBuffer buff = new StringBuffer();

        for ( int i = 0; i < indent; i++ ) {
            buff.append( "    " );
        }
        return buff.toString();
    }

To use the visitor it is started on the top node after parsing the file:

InputStream in = new FileInputStream( "data/simple.jdl" );

JoveDLParser parser = new JoveDLParser( in );

SimpleNode node = parser.JoveDL();

// print AST using Visitor
JoveDLParserVisitor visitor  = new LoggerVisitor();
visitor.visit( node, null);

As result it prints the complete AST including the collected tokens ( accessed using jjtGetValue() ):

49 [main] DEBUG de.groygroy.javacc.p1.LoggerVisitor  -     JoveDL, null
49 [main] DEBUG de.groygroy.javacc.p1.LoggerVisitor  -         FormalComment, /**
 * The file header contains metadata and the location 
 * of  the table of content
 */
49 [main] DEBUG de.groygroy.javacc.p1.LoggerVisitor  -         BlockDeclaration, FileHeader
49 [main] DEBUG de.groygroy.javacc.p1.LoggerVisitor  -             EntityDeclaration, FileHeader
49 [main] DEBUG de.groygroy.javacc.p1.LoggerVisitor  -                 VariableDeclaration, version
49 [main] DEBUG de.groygroy.javacc.p1.LoggerVisitor  -                     TypeDeclaration, null
50 [main] DEBUG de.groygroy.javacc.p1.LoggerVisitor  -                         SimpleTypeDeclaration, UChar
50 [main] DEBUG de.groygroy.javacc.p1.LoggerVisitor  -                     ArrayDeclaration, null
50 [main] DEBUG de.groygroy.javacc.p1.LoggerVisitor  -                     GenericParameterDeclaration, null
50 [main] DEBUG de.groygroy.javacc.p1.LoggerVisitor  -                         ParameterNameAndValue, size
50 [main] DEBUG de.groygroy.javacc.p1.LoggerVisitor  -                             ParameterValue, 80
50 [main] DEBUG de.groygroy.javacc.p1.LoggerVisitor  -                 VariableDeclaration, byteOrder
50 [main] DEBUG de.groygroy.javacc.p1.LoggerVisitor  -                     TypeDeclaration, null
50 [main] DEBUG de.groygroy.javacc.p1.LoggerVisitor  -                         SimpleTypeDeclaration, UChar

Using the AST

Now you have parsed your file into an AST and know how to traverse the AST. As a last step you have to implement some business actions in the visitor. Typically you have defined a class like Block, Segment, and Entity which should represent the parsed file.
Why not use the created AST??? nodes themself ? First off all, those files are automatically created using JJTree. Any handcrafted changes to the AST files are lost with the next run of JJTree. Secondly those classes represent the grammar of the file, which is not desirable for all caes. Using a separate set of POJOs offers a better separation of concern and more flexibility.

For this demonstration we create POJOs representing Segments and Entities. You could use different strategies to handles those objects in the Visitor. One is to use visitor variables (smells a bit like a FORTRAN 77 common block), another is return the values from the visitor methodes. A third option is to used the data paramter passed from visitor methode. to visitor method. In the example it is a simple key / value map, but you could use any other types defined by the VISITOR_DATA_TYPE option.
I prefer to create and fill object in their respective visitor method and then pass it to its "parent" method via the return value. This collects the returned values and uses them.

public EntityDeclaration visit( ASTEntityDeclaration node, Map<String, Object> data ) {
        // Use the entity name from the AST node and create a new POJOs
        String entityName = ( String ) node.jjtGetValue();
        EntityDeclaration retval = new EntityDeclaration();
        retval.setName( entityName );
        log.info( "created " + retval );
        return retval;
}

public SegmentDeclaration visit( ASTSegmentDeclaration node, Map<String, Object> data ) {
        String segmentName = ( String ) node.jjtGetValue();

        SegmentDeclaration retval = new SegmentDeclaration();
        retval.setName( segmentName );

        // Travers through the children nodes and 
        for ( int i = 0; i < node.jjtGetNumChildren(); i++ ) {
            Node child = node.jjtGetChild( i );

            // If the child is an EntityDeclaration child node visit it
            // and put the returned entities in the list of defined entities
            if ( child instanceof ASTEntityDeclaration ) {
                Object val = child.jjtAccept( this, data );
                retval.getEntities().add( ( EntityDeclaration ) val );
            } else {
                // found another AST child, e.g. a variable declaration
                // no need to descend any further
            }
        }

        log.info( "created " + retval );
        return retval;
    }

As you see, the visit method for the entities never traverses to its child nodes and that for the segment declaration visits only those for the entity declaration.
Using and collecting the visitor methods return value works as long the desired value is contained in a child node of the AST. For the described file format this is true with the exception of the JavaDoc like comment. Comments are typically described before their respective entity. That’s why we have to collect the comments value and use it later on. The grammar allows comments on blocks, segments and entity. The modified visitor preserves the comment in the lastComment variable. That variable is used for segments and entities, but reset to null on blocks, segments and entities to avoid saving e.g. an entity comment into the next segment.

public class CollectorVisitor implements JoveDLParserVisitor {
    
   private String lastComment;

   public Object visit( ASTFormalComment node, Map<String, Object> data ) {
        lastComment = ( String ) node.jjtGetValue();
        node.childrenAccept( this, data );
        return data;
   }
   public Object visit( ASTBlockDeclaration node, Map<String, Object> data ) {
        lastComment = null; // Set lastComment to null, otherwise a block comment   
                            // might become an entity comment
        node.childrenAccept( this, data );
        return data;
    }

    public SegmentDeclaration visit( ASTSegmentDeclaration node, Map<String, Object> data ) {
        String segmentName = ( String ) node.jjtGetValue();        

        SegmentDeclaration retval = new SegmentDeclaration();
        retval.setName( segmentName );
        retval.setComment( lastComment );
        lastComment = null;

         ...
        return retval;
    }

   public EntityDeclaration visit( ASTEntityDeclaration node, Map<String, Object> data ) {
        // Use the entity name from the AST node and create a new POJOs
        
        String entityName = ( String ) node.jjtGetValue();
        EntityDeclaration retval = new EntityDeclaration();
        retval.setName( entityName );
        retval.setComment( lastComment );
        lastComment = null;
        ...
        return retval;
    }

Resume

Using the demonstrated approach one could create parsers for DSLs or other file formats, with a reasonable effort and the chance to create maintainable code.

Carsten Zerbst

Dieses Blog durchsuchen