A few days ago, in an attempt to understand how languages are written, I tried to write one myself

I was not really looking forward to writing a tokenizer/lexer from scratch, and was more interested in the implementation of the language.

So I decided to use a parser generator,
after trying out Jison (javascript) and Antlr (Java), I chose Antlr

Antlr’s support for generating targets in
multiple languages (Java, Javascript, Ruby, Python) made me choose it over Jison.

What this language should support?

  • Type Inference
  • Variables (Int, String, Boolean, Null)
  • Printing to stdout
  • Branching with conditionals like if, else if and if
  • Loops while (and should handle nested loops)
  • Arithmetic operations
  • Basic string operations
  • Implicit type casting

Grammar

First, I define a tokenizing rules and grammar in BNF format.
BNF

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
grammar Baritsu;

@header {
package com.shawnrebello.baritsu.antlr;
}

program : list_of_expressions;



list_of_expressions : function_definition TERMINATOR
| list_of_statements
| list_of_expressions list_of_expressions
| TERMINATOR
;


function_definition : DEF ID block ;

block : (NEWLINE | COMMENT)* '{' list_of_statements '}' (NEWLINE | COMMENT)*
;

list_of_statements : (TERMINATOR)* (statement)+;

statement : variable_declaration (TERMINATOR)+
| variable_assignment (TERMINATOR)+
| print_statement (TERMINATOR)+
| loop_statement (TERMINATOR)*
| branch_statement (TERMINATOR)*
| primitive (TERMINATOR)+
;

variable_declaration: DEF ID ('=' expr)? ;

variable_assignment: ID '=' expr ;

print_statement : PRINT expr #printStat
| PUTS expr #putsStat
;

loop_statement : WHILE expr block # whileStat
;
branch_statement : IF expr block (ELSEIF expr)? block ELSE? block # ifStat;

expr : ID '(' argList? ')' # functionCallExpr
| expr '[' expr ']' # arrayExpr
| '-' expr # negateExpr
| '!' expr # notExpr
| expr '*' expr # multiplyExpr
| expr '/' expr # divideExpr
| expr ('+') expr # addExpr
| expr ('-') expr # subExpr
| expr '==' expr # eqExpr
| expr '<' expr # lessThanExpr
| expr '>' expr # greaterThanExpr
| expr '<=' expr # lessThanEqExpr
| expr '>=' expr # greaterThanEqExpr
| ID # idExpr
| primitive # primitiveExpr
| '(' expr ')' # parenExpr
;

argList : expr (',' expr)* ;

TERMINATOR : (SEMICOLON | NEWLINE | SINGLELINECOMMENT)+ ;

primitive : STRING # string
| INT # int
| BOOLEAN # bool
| NULL # null
;

NULL : 'null';
WHILE: 'while';
IF: 'if';
ELSE: 'else';
ELSEIF: 'else if';
STRING : '"' ALPHANUMERICARRAY '"'
{setText(getText().substring(1, getText().length()-1));};
BOOLEAN: 'true' | 'false';
PRINT : 'print';
PUTS : 'puts';
FOR: 'for';
DEF : 'def' ;
ID : LETTER (LETTER | [0-9])* ;
DO : 'do';
END : 'end';
SEMICOLON: ';';
NEWLINE: '\n';
fragment
ALPHANUMERICARRAY: (LETTER | INT | WS)+ ;
LETTER: [a-zA-Z];
INT: [0-9]+;
COMMENT: (MULTILINECOMMENT | SINGLELINECOMMENT ) -> channel(HIDDEN);
MULTILINECOMMENT: '/*' .*? '*/';
SINGLELINECOMMENT: '//' .*? NEWLINE;
WS: [ \t\r]+ -> skip;

github source

I then use Antlr to generate a parser generator for this grammar.

I initially started using listener pattern, but then switched to the visitor pattern for more flexibility.

To know the difference, this link is helpful.

The Compiler class initializes the lexer, the parser and starts the EvalVisitor‘s visit.

The methods in this class define what happens when a grammar rule is matched

Let us walk through one such method,

  • def x = 42 + 5
  • when variable_assignment from the grammar is matched the method visitVariable_assignment is called.
  • the text of the ID (x) will be variableName
  • the right hand side (42 + 5), will be then visited
  • when the value of this RHS expression is
    computed, it is assigned to a list of global ``Variable's in theEnvironment` scope .
  • I’ve used the Environment class, to group variables in the same context
  • The Variable class allows variable declaration, getting and updating values.
  • A map <String, Value> scoped to the Environment, stores the variables
  • The primitives in this language are defined in the class Value
  • Each Value instance has the datatype, and the data
  • The Value class also exposes methods for typecasting and arithmetic operations

Notes:

  • I had to define a function to evaluate the truthyness of non-boolean primitives (for eg: is the integer 0 true or false?)
  • A getType function returns the inferred datatype of the primitive, this is used when typecasting (For eg: def “There are “ + 13 + “ original Cylon models”)
  • A null type was added
  • An undefined type had to be added to handle invalid data

Final notes

Implementing this language has been a fun learning experience,
this language only has a small subset of features offered by the popular general purpose programming languages, and inspite of this, I learnt much more than expected.
Make me better appreciate, the hard work that goes into implementing and supporting a language.

You can check out the source code of this language here..

Thanks for reading!