To better understand computer and programming architecture, I wanted to build a programming language in the same way that an official programming language works. I aimed to make an assembly language that a processor could execute, and a complier to convert code into my new assembly language. For fun, emojis would comprise the entirety of the high level language. In other words, I wanted to make an emoji-based programming language.
I architected and built this project solo, handling all aspects of the language stack: the virtual machine, the assembly instruction set, and the compiler.
I began by simulating a simple 32-bit processor in C++ to serve as the execution environment for my custom assembly language.
Virtual Machine Architecture:Â
32 General-Purpose Registers: Implemented as an array for fast access.
1 MB of RAM: Simulated as a one-dimensional byte array.
Control Registers: A Program Counter (PC) and Instruction Register (IR) to manage the fetch-decode-execute cycle.
Assembly Language Design: I designed a RISC-style instruction set inspired by NIOS II. Some assembly language features:
Load/Store Architecture: Data must be explicitly moved between memory and registers using load and store instructions, simplifying the processor's design.
Clear Labeling: I introduced a unique syntax using a > prefix for branch labels (e.g., >loop) to make the generated assembly code more readable and easier to debug.
Example of a subroutine in the language
The compiler's job is to translate the high-level emoji code into my custom assembly. I broke this complex problem down into three distinct phases.
Phase 1: Tokenizing
I wrote a tokenizer that scans the source code character-by-character (or in this case, emoji-by-emoji) to group them into meaningful tokens (e.g., if, (, variable, ==, number, )).
This involved implementing state-machine logic to correctly parse multi-character tokens, like distinguishing the number 40 from the digits 4 and 0.
Phase 2: Syntax Organization (Shunting Yard Algorithm)
To handle operator precedence (e.g., multiplication before addition), I implemented the Shunting Yard Algorithm to convert the infix expressions from the source code into postfix notation (Reverse Polish Notation).
For example, 4 + 5 * 7 becomes 4 5 7 * +. This notation is far simpler for a stack-based machine to evaluate unambiguously.
Phase 3: Code Generation
This is the core of the compiler. I developed code generation routines that traverse the organized tokens and emit the corresponding assembly instructions.
Variables: I implemented a symbol table to manage variable memory addresses. Assignments generate store instructions.
Arithmetic: Postfix expressions are evaluated by generating instructions to push values onto a stack and call arithmetic subroutines (e.g., stackAdd).
Control Flow (if, for, while): This was the most complex part. I implemented a recursive system to manage nested scopes. For an if statement, the condition is evaluated, and a conditional branch is emitted. The compiler then processes the inner block of code, finally resolving the branch target address when it encounters the closing }.
The project was a resounding success, resulting in a fully functional, end-to-end programming language.
Final Product: The compiler and virtual machine were integrated into a single C++ program. The final executable can read a source file written in the emoji language, compile it to custom assembly, and execute it, while also outputting the generated assembly code for inspection.
Language Capabilities: The language supports essential programming constructs: variable assignment, arithmetic operations, and complex control flow (if, else, while, for), including nested structures.
Technical Achievements:
Deep Systems Understanding: This project provided an unparalleled deep dive into how programming languages work at the lowest level, from parsing to code generation to execution.
Problem Decomposition: I successfully managed the complexity of building a compiler by breaking it into well-defined, manageable stages (tokenize, parse, generate).
Architectural Design: Designing a coherent instruction set and a virtual machine to execute it required careful planning and a clear understanding of computer architecture.
Conclusion: Building a complete language stack from scratch was one of my most challenging and rewarding projects. It demystified the "magic" of compilers and gave me a profound appreciation for the design of modern languages. The skills I gained in parsing, code generation, and virtual machine design are directly applicable to fields like interpreter engineering, static analysis, and even embedded systems development.