Why I Am Creating a Programming Language

Matthew Roever
Level Up Coding
Published in
10 min readFeb 1, 2020

--

The need for a fast, safe, and easily understood language.

Syntax highlighted code
Photo by Chris Ried on Unsplash

From a naive perspective, a computer is nothing more than a very fast version of the old pocket calculators at the hardware level. However, computers are actually extremely complex in nature. The x64 format, which will be my primary focus initially, contains hundreds of instructions. The documentation from Intel (IA32) is only 5,038 pages long currently.

Now for the real question: Why am I crazy enough to want to make my own programming language?

I believe that a programming language should focus on the following criteria:

· Safety: The language must be inherently safe. People make mistakes. Programmers make mistakes. As long as unsafe code compiles there will always be bugs and security flaws. I recognize that unsafe code is a necessary for some tasks, but most code can be safe. In the few cases where code must be unsafe it can be marked and released for the safety constraints.

· Speed: Time matters. I personally feel that Python is used in too many places. It might be easy to train new programmers and quickly prototype with Python, but with a top speed of only 1 mph it’s not going to win the race.

· Simplicity: Code must express intent. The verboseness of C/C++ and Java doesn’t add extra detail, it just adds more typing and reading.

· Standard Library Support: This is the one category I must praise Python for. No language has better standard libraries or easily accessible support packages. Whatever you want to do, there is a Python library to do it.

· General purpose: There should never be a task the language cannot handle.

The reason I feel these five categories are the most important stems from my experience with C++ and my nontraditional background. I am a civil engineer by trade. I started programming because I was interested in the methods used by engineering software that cannot be performed by hand and computer graphics.

I have decided to make my own programming language because of my personal interest in the topic and because no existing language meets all of my needs. Rust goes a long way to fix memory safety, but using it for computer graphics would be futile. Most of your code would be placed in unsafe blocks, defeating the reason to use Rust in the first place. Below, I will elaborate on the concepts I feel are necessary for a viable language. My goal is not to simply build a toy language that can only perform addition. This is a long-term project to create a language that is a safe alternative to C/C++ while providing the ease of Python.

Easy to Learn

Python is one of the most frequently used and taught languages because it is beginner friendly and has ample educational resources. An experienced C/C++ developer can easily learn Rust, but for novice programmers the difficulty curve is probably too great. The greatest limitation for systems languages is the belief that they are too complex. Without a clear grasp on how processors work, the rules of systems languages can seem strict and obscure.

When designing a language, the syntax will initially define its difficulty to learn. Exotic usage of operators may support concise construction, but less readable code is harder to learn. Programming languages with syntax that is close to English and mathematics are the easiest to learn.

The static type systems of C like languages are great for compilers, but it can overwhelm new programmers. Signed and unsigned integers, floating-point numbers, and byte lengths are a hurdle without enough experience to know which is needed. Dynamic type systems used in Python make for cleaner code, but type errors that could be caught by a compiler are pushed to runtime. Dynamic typing also requires perpetual use of duck typing either manually or implicitly by an interpreter. To get the best of both worlds, I plan to use an inferred type system. An inferred type system is a static type system where every unspecified type is defined as auto. The compiler then selects the best type after analyzing the code.

C++ is a verbose language which leads to extensive boilerplate code (a major headache for me). By designing a concise language and eliminating the viral inclusion tendencies of C, most boilerplate code can be eliminated. Code that is present simply because a language said it must be there to prevent some undesirable behavior is annoying. Clarity is necessary for a language to be adopted in the mainstream.

Memory Safety

Memory safety is the leading cause of security vulnerabilities. A report from Microsoft determined that roughly 70% of CVEs were caused by memory safety violations . The most common causes included heap corruption, heap out-of-bounds, use after free, uninitialized use, and type confusion. Other than uninitialized use which can be caught by static analysis tools in most cases, the leading causes of memory unsafety stem from the misuse of pointers.

To eliminate sources of memory unsafety, the capabilities of pointers must change. First, pointers and arrays must be decoupled. Arrays will store sets of data, and pointers will be addresses in memory where an object is store, but the two will not be combined.

Arrays will store their size and the compiler will ensure out-of-bounds access never occurs. Arrays will be designed based on the variable-length arrays and flexible array members from C. This will enable the creation of compact data structures such as std::vector; use of which will be encouraged over raw arrays.

Pointers will only be a memory address where data is stored. There will be no option to index a pointer the way C allows. Pointers will also track the number of times they are referenced. Once the reference counter hits zero, the pointer will be automatically freed from memory. This prevents the pointer from ever entering an invalid state where references to it still exist while the pointer has been freed — eliminating use after free, most memory leaks (two pointers that aim at each other could still leak), and freeing more than once.

References are a similar concept to pointers and are often pointers under the hood. Tracking the lifetime of references will be performed using static analysis tools to prevent their usage after the object the reference refers to has been destroyed.

Multithreading

Multithreading is an important part of any modern programming language. CPUs are becoming increasingly parallel. As the number of cores increases it is necessary for our programming languages to prioritize concurrency. When performing concurrent algorithms, memory must be accessed in a safe manner. Data races are not acceptable. Explicit synchronization of data with mutexes, like in C++, is not enough. It places the burden on the programmer without support from the compiler.

Threads must be easy to launch, and they must be supported by the runtime. A good example of this is Go. In Go, any function can be run asynchronously by using the keyword go before the function call. Channels in Go provide a powerful mechanism for sharing data between threads. Channels allow data to be handed off between threads without the need for explicit synchronization. But this doesn’t handle all forms of data sharing required for effective concurrency. In addition to channels, atomic data types allow values to be shared between threads without costly (mutex) synchronization. Hardware access specifications guarantee access in a safe and consistent manner. Lastly, to avoid explicit mutexes, regional synchronization will enable the sharing of structured data. When a thread accesses a region, it must declare its intent: read-only access or exclusive write permissions. Declaring intent will implicitly perform any locking or blocking required before the data can be used.

Object Oriented Programming

I will be using the standard object oriented principals that under lie most programming languages. The only major deviation will be my handling of inheritance. I plan to use single inheritance which limits a derived class to only one base class. In multiple inheritance languages such as C++ a class can inherit from multiple base classes. I personally feel that multiple inheritance makes programs more complex and harder to debug without providing substantial design flexibility.

Single and multiple class inheritance flowchart
Figure 1: Single and multiple class inheritance — by author

The ability to inherit from any class in C++ is a flaw. Inheritance should only be permitted for classes that were explicitly designed to be a base class. Rather than having virtual functions, classes should be declared as a virtual class to be used as a base.

Functional Programming

Functional programming is a declarative style of programming. The language defines what to do, rather than how to do it (imperative programming, think C). Functional languages usually treat data as immutable (constant) and functions are first-class objects. Data is constant to prevent side effects, which are modifications of any form to existing data. By ensuring all existing data is constant, any piece of data can be used in parallel without fear of conflict. A first-class function can be passed as an argument to another function, and a higher-order function can return a function as the result. First-class and higher-order functions come from lambda calculus. When combined with immutable data they provide a different way of looking at a problem. Functional programming is especially helpful in multithreaded applications and the creation of decision trees. I plan to implement hybrid imperative-declarative capabilities like those in Scala, though biased more towards imperative design.

Contracts and Traits

A contract defines what something can be, while a trait defines what something is. Contracts and traits come in two forms: narrow and wide.

A contract is wide when there are no limitations placed on the range of permitted values, while it is narrow when restrictions are in place. For example, saying a parameter cannot be nullptr is a narrow contract.

A trait is wide when there are no limitations placed on the data type that is permitted, while it is narrow when the type of data that can be used is limited. For example, saying a type must be move constructible for use in a template is a narrow trait.

C++ and Rust have trait systems that allow the programmer to place restrictions on the types used for generic (template) classes. However, restrictions on parameters are relegated to the comments (which everyone reads… right?). A programmer currently has two options when placing a restriction on a parameter: 1) hope the person calling the function read the documentation and performs the proper checks, or 2) check the input to make sure it is valid. If neither the function author nor user perform checking, the program will contain a lingering bug that may be hard to track down. However, if the function author and user both perform checking to ensure the parameter is valid the overhead of verification is paid twice. By placing contracts on the parameters of a function, the function author documents the input limitations and provides the compiler with the necessary details to automatically ensure the parameter is always valid. As a bonus, defining a narrow contract in the function declaration rather than in the comments will enable the compiler to make more optimizations, including the removal of redundant parameter validation checks.

Pattern Matching

Pattern matching is the process of taking an input value and comparing it to a predetermined value or range to decide what actions to take. This includes chains of if-else statements, switch statements from C and Java, and match from Rust and Scala. Determining what something is, then proceeding accordingly is a major part of all programs. Therefore, the language should make it easy. I prefer the match syntax of Rust and Scala the most because it is clean and to the point.

Write Once, Run Anywhere

The compiler will support direct compilation to a target platform (.dll, .dylib, .exe, etc.) but that is not its primary mode of operation. By default, the compiler will produce a near assembly bytecode styled after the Java bytecode. The runtime library distribution will contain an assembler that converts the bytecode into the machine specific instruction set of the host. This eliminates the need for developers to build multiple versions of their program for each instruction set and platform, and removes the burden on customers to select the proper version for their system. Additionally, since the program is compiled for a specific machine, additional optimizations can be made for their hardware. This can include intrinsic operation support verification or substitution, and security fixes applied on a processor-by-processor basis.

The local assembler will be able to compile and cache an entire program upfront for one-time compilation, or it may run in just-in-time mode similar to the .NET Framework. It must be noted that all bytecode will be compiled to machine instructions, there will be no interpreter or garbage collector.

Standard Library Support

Standard library support can make or break a language. I am inclined to say if Python supports it, so will I. Since this covers about every topic you might write a program for, here is my list of the most important topics the standard library must support:

· Algorithms

· Computer graphics — based on the Vulkan API and possibly a CPU variant, the library will enable the creation of graphical user interfaces, games, and GPU general purpose compute

· Containers — vector, string, lists, queues, trees, hash tables, graphs, and many more

· Cryptography — algorithms for local and network communication encryption

· Machine Learning — one of the key drivers of Python usage, having a simple yet powerful ML library will help drive adoption

· Mathematics — general functions, matrices, signal processing, etc.

· Networking — foundational components for server-side development: from basic sockets to SSL clients

Objectives

My goal is to produce an open source, fully documented language. A compiler will initially be produced to work with Visual Studio Code. A custom closed source compiler in the vein of Visual Studio will also be produced to support development. The closed source compiler will incorporate the free version (which will support all language features and optimizations — no gimmicks, the free version is the full version) while also providing side-by-side compilation of other programming languages to aid transition. Code bases with millions of lines of C/C++ code cannot be easily converted; not that anyone would want to convert it. Side-by-side compilation will enable gradual transition and selective safety injection into unsafe code.

My initial focus will be on x64 Windows development. In the long term, support for Linux, macOS, Android, and iOS will also be added. Support for the arm64, and embedded systems instruction sets such as RISC-V, and MIPS will also come later.

Every week I will be releasing an article covering how compilers work and are produced. I will cover everything from designing a language, the creation of compiler components with code on GitHub, assembly generation, standard library design, and various related topics.

This article is part of an ongoing series about compilers. The goal is to produce a new programming language. Next in series:

--

--

I am a Civil Engineer that codes. I write about compilers, computer graphics, and entrepreneurship.