Babel Tower of Programming Languages

Mikhail Vladimirov
8 min readSep 8, 2023

The Fundamental theorem of software engineering states:

We can solve any problem by introducing an extra level of indirection
…except for the problem of too many levels of indirection.

Taking that a new level of indirection could be seen as a new DSL (Domain Specific Language), as it introduces a new way of expressing the desired logic of a program, I would generalize this theorem as:

We can solve any problem by introducing a new programming language
…except for the problem of too many programming languages.

And the problem of too many programming languages is already very real.

As of today, Wikipedia lists almost seven hundred programming languages, excluding dialects of BASIC, esoteric programming languages, and markup languages.

One would say, that there is no problem, and the more choices we have the better, but… in order to be efficient, a general purpose programming language has to have a comprehensive ecosystem of reusable solutions for generic problems, so developers could focus on tasks specific to their project, rather then re-implementing common things such as binary search over and over again.

Back in 1991, Richard Stallman wrote in his famous “Why Software Should Be Free” essay:

Free software would require far fewer programmers to satisfy the demand, because of increased software productivity at all levels:
• Wider use of each program that is developed.
• The ability to adapt existing programs for customization instead of starting from scratch.
• Better education of programmers.
• The elimination of duplicate development effort.

He saw obstructed (non-free) software as a source of material harm, as:

• Fewer people use the program.
• None of the users can adapt or fix the program.
• Other developers cannot learn from the program, or base new work on it.

Fast forward to 2023: free software is everywhere and even Microsoft finally embraced open source. Virtually any task, that is already solved by somebody, has an open-source solution published on GitHub, and this solution is supped to be reusable by you, either as is or with slight modifications… as long as the solution is written in the same programming language as your project!

And here is the Catch-22.

In 1991 software developers were isolated from each other by large corporations having their own proprietary software ecosystems. Nowadays, developers are still isolated, but now by non-interoperable ecosystems of different programming languages, and the overall picture is about the same: developers still have to code the same things over and over again: not under different copyrights, but now in different languages.

However, this is not the biggest problem.

The most valuable asset of a mainstream programming language is not its clear syntax, state of the art compiler, or brilliant development team. It is its ecosystem or tools and libraries, and the team of enthusiasts donating their precious time for free to keep this ecosystem up-to-date.

Such enthusiasts are a scarce resource, and this scarcity effectively limits the number of up-to-date ecosystems we could have at any given moment of time. Four, five, maybe a bit more, but definitely not dozens, and this limit is already exhausted. For now, in order to create a new ecosystem for an absolutely brilliant, but not that popular yet, programming language, we would have to sacrifice an existing ecosystem of, probably, a not so good, but extremely popular programming language, wasting countless efforts invested in it. But without ecosystem, the brilliant language has little chances to ever become popular. This applies not only to completely new programming languages, but also to major upgrades of existing languages, that break backward compatibility.

The more mainstream languages we have, the less chances are for a new, better mainstream language to emerge, or an existing mainstream language to undertake major upgrade.

Virtually every non-trivial software project uses more than one language. Even if you believe your project is 100% JavaScript or Rust, it still has build scripts, deployment configurations, resource files etc, written in variety of languages. This could be a problem, as you need to make your JavaScript- or Rust-oriented tool chain to somehow consume these languages, to make your IDE to edit and highlight them etc.

Things get even weirder, when you need to mix several languages in a single file. This is much more common situation than one could thing. Consider a Java file full of SQL statements, or just with Javadoc comments, that, in turn, have HTML tags inside.

A normal programming language has a strict and well-defined syntax structure, often described by a so-called grammar, and every code file written in the language has to fully comply with this structure. This makes embedding one language into another tricky and often cumbersome.

Developers wrap code, written in other languages, into comments, string literals or other similar syntax entities of the host language, that could contain arbitrary text, however such approach has a number of important drawbacks: i) even inside such syntax entities, certain special characters still ought to be escaped, which means that the included code in other languages has to be altered before inserting, at least in some cases; ii) text editors don’t recognise such embedded code blocks and cannot highlight its syntax; and iii) toolchains don’t recognise them as well and thus cannot check syntax and semantics of embedded code at build time.

Existing programming languages don’t offer syntax for embedding code, written in arbitrary other languages, however developers anyway do this all the time, effectively abusing syntax entities of the host languages.

Complex software projects are rarely build directly on top of a core programming languages, but rather on top of higher-level libraries, frameworks and other components that effectively extend the core language, translating it into a sort of DSL (domain specific language), more suitable for the particular subject are of the project. This allows solving tasks, which a common for this area, with more concise and much simpler code.

However, such DSL still has to follow the syntax of the core language, as libraries cannot modify the language grammar. This could be quite limiting. For example, a math framework that allows writing math formulas using LaTeX syntax, and thus allows copying formulas between code and scientific papers, would be impossible in mainstream languages, unless the formulas are embedded as string literals, however, even in this case, it would still be necessary to escape special characters, such as backslashes.

While all mainstream languages offer powerful DSL-building mechanisms, such as user-defined functions, classes, templates, macros etc, the language grammar is immutable, thus any DSLs, build on top of a language, is syntactically the same language, just with extended functionality.

While the problems described above are common for all mainstream programming languages, there are a few examples of how these problems could be addressed.

The first example is the famous Babel package for JavaScript. It allows developers to write code in modern JavaScript versions, and then compile it into older JavaScrip to preserve compatibility with popular browsers that still uses outdated JavaScript implementations.

While Babel originally emerged as a solution for a very particular problem, the approach itself if very powerful. The set of source languages for Babel, i.e. the languages Babel is able to compile, doesn’t have to be limited to the official JavaScript versions. It could in theory include JavaScript spin-offs, such as TypeScript, or even languages, that are completely independent from JavaScript, such as Python or even C++. This would allow mixing different languages in a single project, and compile all of them into the same target language. Also, in theory, Babel could be made extensible, so for a particular project, developers could extend the syntax of a language, compiled by Babel, which could allow building DSLs, that differ syntactically from the base language.

Another notable example is TeX. Being basically a markup language, it has build-in programming capabilities, so it could be considered as a programming language. The uniqueness of TeX is that it survives for decades with virtually no changes in the core language, and not only survives, but will holds the leading position in its niche.

During these decades dramatic events happened in computer typesetting: ASCII was replaced with an insane zoo of text encodings, that were finally replaced with the single UTF-8 encoding. The output format changed from TeX’s own DVI to PostScript then to PDF. TrueType fonts became de-facto standard, various raster and vector graphic formats became popular and then went to abyss. Colors came on the scene in various formats, such as RGB and CMYC.

While other typesetting systems were updated again and again trying to chase ever changing landscape of computer typesetting, TeX handled these things almost purely via packages, virtually without core changes.

This enormous flexibility of TeX is mostly based on two commands: \catcode and \afterassignment. The former commands allows redefining any special character recognized by TeX, and the latter allows a user-defined macro to read code after the command invocation.

Together, these two commands make it possible to define a macro that completely reprograms TeX parser into something different, and could later return it back to normal. Why one would every need to make any changes in the core interpreter, if a package could arbitrary change the interpreter behavior as well as the syntax of the language being interpreted.

While in the TeX case, reprogramming the interpreter is done via unreadable dirty hacks, the same idea could be implemented in any interpreted language. Let’s consider a simplified model of a programming language interpreter that uses two input streams: CMDIN, where the interpreter reads the program to be interpreted; and STDIN, where the interpreter reads the program’s own input.

The interpreter reads program instructions from CMDIN one by one and executed them. When it encounters the READinstruction, it reads some data from STDIN and makes this data available to the program being interpreted, so the program could assign it to a variable or use it in some other way. This is basicaly how normal interpreters work.

Now, lets consider that we add another instruction: CMDREAD, that is very similar to READ but reads from CMDIN rather than from STDIN. We also add the CMDUNREAD instruction, that “unreads” data into CMDIN so subsequent reads from CMDIN would read this data.

This two instruction would allow implementing powerful macros that not only expand into arbitrary generated code, but also could consume and rewrite code after their invocations, effectively reprogramming the interpreter.

In particular, it would be possible to include snippets written in other languages, without any modifications and even without escaping special character. It would also allow extending or modifying the core language without making any changes in the interpreter or language standard.

This was a high-level overview of certain problems. Subsequent articles will suggest how these problems could be addressed. Stay tuned.

--

--