My toy language
My playing-around language. This is a practice run to establish an LLVM spike and figure out how to structure things before I take a serious run at the language I wish I could program in. Or maybe I'll approach it iteratively. We'll see.
For starters, I'm shooting for a language about as high-level as Python but trading some of its little-used dynamicity for speed and machine-code compilation. A good language for prototyping but which runs fast and deploys easily. A language easier to reason about than Python. A language at least as terse.
Everything is vague and loosely held at the moment. Consider this a sketch.
if
and, for now, assignment). This gives terseness and simplicity (no special ternary expression needed; can factor an a =
out of a match clause)..unwrap()
and ?
all over the place in Rust) that makes programs hard to read and just lays a lot of profitless bookkeeping on the user. "Just add question marks until the compiler is happy!" -In that case, have you really improved anything over just throwing exceptions? Examine OCaml 5's typed effects.What I first care about when exposed to a new languagewhat determines the sort of software architecture you can do with itare its structures: data structures, code structures (classes, functions, multimethods, modules). Design those first, as the rest is, relatively speaking, window dressing: syntax and loops and such will follow.
Focus especially on data structures. When reading a program (and reading is where the majority of programming time is spent), nothing is more illuminating than data structures. "A program well-represented is 3/4 solved" and all that. The more the in-program representation of types and such ape the appearance of state-of-the-art documentation, the more we've succeeded.
The overarching language-design strategy is to identify good things, figure out how to have them all, and, if that is impossible, figure out which things are in tension with each other and come up with the best tradeoff between them.
match
Could marry comprehensiveness guarantees with multiple dispatch by defining a bunch of types, then saying "these types make up an enum" (thus creating the equivalent of variant types), then throw a fit if somebody defines a method on one of the types but not the rest. (Multimethods could hopefully take the place of match statements.) Basically, promote the variant constructors to the top level. This also solves the problem in OCaml where TFunction needs to take a FunctionType but the tightest we can constrain it to is texpr.
```
type BoolType
type TipeVar {var:int}
type FunctionType {args:tipe[], return:tipe}
enum tipe is
BoolType or TipeVar or FunctionType
fun show t:tipe
```
Sketches. Don't take them too seriously.
to add a:int and b:int
a + b
to count-words str:text
num = 0
for char in str do
if char == ' ' then
num = num + 1 # Might lower to `set num to (num + 1)`.
to count-words string:text
sum((if char == ' ' 1 else 0) for char in string, 0)
fun num_words(string str)
length(char for char in str if char == ' ')
fun sum(iterable Iterable<T>, default T)
total = 0 -- All symbols are infix operators.
for item in iterable
total = total + item
else
default
fn documented_types(big_doodad:BigDoodad)
sub_doodad:PigglyWiggly = big_doodad.owner -- Compiler proves that documented types are right.
type tipe is
double_type or
int_type or
string_type with length:int or
string_ptr_type
type expr is
double with value:float or
int with value:int or
call with function_name:text and args:expr[] or
string with value:text
match expression to
call with function_name and args in -- Trying to make it look like a function definition, because it works similarly: there's a block inside with bound params. I also like that you can copy and paste the line from the type definition.
generate_call function_name with args
int with value in
generate_int value
-- Scoping:
fun foo
if goo
g = 8
g -- Could be unbound. But compiler can complain. (Assuming function scope.)
fun foo
if goo
g = 8
else
g = 9
g -- Compiler can tell this is bound, assuming function scope.
fun foo
print g -- Compiler can complain about read before write.
g = 9
fun foo
result = match bar to
thing with thong and thang
nums = [1 2 3 4 5 6]
each nums giving (to $0 * 2) -- like map()
bing with bong
capitalize bong
total = reduce bong from 0 adding ((accum, cur) => accum + (ord cur))
[total]
if result isa bing
print total -- Compile error: could be unbound. Compiler isn't smart enough to prove it's bound. Acceptable? If compiler remains that stupid, could satisfy it by adding a total=0 initializer at the top of the function.
-- I feel like functions should be areas of "free play". A language should scale to arbitrarily large programs but not arbitrarily large functions (which is what block-scoping helps allow). Programs can get unboundedly large, but functions shouldn't. Functions, not blocks, are the main units of programmatic decomposition. Blocks aren't first-class in any language I can think of; they're always parametrized, making them functions.
-- Anyway, the cost of going with function-scoping is that you have to declare nonlocals you want to write (or have a special = operator for that). That's a rarity. (It's also something of an antipattern: it promotes hidden mutations that would better be expressed through return values. Maybe dont even support it.) The dividend is you don't need to declare vars (or read the usually-just-noise declarations). Declarations exist solely to broaden the scope of a var, assuming a language in which vars are otherwise assumed to be local to the block they're mentioned in.
-- It's important to allow shadowing of nonlocals in a function; otherwise, you can't paste functions around without reading them carefully to make sure they don't overwrite nonlocals.
-- A read in an if is fine if the if has the same or a logically equivalent condition as one which previously wrote. For ifs which write and are nested, the whole chain of nested if conditions (up to the point where the writes and the reads share a common encloser) will have to be satisfied...or something. Maybe I can just use unification and get some more milage out of the unifier I'm going to need anyway for Hindley-Milner.
Maybe WYSIWYG, grouping-wise:
6+1 * 3+1 = 28
collect (item :: rest) some_list
...could be abbreviated as...
collect item::rest some_list
...which scans better.
If you're going to have significant whitespace, have significant whitespace. :-)
Right now, "Hello, world" is hard-coded into the compiler in the form of AST expressions. There's no parser yet, because I don't know what I want the syntax to look like. To install dependencies...
opam install ctypes-foreign llvm ounit2 ppx_deriving dune
Also install the LLVM headers.
To compile and run "Hello, world", run make run
.
You might have to opam upgrade
things. If you update just one thing, it'll say "Hey, you've got to recompile the rest because your system changed". Let it.
Be able to declare externals so we don't need any C.
Compile all the way to an executable without needing other shell commands. (Call LLVM's linker internally.)
lld::elf::link
, from your code."Be able to do multiple statements.
Study what I've written to get an understanding of the LLVM API.
Add ifs (multiple basic blocks).
Vars. A var assigned-to from a function and not declared nonlocal/global/whatever-else-I-add is scoped to that function. There is no block-level scoping. This avoids having to make or read var declarations, read over let
keywords, or slide :=
operators around as a function evolves. In short, we borrow Python's assignment heuristic. Upsides: you can move code from an outer function to an inner one without changing it, as long as it only reads variables. (You don't have to add ^
annotations like print(foo^)
(if ^
meant "look in the enclosing scope").) Writes to a var from outside can be accomplished by declaring the var nonlocal
. This is better than a special assignment operator because you can do it once and for all rather than changing every writing reference you've copied and pasted from the outer function.
It's worth contrasting var declarations with import statements. The latter are useful while reading code. They tell the reader where deeply imported local symbols are defined, in case they want to go read the code or docs. They shorten callsites if you deeply import. Var declarations don't provide as much information: just "it belongs to this block" and sometimes "it is of this type". We're dispensing with the strictures of the former for ease of expression, and the second we will accomplish some other way, perhaps someVar:someType = nonObviousReturnValue()
.
Function scoping--or in fact, any scoping in which the introduction of a new var does not introduce a new block--does make it more challenging to detect reads from undefined vars. (I don't want runtime errors for something so common and, because the cause is branches, possibly undetected until a rare branch fires.) The compiler will have to check every branch of every if
in a function to make sure every one defines all non-pre-existing vars read afterward. Same for loops that could break
before assigning to a var. If the compiler can't prove we write before read, it throws an error. You can satisfy it by initializing the var before the if
or loop. Hopefully the compiler will get smarter and smarter, letting us remove more and more such legalistic initializations. But in any case it's a reduction from saying let
in front of each first write of a var.
A var with the same name as one in an enclosing function or module, if assigned to from the inner function, shadows the outer one.
To implement: for each function, find the vars written to within it. Put those into a set. (This is faster than spidering all over the function anew for each ref.) As we codegen each var ref, look into those sets to figure out what stack frame to (if not already done) add the var to.
We'll put all the alloca
instructions in the entry block of the function so we can depend on mem2reg
to convert them to register accesses.
Sum types can be represented by alloca-ing enough space for the largest alternative and also making space (whether as a separate pseudovar or part of a struct that contains the enum and the var value) for the enum value. Make sure it works nicely with nested enumerations, like foo:(int|Snoo); Snoo = double|string
.
Complain on the possibility of undefined var reads. Maybe for this it could be good enough just to assert that every read in a CFG node has a write in a dominator or in the same node but before the read. (With Lengauer-Tarjan, we can find dominators in O(m log n).) Or assert that every read hits a write via every path back to the entry block. Definitely write this against the CFG, lest we get some control structure later that has a nonobvious, nonlexical relationship to control flow. Or would it be possible and simpler to have a big match clause where each syntactical construct makes sure a given var is found in each of its (potential) branches? We could cache for speed.
When making example CFGs for trying out analysis ideas, using rand()
calls as the condition in if
expressions comes in handy: it lets you pick best-case and worst-case paths at your discretion without painstakingly constructing code that does whichever you were looking for. Sometimes it lets you halve the amount of code just by choosing different branches each time around a loop.
Support other types than int. I don't see any other way to do this than to write an honest-to-goodness constraint solver for type inference. Perhaps it's possible that we wouldn't need one just for inference, but we'd sure need one for type checking. If somebody is inconsistent with types, we need that unification failure to notice the error. Here's a great summary of unification for type inference, even including some OCaml code: https://www.cs.cornell.edu/courses/cs3110/2011sp/Lectures/lec26-type-inference/type-inference.htm. Start with local inference for practice, then expand. A unification of a set of pairs is a single set of var bindings that unifies every pair.
fun foo(a, b, c)
syntax, it can desugar to foo = fun(a, b, c)
.initialize()
.{env, func_ptr}
and generating call func_ptr(env, )
.Raise an error if a function returns a different type than declared. (This should be taken care of by the unifier.)
GC
Loops. All looping constructs can be lowered to an infinite loop plus a break statement. Infinite loops can be lowered to recursion with TCO. Not sure about the breaks.
If everything is an expr, what should a loop's value be? Its last statement seems arbitrary and rarely useful. What if it were a list of the values of its last statements? That would be often useful, plus it serves the purpose of a list comp!
for x in range(8)
x**2
We can prove whether anybody does anything with the value and avoid collecting it if not. Of course, this will require all terminal statements of a loop to have the same type. Is that going to be too annoying? Maybe we could require that only if you're doing anything with the value.
Decide on dispatch. Will it be hard for a human to find where a function's code is? https://www.ocaml.org/releases/4.11/htmlman/lablexamples.html lays out one cohesive system of kwargs + inference and even suggests some naming conventions.
Do non-primitive types, like ML enums. We won't have type erasure on enums because we'll have to be able to distinguish among variants in match
clauses.
We might be able to get away with just recursion for type inference, as long as we explicitly declare function args and return types (at least temporarily): https://mukulrathi.co.uk/create-your-own-programming-language/intro-to-type-checking/
Make protos unnecessary (except for externals, I guess).
Design the language.
match
variant, but we can do the same by making our functions piecewise.Exceptions. Undesirable in a pure language because they can pop out any time unexpectedly. So require you catch them--and in the direct-calling stack frame? That makes them basically inferred option types without exceptions' distinguishing characteristic: multi-frame stack unwinding.
If we have exceptions, the catch
clauses should probably look like match
clauses and be able to pattern-match and destructure thrown values. A thrown value can be any old object with a constructor: anything that can be matched against.
For applications that cannot tolerate exceptions, we can have a no_except
signifier applied to a function that cues the compiler to throw an error if any exception that could be raised within a function is not caught.
typed effects or something. https://github.com/ocamllabs/ocaml-effects-tutorial
Think about separate compilation. This will be in tension with global type inference.