Erythro was born of a small challenge between me and a friend - a programming competition had been brewing for several months, and with the 2020 virus lockdown we'd both been left with nothing else to do, so the challenge was formalised:
The first person to create a language specification and a working compiler for said specification, that achieves the following requirements:
- is x86_64 native
- is self-hosted (the compiler can compile itself)
- supports constant folding (any arithmetic in the program gets folded into a constant)
- produces a valid ELF64 executable by default, but optionally produces an executable of the format defined below.
- can cross-compile to AARCH64 (tested with a Raspberry Pi 3B)
- compiles a mini-kernel for the chosen architecture if the argument --compile-example-program is passed
- outputs GDB-compatible debugging symbols by default, but with the option to not include them
wins the competition.
There is no prize, only the utility of having these tools ready for when we use them.. because we both wanted to do this for our own separate projects - I wanted this for Chroma, destoer wanted this for his Nintendo 64 emulator.
This is a draft specification, but any changes will be incremental, in raised boxes like this one, above the real text. This means that any major revisions to the spec will appear in place of the old one, but there will be a record going back in time as you scroll down.
The first idea for a custom executable format is using a Matroska container (.mkv file). This means that theoretically the file could both; run in our kernels, and be played as a video in any program supporting MKV, including any video players that are created on top of our kernels.
A video player inside this custom Matroska format, which is capable of playing itself as a video, is an amazing concept..
There's just something about a well-written piece of code, in a language that manages to make it look nice. I'm of the humble opinion that for some things, C++ manages this really well. For other things, Java is the way to go. I'd even go as far to say that Python can make some godawful code look nice.
When it comes to functionality, though, there is one thing that stands out, especially for things like this: C. It's quite possible that it'll make the code look terrible; loops and ifs nested 150 layers deep, but nothing beats the solidity of that platform.
This language is designed to be used for my kernel/os: Chroma. Thus, i want it to have a featureset that makes it specifically adept at that. I'll be taking inspiration from a great many things, and the combination, mix and recipe will be documented here.
Some people would say that a featureset like Rust is ideal for this: complete memory safety, easy and quick to iterate.. To that i say no. Just, no.
So, let's have a quick look at features that I'd like to have in this language. To do that, let's first have a look at some parts of osdev that i find quite frankly annoying.
This isn't going to be a whistlestop tour - I'm going to try to get as in-depth as i possibly can, this is an extremely important step.
The single most important thing for me, in this language, is easy access to inline asm. The GCC way of doing this is with this gastly mess:
So, take this example pulled straight from the Chroma source:
Take a long, hard look at this code. Try to understand it from what i've given you, the comments should let you infer what the ASM does even if you can't read assembly. Then, find the error.
That's right. This function which is meant to read a hardware port and return it as an integer, actually wants the data as input, to output it to the port. This syntax is terrible; hard to read, write and understand, and for things like this where there absolutely is no escaping ASM, an alternative is needed desperately.
For my solution, we have to escape the idea that this language will ever be general purpose. Having easy access to raw assembly language is powerful, and should not be used lightly. Nonetheless, let's also look at the section of code that switches the code segment into the newly formed GDT entry:
Sure, this is readable, and with the \ns you can easily tell what is supposed to be happening here. However, that's a lot of typing for something that's incredibly common in osdev.
Sure, calling this a "solution" is a stretch, but here's how Erythro is going to tackle it:
There are more details to consider here - consider [value] "a" ((uint8_t) Data)
.
It casts the 32-bit Data variable to 8 bits, and passes that to the [value]
variable in the ASM.
To do something like that would take a lot more effort than i have for this full challenge, so i'll have to erase the possibility of allowing casting in ASM snippets. However, since this is all lexed in the compiler I should be good to simply use the variable as a name in asm:
And with that small adjustment, it all becomes a lot easier. The gastly multi line GCC behemoth above becomes streamlined:
This also introduces another feature I want:
One exceedingly annoying thing about C, despite being "one of the lowest level languages you can use", apart from of course Assembler language and raw machine code itself (and butterflies!), is that it completely removes you from accessing the CPU directly. This makes sense in the modern world - security, right?
WRONG. Well, not wrong, just annoying. Erythro is designed specifically for writing kernels, which by default are ring 0. Thus, i want registers. I want them now.
One notable side-effect of the mov ax, ds
syntax of the asm sections, is that ax and ds are now not usable as variable names. As a hunch, i guess this is why GCC requires they be in %register form. However, i actually want these register names to be keywords - so that i can access them in code! Thus, this is perfect.
These registers represent the (uint64_t) versions of these registers.
There are variants:
represent the (uint32_t) casts of the 64 bit versions - that is, they are the lower 32 bits.
represent the (uint16_t) casts of the 64 bit versions - that is, they are the lower 16 bits.
represent the (uint8_t) casts of the 64 bit versions - that is, they are the lower 8 bits.
r8 is the 64 bit (quad-word) version,
r8d is the 32 bit (double-word) version,
r8w is the 16 bit (word) version,
r9b is the 8 bit (byte) version.
This feels really intuitive to me, so i'll steal that too.