Announcing C0 2.0

May 28, 2020

In this post, I’ll walk through the key additions and enhancements made to the CC0 compiler, as well as sharing some of the discussions we had along the way.

C0 is designed to be used by people who may not have a lot of programming experiencing. When adding features, it’s important to weigh the added expressiveness with the added complexity. My personal opinion on this matter is that C0 should not be more difficult to program in than C. So that’s why we’ve added things like printf and backtraces, as these are available in C either natively or through easily accessible tools e.g. valgrind.

This post is a little long so here is a table of contents:

Changelong
Implementation

Changelong

First, here is the complete list of changes:

Compiler

Added unused variable warnings
Added unused expression result warnings
Added dead code/unreachable code warning
Added a check to prevent C1 files from being overwritten if specified as the target of -o
Added a check to prevent compiling executable files
Added printf/format, where the format string must be a string constant.
Modify lvalue rules for dereference from lval ::= *lval | .. to lval ::= *(t*)lval | *lval | ..
Colorized the output of various error messages
Fixed some typos

C0VM

Added C1 bytecode generation
Implemented runtime for tagged pointers

C0 Libraries/Runtime

Added backtraces which are printed on abnormal program execution
Added a callstack bound
<parse>: Fixed a bug where parse_int would not sign extend the number, causing it to fail incorrectly on valid inputs like 0x80000000 (should parse as -1, but was parsed as 2147483648).
<img>: Improved error message if the user tried to read an invalid/corrupted PNG in image_load
Made the maximum array size, the maximum callstack size, and the maximum number of entries printed in the stacktrace configurable via environment variables
Colorized the output of various error messages

Implementation

Warnings

A lot of the changes made revolve around making it easier to debug C0 programs. The first of these is the addition of warnings about unused variables, unused expressions, and for unreachable code.

int main() {
  int x = 3;
  int y = 5;

  if (y == 3) {
    y == y + 1;
    return y;
  }
  else {
    return -1;
  }

  return 4;
}

In this program, the variable x is never used. Sometimes this isn’t an issue - maybe the code isn’t finished and the author plans to use it later. But this does frequently indicate ‘forgetting’ to use a value.

This code also has the statement with no effect y == y + 1. This is almost certainly an error, the author probably intended y = y + 1 instead.

Finally, the last return 4 is unreachable. In a simple program like this it is obvious, but in a larger program, the author may expect that code to be executed and be surprised when it never is.

All these things are easy for the compiler to recognize, but they can be difficult for a human because they are often very subtle. Previous versions of CC0 would silently accept the above program, but now it will produce the following:

foo.c0: 6.5-6.15: warning: expression result unused
  | 
6 |     y == y + 1;
  |     ~~~~~~~~~~ 
foo.c0: 2.3-2.12: warning: value is never used
  | 
2 |   int x = 3;
  |   ~~~~~~~~~ 
foo.c0: 13.3-13.12: warning: unreachable code
   | 
13 |   return 4;
   |   ~~~~~~~~~ 

Hopefully this will make debugging a little less tedious.

Adding `printf`/`format`

When something goes wrong in a program, many programmers instinctively start debugging with print statements. However, currently in C0 this can be very tedious. Consider the following code which loops over an image and computes some value:

int[] compute(image_t img, int width, int height) {
  int[] A = alloc_array(int, width * height);

  for (int row = 0; i < height; row++) {
    for (int col = 0; col < width; col++) {
      int i = f(row, col, width, height);
      int x = g(img, row, col);

      A[i] = x;
    }
  }

  return A;
}

Suppose this function is returning the wrong result. To start debugging, we may want to inspect the values of the variables at each iteration.

print("row = ");
printint(row);
print(", col = ");
printint(col);
print(", i = ");
printint(i);
print(", x = ");
printint(x);
print("\n");

I think it’s obvious that this is very cumbersome. With printf it becomes very concise:

printf("row = %d, col = %d, i = %d, x = %d\n", row, col, i, x);

There is also format which returns string instead of printing things out directly.

It’s important to note that printf/format are not real functions. They are essentially just syntactic sugar for a series of print/printint/printchar statements. Because of this, you must pass a string literal as the format string (to ensure type safety), and you may not take the address of either function (it’s impossible since the variadic type of printf cannot be written down in C0).

Implementing this is not as elegant as one would hope. You can’t explicitly write a prototype for printf in a library header, but it needs to be declared as a symbol to make sure users don’t try to shadow it. This requires watching for conio to be loaded, and then hardcoding printf as a symbol. Then the typechecker needs to be intercepted when examining function calls, and when taking the address of a symbol. The format string is also parsed at compile time to ensure the argument types match up.

This also doesn’t really have the full power of C’s printf as you can’t specify field widths, print a number in hex, etc. At first I was hoping to just use printf directly by desugaring a printf call into a series of printf calls where each call takes exactly 2 arguments. This is more difficult than it seems because of C0VM and the C0FFI. The FFI works by creating a wrapper for every native function which unboxes arguments from the C0VM c0_value representation to their C representation. This would mean you would need to create separate wrappers for printing strings, ints, and chars. That itself is not really an issue, but wrappers are generated directly from C0 header files, which means that you would need to then hide these extra functions from user code. This solution ends up being pretty inelegant and adds a lot of complexity to the compiler.

In the end, the current functionality implemented is a good balance between simplicity of implementation and usefulness.

Casting in an lvalue

Currently the grammar for lvalues looks like:

lvalue ::= *lvalue 
         | lvalue[expr]
         | lvalue.field
         | identifier
         | (lvalue)

For example, expressions like *alloc(int) are illegal. This is nonsensical so it makes sense why it would not be allowed.

However doing something like the following is fairly common, but is currently illegal:

*(string*)x = "hello";
(*(int*)(record->count))++;

To make this work we make a small modification to the lvalue rules:

lvalue ::= *(t*)lvalue | ...

We also considered

lvalue ::= *lvalue | ... => lvalue ::= *expr | ...

But this was rejected since it then allows nonsensical statements such as *alloc(int) = ... from before

C1 bytecode generation

As the name suggests, CC0’s bytecode generator didn’t generate bytecode for C1 programs. This means that programs which use generic (void*) pointers or function pointers couldn’t be run on C0VM. However, supporting this features adds only 4 additional instructions to C0VM, so I implemented it.

[*] = tagged pointer

0xB6 invokedynamic   S, v1, v2, ..., vn, f:i32 -> S, v
                    (idx = f & 0xFFFF, t = (f >> 31) & 0x1,
                           function_pool[idx] => g, g(v1,...,vn) = v if t = 0
                             native_pool[idx] => g, g(v1,...,vn) = v if t = 1)

0xC0 checktag <c1,c2> S, a:[*] -> S, a:*
    (cast from void*: if a has tag (c1<<8|c2), otherwise c0_memory_error())
0xC1 hastag <c1,c2> S, a:[*] -> S, x:i32
    (\hastag: if a has tag (c1<<8|c2) then 1 else 0)
0xC2 addtag <c1,c2> S, a:* -> S, a:[*]
    (cast to void*: convert from a regular pointer to a tagged pointer)

Function pointers use invokedynamic. The information needed to identify a function is just its index in the function pool or native pool. This information is statically known, so it is encoded into the index as the highest order bit.

This becomes a little counterintuitive when storing function pointers in the heap - the name suggests that they take up 8 bytes, but since they are actually only 32-bit integers they are only 4. In addition, by the typing rules we know that every function pointer call looks like

(*e)(v1, v2, ...)

But in the case of C0VM e should not be dereferenced since it’s just an integer.

This also means that we load/store function pointers using imload/imstore. This may seem a little strange since semantically we are loading pointers not integers. But this was also judged to be ok since C0VM opcodes aren’t really intended to semantically resemble the original program, and instead represent the concrete execution strategy (e.g. string ~ char* in C0VM but not in C0)

Generic pointers are a little trickier. A generic pointer needs to carry around two pieces of information, the actual pointer, and the original type/tag (represented as integer). This is represented by the following struct:

struct c0_tagged_ptr {
  void* p; // the "real pointer". always non-NULL
  int tag; // types used in casts are mapped to numbers
};

This seems easy so far, but problems show up when dealing with operator ==.

int* ip = alloc(int);
void* p1 = (void*)ip;
void* p2 = (void*)ip;

assert(p1 == p2);

The cast to void* allocates a new c0_tagged_ptr struct each time, so naively comparing pointers will cause the assert to fail. When comparing tagged pointers, we really just want to check if the c0_tagged_ptr::p’s are equal. It may seem like we can get away with adding a case C0_TAGGED_PTR to

enum c0_val_kind { C0_INTEGER, C0_POINTER };

but this doesn’t work because we lose all this information when we store a void* in the heap (e.g. in a struct), since amload/amstore just reads/writes 8 bytes.

To solve this problem we use the highest 1 bit of a pointer to represent if it is really a tagged pointer. As the address space is 48 bits wide at most, this will never conflict with an address. We don’t want to use the lower bits because indexing into the string pool can lead to arbitrary offsets.

This technically does drop 32-bit compatability with C0VM. So I guess you won’t be able to run C0VM on your Samsung Smart Fridge anymore.

Backtraces

I mentioned how printf is very useful in situations where you know a function is producing the wrong result and you want to track down why. However, true to C, C0 programs sometimes will crash with nothing more than “attempt to dereference a NULL pointer” or “array index out of bounds”. This does not give any information as to where the problem is occuring.

Technically programs like valgrind or gdb/lldb can be used to get a stack trace, but due to the name mangling CC0 does it can be hard to read, especially for novice programmers.

When programs are compiled with -d the stack trace will be recorded, producing output like the following:

Clac top level
clac>> 1 0 /
c0rt: division by 0
c0rt: in a function called from:
          eval (clac.c0: 237.39-237.50)
          top_level (clac-main.c0: 52.12-52.26)
          main (clac-main.c0: 82.3-82.15)
          (program start)

In this example of an RPN interpreter, invalid divisions were not guarded against. The runtime error displays where the error is happening.

Maximum stack size

Currently a C0 program such as the following will exhaust stack space and segfault:

int foo(int i) { return i + foo(i + 1); }
int main() { return foo(0); }

This just produces the message Segmentation fault with no further explanation. This can be misleading as students are frequently told in class that segfaults are caused by dereferencing NULL pointers or by accessing arrays out of bounds.

When compiled with -d, CC0 will track the callstack size. If it gets too large, then the program will abort:

c0rt: Maximum callstack size exceeded (is 86306, change $C0_STACK_LIMIT to adjust)

The stack size can be adjusted via environment variables. Other parameters can be adjusted via environment variables such as the maximum allocation size ($C0_MAX_ARRAYSIZE) and the maximum backtrace length ($C0_BACKTRACE_LIMIT).

This means that some programs may now crash when compiled with cc0 which run fine when run on C0VM. Although this does create a difference in the two implementations, this was judged to not be an issue because the two had already diverged (e.g. functions in C0VM can only declare at most 256 variables).

Command line enhancements

The name of the produced executable can be set with the -o flag. There are a couple of ways this can go wrong:

If the file name is omitted, then it’s likely the output file will be a source file e.g.
```
% cc0 -o foo.c0
```
Currently CC0 will detect this and refuse to overwrite foo.c0, but if the file extension was .c1 it would overwrite the given file. This data loss is very painful, so CC0 has been patched to detect this too.
If the -o flag is omitted, then it’s possible CC0 will try to read a binary file. This can lead to confusing error messages as the lexer will print out a confusing series of bytes when it tries to report an error. (This usually happens in conjunction with the problem above, so it shouldn’t happen anymore)

Finally, using ANSI escape codes, the output produced by CC0 is now colorized. Reading a bunch of monospace white text can really test your focus, so colors should help group information and enhance readability

Library bug fixes

These are relatively minor things. Some students reported that parse_int was failing when it shouldn’t. This was because parse_int was using strtol. So an input like 0x80000000 was not parsed as a negative number, but as (long)INT32_MAX + 1. That ends up being larger than INT32_MAX so parse_int failed.

In some cases PNG files would get corrupted when loading them with image_load. It used to just fail with png internal error. The error message has been updated with some more details as to what went wrong.

So that wraps up this update to C0. Currently these changes are available from my Andrew public directory, but they should be deployed globally on Andrew later this summer.

I think this should remove a lot of the tediousness from the language that simplicity can bring. Learning computer science is already quite challenging - hopefully this helps makes things less frustrating.

As a final note, this article basically contains everything I wanted to add to the language. I am very interested in hearing what other people’s opinions are. Feel free to contact me with any suggestions or other feedback!