Law & Order in Software Engineering with C*

Penned on the 2nd day of December, 2020. It was a Wednesday.

Last updated on the 3rd day of December, 2020. It was a Thursday.

Note
This paper details the concepts of Law & Order, and a programming language called C* that implements them. This has no relation to any Connection Machines, or the associated superset of ANSI C by the same name.

Software engineering over the last fifteen to twenty-five years has become stunted. Increasingly less attention is paid to quality-of-work, and incredible amounts of resources are being spent to solve second order problems. Executives and engineers alike shoulder the blame for this, and the result is simply a whole lot of bad code. Peter Welch has mused much about the deplorable state of code quality in his semi-hyperbolic rant titled Programming Sucks. Bad code in a vacuum wouldn’t be a problem if it didn’t make it nearly impossible to write good code in an economically cohesive fashion. All of the code we have works together. If some bad code is sitting down the hierarchy, it can mess things up, lowering the quality baseline of all code running on a person’s machine. Programs need Law & Order, and where they need it most is with systems.

Systems programming is one of the most difficult branches of software engineering. So difficult that many developers today do not even understand manual memory management. The advantage of C with regard to systems programming is widely misunderstood; it has less to do with performance, portability and applications, and more to do with communicating complex systems. Stephen Kell argues this thoroughly in his paper titled Some Were Meant for C. Relatedly, the shortcomings of many, many attempts to make systems programming better have to do with this misunderstanding about the power of C as well. The C language forces the programmer to make the composition of their code explicit, and this has very positive effects for what Kell calls communicativity in systems design. C also has unparalleled ability to integrate with existing code; Kell explains this in a quote from Richard Gabriel:

In the worse-is-better world, integration is linking your .ofiles together, freely intercalling functions, and using the same basic data representations. You don’t have a foreign loader, you don’t coerce types across function-call boundaries, you don’t make one language dominant, and you don’t make the woes of your implementation technology impact the entire system.
—Richard Gabriel, 1994

Ken Thompson gave one of the earliest known reflections about this problem with software in his Turing Award lecture, titled Reflections on Trusting Trust. His moral at the end suggests that the real life law will be something soon looked to for setting boundaries on the practise of malware development. However, in the decades that followed this lecture, the Worldwide Web came to be and changed just about everything. It is probably more pragmatic now to bring law & order to the code that needs it, instead of waiting for the code to show up at a courthouse. So, this is what I will do.

With this in mind, only one more requirement is left to be illuminated: the concept of the total system. The total system is the system in development which the programmer, with all of their tools, has unfettered control over. There are three main components in a total system: laws, pacts, and marshalling. With this idea, it becomes possible to define arbitrary constraints about data and enforce those constraints at compile time. This could be thought of as a radical form of programming by contract, but there are elements for scaling these contracts and dealing with violations that are quite new.

A law defines an expression about a data type which must be satisfied in order for compilation to succeed. For example, it is possible to say that all integer primitives may not be non-negative:

/* declare a new law that applies to all integers (int, short, long).
 * the subject is addressed using a sole underscore ‘_’ pronoun. this law
 * is given the name ‘no_neg_int’.
 */
law no_neg_ints : int, short, long
{
	_ >= 0;
};
/* this law can be applied to further types after its definition like so */
law no_neg_int : char;

laws carry all of the same powers that laws of physics would. It is by every conception impossible for a program to defy its laws within the context of the total system. When dealing with data that comes from outside the total system, laws do not apply. Instead, the programmer may define a pact.

pacts function like laws, but only apply at the border of a system. The constraints of pacts may be violated by the foreign data they apply to, and in the event of violations, special code blocks may be written to handle such cases. This is called marshalling, but here it is distinguished from mere serialisation; in marshalling, the data is not just serialised but validated on the constraints of the pacts that apply to it.

marshals are used to deal with illegal input from foreign callers. For every publicly-callable function with constrained inputs, marshalling is required. Marshalling code is not executed when called from within the total system, and is skipped over if the constraints as checked at the call boundary are satisfied. Additional restrictions also apply when within marshal scope:

  1. no calls to non-pure functions are permitted
  2. only the parameter being marshalled may be identified in code
  3. the marshalled parameter cannot be modified

Here is an example of some marshalling code:

int myfunc( int a, char * b )
{
	marshal a
	{
		/* unconditionally return 1 */
		return 1;
	}

	marshal b
	{
		/* there is a special case for that below */
		if(b == NULL)
		{
			continue;
		}

		return 127;
	}

	/* normal code where a and b are well-defined */

	return 0;
}

The C* programming language is an attempt at revisiting ANSI C, as standardised by the American National Standards Institute in 1989, reinforcing its strengths, and adding these new tools to the language. C* will not add ‘object-oriented’ concepts to the language like C++. Indeed, it is a departure not only from C++’s drive towards multi-paradigm language development, but from even the notion of an ‘unboxed’ programming language entirely. The language will add some minor things to improve the details of data representation, but it is otherwise quite faithfully ANSI C, down to the hoisting of locals and the lack of line comments.

That said, C* also introduces two new features for basic reflection: tagging, and the areaof operator. Tags are string literals that functions can add to variables that they create, and they can be reflected upon in the context of laws, pacts and normal code as constant expressions. This is useful for ensuring an origin of some data, e.g. a pointer returned by malloc() that must be either a valid pointer or NULL. The areaof operator is similar to the sizeof operator, except it gives information about pointer types and the memory they point to. The value of areaof is the number of bytes allocated in the memory region the pointer points to. areaof is also an lvalue, meaning it can be set and modified as necessary. These tools are not meant to guarantee certain behaviour, but make it possible to know important details about data and decide what to do with it in laws or in code.

It must be pointed out that not only is it possible to program with respect to law & order idiomatically in ANSI C, but even in other programming languages like JavaScript or Python. No programming language to date is designed to do this yet, but that does not make it impossible to do anyway.

As mentioned earlier, many attempts have been made over the last ten to fifteen years to make systems programming better. Rust is popularly chief among these. It fails to improve systems programming in several dimensions: first, the lack of integration ability as argued (without prejudice to Rust) by Stephen Kell; second, the relentless devotion towards counterproductive metaprogramming (making Rust an ‘unboxed’ language like ‘modern’ C++); and third, the lack of coherent tools for dealing in its self-prescribed terms of ‘safety’, since it simply does not cope with the lack thereof in the outside world of software.

Rust continues a trend started with Lisps back in the twentieth century: marrying the language to a package manager. It also takes many cues from JavaScript’s npm, making its ecosystem very online. This approach comes with the same drawbacks and downsides that npm and Lisp ecosystems suffer from. In the npm case, there is no mechanism besides one’s own eyes for adjudicating the quality of a package, which causes quality to suffer. In the Lisp case, Rust packaging does not integrate well with codebases that are not already written in Rust using their package manager. Bindings need to be written, which as Sergey Davidoff describes are unsafe by design. Most importantly, Rust’s ecosystem is plainly hostile to language migration, as Kell explains:

Like any language, C persists partly because replacing code is costly. But perversely, the implementation technologies favoured by more modern languages offer especially unfavourable effort/reward curves for migration. Migration all at once is rarely economical; function-by-function is probably the desired granularity.
—Stephen Kell
New materials and tools, whatever their merits, would hardly justify rebuilding the pyramids of Egypt. Why rewrite code at all? Primitive tools can still produce great work. The proposition of rewriting codebases is an expensive one, and an increase in bugs is highly likely in the short term.
—Stephen Kell

At the same time, metaprogramming has become one of the biggest resource sinks in the field of software engineering. It is so many instances of “the General Problem” identified in this XKCD, where a programmer, instead of passing the salt, devises a system to pass arbitrary condiments, claiming it will “save time in the long run”. In doing this, he adds a layer of abstraction to the whole thing, which has a load of complexity associated with it.

My Ethos for Sustainable Computing expounds upon the woes of metaprogramming in detail, but it suffices to say that no abstraction ever comes for free. While it is possible to avoid burdening the runtime with abstractions and even save most of the effort they incur at build time, as Rust has surely shown, what is not so appreciated is the human cost involved in such endeavours.

Not a single living person knows how everything in your five-year-old MacBook actually works. Why do we tell you to turn it off and on again? Because we don’t have the slightest clue what’s wrong with it, and it’s really easy to induce coma in computers and have their built-in team of automatic doctors try to figure it out for us.
—Peter Welch, Programming Sucks

This load of what I call in my Ethos “immaculate complexity” is unsustainable and will collapse in the timeframe of the next fifty years. There is simply no real value backing much of the complexity we devote so much time and money to, and just like a bubble economy things will eventually crash down in a ‘correction’ to their real value. This will probably be very destructive collaterally for the software industry as we know it.

On the topic of safety, the Microsoft Security Response Center provides a primer on the supposed benefits of using Rust over C or C++:

What separates Rust from C and C++ is its strong safety guarantees. Unless explicitly opted-out of through usage of the “unsafe” keyword, Rust is completely memory safe, meaning that the issues we illustrated in the previous post are impossible to express.
—MSRC blog

Put another way, Rust makes it impossible to express much about manual memory management. A Stanford CS 242 lecture confirms this conception, calling it “memory containment”. This haemorrhaging of vocabulary is needlessly destructive: it cordones off such code into ‘unsafe’ blocks and provides nothing to guard the safety boundary from the incidence of bugs. Rather than quarantining memory managing code and treating it like the apocalypse, C* opts to embrace it, because it provides mechanisms to cope with such issues at every level.

Sergey Davidoff wrote a Medium article titled How Rust’s standard library was vulnerable for years and nobody noticed. After demonstrating how common & easy-to-find vulnerabilities are, he explains how the same problems plague Rust crates, and even the language’s standard library. His conclusion is that well-behaved code depends on lots of verification. This is a concession that Rust’s safety mechanisms utterly fail in the places they are needed most: unsafe code.

It is more important than ever to get a handle on the burgeoning complexity of the world’s digital systems. Doing this requires marrying good theory with the practical realities of engineering. This is what Law & Order, and the C* programming language, are designed to do. These tools make it clear how it can be realistic to solve the fundamental problems Thompson spoke of in his Turing Award lecture about “trusting trust”. Within a system, trust of one’s self is enough; outside a system, whatever comes in is handled cautiously and judiciously. Applying C’s unrivaled powers of communicativity and integration, systems programming can be overturned into this new paradigm painlessly and economically. This will mark the beginning of the end of vulnerabilities as they are known today.

Until next time,
Άλέξανδερ Νιχολί