Rafael rambling: 2012

Sunday, October 21, 2012

Security and Objects

Mobile code is one of the great challenges for software security. Lets say you are writing an email application. The idea that people could send little apps to each other in email messages might seem like a potentially interesting feature: users could build polls, schedule meetings, play games, share interactive documents. Kind of cool.

And if the platform you are building upon supports reflectively evaluating code, it could be as easy as something like this (in OO pseudocode):

    define load_message(message)
        ...
        eval(message.code)

Of course it can't be that easy. What if the code in the message does something like:

    new stdlib.io.File("/").delete()

The standard way to avoid the vulnerability is to put the code in a so-called sandbox. It sounds very secure, but in practice this usually amounts to gathering up a list of "dangerous" call sites and inserting in each some code to check if the caller has permission to proceed. So the implementation of delete would include code along the lines of:

define delete()
    if VM.callStackContainsEvilCode?()
            raise YouShallNotPassException()

        ...

This is fraught with problems. It requires runtime support for inspecting the call stack and a system for declaring that certain code has some level of authorization while some other code has a lower level. Not to mention the busywork of going trough the code and peppering that little snippet over every suspect call site. If you miss one — say, for instance, a method that gets the addresses of all contacts on your email application — and you have a security bug on your hands.

A better way ?

Perhaps there is a better way. Take another look at the offending line: new stdlib.io.File("/").delete(). It is only able to call the dangerous delete() method because it has a reference to a file object pointing to the root of the filesystem. And it only has that reference because it could reach for the File class on a global namespace. What if there was no global namespace?

It might seem weird, but it's not that hard to imagine a programming system lacking a global namespace. Many object-oriented languages, following Smalltalk's lead, have a notion of a metaclass, an object that represents a class. Many of them (also following Smalltak) also get by without a "new" operator. Objects are created by calling a method — usually named new() — on the metaclass object.

We are very close now. The last step, unfortunately not taken by most common languages, is to avoid anchoring the metaclass object onto a global namespace. The result is that code can only create objects of the classes it holds a reference to. And it only has a reference if it is given one via a method or constructor parameter.

Proceeding recursively, we end up with a stratified program. There is an entry point that receives a reference to the entire standard library, and each call site decides how much authority to grant each callee. On our example, when we evaluate external code we can grant very little authority, meaning we can pass the evaluated code just a handful of references. Care must be taken so that none of them will direct or indirectly provide a way to create a File. In a way, object design becomes security policy.

And we get very fine-grained control over such policy. We could, for instance, grant loaded code authority to write on a designated directory just by passing it a reference to the Directory object for that directory. Our choices get even more interesting when we realize we can pass references to proxies instead of real objects in order to attenuate authority. Continuing with our example, hoping it doesn't get too contrived, we could build a proxy for the Directory that checks if callers exceed a given quota of disk space.

Research

I have mentioned above that most common languages don't fit this post's description. But there are languages that do, a prime example is E. In fact, there is a whole area of research for dealing with security in this manner, it's called "object capability security".

I'm not really a security guy, I got interested in the area due to the implications for language and system design. If you got interested, for any reason, please check out Mark Miller's work. He is the creator of the E language and the javascript-based Caja project. His thesis is very readable.

Wednesday, August 29, 2012

A Practitioner's requests for Programming Language Research

I have been interested in programming languages and programming language theory for some time, CTM is my favorite technical book, and I even managed to wade through the first chapters in Pierce's TAPL (some day I'll get through the rest, some day...). But it's not often that I can connect that interest with by job as practicing programmer. This post is an attempt to forget about my particular research interests and try to list my daily pain points and wants as a user of programming languages.

If this is to make any sense, I have to start outlining the context: mostly back-end-related web development, currently with a content-based website, with past consulting stints at more and less enterprisey environments. I won't presume to represent any particular segment of the programming population, but I would be surprised if my peeves are unique.

Semi-automatic mapping between data representations.

Most programs receive data in one or more external forms — database rows, JSON documents, ASN.1 records, form data and many others — transform them to internal representations for computation and write them out again. Reflection enables generic marshaling libraries so we can escape the drudge of writing and maintaining all that conversion code by hand. There is a whole research area for doing about the same in typed functional programming — polytypic programming — but I know next to nothing about it, I'm afraid.

All this is great, but in practice a completely generic mapping algorithm is rarely sufficient. It forces one of the two representations to bend to conform to the other: who hasn't seen horrible schema-generated classes or ugly auto-generated xml? On the other hand, when we try to sophisticate the mapping to give developers more flexibility, we get ever more cumbersome configuration options cluttering our code. We need something better and I hope PLT can help; it could be an elegantly terse language for writing the mappings, it could be an ingenious abstraction mechanism for composing generic mapping with custom instructions, or it could be something else entirely.

Library versioning

The issues around library versioning and dependency management are well known and have been for a long time, generation after generation of technology creating it's own version dependency hell. The situation is so dire, just enabling the coexistence of many versions of a library on a single machine is hailed as a major breakthrough.

A language based approach can afford to be much more ambitious. Most module systems I'm familiar with handle compatibility by having client-supplied rules preside over version numbers defined by the library provider. The provider, in turn, must decide wholesale for his entire library what level of compatibility he thinks each release can maintain. In practice there is a lot of guessing going on both on the client and the provider side, mediated by the very lossy medium that is the version number.

Perhaps things need not be so complicated if the language gets in on the action. Imagine if we could indicate to our compiler "this little change in this method here is supposed to be compatible, while this other change over there is not" and the information would trickle down from caller to caller all the way to the public API surface. The compiler would of course flag a call from a compatible method to a non-compatible method, and perhaps a theorem prover/counterexample generator could pinpoint places where the developer unwittingly broke compatibility. If our modules are first-class, we can even contemplate having several versions of them interoperating in runtime to satisfy transitive dependencies differences.

Data evolution

If the behaviour evolution problem described on the item above is solved, the next issue to tackle is the evolution of data. In a way it's a harder problem to face, as it's not enough to track what changed, we need to know why it changed (simple example: the User record has a surname String field on version one; on version two this field is gone and a new last_name field is there; should it be considered a rename?).

In the context of server-side web applications I don't see a particular need for in-memory application data evolution (clusters alleviate the need for hot-deploys). But migrating externally stored data is a real necessity. As far as I know, the state of the art now amounts to filtering and sorting manually written database scripts to run at deploy time, perhaps with some syntactic sugar sprinkled on top.

But if we note that on the moment the data declaration code is being changed, the developer knows why it's changing, we just need a way to record this intent and tie it to the mapping and change structures from the previous two items, and voilà. Things sure sound easy on blog posts, don't them?

Tooling

I've written about this before, so I'll just briefly lament what a shame it is we are trapped between monstrous cockpit-like IDEs and dumb-as-a-rock text editors, when the obvious path of a language-aware editor seems to interest almost no one. By the way, I wish the Light Table guys all the luck.

Non-issues

From my biased perspective as a practitioner on my particular domain, there are some problems tackled by programming language research that I don't see as particularly pressing.

Sequential performance: Cache is king.
Parallel performance: On the server, multicore parallelism is very well exploited by request-level concurrency and virtualization.
Concurrency: STM and Actors are nice but a very large portion of the concurrency issues in practice are simply delegated to the database (and I haven't the foggiest idea what concurrency support those guys crave for, but I bet many are content with locks, monitors and semaphores).
Correctness: We have bugs, of course, but they seldom present themselves as classical broken invariants. What are they then? Faults of omission, unintended interactions between separate systems, API usage misunderstandings. I'm skeptical on language help for these, but I'd be glad to be proven wrong.

Seems like a good time to repeat that on this post I'm trying to connect my particular needs as a developer on my current domain to what programming language research might accomplish. I'd expect the issues and non-issues to be different for other domains, or even for other developers in a similar domain but with a different background.

Cake

Who doesn't love cake? No, not this cake, this cake. Hmm.

Saturday, July 07, 2012

Types and Bugs

There are certain discussions in our biz that are so played out they provoke instant boredom upon encounter. A major one is the old dynamic vs. static skirmish, recently resurfaced in a blog post by Evan Farrer. Which is a shame as the post is quite interesting, describing his results transliterating well-tested code in a dynamic language to a static language to see if the type system found any bugs. Which it did.

The full-length paper is a great read as well. He describes his translation methodology and gives some detail on each bug found. At first it may seem the author could be stacking the odds towards the static language as the translation was manually done by himself, but I found his description of the process pretty convincing of its fairness. The choice of Python as a source language probably helped given the pythonic inclination towards straightforward code that avoids sophisticated abstraction and metaprogramming mechanisms.

But the real meat of the paper is in the description of the bugs he found. Upon a not particularly discriminating reading, a clear pattern jumped out. Most of the bugs fell into one of two categories:

Assuming that a variable always references a valid value when it can contain a null value.
Referencing constructs that no longer exist in the source.

The first category is also the largest, comprising several places where the original code could be coerced into letting a variable be set to a null value, usually by just leaving it uninitialized, and a subsequent call would attempt to derefence it assuming it contained a valid value. Haskell's type system avoids the problem as it simply doesn't have any notion of null. Code that has to deal with optional values must do it trough algebraic data types.

How the second category of bug comes about is easy to guess from the projects histories: some method or variable was present but changed, perhaps it was renamed or subsumed, and not all references were updated to reflect the change. Even pervasive unit tests can't hope to catch these kinds of regressions, as the problem is found on the integration between units of code; the units themselves are just fine. A type system helps when the change affects the signature of the referenced construct, which is often but not always.

If the study's findings are generalizable and my observations are correct, these are the main takeaways:

If you have a type system at your service, it's prudent to structure code such that behavior-breaking changes are reflected on the types.
End-to-end integration tests are a necessary complement to both a suite of unit tests and a type system. In my experience how far should these tests stray off the happy path is a difficult engineering trade-off.
If your type system allows nulls — such as Java's, for instance — its role in bug prevention is greatly diminished. The proportion of null-dereference bugs on the analysed code bases helps to makes it clear just how big a mistake it is to allow nulls in a programming language.