Tuesday, August 30, 2011

Language design

Programming languages are our principal tools, the things with work with every day. Most programmers (real ones) have a particular relation to their favorite language, but, they're also very curious about new languages, new paradigms (or at least paradigms, they haven't test so far), new concepts …

This is probably the reason why it is so hard too have clear answers from programmers to questions like “what is the best programming language ?” or “what is the worst programming language ?”

So, I'll try to make a small survey of ideas on language design that I found most useful when trying to classify languages. I'll also add my personal visions on what is a good language.

Semantics ?
Lot of people began classification of programming languages using execution model (compiled or not) or paradigm (functional, object oriented, declarative, procedural … ) This is probably a good idea, but I won't do it that way.

From my own point of view there are two major groups of programming languages: those with practical approach and those with theoretic foundations. This is how I understand the evolution of programming languages.

The practical way follow the history of computer evolution: early computers have a minimal interface (switches, punched cards … ) and programmers where using direct machine code. As computers were able to treat bigger programs, programmers start to use assemblers that transforms an “almost” human-readable source code into binary machine code. And the evolution continue in the direction of more abstractions: conditional jump were replaced by if-then-else constructions, while and for loops simplified basic algorithms and so on. This where the birth of structured programming. The basis of this evolution is to abstract recurring coding constructions using a simpler syntax. Starting from machine code, the practical way has reached structured and object oriented programming paradigms.

On the other side, the theoretical way, take birth in mathematical models for computing like lambda-calculus. This branch of language evolution take a completely different direction: we start with an abstract concept, often with a very formal definitions, and try to build a concrete programming language. The evolution is now reverted, since the goal is to found lower level translation of abstract syntax, rather than building new syntax for known constructions.

Both evolution produce a lot languages like Cobol, Fortran or C/C++ for the practical way and Lisp, Scheme, Prolog or ML family for the theoretical way. The interesting fact is that the two ways are converging ! Practical languages tends to introduce concepts with theoretical foundations (like in Java or C#) and theoretical languages are more and more able to produce efficient concrete machine code.

So, why starting a classification like this ? Just because the way a language was built has a great impact on how the language can be used.

Languages with well founded semantics are good in implementing clever algorithms and theoretical data manipulation (symbolic processing, compiler, code analysis tools … ) On the other hand, more practical languages (like C) are more suited for system programming and any other activities involving lower level manipulations.
Dynamic or Static ?
An other important discriminating criterion is the nature of the language execution model. Rather than talking about compilation versus interpretation (or virtual machine, JIT … ) I prefer to push the distinction in the field of dynamic versus static languages (there are compilers for dynamic languages and interpreter for static ones.)

Again, using one kind or the other is a matter of goals. Dynamic languages are often considered as toys or scripting languages, but this is too restrictive since modern languages (such as Perl, Python or Ruby) are probably as good as other languages for writing applications.

In fact, this is more a matter of program design and software conception. Dynamic programming tends to encourage programming by accretion: you build a small kernel of functionality, and then make it grows. The dynamic nature of such languages, let you modify or add behavior to existing code without complete rewriting.

On the other hand, when you need (or you already have) a well established design, static languages are more suited: they enforced your design choices by making them mandatory. The usual benefits are faster error detection and probably better performances.

So, to summarized, dynamic languages are more suited for prototyping or unguided programming (you know, when you code, but don't really know where're you going … ), while static languages offer better context for bigger projects involving more conceptions. But, in my opinion, this is more a matter of taste !
Hidden Complexity
Now, we are on the dangerous field of what is good and bad in programming language. Hidden complexity is about language's simple constructions that involved heavy computation.

We can find a good example in OCaml's operators on lists. You have two operators that are often seen as similar: :: and @. The first one is a constructor for the list type while the second one is in fact a shorthand for a function on two lists (append.) So, the issue is that the former operator looks like a simple inexpensive operation, while it can induce very high complexity cost like in the following example:

let rec rev = function
  | [] -> []
  | h::t -> (rev t) @ [h]

This function has a quadratic complexity (O(n²)) due to the append operator while this is not evident (if you don't count the cost of the append, this function has a linear complexity.)

Most of the time, hidden complexity appears in magic operators, the kind of operators that seems to solve a lot of issues.

A very common source of hidden complexity is operator overloading and implicit operations. C++ is full of this kind of traps: when objects are passed by copy (that is, the default way of passing objects to a function) a special constructor is invoked, which is a simple memory copy unless someone has redefined it (“Hey ! Who add an Ackerman function to the copy-constructor of this object !”)
Keywords or symbols ?
Again, a matter of taste: language's syntax can be keyword oriented (Pascal) or symbol oriented (C++.)

For example, a Pascal's function will look like this:

PROCEDURE swap(VAR a: integer, VAR b: integer)
VAR
  c : integer;
BEGIN
  c := a;
  a := b;
  b:= c;
END

While in C++ we would have something like:

void swap(int &a, int &b)
{
  int c;
  c = a;
  a = b;
  b = c;
}

The C/C++ syntax relies more on symbol than keywords ({…} against begin … end for example.) What is best ? I tend to prefer the symbol way rather than the keyword one, but its a matter of taste. Of course, it's also a matter of balance, you probably can find some purely symbolic languages and you'll probably find it too criptic (have you ever opened a sendmail.cf file ?)

In fact, it doesn't need to be purely symbolic to be cryptic, C++ syntax is a good example, an average C++ piece of code is full of symbolic details which probably can't be understand with fast-reading since details can change semantics (and I don't speak about operators overloading … )

Here are my own rules about syntax:
  1. one meaning for one symbol
  2. symbol should only be used for immediate operations (no hidden complexity)
  3. uncommon operations deserve keywords rather than symbol
  4. default behavior and usual operations should have the simplest syntax (or no specific syntax at all)
In my opinion, the C (not C++) syntax  can be considered as a good example of a language well balanced between symbols and keywords while C++ is seminal example of all possible mistakes in language design (at least about syntax.) The Pascal syntax tends to be a little bit verbose and have a pedantic back-taste (this is that kind of languages that prefer to say integer rather than int to be sure you understand them well, in case you've missed some point, you  they are teaching languages.)
Humorous classification
As a so called conclusion, here is a draft for a programming languages classification:
  • Rudimentary languages: they're here for decades, they're probably older than you, and you don't want to use them but some times you can't avoid them (BASIC, COBOL … )
  • Sedimentary languages: built using a geological process of accretion, they were designed as stack of concepts in hope every cases where covered … (C++ … )
  • Alimentary languages: you learn them and use them because programming is a job, job brings you money and money buys you food. You probably don't like them but in fact, you don't like what you're doing with them (maybe you don't like programming at all) (Java, C#, VB, Php … )
  • |-|4><0r languages: you need to be l33t to understand them … (Brainfuck, whitespace … )
  • Will-be-dead-next-years-for-decades languages: we all agree, these languages are alive abominations escaped from Jurassic ages, but they're still there and used ! (COBOL, FORTRAN … )
  • Teaching languages: sometimes, you feel like if the compiler stares at you with a cold and calm face and a wooden stick in its hand while the rest of the class is holding its breath. When such time comes, you know that you, naughty boy, haven't declared your thrown exceptions, or have used a deprecated method … (Pascal, Java … )
  • We've-got-anything-you-want languages: have you ever dream of a procedural structured functional language with object oriented extensions, concurrent programming primitives, embedded assembly, static validation and dynamic assertion testing ? Ada is there for you …