Another article on LSE's blog about the C! programming language, read this out !
C! - system oriented programming - syntax explanation
The article presents basis of C! syntax and is a follow-up of my previous article introducing the language (also on LSE's blog.)
Wednesday, June 27, 2012
Back To C: tips and tricks
Note: I've begun this articles months ago, but due to a heavy schedule (combine with my natural laziness), I've postponed the redaction. So, the article may be a little rambling, I apologize for that but I hope you'll found it interesting.
The C programming language is probably the one of the widely used, and widely known programming language. Even if there's some obscure syntax tricks, it's a simple language, syntactically speaking. But, it's probably one of the harder to learn and to use correctly !
There's several reason for that:
- since it is intended to be a low-level tools, it let you do a lot of things, things that are most of the time considered ugly but necessary in some cases.
- it's original behavior does not include modern notions of type checking and other kind of verification, you're just playing with a machine gun without any safety lock …
- expressive power with simple syntax often means more complex work: C belongs to the same kind of languages as assembly or lambda-calculus, you can do anything you want, but you just have to code the whole things completely from scratch.
One the most important things about coding using C, is to understand how it evolves and what it was mean to be and what it is really now. Back in the first Unix days, Denis Ritchie presented its language as a high level portable assembler, a tools for system, kernel and low-level programming. But, since then, there are standards (ANSI, ISO … ) and the language is no longer the original low-level tool.
Using C nowadays is a difficult exercise and the path to working code is full of traps and dead-ends.
I'll try to lay down some of my « tricks » and experiences. I've been using C for almost 15 years now, and I'm teaching it for about half of that time, I've seen a lot of things, but I can still be surprised every day.
Listen to your compiler !
Historically, C compilers are a little bit silent. Basically, most bad uses are not really errors, they may even be legit in some cases. So, most compilers prefer to emit warnings rather than errors.
This is why you should activate all warnings, and even better activate « warnings as error ».
A good example is the warning about affectation in condition, take a look at the following code
while ( i = read(fd, buf + off, SIZE - off) ) off += i;
This code will trigger a warning (« warning: suggest parentheses around assignment used as truth value » in gcc.) Of course, my example is a legit usage of affectation in condition. But, the usual confusion of « = » instead of « == » is one of the most common error in C, and probably one of the most perverse bug.
This warning is probably far better than putting left-value on the right hand-side, which only work if one of the operands is not a left-value (hey, you never compare two variables ?)
As the message said, you just have to put extra parentheses to avoid the warning, like that:
while ( (i = read(fd, buf + off, SIZE - off)) ) off += i;
Even if this example may seems naive, it reflects the way you should use your C compiler: activate all warnings and alter your code so legit use won't trigger any messages, rather than shut the screaming off !
Oh, and if you want good warnings and error messages, I strongly recommend the use the clang compiler (based on llvm), it gives the best error messages I've seen so far.
And, for my pleasure, here is another classical example of test and affect combined in a if-statement:
int fd; if ( (fd = open("some_file",O_RDONLY)) == -1) { perror("my apps (openning some_file)"); exit(3); } /* continue using fd */
Identify elements syntactically
Have you ever mask an enum with a variable ? Or, fight with an error for hours just because a variable, a structure field or a function was replaced by a macro ? In C, identifiers are not syntactically separated, so you can do horror like that:
enum e_example { A, B }; enum e_example f(void) { enum e_example x; float A = 42.0; x = A; return x; } int main() { printf("> %d\n", f()); return 0; }
What does this code print ? 42 of course ! Why ? Simply because the float variable A mask the enum constant A. The ugly part is that your compiler won't warn you and won't complain.
So, there's no solution on the compiler side, we have to protect ourselves from that kind of errors. A usual solution is to adopt syntactical convention: each kind of symbol have its own marker, for example you can write enum constant like A_E or, in order to have different names for each enum definition, you can prefix your constant with the name of the type.
Basically, you should have a dedicated syntax for type name, macro constants, enum member or any other ambiguous identifiers. Thus, my previous enum should be written:
So, there's no solution on the compiler side, we have to protect ourselves from that kind of errors. A usual solution is to adopt syntactical convention: each kind of symbol have its own marker, for example you can write enum constant like A_E or, in order to have different names for each enum definition, you can prefix your constant with the name of the type.
Basically, you should have a dedicated syntax for type name, macro constants, enum member or any other ambiguous identifiers. Thus, my previous enum should be written:
enum e_example { A_E, B_E }; /* using _E as enum prefix for example */
Just keep in mind that identifier's size should be kept relatively small in order to preserve readability and avoid typing annoyance (you won't enjoy typing e_example_A more than once.)
There's a notion of sequence points in the language: any code between two sequence points can be evaluated in any order, the only things you know is that all side-effects syntactically before a sequence point take place before going further and no side-effect syntactically after the sequence point will begin before the sequence point.
So, this notion obviously forbid ambiguous code like i++ * i++ or i = i++. In fact this code is not strictly forbidden (sometimes the compiler may issue a warning, but that's all), it belongs to the infamous category of undefined behavior and you don't want to use it.
But that's not all. What you should understand and enforce is that no more than one modification to the same location should occur between two sequence points, but also whenever a memory location is modified between two points, the only legit access should be in the scope of the modification (i.e. you're fetching the value to compute the new value.
So, you should not write something like: t[i] = i++, but you can write (of course) i = i + 1.
Now, what constructions mark those sequence points ?
Understanding Sequence Points
There's a notion of sequence points in the language: any code between two sequence points can be evaluated in any order, the only things you know is that all side-effects syntactically before a sequence point take place before going further and no side-effect syntactically after the sequence point will begin before the sequence point.
So, this notion obviously forbid ambiguous code like i++ * i++ or i = i++. In fact this code is not strictly forbidden (sometimes the compiler may issue a warning, but that's all), it belongs to the infamous category of undefined behavior and you don't want to use it.
But that's not all. What you should understand and enforce is that no more than one modification to the same location should occur between two sequence points, but also whenever a memory location is modified between two points, the only legit access should be in the scope of the modification (i.e. you're fetching the value to compute the new value.
So, you should not write something like: t[i] = i++, but you can write (of course) i = i + 1.
Now, what constructions mark those sequence points ?
- Full expressions (an expression that is not a sub-expression of another)
- Sequential operators: ||, && and ,
- Function call (all elements necessary for the call, like parameters, will be computed before the call itself.)
That's all, meaning that in the expression: f() + g() you don't know if f() will be called before g() !
Here are my personal rules to avoid those ambiguous cases:
Here are my personal rules to avoid those ambiguous cases:
- Prefer pure functions in expressions
- Keep track of global states modified by your functions so you can establish which function can't be used in the same sub-expression
- Each function should have a bounded scope: a function must do one things and if it you can't avoid global side effects, it must be limited to one or two global states per function.
- Prefer local static states rather than global states so you can bound modification through the use of one function.
- Prefer pointer above implicit references for function (C++)
The last point may disturb you, implicit references (in the sens of C++ or Pascal) are used, in both call-site and body of the function, like non reference arguments, hiding the persistence of the modification. This pose no threat where it used with intended modification, but inside expression this can lead to that kind of bug you fight against for hours. Using an explicit pointer requires a reference operator indicating the possible modification.
Functions' prototype have to be explicit
In C, pointers are used for a lot of things: array, mutable arguments, data-structures …
So you can pass a pointer to a function for a lot of reason and it could be interesting to find a way to differentiate the various usage of pointer as parameter.
I'll give you a quick overview of my coding style:
Pointer as reference (mutable argument): when passing a pointer to a variable in order to keep modification of the value of the variable in calling context, I explicitly use a star:
Data structure: most (all) linked data structures (list, tree … ) have a pointer as entry point. In fact, the structure is the pointer (for example the linked list is the pointer not the structure.) So, I hide the pointer in the typedef rather than let it be visible:
The last point is often disturbing for student and it need some explanation. First, consider the logic of a linked list: a list is recursively define as an empty list or a pair of an element and a list. Thus, the NULL pointer representing the empty list is a list, meaning that the list is the pointer not the structure (a pair can be viewed as a structure or a pointer to a structure, to be coherent we must choose the later definition.)
So, you're list is a pointer and thus the type list must reflect this fact.
This is where comes the usual argument: « but, if the pointer is not obvious in the prototype how do I know that I must use an arrow rather than a dot to access the structure member ? » There's two answers to this question:
The star in the second version indicate that the function may modify the head of the list (in fact, it will.)
Forbidding typedef of structure has an other positive aspect: when passing a structure to a function (or when returning it), the structure is copied (and thus duplicated) inducing a (small) overhead at the function call (or function return) and bigger memory usage. Hiding the fact that the value is a structure induces what I call hidden complexity: what look like a simple operation has in fact a non-negligible cost. Thus, once again, the good strategy is to let the fact that the value is a structure visible.
The special case of strings: in C, strings have no specific type, you use pointer to characters. Normally, you can view a string as an array of characters, but the usual convention is to describe strings as char* rather than char[], so strings are the only case where I use the star rather than the bracket syntax. It solves also the issue of arrays of strings, you can't use char[][] (this a two dimensional array, but since the compiler need the size of line to correctly translate indexes, you can't write such a type) the solution is to mix star and bracket, as in the following example:
These tricks are not magical receipts, if you want to be a C programmer, you'll need practice. The first rule of a programmer should be: Code and when you think you have code enough, then code again !
By coding I mean: try out ideas, implement toys, compare behaviors of various implementation of the same idea, take a look at the generated ASM code (yes, you should do that … ) And don't be afraid of pointers, pointers are you friends, you need them, they are always useful in a lot of situation, but you got to be nice with them and treat them properly (or you'll pay !)
As a second rule, I'll propose: find a convenient coding style ! Coding style are often useful to avoid misuse of data structures or ambiguous situation (such as the examples on enum), but when building a coding style, don't focus on code presentation, organization, comments nor forbidden features. The most important part of a coding style is the naming convention and a coherent way of describing function prototypes.
So you can pass a pointer to a function for a lot of reason and it could be interesting to find a way to differentiate the various usage of pointer as parameter.
I'll give you a quick overview of my coding style:
- Pointer as array: when the pointer parameter is in fact an array, I always use the empty bracket rather than the start, for example:
float vect_sum(float tab[], size_t len) { float res = 0; float *end = tab + len; for (; tab != end; ++tab) res += *tab; return res; }
void swap(int *a, int *b) { int c; c = *a; *a = *b; *b = c; }
typedef struct s_list *list; struct s_list { list next; int content; }; size_t list_len(list l) { size_t r = 0; for (; l; l = l->next) ++r; return r; }
The last point is often disturbing for student and it need some explanation. First, consider the logic of a linked list: a list is recursively define as an empty list or a pair of an element and a list. Thus, the NULL pointer representing the empty list is a list, meaning that the list is the pointer not the structure (a pair can be viewed as a structure or a pointer to a structure, to be coherent we must choose the later definition.)
So, you're list is a pointer and thus the type list must reflect this fact.
This is where comes the usual argument: « but, if the pointer is not obvious in the prototype how do I know that I must use an arrow rather than a dot to access the structure member ? » There's two answers to this question:
- First, when dealing with the code manipulating the list you know what you're doing, so there's no question !
- You must be coherent, if you include the star in the typedef, you won't do a typedef on a struct. You don't flag the case where you should use an arrow but when you should use a dot !
/* we're using the previous list definition */ list fun_add(list l, int x) { list t; t = malloc(sizeof (struct s_list)); t->content = x; t->next = l; return t; } /* note that l is passed as a pointer to a list, not a list */ void proc_add(list *l, int x) { list t; t = malloc(sizeof (struct s_list)); t->content = x; t->next = *l; *l = t; }
The star in the second version indicate that the function may modify the head of the list (in fact, it will.)
Forbidding typedef of structure has an other positive aspect: when passing a structure to a function (or when returning it), the structure is copied (and thus duplicated) inducing a (small) overhead at the function call (or function return) and bigger memory usage. Hiding the fact that the value is a structure induces what I call hidden complexity: what look like a simple operation has in fact a non-negligible cost. Thus, once again, the good strategy is to let the fact that the value is a structure visible.
The special case of strings: in C, strings have no specific type, you use pointer to characters. Normally, you can view a string as an array of characters, but the usual convention is to describe strings as char* rather than char[], so strings are the only case where I use the star rather than the bracket syntax. It solves also the issue of arrays of strings, you can't use char[][] (this a two dimensional array, but since the compiler need the size of line to correctly translate indexes, you can't write such a type) the solution is to mix star and bracket, as in the following example:
/* argv is an array of strings */ int main(int argc, char *argv[]) { / .. return 0; }
Conclusion
I hope these tricks where useful or interesting to you. As I say in opening, programming in C requires careful concentration and a bit of understanding the underlying semantics of the language.These tricks are not magical receipts, if you want to be a C programmer, you'll need practice. The first rule of a programmer should be: Code and when you think you have code enough, then code again !
By coding I mean: try out ideas, implement toys, compare behaviors of various implementation of the same idea, take a look at the generated ASM code (yes, you should do that … ) And don't be afraid of pointers, pointers are you friends, you need them, they are always useful in a lot of situation, but you got to be nice with them and treat them properly (or you'll pay !)
As a second rule, I'll propose: find a convenient coding style ! Coding style are often useful to avoid misuse of data structures or ambiguous situation (such as the examples on enum), but when building a coding style, don't focus on code presentation, organization, comments nor forbidden features. The most important part of a coding style is the naming convention and a coherent way of describing function prototypes.
Wednesday, May 9, 2012
C! Programming Language - LSE blog articles
I wrote a new article presenting the C! programming language.
C! on LSE's blog
C! on LSE's blog
C! is a system oriented programming language. Started as a simple revisited syntax for C, it now offers simple objects and similar features.
Tuesday, August 30, 2011
Language design
Programming languages are our principal tools, the things with work with every day. Most programmers (real ones) have a particular relation to their favorite language, but, they're also very curious about new languages, new paradigms (or at least paradigms, they haven't test so far), new concepts …
This is probably the reason why it is so hard too have clear answers from programmers to questions like “what is the best programming language ?” or “what is the worst programming language ?”
So, I'll try to make a small survey of ideas on language design that I found most useful when trying to classify languages. I'll also add my personal visions on what is a good language.
Semantics ?
Lot of people began classification of programming languages using execution model (compiled or not) or paradigm (functional, object oriented, declarative, procedural … ) This is probably a good idea, but I won't do it that way.
From my own point of view there are two major groups of programming languages: those with practical approach and those with theoretic foundations. This is how I understand the evolution of programming languages.
The practical way follow the history of computer evolution: early computers have a minimal interface (switches, punched cards … ) and programmers where using direct machine code. As computers were able to treat bigger programs, programmers start to use assemblers that transforms an “almost” human-readable source code into binary machine code. And the evolution continue in the direction of more abstractions: conditional jump were replaced by if-then-else constructions, while and for loops simplified basic algorithms and so on. This where the birth of structured programming. The basis of this evolution is to abstract recurring coding constructions using a simpler syntax. Starting from machine code, the practical way has reached structured and object oriented programming paradigms.
On the other side, the theoretical way, take birth in mathematical models for computing like lambda-calculus. This branch of language evolution take a completely different direction: we start with an abstract concept, often with a very formal definitions, and try to build a concrete programming language. The evolution is now reverted, since the goal is to found lower level translation of abstract syntax, rather than building new syntax for known constructions.
Both evolution produce a lot languages like Cobol, Fortran or C/C++ for the practical way and Lisp, Scheme, Prolog or ML family for the theoretical way. The interesting fact is that the two ways are converging ! Practical languages tends to introduce concepts with theoretical foundations (like in Java or C#) and theoretical languages are more and more able to produce efficient concrete machine code.
So, why starting a classification like this ? Just because the way a language was built has a great impact on how the language can be used.
Languages with well founded semantics are good in implementing clever algorithms and theoretical data manipulation (symbolic processing, compiler, code analysis tools … ) On the other hand, more practical languages (like C) are more suited for system programming and any other activities involving lower level manipulations.
Dynamic or Static ?
An other important discriminating criterion is the nature of the language execution model. Rather than talking about compilation versus interpretation (or virtual machine, JIT … ) I prefer to push the distinction in the field of dynamic versus static languages (there are compilers for dynamic languages and interpreter for static ones.)
Again, using one kind or the other is a matter of goals. Dynamic languages are often considered as toys or scripting languages, but this is too restrictive since modern languages (such as Perl, Python or Ruby) are probably as good as other languages for writing applications.
In fact, this is more a matter of program design and software conception. Dynamic programming tends to encourage programming by accretion: you build a small kernel of functionality, and then make it grows. The dynamic nature of such languages, let you modify or add behavior to existing code without complete rewriting.
On the other hand, when you need (or you already have) a well established design, static languages are more suited: they enforced your design choices by making them mandatory. The usual benefits are faster error detection and probably better performances.
So, to summarized, dynamic languages are more suited for prototyping or unguided programming (you know, when you code, but don't really know where're you going … ), while static languages offer better context for bigger projects involving more conceptions. But, in my opinion, this is more a matter of taste !
Hidden Complexity
Now, we are on the dangerous field of what is good and bad in programming language. Hidden complexity is about language's simple constructions that involved heavy computation.
We can find a good example in OCaml's operators on lists. You have two operators that are often seen as similar: :: and @. The first one is a constructor for the list type while the second one is in fact a shorthand for a function on two lists (append.) So, the issue is that the former operator looks like a simple inexpensive operation, while it can induce very high complexity cost like in the following example:
let rec rev = function | [] -> [] | h::t -> (rev t) @ [h]
This function has a quadratic complexity (O(n²)) due to the append operator while this is not evident (if you don't count the cost of the append, this function has a linear complexity.)
Most of the time, hidden complexity appears in magic operators, the kind of operators that seems to solve a lot of issues.
A very common source of hidden complexity is operator overloading and implicit operations. C++ is full of this kind of traps: when objects are passed by copy (that is, the default way of passing objects to a function) a special constructor is invoked, which is a simple memory copy unless someone has redefined it (“Hey ! Who add an Ackerman function to the copy-constructor of this object !”)
Most of the time, hidden complexity appears in magic operators, the kind of operators that seems to solve a lot of issues.
A very common source of hidden complexity is operator overloading and implicit operations. C++ is full of this kind of traps: when objects are passed by copy (that is, the default way of passing objects to a function) a special constructor is invoked, which is a simple memory copy unless someone has redefined it (“Hey ! Who add an Ackerman function to the copy-constructor of this object !”)
Keywords or symbols ?
Again, a matter of taste: language's syntax can be keyword oriented (Pascal) or symbol oriented (C++.)
For example, a Pascal's function will look like this:
For example, a Pascal's function will look like this:
PROCEDURE swap(VAR a: integer, VAR b: integer) VAR c : integer; BEGIN c := a; a := b; b:= c; END
While in C++ we would have something like:
void swap(int &a, int &b) { int c; c = a; a = b; b = c; }
The C/C++ syntax relies more on symbol than keywords ({…} against begin … end for example.) What is best ? I tend to prefer the symbol way rather than the keyword one, but its a matter of taste. Of course, it's also a matter of balance, you probably can find some purely symbolic languages and you'll probably find it too criptic (have you ever opened a sendmail.cf file ?)
In fact, it doesn't need to be purely symbolic to be cryptic, C++ syntax is a good example, an average C++ piece of code is full of symbolic details which probably can't be understand with fast-reading since details can change semantics (and I don't speak about operators overloading … )
Here are my own rules about syntax:
In fact, it doesn't need to be purely symbolic to be cryptic, C++ syntax is a good example, an average C++ piece of code is full of symbolic details which probably can't be understand with fast-reading since details can change semantics (and I don't speak about operators overloading … )
Here are my own rules about syntax:
- one meaning for one symbol
- symbol should only be used for immediate operations (no hidden complexity)
- uncommon operations deserve keywords rather than symbol
- default behavior and usual operations should have the simplest syntax (or no specific syntax at all)
In my opinion, the C (not C++) syntax can be considered as a good example of a language well balanced between symbols and keywords while C++ is seminal example of all possible mistakes in language design (at least about syntax.) The Pascal syntax tends to be a little bit verbose and have a pedantic back-taste (this is that kind of languages that prefer to say integer rather than int to be sure you understand them well, in case you've missed some point, you they are teaching languages.)
Humorous classificationAs a so called conclusion, here is a draft for a programming languages classification:
- Rudimentary languages: they're here for decades, they're probably older than you, and you don't want to use them but some times you can't avoid them (BASIC, COBOL … )
- Sedimentary languages: built using a geological process of accretion, they were designed as stack of concepts in hope every cases where covered … (C++ … )
- Alimentary languages: you learn them and use them because programming is a job, job brings you money and money buys you food. You probably don't like them but in fact, you don't like what you're doing with them (maybe you don't like programming at all) (Java, C#, VB, Php … )
- |-|4><0r languages: you need to be l33t to understand them … (Brainfuck, whitespace … )
- Will-be-dead-next-years-for-decades languages: we all agree, these languages are alive abominations escaped from Jurassic ages, but they're still there and used ! (COBOL, FORTRAN … )
- Teaching languages: sometimes, you feel like if the compiler stares at you with a cold and calm face and a wooden stick in its hand while the rest of the class is holding its breath. When such time comes, you know that you, naughty boy, haven't declared your thrown exceptions, or have used a deprecated method … (Pascal, Java … )
- We've-got-anything-you-want languages: have you ever dream of a procedural structured functional language with object oriented extensions, concurrent programming primitives, embedded assembly, static validation and dynamic assertion testing ? Ada is there for you …
- …
Saturday, July 30, 2011
Rule 0 - Do It !
So, I've already listed some rules about programming (do I need to recall you that the most important one is don't trust the maxims) but I recently come up with a new one, some kind of rule 0, the rule above the rules. So, the rule is pretty simple:
Do It !
Simple isn't it ?
So what do I mean by Do It ? There are several topic where this rule can apply, I'll cover what seems to be the most important to me: programming learning, fixing bugs and rewriting/refactoring.
For the beginner:
Learning programming (or even learning a new programming language) can be separate in two parts: theoretical principles, syntax and structure studying and practicing.
Many beginner falls into the over-studying symptom: they spent a lot of time trying to understand inner concepts and to recall every specific syntax construction before practicing. But when it comes to code, they just do the same mistakes and face the same issues as if they haven't spend a lot of time to understand how it works.
Why ? Just think of it: if I explain to you how it is to ride a bicycle, I can explain you a lot of things, even show various videos and schema, but until you get yourself on a real bike and fall face to the ground, you won't be able to ride it.
Programming is no exception. You need to practice in order to learn. A damn good example are pointers. Pointers are a mandatory concept in many languages and probably in many traditional algorithms, so a programmer can not avoid pointers manipulations. When facing it for the first time, they appears as strange entities with complex usage and semantics: they are value by them selves but they also seems to be containers, you can obtain a pointer from various operations (referencing a variable, allocating memory … ) All of this with various syntax a special cases (the worst of all is to begin to learn pointers with different languages using different syntax, say Pascal and C … )
The fact is simply that, you need to practice to learn. I've test a lot of way to explain pointers concepts to various students and never find a better way than practicing. The usual box and arrow diagram always lake important point. Explaining the idea of pointer as array indices for the big array that is memory won' help more. But a bunch of pointer and pointer's arithmetic exercises will always do the job.
Practicing break out another beginner's symptom: the fear of the difficulty. When facing a new kind of problems, measuring the real difficulty of solving it is quite hard. Trying to solve it in your head or with some useless schema on a sheet of paper won't help you, you should give it a try. A good example is writing a memory allocator (a pretty good exercise to understand pointers and memory management), if you had never try it before, you can't figure out how one should do it. But, as you begin to code, you'll find that there's no mystery nor magic and (if you stay on the simple path, of course, good generic and efficient allocators are not so simple) you'll see that the task is really affordable.
So, to summarize: don't waste your time over-thinking your code, just code, do it !
Bug fixing:
When building a projects there's always two possible tasks: continue to the next missing features and fix the existing ones. So should I prefer push the project to the some fixed bounds (achieving a functionality goal) or should I fix issues in what was already written.
My advice is to never postpone bug fix, even if it's a minor one. Bugs, incomplete cases and things like that are bombs awaiting to blow you all works down. There are several reason not to postpone corrections:
- It's simpler to fix something when the code is still fresh in your head
- The bug fix may induce deeper modifications than it seems
- Building a new part of a project on rotten bases will always end up in a big crash
You can extend this idea to real bug fixes: that is, you should not let ugly workaround sit in your base code. Workaround are far pernicious than bug in the sense that they hide the error without (most of the time) really fixing it.
Rewriting/Refactoring:
To illustrate this part I'll give a personal example: I'm actually working on an experimental language (mostly like a C variant but with a cleaner syntax and some syntactic sugar), the project began when it appears that we can't escape the use of the C language for writing kernels. We give a try to a lot of languages and face up various issues, mainly because modern languages are oriented toward userland programming, not kernel programming. After a long and trollish discussion on our mailing list, I came up with idea of rewriting the C syntax and extend it with various comfort syntactic sugar and high levels extensions.
So, I lay down a quick version of an AST (Abstract Syntax Tree) for a kind of C language, write up a pretty printer for it and then a parser. When it was clear that the idea works (and facing enthusiastic messages on the list), I go the next step and write down type checking. At that point, the original AST was not completely suited in the sense that it was an immutable value (I was using OCaml for the project) but I found a way to circumvent the difficulties and continue along this axis.
We then begin to write code generation (thanks to the fact that we were writing a compiler-to-compiler rather than a real compiler this wasn't so hard in the first place.) But as the project evolves, my badly designed AST give us more and more difficulties. And then, one day (after almost a year of code), it appears to me that if we want to push the project to its end, I must change my data structure. So, how to do it ? the task seems unaffordable since the AST was the skeleton of the project.
I take out my favorite text editor, allow myself five minutes of reflexion, and decide that if I can't rewrite parsing and typing within a few days, I'll give up and try to fix the original code. Guess what ? A week later, 90% of the project was rewritten and some parts was even more advanced than in the original version ! Of course, the first two days was a total rush, I came up writing 5000 lines of code in 48 hours ! But, thanks to the new design and to experiences gathered on the previous work, most of the task was straightforward.
Two weeks later (yes, this a really recent story ;), our compiler is now more functional than the previous version !
What did I learn ? First that I'm still capable of writing very big amount of code in few days (this good for my self esteem ;) Second, that rewriting is far simpler than it may seems at first glance. There are several reason you should give a try to rewriting:
- Previous work wasn't a waste of time, it points you out where difficulties are
- A better design and data structure will greatly simplify the rest of the code
- If you fail, you can still go back to the previous work
- There's always be external part that can be reused
- If you feel the need to do it, things gonna be worst if you postpone it !
Of course, rewriting is not always the only solution: this example is quite extreme in a sense that the whole project was based on the part that I want to change, and many things inside the project was to be rewritten.
Refactoring is, most of the time, a better way to fix bad design. The main idea is to rewrite your code chunks by chunks but still preserving the state of the project. There's a lot of patterns to do that (you can find good books on the subject, Marc Espie, which is a person I consider as a real expert programmer, pointed me out with the book: "Refactoring" by Martin Fowler, you should probably read it, I will as soon as I find times for it.)
Conclusion:
So, my new rule, Do It !, is all about going straight to the point rather than wasting your time in useless preparation or postponing important fix or rewrite. There's no better way to do something than actually doing it !
Saturday, July 9, 2011
Functional Programming ?
Do you know functional programming ?
Functional programming (FP) is an interesting subject for a lot of reasons. Those of us that have used (and still used) functional programming languages (FPL) often pointed out two disappointing facts:
- Lots of programmers doesn't really understand what is it
- Despite the fact that FP and FPL are pleasant to use but also quite efficient, it is still not largely used
Am I a FP fan ? the quick answer is "no more", the longest one is more complicated. I have used OCaml for now more than 10 years, I've found it very useful and pleasant many time. But, I've also found a lot of unsatisfactory points. It is hard to really explain why, but on bigger project, the absence of constraints on global program structure often leads to unreadable code, some syntax choices are annoying ... Other FPL haven't convince me yet ...
So, here are my humble opinion (how I hate this expression ... )
First, I'll attempt to define FP and what should be an FPL. You should right now forget the erroneous idea that FP is about recursion and recursive functions !
FP arise from early days of formal languages (before computers in fact), exactly λ-calculus is (at least for me) the starting point of FP. So what is the main characteristic of λ-calculus ? Function as first class value ! In fact, functions are the only value in λ-calculus !
First class value are language entities that can be passed (and returned) to (by) functions. So, λ-calculus can directly manipulate functions, this is what we call higher-order. So, the first requirement of a FPL is higher-order.
λ-calculus, being a mathematical model, have no notion of mutable entities and side effects (in fact mutable and side effects can be expressed in pure functional semantics.) Thus, languages inspired by λ-calculus try to avoid these notions to match the original model, but it's a matter of style.
Another aspect of FP and FPL is the notion of everything evaluate to a value: we should always return something and there's no difference between statement and expression as in imperative languages. This is strongly linked with the absence of side-effect: you don't modify a state, you return a new value.
What about loops in FP ? The fact that in FP we prefer recursion over loops, is a consequences of the everything is an expression: a loop does not return a value. But, FOR-loops (i.e. bounded iterations) can exist without mutable entities and thus fit in the model (we can somehow handle the loop returns nothing, since in FPL nothing can be something) on the other hand, integrating WHILE-loops in the semantics of a pure FPL is far more difficult (we need to express a changing state.)
So, to summarize: an FPL is language with higher-order where everything is expression and where we prefer to limit mutable entities and side-effect. Everything else is comes from those facts (partial evaluation and curryfied functions, recursion rather than loop and even lazy evaluation.)
Is there good FPL ? This is hard and dangerous question. I tend to prefer strongly typed FPL, and thus my list of good FPL will be limited to the ML family (and derived languages.) My favorite is OCaml for various reason: I learnt FP with it, it is not too purist (in OCaml you have imperative aspects and even objects), it has a good module approach (functors ... ) and resulting code works well. I found Haskell interesting but too extremist to my taste. Lisp and scheme derivatives haven't convince me (but again, I prefer typed languages ... )
Interesting FPL features: except for pure FP aspects, many FPL have interesting features.:
- Pattern Matching: this probably the most impressive features, it lets you describe complex cases and extract values. Combined with variants, it offers the best way to traverse abstract syntax trees.
- Module systems: to organize your code, module are very useful and some FPL provides very powerful framework. For example, the module system of OCaml lets you define nested modules, independant module interfaces and the powerful functors (function over modules.)
- Lazy evaluation: lazy values is mix between functions and expressions, as functions it is not evaluated when defined, but as expressions, it is evaluated only once. This lets you define expressions that will be evaluated only when needed but then resulting value is memoized. A good
example is infinite generated lists: elements of the lists are generated only when traversed, but then you don't need to compute it for each access.
Now, how about the lack of users of FPL ? Since, I'm myself a disappointed user of FPL, I can understand the relatively low audience of FPL.
First of all, most FPL are research projects and as such are never quite yet finished: the language changes too often, priority is put on new features rather than production necessity ... Thus, it is hard to develop a long and stable project using an FPL.
On the side of the languages them selves, there are also some disturbing points. Even compiled FPL have an original syntax designed for interpretors, thus program written with those syntax are often badly organized. For example, there's no explicit entry point in OCaml, you must carefully organize your code so that the execution flow go directly where you want.
From a syntaxic point of view, language history has shown that language with explicitly delimited blocks and instructions have a better impact than too permissive one. Most FPL are syntactically permissive languages.
So, I think that despite being very interesting programming languages, FPL are not a good choice for practical reason. These reasons are not tied to the FP model but to the implementations of this model. If we want to see FPL regarded as usable as other main languages some day, we need a more stable language with something like a C syntax.
Sunday, June 26, 2011
Marwan's Programming Guiding Rules
Ok, now that I have expose my view on maxims, I can write my own ! (but remember, don't trust the maxims !)
I'll start with pure coding guidelines (and keep the rest for later.)
I'll start with pure coding guidelines (and keep the rest for later.)
copy/paste are evil !
(with its corollary: copying code implies copying bugs)
This one is classic (at least for my students): whenever you want to copy/paste some code, you should reorganize it. Most of the time, a function (or something similar) will be useful later, and even if there'll be only one copy, you will probably find a better way to write it.
Just take some examples: here it is two function on linked list, a simple first place add and a sorted insertion, with copy/paste:
typedef struct s_list *t_list; struct s_list { t_list next; int data; }; t_list add(int x, t_list l) { t_list t; t = malloc(sizeof (struct s_list)); t->data = x; t->next = l; return t; } void insert(int x, t_list *l) { if (*l == NULL || x < (*l)->data) { t_list t = malloc(sizeof (s_list)); t->data = x; t->next = *l; *l = t; } else insert(x, &((*l)->next)); }
Now, the version of insert without copy/paste:
void insert(int x, t_list *l) { if (*l == NULL || x < (*l)->data) *l = add(x,*l); else insert(x, &((*l)->next)); }
First, the later version is shorter. But that's not all, in the second version, the reference to the size of the struct is done in one place. If you finally want to use a specific allocator (such as a recycling pool) you have one change to make, not two nor more.
Another toy example, with "one copy" (no need to write a function), we implement FIFO queue using circular list (this is only the push operation):
// with copy/paste void push(int x, t_list *q) { t_list t; if (*q) { t = malloc(sizeof (struct s_list)); t->data = x; t->next = (*q)->next; (*q)->next = t; *q = t; } else { t = malloc(sizeof (struct s_list)); t->data = x; t->next = t; *q = t; } } //without void push(int x, t_list *q) { t_list t; t = malloc(sizeof (struct s_list)); t->data = x; if (*q) { t->next = (*q)->next; (*q)->next = t; } else t->next = t; *q = t; }
Again, the code is smaller and simpler to verify. Those two example are very basic, and most programmer will probably use the second form intuitively, but in more complex situation, taking as guideline to avoid copy/paste may save you from long nightmarish bug hunt.
Any programming features is good as long as it fit your need.
There's no reason to not use a feature in your programming language as long as you understand it and it do the job. If it's possible, it means that somehow, it was meant by the language designers for a good reason (or, else, they should have find a way to forbid it.)
When teaching programming, I often heard students arguing about bad practices, rules they have read on the net that they don't really understand. Most of the time, theses rules only take sense in a precise context or express a taste rather than a reasonable motivation.
The main goal of programmer is to make its code works, it's not to produce shinny piece of code or perfect models of how to code. This lead us to my next rule:
The end justifies the means.
To understand this one, we must define the end and the means. The end is your goal, what my program should do. But it's also all the requirements linked to it: should it be fast ? should it be robust and reliable ? what will be its life cycle (one shot or high availability permanent run) ? would it be extended ? ...
The means deal with how you will produce it and it's strongly connected to the end: don't spend to much time on code for a one shot script, make it works, on the other hand remove all quick and dirty implementation from a production release !
Building a program is most of the time some parts of professional work you're paid for. So, this is not a challenge of best programming practices, there's no jury's prices for nice attempt: you have to make it works the way your future users expected it to work. If they ask for simple script that will save them hours of boring manipulation, they don't want months until you find it finally presentable, the want it now !
The important point is to define the correct goals, and make the right efforts to achieve these goals.
This also means that any tricks are fine as long as it makes your code works. Again, students some times seems disappointed by some hacks they found in some piece of code. They saw it as cheating as if finding a way to get around some difficulties is not elegant and should not be tolerated !
Coding is no game, you can have fun doing it, but there's no place for ridiculous honor code. You will find that the satisfying users' expectations will probably force you to produce code as clean as it can be and that you won't need arbitrary rules to guide you.
Take for example the use of goto and the particular cases of goto into the body of loop: goto is generally considered evil and goto inside a loop is the probably the most heretic things you can do. But, there are some cases where this can be useful. The best known example is the Duff Device example of smart loop unwinding, take a look at it, it's worth the read. I will use a simpler example, not focus on optimization, that I found more striking.
The idea is simple, you have a list or an array and want to print each element separated by a ';', by separated we really mean that the separator is between each element and not after, thus the last element will not be followed by a ';'. There are several ways to do that, let's take a first example:
void printArray(int tab[], size_t count) { size_t i; size_t i; for (i=0; i < count; ++i) { if (i > 0) // not the first time printf(";"); printf("%d",tab[i]); } }
Ok, this is not the best version you can write, but it focus on the fact that the first case is handle separately. We can have done like that also:
void printArray(int tab[], size_t count) { if (count) { size_t i; printf("%d",tab[0]); for (i=1; i < count; ++i) printf(";%d",tab[i]); } }
In this version, the treatment of the first case is less obvious and we need to check the count. Now, take a look at this one:
void printArray(int tab[], size_t count) { if (count) { size_t i=0; goto start; for (; i < count; ++i) { printf(";"); start: printf("%i",tab[i]); } } }
We use a goto to skip the first printf in the loop. The result is better than the first version and it let appears the specificity of the first case. Of course, this is a toy example, and the goto version is not dramatically better in any way, but it illustrate the fact that dirty code can be useful and readable.
Keep it simple, stupid !One of the classic ! Prefer the simplest path. When you can't figure out yourself how your code work, it means that it's too complex and will probably won't work. Here are some other maxims and quote on the subjects:
"Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it ?" (Brian Kernighan, "The Elements of Programming Style", 2nd edition, chapter 2)
Smarter code means smarter bugs ! (me ;)
"There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies." (C.A.R. Hoare)So, all those rules deal finally with the same issues: making your programs work ! That's the only things that really matter in programming.
Subscribe to:
Posts (Atom)