Tuesday, July 31, 2012

Rules won't help …

As I mentioned earlier, mantras and empiric rules are often more misleading than helping in programming. In fact, I thought that directing your programming activities through a set of established rules is more dangerous than you can think.

It doesn't mean that you'll never achieve your goal if you follow those rules, but it will give you a false sense of security.


Rules to failure …


Using rules to conduct some activities is natural and comfortable. But, are we sure that following those rules will produce the desired effect ?

Take, for example, recipes (cooking ones.) We all experienced that before: you want to make a cake, choose one in your cook book (one with a very nice photo and not much work), you follow the recipe carefully, and end with some sort of  ugly dead creature from outer-space that taste like burned plastic.

What went wrong ? After several similar failures, it appears that changing various elements (time, quantity even ingredients … ) of the recipe provides a far more satisfactory result. Why ?

There's plenty of reasons. Of course, you can blame the book (and you may be right), but there's more. A good recipe, designed and written by a cooking expert (and maybe reviewed by some more cooking experts), is in fact a description of a chemical experiment. If you've done some, you know that too much approximation may leads to dramatic (and sometimes dangerous) unexpected results. Cooking is less dangerous (except for your taste), but the logic is the same: a wrong combination of manipulations, approximate timing and measuring leads to inedible meals and cakes.

At that point, you may think that the point is that you haven't do the job by the book and that you'll need to be more careful next time. But again, it's false. You won't be able to achieve a descent result without adapting the recipe.

Let's consider another example: on some packed sandwiches (you know that tasteless triangle things that you can buy in super-market), I've found opening instructions: « twist and pull » with a curved arrow indicating twisting direction. I've tried about 5 minutes to perform the operation and finally get a « shaken sandwich » (not without using some cutting tools to open the damned things.)

All of that because, there's the rule, the intention behind it, the way you understand it, the way you try to applied it, and the real action to perform. In fact, there's probably more than one way to do it, and some are simpler for you than the one described in rules.

So ? And what's about programming, then ?


Programming is finally no more than writing rules for the computer.

Thus, it may seems legit to apply to your self some rules when designing and programming ?

But again, there's some hidden issues there.

There's a lot of designing methodologies and software conception processes. All those tools are supposed to lead you through the best way to achieve your primary goals. They are collections of best practices and rationalized procedures, known to provides grant success in many other projects.

That's bullshit ! Success of most projects are due to people in the project, not how they organized. Of course, well suited methodology can help you and may event prevent some important mistake. But you can still get a bunch of craps, it will be the best designed bunch of craps of all time, but still a bunch of craps.

Next lesson: how to fail using programming design methodology:

  1. Choose a small random project with well known solutions (in order to compare to your production), possibly a project more directed to algorithms than data-managing. For example: « Your program will help a fictitious character lost in a maze to get out by the shortest path. You'll have access to the whole map. » You'll recognize a classic application for path-finding algorithm.
  2. Now, choose a method. I'll stay abstract on that matter (due to a lake of place.) So basically, you'll have an outer specification phase, a data model specification, then interaction specification and finally you'll get to more technical aspects and code.
  3. So, our program take what look like a map and the description of the character to somehow leads the character to the exit. The maze is provided as a bitmap with a distinct color for rooms and tunnels linking rooms.
  4. The main element that appears in the description seems to be the character (it comes first and the subject is to lead it in the maze.) We'll use it as our central element.
  5. Following our method, we will design a class diagram based on the idea that the main character is moving on an (almost) open-field (the original bitmap.)
At that point you may have notice that we're in trouble. While, a careful examination of the subject triggers some idea about path-finding in graph, we're stuck with bitmap exploration.

The error is obvious, we haven't choose the most efficient way to solve the problem. Basically, we should have notice that even if the input is in form of a bitmap, there's a notion of rooms (graph vertex) and tunnel (graph edges.) I made this mistake on purpose, to show you that a conception method won't help you to solve an issue, but help you to organize the process of building a solution.

This is a common mistake: project organization methodology, conception tools and any other software engineering processes are guidelines that kept you on track with big project, but they will never save you from bad ideas.

False security

This notion of rules guided conception provides a false sense of security. You're respecting a well establish software process, thus you shouldn't fail. Experience prove the contrary, the only things that a methodology can avoid are dispersion and diversion from the original concept: it keep you on track and help you evaluate the progression (and thus, possible delays.)

This is the main issue on relying on rules, just like the cooking recipes, following it won't lead you to success automatically.

And just like cooking, if you don't have good programmers to implement your design, even if the design is the best possible piece of conception, your project won't work.

I'm not talking about managed project against tech-lead, I'm just saying that rules, methodology (and thus management) are not guarantees of success, they're only there for keeping project on tracks.


Security, cryptography and laws.


Let's move to another related subject: security.

Playing with rules.

Non technical people often think that security can be enforce using security laws. This is somewhat related to playing some strategy game.

For example, when you play chess, you have a finite set of rules, and normally you can't play outside of these rules. These rules are constraint: at each point of the game, there's a finite set of possible moves. Based on this fact, you can use rules to protect yourself against your opponent's attack. For example, moving your king to be outside of the range of your opponent's menacing piece. Since, you both play with the same rules, he won't be able to conclude his attack (provides that you estimate all possible situation.)

If your opponent doesn't respect the rules, he's a cheater, and cheating implies loosing the game, unless no one sees it. So, rules are protecting you (you know the range of possible attacks) and still protect you when the other player is not fair.

Real life don't follow this scheme: if your opponent has an important goals, breaking rules won't stop him.

Laws against efficient technical solutions ?

There's recurring debates in many countries about regulating cryptography or software reverse engineering. The common idea is that if you forbid the use of strong cryptography, criminals and terrorist won't be able to securely exchange information. Bullshit !

Accessing strong cryptography, even when it's prohibited by laws, for someone that plans to perform criminal actions, is not impossible. In fact, it is relatively simple to find a sufficiently safe cipher with basic mathematical knowledge (an homemade cipher, based on xoring data with one time pad is quite simple as soon as you can transmit the key sequence to your buddy.)

The real result of constraining usage of strong cryptography is compromising normal users' security ! Since they can't use efficient tools (or they must rely on tools with known vulnerabilities), they expose their communication and their data to security threats !

Another common mistake is about reverse-engineering (or more generally analyzing software security.) For example, if someone decided to break a security protection of a credit card system for his own profit, it's main crime is not breaking the security protection, it's to exploit the security breach to steal money ! He is already on the wrong side of the laws, he won't care about another minor crime.

On the other hand, a security researcher that found some vulnerabilities in credit card protection is helping us ! Proving that there's a breach in a system that a lot of users trust, is far more important than prohibiting the break of the protection.

Once again, the point is that protection and safety don't come from rules or laws, they come from good and efficient software.

Secure programming recipes.

Another classical subject in software security is : secure programming. There's plenty of books on the subject, but unfortunately, most of them are recipes books: they show you well known mistakes to avoid and predefined code patterns that normally won't fail. The result is that we still have a lot of bad programs around there with plenty of (possibly yet to discover) vulnerable code.

I can sum-up what I think, secure programming should be:
  • Your program must only do what was specified in the first place
  • Your program must not do anything outside of the scope of its specification
  • Your program must be free of bugs.
The first two points imply that a program can only be safe if its specification includes what what must be done and what must not be done. A secure specification requires that you define the expected behavior in terms of authorized and unauthorized aspects.

The last point is the hard part, once again it emphasizes the need for technical qualities. But, it is also important that a program free of bug is not by itself safe. A good is example is the telnet protocol: the protocol is flawed by design (transmitting in clear text all the sufficient data for authentication is stupid) meaning that even if you're certain that your implementation is bug free, using telnet will never be safe (at least in wide open environment.)

And, again about recipes for secure programming, be aware that following them is not sufficient. For two reasons: first they not cover all possible mistakes; second, even if you follow them you can open some breach.

Let's take a basic example. When copying C strings, you should not use the infamous strcpy function and prefer strncpy. The reason is quite obvious: the first form assumes that the destination points to memory chunk that can hold all characters from the source parameters, on the other the later one take a bound for the copy. The issue is the well known buffer overflow attack (providing a very long string you can rewrite the return address of the function.)

But, using the later form is no more safe. Of course, you can pass a too small value as bound, but there's another issue: reading the manual of strncpy, we can see that if the source string is wider than the bound, no terminating null character will be added to the destination. What's the issue ? A further use of the destination string may again overflow, offering the opportunity to access sensitive data, or once again overwriting important values. This second form of error, may not lead to a new security issue, but it can, it will far more complex to track down since it won't break directly where the mistake was made.

The following toy-example will demonstrate a simple example:

#include 
#include 
#include 

void copy_and_print(char *text)
{
  char                  secret[] = "Hidden text.";
  char                  buf[32];

  // first copy the the original text in buf
  strncpy(buf,text,32);
  // Then print it using printf.
  printf("copied text: %s\n",buf);
  // use secret to avoid warning.
  fprintf(stderr, "  secret text: %s\n", secret);
}

int main()
{
  // prepare a long piece of text
  // wider than the 32bytes used in our function
  char                  text[34];
  for (size_t i=0; i<33; ++i)
    text[i] = 'a';
  text[33] = 0;

  // So try our function
  copy_and_print(text);
  // What happend ?

  return 0;
}

If your stack and compiler is similar to mine, it will print the hidden messages twice …

About constraining languages

Since I'm passionate about programming, I've tried and learnt many programming languages. While some language hasn't fit me for taste reason or because I've no use of it, some have let me a strange sensation of resistance, you know where you feel like if you're moving like molasses.

It takes me sometime to realize that it was related to languages that try to force you to use best practices. The first example (for me) was Pascal. Pascal is language designed for teaching and enforcing good practices of structured programming (against evil gotos and unguarded loops) and Pascal is somehow very restrictive.

The next in row was Java, once again, Java has for its goal to promote good object oriented methodology.

What happen when you code (or at least when I code;) using these languages ? You're forced to circumvent the constraints to be able to achieve your goals. Some simple example of (safe) piece of code became a huge and complex plumbing because the direct way to do it is prohibited by the language.

A curious aspect of such languages, is that they tend to prefer keywords rather than symbols. In more recent languages, it can also appears in constraint over code formatting (like fascist languages using indentation to delimit blocks.)

Conclusion


So, what was it all about ? I'm deeply convinced that while using guideline may proven some help in maintaining a process on track, it won't help you build better software. At least, you know that you'll be able to go as far as your idea can lead you, if the idea was bad, it won't save you from failure.

This concept is not only software related: rules, laws and the like are only able to protect you in games ! In real life, the only safeguard are hard-works and technical skills.

Wednesday, June 27, 2012

New article on LSE's blog

Another article on LSE's blog about the C! programming language, read this out !

C! - system oriented programming - syntax explanation

The article presents basis of C! syntax and is a follow-up of my previous article introducing the language (also on LSE's blog.)

Back To C: tips and tricks

Note: I've begun this articles months ago, but due to a heavy schedule (combine with my natural laziness), I've postponed the redaction. So, the article may be a little rambling, I apologize for that but I hope you'll found it interesting.

The C programming language is probably the one of the widely used, and widely known programming language. Even if there's some obscure syntax tricks, it's a simple language, syntactically speaking. But, it's probably one of the harder to learn and to use correctly !

There's several reason for that:
  • since it is intended to be a low-level tools, it let you do a lot of things, things that are most of the time considered ugly but necessary in some cases.
  • it's original behavior does not include modern notions of type checking and other kind of verification, you're just playing with a machine gun without any safety lock …
  • expressive power with simple syntax often means more complex work: C belongs to the same kind of languages as assembly or lambda-calculus, you can do anything you want, but you just have to code the whole things completely from scratch.

One the most important things about coding using C, is to understand how it evolves and what it was mean to be and what it is really now. Back in the first Unix days, Denis Ritchie presented its language as a high level portable assembler, a tools for system, kernel and low-level programming. But, since then, there are standards (ANSI, ISO … ) and the language is no longer the original low-level tool.

Using C nowadays is a difficult exercise and the path to working code is full of traps and dead-ends.

I'll try to lay down some of my « tricks » and experiences. I've been using C for almost 15 years now, and I'm teaching it for about half of that time, I've seen a lot of things, but I can still be surprised every day.

Listen to your compiler !



Historically, C compilers are a little bit silent. Basically, most bad uses are not really errors, they may even be legit in some cases. So, most compilers prefer to emit warnings rather than errors.

This is why you should activate all warnings, and even better activate « warnings as error ».

A good example is the warning about affectation in condition, take a look at the following code

while ( i = read(fd, buf + off, SIZE - off) )
  off += i;

This code will trigger a warning (« warning: suggest parentheses around assignment used as truth value » in gcc.) Of course, my example is a legit usage of affectation in condition. But, the usual confusion of « = » instead of « == » is one of the most common error in C, and probably one of the most perverse bug.

This warning is probably far better than putting left-value on the right hand-side, which only work  if one of the operands is not a left-value (hey, you never compare two variables ?)

As the message said, you just have to put extra parentheses to avoid the warning, like that:

while ( (i = read(fd, buf + off, SIZE - off)) )
  off += i;

Even if this example may seems naive, it reflects the way you should use your C compiler: activate all warnings and alter your code so legit use won't trigger any messages, rather than shut the screaming off !

Oh, and if you want good warnings and error messages, I strongly recommend the use the clang compiler (based on llvm), it gives the best error messages I've seen so far.

And, for my pleasure, here is another classical example of test and affect combined in a if-statement:

int fd;
if ( (fd = open("some_file",O_RDONLY)) == -1)
  {
    perror("my apps (openning some_file)");
    exit(3);
  }
/* continue using fd */

Identify elements syntactically


Have you ever mask an enum with a variable ? Or, fight with an error for hours just because a variable, a structure field or a function was replaced by a macro ? In C, identifiers are not syntactically separated, so you can do horror like that:

enum e_example { A, B };

enum e_example f(void)
{
  enum e_example        x;
  float                 A = 42.0;
  x = A;
  return x;
}

int main()
{
  printf("> %d\n", f());
  return 0;
}

What does this code print ? 42 of course ! Why ? Simply because the float variable A mask the enum constant A. The ugly part is that your compiler won't warn you and won't complain.

So, there's no solution on the compiler side, we have to protect ourselves from that kind of errors. A usual solution is to adopt syntactical convention: each kind of symbol have its own marker, for example you can write enum constant like A_E or, in order to have different names for each enum definition, you can prefix your constant with the name of the type.

Basically, you should have a dedicated syntax for type name, macro constants, enum member or any other ambiguous identifiers. Thus, my previous enum should be written:

enum e_example { A_E, B_E }; /* using _E as enum prefix for example */

Just keep in mind that identifier's size should be kept relatively small in order to preserve readability and avoid typing annoyance (you won't enjoy typing e_example_A more than once.)

Understanding Sequence Points


There's a notion of sequence points in the language: any code between two sequence points can be evaluated in any order, the only things you know is that all side-effects syntactically before a sequence point take place before going further and no side-effect syntactically after the sequence point will begin before the sequence point.

So, this notion obviously forbid ambiguous code like i++ * i++ or i = i++. In fact this code is not strictly forbidden (sometimes the compiler may issue a warning, but that's all), it belongs to the infamous category of undefined behavior and you don't want to use it.

But that's not all. What you should understand and enforce is that no more than one modification to the same location should occur between two sequence points, but also whenever a memory location is modified between two points, the only legit access should be in the scope of the modification (i.e. you're fetching the value to compute the new value.

So, you should not write something like: t[i] = i++, but you can write (of course) i = i + 1.

Now, what constructions mark those sequence points ?
  • Full expressions (an expression that is not a sub-expression of another)
  • Sequential operators: ||, && and ,
  • Function call (all elements necessary for the call, like parameters, will be computed before the call itself.)
That's all, meaning that in the expression: f() + g() you don't know if f() will be called before g() !

Here are my personal rules to avoid those ambiguous cases:

  • Prefer pure functions in expressions
  • Keep track of global states modified by your functions so you can establish which function can't be used in the same sub-expression
  • Each function should have a bounded scope: a function must do one things and if it you can't avoid global side effects, it must be limited to one or two global states per function.
  • Prefer local static states rather than global states so you can bound modification through the use of one function.
  • Prefer pointer above implicit references for function (C++)
The last point may disturb you, implicit references (in the sens of C++ or Pascal) are used, in both call-site and body of the function, like non reference arguments, hiding the persistence of the modification. This pose no threat where it used with intended modification, but inside expression this can lead to that kind of bug you fight against for hours. Using an explicit pointer requires a reference operator indicating the possible modification.

Functions' prototype have to be explicit

In C, pointers are used for a lot of things: array, mutable arguments, data-structures …

So you can pass a pointer to a function for a lot of reason and it could be interesting to find a way to differentiate the various usage of pointer as parameter.

I'll give you a quick overview of my coding style:
  • Pointer as array: when the pointer parameter is in fact an array, I always use the empty bracket rather than the start, for example:

  • float vect_sum(float tab[], size_t len)
    {
      float res = 0;
      float *end = tab + len;
      for (; tab != end; ++tab)
        res += *tab;
      return res;
    }
    

  • Pointer as reference (mutable argument): when passing a pointer to a variable in order to keep modification of the value of the variable in calling context, I explicitly use a star:

  • void swap(int *a, int *b)
    {
      int c;
      c = *a;
      *a = *b;
      *b = c;
    }
    

  • Data structure: most (all) linked data structures (list, tree … ) have a pointer as entry point. In fact, the structure is the pointer (for example the linked list is the pointer not the structure.) So, I hide the pointer in the typedef rather than let it be visible:

  • typedef struct s_list *list;
    struct s_list
    {
      list next;
      int  content;
    };
    size_t list_len(list l)
    {
      size_t r = 0;
      for (; l; l = l->next)
        ++r;
      return r;
    }

The last point is often disturbing for student and it need some explanation. First, consider the logic of a linked list: a list is recursively define as an empty list or a pair of an element and a list. Thus, the NULL pointer representing the empty list is a list, meaning that the list is the pointer not the structure (a pair can be viewed as a structure or a pointer to a structure, to be coherent we must choose the later definition.)
So, you're list is a pointer and thus the type list must reflect this fact.
This is where comes the usual argument: « but, if the pointer is not obvious in the prototype how do I know that I must use an arrow rather than a dot to access the structure member ? » There's two answers to this question:
  • First, when dealing with the code manipulating the list you know what you're doing, so there's no question !
  • You must be coherent, if you include the star in the typedef, you won't do a typedef on a struct. You don't flag the case where you should use an arrow but when you should use a dot !
Combining the rule for pointer as reference and pointer as data structure, you'll have a coherent strategy to define functions that modify the content of the data structure and functions that modify the structure itself (specially when modifying the entry point.) The following example show a functional add (adding an element at the head of a linked list without modifying the given pointer) and a procedural add (adding again at the head but modifying the given pointer.)

/* we're using the previous list definition */

list fun_add(list l, int x)
{
  list t;
  t = malloc(sizeof (struct s_list));
  t->content = x;
  t->next = l;
  return t;
}

/* note that l is passed as a pointer to a list, not a list */
void proc_add(list *l, int x)
{
  list t;
  t = malloc(sizeof (struct s_list));
  t->content = x;
  t->next = *l;
  *l = t;
}

The star in the second version indicate that the function may modify the head of the list (in fact, it will.)

Forbidding typedef of structure has an other positive aspect: when passing a structure to a function (or when returning it), the structure is copied (and thus duplicated) inducing a (small) overhead at the function call (or function return) and bigger memory usage. Hiding the fact that the value is a structure induces what I call hidden complexity: what look like a simple operation has in fact a non-negligible cost. Thus, once again, the good strategy is to let the fact that the value is a structure visible.

The special case of strings: in C, strings have no specific type, you use pointer to characters. Normally, you can view a string as an array of characters, but the usual convention is to describe strings as char* rather than char[], so strings are the only case where I use the star rather than the bracket syntax. It solves also the issue of arrays of strings, you can't use char[][] (this a two dimensional array, but since the compiler need the size of line to correctly translate indexes, you can't write such a type) the solution is to mix star and bracket, as in the following example:

/* argv is an array of strings */
int main(int argc, char *argv[])
{
  / ..
  return 0;
}

Conclusion

I hope these tricks where useful or interesting to you. As I say in opening, programming in C requires careful concentration and a bit of understanding the underlying semantics of the language.

These tricks are not magical receipts, if you want to be a C programmer, you'll need practice. The first rule of a programmer should be: Code and when you think you have code enough, then code again !

By coding I mean: try out ideas, implement toys, compare behaviors of various implementation of the same idea, take a look at the generated ASM code (yes, you should do that … ) And don't be afraid of pointers, pointers are you friends, you need them, they are always useful in a lot of situation, but you got to be nice with them and treat them properly (or you'll pay !)

As a second rule, I'll propose: find a convenient coding style ! Coding style are often useful to avoid misuse of data structures or ambiguous situation (such as the examples on enum), but when building a coding style, don't focus on code presentation, organization, comments nor forbidden features. The most important part of a coding style is the naming convention and a coherent way of describing function prototypes.

Wednesday, May 9, 2012

C! Programming Language - LSE blog articles

I wrote a new article presenting the C! programming language.

C! on LSE's blog

C! is a system oriented programming language. Started as a simple revisited syntax for C, it now offers simple objects and similar features.