--- pubDate = 2018-11-18 tags = ['language', 'obfuscation', 'programmation'] [author] name = "ache" email = "ache@ache.one" [[alt_lang]] lang = "fr" url = "/articles/bizarreries-du-langage-c" --- The quirks of the C language ============================ ![The C Programming Language logo](res/c_language.svg) C is a language with a simple syntax. The only complexity of this language come from the fact that it acts in a machine-like way. However, a part of the C syntax is almost never taught. Let's tackle these mysterious cases! 🧞 :::note To understand this post, it is necessary to have a basic knowledge of a language with a syntax and operation close to C. ::: Table of contents ---------- The uncommon operators ---------------------- There are two operators in the C language that are almost never used. The first is the comas operator. In C, the comma is used to separate the elements of a definition or to separate the elements of a function. In short, it is a punctuation element. But not only ! It's also an operator. ### The comma operator The following instruction, although unnecessary, is quite valid: ~~~c printf("%d", (5,3) ); ~~~ It prints 3. The operator `,` is used to juxtapose expressions. The value of the whole expression is equal to the value of the last expression. This operator is very useful in a `for` loop to multiply the iterations. For example, to increment `i` and decrement `j` in the same iteration of a `for` loop, we can do: ~~~c for( ; i < j ; i++, j-- ) { // [...] } ~~~ Or again, in small `if` to simplify ~~~c if( argc > 2 && argv[2][0] == '0' ) action = 4, color = false; ~~~ Here, we assign `action` and `color`. Normally to do 2 assignations, we should use curly braces. We can also use the comma operator to remove parentheses. ~~~c while( c = getchar(), c != EOF && c != '\n' ) { // [...] } // Is strictly equivalent to : while( (c = getchar()) != EOF && c != '\n' ) { // [...] } ~~~ But above all, do not abuse this operator! You can, in a rather fast way, obtain unreadable things. This remark is also valid for the next operator! ### The ternary operator The "ternary" for the intimates. The only one operator of the language that takes 3 operands. It is used to simplify conditional expressions. For example to print the minimum of 2 numbers, without ternary, we could do: ~~~c if (a < b) printf("%d", a); else printf("%d", b); ~~~ Or simply: ~~~c int min = a; if( b < a) min = b; printf("%d", min); ~~~ And using the ternary operator: ~~~c printf("%d", a~~~c >static int b64_d[] = { > ['A'] = 0, ['B'] = 1, ['C'] = 2, ['D'] = 3, ['E'] = 4, > ['F'] = 5, ['G'] = 6, ['H'] = 7, ['I'] = 8, ['J'] = 9, > ['K'] = 10, ['L'] = 11, ['M'] = 12, ['N'] = 13, ['O'] = 14, > ['P'] = 15, ['Q'] = 16, ['R'] = 17, ['S'] = 18, ['T'] = 19, > ['U'] = 20, ['V'] = 21, ['W'] = 22, ['X'] = 23, ['Y'] = 24, > ['Z'] = 25, ['a'] = 26, ['b'] = 27, ['c'] = 28, ['d'] = 29, > ['e'] = 30, ['f'] = 31, ['g'] = 32, ['h'] = 33, ['i'] = 34, > ['j'] = 35, ['k'] = 36, ['l'] = 37, ['m'] = 38, ['n'] = 39, > ['o'] = 40, ['p'] = 41, ['q'] = 42, ['r'] = 43, ['s'] = 44, > ['t'] = 45, ['u'] = 46, ['v'] = 47, ['w'] = 48, ['x'] = 49, > ['y'] = 50, ['z'] = 51, ['0'] = 52, ['1'] = 53, ['2'] = 54, > ['3'] = 55, ['4'] = 56, ['5'] = 57, ['6'] = 58, ['7'] = 59, > ['8'] = 60, ['9'] = 61, ['+'] = 62, ['/'] = 63, ['='] = 64 >}; >~~~ Source: [Taurre](https://openclassrooms.com/forum/sujet/defis-8-tout-en-base64-19054?page=1#message-6921633) [^array]: `{0}` can also be used for a number. Any value initializing a simple variable (pointer, number, ...) can optionally take braces. ~~~c int a = {11}; float pi = {3.1415926}; char* s = {"unicorn"}; ~~~ The main feature of the array is the use of commas between the braces. The compound literals --------------------- Since we are talking about arrays. There is a simple syntax for using single-use arrays. I would like to use this array: ~~~c int arr[5] = {5, 4, 5, 2, 1}; printf("%d", arr[i]); // With i set to something >=0 and <5 ~~~ However, I only use this array once... It's a bit disturbing to have to use an identifier just for that. Well, I can do that: ~~~c printf("%d", ((int[]){5,4,5,2,1}) [i] ); // With i set to something >=0 and <5 ~~~ It's not very readable, but in many cases this syntax is useful. For example with a structure: ~~~c // To send our message: send_msg( (message){ .dst="192.168.11.1", .msg="Code 11"} ); // To print the distance between two points printf("%d", distance( (point){1, 2}, (point){2, 3} ) ); // Or on Linux, in system programming execvp( "bash" , (char*[]){"bash", "-c", "ls", NULL} ); ~~~ We call these expressions *compound literals* (which is a pain to translate in any other language) Introduction to VLAs -------------------- Variable Length Arrays are arrays with length only know at runtime. If never encounter VLA, this should clink you: ~~~c int n = 11; int arr[n]; for(int i = 0 ; i < n ; i++) arr[i] = 0; ~~~ A lot of teachers must have repressed that code. We have been taught that an array must have a know size at compile time. VLA are the exception. Introduced with the C99 standard, VLAs have a bad reputation. There are several reasons for this, which I won't go into here[^reasons]. I'm just going to talk about the non-intuitive behaviors introduced with VLAs. To define a VLA, it's the same syntax as a classical array, but the size of the array is a non-constant expression. But first, let's see what and how to use a VLA (the normal way). ~~~c int n = 50; int arr[n]; double arr2[2*n]; unsigned int arr3[foo()]; // avec foo une fonction définie ailleurs ~~~ A variable length array can not be initialised nor declared `static`. Thus, both of these statements are incorrect: ~~~c int n = 30; int arr[n] = {0}; static arr2[n]; ~~~ In a function, we can use this syntax to refer to a VLA: ~~~c void bar(int n, int arr[n]) { } ~~~ Since, in C the size of the first dimension of an array isn't really of interest as an argument of a function. A real-life case may be in passing a 2-dimension VLA where the second dimension *must* be specified: ~~~c void foo( int n, int m, int arr[][m]) { } ~~~ Note that it is possible to use the character `*` (yet another use ...) instead of the size of one or more dimensions of a VLA, but *only* within a prototype. ~~~c void foo(int, int, int[][*]); ~~~ Well, after that short introduction, let's talk about the interesting cases. The quirks and eccentricities that the VLAs have introduced ! [^reasons]: I can nevertheless give you two references [“Is it safe to use variable-length arrays?”](https://stackoverflow.com/questions/7326547/is-it-safe-to-use-variable-length-arrays) from stack overflow and this article : [“The Linux Kernel Is Now VLA-Free”](https://www.phoronix.com/scan.php?page=news_item&px=Linux-Kills-The-VLA). The VLAs exceptions ------------------- The most known deviant behavior of VLAs is their relation to `sizeof`. `sizeof` is a unary operator that retrieves the size of a type from an expression or from the name of a type surrounded by parentheses. ~~~c /* How sizeof works using examples */ float a; size_t b = 0; printf("%zu", sizeof(char)); // Prints 1 printf("%zu", sizeof(int)); // Prints 4 printf("%zu", sizeof a); // Prints 4 printf("%zu", sizeof(a*2.)); // Prints 8 printf("%zu", sizeof b++); // Prints 8 ~~~ The first result are very surprising, the size of a `char` is defined to be 1 and `sizeof(char)` must return 1 (as per the C standard). The second one is the size of `int`[^system]. The third one is the size of the type of the expression `a` (which is `float`[^system]). The fourth is the size of a `double`[^system] (the type of `a*2.`[^float]). The last one is the size of the type `size_t` since it is the type of the expression `b++`[^system]. Here, we don't care about the value of the expression since `sizeof` doesn't care more. The value of the `sizeof` expression is determined at compile time. The operations inside the `sizeof` statements aren't executed. Since the expression must be valid, its type is determined at compile time. ~~~c int n = 5; printf("%zu", sizeof(char[++n])); // Prints 6 ~~~ *Ouch* ! Here are the VLAs. In the type `int[++n]`, `++n` is a non-constant expression. So the array is a VLA. To know the size of the array, the compiler must execute the expression inside the bracket. This, `n` holds 6 now and `sizeof` indicates that an array of `char` declared within this expression should have a size of `6`. This is only slightly intuitive since the VLAs here have introduced an *exception to the rule* which is not to execute the expression passed to `sizeof`. Another odd behaviour introduced by VLAs is the execution of expressions related to the size of a VLA in the definition of a function. Thus : ~~~c int foo( char arr[printf("bar")] ) { printf("%zu", sizeof arr); } ~~~ Assuming that the displays do not cause an error, calling this function will display `bar3`. The `printf("bar")` statement is evaluated and then only the body of the function is executed (the "3"). Note that there are other exceptions induced by the standardisation of VLAs such as I already state, the impossibility to allocate VLAs statically (quite logical), or the impossibility to use VLAs in a structure (GNU GCC supports it anyway). And even some conditional branches are forbidden when using a VLA. [^system]: On my computer. [^float]: Here, the implementation follows the IEEE 754 standard, where the size of floating number “simple” is 4 bytes and “double” is 8. `2.` has type `double` so `2.*a` has the same type as its greater operand. A flexible array ---------------- You may never hear something like “flexible array members”. This is normal, these respond to a very specific and uncommon problem. The objective is to allocate a structure but with one field (an array) of unknown size at compile time and all on a contiguous space[^why]. Here, there is no VLAs, because as we already stated, VLAs are forbidden as structure field. We must use dynamic allocation We could write that: ~~~c struct foo { int* arr; }; ~~~ And use it like that: ~~~c struct foo* contiguous = malloc( sizeof(struct foo) ); if (contiguous) { contiguous->arr = malloc( N * sizeof *contiguous->arr ); if (contiguous->arr) { contiguous->arr[0] = 11; } } ~~~ But here the array may not be next to the structure in memory. As a consequence, if we copy the structure the value of `arr` will be the same for the copy and for the original one. To avoid that, we must copy the structure, reallocate the array and copy it. Let's see another way. ~~~c struct foo { /* ... At least another field because the C standard say so ... */ int flexiArr[]; }; ~~~ Here, the field `flexiArr` is a member array flexible. Such an array must be the last element of the structure and not specify a size[^zero]. It is used like this: ~~~c struct foo* contiguous = malloc( sizeof(struct foo) + N * sizeof *flexiArr ); if (contiguous) { flexiArr[0] = 11; } ~~~ This syntax responds as much to a need for portability on architectures imposing a particular alignment (the array is contiguous to the structure) as to the need to show a semantic link between the array and the structure (the array belongs to the structure). [^why]: We may want that this space to be contiguous for many reasons. One is to optimise the use of the processor cache. Another one is that the management of the network layers which are well suited to the use of flexible array. [^zero]: Prior to the specification of flexible array members in C99, it was common practice to use arrays of size one to replicate the concept. A labels history ---------------- In C, if there's one thing we shouldn't talk about, it's *labels*. We use it with the `goto` statement. The forbidden one ! To hide them, we replace the `goto` by named statement more explicit like `break` or `continue`. So we don't have `goto` anymore, and we never learn what is a label anymore when we learn the C syntax. This is how `goto` and a label are used: ~~~c goto end; end: return 0; } ~~~ Basically, a label is a name given to an instruction. We use it mainly in `switch` statement now a day. ~~~c switch( action ) { case 0: do_action0(); case 1: do_action1(); break; case 2: do_action2(); break; default: do_action3(); } ~~~ Here each `case` and the `default` are in fact labels. Except that you can't use them with `goto` since there is no name to refer. :::question Why are you telling us this? ::: Firstly, it's good to know that it's called a label. Secondly, because I'm going to tell you about a classic. The *Duff's device*. It's a kind of loop unrolled and optimised. The goal is to reduce the number of loop check (as well as the number of decrements). Here is the historical version written by Tom Duff ~~~c { register n = (count + 7) / 8; switch (count % 8) { case 0: do { *to = *from++; case 7: *to = *from++; case 6: *to = *from++; case 5: *to = *from++; case 4: *to = *from++; case 3: *to = *from++; case 2: *to = *from++; case 1: *to = *from++; } while (--n > 0); } } ~~~ It doesn't matter what `register` means. Also, `to` is a particular pointer, but it doesn't really matter. Here, what I want to tell you about is that `do-while` loop in the middle of a `switch`. The test we try to avoid is `--v > 0`. Normally, `n` would actually be `count`. And we would have to test `count` times. The same goes for its decrement. That's to say: ~~~c while( count-- > 0 ) *to = *from++; ~~~ Dividing by 8 (arbitrary number) we also divide the number of tests and decrements by 8. However, if `count` is not divisible by 8, we have a problem, we don't do all the instructions. It would be nice to be able to jump directly to the 2nd instruction, if you only have 6 instructions left. And this is where labels can help us! Thanks to the `switch` we can jump directly to the right instruction. We only have to label every instruction with the number of instruction remaining to do. Then we jump to that instruction with the `switch` statement. It is very rare to have to use this type of trick. It's mostly an optimisation from another time. But since I would like to talk about that syntax, it was necessary to talk about labels (or was it ?) Complex numbers --------------- Once again, we will study a syntax introduced by C99. More exactly there are 3 types that have been introduced which are complex numbers. The type of complex number is `double _Complex` (the other two follow the same pattern, I will only write about the `double` version) Thus, in C, it is possible to declare a complex number like this: ~~~c double complex point = 2 + 3 * I; ~~~ Here we find the special macros `complex` and `I` (defined in the `` header). The former is used to create a complex type while the latter is used to define the imaginary part of a complex number. In memory a complex variable takes up as much space as 2 times the real type on which it is based. A complex variable is used as a normal variable. The arithmetic is intuitive since it is based on the real type. Note that it is recommended to use the macro `CMPLX` to initialise a complex number: ~~~c double complex cplx = CMPLX(2, 3); ~~~ For a better handling of cases where the imaginary part (the one multiplied by `I`) would be `NAN`, `INFINITY` or even more or less 0. The `` header offers us a really simple way to use imaginary numbers. Indeed, many common functions for manipulating imaginary numbers are available. Generic macros -------------- There is a way in C to have macros that are defined differently depending on the type of one of its arguments. This syntax is however "new" since it dates from the C11 standard. This genericity is achieved with generic selections based on the syntax ` _Generic ( /* ... */ )`. To understand the syntax, let's look at a simplistic example: ~~~c #include #include #define MAXIMUM_OF(x) _Generic ((x), \ char: CHAR_MAX, \ int: INT_MAX, \ long: LONG_MAX \ ) int main(int argc, char* argv[]) { int i = 0; long l = 0; char c = 0; printf("%i\n", MAXIMUM_OF(i)); printf("%d\n", MAXIMUM_OF(c)); printf("%ld\n", MAXIMUM_OF(l)); return 0; } ~~~ Here we print the maximum that can be stored by each of the types we use. This is something that would not have been possible without the use of this new keyword `_Generic`. To use this syntax, we use the keyword `_Generic` to which we pass 2 parameters. The first is an expression whose type will influence the expression that is finally executed. The second is a sequence of type and expression associations (type: expression) whose associations are separated by commas. In the end, only the expression designated by the type of the first expression is finally evaluated. A real-world example could be: ~~~c int powInt(int,int); #define POW(x,y) _Generic ((y), double: pow, float: powf, long double: powl, int: powInt)((x), (y)) ~~~ There's not much more to say except that it's possible to have the word `default` in the list of types, which will then correspond to all unmentioned types. So a cleaner definition of the `POW` macro from earlier could be : ~~~c int powIntuInt(int a, unsigned int b); double powIntInt(int a, int b); double powFltInt(double a,int b) { return pow (a,b); } double powfFltInt(float a,int b) { return powf(a,b); } double powlFltInt(long double a,int b) { return powl(a,b); } #define POWINT(x) _Generic((x), double: powFltInt, \ float : powfFltInt, \ long double: powlFltInt, \ unsigned int: powIntuInt, \ default: powIntInt) #define POW(x,y) _Generic ((y), double: pow, float: powf, long double: powl, default: POWINT((x)) )((x), (y)) ~~~ Too special characters ---------------------- Let's go back in time again. I'm going to mention one more thing. A time when not all characters were as accessible as they are today on so many types of keyboards. Keyboards didn't necessarily have compose keys. Thus, it was impossible to type the `#` character. The `#` character could then be replaced by the sequence `??=`. And for each character not on the keyboard and used in the C language, there was a `??` based sequence called trigraph. Another version based on 2 *more readable* characters is called digraphs. Here is a table summarising the trigraph and digraph sequences and their character representation. | Character | Digraph | Trigraph | | --------- | ------- | -------- | | `#` | `%:` | `??=` | | `[` | `<:` | `??(` | | `]` | `:>` | `??)` | | `{` | `<%` | `??<` | | `}` | `%>` | `??>` | | `\` | | `??/` | | `^` | | `??'` | | `\|` | | `??!` | | `~` | | `??-` | | `##` | `%:%:` | | The **main** difference between the digraphs and trigraphs is inside a string: ~~~c puts("??= is a hashtag"); puts("%:"); ~~~ These medieval mechanisms are still valid today in C. [^C23] So this line of code is perfectly valid: ~~~c ??=define FIRST arr<:0] ~~~ The only use of this syntax nowadays is to obfuscate a source code very easily. With a combination of a ternary with trigraphs and a digraph you get an absolutely unreadable code 😉 ~~~c printf("%d", a ?5??((arr):>:0); ~~~ [^C23]: In C23, trigraphs are deprecated and doesn't work anymore. **Never use it in a serious code**. ## To conclude That's it, I hope you've learned something from this post. Don't forget to use these syntaxes sparingly. I would like to thank [Taurre](https://zestedesavoir.com/membres/voir/Taurre/) for validating this article in French, but also for his pedagogy on the forums for years, as well as [blo yhg](https://zestedesavoir.com/membres/voir/blo%20yhg/) for his careful proofreading. Note that you can (re)discover a lot of code abusing the C language syntax at [IOCCC](http://ioccc.org/winners.html). 😈