From e089388ab53d8e015f0cc997b3cee0fb1d62bca6 Mon Sep 17 00:00:00 2001 From: ache Date: Sat, 22 Apr 2023 00:08:23 +0200 Subject: Support for multilang --- articles/bizarreries-du-langage-c.md | 4 + articles/c-language-quirks.md | 850 +++++++++++++++++++++++++++++++++++ 2 files changed, 854 insertions(+) create mode 100644 articles/c-language-quirks.md (limited to 'articles') diff --git a/articles/bizarreries-du-langage-c.md b/articles/bizarreries-du-langage-c.md index 6bfc7af..fecc440 100644 --- a/articles/bizarreries-du-langage-c.md +++ b/articles/bizarreries-du-langage-c.md @@ -7,6 +7,10 @@ tags = ['langage', 'obfusquation', 'programmation'] name = "ache" email = "ache@ache.one" +[[alt_lang]] +lang = "en" +url = "/articles/c-language-quirks" + --- diff --git a/articles/c-language-quirks.md b/articles/c-language-quirks.md new file mode 100644 index 0000000..1b5d0de --- /dev/null +++ b/articles/c-language-quirks.md @@ -0,0 +1,850 @@ +--- + +pubDate = 2018-11-18 +tags = ['language', 'obfuscation', 'programmation'] + +[author] +name = "ache" +email = "ache@ache.one" + +[[alt_lang]] +lang = "fr" +url = "/articles/bizarreries-du-langage-c" + +--- + + + +The quirks of the C language +============================ + +![The C Programming Language logo](res/c_language.svg) +C is a language with a simple syntaxe. +The only complexity of this language come from the fact that it acts in a machine-like way. +However, a part of the C syntaxe is almost never taught. +Let's tackle these mysterious cases! 🧞 + +:::attention +To understand this post, it is necessary to have a basic knowledge of a language with a syntax and operation close to C. +::: + + +Summary +------- + + +The uncommon operators +---------------------- + +There is two operators in the C language that are almost never used. +The first is the comas operator. +In C, the comma is used to separate the elements of a definition or to separate the elements of a function. In short, it is a punctuation element. +But not only ! It's also an operator. + + +### The comma operator + +The following instruction, although unnecessary, is quite valid: + +~~~c +printf("%d", (5,3) ); +~~~ + +It prints 3. +The operator `,` is used to juxtapose expressions. + +La valeur de l'expression complète est égale à la valeur de la dernière expression. +The value of the complete expression is equal to the value of the last expression. + +This operator is very useful in a `for` loop to multiply the iterations. +For example, to increment `i` and decrement `j` in the same iteration of a `for` loop, we can do: + +~~~c +for( ; i < j ; i++, j-- ) { + // [...] +} +~~~ + +Or again, in small `if` to simplifiate + +~~~c +if( argc > 2 && argv[2][0] == '0' ) + action = 4, color = false; +~~~ + +Here, we assign `action` and `color`. +Normaly to do 2 assignations, we should use curly braces. + +We can also use the comma operator to remove parentheses. + +~~~c +while( c = getchar(), c != EOF && c != '\n' ) { + // [...] +} +// Is strictly equivalent to : +while( (c = getchar()) != EOF && c != '\n' ) { + // [...] +} +~~~ + + +But above all, do not abuse this operator! +You can, in a rather fast way, obtain unreadable things. +This remark is also valid for the next operator! + +### The ternary operator + +The "ternary" for the intimates. +The only one operator of the language that takes 3 operands. +It is used to simplify conditional expressions. + +For example to print the minimum of 2 nombers, without ternary, we could do: + +~~~c +if (a < b) + printf("%d", a); +else + printf("%d", b); +~~~ + +Or simply: + +~~~c +int min = a; +if( b < a) + min = b; +printf("%d", min); +~~~ + +And using the ternary operator: + +~~~c +printf("%d", a~~~c +>static int b64_d[] = { +> ['A'] = 0, ['B'] = 1, ['C'] = 2, ['D'] = 3, ['E'] = 4, +> ['F'] = 5, ['G'] = 6, ['H'] = 7, ['I'] = 8, ['J'] = 9, +> ['K'] = 10, ['L'] = 11, ['M'] = 12, ['N'] = 13, ['O'] = 14, +> ['P'] = 15, ['Q'] = 16, ['R'] = 17, ['S'] = 18, ['T'] = 19, +> ['U'] = 20, ['V'] = 21, ['W'] = 22, ['X'] = 23, ['Y'] = 24, +> ['Z'] = 25, ['a'] = 26, ['b'] = 27, ['c'] = 28, ['d'] = 29, +> ['e'] = 30, ['f'] = 31, ['g'] = 32, ['h'] = 33, ['i'] = 34, +> ['j'] = 35, ['k'] = 36, ['l'] = 37, ['m'] = 38, ['n'] = 39, +> ['o'] = 40, ['p'] = 41, ['q'] = 42, ['r'] = 43, ['s'] = 44, +> ['t'] = 45, ['u'] = 46, ['v'] = 47, ['w'] = 48, ['x'] = 49, +> ['y'] = 50, ['z'] = 51, ['0'] = 52, ['1'] = 53, ['2'] = 54, +> ['3'] = 55, ['4'] = 56, ['5'] = 57, ['6'] = 58, ['7'] = 59, +> ['8'] = 60, ['9'] = 61, ['+'] = 62, ['/'] = 63, ['='] = 64 +>}; +>~~~ +Source: [Taurre](https://openclassrooms.com/forum/sujet/defis-8-tout-en-base64-19054?page=1#message-6921633) + + +[^array]: + `{0}` can also be used for a number. Any value initializing a simple variable (pointer, number, ...) can optionally take braces. + + ~~~c + int a = {11}; + float pi = {3.1415926}; + char* s = {"unicorn"}; + ~~~ + + The main feature of the array is the use of commas between the braces. + + +The compound literals +--------------------- + + +Since we are talking about arrays. +There is a simple syntax for using single-use arrays. + +I would like to use this array: +~~~c +int arr[5] = {5, 4, 5, 2, 1}; +printf("%d", arr[i]); // With i set to something >=0 and <5 +~~~ + + +However, I only use this array once... +It's a bit disturbing to have to use an identifier just for that. + + +Well, I can do that: +~~~c +printf("%d", ((int[]){5,4,5,2,1}) [i] ); // With i set to something >=0 and <5 +~~~ + +It's not very readable, but in many case this syntax is useful. +For example with a structure: + +~~~c + +// To send our message: +send_msg( (message){ .dst="192.168.11.1", .msg="Code 11"} ); + +// To print the distance between two points +printf("%d", distance( (point){1, 2}, (point){2, 3} ) ); + +// Or on Linux, in system programming +execvp( "bash" , (char*[]){"bash", "-c", "ls", NULL} ); +~~~ + +We call these expressions *compound literals* (which is a pain to translate in any other language) + + +Introduction to VLAs +-------------------- + +Variable Lenght Arrays are arrays with length only know at runtime. +If never encounter VLA, this should clink you: + +~~~c +int n = 11; + +int arr[n]; + +for(int i = 0 ; i < n ; i++) + arr[i] = 0; +~~~ + +A lot of teachers must have reprove that code. +We have been taught that an array must have a know size at compile time. +VLA are the exception. +Introduced with the C99 standard, VLAs have a bad reputation. +There are several reasons for this, which I won't go into here[^reasons]. + +I'm just going to talk about the non-intuitive behaviors introduced with VLAs. + +To define a VLA, it's the same syntax as a classical array but the size of the array is an non constant expression. +But first, let's see what and how to use a VLA (the normal way). + +~~~c +int n = 50; +int arr[n]; + +double arr2[2*n]; + +unsigned int arr3[foo()]; // avec foo une fonction définie ailleurs +~~~ + +A variable length array can not be initialised nor declared `static`. +Un VLA ne peut pas être initialisé, de plus, il ne peut être déclaré `static`. +Thus, both of these statements are incorrect: + +~~~c +int n = 30; + +int arr[n] = {0}; +static arr2[n]; +~~~ + +In a function, we can use this syntax to refer to a VLA: + +~~~c +void bar(int n, int arr[n]) { + +} +~~~ + +Since, in C the size of the first dimension of an array isn't really of interest as an argument of a function. +A real-life case may be in passing a 2-dimension VLA where the second dimension *must* be specified: + +~~~c +void foo( int n, int m, int arr[][m]) { + +} +~~~ + +Note that it is possible to use the character `*` (yet another use ...) instead of the size of one or more dimensions of a VLA, but *only* within a prototype. + +~~~c +void foo(int, int, int[][*]); +~~~ + +Well, after that short introduction, let's talk about the interesting cases. +The quirks and eccentricities that the VLAs have introduced ! + + +[^reasons]: I can nevertheless give you two references [“Is it safe to use variable-length arrays?”](https://stackoverflow.com/questions/7326547/is-it-safe-to-use-variable-length-arrays) from stack overflow and this article : [“The Linux Kernel Is Now VLA-Free”](https://www.phoronix.com/scan.php?page=news_item&px=Linux-Kills-The-VLA). + + +The VLAs exceptions +------------------- + +The most known deviant behavior of VLAs is their relation to `sizeof`. + +`sizeof` is an unary operator that retrieves the size of a type from an expression or from the name of a type surrounded by parentheses. + +~~~c +/* How sizeof works using examples */ +float a; +size_t b = 0; + +printf("%zu", sizeof(char)); // Prints 1 +printf("%zu", sizeof(int)); // Prints 4 +printf("%zu", sizeof a); // Prints 4 +printf("%zu", sizeof(a*2.)); // Prints 8 +printf("%zu", sizeof b++); // Prints 8 +~~~ + +The first result are very surprising, the size of a `char` is defined to be 1 and `sizeof(char)` must return 1 (as per the C standard). +The second one is the size of `int`[^system]. +The third one is the size of the type of the expression `a` (which is `float`[^system]). +The fourth is the size of a `double`[^system] (the type of `a*2.`[^float]). +The last one is the size of the type `size_t` since it is the type of the expression `b++`[^system]. + +Ici, we don't care about the value of the expression since `sizeof` doesn't care more. +The value of the sizeof expression is determined at compile time. +The operations inside the sizeof statements aren't executed. +Since the expression must be valid, its type is determined at compile time. + +~~~c +int n = 5; +printf("%zu", sizeof(char[++n])); // Prints 6 +~~~ + +*Ouch* ! Here are the VLAs. +In the type `int[++n]`, `++n` is a non constant expression. +So the array is a VLA. +To know the size of the array, the compiler must execute the expression inside the bracket. +This, `n` holds 6 now and `sizeof` indicates that an array of `char` declared within this expression should have a size of `6`. + +This is only slightly intuitive since the VLAs here have introduced an *exception to the rule* which is not to execute the expression passed to `sizeof`. + +Another odd behaviour introduced by VLAs is the execution of expressions related to the size of a VLA in the definition of a function. Thus : + +~~~c +int foo( char arr[printf("bar")] ) { + printf("%zu", sizeof arr); +} +~~~ + +Assuming that the displays do not cause an error, calling this function will display `bar3`. +The `printf("bar")` statement is evaluated and then only the body of the function is executed (the "3"). + + +Note that there are other exceptions induced by the standardisation of VLAs such as I already state, the impossibility to allocate VLAs statically (quite logical), or the impossibility to use VLAs in a structure (GNU GCC supports it anyway). +And even some conditional branches are forbidden when using a VLA. + +[^system]: On my computer. +[^float]: Here, the implementation follow the IEEE 754 standard, where the size of floating number “simple” is 4 bytes and “double” is 8. `2.` has type `double` so `2.*a` has the same type as its greater operand. + + +A flexible array +---------------- + +You may never heard something like “flexible arrays member”. +This is normal, these respond to a very specific and uncommon problem. + +The objective is to allocate a structure but with one field (an array) of unknown size at compile time and all on a contiguous space[^why]. + +Here, there is no VLAs, because as we already stated, VLAs are forbidden as structure field. +We must use dynamic allocation +We could write that: + +~~~c +struct foo { + int* arr; +}; +~~~ + +And use it like that: + +~~~c + struct foo* contiguous = malloc( sizeof(struct foo) ); + if (contiguous) { + contiguous->arr = malloc( N * sizeof *contiguous->arr ); + if (contiguous->arr) { + contiguous->arr[0] = 11; + } + } +~~~ + +But here the array may not be next to the structure in memory. +As a consequence, if we copy the structure we the value of `arr` will be the same for the copy and for the original one. +To avoid that, we must copy the structure, reallocate the array and copy it. +Let's see another way. + +~~~c +struct foo { + /* ... At least another field because the C standard say so ... */ + int flexiArr[]; +}; +~~~ + +Here, the field `flexiArr` is a member array flexible. +Such an array must be the last element of the structure and not specify a size[^zero]. It is used like this: + +~~~c + struct foo* contiguous = malloc( sizeof(struct foo) + N * sizeof *flexiArr ); + if (contiguous) { + flexiArr[0] = 11; + } +~~~ + + +This syntax responds as much to a need for portability on architectures imposing a particular alignment (the array is contiguous to the structure) as to the need to show a semantic link between the array and the structure (the array belongs to the structure). + + +[^why]: We may want that this space to be contiguous for many reasons. +One is to optimise the use of the processor cache. +Another one is that the management of the network layers which are well suited to the use of flexible array. + +[^zero]: Prior to the specification of flexible array members in C99, it was common practice to use arrays of size one to replicate the concept. + + +A labels history +---------------- + +In C, if there's one thing we shouldn't talk about, it's *labels*. +We use it with the `goto` statement. The forbidden one ! + +To hide them, we replace the `goto` by named statement more explicit like `break` or `continue`. +So we don't have `goto` anymore and we never lear what is a label anymore when we learn the C syntax. + +This is how `goto` and a label are used: + + +~~~c +goto end; + + +end: return 0; +} +~~~ + +Basically, a label is a name given to an instruction. +We use it mainly in `switch` statement now a day. + +~~~c + +switch( action ) { + case 0: + do_action0(); + case 1: + do_action1(); + break; + case 2: + do_action2(); + break; + default: + do_action3(); +} +~~~ + +Here each `case` and the `default` are in fact labels. +Except that you can't use them with `goto` since there is no name to refer. + + +:::question +Why are you telling us this? +::: + +Firstly, it's good to know that it's called a label. +Secondly, because I'm going to tell you about a classic. +The *Duff's device*. + +It's a kind of loop unrolled and optimised. +The goal is to reduce the number of loop check (as well as the number of decrements). + +Here is the historical version written by Tom Duff + +~~~c +{ + register n = (count + 7) / 8; + switch (count % 8) { + case 0: do { *to = *from++; + case 7: *to = *from++; + case 6: *to = *from++; + case 5: *to = *from++; + case 4: *to = *from++; + case 3: *to = *from++; + case 2: *to = *from++; + case 1: *to = *from++; + } while (--n > 0); + } +} +~~~ + + +It doesn't matter what `register` means. Also, `to` is a particular pointer, but it doesn't really matter. + +Here, what I want to tell you about is that `do-while` loop in the middle of a `switch`. + +The test we try to avoid is `--v > 0`. + +Normally, `n` would actually be `count`. +And we would have to test `count` times. +The same goes for its decrement. + + +That's to say: + +~~~c +while( count-- > 0 ) + *to = *from++; +~~~ + +Dividing by 8 (arbitrary number) we also divides the number of tests and decrements by 8. +However, if `count` is not divisible by 8, we have a problem, we don't do all the instructions. +It would be nice to be able to jump directly to the 2nd instruction, if you only have 6 instructions left. + +And this is where labels can help us! +Thanks to the `switch` we can jump directly to the right instruction. + +We only have to label every instruction with the number of instruction remaining to do. +Them we jump to that instruction with the `switch` statement. + + +It is very rare to have to use this type of trick. +It's mostly an optimisation from another time. +But since I would like to talk about that syntax, it was necessary to talk about labels (or was it ?) + + +Complex numbers +--------------- + +Once again, we will study a syntax introduced by C99. +More exactly there are 3 types that have been introduced which are complex numbers. +The type of a complex number is `double _Complex` (the other two follow the same pattern, I will only write about the `double` version) + +Thus in C, it is possible to declare a complex number like this: + + +~~~c +double complex point = 2 + 3 * I; +~~~ + +Here we find the special macros `complex` and `I` (defined in the `` header). +The former is used to create a complex type while the latter is used to define the imaginary part of a complex number. + +In memory a complex variable takes up as much space as 2 times the real type on which it is based. +A complex variable is used as a normal variable. +The arithmetic is intuitive since it is based on the real type. +Note that it is recommended to use the macro `CMPLX` to initialise a complex number: + + +~~~c +double complex cplx = CMPLX(2, 3); +~~~ + + +For a better handling of cases where the imaginary part (the one multiplied by `I`) would be `NAN`, `INFINITY` or even more or less 0. + +The `` header offers us a really simple way to use imaginary numbers. +Indeed, many common functions for manipulating imaginary numbers are available. + + +Generic macros +-------------- + +There is a way in C to have macros that are defined differently depending on the type of one of its arguments. +This syntax is however "new" since it dates from the C11 standard. + +This genericity is achieved with generic selections based on the syntax ` _Generic ( /* ... */ )`. +To understand the syntax, let's look at a simplistic example: + +~~~c +#include +#include + +#define MAXIMUM_OF(x) _Generic ((x), \ + char: CHAR_MAX, \ + int: INT_MAX, \ + long: LONG_MAX \ + ) + +int main(int argc, char* argv[]) { + int i = 0; + long l = 0; + char c = 0; + + printf("%i\n", MAXIMUM_OF(i)); + printf("%d\n", MAXIMUM_OF(c)); + printf("%ld\n", MAXIMUM_OF(l)); + return 0; +} +~~~ + + +Here we print the maximum that can be stored by each of the types we use. +This is something that would not have been possible without the use of this new keyword `_Generic`. +To use this syntax, we use the keyword `_Generic` to which we pass 2 parameters. +The first is an expression whose type will influence the expression that is finally executed. +The second is a sequence of type and expression associations (type: expression) whose associations are separated by commas. +In the end, only the expression designated by the type of the first expression is finally evaluated. + +A real-world example could be: + +~~~c +int powInt(int,int); + +#define POW(x,y) _Generic ((y), double: pow, float: powf, long double: powl, int: powInt)((x), (y)) +~~~ + + +There's not much more to say except that it's possible to have the word `default` in the list of types, which will then correspond to all unmentioned types. +So a cleaner definition of the `POW` macro from earlier could be : + +~~~c +int powIntuInt(int a, unsigned int b); +double powIntInt(int a, int b); +double powFltInt(double a,int b) { return pow (a,b); } +double powfFltInt(float a,int b) { return powf(a,b); } +double powlFltInt(long double a,int b) { return powl(a,b); } + + +#define POWINT(x) _Generic((x), double: powFltInt, \ + float : powfFltInt, \ + long double: powlFltInt, \ + unsigned int: powIntuInt, \ + default: powIntInt) +#define POW(x,y) _Generic ((y), double: pow, float: powf, long double: powl, default: POWINT((x)) )((x), (y)) +~~~ + + +Too special characters +---------------------- + + +Let's go back in time again. +I'm going to mention one more thing. +A time when not all characters were as accessible as they are today on so many types of keyboards. + +Keyboards didn't necessarily have compose keys. +Thus, it was impossible to type the `#` character. + +The `#` character could then be replaced by the sequence `??=`. +And for each character not on the keyboard and used in the C language, there was a `??` based sequence called trigraph. +Another version based on 2 *more readable* characters is called digraphs. + +Here is a table summarising the trigraph and digraph sequences and their character representation. + +| Character | Digraph | Trigraph | +| --------- | ------- | -------- | +| `#` | `%:` | `??=` | +| `[` | `<:` | `??(` | +| `]` | `:>` | `??)` | +| `{` | `<%` | `??<` | +| `}` | `%>` | `??>` | +| `\` | | `??/` | +| `^` | | `??'` | +| `\|` | | `??!` | +| `~` | | `??-` | +| `##` | `%:%:` | | + + +The **main** difference between the digraphes and trigraphes are inside a string: + +~~~c +puts("??= is a hashtag"); +puts("%:"); +~~~ + +These medieval mechanisms are still valid today in C. [^C23] +So this line of code is perfectly valid: +~~~c +??=define FIRST arr<:0] +~~~ + +The only use of this syntax nowadays is to obfuscate a source code very easily. +A combination of a ternary with a trigraph and a digraph and you have an absolutely unreadable code 😉 + +~~~c +printf("%d", a ?5??((arr):>:0); +~~~ + +[^C23]: In C23, trigraphes are deprecated and doesn't work anymore. + + +**Never use it in a serious code**. + +## To conclude + + +That's it, I hope you've learned something from this post. Don't forget to use these syntaxes sparingly. + +I would like to thank [Taurre](https://zestedesavoir.com/membres/voir/Taurre/) for validating this article, but also for his pedagogy on the forums for years, as well as [blo yhg](https://zestedesavoir.com/membres/voir/blo%20yhg/) for his careful proofreading. + +Note that you can (re)discover a lot of code abusing the C language syntax at [IOCCC](http://ioccc.org/winners.html). 😈 + -- cgit v1.2.3