Hacker News by YouPan

Loading Stories...

obl

Weird that this treats uninitialized variables as unknown values. For example in ex3.c, the program

  int main(){
      int x;
      if (x <= 42){
              assert(x != 12345);
      }
  }

is of course UB in C, even though under a "uninitialized is random" model the program is valid and does not assert (as the model checker concludes).

(even in O1 clang gets rid of the whole function, including even the ret instruction, I'm surprised it does not at least leave an ud2 for an empty function to help debugging since it would not cost anything https://godbolt.org/z/eK8cz3EPe )

fweimer

Uninitialized local variables are documented as a source of a certain type of nondeterminism: http://www.cprover.org/cprover-manual/modeling/nondeterminis...

So the checker treats them as defined, but with an unknown variable. You could have written this instead:

    extern int unknown_int_value(void);
    int x = unknown_int_value();

And leaving unknown_int_value undefined (so it's not visible to the analyzer). Or write a function and use x as a parameter.

I suspect CBMC does this to have a convenient syntax for this frequent scenario. Apparently, it's used quite often, as in these examples: https://model-checking.github.io/cbmc-training/cbmc/overview...

It seems that CBMC is not intended to check production sources directly against C semantics, but to prove things about programs written in a C-like syntax.

kazinator

The most important thing about this:

  int x;

  if (x < 42) { assert (x != 12345); }

isn't to have a checker is clever enough to know that the assert will not go off, but to be informed that the automatic variable x is being accessed without being initialized!!!

WalterBright

Unfortunately,

    int main(){
        char buffer[10];
        buffer[10] = 0;
    }

are so rare they are hardly worth bothering with. The more usual case is:

    int get(char* buffer)
    {
        return buffer[10];
    }

    void test() {
        char buffer[10];
        get(buffer);
    }

I.e. the array bounds for buffer get lost in the function call. I have proposed a fix:

https://www.digitalmars.com/articles/C-biggest-mistake.html

that is compatible with existing code.

And yet, nobody cares. Oh well! Instead, we have overly complex solutions like this:

https://developers.redhat.com/articles/2022/09/17/gccs-new-f...

kazinator

There are trivial cases of the use of an uninitialized variable that elementary data flow analysis will uncover.

When the variable definition and its use are together in the same basic block, it's immediately obvious that the variable has a next-use on entry into the block, and that its definition hasn't supplied it with a value.

From there it gets complicated. GCC has a history of emitting overly zealous diagnostics in this area, warning about possibly uninitialized variables that can be proven not to be.

The difficulty is equivalent to the halting problem:

  void f()
  {
    int x;

    if (invokes_undefined_behavior(f))
      x = 'A';

    putchar(x);
  }

philzook

Your example does not pass the verifier

   cbmc /tmp/walter.c --bounds-check --pointer-check --function test

   ...
   [get.pointer_dereference.5] line 3 dereference failure: pointer outside object bounds in buffer[(signed long int)10]: FAILURE
   ...

There are many footguns. People love their guns. Makes them feel powerful.

teo_zero

Interesting, but how do you define a function that returns an array? Maybe:

  T f(args)[..] { body }

This would be hard to parse. Then what about:

  T[..] f(args) { body }

But if adding [] after T is allowed in function declaration, then why not for arrays as well:

  T[..] a;

nanolith

That's understandable, and I would suggest reaching out to the CBMC developers. It shouldn't be difficult to add an uninitialized variable pass to the solver.

Practically speaking, -Wall -Werror should catch this. Any use of a tool like CBMC should be part of a defense in depth strategy for code safety.

It does in clang.

   $ cc -Wall -Werror bar.c
   bar.c:5:11: error: variable 'x' is uninitialized when used here [-Werror,-Wuninitialized]
      if (x <= 42){
          ^
   bar.c:4:12: note: initialize the variable 'x' to silence this warning
      int x;
           ^
            = 0

   $ cc --version
   clang version 16.0.6

It also does in gcc.

   $ gcc -Wall -Werror bar.c
   bar.c: In function 'main':
   bar.c:5:10: error: 'x' is used uninitialized [-Werror=uninitialized]
    5 |       if (x <= 42){
      |          ^
   bar.c:4:11: note: 'x' was declared here
    4 |       int x;
      |           ^
   cc1: all warnings being treated as errors

   $ gcc --version
   gcc 12.2.0

teo_zero

If you "just" want bounds checking, then you don't need a pointer to array. GCC is able to detect the out-of-bound even in this case:

  int get(int n, char buf[n])
  {
    return buf[10];
  }

It's if you want to use sizeof that you need a pointer to array. But if you have n, why would you need to query the size of your array?

kazinator

When I say that a variable is used, I mean that its value is accessed. This is in line with compiler jargon. In this basic block, the variable x has no next use at the first line; the assignment of 42 is a dead assignment.

  x = 42
  y = w
  x = 43

Search the web for "next-use information".

pjmlp

Pascal, the dialects most people used, not the ISO Pascal from 1972 used by Kernighan, had ways to pass around dynamically size arrays and strings.

Which anyway was not an issue in the PL/I variants that predated C, nor the Mesa/Modula-2 descendents from the mid-70's.

EDIT:

Walter's example with 1978's Modula-2.

    MODULE Example;

    FROM InOut IMPORT WriteString, Write, WriteLn;

    PROCEDURE get(buffer : ARRAY OF CHAR): CHAR;
    BEGIN
      RETURN buffer[10]; (* This will boom unless compiled with bounds checking disabled, in which case who knows *)
    END get;

    VAR
      buffer : ARRAY [0..9] OF CHAR;

    BEGIN
        WriteString('The out of bounds character is');
        Write(get(buffer));
        WriteLn;
    END Example.

nanolith

> But picking the next available address to attach to the symbol is not like that.

> The compiler always picks that itself

One of the best analogies I heard, years ago on IRC, is that good source code is like select cuts of meat and fat, and the compiler is like a meat grinder. What comes out of the meat grinder is hamburger. It bears little resemblance to the source code, syntax, or semantics. But, it is meat.

I'll talk about register machines with a stack, since code is generated on most architecture this way. But, that's not necessarily a thing that can be relied upon. The compiler will preserve "relevant meaning" once the optimization passes are finished. Side-effects and return values (if used by the caller in the case of LTO) may be preserved, but that's largely it. That being said, let's talk about what a variable is, from the perspective of a compiler. A variable is decomposed into a value, which may be referenced, may be changed, or may need to be preserved. The optimizer aggressively strips away anything that isn't immediately needed. If the compiler needs to preserve the value of the register, then it can choose several ways of doing this. If the value is fixed, then it may just become a constant in code. Fixed values can also be produced synthetically. If the value is variable that is set at runtime and referenced later, then it could be set in one register and transferred to another register. If neither is possible, then the value could be spilled to the stack. But, this spill location is not necessarily assigned to this variable. This spill location could be used for another variable later in the function when this one falls out of scope. Spill locations aren't necessarily fixed. On many ABIs, there are scratch registers, registers that must be preserved between calls (the called function must preserve them if used), and registers that can be overwritten after a call. How a value is preserved depends entirely on whether it must be preserved, and which is the most efficient way to do so.

If a variable in source code is referenced but never assigned, then the optimizer can optimize away any "storage" (i.e. stack or a register) for this variable. When the variable is "read", the value used is arbitrary, if at all. The "read" has no meaning and has been optimized away. The value isn't fixed to a random value either at compile time or at runtime. It's whatever flotsam happens to be at that particular place in code at that particular time. If the value is "read" once in one location and once again in another location, even on the next line of source code, these values could be the same or entirely different. There's no telling what the compiler will do, because the optimizer is really free to interpret this in any way. The only way to ensure that a value is preserved by the optimizer is to explicitly assign this variable a value in the source code. This assignment is respected, assuming that the variable is referenced afterward. Beyond that, we can't make any meaningful assumptions.

Joker_vD

> I think most C programmers frustrations with UB stem from knowing that the C standard has fundamental flaws and modern compilers abuse that fact for "optimizations" on UB

Do they actually know that? I don't think so. Let's take [0] for instance (the author is an expreienced C programmer who loves the language and wrote a re-implementation of bc in it):

    There seemed to be a lot of misunderstandings; I could not get a handle on what this person thought UB meant.

    I finally figured it out: this person’s definition of UB was not “the language spec can’t guarantee anything.” Instead, it was “compilers can assume UB does not exist and optimize accordingly.”

    Wat.

Yep. Apparently, that was news to him, even though C implementations have been behaving like that for about 30 years already, and you can read the rest of the post for the "this is evil, we the users must do something about it" take. And the proposed "something" is not "we should instead use a language with actually defined semantics", oh no. It's "use compiler flags to force more reasonable behaviour, hopefully" and "somebody should write boringcc, unfortunately, I am myself a bit too busy for that". Well, despite numerous pleas and several attempts, nobody has managed to write boringcc which is telling of something, I'm just not sure of what exactly. So... I think it is about imprinting: "Oh, it's a wonderful language, well, it would be if it was actually implemented the way I used to think it is implemented (and I still think it should be implemented that way) but still, it's a wonderful language if only not for that pesky realitiy" is forcing unfounded expectations onto reality, and where do these expectations even come from in the first place?

And yeah, I fully agree with you that C standard should've probably done that. But it didn't happen, and it can't happen because backwards compatibility [1]. But even then, C programs would still be non-portable, in a sense that you have to use #ifdef's for tinkering with platform-specific behaviour for anything interesting, because C standard even today leaves a lot of stuff completely up to implementation, see [2] for an especially apalling example, even without touching UB.

[0] https://gavinhoward.com/2023/08/the-scourge-of-00ub/

[1] https://thephd.dev/your-c-compiler-and-standard-library-will...

[2] https://thephd.dev/conformance-should-mean-something-fputc-a...

Loading Stories...

The C bounded model checker: criminally underused