[source]

compiler/cmm/CmmPipeline.hs

Note [Sinking after stack layout]

[note link]

In the past we considered running sinking pass also before stack layout, but after making some measurements we realized that:

a) running sinking only before stack layout produces slower
   code than running sinking only before stack layout
  1. running sinking both before and after stack layout produces code that has the same performance as when running sinking only after stack layout.

In other words sinking before stack layout doesn’t buy as anything.

An interesting question is “why is it better to run sinking after stack layout”? It seems that the major reason are stores and loads generated by stack layout. Consider this code before stack layout:

c1E:
_c1C::P64 = R3; _c1B::P64 = R2; _c1A::P64 = R1; I64[(young<c1D> + 8)] = c1D; call stg_gc_noregs() returns to c1D, args: 8, res: 8, upd: 8;
c1D:
R3 = _c1C::P64; R2 = _c1B::P64; R1 = _c1A::P64; call (P64[(old + 8)])(R3, R2, R1) args: 8, res: 0, upd: 8;

Stack layout pass will save all local variables live across a call (_c1C, _c1B and _c1A in this example) on the stack just before making a call and reload them from the stack after returning from a call:

c1E:
_c1C::P64 = R3; _c1B::P64 = R2; _c1A::P64 = R1; I64[Sp - 32] = c1D; P64[Sp - 24] = _c1A::P64; P64[Sp - 16] = _c1B::P64; P64[Sp - 8] = _c1C::P64; Sp = Sp - 32; call stg_gc_noregs() returns to c1D, args: 8, res: 8, upd: 8;
c1D:
_c1A::P64 = P64[Sp + 8]; _c1B::P64 = P64[Sp + 16]; _c1C::P64 = P64[Sp + 24]; R3 = _c1C::P64; R2 = _c1B::P64; R1 = _c1A::P64; Sp = Sp + 32; call (P64[Sp])(R3, R2, R1) args: 8, res: 0, upd: 8;

If we don’t run sinking pass after stack layout we are basically left with such code. However, running sinking on this code can lead to significant improvements:

c1E:
I64[Sp - 32] = c1D; P64[Sp - 24] = R1; P64[Sp - 16] = R2; P64[Sp - 8] = R3; Sp = Sp - 32; call stg_gc_noregs() returns to c1D, args: 8, res: 8, upd: 8;
c1D:
R3 = P64[Sp + 24]; R2 = P64[Sp + 16]; R1 = P64[Sp + 8]; Sp = Sp + 32; call (P64[Sp])(R3, R2, R1) args: 8, res: 0, upd: 8;

Now we only have 9 assignments instead of 15.

There is one case when running sinking before stack layout could be beneficial. Consider this:

L1:
x = y call f() returns L2

L2: …x…y…

Since both x and y are live across a call to f, they will be stored on the stack during stack layout and restored after the call:

L1:
x = y P64[Sp - 24] = L2 P64[Sp - 16] = x P64[Sp - 8] = y Sp = Sp - 24 call f() returns L2
L2:
y = P64[Sp + 16] x = P64[Sp + 8] Sp = Sp + 24 …x…y…

However, if we run sinking before stack layout we would propagate x to its usage place (both x and y must be local register for this to be possible - global registers cannot be floated past a call):

L1:
x = y call f() returns L2

L2: …y…y…

Thus making x dead at the call to f(). If we ran stack layout now we would generate less stores and loads:

L1:
x = y P64[Sp - 16] = L2 P64[Sp - 8] = y Sp = Sp - 16 call f() returns L2
L2:
y = P64[Sp + 8] Sp = Sp + 16 …y…y…

But since we don’t see any benefits from running sinking befroe stack layout, this situation probably doesn’t arise too often in practice.

Note [inconsistent-pic-reg]

On x86/Darwin, PIC is implemented by inserting a sequence like

call 1f

1: popl %reg

at the proc entry point, and then referring to labels as offsets from %reg. If we don’t split proc points, then we could have many entry points in a proc that would need this sequence, and each entry point would then get a different value for %reg. If there are any join points, then at the join point we don’t have a consistent value for %reg, so we don’t know how to refer to labels.

Hence, on x86/Darwin, we have to split proc points, and then each proc point will get its own PIC initialisation sequence.

This isn’t an issue on x86/ELF, where the sequence is

call 1f
1: popl %reg
addl $_GLOBAL_OFFSET_TABLE_+(.-1b), %reg

so %reg always has a consistent value: the address of _GLOBAL_OFFSET_TABLE_, regardless of which entry point we arrived via.

Note [unreachable blocks]

The control-flow optimiser sometimes leaves unreachable blocks behind containing junk code. These aren’t necessarily a problem, but removing them is good because it might save time in the native code generator later.