My first: LLVM program

2020-04-11

I recently started reading through Stephen Diehl’s ‘Implementing a JIT Compiled Language with Haskell and LLVM’. I thought it might be fun to write a quick experience report from my getting started with LLVM and remembering how to build and link stuff manually.

Setup

I’m using Ubuntu under WSL on Windows 10. This works nicely with VS Code’s remoting system - I can use the Linux terminal directly inside VS Code on my Windows machine.

VSCode running LLVM inside WSL

(Side note: I forked the Kaleidoscope repository to get it to compile under Stackage LTS 14.27. In any case, you won’t need any Haskell code to follow this particular article!)

To get started, let’s install LLVM 9 and accoutrements.

sudo apt install llvm-9 llvm-9-dev clang-9

A trivial LLVM program

Let’s write a simple program that prints the character ? to stdout. To do this, we can use the putchar function provided by libc.

program.ll

declare i32 @putchar(i32)

define void @main() {
    call i32 @putchar(i32 63)
    ret void
}

We’ve just written our first program in the LLVM IR!

Hopefully it’s fairly clear what’s happening here.

Running the LLVM code

LLVM comes with a handy interpeter lli which lets us run our program directly. This is useful for testing that it works as expected without having to compile it.

$ lli-9 program.ll
?

Looks like our program is working! Now let’s see how we can compile it down to a proper exectuble.

(Note: on my machine, it also prints an error could not mmap JIT marker. It’s not clear exactly why.)

IR formats: text and bitcode

There are in fact two formats of LLVM IR. What we’ve written above uses the text representation (extension .ll). There is another binary representation (extension .bc) known as ‘bitcode’.

Let’s see quickly how we can convert between the two formats. First, we can use llvm-as turn text into bitcode:

$ ls
program.ll

$ llvm-as-9 program.ll

$ ls
program.bc program.ll

And to go the other way, from bitcode to text, we use llvm-dis:

$ ls
program.bc

$ llvm-dis-9 program.bc

$ ls
program.bc program.ll

You might have noticed that these programs are called the ‘LLVM assembler’ and ‘LLVM disassembler’. This is a bit of a misnomer, since there is no actual native assembly involved at any point - only different formats of the LLVM IR.

Converting IR to assembly

Next, we might want to convert our program from (platform-independent) LLVM IR to (native) assembly. For this, we can use llc.

By default, llc compiles to textual assembly code (.s). For example:

$ llc program.ll

$ cat program.s
        .text
        .file   "program.ll"
        .globl  main                    # -- Begin function main
        .p2align        4, 0x90
        .type   main,@function
main:                                   # @main
        .cfi_startproc
# %bb.0:
        pushq   %rax
        .cfi_def_cfa_offset 16
        movl    $63, %edi
        callq   putchar
        popq    %rax
        .cfi_def_cfa_offset 8
        retq
.Lfunc_end0:
        .size   main, .Lfunc_end0-main
        .cfi_endproc
                                        # -- End function

        .section        ".note.GNU-stack","",@progbits

Object files

llc can also be used to output object files (.o). These contain essentially the same content as the textual assembly, but in binary format. In fact, the difference between .s and .o files is essentially the same as the difference between .ll and .bc files.

To do this, use llc --filetype=obj <filename>:

$ llc-9 --filetype=obj program.ll

$ ls
program.ll program.o

Target triples

Of course, there are many different platforms with many different kinds of assembly. llc automatically determines which to use. To see which platform is being targeted by default, run llc --version.

$ llc-9 --version
LLVM (http://llvm.org/):
  LLVM version 9.0.0

  Optimized build.
  Default target: x86_64-pc-linux-gnu
  Host CPU: znver1

  Registered Targets:
    aarch64    - AArch64 (little endian)
    aarch64_32 - AArch64 (little endian ILP32)
    aarch64_be - AArch64 (big endian)
[...]

You can, however, also configure the target assembly by changing parts of the target triplet. You can override the whole thing at once using the -mtriple option, or set the architecture using the -mcpu/-march options. For example, let’s compile for 64-bit WebAssembly:

$ llc-9 --march wasm64 program.ll

$ cat program.s
	.text
	.file	"program.ll"
	.section	.text.main,"",@
	.globl	main                    # -- Begin function main
	.type	main,@function
main:                                   # @main
	.functype	main () -> ()
# %bb.0:
	i32.const	63
	i32.call	putchar
	drop
                                        # fallthrough-return-void
	end_function
.Lfunc_end0:
	.size	main, .Lfunc_end0-main
                                        # -- End function

	.functype	putchar (i32) -> (i32)

Linking the assembly

Though we have now compiled our LLVM IR to machine code in the form of object files (.o), these cannot be executed. They must be linked first to produce executables.

There are a few different ways we can do this.

Linking with GCC

We can use GCC to do the linking for us. Though GCC is often just seen as a ‘C compiler’, it’s rather more magical than that and can accept all kinds of weird inputs (including, among other things: Go (!), D and Ada).

In our case, we can actually give GCC the textual assembly:

$ gcc -o program program.s

or the binary object file:

$ gcc -o program program.o

In both cases, it will figure out how to do the right thing to produce an executable binary.

$ ./program
?

Linking with Clang

It’s not just GCC that we can use to do our linking. LLVM has a closely associated compiler, Clang, which provides very similar functionality. Lucky for us (in this case) it works exactly the same way as gcc.

$ clang-9 -o program program.s
[or]
$ clang-9 -o program program.o

$ ./program
?

Clang can also go one better. Thanks to its close link with the LLVM project, it can also consume LLVM IR directly:

$ clang-9 -o program program.ll

$ ./program
?

Manually linking

If you want to see what’s going on inside GCC during linking, you can pass the --verbose option (-v for short). Here’s what I get. The GCC linker is actually called ld, but you’ll notice that a program called collect2 is being called instead. It turns out that this is mostly a complicated alias for ld.

$ gcc --verbose -o program program.s | grep crt
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.5.0-3ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
COLLECT_GCC_OPTIONS='-v' '-o' 'program' '-mtune=generic' '-march=x86-64'
 as -v --64 -o /tmp/ccpdizFX.o program.s
GNU assembler version 2.30 (x86_64-linux-gnu) using BFD version (GNU Binutils for Ubuntu) 2.30
COMPILER_PATH=/usr/lib/gcc/x86_64-linux-gnu/7/:/usr/lib/gcc/x86_64-linux-gnu/7/:/usr/lib/gcc/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/7/:/usr/lib/gcc/x86_64-linux-gnu/
LIBRARY_PATH=/usr/lib/gcc/x86_64-linux-gnu/7/:/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/7/../../../../lib/:/lib/x86_64-linux-gnu/:/lib/../lib/:/usr/lib/x86_64-linux-gnu/:/usr/lib/../lib/:/usr/lib/gcc/x86_64-linux-gnu/7/../../../:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-v' '-o' 'program' '-mtune=generic' '-march=x86-64'
 /usr/lib/gcc/x86_64-linux-gnu/7/collect2 -plugin /usr/lib/gcc/x86_64-linux-gnu/7/liblto_plugin.so -plugin-opt=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper -plugin-opt=-fresolution=/tmp/ccBjbrLv.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --eh-frame-hdr -m elf_x86_64 --hash-style=gnu --as-needed -dynamic-linker /lib64/ld-linux-x86-64.so.2 -pie -z now -z relro -o program /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/Scrt1.o /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/7/crtbeginS.o -L/usr/lib/gcc/x86_64-linux-gnu/7 -L/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/7/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/7/../../.. /tmp/ccpdizFX.o -lgcc --push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed -lgcc_s --pop-state /usr/lib/gcc/x86_64-linux-gnu/7/crtendS.o /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/crtn.o
COLLECT_GCC_OPTIONS='-v' '-o' 'program' '-mtune=generic' '-march=x86-64'
ross@golem-windows:/mnt/d/Development/Haskell/kaleidoscope/src/chapter1$ gcc --verbose -o program program.s
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.5.0-3ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
COLLECT_GCC_OPTIONS='-v' '-o' 'program' '-mtune=generic' '-march=x86-64'
 as -v --64 -o /tmp/ccY5JS3x.o program.s
GNU assembler version 2.30 (x86_64-linux-gnu) using BFD version (GNU Binutils for Ubuntu) 2.30
COMPILER_PATH=/usr/lib/gcc/x86_64-linux-gnu/7/:/usr/lib/gcc/x86_64-linux-gnu/7/:/usr/lib/gcc/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/7/:/usr/lib/gcc/x86_64-linux-gnu/
LIBRARY_PATH=/usr/lib/gcc/x86_64-linux-gnu/7/:/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/7/../../../../lib/:/lib/x86_64-linux-gnu/:/lib/../lib/:/usr/lib/x86_64-linux-gnu/:/usr/lib/../lib/:/usr/lib/gcc/x86_64-linux-gnu/7/../../../:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-v' '-o' 'program' '-mtune=generic' '-march=x86-64'
 /usr/lib/gcc/x86_64-linux-gnu/7/collect2 -plugin /usr/lib/gcc/x86_64-linux-gnu/7/liblto_plugin.so -plugin-opt=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper -plugin-opt=-fresolution=/tmp/ccFWGnuo.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --eh-frame-hdr -m elf_x86_64 --hash-style=gnu --as-needed -dynamic-linker /lib64/ld-linux-x86-64.so.2 -pie -z now -z relro -o program /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/Scrt1.o /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/7/crtbeginS.o -L/usr/lib/gcc/x86_64-linux-gnu/7 -L/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/7/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/7/../../.. /tmp/ccY5JS3x.o -lgcc --push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed -lgcc_s --pop-state /usr/lib/gcc/x86_64-linux-gnu/7/crtendS.o /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/crtn.o
COLLECT_GCC_OPTIONS='-v' '-o' 'program' '-mtune=generic' '-march=x86-64'

Once upon a time, linking your object file was perhaps as simple as calling:

$ ld -o output /lib/crt0.o program.o -lc

This tells ld to how to link your object file (program.o):

Unfortunately, a bit of cruft has stacked up since then. Here’s the smallest command I managed to get working.

$ ld -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o program /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/Scrt1.o /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/7/crtbeginS.o -L/usr/lib/gcc/x86_64-linux-gnu/7 -L/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/7/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/7/../../.. program.o -lc /usr/lib/gcc/x86_64-linux-gnu/7/crtendS.o /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/crtn.o

Yikes! Probably best to just use gcc.

Clang provides a similar --verbose option. Here’s the output on my system:

$ clang-9 -v -o program program.oclang version 9.0.0-2~ubuntu18.04.2 (tags/RELEASE_900/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/7
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/8
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7.5.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/8
Selected GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0
Candidate multilib: .;@m64
Selected multilib: .;@m64
 "/usr/bin/ld" -z relro --hash-style=gnu --build-id --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o program /usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0/../../../x86_64-linux-gnu/crt1.o /usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0/../../../x86_64-linux-gnu/crti.o /usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0/crtbegin.o -L/usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0 -L/usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0/../../../x86_64-linux-gnu -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0/../../.. -L/usr/lib/llvm-9/bin/../lib -L/lib -L/usr/lib program.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lc -lgcc --as-needed -lgcc_s --no-as-needed /usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0/crtend.o /usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0/../../../x86_64-linux-gnu/crtn.o

Amusingly, despite LLVM having its own linker lld, it seems that Clang actually just calls the GNU linker ld under the hood. lld is more or less undocumented, so I didn’t try to get it working.

Conclusion

So there you go - a quick introduction to writing and compiling a program in LLVM IR. You can also find more detailed docs on the LLVM website.

Cheatsheet

File formats