Compiling a Compiler – GCC

From all the work i have been doing on compilers recently i have started to realise a lot of people won’t know how to compile up GCC or even how that works. Since how are you meant to compile a compiler written in a language that itself implements. Well the technique is called bootstrapping. But this is something people try and think its more complicated that it usually is. To understand how it works we have to think how was GCC built; well its as simple as it NEED’s a C compiler :-) . But how did that first c-compiler get there in the first place, well it goes back years of people made the first c compilers in Assembler and then it bootstrapped up!

Within GCC when you start the build we bootstrap 3 times, which means we build it 3 times but why you ask? The generated code for the first compiler run, is by the host compile so the compiler itself cannot be trusted, then the 2nd is done with the newly built compiler and the 3rd for good measure to guarantee we’ve compiled with the new compiler and GCC itself should be running off its own code. Then you can look at something like the GHC the Glasgow Haskell compiler which is written in its own language, but the thing they don’t say is that it is written in a some-what simplified dialect of Haskell and bootstraps off a small amount of C code if you don’t already have a Haskell implementation already present. Which is similar to the bootstrapping languages of GCC which use C90 to maintain language portability.

Then you can look at something like what if someone writes a new operating system how do you port a compiler and libc over to the new system. Well this is done by a mixture of cross complication and complicated kernel work which can take quite some time to get going correctly.

But I digress as usual now onto a little tutorial how do we compile GCC well yes it uses the ./configure, make, make install build system. But it doesn’t quite work like this so we do it slightly different.

First to run experimental I prefer to use git:

# to get my gccpy work use this git repo it merges with gcc master at least weekly into master

$ git clone git://crules.org:gcc-dev.git

# or run the gcc git mirror:

$ git clone git://gcc.gnu.org/git/gcc.git

Now compiling time:

$ aptitude install libgmp3-dev libmpfr-dev

# GCC now requires libmpc which you may need to compile and install from:

# http://www.multiprecision.org/index.php?prog=mpc

$ cd gcc

$ mkdir build

$ cd build

Useful configure options

$ ../configure

–enable-gold | use the GOLD linker

–disable-bootstrap | useful for gcc developers not recommended for users or for your first build

–enable-languages=c,c++,python,gccgo …

….

–with-gmp-lib=/usr/local/lib

–with-mpfr-lib=/usr/local/include

-with-mpc/mpfr …. the same optiosn to point to different library locations.

Now to build:

$ make

$ make install

And you should be good to go. We have to do our build in a separate directory due to the size of GCC some components don’t like being built directly.

Reference Counted Garbage Collector!

Garbage collection is a subject we all wish we knew more about, many software developers think they understand the concept fully and think building it should then be a trivial task. Building a ref-counted garbage collector, was much much more painful than i ever expected in crules. I am going to focus on how this work in an interpreted language and maybe a few months on how it will work in GccPy an Ahead of time implementation of Python on GCC.

So what is reference counting, take for instance the expression in a dynamic language:

x = 2 + 3 + 4 + 5;

So what happens in this expression in an interpreter? You can tell ‘x’ is a variable which will store the result of the expression ‘2 + 3 + 4 + 5′. Well to understand this simple example you first have to look at how this expression is evaluated, for instance these values ‘2,3….’ are still literals at this parse stage. What that means is they are numbers but dont paticularly have any meaning within the language yet, for instance python and crules need to ‘fold’ this literals into objects @see

`extern crl_symbol_obj * crl_rr_literal_fold( crl_symbol_obj * lit, crl_context_table * context )` in rr_bin_eval.c

In this case it looks at the literal object and then looks up for the apropriate object definition to create the Integer Object so the language knows how to do addition subtraction and some other member functions etc.

So ok lets look at this at a more semantic level what will happen is:

t1 = 2;
t2 = 3;
t3 = 4;
t4 = 5;

These address’s ‘t#’ are not actually accessible within the language its just a way of showing these are folded and we have addresses to them. Then we will have:

t5 = t1 + t2;
t6 = t5 + t3;
t7 = t6 + t4;

And now t7 should have the result of: 14 . And now finally:

x = t7;

So ok now you see how this is evaluated within at least a semantic context, what is ref counting?! So you can notice to get that simple result of 14 we had to create quite a few tmp objects as ‘t#’ But how does the runtime, tell which are temporary and which are still needed. Its as simple as adding in a ’signed long refcount’ to your Intermediate Representation.

In crules the notion of a context is the key to understanding how it all works so for each of these folded objects we create them with:

`Crl_Symbol_Init_Ctx( retval, context );` which is a pre-processor macro to initialize the crl_symbol_obj * retval, within this context. So what that is need for is when the garbage collector is invoked we need to be able to sweep over something to look for dead references in other words any objects with reference count ‘<= 0′ i say <= 0, and not simply zero because of how context poping and pushing can work and you can have a ref count of -2 in some cases but that’s besides the point.

So when this object is created within a context the garbage collection is able to sweep over a stack of context’s and look decide what needs deleted, so lets
look at the ref counting itself now.

Semantic | Refcount
——————-
t1       | 1
t2       | 1
t3       | 1
t4       | 1
t5       | 0
t6       | 0
t7       | 0

This shows the resulted ref counts after this is run:

t1 = 2;
t2 = 3;
t3 = 4;
t4 = 5;

Now lets look at what happens for:

t5 = t1 + t2

Semantic | Refcount
——————-
t1       | 2
t2       | 2
t3       | 1
t4       | 1
t5       | 1
t6       | 0
t7       | 0

So this is where it gets a little more tricky once that t5 has been evaluated t1 and t2 are de-referenced again.

Semantic | Refcount
——————-
t1       | 1
t2       | 1
t3       | 1
t4       | 1
t5       | 1
t6       | 0
t7       | 0

So the same idea happens for t6 and t7, and then we have the resultant table of:

Semantic | Refcount
——————-
t1       | 1
t2       | 1
t3       | 1
t4       | 1
t5       | 1
t6       | 1
t7       | 1

But now we know by common sense t1->t6 are all gabage its only t7 which is of any interest since this is what the variable ‘x’ now points to. But this is kind of down to how you implement it how you dec-rement those ref counts when i run in crules they are handled after each use.

Take a peek at rr_bin_eval.c and rr_math.c and rr_runtime.c in crl_rr_evaluate_expression( … )

So ok we should now have the table of:

Semantic | Refcount
——————-
t1       | 0
t2       | 0
t3       | 0
t4       | 0
t5       | 0
t6       | 0
t7       | 1

The key thing to remember is we dont have ref counts on the variable ‘x’ its what x points to! This is the whole key idea of how a dynamic type system works.

Now in crules the garbage collector is invoked when we have executed a full procedure, so in this case if i was to feed in directly this expression into the interpreter the garbage collector would be invoked if this was part of a function or procedure it would be executed once we hit the end of the block.

So now what comes in the idea of Mark-Sweep, this is probably what you understand of a mark sweep algorithm it sweeps over the stack and heap looking for dead refernces. So ok that makes sense but in the end its horrible vague and if you are to do that its an incredibly linear way of doing things. In essence crules has a mark sweep algorithm. But what really happens is i sweep and mark then free.

So to invoke the garbage collector in crules we call:

`extern void crl_garbage_invoke_sweep( crl_context_table * const );`

So what happens in gg_garbage.c, loops through each context in the context_table stack and looks through the symbol_stack and checks for objects of <= 0 ref count and marks them as garbage and sets the vector entry to NULL so it cant be collected again and result in a double free etc. Then once that is once it automaticaly invokes the garbage collector and runs over a vector of marked objects and free’s them one by one this could be speed up with pthreads but i havent got around to doing that.

This does not mean if your writing a builting object type or module as a dynamic lib .so, you have to invoke the garbage collection yourself you dont and if your writing crules code you dont have to do anything its all automatic except for the real intrinsic bits and pieces to the runtime within the interpreter.

Ok this is a good overview of ref counting but there is more lets take the example:

x = 2;
y = x;
z = y;

So as before we fold the literal ‘2′:

t1 = 2;
x  = t1;            t1 – ref count 1
y  = x = t1;        t1 – ref count 2
z  = y = t1;        t2 – ref count 3

Remember what we check is what the accessors point to, we never have a ref count on x or y or z its the value t1. So we have 3 objects simply pointing to the same object, We cannot free t1 at all! or we will surely break something! ;)

We can only safely delete this if each of these accessors went out of scope in the current context example if we stopped the session. we can decrement. But to understand what i mean more deeply we have to look at how functions and argument passing works and how the context stack works.

So i will finish this in a 2nd part tutorial soon :) in a review of how crules internals work. I hope some of you are still awake!

Onwards and Upwards

So its been quite some time since I’ve blogged and i will be meaning to write an update post every Sunday from now on maybe not this Sunday but hey we’ll see. For the next few months and the foreseeable future I’ll be talking a lot about GCC . So lets start from the beginning…

I am currently still a university student, I finished working at SAP in July 2009, so I took 3 Months and worked on Crules full-time for that time until I went back to university and I still worked at it full time since i was only having to do 3 Modules this year and even then i only had to do 2, since i messed up a few in 2nd year due to several personal reasons. I was taking a year to fix up my 2nd year so I could go back to final year and finish it! Although my old Job at SAP are pushing me to apply for a PhD with them, and i am not sure if i want to do that yet but we’ll see! Anyways i still worked at my language Crules since i had a lot of free time and the module material was not interesting to me what so ever!

Then I get this email “Hello From Google” and my god I wish I could put up what it read but i cant for some legal reason i think, and basically it all sounded as if they were offering my a job at Google in California I was so excited the best day of my life I can possibly remember, the amount of work i have done in my own time to learn compiler design and language design, Unix and system administration all on my own may have paid off! So yeah I go to interviews but they didn’t quite go so well, for you see they send you this email but really its asking you to apply for a job with them since they head hunt out people with passions and interests already relating to google in house technologies so you can bet they are going to be good employees. The interviews were tough but enjoyable I messed up some very basic questions and kick my self so hard now. But hey they didn’t go so well but they said they were keeping me on file for something more suitable mainly more programming related instead of the Google.com team. So I wasn’t complaining it was a good result for someone at the age of 21 without a degree with a language project and 1.5 years software developer and System-admin experience. Where things look up for me….

So then came around Oggcamp which is taking over the Mantle of Lug Radio Live, a UK conference on Linux Free Software and Free Thinking! It was a great weekend and i even joined in a gave a talk on my Programming Language the video apparently should be online eventually, but I’ll post up the slides here! I got on well and even got asked if i was interested in giving a talk at the new FlossUK Unix user group conference. Although i have lost the guys email I think he still has mine so if your reading this let me know! Here some photos of me looking freaked out

So yeah I gave a talk, i think it went well although i was annoyed that i could show off some more interesting features in the new release since it was was <sigsev> in some garbage collection bits and bobs. But i shown some more basic old features and I’ll endeavour to give a talk next year and show things off properly. But hey its my first proper talk on my language and no one left on me and i was voted into the 2nd talk room out of the 3 and none left on me! But yeah enough of boring things, most of all we drank ALOT of beer here is some of us with out drinking trousers on at @methoddan’s ratholeradio road-show gig:

It was really good to see some friends again like @Nybill, @Windigo, @yaMatt (Thanks for letting me stay with you mate!), @Dickturpin ( hows that £1 i lent you rofl! <in-joke> ) ,@bobobex (Thanks for pushing me to do my talk!), @corenominal ( Funniest dude ever! ) @Fabsh, @Methoddan, @crhisp (Great bloke to have a beer with!), @jontheniceguy

So since i came back from Oggcamp i was knackered i had already started the whole move in Crules to fully object orientated, with a really cool Builtin type C api which is really nice as well as some embedding features, with a fully automated Ref-Counted Garbage collector. Took a helluva’ lot of time to get going and implemented took 3 weeks planning and alot more time fixing and implementing! I just need some time now some time to fix regressions and write a lot of documentation.

So ok then i finished university for Summer fair enough but I needed a job besides my free lance bits and pieces since i hadn’t heard back from google. So i had already applied for Google Summer of code for 2 projects and god accepted into the one I wanted to do which so happens was one of my original projects i started quite a while back! An implementation of Python as a GCC Front-end I’ve taken a lot of Flak from the PyPy project in the beginner but this is something I want to blog about but I’ll leave that for another day I wanted to write something semi comprehensible for most people before i start talking about in-depth thing in the Crules API or the GCCPY Runtime Library or Parsing the Python Language etc. I hope this is semi coherent and updates you on why i’ve been quiet!

Made my @bobobex

Made my @bobobex