Hey guys time for an overview of Gccpy,:
Gccpy is a Python front-end to gcc from my last years GSOC (2010), I was mentored by Ian Lance Taylor
who continues to be a great inspiration to me with his work on gccgo and his help on gcc-help to the community
of users and developers.
An overview of what the project aims to achive is creating an AOT compiled version of Python using GCC as a framework for
middle-end, back-end optimization aswell as protable code-generation. Creating AOT languages has
been generally aimed for more ‘low-level’ languages such as C/C++/Fortran where the language requires
strong typing and other kinds declarative features; which gives rise to much less dynamic
features which languages like Python/PHP/Perl take for granted. The reason these more ‘high-level’
languages are able to do such things is due to the fact they are generaly implemented as interpreted
languages and this allows for much of this dynamic logic to take place at runtime when a program is passed
through their respective interpreters. But gccpy tackles this by generating code which allows for and efficient
runtime as well as the dynamic features required.
Personally I feel this is a very strong and important style of implementing dynamic languages; which needs to be shown
and proven to be an effective and strong way of implementing new and up-comming languages and why GCC is the right platform
to do so.
Gccpy is a Python front-end to gcc from my last years GSOC (2010), I was mentored by Ian Lance Taylor who continues to be a great inspiration to me with his work on gccgo and his help on gcc-help to the community of users and developers.
An overview of what the project aims to achieve is creating an AOT compiled version of Python using GCC as a framework for middle-end, back-end optimization as well as protable code-generation. Creating AOT languages has been generally aimed for more ‘low-level’ languages such as C/C++/Fortran where the language requires strong typing and other kinds declarative features; which gives rise to much less dynamic features which languages like Python/PHP/Perl take for granted. The reason these more ‘high-level’ languages are able to do such things is due to the fact they are generally implemented as interpreted languages and this allows for much of this dynamic logic to take place at runtime when a program is passed through their respective interpreters. But gccpy tackles this by generating code which allows for and efficient runtime as well as the dynamic features required.
Personally I feel this is a very strong and important style of implementing dynamic languages; which needs to be shown and proven to be an effective and strong way of implementing new and up-coming languages and why GCC is the right platform to do so.
So ok this was the overview of my Gsoc application, but i think its about the best i can word it. But first lets discuss how this actually works over several blog posts i will demonstrate some of the basic ideas which lets us compile python. From the title of this post i will quickly demonstrate the ideas behind dynamic typing:
Lets look at how we can easily generate code for some C code for example:
-
-
int foobar (void)
-
{
-
int x = 1, y = 2;
-
return x + y;
-
}
-
We could generate an IR of:
-
-
int foobar (void)
-
{
-
-
int x, y, T.1;
-
-
x = 1;
-
y = 2;
-
-
T.1 = x + y;
-
-
return T.1;
-
}
-
And the i386 code of:
-
-
.globl foobar
-
foobar:
-
subl $12, %esp # get some stack space
-
mov $1, %esp # x
-
mov $2, 4(%esp) # y
-
mov 4(%esp), 8(%esp) # setting up a very highlevel/*un-optimized* addition
-
addl %esp, 8(%esp) # T.1
-
mov 8(%esp), %eax # the return
-
addl $12, %esp # fix the stack
-
ret
-
This is where some interpreters/runtimes start to try and become much more like a ‘virtual machine’ like Java they implement their language by having a runtime which runs code that is in a virtual inscrution set. So when they parse their language with a front-end they generate this virtual instruction set for the given program but then they ‘compile/assemble’ this to bytecode which is a similar akin to C where we assemble the target code to an object code before linking into an executeable format. But really the byte code is nothing more than a binary form of the instruction set to optimize execution of the instruction set.
An example why many people belive generating efficient code for dynamic languages can be difficult is take for example:
-
-
def foo (x):
-
x.append (1)
-
return x + [ 1, 2, 3 ]
-
So what happens here in an abstract point of view you can’t assume anything about this code due to dynamic typing compared to something like
-
-
def List foo (List x):
-
x.append (1)
-
return x + [ 1, 2, 3 ]
-
Having storage specifiers insantly makes this set of code much more declaritive and gives rise to many more assumptions able to be made; which in turn gives a compiler more ‘hints’ on what it can do to generate code. Of course in the example above this is just a hypothetical language just to demonstrate the idea. So to implement dynamic typing we have to analyse what it actually is.
Lets take a normal/regular python session:
-
-
>>> x = 1
-
>>> x = "string"
-
>>> y = x
-
>>> x = 2
-
So what is actually happening now line by line by showing what each identifier is assigned to what data.
-
-
>>> x = 1 # x = 1 | y = NULL
-
>>> x = "string" # x = "string" | y = NULL
-
>>> y = x # x = "string" | y = "string"
-
>>> x = 2 # x = 2 | y = "string"
-
So why is this actually a problem, traditionly take for example code like:
-
-
int x = 1
-
x = 1.5555
-
x = "string"
-
When a c-compiler would run over that code it would give all manar of warnings about type conversion and invalid types being assigned. But why is this since a compiler will want to generate efficient code it will reserve the space valid for an integer on the stack which on an i386 32bit processor would be 32 bits usualy and would asuse subl $4, %esp to have space on the stack for that integer, but the problem arises if we were to then want to put in data which is greater than the size previously allocated for the given initial data. So you will have overflow and corruption of data. So how can you combat that to make dynamic typing work, the method or approach i have taken for gccpy takes much inspiration in how object orientation works. Every piece of static data give in a program is wraped into a gpy_object_t structure at runtime so in turn every type in gccpy is implemented via a gpy_object_t type, so for example the previous python session could be represented in GIMPLE via something like:
-
-
gpy_object_t * x = fold_integer (1)
-
incr_ref_count (x)
-
-
decr_ref_count (x)
-
x = fold_string ("string")
-
-
gpy_object_t * y = x
-
incr_ref_count (y)
-
-
x = fold_integer (2)
-
incr_ref_count (x)
-
The basic idea how dynamic typing is not what an identifier with a specified storage specifier holds its what an identifier points to. So when `x = fold_integer (1)` we should look at what the gpy_object_t structure looks like currently its in many ways similar to how PY_object works in the cpython implemetation but is a little more specific and streamlined to gccpy’s needs.
-
-
typedef struct gpy_object_t {
-
enum GPY_OBJECT_T T;
-
union {
-
gpy_object_state_t * object_state;
-
struct gpy_callable__t * call;
-
gpy_literal_t * literal;
-
} o ;
-
} gpy_object_t ;
-
This structure is quite open to be used in many areas of how gccpy works but what we are interested in is the:
-
-
gpy_object_state_t * object_state;
-
This is the part where it stores the static data defined be it an integer or a class defined in the source code.
-
-
typedef struct gpy_rr_object_state_t {
-
char * obj_t_ident;
-
signed long ref_count;
-
void * self;
-
struct gpy_typedef_t * definition;
-
} gpy_object_state_t ;
-
This structure is whats used to hold the object_state it holds the object type identifier as a string the reference count for the garbage collector the pointer to a structure in memory which is the actual data for example int or FILE * etc, and also a pointer to the objects definition structure. Each object has its own definition and each definition requires several hooks:
-
-
typedef struct gpy_typedef_t {
-
char * identifier;
-
size_t builtin_type_size;
-
gpy_object_t * (*init_hook)(struct gpy_typedef_t *, gpy_object_t **);
-
void (*destroy_hook)(gpy_object_t *);
-
void (*print_hook)(gpy_object_t * , FILE *, bool);
-
struct gpy_number_prot_t * binary_protocol;
-
struct gpy_builtin_method_def_t * methods;
-
} gpy_typedef_t ;
-
It has the identifier the size of the sturcture of which holds the actual data the initilization hook which returns the object state when you initialize an object the destroy hook for the garbage collector, a print hook for the print keyword to print the data. Now things get more interesting, the binary protocol is whats used to allow for dynamic binary operators so things like:
-
-
x = 2 + 1.5
-
concat = "foo" + "bar"
-
Can allow for mixed type binary operations, by having hooks for each type of binary operation be it addition subtraction etc. Finaly there is a table of member methods which allows for dot accesors like:
-
-
list.append (…)
-
list.index (…)
-
As append and index are both member methods to the builtin type List.