CAST

What Is

CAST parses C code into an abstract syntax tree (AST), lets you break it, then vomit it out as code. The parser does C99.

Library Overview

Everything is in the module C.

  • There's the parser (Parser).
  • There's the tree (Node and its subclasses).
  • That's it.

Usage

You call Parser#parse, and it gives you a tree of Node objects. Watch:

require 'cast/cast'

## create a parser
parser = C::Parser.new

## (optional) set some settings...
parser.pos.filename = "toy.c"      # used for error messages
parser.type_names << 'LinkedList'  # treat these words as types

## gimme a tree
ugly_c_code = open("toy.c"){|f| f.read}
tree = parser.parse(ugly_c_code)

## what's the tree look like?
p tree

If there's a parse error, #parse raises a ParseError (which has a nice error message in #message).

The Parser

Here's a quiz: what does "a * b;" do?

I bet you said "why you l4m3r n00b, that's a statement that multiplies a by b and throws away the answer -- now go take your meaningless snippetage to your computing 101 class and let me finish hurting this JavaTM programmer." Well, you'd be both mean and wrong. It was, of course, a trick question. I didn't say if any of a and b are types! If only a is a type, it's actually a declaration. And if b is a type, it's a syntax error.

So, the parser's gonna need to know which identifiers are type names. This is one of the bits of state that a Parser keeps. Here's the complete list (um, of two):

  • #type_names -- a Set of Strings.
  • #pos -- the Node::Pos this parser will start parsing at.

A Node::Pos has three read-write atts: #filename, #line_num, #col_num. Default is nil, 1, 0.

The Nodes

There are a couple of Node classes:

  • Node
    • TranslationUnit
    • Declaration
    • Declarator
    • FunctionDef
    • Parameter
    • Enumerator
    • MemberInit
    • Member
    • Statement
      • Block
      • If
      • Switch
      • While
      • For
      • Goto
      • Continue
      • Break
      • Return
      • ExpressionStatement
    • Label
      • PlainLabel
      • Default
      • Case
  • Node
    • Expression
      • Comma
      • Conditional
      • Variable
      • UnaryExpression
        • PostfixExpression
          • Index
          • Call
          • Dot
          • Arrow
          • PostInc
          • PostDec
        • PrefixExpression
          • Cast
          • Address
          • Dereference
          • Sizeof
          • Plus
          • Minus
          • PreInc
          • PreDec
          • BitNot
          • Not
  • Node
    • Expression
      • BinaryExpression
        • Add
        • Subtract
        • Multiply
        • Divide
        • Mod
        • Equal
        • NotEqual
        • Less
        • More
        • LessOrEqual
        • MoreOrEqual
        • BitAnd
        • BitOr
        • BitXor
        • ShiftLeft
        • ShiftRight
        • And
        • Or
  • Node
    • Expression
      • AssignmentExpression
        • Assign
        • MultiplyAssign
        • DivideAssign
        • ModAssign
        • AddAssign
        • SubtractAssign
        • ShiftLeftAssign
        • ShiftRightAssign
        • BitAndAssign
        • BitXorAssign
        • BitOrAssign
      • Literal
        • StringLiteral
        • CharLiteral
        • CompoundLiteral
        • IntLiteral
        • FloatLiteral
  • Node
    • Type
      • IndirectType
        • Pointer
        • Array
        • Function
      • DirectType
        • Struct
        • Union
        • Enum
        • CustomType
        • PrimitiveType
          • Void
          • Int
          • Float
          • Char
          • Bool
          • Complex
          • Imaginary
    • NodeList
      • NodeArray
      • NodeChain

The bold ones are abstract.

The last 2 (the NodeLists) represent lists of nodes. Methodwise, they try to behave like normal ruby ::Arrays. Implementationwise, a NodeChain is a doubly linked list, whereas a NodeArray is an array. NodeChains may be more efficient when adding things at the beginning of a LARGE list.

Attributes

Each Node object has:

  • #parent -- return the parent in the tree (a Node or nil).
  • #pos, #pos= -- the position in the source file (a Node::Pos).
  • #to_s -- return the code for the tree (a String).
  • #inspect -- return a pretty string for inspection. Try it.
  • #match?(str), #=~(str) -- return true iff str parses as a Node equal to this one.
  • #detach -- remove this node from the tree (parent becomes nil) and return it.
  • #detached?, #attached? -- return true if parent is nil or non-nil respectively.
  • #replace_with(node) -- replace this node with node in the tree.
  • #swap_with(node) -- exchange this node with node in their trees.
  • #insert_prev(*nodes), #insert_next(*nodes) -- insert nodes before this node in the parent list. Parent must be a NodeList! Useful for adding statements before a node in a block, for example.
  • #Foo? -- (where Foo is a module name) return self.is_a?(Foo).

The #Foo? method is a convienience for a common need. Example:

## make a tree
ast = C::Parser.new.parse(code_string)

## print all global variables
ast.entities.each do |node|
  node.Declaration? or next
  node.declarators.each do |decl|
    unless decl.type.Function?
      puts "#{decl.name}: #{decl.type}"
    end
  end
end

If you're a duck-typing purist, then sorry for the cardiac arrest you're now experiencing. CAST does pay attention to the class of Node objects for quite a few things. This is the cleanest way to distinguish, e.g., an Add from a Subtract (which both have the same methods but represent very different things). It just seems impractical (and unnecessary) to allow duck typing in this situation.

The #=~ method lets you do:

if declarator.type =~ 'const int *'
  puts "Ooh, a const int pointer!"
end

This is not the same as "declarator.type.to_s == 'const int *'"; that'd require you to guess how to_s formats its strings (most notably, the whitespace).

Fields and children

Each concrete Node class has a member for each bit of important C stuff it pertains to. I know you peeked at the big list below, so you know the kind of thing I mean.

But these aren't defined as attrs as you normally do in Ruby -- they're fields. If a node has a field foo, it means there's a setter #foo= and getter #foo. (A field foo? means the setter is #foo= and the getter is #foo?.) Some fields are even more special: child fields. These form the tree structure, and always have a Node or nil value.

Why divulge these bizarre internal secrets? Because these Node methods are defined in terms of fields and children:

  • #==, #eql? -- Equality is checked recursively. That is, all fields (including children) must be equal.
  • #dup, #clone -- Children are copied recursively (other fields and attrs as normal).

Then there's the tree-twiddling methods, which only ever yield/return/affect (non-nil) children.

  • #next, #prev -- return the next/prev sibling.
  • #list_next, #list_prev -- like #next/#prev, but also requires the parent to be NodeList. I'll be honest; I don't remember why I added these methods. They may well suddenly disappear.
  • #each, #reverse_each -- Yield all (non-nil) children. Node includes Enumerable, so you get the free chocolates too.
  • #depth_first, #reverse_depth_first -- Walk the tree in that order, yielding two args (event, node) at each node. event is :down on the way down, :up on the way up. If the block throws :prune, it won't descend any further.
  • #preorder, #reverse_preorder, #postorder, #reverse_postorder -- Walk the tree depth first, yielding nodes in the given order. For the preorders, if the block throws :prune, it won't descend any further.
  • #node_after(child), #node_before(child) -- return the node before/after child (same as child.next).
  • #remove_node(child) -- remove child from this node (same as child.detach).
  • #replace_node(child, new_child) -- replace child with yeah you guessed it (same as child.replace_with(newchild)).

If you're walking the tree looking for nodes to move around, don't forget that modifying the tree during traversal is a criminal offence.

And now, the episode you've been waiting for: THE FIELD LIST! (Cue music and dim lights.)

Notes about the table:

  • If no default is listed, it is false if the field name ends in a '?', nil otherwise.
  • nil is always allowed for a child field.
  • There are some conventions we follow religiously to help you:
    • Field names that end in '?' are always true-or-false.
    • NodeList-valued fields always default to an empty NodeArray or NodeChain, so you can tack things on there with <<, without worrying about needing to create the list first.
    • A field is Node-valued if and only if it is a child field.
    • The rows go yellow, green, yellow, green, ... .

Class Field Child? Type or possible values Default Comments
TranslationUnit entities Y NodeList NodeChain[] He lives at the root of a parsed file.
Declaration storage :typedef, :extern, :static, :auto, :register There are also methods to query the storage more humanely:
  • #typedef? -- true iff storage == :typedef
  • #extern? -- true iff storage == :extern
  • #static? -- true iff storage == :static
  • #auto? -- true iff storage == :auto
  • #register? -- true iff storage == :register
type Y DirectType
declarators Y NodeList NodeArray[]
inline? true, false
Declarator indirect_type Y IndirectType What on earth is a "declarator", you ask? Consider "int i, *ip;". This is a Declaration with two Declarators:
    Declaration
        type: Int
        declarators: 
            - Declarator
                name: "i"
            - Declarator
                indirect_type: Pointer
                name: "ip"
The indirect_type of the ip Declarator is a Pointer to nil. "'Pointer to nil' my foot -- I want the type of the stupid variable!" Here:
  • #type -- return the type, the whole type, and nothing but the type. This is a clone; modifying it won't modify the tree.
So calling #type on the ip Declarator gives:
    Pointer
        type: Int
name String
init Y Expression
num_bits Y Integer
FunctionDef storage :extern, :static Just like Declaration, there's also:
  • #extern? -- return true iff storage == :extern
  • #static? -- return true iff storage == :static
There's also a pseudo-field:
  • #prototype? -- same as !no_prototype?
  • #prototype=(val) -- same as no_prototype = !val
no_prototype? means that no prototype was given. That means parameter types weren't given in the parens, but in the "old-style" declaration list. Example:
int main(argc, argv)
    int argc;
    char **argv;
{
    return 0;
}
int main(int argc, char **argv) {
    return 0;
}
No prototype. Prototype.
Everyone tells you to use prototypes. That's because no type checking is done when calling a function declared without a prototype.
inline? true, false
type Y Type
name String
def Y Block Block.new
no_prototype? true, false
Parameter register? true, false Used in Functions.
type Y Type
name String
Enumerator name String Used in Enums.
val Y Expression
MemberInit member Y NodeList of Member-or-Expression Used in CompoundLiterals.
init Y Expression
Member name String Used in MemberInits.
Block labels Y NodeList of Label NodeArray[]
stmts Y NodeList of Statement NodeArray[]
If labels Y NodeList of Label NodeArray[]
cond Y Expression
then Y Statement
else Y Statement
Switch labels Y NodeList of Label NodeArray[]
cond Y Expression
stmt Y Statement
While labels Y NodeList of Label NodeArray[] do? means it's a do-while loop.
do? true, false
cond Y Expression
stmt Y Statement
For labels Y NodeList of Label NodeArray[]
init Y Expression or Declaration
cond Y Expression
iter Y Expression
stmt Y Statement
Goto labels Y NodeList of Label NodeArray[]
target String
Continue labels Y NodeList of Label NodeArray[]
Break labels Y NodeList of Label NodeArray[]
Return labels Y NodeList of Label NodeArray[]
expr Y Expression
ExpressionStatement labels Y NodeList of Label NodeArray[]
expr Y Expression
PlainLabel name String
Default
Case expr Y Expression
Comma exprs Y NodeList of Expression
Conditional cond Y Expression
then Y Expression
else Y Expression
Variable name String
Index expr Y Expression
index Y Expression
Call expr Y Expression
args Y NodeList of Expression-or-Type
Dot expr Y Expression
member Y String
Arrow expr Y Expression
member Y String
PostInc expr Y Expression
PostDec expr Y Expression
Cast type Y Type
expr Y Expression
Address expr Y Expression
Dereference expr Y Expression
Sizeof expr Y Type or Expression
Positive expr Y Expression
Negative expr Y Expression
PreInc expr Y Expression
PreDec expr Y Expression
BitNot expr Y Expression
Not expr Y Expression
Add expr1 Y Expression
expr2 Y Expression
Subtract expr1 Y Expression
expr2 Y Expression
Multiply expr1 Y Expression
expr2 Y Expression
Divide expr1 Y Expression
expr2 Y Expression
Mod expr1 Y Expression
expr2 Y Expression
Equal expr1 Y Expression
expr2 Y Expression
NotEqual expr1 Y Expression
expr2 Y Expression
Less expr1 Y Expression
expr2 Y Expression
More expr1 Y Expression
expr2 Y Expression
LessOrEqual expr1 Y Expression
expr2 Y Expression
MoreOrEqual expr1 Y Expression
expr2 Y Expression
BitAnd expr1 Y Expression
expr2 Y Expression
BitOr expr1 Y Expression
expr2 Y Expression
BitXor expr1 Y Expression
expr2 Y Expression
ShiftLeft expr1 Y Expression
expr2 Y Expression
ShiftRight expr1 Y Expression
expr2 Y Expression
And expr1 Y Expression
expr2 Y Expression
Or expr1 Y Expression
expr2 Y Expression
Assign lval Y Expression
rval Y Expression
MultiplyAssign lval Y Expression
rval Y Expression
DivideAssign lval Y Expression
rval Y Expression
ModAssign lval Y Expression
rval Y Expression
AddAssign lval Y Expression
rval Y Expression
SubtractAssign lval Y Expression
rval Y Expression
ShiftLeftAssign lval Y Expression
rval Y Expression
ShiftRightAssign lval Y Expression
rval Y Expression
BitAndAssign lval Y Expression
rval Y Expression
BitXorAssign lval Y Expression
rval Y Expression
BitOrAssign lval Y Expression
rval Y Expression
StringLiteral val String The String in val is the literal string entered. "\n" isn't converted to a newline, for instance.
CharLiteral val String The String in val is the literal string entered. '\n' isn't converted to a newline, for instance.
CompoundLiteral type Y Type

Here's an example. (struct S){1, .x = 2, .y [3] .z = 4} parses as:

CompoundLiteral
    type: Struct
        name: "S"
    member_inits: 
        - MemberInit
            init: IntLiteral
                val: 1
        - MemberInit
            member: 
                - Member
                    name: "x"
            init: IntLiteral
                val: 2
        - MemberInit
            member: 
                - Member
                    name: "y"
                - IntLiteral
                    val: 3
                - Member
                    name: "z"
            init: IntLiteral
                val: 4
"That's legal syntax!?" Yep. Look it up.
member_inits Y NodeList of MemberInit NodeArray[]
IntLiteral val Integer

Also:

  • #dec? -- return true iff format == :dec
  • #hex? -- return true iff format == :hex
  • #oct? -- return true iff format == :oct
format :dec, :hex, :oct :dec
FloatLiteral val Float
Pointer const? true, false
restrict? true, false
volatile? true, false
type Y Type
Array const? true, false
restrict? true, false
volatile? true, false
type Y Type
length Y Expression
Function const? true, false
restrict? true, false
volatile? true, false
type Y Type
params Y NodeList of Parameter NodeArray[]
var_args? true, false
Struct const? true, false
restrict? true, false
volatile? true, false
name String
members Y NodeList of Member NodeArray[]
Union const? true, false
restrict? true, false
volatile? true, false
name String
members Y NodeList of Member NodeArray[]
Enum const? true, false
restrict? true, false
volatile? true, false
name String
members Y NodeList of Enumerator
CustomType const? true, false This is for typedef'd names.
restrict? true, false
volatile? true, false
name String
Void const? true, false const void!? Yes, think about: const void *.
restrict? true, false
volatile? true, false
Int const? true, false longness sounds silly, so here are some less silly methods:
  • #short? -- return true iff longness == -1
  • #plain? -- return true iff longness == 0
  • #long? -- return true iff longness == 1
  • #long_long? -- return true iff longness == 2
Oh, and look, a pseudo-field:
  • #signed? -- same as !unsigned?
  • #signed=(val) -- same as unsigned = !val
restrict? true, false
volatile? true, false
longness -1, 0, 1, 2 0
unsigned? true, false
Float const? true, false Less silly-sounding longness substitutes:
  • #plain? -- return true iff longness == 0
  • #double? -- return true iff longness == 1
  • #long_double? -- return true iff longness == 2
restrict? true, false
volatile? true, false
longness 0, 1, 2 0
Char const? true, false Also:
  • #signed? -- return true iff signed == true
  • #unsigned? -- return true iff signed == false
  • #plain? -- return true iff signed == nil
Yes, C99 says that char, signed char, and unsigned char are 3 distinct types (unlike with int -- go figure). Like Martian chalk and Venusian cheese: completely different, but you can fit 'em each in one byte. Mmm, Martian chalk...
restrict? true, false
volatile? true, false
signed true, false, nil
Bool const? true, false This is the rarely seen _Bool type.
restrict? true, false
volatile? true, false
Complex const? true, false

This is the rarely seen _Complex type.

  • #plain? -- return true iff longness == 0
  • #double? -- return true iff longness == 1
  • #long_double? -- return true iff longness == 2
restrict? true, false
volatile? true, false
longness 0, 1, 2 0
Imaginary const? true, false

This is the rarely seen _Imaginary type.

  • #plain? -- return true iff longness == 0
  • #double? -- return true iff longness == 1
  • #long_double? -- return true iff longness == 2
restrict? true, false
volatile? true, false
longness 0, 1, 2 0

Node Construction

Wanna make a Node? Take your pick:

  • #new(field1, field2, ...) -- fields are in the order listed above.
  • #new(:field1 => val, :field2 => val, ...) -- field order doesn't matter this way.
  • #new_at(pos, *args) -- set the pos to that given, and pass args to #new.

They're for losers, though. What you really want to do is make Nodes by parsing C code. Each class -- even the abstract classes like Statement -- has a .parse method:

funcdef = C::FunctionDef.parse <<EOS
void frobnicate(int karma) {
  use_waffle_iron();
}

stmt = C::Statement.parse('while (not_looking) paint_car();')

Need to tell it to treat WaffleIron as a type name? All those parse methods use C.default_parser:

C.default_parser.type_names << 'WaffleIron'
type = C::Type.parse('WaffleIron')

Alternatively, you could've given parse your own parser:

parser = C::Parser.new
parser.type_names << 'WaffleIron'
type = C::Type.parse('WaffleIron', parser)

In fact, there's also C.parse(str, parser=nil), which is an alias for C::TranslationUnit.parse(str, parser).

ast = C.parse(STDIN)

Yes, all that talk in the intro about doing parser = C::Parser.new; parser.parse(...) was actually all a charade to make you type more. I so own you.

Open Issues

  • Is it okay to bastardize the =~ operator like that?
  • Should binary operators have a list of expressions rather than just 2? That'd allow to_s to format the strings better in some cases and make it consistent with Comma.
  • At the moment CAST chokes on preprocessor #-lines. Ruby makes it trivial to filter these out before passing the string to Parser#parse, and this makes it obvious when you forget to run a file through the preprocessor (which you need to do to get type names at least once). Is this stupid? Ideally we should probably have a builtin preprocessor and use that.
  • Should a Member be allowed in MemberInit#member? It'd be the common case, but special cases make uniform treatment more arduous. Is "uniform treatment" an issue?

Vote now.

To Do

  • Stop embarrasing yourself and write the parser in C.
  • Make it -wd friendly.
  • Fix the "TODO" bits in c.y. These are for rarely used C constructs, like the declaration int arr[*];.
  • Add a comment node. Might make it useful for doc extraction. Anyone want this? Not all comments will be caught, though. Comments can appear between any two tokens, and I don't really want to add comment fields to every node. They'd probably just appear between toplevel entities (in TranslationUnit#entities) and between statements (in Block#stmts).
  • Make it rdoc-able.

If any of these affect you greatly, kick me to make it happen faster.

Contact

I'm not really sure what people are going to try to use this for. If there's some functionality you think would make a good addition, or think I've made a mess of this poor puppy, give me a yell.

You can spam me at george.ogata@gmail.com. It'd help if you prefixed the subject with "[cast] " so I can easily distinguish CAST spam from fake Rolex spam.