# CASTY Casty is a fork of CAST, a C parser and abstract syntax tree for Ruby. ## Example require 'casty' source = File.read('file.c') ast = C.parse(source) ast.entities.each do |declaration| declaration.declarator.each do |declarator| puts "#{declarator.name}: declarator.type" end end Or in irb: irb> ast = C.parse('int main(void) { return 0; }') => TranslationUnit entities: - FunctionDef type: Function type: Int params: [] name: "main" def: Block stmts: - Return expr: IntLiteral val: 0 irb> puts ast int main(void) { return 0; } => nil ## Nodes `C.parse` returns a tree of `Node` objects. Here's the class hierarchy:

Node
- TranslationUnit
- Comment
- Declaration
- Declarator
- FunctionDef
- Parameter
- Enumerator
- MemberInit
- Member
- Statement
  - Block
  - If
  - Switch
  - While
  - For
  - Goto
  - Continue
  - Break
  - Return
  - ExpressionStatement
- Label
  - PlainLabel
  - Default
  - Case
- Type
  - IndirectType
    - Pointer
    - Array
    - Function
  - DirectType
    - Struct
    - Union
    - Enum
    - CustomType
    - PrimitiveType
      - Void
      - Int
      - Float
      - Char
      - Bool
      - Complex
      - Imaginary

Node
- Expression
  - Comma
  - Conditional
  - Variable
  - UnaryExpression
    - PostfixExpression
      - Index
      - Call
      - Dot
      - Arrow
      - PostInc
      - PostDec
    - PrefixExpression
      - Cast
      - Address
      - Dereference
      - Sizeof
      - Plus
      - Minus
      - PreInc
      - PreDec
      - BitNot
      - Not
  - BinaryExpression
    - Add
    - Subtract
    - Multiply
    - Divide
    - Mod
    - Equal
    - NotEqual
    - Less
    - More
    - LessOrEqual
    - MoreOrEqual
    - BitAnd
    - BitOr
    - BitXor
    - ShiftLeft
    - ShiftRight
    - And
    - Or

Node
- Expression
  - AssignmentExpression
    - Assign
    - MultiplyAssign
    - DivideAssign
    - ModAssign
    - AddAssign
    - SubtractAssign
    - ShiftLeftAssign
    - ShiftRightAssign
    - BitAndAssign
    - BitXorAssign
    - BitOrAssign
  - Literal
    - StringLiteral
    - CharLiteral
    - CompoundLiteral
    - IntLiteral
    - FloatLiteral
- NodeList
  - NodeArray
  - NodeChain

The highlighted ones are abstract. The last 2 (`NodeList`s) represent lists of `Node`s. They quack like standard ruby `Arrays`. `NodeChain` is a doubly linked list; `NodeArray` is an array. ### Node Methods * `parent`: return the parent in the tree (a `Node` or nil). * `pos`, `pos=`: the position in the source file (a `Node::Pos`). * `to_s`: return the code for the tree (a `String`). * `inspect`: return a pretty string for inspection, makes irb fun. * `match?(str)`, `=~(str)`: return true iff str parses as a `Node` equal to this one. * `detach`: remove this node from the tree (parent becomes nil) and return it. * `detached?`, `attached?`: return true if parent is nil or non-nil respectively. * `replace_with(node)`: replace this node with node in the tree. * `swap_with(node)`: exchange this node with node in their trees. * `insert_prev(*nodes)`, `insert_next(*nodes)`: insert nodes before this node in the parent list. Parent must be a `NodeList`! Useful for adding statements before a node in a block, for example. * `Foo?`: (where `Foo` is a module name) return `self.is_a?(Foo)`. This is a convienience for a common need. Example:

   \# print all global variables
   ast.entities.each do |node|
     node.Declaration? or next
     node.declarators.each do |decl|
       unless decl.type.Function?
         puts "#{decl.name}: #{decl.type}"
       end
     end
   end

The `=~` method lets you do: if declarator.type =~ 'const int *' puts "Ooh, a const int pointer!" end This is not the same as `declarator.type.to_s == 'const int *'`; that'd require you to guess how `to_s` formats its strings (most notably, the whitespace). ### Fields and Children The big table down below lists the *fields* of each `Node`. A field is an attribute which: * is used in equality checks (`==` and `eql?`). * are copied recursively by `dup` and `clone`. Fields listed as *children* form the tree structure. They only have a `Node` or `nil` value, and are yielded/returned/affected by the traversal methods: * `next`, `prev`: return the next/prev sibling. * `list_next`, `list_prev`: like `next`/`prev`, but also requires the parent to be `NodeList`. I'll be honest; I don't remember why I added these methods. They may well suddenly disappear. * `each`, `reverse_each`: Yield all (non-nil) children. `Node` includes `Enumerable`, so, you know. * `depth_first`, `reverse_depth_first`: Walk the tree in that order, yielding two args (event, node) at each node. event is `:down` on the way down, `:up` on the way up. If the block throws `:prune`, it won't descend any further. * `preorder`, `reverse_preorder`, `postorder`, `reverse_postorder`: Walk the tree depth first, yielding nodes in the given order. For the preorders, if the block throws `:prune`, it won't descend any further. * `node_after(child)`, `node_before(child)`: return the node before/after child (same as `child.next`). * `remove_node(child)`: remove child from this node (same as `child.detach`). * `replace_node(child, new_child)`: replace child with yeah you guessed it (same as `child.replace_with(newchild)`). Note: don't modify the tree during traversal! Other notes about the table: * Field names that end in '?' are always true-or-false. * If no default is listed: * it is false if the field name ends in a '?' * it is a `NodeArray` if it is a `NodeList`. * it is `nil` otherwise.

Class Field Type / values Default Comments

TranslationUnit entities * NodeList NodeChain[] The root of a parsed file.

Declaration

storage

:typedef, :extern, :static, :auto, :register

Also:

#typedef? -- true iff storage == :typedef
#extern? -- true iff storage == :extern
#static? -- true iff storage == :static
#auto? -- true iff storage == :auto
#register? -- true iff storage == :register

type * DirectType

declarators * NodeList NodeArray[]

inline? true, false

Declarator

indirect_type *

IndirectType

What's a "declarator?" Consider "int i, *ip;". This is a Declaration with two Declarators:

    Declaration
        type: Int
        declarators: 
            - Declarator
                name: "i"
            - Declarator
                indirect_type: Pointer
                name: "ip"

The indirect_type of the ip Declarator is a Pointer to nil. To get the complete type of the variable use:

#type -- return the complete type. This is a clone; modifying it won't modify the tree.

So calling #type on the ip Declarator gives:

    Pointer
      type: Int

name String

init * Expression

num_bits * Integer

FunctionDef

storage

:extern, :static

Also:

#extern? -- return true iff storage == :extern
#static? -- return true iff storage == :static
#prototype? -- same as !no_prototype?
#prototype=(val) -- same as no_prototype = !val

no_prototype? means that no prototype was given. That means parameter types weren't given in the parens, but in the "old-style" declaration list. Example:

No prototype	Prototype
int main(argc, argv) int argc; char **argv; { return 0; }	int main(int argc, char **argv) { return 0; }

Everyone tells you to use prototypes. That's because no type checking is done when calling a function declared without a prototype.

inline? true, false

type * Type

name String

def * Block Block.new

no_prototype? true, false

Parameter register? true, false Used in Functions.

type * Type

name String

Enumerator name String Used in Enums.

val * Expression

MemberInit member * NodeList of (Member or Expression) Used in CompoundLiterals.

init * Expression

Member name String Used in MemberInits.

Block labels * NodeList of Label NodeArray[]

stmts * NodeList of (Statement or Declaration or Comment) NodeArray[]

If labels * NodeList of Label NodeArray[]

cond * Expression

then * Statement

else * Statement

Switch labels * NodeList of Label NodeArray[]

cond * Expression

stmt * Statement

While labels * NodeList of Label NodeArray[] do? means it's a do-while loop.

do? true, false

cond * Expression

stmt * Statement

For labels * NodeList of Label NodeArray[]

init * Expression or Declaration

cond * Expression

iter * Expression

stmt * Statement

Goto labels * NodeList of Label NodeArray[]

target String

Continue labels * NodeList of Label NodeArray[]

Break labels * NodeList of Label NodeArray[]

Return labels * NodeList of Label NodeArray[]

expr * Expression

ExpressionStatement labels * NodeList of Label NodeArray[]

expr * Expression

PlainLabel name String

Default

Case expr * Expression

Comma exprs * NodeList of Expression

Conditional cond * Expression

then * Expression

else * Expression

Variable name String

Index expr * Expression

index * Expression

Call expr * Expression

args * NodeList of (Expression or Type)

Dot expr * Expression

member * String

Arrow expr * Expression

member * String

PostInc expr * Expression

PostDec expr * Expression

Cast type * Type

expr * Expression

Address expr * Expression

Dereference expr * Expression

Sizeof expr * Type or Expression

Positive expr * Expression

Negative expr * Expression

PreInc expr * Expression

PreDec expr * Expression

BitNot expr * Expression

Not expr * Expression

Add expr1 * Expression

expr2 * Expression

Subtract expr1 * Expression

expr2 * Expression

Multiply expr1 * Expression

expr2 * Expression

Divide expr1 * Expression

expr2 * Expression

Mod expr1 * Expression

expr2 * Expression

Equal expr1 * Expression

expr2 * Expression

NotEqual expr1 * Expression

expr2 * Expression

Less expr1 * Expression

expr2 * Expression

More expr1 * Expression

expr2 * Expression

LessOrEqual expr1 * Expression

expr2 * Expression

MoreOrEqual expr1 * Expression

expr2 * Expression

BitAnd expr1 * Expression

expr2 * Expression

BitOr expr1 * Expression

expr2 * Expression

BitXor expr1 * Expression

expr2 * Expression

ShiftLeft expr1 * Expression

expr2 * Expression

ShiftRight expr1 * Expression

expr2 * Expression

And expr1 * Expression

expr2 * Expression

Or expr1 * Expression

expr2 * Expression

Assign lval * Expression

rval * Expression

MultiplyAssign lval * Expression

rval * Expression

DivideAssign lval * Expression

rval * Expression

ModAssign lval * Expression

rval * Expression

AddAssign lval * Expression

rval * Expression

SubtractAssign lval * Expression

rval * Expression

ShiftLeftAssign lval * Expression

rval * Expression

ShiftRightAssign lval * Expression

rval * Expression

BitAndAssign lval * Expression

rval * Expression

BitXorAssign lval * Expression

rval * Expression

BitOrAssign lval * Expression

rval * Expression

StringLiteral val String The String in val is the literal string entered. "\n" isn't converted to a newline, for instance.

CharLiteral val String The String in val is the literal string entered. '\n' isn't converted to a newline, for instance.

CompoundLiteral

type *

Type

Here's an example:

(struct S){1, .x = 2, .y [3] .z = 4}

parses as:

CompoundLiteral
    type: Struct
        name: "S"
    member_inits: 
        - MemberInit
            init: IntLiteral
                val: 1
        - MemberInit
            member: 
                - Member
                    name: "x"
            init: IntLiteral
                val: 2
        - MemberInit
            member: 
                - Member
                    name: "y"
                - IntLiteral
                    val: 3
                - Member
                    name: "z"
            init: IntLiteral
                val: 4

member_inits * NodeList of MemberInit NodeArray[]

IntLiteral

val

Integer

Also:

#dec? -- return true iff format == :dec
#hex? -- return true iff format == :hex
#oct? -- return true iff format == :oct

format :dec, :hex, :oct :dec

FloatLiteral val Float

Pointer const? true, false

restrict? true, false

volatile? true, false

type * Type

Array const? true, false

restrict? true, false

volatile? true, false

type * Type

length * Expression

Function const? true, false

restrict? true, false

volatile? true, false

type * Type

params * NodeList of Parameter NodeArray[]

var_args? true, false

Struct const? true, false

restrict? true, false

volatile? true, false

name String

members * NodeList of Member NodeArray[]

Union const? true, false

restrict? true, false

volatile? true, false

name String

members * NodeList of Member NodeArray[]

Enum const? true, false

restrict? true, false

volatile? true, false

name String

members * NodeList of Enumerator

CustomType const? true, false For typedef'd names.

restrict? true, false

volatile? true, false

name String

Void const? true, false const is for things like const void *.

restrict? true, false

volatile? true, false

Int

const?

true, false

Also:

#short? -- return true iff longness == -1
#plain? -- return true iff longness == 0
#long? -- return true iff longness == 1
#long_long? -- return true iff longness == 2
#signed? -- same as !unsigned?
#signed=(val) -- same as unsigned = !val

restrict? true, false

volatile? true, false

longness -1, 0, 1, 2 0

unsigned? true, false

Float

const?

true, false

Also:

#plain? -- return true iff longness == 0
#double? -- return true iff longness == 1
#long_double? -- return true iff longness == 2

restrict? true, false

volatile? true, false

longness 0, 1, 2 0

Char

const?

true, false

Also:

#signed? -- return true iff signed == true
#unsigned? -- return true iff signed == false
#plain? -- return true iff signed == nil

Yes, C99 says that char, signed char, and unsigned char are 3 distinct types (unlike with int -- go figure). Like Martian chalk and Venusian cheese: completely different, but you can fit 'em each in one byte.

restrict? true, false

volatile? true, false

signed true, false, nil

Bool const? true, false This is the rarely seen _Bool type.

restrict? true, false

volatile? true, false

Complex

const?

true, false

This is the rarely seen _Complex type.

#plain? -- return true iff longness == 0
#double? -- return true iff longness == 1
#long_double? -- return true iff longness == 2

restrict? true, false

volatile? true, false

longness 0, 1, 2 0

Imaginary

const?

true, false

This is the rarely seen _Imaginary type.

#plain? -- return true iff longness == 0
#double? -- return true iff longness == 1
#long_double? -- return true iff longness == 2

restrict? true, false

volatile? true, false

longness 0, 1, 2 0

BlockExpression block * Block Block.new Only if the block_expressions extension is enabled. See "Extensions" section below.

## Parser `C.parse` will use the default parser (`C.default_parser`), but you can also manage your own parser(s) if you need finer control over state. Parser state consists of: * `type_names`: a Set of Strings. As a parser eats `typedef`s, this grows. * `pos`: the `Node::Pos` this parser will start parsing at. A `Node::Pos` has three read-write attributes: `filename`, `line_num`, `col_num`. Default is nil, 1, 0. Note that the type names the parser has seen affects the parser! For example, consider: a * b; * If only `a` is a type, this is a declaration. * If neither `a` nor `b` are types, this is a multiplication statement. * Otherwise, it's a syntax error. You may append type names implicitly, by parsing `typedef`s, or explicitly like this: parser.type_names << 'Thing' << 'OtherThing' ### Parsing Snippets `C.parse` will parse the toplevel C construct, a `C::TranslationUnit`, but you can also parse other snippets of C: C::Statement.parse('while (not_looking) { paint_car(); }') C::Type.parse('void *(*)(int *(*)[][2], ...)') This works for both concrete and abstract `Node` subclasses. A `Parser` may be given as an optional second argument. ### Extensions to C99 * `Type`s are allowed as function arguments. This is needed to parse C99 macros like `va_arg()`. * `Block`s in parentheses are allowed as expressions ([a gcc extension][gcc-block-expressions]). You need to call `parser.enable_block_expressions` first. They appear as `BlockExpression` nodes. [gcc-block-expressions]: http://gcc.gnu.org/onlinedocs/gcc-4.2.1/gcc/Statement-Exprs.html#Statement-Exprs ## Parsing Full Programs This can be tricky for a number of reasons. Here are the issues you'll likely encounter. ### Preprocessing Directives that start with `#` are not handled by the `Parser`, as they're external to the C grammar. CAST ships with a `Preprocessor`, which wraps the preprocessor used to build your Ruby interpreter. cpp = C::Preprocessor.new cpp.include_path << '/usr/include' << /usr/local/include' cpp.macros['DEBUG'] = '1' cpp.macros['max(a, b)'] = '((a) > (b) ? (a) : (b))' cpp.preprocess(code) Note however, that preprocessors tend to leave vendor-specific extensions in their output. GNU `cpp`, for example, leaves "linemarkers" (lines that begin with `#`) in the output which you'll need to filter out manually before feeding it to a `Parser`. ### Built-in types Mac OS 10.5's system `cpp` for instance assumes the compiler will recognize types such as `__darwin_va_list`. ### Syntactic Extensions Some code may take advantage of compiler-specific extensions to the syntax. For example, `gcc` supports inline assembly via directives like: asm("movl %1, %%eax; "movl %%eax, %0;" :"=r"(y) :"r"(x) :"%eax"); Such code is fairly rare, so there is no direct support in CAST for this. You'll need to manually massage such constructs out of the `Parser` input. Or send me patches. Delicious patches. ## Contributing * [Bug reports](http://github.com/Vasfed/cast/issues) * [Source](http://github.com/Vasfed/cast) * Patches: Fork on Github, send pull request. * Include tests where practical. * Leave the version alone, or bump it in a separate commit. ## Copyright Copyright (c) George Ogata. See LICENSE for details.