|
CAST
What Is
CAST parses C code into an abstract syntax tree (AST), lets you
break it, then vomit it out as code. The parser does C99.
Library Overview
Everything is in the module C.
- There's the parser (Parser).
- There's the tree (Node and its subclasses).
- That's it.
Usage
You call Parser#parse, and it gives you a tree of
Node objects. Watch:
require 'cast/cast'
parser = C::Parser.new
parser.pos.filename = "toy.c"
parser.type_names << 'LinkedList'
ugly_c_code = open("toy.c"){|f| f.read}
tree = parser.parse(ugly_c_code)
p tree
If there's a parse error, #parse raises a
ParseError (which has a nice error message in
#message).
The Parser
Here's a quiz: what does "a * b;" do?
I bet you said "why you l4m3r n00b, that's a statement that
multiplies a by b and throws away the answer --
now go take your meaningless snippetage to your computing 101 class
and let me finish hurting this JavaTM programmer." Well,
you'd be both mean and wrong. It was, of course, a trick question.
I didn't say if any of a and b are types! If only
a is a type, it's actually a declaration. And if
b is a type, it's a syntax error.
So, the parser's gonna need to know which identifiers are type
names. This is one of the bits of state that a Parser
keeps. Here's the complete list (um, of two):
-
#type_names -- a Set of Strings.
-
#pos -- the Node::Pos this parser will start
parsing at.
A Node::Pos has three read-write atts: #filename,
#line_num, #col_num. Default is nil, 1,
0.
The Nodes
There are a couple of Node classes:
The bold ones are abstract.
The last 2 (the NodeLists) represent lists of nodes.
Methodwise, they try to behave like normal ruby ::Arrays.
Implementationwise, a NodeChain is a doubly linked list,
whereas a NodeArray is an array. NodeChains may
be more efficient when adding things at the beginning of a LARGE
list.
Attributes
Each Node object has:
-
#parent -- return the parent in the tree (a Node
or nil).
-
#pos, #pos= -- the position in the source file
(a Node::Pos).
-
#to_s -- return the code for the tree (a
String).
-
#inspect -- return a pretty string for inspection.
Try it.
-
#match?(str), #=~(str) -- return true iff
str parses as a Node equal to this one.
-
#detach -- remove this node from the tree (parent becomes
nil) and return it.
-
#detached?, #attached? -- return true if parent
is nil or non-nil respectively.
-
#replace_with(node) -- replace this node with
node in the tree.
-
#swap_with(node) -- exchange this node with node
in their trees.
-
#insert_prev(*nodes),
#insert_next(*nodes) -- insert nodes before this
node in the parent list. Parent must be a NodeList!
Useful for adding statements before a node in a block, for
example.
-
#Foo? -- (where Foo is a module
name) return self.is_a?(Foo).
The #Foo? method is a convienience for a common
need. Example:
ast = C::Parser.new.parse(code_string)
ast.entities.each do |node|
node.Declaration? or next
node.declarators.each do |decl|
unless decl.type.Function?
puts "#{decl.name}: #{decl.type}"
end
end
end
If you're a duck-typing purist, then sorry for the cardiac arrest
you're now experiencing. CAST does pay attention to the
class of Node objects for quite a few things. This is the
cleanest way to distinguish, e.g., an Add from a
Subtract (which both have the same methods but represent
very different things). It just seems impractical (and unnecessary)
to allow duck typing in this situation.
The #=~ method lets you do:
if declarator.type =~ 'const int *'
puts "Ooh, a const int pointer!"
end
This is not the same as "declarator.type.to_s == 'const
int *'"; that'd require you to guess how to_s formats its
strings (most notably, the whitespace).
Fields and children
Each concrete Node class has a member for each bit of
important C stuff it pertains to. I know you peeked at the big list
below, so you know the kind of thing I mean.
But these aren't defined as attrs as you normally do in
Ruby -- they're fields. If a node has a field
foo, it means there's a setter #foo= and getter
#foo. (A field foo? means the setter is
#foo= and the getter is #foo?.) Some fields are
even more special: child fields. These form the
tree structure, and always have a Node or nil
value.
Why divulge these bizarre internal secrets? Because these
Node methods are defined in terms of fields and children:
-
#==, #eql? -- Equality is checked recursively.
That is, all fields (including children) must be equal.
-
#dup, #clone -- Children are copied recursively
(other fields and attrs as normal).
Then there's the tree-twiddling methods, which only ever
yield/return/affect (non-nil) children.
-
#next, #prev -- return the next/prev sibling.
-
#list_next, #list_prev -- like
#next/#prev, but also requires the parent to be
NodeList. I'll be honest; I don't remember why I added
these methods. They may well suddenly disappear.
-
#each, #reverse_each -- Yield all (non-nil)
children. Node includes Enumerable, so you get the free
chocolates too.
-
#depth_first, #reverse_depth_first -- Walk the
tree in that order, yielding two args (event, node) at
each node. event is :down on the way down,
:up on the way up. If the block throws
:prune, it won't descend any further.
-
#preorder, #reverse_preorder,
#postorder, #reverse_postorder -- Walk the tree
depth first, yielding nodes in the given order. For the
preorders, if the block throws :prune, it won't descend
any further.
-
#node_after(child), #node_before(child) --
return the node before/after child (same as
child.next).
-
#remove_node(child) -- remove child from this
node (same as child.detach).
-
#replace_node(child, new_child) -- replace
child with yeah you guessed it (same as
child.replace_with(newchild)).
If you're walking the tree looking for nodes to move around, don't
forget that modifying the tree during traversal is a criminal
offence.
And now, the episode you've been waiting for: THE FIELD LIST! (Cue
music and dim lights.)
Notes about the table:
-
If no default is listed, it is false if the field name
ends in a '?', nil otherwise.
-
nil is always allowed for a child field.
-
There are some conventions we follow religiously to help you:
-
Field names that end in '?' are always true-or-false.
-
NodeList-valued fields always default to an
empty NodeArray or NodeChain, so you can
tack things on there with <<, without worrying
about needing to create the list first.
-
A field is Node-valued if and only if it is
a child field.
-
The rows go yellow, green, yellow, green, ... .
TranslationUnit |
entities |
Y |
NodeList |
NodeChain[] |
Declaration |
storage |
|
:typedef, :extern, :static, :auto, :register |
|
type |
Y |
DirectType |
|
declarators |
Y |
NodeList |
NodeArray[] |
inline? |
|
true, false |
|
Declarator |
indirect_type |
Y |
IndirectType |
|
name |
|
String |
|
init |
Y |
Expression |
|
num_bits |
Y |
Integer |
|
FunctionDef |
storage |
|
:extern, :static |
|
inline? |
|
true, false |
|
type |
Y |
Type |
|
name |
|
String |
|
def |
Y |
Block |
Block.new |
no_prototype? |
|
true, false |
|
Parameter |
register? |
|
true, false |
|
type |
Y |
Type |
|
name |
|
String |
|
Enumerator |
name |
|
String |
|
val |
Y |
Expression |
|
MemberInit |
member |
Y |
NodeList of Member-or-Expression |
|
init |
Y |
Expression |
|
Member |
name |
|
String |
|
Block |
labels |
Y |
NodeList of Label |
NodeArray[] |
stmts |
Y |
NodeList of Statement |
NodeArray[] |
If |
labels |
Y |
NodeList of Label |
NodeArray[] |
cond |
Y |
Expression |
|
then |
Y |
Statement |
|
else |
Y |
Statement |
|
Switch |
labels |
Y |
NodeList of Label |
NodeArray[] |
cond |
Y |
Expression |
|
stmt |
Y |
Statement |
|
While |
labels |
Y |
NodeList of Label |
NodeArray[] |
do? |
|
true, false |
|
cond |
Y |
Expression |
|
stmt |
Y |
Statement |
|
For |
labels |
Y |
NodeList of Label |
NodeArray[] |
init |
Y |
Expression or Declaration |
|
cond |
Y |
Expression |
|
iter |
Y |
Expression |
|
stmt |
Y |
Statement |
|
Goto |
labels |
Y |
NodeList of Label |
NodeArray[] |
target |
|
String |
|
Continue |
labels |
Y |
NodeList of Label |
NodeArray[] |
Break |
labels |
Y |
NodeList of Label |
NodeArray[] |
Return |
labels |
Y |
NodeList of Label |
NodeArray[] |
expr |
Y |
Expression |
|
ExpressionStatement |
labels |
Y |
NodeList of Label |
NodeArray[] |
expr |
Y |
Expression |
|
PlainLabel |
name |
|
String |
|
Default |
|
|
|
|
Case |
expr |
Y |
Expression |
|
Comma |
exprs |
Y |
NodeList of Expression |
|
Conditional |
cond |
Y |
Expression |
|
then |
Y |
Expression |
|
else |
Y |
Expression |
|
Variable |
name |
|
String |
|
Index |
expr |
Y |
Expression |
|
index |
Y |
Expression |
|
Call |
expr |
Y |
Expression |
|
args |
Y |
NodeList of Expression-or-Type |
|
Dot |
expr |
Y |
Expression |
|
member |
Y |
String |
|
Arrow |
expr |
Y |
Expression |
|
member |
Y |
String |
|
PostInc |
expr |
Y |
Expression |
|
PostDec |
expr |
Y |
Expression |
|
Cast |
type |
Y |
Type |
|
expr |
Y |
Expression |
|
Address |
expr |
Y |
Expression |
|
Dereference |
expr |
Y |
Expression |
|
Sizeof |
expr |
Y |
Type or Expression |
|
Positive |
expr |
Y |
Expression |
|
Negative |
expr |
Y |
Expression |
|
PreInc |
expr |
Y |
Expression |
|
PreDec |
expr |
Y |
Expression |
|
BitNot |
expr |
Y |
Expression |
|
Not |
expr |
Y |
Expression |
|
Add |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
Subtract |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
Multiply |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
Divide |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
Mod |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
Equal |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
NotEqual |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
Less |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
More |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
LessOrEqual |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
MoreOrEqual |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
BitAnd |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
BitOr |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
BitXor |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
ShiftLeft |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
ShiftRight |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
And |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
Or |
expr1 |
Y |
Expression |
|
expr2 |
Y |
Expression |
|
Assign |
lval |
Y |
Expression |
|
rval |
Y |
Expression |
|
MultiplyAssign |
lval |
Y |
Expression |
|
rval |
Y |
Expression |
|
DivideAssign |
lval |
Y |
Expression |
|
rval |
Y |
Expression |
|
ModAssign |
lval |
Y |
Expression |
|
rval |
Y |
Expression |
|
AddAssign |
lval |
Y |
Expression |
|
rval |
Y |
Expression |
|
SubtractAssign |
lval |
Y |
Expression |
|
rval |
Y |
Expression |
|
ShiftLeftAssign |
lval |
Y |
Expression |
|
rval |
Y |
Expression |
|
ShiftRightAssign |
lval |
Y |
Expression |
|
rval |
Y |
Expression |
|
BitAndAssign |
lval |
Y |
Expression |
|
rval |
Y |
Expression |
|
BitXorAssign |
lval |
Y |
Expression |
|
rval |
Y |
Expression |
|
BitOrAssign |
lval |
Y |
Expression |
|
rval |
Y |
Expression |
|
StringLiteral |
val |
|
String |
|
CharLiteral |
val |
|
String |
|
CompoundLiteral |
type |
Y |
Type |
|
member_inits |
Y |
NodeList of MemberInit |
NodeArray[] |
IntLiteral |
val |
|
Integer |
|
format |
|
:dec, :hex, :oct |
:dec |
FloatLiteral |
val |
|
Float |
|
Pointer |
const? |
|
true, false |
|
restrict? |
|
true, false |
|
volatile? |
|
true, false |
|
type |
Y |
Type |
|
Array |
const? |
|
true, false |
|
restrict? |
|
true, false |
|
volatile? |
|
true, false |
|
type |
Y |
Type |
|
length |
Y |
Expression |
|
Function |
const? |
|
true, false |
|
restrict? |
|
true, false |
|
volatile? |
|
true, false |
|
type |
Y |
Type |
|
params |
Y |
NodeList of Parameter |
NodeArray[] |
var_args? |
|
true, false |
|
Struct |
const? |
|
true, false |
|
restrict? |
|
true, false |
|
volatile? |
|
true, false |
|
name |
|
String |
|
members |
Y |
NodeList of Member |
NodeArray[] |
Union |
const? |
|
true, false |
|
restrict? |
|
true, false |
|
volatile? |
|
true, false |
|
name |
|
String |
|
members |
Y |
NodeList of Member |
NodeArray[] |
Enum |
const? |
|
true, false |
|
restrict? |
|
true, false |
|
volatile? |
|
true, false |
|
name |
|
String |
|
members |
Y |
NodeList of Enumerator |
|
CustomType |
const? |
|
true, false |
|
restrict? |
|
true, false |
|
volatile? |
|
true, false |
|
name |
|
String |
|
Void |
const? |
|
true, false |
|
restrict? |
|
true, false |
|
volatile? |
|
true, false |
|
Int |
const? |
|
true, false |
|
restrict? |
|
true, false |
|
volatile? |
|
true, false |
|
longness |
|
-1, 0, 1, 2 |
0 |
unsigned? |
|
true, false |
|
Float |
const? |
|
true, false |
|
restrict? |
|
true, false |
|
volatile? |
|
true, false |
|
longness |
|
0, 1, 2 |
0 |
Char |
const? |
|
true, false |
|
restrict? |
|
true, false |
|
volatile? |
|
true, false |
|
signed |
|
true, false, nil |
|
Bool |
const? |
|
true, false |
|
restrict? |
|
true, false |
|
volatile? |
|
true, false |
|
Complex |
const? |
|
true, false |
|
restrict? |
|
true, false |
|
volatile? |
|
true, false |
|
longness |
|
0, 1, 2 |
0 |
Imaginary |
const? |
|
true, false |
|
restrict? |
|
true, false |
|
volatile? |
|
true, false |
|
longness |
|
0, 1, 2 |
0 |
Node Construction
Wanna make a Node? Take your pick:
-
#new(field1, field2, ...) -- fields are in the order
listed above.
-
#new(:field1 => val, :field2 => val, ...) -- field order
doesn't matter this way.
-
#new_at(pos, *args) -- set the pos to that
given, and pass args to #new.
They're for losers, though. What you really want to do is make
Nodes by parsing C code. Each class -- even the abstract
classes like Statement -- has a .parse method:
funcdef = C::FunctionDef.parse <<EOS
void frobnicate(int karma) {
use_waffle_iron();
}
stmt = C::Statement.parse('while (not_looking) paint_car();')
Need to tell it to treat WaffleIron as a type name? All
those parse methods use C.default_parser:
C.default_parser.type_names << 'WaffleIron'
type = C::Type.parse('WaffleIron')
Alternatively, you could've given parse your own
parser:
parser = C::Parser.new
parser.type_names << 'WaffleIron'
type = C::Type.parse('WaffleIron', parser)
In fact, there's also C.parse(str, parser=nil), which is an
alias for C::TranslationUnit.parse(str, parser).
ast = C.parse(STDIN)
Yes, all that talk in the intro about doing parser =
C::Parser.new; parser.parse(...) was actually all a charade to
make you type more. I so own you.
Open Issues
-
Is it okay to bastardize the =~ operator like that?
-
Should binary operators have a list of expressions rather than
just 2? That'd allow to_s to format the strings better in some
cases and make it consistent with Comma.
-
At the moment CAST chokes on preprocessor #-lines. Ruby makes it
trivial to filter these out before passing the string to
Parser#parse, and this makes it obvious when you forget
to run a file through the preprocessor (which you need to do to
get type names at least once). Is this stupid? Ideally we should
probably have a builtin preprocessor and use that.
-
Should a Member be allowed in MemberInit#member?
It'd be the common case, but special cases make uniform treatment
more arduous. Is "uniform treatment" an issue?
Vote now.
To Do
-
Stop embarrasing yourself and write the parser in C.
-
Make it -wd friendly.
-
Fix the "TODO" bits in c.y. These are for rarely used C
constructs, like the declaration int arr[*];.
-
Add a comment node. Might make it useful for doc extraction.
Anyone want this? Not all comments will be caught, though.
Comments can appear between any two tokens, and I don't really
want to add comment fields to every node. They'd probably just
appear between toplevel entities (in
TranslationUnit#entities) and between statements (in
Block#stmts).
-
Make it rdoc-able.
If any of these affect you greatly, kick me to make it happen
faster.
Contact
I'm not really sure what people are going to try to use this for. If
there's some functionality you think would make a good addition, or
think I've made a mess of this poor puppy, give me a yell.
You can spam me at george.ogata@gmail.com.
It'd help if you prefixed the subject with "[cast] " so I can easily
distinguish CAST spam from fake Rolex spam.
|