README.md in sequitur-0.1.01 vs README.md in sequitur-0.1.02

- old
+ new

@@ -13,11 +13,11 @@ [Sequitur algorithm home](http://sequitur.info/) [Wikipedia](http://en.wikipedia.org/wiki/Sequitur_algorithm) ### The theory in a nutshell ### Given a sequence of input tokens (say, characters), the Sequitur algorithm -will represent that input sequence as a set of rules. As the algorithm detects +will represent that input sequence as a set of rules. As the algorithm detects automatically repeated token patterns, the resulting rule set can encode repetitions in the input in a very compact way. Of interest is the fact that the algorithm runs in time linear in the length of the input sequence. **Can you give a simple example?** @@ -44,11 +44,11 @@ P2 : P1 c. P3 : P2 d. ``` Translated in plain English: -- Rule (start) tells that the input consists of the sequence of P_1 P_2 P_3 patterns followed by the letter e. +- Rule (start) tells that the input consists of the sequence of P1 P2 P3 patterns followed by the letter e. - Rule (P1) represents the sequence 'ab'. - Rule (P2) represents the pattern encoded by P1 (thus 'ab') then 'c'. In other words, it represents the string 'abc'. - Rule (P3) represents the pattern encoded by P2 then d. It is thus equivalent to 'abcd'. @@ -76,10 +76,11 @@ ```` The demo illustrates how easy it is to run the algorithm on a string. However, the next question is how can you make good use of the algorithm's result. +**Printing the resulting rules** The very first natural step is to be able to print out the (grammar) rules. Here's how: ```ruby @@ -102,9 +103,90 @@ # Rendered output is: # start : P1 P2 P3 P3 e. # P1 : a b. # P2 : P1 c. # P3 : P2 d. +``` + +## Understanding the algorithm's results +The Sequitur algorithm generates a -simplified- context-free grammar, therefore we dedicate this section +to the terminology about context-free grammars. As the Internet provides tons of information can be found +on the subject, we limit ourselves to the minimal terminology of interest when using the sequitur gem. + +First of all, what is a **grammar**? To simplify the matter, one can see a grammar as a set of +grammar rules. These rules are called production rules or more briefly **productions**. + +In a context-free grammar, productions have the form: +```` +P : body. +``` + +Where: +- The colon ':' character separates the head (= left-hand side) and the body (right-hand side, *rhs* in short) +of the rule. +- The left-hand side consists just of one symbol, P. P is a categorized as a *nonterminal symbol* and for our purposes +a nonterminal symbol can be seen as the "name" of the production. By contrast, a terminal symbol is just one element +from the input sequence (symbols as defined in formal grammar theory shouldn't be confused with Ruby's `Symbol` class). +- the body is a sequence -possibly empty- of *symbols* (terminal or nonterminal). + +Basically, a production rule tells that P is equivalent to the sequence of symbols found in the +right-hand side of the production. A nonterminal symbol that appears in the rhs of a production can be +seen as a reference to the production with same name. + + +## The Sequitur API + +Recall the above example: a single call to the `Sequitur#build_from` factory method +suffices to construct a grammar object. + +```ruby + require 'sequitur' + + input_sequence = 'ababcabcdabcde' + grammar = Sequitur.build_from(input_sequence) +``` + +The return value `grammar` is a `Sequitur::SequiturGrammar` instance. + +Unsurprisingly, the `Sequitur::SequiturGrammar` class defines an accessor method called 'productions' +that returns the productions of the grammar as an array of `Sequitur::Production` objects. + +```ruby + # Count the number of productions in the grammar + puts grammar.productions.size # => 4 + + # Retrieve all productions of the grammar + all_prods = grammar.productions + + # Retrieve the start production + start_prod = grammar.production[0] +``` + +Once we have a grip on a production, it is easy to access its right-hand side through the `Production#rhs` method. +It returns an array of symbols. + +```ruby + # ...Continuing the same example + # Retrieve the right-hand side of the production + prod_body = start_prod.rhs # Return an Array object +``` + +The RHS of a production is a sequence (i.e. Array) of symbols. +How are the grammar symbols implemented? +-Terminal symbols are directly originating from the input sequence. They are inserted "as is" in the +RHS. For instance, if the input sequence consists of integer values (i.e. Finum instances), then they +will be inserted in the RHS of productions. +-Non-terminal symbols are implemented as `Sequitur::ProductionRef` objects. + +A ProductionRef is reference to a Production object. The latter one can be accessed through the `ProductionRef#production` method. + + +### Installation ### +The sequitur gem installation is fairly standard. +If your project has a `Gemfile` file, add `sequitur` to it. Otherwise, install the gem like this: + +```bash +$[sudo] gem install sequitur ``` ### TODO: Add more documentation ### \ No newline at end of file