DataFrame.md in red_amber-0.1.5

- old
+ new

@@ -1,25 +1,25 @@
 # DataFrame
 
-Class `RedAmber::DataFrame` represents 2D-data. `DataFrame` consists with:
+Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
 - A collection of data which have same data type within. We call it `Vector`.
 - A label is attached to `Vector`. We call it `key`.
 - A `Vector` and associated `key` is grouped as a `variable`.
 - `variable`s with same vector length are aligned and arranged to be a `DaTaFrame`.
 - Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `observation`.
 
 ![dataframe model image](doc/../image/dataframe_model.png)
 
 ## Constructors and saving
 
-### `new` from a columnar Hash
+### `new` from a Hash
 
   ```ruby
   RedAmber::DataFrame.new(x: [1, 2, 3])
   ```
 
-### `new` from a schema (by Hash) and rows (by Array)
+### `new` from a schema (by Hash) and data (by Array)
 
   ```ruby
   RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]])
   ```
 
@@ -50,11 +50,11 @@
 - from a string buffer
 
 - from a URI
 
   ```ruby
-  uri = URI("https://github.com/heronshoes/red_amber/blob/master/test/entity/with_header.csv")
+  uri = URI("uri = URI("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
   RedAmber::DataFrame.load(uri)
   ```
 
 - from a Parquet file
 
@@ -76,11 +76,11 @@
   dataframe.save("file.parquet")
   ```
 
 ## Properties
 
-### `table`
+### `table`, `to_arrow`
 
 - Reader of Arrow::Table object inside.
 
 ### `size`, `n_obs`, `n_rows`
   
@@ -91,20 +91,57 @@
 - Returns num of keys (num of variables).
  
 ### `shape`
  
 - Returns shape in an Array[n_rows, n_cols].
- 
+
+### `variables`
+
+- Returns key names and Vectors pair in a Hash.
+
+  It is convenient to use in a block when both key and vector required. We will write:
+
+  ```ruby
+    # update numeric variables
+    df.assign do
+      variables.select.with_object({}) do |(key, vector), assigner|
+        assigner[key] = vector * -1 if vector.numeric?
+      end
+    end
+  ```
+
+  Instead of:
+  ```ruby
+    df.assign do
+      assigner = {}
+      vectors.each_with_index do |vector, i|
+        assigner[keys[i]] = vector * -1 if vector.numeric?
+      end
+      assigner
+    end
+  ```
+
 ### `keys`, `var_names`, `column_names`
   
 - Returns key names in an Array.
 
+  When we use it with vectors, Vector#key is useful to get the key inside of DataFrame.
+
+  ```ruby
+    # update numeric variables, another solution
+    df.assign do
+      vectors.each_with_object({}) do |vector, assigner|
+        assigner[vector.key] = vector * -1 if vector.numeric?
+      end
+    end
+  ```
+
 ### `types`
   
 - Returns types of vectors in an Array of Symbols.
 
-### `data_types`
+### `type_classes`
 
 - Returns types of vector in an Array of `Arrow::DataType`.
 
 ### `vectors`
 
@@ -165,11 +202,11 @@
   6 :body_mass_g       uint16    95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
   7 :sex               string     3 {"male"=>168, "female"=>165, nil=>11}
   8 :year              uint16     3 {2007=>110, 2008=>114, 2009=>120}
   ```
 
-  - limit: limits variable number to show. Default value is 10.
+  - limit: limit of variables to show. Default value is 10.
   - tally: max level to use tally mode.
   - elements: max num of element to show values in each observations.
 
 ### `inspect`
 
@@ -222,12 +259,21 @@
   df[:a]
   # =>
   #<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
   [1, 2, 3]
   ```
-  This may be useful to use in a block of DataFrame manipulations.
+  Or `#v` method also returns a Vector for a key.
 
+  ```ruby
+  df.v(:a)
+  # =>
+  #<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
+  [1, 2, 3]
+  ```
+
+  This may be useful to use in a block of DataFrame manipulation verbs. We can write `v(:a)` rather than `self[:a]` or `df[:a]`
+
 ### Select observations (rows in a table) by `[]` as `[index]`, `[range]`, `[array]`
 
 - Select a obs. by index: `df[0]`
 - Select obs. by indeces in a Range: `df[1..2]`
 
@@ -265,17 +311,17 @@
     1 :a  uint8      1 [1]
     2 :b  string     1 ["A"]
     3 :c  double     1 [1.0]
     ```
 
-### Select rows from top or bottom
+### Select rows from top or from bottom
 
   `head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
 
 ## Sub DataFrame manipulations
 
-### `pick`
+### `pick  ` - pick up variables by key label -
 
   Pick up some variables (columns) to create a sub DataFrame.
 
   ![pick method image](doc/../image/dataframe/pick.png)
 
@@ -311,21 +357,22 @@
  - Keys or booleans by a block
 
     `pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
 
     ```ruby
+    # It is ok to write `keys ...` in the block, not `penguins.keys ...`
     penguins.pick { keys.map { |key| key.end_with?('mm') } }
     # =>
     #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f1cc>
     Vectors : 3 numeric
     # key                type   level data_preview
     1 :bill_length_mm    double   165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
     2 :bill_depth_mm     double    81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
     3 :flipper_length_mm int64     56 [181, 186, 195, nil, 193, ... ], 2 nils
     ```
 
-### `drop`
+### `drop  ` - pick and drop -
 
   Drop some variables (columns) to create a remainer DataFrame.
 
   ![drop method image](doc/../image/dataframe/drop.png)
 
@@ -350,29 +397,29 @@
   booleans_invert = booleans.map(&:!) # => [false, true, true]
   df.pick(booleans) == df.drop(booleans_invert) # => true
   ```
 - Difference between `pick`/`drop` and `[]`
 
-  If `pick` or `drop` will select single variable (column), it returns a `DataFrame` with one variable. In contrast, `[]` returns a `Vector`.
+  If `pick` or `drop` will select a single variable (column), it returns a `DataFrame` with one variable. In contrast, `[]` returns a `Vector`. This behavior may be useful to use in a block of DataFrame manipulations.
 
   ```ruby
   df = RedAmber::DataFrame.new(a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3])
-  df[:a]
-  # =>
-  #<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
-  [1, 2, 3]
-
   df.pick(:a) # or
   df.drop(:b, :c)
   # =>
   #<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000000f280>
   Vector : 1 numeric
   # key type  level data_preview
   1 :a  uint8     3 [1, 2, 3]
+
+  df[:a]
+  # =>
+  #<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
+  [1, 2, 3]
   ```
 
-### `slice`
+### `slice  `  - to cut vertically is slice -
 
   Slice and select observations (rows) to create a sub DataFrame.
 
   ![slice method image](doc/../image/dataframe/slice.png)
 
@@ -486,21 +533,21 @@
     ```ruby
     # remove all observation contains nil
     removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
     removed.tdr
     # =>
-    RedAmber::DataFrame : 342 x 8 Vectors
+    RedAmber::DataFrame : 333 x 8 Vectors
     Vectors : 5 numeric, 3 strings
     # key                type   level data_preview
-    1 :species           string     3 {"Adelie"=>151, "Chinstrap"=>68, "Gentoo"=>123}
-    2 :island            string     3 {"Torgersen"=>51, "Biscoe"=>167, "Dream"=>124}
-    3 :bill_length_mm    double   164 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
-    4 :bill_depth_mm     double    80 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
-    5 :flipper_length_mm int64     55 [181, 186, 195, 193, 190, ... ]
-    6 :body_mass_g       int64     94 [3750, 3800, 3250, 3450, 3650, ... ]
-    7 :sex               string     3 {"male"=>168, "female"=>165, ""=>9}
-    8 :year              int64      3 {2007=>109, 2008=>114, 2009=>119}
+    1 :species           string     3 {"Adelie"=>146, "Chinstrap"=>68, "Gentoo"=>119}
+    2 :island            string     3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>123}
+    3 :bill_length_mm    double   163 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
+    4 :bill_depth_mm     double    79 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
+    5 :flipper_length_mm uint8     54 [181, 186, 195, 193, 190, ... ]
+    6 :body_mass_g       uint16    93 [3750, 3800, 3250, 3450, 3650, ... ]
+    7 :sex               string     2 {"male"=>168, "female"=>165}
+    8 :year              uint16     3 {2007=>103, 2008=>113, 2009=>117}    
     ```
 
 - Keys or booleans by a block
 
     `remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
@@ -581,11 +628,11 @@
 
   Symbol key and String key are distinguished.
 
 ### `assign`
 
-  Assign new variables (columns) and create a updated DataFrame.
+  Assign new or updated variables (columns) and create a updated DataFrame.
 
   - Variables with new keys will append new variables at bottom (right in the table).
   - Variables with exisiting keys will update corresponding vectors.
 
     ![assign method image](doc/../image/dataframe/assign.png)
@@ -647,32 +694,135 @@
     Vectors : 2 numeric, 1 string
     # key     type   level data_preview
     1 :index  int8       5 [0, -1, -2, -3, nil], 1 nil
     2 :float  double     5 [-0.0, -1.1, -2.2, NaN, nil], 1 NaN, 1 nil
     3 :string string     5 ["A", "B", "C", "D", nil], 1 nil
+
+    # Or it ’s shorter like this:
+    df.assign do
+      variables.select.with_object({}) do |(key, vector), assigner|
+        assigner[key] = vector * -1 if vector.numeric?
+      end
+    end
+    # => same as above
     ```
 
 - Key type
 
   Symbol key and String key are considered as the same key.
 
 ## Updating
 
-- [ ] Update elements matching a condition
+### `sort`
 
-- [ ] Clamp
+  `sort` accepts parameters as sort_keys thanks to the amazing Red Arrow feature。
+    - :key, "key" or "+key" denotes ascending order
+    - "-key" denotes descending order
 
-- [ ] Sort rows
+  ```ruby
+  df = RedAmber::DataFrame.new({
+        index:  [1, 1, 0, nil, 0],
+        string: ['C', 'B', nil, 'A', 'B'],
+        bool:   [nil, true, false, true, false],
+      })
+  df.sort(:index, '-bool').tdr(tally: 0)
+  # =>
+  RedAmber::DataFrame : 5 x 3 Vectors
+  Vectors : 1 numeric, 1 string, 1 boolean
+  # key     type    level data_preview
+  1 :index  uint8       3 [0, 0, 1, 1, nil], 1 nil
+  2 :string string      4 [nil, "B", "B", "C", "A"], 1 nil
+  3 :bool   boolean     3 [false, false, true, nil, true], 1 nil
+  ```
 
+- [ ] Clamp
+
 - [ ] Clear data
 
 ## Treat na data
 
-- [ ] Drop na (NaN, nil)
+### `remove_nil`
 
-- [ ] Replace na with value
+  Remove any observations containing nil.
 
-- [ ] Interpolate na with convolution array
+## Grouping
+
+### `group(aggregating_keys, function, target_keys)`
+
+  Create grouped dataframe by `aggregation_keys` and apply `function` to each group and returns in `target_keys`. Aggregated key name is `function(key)` style.
+
+  (The current implementation is not intuitive. Needs improvement.)
+
+  ```ruby
+  ds = Datasets::Rdatasets.new('dplyr', 'starwars')
+  starwars = RedAmber::DataFrame.new(ds.to_table.to_h)
+  starwars.tdr(11)
+  # =>
+  RedAmber::DataFrame : 87 x 11 Vectors
+  Vectors : 3 numeric, 8 strings
+  #  key         type   level data_preview
+  1  :name       string    87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader",   "Leia Organa", ... ]
+  2  :height     uint16    46 [172, 167, 96, 202, 150, ... ], 6 nils
+  3  :mass       double    39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
+  4  :hair_color string    13 ["blond", nil, nil, "none", "brown", ... ], 5 nils
+  5  :skin_color string    31 ["fair", "gold", "white, blue", "white", "light", ..  . ]
+  6  :eye_color  string    15 ["blue", "yellow", "red", "yellow", "brown", ... ]
+  7  :birth_year double    37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
+  8  :sex        string     5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, nil=>4}
+  9  :gender     string     3 {"masculine"=>66, "feminine"=>17, nil=>4}
+  10 :homeworld  string    49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ], 10 nils
+  11 :species    string    38 ["Human", "Droid", "Droid", "Human", "Human", ... ], 4 nils
+
+  grouped = starwars.group(:species, :mean, [:mass, :height])
+  # =>
+  #<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000000fbf4>
+  Vectors : 2 numeric, 1 string
+  # key             type   level data_preview
+  1 :"mean(mass)"   double    27 [82.78181818181818, 69.75, 124.0, 74.0, 1358.0, ... ], 6 nils
+  2 :"mean(height)" double    32 [176.6451612903226, 131.2, 231.0, 173.0, 175.0, ... ]
+  3 :species        string    38 ["Human", "Droid", "Wookiee", "Rodian", "Hutt", ... ], 1 nil
+
+  count = starwars.group(:species, :count, :species)[:"count(species)"]
+  df = grouped.slice(count > 1)
+  # =>
+  #<RedAmber::DataFrame : 8 x 3 Vectors, 0x000000000000fc44>
+  Vectors : 2 numeric, 1 string
+  # key             type   level data_preview
+  1 :"mean(mass)"   double     8 [82.78181818181818, 69.75, 124.0, 74.0, 80.0, ... ]
+  2 :"mean(height)" double     8 [176.6451612903226, 131.2, 231.0, 208.66666666666666, 173.0, ... ]
+  3 :species        string     8 ["Human", "Droid", "Wookiee", "Gungan", "Zabrak", ... ]
+
+  df.table
+  # =>
+  #<Arrow::Table:0x1165593c8 ptr=0x7fb3db144c70>
+	mean(mass)	mean(height)	species
+  0	 82.781818	  176.645161	Human  
+  1	 69.750000	  131.200000	Droid  
+  2	124.000000	  231.000000	Wookiee
+  3	 74.000000	  208.666667	Gungan 
+  4	 80.000000	  173.000000	Zabrak 
+  5	 55.000000	  179.000000	Twi'lek
+  6	 53.100000	  168.000000	Mirialan
+  7	 88.000000	  221.000000	Kaminoan
+  ```
+
+  Available functions are:
+
+  - [ ] all                 
+  - [ ] any
+  - [ ] approximate_median
+  - ✓ count
+  - [ ] count_distinct
+  - [ ] distinct
+  - ✓ max
+  - ✓ mean
+  - ✓ min
+  - [ ] min_max
+  - ✓ product
+  - ✓ stddev
+  - ✓ sum
+  - [ ] tdigest
+  - ✓ variance
 
 ## Combining DataFrames
 
 - [ ]  obs