DataFrame.md in red_amber-0.2.3

- old
+ new

@@ -3,11 +3,12 @@
 Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
 - A collection of data which have same data type within. We call it `Vector`.
 - A label is attached to `Vector`. We call it `key`.
 - A `Vector` and associated `key` is grouped as a `variable`.
 - `variable`s with same vector length are aligned and arranged to be a `DataFrame`.
-- Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `observation`.
+  - Each `key` in a `DataFrame` must be unique.
+- Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `record` or `observation`.
 
 ![dataframe model image](doc/../image/dataframe_model.png)
 
 ## Constructors and saving
 
@@ -92,17 +93,17 @@
 
 ## Properties
 
 ### `table`, `to_arrow`
 
-- Reader of Arrow::Table object inside.
+- Returns Arrow::Table object in the DataFrame.
 
-### `size`, `n_obs`, `n_rows`
+### `size`, `n_records`, `n_obs`, `n_rows`
   
-- Returns size of Vector (num of observations).
- 
-### `n_keys`, `n_vars`, `n_cols`,
+- Returns size of Vector (num of records).
+
+### `n_keys`, `n_variables`, `n_vars`, `n_cols`,
   
 - Returns num of keys (num of variables).
  
 ### `shape`
  
@@ -136,21 +137,12 @@
 
 ### `keys`, `var_names`, `column_names`
   
 - Returns key names in an Array.
 
-  When we use it with vectors, Vector#key is useful to get the key inside of DataFrame.
+  Each key must be unique in the DataFrame.
 
-  ```ruby
-    # update numeric variables, another solution
-    df.assign do
-      vectors.each_with_object({}) do |vector, assigner|
-        assigner[vector.key] = vector * -1 if vector.numeric?
-      end
-    end
-  ```
-
 ### `types`
   
 - Returns types of vectors in an Array of Symbols.
 
 ### `type_classes`
@@ -159,29 +151,44 @@
 
 ### `vectors`
 
 - Returns an Array of Vectors.
 
+  When we use it, Vector#key is useful to get the key in the DataFrame.
+
+  ```ruby
+    # update numeric variables, another solution
+    df.assign do
+      vectors.each_with_object({}) do |vector, assigner|
+        assigner[vector.key] = vector * -1 if vector.numeric?
+      end
+    end
+  ```
+
 ### `indices`, `indexes`
 
-- Returns indexes in an Array.
+- Returns indexes in a Vector.
   Accepts an option `start` as the first of indexes.
 
   ```ruby
   df = RedAmber::DataFrame.new(x: [1, 2, 3, 4, 5])
   df.indices
 
   # =>
+  #<RedAmber::Vector(:uint8, size=5):0x0000000000013ed4>
   [0, 1, 2, 3, 4]
 
   df.indices(1)
 
   # =>
+  #<RedAmber::Vector(:uint8, size=5):0x0000000000018fd8>
   [1, 2, 3, 4, 5]
 
   df.indices(:a)
+
   # =>
+  #<RedAmber::Vector(:dictionary, size=5):0x000000000001bd50>
   [:a, :b, :c, :d, :e]
   ```
 
 ### `to_h`
 
@@ -273,10 +280,11 @@
   require 'red_amber'
   require 'datasets-arrow'
 
   dataset = Datasets::Penguins.new
   # (From 0.2.2) responsible to the object which has `to_arrow` method.
+  # If older, it should be `dataset.to_arrow` in the parentheses.
   RedAmber::DataFrame.new(dataset).tdr
 
   # =>
   RedAmber::DataFrame : 344 x 8 Vectors
   Vectors : 5 numeric, 3 strings
@@ -288,30 +296,31 @@
   4 :flipper_length_mm uint8     56 [181, 186, 195, nil, 193, ... ], 2 nils
   5 :body_mass_g       uint16    95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
   6 :sex               string     3 {"male"=>168, "female"=>165, nil=>11}
   7 :year              uint16     3 {2007=>110, 2008=>114, 2009=>120}
   ```
-
+  
+  Options:
   - limit: limit of variables to show. Default value is 10.
-  - tally: max level to use tally mode.
-  - elements: max num of element to show values in each observations.
+  - tally: max level to use tally mode. Default value is 5.
+  - elements: max num of element to show values in each records. Default value is 5.
 
 ## Selecting
 
 ### Select variables (columns in a table) by `[]` as `[key]`, `[keys]`, `[keys[index]]`
 - Key in a Symbol: `df[:symbol]`
 - Key in a String: `df["string"]`
 - Keys in an Array: `df[:symbol1, "string", :symbol2]`
 - Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`
 
-  Key indeces can be used via `keys[i]` because numbers are used to select observations (rows).
+  Key indeces should be used via `keys[i]` because numbers are used to select records (rows). See next section.
 
 - Keys by a Range:
 
-  If keys are able to represent by Range, it can be included in the arguments. See a example below.
+  If keys are able to represent by a Range, it can be included in the arguments. See a example below.
 
-- You can exchange the order of variables (columns).
+- You can also exchange the order of variables (columns).
  
   ```ruby
   hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
   df = RedAmber::DataFrame.new(hash)
   df[:b..:c, "a"]
@@ -323,42 +332,44 @@
   0 A             1.0       1
   1 B             2.0       2
   2 C             3.0       3
   ```
 
-  If `#[]` represents single variable (column), it returns a Vector object.
+  If `#[]` represents a single variable (column), it returns a Vector object.
 
   ```ruby
   df[:a]
 
   # =>
   #<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
   [1, 2, 3]
   ```
+
   Or `#v` method also returns a Vector for a key.
 
   ```ruby
   df.v(:a)
 
   # =>
   #<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
   [1, 2, 3]
   ```
 
-  This may be useful to use in a block of DataFrame manipulation verbs. We can write `v(:a)` rather than `self[:a]` or `df[:a]`
+  This method may be useful to use in a block of DataFrame manipulation verbs. We can write `v(:a)` rather than `self[:a]` or `df[:a]`
 
-### Select observations (rows in a table) by `[]` as `[index]`, `[range]`, `[array]`
+### Select records (rows in a table) by `[]` as `[index]`, `[range]`, `[array]`
 
-- Select a obs. by index: `df[0]`
-- Select obs. by indeces in a Range: `df[1..2]`
+- Select a record by index: `df[0]`
 
-  An end-less or a begin-less Range can be used to represent indeces.
+- Select records by indeces in an Array: `df[1, 2]`
 
-- Select obs. by indeces in an Array: `df[1, 2]`
+- Select records by indeces in a Range: `df[1..2]`
 
-- You can use float indices.
+  An end-less or a begin-less Range can be used to represent indeces.
 
+- You can use indices in Float.
+
 - Mixed case: `df[2, 0..]`
 
   ```ruby
   hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
   df = RedAmber::DataFrame.new(hash)
@@ -372,13 +383,13 @@
   1       1 A             1.0
   2       2 B             2.0
   3       3 C             3.0
   ```
 
-- Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self.
+- Select records by a boolean Array or a boolean RedAmber::Vector at same size as self.
 
-  It returns a sub dataframe with observations at boolean is true.
+  It returns a sub dataframe with records at boolean is true.
 
     ```ruby
     # with the same dataframe `df` above
     df[true, false, nil] # or
     df[[true, false, nil]] # or
@@ -389,19 +400,19 @@
             a b               c
       <uint8> <string> <double>
     1       1 A             1.0
     ```
 
-### Select rows from top or from bottom
+### Select records (rows) from top or from bottom
 
   `head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
 
 ## Sub DataFrame manipulations
 
-### `pick  ` - pick up variables by key label -
+### `pick  ` - pick up variables -
 
-  Pick up some columns (variables) to create a sub DataFrame.
+  Pick up some variables (columns) to create a sub DataFrame.
 
   ![pick method image](doc/../image/dataframe/pick.png)
 
 - Keys as arguments
 
@@ -489,13 +500,13 @@
     341           50.4          15.7               222
     342           45.2          14.8               212
     343           49.9          16.1               213
     ```
 
-### `drop  ` - pick and drop -
+### `drop  ` - counterpart of pick -
 
-  Drop some columns (variables) to create a remainer DataFrame.
+  Drop some variables (columns) to create a remainer DataFrame.
 
   ![drop method image](doc/../image/dataframe/drop.png)
 
 - Keys as arguments
 
@@ -555,24 +566,24 @@
   # =>
   #<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
   [1, 2, 3]
   ```
 
-### `slice  `  - to cut vertically is slice -
+### `slice  `  - slice and select records -
 
-  Slice and select rows (observations) to create a sub DataFrame.
+  Slice and select records (rows) to create a sub DataFrame.
 
   ![slice method image](doc/../image/dataframe/slice.png)
 
 - Indices as arguments
 
     `slice(indeces)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers.
 
     Negative index from the tail like Ruby's Array is also acceptable.
 
     ```ruby
-    # returns 5 obs. at start and 5 obs. from end
+    # returns 5 records at start and 5 records from end
     penguins.slice(0...5, -5..-1)
 
     # =>
     #<RedAmber::DataFrame : 10 x 8 Vectors, 0x0000000000042be4>
       species  island    bill_length_mm bill_depth_mm flipper_length_mm ...     year
@@ -663,22 +674,22 @@
     #<Arrow::Table:0x7fdfe44981c8 ptr=0x555e9febc330>
 	    a	b	         c
     0	1	A	  1.000000
     ``` 
 
-### `remove`
+### `remove` - counterpart of slice -
 
-  Slice and reject rows (observations) to create a remainer DataFrame.
+  Slice and reject records (rows) to create a remainer DataFrame.
 
   ![remove method image](doc/../image/dataframe/remove.png)
 
 - Indices as arguments
 
     `remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
 
     ```ruby
-    # returns 6th to 339th obs.
+    # returns 6th to 339th records
     penguins.remove(0...5, -5..-1)
 
     # =>
     #<RedAmber::DataFrame : 334 x 8 Vectors, 0x00000000000487c4>
         species  island    bill_length_mm bill_depth_mm flipper_length_mm ...     year
@@ -697,11 +708,11 @@
 - Booleans as an argument
 
   `remove(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
 
     ```ruby
-    # remove all observation contains nil
+    # remove all records contains nil
     removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
     removed
 
     # =>
     #<RedAmber::DataFrame : 333 x 8 Vectors, 0x0000000000049fac>
@@ -783,11 +794,11 @@
     1   (nil) C             3.0
     ```
 
 ### `rename`
 
-  Rename keys (column names) to create a updated DataFrame.
+  Rename keys (variable/column names) to create a updated DataFrame.
 
   ![rename method image](doc/../image/dataframe/rename.png)
 
 - Key pairs as arguments
 
@@ -818,11 +829,11 @@
 
   Symbol key and String key are distinguished.
 
 ### `assign`
 
-  Assign new or updated columns (variables) and create a updated DataFrame.
+  Assign new or updated variables (columns) and create an updated DataFrame.
 
   - Variables with new keys will append new columns from the right.
   - Variables with exisiting keys will update corresponding vectors.
 
     ![assign method image](doc/../image/dataframe/assign.png)
@@ -1007,11 +1018,11 @@
 
 ## Updating
 
 ### `sort`
 
-  `sort` accepts parameters as sort_keys thanks to the amazing Red Arrow feature。
+  `sort` accepts parameters as sort_keys thanks to the Red Arrow's feature。
     - :key, "key" or "+key" denotes ascending order
     - "-key" denotes descending order
 
   ```ruby
   df = RedAmber::DataFrame.new(
@@ -1038,11 +1049,11 @@
 
 ## Treat na data
 
 ### `remove_nil`
 
-  Remove any observations containing nil.
+  Remove any records containing nil.
 
 ## Grouping
 
 ### `group(group_keys)`
 
@@ -1208,11 +1219,11 @@
   The leftmost column is created by original keys. Key name of the column is
   named by parameter `:name`. If `:name` is not specified, `:NAME` is used for the key.
 
 ### `to_long(*keep_keys)`
 
-  Creates a 'long' (tidy) DataFrame from a 'wide' DataFrame.
+  Creates a 'long' (may be tidy) DataFrame from a 'wide' DataFrame.
 
   - Parameter `keep_keys` specifies the key names to keep.
 
   ```ruby
   import_cars.to_long(:Year)
@@ -1255,11 +1266,11 @@
   24     2021 VW                      35215
   ```
 
 ### `to_wide`
 
-  Creates a 'wide' (messy) DataFrame from a 'long' DataFrame.
+  Creates a 'wide' (may be messy) DataFrame from a 'long' DataFrame.
 
   - Option `:name` is the key of the column which will be expanded **to key names**.
     The default value is `:NAME` if it is not specified.
   - Option `:value` is the key of the column which will be expanded **to values**.
     The default value is `:VALUE` if it is not specified.
@@ -1280,12 +1291,280 @@
   4     2021    22535    35905    18211         51722    35215
   ```
 
 ## Combine
 
-- [ ] Combining dataframes
+### `join`
+![dataframe joining image](doc/../image/dataframe/join.png)
 
-- [ ] Join
+  You should use specific `*_join` methods below.
+
+  - `other` is a DataFrame or a Arrow::Table.
+  - `join_keys` are keys shared by self and other to match with them.
+  - If `join_keys` are empty, common keys in self and other are chosen (natural join).
+  - If (common keys) > `join_keys`, duplicated keys are renamed by `suffix`. 
+
+  ```ruby
+  df = DataFrame.new(
+    KEY: %w[A B C],
+    X1: [1, 2, 3]
+  )
+  #=>
+  #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000012a70>
+    KEY           X1
+    <string> <uint8>
+  0 A              1
+  1 B              2
+  2 C              3
+
+  other = DataFrame.new(
+    KEY: %w[A B D],
+    X2: [true, false, nil]
+  )
+  #=>
+  #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000017034>
+    KEY      X2
+    <string> <boolean>
+  0 A        true
+  1 B        false
+  2 D        (nil)
+  ```
+
+#### Mutating joins
+
+##### `inner_join(other, join_keys = nil, suffix: '.1')`
+
+  Join data, leaving only the matching records.
+
+  ```ruby
+  df.inner_join(other, :KEY)
+  #=>
+  #<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000001e2bc>     
+    KEY           X1 X2
+    <string> <uint8> <boolean>
+  0 A              1 true
+  1 B              2 false
+  ```
+
+##### `full_join(other, join_keys = nil, suffix: '.1')`
+
+  Join data, leaving all records.
+
+  ```ruby
+  df.full_join(other, :KEY)
+  #=>
+  #<RedAmber::DataFrame : 4 x 3 Vectors, 0x0000000000029fcc>
+    KEY           X1 X2
+    <string> <uint8> <boolean>
+  0 A              1 true
+  1 B              2 false
+  2 C              3 (nil)
+  3 D          (nil) (nil)
+  ```
+
+##### `left_join(other, join_keys = nil, suffix: '.1')`
+
+  Join matching values to self from other.
+
+  ```ruby
+  df.left_join(other, :KEY)
+  #=>
+  #<RedAmber::DataFrame : 3 x 3 Vectors, 0x0000000000029fcc>
+    KEY           X1 X2
+    <string> <uint8> <boolean>
+  0 A              1 true
+  1 B              2 false
+  2 C              3 (nil)
+  ```
+
+##### `right_join(other, join_keys = nil, suffix: '.1')`
+
+  Join matching values from self to other.
+
+  ```ruby
+  df.right_join(other, :KEY)
+  #=>
+  #<RedAmber::DataFrame : 2 x 3 Vectors, 0x0000000000029fcc>
+    KEY           X1 X2
+    <string> <uint8> <boolean>
+  0 A              1 true
+  1 B              2 false
+  2 D          (nil) (nil)
+  ```
+
+#### Filtering join
+
+##### `semi_join(other, join_keys = nil, suffix: '.1')`
+
+  Return records of self that have a match in other.
+
+  ```ruby
+  df.semi_join(other, :KEY)
+  #=>
+  #<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000029fcc>
+    KEY           X1
+    <string> <uint8>
+  0 A              1
+  1 B              2
+  ```
+
+##### `anti_join(other, join_keys = nil, suffix: '.1')`
+
+  Return records of self that do not have a match in other.
+
+  ```ruby
+  df.anti_join(other, :KEY)
+  #=>
+  #<RedAmber::DataFrame : 1 x 2 Vectors, 0x0000000000029fcc>
+    KEY           X1
+    <string> <uint8>
+  0 C              3
+  ```
+
+## Set operations
+![dataframe set and binding image](doc/../image/dataframe/set_and_bind.png)
+
+  Keys in self and other must be same in set operations.
+
+  ```ruby
+  df = DataFrame.new(
+    KEY1: %w[A B C],
+    KEY2: [1, 2, 3]
+  )
+  #=>
+  #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000012a70>
+    KEY1        KEY2
+    <string> <uint8>
+  0 A              1
+  1 B              2
+  2 C              3
+
+  other = DataFrame.new(
+    KEY1: %w[A B D],
+    KEY2: [1, 4, 5]
+  )
+  #=>
+  #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000017034>
+    KEY1        KEY2
+    <string> <uint8>
+  0 A              1
+  1 B              4
+  2 D              5
+  ```
+
+##### `intersect(other)`
+
+  Select records appearing in both self and other.
+
+  ```ruby
+  df.intersect(other)
+  #=>
+  #<RedAmber::DataFrame : 1 x 2 Vectors, 0x0000000000029fcc>
+    KEY1        KEY2
+    <string> <uint8>
+  0 A              1
+  ```
+
+##### `union(other)`
+
+  Select records appearing in self or other.
+
+  ```ruby
+  df.union(other)
+  #=>
+  #<RedAmber::DataFrame : 5 x 2 Vectors, 0x0000000000029fcc>
+    KEY1        KEY2
+    <string> <uint8>
+  0 A              1
+  1 B              2
+  2 C              3
+  3 B              4
+  4 D              5
+  ```
+
+##### `difference(other)`
+
+  Select records appearing in self but not in other.
+
+  It has an alias `setdiff`.
+
+  ```ruby
+  df.difference(other)
+  #=>
+  #<RedAmber::DataFrame : 1 x 2 Vectors, 0x0000000000029fcc>
+    KEY1        KEY2
+    <string> <uint8>
+  1 B              2
+  2 C              3
+  ```
+
+## Binding
+
+### `concatenate(other)`
+
+  Concatenate another DataFrame or Table onto the bottom of self. The shape and data type of other must be the same as self.
+
+  The alias is `concat`.
+
+  An array of DataFrames or Tables is also acceptable as other.
+
+  ```ruby
+  df
+  #=>
+  #<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000022cb8>
+          x y
+    <uint8> <string>
+  0       1 A
+  1       2 B
+  
+  other
+  #=>
+  #<RedAmber::DataFrame : 2 x 2 Vectors, 0x000000000001f6d0>
+          x y
+    <uint8> <string>
+  0       3 C
+  1       4 D
+
+  df.concatenate(other)
+  #=>
+  #<RedAmber::DataFrame : 4 x 2 Vectors, 0x0000000000022574>
+          x y
+    <uint8> <string>
+  0       1 A
+  1       2 B
+  2       3 C
+  3       4 D
+  ```
+
+### `merge(other)`
+
+  Concatenate another DataFrame or Table onto the bottom of self. The shape and data type of other must be the same as self.
+
+  ```ruby
+  df
+  #=>
+  #<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000009150>
+          x       y
+    <uint8> <uint8>
+  0       1       3
+  1       2       4
+
+  other
+  #=>
+  #<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000008a0c>
+    a        b
+    <string> <string>
+  0 A        C
+  1 B        D
+
+  df.merge(other)
+  #=>
+  #<RedAmber::DataFrame : 2 x 4 Vectors, 0x000000000000cb70>
+          x       y a        b
+    <uint8> <uint8> <string> <string>
+  0       1       3 A        C
+  1       2       4 B        D
+  ```
 
 ## Encoding
 
 - [ ] One-hot encoding