doc/DataFrame.md in red_amber-0.2.2 vs doc/DataFrame.md in red_amber-0.2.3
- old
+ new
@@ -3,11 +3,12 @@
Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
- A collection of data which have same data type within. We call it `Vector`.
- A label is attached to `Vector`. We call it `key`.
- A `Vector` and associated `key` is grouped as a `variable`.
- `variable`s with same vector length are aligned and arranged to be a `DataFrame`.
-- Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `observation`.
+ - Each `key` in a `DataFrame` must be unique.
+- Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `record` or `observation`.
![dataframe model image](doc/../image/dataframe_model.png)
## Constructors and saving
@@ -92,17 +93,17 @@
## Properties
### `table`, `to_arrow`
-- Reader of Arrow::Table object inside.
+- Returns Arrow::Table object in the DataFrame.
-### `size`, `n_obs`, `n_rows`
+### `size`, `n_records`, `n_obs`, `n_rows`
-- Returns size of Vector (num of observations).
-
-### `n_keys`, `n_vars`, `n_cols`,
+- Returns size of Vector (num of records).
+
+### `n_keys`, `n_variables`, `n_vars`, `n_cols`,
- Returns num of keys (num of variables).
### `shape`
@@ -136,21 +137,12 @@
### `keys`, `var_names`, `column_names`
- Returns key names in an Array.
- When we use it with vectors, Vector#key is useful to get the key inside of DataFrame.
+ Each key must be unique in the DataFrame.
- ```ruby
- # update numeric variables, another solution
- df.assign do
- vectors.each_with_object({}) do |vector, assigner|
- assigner[vector.key] = vector * -1 if vector.numeric?
- end
- end
- ```
-
### `types`
- Returns types of vectors in an Array of Symbols.
### `type_classes`
@@ -159,29 +151,44 @@
### `vectors`
- Returns an Array of Vectors.
+ When we use it, Vector#key is useful to get the key in the DataFrame.
+
+ ```ruby
+ # update numeric variables, another solution
+ df.assign do
+ vectors.each_with_object({}) do |vector, assigner|
+ assigner[vector.key] = vector * -1 if vector.numeric?
+ end
+ end
+ ```
+
### `indices`, `indexes`
-- Returns indexes in an Array.
+- Returns indexes in a Vector.
Accepts an option `start` as the first of indexes.
```ruby
df = RedAmber::DataFrame.new(x: [1, 2, 3, 4, 5])
df.indices
# =>
+ #<RedAmber::Vector(:uint8, size=5):0x0000000000013ed4>
[0, 1, 2, 3, 4]
df.indices(1)
# =>
+ #<RedAmber::Vector(:uint8, size=5):0x0000000000018fd8>
[1, 2, 3, 4, 5]
df.indices(:a)
+
# =>
+ #<RedAmber::Vector(:dictionary, size=5):0x000000000001bd50>
[:a, :b, :c, :d, :e]
```
### `to_h`
@@ -273,10 +280,11 @@
require 'red_amber'
require 'datasets-arrow'
dataset = Datasets::Penguins.new
# (From 0.2.2) responsible to the object which has `to_arrow` method.
+ # If older, it should be `dataset.to_arrow` in the parentheses.
RedAmber::DataFrame.new(dataset).tdr
# =>
RedAmber::DataFrame : 344 x 8 Vectors
Vectors : 5 numeric, 3 strings
@@ -288,30 +296,31 @@
4 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils
5 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
6 :sex string 3 {"male"=>168, "female"=>165, nil=>11}
7 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}
```
-
+
+ Options:
- limit: limit of variables to show. Default value is 10.
- - tally: max level to use tally mode.
- - elements: max num of element to show values in each observations.
+ - tally: max level to use tally mode. Default value is 5.
+ - elements: max num of element to show values in each records. Default value is 5.
## Selecting
### Select variables (columns in a table) by `[]` as `[key]`, `[keys]`, `[keys[index]]`
- Key in a Symbol: `df[:symbol]`
- Key in a String: `df["string"]`
- Keys in an Array: `df[:symbol1, "string", :symbol2]`
- Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`
- Key indeces can be used via `keys[i]` because numbers are used to select observations (rows).
+ Key indeces should be used via `keys[i]` because numbers are used to select records (rows). See next section.
- Keys by a Range:
- If keys are able to represent by Range, it can be included in the arguments. See a example below.
+ If keys are able to represent by a Range, it can be included in the arguments. See a example below.
-- You can exchange the order of variables (columns).
+- You can also exchange the order of variables (columns).
```ruby
hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
df = RedAmber::DataFrame.new(hash)
df[:b..:c, "a"]
@@ -323,42 +332,44 @@
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
```
- If `#[]` represents single variable (column), it returns a Vector object.
+ If `#[]` represents a single variable (column), it returns a Vector object.
```ruby
df[:a]
# =>
#<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
[1, 2, 3]
```
+
Or `#v` method also returns a Vector for a key.
```ruby
df.v(:a)
# =>
#<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
[1, 2, 3]
```
- This may be useful to use in a block of DataFrame manipulation verbs. We can write `v(:a)` rather than `self[:a]` or `df[:a]`
+ This method may be useful to use in a block of DataFrame manipulation verbs. We can write `v(:a)` rather than `self[:a]` or `df[:a]`
-### Select observations (rows in a table) by `[]` as `[index]`, `[range]`, `[array]`
+### Select records (rows in a table) by `[]` as `[index]`, `[range]`, `[array]`
-- Select a obs. by index: `df[0]`
-- Select obs. by indeces in a Range: `df[1..2]`
+- Select a record by index: `df[0]`
- An end-less or a begin-less Range can be used to represent indeces.
+- Select records by indeces in an Array: `df[1, 2]`
-- Select obs. by indeces in an Array: `df[1, 2]`
+- Select records by indeces in a Range: `df[1..2]`
-- You can use float indices.
+ An end-less or a begin-less Range can be used to represent indeces.
+- You can use indices in Float.
+
- Mixed case: `df[2, 0..]`
```ruby
hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
df = RedAmber::DataFrame.new(hash)
@@ -372,13 +383,13 @@
1 1 A 1.0
2 2 B 2.0
3 3 C 3.0
```
-- Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self.
+- Select records by a boolean Array or a boolean RedAmber::Vector at same size as self.
- It returns a sub dataframe with observations at boolean is true.
+ It returns a sub dataframe with records at boolean is true.
```ruby
# with the same dataframe `df` above
df[true, false, nil] # or
df[[true, false, nil]] # or
@@ -389,19 +400,19 @@
a b c
<uint8> <string> <double>
1 1 A 1.0
```
-### Select rows from top or from bottom
+### Select records (rows) from top or from bottom
`head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
## Sub DataFrame manipulations
-### `pick ` - pick up variables by key label -
+### `pick ` - pick up variables -
- Pick up some columns (variables) to create a sub DataFrame.
+ Pick up some variables (columns) to create a sub DataFrame.
![pick method image](doc/../image/dataframe/pick.png)
- Keys as arguments
@@ -489,13 +500,13 @@
341 50.4 15.7 222
342 45.2 14.8 212
343 49.9 16.1 213
```
-### `drop ` - pick and drop -
+### `drop ` - counterpart of pick -
- Drop some columns (variables) to create a remainer DataFrame.
+ Drop some variables (columns) to create a remainer DataFrame.
![drop method image](doc/../image/dataframe/drop.png)
- Keys as arguments
@@ -555,24 +566,24 @@
# =>
#<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
[1, 2, 3]
```
-### `slice ` - to cut vertically is slice -
+### `slice ` - slice and select records -
- Slice and select rows (observations) to create a sub DataFrame.
+ Slice and select records (rows) to create a sub DataFrame.
![slice method image](doc/../image/dataframe/slice.png)
- Indices as arguments
`slice(indeces)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers.
Negative index from the tail like Ruby's Array is also acceptable.
```ruby
- # returns 5 obs. at start and 5 obs. from end
+ # returns 5 records at start and 5 records from end
penguins.slice(0...5, -5..-1)
# =>
#<RedAmber::DataFrame : 10 x 8 Vectors, 0x0000000000042be4>
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
@@ -663,22 +674,22 @@
#<Arrow::Table:0x7fdfe44981c8 ptr=0x555e9febc330>
a b c
0 1 A 1.000000
```
-### `remove`
+### `remove` - counterpart of slice -
- Slice and reject rows (observations) to create a remainer DataFrame.
+ Slice and reject records (rows) to create a remainer DataFrame.
![remove method image](doc/../image/dataframe/remove.png)
- Indices as arguments
`remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
```ruby
- # returns 6th to 339th obs.
+ # returns 6th to 339th records
penguins.remove(0...5, -5..-1)
# =>
#<RedAmber::DataFrame : 334 x 8 Vectors, 0x00000000000487c4>
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
@@ -697,11 +708,11 @@
- Booleans as an argument
`remove(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
```ruby
- # remove all observation contains nil
+ # remove all records contains nil
removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
removed
# =>
#<RedAmber::DataFrame : 333 x 8 Vectors, 0x0000000000049fac>
@@ -783,11 +794,11 @@
1 (nil) C 3.0
```
### `rename`
- Rename keys (column names) to create a updated DataFrame.
+ Rename keys (variable/column names) to create a updated DataFrame.
![rename method image](doc/../image/dataframe/rename.png)
- Key pairs as arguments
@@ -818,11 +829,11 @@
Symbol key and String key are distinguished.
### `assign`
- Assign new or updated columns (variables) and create a updated DataFrame.
+ Assign new or updated variables (columns) and create an updated DataFrame.
- Variables with new keys will append new columns from the right.
- Variables with exisiting keys will update corresponding vectors.
![assign method image](doc/../image/dataframe/assign.png)
@@ -1007,11 +1018,11 @@
## Updating
### `sort`
- `sort` accepts parameters as sort_keys thanks to the amazing Red Arrow feature。
+ `sort` accepts parameters as sort_keys thanks to the Red Arrow's feature。
- :key, "key" or "+key" denotes ascending order
- "-key" denotes descending order
```ruby
df = RedAmber::DataFrame.new(
@@ -1038,11 +1049,11 @@
## Treat na data
### `remove_nil`
- Remove any observations containing nil.
+ Remove any records containing nil.
## Grouping
### `group(group_keys)`
@@ -1208,11 +1219,11 @@
The leftmost column is created by original keys. Key name of the column is
named by parameter `:name`. If `:name` is not specified, `:NAME` is used for the key.
### `to_long(*keep_keys)`
- Creates a 'long' (tidy) DataFrame from a 'wide' DataFrame.
+ Creates a 'long' (may be tidy) DataFrame from a 'wide' DataFrame.
- Parameter `keep_keys` specifies the key names to keep.
```ruby
import_cars.to_long(:Year)
@@ -1255,11 +1266,11 @@
24 2021 VW 35215
```
### `to_wide`
- Creates a 'wide' (messy) DataFrame from a 'long' DataFrame.
+ Creates a 'wide' (may be messy) DataFrame from a 'long' DataFrame.
- Option `:name` is the key of the column which will be expanded **to key names**.
The default value is `:NAME` if it is not specified.
- Option `:value` is the key of the column which will be expanded **to values**.
The default value is `:VALUE` if it is not specified.
@@ -1280,12 +1291,280 @@
4 2021 22535 35905 18211 51722 35215
```
## Combine
-- [ ] Combining dataframes
+### `join`
+![dataframe joining image](doc/../image/dataframe/join.png)
-- [ ] Join
+ You should use specific `*_join` methods below.
+
+ - `other` is a DataFrame or a Arrow::Table.
+ - `join_keys` are keys shared by self and other to match with them.
+ - If `join_keys` are empty, common keys in self and other are chosen (natural join).
+ - If (common keys) > `join_keys`, duplicated keys are renamed by `suffix`.
+
+ ```ruby
+ df = DataFrame.new(
+ KEY: %w[A B C],
+ X1: [1, 2, 3]
+ )
+ #=>
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000012a70>
+ KEY X1
+ <string> <uint8>
+ 0 A 1
+ 1 B 2
+ 2 C 3
+
+ other = DataFrame.new(
+ KEY: %w[A B D],
+ X2: [true, false, nil]
+ )
+ #=>
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000017034>
+ KEY X2
+ <string> <boolean>
+ 0 A true
+ 1 B false
+ 2 D (nil)
+ ```
+
+#### Mutating joins
+
+##### `inner_join(other, join_keys = nil, suffix: '.1')`
+
+ Join data, leaving only the matching records.
+
+ ```ruby
+ df.inner_join(other, :KEY)
+ #=>
+ #<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000001e2bc>
+ KEY X1 X2
+ <string> <uint8> <boolean>
+ 0 A 1 true
+ 1 B 2 false
+ ```
+
+##### `full_join(other, join_keys = nil, suffix: '.1')`
+
+ Join data, leaving all records.
+
+ ```ruby
+ df.full_join(other, :KEY)
+ #=>
+ #<RedAmber::DataFrame : 4 x 3 Vectors, 0x0000000000029fcc>
+ KEY X1 X2
+ <string> <uint8> <boolean>
+ 0 A 1 true
+ 1 B 2 false
+ 2 C 3 (nil)
+ 3 D (nil) (nil)
+ ```
+
+##### `left_join(other, join_keys = nil, suffix: '.1')`
+
+ Join matching values to self from other.
+
+ ```ruby
+ df.left_join(other, :KEY)
+ #=>
+ #<RedAmber::DataFrame : 3 x 3 Vectors, 0x0000000000029fcc>
+ KEY X1 X2
+ <string> <uint8> <boolean>
+ 0 A 1 true
+ 1 B 2 false
+ 2 C 3 (nil)
+ ```
+
+##### `right_join(other, join_keys = nil, suffix: '.1')`
+
+ Join matching values from self to other.
+
+ ```ruby
+ df.right_join(other, :KEY)
+ #=>
+ #<RedAmber::DataFrame : 2 x 3 Vectors, 0x0000000000029fcc>
+ KEY X1 X2
+ <string> <uint8> <boolean>
+ 0 A 1 true
+ 1 B 2 false
+ 2 D (nil) (nil)
+ ```
+
+#### Filtering join
+
+##### `semi_join(other, join_keys = nil, suffix: '.1')`
+
+ Return records of self that have a match in other.
+
+ ```ruby
+ df.semi_join(other, :KEY)
+ #=>
+ #<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000029fcc>
+ KEY X1
+ <string> <uint8>
+ 0 A 1
+ 1 B 2
+ ```
+
+##### `anti_join(other, join_keys = nil, suffix: '.1')`
+
+ Return records of self that do not have a match in other.
+
+ ```ruby
+ df.anti_join(other, :KEY)
+ #=>
+ #<RedAmber::DataFrame : 1 x 2 Vectors, 0x0000000000029fcc>
+ KEY X1
+ <string> <uint8>
+ 0 C 3
+ ```
+
+## Set operations
+![dataframe set and binding image](doc/../image/dataframe/set_and_bind.png)
+
+ Keys in self and other must be same in set operations.
+
+ ```ruby
+ df = DataFrame.new(
+ KEY1: %w[A B C],
+ KEY2: [1, 2, 3]
+ )
+ #=>
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000012a70>
+ KEY1 KEY2
+ <string> <uint8>
+ 0 A 1
+ 1 B 2
+ 2 C 3
+
+ other = DataFrame.new(
+ KEY1: %w[A B D],
+ KEY2: [1, 4, 5]
+ )
+ #=>
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000017034>
+ KEY1 KEY2
+ <string> <uint8>
+ 0 A 1
+ 1 B 4
+ 2 D 5
+ ```
+
+##### `intersect(other)`
+
+ Select records appearing in both self and other.
+
+ ```ruby
+ df.intersect(other)
+ #=>
+ #<RedAmber::DataFrame : 1 x 2 Vectors, 0x0000000000029fcc>
+ KEY1 KEY2
+ <string> <uint8>
+ 0 A 1
+ ```
+
+##### `union(other)`
+
+ Select records appearing in self or other.
+
+ ```ruby
+ df.union(other)
+ #=>
+ #<RedAmber::DataFrame : 5 x 2 Vectors, 0x0000000000029fcc>
+ KEY1 KEY2
+ <string> <uint8>
+ 0 A 1
+ 1 B 2
+ 2 C 3
+ 3 B 4
+ 4 D 5
+ ```
+
+##### `difference(other)`
+
+ Select records appearing in self but not in other.
+
+ It has an alias `setdiff`.
+
+ ```ruby
+ df.difference(other)
+ #=>
+ #<RedAmber::DataFrame : 1 x 2 Vectors, 0x0000000000029fcc>
+ KEY1 KEY2
+ <string> <uint8>
+ 1 B 2
+ 2 C 3
+ ```
+
+## Binding
+
+### `concatenate(other)`
+
+ Concatenate another DataFrame or Table onto the bottom of self. The shape and data type of other must be the same as self.
+
+ The alias is `concat`.
+
+ An array of DataFrames or Tables is also acceptable as other.
+
+ ```ruby
+ df
+ #=>
+ #<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000022cb8>
+ x y
+ <uint8> <string>
+ 0 1 A
+ 1 2 B
+
+ other
+ #=>
+ #<RedAmber::DataFrame : 2 x 2 Vectors, 0x000000000001f6d0>
+ x y
+ <uint8> <string>
+ 0 3 C
+ 1 4 D
+
+ df.concatenate(other)
+ #=>
+ #<RedAmber::DataFrame : 4 x 2 Vectors, 0x0000000000022574>
+ x y
+ <uint8> <string>
+ 0 1 A
+ 1 2 B
+ 2 3 C
+ 3 4 D
+ ```
+
+### `merge(other)`
+
+ Concatenate another DataFrame or Table onto the bottom of self. The shape and data type of other must be the same as self.
+
+ ```ruby
+ df
+ #=>
+ #<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000009150>
+ x y
+ <uint8> <uint8>
+ 0 1 3
+ 1 2 4
+
+ other
+ #=>
+ #<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000008a0c>
+ a b
+ <string> <string>
+ 0 A C
+ 1 B D
+
+ df.merge(other)
+ #=>
+ #<RedAmber::DataFrame : 2 x 4 Vectors, 0x000000000000cb70>
+ x y a b
+ <uint8> <uint8> <string> <string>
+ 0 1 3 A C
+ 1 2 4 B D
+ ```
## Encoding
- [ ] One-hot encoding