doc/DataFrame.md in red_amber-0.1.4 vs doc/DataFrame.md in red_amber-0.1.5
- old
+ new
@@ -1,25 +1,25 @@
# DataFrame
-Class `RedAmber::DataFrame` represents 2D-data. `DataFrame` consists with:
+Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with:
- A collection of data which have same data type within. We call it `Vector`.
- A label is attached to `Vector`. We call it `key`.
- A `Vector` and associated `key` is grouped as a `variable`.
- `variable`s with same vector length are aligned and arranged to be a `DaTaFrame`.
- Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `observation`.
![dataframe model image](doc/../image/dataframe_model.png)
## Constructors and saving
-### `new` from a columnar Hash
+### `new` from a Hash
```ruby
RedAmber::DataFrame.new(x: [1, 2, 3])
```
-### `new` from a schema (by Hash) and rows (by Array)
+### `new` from a schema (by Hash) and data (by Array)
```ruby
RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]])
```
@@ -50,11 +50,11 @@
- from a string buffer
- from a URI
```ruby
- uri = URI("https://github.com/heronshoes/red_amber/blob/master/test/entity/with_header.csv")
+ uri = URI("uri = URI("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
RedAmber::DataFrame.load(uri)
```
- from a Parquet file
@@ -76,11 +76,11 @@
dataframe.save("file.parquet")
```
## Properties
-### `table`
+### `table`, `to_arrow`
- Reader of Arrow::Table object inside.
### `size`, `n_obs`, `n_rows`
@@ -91,20 +91,57 @@
- Returns num of keys (num of variables).
### `shape`
- Returns shape in an Array[n_rows, n_cols].
-
+
+### `variables`
+
+- Returns key names and Vectors pair in a Hash.
+
+ It is convenient to use in a block when both key and vector required. We will write:
+
+ ```ruby
+ # update numeric variables
+ df.assign do
+ variables.select.with_object({}) do |(key, vector), assigner|
+ assigner[key] = vector * -1 if vector.numeric?
+ end
+ end
+ ```
+
+ Instead of:
+ ```ruby
+ df.assign do
+ assigner = {}
+ vectors.each_with_index do |vector, i|
+ assigner[keys[i]] = vector * -1 if vector.numeric?
+ end
+ assigner
+ end
+ ```
+
### `keys`, `var_names`, `column_names`
- Returns key names in an Array.
+ When we use it with vectors, Vector#key is useful to get the key inside of DataFrame.
+
+ ```ruby
+ # update numeric variables, another solution
+ df.assign do
+ vectors.each_with_object({}) do |vector, assigner|
+ assigner[vector.key] = vector * -1 if vector.numeric?
+ end
+ end
+ ```
+
### `types`
- Returns types of vectors in an Array of Symbols.
-### `data_types`
+### `type_classes`
- Returns types of vector in an Array of `Arrow::DataType`.
### `vectors`
@@ -165,11 +202,11 @@
6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
7 :sex string 3 {"male"=>168, "female"=>165, nil=>11}
8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}
```
- - limit: limits variable number to show. Default value is 10.
+ - limit: limit of variables to show. Default value is 10.
- tally: max level to use tally mode.
- elements: max num of element to show values in each observations.
### `inspect`
@@ -222,12 +259,21 @@
df[:a]
# =>
#<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
[1, 2, 3]
```
- This may be useful to use in a block of DataFrame manipulations.
+ Or `#v` method also returns a Vector for a key.
+ ```ruby
+ df.v(:a)
+ # =>
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
+ [1, 2, 3]
+ ```
+
+ This may be useful to use in a block of DataFrame manipulation verbs. We can write `v(:a)` rather than `self[:a]` or `df[:a]`
+
### Select observations (rows in a table) by `[]` as `[index]`, `[range]`, `[array]`
- Select a obs. by index: `df[0]`
- Select obs. by indeces in a Range: `df[1..2]`
@@ -265,17 +311,17 @@
1 :a uint8 1 [1]
2 :b string 1 ["A"]
3 :c double 1 [1.0]
```
-### Select rows from top or bottom
+### Select rows from top or from bottom
`head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
## Sub DataFrame manipulations
-### `pick`
+### `pick ` - pick up variables by key label -
Pick up some variables (columns) to create a sub DataFrame.
![pick method image](doc/../image/dataframe/pick.png)
@@ -311,21 +357,22 @@
- Keys or booleans by a block
`pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
```ruby
+ # It is ok to write `keys ...` in the block, not `penguins.keys ...`
penguins.pick { keys.map { |key| key.end_with?('mm') } }
# =>
#<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f1cc>
Vectors : 3 numeric
# key type level data_preview
1 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
2 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
3 :flipper_length_mm int64 56 [181, 186, 195, nil, 193, ... ], 2 nils
```
-### `drop`
+### `drop ` - pick and drop -
Drop some variables (columns) to create a remainer DataFrame.
![drop method image](doc/../image/dataframe/drop.png)
@@ -350,29 +397,29 @@
booleans_invert = booleans.map(&:!) # => [false, true, true]
df.pick(booleans) == df.drop(booleans_invert) # => true
```
- Difference between `pick`/`drop` and `[]`
- If `pick` or `drop` will select single variable (column), it returns a `DataFrame` with one variable. In contrast, `[]` returns a `Vector`.
+ If `pick` or `drop` will select a single variable (column), it returns a `DataFrame` with one variable. In contrast, `[]` returns a `Vector`. This behavior may be useful to use in a block of DataFrame manipulations.
```ruby
df = RedAmber::DataFrame.new(a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3])
- df[:a]
- # =>
- #<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
- [1, 2, 3]
-
df.pick(:a) # or
df.drop(:b, :c)
# =>
#<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000000f280>
Vector : 1 numeric
# key type level data_preview
1 :a uint8 3 [1, 2, 3]
+
+ df[:a]
+ # =>
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
+ [1, 2, 3]
```
-### `slice`
+### `slice ` - to cut vertically is slice -
Slice and select observations (rows) to create a sub DataFrame.
![slice method image](doc/../image/dataframe/slice.png)
@@ -486,21 +533,21 @@
```ruby
# remove all observation contains nil
removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
removed.tdr
# =>
- RedAmber::DataFrame : 342 x 8 Vectors
+ RedAmber::DataFrame : 333 x 8 Vectors
Vectors : 5 numeric, 3 strings
# key type level data_preview
- 1 :species string 3 {"Adelie"=>151, "Chinstrap"=>68, "Gentoo"=>123}
- 2 :island string 3 {"Torgersen"=>51, "Biscoe"=>167, "Dream"=>124}
- 3 :bill_length_mm double 164 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
- 4 :bill_depth_mm double 80 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
- 5 :flipper_length_mm int64 55 [181, 186, 195, 193, 190, ... ]
- 6 :body_mass_g int64 94 [3750, 3800, 3250, 3450, 3650, ... ]
- 7 :sex string 3 {"male"=>168, "female"=>165, ""=>9}
- 8 :year int64 3 {2007=>109, 2008=>114, 2009=>119}
+ 1 :species string 3 {"Adelie"=>146, "Chinstrap"=>68, "Gentoo"=>119}
+ 2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>123}
+ 3 :bill_length_mm double 163 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
+ 4 :bill_depth_mm double 79 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
+ 5 :flipper_length_mm uint8 54 [181, 186, 195, 193, 190, ... ]
+ 6 :body_mass_g uint16 93 [3750, 3800, 3250, 3450, 3650, ... ]
+ 7 :sex string 2 {"male"=>168, "female"=>165}
+ 8 :year uint16 3 {2007=>103, 2008=>113, 2009=>117}
```
- Keys or booleans by a block
`remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
@@ -581,11 +628,11 @@
Symbol key and String key are distinguished.
### `assign`
- Assign new variables (columns) and create a updated DataFrame.
+ Assign new or updated variables (columns) and create a updated DataFrame.
- Variables with new keys will append new variables at bottom (right in the table).
- Variables with exisiting keys will update corresponding vectors.
![assign method image](doc/../image/dataframe/assign.png)
@@ -647,32 +694,135 @@
Vectors : 2 numeric, 1 string
# key type level data_preview
1 :index int8 5 [0, -1, -2, -3, nil], 1 nil
2 :float double 5 [-0.0, -1.1, -2.2, NaN, nil], 1 NaN, 1 nil
3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
+
+ # Or it ’s shorter like this:
+ df.assign do
+ variables.select.with_object({}) do |(key, vector), assigner|
+ assigner[key] = vector * -1 if vector.numeric?
+ end
+ end
+ # => same as above
```
- Key type
Symbol key and String key are considered as the same key.
## Updating
-- [ ] Update elements matching a condition
+### `sort`
-- [ ] Clamp
+ `sort` accepts parameters as sort_keys thanks to the amazing Red Arrow feature。
+ - :key, "key" or "+key" denotes ascending order
+ - "-key" denotes descending order
-- [ ] Sort rows
+ ```ruby
+ df = RedAmber::DataFrame.new({
+ index: [1, 1, 0, nil, 0],
+ string: ['C', 'B', nil, 'A', 'B'],
+ bool: [nil, true, false, true, false],
+ })
+ df.sort(:index, '-bool').tdr(tally: 0)
+ # =>
+ RedAmber::DataFrame : 5 x 3 Vectors
+ Vectors : 1 numeric, 1 string, 1 boolean
+ # key type level data_preview
+ 1 :index uint8 3 [0, 0, 1, 1, nil], 1 nil
+ 2 :string string 4 [nil, "B", "B", "C", "A"], 1 nil
+ 3 :bool boolean 3 [false, false, true, nil, true], 1 nil
+ ```
+- [ ] Clamp
+
- [ ] Clear data
## Treat na data
-- [ ] Drop na (NaN, nil)
+### `remove_nil`
-- [ ] Replace na with value
+ Remove any observations containing nil.
-- [ ] Interpolate na with convolution array
+## Grouping
+
+### `group(aggregating_keys, function, target_keys)`
+
+ Create grouped dataframe by `aggregation_keys` and apply `function` to each group and returns in `target_keys`. Aggregated key name is `function(key)` style.
+
+ (The current implementation is not intuitive. Needs improvement.)
+
+ ```ruby
+ ds = Datasets::Rdatasets.new('dplyr', 'starwars')
+ starwars = RedAmber::DataFrame.new(ds.to_table.to_h)
+ starwars.tdr(11)
+ # =>
+ RedAmber::DataFrame : 87 x 11 Vectors
+ Vectors : 3 numeric, 8 strings
+ # key type level data_preview
+ 1 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
+ 2 :height uint16 46 [172, 167, 96, 202, 150, ... ], 6 nils
+ 3 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
+ 4 :hair_color string 13 ["blond", nil, nil, "none", "brown", ... ], 5 nils
+ 5 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", .. . ]
+ 6 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ]
+ 7 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
+ 8 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, nil=>4}
+ 9 :gender string 3 {"masculine"=>66, "feminine"=>17, nil=>4}
+ 10 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ], 10 nils
+ 11 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ], 4 nils
+
+ grouped = starwars.group(:species, :mean, [:mass, :height])
+ # =>
+ #<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000000fbf4>
+ Vectors : 2 numeric, 1 string
+ # key type level data_preview
+ 1 :"mean(mass)" double 27 [82.78181818181818, 69.75, 124.0, 74.0, 1358.0, ... ], 6 nils
+ 2 :"mean(height)" double 32 [176.6451612903226, 131.2, 231.0, 173.0, 175.0, ... ]
+ 3 :species string 38 ["Human", "Droid", "Wookiee", "Rodian", "Hutt", ... ], 1 nil
+
+ count = starwars.group(:species, :count, :species)[:"count(species)"]
+ df = grouped.slice(count > 1)
+ # =>
+ #<RedAmber::DataFrame : 8 x 3 Vectors, 0x000000000000fc44>
+ Vectors : 2 numeric, 1 string
+ # key type level data_preview
+ 1 :"mean(mass)" double 8 [82.78181818181818, 69.75, 124.0, 74.0, 80.0, ... ]
+ 2 :"mean(height)" double 8 [176.6451612903226, 131.2, 231.0, 208.66666666666666, 173.0, ... ]
+ 3 :species string 8 ["Human", "Droid", "Wookiee", "Gungan", "Zabrak", ... ]
+
+ df.table
+ # =>
+ #<Arrow::Table:0x1165593c8 ptr=0x7fb3db144c70>
+ mean(mass) mean(height) species
+ 0 82.781818 176.645161 Human
+ 1 69.750000 131.200000 Droid
+ 2 124.000000 231.000000 Wookiee
+ 3 74.000000 208.666667 Gungan
+ 4 80.000000 173.000000 Zabrak
+ 5 55.000000 179.000000 Twi'lek
+ 6 53.100000 168.000000 Mirialan
+ 7 88.000000 221.000000 Kaminoan
+ ```
+
+ Available functions are:
+
+ - [ ] all
+ - [ ] any
+ - [ ] approximate_median
+ - ✓ count
+ - [ ] count_distinct
+ - [ ] distinct
+ - ✓ max
+ - ✓ mean
+ - ✓ min
+ - [ ] min_max
+ - ✓ product
+ - ✓ stddev
+ - ✓ sum
+ - [ ] tdigest
+ - ✓ variance
## Combining DataFrames
- [ ] obs