doc/DataFrame.md in red_amber-0.1.6 vs doc/DataFrame.md in red_amber-0.1.7
- old
+ new
@@ -7,12 +7,10 @@
- `variable`s with same vector length are aligned and arranged to be a `DataFrame`.
- Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `observation`.
![dataframe model image](doc/../image/dataframe_model.png)
-(No change in this model in v0.1.6 .)
-
## Constructors and saving
### `new` from a Hash
```ruby
@@ -35,10 +33,12 @@
### `new` from a Rover::DataFrame
```ruby
+ require 'rover'
+
rover = Rover::DataFrame.new(x: [1, 2, 3])
RedAmber::DataFrame.new(rover)
```
### `load` (class method)
@@ -59,10 +59,12 @@
```
- from a Parquet file
```ruby
+ require 'parquet'
+
dataframe = RedAmber::DataFrame.load("file.parquet")
```
### `save` (instance method)
@@ -73,10 +75,12 @@
- to a URI
- to a Parquet file
```ruby
+ require 'parquet'
+
dataframe.save("file.parquet")
```
## Properties
@@ -173,16 +177,45 @@
## Output
### `to_s`
+`to_s` returns a preview of the Table.
+
+```ruby
+puts penguins.to_s
+
+# =>
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
+ <string> <string> <double> <double> <uint8> ... <uint16>
+ 1 Adelie Torgersen 39.1 18.7 181 ... 2007
+ 2 Adelie Torgersen 39.5 17.4 186 ... 2007
+ 3 Adelie Torgersen 40.3 18.0 195 ... 2007
+ 4 Adelie Torgersen (nil) (nil) (nil) ... 2007
+ 5 Adelie Torgersen 36.7 19.3 193 ... 2007
+ : : : : : : ... :
+342 Gentoo Biscoe 50.4 15.7 222 ... 2009
+343 Gentoo Biscoe 45.2 14.8 212 ... 2009
+344 Gentoo Biscoe 49.9 16.1 213 ... 2009
+```
+### `inspect`
+
+`inspect` uses `to_s` output and also shows shape and object_id.
+
+
### `summary`, `describe` (not implemented)
### `to_rover`
- Returns a `Rover::DataFrame`.
+```ruby
+require 'rover'
+
+penguins.to_rover
+```
+
### `to_iruby`
- Show the DataFrame as a Table in Jupyter Notebook or Jupyter Lab with IRuby.
### `tdr(limit = 10, tally: 5, elements: 5)`
@@ -194,10 +227,11 @@
require 'red_amber'
require 'datasets-arrow'
penguins = Datasets::Penguins.new.to_arrow
RedAmber::DataFrame.new(penguins).tdr
+
# =>
RedAmber::DataFrame : 344 x 8 Vectors
Vectors : 5 numeric, 3 strings
# key type level data_preview
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
@@ -212,26 +246,10 @@
- limit: limit of variables to show. Default value is 10.
- tally: max level to use tally mode.
- elements: max num of element to show values in each observations.
-### `inspect`
-
-- Returns the information of self as `tdr(3)`, and also shows object id.
-
- ```ruby
- puts penguins.inspect
- # =>
- #<RedAmber::DataFrame : 344 x 8 Vectors, 0x000000000000f0b4>
- Vectors : 5 numeric, 3 strings
- # key type level data_preview
- 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
- 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
- 3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
- ... 5 more Vectors ...
- ```
-
## Selecting
### Select variables (columns in a table) by `[]` as `[key]`, `[keys]`, `[keys[index]]`
- Key in a Symbol: `df[:symbol]`
- Key in a String: `df["string"]`
@@ -248,31 +266,34 @@
```ruby
hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
df = RedAmber::DataFrame.new(hash)
df[:b..:c, "a"]
+
# =>
- #<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000b02c>
- Vectors : 2 numeric, 1 string
- # key type level data_preview
- 1 :b string 3 ["A", "B", "C"]
- 2 :c double 3 [1.0, 2.0, 3.0]
- 3 :a uint8 3 [1, 2, 3]
+ #<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000328fc>
+ b c a
+ <string> <double> <uint8>
+ 1 A 1.0 1
+ 2 B 2.0 2
+ 3 C 3.0 3
```
If `#[]` represents single variable (column), it returns a Vector object.
```ruby
df[:a]
+
# =>
#<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
[1, 2, 3]
```
Or `#v` method also returns a Vector for a key.
```ruby
df.v(:a)
+
# =>
#<RedAmber::Vector(:uint8, size=3):0x000000000000f140>
[1, 2, 3]
```
@@ -292,18 +313,20 @@
- Mixed case: `df[2, 0..]`
```ruby
hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
df = RedAmber::DataFrame.new(hash)
- df[:b..:c, "a"].tdr(tally_level: 0)
+ df[2, 0..]
+
# =>
- RedAmber::DataFrame : 4 x 3 Vectors
- Vectors : 2 numeric, 1 string
- # key type level data_preview
- 1 :a uint8 3 [3, 1, 2, 3]
- 2 :b string 3 ["C", "A", "B", "C"]
- 3 :c double 3 [3.0, 1.0, 2.0, 3.0]
+ #<RedAmber::DataFrame : 4 x 3 Vectors, 0x0000000000033270>
+ a b c
+ <uint8> <string> <double>
+ 1 3 C 3.0
+ 2 1 A 1.0
+ 3 2 B 2.0
+ 4 3 C 3.0
```
- Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self.
It returns a sub dataframe with observations at boolean is true.
@@ -311,17 +334,16 @@
```ruby
# with the same dataframe `df` above
df[true, false, nil] # or
df[[true, false, nil]] # or
df[RedAmber::Vector.new([true, false, nil])]
+
# =>
- #<RedAmber::DataFrame : 1 x 3 Vectors, 0x000000000000f1a4>
- Vectors : 2 numeric, 1 string
- # key type level data_preview
- 1 :a uint8 1 [1]
- 2 :b string 1 ["A"]
- 3 :c double 1 [1.0]
+ #<RedAmber::DataFrame : 1 x 3 Vectors, 0x00000000000353e0>
+ a b c
+ <uint8> <string> <double>
+ 1 1 A 1.0
```
### Select rows from top or from bottom
`head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
@@ -338,47 +360,68 @@
`pick(keys)` accepts keys as arguments in an Array.
```ruby
penguins.pick(:species, :bill_length_mm)
+
# =>
- #<RedAmber::DataFrame : 344 x 2 Vectors, 0x000000000000f924>
- Vectors : 1 numeric, 1 string
- # key type level data_preview
- 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
- 2 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
+ #<RedAmber::DataFrame : 344 x 2 Vectors, 0x0000000000035ebc>
+ species bill_length_mm
+ <string> <double>
+ 1 Adelie 39.1
+ 2 Adelie 39.5
+ 3 Adelie 40.3
+ 4 Adelie (nil)
+ 5 Adelie 36.7
+ : : :
+ 342 Gentoo 50.4
+ 343 Gentoo 45.2
+ 344 Gentoo 49.9
```
- Booleans as a argument
`pick(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
```ruby
penguins.pick(penguins.types.map { |type| type == :string })
+
# =>
- #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f938>
- Vectors : 3 strings
- # key type level data_preview
- 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
- 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
- 3 :sex string 3 {"male"=>168, "female"=>165, ""=>11}
+ #<RedAmber::DataFrame : 344 x 3 Vectors, 0x00000000000387ac>
+ species island sex
+ <string> <string> <string>
+ 1 Adelie Torgersen male
+ 2 Adelie Torgersen female
+ 3 Adelie Torgersen female
+ 4 Adelie Torgersen (nil)
+ 5 Adelie Torgersen female
+ : : : :
+ 342 Gentoo Biscoe male
+ 343 Gentoo Biscoe female
+ 344 Gentoo Biscoe male
```
- Keys or booleans by a block
`pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
```ruby
- # It is ok to write `keys ...` in the block, not `penguins.keys ...`
penguins.pick { keys.map { |key| key.end_with?('mm') } }
+
# =>
- #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f1cc>
- Vectors : 3 numeric
- # key type level data_preview
- 1 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
- 2 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
- 3 :flipper_length_mm int64 56 [181, 186, 195, nil, 193, ... ], 2 nils
+ #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000003dd4c>
+ bill_length_mm bill_depth_mm flipper_length_mm
+ <double> <double> <uint8>
+ 1 39.1 18.7 181
+ 2 39.5 17.4 186
+ 3 40.3 18.0 195
+ 4 (nil) (nil) (nil)
+ 5 36.7 19.3 193
+ : : : :
+ 342 50.4 15.7 222
+ 343 45.2 14.8 212
+ 344 49.9 16.1 213
```
### `drop ` - pick and drop -
Drop some variables (columns) to create a remainer DataFrame.
@@ -412,17 +455,21 @@
```ruby
df = RedAmber::DataFrame.new(a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3])
df.pick(:a) # or
df.drop(:b, :c)
+
# =>
- #<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000000f280>
- Vector : 1 numeric
- # key type level data_preview
- 1 :a uint8 3 [1, 2, 3]
+ #<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000003f4bc>
+ a
+ <uint8>
+ 1 1
+ 2 2
+ 3 3
df[:a]
+
# =>
#<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
[1, 2, 3]
```
@@ -439,35 +486,47 @@
Negative index from the tail like Ruby's Array is also acceptable.
```ruby
# returns 5 obs. at start and 5 obs. from end
penguins.slice(0...5, -5..-1)
+
# =>
- #<RedAmber::DataFrame : 10 x 8 Vectors, 0x000000000000f230>
- Vectors : 5 numeric, 3 strings
- # key type level data_preview
- 1 :species string 2 {"Adelie"=>5, "Gentoo"=>5}
- 2 :island string 2 {"Torgersen"=>5, "Biscoe"=>5}
- 3 :bill_length_mm double 9 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
- ... 5 more Vectors ...
+ #<RedAmber::DataFrame : 10 x 8 Vectors, 0x0000000000042be4>
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
+ <string> <string> <double> <double> <uint8> ... <uint16>
+ 1 Adelie Torgersen 39.1 18.7 181 ... 2007
+ 2 Adelie Torgersen 39.5 17.4 186 ... 2007
+ 3 Adelie Torgersen 40.3 18.0 195 ... 2007
+ 4 Adelie Torgersen (nil) (nil) (nil) ... 2007
+ 5 Adelie Torgersen 36.7 19.3 193 ... 2007
+ : : : : : : ... :
+ 8 Gentoo Biscoe 50.4 15.7 222 ... 2009
+ 9 Gentoo Biscoe 45.2 14.8 212 ... 2009
+ 10 Gentoo Biscoe 49.9 16.1 213 ... 2009
```
- Booleans as an argument
`slice(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
```ruby
vector = penguins[:bill_length_mm]
penguins.slice(vector >= 40)
+
# =>
- #<RedAmber::DataFrame : 242 x 8 Vectors, 0x000000000000f2bc>
- Vectors : 5 numeric, 3 strings
- # key type level data_preview
- 1 :species string 3 {"Adelie"=>51, "Chinstrap"=>68, "Gentoo"=>123}
- 2 :island string 3 {"Torgersen"=>18, "Biscoe"=>139, "Dream"=>85}
- 3 :bill_length_mm double 115 [40.3, 42.0, 41.1, 42.5, 46.0, ... ]
- ... 5 more Vectors ...
+ #<RedAmber::DataFrame : 242 x 8 Vectors, 0x0000000000043d3c>
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
+ <string> <string> <double> <double> <uint8> ... <uint16>
+ 1 Adelie Torgersen 40.3 18.0 195 ... 2007
+ 2 Adelie Torgersen 42.0 20.2 190 ... 2007
+ 3 Adelie Torgersen 41.1 17.6 182 ... 2007
+ 4 Adelie Torgersen 42.5 20.7 197 ... 2007
+ 5 Adelie Torgersen 46.0 21.5 194 ... 2007
+ : : : : : : ... :
+ 240 Gentoo Biscoe 50.4 15.7 222 ... 2009
+ 241 Gentoo Biscoe 45.2 14.8 212 ... 2009
+ 242 Gentoo Biscoe 49.9 16.1 213 ... 2009
```
- Indices or booleans by a block
`slice {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
@@ -480,26 +539,32 @@
max = vector.mean + vector.std
vector.to_a.map { |e| (min..max).include? e }
end
# =>
- #<RedAmber::DataFrame : 204 x 8 Vectors, 0x000000000000f30c>
- Vectors : 5 numeric, 3 strings
- # key type level data_preview
- 1 :species string 3 {"Adelie"=>82, "Chinstrap"=>33, "Gentoo"=>89}
- 2 :island string 3 {"Torgersen"=>31, "Biscoe"=>112, "Dream"=>61}
- 3 :bill_length_mm double 90 [39.1, 39.5, 40.3, 39.3, 38.9, ... ]
- ... 5 more Vectors ...
+ #<RedAmber::DataFrame : 204 x 8 Vectors, 0x0000000000047a40>
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
+ <string> <string> <double> <double> <uint8> ... <uint16>
+ 1 Adelie Torgersen 39.1 18.7 181 ... 2007
+ 2 Adelie Torgersen 39.5 17.4 186 ... 2007
+ 3 Adelie Torgersen 40.3 18.0 195 ... 2007
+ 4 Adelie Torgersen 39.3 20.6 190 ... 2007
+ 5 Adelie Torgersen 38.9 17.8 181 ... 2007
+ : : : : : : ... :
+ 202 Gentoo Biscoe 47.2 13.7 214 ... 2009
+ 203 Gentoo Biscoe 46.8 14.3 215 ... 2009
+ 204 Gentoo Biscoe 45.2 14.8 212 ... 2009
```
- Notice: nil option
- `Arrow::Table#slice` uses `filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row.
```ruby
hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }
table = Arrow::Table.new(hash)
table.slice([true, false, nil])
+
# =>
#<Arrow::Table:0x7fdfe44b9e18 ptr=0x555e9fe744d0>
a b c
0 1 A 1.000000
1 (null) (null) (null)
@@ -507,10 +572,11 @@
- Whereas in RedAmber, `DataFrame#slice` with booleans containing nil is treated as false. This behavior comes from `Allow::FilterOptions.null_selection_behavior = :drop`. This is a default value for `Arrow::Table.filter` method.
```ruby
RedAmber::DataFrame.new(table).slice([true, false, nil]).table
+
# =>
#<Arrow::Table:0x7fdfe44981c8 ptr=0x555e9febc330>
a b c
0 1 A 1.000000
```
@@ -526,40 +592,48 @@
`remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.
```ruby
# returns 6th to 339th obs.
penguins.remove(0...5, -5..-1)
+
# =>
- #<RedAmber::DataFrame : 334 x 8 Vectors, 0x000000000000f320>
- Vectors : 5 numeric, 3 strings
- # key type level data_preview
- 1 :species string 3 {"Adelie"=>147, "Chinstrap"=>68, "Gentoo"=>119}
- 2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>124}
- 3 :bill_length_mm double 162 [39.3, 38.9, 39.2, 34.1, 42.0, ... ]
- ... 5 more Vectors ...
+ #<RedAmber::DataFrame : 334 x 8 Vectors, 0x00000000000487c4>
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
+ <string> <string> <double> <double> <uint8> ... <uint16>
+ 1 Adelie Torgersen 39.3 20.6 190 ... 2007
+ 2 Adelie Torgersen 38.9 17.8 181 ... 2007
+ 3 Adelie Torgersen 39.2 19.6 195 ... 2007
+ 4 Adelie Torgersen 34.1 18.1 193 ... 2007
+ 5 Adelie Torgersen 42.0 20.2 190 ... 2007
+ : : : : : : ... :
+ 332 Gentoo Biscoe 44.5 15.7 217 ... 2009
+ 333 Gentoo Biscoe 48.8 16.2 222 ... 2009
+ 334 Gentoo Biscoe 47.2 13.7 214 ... 2009
```
- Booleans as an argument
`remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
```ruby
# remove all observation contains nil
removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
- removed.tdr
+ removed
+
# =>
- RedAmber::DataFrame : 333 x 8 Vectors
- Vectors : 5 numeric, 3 strings
- # key type level data_preview
- 1 :species string 3 {"Adelie"=>146, "Chinstrap"=>68, "Gentoo"=>119}
- 2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>123}
- 3 :bill_length_mm double 163 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
- 4 :bill_depth_mm double 79 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
- 5 :flipper_length_mm uint8 54 [181, 186, 195, 193, 190, ... ]
- 6 :body_mass_g uint16 93 [3750, 3800, 3250, 3450, 3650, ... ]
- 7 :sex string 2 {"male"=>168, "female"=>165}
- 8 :year uint16 3 {2007=>103, 2008=>113, 2009=>117}
+ #<RedAmber::DataFrame : 333 x 8 Vectors, 0x0000000000049fac>
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
+ <string> <string> <double> <double> <uint8> ... <uint16>
+ 1 Adelie Torgersen 39.1 18.7 181 ... 2007
+ 2 Adelie Torgersen 39.5 17.4 186 ... 2007
+ 3 Adelie Torgersen 40.3 18.0 195 ... 2007
+ 4 Adelie Torgersen 36.7 19.3 193 ... 2007
+ 5 Adelie Torgersen 39.3 20.6 190 ... 2007
+ : : : : : : ... :
+ 331 Gentoo Biscoe 50.4 15.7 222 ... 2009
+ 332 Gentoo Biscoe 45.2 14.8 212 ... 2009
+ 333 Gentoo Biscoe 49.9 16.1 213 ... 2009
```
- Indices or booleans by a block
`remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
@@ -569,47 +643,59 @@
vector = self[:bill_length_mm]
min = vector.mean - vector.std
max = vector.mean + vector.std
vector.to_a.map { |e| (min..max).include? e }
end
+
# =>
- #<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000000f370>
- Vectors : 5 numeric, 3 strings
- # key type level data_preview
- 1 :species string 3 {"Adelie"=>70, "Chinstrap"=>35, "Gentoo"=>35}
- 2 :island string 3 {"Torgersen"=>21, "Biscoe"=>56, "Dream"=>63}
- 3 :bill_length_mm double 75 [nil, 36.7, 34.1, 37.8, 37.8, ... ], 2 nils
- ... 5 more Vectors ...
+ #<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000004de40>
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
+ <string> <string> <double> <double> <uint8> ... <uint16>
+ 1 Adelie Torgersen (nil) (nil) (nil) ... 2007
+ 2 Adelie Torgersen 36.7 19.3 193 ... 2007
+ 3 Adelie Torgersen 34.1 18.1 193 ... 2007
+ 4 Adelie Torgersen 37.8 17.1 186 ... 2007
+ 5 Adelie Torgersen 37.8 17.3 180 ... 2007
+ : : : : : : ... :
+ 138 Gentoo Biscoe (nil) (nil) (nil) ... 2009
+ 139 Gentoo Biscoe 50.4 15.7 222 ... 2009
+ 140 Gentoo Biscoe 49.9 16.1 213 ... 2009
```
- Notice for nil
- When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
```ruby
df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])
booleans = df[:a] < 2
+ booleans
+
# =>
#<RedAmber::Vector(:boolean, size=3):0x000000000000f410>
[true, false, nil]
booleans_invert = booleans.to_a.map(&:!) # => [false, true, true]
+
df.slice(booleans) == df.remove(booleans_invert) # => true
```
+
- Whereas `Vector#invert` returns nil for elements nil. This will bring different result.
```ruby
booleans.invert
+
# =>
#<RedAmber::Vector(:boolean, size=3):0x000000000000f488>
[false, true, nil]
df.remove(booleans.invert)
- #<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000000f474>
- Vectors : 2 numeric, 1 string
- # key type level data_preview
- 1 :a uint8 2 [1, nil], 1 nil
- 2 :b string 2 ["A", "C"]
- 3 :c double 2 [1.0, 3.0]
+
+ # =>
+ #<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000005df98>
+ a b c
+ <uint8> <string> <double>
+ 1 1 A 1.0
+ 2 (nil) C 3.0
```
### `rename`
Rename keys (column names) to create a updated DataFrame.
@@ -619,19 +705,20 @@
- Key pairs as arguments
`rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`.
```ruby
- h = { 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] }
- df = RedAmber::DataFrame.new(h)
+ df = RedAmber::DataFrame.new( 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] )
df.rename(:age => :age_in_1993)
+
# =>
- #<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
- Vectors : 1 numeric, 1 string
- # key type level data_preview
- 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
- 2 :age_in_1993 uint8 3 [68, 49, 28]
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000060838>
+ name age_in_1993
+ <string> <uint8>
+ 1 Yasuko 68
+ 2 Rui 49
+ 3 Hinata 28
```
- Key pairs by a block
`rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}`. Block is called in the context of self.
@@ -653,29 +740,33 @@
`assign(key_pairs)` accepts pairs of key and values as arguments. key_pairs should be a Hash of `{key => array}` or `{key => Vector}`.
```ruby
df = RedAmber::DataFrame.new(
- 'name' => %w[Yasuko Rui Hinata],
- 'age' => [68, 49, 28])
+ name: %w[Yasuko Rui Hinata],
+ age: [68, 49, 28])
+ df
+
# =>
- #<RedAmber::DataFrame : 3 x 2 Vectors, 0x000000000000f8fc>
- Vectors : 1 numeric, 1 string
- # key type level data_preview
- 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
- 2 :age uint8 3 [68, 49, 28]
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000062804>
+ name age
+ <string> <uint8>
+ 1 Yasuko 68
+ 2 Rui 49
+ 3 Hinata 28
# update :age and add :brother
assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }
df.assign(assigner)
+
# =>
- #<RedAmber::DataFrame : 3 x 3 Vectors, 0x000000000000f960>
- Vectors : 1 numeric, 2 strings
- # key type level data_preview
- 1 :name string 3 ["Yasuko", "Rui", "Hinata"]
- 2 :age uint8 3 [97, 78, 57]
- 3 :brother string 3 ["Santa", nil, "Momotaro"], 1 nil
+ #<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000658b0>
+ name age brother
+ <string> <uint8> <string>
+ 1 Yasuko 97 Santa
+ 2 Rui 78 (nil)
+ 3 Hinata 57 Momotaro
```
- Key pairs by a block
`assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array}` or `{key => Vector}`. Block is called in the context of self.
@@ -683,40 +774,48 @@
```ruby
df = RedAmber::DataFrame.new(
index: [0, 1, 2, 3, nil],
float: [0.0, 1.1, 2.2, Float::NAN, nil],
string: ['A', 'B', 'C', 'D', nil])
+ df
+
# =>
- #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f8c0>
- Vectors : 2 numeric, 1 string
- # key type level data_preview
- 1 :index uint8 5 [0, 1, 2, 3, nil], 1 nil
- 2 :float double 5 [0.0, 1.1, 2.2, NaN, nil], 1 NaN, 1 nil
- 3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x0000000000069e60>
+ index float string
+ <uint8> <double> <string>
+ 1 0 0.0 A
+ 2 1 1.1 B
+ 3 2 2.2 C
+ 4 3 NaN D
+ 5 (nil) (nil) (nil)
# update numeric variables
df.assign do
assigner = {}
vectors.each_with_index do |v, i|
assigner[keys[i]] = v * -1 if v.numeric?
end
assigner
end
+
# =>
- #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f924>
- Vectors : 2 numeric, 1 string
- # key type level data_preview
- 1 :index int8 5 [0, -1, -2, -3, nil], 1 nil
- 2 :float double 5 [-0.0, -1.1, -2.2, NaN, nil], 1 NaN, 1 nil
- 3 :string string 5 ["A", "B", "C", "D", nil], 1 nil
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000006e000>
+ index float string
+ <int8> <double> <string>
+ 1 0 -0.0 A
+ 2 -1 -1.1 B
+ 3 -2 -2.2 C
+ 4 -3 NaN D
+ 5 (nil) (nil) (nil)
# Or it ’s shorter like this:
df.assign do
variables.select.with_object({}) do |(key, vector), assigner|
assigner[key] = vector * -1 if vector.numeric?
end
end
+
# => same as above
```
- Key type
@@ -734,18 +833,21 @@
df = RedAmber::DataFrame.new({
index: [1, 1, 0, nil, 0],
string: ['C', 'B', nil, 'A', 'B'],
bool: [nil, true, false, true, false],
})
- df.sort(:index, '-bool').tdr(tally: 0)
+ df.sort(:index, '-bool')
+
# =>
- RedAmber::DataFrame : 5 x 3 Vectors
- Vectors : 1 numeric, 1 string, 1 boolean
- # key type level data_preview
- 1 :index uint8 3 [0, 0, 1, 1, nil], 1 nil
- 2 :string string 4 [nil, "B", "B", "C", "A"], 1 nil
- 3 :bool boolean 3 [false, false, true, nil, true], 1 nil
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000009b03c>
+ index string bool
+ <uint8> <string> <boolean>
+ 1 0 (nil) false
+ 2 0 B false
+ 3 1 B true
+ 4 1 C (nil)
+ 5 (nil) A true
```
- [ ] Clamp
- [ ] Clear data
@@ -756,71 +858,21 @@
Remove any observations containing nil.
## Grouping
-### `group(aggregating_keys, function, target_keys)`
+### `group(aggregating_keys)`
- (This is a temporary API and may change in the future version.)
+ (
+ This API will change in the future version. Especcially I want to change:
+ - Order of the column of the result (aggregation_keys should be the first)
+ - DataFrame#group will accept a block (heronshoes/red_amber #28)
+ )
- Create grouped dataframe by `aggregation_keys` and apply `function` to each group and returns in `target_keys`. Aggregated key name is `function(key)` style.
+ `group` creates a class `Group` object. `Group` accepts functions below as a method.
+ Method accepts options as `summary_keys`.
- (The current implementation is not intuitive. Needs improvement.)
-
- ```ruby
- ds = Datasets::Rdatasets.new('dplyr', 'starwars')
- starwars = RedAmber::DataFrame.new(ds.to_table.to_h)
- starwars.tdr(11)
- # =>
- RedAmber::DataFrame : 87 x 11 Vectors
- Vectors : 3 numeric, 8 strings
- # key type level data_preview
- 1 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
- 2 :height uint16 46 [172, 167, 96, 202, 150, ... ], 6 nils
- 3 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
- 4 :hair_color string 13 ["blond", nil, nil, "none", "brown", ... ], 5 nils
- 5 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", .. . ]
- 6 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ]
- 7 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
- 8 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, nil=>4}
- 9 :gender string 3 {"masculine"=>66, "feminine"=>17, nil=>4}
- 10 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ], 10 nils
- 11 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ], 4 nils
-
- grouped = starwars.group(:species, :mean, [:mass, :height])
- # =>
- #<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000000fbf4>
- Vectors : 2 numeric, 1 string
- # key type level data_preview
- 1 :"mean(mass)" double 27 [82.78181818181818, 69.75, 124.0, 74.0, 1358.0, ... ], 6 nils
- 2 :"mean(height)" double 32 [176.6451612903226, 131.2, 231.0, 173.0, 175.0, ... ]
- 3 :species string 38 ["Human", "Droid", "Wookiee", "Rodian", "Hutt", ... ], 1 nil
-
- count = starwars.group(:species, :count, :species)[:"count(species)"]
- df = grouped.slice(count > 1)
- # =>
- #<RedAmber::DataFrame : 8 x 3 Vectors, 0x000000000000fc44>
- Vectors : 2 numeric, 1 string
- # key type level data_preview
- 1 :"mean(mass)" double 8 [82.78181818181818, 69.75, 124.0, 74.0, 80.0, ... ]
- 2 :"mean(height)" double 8 [176.6451612903226, 131.2, 231.0, 208.66666666666666, 173.0, ... ]
- 3 :species string 8 ["Human", "Droid", "Wookiee", "Gungan", "Zabrak", ... ]
-
- df.table
- # =>
- #<Arrow::Table:0x1165593c8 ptr=0x7fb3db144c70>
- mean(mass) mean(height) species
- 0 82.781818 176.645161 Human
- 1 69.750000 131.200000 Droid
- 2 124.000000 231.000000 Wookiee
- 3 74.000000 208.666667 Gungan
- 4 80.000000 173.000000 Zabrak
- 5 55.000000 179.000000 Twi'lek
- 6 53.100000 168.000000 Mirialan
- 7 88.000000 221.000000 Kaminoan
- ```
-
Available functions are:
- [ ] all
- [ ] any
- [ ] approximate_median
@@ -835,13 +887,119 @@
- ✓ stddev
- ✓ sum
- [ ] tdigest
- ✓ variance
+ For the each group of `aggregation_keys`, the aggregation `function` is applied and returns a new dataframe with aggregated keys according to `summary_keys`.
+ Aggregated key name is `function(summary_key)` style.
+
+ This is an example of grouping of famous STARWARS dataset.
+
+ ```ruby
+ starwars =
+ RedAmber::DataFrame.load(URI("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv"))
+ starwars
+
+ # =>
+ #<RedAmber::DataFrame : 87 x 12 Vectors, 0x00000000000773bc>
+ species name height mass hair_color skin_color eye_color ... homeworld
+ <string> <string> <int64> <double> <string> <string> <string> ... <string>
+ Human 1 Luke Skywalker 172 77.0 blond fair blue ... Tatooine
+ Droid 2 C-3PO 167 75.0 NA gold yellow ... Tatooine
+ Droid 3 R2-D2 96 32.0 NA white, blue red ... Naboo
+ Human 4 Darth Vader 202 136.0 none white yellow ... Tatooine
+ Human 5 Leia Organa 150 49.0 brown light brown ... Alderaan
+ : : : : : : : : ... :
+ Droid 85 BB8 (nil) (nil) none none black ... NA
+ NA 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA
+ Human 87 Padmé Amidala 165 45.0 brown light brown ... Naboo
+
+ starwars.tdr(12)
+
+ # =>
+ RedAmber::DataFrame : 87 x 12 Vectors
+ Vectors : 4 numeric, 8 strings
+ # key type level data_preview
+ 1 :"" int64 87 [1, 2, 3, 4, 5, ... ]
+ 2 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
+ 3 :height int64 46 [172, 167, 96, 202, 150, ... ], 6 nils
+ 4 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
+ 5 :hair_color string 13 ["blond", "NA", "NA", "none", "brown", ... ]
+ 6 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", ... ]
+ 7 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ]
+ 8 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
+ 9 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, "NA"=>4}
+ 10 :gender string 3 {"masculine"=>66, "feminine"=>17, "NA"=>4}
+ 11 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ]
+ 12 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ]
+ ```
+
+ We can aggregate for `:species` and calculate the mean of `:mass` and `:height`.
+
+ ```ruby
+ grouped = starwars.group(:species).mean(:mass, :height)
+ grouped
+
+ # =>
+ #<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000008e620>
+ mean(mass) mean(height) species
+ <double> <double> <string>
+ 1 82.8 176.6 Human
+ 2 69.8 131.2 Droid
+ 3 124.0 231.0 Wookiee
+ 4 74.0 173.0 Rodian
+ 5 1358.0 175.0 Hutt
+ : : : :
+ 36 159.0 216.0 Kaleesh
+ 37 80.0 206.0 Pau'an
+ 38 80.0 188.0 Kel Dor
+ ```
+
+ Select rows for count > 1.
+
+ ```ruby
+ count = starwars.group(:species).count(:species)[:'count(species)'] # => Vector
+ grouped = grouped.slice(count > 1)
+
+ # =>
+ #<RedAmber::DataFrame : 9 x 3 Vectors, 0x0000000000098260>
+ mean(mass) mean(height) species
+ <double> <double> <string>
+ 1 82.8 176.6 Human
+ 2 69.8 131.2 Droid
+ 3 124.0 231.0 Wookiee
+ 4 74.0 208.7 Gungan
+ 5 48.0 181.3 NA
+ : : : :
+ 7 55.0 179.0 Twi'lek
+ 8 53.1 168.0 Mirialan
+ 9 88.0 221.0 Kaminoan
+ ```
+
+ Assemble the result and change the order of columns.
+
+ ```ruby
+ grouped.assign(count: count[count > 1]).pick { [2,3,0,1].map{ |i| keys[i] } }
+
+ # =>
+ #<RedAmber::DataFrame : 9 x 4 Vectors, 0x0000000000141838>
+ species count mean(mass) mean(height)
+ <string> <uint8> <double> <double>
+ 1 Human 35 82.8 176.6
+ 2 Droid 6 69.8 131.2
+ 3 Wookiee 2 124.0 231.0
+ 4 Gungan 3 74.0 208.7
+ 5 NA 4 48.0 181.3
+ : : : : :
+ 7 Twi'lek 2 55.0 179.0
+ 8 Mirialan 2 53.1 168.0
+ 9 Kaminoan 2 88.0 221.0
+ ```
+
## Combining DataFrames
-- [ ] obs
+- [ ] Combining rows to a dataframe
- [ ] Add vars
- [ ] Inner join
@@ -850,5 +1008,7 @@
## Encoding
- [ ] One-hot encoding
## Iteration (not impremented)
+
+- [ ] each_rows