doc/DataFrame.md in red_amber-0.2.1 vs doc/DataFrame.md in red_amber-0.2.2
- old
+ new
@@ -12,34 +12,42 @@
## Constructors and saving
### `new` from a Hash
```ruby
- RedAmber::DataFrame.new(x: [1, 2, 3])
+ df = RedAmber::DataFrame.new(x: [1, 2, 3], y: %w[A B C])
```
### `new` from a schema (by Hash) and data (by Array)
```ruby
- RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]])
+ RedAmber::DataFrame.new({x: :uint8, y: :string}, [[1, "A"], [2, "B"], [3, "C"]])
```
### `new` from an Arrow::Table
```ruby
- table = Arrow::Table.new(x: [1, 2, 3])
+ table = Arrow::Table.new(x: [1, 2, 3], y: %w[A B C])
RedAmber::DataFrame.new(table)
```
+### `new` from an Object which responds to `to_arrow`
+
+ ```ruby
+ require "datasets-arrow"
+ dataset = Datasets::Penguins.new
+ RedAmber::DataFrame.new(dataset)
+ ```
+
### `new` from a Rover::DataFrame
```ruby
require 'rover'
- rover = Rover::DataFrame.new(x: [1, 2, 3])
+ rover = Rover::DataFrame.new(x: [1, 2, 3], y: %w[A B C])
RedAmber::DataFrame.new(rover)
```
### `load` (class method)
@@ -61,11 +69,11 @@
- from a Parquet file
```ruby
require 'parquet'
- dataframe = RedAmber::DataFrame.load("file.parquet")
+ df = RedAmber::DataFrame.load("file.parquet")
```
### `save` (instance method)
- to a `.arrow`, `.arrows`, `.csv`, `.csv.gz` or `.tsv` file
@@ -77,11 +85,11 @@
- to a Parquet file
```ruby
require 'parquet'
- dataframe.save("file.parquet")
+ df.save("file.parquet")
```
## Properties
### `table`, `to_arrow`
@@ -208,19 +216,19 @@
puts penguins.to_s
# =>
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
<string> <string> <double> <double> <uint8> ... <uint16>
- 1 Adelie Torgersen 39.1 18.7 181 ... 2007
- 2 Adelie Torgersen 39.5 17.4 186 ... 2007
- 3 Adelie Torgersen 40.3 18.0 195 ... 2007
- 4 Adelie Torgersen (nil) (nil) (nil) ... 2007
- 5 Adelie Torgersen 36.7 19.3 193 ... 2007
+ 0 Adelie Torgersen 39.1 18.7 181 ... 2007
+ 1 Adelie Torgersen 39.5 17.4 186 ... 2007
+ 2 Adelie Torgersen 40.3 18.0 195 ... 2007
+ 3 Adelie Torgersen (nil) (nil) (nil) ... 2007
+ 4 Adelie Torgersen 36.7 19.3 193 ... 2007
: : : : : : ... :
-342 Gentoo Biscoe 50.4 15.7 222 ... 2009
-343 Gentoo Biscoe 45.2 14.8 212 ... 2009
-344 Gentoo Biscoe 49.9 16.1 213 ... 2009
+341 Gentoo Biscoe 50.4 15.7 222 ... 2009
+342 Gentoo Biscoe 45.2 14.8 212 ... 2009
+343 Gentoo Biscoe 49.9 16.1 213 ... 2009
```
### `inspect`
`inspect` uses `to_s` output and also shows shape and object_id.
@@ -233,15 +241,15 @@
puts penguins.summary.to_s(width: 82) # needs more width to show all stats in this example
# =>
variables count mean std min 25% median 75% max
<dictionary> <uint16> <double> <double> <double> <double> <double> <double> <double>
-1 bill_length_mm 342 43.92 5.46 32.1 39.23 44.38 48.5 59.6
-2 bill_depth_mm 342 17.15 1.97 13.1 15.6 17.32 18.7 21.5
-3 flipper_length_mm 342 200.92 14.06 172.0 190.0 197.0 213.0 231.0
-4 body_mass_g 342 4201.75 801.95 2700.0 3550.0 4031.5 4750.0 6300.0
-5 year 344 2008.03 0.82 2007.0 2007.0 2008.0 2009.0 2009.0
+0 bill_length_mm 342 43.92 5.46 32.1 39.23 44.38 48.5 59.6
+1 bill_depth_mm 342 17.15 1.97 13.1 15.6 17.32 18.7 21.5
+2 flipper_length_mm 342 200.92 14.06 172.0 190.0 197.0 213.0 231.0
+3 body_mass_g 342 4201.75 801.95 2700.0 3550.0 4031.5 4750.0 6300.0
+4 year 344 2008.03 0.82 2007.0 2007.0 2008.0 2009.0 2009.0
```
### `to_rover`
- Returns a `Rover::DataFrame`.
@@ -263,25 +271,26 @@
```ruby
require 'red_amber'
require 'datasets-arrow'
- penguins = Datasets::Penguins.new.to_arrow
- RedAmber::DataFrame.new(penguins).tdr
+ dataset = Datasets::Penguins.new
+ # (From 0.2.2) responsible to the object which has `to_arrow` method.
+ RedAmber::DataFrame.new(dataset).tdr
# =>
RedAmber::DataFrame : 344 x 8 Vectors
Vectors : 5 numeric, 3 strings
# key type level data_preview
- 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
- 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
- 3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
- 4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
- 5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils
- 6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
- 7 :sex string 3 {"male"=>168, "female"=>165, nil=>11}
- 8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}
+ 0 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
+ 1 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
+ 2 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
+ 3 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
+ 4 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils
+ 5 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
+ 6 :sex string 3 {"male"=>168, "female"=>165, nil=>11}
+ 7 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}
```
- limit: limit of variables to show. Default value is 10.
- tally: max level to use tally mode.
- elements: max num of element to show values in each observations.
@@ -309,13 +318,13 @@
# =>
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000328fc>
b c a
<string> <double> <uint8>
- 1 A 1.0 1
- 2 B 2.0 2
- 3 C 3.0 3
+ 0 A 1.0 1
+ 1 B 2.0 2
+ 2 C 3.0 3
```
If `#[]` represents single variable (column), it returns a Vector object.
```ruby
@@ -357,14 +366,14 @@
# =>
#<RedAmber::DataFrame : 4 x 3 Vectors, 0x0000000000033270>
a b c
<uint8> <string> <double>
- 1 3 C 3.0
- 2 1 A 1.0
- 3 2 B 2.0
- 4 3 C 3.0
+ 0 3 C 3.0
+ 1 1 A 1.0
+ 2 2 B 2.0
+ 3 3 C 3.0
```
- Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self.
It returns a sub dataframe with observations at boolean is true.
@@ -403,19 +412,19 @@
# =>
#<RedAmber::DataFrame : 344 x 2 Vectors, 0x0000000000035ebc>
species bill_length_mm
<string> <double>
- 1 Adelie 39.1
- 2 Adelie 39.5
- 3 Adelie 40.3
- 4 Adelie (nil)
- 5 Adelie 36.7
+ 0 Adelie 39.1
+ 1 Adelie 39.5
+ 2 Adelie 40.3
+ 3 Adelie (nil)
+ 4 Adelie 36.7
: : :
- 342 Gentoo 50.4
- 343 Gentoo 45.2
- 344 Gentoo 49.9
+ 341 Gentoo 50.4
+ 342 Gentoo 45.2
+ 343 Gentoo 49.9
```
- Indices as arguments
`pick(indices)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers.
@@ -425,41 +434,41 @@
# =>
#<RedAmber::DataFrame : 344 x 4 Vectors, 0x0000000000055ce4>
species island bill_length_mm year
<string> <string> <double> <uint16>
- 1 Adelie Torgersen 39.1 2007
- 2 Adelie Torgersen 39.5 2007
- 3 Adelie Torgersen 40.3 2007
- 4 Adelie Torgersen (nil) 2007
- 5 Adelie Torgersen 36.7 2007
+ 0 Adelie Torgersen 39.1 2007
+ 1 Adelie Torgersen 39.5 2007
+ 2 Adelie Torgersen 40.3 2007
+ 3 Adelie Torgersen (nil) 2007
+ 4 Adelie Torgersen 36.7 2007
: : : : :
- 342 Gentoo Biscoe 50.4 2009
- 343 Gentoo Biscoe 45.2 2009
- 344 Gentoo Biscoe 49.9 2009
+ 341 Gentoo Biscoe 50.4 2009
+ 342 Gentoo Biscoe 45.2 2009
+ 343 Gentoo Biscoe 49.9 2009
```
- Booleans as arguments
`pick(booleans)` accepts booleans as arguments in an Array. Booleans must be same length as `n_keys`.
```ruby
- penguins.pick(penguins.types.map { |type| type == :string })
+ penguins.pick(penguins.vectors.map(&:string?))
# =>
#<RedAmber::DataFrame : 344 x 3 Vectors, 0x00000000000387ac>
species island sex
<string> <string> <string>
- 1 Adelie Torgersen male
+ 0 Adelie Torgersen male
+ 1 Adelie Torgersen female
2 Adelie Torgersen female
- 3 Adelie Torgersen female
- 4 Adelie Torgersen (nil)
- 5 Adelie Torgersen female
+ 3 Adelie Torgersen (nil)
+ 4 Adelie Torgersen female
: : : :
- 342 Gentoo Biscoe male
- 343 Gentoo Biscoe female
- 344 Gentoo Biscoe male
+ 341 Gentoo Biscoe male
+ 342 Gentoo Biscoe female
+ 343 Gentoo Biscoe male
```
- Keys or booleans by a block
`pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, indices or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
@@ -469,19 +478,19 @@
# =>
#<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000003dd4c>
bill_length_mm bill_depth_mm flipper_length_mm
<double> <double> <uint8>
- 1 39.1 18.7 181
- 2 39.5 17.4 186
- 3 40.3 18.0 195
- 4 (nil) (nil) (nil)
- 5 36.7 19.3 193
+ 0 39.1 18.7 181
+ 1 39.5 17.4 186
+ 2 40.3 18.0 195
+ 3 (nil) (nil) (nil)
+ 4 36.7 19.3 193
: : : :
- 342 50.4 15.7 222
- 343 45.2 14.8 212
- 344 49.9 16.1 213
+ 341 50.4 15.7 222
+ 342 45.2 14.8 212
+ 343 49.9 16.1 213
```
### `drop ` - pick and drop -
Drop some columns (variables) to create a remainer DataFrame.
@@ -524,13 +533,13 @@
# =>
#<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000003f4bc>
a
<uint8>
- 1 1
- 2 2
- 3 3
+ 0 1
+ 1 2
+ 2 3
df[:a]
# =>
#<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
@@ -564,21 +573,21 @@
# returns 5 obs. at start and 5 obs. from end
penguins.slice(0...5, -5..-1)
# =>
#<RedAmber::DataFrame : 10 x 8 Vectors, 0x0000000000042be4>
- species island bill_length_mm bill_depth_mm flipper_length_mm ... year
- <string> <string> <double> <double> <uint8> ... <uint16>
- 1 Adelie Torgersen 39.1 18.7 181 ... 2007
- 2 Adelie Torgersen 39.5 17.4 186 ... 2007
- 3 Adelie Torgersen 40.3 18.0 195 ... 2007
- 4 Adelie Torgersen (nil) (nil) (nil) ... 2007
- 5 Adelie Torgersen 36.7 19.3 193 ... 2007
- : : : : : : ... :
- 8 Gentoo Biscoe 50.4 15.7 222 ... 2009
- 9 Gentoo Biscoe 45.2 14.8 212 ... 2009
- 10 Gentoo Biscoe 49.9 16.1 213 ... 2009
+ species island bill_length_mm bill_depth_mm flipper_length_mm ... year
+ <string> <string> <double> <double> <uint8> ... <uint16>
+ 0 Adelie Torgersen 39.1 18.7 181 ... 2007
+ 1 Adelie Torgersen 39.5 17.4 186 ... 2007
+ 2 Adelie Torgersen 40.3 18.0 195 ... 2007
+ 3 Adelie Torgersen (nil) (nil) (nil) ... 2007
+ 4 Adelie Torgersen 36.7 19.3 193 ... 2007
+ : : : : : : ... :
+ 7 Gentoo Biscoe 50.4 15.7 222 ... 2009
+ 8 Gentoo Biscoe 45.2 14.8 212 ... 2009
+ 9 Gentoo Biscoe 49.9 16.1 213 ... 2009
```
- Booleans as an argument
`slice(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
@@ -589,19 +598,19 @@
# =>
#<RedAmber::DataFrame : 242 x 8 Vectors, 0x0000000000043d3c>
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
<string> <string> <double> <double> <uint8> ... <uint16>
- 1 Adelie Torgersen 40.3 18.0 195 ... 2007
- 2 Adelie Torgersen 42.0 20.2 190 ... 2007
- 3 Adelie Torgersen 41.1 17.6 182 ... 2007
- 4 Adelie Torgersen 42.5 20.7 197 ... 2007
- 5 Adelie Torgersen 46.0 21.5 194 ... 2007
+ 0 Adelie Torgersen 40.3 18.0 195 ... 2007
+ 1 Adelie Torgersen 42.0 20.2 190 ... 2007
+ 2 Adelie Torgersen 41.1 17.6 182 ... 2007
+ 3 Adelie Torgersen 42.5 20.7 197 ... 2007
+ 4 Adelie Torgersen 46.0 21.5 194 ... 2007
: : : : : : ... :
- 240 Gentoo Biscoe 50.4 15.7 222 ... 2009
- 241 Gentoo Biscoe 45.2 14.8 212 ... 2009
- 242 Gentoo Biscoe 49.9 16.1 213 ... 2009
+ 239 Gentoo Biscoe 50.4 15.7 222 ... 2009
+ 240 Gentoo Biscoe 45.2 14.8 212 ... 2009
+ 241 Gentoo Biscoe 49.9 16.1 213 ... 2009
```
- Indices or booleans by a block
`slice {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
@@ -617,19 +626,19 @@
# =>
#<RedAmber::DataFrame : 204 x 8 Vectors, 0x0000000000047a40>
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
<string> <string> <double> <double> <uint8> ... <uint16>
- 1 Adelie Torgersen 39.1 18.7 181 ... 2007
- 2 Adelie Torgersen 39.5 17.4 186 ... 2007
- 3 Adelie Torgersen 40.3 18.0 195 ... 2007
- 4 Adelie Torgersen 39.3 20.6 190 ... 2007
- 5 Adelie Torgersen 38.9 17.8 181 ... 2007
+ 0 Adelie Torgersen 39.1 18.7 181 ... 2007
+ 1 Adelie Torgersen 39.5 17.4 186 ... 2007
+ 2 Adelie Torgersen 40.3 18.0 195 ... 2007
+ 3 Adelie Torgersen 39.3 20.6 190 ... 2007
+ 4 Adelie Torgersen 38.9 17.8 181 ... 2007
: : : : : : ... :
- 202 Gentoo Biscoe 47.2 13.7 214 ... 2009
- 203 Gentoo Biscoe 46.8 14.3 215 ... 2009
- 204 Gentoo Biscoe 45.2 14.8 212 ... 2009
+ 201 Gentoo Biscoe 47.2 13.7 214 ... 2009
+ 202 Gentoo Biscoe 46.8 14.3 215 ... 2009
+ 203 Gentoo Biscoe 45.2 14.8 212 ... 2009
```
- Notice: nil option
- `Arrow::Table#slice` uses `filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row.
@@ -672,19 +681,19 @@
# =>
#<RedAmber::DataFrame : 334 x 8 Vectors, 0x00000000000487c4>
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
<string> <string> <double> <double> <uint8> ... <uint16>
- 1 Adelie Torgersen 39.3 20.6 190 ... 2007
- 2 Adelie Torgersen 38.9 17.8 181 ... 2007
- 3 Adelie Torgersen 39.2 19.6 195 ... 2007
- 4 Adelie Torgersen 34.1 18.1 193 ... 2007
- 5 Adelie Torgersen 42.0 20.2 190 ... 2007
+ 0 Adelie Torgersen 39.3 20.6 190 ... 2007
+ 1 Adelie Torgersen 38.9 17.8 181 ... 2007
+ 2 Adelie Torgersen 39.2 19.6 195 ... 2007
+ 3 Adelie Torgersen 34.1 18.1 193 ... 2007
+ 4 Adelie Torgersen 42.0 20.2 190 ... 2007
: : : : : : ... :
- 332 Gentoo Biscoe 44.5 15.7 217 ... 2009
- 333 Gentoo Biscoe 48.8 16.2 222 ... 2009
- 334 Gentoo Biscoe 47.2 13.7 214 ... 2009
+ 331 Gentoo Biscoe 44.5 15.7 217 ... 2009
+ 332 Gentoo Biscoe 48.8 16.2 222 ... 2009
+ 333 Gentoo Biscoe 47.2 13.7 214 ... 2009
```
- Booleans as an argument
`remove(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
@@ -696,19 +705,19 @@
# =>
#<RedAmber::DataFrame : 333 x 8 Vectors, 0x0000000000049fac>
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
<string> <string> <double> <double> <uint8> ... <uint16>
- 1 Adelie Torgersen 39.1 18.7 181 ... 2007
- 2 Adelie Torgersen 39.5 17.4 186 ... 2007
- 3 Adelie Torgersen 40.3 18.0 195 ... 2007
- 4 Adelie Torgersen 36.7 19.3 193 ... 2007
- 5 Adelie Torgersen 39.3 20.6 190 ... 2007
+ 0 Adelie Torgersen 39.1 18.7 181 ... 2007
+ 1 Adelie Torgersen 39.5 17.4 186 ... 2007
+ 2 Adelie Torgersen 40.3 18.0 195 ... 2007
+ 3 Adelie Torgersen 36.7 19.3 193 ... 2007
+ 4 Adelie Torgersen 39.3 20.6 190 ... 2007
: : : : : : ... :
- 331 Gentoo Biscoe 50.4 15.7 222 ... 2009
- 332 Gentoo Biscoe 45.2 14.8 212 ... 2009
- 333 Gentoo Biscoe 49.9 16.1 213 ... 2009
+ 330 Gentoo Biscoe 50.4 15.7 222 ... 2009
+ 331 Gentoo Biscoe 45.2 14.8 212 ... 2009
+ 332 Gentoo Biscoe 49.9 16.1 213 ... 2009
```
- Indices or booleans by a block
`remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
@@ -725,19 +734,19 @@
# =>
#<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000004de40>
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
<string> <string> <double> <double> <uint8> ... <uint16>
- 1 Adelie Torgersen (nil) (nil) (nil) ... 2007
- 2 Adelie Torgersen 36.7 19.3 193 ... 2007
- 3 Adelie Torgersen 34.1 18.1 193 ... 2007
- 4 Adelie Torgersen 37.8 17.1 186 ... 2007
- 5 Adelie Torgersen 37.8 17.3 180 ... 2007
+ 0 Adelie Torgersen (nil) (nil) (nil) ... 2007
+ 1 Adelie Torgersen 36.7 19.3 193 ... 2007
+ 2 Adelie Torgersen 34.1 18.1 193 ... 2007
+ 3 Adelie Torgersen 37.8 17.1 186 ... 2007
+ 4 Adelie Torgersen 37.8 17.3 180 ... 2007
: : : : : : ... :
- 138 Gentoo Biscoe (nil) (nil) (nil) ... 2009
- 139 Gentoo Biscoe 50.4 15.7 222 ... 2009
- 140 Gentoo Biscoe 49.9 16.1 213 ... 2009
+ 137 Gentoo Biscoe (nil) (nil) (nil) ... 2009
+ 138 Gentoo Biscoe 50.4 15.7 222 ... 2009
+ 139 Gentoo Biscoe 49.9 16.1 213 ... 2009
```
- Notice for nil
- When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
@@ -768,12 +777,12 @@
# =>
#<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000005df98>
a b c
<uint8> <string> <double>
- 1 1 A 1.0
- 2 (nil) C 3.0
+ 0 1 A 1.0
+ 1 (nil) C 3.0
```
### `rename`
Rename keys (column names) to create a updated DataFrame.
@@ -790,13 +799,13 @@
# =>
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000060838>
name age_in_1993
<string> <uint8>
- 1 Yasuko 68
- 2 Rui 49
- 3 Hinata 28
+ 0 Yasuko 68
+ 1 Rui 49
+ 2 Hinata 28
```
- Key pairs by a block
`rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}` or an Array of Arrays like `[[existing_key, new_key], ... ]`. Block is called in the context of self.
@@ -830,13 +839,13 @@
# =>
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000062804>
name age
<string> <uint8>
- 1 Yasuko 68
- 2 Rui 49
- 3 Hinata 28
+ 0 Yasuko 68
+ 1 Rui 49
+ 2 Hinata 28
# update :age and add :brother
df.assign do
{
age: age + 29,
@@ -846,13 +855,13 @@
# =>
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000658b0>
name age brother
<string> <uint8> <string>
- 1 Yasuko 97 Santa
- 2 Rui 78 (nil)
- 3 Hinata 57 Momotaro
+ 0 Yasuko 97 Santa
+ 1 Rui 78 (nil)
+ 2 Hinata 57 Momotaro
```
- Key pairs by a block
`assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array_like}` or an Array of Arrays like `[[key, array_like], ... ]`. `array_like` is ether `Vector`, `Array` or `Arrow::Array`. The block is called in the context of self.
@@ -867,15 +876,15 @@
# =>
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x0000000000069e60>
index float string
<uint8> <double> <string>
- 1 0 0.0 A
- 2 1 1.1 B
- 3 2 2.2 C
- 4 3 NaN D
- 5 (nil) (nil) (nil)
+ 0 0 0.0 A
+ 1 1 1.1 B
+ 2 2 2.2 C
+ 3 3 NaN D
+ 4 (nil) (nil) (nil)
# update :float
# assigner by an Array
df.assign do
vectors.select(&:float?)
@@ -884,15 +893,15 @@
# =>
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x00000000000dfffc>
index float string
<uint8> <double> <string>
- 1 0 -0.0 A
- 2 1 -1.1 B
- 3 2 -2.2 C
- 4 3 NaN D
- 5 (nil) (nil) (nil)
+ 0 0 -0.0 A
+ 1 1 -1.1 B
+ 2 2 -2.2 C
+ 3 3 NaN D
+ 4 (nil) (nil) (nil)
# Or we can use assigner by a Hash
df.assign do
vectors.select.with_object({}) do |v, assigner|
assigner[v.key] = -v if v.float?
@@ -919,15 +928,15 @@
# =>
#<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000001787c>
new_index index float string
<uint8> <uint8> <double> <string>
- 1 1 0 0.0 A
- 2 2 1 1.1 B
- 3 3 2 2.2 C
- 4 4 3 NaN D
- 5 5 (nil) (nil) (nil)
+ 0 1 0 0.0 A
+ 1 2 1 1.1 B
+ 2 3 2 2.2 C
+ 3 4 3 NaN D
+ 4 5 (nil) (nil) (nil)
```
### `slice_by(key, keep_key: false) { block }`
`slice_by` accepts a key and a block to select rows.
@@ -944,24 +953,24 @@
# =>
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x0000000000069e60>
index float string
<uint8> <double> <string>
- 1 0 0.0 A
- 2 1 1.1 B
- 3 2 2.2 C
- 4 3 NaN D
- 5 (nil) (nil) (nil)
+ 0 0 0.0 A
+ 1 1 1.1 B
+ 2 2 2.2 C
+ 3 3 NaN D
+ 4 (nil) (nil) (nil)
df.slice_by(:string) { ["A", "C"] }
# =>
#<RedAmber::DataFrame : 2 x 2 Vectors, 0x000000000001b1ac>
index float
<uint8> <double>
- 1 0 0.0
- 2 2 2.2
+ 0 0 0.0
+ 1 2 2.2
```
It is the same behavior as;
```ruby
@@ -975,13 +984,13 @@
# =>
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000069668>
index float
<uint8> <double>
- 1 0 0.0
- 2 1 1.1
- 3 2 2.2
+ 0 0 0.0
+ 1 1 1.1
+ 2 2 2.2
```
When the option `keep_key: true` used, the column `key` will be preserved.
```ruby
@@ -989,13 +998,13 @@
# =>
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x0000000000073c44>
index float string
<uint8> <double> <string>
- 1 0 0.0 A
- 2 1 1.1 B
- 3 2 2.2 C
+ 0 0 0.0 A
+ 1 1 1.1 B
+ 2 2 2.2 C
```
## Updating
### `sort`
@@ -1014,15 +1023,15 @@
# =>
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000009b03c>
index string bool
<uint8> <string> <boolean>
- 1 0 (nil) false
- 2 0 B false
- 3 1 B true
- 4 1 C (nil)
- 5 (nil) A true
+ 0 0 (nil) false
+ 1 0 B false
+ 2 1 B true
+ 3 1 C (nil)
+ 4 (nil) A true
```
- [ ] Clamp
- [ ] Clear data
@@ -1035,11 +1044,11 @@
## Grouping
### `group(group_keys)`
- `group` creates a class `Group` object. `Group` accepts functions below as a method.
+ `group` creates a instance of class `Group`. `Group` accepts functions below as a method.
Method accepts options as `group_keys`.
Available functions are:
- [ ] all
@@ -1062,110 +1071,112 @@
Summary key names are provided by `function(summary_keys)` style.
This is an example of grouping of famous STARWARS dataset.
```ruby
- starwars =
- RedAmber::DataFrame.load(URI("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv"))
- starwars
+ uri = URI("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv")
+ starwars = RedAmber::DataFrame.load(uri)
# =>
#<RedAmber::DataFrame : 87 x 12 Vectors, 0x0000000000005a50>
unnamed1 name height mass hair_color skin_color eye_color ... species
<int64> <string> <int64> <double> <string> <string> <string> ... <string>
- 1 1 Luke Skywalker 172 77.0 blond fair blue ... Human
- 2 2 C-3PO 167 75.0 NA gold yellow ... Droid
- 3 3 R2-D2 96 32.0 NA white, blue red ... Droid
- 4 4 Darth Vader 202 136.0 none white yellow ... Human
- 5 5 Leia Organa 150 49.0 brown light brown ... Human
+ 0 1 Luke Skywalker 172 77.0 blond fair blue ... Human
+ 1 2 C-3PO 167 75.0 NA gold yellow ... Droid
+ 2 3 R2-D2 96 32.0 NA white, blue red ... Droid
+ 3 4 Darth Vader 202 136.0 none white yellow ... Human
+ 4 5 Leia Organa 150 49.0 brown light brown ... Human
: : : : : : : : ... :
- 85 85 BB8 (nil) (nil) none none black ... Droid
- 86 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA
- 87 87 Padmé Amidala 165 45.0 brown light brown ... Human
+ 84 85 BB8 (nil) (nil) none none black ... Droid
+ 85 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA
+ 86 87 Padmé Amidala 165 45.0 brown light brown ... Human
starwars.tdr(12)
# =>
RedAmber::DataFrame : 87 x 12 Vectors
Vectors : 4 numeric, 8 strings
# key type level data_preview
- 1 :unnamed1 int64 87 [1, 2, 3, 4, 5, ... ]
- 2 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
- 3 :height int64 46 [172, 167, 96, 202, 150, ... ], 6 nils
- 4 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
- 5 :hair_color string 13 ["blond", "NA", "NA", "none", "brown", ... ]
- 6 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", ... ]
- 7 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ]
- 8 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
- 9 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, "NA"=>4}
- 10 :gender string 3 {"masculine"=>66, "feminine"=>17, "NA"=>4}
- 11 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ]
- 12 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ]
+ 0 :unnamed1 int64 87 [1, 2, 3, 4, 5, ... ]
+ 1 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ]
+ 2 :height int64 46 [172, 167, 96, 202, 150, ... ], 6 nils
+ 3 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils
+ 4 :hair_color string 13 ["blond", "NA", "NA", "none", "brown", ... ]
+ 5 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", ... ]
+ 6 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ]
+ 7 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils
+ 8 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, "NA"=>4}
+ 9 :gender string 3 {"masculine"=>66, "feminine"=>17, "NA"=>4}
+ 10 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ]
+ 11 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ]
```
We can group by `:species` and calculate the count.
```ruby
- starwars.group(:species).count(:species)
+ starwars.remove { species == "NA" }
+ .group(:species).count(:species)
# =>
- #<RedAmber::DataFrame : 38 x 2 Vectors, 0x000000000001d6f0>
+ #<RedAmber::DataFrame : 37 x 2 Vectors, 0x000000000000ffa0>
species count
<string> <int64>
- 1 Human 35
- 2 Droid 6
- 3 Wookiee 2
- 4 Rodian 1
- 5 Hutt 1
+ 0 Human 35
+ 1 Droid 6
+ 2 Wookiee 2
+ 3 Rodian 1
+ 4 Hutt 1
: : :
- 36 Kaleesh 1
- 37 Pau'an 1
- 38 Kel Dor 1
+ 34 Kaleesh 1
+ 35 Pau'an 1
+ 36 Kel Dor 1
```
We can also calculate the mean of `:mass` and `:height` together.
```ruby
- grouped = starwars.group(:species) { [count(:species), mean(:height, :mass)] }
+ grouped = starwars.remove { species == "NA" }
+ .group(:species) { [count(:species), mean(:height, :mass)] }
# =>
- #<RedAmber::DataFrame : 38 x 4 Vectors, 0x00000000000407cc>
- specie s count mean(height) mean(mass)
- <strin g> <int64> <double> <double>
- 1 Human 35 176.6 82.8
- 2 Droid 6 131.2 69.8
- 3 Wookie e 2 231.0 124.0
- 4 Rodian 1 173.0 74.0
- 5 Hutt 1 175.0 1358.0
- : : : : :
- 36 Kalees h 1 216.0 159.0
- 37 Pau'an 1 206.0 80.0
- 38 Kel Dor 1 188.0 80.0
+ #<RedAmber::DataFrame : 37 x 4 Vectors, 0x000000000000fff0>
+ species count mean(height) mean(mass)
+ <string> <int64> <double> <double>
+ 0 Human 35 176.65 82.78
+ 1 Droid 6 131.2 69.75
+ 2 Wookiee 2 231.0 124.0
+ 3 Rodian 1 173.0 74.0
+ 4 Hutt 1 175.0 1358.0
+ : : : : :
+ 34 Kaleesh 1 216.0 159.0
+ 35 Pau'an 1 206.0 80.0
+ 36 Kel Dor 1 188.0 80.0
```
Select rows for count > 1.
```ruby
grouped.slice(grouped[:count] > 1)
# =>
- #<RedAmber::DataFrame : 9 x 4 Vectors, 0x000000000004c270>
+ #<RedAmber::DataFrame : 8 x 4 Vectors, 0x000000000001002c>
species count mean(height) mean(mass)
<string> <int64> <double> <double>
- 1 Human 35 176.6 82.8
- 2 Droid 6 131.2 69.8
- 3 Wookiee 2 231.0 124.0
- 4 Gungan 3 208.7 74.0
- 5 NA 4 181.3 48.0
- : : : : :
- 7 Twi'lek 2 179.0 55.0
- 8 Mirialan 2 168.0 53.1
- 9 Kaminoan 2 221.0 88.0
+ 0 Human 35 176.65 82.78
+ 1 Droid 6 131.2 69.75
+ 2 Wookiee 2 231.0 124.0
+ 3 Gungan 3 208.67 74.0
+ 4 Zabrak 2 173.0 80.0
+ 5 Twi'lek 2 179.0 55.0
+ 6 Mirialan 2 168.0 53.1
+ 7 Kaminoan 2 221.0 88.0
```
## Reshape
+![dataframe reshapeing image](doc/../image/reshaping_dataframe.png)
+
### `transpose`
Creates transposed DataFrame for the wide (messy) dataframe.
```ruby
@@ -1173,30 +1184,31 @@
# =>
#<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000d520>
Year Audi BMW BMW_MINI Mercedes-Benz VW
<int64> <int64> <int64> <int64> <int64> <int64>
- 1 2017 28336 52527 25427 68221 49040
- 2 2018 26473 50982 25984 67554 51961
- 3 2019 24222 46814 23813 66553 46794
- 4 2020 22304 35712 20196 57041 36576
- 5 2021 22535 35905 18211 51722 35215
- import_cars.transpose(:Manufacturer)
+ 0 2017 28336 52527 25427 68221 49040
+ 1 2018 26473 50982 25984 67554 51961
+ 2 2019 24222 46814 23813 66553 46794
+ 3 2020 22304 35712 20196 57041 36576
+ 4 2021 22535 35905 18211 51722 35215
+ import_cars.transpose(name: :Manufacturer)
+
# =>
- #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000ef74>
+ #<RedAmber::DataFrame : 5 x 6 Vectors, 0x0000000000010a2c>
Manufacturer 2017 2018 2019 2020 2021
- <dictionary> <uint32> <uint32> <uint32> <uint16> <uint16>
- 1 Audi 28336 26473 24222 22304 22535
- 2 BMW 52527 50982 46814 35712 35905
- 3 BMW_MINI 25427 25984 23813 20196 18211
- 4 Mercedes-Benz 68221 67554 66553 57041 51722
- 5 VW 49040 51961 46794 36576 35215
+ <string> <uint32> <uint32> <uint32> <uint16> <uint16>
+ 0 Audi 28336 26473 24222 22304 22535
+ 1 BMW 52527 50982 46814 35712 35905
+ 2 BMW_MINI 25427 25984 23813 20196 18211
+ 3 Mercedes-Benz 68221 67554 66553 57041 51722
+ 4 VW 49040 51961 46794 36576 35215
```
The leftmost column is created by original keys. Key name of the column is
- named by parameter `:name`. If `:name` is not specified, `:N` is used for the key.
+ named by parameter `:name`. If `:name` is not specified, `:NAME` is used for the key.
### `to_long(*keep_keys)`
Creates a 'long' (tidy) DataFrame from a 'wide' DataFrame.
@@ -1204,67 +1216,69 @@
```ruby
import_cars.to_long(:Year)
# =>
- #<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000012750>
- Year N V
- <uint16> <dictionary> <uint32>
- 1 2017 Audi 28336
- 2 2017 BMW 52527
- 3 2017 BMW_MINI 25427
- 4 2017 Mercedes-Benz 68221
- 5 2017 VW 49040
+ #<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000011864>
+ Year NAME VALUE
+ <uint16> <string> <uint32>
+ 0 2017 Audi 28336
+ 1 2017 BMW 52527
+ 2 2017 BMW_MINI 25427
+ 3 2017 Mercedes-Benz 68221
+ 4 2017 VW 49040
: : : :
- 23 2021 BMW_MINI 18211
- 24 2021 Mercedes-Benz 51722
- 25 2021 VW 35215
+ 22 2021 BMW_MINI 18211
+ 23 2021 Mercedes-Benz 51722
+ 24 2021 VW 35215
```
- Option `:name` is the key of the column which came **from key names**.
+ The default value is `:NAME` if it is not specified.
- Option `:value` is the key of the column which came **from values**.
+ The default value is `:VALUE` if it is not specified.
```ruby
import_cars.to_long(:Year, name: :Manufacturer, value: :Num_of_imported)
# =>
- #<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000017700>
+ #<RedAmber::DataFrame : 25 x 3 Vectors, 0x000000000001359c>
Year Manufacturer Num_of_imported
- <uint16> <dictionary> <uint32>
- 1 2017 Audi 28336
- 2 2017 BMW 52527
- 3 2017 BMW_MINI 25427
- 4 2017 Mercedes-Benz 68221
- 5 2017 VW 49040
+ <uint16> <string> <uint32>
+ 0 2017 Audi 28336
+ 1 2017 BMW 52527
+ 2 2017 BMW_MINI 25427
+ 3 2017 Mercedes-Benz 68221
+ 4 2017 VW 49040
: : : :
- 23 2021 BMW_MINI 18211
- 24 2021 Mercedes-Benz 51722
- 25 2021 VW 35215
+ 22 2021 BMW_MINI 18211
+ 23 2021 Mercedes-Benz 51722
+ 24 2021 VW 35215
```
### `to_wide`
Creates a 'wide' (messy) DataFrame from a 'long' DataFrame.
- Option `:name` is the key of the column which will be expanded **to key names**.
+ The default value is `:NAME` if it is not specified.
- Option `:value` is the key of the column which will be expanded **to values**.
+ The default value is `:VALUE` if it is not specified.
```ruby
import_cars.to_long(:Year).to_wide
# import_cars.to_long(:Year).to_wide(name: :N, value: :V)
# is also OK
# =>
#<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000f0f0>
Year Audi BMW BMW_MINI Mercedes-Benz VW
<uint16> <uint16> <uint16> <uint16> <uint32> <uint16>
- 1 2017 28336 52527 25427 68221 49040
- 2 2018 26473 50982 25984 67554 51961
- 3 2019 24222 46814 23813 66553 46794
- 4 2020 22304 35712 20196 57041 36576
- 5 2021 22535 35905 18211 51722 35215
-
- # == import_cars
+ 0 2017 28336 52527 25427 68221 49040
+ 1 2018 26473 50982 25984 67554 51961
+ 2 2019 24222 46814 23813 66553 46794
+ 3 2020 22304 35712 20196 57041 36576
+ 4 2021 22535 35905 18211 51722 35215
```
## Combine
- [ ] Combining dataframes