doc/DataFrame.md in red_amber-0.1.8 vs doc/DataFrame.md in red_amber-0.2.0
- old
+ new
@@ -165,10 +165,15 @@
- Returns an array of row-oriented data without header.
If you need a column-oriented full array, use `.to_h.to_a`
+### `each_row`
+
+ Yield each row in a `{ key => row}` Hash.
+ Returns Enumerator if block is not given.
+
### `schema`
- Returns column name and data type in a Hash.
### `==`
@@ -200,12 +205,27 @@
### `inspect`
`inspect` uses `to_s` output and also shows shape and object_id.
-### `summary`, `describe` (not implemented)
+### `summary`, `describe`
+`DataFrame#summary` or `DataFrame#describe` shows summary statistics in a DataFrame.
+
+```ruby
+puts penguins.summary.to_s(width: 82) # needs more width to show all stats in this example
+
+# =>
+ variables count mean std min 25% median 75% max
+ <dictionary> <uint16> <double> <double> <double> <double> <double> <double> <double>
+1 bill_length_mm 342 43.92 5.46 32.1 39.23 44.38 48.5 59.6
+2 bill_depth_mm 342 17.15 1.97 13.1 15.6 17.32 18.7 21.5
+3 flipper_length_mm 342 200.92 14.06 172.0 190.0 197.0 213.0 231.0
+4 body_mass_g 342 4201.75 801.95 2700.0 3550.0 4031.5 4750.0 6300.0
+5 year 344 2008.03 0.82 2007.0 2007.0 2008.0 2009.0 2009.0
+```
+
### `to_rover`
- Returns a `Rover::DataFrame`.
```ruby
@@ -702,11 +722,11 @@
![rename method image](doc/../image/dataframe/rename.png)
- Key pairs as arguments
- `rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`.
+ `rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}` or an Array of Arrays like `[[existing_key, new_key], ... ]`.
```ruby
df = RedAmber::DataFrame.new( 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] )
df.rename(:age => :age_in_1993)
@@ -719,28 +739,32 @@
3 Hinata 28
```
- Key pairs by a block
- `rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}`. Block is called in the context of self.
+ `rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}` or an Array of Arrays like `[[existing_key, new_key], ... ]`. Block is called in the context of self.
+- Not existing keys
+
+ If specified `existing_key` is not exist, raise a `DataFrameArgumentError`.
+
- Key type
Symbol key and String key are distinguished.
### `assign`
- Assign new or updated variables (columns) and create a updated DataFrame.
+ Assign new or updated columns (variables) and create a updated DataFrame.
- - Variables with new keys will append new variables at bottom (right in the table).
+ - Variables with new keys will append new columns from the right.
- Variables with exisiting keys will update corresponding vectors.
![assign method image](doc/../image/dataframe/assign.png)
- Variables as arguments
- `assign(key_pairs)` accepts pairs of key and values as arguments. key_pairs should be a Hash of `{key => array}` or `{key => Vector}`.
+ `assign(key_pairs)` accepts pairs of key and values as parameters. `key_pairs` should be a Hash of `{key => array_like}` or an Array of Arrays like `[[key, array_like], ... ]`. `array_like` is ether `Vector`, `Array` or `Arrow::Array`.
```ruby
df = RedAmber::DataFrame.new(
name: %w[Yasuko Rui Hinata],
age: [68, 49, 28])
@@ -767,11 +791,11 @@
3 Hinata 57 Momotaro
```
- Key pairs by a block
- `assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array}` or `{key => Vector}`. Block is called in the context of self.
+ `assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array_like}` or an Array of Arrays like `[[key, array_like], ... ]`. `array_like` is ether `Vector`, `Array` or `Arrow::Array`. The block is called in the context of self.
```ruby
df = RedAmber::DataFrame.new(
index: [0, 1, 2, 3, nil],
float: [0.0, 1.1, 2.2, Float::NAN, nil],
@@ -786,43 +810,63 @@
2 1 1.1 B
3 2 2.2 C
4 3 NaN D
5 (nil) (nil) (nil)
- # update numeric variables
+ # update :float
+ # assigner by an Array
df.assign do
- assigner = {}
- vectors.each_with_index do |v, i|
- assigner[keys[i]] = v * -1 if v.numeric?
- end
- assigner
+ vectors.select(&:float?)
+ .map { |v| [v.key, -v] }
end
# =>
- #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000006e000>
- index float string
- <int8> <double> <string>
- 1 0 -0.0 A
- 2 -1 -1.1 B
- 3 -2 -2.2 C
- 4 -3 NaN D
- 5 (nil) (nil) (nil)
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x00000000000dfffc>
+ index float string
+ <uint8> <double> <string>
+ 1 0 -0.0 A
+ 2 1 -1.1 B
+ 3 2 -2.2 C
+ 4 3 NaN D
+ 5 (nil) (nil) (nil)
- # Or it ’s shorter like this:
+ # Or we can use assigner by a Hash
df.assign do
- variables.select.with_object({}) do |(key, vector), assigner|
- assigner[key] = vector * -1 if vector.numeric?
+ vectors.select.with_object({}) do |v, assigner|
+ assigner[v.key] = -v if v.float?
end
end
# => same as above
```
- Key type
Symbol key and String key are considered as the same key.
+- Empty assignment
+
+ If assigner is empty or nil, returns self.
+
+- Append from left
+
+ `assign_left` method accepts the same parameters and block as `assign`, but append new columns from leftside.
+
+ ```ruby
+ df.assign_left(new_index: [1, 2, 3, 4, 5])
+
+ # =>
+ #<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000001787c>
+ new_index index float string
+ <uint8> <uint8> <double> <string>
+ 1 1 0 0.0 A
+ 2 2 1 1.1 B
+ 3 3 2 2.2 C
+ 4 4 3 NaN D
+ 5 5 (nil) (nil) (nil)
+ ```
+
## Updating
### `sort`
`sort` accepts parameters as sort_keys thanks to the amazing Red Arrow feature。
@@ -931,41 +975,41 @@
```ruby
starwars.group(:species).count(:species)
# =>
- #<RedAmber::DataFrame : 38 x 2 Vectors, 0x000000000001d6f0>
- species count
- <string> <int64>
- 1 Human 35
- 2 Droid 6
- 3 Wookiee 2
- 4 Rodian 1
- 5 Hutt 1
- : : :
- 36 Kaleesh 1
- 37 Pau'an 1
+ #<RedAmber::DataFrame : 38 x 2 Vectors, 0x000000000001d6f0>
+ species count
+ <string> <int64>
+ 1 Human 35
+ 2 Droid 6
+ 3 Wookiee 2
+ 4 Rodian 1
+ 5 Hutt 1
+ : : :
+ 36 Kaleesh 1
+ 37 Pau'an 1
38 Kel Dor 1
```
We can also calculate the mean of `:mass` and `:height` together.
```ruby
grouped = starwars.group(:species) { [count(:species), mean(:height, :mass)] }
# =>
- #<RedAmber::DataFrame : 38 x 4 Vectors, 0x00000000000407cc>
- species count mean(height) mean(mass)
- <string> <int64> <double> <double>
- 1 Human 35 176.6 82.8
- 2 Droid 6 131.2 69.8
- 3 Wookiee 2 231.0 124.0
- 4 Rodian 1 173.0 74.0
- 5 Hutt 1 175.0 1358.0
- : : : : :
- 36 Kaleesh 1 216.0 159.0
- 37 Pau'an 1 206.0 80.0
+ #<RedAmber::DataFrame : 38 x 4 Vectors, 0x00000000000407cc>
+ specie s count mean(height) mean(mass)
+ <strin g> <int64> <double> <double>
+ 1 Human 35 176.6 82.8
+ 2 Droid 6 131.2 69.8
+ 3 Wookie e 2 231.0 124.0
+ 4 Rodian 1 173.0 74.0
+ 5 Hutt 1 175.0 1358.0
+ : : : : :
+ 36 Kalees h 1 216.0 159.0
+ 37 Pau'an 1 206.0 80.0
38 Kel Dor 1 188.0 80.0
```
Select rows for count > 1.
@@ -985,20 +1029,117 @@
7 Twi'lek 2 179.0 55.0
8 Mirialan 2 168.0 53.1
9 Kaminoan 2 221.0 88.0
```
-## Combining DataFrames
+## Reshape
-- [ ] Combining rows to a dataframe
+### `transpose`
-- [ ] Inner join
+ Creates transposed DataFrame for wide type dataframe.
-- [ ] Left join
+ ```ruby
+ import_cars = RedAmber::DataFrame.load('test/entity/import_cars.tsv')
-## Encoding
+ # =>
+ #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000d520>
+ Year Audi BMW BMW_MINI Mercedes-Benz VW
+ <int64> <int64> <int64> <int64> <int64> <int64>
+ 1 2021 22535 35905 18211 51722 35215
+ 2 2020 22304 35712 20196 57041 36576
+ 3 2019 24222 46814 23813 66553 46794
+ 4 2018 26473 50982 25984 67554 51961
+ 5 2017 28336 52527 25427 68221 49040
-- [ ] One-hot encoding
+ import_cars.transpose
-## Iteration
+ # =>
+ #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000ef74>
+ name 2021 2020 2019 2018 2017
+ <dictionary> <uint16> <uint16> <uint32> <uint32> <uint32>
+ 1 Audi 22535 22304 24222 26473 28336
+ 2 BMW 35905 35712 46814 50982 52527
+ 3 BMW_MINI 18211 20196 23813 25984 25427
+ 4 Mercedes-Benz 51722 57041 66553 67554 68221
+ 5 VW 35215 36576 46794 51961 49040
+ ```
+
+ The leftmost column is created by original keys. Key name of the column is
+ named by 'name'.
-- [ ] each_rows
+### `to_long(*keep_keys)`
+
+ Creates a 'long' DataFrame.
+
+ - Parameter `keep_keys` specifies the key names to keep.
+
+ ```ruby
+ import_cars.to_long(:Year)
+
+ # =>
+ #<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000012750>
+ Year name value
+ <uint16> <dictionary> <uint32>
+ 1 2021 Audi 22535
+ 2 2021 BMW 35905
+ 3 2021 BMW_MINI 18211
+ 4 2021 Mercedes-Benz 51722
+ 5 2021 VW 35215
+ : : : :
+ 23 2017 BMW_MINI 25427
+ 24 2017 Mercedes-Benz 68221
+ 25 2017 VW 49040
+ ```
+
+ - Option `:name` : key of the column which is come **from key names**.
+ - Option `:value` : key of the column which is come **from values**.
+
+ ```ruby
+ import_cars.to_long(:Year, name: :Manufacturer, value: :Num_of_imported)
+
+ # =>
+ #<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000017700>
+ Year Manufacturer Num_of_imported
+ <uint16> <dictionary> <uint32>
+ 1 2021 Audi 22535
+ 2 2021 BMW 35905
+ 3 2021 BMW_MINI 18211
+ 4 2021 Mercedes-Benz 51722
+ 5 2021 VW 35215
+ : : : :
+ 23 2017 BMW_MINI 25427
+ 24 2017 Mercedes-Benz 68221
+ 25 2017 VW 49040
+ ```
+
+### `to_wide`
+
+ Creates a 'wide' DataFrame.
+
+ - Option `:name` : key of the column which will be expanded **to key name**.
+ - Option `:value` : key of the column which will be expanded **to values**.
+
+ ```ruby
+ import_cars.to_long(:Year).to_wide
+ # import_cars.to_long(:Year).to_wide(name: :name, value: :value)
+ # is also OK
+
+ # =>
+ #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000f0f0>
+ Year Audi BMW BMW_MINI Mercedes-Benz VW
+ <uint16> <uint16> <uint16> <uint16> <uint32> <uint16>
+ 1 2021 22535 35905 18211 51722 35215
+ 2 2020 22304 35712 20196 57041 36576
+ 3 2019 24222 46814 23813 66553 46794
+ 4 2018 26473 50982 25984 67554 51961
+ 5 2017 28336 52527 25427 68221 49040
+ ```
+
+## Combine
+
+- [ ] Combining dataframes
+
+- [ ] Join
+
+## Encoding
+
+- [ ] One-hot encoding