doc/DataFrame.md in red_amber-0.2.0 vs doc/DataFrame.md in red_amber-0.2.1
- old
+ new
@@ -153,12 +153,30 @@
- Returns an Array of Vectors.
### `indices`, `indexes`
-- Returns all indexes in an Array.
+- Returns indexes in an Array.
+ Accepts an option `start` as the first of indexes.
+ ```ruby
+ df = RedAmber::DataFrame.new(x: [1, 2, 3, 4, 5])
+ df.indices
+
+ # =>
+ [0, 1, 2, 3, 4]
+
+ df.indices(1)
+
+ # =>
+ [1, 2, 3, 4, 5]
+
+ df.indices(:a)
+ # =>
+ [:a, :b, :c, :d, :e]
+ ```
+
### `to_h`
- Returns column-oriented data in a Hash.
### `to_a`, `raw_records`
@@ -370,17 +388,17 @@
## Sub DataFrame manipulations
### `pick ` - pick up variables by key label -
- Pick up some variables (columns) to create a sub DataFrame.
+ Pick up some columns (variables) to create a sub DataFrame.
![pick method image](doc/../image/dataframe/pick.png)
- Keys as arguments
- `pick(keys)` accepts keys as arguments in an Array.
+ `pick(keys)` accepts keys as arguments in an Array or a Range.
```ruby
penguins.pick(:species, :bill_length_mm)
# =>
@@ -396,15 +414,37 @@
342 Gentoo 50.4
343 Gentoo 45.2
344 Gentoo 49.9
```
-- Booleans as a argument
+- Indices as arguments
- `pick(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
+ `pick(indices)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers.
```ruby
+ penguins.pick(0..2, -1)
+
+ # =>
+ #<RedAmber::DataFrame : 344 x 4 Vectors, 0x0000000000055ce4>
+ species island bill_length_mm year
+ <string> <string> <double> <uint16>
+ 1 Adelie Torgersen 39.1 2007
+ 2 Adelie Torgersen 39.5 2007
+ 3 Adelie Torgersen 40.3 2007
+ 4 Adelie Torgersen (nil) 2007
+ 5 Adelie Torgersen 36.7 2007
+ : : : : :
+ 342 Gentoo Biscoe 50.4 2009
+ 343 Gentoo Biscoe 45.2 2009
+ 344 Gentoo Biscoe 49.9 2009
+ ```
+
+- Booleans as arguments
+
+ `pick(booleans)` accepts booleans as arguments in an Array. Booleans must be same length as `n_keys`.
+
+ ```ruby
penguins.pick(penguins.types.map { |type| type == :string })
# =>
#<RedAmber::DataFrame : 344 x 3 Vectors, 0x00000000000387ac>
species island sex
@@ -418,13 +458,13 @@
342 Gentoo Biscoe male
343 Gentoo Biscoe female
344 Gentoo Biscoe male
```
- - Keys or booleans by a block
+- Keys or booleans by a block
- `pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
+ `pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, indices or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
```ruby
penguins.pick { keys.map { |key| key.end_with?('mm') } }
# =>
@@ -442,25 +482,29 @@
344 49.9 16.1 213
```
### `drop ` - pick and drop -
- Drop some variables (columns) to create a remainer DataFrame.
+ Drop some columns (variables) to create a remainer DataFrame.
![drop method image](doc/../image/dataframe/drop.png)
- Keys as arguments
- `drop(keys)` accepts keys as arguments in an Array.
+ `drop(keys)` accepts keys as arguments in an Array or a Range.
-- Booleans as a argument
+- Indices as arguments
- `drop(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`.
+ `drop(indices)` accepts indices as a arguments. Indices should be Integers, Floats or Ranges of Integers.
+- Booleans as arguments
+
+ `drop(booleans)` accepts booleans as an argument in an Array. Booleans must be same length as `n_keys`.
+
- Keys or booleans by a block
- `drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
+ `drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, indices or a boolean Array with a same length as `n_keys`. Block is called in the context of self.
- Notice for nil
When used with booleans, nil in booleans is treated as a false. This behavior is aligned with Ruby's `nil#!`.
@@ -491,13 +535,24 @@
# =>
#<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
[1, 2, 3]
```
+ A simple key name is usable as a method of the DataFrame if the key name is acceptable as a method name.
+ It returns a Vector same as `[]`.
+
+ ```ruby
+ df.a
+
+ # =>
+ #<RedAmber::Vector(:uint8, size=3):0x000000000000f258>
+ [1, 2, 3]
+ ```
+
### `slice ` - to cut vertically is slice -
- Slice and select observations (rows) to create a sub DataFrame.
+ Slice and select rows (observations) to create a sub DataFrame.
![slice method image](doc/../image/dataframe/slice.png)
- Indices as arguments
@@ -524,11 +579,11 @@
10 Gentoo Biscoe 49.9 16.1 213 ... 2009
```
- Booleans as an argument
- `slice(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
+ `slice(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
```ruby
vector = penguins[:bill_length_mm]
penguins.slice(vector >= 40)
@@ -601,11 +656,11 @@
0 1 A 1.000000
```
### `remove`
- Slice and reject observations (rows) to create a remainer DataFrame.
+ Slice and reject rows (observations) to create a remainer DataFrame.
![remove method image](doc/../image/dataframe/remove.png)
- Indices as arguments
@@ -630,11 +685,11 @@
334 Gentoo Biscoe 47.2 13.7 214 ... 2009
```
- Booleans as an argument
- `remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
+ `remove(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`.
```ruby
# remove all observation contains nil
removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
removed
@@ -658,14 +713,16 @@
`remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self.
```ruby
penguins.remove do
- vector = self[:bill_length_mm]
- min = vector.mean - vector.std
- max = vector.mean + vector.std
- vector.to_a.map { |e| (min..max).include? e }
+ # We will use another style shown in slice
+ # self.bill_length_mm returns Vector
+ mean = bill_length_mm.mean
+ min = mean - bill_length_mm.std
+ max = mean + bill_length_mm.std
+ bill_length_mm.to_a.map { |e| (min..max).include? e }
end
# =>
#<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000004de40>
species island bill_length_mm bill_depth_mm flipper_length_mm ... year
@@ -678,10 +735,11 @@
: : : : : : ... :
138 Gentoo Biscoe (nil) (nil) (nil) ... 2009
139 Gentoo Biscoe 50.4 15.7 222 ... 2009
140 Gentoo Biscoe 49.9 16.1 213 ... 2009
```
+
- Notice for nil
- When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.
```ruby
df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])
@@ -770,19 +828,23 @@
age: [68, 49, 28])
df
# =>
#<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000062804>
- name age
- <string> <uint8>
- 1 Yasuko 68
- 2 Rui 49
+ name age
+ <string> <uint8>
+ 1 Yasuko 68
+ 2 Rui 49
3 Hinata 28
# update :age and add :brother
- assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }
- df.assign(assigner)
+ df.assign do
+ {
+ age: age + 29,
+ brother: ['Santa', nil, 'Momotaro']
+ }
+ end
# =>
#<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000658b0>
name age brother
<string> <uint8> <string>
@@ -797,11 +859,12 @@
```ruby
df = RedAmber::DataFrame.new(
index: [0, 1, 2, 3, nil],
float: [0.0, 1.1, 2.2, Float::NAN, nil],
- string: ['A', 'B', 'C', 'D', nil])
+ string: ['A', 'B', 'C', 'D', nil]
+ )
df
# =>
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x0000000000069e60>
index float string
@@ -819,17 +882,17 @@
.map { |v| [v.key, -v] }
end
# =>
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x00000000000dfffc>
- index float string
- <uint8> <double> <string>
- 1 0 -0.0 A
- 2 1 -1.1 B
- 3 2 -2.2 C
- 4 3 NaN D
- 5 (nil) (nil) (nil)
+ index float string
+ <uint8> <double> <string>
+ 1 0 -0.0 A
+ 2 1 -1.1 B
+ 3 2 -2.2 C
+ 4 3 NaN D
+ 5 (nil) (nil) (nil)
# Or we can use assigner by a Hash
df.assign do
vectors.select.with_object({}) do |v, assigner|
assigner[v.key] = -v if v.float?
@@ -850,11 +913,11 @@
- Append from left
`assign_left` method accepts the same parameters and block as `assign`, but append new columns from leftside.
```ruby
- df.assign_left(new_index: [1, 2, 3, 4, 5])
+ df.assign_left(new_index: df.indices(1))
# =>
#<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000001787c>
new_index index float string
<uint8> <uint8> <double> <string>
@@ -863,24 +926,92 @@
3 3 2 2.2 C
4 4 3 NaN D
5 5 (nil) (nil) (nil)
```
+### `slice_by(key, keep_key: false) { block }`
+
+`slice_by` accepts a key and a block to select rows.
+
+(Since 0.2.1)
+
+ ```ruby
+ df = RedAmber::DataFrame.new(
+ index: [0, 1, 2, 3, nil],
+ float: [0.0, 1.1, 2.2, Float::NAN, nil],
+ string: ['A', 'B', 'C', 'D', nil]
+ )
+ df
+
+ # =>
+ #<RedAmber::DataFrame : 5 x 3 Vectors, 0x0000000000069e60>
+ index float string
+ <uint8> <double> <string>
+ 1 0 0.0 A
+ 2 1 1.1 B
+ 3 2 2.2 C
+ 4 3 NaN D
+ 5 (nil) (nil) (nil)
+
+ df.slice_by(:string) { ["A", "C"] }
+
+ # =>
+ #<RedAmber::DataFrame : 2 x 2 Vectors, 0x000000000001b1ac>
+ index float
+ <uint8> <double>
+ 1 0 0.0
+ 2 2 2.2
+ ```
+
+It is the same behavior as;
+
+ ```ruby
+ df.slice { [string.index("A"), string.index("C")] }.drop(:string)
+ ```
+
+`slice_by` also accepts a Range.
+
+ ```ruby
+ df.slice_by(:string) { "A".."C" }
+
+ # =>
+ #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000069668>
+ index float
+ <uint8> <double>
+ 1 0 0.0
+ 2 1 1.1
+ 3 2 2.2
+ ```
+
+When the option `keep_key: true` used, the column `key` will be preserved.
+
+ ```ruby
+ df.slice_by(:string, keep_key: true) { "A".."C" }
+
+ # =>
+ #<RedAmber::DataFrame : 3 x 3 Vectors, 0x0000000000073c44>
+ index float string
+ <uint8> <double> <string>
+ 1 0 0.0 A
+ 2 1 1.1 B
+ 3 2 2.2 C
+ ```
+
## Updating
### `sort`
`sort` accepts parameters as sort_keys thanks to the amazing Red Arrow feature。
- :key, "key" or "+key" denotes ascending order
- "-key" denotes descending order
```ruby
- df = RedAmber::DataFrame.new({
+ df = RedAmber::DataFrame.new(
index: [1, 1, 0, nil, 0],
string: ['C', 'B', nil, 'A', 'B'],
bool: [nil, true, false, true, false],
- })
+ )
df.sort(:index, '-bool')
# =>
#<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000009b03c>
index string bool
@@ -1033,106 +1164,107 @@
## Reshape
### `transpose`
- Creates transposed DataFrame for wide type dataframe.
+ Creates transposed DataFrame for the wide (messy) dataframe.
```ruby
import_cars = RedAmber::DataFrame.load('test/entity/import_cars.tsv')
# =>
#<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000d520>
Year Audi BMW BMW_MINI Mercedes-Benz VW
<int64> <int64> <int64> <int64> <int64> <int64>
- 1 2021 22535 35905 18211 51722 35215
- 2 2020 22304 35712 20196 57041 36576
+ 1 2017 28336 52527 25427 68221 49040
+ 2 2018 26473 50982 25984 67554 51961
3 2019 24222 46814 23813 66553 46794
- 4 2018 26473 50982 25984 67554 51961
- 5 2017 28336 52527 25427 68221 49040
+ 4 2020 22304 35712 20196 57041 36576
+ 5 2021 22535 35905 18211 51722 35215
+ import_cars.transpose(:Manufacturer)
- import_cars.transpose
-
# =>
#<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000ef74>
- name 2021 2020 2019 2018 2017
- <dictionary> <uint16> <uint16> <uint32> <uint32> <uint32>
- 1 Audi 22535 22304 24222 26473 28336
- 2 BMW 35905 35712 46814 50982 52527
- 3 BMW_MINI 18211 20196 23813 25984 25427
- 4 Mercedes-Benz 51722 57041 66553 67554 68221
- 5 VW 35215 36576 46794 51961 49040
+ Manufacturer 2017 2018 2019 2020 2021
+ <dictionary> <uint32> <uint32> <uint32> <uint16> <uint16>
+ 1 Audi 28336 26473 24222 22304 22535
+ 2 BMW 52527 50982 46814 35712 35905
+ 3 BMW_MINI 25427 25984 23813 20196 18211
+ 4 Mercedes-Benz 68221 67554 66553 57041 51722
+ 5 VW 49040 51961 46794 36576 35215
```
The leftmost column is created by original keys. Key name of the column is
- named by 'name'.
+ named by parameter `:name`. If `:name` is not specified, `:N` is used for the key.
### `to_long(*keep_keys)`
- Creates a 'long' DataFrame.
+ Creates a 'long' (tidy) DataFrame from a 'wide' DataFrame.
- Parameter `keep_keys` specifies the key names to keep.
```ruby
import_cars.to_long(:Year)
# =>
#<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000012750>
- Year name value
+ Year N V
<uint16> <dictionary> <uint32>
- 1 2021 Audi 22535
- 2 2021 BMW 35905
- 3 2021 BMW_MINI 18211
- 4 2021 Mercedes-Benz 51722
- 5 2021 VW 35215
+ 1 2017 Audi 28336
+ 2 2017 BMW 52527
+ 3 2017 BMW_MINI 25427
+ 4 2017 Mercedes-Benz 68221
+ 5 2017 VW 49040
: : : :
- 23 2017 BMW_MINI 25427
- 24 2017 Mercedes-Benz 68221
- 25 2017 VW 49040
+ 23 2021 BMW_MINI 18211
+ 24 2021 Mercedes-Benz 51722
+ 25 2021 VW 35215
```
- - Option `:name` : key of the column which is come **from key names**.
- - Option `:value` : key of the column which is come **from values**.
+ - Option `:name` is the key of the column which came **from key names**.
+ - Option `:value` is the key of the column which came **from values**.
```ruby
import_cars.to_long(:Year, name: :Manufacturer, value: :Num_of_imported)
# =>
#<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000017700>
Year Manufacturer Num_of_imported
<uint16> <dictionary> <uint32>
- 1 2021 Audi 22535
- 2 2021 BMW 35905
- 3 2021 BMW_MINI 18211
- 4 2021 Mercedes-Benz 51722
- 5 2021 VW 35215
+ 1 2017 Audi 28336
+ 2 2017 BMW 52527
+ 3 2017 BMW_MINI 25427
+ 4 2017 Mercedes-Benz 68221
+ 5 2017 VW 49040
: : : :
- 23 2017 BMW_MINI 25427
- 24 2017 Mercedes-Benz 68221
- 25 2017 VW 49040
+ 23 2021 BMW_MINI 18211
+ 24 2021 Mercedes-Benz 51722
+ 25 2021 VW 35215
```
### `to_wide`
- Creates a 'wide' DataFrame.
+ Creates a 'wide' (messy) DataFrame from a 'long' DataFrame.
- - Option `:name` : key of the column which will be expanded **to key name**.
- - Option `:value` : key of the column which will be expanded **to values**.
+ - Option `:name` is the key of the column which will be expanded **to key names**.
+ - Option `:value` is the key of the column which will be expanded **to values**.
```ruby
import_cars.to_long(:Year).to_wide
- # import_cars.to_long(:Year).to_wide(name: :name, value: :value)
+ # import_cars.to_long(:Year).to_wide(name: :N, value: :V)
# is also OK
# =>
#<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000f0f0>
Year Audi BMW BMW_MINI Mercedes-Benz VW
<uint16> <uint16> <uint16> <uint16> <uint32> <uint16>
- 1 2021 22535 35905 18211 51722 35215
- 2 2020 22304 35712 20196 57041 36576
+ 1 2017 28336 52527 25427 68221 49040
+ 2 2018 26473 50982 25984 67554 51961
3 2019 24222 46814 23813 66553 46794
- 4 2018 26473 50982 25984 67554 51961
- 5 2017 28336 52527 25427 68221 49040
+ 4 2020 22304 35712 20196 57041 36576
+ 5 2021 22535 35905 18211 51722 35215
+
+ # == import_cars
```
## Combine
- [ ] Combining dataframes