doc/DataFrame.md in red_amber-0.2.0 vs doc/DataFrame.md in red_amber-0.2.1

- old
+ new

@@ -153,12 +153,30 @@ - Returns an Array of Vectors. ### `indices`, `indexes` -- Returns all indexes in an Array. +- Returns indexes in an Array. + Accepts an option `start` as the first of indexes. + ```ruby + df = RedAmber::DataFrame.new(x: [1, 2, 3, 4, 5]) + df.indices + + # => + [0, 1, 2, 3, 4] + + df.indices(1) + + # => + [1, 2, 3, 4, 5] + + df.indices(:a) + # => + [:a, :b, :c, :d, :e] + ``` + ### `to_h` - Returns column-oriented data in a Hash. ### `to_a`, `raw_records` @@ -370,17 +388,17 @@ ## Sub DataFrame manipulations ### `pick ` - pick up variables by key label - - Pick up some variables (columns) to create a sub DataFrame. + Pick up some columns (variables) to create a sub DataFrame. ![pick method image](doc/../image/dataframe/pick.png) - Keys as arguments - `pick(keys)` accepts keys as arguments in an Array. + `pick(keys)` accepts keys as arguments in an Array or a Range. ```ruby penguins.pick(:species, :bill_length_mm) # => @@ -396,15 +414,37 @@ 342 Gentoo 50.4 343 Gentoo 45.2 344 Gentoo 49.9 ``` -- Booleans as a argument +- Indices as arguments - `pick(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`. + `pick(indices)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers. ```ruby + penguins.pick(0..2, -1) + + # => + #<RedAmber::DataFrame : 344 x 4 Vectors, 0x0000000000055ce4> + species island bill_length_mm year + <string> <string> <double> <uint16> + 1 Adelie Torgersen 39.1 2007 + 2 Adelie Torgersen 39.5 2007 + 3 Adelie Torgersen 40.3 2007 + 4 Adelie Torgersen (nil) 2007 + 5 Adelie Torgersen 36.7 2007 + : : : : : + 342 Gentoo Biscoe 50.4 2009 + 343 Gentoo Biscoe 45.2 2009 + 344 Gentoo Biscoe 49.9 2009 + ``` + +- Booleans as arguments + + `pick(booleans)` accepts booleans as arguments in an Array. Booleans must be same length as `n_keys`. + + ```ruby penguins.pick(penguins.types.map { |type| type == :string }) # => #<RedAmber::DataFrame : 344 x 3 Vectors, 0x00000000000387ac> species island sex @@ -418,13 +458,13 @@ 342 Gentoo Biscoe male 343 Gentoo Biscoe female 344 Gentoo Biscoe male ``` - - Keys or booleans by a block +- Keys or booleans by a block - `pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self. + `pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, indices or a boolean Array with a same length as `n_keys`. Block is called in the context of self. ```ruby penguins.pick { keys.map { |key| key.end_with?('mm') } } # => @@ -442,25 +482,29 @@ 344 49.9 16.1 213 ``` ### `drop ` - pick and drop - - Drop some variables (columns) to create a remainer DataFrame. + Drop some columns (variables) to create a remainer DataFrame. ![drop method image](doc/../image/dataframe/drop.png) - Keys as arguments - `drop(keys)` accepts keys as arguments in an Array. + `drop(keys)` accepts keys as arguments in an Array or a Range. -- Booleans as a argument +- Indices as arguments - `drop(booleans)` accepts booleans as a argument in an Array. Booleans must be same length as `n_keys`. + `drop(indices)` accepts indices as a arguments. Indices should be Integers, Floats or Ranges of Integers. +- Booleans as arguments + + `drop(booleans)` accepts booleans as an argument in an Array. Booleans must be same length as `n_keys`. + - Keys or booleans by a block - `drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self. + `drop {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, indices or a boolean Array with a same length as `n_keys`. Block is called in the context of self. - Notice for nil When used with booleans, nil in booleans is treated as a false. This behavior is aligned with Ruby's `nil#!`. @@ -491,13 +535,24 @@ # => #<RedAmber::Vector(:uint8, size=3):0x000000000000f258> [1, 2, 3] ``` + A simple key name is usable as a method of the DataFrame if the key name is acceptable as a method name. + It returns a Vector same as `[]`. + + ```ruby + df.a + + # => + #<RedAmber::Vector(:uint8, size=3):0x000000000000f258> + [1, 2, 3] + ``` + ### `slice ` - to cut vertically is slice - - Slice and select observations (rows) to create a sub DataFrame. + Slice and select rows (observations) to create a sub DataFrame. ![slice method image](doc/../image/dataframe/slice.png) - Indices as arguments @@ -524,11 +579,11 @@ 10 Gentoo Biscoe 49.9 16.1 213 ... 2009 ``` - Booleans as an argument - `slice(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`. + `slice(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`. ```ruby vector = penguins[:bill_length_mm] penguins.slice(vector >= 40) @@ -601,11 +656,11 @@ 0 1 A 1.000000 ``` ### `remove` - Slice and reject observations (rows) to create a remainer DataFrame. + Slice and reject rows (observations) to create a remainer DataFrame. ![remove method image](doc/../image/dataframe/remove.png) - Indices as arguments @@ -630,11 +685,11 @@ 334 Gentoo Biscoe 47.2 13.7 214 ... 2009 ``` - Booleans as an argument - `remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`. + `remove(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`. ```ruby # remove all observation contains nil removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) } removed @@ -658,14 +713,16 @@ `remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self. ```ruby penguins.remove do - vector = self[:bill_length_mm] - min = vector.mean - vector.std - max = vector.mean + vector.std - vector.to_a.map { |e| (min..max).include? e } + # We will use another style shown in slice + # self.bill_length_mm returns Vector + mean = bill_length_mm.mean + min = mean - bill_length_mm.std + max = mean + bill_length_mm.std + bill_length_mm.to_a.map { |e| (min..max).include? e } end # => #<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000004de40> species island bill_length_mm bill_depth_mm flipper_length_mm ... year @@ -678,10 +735,11 @@ : : : : : : ... : 138 Gentoo Biscoe (nil) (nil) (nil) ... 2009 139 Gentoo Biscoe 50.4 15.7 222 ... 2009 140 Gentoo Biscoe 49.9 16.1 213 ... 2009 ``` + - Notice for nil - When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`. ```ruby df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3]) @@ -770,19 +828,23 @@ age: [68, 49, 28]) df # => #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000062804> - name age - <string> <uint8> - 1 Yasuko 68 - 2 Rui 49 + name age + <string> <uint8> + 1 Yasuko 68 + 2 Rui 49 3 Hinata 28 # update :age and add :brother - assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] } - df.assign(assigner) + df.assign do + { + age: age + 29, + brother: ['Santa', nil, 'Momotaro'] + } + end # => #<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000658b0> name age brother <string> <uint8> <string> @@ -797,11 +859,12 @@ ```ruby df = RedAmber::DataFrame.new( index: [0, 1, 2, 3, nil], float: [0.0, 1.1, 2.2, Float::NAN, nil], - string: ['A', 'B', 'C', 'D', nil]) + string: ['A', 'B', 'C', 'D', nil] + ) df # => #<RedAmber::DataFrame : 5 x 3 Vectors, 0x0000000000069e60> index float string @@ -819,17 +882,17 @@ .map { |v| [v.key, -v] } end # => #<RedAmber::DataFrame : 5 x 3 Vectors, 0x00000000000dfffc> - index float string - <uint8> <double> <string> - 1 0 -0.0 A - 2 1 -1.1 B - 3 2 -2.2 C - 4 3 NaN D - 5 (nil) (nil) (nil) + index float string + <uint8> <double> <string> + 1 0 -0.0 A + 2 1 -1.1 B + 3 2 -2.2 C + 4 3 NaN D + 5 (nil) (nil) (nil) # Or we can use assigner by a Hash df.assign do vectors.select.with_object({}) do |v, assigner| assigner[v.key] = -v if v.float? @@ -850,11 +913,11 @@ - Append from left `assign_left` method accepts the same parameters and block as `assign`, but append new columns from leftside. ```ruby - df.assign_left(new_index: [1, 2, 3, 4, 5]) + df.assign_left(new_index: df.indices(1)) # => #<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000001787c> new_index index float string <uint8> <uint8> <double> <string> @@ -863,24 +926,92 @@ 3 3 2 2.2 C 4 4 3 NaN D 5 5 (nil) (nil) (nil) ``` +### `slice_by(key, keep_key: false) { block }` + +`slice_by` accepts a key and a block to select rows. + +(Since 0.2.1) + + ```ruby + df = RedAmber::DataFrame.new( + index: [0, 1, 2, 3, nil], + float: [0.0, 1.1, 2.2, Float::NAN, nil], + string: ['A', 'B', 'C', 'D', nil] + ) + df + + # => + #<RedAmber::DataFrame : 5 x 3 Vectors, 0x0000000000069e60> + index float string + <uint8> <double> <string> + 1 0 0.0 A + 2 1 1.1 B + 3 2 2.2 C + 4 3 NaN D + 5 (nil) (nil) (nil) + + df.slice_by(:string) { ["A", "C"] } + + # => + #<RedAmber::DataFrame : 2 x 2 Vectors, 0x000000000001b1ac> + index float + <uint8> <double> + 1 0 0.0 + 2 2 2.2 + ``` + +It is the same behavior as; + + ```ruby + df.slice { [string.index("A"), string.index("C")] }.drop(:string) + ``` + +`slice_by` also accepts a Range. + + ```ruby + df.slice_by(:string) { "A".."C" } + + # => + #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000069668> + index float + <uint8> <double> + 1 0 0.0 + 2 1 1.1 + 3 2 2.2 + ``` + +When the option `keep_key: true` used, the column `key` will be preserved. + + ```ruby + df.slice_by(:string, keep_key: true) { "A".."C" } + + # => + #<RedAmber::DataFrame : 3 x 3 Vectors, 0x0000000000073c44> + index float string + <uint8> <double> <string> + 1 0 0.0 A + 2 1 1.1 B + 3 2 2.2 C + ``` + ## Updating ### `sort` `sort` accepts parameters as sort_keys thanks to the amazing Red Arrow feature。 - :key, "key" or "+key" denotes ascending order - "-key" denotes descending order ```ruby - df = RedAmber::DataFrame.new({ + df = RedAmber::DataFrame.new( index: [1, 1, 0, nil, 0], string: ['C', 'B', nil, 'A', 'B'], bool: [nil, true, false, true, false], - }) + ) df.sort(:index, '-bool') # => #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000009b03c> index string bool @@ -1033,106 +1164,107 @@ ## Reshape ### `transpose` - Creates transposed DataFrame for wide type dataframe. + Creates transposed DataFrame for the wide (messy) dataframe. ```ruby import_cars = RedAmber::DataFrame.load('test/entity/import_cars.tsv') # => #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000d520> Year Audi BMW BMW_MINI Mercedes-Benz VW <int64> <int64> <int64> <int64> <int64> <int64> - 1 2021 22535 35905 18211 51722 35215 - 2 2020 22304 35712 20196 57041 36576 + 1 2017 28336 52527 25427 68221 49040 + 2 2018 26473 50982 25984 67554 51961 3 2019 24222 46814 23813 66553 46794 - 4 2018 26473 50982 25984 67554 51961 - 5 2017 28336 52527 25427 68221 49040 + 4 2020 22304 35712 20196 57041 36576 + 5 2021 22535 35905 18211 51722 35215 + import_cars.transpose(:Manufacturer) - import_cars.transpose - # => #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000ef74> - name 2021 2020 2019 2018 2017 - <dictionary> <uint16> <uint16> <uint32> <uint32> <uint32> - 1 Audi 22535 22304 24222 26473 28336 - 2 BMW 35905 35712 46814 50982 52527 - 3 BMW_MINI 18211 20196 23813 25984 25427 - 4 Mercedes-Benz 51722 57041 66553 67554 68221 - 5 VW 35215 36576 46794 51961 49040 + Manufacturer 2017 2018 2019 2020 2021 + <dictionary> <uint32> <uint32> <uint32> <uint16> <uint16> + 1 Audi 28336 26473 24222 22304 22535 + 2 BMW 52527 50982 46814 35712 35905 + 3 BMW_MINI 25427 25984 23813 20196 18211 + 4 Mercedes-Benz 68221 67554 66553 57041 51722 + 5 VW 49040 51961 46794 36576 35215 ``` The leftmost column is created by original keys. Key name of the column is - named by 'name'. + named by parameter `:name`. If `:name` is not specified, `:N` is used for the key. ### `to_long(*keep_keys)` - Creates a 'long' DataFrame. + Creates a 'long' (tidy) DataFrame from a 'wide' DataFrame. - Parameter `keep_keys` specifies the key names to keep. ```ruby import_cars.to_long(:Year) # => #<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000012750> - Year name value + Year N V <uint16> <dictionary> <uint32> - 1 2021 Audi 22535 - 2 2021 BMW 35905 - 3 2021 BMW_MINI 18211 - 4 2021 Mercedes-Benz 51722 - 5 2021 VW 35215 + 1 2017 Audi 28336 + 2 2017 BMW 52527 + 3 2017 BMW_MINI 25427 + 4 2017 Mercedes-Benz 68221 + 5 2017 VW 49040 : : : : - 23 2017 BMW_MINI 25427 - 24 2017 Mercedes-Benz 68221 - 25 2017 VW 49040 + 23 2021 BMW_MINI 18211 + 24 2021 Mercedes-Benz 51722 + 25 2021 VW 35215 ``` - - Option `:name` : key of the column which is come **from key names**. - - Option `:value` : key of the column which is come **from values**. + - Option `:name` is the key of the column which came **from key names**. + - Option `:value` is the key of the column which came **from values**. ```ruby import_cars.to_long(:Year, name: :Manufacturer, value: :Num_of_imported) # => #<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000017700> Year Manufacturer Num_of_imported <uint16> <dictionary> <uint32> - 1 2021 Audi 22535 - 2 2021 BMW 35905 - 3 2021 BMW_MINI 18211 - 4 2021 Mercedes-Benz 51722 - 5 2021 VW 35215 + 1 2017 Audi 28336 + 2 2017 BMW 52527 + 3 2017 BMW_MINI 25427 + 4 2017 Mercedes-Benz 68221 + 5 2017 VW 49040 : : : : - 23 2017 BMW_MINI 25427 - 24 2017 Mercedes-Benz 68221 - 25 2017 VW 49040 + 23 2021 BMW_MINI 18211 + 24 2021 Mercedes-Benz 51722 + 25 2021 VW 35215 ``` ### `to_wide` - Creates a 'wide' DataFrame. + Creates a 'wide' (messy) DataFrame from a 'long' DataFrame. - - Option `:name` : key of the column which will be expanded **to key name**. - - Option `:value` : key of the column which will be expanded **to values**. + - Option `:name` is the key of the column which will be expanded **to key names**. + - Option `:value` is the key of the column which will be expanded **to values**. ```ruby import_cars.to_long(:Year).to_wide - # import_cars.to_long(:Year).to_wide(name: :name, value: :value) + # import_cars.to_long(:Year).to_wide(name: :N, value: :V) # is also OK # => #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000f0f0> Year Audi BMW BMW_MINI Mercedes-Benz VW <uint16> <uint16> <uint16> <uint16> <uint32> <uint16> - 1 2021 22535 35905 18211 51722 35215 - 2 2020 22304 35712 20196 57041 36576 + 1 2017 28336 52527 25427 68221 49040 + 2 2018 26473 50982 25984 67554 51961 3 2019 24222 46814 23813 66553 46794 - 4 2018 26473 50982 25984 67554 51961 - 5 2017 28336 52527 25427 68221 49040 + 4 2020 22304 35712 20196 57041 36576 + 5 2021 22535 35905 18211 51722 35215 + + # == import_cars ``` ## Combine - [ ] Combining dataframes