doc/DataFrame.md in red_amber-0.1.8 vs doc/DataFrame.md in red_amber-0.2.0

- old
+ new

@@ -165,10 +165,15 @@ - Returns an array of row-oriented data without header. If you need a column-oriented full array, use `.to_h.to_a` +### `each_row` + + Yield each row in a `{ key => row}` Hash. + Returns Enumerator if block is not given. + ### `schema` - Returns column name and data type in a Hash. ### `==` @@ -200,12 +205,27 @@ ### `inspect` `inspect` uses `to_s` output and also shows shape and object_id. -### `summary`, `describe` (not implemented) +### `summary`, `describe` +`DataFrame#summary` or `DataFrame#describe` shows summary statistics in a DataFrame. + +```ruby +puts penguins.summary.to_s(width: 82) # needs more width to show all stats in this example + +# => + variables count mean std min 25% median 75% max + <dictionary> <uint16> <double> <double> <double> <double> <double> <double> <double> +1 bill_length_mm 342 43.92 5.46 32.1 39.23 44.38 48.5 59.6 +2 bill_depth_mm 342 17.15 1.97 13.1 15.6 17.32 18.7 21.5 +3 flipper_length_mm 342 200.92 14.06 172.0 190.0 197.0 213.0 231.0 +4 body_mass_g 342 4201.75 801.95 2700.0 3550.0 4031.5 4750.0 6300.0 +5 year 344 2008.03 0.82 2007.0 2007.0 2008.0 2009.0 2009.0 +``` + ### `to_rover` - Returns a `Rover::DataFrame`. ```ruby @@ -702,11 +722,11 @@ ![rename method image](doc/../image/dataframe/rename.png) - Key pairs as arguments - `rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}`. + `rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}` or an Array of Arrays like `[[existing_key, new_key], ... ]`. ```ruby df = RedAmber::DataFrame.new( 'name' => %w[Yasuko Rui Hinata], 'age' => [68, 49, 28] ) df.rename(:age => :age_in_1993) @@ -719,28 +739,32 @@ 3 Hinata 28 ``` - Key pairs by a block - `rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}`. Block is called in the context of self. + `rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}` or an Array of Arrays like `[[existing_key, new_key], ... ]`. Block is called in the context of self. +- Not existing keys + + If specified `existing_key` is not exist, raise a `DataFrameArgumentError`. + - Key type Symbol key and String key are distinguished. ### `assign` - Assign new or updated variables (columns) and create a updated DataFrame. + Assign new or updated columns (variables) and create a updated DataFrame. - - Variables with new keys will append new variables at bottom (right in the table). + - Variables with new keys will append new columns from the right. - Variables with exisiting keys will update corresponding vectors. ![assign method image](doc/../image/dataframe/assign.png) - Variables as arguments - `assign(key_pairs)` accepts pairs of key and values as arguments. key_pairs should be a Hash of `{key => array}` or `{key => Vector}`. + `assign(key_pairs)` accepts pairs of key and values as parameters. `key_pairs` should be a Hash of `{key => array_like}` or an Array of Arrays like `[[key, array_like], ... ]`. `array_like` is ether `Vector`, `Array` or `Arrow::Array`. ```ruby df = RedAmber::DataFrame.new( name: %w[Yasuko Rui Hinata], age: [68, 49, 28]) @@ -767,11 +791,11 @@ 3 Hinata 57 Momotaro ``` - Key pairs by a block - `assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array}` or `{key => Vector}`. Block is called in the context of self. + `assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array_like}` or an Array of Arrays like `[[key, array_like], ... ]`. `array_like` is ether `Vector`, `Array` or `Arrow::Array`. The block is called in the context of self. ```ruby df = RedAmber::DataFrame.new( index: [0, 1, 2, 3, nil], float: [0.0, 1.1, 2.2, Float::NAN, nil], @@ -786,43 +810,63 @@ 2 1 1.1 B 3 2 2.2 C 4 3 NaN D 5 (nil) (nil) (nil) - # update numeric variables + # update :float + # assigner by an Array df.assign do - assigner = {} - vectors.each_with_index do |v, i| - assigner[keys[i]] = v * -1 if v.numeric? - end - assigner + vectors.select(&:float?) + .map { |v| [v.key, -v] } end # => - #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000006e000> - index float string - <int8> <double> <string> - 1 0 -0.0 A - 2 -1 -1.1 B - 3 -2 -2.2 C - 4 -3 NaN D - 5 (nil) (nil) (nil) + #<RedAmber::DataFrame : 5 x 3 Vectors, 0x00000000000dfffc> + index float string + <uint8> <double> <string> + 1 0 -0.0 A + 2 1 -1.1 B + 3 2 -2.2 C + 4 3 NaN D + 5 (nil) (nil) (nil) - # Or it ’s shorter like this: + # Or we can use assigner by a Hash df.assign do - variables.select.with_object({}) do |(key, vector), assigner| - assigner[key] = vector * -1 if vector.numeric? + vectors.select.with_object({}) do |v, assigner| + assigner[v.key] = -v if v.float? end end # => same as above ``` - Key type Symbol key and String key are considered as the same key. +- Empty assignment + + If assigner is empty or nil, returns self. + +- Append from left + + `assign_left` method accepts the same parameters and block as `assign`, but append new columns from leftside. + + ```ruby + df.assign_left(new_index: [1, 2, 3, 4, 5]) + + # => + #<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000001787c> + new_index index float string + <uint8> <uint8> <double> <string> + 1 1 0 0.0 A + 2 2 1 1.1 B + 3 3 2 2.2 C + 4 4 3 NaN D + 5 5 (nil) (nil) (nil) + ``` + ## Updating ### `sort` `sort` accepts parameters as sort_keys thanks to the amazing Red Arrow feature。 @@ -931,41 +975,41 @@ ```ruby starwars.group(:species).count(:species) # => - #<RedAmber::DataFrame : 38 x 2 Vectors, 0x000000000001d6f0> - species count - <string> <int64> - 1 Human 35 - 2 Droid 6 - 3 Wookiee 2 - 4 Rodian 1 - 5 Hutt 1 - : : : - 36 Kaleesh 1 - 37 Pau'an 1 + #<RedAmber::DataFrame : 38 x 2 Vectors, 0x000000000001d6f0> + species count + <string> <int64> + 1 Human 35 + 2 Droid 6 + 3 Wookiee 2 + 4 Rodian 1 + 5 Hutt 1 + : : : + 36 Kaleesh 1 + 37 Pau'an 1 38 Kel Dor 1 ``` We can also calculate the mean of `:mass` and `:height` together. ```ruby grouped = starwars.group(:species) { [count(:species), mean(:height, :mass)] } # => - #<RedAmber::DataFrame : 38 x 4 Vectors, 0x00000000000407cc> - species count mean(height) mean(mass) - <string> <int64> <double> <double> - 1 Human 35 176.6 82.8 - 2 Droid 6 131.2 69.8 - 3 Wookiee 2 231.0 124.0 - 4 Rodian 1 173.0 74.0 - 5 Hutt 1 175.0 1358.0 - : : : : : - 36 Kaleesh 1 216.0 159.0 - 37 Pau'an 1 206.0 80.0 + #<RedAmber::DataFrame : 38 x 4 Vectors, 0x00000000000407cc> + specie s count mean(height) mean(mass) + <strin g> <int64> <double> <double> + 1 Human 35 176.6 82.8 + 2 Droid 6 131.2 69.8 + 3 Wookie e 2 231.0 124.0 + 4 Rodian 1 173.0 74.0 + 5 Hutt 1 175.0 1358.0 + : : : : : + 36 Kalees h 1 216.0 159.0 + 37 Pau'an 1 206.0 80.0 38 Kel Dor 1 188.0 80.0 ``` Select rows for count > 1. @@ -985,20 +1029,117 @@ 7 Twi'lek 2 179.0 55.0 8 Mirialan 2 168.0 53.1 9 Kaminoan 2 221.0 88.0 ``` -## Combining DataFrames +## Reshape -- [ ] Combining rows to a dataframe +### `transpose` -- [ ] Inner join + Creates transposed DataFrame for wide type dataframe. -- [ ] Left join + ```ruby + import_cars = RedAmber::DataFrame.load('test/entity/import_cars.tsv') -## Encoding + # => + #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000d520> + Year Audi BMW BMW_MINI Mercedes-Benz VW + <int64> <int64> <int64> <int64> <int64> <int64> + 1 2021 22535 35905 18211 51722 35215 + 2 2020 22304 35712 20196 57041 36576 + 3 2019 24222 46814 23813 66553 46794 + 4 2018 26473 50982 25984 67554 51961 + 5 2017 28336 52527 25427 68221 49040 -- [ ] One-hot encoding + import_cars.transpose -## Iteration + # => + #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000ef74> + name 2021 2020 2019 2018 2017 + <dictionary> <uint16> <uint16> <uint32> <uint32> <uint32> + 1 Audi 22535 22304 24222 26473 28336 + 2 BMW 35905 35712 46814 50982 52527 + 3 BMW_MINI 18211 20196 23813 25984 25427 + 4 Mercedes-Benz 51722 57041 66553 67554 68221 + 5 VW 35215 36576 46794 51961 49040 + ``` + + The leftmost column is created by original keys. Key name of the column is + named by 'name'. -- [ ] each_rows +### `to_long(*keep_keys)` + + Creates a 'long' DataFrame. + + - Parameter `keep_keys` specifies the key names to keep. + + ```ruby + import_cars.to_long(:Year) + + # => + #<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000012750> + Year name value + <uint16> <dictionary> <uint32> + 1 2021 Audi 22535 + 2 2021 BMW 35905 + 3 2021 BMW_MINI 18211 + 4 2021 Mercedes-Benz 51722 + 5 2021 VW 35215 + : : : : + 23 2017 BMW_MINI 25427 + 24 2017 Mercedes-Benz 68221 + 25 2017 VW 49040 + ``` + + - Option `:name` : key of the column which is come **from key names**. + - Option `:value` : key of the column which is come **from values**. + + ```ruby + import_cars.to_long(:Year, name: :Manufacturer, value: :Num_of_imported) + + # => + #<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000017700> + Year Manufacturer Num_of_imported + <uint16> <dictionary> <uint32> + 1 2021 Audi 22535 + 2 2021 BMW 35905 + 3 2021 BMW_MINI 18211 + 4 2021 Mercedes-Benz 51722 + 5 2021 VW 35215 + : : : : + 23 2017 BMW_MINI 25427 + 24 2017 Mercedes-Benz 68221 + 25 2017 VW 49040 + ``` + +### `to_wide` + + Creates a 'wide' DataFrame. + + - Option `:name` : key of the column which will be expanded **to key name**. + - Option `:value` : key of the column which will be expanded **to values**. + + ```ruby + import_cars.to_long(:Year).to_wide + # import_cars.to_long(:Year).to_wide(name: :name, value: :value) + # is also OK + + # => + #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000f0f0> + Year Audi BMW BMW_MINI Mercedes-Benz VW + <uint16> <uint16> <uint16> <uint16> <uint32> <uint16> + 1 2021 22535 35905 18211 51722 35215 + 2 2020 22304 35712 20196 57041 36576 + 3 2019 24222 46814 23813 66553 46794 + 4 2018 26473 50982 25984 67554 51961 + 5 2017 28336 52527 25427 68221 49040 + ``` + +## Combine + +- [ ] Combining dataframes + +- [ ] Join + +## Encoding + +- [ ] One-hot encoding