doc/DataFrame.md in red_amber-0.1.7 vs doc/DataFrame.md in red_amber-0.1.8

- old
+ new

@@ -858,20 +858,14 @@ Remove any observations containing nil. ## Grouping -### `group(aggregating_keys)` +### `group(group_keys)` - ( - This API will change in the future version. Especcially I want to change: - - Order of the column of the result (aggregation_keys should be the first) - - DataFrame#group will accept a block (heronshoes/red_amber #28) - ) - `group` creates a class `Group` object. `Group` accepts functions below as a method. - Method accepts options as `summary_keys`. + Method accepts options as `group_keys`. Available functions are: - [ ] all - [ ] any @@ -887,41 +881,41 @@ - ✓ stddev - ✓ sum - [ ] tdigest - ✓ variance - For the each group of `aggregation_keys`, the aggregation `function` is applied and returns a new dataframe with aggregated keys according to `summary_keys`. - Aggregated key name is `function(summary_key)` style. + For the each group of `group_keys`, the aggregation `function` is applied and returns a new dataframe with aggregated keys according to `summary_keys`. + Summary key names are provided by `function(summary_keys)` style. This is an example of grouping of famous STARWARS dataset. ```ruby starwars = RedAmber::DataFrame.load(URI("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv")) starwars # => - #<RedAmber::DataFrame : 87 x 12 Vectors, 0x00000000000773bc> - species name height mass hair_color skin_color eye_color ... homeworld - <string> <string> <int64> <double> <string> <string> <string> ... <string> - Human 1 Luke Skywalker 172 77.0 blond fair blue ... Tatooine - Droid 2 C-3PO 167 75.0 NA gold yellow ... Tatooine - Droid 3 R2-D2 96 32.0 NA white, blue red ... Naboo - Human 4 Darth Vader 202 136.0 none white yellow ... Tatooine - Human 5 Leia Organa 150 49.0 brown light brown ... Alderaan - : : : : : : : : ... : - Droid 85 BB8 (nil) (nil) none none black ... NA - NA 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA - Human 87 Padmé Amidala 165 45.0 brown light brown ... Naboo + #<RedAmber::DataFrame : 87 x 12 Vectors, 0x0000000000005a50> + unnamed1 name height mass hair_color skin_color eye_color ... species + <int64> <string> <int64> <double> <string> <string> <string> ... <string> + 1 1 Luke Skywalker 172 77.0 blond fair blue ... Human + 2 2 C-3PO 167 75.0 NA gold yellow ... Droid + 3 3 R2-D2 96 32.0 NA white, blue red ... Droid + 4 4 Darth Vader 202 136.0 none white yellow ... Human + 5 5 Leia Organa 150 49.0 brown light brown ... Human + : : : : : : : : ... : + 85 85 BB8 (nil) (nil) none none black ... Droid + 86 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA + 87 87 Padmé Amidala 165 45.0 brown light brown ... Human starwars.tdr(12) # => RedAmber::DataFrame : 87 x 12 Vectors Vectors : 4 numeric, 8 strings # key type level data_preview - 1 :"" int64 87 [1, 2, 3, 4, 5, ... ] + 1 :unnamed1 int64 87 [1, 2, 3, 4, 5, ... ] 2 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ] 3 :height int64 46 [172, 167, 96, 202, 150, ... ], 6 nils 4 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils 5 :hair_color string 13 ["blond", "NA", "NA", "none", "brown", ... ] 6 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", ... ] @@ -931,84 +925,80 @@ 10 :gender string 3 {"masculine"=>66, "feminine"=>17, "NA"=>4} 11 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ] 12 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ] ``` - We can aggregate for `:species` and calculate the mean of `:mass` and `:height`. + We can group by `:species` and calculate the count. ```ruby - grouped = starwars.group(:species).mean(:mass, :height) - grouped + starwars.group(:species).count(:species) # => - #<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000008e620> - mean(mass) mean(height) species - <double> <double> <string> - 1 82.8 176.6 Human - 2 69.8 131.2 Droid - 3 124.0 231.0 Wookiee - 4 74.0 173.0 Rodian - 5 1358.0 175.0 Hutt - : : : : - 36 159.0 216.0 Kaleesh - 37 80.0 206.0 Pau'an - 38 80.0 188.0 Kel Dor + #<RedAmber::DataFrame : 38 x 2 Vectors, 0x000000000001d6f0> + species count + <string> <int64> + 1 Human 35 + 2 Droid 6 + 3 Wookiee 2 + 4 Rodian 1 + 5 Hutt 1 + : : : + 36 Kaleesh 1 + 37 Pau'an 1 + 38 Kel Dor 1 ``` - Select rows for count > 1. - + We can also calculate the mean of `:mass` and `:height` together. + ```ruby - count = starwars.group(:species).count(:species)[:'count(species)'] # => Vector - grouped = grouped.slice(count > 1) + grouped = starwars.group(:species) { [count(:species), mean(:height, :mass)] } # => - #<RedAmber::DataFrame : 9 x 3 Vectors, 0x0000000000098260> - mean(mass) mean(height) species - <double> <double> <string> - 1 82.8 176.6 Human - 2 69.8 131.2 Droid - 3 124.0 231.0 Wookiee - 4 74.0 208.7 Gungan - 5 48.0 181.3 NA - : : : : - 7 55.0 179.0 Twi'lek - 8 53.1 168.0 Mirialan - 9 88.0 221.0 Kaminoan + #<RedAmber::DataFrame : 38 x 4 Vectors, 0x00000000000407cc> + species count mean(height) mean(mass) + <string> <int64> <double> <double> + 1 Human 35 176.6 82.8 + 2 Droid 6 131.2 69.8 + 3 Wookiee 2 231.0 124.0 + 4 Rodian 1 173.0 74.0 + 5 Hutt 1 175.0 1358.0 + : : : : : + 36 Kaleesh 1 216.0 159.0 + 37 Pau'an 1 206.0 80.0 + 38 Kel Dor 1 188.0 80.0 ``` - Assemble the result and change the order of columns. - - ```ruby - grouped.assign(count: count[count > 1]).pick { [2,3,0,1].map{ |i| keys[i] } } + Select rows for count > 1. + ```ruby + grouped.slice(grouped[:count] > 1) + # => - #<RedAmber::DataFrame : 9 x 4 Vectors, 0x0000000000141838> - species count mean(mass) mean(height) - <string> <uint8> <double> <double> - 1 Human 35 82.8 176.6 - 2 Droid 6 69.8 131.2 - 3 Wookiee 2 124.0 231.0 - 4 Gungan 3 74.0 208.7 - 5 NA 4 48.0 181.3 - : : : : : - 7 Twi'lek 2 55.0 179.0 - 8 Mirialan 2 53.1 168.0 - 9 Kaminoan 2 88.0 221.0 + #<RedAmber::DataFrame : 9 x 4 Vectors, 0x000000000004c270> + species count mean(height) mean(mass) + <string> <int64> <double> <double> + 1 Human 35 176.6 82.8 + 2 Droid 6 131.2 69.8 + 3 Wookiee 2 231.0 124.0 + 4 Gungan 3 208.7 74.0 + 5 NA 4 181.3 48.0 + : : : : : + 7 Twi'lek 2 179.0 55.0 + 8 Mirialan 2 168.0 53.1 + 9 Kaminoan 2 221.0 88.0 ``` ## Combining DataFrames - [ ] Combining rows to a dataframe -- [ ] Add vars - - [ ] Inner join - [ ] Left join ## Encoding - [ ] One-hot encoding -## Iteration (not impremented) +## Iteration - [ ] each_rows