doc/DataFrame.md in red_amber-0.1.4 vs doc/DataFrame.md in red_amber-0.1.5

- old
+ new

@@ -1,25 +1,25 @@ # DataFrame -Class `RedAmber::DataFrame` represents 2D-data. `DataFrame` consists with: +Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with: - A collection of data which have same data type within. We call it `Vector`. - A label is attached to `Vector`. We call it `key`. - A `Vector` and associated `key` is grouped as a `variable`. - `variable`s with same vector length are aligned and arranged to be a `DaTaFrame`. - Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `observation`. ![dataframe model image](doc/../image/dataframe_model.png) ## Constructors and saving -### `new` from a columnar Hash +### `new` from a Hash ```ruby RedAmber::DataFrame.new(x: [1, 2, 3]) ``` -### `new` from a schema (by Hash) and rows (by Array) +### `new` from a schema (by Hash) and data (by Array) ```ruby RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]]) ``` @@ -50,11 +50,11 @@ - from a string buffer - from a URI ```ruby - uri = URI("https://github.com/heronshoes/red_amber/blob/master/test/entity/with_header.csv") + uri = URI("uri = URI("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv") RedAmber::DataFrame.load(uri) ``` - from a Parquet file @@ -76,11 +76,11 @@ dataframe.save("file.parquet") ``` ## Properties -### `table` +### `table`, `to_arrow` - Reader of Arrow::Table object inside. ### `size`, `n_obs`, `n_rows` @@ -91,20 +91,57 @@ - Returns num of keys (num of variables). ### `shape` - Returns shape in an Array[n_rows, n_cols]. - + +### `variables` + +- Returns key names and Vectors pair in a Hash. + + It is convenient to use in a block when both key and vector required. We will write: + + ```ruby + # update numeric variables + df.assign do + variables.select.with_object({}) do |(key, vector), assigner| + assigner[key] = vector * -1 if vector.numeric? + end + end + ``` + + Instead of: + ```ruby + df.assign do + assigner = {} + vectors.each_with_index do |vector, i| + assigner[keys[i]] = vector * -1 if vector.numeric? + end + assigner + end + ``` + ### `keys`, `var_names`, `column_names` - Returns key names in an Array. + When we use it with vectors, Vector#key is useful to get the key inside of DataFrame. + + ```ruby + # update numeric variables, another solution + df.assign do + vectors.each_with_object({}) do |vector, assigner| + assigner[vector.key] = vector * -1 if vector.numeric? + end + end + ``` + ### `types` - Returns types of vectors in an Array of Symbols. -### `data_types` +### `type_classes` - Returns types of vector in an Array of `Arrow::DataType`. ### `vectors` @@ -165,11 +202,11 @@ 6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils 7 :sex string 3 {"male"=>168, "female"=>165, nil=>11} 8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120} ``` - - limit: limits variable number to show. Default value is 10. + - limit: limit of variables to show. Default value is 10. - tally: max level to use tally mode. - elements: max num of element to show values in each observations. ### `inspect` @@ -222,12 +259,21 @@ df[:a] # => #<RedAmber::Vector(:uint8, size=3):0x000000000000f140> [1, 2, 3] ``` - This may be useful to use in a block of DataFrame manipulations. + Or `#v` method also returns a Vector for a key. + ```ruby + df.v(:a) + # => + #<RedAmber::Vector(:uint8, size=3):0x000000000000f140> + [1, 2, 3] + ``` + + This may be useful to use in a block of DataFrame manipulation verbs. We can write `v(:a)` rather than `self[:a]` or `df[:a]` + ### Select observations (rows in a table) by `[]` as `[index]`, `[range]`, `[array]` - Select a obs. by index: `df[0]` - Select obs. by indeces in a Range: `df[1..2]` @@ -265,17 +311,17 @@ 1 :a uint8 1 [1] 2 :b string 1 ["A"] 3 :c double 1 [1.0] ``` -### Select rows from top or bottom +### Select rows from top or from bottom `head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)` ## Sub DataFrame manipulations -### `pick` +### `pick ` - pick up variables by key label - Pick up some variables (columns) to create a sub DataFrame. ![pick method image](doc/../image/dataframe/pick.png) @@ -311,21 +357,22 @@ - Keys or booleans by a block `pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, or a boolean Array with a same length as `n_keys`. Block is called in the context of self. ```ruby + # It is ok to write `keys ...` in the block, not `penguins.keys ...` penguins.pick { keys.map { |key| key.end_with?('mm') } } # => #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000000f1cc> Vectors : 3 numeric # key type level data_preview 1 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils 2 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils 3 :flipper_length_mm int64 56 [181, 186, 195, nil, 193, ... ], 2 nils ``` -### `drop` +### `drop ` - pick and drop - Drop some variables (columns) to create a remainer DataFrame. ![drop method image](doc/../image/dataframe/drop.png) @@ -350,29 +397,29 @@ booleans_invert = booleans.map(&:!) # => [false, true, true] df.pick(booleans) == df.drop(booleans_invert) # => true ``` - Difference between `pick`/`drop` and `[]` - If `pick` or `drop` will select single variable (column), it returns a `DataFrame` with one variable. In contrast, `[]` returns a `Vector`. + If `pick` or `drop` will select a single variable (column), it returns a `DataFrame` with one variable. In contrast, `[]` returns a `Vector`. This behavior may be useful to use in a block of DataFrame manipulations. ```ruby df = RedAmber::DataFrame.new(a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]) - df[:a] - # => - #<RedAmber::Vector(:uint8, size=3):0x000000000000f258> - [1, 2, 3] - df.pick(:a) # or df.drop(:b, :c) # => #<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000000f280> Vector : 1 numeric # key type level data_preview 1 :a uint8 3 [1, 2, 3] + + df[:a] + # => + #<RedAmber::Vector(:uint8, size=3):0x000000000000f258> + [1, 2, 3] ``` -### `slice` +### `slice ` - to cut vertically is slice - Slice and select observations (rows) to create a sub DataFrame. ![slice method image](doc/../image/dataframe/slice.png) @@ -486,21 +533,21 @@ ```ruby # remove all observation contains nil removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) } removed.tdr # => - RedAmber::DataFrame : 342 x 8 Vectors + RedAmber::DataFrame : 333 x 8 Vectors Vectors : 5 numeric, 3 strings # key type level data_preview - 1 :species string 3 {"Adelie"=>151, "Chinstrap"=>68, "Gentoo"=>123} - 2 :island string 3 {"Torgersen"=>51, "Biscoe"=>167, "Dream"=>124} - 3 :bill_length_mm double 164 [39.1, 39.5, 40.3, 36.7, 39.3, ... ] - 4 :bill_depth_mm double 80 [18.7, 17.4, 18.0, 19.3, 20.6, ... ] - 5 :flipper_length_mm int64 55 [181, 186, 195, 193, 190, ... ] - 6 :body_mass_g int64 94 [3750, 3800, 3250, 3450, 3650, ... ] - 7 :sex string 3 {"male"=>168, "female"=>165, ""=>9} - 8 :year int64 3 {2007=>109, 2008=>114, 2009=>119} + 1 :species string 3 {"Adelie"=>146, "Chinstrap"=>68, "Gentoo"=>119} + 2 :island string 3 {"Torgersen"=>47, "Biscoe"=>163, "Dream"=>123} + 3 :bill_length_mm double 163 [39.1, 39.5, 40.3, 36.7, 39.3, ... ] + 4 :bill_depth_mm double 79 [18.7, 17.4, 18.0, 19.3, 20.6, ... ] + 5 :flipper_length_mm uint8 54 [181, 186, 195, 193, 190, ... ] + 6 :body_mass_g uint16 93 [3750, 3800, 3250, 3450, 3650, ... ] + 7 :sex string 2 {"male"=>168, "female"=>165} + 8 :year uint16 3 {2007=>103, 2008=>113, 2009=>117} ``` - Keys or booleans by a block `remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self. @@ -581,11 +628,11 @@ Symbol key and String key are distinguished. ### `assign` - Assign new variables (columns) and create a updated DataFrame. + Assign new or updated variables (columns) and create a updated DataFrame. - Variables with new keys will append new variables at bottom (right in the table). - Variables with exisiting keys will update corresponding vectors. ![assign method image](doc/../image/dataframe/assign.png) @@ -647,32 +694,135 @@ Vectors : 2 numeric, 1 string # key type level data_preview 1 :index int8 5 [0, -1, -2, -3, nil], 1 nil 2 :float double 5 [-0.0, -1.1, -2.2, NaN, nil], 1 NaN, 1 nil 3 :string string 5 ["A", "B", "C", "D", nil], 1 nil + + # Or it ’s shorter like this: + df.assign do + variables.select.with_object({}) do |(key, vector), assigner| + assigner[key] = vector * -1 if vector.numeric? + end + end + # => same as above ``` - Key type Symbol key and String key are considered as the same key. ## Updating -- [ ] Update elements matching a condition +### `sort` -- [ ] Clamp + `sort` accepts parameters as sort_keys thanks to the amazing Red Arrow feature。 + - :key, "key" or "+key" denotes ascending order + - "-key" denotes descending order -- [ ] Sort rows + ```ruby + df = RedAmber::DataFrame.new({ + index: [1, 1, 0, nil, 0], + string: ['C', 'B', nil, 'A', 'B'], + bool: [nil, true, false, true, false], + }) + df.sort(:index, '-bool').tdr(tally: 0) + # => + RedAmber::DataFrame : 5 x 3 Vectors + Vectors : 1 numeric, 1 string, 1 boolean + # key type level data_preview + 1 :index uint8 3 [0, 0, 1, 1, nil], 1 nil + 2 :string string 4 [nil, "B", "B", "C", "A"], 1 nil + 3 :bool boolean 3 [false, false, true, nil, true], 1 nil + ``` +- [ ] Clamp + - [ ] Clear data ## Treat na data -- [ ] Drop na (NaN, nil) +### `remove_nil` -- [ ] Replace na with value + Remove any observations containing nil. -- [ ] Interpolate na with convolution array +## Grouping + +### `group(aggregating_keys, function, target_keys)` + + Create grouped dataframe by `aggregation_keys` and apply `function` to each group and returns in `target_keys`. Aggregated key name is `function(key)` style. + + (The current implementation is not intuitive. Needs improvement.) + + ```ruby + ds = Datasets::Rdatasets.new('dplyr', 'starwars') + starwars = RedAmber::DataFrame.new(ds.to_table.to_h) + starwars.tdr(11) + # => + RedAmber::DataFrame : 87 x 11 Vectors + Vectors : 3 numeric, 8 strings + # key type level data_preview + 1 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ] + 2 :height uint16 46 [172, 167, 96, 202, 150, ... ], 6 nils + 3 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils + 4 :hair_color string 13 ["blond", nil, nil, "none", "brown", ... ], 5 nils + 5 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", .. . ] + 6 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ] + 7 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils + 8 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, nil=>4} + 9 :gender string 3 {"masculine"=>66, "feminine"=>17, nil=>4} + 10 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ], 10 nils + 11 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ], 4 nils + + grouped = starwars.group(:species, :mean, [:mass, :height]) + # => + #<RedAmber::DataFrame : 38 x 3 Vectors, 0x000000000000fbf4> + Vectors : 2 numeric, 1 string + # key type level data_preview + 1 :"mean(mass)" double 27 [82.78181818181818, 69.75, 124.0, 74.0, 1358.0, ... ], 6 nils + 2 :"mean(height)" double 32 [176.6451612903226, 131.2, 231.0, 173.0, 175.0, ... ] + 3 :species string 38 ["Human", "Droid", "Wookiee", "Rodian", "Hutt", ... ], 1 nil + + count = starwars.group(:species, :count, :species)[:"count(species)"] + df = grouped.slice(count > 1) + # => + #<RedAmber::DataFrame : 8 x 3 Vectors, 0x000000000000fc44> + Vectors : 2 numeric, 1 string + # key type level data_preview + 1 :"mean(mass)" double 8 [82.78181818181818, 69.75, 124.0, 74.0, 80.0, ... ] + 2 :"mean(height)" double 8 [176.6451612903226, 131.2, 231.0, 208.66666666666666, 173.0, ... ] + 3 :species string 8 ["Human", "Droid", "Wookiee", "Gungan", "Zabrak", ... ] + + df.table + # => + #<Arrow::Table:0x1165593c8 ptr=0x7fb3db144c70> + mean(mass) mean(height) species + 0 82.781818 176.645161 Human + 1 69.750000 131.200000 Droid + 2 124.000000 231.000000 Wookiee + 3 74.000000 208.666667 Gungan + 4 80.000000 173.000000 Zabrak + 5 55.000000 179.000000 Twi'lek + 6 53.100000 168.000000 Mirialan + 7 88.000000 221.000000 Kaminoan + ``` + + Available functions are: + + - [ ] all + - [ ] any + - [ ] approximate_median + - ✓ count + - [ ] count_distinct + - [ ] distinct + - ✓ max + - ✓ mean + - ✓ min + - [ ] min_max + - ✓ product + - ✓ stddev + - ✓ sum + - [ ] tdigest + - ✓ variance ## Combining DataFrames - [ ] obs