README.md in red_amber-0.1.7 vs README.md in red_amber-0.1.8

- old
+ new

@@ -54,11 +54,11 @@ ```ruby require 'red_amber' # require 'red-amber' is also OK. require 'datasets-arrow' arrow = Datasets::Penguins.new.to_arrow -RedAmber::DataFrame.new(arrow) +penguins = RedAmber::DataFrame.new(arrow) # => #<RedAmber::DataFrame : 344 x 8 Vectors, 0x0000000000013790> species island bill_length_mm bill_depth_mm flipper_length_mm ... year <string> <string> <double> <double> <uint8> ... <uint16> @@ -76,32 +76,75 @@ ### DataFrame model ![dataframe model of RedAmber](doc/image/dataframe_model.png) For example, `DataFrame#pick` accepts keys as an argument and returns a sub DataFrame. +![pick method image](doc/image/dataframe/pick.png) + ```ruby -df = penguins.pick(:body_mass_g) +penguins.keys +# => +[:species, + :island, + :bill_length_mm, + :bill_depth_mm, + :flipper_length_mm, + :body_mass_g, + :sex, + :year] + +df = penguins.pick(:species, :island, :body_mass_g) df # => -#<RedAmber::DataFrame : 344 x 1 Vector, 0x0000000000015cc0> - body_mass_g - <uint16> - 1 3750 - 2 3800 - 3 3250 - 4 (nil) - 5 3450 - : : -342 5750 -343 5200 +#<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000003cc1c> + species island body_mass_g + <string> <string> <uint16> + 1 Adelie Torgersen 3750 + 2 Adelie Torgersen 3800 + 3 Adelie Torgersen 3250 + 4 Adelie Torgersen (nil) + 5 Adelie Torgersen 3450 + : : : : +342 Gentoo Biscoe 5750 +343 Gentoo Biscoe 5200 +344 Gentoo Biscoe 5400 +``` + +`DataFrame#drop` drops some columns to create a remainer DataFrame. + +![drop method image](doc/image/dataframe/drop.png) + +You can specify by keys or a boolean array (same size as n_keys). + +```ruby +# Same as df.drop(:species, :island) +df = df.drop(true, true, false) + +# => +#<RedAmber::DataFrame : 344 x 1 Vector, 0x0000000000048760> + body_mass_g + <uint16> + 1 3750 + 2 3800 + 3 3250 + 4 (nil) + 5 3450 + : : +342 5750 +343 5200 344 5400 ``` +Arrow data is immutable, so these methods always return an new object. + `DataFrame#assign` creates new variables (column in the table). +![assign method image](doc/image/dataframe/assign.png) + ```ruby +# New column is created because ':body_mass_kg' is a new key. df.assign(:body_mass_kg => df[:body_mass_g] / 1000.0) # => #<RedAmber::DataFrame : 344 x 2 Vectors, 0x00000000000212f0> body_mass_g body_mass_kg @@ -115,16 +158,101 @@ 342 5750 5.8 343 5200 5.2 344 5400 5.4 ``` +`DataFrame#slice` selects rows (observations) to create a sub DataFrame. + +![slice method image](doc/image/dataframe/slice.png) + +```ruby +# returns 5 rows at the start and 5 rows from the end +penguins.slice(0...5, -5..-1) + +# => +#<RedAmber::DataFrame : 10 x 8 Vectors, 0x0000000000042be4> + species island bill_length_mm bill_depth_mm flipper_length_mm ... year + <string> <string> <double> <double> <uint8> ... <uint16> + 1 Adelie Torgersen 39.1 18.7 181 ... 2007 + 2 Adelie Torgersen 39.5 17.4 186 ... 2007 + 3 Adelie Torgersen 40.3 18.0 195 ... 2007 + 4 Adelie Torgersen (nil) (nil) (nil) ... 2007 + 5 Adelie Torgersen 36.7 19.3 193 ... 2007 + : : : : : : ... : + 8 Gentoo Biscoe 50.4 15.7 222 ... 2009 + 9 Gentoo Biscoe 45.2 14.8 212 ... 2009 +10 Gentoo Biscoe 49.9 16.1 213 ... 2009 +``` + +`DataFrame#remove` rejects rows (observations) to create a remainer DataFrame. + +![remove method image](doc/image/dataframe/remove.png) + +```ruby +# penguins[:bill_length_mm] < 40 returns a boolean Vector +penguins.remove(penguins[:bill_length_mm] < 40) + +# => +#<RedAmber::DataFrame : 244 x 8 Vectors, 0x000000000007d6f4> + species island bill_length_mm bill_depth_mm flipper_length_mm ... year + <string> <string> <double> <double> <uint8> ... <uint16> + 1 Adelie Torgersen 40.3 18.0 195 ... 2007 + 2 Adelie Torgersen (nil) (nil) (nil) ... 2007 + 3 Adelie Torgersen 42.0 20.2 190 ... 2007 + 4 Adelie Torgersen 41.1 17.6 182 ... 2007 + 5 Adelie Torgersen 42.5 20.7 197 ... 2007 + : : : : : : ... : +242 Gentoo Biscoe 50.4 15.7 222 ... 2009 +243 Gentoo Biscoe 45.2 14.8 212 ... 2009 +244 Gentoo Biscoe 49.9 16.1 213 ... 2009 +``` + DataFrame manipulating methods like `pick`, `drop`, `slice`, `remove`, `rename` and `assign` accept a block. -This is an exaple to eliminate observations (row in the table) containing nil. +This example is usage of block to update numeric columns. ```ruby -# remove all observation contains nil +df = RedAmber::DataFrame.new( + integer: [0, 1, 2, 3, nil], + float: [0.0, 1.1, 2.2, Float::NAN, nil], + string: ['A', 'B', 'C', 'D', nil], + boolean: [true, false, true, false, nil]) +df + +# => +#<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000003131c> + integer float string boolean + <uint8> <double> <string> <boolean> +1 0 0.0 A true +2 1 1.1 B false +3 2 2.2 C true +4 3 NaN D false +5 (nil) (nil) (nil) (nil) + +df.assign do + vectors.each_with_object({}) do |v, h| + h[v.key] = -v if v.numeric? + end +end + +# => +#<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000009a1b4> + integer float string boolean + <uint8> <double> <string> <boolean> +1 0 -0.0 A true +2 255 -1.1 B false +3 254 -2.2 C true +4 253 NaN D false +5 (nil) (nil) (nil) (nil) +``` + +Negate (-@) method of unsigned integer Vector returns complement. + +Next example is to eliminate observations (row in the table) containing nil. + +```ruby +# remove all observations containing nil nil_removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) } nil_removed.tdr # => RedAmber::DataFrame : 342 x 8 Vectors Vectors : 5 numeric, 3 strings @@ -143,30 +271,92 @@ ```ruby penguins.remove_nil # => same result as above ``` +`DataFrame#group` method can be used for the grouping tasks. + +```ruby +starwars = RedAmber::DataFrame.load(URI("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv")) +starwars + +# => +#<RedAmber::DataFrame : 87 x 12 Vectors, 0x000000000000607c> + unnamed1 name height mass hair_color skin_color eye_color ... species + <int64> <string> <int64> <double> <string> <string> <string> ... <string> + 1 1 Luke Skywalker 172 77.0 blond fair blue ... Human + 2 2 C-3PO 167 75.0 NA gold yellow ... Droid + 3 3 R2-D2 96 32.0 NA white, blue red ... Droid + 4 4 Darth Vader 202 136.0 none white yellow ... Human + 5 5 Leia Organa 150 49.0 brown light brown ... Human + : : : : : : : : ... : +85 85 BB8 (nil) (nil) none none black ... Droid +86 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA +87 87 Padmé Amidala 165 45.0 brown light brown ... Human + +grouped = starwars.group(:species) { [count(:species), mean(:height, :mass)] } +grouped.slice { v(:count) > 1 } + +# => +#<RedAmber::DataFrame : 9 x 4 Vectors, 0x000000000006e848> + species count mean(height) mean(mass) + <string> <int64> <double> <double> +1 Human 35 176.6 82.8 +2 Droid 6 131.2 69.8 +3 Wookiee 2 231.0 124.0 +4 Gungan 3 208.7 74.0 +5 NA 4 181.3 48.0 +: : : : : +7 Twi'lek 2 179.0 55.0 +8 Mirialan 2 168.0 53.1 +9 Kaminoan 2 221.0 88.0 +``` + See [DataFrame.md](doc/DataFrame.md) for details. ## `RedAmber::Vector` Class `RedAmber::Vector` represents a series of data in the DataFrame. +Method `RedAmber::DataFrame#[key]` returns a Vector with the key `key`. ```ruby penguins[:bill_length_mm] # => #<RedAmber::Vector(:double, size=344):0x000000000000f8fc> [39.1, 39.5, 40.3, nil, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, ... ] ``` Vectors accepts some [functional methods from Arrow](https://arrow.apache.org/docs/cpp/compute.html). +This is an element-wise comparison and returns a boolean Vector of same size. + +![unary element-wise](doc/image/vector/unary_element_wise.png) + +```ruby +penguins[:bill_length_mm] < 40 + +# => +#<RedAmber::Vector(:boolean, size=344):0x000000000007e7ac> +[true, true, false, nil, true, true, true, true, true, false, true, true, false, ... ] +``` + +Next example returns aggregated result. + +![unary aggregation](doc/image/vector/unary_aggregation.png) + +```ruby +penguins[:bill_length_mm].mean +43.92192982456141 +# => + +``` + See [Vector.md](doc/Vector.md) for details. ## Jupyter notebook -[47 Examples of Red Amber](doc/47_examples_of_red_amber.ipynb) +[53 Examples of Red Amber](doc/examples_of_red_amber.ipynb) ## Development ```shell git clone https://github.com/heronshoes/red_amber.git