README.md in red_amber-0.2.1 vs README.md in red_amber-0.2.2
- old
+ new
@@ -1,15 +1,18 @@
# RedAmber
[![Gem Version](https://badge.fury.io/rb/red_amber.svg)](https://badge.fury.io/rb/red_amber)
[![Ruby](https://github.com/heronshoes/red_amber/actions/workflows/test.yml/badge.svg)](https://github.com/heronshoes/red_amber/actions/workflows/test.yml)
+[![Discussions](https://img.shields.io/github/discussions/heronshoes/red_amber)](https://github.com/heronshoes/red_amber/discussions)
A simple dataframe library for Ruby.
- Powered by [Red Arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow) [![Gitter Chat](https://badges.gitter.im/red-data-tools/en.svg)](https://gitter.im/red-data-tools/en)
- Inspired by the dataframe library [Rover-df](https://github.com/ankane/rover)
+![screenshot from jupyterlab](doc/image/screenshot.png)
+
## Requirements
Supported Ruby version is >= 2.7.
Since v0.2.0, this library uses pattern matching which is an experimental feature in 2.7 . It is usable but a warning message will be shown in 2.7 .
@@ -55,353 +58,153 @@
## Docker image and Jupyter Notebook
[RubyData Docker Stacks](https://github.com/RubyData/docker-stacks) is available as a ready-to-run Docker image containing Jupyter and useful data tools as well as RedAmber (Thanks to @mrkn).
-Also you can try the contents of this README interactively by [Binder](https://mybinder.org/v2/gh/RubyData/docker-stacks/master?filepath=red-amber.ipynb).
-[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/RubyData/docker-stacks/master?filepath=red-amber.ipynb)
+Also you can try the contents of this README interactively by [Binder](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=README.ipynb).
+[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=red-amber.ipynb)
+## Data frame in `RedAmber`
-## `RedAmber::DataFrame`
+Class `RedAmber::DataFrame` represents a set of data in 2D-shape.
+The entity is a Red Arrow's Table object.
-It represents a set of data in 2D-shape. The entity is a Red Arrow's Table object.
-
![dataframe model of RedAmber](doc/image/dataframe_model.png)
+Load the library.
+
```ruby
require 'red_amber' # require 'red-amber' is also OK.
-require 'datasets-arrow'
-
-arrow = Datasets::Penguins.new.to_arrow
-penguins = RedAmber::DataFrame.new(arrow)
-
-# =>
-#<RedAmber::DataFrame : 344 x 8 Vectors, 0x0000000000013790>
- species island bill_length_mm bill_depth_mm flipper_length_mm ... year
- <string> <string> <double> <double> <uint8> ... <uint16>
- 1 Adelie Torgersen 39.1 18.7 181 ... 2007
- 2 Adelie Torgersen 39.5 17.4 186 ... 2007
- 3 Adelie Torgersen 40.3 18.0 195 ... 2007
- 4 Adelie Torgersen (nil) (nil) (nil) ... 2007
- 5 Adelie Torgersen 36.7 19.3 193 ... 2007
- : : : : : : ... :
-342 Gentoo Biscoe 50.4 15.7 222 ... 2009
-343 Gentoo Biscoe 45.2 14.8 212 ... 2009
-344 Gentoo Biscoe 49.9 16.1 213 ... 2009
+include RedAmber
```
-For example, `DataFrame#pick` accepts keys as arguments and returns a sub DataFrame.
+### Example: diamonds dataset
-![pick method image](doc/image/dataframe/pick.png)
-
```ruby
-penguins.keys
-# =>
-[:species,
- :island,
- :bill_length_mm,
- :bill_depth_mm,
- :flipper_length_mm,
- :body_mass_g,
- :sex,
- :year]
+require 'datasets-arrow' # to load sample data
-df = penguins.pick(:species, :island, :body_mass_g)
-df
+dataset = Datasets::Diamonds.new
+diamonds = DataFrame.new(dataset) # from v0.2.2, should be `dataset.to_arrow` if older.
# =>
-#<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000003cc1c>
- species island body_mass_g
- <string> <string> <uint16>
- 1 Adelie Torgersen 3750
- 2 Adelie Torgersen 3800
- 3 Adelie Torgersen 3250
- 4 Adelie Torgersen (nil)
- 5 Adelie Torgersen 3450
- : : : :
-342 Gentoo Biscoe 5750
-343 Gentoo Biscoe 5200
-344 Gentoo Biscoe 5400
+#<RedAmber::DataFrame : 53940 x 10 Vectors, 0x000000000000f668>
+ carat cut color clarity depth table price x ... z
+ <double> <string> <string> <string> <double> <double> <uint16> <double> ... <double>
+ 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 ... 2.43
+ 1 0.21 Premium E SI1 59.8 61.0 326 3.89 ... 2.31
+ 2 0.23 Good E VS1 56.9 65.0 327 4.05 ... 2.31
+ 3 0.29 Premium I VS2 62.4 58.0 334 4.2 ... 2.63
+ 4 0.31 Good J SI2 63.3 58.0 335 4.34 ... 2.75
+ : : : : : : : : : ... :
+53937 0.7 Very Good D SI1 62.8 60.0 2757 5.66 ... 3.56
+53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 ... 3.74
+53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 ... 3.64
```
-`DataFrame#drop` drops some columns to create a remainer DataFrame.
+For example, we can compute mean prices per 'cut' for the data larger than 1 carat.
-![drop method image](doc/image/dataframe/drop.png)
-
-You can specify by keys or a boolean array of same size as n_keys.
-
```ruby
-# Same as df.drop(:species, :island)
-df = df.drop(true, true, false)
+df = diamonds
+ .slice { carat > 1 }
+ .group(:cut)
+ .mean(:price) # `pick` prior to `group` is not required if `:price` is specified here.
+ .sort('-mean(price)')
# =>
-#<RedAmber::DataFrame : 344 x 1 Vector, 0x0000000000048760>
- body_mass_g
- <uint16>
- 1 3750
- 2 3800
- 3 3250
- 4 (nil)
- 5 3450
- : :
-342 5750
-343 5200
-344 5400
+#<RedAmber::DataFrame : 5 x 2 Vectors, 0x000000000000f67c>
+ cut mean(price)
+ <string> <double>
+0 Ideal 8674.23
+1 Premium 8487.25
+2 Very Good 8340.55
+3 Good 7753.6
+4 Fair 7177.86
```
-Arrow data is immutable, so these methods always return an new object.
+Arrow data is immutable, so these methods always return new objects.
+Next example will rename a column and create a new column by simple calcuration.
-`DataFrame#assign` creates new columns or update existing columns.
-
-![assign method image](doc/image/dataframe/assign.png)
-
```ruby
-# New column is created because ':body_mass_kg' is a new key.
-df.assign(:body_mass_kg => df[:body_mass_g] / 1000.0)
+usdjpy = 110.0
-# =>
-#<RedAmber::DataFrame : 344 x 2 Vectors, 0x00000000000212f0>
- body_mass_g body_mass_kg
- <uint16> <double>
- 1 3750 3.8
- 2 3800 3.8
- 3 3250 3.3
- 4 (nil) (nil)
- 5 3450 3.5
- : : :
-342 5750 5.8
-343 5200 5.2
-344 5400 5.4
-```
+df.rename('mean(price)': :mean_price_USD)
+ .assign(:mean_price_JPY) { mean_price_USD * usdjpy }
-`DataFrame#slice` selects rows (observations) to create a sub DataFrame.
-
-![slice method image](doc/image/dataframe/slice.png)
-
-```ruby
-# returns 5 rows at the start and 5 rows from the end
-penguins.slice(0...5, -5..-1)
-
# =>
-#<RedAmber::DataFrame : 10 x 8 Vectors, 0x0000000000042be4>
- species island bill_length_mm bill_depth_mm flipper_length_mm ... year
- <string> <string> <double> <double> <uint8> ... <uint16>
- 1 Adelie Torgersen 39.1 18.7 181 ... 2007
- 2 Adelie Torgersen 39.5 17.4 186 ... 2007
- 3 Adelie Torgersen 40.3 18.0 195 ... 2007
- 4 Adelie Torgersen (nil) (nil) (nil) ... 2007
- 5 Adelie Torgersen 36.7 19.3 193 ... 2007
- : : : : : : ... :
- 8 Gentoo Biscoe 50.4 15.7 222 ... 2009
- 9 Gentoo Biscoe 45.2 14.8 212 ... 2009
-10 Gentoo Biscoe 49.9 16.1 213 ... 2009
+#<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000000f71c>
+ cut mean_price_USD mean_price_JPY
+ <string> <double> <double>
+0 Ideal 8674.23 954164.93
+1 Premium 8487.25 933597.34
+2 Very Good 8340.55 917460.37
+3 Good 7753.6 852896.11
+4 Fair 7177.86 789564.12
```
-`DataFrame#remove` rejects rows (observations) to create a remainer DataFrame.
+### Example: starwars dataset
-![remove method image](doc/image/dataframe/remove.png)
+Next example is `starwars` dataset reading from the downloaded CSV file. Followed by minimum data cleansing.
```ruby
-# penguins[:bill_length_mm] < 40 returns a boolean Vector
-penguins.remove(penguins[:bill_length_mm] < 40)
+uri = URI('https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv')
-# =>
-#<RedAmber::DataFrame : 244 x 8 Vectors, 0x000000000007d6f4>
- species island bill_length_mm bill_depth_mm flipper_length_mm ... year
- <string> <string> <double> <double> <uint8> ... <uint16>
- 1 Adelie Torgersen 40.3 18.0 195 ... 2007
- 2 Adelie Torgersen (nil) (nil) (nil) ... 2007
- 3 Adelie Torgersen 42.0 20.2 190 ... 2007
- 4 Adelie Torgersen 41.1 17.6 182 ... 2007
- 5 Adelie Torgersen 42.5 20.7 197 ... 2007
- : : : : : : ... :
-242 Gentoo Biscoe 50.4 15.7 222 ... 2009
-243 Gentoo Biscoe 45.2 14.8 212 ... 2009
-244 Gentoo Biscoe 49.9 16.1 213 ... 2009
-```
+starwars = DataFrame.load(uri)
-DataFrame manipulating methods like `pick`, `drop`, `slice`, `remove`, `rename` and `assign` accept a block.
-
-Previous example is also OK with a block.
-
-```ruby
-penguins.remove { bill_length_mm < 40 }
-```
-
-Next example is an usage of block to update a column.
-
-```ruby
-df = RedAmber::DataFrame.new(
- integer: [0, 1, 2, 3, nil],
- float: [0.0, 1.1, 2.2, Float::NAN, nil],
- string: ['A', 'B', 'C', 'D', nil],
- boolean: [true, false, true, false, nil])
-df
-
-# =>
-#<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000003131c>
- integer float string boolean
- <uint8> <double> <string> <boolean>
-1 0 0.0 A true
-2 1 1.1 B false
-3 2 2.2 C true
-4 3 NaN D false
-5 (nil) (nil) (nil) (nil)
-
-df.assign do
- vectors.select(&:float?).map { |v| [v.key, -v] }
- # => returns [[:float], [-0.0, -1.1, -2.2, NAN, nil]]
-end
-
-# =>
-#<RedAmber::DataFrame : 5 x 3 Vectors, 0x00000000000e270c>
- index float string
- <uint8> <double> <string>
-1 0 -0.0 A
-2 1 -1.1 B
-3 2 -2.2 C
-4 3 NaN D
-5 (nil) (nil) (nil)
-```
-
-Next example is to eliminate rows containing nil.
-
-```ruby
-# remove all observations containing nil
-nil_removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
-nil_removed.tdr
-
-# =>
-RedAmber::DataFrame : 342 x 8 Vectors
-Vectors : 5 numeric, 3 strings
-# key type level data_preview
-1 :species string 3 {"Adelie"=>151, "Chinstrap"=>68, "Gentoo"=>123}
-2 :island string 3 {"Torgersen"=>51, "Biscoe"=>167, "Dream"=>124}
-3 :bill_length_mm double 164 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
-4 :bill_depth_mm double 80 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
-5 :flipper_length_mm int64 55 [181, 186, 195, 193, 190, ... ]
-6 :body_mass_g int64 94 [3750, 3800, 3250, 3450, 3650, ... ]
-7 :sex string 3 {"male"=>168, "female"=>165, ""=>9}
-8 :year int64 3 {2007=>109, 2008=>114, 2009=>119}
-```
-
-For this frequently needed task, we can do it much simpler.
-
-```ruby
-penguins.remove_nil # => same result as above
-```
-
-`DataFrame#summary` shows summary statistics in a DataFrame.
-
-```ruby
-puts penguins.summary.to_s(width: 82)
-
-# =>
- variables count mean std min 25% median 75% max
- <dictionary> <uint16> <double> <double> <double> <double> <double> <double> <double>
-1 bill_length_mm 342 43.92 5.46 32.1 39.23 44.38 48.5 59.6
-2 bill_depth_mm 342 17.15 1.97 13.1 15.6 17.32 18.7 21.5
-3 flipper_length_mm 342 200.92 14.06 172.0 190.0 197.0 213.0 231.0
-4 body_mass_g 342 4201.75 801.95 2700.0 3550.0 4031.5 4750.0 6300.0
-5 year 344 2008.03 0.82 2007.0 2007.0 2008.0 2009.0 2009.0
-```
-
-`DataFrame#group` method can be used for the grouping tasks.
-
-```ruby
-starwars = RedAmber::DataFrame.load(URI("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv"))
starwars
+ .drop(0) # delete unnecessary index column
+ .remove { species == "NA" } # delete unnecessary rows
+ .group(:species) { [count(:species), mean(:height, :mass)] }
+ .slice { count > 1 }
# =>
-#<RedAmber::DataFrame : 87 x 12 Vectors, 0x000000000000607c>
- unnamed1 name height mass hair_color skin_color eye_color ... species
- <int64> <string> <int64> <double> <string> <string> <string> ... <string>
- 1 1 Luke Skywalker 172 77.0 blond fair blue ... Human
- 2 2 C-3PO 167 75.0 NA gold yellow ... Droid
- 3 3 R2-D2 96 32.0 NA white, blue red ... Droid
- 4 4 Darth Vader 202 136.0 none white yellow ... Human
- 5 5 Leia Organa 150 49.0 brown light brown ... Human
- : : : : : : : : ... :
-85 85 BB8 (nil) (nil) none none black ... Droid
-86 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA
-87 87 Padmé Amidala 165 45.0 brown light brown ... Human
-
-starwars.group(:species) { [count(:species), mean(:height, :mass)] }
- .slice { count > 1 }
-
-# =>
-#<RedAmber::DataFrame : 9 x 4 Vectors, 0x000000000006e848>
+#<RedAmber::DataFrame : 8 x 4 Vectors, 0x000000000000f848>
species count mean(height) mean(mass)
<string> <int64> <double> <double>
-1 Human 35 176.6 82.8
-2 Droid 6 131.2 69.8
-3 Wookiee 2 231.0 124.0
-4 Gungan 3 208.7 74.0
-5 NA 4 181.3 48.0
-6 Zabrak 2 173.0 80.0
-7 Twi'lek 2 179.0 55.0
-8 Mirialan 2 168.0 53.1
-9 Kaminoan 2 221.0 88.0
+0 Human 35 176.65 82.78
+1 Droid 6 131.2 69.75
+2 Wookiee 2 231.0 124.0
+3 Gungan 3 208.67 74.0
+4 Zabrak 2 173.0 80.0
+5 Twi'lek 2 179.0 55.0
+6 Mirialan 2 168.0 53.1
+7 Kaminoan 2 221.0 88.0
```
See [DataFrame.md](doc/DataFrame.md) for other examples and details.
-## `RedAmber::Vector`
+### `Vector` for 1D data object in column
Class `RedAmber::Vector` represents a series of data in the DataFrame.
-Method `RedAmber::DataFrame#[key]` returns a Vector with the key `key`.
-```ruby
-penguins[:bill_length_mm]
-# =>
-#<RedAmber::Vector(:double, size=344):0x000000000000f8fc>
-[39.1, 39.5, 40.3, nil, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, ... ]
-```
-
-Vectors accepts some [functional methods from Arrow](https://arrow.apache.org/docs/cpp/compute.html).
-
-This is an element-wise comparison and returns a boolean Vector of same size.
-
-![unary element-wise](doc/image/vector/unary_element_wise.png)
-
-```ruby
-penguins[:bill_length_mm] < 40
-
-# =>
-#<RedAmber::Vector(:boolean, size=344):0x000000000007e7ac>
-[true, true, false, nil, true, true, true, true, true, false, true, true, false, ... ]
-```
-
-Next example returns aggregated result.
-
-![unary aggregation](doc/image/vector/unary_aggregation.png)
-
-```ruby
-penguins[:bill_length_mm].mean
-43.92192982456141
-# =>
-
-```
-
See [Vector.md](doc/Vector.md) for details.
## Jupyter notebook
-[71 Examples of Red Amber](doc/examples_of_red_amber.ipynb) shows more examples in jupyter notebook.
+[73 Examples of Red Amber](binder/examples_of_red_amber.ipynb) shows more examples in jupyter notebook.
+You can try this notebook on [Binder](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=examples_of_red_amber.ipynb).
+[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/heronshoes/docker-stacks/RedAmber-binder?filepath=examples_of_red_amber.ipynb)
+
+
## Development
```shell
git clone https://github.com/heronshoes/red_amber.git
cd red_amber
bundle install
bundle exec rake test
```
+## Community
+
I will appreciate if you could help to improve this project. Here are a few ways you can help:
+- Let's talk in the [discussions](https://github.com/heronshoes/red_amber/discussions). [![Discussions](https://img.shields.io/github/discussions/heronshoes/red_amber)](https://github.com/heronshoes/red_amber/discussions)
+ - Browse Q and A, how to use, tips, etc.
+ - Ask questions you’re wondering about.
+ - Share ideas. The idea may be promoted to issues or pull requests.
- [Report bugs or suggest new features](https://github.com/heronshoes/red_amber/issues)
- Fix bugs and [submit pull requests](https://github.com/heronshoes/red_amber/pulls)
- Write, clarify, or fix documentation
## License