README.md in red_amber-0.1.6 vs README.md in red_amber-0.1.7

- old
+ new

@@ -1,27 +1,32 @@ # RedAmber +[![Gem Version](https://badge.fury.io/rb/red_amber.svg)](https://badge.fury.io/rb/red_amber) +[![Ruby](https://github.com/heronshoes/red_amber/actions/workflows/test.yml/badge.svg)](https://github.com/heronshoes/red_amber/actions/workflows/test.yml) + A simple dataframe library for Ruby (experimental). -- Powered by [Red Arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow) +- Powered by [Red Arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow) [![Gitter Chat](https://badges.gitter.im/red-data-tools/en.svg)](https://gitter.im/red-data-tools/en) - Inspired by the dataframe library [Rover-df](https://github.com/ankane/rover) ## Requirements ```ruby gem 'red-arrow', '>= 8.0.0' -gem 'red-parquet', '>= 8.0.0' # if you use IO from/to parquet -gem 'rover-df', '~> 0.3.0' # if you use IO from/to Rover::DataFrame + +gem 'red-parquet', '>= 8.0.0' # Optional, if you use IO from/to parquet +gem 'rover-df', '~> 0.3.0' # Optional, if you use IO from/to Rover::DataFrame ``` ## Installation Install requirements before you install Red Amber. - Apache Arrow GLib (>= 8.0.0) -- Apache Parquet GLib (>= 8.0.0) +- Apache Parquet GLib (>= 8.0.0) # If you use IO from/to parquet + See [Apache Arrow install document](https://arrow.apache.org/install/). Minimum installation example for the latest Ubuntu is in the ['Prepare the Apache Arrow' section in ci test](https://github.com/heronshoes/red_amber/blob/master/.github/workflows/test.yml) of Red Amber. Add this line to your Gemfile: @@ -40,98 +45,78 @@ ```shell gem install red_amber ``` -(From v0.1.6) - -RedAmber uses TDR mode for `#inspect` and `#to_iruby` by default. If you prefer Table mode, please set the environment variable -`RED_AMBER_OUTPUT_MODE` to `"table"`. See [TDR section](#TDR) for detail. - ## `RedAmber::DataFrame` Represents a set of data in 2D-shape. The entity is a Red Arrow's Table object. ```ruby require 'red_amber' # require 'red-amber' is also OK. require 'datasets-arrow' arrow = Datasets::Penguins.new.to_arrow -penguins = RedAmber::DataFrame.new(arrow) -penguins.table +RedAmber::DataFrame.new(arrow) # => -#<Arrow::Table:0x111271098 ptr=0x7f9118b3e0b0> - species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year - 0 Adelie Torgersen 39.100000 18.700000 181 3750 male 2007 - 1 Adelie Torgersen 39.500000 17.400000 186 3800 female 2007 - 2 Adelie Torgersen 40.300000 18.000000 195 3250 female 2007 - 3 Adelie Torgersen (null) (null) (null) (null) (null) 2007 - 4 Adelie Torgersen 36.700000 19.300000 193 3450 female 2007 - 5 Adelie Torgersen 39.300000 20.600000 190 3650 male 2007 - 6 Adelie Torgersen 38.900000 17.800000 181 3625 female 2007 - 7 Adelie Torgersen 39.200000 19.600000 195 4675 male 2007 - 8 Adelie Torgersen 34.100000 18.100000 193 3475 (null) 2007 - 9 Adelie Torgersen 42.000000 20.200000 190 4250 (null) 2007 -... -334 Gentoo Biscoe 46.200000 14.100000 217 4375 female 2009 -335 Gentoo Biscoe 55.100000 16.000000 230 5850 male 2009 -336 Gentoo Biscoe 44.500000 15.700000 217 4875 (null) 2009 -337 Gentoo Biscoe 48.800000 16.200000 222 6000 male 2009 -338 Gentoo Biscoe 47.200000 13.700000 214 4925 female 2009 -339 Gentoo Biscoe (null) (null) (null) (null) (null) 2009 -340 Gentoo Biscoe 46.800000 14.300000 215 4850 female 2009 -341 Gentoo Biscoe 50.400000 15.700000 222 5750 male 2009 -342 Gentoo Biscoe 45.200000 14.800000 212 5200 female 2009 -343 Gentoo Biscoe 49.900000 16.100000 213 5400 male 2009 +#<RedAmber::DataFrame : 344 x 8 Vectors, 0x0000000000013790> + species island bill_length_mm bill_depth_mm flipper_length_mm ... year + <string> <string> <double> <double> <uint8> ... <uint16> + 1 Adelie Torgersen 39.1 18.7 181 ... 2007 + 2 Adelie Torgersen 39.5 17.4 186 ... 2007 + 3 Adelie Torgersen 40.3 18.0 195 ... 2007 + 4 Adelie Torgersen (nil) (nil) (nil) ... 2007 + 5 Adelie Torgersen 36.7 19.3 193 ... 2007 + : : : : : : ... : +342 Gentoo Biscoe 50.4 15.7 222 ... 2009 +343 Gentoo Biscoe 45.2 14.8 212 ... 2009 +344 Gentoo Biscoe 49.9 16.1 213 ... 2009 ``` -By default, RedAmber shows self by compact transposed style. This unfamiliar style (TDR) is designed for -the exploratory data processing. It keeps Vectors as row vectors, shows keys and types at a glance, shows levels -for the 'factor-like' variables and shows the number of abnormal values like NaN and nil. - -```ruby -penguins - -# => -RedAmber::DataFrame : 344 x 8 Vectors -Vectors : 5 numeric, 3 strings -# key type level data_preview -1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124} -2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124} -3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils -4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils -5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils -6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils -7 :sex string 3 {"male"=>168, "female"=>165, nil=>11} -8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120} -``` - ### DataFrame model ![dataframe model of RedAmber](doc/image/dataframe_model.png) For example, `DataFrame#pick` accepts keys as an argument and returns a sub DataFrame. ```ruby df = penguins.pick(:body_mass_g) +df + # => -#<RedAmber::DataFrame : 344 x 1 Vector, 0x000000000000fa14> -Vector : 1 numeric -# key type level data_preview -1 :body_mass_g int64 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils +#<RedAmber::DataFrame : 344 x 1 Vector, 0x0000000000015cc0> + body_mass_g + <uint16> + 1 3750 + 2 3800 + 3 3250 + 4 (nil) + 5 3450 + : : +342 5750 +343 5200 +344 5400 ``` `DataFrame#assign` creates new variables (column in the table). ```ruby df.assign(:body_mass_kg => df[:body_mass_g] / 1000.0) + # => -#<RedAmber::DataFrame : 344 x 2 Vectors, 0x000000000000fa28> -Vectors : 2 numeric -# key type level data_preview -1 :body_mass_g int64 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils -2 :body_mass_kg double 95 [3.75, 3.8, 3.25, nil, 3.45, ... ], 2 nils +#<RedAmber::DataFrame : 344 x 2 Vectors, 0x00000000000212f0> + body_mass_g body_mass_kg + <uint16> <double> + 1 3750 3.8 + 2 3800 3.8 + 3 3250 3.3 + 4 (nil) (nil) + 5 3450 3.5 + : : : +342 5750 5.8 +343 5200 5.2 +344 5400 5.4 ``` DataFrame manipulating methods like `pick`, `drop`, `slice`, `remove`, `rename` and `assign` accept a block. This is an exaple to eliminate observations (row in the table) containing nil. @@ -176,22 +161,12 @@ Vectors accepts some [functional methods from Arrow](https://arrow.apache.org/docs/cpp/compute.html). See [Vector.md](doc/Vector.md) for details. -## TDR +## Jupyter notebook -I named the data frame representation style in the model above as TDR (Transposed DataFrame Representation). - -This library can be used with both TDR mode and usual Table mode. -If you set the environment variable `RED_AMBER_OUTPUT_MODE` to `"table"`, output style by `inspect` and `to_iruby` is the Table mode. Other value including nil will output TDR style. - -You can switch the mode in Ruby like this. -```ruby -ENV['RED_AMBER_OUTPUT_STYLE'] = 'table' # => Table mode -``` - -For more detail information about TDR, see [TDR.md](doc/tdr.md). +[47 Examples of Red Amber](doc/47_examples_of_red_amber.ipynb) ## Development ```shell git clone https://github.com/heronshoes/red_amber.git