README.md in red_amber-0.1.3 vs README.md in red_amber-0.1.4
- old
+ new
@@ -21,138 +21,30 @@
gem 'red_amber'
```
And then execute:
- $ bundle install
+```shell
+bundle install
+```
Or install it yourself as:
- $ gem install red_amber
+```shell
+gem install red_amber
+```
## `RedAmber::DataFrame`
-### Constructors and saving
+Represents a set of data in 2D-shape.
-- [x] `new` from a columnar Hash
- - `RedAmber::DataFrame.new(x: [1, 2, 3])`
-
-- [x] `new` from a schema (by Hash) and rows (by Array)
- - `RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]])`
-
-- [x] `new` from an Arrow::Table
- - `RedAmber::DataFrame.new(Arrow::Table.new(x: [1, 2, 3]))`
-
-- [x] `new` from a Rover::DataFrame
- - `RedAmber::DataFrame.new(Rover::DataFrame.new(x: [1, 2, 3]))`
-
-- [x] `load` (class method)
-
- - [x] from a [`.arrow`, `.arrows`, `.csv`, `.csv.gz`, `.tsv`] file
- - `RedAmber::DataFrame.load("test/entity/with_header.csv")`
-
- - [x] from a string buffer
-
- - [x] from a URI
- - `RedAmber::DataFrame.load(URI("https://github.com/heronshoes/red_amber/blob/master/test/entity/with_header.csv"))`
-
- - [x] from a Parquet file
-
- `red-parquet` gem is required.
-
- ```ruby
- require 'parquet'
- dataframe = RedAmber::DataFrame.load("file.parquet")
- ```
-
-- [x] `save` (instance method)
-
- - [x] to a [`.arrow`, `.arrows`, `.csv`, `.csv.gz`, `.tsv`] file
-
- - [x] to a string buffer
-
- - [x] to a URI
-
- - [x] to a Parquet file
-
- `red-parquet` gem is required.
-
- ```ruby
- require 'parquet'
- dataframe.save("file.parquet")
- ```
-
-### Properties
-
-- [x] `table`
-
- Reader of Arrow::Table object inside.
-
-- [x] `n_rows`, `nrow`, `size`, `length`
-
- Returns num of rows (data size).
-
-- [x] `n_columns`, `ncol`, `width`
-
- Returns num of columns (num of vectors).
-
-- [x] `shape`
-
- Returns shape in an Array[n_rows, n_cols].
-
-- [x] `column_names`, `keys`
-
- Returns num of column names by an Array.
-
-- [x] `types`
-
- Returns types of columns by an Array of Symbols.
-
-- [x] `data_types`
-
- Returns types of columns by an Array of `Arrow::DataType`.
-
-- [x] `vectors`
-
- Returns an Array of Vectors.
-
-- [x] `to_h`
-
- Returns column-oriented data in a Hash.
-
-- [x] `to_a`, `raw_records`
-
- Returns an array of row-oriented data without header. If you need a column-oriented full array, use `.to_h.to_a`
-
-- [x] `schema`
-
- Returns column name and data type in a Hash.
-
-- [x] `==`
-
-- [x] `empty?`
-
-### Output
-
-- [x] `to_s`
-
-- [ ] summary, describe
-
-- [x] `to_rover`
-
- Returns a `Rover::DataFrame`.
-
-- [x] `inspect(tally_level: 5, max_element: 5)`
-
- Shows some information about self in a transposed style.
-
```ruby
require 'red_amber'
require 'datasets-arrow'
penguins = Datasets::Penguins.new.to_arrow
-RedAmber::DataFrame.new(penguins)
+puts RedAmber::DataFrame.new(penguins).tdr
# =>
RedAmber::DataFrame : 344 x 8 Vectors
Vectors : 5 numeric, 3 strings
# key type level data_preview
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
@@ -163,260 +55,60 @@
6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
7 :sex string 3 {"male"=>168, "female"=>165, nil=>11}
8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}
```
- - tally_level: max level to use tally mode
- - max_element: max num of element to show values in each row
+### DataFrame model
+![dataframe model of RedAmber](doc/image/dataframe_model.png)
-### Selecting
+For example, `DataFrame#pick` accepts keys as an argument and returns a sub DataFrame.
-- [x] Select columns by `[]` as `[key]`, `[keys]`, `[keys[index]]`
- - Key in a Symbol: `df[:symbol]`
- - Key in a String: `df["string"]`
- - Keys in an Array: `df[:symbol1, "string", :symbol2]`
- - Keys in indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`
- - Keys in a Range:
- A end-less Range can be used to represent keys.
-
```ruby
-hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
-df = RedAmber::DataFrame.new(hash)
-df[:b..:c, "a"]
+df = penguins.pick(:body_mass_g)
# =>
-RedAmber::DataFrame : 3 x 3 Vectors
-Vectors : 2 numeric, 1 string
-# key type level data_preview
-1 :b string 3 ["A", "B", "C"]
-2 :c double 3 [1.0, 2.0, 3.0]
-3 :a uint8 3 [1, 2, 3]
+#<RedAmber::DataFrame : 344 x 1 Vector, 0x000000000000fa14>
+Vector : 1 numeric
+# key type level data_preview
+1 :body_mass_g int64 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
```
-- [x] Select rows by `[]` as `[index]`, `[range]`, `[array]`
- - Select a row by index: `df[0]`
- - Select rows by indeces in a Range: `df[1..2]`
- - Select rows by indeces in an Array: `df[1, 2]`
- - Mixed case: `df[2, 0..]`
+`DataFrame#assign` can accept a block and create new variables.
-- [x] Select rows from top or bottom
+```ruby
+df.assign do
+ {:body_mass_kg => penguins[:body_mass_g] / 1000.0}
+end
+# =>
+#<RedAmber::DataFrame : 344 x 2 Vectors, 0x000000000000fa28>
+Vectors : 2 numeric
+# key type level data_preview
+1 :body_mass_g int64 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
+2 :body_mass_kg double 95 [3.75, 3.8, 3.25, nil, 3.45, ... ], 2 nils
+```
- `head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)`
+Other DataFrame manipulating methods like `pick`, `drop`, `slice`, `remove` and `rename` also accept a block.
-- [ ] slice
+See [DataFrame.md](doc/DataFrame.md) for details.
-### Updating
-- [ ] Add a new column
-
-- [ ] Update a single element
-
-- [ ] Update multiple elements
-
-- [ ] Update all elements
-
-- [ ] Update elements matching a condition
-
-- [ ] Clamp
-
-- [ ] Delete columns
-
-- [ ] Rename a column
-
-- [ ] Sort rows
-
-- [ ] Clear data
-
-### Treat na data
-
-- [ ] Drop na (NaN, nil)
-
-- [ ] Replace na with value
-
-- [ ] Interpolate na with convolution array
-
-### Combining DataFrames
-
-- [ ] Add rows
-
-- [ ] Add columns
-
-- [ ] Inner join
-
-- [ ] Left join
-
-### Encoding
-
-- [ ] One-hot encoding
-
-### Iteration (not impremented)
-
-### Filtering (not impremented)
-
-
## `RedAmber::Vector`
-### Constructor
-- [x] Create from a column in a DataFrame
+Class `RedAmber::Vector` represents a series of data in the DataFrame.
-- [x] New from an Array
-
-### Properties
-
-- [x] `to_s`
-
-- [x] `values`, `to_a`, `entries`
-
-- [x] `size`, `length`, `n_rows`, `nrow`
-
-- [x] `type`
-
-- [x] `data_type`
-
-- [ ] `each`
-
-- [ ] `chunked?`
-
-- [ ] `n_chunks`
-
-- [ ] `each_chunk`
-
-- [x] `tally`
-
-- [x] `n_nils`, `n_nans`
-
- - `n_nulls` is an alias of `n_nils`
-
-- [x] `inspect(limit: 80)`
-
- - `limit` sets size limit to display long array.
-
-### Functions
-#### Unary aggregations: vector.func => scalar
-
-| Method |Boolean|Numeric|String|Options|Remarks|
-| ----------- | --- | --- | --- | --- | --- |
-| ✓ `all` | ✓ | | | ✓ ScalarAggregate| |
-| ✓ `any` | ✓ | | | ✓ ScalarAggregate| |
-| ✓ `approximate_median`| |✓| | ✓ ScalarAggregate| alias `median`|
-| ✓ `count` | ✓ | ✓ | ✓ | ✓ Count | |
-| ✓ `count_distinct`| ✓ | ✓ | ✓ | ✓ Count |alias `count_uniq`|
-|[ ]`index` | [ ] | [ ] | [ ] |[ ] Index | |
-| ✓ `max` | ✓ | ✓ | ✓ | ✓ ScalarAggregate| |
-| ✓ `mean` | ✓ | ✓ | | ✓ ScalarAggregate| |
-| ✓ `min` | ✓ | ✓ | ✓ | ✓ ScalarAggregate| |
-|[ ]`min_max` | [ ] | [ ] | [ ] |[ ] ScalarAggregate| |
-|[ ]`mode` | | [ ] | |[ ] Mode | |
-| ✓ `product` | ✓ | ✓ | | ✓ ScalarAggregate| |
-|[ ]`quantile`| | [ ] | |[ ] Quantile| |
-|[ ]`stddev` | | ✓ | |[ ] Variance| |
-| ✓ `sum` | ✓ | ✓ | | ✓ ScalarAggregate| |
-|[ ]`tdigest` | | [ ] | |[ ] TDigest | |
-|[ ]`variance`| | ✓ | |[ ] Variance| |
-
-
-Options can be used as follows.
-See the [document of C++ function](https://arrow.apache.org/docs/cpp/compute.html) for detail.
-
```ruby
-double = RedAmber::Vector.new([1, 0/0.0, -1/0.0, 1/0.0, nil, ""])
-#=>
-#<RedAmber::Vector(:double, size=6):0x000000000000f910>
-[1.0, NaN, -Infinity, Infinity, nil, 0.0]
-
-double.count #=> 5
-double.count(opts: {mode: :only_valid}) #=> 5, default
-double.count(opts: {mode: :only_null}) #=> 1
-double.count(opts: {mode: :all}) #=> 6
-
-boolean = RedAmber::Vector.new([true, true, nil])
-#=>
-#<RedAmber::Vector(:boolean, size=3):0x000000000000f924>
-[true, true, nil]
-
-boolean.all #=> true
-boolean.all(opts: {skip_nulls: true}) #=> true
-boolean.all(opts: {skip_nulls: false}) #=> false
+penguins[:species]
+# =>
+#<RedAmber::Vector(:string, size=344):0x000000000000f8e8>
+["Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", ... ]
```
-#### Unary element-wise: vector.func => vector
+Vectors accepts some [functional methods from Arrow](https://arrow.apache.org/docs/cpp/compute.html).
-| Method |Boolean|Numeric|String|Options|Remarks|
-| ------------ | --- | --- | --- | --- | ----- |
-| ✓ `-@` | | ✓ | | |as `-vector`|
-| ✓ `negate` | | ✓ | | |`-@` |
-| ✓ `abs` | | ✓ | | | |
-|[ ]`acos` | | [ ] | | | |
-|[ ]`asin` | | [ ] | | | |
-| ✓ `atan` | | ✓ | | | |
-| ✓ `bit_wise_not`| | (✓) | | |integer only|
-|[ ]`ceil` | | ✓ | | | |
-| ✓ `cos` | | ✓ | | | |
-|[ ]`floor` | | ✓ | | | |
-| ✓ `invert` | ✓ | | | |`!`, alias `not`|
-|[ ]`ln` | | [ ] | | | |
-|[ ]`log10` | | [ ] | | | |
-|[ ]`log1p` | | [ ] | | | |
-|[ ]`log2` | | [ ] | | | |
-|[ ]`round` | | [ ] | |[ ] Round| |
-|[ ]`round_to_multiple`| | [ ] | |[ ] RoundToMultiple| |
-| ✓ `sign` | | ✓ | | | |
-| ✓ `sin` | | ✓ | | | |
-| ✓ `tan` | | ✓ | | | |
-|[ ]`trunc` | | ✓ | | | |
+See [Vector.md](doc/Vector.md) for details.
-#### Binary element-wise: vector.func(vector) => vector
+## TDR concept
-| Method |Boolean|Numeric|String|Options|Remarks|
-| ----------------- | --- | --- | --- | --- | ----- |
-| ✓ `add` | | ✓ | | | `+` |
-| ✓ `atan2` | | ✓ | | | |
-| ✓ `and_kleene` | ✓ | | | | `&` |
-| ✓ `and_org ` | ✓ | | | |`and` in Red Arrow|
-| ✓ `and_not` | ✓ | | | | |
-| ✓ `and_not_kleene`| ✓ | | | | |
-| ✓ `bit_wise_and` | | (✓) | | |integer only|
-| ✓ `bit_wise_or` | | (✓) | | |integer only|
-| ✓ `bit_wise_xor` | | (✓) | | |integer only|
-| ✓ `divide` | | ✓ | | | `/` |
-| ✓ `equal` | ✓ | ✓ | ✓ | |`==`, alias `eq`|
-| ✓ `greater` | ✓ | ✓ | ✓ | |`>`, alias `gt`|
-| ✓ `greater_equal` | ✓ | ✓ | ✓ | |`>=`, alias `ge`|
-| ✓ `is_finite` | | ✓ | | | |
-| ✓ `is_inf` | | ✓ | | | |
-| ✓ `is_na` | ✓ | ✓ | ✓ | | |
-| ✓ `is_nan` | | ✓ | | | |
-|[ ]`is_nil` | ✓ | ✓ | ✓ |[ ] Null|alias `is_null`|
-| ✓ `is_valid` | ✓ | ✓ | ✓ | | |
-| ✓ `less` | ✓ | ✓ | ✓ | |`<`, alias `lt`|
-| ✓ `less_equal` | ✓ | ✓ | ✓ | |`<=`, alias `le`|
-|[ ]`logb` | | [ ] | | | |
-|[ ]`mod` | | [ ] | | | `%` |
-| ✓ `multiply` | | ✓ | | | `*` |
-| ✓ `not_equal` | ✓ | ✓ | ✓ | |`!=`, alias `ne`|
-| ✓ `or_kleene` | ✓ | | | | `\|` |
-| ✓ `or_org` | ✓ | | | |`or` in Red Arrow|
-| ✓ `power` | | ✓ | | | `**` |
-| ✓ `subtract` | | ✓ | | | `-` |
-| ✓ `shift_left` | | (✓) | | |`<<`, integer only|
-| ✓ `shift_right` | | (✓) | | |`>>`, integer only|
-| ✓ `xor` | ✓ | | | | `^` |
-
-##### (Not impremented)
-- [ ] sort, sort_index
-- [ ] argmin, argmax
-- [ ] (array functions)
-- [ ] (strings functions)
-- [ ] (temporal functions)
-- [ ] (conditional functions)
-- [ ] (index functions)
-- [ ] (other functions)
-
-### Coerce (not impremented)
-
-### Updating (not impremented)
-
-### DSL in a block for faster calculation ?
-
+I named the data frame representation style in the model above as TDR (Transposed DataFrame Representation). See [TDR.md](doc/tdr.md) for details.
## Development
```shell
git clone https://github.com/heronshoes/red_amber.git