doc/DataFrame.md in red_amber-0.2.2 vs doc/DataFrame.md in red_amber-0.2.3

- old
+ new

@@ -3,11 +3,12 @@ Class `RedAmber::DataFrame` represents 2D-data. A `DataFrame` consists with: - A collection of data which have same data type within. We call it `Vector`. - A label is attached to `Vector`. We call it `key`. - A `Vector` and associated `key` is grouped as a `variable`. - `variable`s with same vector length are aligned and arranged to be a `DataFrame`. -- Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `observation`. + - Each `key` in a `DataFrame` must be unique. +- Each `Vector` in a `DataFrame` contains a set of relating data at same position. We call it `record` or `observation`. ![dataframe model image](doc/../image/dataframe_model.png) ## Constructors and saving @@ -92,17 +93,17 @@ ## Properties ### `table`, `to_arrow` -- Reader of Arrow::Table object inside. +- Returns Arrow::Table object in the DataFrame. -### `size`, `n_obs`, `n_rows` +### `size`, `n_records`, `n_obs`, `n_rows` -- Returns size of Vector (num of observations). - -### `n_keys`, `n_vars`, `n_cols`, +- Returns size of Vector (num of records). + +### `n_keys`, `n_variables`, `n_vars`, `n_cols`, - Returns num of keys (num of variables). ### `shape` @@ -136,21 +137,12 @@ ### `keys`, `var_names`, `column_names` - Returns key names in an Array. - When we use it with vectors, Vector#key is useful to get the key inside of DataFrame. + Each key must be unique in the DataFrame. - ```ruby - # update numeric variables, another solution - df.assign do - vectors.each_with_object({}) do |vector, assigner| - assigner[vector.key] = vector * -1 if vector.numeric? - end - end - ``` - ### `types` - Returns types of vectors in an Array of Symbols. ### `type_classes` @@ -159,29 +151,44 @@ ### `vectors` - Returns an Array of Vectors. + When we use it, Vector#key is useful to get the key in the DataFrame. + + ```ruby + # update numeric variables, another solution + df.assign do + vectors.each_with_object({}) do |vector, assigner| + assigner[vector.key] = vector * -1 if vector.numeric? + end + end + ``` + ### `indices`, `indexes` -- Returns indexes in an Array. +- Returns indexes in a Vector. Accepts an option `start` as the first of indexes. ```ruby df = RedAmber::DataFrame.new(x: [1, 2, 3, 4, 5]) df.indices # => + #<RedAmber::Vector(:uint8, size=5):0x0000000000013ed4> [0, 1, 2, 3, 4] df.indices(1) # => + #<RedAmber::Vector(:uint8, size=5):0x0000000000018fd8> [1, 2, 3, 4, 5] df.indices(:a) + # => + #<RedAmber::Vector(:dictionary, size=5):0x000000000001bd50> [:a, :b, :c, :d, :e] ``` ### `to_h` @@ -273,10 +280,11 @@ require 'red_amber' require 'datasets-arrow' dataset = Datasets::Penguins.new # (From 0.2.2) responsible to the object which has `to_arrow` method. + # If older, it should be `dataset.to_arrow` in the parentheses. RedAmber::DataFrame.new(dataset).tdr # => RedAmber::DataFrame : 344 x 8 Vectors Vectors : 5 numeric, 3 strings @@ -288,30 +296,31 @@ 4 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils 5 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils 6 :sex string 3 {"male"=>168, "female"=>165, nil=>11} 7 :year uint16 3 {2007=>110, 2008=>114, 2009=>120} ``` - + + Options: - limit: limit of variables to show. Default value is 10. - - tally: max level to use tally mode. - - elements: max num of element to show values in each observations. + - tally: max level to use tally mode. Default value is 5. + - elements: max num of element to show values in each records. Default value is 5. ## Selecting ### Select variables (columns in a table) by `[]` as `[key]`, `[keys]`, `[keys[index]]` - Key in a Symbol: `df[:symbol]` - Key in a String: `df["string"]` - Keys in an Array: `df[:symbol1, "string", :symbol2]` - Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]` - Key indeces can be used via `keys[i]` because numbers are used to select observations (rows). + Key indeces should be used via `keys[i]` because numbers are used to select records (rows). See next section. - Keys by a Range: - If keys are able to represent by Range, it can be included in the arguments. See a example below. + If keys are able to represent by a Range, it can be included in the arguments. See a example below. -- You can exchange the order of variables (columns). +- You can also exchange the order of variables (columns). ```ruby hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]} df = RedAmber::DataFrame.new(hash) df[:b..:c, "a"] @@ -323,42 +332,44 @@ 0 A 1.0 1 1 B 2.0 2 2 C 3.0 3 ``` - If `#[]` represents single variable (column), it returns a Vector object. + If `#[]` represents a single variable (column), it returns a Vector object. ```ruby df[:a] # => #<RedAmber::Vector(:uint8, size=3):0x000000000000f140> [1, 2, 3] ``` + Or `#v` method also returns a Vector for a key. ```ruby df.v(:a) # => #<RedAmber::Vector(:uint8, size=3):0x000000000000f140> [1, 2, 3] ``` - This may be useful to use in a block of DataFrame manipulation verbs. We can write `v(:a)` rather than `self[:a]` or `df[:a]` + This method may be useful to use in a block of DataFrame manipulation verbs. We can write `v(:a)` rather than `self[:a]` or `df[:a]` -### Select observations (rows in a table) by `[]` as `[index]`, `[range]`, `[array]` +### Select records (rows in a table) by `[]` as `[index]`, `[range]`, `[array]` -- Select a obs. by index: `df[0]` -- Select obs. by indeces in a Range: `df[1..2]` +- Select a record by index: `df[0]` - An end-less or a begin-less Range can be used to represent indeces. +- Select records by indeces in an Array: `df[1, 2]` -- Select obs. by indeces in an Array: `df[1, 2]` +- Select records by indeces in a Range: `df[1..2]` -- You can use float indices. + An end-less or a begin-less Range can be used to represent indeces. +- You can use indices in Float. + - Mixed case: `df[2, 0..]` ```ruby hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]} df = RedAmber::DataFrame.new(hash) @@ -372,13 +383,13 @@ 1 1 A 1.0 2 2 B 2.0 3 3 C 3.0 ``` -- Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self. +- Select records by a boolean Array or a boolean RedAmber::Vector at same size as self. - It returns a sub dataframe with observations at boolean is true. + It returns a sub dataframe with records at boolean is true. ```ruby # with the same dataframe `df` above df[true, false, nil] # or df[[true, false, nil]] # or @@ -389,19 +400,19 @@ a b c <uint8> <string> <double> 1 1 A 1.0 ``` -### Select rows from top or from bottom +### Select records (rows) from top or from bottom `head(n=5)`, `tail(n=5)`, `first(n=1)`, `last(n=1)` ## Sub DataFrame manipulations -### `pick ` - pick up variables by key label - +### `pick ` - pick up variables - - Pick up some columns (variables) to create a sub DataFrame. + Pick up some variables (columns) to create a sub DataFrame. ![pick method image](doc/../image/dataframe/pick.png) - Keys as arguments @@ -489,13 +500,13 @@ 341 50.4 15.7 222 342 45.2 14.8 212 343 49.9 16.1 213 ``` -### `drop ` - pick and drop - +### `drop ` - counterpart of pick - - Drop some columns (variables) to create a remainer DataFrame. + Drop some variables (columns) to create a remainer DataFrame. ![drop method image](doc/../image/dataframe/drop.png) - Keys as arguments @@ -555,24 +566,24 @@ # => #<RedAmber::Vector(:uint8, size=3):0x000000000000f258> [1, 2, 3] ``` -### `slice ` - to cut vertically is slice - +### `slice ` - slice and select records - - Slice and select rows (observations) to create a sub DataFrame. + Slice and select records (rows) to create a sub DataFrame. ![slice method image](doc/../image/dataframe/slice.png) - Indices as arguments `slice(indeces)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers. Negative index from the tail like Ruby's Array is also acceptable. ```ruby - # returns 5 obs. at start and 5 obs. from end + # returns 5 records at start and 5 records from end penguins.slice(0...5, -5..-1) # => #<RedAmber::DataFrame : 10 x 8 Vectors, 0x0000000000042be4> species island bill_length_mm bill_depth_mm flipper_length_mm ... year @@ -663,22 +674,22 @@ #<Arrow::Table:0x7fdfe44981c8 ptr=0x555e9febc330> a b c 0 1 A 1.000000 ``` -### `remove` +### `remove` - counterpart of slice - - Slice and reject rows (observations) to create a remainer DataFrame. + Slice and reject records (rows) to create a remainer DataFrame. ![remove method image](doc/../image/dataframe/remove.png) - Indices as arguments `remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer. ```ruby - # returns 6th to 339th obs. + # returns 6th to 339th records penguins.remove(0...5, -5..-1) # => #<RedAmber::DataFrame : 334 x 8 Vectors, 0x00000000000487c4> species island bill_length_mm bill_depth_mm flipper_length_mm ... year @@ -697,11 +708,11 @@ - Booleans as an argument `remove(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`. ```ruby - # remove all observation contains nil + # remove all records contains nil removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) } removed # => #<RedAmber::DataFrame : 333 x 8 Vectors, 0x0000000000049fac> @@ -783,11 +794,11 @@ 1 (nil) C 3.0 ``` ### `rename` - Rename keys (column names) to create a updated DataFrame. + Rename keys (variable/column names) to create a updated DataFrame. ![rename method image](doc/../image/dataframe/rename.png) - Key pairs as arguments @@ -818,11 +829,11 @@ Symbol key and String key are distinguished. ### `assign` - Assign new or updated columns (variables) and create a updated DataFrame. + Assign new or updated variables (columns) and create an updated DataFrame. - Variables with new keys will append new columns from the right. - Variables with exisiting keys will update corresponding vectors. ![assign method image](doc/../image/dataframe/assign.png) @@ -1007,11 +1018,11 @@ ## Updating ### `sort` - `sort` accepts parameters as sort_keys thanks to the amazing Red Arrow feature。 + `sort` accepts parameters as sort_keys thanks to the Red Arrow's feature。 - :key, "key" or "+key" denotes ascending order - "-key" denotes descending order ```ruby df = RedAmber::DataFrame.new( @@ -1038,11 +1049,11 @@ ## Treat na data ### `remove_nil` - Remove any observations containing nil. + Remove any records containing nil. ## Grouping ### `group(group_keys)` @@ -1208,11 +1219,11 @@ The leftmost column is created by original keys. Key name of the column is named by parameter `:name`. If `:name` is not specified, `:NAME` is used for the key. ### `to_long(*keep_keys)` - Creates a 'long' (tidy) DataFrame from a 'wide' DataFrame. + Creates a 'long' (may be tidy) DataFrame from a 'wide' DataFrame. - Parameter `keep_keys` specifies the key names to keep. ```ruby import_cars.to_long(:Year) @@ -1255,11 +1266,11 @@ 24 2021 VW 35215 ``` ### `to_wide` - Creates a 'wide' (messy) DataFrame from a 'long' DataFrame. + Creates a 'wide' (may be messy) DataFrame from a 'long' DataFrame. - Option `:name` is the key of the column which will be expanded **to key names**. The default value is `:NAME` if it is not specified. - Option `:value` is the key of the column which will be expanded **to values**. The default value is `:VALUE` if it is not specified. @@ -1280,12 +1291,280 @@ 4 2021 22535 35905 18211 51722 35215 ``` ## Combine -- [ ] Combining dataframes +### `join` +![dataframe joining image](doc/../image/dataframe/join.png) -- [ ] Join + You should use specific `*_join` methods below. + + - `other` is a DataFrame or a Arrow::Table. + - `join_keys` are keys shared by self and other to match with them. + - If `join_keys` are empty, common keys in self and other are chosen (natural join). + - If (common keys) > `join_keys`, duplicated keys are renamed by `suffix`. + + ```ruby + df = DataFrame.new( + KEY: %w[A B C], + X1: [1, 2, 3] + ) + #=> + #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000012a70> + KEY X1 + <string> <uint8> + 0 A 1 + 1 B 2 + 2 C 3 + + other = DataFrame.new( + KEY: %w[A B D], + X2: [true, false, nil] + ) + #=> + #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000017034> + KEY X2 + <string> <boolean> + 0 A true + 1 B false + 2 D (nil) + ``` + +#### Mutating joins + +##### `inner_join(other, join_keys = nil, suffix: '.1')` + + Join data, leaving only the matching records. + + ```ruby + df.inner_join(other, :KEY) + #=> + #<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000001e2bc> + KEY X1 X2 + <string> <uint8> <boolean> + 0 A 1 true + 1 B 2 false + ``` + +##### `full_join(other, join_keys = nil, suffix: '.1')` + + Join data, leaving all records. + + ```ruby + df.full_join(other, :KEY) + #=> + #<RedAmber::DataFrame : 4 x 3 Vectors, 0x0000000000029fcc> + KEY X1 X2 + <string> <uint8> <boolean> + 0 A 1 true + 1 B 2 false + 2 C 3 (nil) + 3 D (nil) (nil) + ``` + +##### `left_join(other, join_keys = nil, suffix: '.1')` + + Join matching values to self from other. + + ```ruby + df.left_join(other, :KEY) + #=> + #<RedAmber::DataFrame : 3 x 3 Vectors, 0x0000000000029fcc> + KEY X1 X2 + <string> <uint8> <boolean> + 0 A 1 true + 1 B 2 false + 2 C 3 (nil) + ``` + +##### `right_join(other, join_keys = nil, suffix: '.1')` + + Join matching values from self to other. + + ```ruby + df.right_join(other, :KEY) + #=> + #<RedAmber::DataFrame : 2 x 3 Vectors, 0x0000000000029fcc> + KEY X1 X2 + <string> <uint8> <boolean> + 0 A 1 true + 1 B 2 false + 2 D (nil) (nil) + ``` + +#### Filtering join + +##### `semi_join(other, join_keys = nil, suffix: '.1')` + + Return records of self that have a match in other. + + ```ruby + df.semi_join(other, :KEY) + #=> + #<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000029fcc> + KEY X1 + <string> <uint8> + 0 A 1 + 1 B 2 + ``` + +##### `anti_join(other, join_keys = nil, suffix: '.1')` + + Return records of self that do not have a match in other. + + ```ruby + df.anti_join(other, :KEY) + #=> + #<RedAmber::DataFrame : 1 x 2 Vectors, 0x0000000000029fcc> + KEY X1 + <string> <uint8> + 0 C 3 + ``` + +## Set operations +![dataframe set and binding image](doc/../image/dataframe/set_and_bind.png) + + Keys in self and other must be same in set operations. + + ```ruby + df = DataFrame.new( + KEY1: %w[A B C], + KEY2: [1, 2, 3] + ) + #=> + #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000012a70> + KEY1 KEY2 + <string> <uint8> + 0 A 1 + 1 B 2 + 2 C 3 + + other = DataFrame.new( + KEY1: %w[A B D], + KEY2: [1, 4, 5] + ) + #=> + #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000017034> + KEY1 KEY2 + <string> <uint8> + 0 A 1 + 1 B 4 + 2 D 5 + ``` + +##### `intersect(other)` + + Select records appearing in both self and other. + + ```ruby + df.intersect(other) + #=> + #<RedAmber::DataFrame : 1 x 2 Vectors, 0x0000000000029fcc> + KEY1 KEY2 + <string> <uint8> + 0 A 1 + ``` + +##### `union(other)` + + Select records appearing in self or other. + + ```ruby + df.union(other) + #=> + #<RedAmber::DataFrame : 5 x 2 Vectors, 0x0000000000029fcc> + KEY1 KEY2 + <string> <uint8> + 0 A 1 + 1 B 2 + 2 C 3 + 3 B 4 + 4 D 5 + ``` + +##### `difference(other)` + + Select records appearing in self but not in other. + + It has an alias `setdiff`. + + ```ruby + df.difference(other) + #=> + #<RedAmber::DataFrame : 1 x 2 Vectors, 0x0000000000029fcc> + KEY1 KEY2 + <string> <uint8> + 1 B 2 + 2 C 3 + ``` + +## Binding + +### `concatenate(other)` + + Concatenate another DataFrame or Table onto the bottom of self. The shape and data type of other must be the same as self. + + The alias is `concat`. + + An array of DataFrames or Tables is also acceptable as other. + + ```ruby + df + #=> + #<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000022cb8> + x y + <uint8> <string> + 0 1 A + 1 2 B + + other + #=> + #<RedAmber::DataFrame : 2 x 2 Vectors, 0x000000000001f6d0> + x y + <uint8> <string> + 0 3 C + 1 4 D + + df.concatenate(other) + #=> + #<RedAmber::DataFrame : 4 x 2 Vectors, 0x0000000000022574> + x y + <uint8> <string> + 0 1 A + 1 2 B + 2 3 C + 3 4 D + ``` + +### `merge(other)` + + Concatenate another DataFrame or Table onto the bottom of self. The shape and data type of other must be the same as self. + + ```ruby + df + #=> + #<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000009150> + x y + <uint8> <uint8> + 0 1 3 + 1 2 4 + + other + #=> + #<RedAmber::DataFrame : 2 x 2 Vectors, 0x0000000000008a0c> + a b + <string> <string> + 0 A C + 1 B D + + df.merge(other) + #=> + #<RedAmber::DataFrame : 2 x 4 Vectors, 0x000000000000cb70> + x y a b + <uint8> <uint8> <string> <string> + 0 1 3 A C + 1 2 4 B D + ``` ## Encoding - [ ] One-hot encoding