doc/DataFrame.md in red_amber-0.2.1 vs doc/DataFrame.md in red_amber-0.2.2

- old
+ new

@@ -12,34 +12,42 @@ ## Constructors and saving ### `new` from a Hash ```ruby - RedAmber::DataFrame.new(x: [1, 2, 3]) + df = RedAmber::DataFrame.new(x: [1, 2, 3], y: %w[A B C]) ``` ### `new` from a schema (by Hash) and data (by Array) ```ruby - RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]]) + RedAmber::DataFrame.new({x: :uint8, y: :string}, [[1, "A"], [2, "B"], [3, "C"]]) ``` ### `new` from an Arrow::Table ```ruby - table = Arrow::Table.new(x: [1, 2, 3]) + table = Arrow::Table.new(x: [1, 2, 3], y: %w[A B C]) RedAmber::DataFrame.new(table) ``` +### `new` from an Object which responds to `to_arrow` + + ```ruby + require "datasets-arrow" + dataset = Datasets::Penguins.new + RedAmber::DataFrame.new(dataset) + ``` + ### `new` from a Rover::DataFrame ```ruby require 'rover' - rover = Rover::DataFrame.new(x: [1, 2, 3]) + rover = Rover::DataFrame.new(x: [1, 2, 3], y: %w[A B C]) RedAmber::DataFrame.new(rover) ``` ### `load` (class method) @@ -61,11 +69,11 @@ - from a Parquet file ```ruby require 'parquet' - dataframe = RedAmber::DataFrame.load("file.parquet") + df = RedAmber::DataFrame.load("file.parquet") ``` ### `save` (instance method) - to a `.arrow`, `.arrows`, `.csv`, `.csv.gz` or `.tsv` file @@ -77,11 +85,11 @@ - to a Parquet file ```ruby require 'parquet' - dataframe.save("file.parquet") + df.save("file.parquet") ``` ## Properties ### `table`, `to_arrow` @@ -208,19 +216,19 @@ puts penguins.to_s # => species island bill_length_mm bill_depth_mm flipper_length_mm ... year <string> <string> <double> <double> <uint8> ... <uint16> - 1 Adelie Torgersen 39.1 18.7 181 ... 2007 - 2 Adelie Torgersen 39.5 17.4 186 ... 2007 - 3 Adelie Torgersen 40.3 18.0 195 ... 2007 - 4 Adelie Torgersen (nil) (nil) (nil) ... 2007 - 5 Adelie Torgersen 36.7 19.3 193 ... 2007 + 0 Adelie Torgersen 39.1 18.7 181 ... 2007 + 1 Adelie Torgersen 39.5 17.4 186 ... 2007 + 2 Adelie Torgersen 40.3 18.0 195 ... 2007 + 3 Adelie Torgersen (nil) (nil) (nil) ... 2007 + 4 Adelie Torgersen 36.7 19.3 193 ... 2007 : : : : : : ... : -342 Gentoo Biscoe 50.4 15.7 222 ... 2009 -343 Gentoo Biscoe 45.2 14.8 212 ... 2009 -344 Gentoo Biscoe 49.9 16.1 213 ... 2009 +341 Gentoo Biscoe 50.4 15.7 222 ... 2009 +342 Gentoo Biscoe 45.2 14.8 212 ... 2009 +343 Gentoo Biscoe 49.9 16.1 213 ... 2009 ``` ### `inspect` `inspect` uses `to_s` output and also shows shape and object_id. @@ -233,15 +241,15 @@ puts penguins.summary.to_s(width: 82) # needs more width to show all stats in this example # => variables count mean std min 25% median 75% max <dictionary> <uint16> <double> <double> <double> <double> <double> <double> <double> -1 bill_length_mm 342 43.92 5.46 32.1 39.23 44.38 48.5 59.6 -2 bill_depth_mm 342 17.15 1.97 13.1 15.6 17.32 18.7 21.5 -3 flipper_length_mm 342 200.92 14.06 172.0 190.0 197.0 213.0 231.0 -4 body_mass_g 342 4201.75 801.95 2700.0 3550.0 4031.5 4750.0 6300.0 -5 year 344 2008.03 0.82 2007.0 2007.0 2008.0 2009.0 2009.0 +0 bill_length_mm 342 43.92 5.46 32.1 39.23 44.38 48.5 59.6 +1 bill_depth_mm 342 17.15 1.97 13.1 15.6 17.32 18.7 21.5 +2 flipper_length_mm 342 200.92 14.06 172.0 190.0 197.0 213.0 231.0 +3 body_mass_g 342 4201.75 801.95 2700.0 3550.0 4031.5 4750.0 6300.0 +4 year 344 2008.03 0.82 2007.0 2007.0 2008.0 2009.0 2009.0 ``` ### `to_rover` - Returns a `Rover::DataFrame`. @@ -263,25 +271,26 @@ ```ruby require 'red_amber' require 'datasets-arrow' - penguins = Datasets::Penguins.new.to_arrow - RedAmber::DataFrame.new(penguins).tdr + dataset = Datasets::Penguins.new + # (From 0.2.2) responsible to the object which has `to_arrow` method. + RedAmber::DataFrame.new(dataset).tdr # => RedAmber::DataFrame : 344 x 8 Vectors Vectors : 5 numeric, 3 strings # key type level data_preview - 1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124} - 2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124} - 3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils - 4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils - 5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils - 6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils - 7 :sex string 3 {"male"=>168, "female"=>165, nil=>11} - 8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120} + 0 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124} + 1 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124} + 2 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils + 3 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils + 4 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils + 5 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils + 6 :sex string 3 {"male"=>168, "female"=>165, nil=>11} + 7 :year uint16 3 {2007=>110, 2008=>114, 2009=>120} ``` - limit: limit of variables to show. Default value is 10. - tally: max level to use tally mode. - elements: max num of element to show values in each observations. @@ -309,13 +318,13 @@ # => #<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000328fc> b c a <string> <double> <uint8> - 1 A 1.0 1 - 2 B 2.0 2 - 3 C 3.0 3 + 0 A 1.0 1 + 1 B 2.0 2 + 2 C 3.0 3 ``` If `#[]` represents single variable (column), it returns a Vector object. ```ruby @@ -357,14 +366,14 @@ # => #<RedAmber::DataFrame : 4 x 3 Vectors, 0x0000000000033270> a b c <uint8> <string> <double> - 1 3 C 3.0 - 2 1 A 1.0 - 3 2 B 2.0 - 4 3 C 3.0 + 0 3 C 3.0 + 1 1 A 1.0 + 2 2 B 2.0 + 3 3 C 3.0 ``` - Select obs. by a boolean Array or a boolean RedAmber::Vector at same size as self. It returns a sub dataframe with observations at boolean is true. @@ -403,19 +412,19 @@ # => #<RedAmber::DataFrame : 344 x 2 Vectors, 0x0000000000035ebc> species bill_length_mm <string> <double> - 1 Adelie 39.1 - 2 Adelie 39.5 - 3 Adelie 40.3 - 4 Adelie (nil) - 5 Adelie 36.7 + 0 Adelie 39.1 + 1 Adelie 39.5 + 2 Adelie 40.3 + 3 Adelie (nil) + 4 Adelie 36.7 : : : - 342 Gentoo 50.4 - 343 Gentoo 45.2 - 344 Gentoo 49.9 + 341 Gentoo 50.4 + 342 Gentoo 45.2 + 343 Gentoo 49.9 ``` - Indices as arguments `pick(indices)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers. @@ -425,41 +434,41 @@ # => #<RedAmber::DataFrame : 344 x 4 Vectors, 0x0000000000055ce4> species island bill_length_mm year <string> <string> <double> <uint16> - 1 Adelie Torgersen 39.1 2007 - 2 Adelie Torgersen 39.5 2007 - 3 Adelie Torgersen 40.3 2007 - 4 Adelie Torgersen (nil) 2007 - 5 Adelie Torgersen 36.7 2007 + 0 Adelie Torgersen 39.1 2007 + 1 Adelie Torgersen 39.5 2007 + 2 Adelie Torgersen 40.3 2007 + 3 Adelie Torgersen (nil) 2007 + 4 Adelie Torgersen 36.7 2007 : : : : : - 342 Gentoo Biscoe 50.4 2009 - 343 Gentoo Biscoe 45.2 2009 - 344 Gentoo Biscoe 49.9 2009 + 341 Gentoo Biscoe 50.4 2009 + 342 Gentoo Biscoe 45.2 2009 + 343 Gentoo Biscoe 49.9 2009 ``` - Booleans as arguments `pick(booleans)` accepts booleans as arguments in an Array. Booleans must be same length as `n_keys`. ```ruby - penguins.pick(penguins.types.map { |type| type == :string }) + penguins.pick(penguins.vectors.map(&:string?)) # => #<RedAmber::DataFrame : 344 x 3 Vectors, 0x00000000000387ac> species island sex <string> <string> <string> - 1 Adelie Torgersen male + 0 Adelie Torgersen male + 1 Adelie Torgersen female 2 Adelie Torgersen female - 3 Adelie Torgersen female - 4 Adelie Torgersen (nil) - 5 Adelie Torgersen female + 3 Adelie Torgersen (nil) + 4 Adelie Torgersen female : : : : - 342 Gentoo Biscoe male - 343 Gentoo Biscoe female - 344 Gentoo Biscoe male + 341 Gentoo Biscoe male + 342 Gentoo Biscoe female + 343 Gentoo Biscoe male ``` - Keys or booleans by a block `pick {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return keys, indices or a boolean Array with a same length as `n_keys`. Block is called in the context of self. @@ -469,19 +478,19 @@ # => #<RedAmber::DataFrame : 344 x 3 Vectors, 0x000000000003dd4c> bill_length_mm bill_depth_mm flipper_length_mm <double> <double> <uint8> - 1 39.1 18.7 181 - 2 39.5 17.4 186 - 3 40.3 18.0 195 - 4 (nil) (nil) (nil) - 5 36.7 19.3 193 + 0 39.1 18.7 181 + 1 39.5 17.4 186 + 2 40.3 18.0 195 + 3 (nil) (nil) (nil) + 4 36.7 19.3 193 : : : : - 342 50.4 15.7 222 - 343 45.2 14.8 212 - 344 49.9 16.1 213 + 341 50.4 15.7 222 + 342 45.2 14.8 212 + 343 49.9 16.1 213 ``` ### `drop ` - pick and drop - Drop some columns (variables) to create a remainer DataFrame. @@ -524,13 +533,13 @@ # => #<RedAmber::DataFrame : 3 x 1 Vector, 0x000000000003f4bc> a <uint8> - 1 1 - 2 2 - 3 3 + 0 1 + 1 2 + 2 3 df[:a] # => #<RedAmber::Vector(:uint8, size=3):0x000000000000f258> @@ -564,21 +573,21 @@ # returns 5 obs. at start and 5 obs. from end penguins.slice(0...5, -5..-1) # => #<RedAmber::DataFrame : 10 x 8 Vectors, 0x0000000000042be4> - species island bill_length_mm bill_depth_mm flipper_length_mm ... year - <string> <string> <double> <double> <uint8> ... <uint16> - 1 Adelie Torgersen 39.1 18.7 181 ... 2007 - 2 Adelie Torgersen 39.5 17.4 186 ... 2007 - 3 Adelie Torgersen 40.3 18.0 195 ... 2007 - 4 Adelie Torgersen (nil) (nil) (nil) ... 2007 - 5 Adelie Torgersen 36.7 19.3 193 ... 2007 - : : : : : : ... : - 8 Gentoo Biscoe 50.4 15.7 222 ... 2009 - 9 Gentoo Biscoe 45.2 14.8 212 ... 2009 - 10 Gentoo Biscoe 49.9 16.1 213 ... 2009 + species island bill_length_mm bill_depth_mm flipper_length_mm ... year + <string> <string> <double> <double> <uint8> ... <uint16> + 0 Adelie Torgersen 39.1 18.7 181 ... 2007 + 1 Adelie Torgersen 39.5 17.4 186 ... 2007 + 2 Adelie Torgersen 40.3 18.0 195 ... 2007 + 3 Adelie Torgersen (nil) (nil) (nil) ... 2007 + 4 Adelie Torgersen 36.7 19.3 193 ... 2007 + : : : : : : ... : + 7 Gentoo Biscoe 50.4 15.7 222 ... 2009 + 8 Gentoo Biscoe 45.2 14.8 212 ... 2009 + 9 Gentoo Biscoe 49.9 16.1 213 ... 2009 ``` - Booleans as an argument `slice(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`. @@ -589,19 +598,19 @@ # => #<RedAmber::DataFrame : 242 x 8 Vectors, 0x0000000000043d3c> species island bill_length_mm bill_depth_mm flipper_length_mm ... year <string> <string> <double> <double> <uint8> ... <uint16> - 1 Adelie Torgersen 40.3 18.0 195 ... 2007 - 2 Adelie Torgersen 42.0 20.2 190 ... 2007 - 3 Adelie Torgersen 41.1 17.6 182 ... 2007 - 4 Adelie Torgersen 42.5 20.7 197 ... 2007 - 5 Adelie Torgersen 46.0 21.5 194 ... 2007 + 0 Adelie Torgersen 40.3 18.0 195 ... 2007 + 1 Adelie Torgersen 42.0 20.2 190 ... 2007 + 2 Adelie Torgersen 41.1 17.6 182 ... 2007 + 3 Adelie Torgersen 42.5 20.7 197 ... 2007 + 4 Adelie Torgersen 46.0 21.5 194 ... 2007 : : : : : : ... : - 240 Gentoo Biscoe 50.4 15.7 222 ... 2009 - 241 Gentoo Biscoe 45.2 14.8 212 ... 2009 - 242 Gentoo Biscoe 49.9 16.1 213 ... 2009 + 239 Gentoo Biscoe 50.4 15.7 222 ... 2009 + 240 Gentoo Biscoe 45.2 14.8 212 ... 2009 + 241 Gentoo Biscoe 49.9 16.1 213 ... 2009 ``` - Indices or booleans by a block `slice {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self. @@ -617,19 +626,19 @@ # => #<RedAmber::DataFrame : 204 x 8 Vectors, 0x0000000000047a40> species island bill_length_mm bill_depth_mm flipper_length_mm ... year <string> <string> <double> <double> <uint8> ... <uint16> - 1 Adelie Torgersen 39.1 18.7 181 ... 2007 - 2 Adelie Torgersen 39.5 17.4 186 ... 2007 - 3 Adelie Torgersen 40.3 18.0 195 ... 2007 - 4 Adelie Torgersen 39.3 20.6 190 ... 2007 - 5 Adelie Torgersen 38.9 17.8 181 ... 2007 + 0 Adelie Torgersen 39.1 18.7 181 ... 2007 + 1 Adelie Torgersen 39.5 17.4 186 ... 2007 + 2 Adelie Torgersen 40.3 18.0 195 ... 2007 + 3 Adelie Torgersen 39.3 20.6 190 ... 2007 + 4 Adelie Torgersen 38.9 17.8 181 ... 2007 : : : : : : ... : - 202 Gentoo Biscoe 47.2 13.7 214 ... 2009 - 203 Gentoo Biscoe 46.8 14.3 215 ... 2009 - 204 Gentoo Biscoe 45.2 14.8 212 ... 2009 + 201 Gentoo Biscoe 47.2 13.7 214 ... 2009 + 202 Gentoo Biscoe 46.8 14.3 215 ... 2009 + 203 Gentoo Biscoe 45.2 14.8 212 ... 2009 ``` - Notice: nil option - `Arrow::Table#slice` uses `filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row. @@ -672,19 +681,19 @@ # => #<RedAmber::DataFrame : 334 x 8 Vectors, 0x00000000000487c4> species island bill_length_mm bill_depth_mm flipper_length_mm ... year <string> <string> <double> <double> <uint8> ... <uint16> - 1 Adelie Torgersen 39.3 20.6 190 ... 2007 - 2 Adelie Torgersen 38.9 17.8 181 ... 2007 - 3 Adelie Torgersen 39.2 19.6 195 ... 2007 - 4 Adelie Torgersen 34.1 18.1 193 ... 2007 - 5 Adelie Torgersen 42.0 20.2 190 ... 2007 + 0 Adelie Torgersen 39.3 20.6 190 ... 2007 + 1 Adelie Torgersen 38.9 17.8 181 ... 2007 + 2 Adelie Torgersen 39.2 19.6 195 ... 2007 + 3 Adelie Torgersen 34.1 18.1 193 ... 2007 + 4 Adelie Torgersen 42.0 20.2 190 ... 2007 : : : : : : ... : - 332 Gentoo Biscoe 44.5 15.7 217 ... 2009 - 333 Gentoo Biscoe 48.8 16.2 222 ... 2009 - 334 Gentoo Biscoe 47.2 13.7 214 ... 2009 + 331 Gentoo Biscoe 44.5 15.7 217 ... 2009 + 332 Gentoo Biscoe 48.8 16.2 222 ... 2009 + 333 Gentoo Biscoe 47.2 13.7 214 ... 2009 ``` - Booleans as an argument `remove(booleans)` accepts booleans as an argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `size`. @@ -696,19 +705,19 @@ # => #<RedAmber::DataFrame : 333 x 8 Vectors, 0x0000000000049fac> species island bill_length_mm bill_depth_mm flipper_length_mm ... year <string> <string> <double> <double> <uint8> ... <uint16> - 1 Adelie Torgersen 39.1 18.7 181 ... 2007 - 2 Adelie Torgersen 39.5 17.4 186 ... 2007 - 3 Adelie Torgersen 40.3 18.0 195 ... 2007 - 4 Adelie Torgersen 36.7 19.3 193 ... 2007 - 5 Adelie Torgersen 39.3 20.6 190 ... 2007 + 0 Adelie Torgersen 39.1 18.7 181 ... 2007 + 1 Adelie Torgersen 39.5 17.4 186 ... 2007 + 2 Adelie Torgersen 40.3 18.0 195 ... 2007 + 3 Adelie Torgersen 36.7 19.3 193 ... 2007 + 4 Adelie Torgersen 39.3 20.6 190 ... 2007 : : : : : : ... : - 331 Gentoo Biscoe 50.4 15.7 222 ... 2009 - 332 Gentoo Biscoe 45.2 14.8 212 ... 2009 - 333 Gentoo Biscoe 49.9 16.1 213 ... 2009 + 330 Gentoo Biscoe 50.4 15.7 222 ... 2009 + 331 Gentoo Biscoe 45.2 14.8 212 ... 2009 + 332 Gentoo Biscoe 49.9 16.1 213 ... 2009 ``` - Indices or booleans by a block `remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as `size`. Block is called in the context of self. @@ -725,19 +734,19 @@ # => #<RedAmber::DataFrame : 140 x 8 Vectors, 0x000000000004de40> species island bill_length_mm bill_depth_mm flipper_length_mm ... year <string> <string> <double> <double> <uint8> ... <uint16> - 1 Adelie Torgersen (nil) (nil) (nil) ... 2007 - 2 Adelie Torgersen 36.7 19.3 193 ... 2007 - 3 Adelie Torgersen 34.1 18.1 193 ... 2007 - 4 Adelie Torgersen 37.8 17.1 186 ... 2007 - 5 Adelie Torgersen 37.8 17.3 180 ... 2007 + 0 Adelie Torgersen (nil) (nil) (nil) ... 2007 + 1 Adelie Torgersen 36.7 19.3 193 ... 2007 + 2 Adelie Torgersen 34.1 18.1 193 ... 2007 + 3 Adelie Torgersen 37.8 17.1 186 ... 2007 + 4 Adelie Torgersen 37.8 17.3 180 ... 2007 : : : : : : ... : - 138 Gentoo Biscoe (nil) (nil) (nil) ... 2009 - 139 Gentoo Biscoe 50.4 15.7 222 ... 2009 - 140 Gentoo Biscoe 49.9 16.1 213 ... 2009 + 137 Gentoo Biscoe (nil) (nil) (nil) ... 2009 + 138 Gentoo Biscoe 50.4 15.7 222 ... 2009 + 139 Gentoo Biscoe 49.9 16.1 213 ... 2009 ``` - Notice for nil - When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`. @@ -768,12 +777,12 @@ # => #<RedAmber::DataFrame : 2 x 3 Vectors, 0x000000000005df98> a b c <uint8> <string> <double> - 1 1 A 1.0 - 2 (nil) C 3.0 + 0 1 A 1.0 + 1 (nil) C 3.0 ``` ### `rename` Rename keys (column names) to create a updated DataFrame. @@ -790,13 +799,13 @@ # => #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000060838> name age_in_1993 <string> <uint8> - 1 Yasuko 68 - 2 Rui 49 - 3 Hinata 28 + 0 Yasuko 68 + 1 Rui 49 + 2 Hinata 28 ``` - Key pairs by a block `rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}` or an Array of Arrays like `[[existing_key, new_key], ... ]`. Block is called in the context of self. @@ -830,13 +839,13 @@ # => #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000062804> name age <string> <uint8> - 1 Yasuko 68 - 2 Rui 49 - 3 Hinata 28 + 0 Yasuko 68 + 1 Rui 49 + 2 Hinata 28 # update :age and add :brother df.assign do { age: age + 29, @@ -846,13 +855,13 @@ # => #<RedAmber::DataFrame : 3 x 3 Vectors, 0x00000000000658b0> name age brother <string> <uint8> <string> - 1 Yasuko 97 Santa - 2 Rui 78 (nil) - 3 Hinata 57 Momotaro + 0 Yasuko 97 Santa + 1 Rui 78 (nil) + 2 Hinata 57 Momotaro ``` - Key pairs by a block `assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array_like}` or an Array of Arrays like `[[key, array_like], ... ]`. `array_like` is ether `Vector`, `Array` or `Arrow::Array`. The block is called in the context of self. @@ -867,15 +876,15 @@ # => #<RedAmber::DataFrame : 5 x 3 Vectors, 0x0000000000069e60> index float string <uint8> <double> <string> - 1 0 0.0 A - 2 1 1.1 B - 3 2 2.2 C - 4 3 NaN D - 5 (nil) (nil) (nil) + 0 0 0.0 A + 1 1 1.1 B + 2 2 2.2 C + 3 3 NaN D + 4 (nil) (nil) (nil) # update :float # assigner by an Array df.assign do vectors.select(&:float?) @@ -884,15 +893,15 @@ # => #<RedAmber::DataFrame : 5 x 3 Vectors, 0x00000000000dfffc> index float string <uint8> <double> <string> - 1 0 -0.0 A - 2 1 -1.1 B - 3 2 -2.2 C - 4 3 NaN D - 5 (nil) (nil) (nil) + 0 0 -0.0 A + 1 1 -1.1 B + 2 2 -2.2 C + 3 3 NaN D + 4 (nil) (nil) (nil) # Or we can use assigner by a Hash df.assign do vectors.select.with_object({}) do |v, assigner| assigner[v.key] = -v if v.float? @@ -919,15 +928,15 @@ # => #<RedAmber::DataFrame : 5 x 4 Vectors, 0x000000000001787c> new_index index float string <uint8> <uint8> <double> <string> - 1 1 0 0.0 A - 2 2 1 1.1 B - 3 3 2 2.2 C - 4 4 3 NaN D - 5 5 (nil) (nil) (nil) + 0 1 0 0.0 A + 1 2 1 1.1 B + 2 3 2 2.2 C + 3 4 3 NaN D + 4 5 (nil) (nil) (nil) ``` ### `slice_by(key, keep_key: false) { block }` `slice_by` accepts a key and a block to select rows. @@ -944,24 +953,24 @@ # => #<RedAmber::DataFrame : 5 x 3 Vectors, 0x0000000000069e60> index float string <uint8> <double> <string> - 1 0 0.0 A - 2 1 1.1 B - 3 2 2.2 C - 4 3 NaN D - 5 (nil) (nil) (nil) + 0 0 0.0 A + 1 1 1.1 B + 2 2 2.2 C + 3 3 NaN D + 4 (nil) (nil) (nil) df.slice_by(:string) { ["A", "C"] } # => #<RedAmber::DataFrame : 2 x 2 Vectors, 0x000000000001b1ac> index float <uint8> <double> - 1 0 0.0 - 2 2 2.2 + 0 0 0.0 + 1 2 2.2 ``` It is the same behavior as; ```ruby @@ -975,13 +984,13 @@ # => #<RedAmber::DataFrame : 3 x 2 Vectors, 0x0000000000069668> index float <uint8> <double> - 1 0 0.0 - 2 1 1.1 - 3 2 2.2 + 0 0 0.0 + 1 1 1.1 + 2 2 2.2 ``` When the option `keep_key: true` used, the column `key` will be preserved. ```ruby @@ -989,13 +998,13 @@ # => #<RedAmber::DataFrame : 3 x 3 Vectors, 0x0000000000073c44> index float string <uint8> <double> <string> - 1 0 0.0 A - 2 1 1.1 B - 3 2 2.2 C + 0 0 0.0 A + 1 1 1.1 B + 2 2 2.2 C ``` ## Updating ### `sort` @@ -1014,15 +1023,15 @@ # => #<RedAmber::DataFrame : 5 x 3 Vectors, 0x000000000009b03c> index string bool <uint8> <string> <boolean> - 1 0 (nil) false - 2 0 B false - 3 1 B true - 4 1 C (nil) - 5 (nil) A true + 0 0 (nil) false + 1 0 B false + 2 1 B true + 3 1 C (nil) + 4 (nil) A true ``` - [ ] Clamp - [ ] Clear data @@ -1035,11 +1044,11 @@ ## Grouping ### `group(group_keys)` - `group` creates a class `Group` object. `Group` accepts functions below as a method. + `group` creates a instance of class `Group`. `Group` accepts functions below as a method. Method accepts options as `group_keys`. Available functions are: - [ ] all @@ -1062,110 +1071,112 @@ Summary key names are provided by `function(summary_keys)` style. This is an example of grouping of famous STARWARS dataset. ```ruby - starwars = - RedAmber::DataFrame.load(URI("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv")) - starwars + uri = URI("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv") + starwars = RedAmber::DataFrame.load(uri) # => #<RedAmber::DataFrame : 87 x 12 Vectors, 0x0000000000005a50> unnamed1 name height mass hair_color skin_color eye_color ... species <int64> <string> <int64> <double> <string> <string> <string> ... <string> - 1 1 Luke Skywalker 172 77.0 blond fair blue ... Human - 2 2 C-3PO 167 75.0 NA gold yellow ... Droid - 3 3 R2-D2 96 32.0 NA white, blue red ... Droid - 4 4 Darth Vader 202 136.0 none white yellow ... Human - 5 5 Leia Organa 150 49.0 brown light brown ... Human + 0 1 Luke Skywalker 172 77.0 blond fair blue ... Human + 1 2 C-3PO 167 75.0 NA gold yellow ... Droid + 2 3 R2-D2 96 32.0 NA white, blue red ... Droid + 3 4 Darth Vader 202 136.0 none white yellow ... Human + 4 5 Leia Organa 150 49.0 brown light brown ... Human : : : : : : : : ... : - 85 85 BB8 (nil) (nil) none none black ... Droid - 86 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA - 87 87 Padmé Amidala 165 45.0 brown light brown ... Human + 84 85 BB8 (nil) (nil) none none black ... Droid + 85 86 Captain Phasma (nil) (nil) unknown unknown unknown ... NA + 86 87 Padmé Amidala 165 45.0 brown light brown ... Human starwars.tdr(12) # => RedAmber::DataFrame : 87 x 12 Vectors Vectors : 4 numeric, 8 strings # key type level data_preview - 1 :unnamed1 int64 87 [1, 2, 3, 4, 5, ... ] - 2 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ] - 3 :height int64 46 [172, 167, 96, 202, 150, ... ], 6 nils - 4 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils - 5 :hair_color string 13 ["blond", "NA", "NA", "none", "brown", ... ] - 6 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", ... ] - 7 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ] - 8 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils - 9 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, "NA"=>4} - 10 :gender string 3 {"masculine"=>66, "feminine"=>17, "NA"=>4} - 11 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ] - 12 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ] + 0 :unnamed1 int64 87 [1, 2, 3, 4, 5, ... ] + 1 :name string 87 ["Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", ... ] + 2 :height int64 46 [172, 167, 96, 202, 150, ... ], 6 nils + 3 :mass double 39 [77.0, 75.0, 32.0, 136.0, 49.0, ... ], 28 nils + 4 :hair_color string 13 ["blond", "NA", "NA", "none", "brown", ... ] + 5 :skin_color string 31 ["fair", "gold", "white, blue", "white", "light", ... ] + 6 :eye_color string 15 ["blue", "yellow", "red", "yellow", "brown", ... ] + 7 :birth_year double 37 [19.0, 112.0, 33.0, 41.9, 19.0, ... ], 44 nils + 8 :sex string 5 {"male"=>60, "none"=>6, "female"=>16, "hermaphroditic"=>1, "NA"=>4} + 9 :gender string 3 {"masculine"=>66, "feminine"=>17, "NA"=>4} + 10 :homeworld string 49 ["Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", ... ] + 11 :species string 38 ["Human", "Droid", "Droid", "Human", "Human", ... ] ``` We can group by `:species` and calculate the count. ```ruby - starwars.group(:species).count(:species) + starwars.remove { species == "NA" } + .group(:species).count(:species) # => - #<RedAmber::DataFrame : 38 x 2 Vectors, 0x000000000001d6f0> + #<RedAmber::DataFrame : 37 x 2 Vectors, 0x000000000000ffa0> species count <string> <int64> - 1 Human 35 - 2 Droid 6 - 3 Wookiee 2 - 4 Rodian 1 - 5 Hutt 1 + 0 Human 35 + 1 Droid 6 + 2 Wookiee 2 + 3 Rodian 1 + 4 Hutt 1 : : : - 36 Kaleesh 1 - 37 Pau'an 1 - 38 Kel Dor 1 + 34 Kaleesh 1 + 35 Pau'an 1 + 36 Kel Dor 1 ``` We can also calculate the mean of `:mass` and `:height` together. ```ruby - grouped = starwars.group(:species) { [count(:species), mean(:height, :mass)] } + grouped = starwars.remove { species == "NA" } + .group(:species) { [count(:species), mean(:height, :mass)] } # => - #<RedAmber::DataFrame : 38 x 4 Vectors, 0x00000000000407cc> - specie s count mean(height) mean(mass) - <strin g> <int64> <double> <double> - 1 Human 35 176.6 82.8 - 2 Droid 6 131.2 69.8 - 3 Wookie e 2 231.0 124.0 - 4 Rodian 1 173.0 74.0 - 5 Hutt 1 175.0 1358.0 - : : : : : - 36 Kalees h 1 216.0 159.0 - 37 Pau'an 1 206.0 80.0 - 38 Kel Dor 1 188.0 80.0 + #<RedAmber::DataFrame : 37 x 4 Vectors, 0x000000000000fff0> + species count mean(height) mean(mass) + <string> <int64> <double> <double> + 0 Human 35 176.65 82.78 + 1 Droid 6 131.2 69.75 + 2 Wookiee 2 231.0 124.0 + 3 Rodian 1 173.0 74.0 + 4 Hutt 1 175.0 1358.0 + : : : : : + 34 Kaleesh 1 216.0 159.0 + 35 Pau'an 1 206.0 80.0 + 36 Kel Dor 1 188.0 80.0 ``` Select rows for count > 1. ```ruby grouped.slice(grouped[:count] > 1) # => - #<RedAmber::DataFrame : 9 x 4 Vectors, 0x000000000004c270> + #<RedAmber::DataFrame : 8 x 4 Vectors, 0x000000000001002c> species count mean(height) mean(mass) <string> <int64> <double> <double> - 1 Human 35 176.6 82.8 - 2 Droid 6 131.2 69.8 - 3 Wookiee 2 231.0 124.0 - 4 Gungan 3 208.7 74.0 - 5 NA 4 181.3 48.0 - : : : : : - 7 Twi'lek 2 179.0 55.0 - 8 Mirialan 2 168.0 53.1 - 9 Kaminoan 2 221.0 88.0 + 0 Human 35 176.65 82.78 + 1 Droid 6 131.2 69.75 + 2 Wookiee 2 231.0 124.0 + 3 Gungan 3 208.67 74.0 + 4 Zabrak 2 173.0 80.0 + 5 Twi'lek 2 179.0 55.0 + 6 Mirialan 2 168.0 53.1 + 7 Kaminoan 2 221.0 88.0 ``` ## Reshape +![dataframe reshapeing image](doc/../image/reshaping_dataframe.png) + ### `transpose` Creates transposed DataFrame for the wide (messy) dataframe. ```ruby @@ -1173,30 +1184,31 @@ # => #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000d520> Year Audi BMW BMW_MINI Mercedes-Benz VW <int64> <int64> <int64> <int64> <int64> <int64> - 1 2017 28336 52527 25427 68221 49040 - 2 2018 26473 50982 25984 67554 51961 - 3 2019 24222 46814 23813 66553 46794 - 4 2020 22304 35712 20196 57041 36576 - 5 2021 22535 35905 18211 51722 35215 - import_cars.transpose(:Manufacturer) + 0 2017 28336 52527 25427 68221 49040 + 1 2018 26473 50982 25984 67554 51961 + 2 2019 24222 46814 23813 66553 46794 + 3 2020 22304 35712 20196 57041 36576 + 4 2021 22535 35905 18211 51722 35215 + import_cars.transpose(name: :Manufacturer) + # => - #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000ef74> + #<RedAmber::DataFrame : 5 x 6 Vectors, 0x0000000000010a2c> Manufacturer 2017 2018 2019 2020 2021 - <dictionary> <uint32> <uint32> <uint32> <uint16> <uint16> - 1 Audi 28336 26473 24222 22304 22535 - 2 BMW 52527 50982 46814 35712 35905 - 3 BMW_MINI 25427 25984 23813 20196 18211 - 4 Mercedes-Benz 68221 67554 66553 57041 51722 - 5 VW 49040 51961 46794 36576 35215 + <string> <uint32> <uint32> <uint32> <uint16> <uint16> + 0 Audi 28336 26473 24222 22304 22535 + 1 BMW 52527 50982 46814 35712 35905 + 2 BMW_MINI 25427 25984 23813 20196 18211 + 3 Mercedes-Benz 68221 67554 66553 57041 51722 + 4 VW 49040 51961 46794 36576 35215 ``` The leftmost column is created by original keys. Key name of the column is - named by parameter `:name`. If `:name` is not specified, `:N` is used for the key. + named by parameter `:name`. If `:name` is not specified, `:NAME` is used for the key. ### `to_long(*keep_keys)` Creates a 'long' (tidy) DataFrame from a 'wide' DataFrame. @@ -1204,67 +1216,69 @@ ```ruby import_cars.to_long(:Year) # => - #<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000012750> - Year N V - <uint16> <dictionary> <uint32> - 1 2017 Audi 28336 - 2 2017 BMW 52527 - 3 2017 BMW_MINI 25427 - 4 2017 Mercedes-Benz 68221 - 5 2017 VW 49040 + #<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000011864> + Year NAME VALUE + <uint16> <string> <uint32> + 0 2017 Audi 28336 + 1 2017 BMW 52527 + 2 2017 BMW_MINI 25427 + 3 2017 Mercedes-Benz 68221 + 4 2017 VW 49040 : : : : - 23 2021 BMW_MINI 18211 - 24 2021 Mercedes-Benz 51722 - 25 2021 VW 35215 + 22 2021 BMW_MINI 18211 + 23 2021 Mercedes-Benz 51722 + 24 2021 VW 35215 ``` - Option `:name` is the key of the column which came **from key names**. + The default value is `:NAME` if it is not specified. - Option `:value` is the key of the column which came **from values**. + The default value is `:VALUE` if it is not specified. ```ruby import_cars.to_long(:Year, name: :Manufacturer, value: :Num_of_imported) # => - #<RedAmber::DataFrame : 25 x 3 Vectors, 0x0000000000017700> + #<RedAmber::DataFrame : 25 x 3 Vectors, 0x000000000001359c> Year Manufacturer Num_of_imported - <uint16> <dictionary> <uint32> - 1 2017 Audi 28336 - 2 2017 BMW 52527 - 3 2017 BMW_MINI 25427 - 4 2017 Mercedes-Benz 68221 - 5 2017 VW 49040 + <uint16> <string> <uint32> + 0 2017 Audi 28336 + 1 2017 BMW 52527 + 2 2017 BMW_MINI 25427 + 3 2017 Mercedes-Benz 68221 + 4 2017 VW 49040 : : : : - 23 2021 BMW_MINI 18211 - 24 2021 Mercedes-Benz 51722 - 25 2021 VW 35215 + 22 2021 BMW_MINI 18211 + 23 2021 Mercedes-Benz 51722 + 24 2021 VW 35215 ``` ### `to_wide` Creates a 'wide' (messy) DataFrame from a 'long' DataFrame. - Option `:name` is the key of the column which will be expanded **to key names**. + The default value is `:NAME` if it is not specified. - Option `:value` is the key of the column which will be expanded **to values**. + The default value is `:VALUE` if it is not specified. ```ruby import_cars.to_long(:Year).to_wide # import_cars.to_long(:Year).to_wide(name: :N, value: :V) # is also OK # => #<RedAmber::DataFrame : 5 x 6 Vectors, 0x000000000000f0f0> Year Audi BMW BMW_MINI Mercedes-Benz VW <uint16> <uint16> <uint16> <uint16> <uint32> <uint16> - 1 2017 28336 52527 25427 68221 49040 - 2 2018 26473 50982 25984 67554 51961 - 3 2019 24222 46814 23813 66553 46794 - 4 2020 22304 35712 20196 57041 36576 - 5 2021 22535 35905 18211 51722 35215 - - # == import_cars + 0 2017 28336 52527 25427 68221 49040 + 1 2018 26473 50982 25984 67554 51961 + 2 2019 24222 46814 23813 66553 46794 + 3 2020 22304 35712 20196 57041 36576 + 4 2021 22535 35905 18211 51722 35215 ``` ## Combine - [ ] Combining dataframes