--- title: 127 examples of Red Amber author: heronshoes date: '2023-08-11' format: pdf: code-fold: false jupyter: ruby format: pdf: toc: true fontfamily: libertinus colorlinks: true --- For RedAmber Version 0.5.1, 0.5.2 and Arrow version 12.0.1, 13.0.0 . ## 1. Install Install requirements before you install RedAmber. - Ruby (>= 3.0) - Apache Arrow (>= 12.0.0) - Apache Arrow GLib (>= 12.0.0) - Apache Parquet GLib (>= 12.0.0) # if you need IO from/to Parquet resource. See [Apache Arrow install document](https://arrow.apache.org/install/). - Minimum installation example for the latest Ubuntu: ```shell sudo apt update sudo apt install -y -V ca-certificates lsb-release wget wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb sudo apt update sudo apt install -y -V libarrow-dev sudo apt install -y -V libarrow-glib-dev ``` - On Fedora 38 (Rawhide): ```shell sudo dnf update sudo dnf -y install gcc-c++ libarrow-devel libarrow-glib-devel ruby-devel libyaml-devel - On macOS, you can install Apache Arrow C++ library using Homebrew: ```shell brew install apache-arrow ``` and GLib (C) package with: ```shell brew install apache-arrow-glib ``` If you prepared Apache Arrow, add these lines to your Gemfile: ```ruby gem 'red-arrow', '>= 12.0.0' gem 'red_amber' gem 'red-arrow-numo-narray' # Optional, recommended if you use inputs from Numo::NArray # or use random sampling feature. gem 'red-parquet', '>= 12.0.0' # Optional, if you use IO from/to parquet gem 'red-datasets-arrow' # Optional, recommended if you use Red Datasets gem 'red-arrow-activerecord' # Optional, if you use Active Record gem 'rover-df', # Optional, if you use IO from/to Rover::DataFrame. ``` And then execute `bundle install` or install it yourself as `gem install red_amber`. ## 2. Require ```{ruby} #| tags: [] require 'red_amber' # require 'red-amber' is also OK include RedAmber {RedAmber: VERSION, Arrow: Arrow::VERSION} ``` ## 3. Initialize There are several ways to initialize a DataFrame. ```{ruby} #| tags: [] # From a Hash DataFrame.new(x: [1, 2, 3], y: %w[A B C]) ``` ```{ruby} #| tags: [] # From a schema and a row-oriented array DataFrame.new({ x: :uint8, y: :string }, [[1, 'A'], [2, 'B'], [3, 'C']]) ``` ```{ruby} #| tags: [] # From an Arrow::Table table = Arrow::Table.new(x: [1, 2, 3], y: %w[A B C]) DataFrame.new(table) ``` ```{ruby} #| tags: [] # From a Rover::DataFrame require 'rover' rover = Rover::DataFrame.new(x: [1, 2, 3], y: %w[A B C]) DataFrame.new(rover) ``` ```{ruby} #| tags: [] # from a datasets in Red Datasets require 'datasets-arrow' dataset = Datasets::Penguins.new penguins = DataFrame.new(dataset) # Since 0.2.2 . If it is older, it must be `dataset.to_arrow`. ``` ```{ruby} #| tags: [] dataset = Datasets::Rdatasets.new('datasets', 'mtcars') mtcars = DataFrame.new(dataset) ``` (New from 0.2.3 with Arrow 10.0.0) It is possible to initialize by objects responsible to `to_arrow` since 0.2.3 . Arrays in Numo::NArray is responsible to `to_arrow` with `red-arrow-numo-narray` gem. This feature is proposed by the Red Data Tools member @kojix2 and implemented by @kou in Arrow 10.0.0 and Red Arrow Numo::NArray 0.0.6. Thanks! ```{ruby} #| tags: [] require 'arrow-numo-narray' DataFrame.new(numo: Numo::DFloat.new(3).rand) ``` Another example by Numo::NArray is [#77. Introduce columns from numo/narray](#77.-Introduce-columns-from-numo/narray). ## 4. Load `RedAmber::DataFrame` delegates `#load` to `Arrow::Table#load`. We can load from `[.arrow, .arrows, .csv, .csv.gz, .tsv]` files. `load` accepts following options: `load(input, format: nil, compression: nil, schema: nil, skip_lines: nil)` - `format` [:arrow_file, :batch, :arrows, :arrow_stream, :stream, :csv, :tsv] - `compression` [:gzip, nil] - `schema` [Arrow::Schema] - `skip_lines` [Regexp] Load from a file 'comecome.csv'; ```{ruby} #| tags: [] file = Tempfile.open(['comecome', '.csv']) do |f| f.puts(<<~CSV) name,age Yasuko,68 Rui,49 Hinata,28 CSV f end DataFrame.load(file) ``` Load from a Buffer; ```{ruby} #| tags: [] DataFrame.load(Arrow::Buffer.new(<<~BUFFER), format: :csv) name,age Yasuko,68 Rui,49 Hinata,28 BUFFER ``` Load from a Buffer skipping comment line; ```{ruby} #| tags: [] DataFrame.load(Arrow::Buffer.new(<<~BUFFER), format: :csv, skip_lines: /^#/) # comment name,age Yasuko,68 Rui,49 Hinata,28 BUFFER ``` ## 5. Load from a URI ```{ruby} #| tags: [] uri = URI("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv") DataFrame.load(uri) ``` ## 6. Save `#save` accepts same options as `#load`. See [#4. Load](#4.-Load). ```{ruby} #| tags: [] penguins.save("penguins.arrow") penguins.save("penguins.arrows") penguins.save("penguins.csv") penguins.save("penguins.csv.gz") penguins.save("penguins.tsv") penguins.save("penguins.feather") ``` (Since 0.3.0) `DataFrame#save` returns self. ## 7. to_s/inspect `to_s` or `inspect` (it uses to_s inside) shows a preview of the dataframe. It shows first 5 and last 3 rows if it has many rows. Columns are also omitted if line is exceeded 80 letters. ```{ruby} #| tags: [] df = DataFrame.new( x: [1, 2, 3, 4, 5], y: [1, 2, 3, 0/0.0, nil], s: %w[A B C D] << nil, b: [true, false, true, false, nil] ) ``` ```{ruby} #| tags: [] p penguins; nil ``` ## 8. Show table `#table` shows Arrow::Table object. The alias is `#to_arrow`. ```{ruby} #| tags: [] df.table ``` ```{ruby} #| tags: [] penguins.to_arrow ``` ```{ruby} #| tags: [] # This is a Red Arrow's feature puts df.table.to_s(format: :column) ``` ```{ruby} #| tags: [] # This is also a Red Arrow's feature puts df.table.to_s(format: :list) ``` ## 9. TDR TDR means 'Transposed Dataframe Representation'. It shows columns in lateral just the same shape as initializing by a Hash. TDR has some information which is useful for the exploratory data processing. - DataFrame shape: n_rows x n_columns - Data types - Levels: number of unique elements - Data preview: same data is aggregated if level is smaller (tally mode) - Show counts of abnormal element: NaN and nil It is similar to dplyr's (or Polars's) `glimpse()` so we have an alias `#glimpse` (since 0.4.0). ```{ruby} #| tags: [] df.tdr ``` ```{ruby} #| tags: [] penguins.tdr ``` `#tdr` has some options: `limit` : to limit a number of variables to show. Default value is `limit=10`. ```{ruby} #| tags: [] penguins.tdr(3) ``` By default `#tdr` shows 9 variables at maximum. `#tdr(:all)` will show all variables. ```{ruby} #| tags: [] mtcars.tdr(:all) ``` (Since 0.4.0) `#tdra` method is short cut for `#tdr(:all)` ```{ruby} #| tags: [] mtcars.tdra ``` `elements` : max number of elements to show in observations. Default value is `elements: 5`. ```{ruby} #| tags: [] penguins.tdr(elements: 3) # Show first 3 items in data ``` `tally` : max level to use tally mode. Level means size of `tally`ed hash. Default value is `tally: 5`. ```{ruby} #| tags: [] penguins.tdr(tally: 0) # Don't use tally mode ``` `#tdr_str` returns a String. `#tdr` do the same thing as `puts #tdr_str` ```{ruby} #| tags: [] puts penguins.tdr_str ``` (Since 0.4.0) `#glimpse` is an alias for `#tdr`. ```{ruby} #| tags: [] mtcars.glimpse(:all, elements: 10) ``` ## 10. Size and shape ```{ruby} #| tags: [] # same as n_rows, n_obs df.size ``` ```{ruby} #| tags: [] # same as n_cols, n_vars df.n_keys ``` ```{ruby} #| tags: [] # [df.size, df.n_keys], [df.n_rows, df.n_cols] df.shape ``` ## 11. Keys ```{ruby} #| tags: [] df.keys ``` ```{ruby} #| tags: [] penguins.keys ``` ## 12. Types ```{ruby} #| tags: [] df.types ``` ```{ruby} #| tags: [] penguins.types ``` ## 13. Data type classes ```{ruby} #| tags: [] df.type_classes ``` ```{ruby} #| tags: [] penguins.type_classes ``` ## 14. Indices Another example of `indices` is in [66. Custom index](#66.-Custom-index). ```{ruby} #| tags: [] df.indexes # or df.indices ``` (Since 0.2.3) `#indices` returns Vector. ## 15. To an Array or a Hash DataFrame#to_a returns an array of row-oriented data without a header. ```{ruby} #| tags: [] df.to_a ``` If you need a column-oriented array with keys, use `.to_h.to_a` ```{ruby} #| tags: [] df.to_h ``` ```{ruby} #| tags: [] df.to_h.to_a ``` ## 16. Schema Schema is keys and value types pairs as a Hash. ```{ruby} #| tags: [] df.schema ``` ## 17. Vector Each variable (column in the table) is represented by a Vector object. ```{ruby} #| tags: [] df[:x] # This syntax will come later ``` Or create new Vector by the constructor. ```{ruby} #| tags: [] Vector.new(1, 2, 3, 4, 5) ``` ```{ruby} #| tags: [] Vector.new(1..5) ``` ```{ruby} #| tags: [] Vector.new([1, 2, 3], [4, 5]) ``` ```{ruby} #| tags: [] array = Arrow::Array.new([1, 2, 3, 4, 5]) Vector.new(array) ``` (Since 0.4.2) New constructor Vector[*array_like] has introduced. ```{ruby} #| tags: [] Vector[1, 2, 3, 4, 5] ``` ## 18. Vectors Returns an Array of Vectors as a DataFrame. ```{ruby} #| tags: [] df.vectors ``` ## 19. Variables Returns key and Vector pairs as a Hash. ```{ruby} #| tags: [] df.variables ``` ## 20. Select columns by #[ ] `DataFrame#[]` is overloading column operations and row operations. - For columns (variables) - Key in a Symbol: `df[:symbol]` - Key in a String: `df["string"]` - Keys in an Array: `df[:symbol1, "string", :symbol2]` - Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]` ```{ruby} #| tags: [] # Keys in a Symbol and a String df[:x, 'y'] ``` ```{ruby} #| tags: [] # Keys in a Range df[:x..:y] ``` ```{ruby} #| tags: [] # Keys with a index Range, and a symbol df[df.keys[2..], :x] ``` ## 21. Select rows by #[ ] `DataFrame#[]` is overloading column operations and row operations. - For rows (observations) - Select rows by a Index: `df[index]` - Select rows by Indices: `df[indices]` # Array, Arrow::Array, Vectors are acceptable for indices - Select rows by Ranges: `df[range]` - Select rows by Booleans: `df[booleans]` # Array, Arrow::Array, Vectors are acceptable for booleans ```{ruby} #| tags: [] # indices df[0, 2, 1] ``` ```{ruby} #| tags: [] # including a Range # negative indices are also acceptable df[1..2, -1] ``` ```{ruby} #| tags: [] # booleans # length of boolean should be the same as self df[false, true, true, false, true] ``` ```{ruby} #| tags: [] # Arrow::Array indices = Arrow::UInt8Array.new([0,2,4]) df[indices] ``` ```{ruby} #| tags: [] # By a Vector as indices indices = Vector.new(df.indices) # indices > 1 returns a boolean Vector df[indices > 1] ``` ```{ruby} #| tags: [] # By a Vector as booleans booleans = df[:b] ``` ```{ruby} #| tags: [] df[booleans] ``` ## 22. empty? ```{ruby} #| tags: [] df.empty? ``` ```{ruby} #| tags: [] DataFrame.new ``` ```{ruby} #| tags: [] DataFrame.new.empty? ``` ## 23. Select columns by pick `DataFrame#pick` accepts an Array of keys to pick up columns (variables) and creates a new DataFrame. You can change the order of columns at a same time. The name `pick` comes from the action to pick variables(columns) according to the label keys. ```{ruby} #| tags: [] df.pick(:s, :y) # or df.pick([:s, :y]) # OK too. ``` Or use a boolean Array of lengeh `n_key` to `pick`. This style preserves the order of variables. ```{ruby} #| tags: [] df.pick(false, true, true, false) # or df.pick([false, true, true, false]) # or df.pick(Vector.new([false, true, true, false])) ``` `#pick` also accepts a block in the context of self. Next example is picking up numeric variables. ```{ruby} #| tags: [] # reciever is required with the argument style df.pick(df.vectors.map(&:numeric?)) # with a block df.pick { vectors.map(&:numeric?) } ``` `pick` also accepts numeric indexes. (Since 0.2.1) ```{ruby} #| tags: [] df.pick(0, 3) ``` ## 24. Reject columns by drop `DataFrame#drop` accepts an Array keys to drop columns (variables) to create a remainer DataFrame. The name `drop` comes from the pair word of `pick`. ```{ruby} #| tags: [] df.drop(:x, :b) # df.drop([:x, :b]) #is OK too. ``` Or use a boolean Array of lengeh `n_key` to `drop`. ```{ruby} #| tags: [] df.drop(true, false, false, true) # df.drop([true, false, false, true]) # is OK too ``` `#drop` also accepts a block in the context of self. Next example will drop variables which have nil or NaN values. ```{ruby} #| tags: [] df.drop { vectors.map { |v| v.is_na.any } } ``` Argument style is also acceptable but it requires the reciever 'df'. ```{ruby} #| tags: [] df.drop(df.vectors.map { |v| v.is_na.any }) ``` `drop` also accepts numeric indexes. (Since 0.2.1) ```{ruby} #| tags: [] df.drop(0, 3) ``` ## 25. Pick/drop and nil When `pick` or `drop` is used with booleans, nil in the booleans is treated as false. This behavior is aligned with Ruby's `BasicObject#!`. ```{ruby} #| tags: [] booleans = [true, true, false, nil] booleans_invert = booleans.map(&:!) # => [false, false, true, true] because nil.! is true df.pick(booleans) == df.drop(booleans_invert) ``` ## 26. Vector#invert, #primitive_invert For the boolean Vector; ```{ruby} #| tags: [] vector = Vector.new(booleans) ``` nil is converted to nil by `Vector#invert`. ```{ruby} #| tags: [] vector.invert # or !vector ``` So `df.pick(booleans) != df.drop(booleans.invert)` when booleans have any nils. On the other hand, `Vector#primitive_invert` follows Ruby's `BasicObject#!`'s behavior. Then pick and drop keep 'MECE' behavior. ```{ruby} #| tags: [] vector.primitive_invert ``` ```{ruby} #| tags: [] df.pick(vector) == df.drop(vector.primitive_invert) ``` ## 27. Pick/drop, #[] and #v When `pick` or `drop` select a single column (variable), it returns a `DataFrame` with one column (variable). ```{ruby} #| tags: [] df.pick(:x) # or df.drop(:y, :s, :b) ``` In contrast, when `[]` selects a single column (variable), it returns a `Vector`. ```{ruby} #| tags: [] df[:x] ``` This behavior may be useful to use with DataFrame manipulation verbs (like pick, drop, slice, remove, assign, rename). ```{ruby} #| tags: [] df.pick { keys.select { |key| df[key].numeric? } } ``` `df#v` method is same as `df#[]` to pick a Vector. But a little bit faster and easy to use in the block. ```{ruby} #| tags: [] df.v(:x) ``` ## 28. Slice Another example of `slice` is [#70. Row index label by slice_by](#70.-Row-index-label-by-slice_by). `slice` selects rows (records) to create a subset of a DataFrame. `slice(indeces)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers. Negative index from the tail like Ruby's Array is also acceptable. ```{ruby} #| tags: [] # returns 5 rows from the start and 5 rows from the end penguins.slice(0...5, -5..-1) ``` ```{ruby} #| tags: [] # slice accepts Float index # 33% of 344 observations in index => 113.52 th data ?? indexed_penguins = penguins.assign_left { [:index, indexes] } # #assign_left and assigner by Array is 0.2.0 feature indexed_penguins.slice(penguins.size * 0.33) ``` Indices in Vectors or Arrow::Arrays are also acceptable. Another way to select in `slice` is to use booleans. An alias for this feature is `filter`. - Booleans is an Array, Arrow::Array, Vector or their Array. - Each data type must be boolean. - Size of booleans must be same as the size of self. ```{ruby} #| tags: [] # make boolean Vector to check over 40 booleans = penguins[:bill_length_mm] > 40 ``` ```{ruby} #| tags: [] penguins.slice(booleans) ``` `slice` accepts a block. - We can't use both arguments and a block at a same time. - The block should return indeces in any length or a boolean Array with a same length as `size`. - Block is called in the context of self. So reciever 'self' can be omitted in the block. ```{ruby} #| tags: [] # return a DataFrame with bill_length_mm is in 2*std range around mean penguins.slice do min = bill_length_mm.mean - bill_length_mm.std max = bill_length_mm.mean + bill_length_mm.std bill_length_mm.to_a.map { |e| (min..max).include? e } end ``` ## 29. Slice and nil option `Arrow::Table#slice` uses `#filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row. ```{ruby} #| tags: [] hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] } table = Arrow::Table.new(hash) table.slice([true, false, nil]) ``` Whereas in RedAmber, `DataFrame#slice` with booleans containing nil is treated as false. This behavior comes from `Allow::FilterOptions.null_selection_behavior = :drop`. This is a default value for `Arrow::Table.filter` method. ```{ruby} #| tags: [] RedAmber::DataFrame.new(table).slice([true, false, nil]).table ``` ## 30. Remove Slice and reject rows (observations) to create a remainer DataFrame. `#remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer. ```{ruby} #| tags: [] # returns 6th to 339th obs. Remainer of penguins.slice(0...5, -5..-1) penguins.remove(0...5, -5..-1) ``` `remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `#size`. ```{ruby} #| tags: [] # remove all observation contains nil removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) } ``` `remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as size. Block is called in the context of self. ```{ruby} #| tags: [] # Remove data in 2*std range around mean penguins.remove do vector = self[:bill_length_mm] min = vector.mean - vector.std max = vector.mean + vector.std vector.to_a.map { |e| (min..max).include? e } end ``` ## 31. Remove and nil When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`. ```{ruby} #| tags: [] df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3]) ``` ```{ruby} #| tags: [] booleans = df[:a] < 2 ``` ```{ruby} #| tags: [] booleans_invert = booleans.to_a.map(&:!) ``` ```{ruby} #| tags: [] df.slice(booleans) == df.remove(booleans_invert) ``` Whereas `Vector#invert` returns nil for elements nil. This will bring different result. (See #26) ```{ruby} #| tags: [] booleans.invert ``` ```{ruby} #| tags: [] df.remove(booleans.invert) ``` We have `#primitive_invert` method in Vector. This method returns the same result as `.to_a.map(&:!)` above. ```{ruby} #| tags: [] booleans.primitive_invert ``` ```{ruby} #| tags: [] df.remove(booleans.primitive_invert) ``` ```{ruby} #| tags: [] df.slice(booleans) == df.remove(booleans.primitive_invert) ``` ## 32. Remove nil Remove any observations containing nil. ```{ruby} #| tags: [] penguins.remove_nil ``` The roundabout way for this is to use `#remove`. ```{ruby} #| tags: [] penguins.remove { vectors.map(&:is_nil).reduce(&:|) } ``` ## 33. Rename Rename keys (column names) to create a updated DataFrame. `#rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}` or an Array of Array `[[existing_key, new_key], ...]` . ```{ruby} #| tags: [] h = { name: %w[Yasuko Rui Hinata], age: [68, 49, 28] } comecome = RedAmber::DataFrame.new(h) ``` ```{ruby} #| tags: [] comecome.rename(age: :age_in_1993) # comecome.rename(:age, :age_in_1993) # is also OK # comecome.rename([:age, :age_in_1993]) # is also OK ``` `#rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}` or an Array of Array `[[existing_key, new_key], ...]`. Block is called in the context of self. Symbol key and String key are distinguished. ## 34. Assign Another example of `assign` is [68. Assign revised](#68.-Assign-revised), [#69. Variations of assign](#69.-Variations-of-assign) . Assign new or updated columns (variables) and create a updated DataFrame. - Columns with new keys will append new variables at right (bottom in TDR). - Columns with exisiting keys will update corresponding vectors. `#assign(key_pairs)` accepts pairs of key and array_like values as arguments. The pairs should be a Hash of `{key => array_like}` or an Array of Array `[[key, array_like], ... ]`. `array_like` is one of `Vector`, `Array` or `Arrow::Array`. ```{ruby} #| tags: [] comecome = RedAmber::DataFrame.new( name: %w[Yasuko Rui Hinata], age: [68, 49, 28] ) ``` ```{ruby} #| tags: [] # update :age and add :brother assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] } comecome.assign(assigner) ``` `#assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and array_like values as a Hash of `{key => array_like}` or an Array of Array `[[key, array_like], ... ]`. `array_like` is one of `Vector`, `Array` or `Arrow::Array`. Block is called in the context of self. ```{ruby} #| tags: [] df = RedAmber::DataFrame.new( index: [0, 1, 2, 3, nil], float: [0.0, 1.1, 2.2, Float::NAN, nil], string: ['A', 'B', 'C', 'D', nil]) ``` ```{ruby} #| tags: [] # update numeric variables df.assign do vectors.select(&:numeric?).map { |v| [v.key, -v] } end ``` In this example, columns :x and :y are updated. Column :x returns complements for #negate method because :x is :uint8 type. ```{ruby} #| tags: [] df.types ``` ## 35. Coerce in Vector Vector has coerce method. ```{ruby} #| tags: [] vector = RedAmber::Vector.new(1,2,3) ``` ```{ruby} #| tags: [] # Vector's `#*` method vector * -1 ``` ```{ruby} #| tags: [] # coerced calculation -1 * vector ``` ```{ruby} #| tags: [] # `@-` operator -vector ``` ## 36. Vector#to_ary `Vector#to_ary` will enable implicit conversion to an Array. ```{ruby} #| tags: [] Array(Vector.new([3, 4, 5])) ``` ```{ruby} #| tags: [] [1, 2] + Vector.new([3, 4, 5]) ``` ```{ruby} #| tags: [] [1, 2, Vector.new([3, 4, 5])].flatten ``` ## 37. Vector#fill_nil `Vector#fill_nil_forward` or `Vector#fill_nil_backward` will propagate the last valid observation forward (or backward). Or preserve nil if all previous values are nil or at the end. ```{ruby} #| tags: [] integer = Vector.new([0, 1, nil, 3, nil]) integer.fill_nil_forward ``` ```{ruby} #| tags: [] integer.fill_nil_backward ``` (Since 0.4.2) `Vector#fill_nil(value)` will fill `value` to `nil` in self. ```{ruby} #| tags: [] integer.fill_nil(-1) ``` If value has upper type, self will automatically upcasted. Int16 will casted into double in next example. ```{ruby} #| tags: [] integer.fill_nil(0.1) ``` ## 38. Vector#all?/any? `Vector#all?` returns true if all elements is true. `Vector#any?` returns true if exists any true. These are unary aggregation function. ```{ruby} #| tags: [] booleans = Vector.new([true, true, nil]) booleans.all? ``` ```{ruby} #| tags: [] booleans.any? ``` If these methods are used with option `skip_nulls: false` nil is considered. ```{ruby} #| tags: [] booleans.all?(skip_nulls: false) ``` ```{ruby} #| tags: [] booleans.any?(skip_nulls: false) ``` ## 39. Vector#count/count_uniq `Vector#count` counts element. `Vector#count_uniq` counts unique element. `#count_distinct` is an alias (Arrow's name). These are unary aggregation function. ```{ruby} #| tags: [] string = Vector.new(%w[A B A]) string.count ``` ```{ruby} #| tags: [] string.count_uniq # count_distinct is also OK ``` ## 40. Vector#stddev/variance These are unary element-wise function. For biased standard deviation; ```{ruby} #| tags: [] integers = Vector.new([1, 2, 3, nil]) integers.stddev ``` For unbiased standard deviation; ```{ruby} #| tags: [] integers.sd ``` For biased variance; ```{ruby} #| tags: [] integers.variance ``` For unbiased variance; ```{ruby} #| tags: [] integers.var ``` ## 41. Vector#negate These are unary element-wise function. ```{ruby} #| tags: [] double = Vector.new([1.0, -2, 3]) double.negate ``` Same as #negate; ```{ruby} #| tags: [] -double ``` ## 42. Vector#round Otions for `#round`; - `:n-digits` The number of digits to show. - `round_mode` Specify rounding mode. This is a unary element-wise function. ```{ruby} #| tags: [] double = RedAmber::Vector.new([15.15, 2.5, 3.5, -4.5, -5.5]) ``` ```{ruby} #| tags: [] double.round ``` ```{ruby} #| tags: [] double.round(mode: :half_to_even) ``` ```{ruby} #| tags: [] double.round(mode: :towards_infinity) ``` ```{ruby} #| tags: [] double.round(mode: :half_up) ``` ```{ruby} #| tags: [] double.round(mode: :half_towards_zero) ``` ```{ruby} #| tags: [] double.round(mode: :half_towards_infinity) ``` ```{ruby} #| tags: [] double.round(mode: :half_to_odd) ``` ```{ruby} #| tags: [] double.round(n_digits: 0) ``` ```{ruby} #| tags: [] double.round(n_digits: 1) ``` ```{ruby} #| tags: [] double.round(n_digits: -1) ``` ## 43. Vector#and/or RedAmber select `and_kleene`/`or_kleene` as default `&`/`|` method. These are unary element-wise function. ```{ruby} #| tags: [] bool_self = Vector.new([true, true, true, false, false, false, nil, nil, nil]) bool_other = Vector.new([true, false, nil, true, false, nil, true, false, nil]) bool_self & bool_other # same as bool_self.and_kleene(bool_other) ``` ```{ruby} #| tags: [] # Ruby's primitive `&&` bool_self && bool_other ``` ```{ruby} #| tags: [] # Arrow's default `and` bool_self.and_org(bool_other) ``` ```{ruby} #| tags: [] bool_self | bool_other # same as bool_self.or_kleene(bool_other) ``` ```{ruby} #| tags: [] # Ruby's primitive `||` bool_self || bool_other ``` ```{ruby} #| tags: [] # Arrow's default `or` bool_self.or_org(bool_other) ``` ## 44. Vector#is_finite/is_nan/is_nil/is_na These are unary element-wise function. ```{ruby} #| tags: [] double = Vector.new([Math::PI, Float::INFINITY, -Float::INFINITY, Float::NAN, nil]) ``` ```{ruby} #| tags: [] double.is_finite ``` ```{ruby} #| tags: [] double.is_inf ``` ```{ruby} #| tags: [] double.is_na ``` ```{ruby} #| tags: [] double.is_nil ``` ```{ruby} #| tags: [] double.is_valid ``` ## 45. Prime-th rows ```{ruby} #| tags: [] # prime-th rows ... Don't ask me what it means. require 'prime' penguins.assign_left(:index, penguins.indices + 1) # since 0.2.0 .slice { Vector.new(Prime.each(size).to_a) - 1 } ``` ## 46. Slice by Enumerator Slice accepts Enumerator. ```{ruby} #| tags: [] # Select every 10 samples penguins.assign_left(index: penguins.indices) # 0.2.0 feature .slice(0.step(by: 10, to: 340)) ``` ```{ruby} #| tags: [] # Select every 2 samples by step 100 penguins.assign_left(index: penguins.indices) # 0.2.0 feature .slice { 0.step(by: 100, to: 300).map { |i| i..(i+1) } } ``` ## 47. Output mode Output mode of `DataFrame#inspect` and `DataFrame#to_iruby` is Table mode by default. If you prefer other mode set the environment variable `RED_AMBER_OUTPUT_MODE` . ```{ruby} #| tags: [] ENV['RED_AMBER_OUTPUT_MODE'] = 'Table' # or nil (default) penguins # Almost same as `puts penguins.to_s` in any mode ``` ```{ruby} #| tags: [] penguins[:species] ``` ```{ruby} #| tags: [] ENV['RED_AMBER_OUTPUT_MODE'] = 'Plain' # Since 0.2.2 penguins ``` ```{ruby} #| tags: [] penguins[:species] ``` ```{ruby} #| tags: [] ENV['RED_AMBER_OUTPUT_MODE'] = 'Minimum' # Since 0.2.2 penguins ``` ```{ruby} #| tags: [] penguins[:species] ``` ```{ruby} #| tags: [] ENV['RED_AMBER_OUTPUT_MODE'] = 'TDR' penguins ``` ```{ruby} #| tags: [] penguins[:species] ``` ```{ruby} #| tags: [] ENV['RED_AMBER_OUTPUT_MODE'] = nil ``` ## 48. Empty key Empty key `:""` will be automatically renamed to `:unnamed1`. If `:unnamed1` was used, `:unnamed1.succ` will be used. (Since 0.1.8) ```{ruby} #| tags: [] df = DataFrame.new("": [1, 2], unnamed1: [3, 4]) ``` ## 49. Grouping `DataFrame#group` takes group_keys as arguments, and creates `Group` class. Group class inspects counts of each unique elements. (Since 0.1.7) ```{ruby} #| tags: [] group = penguins.group(:species) ``` The instance of `Group` class has methods to summary functions. It returns `function(key)` style summarized columns as a result. ```{ruby} #| tags: [] group.count ``` If count result is same in multiple columns, count column is aggregated to one column `:count`. ```{ruby} #| tags: [] penguins.pick(:species, :bill_length_mm, :bill_depth_mm).group(:species).count ``` Grouping key comes first (leftmost) in the columns. ## 50. Grouping with a block `DataFrame#group` takes a block and we can specify multiple functions. Inside the block is the context of instance of Group. So we can use summary functions without the reciever. (Since 0.1.8) ```{ruby} #| tags: [] penguins.group(:species) { [count(:species), mean(:body_mass_g)] } ``` `Group#summarize` accepts same block as `DataFrame#group`. ```{ruby} #| tags: [] group.summarize { [count(:species), mean] } ``` ## 51. Group#count family `Group#count` counts the number of non-nil values in each group. If counts are the same (and do not include NaN or nil), columns for counts are unified. ```{ruby} dataframe = DataFrame.new( x: [*1..6], y: %w[A A B B B C], z: [false, true, false, nil, true, false]) ``` Non-nil counts in column y and z are different. ```{ruby} dataframe.group(:y).count ``` Non-nil counts in column x and y are same, so only one column is emitted. ```{ruby} dataframe.group(:z).count ``` `Group#count_all` returns each record group size as a DataFrame. `Group#group_count` is an alias. ```{ruby} dataframe.group(:y).count_all ``` `Group#count_uniq` count the unique values in each group and return as a DataFrame. `Group#count_distinct` is an alias. ```{ruby} dataframe.group(:y).count_uniq ``` ## 52. Group#one `Group#one` gets one value from each group. ```{ruby} dataframe.group(:y).one ``` ## 53. Group aggregation functions `Group#all` emits aggragated booleans Whether all elements in each group evaluate to true. ```{ruby} dataframe.group(:y).all ``` `Group#any` emits aggragated booleans Whether any elements in each group evaluate to true. ```{ruby} dataframe.group(:y).any ``` `Group#max` computes maximum of values in each group for numeric columns. ```{ruby} dataframe.group(:y).max ``` `Group#mean` computes mean of values in each group for numeric columns. ```{ruby} dataframe.group(:y).mean ``` `Group#median` computes median of values in each group for numeric columns. ```{ruby} dataframe.group(:y).median ``` `Group#min` computes minimum of values in each group for numeric columns. ```{ruby} dataframe.group(:y).min ``` `Group#product` computes product of values in each group for numeric columns. ```{ruby} dataframe.group(:y).product ``` `Group#stddev` computes standrad deviation of values in each group for numeric columns. ```{ruby} dataframe.group(:y).stddev ``` `Group#sum` computes sum of values in each group for numeric columns. ```{ruby} dataframe.group(:y).sum ``` `Group#variance` computes variance of values in each group for numeric columns. ```{ruby} dataframe.group(:y).variance ``` ## 54. Group#grouped_frame `Group#grouped_frame` returns grouped DataFrame only for group keys. The alias is `#none` ```{ruby} dataframe.group(:y).grouped_frame ``` ## 55. Vector#shift `Vector#shift(amount = 1, fill: nil)` Shift vector's values by specified `amount`. Shifted space is filled by value `fill`. (Since 0.1.8) ```{ruby} #| tags: [] vector = RedAmber::Vector.new([1, 2, 3, 4, 5]) vector.shift ``` ```{ruby} #| tags: [] vector.shift(-2) ``` ```{ruby} #| tags: [] vector.shift(fill: Float::NAN) ``` ## 56. From the Pandas cookbook - if-then https://pandas.pydata.org/docs/user_guide/cookbook.html#if-then ```python # by Python Pandas df = pd.DataFrame( {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]} ) df.loc[df.AAA >= 5, "BBB"] = -1 # returns => AAA BBB CCC 0 4 10 100 1 5 -1 50 2 6 -1 -30 3 7 -1 -50 ``` ```{ruby} #| tags: [] # RedAmber df = DataFrame.new( "AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50] # You can omit {} ) df.assign(BBB: df[:BBB].replace(df[:AAA] >= 5, -1)) ``` If you want to replace both :BBB and :CCC ; ```{ruby} #| tags: [] df.assign do replacer = v(:AAA) >= 5 # Boolean Vector { BBB: v(:BBB).replace(replacer, -1), CCC: v(:CCC).replace(replacer, -2) } end ``` ## 57. From the Pandas cookbook - Splitting Split a frame with a boolean criterion https://pandas.pydata.org/docs/user_guide/cookbook.html#splitting ```python # by Python Pandas df = pd.DataFrame( {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]} ) df[df.AAA <= 5] # returns => AAA BBB CCC 0 4 10 100 1 5 20 50 df[df.AAA > 5] # returns => AAA BBB CCC 2 6 30 -30 3 7 40 -50 ``` ```{ruby} #| tags: [] # RedAmber df = DataFrame.new( # You can omit outer {} "AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50] ) df.slice(df[:AAA] <= 5) # df[df[:AAA] <= 5] # is also OK ``` ```{ruby} #| tags: [] df.remove(df[:AAA] <= 5) # df.slice(df[:AAA] > 5) # do the same thing ``` ## 58. From the Pandas cookbook - Building criteria Split a frame with a boolean criterion https://pandas.pydata.org/docs/user_guide/cookbook.html#building-criteria ```python # by Python Pandas df = pd.DataFrame( {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]} ) # and df.loc[(df["BBB"] < 25) & (df["CCC"] >= -40), "AAA"] # returns a series => 0 4 1 5 Name: AAA, dtype: int64 # or df.loc[(df["BBB"] > 25) | (df["CCC"] >= -40), "AAA"] # returns a series => 0 4 1 5 2 6 3 7 Name: AAA, dtype: int64 ``` ```{ruby} #| tags: [] # RedAmber df = DataFrame.new( # You can omit {} "AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50] ) df.slice( (df[:BBB] < 25) & (df[:CCC] >= 40) ).pick(:AAA) ``` ```{ruby} #| tags: [] df.slice( (df[:BBB] > 25) | (df[:CCC] >= 40) ).pick(:AAA) # df[ (df[:BBB] > 25) | (df[:CCC] >= 40) ][:AAA)] # also OK ``` ```python # by Python Pandas # or (with assignment) df.loc[(df["BBB"] > 25) | (df["CCC"] >= 75), "AAA"] = 0.1 df # returns a dataframe => AAA BBB CCC 0 0.1 10 100 1 5.0 20 50 2 0.1 30 -30 3 0.1 40 -50 ``` ```{ruby} #| tags: [] # df.assign(AAA: df[:AAA].replace((df[:BBB] > 25) | (df[:CCC] >= 75), 0.1)) # by one liner booleans = (df[:BBB] > 25) | (df[:CCC] >= 75) replaced = df[:AAA].replace(booleans, 0.1) df.assign(AAA: replaced) ``` ```python # by Python Pandas # Select rows with data closest to certain value using argsort df = pd.DataFrame( {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]} ) aValue = 43.0 df.loc[(df.CCC - aValue).abs().argsort()] # returns a dataframe => AAA BBB CCC 1 5 20 50 0 4 10 100 2 6 30 -30 3 7 40 -50 ``` ```{ruby} #| tags: [] a_value = 43 df[(df[:CCC] - a_value).abs.sort_indexes] # df.slice (df[:CCC] - a_value).abs.sort_indexes # also OK ``` ```python # by Python Pandas # Dynamically reduce a list of criteria using a binary operators df = pd.DataFrame( {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]} ) Crit1 = df.AAA <= 5.5 Crit2 = df.BBB == 10.0 Crit3 = df.CCC > -40.0 AllCrit = Crit1 & Crit2 & Crit3 import functools CritList = [Crit1, Crit2, Crit3] AllCrit = functools.reduce(lambda x, y: x & y, CritList) df[AllCrit] # returns a dataframe => AAA BBB CCC 0 4 10 100 ``` ```{ruby} #| tags: [] crit1 = df[:AAA] <= 5.5 crit2 = df[:BBB] == 10.0 crit3 = df[:CCC] >= -40.0 df[crit1 & crit2 & crit3] ``` ## 59. From the Pandas cookbook - Dataframes https://pandas.pydata.org/docs/user_guide/cookbook.html#dataframes ```python # by Python Pandas # Using both row labels and value conditionals df = pd.DataFrame( {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]} ) df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))] # returns => AAA BBB CCC 0 4 10 100 2 6 30 -30 ``` ```{ruby} #| tags: [] # RedAmber df = DataFrame.new( "AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50] ) df[(df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i }] ``` ```python # by Python Pandas # Use loc for label-oriented slicing and iloc positional slicing GH2904 df = pd.DataFrame( {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}, index=["foo", "bar", "boo", "kar"], ) # There are 2 explicit slicing methods, with a third general case # 1. Positional-oriented (Python slicing style : exclusive of end) # 2. Label-oriented (Non-Python slicing style : inclusive of end) # 3. General (Either slicing style : depends on if the slice contains labels or positions) df.loc["bar":"kar"] # Label # returns => AAA BBB CCC bar 5 20 50 boo 6 30 -30 kar 7 40 -50 # Generic df[0:3] # returns => AAA BBB CCC foo 4 10 100 bar 5 20 50 boo 6 30 -30 df["bar":"kar"] # returns => AAA BBB CCC bar 5 20 50 boo 6 30 -30 kar 7 40 -50 ``` ```{ruby} #| tags: [] # RedAmber does not have row index. Use a new column as indexes. labeled = df.assign_left(index: %w[foo bar boo kar]) # labeled = df.assign(index: %w[foo bar boo kar]).pick { [keys[-1], keys[0...-1]] } # until v0.1.8 ``` ```{ruby} #| tags: [] labeled[1..3] ``` ```{ruby} #| tags: [] labeled.slice do v = v(:index) v.index("bar")..v.index("kar") end ``` `slice_by` returns the same result as above. (Since 0.2.1) ```{ruby} #| tags: [] labeled.slice_by(:index, keep_key: true) { "bar".."kar"} ``` ```python # by Python Pandas # Ambiguity arises when an index consists of integers with a non-zero start or non-unit increment. df2 = pd.DataFrame(data=data, index=[1, 2, 3, 4]) # Note index starts at 1. df2.iloc[1:3] # Position-oriented # returns => AAA BBB CCC 2 5 20 50 3 6 30 -30 df2.loc[1:3] # Label-oriented # returns => AAA BBB CCC 1 4 10 100 2 5 20 50 3 6 30 -30 ``` ```{ruby} #| tags: [] # RedAmber only have an implicit integer index 0...size, # does not happen any ambiguity unless you create a new column and use it for indexes :-). ``` ```python # by Python Pandas # Using inverse operator (~) to take the complement of a mask df[~((df.AAA <= 6) & (df.index.isin([0, 2, 4])))] # returns => AAA BBB CCC 1 5 20 50 3 7 40 -50 ``` ```{ruby} #| tags: [] # RedAmber offers #! method for boolean Vector. df[!((df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i })] # or # df[((df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i }).invert] ``` If you have `nil` in your data, consider #primitive_invert for consistent result. See example #26. ## 60. From the Pandas cookbook - New columns https://pandas.pydata.org/docs/user_guide/cookbook.html#new-columns ```python # by Python Pandas # Efficiently and dynamically creating new columns using applymap df = pd.DataFrame({"AAA": [1, 2, 1, 3], "BBB": [1, 1, 2, 2], "CCC": [2, 1, 3, 1]}) df # returns => AAA BBB CCC 0 1 1 2 1 2 1 1 2 1 2 3 3 3 2 1 source_cols = df.columns # Or some subset would work too new_cols = [str(x) + "_cat" for x in source_cols] categories = {1: "Alpha", 2: "Beta", 3: "Charlie"} df[new_cols] = df[source_cols].applymap(categories.get) df # returns => AAA BBB CCC AAA_cat BBB_cat CCC_cat 0 1 1 2 Alpha Alpha Beta 1 2 1 1 Beta Alpha Alpha 2 1 2 3 Alpha Beta Charlie 3 3 2 1 Charlie Beta Alpha ``` ```{ruby} #| tags: [] # RedAmber df = DataFrame.new({"AAA": [1, 2, 1, 3], "BBB": [1, 1, 2, 2], "CCC": [2, 1, 3, 1]}) ``` ```{ruby} #| tags: [] categories = {1 => "Alpha", 2 => "Beta", 3 => "Charlie"} # Creating a Hash from keys df.assign do keys.each_with_object({}) do |key, h| h["#{key}_cat"] = v(key).to_a.map { |x| categories[x] } end end # Creating an Array from vectors, from v0.2.0 df.assign do vectors.map do |v| ["#{v.key}_cat", v.to_a.map { |x| categories[x] } ] end end ``` ```python # by Python Pandas # Keep other columns when using min() with groupby df = pd.DataFrame( {"AAA": [1, 1, 1, 2, 2, 2, 3, 3], "BBB": [2, 1, 3, 4, 5, 1, 2, 3]} ) df # returns => AAA BBB 0 1 2 1 1 1 2 1 3 3 2 4 4 2 5 5 2 1 6 3 2 7 3 3 # Method 1 : idxmin() to get the index of the minimums df.loc[df.groupby("AAA")["BBB"].idxmin()] # returns => AAA BBB 1 1 1 5 2 1 6 3 2 # Method 2 : sort then take first of each df.sort_values(by="BBB").groupby("AAA", as_index=False).first() # returns => AAA BBB 0 1 1 1 2 1 2 3 2 # Notice the same results, with the exception of the index. ``` ```{ruby} #| tags: [] # RedAmber df = DataFrame.new(AAA: [1, 1, 1, 2, 2, 2, 3, 3], BBB: [2, 1, 3, 4, 5, 1, 2, 3]) ``` ```{ruby} #| tags: [] df.group(:AAA).min # Add `.rename { [keys[-1], :BBB] }` if you want. ``` ## 61. Summary/describe ```{ruby} #| tags: [] penguins.summary # or penguins.describe ``` If you need a variables in row, use `transpose`. (Since 0.2.0) ```{ruby} #| tags: [] penguins.summary.transpose(name: :stats) ``` ## 62. Quantile/Quantiles `Vector#quantile(prob)` returns quantile at probability `prob`. (Since 0.2.0) ```{ruby} #| tags: [] penguins[:bill_depth_mm].quantile # default is prob = 0.5 ``` `Vector#quantiles` accepts an Array for multiple quantiles. Returns a DataFrame. ```{ruby} #| tags: [] penguins[:bill_depth_mm].quantiles([0.05, 0.95]) ``` ## 63. Transpose `DataFrame#transpose` creates transposed DataFrame for wide type dataframe. (Since 0.2.0) ```{ruby} #| tags: [] uri = URI("https://raw.githubusercontent.com/heronshoes/red_amber/master/test/entity/import_cars.tsv") import_cars = RedAmber::DataFrame.load(uri) ``` ```{ruby} #| tags: [] import_cars.transpose ``` Default name of created column is `:NAME`. We can name the column from the keys in original by the option `name:`. ```{ruby} #| tags: [] import_cars.transpose(key: :Year, name: :Manufacturer) ``` You can specify index column by option `:key` even if it is in the middle of the original DataFrame. ```{ruby} #| tags: [] # locate `:Year` in the middle df = import_cars.pick(1..2, 0, 3..) ``` ```{ruby} #| tags: [] df.transpose(key: :Year) ``` ## 64. To_long `DataFrame#to_long(*keep_keys)` reshapes wide DataFrame to the long DataFrame. - Parameter `keep_keys` specifies the key names to keep. (Since 0.2.0) ```{ruby} #| tags: [] uri = URI("https://raw.githubusercontent.com/heronshoes/red_amber/master/test/entity/import_cars.tsv") import_cars = RedAmber::DataFrame.load(uri) ``` ```{ruby} #| tags: [] import_cars.to_long(:Year) ``` - Option `:name` specify the key of the column which is come **from key names**. Default is `:NAME`. - Option `:value` specify the key of the column which is come **from values**. Default is `:VALUE`. ```{ruby} #| tags: [] import_cars.to_long(:Year, name: :Manufacturer, value: :Num_of_imported) ``` ## 65. To_wide `DataFrame#to_wide(*keep_keys)` reshapes long DataFrame to a wide DataFrame. - Option `:name` specify the key of the column which will be expanded **to key name**. Default is `:NAME`. - Option `:value` specify the key of the column which will be expanded **to values**. Default is `:VALUE`. (Since 0.2.0) ```{ruby} #| tags: [] import_cars.to_long(:Year).to_wide ``` ```{ruby} #| tags: [] import_cars.to_long(:Year).to_wide(name: :NAME, value: :VALUE) # is also OK ``` ## 66. Custom index Another example of `indices` is [14. Indices](#14.-Indices). We can set the start of indices by the option. (Since 0.2.1) ```{ruby} #| tags: [] df = DataFrame.new(x: [0, 1, 2, 3, 4]) df.indices ``` ```{ruby} #| tags: [] df.indices(1) ``` You can put the first value which accepts `#succ` method. ```{ruby} #| tags: [] df.indices("a") ``` ## 67. Method missing `RedAmber::DataFrame` has `#method_missing` to enable to call key names as methods. This feature is limited to what can be called as a method (`:key` is OK, not allowed for the keys `:Key`, `:"key.1"`, `:"1key"`, etc. ). But it will be convenient in many cases. (Since 0.2.1) ```{ruby} #| tags: [] df = DataFrame.new(x: [1, 2, 3]) df.x.sum ``` ```{ruby} #| tags: [] # Some ways to pull a Vector df[:x] # Formal style df.v(:x) # #v method df.x # method ``` ```{ruby} #| tags: [] df.x.sum ``` ## 68. Assign revised Another example of `assign` is [#34. Assign](#34.-Assign), [#69. Variations of assign](#69.-Variations-of-assign) . ```{ruby} #| tags: [] df = DataFrame.new(x: [1, 2, 3]) # Assign by a Hash df.assign(y: df.x / 10.0) ``` ```{ruby} #| tags: [] # Assign by separated key and value df.assign(:y) { x / 10.0 } ``` ```{ruby} #| tags: [] # Separated keys and values df.assign(:y, :z) { [x * 10, x / 10.0] } ``` ## 69. Variations of assign Another example of `assign` is [#34. Assign](#34.-Assign), [#68. Assign revised](#68.-Assign-revised) . ```{ruby} #| tags: [] df = DataFrame.new(x: [1, 2, 3]) ``` ```{ruby} #| tags: [] # Hash args df.assign(y: df[:x] * 10, z: df[:x] / 10.0) # Hash hash = {y: df[:x] * 10, z: df[:x] / 10.0} df.assign(hash) # Array array = [[:y, df[:x] * 10], [:z, df[:x] / 10.0]] df.assign(array) # Array df.assign [ [:y, df[:x] * 10], [:z, df[:x] / 10.0] ] # Hash df.assign({ y: df[:x] * 10, z: df[:x] / 10.0 }) # Block, Hash df.assign { {y: df[:x] * 10, z: df[:x] / 10.0} } # Block, Array df.assign { [[:y, df[:x] * 10], [:z, df[:x] / 10.0]] } # Block, Array, method #df.assign { [:y, x * 10], [:z, x / 10.0]] } # Separated #df.assign(:y, :z) { [x * 10, x / 10.0] } ``` ## 70. Row index label by slice_by Another example of `slice` is [#28. Slice](#28.-Slice). (Since 0.2.1) ```{ruby} #| tags: [] df = DataFrame.new(num: [1.1, 2.2, 3.3, 4.4, 5.5]) .assign_left(:label) { indices("a") } ``` `slice_by(key) { row_selector }` selects rows in column `key` with `row_selector`. ```{ruby} #| tags: [] df.slice_by(:label) { "b".."d" } ``` ```{ruby} #| tags: [] df.slice_by(:label) { ["c", "b", "e"] } ``` If the option `keep_key:` set to `true`, index label column is preserved. ```{ruby} #| tags: [] df.slice_by(:label, keep_key: true) { "b".."d" } ``` ## 71. Simpson's paradox in COVID-19 data https://www.rdocumentation.org/packages/openintro/versions/2.3.0/topics/simpsons_paradox_covid ```{ruby} #| tags: [] require 'datasets-arrow' ds = Datasets::Rdatasets.new('openintro', 'simpsons_paradox_covid') df = RedAmber::DataFrame.new(ds.to_arrow) ``` Create group and count by vaccine status and outcome. ```{ruby} #| tags: [] count = df.group(:vaccine_status, :outcome).count ``` Reshape to human readable wide table. ```{ruby} #| tags: [] all_count = count.to_wide(name: :vaccine_status, value: :count) ``` Compute death or survived ratio for vaccine status. ```{ruby} #| tags: [] all_count.assign do { "vaccinated_%": 100.0 * vaccinated / vaccinated.sum, "unvaccinated_%": 100.0 * unvaccinated / unvaccinated.sum } end ``` Death ratio for vaccinated is higher than unvaccinated. Is it true? Next, do the same thing above for each age group. Temporally create methods. ```{ruby} #| tags: [] def make_covid_table(df) df.group(:vaccine_status, :outcome) .count .to_wide(name: :vaccine_status, value: :count) .assign do { "vaccinated_%": (100.0 * vaccinated / vaccinated.sum).round(n_digits: 3), "unvaccinated_%": (100.0 * unvaccinated / unvaccinated.sum).round(n_digits: 3) } end end ``` ```{ruby} #| tags: [] # under 50 make_covid_table(df[df[:age_group] == "under 50"]) ``` ```{ruby} #| tags: [] # 50 + make_covid_table(df[df[:age_group] == "50 +"]) ``` Death ratio for vaccinated is lower than unvaccinated for grouped subset by age. This is an exaple of "Simpson's paradox" . ```{ruby} #| tags: [] # Vaccine status vs age # 50+ is highly vaccinated. df.group(:vaccine_status, :age_group).count.to_wide(name: :age_group, value: :count) ``` ```{ruby} #| tags: [] # Outcome vs age # 50+ also has higher death rate. df.group(:outcome, :age_group).count.to_wide(name: :age_group, value: :count) ``` ## 72. Clean up dirty data ```{ruby} #| tags: [] file = Tempfile.open(['dirty_data', '.csv']) do |f| f.puts(<<~CSV) height,weight 154.9,52.2 156.8cm,51.1kg 152,49 148.5cm,45.4kg 155cm, ,49.9kg 1.58m,49.8kg 166.8cm,53.6kg CSV f end df = DataFrame.load(file) ``` It was loaded as String Vectors. ```{ruby} #| tags: [] df.schema ``` First for the `:weight` column. Replacing "" to NaN causes casting to Float. ```{ruby} #| tags: [] df.assign do { weight: weight.replace(weight == "", Float::NAN) } end ``` Apply same conversion for `:height` followed by unit conversion by `if_else`. ```{ruby} #| tags: [] df = df.assign do { weight: weight.replace(weight == '', Float::NAN), height: height.replace(height == '', Float::NAN) .then { |h| (h < 10).if_else(h * 100, h) } } end puts df.schema df ``` We got clean data, then compute BMI as a new column. ```{ruby} #| tags: [] df.assign(:BMI) { (weight / height ** 2 * 10000).round(n_digits: 1) } ``` ## 73. From the Pandas cookbook - Multiindexing (Updated on v0.3.0) https://pandas.pydata.org/docs/user_guide/cookbook.html#multiindexing ```python # by Python Pandas # Efficiently and dynamically creating new columns using applymap df = pd.DataFrame( { "row": [0, 1, 2], "One_X": [1.1, 1.1, 1.1], "One_Y": [1.2, 1.2, 1.2], "Two_X": [1.11, 1.11, 1.11], "Two_Y": [1.22, 1.22, 1.22], } ) df # => row One_X One_Y Two_X Two_Y 0 0 1.1 1.2 1.11 1.22 1 1 1.1 1.2 1.11 1.22 2 2 1.1 1.2 1.11 1.22 # As Labelled Index df = df.set_index("row") df # => One_X One_Y Two_X Two_Y row 0 1.1 1.2 1.11 1.22 1 1.1 1.2 1.11 1.22 2 1.1 1.2 1.11 1.22 # With Hierarchical Columns df.columns = pd.MultiIndex.from_tuples([tuple(c.split("_")) for c in df.columns]) df # => One Two X Y X Y row 0 1.1 1.2 1.11 1.22 1 1.1 1.2 1.11 1.22 2 1.1 1.2 1.11 1.22 # Now stack & Reset df = df.stack(0).reset_index(1) df # => level_1 X Y row 0 One 1.10 1.20 0 Two 1.11 1.22 1 One 1.10 1.20 1 Two 1.11 1.22 2 One 1.10 1.20 2 Two 1.11 1.22 # And fix the labels (Notice the label 'level_1' got added automatically) df.columns = ["Sample", "All_X", "All_Y"] df # => Sample All_X All_Y row 0 One 1.10 1.20 0 Two 1.11 1.22 1 One 1.10 1.20 1 Two 1.11 1.22 2 One 1.10 1.20 2 Two 1.11 1.22 ``` (Until 0.2.3) This is an example before `Vector#split_*` has introduced. See [88. Vector#split_columns](#88.-Vector#split_to_columns) . ```{ruby} #| tags: [] df = RedAmber::DataFrame.new( "row": [0, 1, 2], "One_X": [1.1, 1.1, 1.1], "One_Y": [1.2, 1.2, 1.2], "Two_X": [1.11, 1.11, 1.11], "Two_Y": [1.22, 1.22, 1.22], ) ``` ```{ruby} #| tags: [] df_x = df.pick(:row, :One_X, :Two_X) .to_long(:row, name: :Sample, value: :All_X) ``` ```{ruby} #| tags: [] df_y = df.pick(:row, :One_Y, :Two_Y) .to_long(:row, name: :Sample, value: :All_Y) ``` ```{ruby} #| tags: [] df_x.pick(:row) .assign [ [:Sample, df_x[:Sample].each.map { |x| x.split("_").first }], [:All_X, df_x[:All_X]], [:All_Y, df_y[:All_Y]] ] ``` (Since 0.3.0) This example will use `Vector#split_to_columns`. ```{ruby} #| tags: [] df = RedAmber::DataFrame.new( "row": [0, 1, 2], "One_X": [1.1, 1.1, 1.1], "One_Y": [1.2, 1.2, 1.2], "Two_X": [1.11, 1.11, 1.11], "Two_Y": [1.22, 1.22, 1.22], ) ``` ```{ruby} #| tags: [] df.to_long(:row) ``` `Vector#split_to_colums` returns two splitted Vectors. ```{ruby} #| tags: [] df.to_long(:row, name: :Sample) .assign(:Sample, :xy) { v(:Sample).split_to_columns('_') } ``` ```{ruby} #| tags: [] df.to_long(:row, name: :Sample) .assign(:Sample, :xy) { v(:Sample).split_to_columns('_') } .to_wide(name: :xy, value: :VALUE) ``` ## 74. From the Pandas cookbook - Arithmetic https://pandas.pydata.org/docs/user_guide/cookbook.html#arithmetic ```python # by Python Pandas cols = pd.MultiIndex.from_tuples( [(x, y) for x in ["A", "B", "C"] for y in ["O", "I"]] ) df = pd.DataFrame(np.random.randn(2, 6), index=["n", "m"], columns=cols) df # => A B C O I O I O I n 0.469112 -0.282863 -1.509059 -1.135632 1.212112 -0.173215 m 0.119209 -1.044236 -0.861849 -2.104569 -0.494929 1.071804 df = df.div(df["C"], level=1) df # => A B C O I O I O I n 0.387021 1.633022 -1.244983 6.556214 1.0 1.0 m -0.240860 -0.974279 1.741358 -1.963577 1.0 1.0 ``` This is a tentative example. This work may be refined by the coming feature which treats multiple key header easily. ```{ruby} #| tags: [] require "arrow-numo-narray" values = Numo::DFloat.new(6, 2).rand_norm ``` For consistency with the pandas result, we will use same data of them. ```{ruby} #| tags: [] values = [ [0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215], [0.119209, -1.044236, -0.861849, -2.104569, -0.494929, 1.071804] ].transpose ``` ```{ruby} #| tags: [] keys = %w[A B C].product(%w[O I]).map(&:join) ``` ```{ruby} #| tags: [] df = RedAmber::DataFrame.new(index: %w[n m]) .assign(*keys) { values } ``` ```{ruby} #| tags: [] df.assign do assigner = {} %w[A B C].each do |abc| %w[O I].each do |oi| key = "#{abc}#{oi}".to_sym assigner[key] = v(key) / v("C#{oi}".to_sym) end end assigner end ``` ```{ruby} #| tags: [] coords = [["AA", "one"], ["AA", "six"], ["BB", "one"], ["BB", "two"], ["BB", "six"]].transpose df = RedAmber::DataFrame.new(MyData: [11, 22, 33, 44, 55]) .assign_left(:label1, :label2) { coords } ``` ## 75. From the Pandas cookbook - Slicing https://pandas.pydata.org/docs/user_guide/cookbook.html#slicing ```python # by Python Pandas coords = [("AA", "one"), ("AA", "six"), ("BB", "one"), ("BB", "two"), ("BB", "six")] index = pd.MultiIndex.from_tuples(coords) df = pd.DataFrame([11, 22, 33, 44, 55], index, ["MyData"]) df # => MyData AA one 11 six 22 BB one 33 two 44 six 55 ``` To take the cross section of the 1st level and 1st axis the index: ```python # by Python Pandas # Note : level and axis are optional, and default to zero df.xs("BB", level=0, axis=0) # => MyData one 33 two 44 six 55 ``` ```{ruby} #| tags: [] df.slice { label1 == "BB" }.drop(:label1) ``` …and now the 2nd level of the 1st axis. ```python # by Python Pandas df.xs("six", level=1, axis=0) # => MyData AA 22 BB 55 ``` ```{ruby} #| tags: [] df.slice { label2 == "six" }.drop(:label2) ``` ```python # by Python Pandas import itertools index = list(itertools.product(["Ada", "Quinn", "Violet"], ["Comp", "Math", "Sci"])) headr = list(itertools.product(["Exams", "Labs"], ["I", "II"])) indx = pd.MultiIndex.from_tuples(index, names=["Student", "Course"]) cols = pd.MultiIndex.from_tuples(headr) # Notice these are un-named data = [[70 + x + y + (x * y) % 3 for x in range(4)] for y in range(9)] df = pd.DataFrame(data, indx, cols) df # => Exams Labs I II I II Student Course Ada Comp 70 71 72 73 Math 71 73 75 74 Sci 72 75 75 75 Quinn Comp 73 74 75 76 Math 74 76 78 77 Sci 75 78 78 78 Violet Comp 76 77 78 79 Math 77 79 81 80 Sci 78 81 81 81 ``` ```{ruby} #| tags: [] indexes = %w[Ada Quinn Violet].product(%w[Comp Math Sci]).transpose df = RedAmber::DataFrame.new(%w[Student Course].zip(indexes)) .assign do assigner = {} keys = %w[Exams Labs].product(%w[I II]).map { |a| a.join("/") } keys.each.with_index do |key, x| assigner[key] = (0...9).map { |y| 70 + x + y + (x * y) % 3 } end assigner end ``` ```python # by Python Pandas All = slice(None) df.loc["Violet"] # => Exams Labs I II I II Course Comp 76 77 78 79 Math 77 79 81 80 Sci 78 81 81 81 ``` ```{ruby} #| tags: [] df.slice(df[:Student] == "Violet").drop(:Student) ``` ```python # by Python Pandas df.loc[(All, "Math"), All] # => Exams Labs I II I II Student Course Ada Math 71 73 75 74 Quinn Math 74 76 78 77 Violet Math 77 79 81 80 ``` ```{ruby} #| tags: [] df.slice(df[:Course] == "Math") ``` ```python # by Python Pandas df.loc[(slice("Ada", "Quinn"), "Math"), All] # => Exams Labs I II I II Student Course Ada Math 71 73 75 74 Quinn Math 74 76 78 77 ``` ```{ruby} #| tags: [] df.slice(df[:Course] == "Math") .slice { (v(:Student) == "Ada") | (v(:Student) == "Quinn") } ``` ```python # by Python Pandas df.loc[(All, "Math"), ("Exams")] # => I II Student Course Ada Math 71 73 Quinn Math 74 76 Violet Math 77 79 ``` ```{ruby} #| tags: [] df.slice(df[:Course] == "Math") .pick { [:Student, :Course].concat keys.select { |key| key.to_s.start_with?("Exams") } } ``` ```python # by Python Pandas df.loc[(All, "Math"), (All, "II")] # => Exams Labs II II Student Course Ada Math 73 74 Quinn Math 76 77 Violet Math 79 80 ``` ```{ruby} #| tags: [] df.slice(df[:Course] == "Math") .pick { [:Student, :Course].concat keys.select { |key| key.to_s.end_with?("II") } } ``` ## 76. Vector#map `Vector#map` method accepts a block and return yielded results from the block in a Vector. ```{ruby} #| tags: [] v = Vector.new(1, 2, 3, 4) v.map { |x| x / 100.0 } ``` If no block is given, return a Enumerator. ```{ruby} #| tags: [] v.map ``` If you need ruby's map from a Vector, try `.each.map` . ```{ruby} #| tags: [] v.each.map { |x| x / 100.0 } ``` Alias for `#map` is `#collect` Similar method is `Vector#filter/#select`. ## 77. Introduce columns from numo/narray (Until 0.2.2 w/Arrow 9.0.0) We couldn't construct the DataFrame directly from Numo/NArray, but following trick enables. ```{ruby} #| tags: [] DataFrame.new(index: Array(1..10)) .assign do { x0: Numo::DFloat.new(size).rand_norm(0, 2), x1: Numo::DFloat.new(size).rand_norm(5, 2), x2: Numo::DFloat.new(size).rand_norm(10, 2), y0: Numo::DFloat.new(size).rand_norm(100, 10), y1: Numo::DFloat.new(size).rand_norm(200, 10), y2: Numo::DFloat.new(size).rand_norm(300, 10) } end ``` If you do not need the index column, try this. ```{ruby} #| tags: [] DataFrame.new(_: Array(1..10)) .assign do { x0: Numo::DFloat.new(size).rand_norm(0, 2), x1: Numo::DFloat.new(size).rand_norm(5, 2), x2: Numo::DFloat.new(size).rand_norm(10, 2), y0: Numo::DFloat.new(size).rand_norm(100, 10), y1: Numo::DFloat.new(size).rand_norm(200, 10), y2: Numo::DFloat.new(size).rand_norm(300, 10) } end .drop(:_) ``` (New from 0.2.3 with Aroow 10.0.0) It is possible to initialize by objects responsible to `to_arrow` since 0.2.3 . Arrays in Numo::NArray is responsible to `to_arrow` with `red-arrow-numo-narray` gem. This feature is proposed by the Red Data Tools member @kojix2 and implemented by @kou in Arrow 10.0.0 and Red Arrow Numo::NArray 0.0.6. Thanks! ```{ruby} #| tags: [] require 'arrow-numo-narray' size = 10 DataFrame.new( x0: Numo::DFloat.new(size).rand_norm(0, 2), x1: Numo::DFloat.new(size).rand_norm(5, 2), x2: Numo::DFloat.new(size).rand_norm(10, 2), y0: Numo::DFloat.new(size).rand_norm(100, 10), y1: Numo::DFloat.new(size).rand_norm(200, 10), y2: Numo::DFloat.new(size).rand_norm(300, 10) ) ``` ## 78. Join (mutating joins) (Since 0.2.3) ```{ruby} #| tags: [] df = DataFrame.new( KEY: %w[A B C], X1: [1, 2, 3] ) ``` ```{ruby} #| tags: [] other = DataFrame.new( KEY: %w[A B D], X2: [true, false, nil] ) ``` Inner join will join data leaving only the matching records. ```{ruby} #| tags: [] df.inner_join(other, :KEY) ``` If we omit join keys, common keys are automatically chosen (natural key). ```{ruby} #| tags: [] df.inner_join(other) ``` Full join will join data leaving all records. ```{ruby} #| tags: [] df.full_join(other) ``` Left join will join matching values to self from other (type: left_outer). ```{ruby} #| tags: [] df.left_join(other) ``` Right join will join matching values from self to other (type: right_outer). ```{ruby} #| tags: [] df.right_join(other) ``` Left join will join matching values to self from other. ```{ruby} #| tags: [] df.left_join(other) ``` ## 79. Join (filtering joins) (Since 0.2.3) Semi join will return records of self that have a match in other. ```{ruby} #| tags: [] df.semi_join(other) ``` Anti join will return records of self that do not have a match in other. ```{ruby} #| tags: [] df.anti_join(other) ``` ## 80. Partial joins (Since 0.2.3) ```{ruby} #| tags: [] df2 = DataFrame.new( KEY1: %w[A B C], KEY2: %w[s t u], X: [1, 2, 3] ) ``` ```{ruby} #| tags: [] other2 = DataFrame.new( KEY1: %w[A B D], KEY2: %w[s u v], Y: [3, 2, 1] ) ``` ```{ruby} #| tags: [] # natural join df2.inner_join(other2) # Same as df2.inner_join(other2, [:KEY1, :KEY2]) ``` Partial join enables some part of common keys as join keys. Common keys of other not used as join keys will renamed as `:suffix`. Default suffix is '.1'. ```{ruby} #| tags: [] # partial join df2.inner_join(other2, :KEY1) ``` ```{ruby} #| tags: [] df2.inner_join(other2, :KEY1, suffix: '_') ``` ## 81. Order of record in join Order of records is not guaranteed to be preserved before or after join. This is a similar property to RDB. Records behave like a set. If you want to preserve the order of records, it is recommended to add an index or sort. (Since 0.2.3) ```{ruby} #| tags: [] df2 ``` ```{ruby} #| tags: [] other2 ``` ```{ruby} #| tags: [] df2.full_join(other2, :KEY2) ``` ## 82. Set operations Keys in self and other must be same in set operations. (Since 0.2.3) ```{ruby} #| tags: [] df = DataFrame.new( KEY1: %w[A B C], KEY2: [1, 2, 3] ) ``` ```{ruby} #| tags: [] other = DataFrame.new( KEY1: %w[A B D], KEY2: [1, 4, 5] ) ``` Intersect will select records appearing in both self and other. ```{ruby} #| tags: [] df.intersect(other) ``` Union will select records appearing in both self or other. ```{ruby} #| tags: [] df.union(other) ``` Difference will select records appearing in self but not in other. It has an alias `#setdiff`. ```{ruby} #| tags: [] df.difference(other) ``` ## 83. Join (big method) Undocumented big method `join` supports all mutating joins, filtering joins and set operations. |category|method of RedAmber|:type in join method|requirement| |-|-|-|-| |mutating joins|#inner_join|:inner|| |mutating joins|#full_join|:full_outer|| |mutating joins|#left_join|:left_outer|| |mutating joins|#right_join|:right_outer|| |-|-|:right_semi|| |-|-|:right_anti|| |filtering joins|#semi_join|:left_semi|| |filtering joins|#anti_join|:left_anti|| |set operations|#intersect|:inner|must have same keys with self and other| |set operations|#union|:full_outer|must have same keys with self and other| |set operations|#difference|:left_anti|must have same keys with self and other| (Since 0.2.3) ```{ruby} #| tags: [] df = DataFrame.new( KEY: %w[A B C], X1: [1, 2, 3] ) ``` ```{ruby} #| tags: [] other = DataFrame.new( KEY: %w[A B D], X2: [true, false, nil] ) ``` ```{ruby} #| tags: [] df.join(other, :KEY, type: :inner) # Same as df.inner_join(other) ``` (Since 0.5.0) `#join` will not force ordering of original column by default. ## 84. Force order for #join We can use `:force_order` option to ensure unique order for `join` families. This option is true by default in `#inner_join`, `#full_join`, `#left_join`, `#right_join`, `#semi_join` and `#anti_join`. It will append index to the source and sort after joining. It will cause some degradation in performance. (Since 0.4.0) (Since 0.5.0) `#join` will not force ordering of original column by default. ```{ruby} #| tags: [] df2 = DataFrame.new( KEY1: %w[A B C], KEY2: %w[s t u], X: [1, 2, 3] ) ``` ```{ruby} #| tags: [] right2 = DataFrame.new( KEY1: %w[A B D], KEY2: %w[s u v], Y: [3, 2, 1] ) ``` ```{ruby} #| tags: [] df2.full_join(right2, :KEY2) ``` ```{ruby} #| tags: [] df2.full_join(right2, :KEY2, force_order: false) ``` ```{ruby} #| tags: [] df2.full_join(right2, { left: :KEY2, right: 'KEY2' }) ``` ```{ruby} #| tags: [] df2.full_join(right2, { left: :KEY2, right: 'KEY2' }, force_order: false) ``` ## 85. Binding DataFrames in vertical (concatenate) Concatenate another DataFrame or Table onto the bottom of self. The shape and data type of other must be the same as self. The alias is `concat`. (Since 0.2.3) ```{ruby} #| tags: [] df = DataFrame.new(x: [1, 2], y: ['A', 'B']) ``` ```{ruby} #| tags: [] other = DataFrame.new(x: [3, 4], y: ['C', 'D']) ``` ```{ruby} #| tags: [] df.concatenate(other) ``` ## 86. Binding DataFrames in lateral (merge) Concatenate another DataFrame or Table onto the bottom of self. The shape and data type of other must be the same as self. (Since 0.2.3) ```{ruby} #| tags: [] df = DataFrame.new(x: [1, 2], y: [3, 4]) ``` ```{ruby} #| tags: [] other = DataFrame.new(a: ['A', 'B'], b: ['C', 'D']) ``` ```{ruby} #| tags: [] df.merge(other) ``` ## 87. Join - larger example by nycflight13 (Since 0.2.3) 'nycflights13' dataset is a large dataset. It will take a while for the first run to fetch and prepare red-datasets cache. ```{ruby} #| tags: [] require 'datasets-arrow' package = 'nycflights13' airlines = DataFrame.new(Datasets::Rdatasets.new(package, 'airlines')) airlines ``` Creating `Datasets::Rdatasets.new('flights', 'airlines')` is very slow because Red Datasets uses Ruby's primitive CSV as csv parser. We can parse csv by Arrow's faster parser. ```{ruby} uri = URI('https://vincentarelbundock.github.io/Rdatasets/csv/nycflights13/flights.csv') flights = DataFrame.load(uri) .pick(%i[month day carrier flight tailnum origin dest air_time distance]) flights ``` ```{ruby} # inner join flights.inner_join(airlines, :carrier) # flights.inner_join(airlines) # natural join (same result) ``` ## 88. Vector#split_to_columns Another example using in the DataFrame operation is in [73. From the Pandas cookbook - Multiindexing](#73.-From-the-Pandas-cookbook---Multiindexing). `self` must be a String type Vector. (Since 0.3.0) ```{ruby} #| tags: [] v = Vector.new(['a b', 'c d', 'e f']) ``` ```{ruby} #| tags: [] v.split_to_columns ``` `#split` accepts `sep` argument as a separator. `sep` is passed to `String#split(sep)`. ```{ruby} #| tags: [] Vector.new('ab', 'cd', 'ef') .split_to_columns('') ``` nil will separated as nil. ```{ruby} #| tags: [] Vector.new(nil, 'c d', 'e f') .split_to_columns ``` ## 89. Vector#split_to_rows `#split_to_rows` will separate strings and flatten into row. (Since 0.3.0) ```{ruby} #| tags: [] v = Vector.new(['a b', 'c d', 'e f']) ``` ```{ruby} #| tags: [] v.split_to_rows ``` ## 90. Vector#merge (Since 0.3.0) `Vector#merge(other)` merges `self` and `other` if they are String Vector. ```{ruby} #| tags: [] vector = Vector.new(%w[a c e]) other = Vector.new(%w[b d f]) vector.merge(other) ``` If `other` is scalar, it will be appended to each elements of `self`. ```{ruby} #| tags: [] vector.merge('x') ``` Option `:sep` is used to concatenating elements. Its default value is ' '. ```{ruby} #| tags: [] vector.merge('x', sep: '') ``` ## 91. Separate a variable (column) in a DataFrame (Since 0.3.0) R's separate operation. https://tidyr.tidyverse.org/reference/separate.html ```{ruby} #| tags: [] df = DataFrame.new(xyz: [nil, 'x.y', 'x.z', 'y.z']) ``` Instead of `separate(:xyz, [:a, :b])` we will do: ```{ruby} #| tags: [] df.assign(:A, :B) { xyz.split_to_columns('.') } .drop(:xyz) ``` If you need :B only, instead of `separate(:xyz, [nil, :B])` we will do: ```{ruby} #| tags: [] df.assign(:A, :B) { xyz.split_to_columns('.') } .pick(:B) ``` When splitted length is not equal, split returns max size of Vector Array filled with nil. ```{ruby} #| tags: [] df = DataFrame.new(xyz: ['x', 'x y', 'x y z', nil]) df.assign(:x, :y, :z) { xyz.split_to_columns } ``` Split limiting max 2 elemnts. ```{ruby} #| tags: [] df.assign(:x, :yz) { xyz.split_to_columns(' ', 2) } ``` Another example: ```{ruby} #| tags: [] df = DataFrame.new(id: 1..3, 'month-year': %w[8-2022 9-2022 10-2022]) .assign(:month, :year) { v(:'month-year').split_to_columns('-') } ``` Split between the letters. ```{ruby} #| tags: [] df = DataFrame.new(id: 1..3, yearmonth: %w[202209 202210 202211]) .assign(:year, :month) { yearmonth.split_to_columns(/(?=..$)/) } ``` ## 92. Unite variables (columns) in a DataFrame (Since 0.3.0) R's unite operation. ```{ruby} #| tags: [] df = DataFrame.new(id: 1..3, year: %w[2022 2022 2022], month: %w[09 10 11]) ``` ```{ruby} #| tags: [] df.assign(:yearmonth) { year.merge(month, sep: '') } .pick(:id, :yearmonth) ``` ```{ruby} #| tags: [] # Or directly create: DataFrame.new(id: 1..3, yearmonth: df.year.merge(df.month, sep: '')) ``` ## 93. Separate variable and lengthen into several rows. (Since 0.3.0) R's separate_rows operation. ```{ruby} #| tags: [] df = DataFrame.new(id: 1..3, yearmonth: %w[202209 202210 202211]) .assign(:year, :month) { yearmonth.split_to_columns(/(?=..$)/) } .drop(:yearmonth) .to_long(:id) ``` Another example with different list size. ```{ruby} #| tags: [] df = DataFrame.new( x: 1..3, y: ['a', 'd,e,f', 'g,h'], z: ['1', '2,3,4', '5,6'], ) ``` ```{ruby} #| tags: [] sizes = df.y.split(',').list_sizes a = sizes.to_a.map.with_index(1) { |n, i| [i] * n }.flatten ``` ```{ruby} #| tags: [] DataFrame.new( x: a, y: df.y.split_to_rows(','), z: df.z.split_to_rows(',') ) ``` Another way to use `#split_to_columns`. ```{ruby} #| tags: [] xy = df.pick(:x, :y) .assign(:y, :y1, :y2) { v(:y).split_to_columns(',') } .to_long(:x, value: :y) .remove_nil ``` ```{ruby} #| tags: [] xz = df.pick(:x, :z) .assign(:z, :z1, :z2) { v(:z).split_to_columns(',') } .to_long(:x, value: :z) .remove_nil ``` ```{ruby} #| tags: [] xy.pick(:x, :y).merge(xz.pick(:z)) ``` Get all combinations of :y and :z. ```{ruby} #| tags: [] df.assign(:y, :y1, :y2) { v(:y).split_to_columns(',') } .to_long(:x, :z, value: :y) .drop(:NAME) .assign(:z, :z1, :z2) { v(:z).split_to_columns(',') } .to_long(:x, :y, value: :z) .drop(:NAME) .drop_nil ``` ## 94. Vector#propagate Spread the return value of an aggregate function as if it is a element-wise function. It has an alias `#expand`. (Since 0.4.0) ```{ruby} #| tags: [] vec = Vector.new(1, 2, 3, 4) vec.propagate(:mean) ``` Block is also available. ```{ruby} #| tags: [] vec.propagate { |v| v.mean.round } ``` ## 95. DataFrame#propagate Returns a Vector such that all elements have value `scalar` and have same size as self. (Since 0.5.0) ```{ruby} #| tags: [] df ``` ```{ruby} #| tags: [] df.assign(:sum_x) { propagate(x.sum) } ``` With a block. ```{ruby} #| tags: [] df.assign(:range) { propagate { x.max - x.min } } ``` ## 96. Vector#sort / #sort_indices `#sort` will arrange values in Vector. Accepts :sort order option: - `:+`, `:ascending` or without argument will sort in increasing order. - `:-` or `:descending` will sort in decreasing order. (Since 0.4.0) ```{ruby} #| tags: [] vector = Vector.new(%w[B D A E C]) vector.sort # same as vector.sort(:+) # same as vector.sort(:ascending) ``` Sort in decreasing order; ```{ruby} #| tags: [] vector.sort(:-) # same as vector.sort(:descending) ``` ## 97. Vector#rank Returns 1-based numerical rank of self. - Nil values are considered greater than any value. - NaN values are considered greater than any value but smaller than nil values. - Sort order can be controlled by the option `order`. * `:ascending` or `+` will compute rank in ascending order (default). * `:descending` or `-` will compute rank in descending order. - Tiebreakers will configure how ties between equal values are handled. * `tie: :first` : Ranks are assigned in order of when ties appear in the input (default). * `tie: :min` : Ties get the smallest possible rank in the sorted order. * `tie: :max` : Ties get the largest possible rank in the sorted order. * `tie: :dense` : The ranks span a dense [1, M] interval where M is the number of distinct values in the input. - Placement of nil and NaN is controlled by the option `null_placement`. * `null_placement: :at_end` : place nulls at end (default). * `null_placement: :at_start` : place nulls at the top of Vector. (Since 0.4.0, revised in 0.5.1) Rank of float Vector; ```{ruby} #| tags: [] float = Vector[1, 0, nil, Float::NAN, Float::INFINITY, -Float::INFINITY, 3, 2] ``` ```{ruby} #| tags: [] # Same as float.rank(:ascending, tie: :first, null_placement: :at_end) float.rank ``` With sort order; ```{ruby} #| tags: [] float.rank(:descending) # or float.rank('-') ``` With null placement; ```{ruby} #| tags: [] float.rank(null_placement: :at_start) ``` Rank of string Vector with tiebreakers; ```{ruby} #| tags: [] string = Vector['A', 'A', nil, nil, 'C', 'B'] ``` ```{ruby} #| tags: [] string.rank # same as string.rank(tie: :first) ``` ```{ruby} #| tags: [] string.rank(tie: :min) ``` ```{ruby} #| tags: [] string.rank(tie: :max) ``` ```{ruby} #| tags: [] string.rank(tie: :dense) ``` ## 98. Vector#sample Pick up elements at random. (Since 0.4.0) Return a randomly selected element. This is one of an aggregation function. ```{ruby} #| tags: [] v = Vector.new('A'..'H') ``` Returns scalar without any arguments. ```{ruby} #| tags: [] v.sample ``` `sample(n)` will pick up `n` elements at random. `n` is a positive number of elements to pick. If n is smaller or equal to size, elements are picked by non-repeating. If n == 1 (in case of `sample(1)`), it returns a Vector of size == 1 not a scalar. ```{ruby} #| tags: [] v.sample(1) ``` Sample same size of self: every element is picked in random order. ```{ruby} #| tags: [] v.sample(8) ``` If n is greater than `size`, some elements are picked repeatedly. ```{ruby} #| tags: [] v.sample(9) ``` `sample(prop)` will pick up elements by proportion `prop` at random. `prop` must be positive float. - Absolute number of elements to pick:`prop*size` is truncated. - If prop is smaller or equal to 1.0, elements are picked by non-repeating. ```{ruby} #| tags: [] v.sample(0.7) ``` If picked element is only one, it returns a Vector of size == 1 not a scalar. ```{ruby} #| tags: [] v.sample(0.1) ``` Sample same size of self: every element is picked in random order. ```{ruby} #| tags: [] v.sample(1.0) ``` If prop is greater than 1.0, some elements are picked repeatedly. ```{ruby} #| tags: [] # 2 times over sampling sampled = v.sample(2.0) ``` ```{ruby} #| tags: [] sampled.tally ``` ## 99. DataFrame#sample/shuffle (Since 0.5.0) Select records randomly to create a DataFrame. ```{ruby} #| tags: [] penguins.sample(0.1) ``` Returns a DataFrame with shuffled rows. ```{ruby} #| tags: [] penguins.shuffle ``` ## 100. Vector#concatenate Concatenate other array-like to self. (Since 0.4.0) Concatenate to string; ```{ruby} #| tags: [] string = Vector.new(%w[A B]) ``` ```{ruby} #| tags: [] string.concatenate([1, 2]) ``` Concatenate to integer; ```{ruby} #| tags: [] integer = Vector.new(1, 2) ``` ```{ruby} #| tags: [] integer.concatenate(["A", "B"]) ``` ## 101. Vector#resolve Return other as a Vector which is same data type as self. (Since 0.4.0) Integer to String; ```{ruby} #| tags: [] Vector.new('A').resolve([1, 2]) ``` String to Ineger; ```{ruby} #| tags: [] Vector.new(1).resolve(['A']) ``` Upcast to uint16; ```{ruby} #| tags: [] vector = Vector.new(256) ``` Not a uint8 Vector; ```{ruby} #| tags: [] vector.resolve([1, 2]) ``` ## 102. Vector#cast Cast self to `type`. (since 0.4.2) ```{ruby} #| tags: [] vector = Vector.new(1, 2, nil) vector.cast(:int16) ``` ```{ruby} #| tags: [] vector.cast(:double) ``` ```{ruby} #| tags: [] vector.cast(:string) ``` ## 103. Vector#one Get a non-nil element in self. If all elements are nil, return nil. (since 0.4.2) ```{ruby} #| tags: [] vector = Vector.new([nil, 1, 3]) vector.one ``` ## 104. SubFrames `SubFrames` is a new concept of DataFrame collection. It represents ordered subsets of a DataFrame collected by some rules. It includes both grouping and windowing concepts in a unified manner, and also covers broader cases more flexibly. (Since 0.4.0) ```{ruby} #| tags: [] dataframe = DataFrame.new( x: [*1..6], y: %w[A A B B B C], z: [false, true, false, nil, true, false] ) p dataframe; nil ``` ```{ruby} #| tags: [] sf = SubFrames.new(dataframe, [[0, 1], [2, 3, 4], [5]]) ``` Source DataFrame (univarsal set). ```{ruby} #| tags: [] sf.baseframe ``` Size of subsets. ```{ruby} #| tags: [] sf.size ``` Sizes of each subsets. ```{ruby} #| tags: [] sf.sizes ``` `#each` will return an Enumerator or iterates each subset as a DataFrame. ```{ruby} #| tags: [] sf.each ``` ```{ruby} #| tags: [] sf.each.next ``` `SubFrames.new` also accepts a block. ```{ruby} #| tags: [] usf = SubFrames.new(dataframe) { |df| [df.indices] } ``` `#universal?` tests if self is an univarsal set. ```{ruby} #| tags: [] usf.universal? ``` `#empty?` tests if self is an empty set. ```{ruby} #| tags: [] esf = SubFrames.new(dataframe, []) ``` ```{ruby} #| scrolled: true #| tags: [] esf.empty? ``` `#take(n)` takes n sub dataframes and return them by SubFrames. If n >= size, it returns self. ```{ruby} sf.take(2) ``` `#offset_indices` returns indices at the top of each sub DataFrames. ```{ruby} sf.offset_indices ``` `#frames` returns an Array of sub DataFrames. ```{ruby} sf.frames ``` `SubFrames.new` also accepts boolean filters even from the block. ```{ruby} #| tags: [] small = dataframe.x < 4 large = !small small_large = SubFrames.new(dataframe) { [small, large] } ``` ## 105. SubFrames#concatenate `SubFrames#concatenate` (or alias `#concat`) will concatenate SubFrames to create a DataFrame. (Since 0.4.0) ```{ruby} #| tags: [] sf.concatenate ``` ## 106. SubFrames.by_group Create SubFrames by Group object. (Since 0.4.0) ```{ruby} #| tags: [] p dataframe; nil ``` ```{ruby} #| tags: [] group = Group.new(dataframe, [:y]) sf = SubFrames.by_group(group) ``` ## 107. SubFrames.by_indices/.by_filters `SubFrames.by_indices(dataframe, subset_indices)` creates a new SubFrames object from a DataFrame and an array of indices.# ```{ruby} SubFrames.by_indices(dataframe, [[0, 2, 4], [1, 3, 5]]) ``` `SubFrames.by_filters(dataframe, subset_filters)` creates a new SubFrames object from a DataFrame and an array of filters. ```{ruby} #| scrolled: true SubFrames.by_filters(dataframe, [[true, false, true, false, nil, false], [true, true, false, false, nil, false]]) ``` ## 108. SubFrames.by_dataframes `SubFrames.by_dataframes(dataframes)` creates a new SubFrames from an Array of DataFrames. ```{ruby} dataframes = [ DataFrame.new(x: [1, 2, 3], y: %w[A A B], z: [false, true, false]), DataFrame.new(x: [4, 5, 6], y: %w[B B C], z: [nil, true, false]) ] ``` ```{ruby} SubFrames.by_dataframes(dataframes) ``` ## 109. DataFrame#sub_by_value `sub_by_value(*keys)` make subframes by value. It is corresponding to Group processing. Create SubFrames from keys and group by values in columns specified by the key. (Since 0.4.0) ```{ruby} #| tags: [] dataframe.sub_by_value(:y) ``` ## 110. DataFrame#sub_by_window Create SubFrames by window in `size` rolling `from` by `step`. Default values is `from: 0`, `size: nil` and `step: 1`. (Since 0.4.0) ```{ruby} #| tags: [] dataframe.sub_by_window(size: 4, step: 2) ``` ## 111. DataFrame#sub_by_enum Create SubFrames by Grouping/Windowing by posion. The position is specified by `Array`'s enumerator method such as `each_slice` or `each_cons`. (Since 0.4.0) Create a SubFrames object sliced by 3 rows. This is MECE (Mutually Exclusive and Collectively Exhaustive) SubFrames. ```{ruby} #| tags: [] dataframe.sub_by_enum(:each_slice, 3) ``` Create a SubFrames object for each consecutive 3 rows. ```{ruby} #| tags: [] dataframe.sub_by_enum(:each_cons, 4) ``` ## 112. DataFrame#sub_by_kernel Create SubFrames by windowing with a kernel and step. Kernel is a boolean Array and it behaves like a masked window. (Since 0.4.0) ```{ruby} #| tags: [] kernel = [true, false, false, true] dataframe.sub_by_kernel(kernel, step: 2) ``` ## 113. DataFrame#build_subframes Generic builder of sub-dataframe from self. (Sice 0.4.0) ```{ruby} #| tags: [] dataframe.build_subframes([[0, 2, 4], [1, 3, 5]]) ``` `#build_subframes` also accepts a block. ```{ruby} #| tags: [] dataframe.build_subframes do |df| even = df.indices.map(&:even?) [even, !even] end ``` ## 114. SubFrames#aggregate Aggregate SubFrames to create a DataFrame. There are 4 APIs in this method. (Since 0.4.0) - `#aggregate(keys) { columns }` Aggregate SubFrames creating DataFrame with label `keys` and its column values by block. ```{ruby} #| tags: [] sf = dataframe.sub_by_value(:y) ``` ```{ruby} sf.aggregate(:y, :sum_x) { [y.one, x.sum] } # sf.aggregate([:y, :sum_x]) { [y.one, x.sum] } is also acceptable ``` - `#aggregate { key_and_aggregated_values }` Aggregate SubFrames creating DataFrame with pairs of key and aggregated values in Hash from the block. ```{ruby} sf.aggregate do { y: y.one, sum_x: x.sum } end ``` - `#aggregate { [keys, values] }` Aggregate SubFrames creating DataFrame with an Array of key and aggregated value from the block. ```{ruby} #| tags: [] sf.aggregate do [[:y, y.one], [:sum_x, x.sum]] end ``` - `#aggregate(group_keys, aggregations)` Aggregate SubFrames for first values of the columns of `group_keys` and the aggregated results of key-function pairs. ( [Experiment)l] This API may be changed in the future. ```{ruby} #| tags: [] sf.aggregate(:y, { x: :sum, z: :count }) ``` ## 115. SubFrames#map/#assign `#map` Returns a SubFrames containing DataFrames returned by the block. It has an alias `collect`. ```{ruby} sf ``` This example assigns a new column. ```{ruby} sf.map { |df| df.assign(x_plus1: df[:x] + 1) } ``` There is a shortcut of `map { assign }`. We can use `assign(key) { updated_column }`. ```{ruby} sf.assign(:x_plus1) { x + 1 } ``` We can use `assign(keys) { updated_columns }` for multiple columns. ```{ruby} sf.assign(:sum_x, :flac_x) do group_sum = x.sum [[group_sum] * x.size, x / group_sum.to_f] end ``` Also `assign { keys_and_columns }` is possible. ```{ruby} sf.assign do { 'x*z': x * z.if_else(1, 0) } end ``` (Notice) `SubFrames#assign` has a same syntax as `DataFrame#assign`. If you need an Array of DataFrames (not a SubFrames), use `each.map` instead. ```{ruby} sf.each.map { |df| df.assign(x_plus1: df[:x] + 1) } ``` ## 116. SubFrames#select/#reject `#select` returns a SubFrames containing DataFrames selected by the block.# ```{ruby} sf.select { |df| df[:z].any? } ``` `#select` has aliases `#filter` and `#find_all`. `#reject` returns a SubFrames containing truthy DataFrames returned by the block.# ```{ruby} sf.reject { |df| df[:z].any? } ``` ## 117. SubFrames#filter_map It returns a SubFrames containing truthy DataFrames returned by the block. ```{ruby} sf.filter_map do |df| if df.size > 1 df.assign(:y) do y.merge(indices('1'), sep: '') end end end ``` ## 118. Vector#modulo (Since 0.4.1) `#%` is an alias of `#modulo`. ```{ruby} #| tags: [] vector = Vector.new(5, -3, 1) vector % 3 ``` `#%` and `#modulo` is equivalent to `self-divisor*(self/divisor).floor`. ```{ruby} #| tags: [] vector.modulo(-2) ``` ## 119. Vector#mode Compute the 1 most common values and their respective occurence counts. (since 0.5.0) ModeOptions are not supported in 0.5.0 . Only one mode value is returned. ```{ruby} #| tags: [] Vector[true, true, false, nil].mode ``` ```{ruby} #| tags: [] Vector[0, 1, 1, 2, nil].mode ``` ```{ruby} #| tags: [] Vector[1, 0/0.0, -1/0.0, 1/0.0, nil].mode ``` ## 120. Vector#end_with/start_with Check if elements in self ends/starts with a literal pattern. (since 0.5.0) ```{ruby} #| tags: [] v = Vector['array', 'Arrow', 'carrot', nil, 'window'] ``` Emits true if it contains `string`. Emit false if not found. Nil inputs emit nil. ```{ruby} #| tags: [] v.end_with('ow') ``` ```{ruby} #| tags: [] v.start_with('arr') ``` ## 121. Vector#match_substring For each string in self, emit true if it contains a given pattern. (since 0.5.0) ```{ruby} #| tags: [] v = Vector['array', 'Arrow', 'carrot', nil, 'window'] ``` Emits true if it contains `string`. Emit false if not found. Nil inputs emit nil. ```{ruby} #| tags: [] v.match_substring('arr') ``` Otherwise use it with Regexp pattern. It calls `count_substring_regex` in Arrow compute function and uses re2 library. ```{ruby} #| tags: [] v.match_substring(/arr/) ``` You can ignore case if you use regexp with `i` option, or `igfnore_case: true` ```{ruby} #| tags: [] v.match_substring(/arr/i) # same as v.find_substring(/arr/, ignore_case: true) ``` ## 122. Vector#match_like Match elements of self against SQL-style LIKE pattern. The pattern matches a given pattern at any position. - '%' will match any number of characters, - '_' will match exactly one character, and any other character matches itself. - To match a literal '%', '_', or '\', precede the character with a backslash. (since 0.5.0) ```{ruby} #| tags: [] v = Vector['array', 'Arrow', 'carrot', nil, 'window'] ``` You can find indices of a literal string. Emit -1 if not found. Nil inputs emit nil. ```{ruby} #| tags: [] v.match_like('_arr%') ``` You can ignore case if you use the option `igfnore_case: true`. ```{ruby} #| tags: [] v.match_like('arr%', ignore_case: true) ``` ## 123. Vector#find_substring Find first occurrence of substring in string Vector. (since 0.5.1) ```{ruby} #| tags: [] v = Vector['array', 'Arrow', 'carrot', nil, 'window'] ``` You can find indices of a literal string. Emit -1 if not found. Nil inputs emit nil. ```{ruby} #| tags: [] v.find_substring('arr') ``` Otherwise use it with Regexp pattern. It calls `count_substring_regex` in Arrow compute function and uses re2 library. ```{ruby} #| tags: [] v.find_substring(/arr/) ``` You can ignore case if you use regexp with `i` option, or `igfnore_case: true` ```{ruby} #| tags: [] v.find_substring(/arr/i) # same as v.find_substring(/arr/, ignore_case: true) ``` ## 124. Vector#count_substring For each string in self, count occuerences of substring in given pattern. (since 0.5.0) ```{ruby} #| tags: [] v = Vector['amber', 'Amazon', 'banana', nil] ``` You can find indices of a literal string. Emit -1 if not found. Nil inputs emit nil. ```{ruby} #| tags: [] v.count_substring('an') ``` Otherwise use it with Regexp pattern. It calls `count_substring_regex` in Arrow compute function and uses re2 library. ```{ruby} #| tags: [] v.count_substring(/a[mn]/) ``` You can ignore case if you use regexp with `i` option, or `igfnore_case: true` ```{ruby} #| tags: [] v.count_substring(/a[mn]/i) # same as v.find_substring(/arr/, ignore_case: true) ``` ## 125. Grouped DataFrame as a list This API was introduced in 0.2.3, and supply a new DataFrame group (experimental). This additional API will treat a grouped DataFrame as a list of DataFrames. I think this API has pros such as: - API is easy to understand and flexible. - It has good compatibility with Ruby's primitive Enumerables. - We can only use non hash-ed aggregation functions. - Do not need grouped DataFrame state, nor `#ungroup` method. - May be useful for concurrent operations. This feature is implemented by Ruby, so it is pretty slow and experimental. Use original Group API for practical purpose. (Since 0.2.3, experimental feature => This was upgraded to SubFrames feature) ```{ruby} enum = penguins.group(:island).each ``` ```{ruby} enum.to_a ``` ```{ruby} array = enum.map do |df| DataFrame.new(island: [df.island[0]]).assign do df.variables.each_with_object({}) do |(key, vec), hash| next unless vec.numeric? hash["mean(#{key})"] = [vec.mean] end end end ``` ```{ruby} array.reduce { |a, df| a.concat df } ``` ## 126. ArrowFunction helpers `ArrowFunction` module adds two helper method. `ArrowFunction.find(function_name)` returns Arrow Function object in Arrow C++ Compute Functions. ```{ruby} ArrowFunction.find(:mean) ``` To execute this function, ```{ruby} ArrowFunction.find(:mean).execute([[1, 2, 3, 4]]).value.value ``` `ArrowFunction.arrow_doc(function_name)` returns a document of Arrow C++ Compute Function in a string. ```{ruby} puts ArrowFunction.arrow_doc(:mean) ``` ## 127. DataFrame.auto_cast A data set for planetary data in https://nssdc.gsfc.nasa.gov/planetary/factsheet/ is used here. Let's manually copy the data in the html table and get the tab separated text values. ```{ruby} tsv = ' MERCURY VENUS EARTH MOON MARS JUPITER SATURN URANUS NEPTUNE PLUTO Mass (1024kg) 0.330 4.87 5.97 0.073 0.642 1898 568 86.8 102 0.0130 Diameter (km) 4879 12,104 12,756 3475 6792 142,984 120,536 51,118 49,528 2376 Density (kg/m3) 5429 5243 5514 3340 3934 1326 687 1270 1638 1850 Gravity (m/s2) 3.7 8.9 9.8 1.6 3.7 23.1 9.0 8.7 11.0 0.7 Escape Velocity (km/s) 4.3 10.4 11.2 2.4 5.0 59.5 35.5 21.3 23.5 1.3 Rotation Period (hours) 1407.6 -5832.5 23.9 655.7 24.6 9.9 10.7 -17.2 16.1 -153.3 Length of Day (hours) 4222.6 2802.0 24.0 708.7 24.7 9.9 10.7 17.2 16.1 153.3 Distance from Sun (106 km) 57.9 108.2 149.6 0.384* 228.0 778.5 1432.0 2867.0 4515.0 5906.4 Perihelion (106 km) 46.0 107.5 147.1 0.363* 206.7 740.6 1357.6 2732.7 4471.1 4436.8 Aphelion (106 km) 69.8 108.9 152.1 0.406* 249.3 816.4 1506.5 3001.4 4558.9 7375.9 Orbital Period (days) 88.0 224.7 365.2 27.3* 687.0 4331 10,747 30,589 59,800 90,560 Orbital Velocity (km/s) 47.4 35.0 29.8 1.0* 24.1 13.1 9.7 6.8 5.4 4.7 Orbital Inclination (degrees) 7.0 3.4 0.0 5.1 1.8 1.3 2.5 0.8 1.8 17.2 Orbital Eccentricity 0.206 0.007 0.017 0.055 0.094 0.049 0.052 0.047 0.010 0.244 Obliquity to Orbit (degrees) 0.034 177.4 23.4 6.7 25.2 3.1 26.7 97.8 28.3 122.5 Mean Temperature (C) 167 464 15 -20 -65 -110 -140 -195 -200 -225 Surface Pressure (bars) 0 92 1 0 0.01 Unknown* Unknown* Unknown* Unknown* 0.00001 Number of Moons 0 0 1 0 2 79 82 27 14 5 Ring System? No No No No No Yes Yes Yes Yes No Global Magnetic Field? Yes No Yes No No Yes Yes Yes Yes Unknown ' raw_dataframe = DataFrame.load(Arrow::Buffer.new(tsv), format: :tsv) ENV['RED_AMBER_OUTPUT_MODE'] = 'plain' raw_dataframe ``` This dataframe has row oriented calues. So we must transpose the dataframe. ```{ruby} transposed = raw_dataframe.transpose ``` This dataframe has string columns. We can cast each numeric columns, recommended way is to use `#auto_cast`. `#auto_cast` save it in temporally tsv file and re-open it to get a casted dataframe. ```{ruby} transposed.auto_cast ``` There are still some dirts to be cleaned in this dataframe, we don't touch them here. If you are interested, give it a try! - Rename a column 'NAME' to 'Planet_name'. - Remove preceding/trailing spaces in 'Planet_name' values. - Capitalize 'Planet_name' values. - Remove data for 'Moon' and 'Pluto' to create the Table for planets. - Convert 'Unknown*' to nil. - Change 'Yes' / 'No' values to true / false (change column type to boolean). - Remove comma in numeric values. They obstruct to be numeric columns. - Correct cell values which have '*'. They obstruct to be numeric columns. - Add missing '^' to unit in labels.