{ "cells": [ { "cell_type": "markdown", "id": "e355db8b-ebb6-4ea6-97b5-3b9fdadc302c", "metadata": {}, "source": [ "# 53 examples of Red Amber" ] }, { "cell_type": "markdown", "id": "f20f4970-db38-4d96-9a36-d4cf9d007596", "metadata": {}, "source": [ "Last update: August 3, 2022, for RedAmber Version 0.1.8" ] }, { "cell_type": "markdown", "id": "f6e927d0-b59a-4c4e-9f8a-4fa08f9a6b2f", "metadata": {}, "source": [ "## 1. Install" ] }, { "cell_type": "markdown", "id": "85eacfe6-fa11-4749-844f-5914d6cd7dbc", "metadata": {}, "source": [ "Install requirements before you install Red Amber.\n", "\n", "- Apache Arrow GLib (>= 8.0.0)\n", "\n", "- Apache Parquet GLib (>= 8.0.0) # if you need IO from/to Parquet resource.\n", "\n", " See [Apache Arrow install document](https://arrow.apache.org/install/).\n", " \n", " Minimum installation example for the latest Ubuntu is in the ['Prepare the Apache Arrow' section in ci test](https://github.com/heronshoes/red_amber/blob/master/.github/workflows/test.yml) of Red Amber.\n", "\n", "Then add this line to your Gemfile:\n", "```\n", "gem 'red_amber'\n", "```\n", "\n", "And then execute:\n", "```\n", "$ bundle install\n", "```\n", "\n", "Or install it yourself as:\n", "```\n", "$ gem install red_amber\n", "```" ] }, { "cell_type": "markdown", "id": "8c08c45d-0818-4b43-bc65-4d43dd8b6b66", "metadata": {}, "source": [ "## 2. Require" ] }, { "cell_type": "code", "execution_count": 1, "id": "74b76022-03ea-40ae-bac8-fc8743659042", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"0.1.8-HEAD\"" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'red_amber' # require 'red-amber' is also OK\n", "include RedAmber\n", "VERSION" ] }, { "cell_type": "markdown", "id": "d8fb6289-39ea-4fa9-a165-b87ee6d125e9", "metadata": { "tags": [] }, "source": [ "## 3. Initialize" ] }, { "cell_type": "code", "execution_count": 2, "id": "51f81824-626a-4741-a29b-30ea357fe7b5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
xy
1A
2B
3C
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 A\n", "2 2 B\n", "3 3 C\n" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# From a Hash\n", "DataFrame.new(x: [1, 2, 3], y: %w[A B C])" ] }, { "cell_type": "code", "execution_count": 3, "id": "20b696eb-c199-444d-a957-e0b1081f1506", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
xy
1A
2B
3C
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 A\n", "2 2 B\n", "3 3 C\n" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# From a schema and a column array\n", "DataFrame.new({ x: :uint8, y: :string }, [[1, 'A'], [2, 'B'], [3, 'C']])" ] }, { "cell_type": "code", "execution_count": 4, "id": "21eab151-f977-4474-a6d1-576169e24b26", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
xy
1A
2B
3C
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 A\n", "2 2 B\n", "3 3 C\n" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# From a Arrow::Table\n", "table = Arrow::Table.new(x: [1, 2, 3], y: %w[A B C])\n", "DataFrame.new(table)" ] }, { "cell_type": "code", "execution_count": 5, "id": "aa09d3da-f332-45cd-92ca-712c6a679035", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
xy
1A
2B
3C
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 A\n", "2 2 B\n", "3 3 C\n" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# From a Rover::DataFrame\n", "require 'rover'\n", "rover = Rover::DataFrame.new(x: [1, 2, 3], y: %w[A B C])\n", "DataFrame.new(rover)" ] }, { "cell_type": "code", "execution_count": 6, "id": "cd2c3677-00fb-48fe-bb94-18bc0815db72", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <344 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen(nil)(nil)(nil)(nil)(nil)2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen (nil) (nil) (nil) ... 2007\n", " 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " : : : : : : ... :\n", "342 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "343 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "344 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# from a red-datasets\n", "require 'datasets-arrow'\n", "dataset = Datasets::Penguins.new\n", "penguins = DataFrame.new(dataset.to_arrow)" ] }, { "cell_type": "markdown", "id": "3a2d12b4-7623-42c7-9e32-76cf303c7cea", "metadata": {}, "source": [ "It should be in future version;\n", "```ruby\n", "require 'datasets-red-amber'\n", "penguins = Datasets::Penguins.new.to_red_amber\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "id": "2e4619b7-bf6d-4081-9066-b186da8fdf5b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <32 x 11 vectors>
mpgcyldisphpdratwtqsecvsamgearcarb
21.06160.01103.92.6216.460144
21.06160.01103.92.87517.020144
22.84108.0933.852.3218.611141
21.46258.01103.083.21519.441031
19.76145.01753.622.7715.50156
15.08301.03353.543.5714.60158
21.44121.01094.112.7818.61142
" ], "text/plain": [ "#\n", " mpg cyl disp hp drat wt qsec vs am ... carb\n", " ... \n", " 1 21.0 6 160.0 110 3.9 2.6 16.5 0 1 ... 4\n", " 2 21.0 6 160.0 110 3.9 2.9 17.0 0 1 ... 4\n", " 3 22.8 4 108.0 93 3.9 2.3 18.6 1 1 ... 1\n", " 4 21.4 6 258.0 110 3.1 3.2 19.4 1 0 ... 1\n", " 5 18.7 8 360.0 175 3.2 3.4 17.0 0 0 ... 2\n", " : : : : : : : : : : ... :\n", "30 19.7 6 145.0 175 3.6 2.8 15.5 0 1 ... 6\n", "31 15.0 8 301.0 335 3.5 3.6 14.6 0 1 ... 8\n", "32 21.4 4 121.0 109 4.1 2.8 18.6 1 1 ... 2\n" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset = Datasets::Rdatasets.new('datasets', 'mtcars')\n", "mtcars = DataFrame.new(dataset.to_arrow)" ] }, { "cell_type": "markdown", "id": "e1f77a54-3a43-4d17-bb6f-332ef13832a3", "metadata": {}, "source": [ "## 4. Load" ] }, { "cell_type": "markdown", "id": "0fed4f43-3fbb-43e5-af0d-f93401deea78", "metadata": {}, "source": [ "`RedAmber::DataFrame` delegates `#load` to `Arrow::Table#load`. We can load from `[.arrow, .arrows, .csv, .csv.gz, .tsv]` files." ] }, { "cell_type": "code", "execution_count": 8, "id": "4203e671-0a0a-405c-8482-53a8cd78a891", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
nameage
Yasuko68
Rui49
Hinata28
" ], "text/plain": [ "#\n", " name age\n", " \n", "1 Yasuko 68\n", "2 Rui 49\n", "3 Hinata 28\n" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DataFrame.load(\"test/entity/with_header.csv\")" ] }, { "cell_type": "markdown", "id": "29875147-1371-4575-a565-69c3534c15f2", "metadata": {}, "source": [ "## 5. Load from a URI" ] }, { "cell_type": "code", "execution_count": 9, "id": "916b86e2-e3a2-4ebb-8770-9e8a29c46523", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <344 x 7 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
AdelieTorgersen39.118.71813750MALE
AdelieTorgersen39.517.41863800FEMALE
AdelieTorgersen40.318.01953250FEMALE
AdelieTorgersen(nil)(nil)(nil)(nil)
GentooBiscoe50.415.72225750MALE
GentooBiscoe45.214.82125200FEMALE
GentooBiscoe49.916.12135400MALE
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... sex\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... MALE\n", " 2 Adelie Torgersen 39.5 17.4 186 ... FEMALE\n", " 3 Adelie Torgersen 40.3 18.0 195 ... FEMALE\n", " 4 Adelie Torgersen (nil) (nil) (nil) ...\n", " 5 Adelie Torgersen 36.7 19.3 193 ... FEMALE\n", " : : : : : : ... :\n", "342 Gentoo Biscoe 50.4 15.7 222 ... MALE\n", "343 Gentoo Biscoe 45.2 14.8 212 ... FEMALE\n", "344 Gentoo Biscoe 49.9 16.1 213 ... MALE\n" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "uri = URI(\"https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv\")\n", "DataFrame.load(uri)" ] }, { "cell_type": "markdown", "id": "e6abe64d-e97f-437e-9c54-18f9e06e9668", "metadata": {}, "source": [ "## 6. Save" ] }, { "cell_type": "code", "execution_count": 10, "id": "91c0fb62-7990-47f1-9fb6-b0529bc1783f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.save(\"file.arrow\")\n", "penguins.save(\"file.arrows\")\n", "penguins.save(\"file.csv\")\n", "penguins.save(\"file.csv.gz\")\n", "penguins.save(\"file.tsv\")\n", "penguins.save(\"file.feather\")" ] }, { "cell_type": "markdown", "id": "d1d30973-9e2f-406a-9f42-9e6e4c966baf", "metadata": {}, "source": [ "## 7. to_s/inspect" ] }, { "cell_type": "markdown", "id": "a7bc9cb7-eae4-495f-831e-b747e486d0bd", "metadata": {}, "source": [ "`to_s` or `inspect` (it uses to_s inside) shows a preview of the dataframe.\n", "\n", "It shows first 5 and last 3 rows if it has many rows. Columns are also omitted if line is exceeded 80 letters." ] }, { "cell_type": "code", "execution_count": 11, "id": "af6d29ef-2e1c-4a08-a8b2-d69acda79ec5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#\n", " x y s b\n", " \n", "1 1 1.0 A true\n", "2 2 2.0 B false\n", "3 3 3.0 C true\n", "4 4 NaN D false\n", "5 5 (nil) (nil) (nil)\n", "\n" ] } ], "source": [ "df = DataFrame.new(\n", " x: [1, 2, 3, 4, 5],\n", " y: [1, 2, 3, 0/0.0, nil],\n", " s: %w[A B C D] << nil,\n", " b: [true, false, true, false, nil])\n", "p df; nil" ] }, { "cell_type": "code", "execution_count": 12, "id": "cdff2e60-bd0a-4d12-b348-201a49bbbbbe", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen (nil) (nil) (nil) ... 2007\n", " 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " : : : : : : ... :\n", "342 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "343 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "344 Gentoo Biscoe 49.9 16.1 213 ... 2009\n", "\n" ] } ], "source": [ "p penguins; nil" ] }, { "cell_type": "markdown", "id": "cb44df38-58f7-479c-b7a4-c9c305639292", "metadata": {}, "source": [ "## 8. Show table" ] }, { "cell_type": "code", "execution_count": 13, "id": "fc710035-8134-4b18-89fe-8c58b95e0e0e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "\tx\t y\ts\tb\n", "0\t1\t 1.000000\tA\ttrue\n", "1\t2\t 2.000000\tB\tfalse\n", "2\t3\t 3.000000\tC\ttrue\n", "3\t4\t NaN\tD\tfalse\n", "4\t5\t (null)\t(null)\t(null)\n" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.table" ] }, { "cell_type": "code", "execution_count": 14, "id": "2634fb7b-194f-4277-94ba-05f39c497ffa", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "\tspecies\tisland\tbill_length_mm\tbill_depth_mm\tflipper_length_mm\tbody_mass_g\tsex\tyear\n", " 0\tAdelie \tTorgersen\t 39.100000\t 18.700000\t 181\t 3750\tmale\t2007\n", " 1\tAdelie \tTorgersen\t 39.500000\t 17.400000\t 186\t 3800\tfemale\t2007\n", " 2\tAdelie \tTorgersen\t 40.300000\t 18.000000\t 195\t 3250\tfemale\t2007\n", " 3\tAdelie \tTorgersen\t (null)\t (null)\t (null)\t (null)\t(null)\t2007\n", " 4\tAdelie \tTorgersen\t 36.700000\t 19.300000\t 193\t 3450\tfemale\t2007\n", " 5\tAdelie \tTorgersen\t 39.300000\t 20.600000\t 190\t 3650\tmale\t2007\n", " 6\tAdelie \tTorgersen\t 38.900000\t 17.800000\t 181\t 3625\tfemale\t2007\n", " 7\tAdelie \tTorgersen\t 39.200000\t 19.600000\t 195\t 4675\tmale\t2007\n", " 8\tAdelie \tTorgersen\t 34.100000\t 18.100000\t 193\t 3475\t(null)\t2007\n", " 9\tAdelie \tTorgersen\t 42.000000\t 20.200000\t 190\t 4250\t(null)\t2007\n", "...\n", "334\tGentoo \tBiscoe\t 46.200000\t 14.100000\t 217\t 4375\tfemale\t2009\n", "335\tGentoo \tBiscoe\t 55.100000\t 16.000000\t 230\t 5850\tmale\t2009\n", "336\tGentoo \tBiscoe\t 44.500000\t 15.700000\t 217\t 4875\t(null)\t2009\n", "337\tGentoo \tBiscoe\t 48.800000\t 16.200000\t 222\t 6000\tmale\t2009\n", "338\tGentoo \tBiscoe\t 47.200000\t 13.700000\t 214\t 4925\tfemale\t2009\n", "339\tGentoo \tBiscoe\t (null)\t (null)\t (null)\t (null)\t(null)\t2009\n", "340\tGentoo \tBiscoe\t 46.800000\t 14.300000\t 215\t 4850\tfemale\t2009\n", "341\tGentoo \tBiscoe\t 50.400000\t 15.700000\t 222\t 5750\tmale\t2009\n", "342\tGentoo \tBiscoe\t 45.200000\t 14.800000\t 212\t 5200\tfemale\t2009\n", "343\tGentoo \tBiscoe\t 49.900000\t 16.100000\t 213\t 5400\tmale\t2009\n" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.table" ] }, { "cell_type": "code", "execution_count": 15, "id": "9dba2a67-ede7-4663-907b-9b2dd5db1605", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x: uint8\n", "y: double\n", "s: string\n", "b: bool\n", "----\n", "x:\n", " [\n", " [\n", " 1,\n", " 2,\n", " 3,\n", " 4,\n", " 5\n", " ]\n", " ]\n", "y:\n", " [\n", " [\n", " 1,\n", " 2,\n", " 3,\n", " nan,\n", " null\n", " ]\n", " ]\n", "s:\n", " [\n", " [\n", " \"A\",\n", " \"B\",\n", " \"C\",\n", " \"D\",\n", " null\n", " ]\n", " ]\n", "b:\n", " [\n", " [\n", " true,\n", " false,\n", " true,\n", " false,\n", " null\n", " ]\n", " ]\n" ] } ], "source": [ "# This is a Red Arrow's feature\n", "puts df.table.to_s(format: :column)" ] }, { "cell_type": "code", "execution_count": 16, "id": "d1cc17b8-1cfc-4986-9dec-7bca02be32f0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================== 0 ====================\n", "x: 1\n", "y: 1.000000\n", "s: A\n", "b: true\n", "==================== 1 ====================\n", "x: 2\n", "y: 2.000000\n", "s: B\n", "b: false\n", "==================== 2 ====================\n", "x: 3\n", "y: 3.000000\n", "s: C\n", "b: true\n", "==================== 3 ====================\n", "x: 4\n", "y: NaN\n", "s: D\n", "b: false\n", "==================== 4 ====================\n", "x: 5\n", "y: (null)\n", "s: (null)\n", "b: (null)\n" ] } ], "source": [ "# This is also a Red Arrow's feature\n", "puts df.table.to_s(format: :list)" ] }, { "cell_type": "markdown", "id": "16e4ae6b-2399-43f0-be8e-65669b95c7b6", "metadata": {}, "source": [ "## 9. TDR" ] }, { "cell_type": "markdown", "id": "2d14eb4b-9026-4cc5-a71a-598946d40b67", "metadata": {}, "source": [ "TDR means 'Transposed Dataframe Representation'. It shows columns in lateral just the same shape as initializing by a Hash. TDR has some information which is useful for the exploratory data processing.\n", "\n", "- DataFrame shape: n_rows x n_columns\n", "- Data types\n", "- Levels: number of unique elements\n", "- Data preview: same data is aggregated if level is smaller (tally mode)\n", "- Show counts of abnormal element: NaN and nil" ] }, { "cell_type": "code", "execution_count": 17, "id": "8050462f-7c60-41b7-a011-af11763784dc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RedAmber::DataFrame : 5 x 4 Vectors\n", "Vectors : 2 numeric, 1 string, 1 boolean\n", "# key type level data_preview\n", "1 :x uint8 5 [1, 2, 3, 4, 5]\n", "2 :y double 5 [1.0, 2.0, 3.0, NaN, nil], 1 NaN, 1 nil\n", "3 :s string 5 [\"A\", \"B\", \"C\", \"D\", nil], 1 nil\n", "4 :b boolean 3 {true=>2, false=>2, nil=>1}\n" ] } ], "source": [ "# use the same dataframe as #7\n", "df.tdr" ] }, { "cell_type": "code", "execution_count": 18, "id": "bb616ffe-c19a-4b02-a011-601ceb3db656", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RedAmber::DataFrame : 344 x 8 Vectors\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n", "2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n", "4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils\n", "5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils\n", "6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils\n", "7 :sex string 3 {\"male\"=>168, \"female\"=>165, nil=>11}\n", "8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}\n" ] } ], "source": [ "penguins.tdr" ] }, { "cell_type": "markdown", "id": "73b8dc18-079f-4d40-8d0e-239f010550da", "metadata": {}, "source": [ "`#tdr` has some options:\n", "\n", "`limit` : to limit a number of variables to show. Default value is `limit=10`." ] }, { "cell_type": "code", "execution_count": 19, "id": "0962845d-e642-4d2a-9607-43e197b46bc5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RedAmber::DataFrame : 344 x 8 Vectors\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n", "2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n", " ... 5 more Vectors ...\n" ] } ], "source": [ "penguins.tdr(3)" ] }, { "cell_type": "markdown", "id": "573606c4-23b9-4b38-8c92-a04f1c1e8781", "metadata": {}, "source": [ "`elements` : max number of elements to show in observations. Default value is `elements: 5`." ] }, { "cell_type": "code", "execution_count": 20, "id": "f957d2bd-e8c0-42a1-a3b4-0a9478e740bf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RedAmber::DataFrame : 344 x 8 Vectors\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n", "2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, ... ], 2 nils\n", "4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, ... ], 2 nils\n", "5 :flipper_length_mm uint8 56 [181, 186, 195, ... ], 2 nils\n", "6 :body_mass_g uint16 95 [3750, 3800, 3250, ... ], 2 nils\n", "7 :sex string 3 {\"male\"=>168, \"female\"=>165, nil=>11}\n", "8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}\n" ] } ], "source": [ "penguins.tdr(elements: 3) # Show first 3 items in data" ] }, { "cell_type": "markdown", "id": "d37ece79-1999-49eb-a2d1-831184ee6509", "metadata": {}, "source": [ "`tally` : max level to use tally mode. Level means size of `tally`ed hash. Default value is `tally: 5`." ] }, { "cell_type": "code", "execution_count": 21, "id": "9c1c472c-3d15-4bca-9a1b-7f86c63d3ed8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RedAmber::DataFrame : 344 x 8 Vectors\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 [\"Adelie\", \"Adelie\", \"Adelie\", \"Adelie\", \"Adelie\", ... ]\n", "2 :island string 3 [\"Torgersen\", \"Torgersen\", \"Torgersen\", \"Torgersen\", \"Torgersen\", ... ]\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n", "4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils\n", "5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils\n", "6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils\n", "7 :sex string 3 [\"male\", \"female\", \"female\", nil, \"female\", ... ], 11 nils\n", "8 :year uint16 3 [2007, 2007, 2007, 2007, 2007, ... ]\n" ] } ], "source": [ "penguins.tdr(tally: 0) # Don't use tally mode" ] }, { "cell_type": "markdown", "id": "e3c38037-90a1-4fc5-9904-41fc74085908", "metadata": {}, "source": [ "`#tdr_str` returns a String. `#tdr` do the same thing as `puts #tdr_str`" ] }, { "cell_type": "markdown", "id": "21d68764-1bc1-4915-99b6-5ae938b85999", "metadata": {}, "source": [ "## 10. Size and shape" ] }, { "cell_type": "code", "execution_count": 22, "id": "487399f8-a3ef-467f-aa7f-ecbaee5fcb75", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# same as n_rows, n_obs\n", "df.size" ] }, { "cell_type": "code", "execution_count": 23, "id": "dc7441c3-7c85-4ce1-a20e-de8f41f280b4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# same as n_cols, n_vars\n", "df.n_keys" ] }, { "cell_type": "code", "execution_count": 24, "id": "3d42fea6-801a-45f4-8e22-ea9d76ae070f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[5, 4]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# [df.size, df.n_keys], [df.n_rows, df.n_cols]\n", "df.shape" ] }, { "cell_type": "markdown", "id": "bc5caa94-325f-4014-9c90-8ac909c2b378", "metadata": {}, "source": [ "## 11. Keys" ] }, { "cell_type": "code", "execution_count": 25, "id": "bb47775f-fed0-42e6-8781-aa8b721d6112", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[:x, :y, :s, :b]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.keys" ] }, { "cell_type": "code", "execution_count": 26, "id": "3d540ab0-3e52-47b7-b338-b4e0b3d929cb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[:species, :island, :bill_length_mm, :bill_depth_mm, :flipper_length_mm, :body_mass_g, :sex, :year]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.keys" ] }, { "cell_type": "markdown", "id": "decc6a61-9994-4d60-9827-b257cafafb70", "metadata": {}, "source": [ "## 12. Types" ] }, { "cell_type": "code", "execution_count": 27, "id": "bf9cd2bc-a213-427e-bc00-f2083b0e0471", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[:uint8, :double, :string, :boolean]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.types" ] }, { "cell_type": "code", "execution_count": 28, "id": "b1ecb891-98b5-4919-9f37-1847202007d8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[:string, :string, :double, :double, :uint8, :uint16, :string, :uint16]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.types" ] }, { "cell_type": "markdown", "id": "869b3670-62f8-4c23-807b-d6d100a1981e", "metadata": {}, "source": [ "## 13. Data type classes" ] }, { "cell_type": "code", "execution_count": 29, "id": "776ab4db-073b-4b30-931a-8ec77284cdc4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Arrow::UInt8DataType, Arrow::DoubleDataType, Arrow::StringDataType, Arrow::BooleanDataType]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.type_classes" ] }, { "cell_type": "code", "execution_count": 30, "id": "0546a5d0-cab1-4ca8-a2e5-0637d0fd48b6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Arrow::StringDataType, Arrow::StringDataType, Arrow::DoubleDataType, Arrow::DoubleDataType, Arrow::UInt8DataType, Arrow::UInt16DataType, Arrow::StringDataType, Arrow::UInt16DataType]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.type_classes" ] }, { "cell_type": "markdown", "id": "1c2513f6-909e-47fd-a543-66c4f424f44e", "metadata": {}, "source": [ "## 14. Indices" ] }, { "cell_type": "code", "execution_count": 31, "id": "e6e9d7ef-1471-4f23-9210-56045c9fabd5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 1, 2, 3, 4]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.indexes\n", "# or\n", "df.indices" ] }, { "cell_type": "markdown", "id": "3908395f-b086-4fbb-9855-e1ce233f0595", "metadata": {}, "source": [ "## 15. To an Array or a Hash" ] }, { "cell_type": "markdown", "id": "22cb724e-cf61-40d9-a58b-9cc793e83645", "metadata": {}, "source": [ "DataFrame#to_a returns an array of row-oriented data without a header." ] }, { "cell_type": "code", "execution_count": 32, "id": "4054daad-9266-4002-8942-c0891050cb4d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[1, 1.0, \"A\", true], [2, 2.0, \"B\", false], [3, 3.0, \"C\", true], [4, NaN, \"D\", false], [5, nil, nil, nil]]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_a" ] }, { "cell_type": "markdown", "id": "f6abae59-fe31-4056-9de8-7c36e35235de", "metadata": {}, "source": [ "If you need a column-oriented array with keys, use `.to_h.to_a`" ] }, { "cell_type": "code", "execution_count": 33, "id": "d3631290-eb74-4d21-a469-86381c668c7f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{:x=>[1, 2, 3, 4, 5], :y=>[1.0, 2.0, 3.0, NaN, nil], :s=>[\"A\", \"B\", \"C\", \"D\", nil], :b=>[true, false, true, false, nil]}" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_h" ] }, { "cell_type": "code", "execution_count": 34, "id": "08c45e92-f640-4e62-bc96-ee259d0ecff4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[:x, [1, 2, 3, 4, 5]], [:y, [1.0, 2.0, 3.0, NaN, nil]], [:s, [\"A\", \"B\", \"C\", \"D\", nil]], [:b, [true, false, true, false, nil]]]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_h.to_a" ] }, { "cell_type": "markdown", "id": "39b65fc0-4405-4414-9a74-91c724ef587c", "metadata": {}, "source": [ "## 16. Schema" ] }, { "cell_type": "code", "execution_count": 35, "id": "36db7842-e9b0-4473-84d4-3aef987d427f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{:x=>:uint8, :y=>:double, :s=>:string, :b=>:boolean}" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.schema" ] }, { "cell_type": "markdown", "id": "3e61237d-ac67-45bb-827c-a769dff61809", "metadata": {}, "source": [ "## 17. Vector" ] }, { "cell_type": "markdown", "id": "27402307-aaad-49c8-88ca-65346668601d", "metadata": {}, "source": [ "Each variable (column in the table) is represented by a Vector object." ] }, { "cell_type": "code", "execution_count": 36, "id": "6c9ba041-231d-4057-a280-acf620b68525", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[:x] # This syntax comes later" ] }, { "cell_type": "markdown", "id": "3e13d06d-b432-45b2-9745-0c6ef9228e23", "metadata": {}, "source": [ "Or create new Vector by the constructor." ] }, { "cell_type": "code", "execution_count": 37, "id": "3e18a4e0-238c-4800-8bda-a88a57dde3e9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Vector.new(1, 2, 3, 4, 5)" ] }, { "cell_type": "code", "execution_count": 38, "id": "3bd55d9d-b988-46b2-bc11-e3dc5f4adc6c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Vector.new(1..5)" ] }, { "cell_type": "code", "execution_count": 39, "id": "19688e6e-b59b-4a84-8c07-57e87cd0e242", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Vector.new([1, 2, 3], [4, 5])" ] }, { "cell_type": "code", "execution_count": 40, "id": "076bd0e2-01ab-4497-9b9b-84f72a4805bc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "array = Arrow::Array.new([1, 2, 3, 4, 5])\n", "Vector.new(array)" ] }, { "cell_type": "markdown", "id": "22091661-e78a-4c66-9e48-4c3c676469b4", "metadata": {}, "source": [ "- TODO: `Vector[1..5]` as a constructor" ] }, { "cell_type": "markdown", "id": "b729bdba-87a2-4282-bd0e-319fe17f42da", "metadata": {}, "source": [ "## 18. Vectors" ] }, { "cell_type": "markdown", "id": "f5ddd840-2f84-467b-a9bb-feb769573b69", "metadata": {}, "source": [ "Returns an Array of Vectors in a DataFrame." ] }, { "cell_type": "code", "execution_count": 41, "id": "d3ae03f2-e2fe-4a15-abe1-331185448d61", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[#\n", "[1, 2, 3, 4, 5]\n", ", #\n", "[1.0, 2.0, 3.0, NaN, nil]\n", ", #\n", "[\"A\", \"B\", \"C\", \"D\", nil]\n", ", #\n", "[true, false, true, false, nil]\n", "]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.vectors" ] }, { "cell_type": "markdown", "id": "8ac88ff3-0cb6-43d6-a999-0c2e8c6defb7", "metadata": { "tags": [] }, "source": [ "## 19. Variables\n", "\n", "Returns key and Vector pairs in a Hash." ] }, { "cell_type": "code", "execution_count": 42, "id": "3351a216-6fe5-485e-8686-53c1e754fa2e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{:x=>#\n", "[1, 2, 3, 4, 5]\n", ", :y=>#\n", "[1.0, 2.0, 3.0, NaN, nil]\n", ", :s=>#\n", "[\"A\", \"B\", \"C\", \"D\", nil]\n", ", :b=>#\n", "[true, false, true, false, nil]\n", "}" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.variables" ] }, { "cell_type": "markdown", "id": "3b518c1c-eda7-406f-a885-b2344b1726eb", "metadata": {}, "source": [ "## 20. Select columns by #[ ]" ] }, { "cell_type": "markdown", "id": "767b4e49-19eb-4d5f-b030-91bd78f0f5b9", "metadata": {}, "source": [ "`DataFrame#[]` is overloading column operations and row operations.\n", "\n", "- For columns (variables)\n", " - Key in a Symbol: `df[:symbol]`\n", " - Key in a String: `df[\"string\"]`\n", " - Keys in an Array: `df[:symbol1, \"string\", :symbol2]`\n", " - Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`" ] }, { "cell_type": "code", "execution_count": 43, "id": "ccf60edc-cccf-49e3-a503-1ca532247130", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
xy
11.0
22.0
33.0
4NaN
5(nil)
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 1.0\n", "2 2 2.0\n", "3 3 3.0\n", "4 4 NaN\n", "5 5 (nil)\n" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Keys in a Symbol and a String\n", "df[:x, 'y']" ] }, { "cell_type": "code", "execution_count": 44, "id": "8500f8c0-ff5a-4537-9f47-03d675e31b18", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
xy
11.0
22.0
33.0
4NaN
5(nil)
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 1.0\n", "2 2 2.0\n", "3 3 3.0\n", "4 4 NaN\n", "5 5 (nil)\n" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Keys in a Range\n", "df['x'..'y']" ] }, { "cell_type": "code", "execution_count": 45, "id": "db35cae1-35c2-47de-a7e8-906161f21282", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 3 vectors>
sbx
Atrue1
Bfalse2
Ctrue3
Dfalse4
(nil)(nil)5
" ], "text/plain": [ "#\n", " s b x\n", " \n", "1 A true 1\n", "2 B false 2\n", "3 C true 3\n", "4 D false 4\n", "5 (nil) (nil) 5\n" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Keys with a index Range, and a symbol\n", "df[df.keys[2..], :x]" ] }, { "cell_type": "markdown", "id": "03e14403-f7bc-4350-9e7b-715901164331", "metadata": {}, "source": [ "## 21. Select rows by #[ ]\n", "`DataFrame#[]` is overloading column operations and row operations.\n", "\n", "- For rows (observations)\n", " - Select rows by a Index: `df[index]`\n", " - Select rows by Indices: `df[indices]` # Array, Arrow::Array, Vectors are acceptable for indices\n", " - Select rows by Ranges: `df[range]`\n", " - Select rows by Booleans: `df[booleans]` # Array, Arrow::Array, Vectors are acceptable for booleans" ] }, { "cell_type": "code", "execution_count": 46, "id": "e3bc60a7-611e-4fd8-9770-8e0d167d3fee", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
xysb
11.0Atrue
33.0Ctrue
22.0Bfalse
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 1 1.0 A true\n", "2 3 3.0 C true\n", "3 2 2.0 B false\n" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# indices\n", "df[0, 2, 1]" ] }, { "cell_type": "code", "execution_count": 47, "id": "2b8b3801-ae37-4629-9db5-ff937941c895", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
xysb
22.0Bfalse
33.0Ctrue
5(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 2 2.0 B false\n", "2 3 3.0 C true\n", "3 5 (nil) (nil) (nil)\n" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# including a Range\n", "# negative indices are also acceptable\n", "df[1..2, -1]" ] }, { "cell_type": "code", "execution_count": 48, "id": "3f6f8d73-a66c-4773-9bf5-0878c700f2d6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
xysb
22.0Bfalse
33.0Ctrue
5(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 2 2.0 B false\n", "2 3 3.0 C true\n", "3 5 (nil) (nil) (nil)\n" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# booleans\n", "# length of boolean should be the same as self\n", "df[false, true, true, false, true]" ] }, { "cell_type": "code", "execution_count": 49, "id": "abe57279-54fd-48ec-a1a4-c7453211e776", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
xysb
11.0Atrue
33.0Ctrue
5(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 1 1.0 A true\n", "2 3 3.0 C true\n", "3 5 (nil) (nil) (nil)\n" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Arrow::Array\n", "indices = Arrow::UInt8Array.new([0,2,4])\n", "df[indices]" ] }, { "cell_type": "code", "execution_count": 50, "id": "2266611f-23d8-4645-a1e8-b07c2370fb3f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
xysb
33.0Ctrue
4NaNDfalse
5(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 3 3.0 C true\n", "2 4 NaN D false\n", "3 5 (nil) (nil) (nil)\n" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# By a Vector as indices\n", "indices = Vector.new(df.indices)\n", "# indices > 1 returns a boolean Vector\n", "df[indices > 1]" ] }, { "cell_type": "code", "execution_count": 51, "id": "0ea2da7e-aeca-4874-be4a-6af563aa378b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, true, false, nil]\n" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# By a Vector as booleans\n", "booleans = df[:b]" ] }, { "cell_type": "code", "execution_count": 52, "id": "9f842890-6359-4266-9a23-2f8f813ef548", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 4 vectors>
xysb
11.0Atrue
33.0Ctrue
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 1 1.0 A true\n", "2 3 3.0 C true\n" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[booleans]" ] }, { "cell_type": "markdown", "id": "98a04874-cb2c-44c0-b410-b330b9d12b0f", "metadata": {}, "source": [ "## 22. empty?" ] }, { "cell_type": "code", "execution_count": 53, "id": "7b1ab319-90a7-4f09-8629-04dcd94076cb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "false" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.empty?" ] }, { "cell_type": "code", "execution_count": 54, "id": "1e09c32f-20a8-4175-827f-cdb98063535a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DataFrame.new.empty?" ] }, { "cell_type": "code", "execution_count": 55, "id": "3f9f8771-87dd-44eb-8aac-6a3ed8b4c183", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(empty DataFrame)" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DataFrame.new" ] }, { "cell_type": "markdown", "id": "86b826dd-10e6-4087-9162-b89ac6561a61", "metadata": {}, "source": [ "## 23. Select columns by pick" ] }, { "cell_type": "markdown", "id": "b5aefd22-4e96-4dc5-91d2-e6826256bda6", "metadata": { "tags": [] }, "source": [ "`DataFrame#pick` accepts an Array of keys to pick up columns (variables). You can change the order of columns at a same time." ] }, { "cell_type": "code", "execution_count": 56, "id": "68124521-b823-424d-9e06-d11aa927d618", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
sy
A1.0
B2.0
C3.0
DNaN
(nil)(nil)
" ], "text/plain": [ "#\n", " s y\n", " \n", "1 A 1.0\n", "2 B 2.0\n", "3 C 3.0\n", "4 D NaN\n", "5 (nil) (nil)\n" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pick(:s, :y)\n", "# or\n", "df.pick([:s, :y]) # OK too." ] }, { "cell_type": "markdown", "id": "a76dca00-da8f-4959-be18-7a1015a9d13c", "metadata": {}, "source": [ "Or use a boolean Array of lengeh `n_key` to `pick`. This style remains the order of variables." ] }, { "cell_type": "code", "execution_count": 57, "id": "b91f8925-529c-43c9-93ba-e21bcac0f2f7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
ys
1.0A
2.0B
3.0C
NaND
(nil)(nil)
" ], "text/plain": [ "#\n", " y s\n", " \n", "1 1.0 A\n", "2 2.0 B\n", "3 3.0 C\n", "4 NaN D\n", "5 (nil) (nil)\n" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pick(false, true, true, false)\n", "# or\n", "df.pick([false, true, true, false]) # OK" ] }, { "cell_type": "markdown", "id": "5f903182-745b-4923-99d8-14a9b9c6ea4c", "metadata": {}, "source": [ "`#pick` also accepts a block in the context of self.\n", "\n", "Next example is picking up numeric variables." ] }, { "cell_type": "code", "execution_count": 58, "id": "37bb0a49-c38a-484c-91d4-3e23ab43a727", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
xy
11.0
22.0
33.0
4NaN
5(nil)
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 1.0\n", "2 2 2.0\n", "3 3 3.0\n", "4 4 NaN\n", "5 5 (nil)\n" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# reciever is required with the argument style\n", "df.pick(df.vectors.map(&:numeric?))\n", "\n", "# with a block\n", "df.pick { vectors.map(&:numeric?) }" ] }, { "cell_type": "markdown", "id": "e51f07c0-54eb-4114-8cd6-63c7780e7248", "metadata": {}, "source": [ "The name `pick` comes from the action to pick variables(columns) according to the label keys." ] }, { "cell_type": "markdown", "id": "7c1815e4-de6c-425e-8602-b8dd66836250", "metadata": {}, "source": [ "## 24. Reject columns by drop" ] }, { "cell_type": "markdown", "id": "d1ab045e-66f9-4922-8bf2-35aee7f2812e", "metadata": { "tags": [] }, "source": [ "`DataFrame#drop` accepts an Array keys to drop columns (variables) to create remainer DataFrame." ] }, { "cell_type": "code", "execution_count": 59, "id": "7ccace08-62b0-4b0b-93fb-81edf673abf7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
ys
1.0A
2.0B
3.0C
NaND
(nil)(nil)
" ], "text/plain": [ "#\n", " y s\n", " \n", "1 1.0 A\n", "2 2.0 B\n", "3 3.0 C\n", "4 NaN D\n", "5 (nil) (nil)\n" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(:x, :b)\n", "# df.drop([:x, :b]) #is OK too." ] }, { "cell_type": "markdown", "id": "2085b349-95c5-4607-b029-f7c3d630ac1c", "metadata": {}, "source": [ "Or use a boolean Array of lengeh `n_key` to `drop`." ] }, { "cell_type": "code", "execution_count": 60, "id": "785c02f1-1e16-4722-9961-4b49223c8290", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
ys
1.0A
2.0B
3.0C
NaND
(nil)(nil)
" ], "text/plain": [ "#\n", " y s\n", " \n", "1 1.0 A\n", "2 2.0 B\n", "3 3.0 C\n", "4 NaN D\n", "5 (nil) (nil)\n" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(true, false, false, true)\n", "# df.drop([true, false, false, true]) # is OK too" ] }, { "cell_type": "markdown", "id": "d246161e-02cc-40fb-8921-26b37eb3956f", "metadata": {}, "source": [ "`#drop` also accepts a block in the context of self.\n", "\n", "Next example will drop variables which have nil or NaN values." ] }, { "cell_type": "code", "execution_count": 61, "id": "069932e3-d393-4ede-9eb5-7aac8625e0c0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 1 vector>
x
1
2
3
4
5
" ], "text/plain": [ "#\n", " x\n", " \n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop { vectors.map { |v| v.is_na.any } }" ] }, { "cell_type": "markdown", "id": "88b064d6-7d90-4a0b-b9c8-d92e103269fb", "metadata": {}, "source": [ "Argument style is also acceptable but it requires the reciever 'df'." ] }, { "cell_type": "code", "execution_count": 62, "id": "3003a5c2-0966-4f2c-9643-59e8b546c8aa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 1 vector>
x
1
2
3
4
5
" ], "text/plain": [ "#\n", " x\n", " \n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(df.vectors.map { |v| v.is_na.any })" ] }, { "cell_type": "markdown", "id": "c6fce15c-d4a9-4281-9c07-457e78d3c13e", "metadata": {}, "source": [ "The name `drop` comes from the pair word of `pick`." ] }, { "cell_type": "markdown", "id": "0f6dc86c-828d-4f9f-8b07-fce63c30fdca", "metadata": {}, "source": [ "## 25. Pick/drop and nil" ] }, { "cell_type": "markdown", "id": "0a108878-565b-400e-9a47-a15aae09429c", "metadata": {}, "source": [ "When `pick` or `drop` is used with booleans, nil in the booleans is treated as false. This behavior is aligned with Ruby's `BasicObject#!`." ] }, { "cell_type": "code", "execution_count": 63, "id": "7c01fbb4-9bfa-4afc-8e6b-45c97c0beb03", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans = [true, true, false, nil]\n", "booleans_invert = booleans.map(&:!) # => [false, false, true, true] because nil.! is true\n", "df.pick(booleans) == df.drop(booleans_invert)" ] }, { "cell_type": "markdown", "id": "12a24264-9b7a-42a1-a541-e292e3876e35", "metadata": {}, "source": [ "## 26. Vector#invert, #primitive_invert" ] }, { "cell_type": "code", "execution_count": 64, "id": "ea352e12-7e8a-43be-b8ac-797adbc47708", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, true, false, nil]\n" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector = Vector.new(booleans)" ] }, { "cell_type": "markdown", "id": "2a0f82e0-157b-4185-9254-0618be291f9b", "metadata": {}, "source": [ "nil is converted to nil by `Vector#invert`." ] }, { "cell_type": "code", "execution_count": 65, "id": "596c521f-12bf-4448-9e5d-e1b4a2c3d896", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, false, true, nil]\n" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector.invert\n", "# or\n", "!vector" ] }, { "cell_type": "markdown", "id": "a1aec910-3055-4627-a02b-22d45f2ceb70", "metadata": {}, "source": [ "So `df.pick(booleans) != df.drop(booleans.invert)` when booleans have any nils.\n", "\n", "On the other hand, `Vector#primitive_invert` follows Ruby's `BasicObject#!`'s behavior. Then pick and drop keep 'MECE' behavior." ] }, { "cell_type": "code", "execution_count": 66, "id": "4dcaba48-1cea-4ce9-b4a9-b079b43af7ec", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, false, true, true]\n" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector.primitive_invert" ] }, { "cell_type": "code", "execution_count": 67, "id": "c7ae4dad-275a-49e0-a0b0-bf3686248070", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pick(vector) == df.drop(vector.primitive_invert)" ] }, { "cell_type": "markdown", "id": "9a6cec74-43f0-4a72-8262-25b1e311f602", "metadata": {}, "source": [ "## 27. Pick/drop and [ ]" ] }, { "cell_type": "markdown", "id": "32c8f74d-b3ce-4305-9af7-6ea70052c773", "metadata": {}, "source": [ "When `pick` or `drop` select a single column (variable), it returns a `DataFrame` with one column (variable)." ] }, { "cell_type": "code", "execution_count": 68, "id": "e13aee24-cac6-41ad-b8a3-0ec26edbe5d1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 1 vector>
x
1
2
3
4
5
" ], "text/plain": [ "#\n", " x\n", " \n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pick(:x) # or\n", "df.drop(:y, :s, :b)" ] }, { "cell_type": "markdown", "id": "3e47b9d2-929e-4674-9690-0a1fdf7b0a7d", "metadata": {}, "source": [ "In contrast, when `[]` selects a single column (variable), it returns a `Vector`." ] }, { "cell_type": "code", "execution_count": 69, "id": "60d228be-7357-434d-9d39-ee72c110e6fe", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[:x]" ] }, { "cell_type": "markdown", "id": "6d973934-e08b-4b45-8efb-52f9167e7238", "metadata": {}, "source": [ "This behavior may be useful to use in a block of DataFrame manipulation verbs (like pick, drop, slice, remove, assign, rename)." ] }, { "cell_type": "markdown", "id": "34c9bcb0-889a-4190-b2b8-49765cd059c2", "metadata": {}, "source": [ "## 28. Slice" ] }, { "cell_type": "markdown", "id": "9a428ba8-c306-4ab8-8607-51174e8e6ebe", "metadata": {}, "source": [ "`slice` selects rows (observations) to create a subset of a DataFrame." ] }, { "cell_type": "markdown", "id": "6016d6d4-72d6-4ae2-b7dd-3d526c91ae61", "metadata": {}, "source": [ "`slice(indeces)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers. Negative index from the tail like Ruby's Array is also acceptable." ] }, { "cell_type": "code", "execution_count": 70, "id": "9cdce2e4-7876-4be6-bd1f-bc8ab6e6c871", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <10 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen(nil)(nil)(nil)(nil)(nil)2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen (nil) (nil) (nil) ... 2007\n", " 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " : : : : : : ... :\n", " 8 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", " 9 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "10 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# returns 5 rows at the start and 5 rows from the end\n", "penguins.slice(0...5, -5..-1)" ] }, { "cell_type": "code", "execution_count": 71, "id": "380ab809-09ae-4e69-a8e6-8d53d1e7822d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <1 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieBiscoe42.219.51974275male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ... year\n", " ... \n", "1 Adelie Biscoe 42.2 19.5 197 4275 ... 2009\n" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# slice accepts Float index\n", "# 33% of 344 observations in index => 113.52 th data ??\n", "penguins.slice(penguins.size * 0.33)" ] }, { "cell_type": "markdown", "id": "8139bb28-89f8-4058-b824-dde33ead0b60", "metadata": {}, "source": [ "Indices in Vectors or Arrow::Arrays are also acceptable." ] }, { "cell_type": "markdown", "id": "6f79db8c-c706-4d60-949b-3f644474d375", "metadata": {}, "source": [ "Another way to select in `slice` is to use booleans.\n", "- Booleans is an Array, Arrow::Array, Vector or their Array.\n", "- Each data type must be boolean.\n", "- Size of booleans must be same as the size of self." ] }, { "cell_type": "code", "execution_count": 72, "id": "f58ca131-7375-4489-90ce-6ba54b898eb5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, false, true, nil, false, false, false, false, false, true, false, false, ... ]\n" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# make booleans to check over 40\n", "booleans = penguins[:bill_length_mm] > 40" ] }, { "cell_type": "code", "execution_count": 73, "id": "176ab365-c66a-4712-97b9-4381a536321b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <242 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen42.020.21904250(nil)2007
AdelieTorgersen41.117.61823200female2007
AdelieTorgersen42.520.71974500male2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 2 Adelie Torgersen 42.0 20.2 190 ... 2007\n", " 3 Adelie Torgersen 41.1 17.6 182 ... 2007\n", " 4 Adelie Torgersen 42.5 20.7 197 ... 2007\n", " 5 Adelie Torgersen 46.0 21.5 194 ... 2007\n", " : : : : : : ... :\n", "240 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "241 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "242 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.slice(booleans)" ] }, { "cell_type": "markdown", "id": "3264a182-6b72-461a-b712-c3b708c53516", "metadata": {}, "source": [ "`slice` accepts a block.\n", "- We can't use both arguments and a block at a same time.\n", "- The block should return indeces in any length or a boolean Array with a same length as `size`.\n", "- Block is called in the context of self. So reciever 'self' can be omitted in the block." ] }, { "cell_type": "code", "execution_count": 74, "id": "c95d3426-0bbb-430e-8d83-6e22434d99ed", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <204 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen39.320.61903650male2007
GentooBiscoe47.213.72144925female2009
GentooBiscoe46.814.32154850female2009
GentooBiscoe45.214.82125200female2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen 39.3 20.6 190 ... 2007\n", " 5 Adelie Torgersen 38.9 17.8 181 ... 2007\n", " : : : : : : ... :\n", "202 Gentoo Biscoe 47.2 13.7 214 ... 2009\n", "203 Gentoo Biscoe 46.8 14.3 215 ... 2009\n", "204 Gentoo Biscoe 45.2 14.8 212 ... 2009\n" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# return a DataFrame with bill_length_mm is in 2*std range around mean\n", "penguins.slice do\n", " vector = self[:bill_length_mm]\n", " min = vector.mean - vector.std\n", " max = vector.mean + vector.std\n", " vector.to_a.map { |e| (min..max).include? e }\n", "end" ] }, { "cell_type": "markdown", "id": "4fa42801-64f5-4432-856b-85c26a68515d", "metadata": {}, "source": [ "## 29. Slice and nil option" ] }, { "cell_type": "markdown", "id": "31017a7e-0923-4283-bc92-246ebe2591c3", "metadata": {}, "source": [ "`Arrow::Table#slice` uses `#filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row." ] }, { "cell_type": "code", "execution_count": 75, "id": "8e4a8108-154b-4621-acd1-704ddf229d61", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "\t a\tb\t c\n", "0\t 1\tA\t 1.000000\n", "1\t(null)\t(null)\t (null)\n" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }\n", "table = Arrow::Table.new(hash)\n", "table.slice([true, false, nil])" ] }, { "cell_type": "markdown", "id": "dbb57c5a-e949-42b8-a82c-9affb3fe5b7b", "metadata": {}, "source": [ "Whereas in RedAmber, `DataFrame#slice` with booleans containing nil is treated as false. This behavior comes from `Allow::FilterOptions.null_selection_behavior = :drop`. This is a default value for `Arrow::Table.filter` method." ] }, { "cell_type": "code", "execution_count": 76, "id": "851c3bf6-b9e9-41bd-92c5-5372ed934549", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "\ta\tb\t c\n", "0\t1\tA\t 1.000000\n" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "RedAmber::DataFrame.new(table).slice([true, false, nil]).table" ] }, { "cell_type": "markdown", "id": "56398a3d-6146-43af-8b96-fec37730fc49", "metadata": {}, "source": [ "## 30. Remove" ] }, { "cell_type": "markdown", "id": "9e042a97-8a5d-412e-8e4a-fda382225a2d", "metadata": {}, "source": [ "Slice and reject rows (observations) to create a remainer DataFrame." ] }, { "cell_type": "markdown", "id": "2b4cbb97-eef3-4db8-8f25-c44c208ec554", "metadata": {}, "source": [ "`#remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer." ] }, { "cell_type": "code", "execution_count": 77, "id": "17e38ab8-886b-4114-bcaf-ee18df7d00cd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <334 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.320.61903650male2007
AdelieTorgersen38.917.81813625female2007
AdelieTorgersen39.219.61954675male2007
AdelieTorgersen34.118.11933475(nil)2007
GentooBiscoe44.515.72174875(nil)2009
GentooBiscoe48.816.22226000male2009
GentooBiscoe47.213.72144925female2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.3 20.6 190 ... 2007\n", " 2 Adelie Torgersen 38.9 17.8 181 ... 2007\n", " 3 Adelie Torgersen 39.2 19.6 195 ... 2007\n", " 4 Adelie Torgersen 34.1 18.1 193 ... 2007\n", " 5 Adelie Torgersen 42.0 20.2 190 ... 2007\n", " : : : : : : ... :\n", "332 Gentoo Biscoe 44.5 15.7 217 ... 2009\n", "333 Gentoo Biscoe 48.8 16.2 222 ... 2009\n", "334 Gentoo Biscoe 47.2 13.7 214 ... 2009\n" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# returns 6th to 339th obs. Remainer of 1st example of #30\n", "penguins.remove(0...5, -5..-1)" ] }, { "cell_type": "markdown", "id": "def1c1c4-6b60-4864-ae24-c797fbf008a7", "metadata": {}, "source": [ "`remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `#size`." ] }, { "cell_type": "code", "execution_count": 78, "id": "6f169420-7eb2-457f-8d59-7a5c90aa3fa5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <333 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen36.719.31933450female2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " 5 Adelie Torgersen 39.3 20.6 190 ... 2007\n", " : : : : : : ... :\n", "331 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "332 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "333 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# remove all observation contains nil\n", "removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }" ] }, { "cell_type": "markdown", "id": "5f1864c9-4ae4-4fcd-9840-ea424ef5e27d", "metadata": {}, "source": [ "`remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as size. Block is called in the context of self." ] }, { "cell_type": "code", "execution_count": 79, "id": "a6807c65-25e5-4ee1-8d1b-6018c46b3999", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <140 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen(nil)(nil)(nil)(nil)(nil)2007
AdelieTorgersen36.719.31933450female2007
AdelieTorgersen34.118.11933475(nil)2007
AdelieTorgersen37.817.11863300(nil)2007
GentooBiscoe(nil)(nil)(nil)(nil)(nil)2009
GentooBiscoe50.415.72225750male2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen (nil) (nil) (nil) ... 2007\n", " 2 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " 3 Adelie Torgersen 34.1 18.1 193 ... 2007\n", " 4 Adelie Torgersen 37.8 17.1 186 ... 2007\n", " 5 Adelie Torgersen 37.8 17.3 180 ... 2007\n", " : : : : : : ... :\n", "138 Gentoo Biscoe (nil) (nil) (nil) ... 2009\n", "139 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "140 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Remove data in 2*std range around mean\n", "penguins.remove do\n", " vector = self[:bill_length_mm]\n", " min = vector.mean - vector.std\n", " max = vector.mean + vector.std\n", " vector.to_a.map { |e| (min..max).include? e }\n", "end" ] }, { "cell_type": "markdown", "id": "591e6b22-da98-4336-b22e-c7bc9bcf2ebf", "metadata": {}, "source": [ "## 31. Remove and nil" ] }, { "cell_type": "markdown", "id": "67926d1b-c76e-4cb7-b679-6545d850e7e4", "metadata": {}, "source": [ "When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`." ] }, { "cell_type": "code", "execution_count": 80, "id": "8575614e-f702-4ee4-ac7b-745e9b32e803", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 3 vectors>
abc
1A1.0
2B2.0
(nil)C3.0
" ], "text/plain": [ "#\n", " a b c\n", " \n", "1 1 A 1.0\n", "2 2 B 2.0\n", "3 (nil) C 3.0\n" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])" ] }, { "cell_type": "code", "execution_count": 81, "id": "932a5e71-8cef-44e5-a789-ce97329bc001", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, nil]\n" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans = df[:a] < 2" ] }, { "cell_type": "code", "execution_count": 82, "id": "74cf6aa6-8913-433d-97ad-bba2d548afe5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[false, true, true]" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans_invert = booleans.to_a.map(&:!)" ] }, { "cell_type": "code", "execution_count": 83, "id": "5e466a06-cb17-4dc1-a5b0-34bfd3ffb78b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice(booleans) == df.remove(booleans_invert)" ] }, { "cell_type": "markdown", "id": "8bca0b06-2d08-4c28-8b4c-4fd088f2d2d3", "metadata": {}, "source": [ "Whereas `Vector#invert` returns nil for elements nil. This will bring different result. (See #26)" ] }, { "cell_type": "code", "execution_count": 84, "id": "077b216f-0a08-413e-95c9-12789d15a9ba", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, true, nil]\n" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans.invert" ] }, { "cell_type": "code", "execution_count": 85, "id": "b3df62a6-c4a3-44cb-bde6-f6be12b120c8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 3 vectors>
abc
1A1.0
(nil)C3.0
" ], "text/plain": [ "#\n", " a b c\n", " \n", "1 1 A 1.0\n", "2 (nil) C 3.0\n" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.remove(booleans.invert)" ] }, { "cell_type": "markdown", "id": "e05f00b6-3bae-4650-8bbc-d4e0692f6f85", "metadata": {}, "source": [ "We have `#primitive_invert` method in Vector. This method returns the same result as `.to_a.map(&:!)` above." ] }, { "cell_type": "code", "execution_count": 86, "id": "296ca3cd-a6da-4603-a576-d8c36a810e4f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, true, true]\n" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans.primitive_invert" ] }, { "cell_type": "code", "execution_count": 87, "id": "ba5b8c0b-b94e-4209-adcd-258ea3b87bfd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <1 x 3 vectors>
abc
1A1.0
" ], "text/plain": [ "#\n", " a b c\n", " \n", "1 1 A 1.0\n" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.remove(booleans.primitive_invert)" ] }, { "cell_type": "code", "execution_count": 88, "id": "2446792f-0b0a-4642-acae-b4fec89261c1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice(booleans) == df.remove(booleans.primitive_invert)" ] }, { "cell_type": "markdown", "id": "7c23a4ad-0c17-4178-b58a-abfd8153d49b", "metadata": {}, "source": [ "## 32. Remove nil" ] }, { "cell_type": "markdown", "id": "84c7238b-1029-416f-b495-9d045f77b22c", "metadata": {}, "source": [ "Remove any observations containing nil." ] }, { "cell_type": "code", "execution_count": 89, "id": "de4bb615-d14d-4c90-ab54-db2f375b9f00", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <333 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen36.719.31933450female2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " 5 Adelie Torgersen 39.3 20.6 190 ... 2007\n", " : : : : : : ... :\n", "331 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "332 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "333 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.remove_nil" ] }, { "cell_type": "markdown", "id": "4a4ae8f9-dcf8-4dad-bb77-af076e9cadb5", "metadata": {}, "source": [ "The roundabout way for this is to use `#remove`." ] }, { "cell_type": "code", "execution_count": 90, "id": "27a3da5f-0ea2-4c5d-a6c3-c0e20f2224a3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <333 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen36.719.31933450female2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " 5 Adelie Torgersen 39.3 20.6 190 ... 2007\n", " : : : : : : ... :\n", "331 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "332 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "333 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.remove { vectors.map(&:is_nil).reduce(&:|) }" ] }, { "cell_type": "markdown", "id": "4f2a58fd-f033-44f6-9eb4-ed893a2b5d1d", "metadata": {}, "source": [ "## 33. Rename" ] }, { "cell_type": "markdown", "id": "c0d39506-8ae5-48e7-9dd2-acf38d4ec1a9", "metadata": {}, "source": [ "Rename keys (column names) to create a updated DataFrame." ] }, { "cell_type": "markdown", "id": "3f6924ec-e86c-4089-ae40-6783027d3ce0", "metadata": {}, "source": [ "`#rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}` ." ] }, { "cell_type": "code", "execution_count": 91, "id": "9396c96d-83d7-4b92-a4ca-27bc9e4d7b9d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
nameage
Yasuko68
Rui49
Hinata28
" ], "text/plain": [ "#\n", " name age\n", " \n", "1 Yasuko 68\n", "2 Rui 49\n", "3 Hinata 28\n" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "h = { name: %w[Yasuko Rui Hinata], age: [68, 49, 28] }\n", "comecome = RedAmber::DataFrame.new(h)" ] }, { "cell_type": "code", "execution_count": 92, "id": "fad279c6-1ca0-4493-bd69-0e9ef011bff7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
nameage_in_1993
Yasuko68
Rui49
Hinata28
" ], "text/plain": [ "#\n", " name age_in_1993\n", " \n", "1 Yasuko 68\n", "2 Rui 49\n", "3 Hinata 28\n" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comecome.rename(:age => :age_in_1993)" ] }, { "cell_type": "markdown", "id": "9dabb005-9822-4c4b-aaa5-fa6f28f2ed43", "metadata": {}, "source": [ "`#rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of {existing_key => new_key}. Block is called in the context of self." ] }, { "cell_type": "markdown", "id": "aabbba20-6ef8-4da2-8dc0-0cb243cf3b23", "metadata": {}, "source": [ "Symbol key and String key are distinguished." ] }, { "cell_type": "markdown", "id": "07f98b31-6123-4466-b4f8-f995c7cde474", "metadata": {}, "source": [ "## 34. Assign" ] }, { "cell_type": "markdown", "id": "99f6787f-2b36-4360-b155-1c2d7874d25e", "metadata": {}, "source": [ "Assign new or updated columns (variables) and create a updated DataFrame.\n", "\n", "- Columns with new keys will append new variables at right (bottom in TDR).\n", "- Columns with exisiting keys will update corresponding vectors." ] }, { "cell_type": "markdown", "id": "b4b22da0-4ee2-4196-88e1-1cfea6a72f4d", "metadata": {}, "source": [ "`#assign(key_pairs)` accepts pairs of key and values as arguments. key_pairs should be a Hash of `{key => array}` or `{key => Vector}` ." ] }, { "cell_type": "code", "execution_count": 93, "id": "56dcfed8-a6f9-4d8c-bac3-e8ce7c0674a7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
nameage
Yasuko68
Rui49
Hinata28
" ], "text/plain": [ "#\n", " name age\n", " \n", "1 Yasuko 68\n", "2 Rui 49\n", "3 Hinata 28\n" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comecome = RedAmber::DataFrame.new( name: %w[Yasuko Rui Hinata], age: [68, 49, 28] )" ] }, { "cell_type": "code", "execution_count": 94, "id": "8da8d282-8798-44d5-bb7b-7fa2df922308", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 3 vectors>
nameagebrother
Yasuko97Santa
Rui78(nil)
Hinata57Momotaro
" ], "text/plain": [ "#\n", " name age brother\n", " \n", "1 Yasuko 97 Santa\n", "2 Rui 78 (nil)\n", "3 Hinata 57 Momotaro\n" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# update :age and add :brother\n", "assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }\n", "comecome.assign(assigner)" ] }, { "cell_type": "markdown", "id": "e6d3ddfc-b16d-4b20-83df-357e9cdb32e6", "metadata": {}, "source": [ "`#assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and values as a Hash of `{key => array}` or `{key => Vector}`. Block is called in the context of self." ] }, { "cell_type": "code", "execution_count": 95, "id": "8d69edd0-7ad7-4318-8033-1785ce2543db", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 3 vectors>
indexfloatstring
00.0A
11.1B
22.2C
3NaND
(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " index float string\n", " \n", "1 0 0.0 A\n", "2 1 1.1 B\n", "3 2 2.2 C\n", "4 3 NaN D\n", "5 (nil) (nil) (nil)\n" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = RedAmber::DataFrame.new(\n", " index: [0, 1, 2, 3, nil],\n", " float: [0.0, 1.1, 2.2, Float::NAN, nil],\n", " string: ['A', 'B', 'C', 'D', nil])" ] }, { "cell_type": "code", "execution_count": 96, "id": "e884af01-d82b-42e7-8e92-62baf19919cb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 3 vectors>
indexfloatstring
0-0.0A
255-1.1B
254-2.2C
253NaND
(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " index float string\n", " \n", "1 0 -0.0 A\n", "2 255 -1.1 B\n", "3 254 -2.2 C\n", "4 253 NaN D\n", "5 (nil) (nil) (nil)\n" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# update numeric variables\n", "df.assign do\n", " assigner = {}\n", " vectors.each_with_index do |v, i|\n", " assigner[keys[i]] = -v if v.numeric?\n", " end\n", " assigner\n", "end" ] }, { "cell_type": "markdown", "id": "c5c83559-f4d8-4ed2-8b20-5c50eb1faa14", "metadata": {}, "source": [ "## 35. Coerce (Vector)" ] }, { "cell_type": "markdown", "id": "77bdfc69-b728-4335-b76e-e4be92f94310", "metadata": {}, "source": [ "Vector has coerce method." ] }, { "cell_type": "code", "execution_count": 97, "id": "2bfbe584-be54-486b-af32-e76b37c10e49", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3]\n" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector = RedAmber::Vector.new(1,2,3)" ] }, { "cell_type": "code", "execution_count": 98, "id": "ce35d901-38a8-4f13-b2d1-29b83f6c5438", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[-1, -2, -3]\n" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Vector's `#*` method\n", "vector * -1" ] }, { "cell_type": "code", "execution_count": 99, "id": "7d5fc2be-f590-4678-92e9-faa27b618266", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[-1, -2, -3]\n" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# coerced calculation\n", "-1 * vector" ] }, { "cell_type": "code", "execution_count": 100, "id": "fa90a6af-add7-42f2-9707-7d726575aeb6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[255, 254, 253]\n" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# `@-` operator\n", "-vector" ] }, { "cell_type": "markdown", "id": "4820b527-44e9-4738-aa0e-73604078b3b0", "metadata": { "tags": [] }, "source": [ "## 36. to_ary (Vector)" ] }, { "cell_type": "markdown", "id": "8507dcc4-74e3-44ad-aa54-cf43d55f2131", "metadata": {}, "source": [ "`Vector#to_ary` will enable implicit conversion to an Array." ] }, { "cell_type": "code", "execution_count": 101, "id": "b12bd7c8-2981-426c-8ae3-154504a8ea15", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[3, 4, 5]" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Array(Vector.new([3, 4, 5]))" ] }, { "cell_type": "code", "execution_count": 102, "id": "c0cb5a98-7cdf-43a8-b2f7-f9df1961c761", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 2, 3, 4, 5]" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[1, 2] + Vector.new([3, 4, 5])" ] }, { "cell_type": "markdown", "id": "216dde4f-e4d8-4f29-903a-8cbf75de5b8e", "metadata": {}, "source": [ "## 37. Fill nil (Vector)" ] }, { "cell_type": "markdown", "id": "1959d0d7-6d09-4fa5-9365-1e2f7fc35d61", "metadata": {}, "source": [ "`Vector#fill_nil_forward` or `Vector#fill_nil_backward` will\n", "propagate the last valid observation forward (or backward).\n", "Or preserve nil if all previous values are nil or at the end." ] }, { "cell_type": "code", "execution_count": 103, "id": "d003b06a-859f-4de0-9e35-803efac85169", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[0, 1, 1, 3, 3]\n" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "integer = Vector.new([0, 1, nil, 3, nil])\n", "integer.fill_nil_forward" ] }, { "cell_type": "code", "execution_count": 104, "id": "c5d74006-d364-4e86-8a5e-9e96e87a96e0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[0, 1, 3, 3, nil]\n" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "integer.fill_nil_backward" ] }, { "cell_type": "markdown", "id": "347785a6-eab0-4864-a871-2c320005211e", "metadata": {}, "source": [ "## 38. all?/any? (Vector)" ] }, { "cell_type": "markdown", "id": "f82a6f5d-03d3-4645-85f5-d25999165378", "metadata": {}, "source": [ "`Vector#all?` returns true if all elements is true.\n", "\n", "`Vector#any?` returns true if exists any true.\n", "\n", "These are unary aggregation function." ] }, { "cell_type": "code", "execution_count": 105, "id": "ebad37ad-0a09-48b1-ba3a-4e030a917837", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans = Vector.new([true, true, nil])\n", "booleans.all?" ] }, { "cell_type": "code", "execution_count": 106, "id": "97fc24da-03d4-406d-b353-562896775d60", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans.any?" ] }, { "cell_type": "markdown", "id": "0ff3b22d-9f7c-42f2-8d18-c89a06af681b", "metadata": {}, "source": [ "If these methods are used with option `skip_nulls: false` nil is considered." ] }, { "cell_type": "code", "execution_count": 107, "id": "3e0e5800-665a-4a05-b2cb-d152f3f077de", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "false" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans.all?(skip_nulls: false)" ] }, { "cell_type": "code", "execution_count": 108, "id": "3e43f0c4-a254-4735-ac28-de14d2670c67", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans.any?(skip_nulls: false)" ] }, { "cell_type": "markdown", "id": "abc71a85-7958-4a21-91cf-8c96c0784525", "metadata": {}, "source": [ "## 39. count/count_uniq (Vector)" ] }, { "cell_type": "markdown", "id": "3d556118-4105-4d12-806d-ba56c6ae3d1b", "metadata": {}, "source": [ "`Vector#count` counts element.\n", "\n", "`Vector#count_uniq` counts unique element. `#count_distinct` is an alias (Arrow's name).\n", "\n", "These are unary aggregation function." ] }, { "cell_type": "code", "execution_count": 109, "id": "2af73e32-1d7e-4f80-b54e-c40ef08b7034", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string = Vector.new(%w[A B A])\n", "string.count" ] }, { "cell_type": "code", "execution_count": 110, "id": "fe6d8d85-27b0-438f-b1b4-1b15e9eb05f9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string.count_uniq # count_distinct is also OK" ] }, { "cell_type": "markdown", "id": "70abed9f-665a-4ea7-939e-4b185ee53755", "metadata": {}, "source": [ "## 40. stddev/variance (Vector)" ] }, { "cell_type": "markdown", "id": "965de338-b3be-4d33-92e1-5ad7e2ed18f0", "metadata": {}, "source": [ "These are unary element-wise function." ] }, { "cell_type": "code", "execution_count": 111, "id": "0afec200-f377-432b-a260-ae5a0c5ce794", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.816496580927726" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "integers = Vector.new([1, 2, 3, nil])\n", "integers.stddev" ] }, { "cell_type": "code", "execution_count": 112, "id": "2e40ac09-cb7f-4978-87e8-53f84f16f7c7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Unbiased standard deviation\n", "integers.sd" ] }, { "cell_type": "code", "execution_count": 113, "id": "e6158e3b-4af8-467c-a355-8e9f2e579548", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6666666666666666" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "integers.variance" ] }, { "cell_type": "code", "execution_count": 114, "id": "d64d39f2-d979-49f1-9946-65890f40d646", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Unbiased variance\n", "integers.var" ] }, { "cell_type": "markdown", "id": "25023f5a-798a-40a5-ab84-a6615602f747", "metadata": {}, "source": [ "## 41. negate (Vector)" ] }, { "cell_type": "markdown", "id": "00ddf322-ef50-40a1-86a6-22bf3d43f007", "metadata": {}, "source": [ "These are unary element-wise function." ] }, { "cell_type": "code", "execution_count": 115, "id": "ab5a357a-e98c-40a1-9b89-0b38645e416f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[-1.0, 2.0, -3.0]\n" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double = Vector.new([1.0, -2, 3])\n", "double.negate" ] }, { "cell_type": "code", "execution_count": 116, "id": "8a06c856-d61c-4752-a296-1fa207ffd9a1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[-1.0, 2.0, -3.0]\n" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Same as #negate\n", "-double" ] }, { "cell_type": "markdown", "id": "9b145724-d165-4ef3-8a06-2948dd0c7dbb", "metadata": {}, "source": [ "## 42. round (Vector)" ] }, { "cell_type": "markdown", "id": "b780c2f3-935c-4b2f-b18a-b277cf7c24b7", "metadata": {}, "source": [ "Otions for `#round`;\n", "\n", "- `:n-digits` The number of digits to show.\n", "- `round_mode` Specify rounding mode.\n", "\n", "This is a unary element-wise function." ] }, { "cell_type": "code", "execution_count": 117, "id": "e7a069b0-3547-4cd2-a2f0-0740f186b191", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.15, 2.5, 3.5, -4.5, -5.5]\n" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double = RedAmber::Vector.new([15.15, 2.5, 3.5, -4.5, -5.5])" ] }, { "cell_type": "code", "execution_count": 118, "id": "5ee84b24-8830-4788-a404-d5e1cca22abf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 2.0, 4.0, -4.0, -6.0]\n" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round" ] }, { "cell_type": "code", "execution_count": 119, "id": "20adb1ad-473c-4245-b959-7848c239fb76", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 2.0, 4.0, -4.0, -6.0]\n" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :half_to_even)" ] }, { "cell_type": "code", "execution_count": 120, "id": "d2777ad8-2c24-48e4-8f5f-77403e3109ea", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[16.0, 3.0, 4.0, -5.0, -6.0]\n" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :towards_infinity)" ] }, { "cell_type": "code", "execution_count": 121, "id": "a8ab2735-74cb-4cfe-a5a2-61bfa90c72ac", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 3.0, 4.0, -4.0, -5.0]\n" ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :half_up)" ] }, { "cell_type": "code", "execution_count": 122, "id": "3575481c-40ed-405f-a69c-7581d4dce2cf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 2.0, 3.0, -4.0, -5.0]\n" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :half_towards_zero)" ] }, { "cell_type": "code", "execution_count": 123, "id": "a86e4c5c-aced-4a88-b692-4e26b90f1653", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 3.0, 4.0, -5.0, -6.0]\n" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :half_towards_infinity)" ] }, { "cell_type": "code", "execution_count": 124, "id": "73f51bab-ff46-4b99-96a5-8c6547ad9d35", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 3.0, 3.0, -5.0, -5.0]\n" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :half_to_odd)" ] }, { "cell_type": "code", "execution_count": 125, "id": "a12c684c-4a63-4dac-a81b-969978812a24", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 2.0, 4.0, -4.0, -6.0]\n" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(n_digits: 0)" ] }, { "cell_type": "code", "execution_count": 126, "id": "17370f2b-0957-411b-8145-56aa9fc956ac", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.2, 2.5, 3.5, -4.5, -5.5]\n" ] }, "execution_count": 126, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(n_digits: 1)" ] }, { "cell_type": "code", "execution_count": 127, "id": "53072cff-b28b-4672-b30a-8ca37562bc21", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[20.0, 0.0, 0.0, -0.0, -10.0]\n" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(n_digits: -1)" ] }, { "cell_type": "markdown", "id": "51dedfce-51c7-4e5b-b890-a90ad9cf7596", "metadata": {}, "source": [ "## 43. and/or (Vector)" ] }, { "cell_type": "markdown", "id": "b2c4869b-6ebf-476c-b2fd-a4b9c0638dc5", "metadata": {}, "source": [ "RedAmber select `and_kleene`/`or_kleene` as default `&`/`|` method.\n", "\n", "These are unary element-wise function." ] }, { "cell_type": "code", "execution_count": 128, "id": "2d4f5853-1ed9-4d8b-87a9-b5c1faac5fae", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, nil, false, false, false, nil, false, nil]\n" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bool_self = Vector.new([true, true, true, false, false, false, nil, nil, nil])\n", "bool_other = Vector.new([true, false, nil, true, false, nil, true, false, nil])\n", "\n", "bool_self & bool_other # same as bool_self.and_kleene(bool_other)" ] }, { "cell_type": "code", "execution_count": 129, "id": "236c9733-8d45-467e-b288-e6c18b9c39d2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, nil, true, false, nil, true, false, nil]\n" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Ruby's primitive `&&`\n", "bool_self && bool_other" ] }, { "cell_type": "code", "execution_count": 130, "id": "4e984a9c-7d9c-465d-bf26-0c685dedd4bf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, nil, false, false, nil, nil, nil, nil]\n" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Arrow's default `and`\n", "bool_self.and_org(bool_other)" ] }, { "cell_type": "code", "execution_count": 131, "id": "0120ebf5-355d-41f5-83d5-49b9802f337b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, true, true, true, false, nil, true, nil, nil]\n" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bool_self | bool_other # same as bool_self.or_kleene(bool_other)" ] }, { "cell_type": "code", "execution_count": 132, "id": "24ceee23-79df-4fcd-afd8-f3839a087785", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, true, true, false, false, false, nil, nil, nil]\n" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Ruby's primitive `||`\n", "bool_self || bool_other" ] }, { "cell_type": "code", "execution_count": 133, "id": "c152d04b-71a0-4b18-acd1-b5ab9e413d00", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, true, nil, true, false, nil, nil, nil, nil]\n" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Arrow's default `or`\n", "bool_self.or_org(bool_other)" ] }, { "cell_type": "markdown", "id": "beede237-c5ed-4e12-a432-ec7e4546d786", "metadata": {}, "source": [ "## 44. is_finite/is_nan/is_nil/is_na (Vector)" ] }, { "cell_type": "markdown", "id": "77418efd-c0d7-4d63-a7db-2d43fafd386e", "metadata": {}, "source": [ "These are unary element-wise function." ] }, { "cell_type": "code", "execution_count": 134, "id": "19558f9e-fdc4-46e5-90d0-724e4e8fbd8e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[3.141592653589793, Infinity, -Infinity, NaN, nil]\n" ] }, "execution_count": 134, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double = Vector.new([Math::PI, Float::INFINITY, -Float::INFINITY, Float::NAN, nil])" ] }, { "cell_type": "code", "execution_count": 135, "id": "d90a7168-1f87-4363-9589-c1f161babc7d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, false, false, nil]\n" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.is_finite" ] }, { "cell_type": "code", "execution_count": 136, "id": "7d88049b-695f-4b0c-a105-8fb5797a58b1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, true, true, false, nil]\n" ] }, "execution_count": 136, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.is_inf" ] }, { "cell_type": "code", "execution_count": 137, "id": "7d86a7b5-84bf-4031-9811-4076281920cf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, false, false, true, true]\n" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.is_na" ] }, { "cell_type": "code", "execution_count": 138, "id": "d562f826-7a37-4c57-8f92-777555987246", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, false, false, false, true]\n" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.is_nil" ] }, { "cell_type": "code", "execution_count": 139, "id": "e460dc6b-e48f-4462-9ce8-aa6069ebae27", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, true, true, true, false]\n" ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.is_valid" ] }, { "cell_type": "markdown", "id": "2cca75eb-f0e8-4f85-89cb-3601512e76b0", "metadata": {}, "source": [ "## 45. Prime-th rows" ] }, { "cell_type": "code", "execution_count": 140, "id": "e0e56ecc-b24c-4a40-b3bb-26bb64eb59ef", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <68 x 9 vectors>
indexspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
2AdelieTorgersen39.517.41863800female2007
3AdelieTorgersen40.318.01953250female2007
5AdelieTorgersen36.719.31933450female2007
7AdelieTorgersen38.917.81813625female2007
317GentooBiscoe49.415.82164925male2009
331GentooBiscoe50.515.22165000female2009
337GentooBiscoe44.515.72174875(nil)2009
" ], "text/plain": [ "#\n", " index species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 2 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 3 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " 4 7 Adelie Torgersen 38.9 17.8 181 ... 2007\n", " 5 11 Adelie Torgersen 37.8 17.1 186 ... 2007\n", " : : : : : : : ... :\n", "66 317 Gentoo Biscoe 49.4 15.8 216 ... 2009\n", "67 331 Gentoo Biscoe 50.5 15.2 216 ... 2009\n", "68 337 Gentoo Biscoe 44.5 15.7 217 ... 2009\n" ] }, "execution_count": 140, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# prime-th rows ... Don't ask me what it means.\n", "require 'prime'\n", "penguins_with_index =\n", " penguins.assign do\n", " { index: Vector.new(penguins.indices) + 1 }\n", " end.pick { [keys[-1], keys[0..-2]] }\n", "penguins_with_index.slice { Vector.new(Prime.each(size).to_a) - 1 }" ] }, { "cell_type": "markdown", "id": "c9e8de1a-ad8f-4fdc-a65c-4d3db7123530", "metadata": {}, "source": [ "## 46. Slice by Enumerator" ] }, { "cell_type": "markdown", "id": "32dd53a3-a822-4ae1-afe2-b5aa2bfbd3e3", "metadata": {}, "source": [ "Slice accepts Enumerator as an option." ] }, { "cell_type": "code", "execution_count": 141, "id": "b2a118fa-f3c0-4f31-9b45-6db27ccbebe6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <35 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen37.817.11863300(nil)2007
AdelieBiscoe37.818.31743400female2007
AdelieDream39.516.71783250female2007
GentooBiscoe48.515.02194850female2009
GentooBiscoe50.515.22165000female2009
GentooBiscoe46.814.32154850female2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 37.8 17.1 186 ... 2007\n", " 3 Adelie Biscoe 37.8 18.3 174 ... 2007\n", " 4 Adelie Dream 39.5 16.7 178 ... 2007\n", " 5 Adelie Dream 36.5 18.0 182 ... 2007\n", " : : : : : : ... :\n", "33 Gentoo Biscoe 48.5 15.0 219 ... 2009\n", "34 Gentoo Biscoe 50.5 15.2 216 ... 2009\n", "35 Gentoo Biscoe 46.8 14.3 215 ... 2009\n" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select every 10 samples\n", "penguins.slice(0.step by: 10, to: 340)" ] }, { "cell_type": "markdown", "id": "db312c2c-3a7c-4765-bfad-b3313b173a79", "metadata": {}, "source": [ "## 47. Output mode" ] }, { "cell_type": "markdown", "id": "714ed8df-5aa3-4ac4-8b0d-6390aff73c8c", "metadata": {}, "source": [ "Output mode of `#inspect` and `#to_iruby` is Table mode by default. If you prefer TDR mode set the environment variable `RED_AMBER_OUTPUT_MODE` to `\"TDR\"`." ] }, { "cell_type": "code", "execution_count": 142, "id": "a721804b-006e-44c6-8d38-885eae747eaa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <344 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen(nil)(nil)(nil)(nil)(nil)2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen (nil) (nil) (nil) ... 2007\n", " 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " : : : : : : ... :\n", "342 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "343 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "344 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ENV['RED_AMBER_OUTPUT_MODE'] = 'Table' # or nil (default)\n", "penguins # Almost same as `puts penguins.to_s` in any mode" ] }, { "cell_type": "code", "execution_count": 143, "id": "e4c9f70c-a4b1-4a81-bbc4-e9b14a6b6cb0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n", "2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n", " ... 5 more Vectors ...\n", "\n" ] } ], "source": [ "ENV['RED_AMBER_OUTPUT_MODE'] = 'TDR'\n", "p penguins; nil # Almost same as `penguins.tdr` in any mode" ] }, { "cell_type": "code", "execution_count": 144, "id": "2786e9a7-e321-43c5-b56e-9f2ca9d62f8b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RedAmber::DataFrame : 344 x 8 Vectors\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n", "2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n", "4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils\n", "5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils\n", "6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils\n", "7 :sex string 3 {\"male\"=>168, \"female\"=>165, nil=>11}\n", "8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}\n" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins" ] }, { "cell_type": "code", "execution_count": 145, "id": "b00c858b-b14a-492b-bc22-d6a707bcc1ba", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Table\"" ] }, "execution_count": 145, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ENV['RED_AMBER_OUTPUT_MODE'] = 'Table'" ] }, { "cell_type": "markdown", "id": "c6781904-c014-4ef0-86c5-8a758ca3ca1c", "metadata": {}, "source": [ "## 48. Empty key" ] }, { "cell_type": "markdown", "id": "38912e38-bddd-47cc-adfd-5c4f780636ea", "metadata": {}, "source": [ "Empty key `:\"\"` will be automatically renamed to `:unnamed1`.\n", "\n", "If `:unnamed1` was used, `:unnamed1.succ` will be used.\n", "\n", "(Since 0.1.8)" ] }, { "cell_type": "code", "execution_count": 146, "id": "13569004-bb23-45fa-8d11-fe5f367641a6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 2 vectors>
unnamed2unnamed1
13
24
" ], "text/plain": [ "#\n", " unnamed2 unnamed1\n", " \n", "1 1 3\n", "2 2 4\n" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = DataFrame.new(\"\": [1, 2], unnamed1: [3, 4])" ] }, { "cell_type": "markdown", "id": "a4343343-d59a-4b71-b08f-4b737a6d4258", "metadata": {}, "source": [ "## 49. Grouping" ] }, { "cell_type": "markdown", "id": "8ddf0937-7b0e-4ee6-8116-8e78c308e76f", "metadata": {}, "source": [ "`DataFrame#group` takes group_key as an argument, and creates `Group` class. \n", "\n", "Group class inspects counts of each unique elements. \n", "\n", "(Since 0.1.7)" ] }, { "cell_type": "code", "execution_count": 147, "id": "ee602e52-7988-4fab-b5e3-c466acf01c98", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#{\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}}>" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group = penguins.group(:species)" ] }, { "cell_type": "markdown", "id": "a48ab0e9-60e3-4c07-a27b-9c53783318c1", "metadata": {}, "source": [ "The instance of `Group` class has methods to summary functions.\n", "\n", "It returns `function(key)` style summarized columns as a result." ] }, { "cell_type": "code", "execution_count": 148, "id": "20b23ada-b895-4921-b57b-8d46b451e494", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 8 vectors>
speciescount(island)count(bill_length_mm)count(bill_depth_mm)count(flipper_length_mm)count(body_mass_g)count(sex)count(year)
Adelie152151151151151146152
Chinstrap68686868686868
Gentoo124123123123123119124
" ], "text/plain": [ "#\n", " species count(island) count(bill_length_mm) count(bill_depth_mm) ... count(year)\n", " ... \n", "1 Adelie 152 151 151 ... 152\n", "2 Chinstrap 68 68 68 ... 68\n", "3 Gentoo 124 123 123 ... 124\n" ] }, "execution_count": 148, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group.count" ] }, { "cell_type": "markdown", "id": "584d8fa1-1e70-4f31-bdc2-5d4956a28cb5", "metadata": {}, "source": [ "If count result is same in multiple columns, count column is aggregated to one column `:count`." ] }, { "cell_type": "code", "execution_count": 149, "id": "e6936488-9f23-47bd-8492-537c5be1afb3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
speciescount
Adelie151
Chinstrap68
Gentoo123
" ], "text/plain": [ "#\n", " species count\n", " \n", "1 Adelie 151\n", "2 Chinstrap 68\n", "3 Gentoo 123\n" ] }, "execution_count": 149, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.pick(:species, :bill_length_mm, :bill_depth_mm).group(:species).count" ] }, { "cell_type": "markdown", "id": "7c90b59b-918c-492c-ac5b-46b62b31d136", "metadata": {}, "source": [ "Grouping key comes first (leftmost) in the columns." ] }, { "cell_type": "markdown", "id": "21c8a6fa-3f95-41ea-817f-ffc1011e8df9", "metadata": {}, "source": [ "## 50. Grouping with a block" ] }, { "cell_type": "markdown", "id": "db89c8b4-8189-410d-abe9-eac8f440065c", "metadata": {}, "source": [ "`DataFrame#group` takes a block and we can specify multiple functions.\n", "\n", "Inside the block is the context of instance of Group. So we can use summary functions without the reciever.\n", "\n", "(Since 0.1.8)" ] }, { "cell_type": "code", "execution_count": 150, "id": "913f576b-ec86-4e94-af05-7c656ea24cc2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 3 vectors>
speciescountmean(body_mass_g)
Adelie1523700.662251655629
Chinstrap683733.0882352941176
Gentoo1245076.016260162602
" ], "text/plain": [ "#\n", " species count mean(body_mass_g)\n", " \n", "1 Adelie 152 3700.7\n", "2 Chinstrap 68 3733.1\n", "3 Gentoo 124 5076.0\n" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.group(:species) { [count(:species), mean(:body_mass_g)] }" ] }, { "cell_type": "markdown", "id": "e6dc3206-d327-4242-9908-253ad4a74759", "metadata": {}, "source": [ "`Group#summarize` accepts same block as `DataFrame#group`." ] }, { "cell_type": "code", "execution_count": 151, "id": "67c7fc55-7b30-469c-bd0c-cda5732863fe", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 7 vectors>
speciescountmean(bill_length_mm)mean(bill_depth_mm)mean(flipper_length_mm)mean(body_mass_g)mean(year)
Adelie15238.7913907284768418.346357615894032189.953642384105963700.6622516556292008.0131578947369
Chinstrap6848.8338235294117718.420588235294115195.82352941176473733.08823529411762007.9705882352941
Gentoo12447.50487804878047614.982113821138206217.18699186991875076.0162601626022008.0806451612902
" ], "text/plain": [ "#\n", " species count mean(bill_length_mm) mean(bill_depth_mm) ... mean(year)\n", " ... \n", "1 Adelie 152 38.8 18.3 ... 2008.0\n", "2 Chinstrap 68 48.8 18.4 ... 2008.0\n", "3 Gentoo 124 47.5 15.0 ... 2008.1\n" ] }, "execution_count": 151, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group.summarize { [count(:species), mean] }" ] }, { "cell_type": "markdown", "id": "88682c8d-0ab6-4cee-8e59-5b6ad0ae5dac", "metadata": {}, "source": [ "## 51. Vector#shift" ] }, { "cell_type": "markdown", "id": "c6dd1bc0-7146-4884-aac5-9874e829235e", "metadata": {}, "source": [ "`Vector#shift(amount = 1, fill: nil)`\n", "\n", "Shift vector's values by specified `amount`. Shifted space is filled by value `fill`.\n", "\n", "(Since 0.1.8)" ] }, { "cell_type": "code", "execution_count": 152, "id": "013f2db6-3e1d-481f-a908-57605729b51d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[nil, 1, 2, 3, 4]\n" ] }, "execution_count": 152, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector = RedAmber::Vector.new([1, 2, 3, 4, 5])\n", "vector.shift" ] }, { "cell_type": "code", "execution_count": 153, "id": "7625acd7-d6a0-4775-b5e0-ca87f95f4f28", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[3, 4, 5, nil, nil]\n" ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector.shift(-2)" ] }, { "cell_type": "code", "execution_count": 154, "id": "34a9ac2a-2e3f-44bc-8ba7-c4487dc3528e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[NaN, 1.0, 2.0, 3.0, 4.0]\n" ] }, "execution_count": 154, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector.shift(fill: Float::NAN)" ] }, { "cell_type": "markdown", "id": "39f305a6-5718-4e81-b08f-564fa93861cc", "metadata": { "tags": [] }, "source": [ "## 52. From the Pandas cookbook (if-then)\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#if-then" ] }, { "cell_type": "markdown", "id": "4f1d41dd-714a-4c68-af3b-a9080847f833", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "df.loc[df.AAA >= 5, \"BBB\"] = -1\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "0 4 10 100\n", "1 5 -1 50\n", "2 6 -1 -30\n", "3 7 -1 -50\n", "```" ] }, { "cell_type": "code", "execution_count": 155, "id": "24774ccc-8f0f-4ce4-9ba0-bebed8781c38", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 3 vectors>
AAABBBCCC
410100
5-150
6-1-30
7-1-50
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 4 10 100\n", "2 5 -1 50\n", "3 6 -1 -30\n", "4 7 -1 -50\n" ] }, "execution_count": 155, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Red Amber\n", "df = DataFrame.new(\n", " \"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50] # You can omit {}\n", ")\n", "\n", "df.assign( BBB: df[:BBB].replace(df[:AAA] >= 5, -1) )" ] }, { "cell_type": "markdown", "id": "c76ca2e3-e239-440b-923b-dfabb1a1c007", "metadata": {}, "source": [ "If you want to replace both :BBB and :CCC ;" ] }, { "cell_type": "code", "execution_count": 156, "id": "3f97227b-cbee-4515-b76d-3514401967d9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 3 vectors>
AAABBBCCC
410100
5-1-2
6-1-2
7-1-2
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 4 10 100\n", "2 5 -1 -2\n", "3 6 -1 -2\n", "4 7 -1 -2\n" ] }, "execution_count": 156, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.assign do\n", " replacer = v(:AAA) >= 5 # Boolean Vector\n", " {\n", " BBB: v(:BBB).replace(replacer, -1),\n", " CCC: v(:CCC).replace(replacer, -2)\n", " }\n", "end" ] }, { "cell_type": "markdown", "id": "4bf560f9-4393-4ed5-9994-24f8a93470fb", "metadata": { "tags": [] }, "source": [ "## 53. From the Pandas cookbook (splitting)\n", "Split a frame with a boolean criterion\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#splitting" ] }, { "cell_type": "markdown", "id": "fcf06cad-4a34-45e2-b8f8-8422914eb319", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "df[df.AAA <= 5]\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "0 4 10 100\n", "1 5 20 50\n", "\n", "df[df.AAA > 5]\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "2 6 30 -30\n", "3 7 40 -50\n", "```" ] }, { "cell_type": "code", "execution_count": 157, "id": "b08e74d4-aba8-4fb5-a815-5d5384e92f81", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 3 vectors>
AAABBBCCC
410100
52050
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 4 10 100\n", "2 5 20 50\n" ] }, "execution_count": 157, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Red Amber\n", "df = DataFrame.new(\n", " # You can omit {}\n", " \"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]\n", ")\n", "\n", "df.slice(df[:AAA] <= 5)" ] }, { "cell_type": "code", "execution_count": 158, "id": "caa72796-ff7e-4275-849f-04698114ee08", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 3 vectors>
AAABBBCCC
630-30
740-50
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 6 30 -30\n", "2 7 40 -50\n" ] }, "execution_count": 158, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.remove(df[:AAA] <= 5)\n", "# df.slice(df[:AAA] > 5) # do the same thing" ] } ], "metadata": { "kernelspec": { "display_name": "Ruby 3.1.2", "language": "ruby", "name": "ruby" }, "language_info": { "file_extension": ".rb", "mimetype": "application/x-ruby", "name": "ruby", "version": "3.1.2" } }, "nbformat": 4, "nbformat_minor": 5 }