{ "cells": [ { "cell_type": "markdown", "id": "e355db8b-ebb6-4ea6-97b5-3b9fdadc302c", "metadata": {}, "source": [ "# 71 examples of Red Amber" ] }, { "cell_type": "markdown", "id": "f20f4970-db38-4d96-9a36-d4cf9d007596", "metadata": {}, "source": [ "Last update: September 6, 2022, for RedAmber Version 0.2.1" ] }, { "cell_type": "markdown", "id": "f6e927d0-b59a-4c4e-9f8a-4fa08f9a6b2f", "metadata": {}, "source": [ "## 1. Install" ] }, { "cell_type": "markdown", "id": "85eacfe6-fa11-4749-844f-5914d6cd7dbc", "metadata": {}, "source": [ "Install requirements before you install RedAmber.\n", "\n", "- Apache Arrow GLib (>= 9.0.0)\n", "\n", "- Apache Parquet GLib (>= 9.0.0) # if you need IO from/to Parquet resource.\n", "\n", " See [Apache Arrow install document](https://arrow.apache.org/install/).\n", " \n", " Minimum installation example for the latest Ubuntu is in the ['Prepare the Apache Arrow' section in ci test](https://github.com/heronshoes/red_amber/blob/master/.github/workflows/test.yml) of Red Amber.\n", "\n", "Then add this line to your Gemfile:\n", "```\n", "gem 'red_amber'\n", "```\n", "\n", "And then execute:\n", "```\n", "$ bundle install\n", "```\n", "\n", "Or install it yourself as:\n", "```\n", "$ gem install red_amber\n", "```" ] }, { "cell_type": "markdown", "id": "8c08c45d-0818-4b43-bc65-4d43dd8b6b66", "metadata": {}, "source": [ "## 2. Require" ] }, { "cell_type": "code", "execution_count": 1, "id": "74b76022-03ea-40ae-bac8-fc8743659042", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{:RedAmber=>\"0.2.1-HEAD\", :Arrow=>\"9.0.0\"}" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'red_amber' # require 'red-amber' is also OK\n", "include RedAmber\n", "{RedAmber: VERSION, Arrow: Arrow::VERSION}" ] }, { "cell_type": "markdown", "id": "d8fb6289-39ea-4fa9-a165-b87ee6d125e9", "metadata": { "tags": [] }, "source": [ "## 3. Initialize" ] }, { "cell_type": "code", "execution_count": 2, "id": "51f81824-626a-4741-a29b-30ea357fe7b5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
xy
1A
2B
3C
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 A\n", "2 2 B\n", "3 3 C\n" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# From a Hash\n", "DataFrame.new(x: [1, 2, 3], y: %w[A B C])" ] }, { "cell_type": "code", "execution_count": 3, "id": "20b696eb-c199-444d-a957-e0b1081f1506", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
xy
1A
2B
3C
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 A\n", "2 2 B\n", "3 3 C\n" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# From a schema and a row-oriented array\n", "DataFrame.new({ x: :uint8, y: :string }, [[1, 'A'], [2, 'B'], [3, 'C']])" ] }, { "cell_type": "code", "execution_count": 4, "id": "21eab151-f977-4474-a6d1-576169e24b26", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
xy
1A
2B
3C
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 A\n", "2 2 B\n", "3 3 C\n" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# From an Arrow::Table\n", "table = Arrow::Table.new(x: [1, 2, 3], y: %w[A B C])\n", "DataFrame.new(table)" ] }, { "cell_type": "code", "execution_count": 5, "id": "aa09d3da-f332-45cd-92ca-712c6a679035", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
xy
1A
2B
3C
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 A\n", "2 2 B\n", "3 3 C\n" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# From a Rover::DataFrame\n", "require 'rover'\n", "rover = Rover::DataFrame.new(x: [1, 2, 3], y: %w[A B C])\n", "DataFrame.new(rover)" ] }, { "cell_type": "code", "execution_count": 6, "id": "cd2c3677-00fb-48fe-bb94-18bc0815db72", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <344 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen(nil)(nil)(nil)(nil)(nil)2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen (nil) (nil) (nil) ... 2007\n", " 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " : : : : : : ... :\n", "342 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "343 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "344 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# from a red-datasets\n", "require 'datasets-arrow'\n", "dataset = Datasets::Penguins.new\n", "penguins = DataFrame.new(dataset.to_arrow)" ] }, { "cell_type": "markdown", "id": "3a2d12b4-7623-42c7-9e32-76cf303c7cea", "metadata": {}, "source": [ "It should be in the future version;\n", "```ruby\n", "require 'datasets-red-amber'\n", "penguins = Datasets::Penguins.new.to_red_amber\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "id": "2e4619b7-bf6d-4081-9066-b186da8fdf5b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <32 x 11 vectors>
mpgcyldisphpdratwtqsecvsamgearcarb
21.06160.01103.92.6216.460144
21.06160.01103.92.87517.020144
22.84108.0933.852.3218.611141
21.46258.01103.083.21519.441031
19.76145.01753.622.7715.50156
15.08301.03353.543.5714.60158
21.44121.01094.112.7818.61142
" ], "text/plain": [ "#\n", " mpg cyl disp hp drat wt qsec vs am ... carb\n", " ... \n", " 1 21.0 6 160.0 110 3.9 2.62 16.46 0 1 ... 4\n", " 2 21.0 6 160.0 110 3.9 2.88 17.02 0 1 ... 4\n", " 3 22.8 4 108.0 93 3.85 2.32 18.61 1 1 ... 1\n", " 4 21.4 6 258.0 110 3.08 3.22 19.44 1 0 ... 1\n", " 5 18.7 8 360.0 175 3.15 3.44 17.02 0 0 ... 2\n", " : : : : : : : : : : ... :\n", "30 19.7 6 145.0 175 3.62 2.77 15.5 0 1 ... 6\n", "31 15.0 8 301.0 335 3.54 3.57 14.6 0 1 ... 8\n", "32 21.4 4 121.0 109 4.11 2.78 18.6 1 1 ... 2\n" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset = Datasets::Rdatasets.new('datasets', 'mtcars')\n", "mtcars = DataFrame.new(dataset.to_arrow)" ] }, { "cell_type": "markdown", "id": "e1f77a54-3a43-4d17-bb6f-332ef13832a3", "metadata": {}, "source": [ "## 4. Load" ] }, { "cell_type": "markdown", "id": "0fed4f43-3fbb-43e5-af0d-f93401deea78", "metadata": {}, "source": [ "`RedAmber::DataFrame` delegates `#load` to `Arrow::Table#load`. We can load from `[.arrow, .arrows, .csv, .csv.gz, .tsv]` files." ] }, { "cell_type": "code", "execution_count": 8, "id": "ce9f7237-9196-41a0-ba04-6ef4430a8a0c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
nameage
Yasuko68
Rui49
Hinata28
" ], "text/plain": [ "#\n", " name age\n", " \n", "1 Yasuko 68\n", "2 Rui 49\n", "3 Hinata 28\n" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "file = Tempfile.open(['comecome', '.csv']) do |f|\n", " f.puts(<<~CSV)\n", " name,age\n", " Yasuko,68\n", " Rui,49\n", " Hinata,28\n", " CSV\n", " f\n", "end\n", "\n", "DataFrame.load(file)" ] }, { "cell_type": "markdown", "id": "29875147-1371-4575-a565-69c3534c15f2", "metadata": {}, "source": [ "## 5. Load from a URI" ] }, { "cell_type": "code", "execution_count": 9, "id": "916b86e2-e3a2-4ebb-8770-9e8a29c46523", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <344 x 7 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
AdelieTorgersen39.118.71813750MALE
AdelieTorgersen39.517.41863800FEMALE
AdelieTorgersen40.318.01953250FEMALE
AdelieTorgersen(nil)(nil)(nil)(nil)
GentooBiscoe50.415.72225750MALE
GentooBiscoe45.214.82125200FEMALE
GentooBiscoe49.916.12135400MALE
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... sex\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... MALE\n", " 2 Adelie Torgersen 39.5 17.4 186 ... FEMALE\n", " 3 Adelie Torgersen 40.3 18.0 195 ... FEMALE\n", " 4 Adelie Torgersen (nil) (nil) (nil) ...\n", " 5 Adelie Torgersen 36.7 19.3 193 ... FEMALE\n", " : : : : : : ... :\n", "342 Gentoo Biscoe 50.4 15.7 222 ... MALE\n", "343 Gentoo Biscoe 45.2 14.8 212 ... FEMALE\n", "344 Gentoo Biscoe 49.9 16.1 213 ... MALE\n" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "uri = URI(\"https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv\")\n", "DataFrame.load(uri)" ] }, { "cell_type": "markdown", "id": "e6abe64d-e97f-437e-9c54-18f9e06e9668", "metadata": {}, "source": [ "## 6. Save" ] }, { "cell_type": "code", "execution_count": 10, "id": "91c0fb62-7990-47f1-9fb6-b0529bc1783f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.save(\"file.arrow\")\n", "penguins.save(\"file.arrows\")\n", "penguins.save(\"file.csv\")\n", "penguins.save(\"file.csv.gz\")\n", "penguins.save(\"file.tsv\")\n", "penguins.save(\"file.feather\")" ] }, { "cell_type": "markdown", "id": "d1d30973-9e2f-406a-9f42-9e6e4c966baf", "metadata": {}, "source": [ "## 7. to_s/inspect" ] }, { "cell_type": "markdown", "id": "a7bc9cb7-eae4-495f-831e-b747e486d0bd", "metadata": {}, "source": [ "`to_s` or `inspect` (it uses to_s inside) shows a preview of the dataframe.\n", "\n", "It shows first 5 and last 3 rows if it has many rows. Columns are also omitted if line is exceeded 80 letters." ] }, { "cell_type": "code", "execution_count": 11, "id": "af6d29ef-2e1c-4a08-a8b2-d69acda79ec5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#\n", " x y s b\n", " \n", "1 1 1.0 A true\n", "2 2 2.0 B false\n", "3 3 3.0 C true\n", "4 4 NaN D false\n", "5 5 (nil) (nil) (nil)\n", "\n" ] } ], "source": [ "df = DataFrame.new(\n", " x: [1, 2, 3, 4, 5],\n", " y: [1, 2, 3, 0/0.0, nil],\n", " s: %w[A B C D] << nil,\n", " b: [true, false, true, false, nil])\n", "p df; nil" ] }, { "cell_type": "code", "execution_count": 12, "id": "cdff2e60-bd0a-4d12-b348-201a49bbbbbe", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen (nil) (nil) (nil) ... 2007\n", " 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " : : : : : : ... :\n", "342 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "343 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "344 Gentoo Biscoe 49.9 16.1 213 ... 2009\n", "\n" ] } ], "source": [ "p penguins; nil" ] }, { "cell_type": "markdown", "id": "cb44df38-58f7-479c-b7a4-c9c305639292", "metadata": {}, "source": [ "## 8. Show table" ] }, { "cell_type": "code", "execution_count": 13, "id": "fc710035-8134-4b18-89fe-8c58b95e0e0e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "\tx\t y\ts\tb\n", "0\t1\t 1.000000\tA\ttrue\n", "1\t2\t 2.000000\tB\tfalse\n", "2\t3\t 3.000000\tC\ttrue\n", "3\t4\t NaN\tD\tfalse\n", "4\t5\t (null)\t(null)\t(null)\n" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.table" ] }, { "cell_type": "code", "execution_count": 14, "id": "2634fb7b-194f-4277-94ba-05f39c497ffa", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "\tspecies\tisland\tbill_length_mm\tbill_depth_mm\tflipper_length_mm\tbody_mass_g\tsex\tyear\n", " 0\tAdelie \tTorgersen\t 39.100000\t 18.700000\t 181\t 3750\tmale\t2007\n", " 1\tAdelie \tTorgersen\t 39.500000\t 17.400000\t 186\t 3800\tfemale\t2007\n", " 2\tAdelie \tTorgersen\t 40.300000\t 18.000000\t 195\t 3250\tfemale\t2007\n", " 3\tAdelie \tTorgersen\t (null)\t (null)\t (null)\t (null)\t(null)\t2007\n", " 4\tAdelie \tTorgersen\t 36.700000\t 19.300000\t 193\t 3450\tfemale\t2007\n", " 5\tAdelie \tTorgersen\t 39.300000\t 20.600000\t 190\t 3650\tmale\t2007\n", " 6\tAdelie \tTorgersen\t 38.900000\t 17.800000\t 181\t 3625\tfemale\t2007\n", " 7\tAdelie \tTorgersen\t 39.200000\t 19.600000\t 195\t 4675\tmale\t2007\n", " 8\tAdelie \tTorgersen\t 34.100000\t 18.100000\t 193\t 3475\t(null)\t2007\n", " 9\tAdelie \tTorgersen\t 42.000000\t 20.200000\t 190\t 4250\t(null)\t2007\n", "...\n", "334\tGentoo \tBiscoe\t 46.200000\t 14.100000\t 217\t 4375\tfemale\t2009\n", "335\tGentoo \tBiscoe\t 55.100000\t 16.000000\t 230\t 5850\tmale\t2009\n", "336\tGentoo \tBiscoe\t 44.500000\t 15.700000\t 217\t 4875\t(null)\t2009\n", "337\tGentoo \tBiscoe\t 48.800000\t 16.200000\t 222\t 6000\tmale\t2009\n", "338\tGentoo \tBiscoe\t 47.200000\t 13.700000\t 214\t 4925\tfemale\t2009\n", "339\tGentoo \tBiscoe\t (null)\t (null)\t (null)\t (null)\t(null)\t2009\n", "340\tGentoo \tBiscoe\t 46.800000\t 14.300000\t 215\t 4850\tfemale\t2009\n", "341\tGentoo \tBiscoe\t 50.400000\t 15.700000\t 222\t 5750\tmale\t2009\n", "342\tGentoo \tBiscoe\t 45.200000\t 14.800000\t 212\t 5200\tfemale\t2009\n", "343\tGentoo \tBiscoe\t 49.900000\t 16.100000\t 213\t 5400\tmale\t2009\n" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.table" ] }, { "cell_type": "code", "execution_count": 15, "id": "9dba2a67-ede7-4663-907b-9b2dd5db1605", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x: uint8\n", "y: double\n", "s: string\n", "b: bool\n", "----\n", "x:\n", " [\n", " [\n", " 1,\n", " 2,\n", " 3,\n", " 4,\n", " 5\n", " ]\n", " ]\n", "y:\n", " [\n", " [\n", " 1,\n", " 2,\n", " 3,\n", " nan,\n", " null\n", " ]\n", " ]\n", "s:\n", " [\n", " [\n", " \"A\",\n", " \"B\",\n", " \"C\",\n", " \"D\",\n", " null\n", " ]\n", " ]\n", "b:\n", " [\n", " [\n", " true,\n", " false,\n", " true,\n", " false,\n", " null\n", " ]\n", " ]\n" ] } ], "source": [ "# This is a Red Arrow's feature\n", "puts df.table.to_s(format: :column)" ] }, { "cell_type": "code", "execution_count": 16, "id": "d1cc17b8-1cfc-4986-9dec-7bca02be32f0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================== 0 ====================\n", "x: 1\n", "y: 1.000000\n", "s: A\n", "b: true\n", "==================== 1 ====================\n", "x: 2\n", "y: 2.000000\n", "s: B\n", "b: false\n", "==================== 2 ====================\n", "x: 3\n", "y: 3.000000\n", "s: C\n", "b: true\n", "==================== 3 ====================\n", "x: 4\n", "y: NaN\n", "s: D\n", "b: false\n", "==================== 4 ====================\n", "x: 5\n", "y: (null)\n", "s: (null)\n", "b: (null)\n" ] } ], "source": [ "# This is also a Red Arrow's feature\n", "puts df.table.to_s(format: :list)" ] }, { "cell_type": "markdown", "id": "16e4ae6b-2399-43f0-be8e-65669b95c7b6", "metadata": {}, "source": [ "## 9. TDR" ] }, { "cell_type": "markdown", "id": "2d14eb4b-9026-4cc5-a71a-598946d40b67", "metadata": {}, "source": [ "TDR means 'Transposed Dataframe Representation'. It shows columns in lateral just the same shape as initializing by a Hash. TDR has some information which is useful for the exploratory data processing.\n", "\n", "- DataFrame shape: n_rows x n_columns\n", "- Data types\n", "- Levels: number of unique elements\n", "- Data preview: same data is aggregated if level is smaller (tally mode)\n", "- Show counts of abnormal element: NaN and nil" ] }, { "cell_type": "code", "execution_count": 17, "id": "8050462f-7c60-41b7-a011-af11763784dc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RedAmber::DataFrame : 5 x 4 Vectors\n", "Vectors : 2 numeric, 1 string, 1 boolean\n", "# key type level data_preview\n", "1 :x uint8 5 [1, 2, 3, 4, 5]\n", "2 :y double 5 [1.0, 2.0, 3.0, NaN, nil], 1 NaN, 1 nil\n", "3 :s string 5 [\"A\", \"B\", \"C\", \"D\", nil], 1 nil\n", "4 :b boolean 3 {true=>2, false=>2, nil=>1}\n" ] } ], "source": [ "# use the same dataframe as #7\n", "df.tdr" ] }, { "cell_type": "code", "execution_count": 18, "id": "bb616ffe-c19a-4b02-a011-601ceb3db656", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RedAmber::DataFrame : 344 x 8 Vectors\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n", "2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n", "4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils\n", "5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils\n", "6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils\n", "7 :sex string 3 {\"male\"=>168, \"female\"=>165, nil=>11}\n", "8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}\n" ] } ], "source": [ "penguins.tdr" ] }, { "cell_type": "markdown", "id": "73b8dc18-079f-4d40-8d0e-239f010550da", "metadata": {}, "source": [ "`#tdr` has some options:\n", "\n", "`limit` : to limit a number of variables to show. Default value is `limit=10`." ] }, { "cell_type": "code", "execution_count": 19, "id": "0962845d-e642-4d2a-9607-43e197b46bc5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RedAmber::DataFrame : 344 x 8 Vectors\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n", "2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n", " ... 5 more Vectors ...\n" ] } ], "source": [ "penguins.tdr(3)" ] }, { "cell_type": "markdown", "id": "573606c4-23b9-4b38-8c92-a04f1c1e8781", "metadata": {}, "source": [ "`elements` : max number of elements to show in observations. Default value is `elements: 5`." ] }, { "cell_type": "code", "execution_count": 20, "id": "f957d2bd-e8c0-42a1-a3b4-0a9478e740bf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RedAmber::DataFrame : 344 x 8 Vectors\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n", "2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, ... ], 2 nils\n", "4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, ... ], 2 nils\n", "5 :flipper_length_mm uint8 56 [181, 186, 195, ... ], 2 nils\n", "6 :body_mass_g uint16 95 [3750, 3800, 3250, ... ], 2 nils\n", "7 :sex string 3 {\"male\"=>168, \"female\"=>165, nil=>11}\n", "8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}\n" ] } ], "source": [ "penguins.tdr(elements: 3) # Show first 3 items in data" ] }, { "cell_type": "markdown", "id": "d37ece79-1999-49eb-a2d1-831184ee6509", "metadata": {}, "source": [ "`tally` : max level to use tally mode. Level means size of `tally`ed hash. Default value is `tally: 5`." ] }, { "cell_type": "code", "execution_count": 21, "id": "9c1c472c-3d15-4bca-9a1b-7f86c63d3ed8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RedAmber::DataFrame : 344 x 8 Vectors\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 [\"Adelie\", \"Adelie\", \"Adelie\", \"Adelie\", \"Adelie\", ... ]\n", "2 :island string 3 [\"Torgersen\", \"Torgersen\", \"Torgersen\", \"Torgersen\", \"Torgersen\", ... ]\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n", "4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils\n", "5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils\n", "6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils\n", "7 :sex string 3 [\"male\", \"female\", \"female\", nil, \"female\", ... ], 11 nils\n", "8 :year uint16 3 [2007, 2007, 2007, 2007, 2007, ... ]\n" ] } ], "source": [ "penguins.tdr(tally: 0) # Don't use tally mode" ] }, { "cell_type": "markdown", "id": "e3c38037-90a1-4fc5-9904-41fc74085908", "metadata": {}, "source": [ "`#tdr_str` returns a String. `#tdr` do the same thing as `puts #tdr_str`" ] }, { "cell_type": "markdown", "id": "21d68764-1bc1-4915-99b6-5ae938b85999", "metadata": {}, "source": [ "## 10. Size and shape" ] }, { "cell_type": "code", "execution_count": 22, "id": "487399f8-a3ef-467f-aa7f-ecbaee5fcb75", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# same as n_rows, n_obs\n", "df.size" ] }, { "cell_type": "code", "execution_count": 23, "id": "dc7441c3-7c85-4ce1-a20e-de8f41f280b4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# same as n_cols, n_vars\n", "df.n_keys" ] }, { "cell_type": "code", "execution_count": 24, "id": "3d42fea6-801a-45f4-8e22-ea9d76ae070f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[5, 4]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# [df.size, df.n_keys], [df.n_rows, df.n_cols]\n", "df.shape" ] }, { "cell_type": "markdown", "id": "bc5caa94-325f-4014-9c90-8ac909c2b378", "metadata": {}, "source": [ "## 11. Keys" ] }, { "cell_type": "code", "execution_count": 25, "id": "bb47775f-fed0-42e6-8781-aa8b721d6112", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[:x, :y, :s, :b]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.keys" ] }, { "cell_type": "code", "execution_count": 26, "id": "3d540ab0-3e52-47b7-b338-b4e0b3d929cb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[:species, :island, :bill_length_mm, :bill_depth_mm, :flipper_length_mm, :body_mass_g, :sex, :year]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.keys" ] }, { "cell_type": "markdown", "id": "decc6a61-9994-4d60-9827-b257cafafb70", "metadata": {}, "source": [ "## 12. Types" ] }, { "cell_type": "code", "execution_count": 27, "id": "bf9cd2bc-a213-427e-bc00-f2083b0e0471", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[:uint8, :double, :string, :boolean]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.types" ] }, { "cell_type": "code", "execution_count": 28, "id": "b1ecb891-98b5-4919-9f37-1847202007d8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[:string, :string, :double, :double, :uint8, :uint16, :string, :uint16]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.types" ] }, { "cell_type": "markdown", "id": "869b3670-62f8-4c23-807b-d6d100a1981e", "metadata": {}, "source": [ "## 13. Data type classes" ] }, { "cell_type": "code", "execution_count": 29, "id": "776ab4db-073b-4b30-931a-8ec77284cdc4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Arrow::UInt8DataType, Arrow::DoubleDataType, Arrow::StringDataType, Arrow::BooleanDataType]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.type_classes" ] }, { "cell_type": "code", "execution_count": 30, "id": "0546a5d0-cab1-4ca8-a2e5-0637d0fd48b6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Arrow::StringDataType, Arrow::StringDataType, Arrow::DoubleDataType, Arrow::DoubleDataType, Arrow::UInt8DataType, Arrow::UInt16DataType, Arrow::StringDataType, Arrow::UInt16DataType]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.type_classes" ] }, { "cell_type": "markdown", "id": "1c2513f6-909e-47fd-a543-66c4f424f44e", "metadata": { "tags": [] }, "source": [ "## 14. Indices\n", "\n", "Another example of `indices` is [62. Custom index](#62.-Custom-index)." ] }, { "cell_type": "code", "execution_count": 31, "id": "e6e9d7ef-1471-4f23-9210-56045c9fabd5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 1, 2, 3, 4]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.indexes\n", "# or\n", "df.indices" ] }, { "cell_type": "markdown", "id": "3908395f-b086-4fbb-9855-e1ce233f0595", "metadata": {}, "source": [ "## 15. To an Array or a Hash" ] }, { "cell_type": "markdown", "id": "22cb724e-cf61-40d9-a58b-9cc793e83645", "metadata": {}, "source": [ "DataFrame#to_a returns an array of row-oriented data without a header." ] }, { "cell_type": "code", "execution_count": 32, "id": "4054daad-9266-4002-8942-c0891050cb4d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[1, 1.0, \"A\", true], [2, 2.0, \"B\", false], [3, 3.0, \"C\", true], [4, NaN, \"D\", false], [5, nil, nil, nil]]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_a" ] }, { "cell_type": "markdown", "id": "f6abae59-fe31-4056-9de8-7c36e35235de", "metadata": {}, "source": [ "If you need a column-oriented array with keys, use `.to_h.to_a`" ] }, { "cell_type": "code", "execution_count": 33, "id": "d3631290-eb74-4d21-a469-86381c668c7f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{:x=>[1, 2, 3, 4, 5], :y=>[1.0, 2.0, 3.0, NaN, nil], :s=>[\"A\", \"B\", \"C\", \"D\", nil], :b=>[true, false, true, false, nil]}" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_h" ] }, { "cell_type": "code", "execution_count": 34, "id": "08c45e92-f640-4e62-bc96-ee259d0ecff4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[:x, [1, 2, 3, 4, 5]], [:y, [1.0, 2.0, 3.0, NaN, nil]], [:s, [\"A\", \"B\", \"C\", \"D\", nil]], [:b, [true, false, true, false, nil]]]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_h.to_a" ] }, { "cell_type": "markdown", "id": "39b65fc0-4405-4414-9a74-91c724ef587c", "metadata": {}, "source": [ "## 16. Schema" ] }, { "cell_type": "code", "execution_count": 35, "id": "36db7842-e9b0-4473-84d4-3aef987d427f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{:x=>:uint8, :y=>:double, :s=>:string, :b=>:boolean}" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.schema" ] }, { "cell_type": "markdown", "id": "3e61237d-ac67-45bb-827c-a769dff61809", "metadata": {}, "source": [ "## 17. Vector" ] }, { "cell_type": "markdown", "id": "27402307-aaad-49c8-88ca-65346668601d", "metadata": {}, "source": [ "Each variable (column in the table) is represented by a Vector object." ] }, { "cell_type": "code", "execution_count": 36, "id": "6c9ba041-231d-4057-a280-acf620b68525", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[:x] # This syntax comes later" ] }, { "cell_type": "markdown", "id": "3e13d06d-b432-45b2-9745-0c6ef9228e23", "metadata": {}, "source": [ "Or create new Vector by the constructor." ] }, { "cell_type": "code", "execution_count": 37, "id": "3e18a4e0-238c-4800-8bda-a88a57dde3e9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Vector.new(1, 2, 3, 4, 5)" ] }, { "cell_type": "code", "execution_count": 38, "id": "3bd55d9d-b988-46b2-bc11-e3dc5f4adc6c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Vector.new(1..5)" ] }, { "cell_type": "code", "execution_count": 39, "id": "19688e6e-b59b-4a84-8c07-57e87cd0e242", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Vector.new([1, 2, 3], [4, 5])" ] }, { "cell_type": "code", "execution_count": 40, "id": "076bd0e2-01ab-4497-9b9b-84f72a4805bc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "array = Arrow::Array.new([1, 2, 3, 4, 5])\n", "Vector.new(array)" ] }, { "cell_type": "markdown", "id": "22091661-e78a-4c66-9e48-4c3c676469b4", "metadata": {}, "source": [ "- TODO: `Vector[1..5]` as a constructor" ] }, { "cell_type": "markdown", "id": "b729bdba-87a2-4282-bd0e-319fe17f42da", "metadata": {}, "source": [ "## 18. Vectors" ] }, { "cell_type": "markdown", "id": "f5ddd840-2f84-467b-a9bb-feb769573b69", "metadata": {}, "source": [ "Returns an Array of Vectors in a DataFrame." ] }, { "cell_type": "code", "execution_count": 41, "id": "d3ae03f2-e2fe-4a15-abe1-331185448d61", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[#\n", "[1, 2, 3, 4, 5]\n", ", #\n", "[1.0, 2.0, 3.0, NaN, nil]\n", ", #\n", "[\"A\", \"B\", \"C\", \"D\", nil]\n", ", #\n", "[true, false, true, false, nil]\n", "]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.vectors" ] }, { "cell_type": "markdown", "id": "8ac88ff3-0cb6-43d6-a999-0c2e8c6defb7", "metadata": { "tags": [] }, "source": [ "## 19. Variables\n", "\n", "Returns key and Vector pairs in a Hash." ] }, { "cell_type": "code", "execution_count": 42, "id": "3351a216-6fe5-485e-8686-53c1e754fa2e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{:x=>#\n", "[1, 2, 3, 4, 5]\n", ", :y=>#\n", "[1.0, 2.0, 3.0, NaN, nil]\n", ", :s=>#\n", "[\"A\", \"B\", \"C\", \"D\", nil]\n", ", :b=>#\n", "[true, false, true, false, nil]\n", "}" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.variables" ] }, { "cell_type": "markdown", "id": "3b518c1c-eda7-406f-a885-b2344b1726eb", "metadata": {}, "source": [ "## 20. Select columns by #[ ]" ] }, { "cell_type": "markdown", "id": "767b4e49-19eb-4d5f-b030-91bd78f0f5b9", "metadata": {}, "source": [ "`DataFrame#[]` is overloading column operations and row operations.\n", "\n", "- For columns (variables)\n", " - Key in a Symbol: `df[:symbol]`\n", " - Key in a String: `df[\"string\"]`\n", " - Keys in an Array: `df[:symbol1, \"string\", :symbol2]`\n", " - Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`" ] }, { "cell_type": "code", "execution_count": 43, "id": "ccf60edc-cccf-49e3-a503-1ca532247130", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
xy
11.0
22.0
33.0
4NaN
5(nil)
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 1.0\n", "2 2 2.0\n", "3 3 3.0\n", "4 4 NaN\n", "5 5 (nil)\n" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Keys in a Symbol and a String\n", "df[:x, 'y']" ] }, { "cell_type": "code", "execution_count": 44, "id": "8500f8c0-ff5a-4537-9f47-03d675e31b18", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
xy
11.0
22.0
33.0
4NaN
5(nil)
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 1.0\n", "2 2 2.0\n", "3 3 3.0\n", "4 4 NaN\n", "5 5 (nil)\n" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Keys in a Range\n", "df['x'..'y']" ] }, { "cell_type": "code", "execution_count": 45, "id": "db35cae1-35c2-47de-a7e8-906161f21282", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 3 vectors>
sbx
Atrue1
Bfalse2
Ctrue3
Dfalse4
(nil)(nil)5
" ], "text/plain": [ "#\n", " s b x\n", " \n", "1 A true 1\n", "2 B false 2\n", "3 C true 3\n", "4 D false 4\n", "5 (nil) (nil) 5\n" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Keys with a index Range, and a symbol\n", "df[df.keys[2..], :x]" ] }, { "cell_type": "markdown", "id": "03e14403-f7bc-4350-9e7b-715901164331", "metadata": {}, "source": [ "## 21. Select rows by #[ ]\n", "`DataFrame#[]` is overloading column operations and row operations.\n", "\n", "- For rows (observations)\n", " - Select rows by a Index: `df[index]`\n", " - Select rows by Indices: `df[indices]` # Array, Arrow::Array, Vectors are acceptable for indices\n", " - Select rows by Ranges: `df[range]`\n", " - Select rows by Booleans: `df[booleans]` # Array, Arrow::Array, Vectors are acceptable for booleans" ] }, { "cell_type": "code", "execution_count": 46, "id": "e3bc60a7-611e-4fd8-9770-8e0d167d3fee", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
xysb
11.0Atrue
33.0Ctrue
22.0Bfalse
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 1 1.0 A true\n", "2 3 3.0 C true\n", "3 2 2.0 B false\n" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# indices\n", "df[0, 2, 1]" ] }, { "cell_type": "code", "execution_count": 47, "id": "2b8b3801-ae37-4629-9db5-ff937941c895", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
xysb
22.0Bfalse
33.0Ctrue
5(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 2 2.0 B false\n", "2 3 3.0 C true\n", "3 5 (nil) (nil) (nil)\n" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# including a Range\n", "# negative indices are also acceptable\n", "df[1..2, -1]" ] }, { "cell_type": "code", "execution_count": 48, "id": "3f6f8d73-a66c-4773-9bf5-0878c700f2d6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
xysb
22.0Bfalse
33.0Ctrue
5(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 2 2.0 B false\n", "2 3 3.0 C true\n", "3 5 (nil) (nil) (nil)\n" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# booleans\n", "# length of boolean should be the same as self\n", "df[false, true, true, false, true]" ] }, { "cell_type": "code", "execution_count": 49, "id": "abe57279-54fd-48ec-a1a4-c7453211e776", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
xysb
11.0Atrue
33.0Ctrue
5(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 1 1.0 A true\n", "2 3 3.0 C true\n", "3 5 (nil) (nil) (nil)\n" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Arrow::Array\n", "indices = Arrow::UInt8Array.new([0,2,4])\n", "df[indices]" ] }, { "cell_type": "code", "execution_count": 50, "id": "2266611f-23d8-4645-a1e8-b07c2370fb3f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
xysb
33.0Ctrue
4NaNDfalse
5(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 3 3.0 C true\n", "2 4 NaN D false\n", "3 5 (nil) (nil) (nil)\n" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# By a Vector as indices\n", "indices = Vector.new(df.indices)\n", "# indices > 1 returns a boolean Vector\n", "df[indices > 1]" ] }, { "cell_type": "code", "execution_count": 51, "id": "0ea2da7e-aeca-4874-be4a-6af563aa378b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, true, false, nil]\n" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# By a Vector as booleans\n", "booleans = df[:b]" ] }, { "cell_type": "code", "execution_count": 52, "id": "9f842890-6359-4266-9a23-2f8f813ef548", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 4 vectors>
xysb
11.0Atrue
33.0Ctrue
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 1 1.0 A true\n", "2 3 3.0 C true\n" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[booleans]" ] }, { "cell_type": "markdown", "id": "98a04874-cb2c-44c0-b410-b330b9d12b0f", "metadata": {}, "source": [ "## 22. empty?" ] }, { "cell_type": "code", "execution_count": 53, "id": "7b1ab319-90a7-4f09-8629-04dcd94076cb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "false" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.empty?" ] }, { "cell_type": "code", "execution_count": 54, "id": "3f9f8771-87dd-44eb-8aac-6a3ed8b4c183", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(empty DataFrame)" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DataFrame.new" ] }, { "cell_type": "code", "execution_count": 55, "id": "1e09c32f-20a8-4175-827f-cdb98063535a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DataFrame.new.empty?" ] }, { "cell_type": "markdown", "id": "86b826dd-10e6-4087-9162-b89ac6561a61", "metadata": {}, "source": [ "## 23. Select columns by pick" ] }, { "cell_type": "markdown", "id": "b5aefd22-4e96-4dc5-91d2-e6826256bda6", "metadata": { "tags": [] }, "source": [ "`DataFrame#pick` accepts an Array of keys to pick up columns (variables) and creates a new DataFrame. You can change the order of columns at a same time.\n", "\n", "The name `pick` comes from the action to pick variables(columns) according to the label keys." ] }, { "cell_type": "code", "execution_count": 56, "id": "68124521-b823-424d-9e06-d11aa927d618", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
sy
A1.0
B2.0
C3.0
DNaN
(nil)(nil)
" ], "text/plain": [ "#\n", " s y\n", " \n", "1 A 1.0\n", "2 B 2.0\n", "3 C 3.0\n", "4 D NaN\n", "5 (nil) (nil)\n" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pick(:s, :y)\n", "# or\n", "df.pick([:s, :y]) # OK too." ] }, { "cell_type": "markdown", "id": "a76dca00-da8f-4959-be18-7a1015a9d13c", "metadata": {}, "source": [ "Or use a boolean Array of lengeh `n_key` to `pick`. This style preserves the order of variables." ] }, { "cell_type": "code", "execution_count": 57, "id": "b91f8925-529c-43c9-93ba-e21bcac0f2f7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
ys
1.0A
2.0B
3.0C
NaND
(nil)(nil)
" ], "text/plain": [ "#\n", " y s\n", " \n", "1 1.0 A\n", "2 2.0 B\n", "3 3.0 C\n", "4 NaN D\n", "5 (nil) (nil)\n" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pick(false, true, true, false)\n", "# or\n", "df.pick([false, true, true, false])\n", "# or\n", "df.pick(Vector.new([false, true, true, false]))" ] }, { "cell_type": "markdown", "id": "5f903182-745b-4923-99d8-14a9b9c6ea4c", "metadata": {}, "source": [ "`#pick` also accepts a block in the context of self.\n", "\n", "Next example is picking up numeric variables." ] }, { "cell_type": "code", "execution_count": 58, "id": "37bb0a49-c38a-484c-91d4-3e23ab43a727", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
xy
11.0
22.0
33.0
4NaN
5(nil)
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 1.0\n", "2 2 2.0\n", "3 3 3.0\n", "4 4 NaN\n", "5 5 (nil)\n" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# reciever is required with the argument style\n", "df.pick(df.vectors.map(&:numeric?))\n", "\n", "# with a block\n", "df.pick { vectors.map(&:numeric?) }" ] }, { "cell_type": "markdown", "id": "1692e8c3-aafa-4951-84a6-ff26e4f8bbde", "metadata": {}, "source": [ "`pick` also accepts numeric indexes.\n", "\n", "(Since 0.2.1)" ] }, { "cell_type": "code", "execution_count": 59, "id": "dd929ef0-8af4-4bb6-963a-b3f1abd73bf0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
xb
1true
2false
3true
4false
5(nil)
" ], "text/plain": [ "#\n", " x b\n", " \n", "1 1 true\n", "2 2 false\n", "3 3 true\n", "4 4 false\n", "5 5 (nil)\n" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pick(0, 3)" ] }, { "cell_type": "markdown", "id": "7c1815e4-de6c-425e-8602-b8dd66836250", "metadata": {}, "source": [ "## 24. Reject columns by drop" ] }, { "cell_type": "markdown", "id": "d1ab045e-66f9-4922-8bf2-35aee7f2812e", "metadata": { "tags": [] }, "source": [ "`DataFrame#drop` accepts an Array keys to drop columns (variables) to create a remainer DataFrame.\n", "\n", "The name `drop` comes from the pair word of `pick`." ] }, { "cell_type": "code", "execution_count": 60, "id": "7ccace08-62b0-4b0b-93fb-81edf673abf7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
ys
1.0A
2.0B
3.0C
NaND
(nil)(nil)
" ], "text/plain": [ "#\n", " y s\n", " \n", "1 1.0 A\n", "2 2.0 B\n", "3 3.0 C\n", "4 NaN D\n", "5 (nil) (nil)\n" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(:x, :b)\n", "# df.drop([:x, :b]) #is OK too." ] }, { "cell_type": "markdown", "id": "2085b349-95c5-4607-b029-f7c3d630ac1c", "metadata": {}, "source": [ "Or use a boolean Array of lengeh `n_key` to `drop`." ] }, { "cell_type": "code", "execution_count": 61, "id": "785c02f1-1e16-4722-9961-4b49223c8290", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
ys
1.0A
2.0B
3.0C
NaND
(nil)(nil)
" ], "text/plain": [ "#\n", " y s\n", " \n", "1 1.0 A\n", "2 2.0 B\n", "3 3.0 C\n", "4 NaN D\n", "5 (nil) (nil)\n" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(true, false, false, true)\n", "# df.drop([true, false, false, true]) # is OK too" ] }, { "cell_type": "markdown", "id": "d246161e-02cc-40fb-8921-26b37eb3956f", "metadata": {}, "source": [ "`#drop` also accepts a block in the context of self.\n", "\n", "Next example will drop variables which have nil or NaN values." ] }, { "cell_type": "code", "execution_count": 62, "id": "069932e3-d393-4ede-9eb5-7aac8625e0c0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 1 vector>
x
1
2
3
4
5
" ], "text/plain": [ "#\n", " x\n", " \n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop { vectors.map { |v| v.is_na.any } }" ] }, { "cell_type": "markdown", "id": "88b064d6-7d90-4a0b-b9c8-d92e103269fb", "metadata": {}, "source": [ "Argument style is also acceptable but it requires the reciever 'df'." ] }, { "cell_type": "code", "execution_count": 63, "id": "3003a5c2-0966-4f2c-9643-59e8b546c8aa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 1 vector>
x
1
2
3
4
5
" ], "text/plain": [ "#\n", " x\n", " \n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(df.vectors.map { |v| v.is_na.any })" ] }, { "cell_type": "markdown", "id": "e30c2b72-fc05-405c-97dc-7e31b88abebf", "metadata": {}, "source": [ "`drop` also accepts numeric indexes.\n", "\n", "(Since 0.2.1)" ] }, { "cell_type": "code", "execution_count": 64, "id": "0b6eebe0-145d-4660-8be4-e6f09eb656aa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
ys
1.0A
2.0B
3.0C
NaND
(nil)(nil)
" ], "text/plain": [ "#\n", " y s\n", " \n", "1 1.0 A\n", "2 2.0 B\n", "3 3.0 C\n", "4 NaN D\n", "5 (nil) (nil)\n" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(0, 3)" ] }, { "cell_type": "markdown", "id": "0f6dc86c-828d-4f9f-8b07-fce63c30fdca", "metadata": {}, "source": [ "## 25. Pick/drop and nil" ] }, { "cell_type": "markdown", "id": "0a108878-565b-400e-9a47-a15aae09429c", "metadata": {}, "source": [ "When `pick` or `drop` is used with booleans, nil in the booleans is treated as false. This behavior is aligned with Ruby's `BasicObject#!`." ] }, { "cell_type": "code", "execution_count": 65, "id": "7c01fbb4-9bfa-4afc-8e6b-45c97c0beb03", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans = [true, true, false, nil]\n", "booleans_invert = booleans.map(&:!) # => [false, false, true, true] because nil.! is true\n", "df.pick(booleans) == df.drop(booleans_invert)" ] }, { "cell_type": "markdown", "id": "12a24264-9b7a-42a1-a541-e292e3876e35", "metadata": {}, "source": [ "## 26. Vector#invert, #primitive_invert\n", "\n", "For the boolean Vector;" ] }, { "cell_type": "code", "execution_count": 66, "id": "ea352e12-7e8a-43be-b8ac-797adbc47708", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, true, false, nil]\n" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector = Vector.new(booleans)" ] }, { "cell_type": "markdown", "id": "2a0f82e0-157b-4185-9254-0618be291f9b", "metadata": {}, "source": [ "nil is converted to nil by `Vector#invert`." ] }, { "cell_type": "code", "execution_count": 67, "id": "596c521f-12bf-4448-9e5d-e1b4a2c3d896", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, false, true, nil]\n" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector.invert\n", "# or\n", "!vector" ] }, { "cell_type": "markdown", "id": "a1aec910-3055-4627-a02b-22d45f2ceb70", "metadata": {}, "source": [ "So `df.pick(booleans) != df.drop(booleans.invert)` when booleans have any nils.\n", "\n", "On the other hand, `Vector#primitive_invert` follows Ruby's `BasicObject#!`'s behavior. Then pick and drop keep 'MECE' behavior." ] }, { "cell_type": "code", "execution_count": 68, "id": "4dcaba48-1cea-4ce9-b4a9-b079b43af7ec", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, false, true, true]\n" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector.primitive_invert" ] }, { "cell_type": "code", "execution_count": 69, "id": "c7ae4dad-275a-49e0-a0b0-bf3686248070", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pick(vector) == df.drop(vector.primitive_invert)" ] }, { "cell_type": "markdown", "id": "9a6cec74-43f0-4a72-8262-25b1e311f602", "metadata": {}, "source": [ "## 27. Pick/drop and [ ]" ] }, { "cell_type": "markdown", "id": "32c8f74d-b3ce-4305-9af7-6ea70052c773", "metadata": {}, "source": [ "When `pick` or `drop` select a single column (variable), it returns a `DataFrame` with one column (variable)." ] }, { "cell_type": "code", "execution_count": 70, "id": "e13aee24-cac6-41ad-b8a3-0ec26edbe5d1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 1 vector>
x
1
2
3
4
5
" ], "text/plain": [ "#\n", " x\n", " \n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pick(:x) # or\n", "df.drop(:y, :s, :b)" ] }, { "cell_type": "markdown", "id": "3e47b9d2-929e-4674-9690-0a1fdf7b0a7d", "metadata": {}, "source": [ "In contrast, when `[]` selects a single column (variable), it returns a `Vector`." ] }, { "cell_type": "code", "execution_count": 71, "id": "60d228be-7357-434d-9d39-ee72c110e6fe", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[:x]" ] }, { "cell_type": "markdown", "id": "6d973934-e08b-4b45-8efb-52f9167e7238", "metadata": {}, "source": [ "This behavior may be useful to use with DataFrame manipulation verbs (like pick, drop, slice, remove, assign, rename)." ] }, { "cell_type": "code", "execution_count": 72, "id": "6beefc5a-dc47-42cc-a283-456073c4251e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
xy
11.0
22.0
33.0
4NaN
5(nil)
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 1.0\n", "2 2 2.0\n", "3 3 3.0\n", "4 4 NaN\n", "5 5 (nil)\n" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pick { keys.select { |key| df[key].numeric? } }" ] }, { "cell_type": "markdown", "id": "34c9bcb0-889a-4190-b2b8-49765cd059c2", "metadata": {}, "source": [ "## 28. Slice\n", "\n", "Another example of `slice` is [#66. Row index label by slice_by](#66.-Row-index-label-by-slice_by)." ] }, { "cell_type": "markdown", "id": "9a428ba8-c306-4ab8-8607-51174e8e6ebe", "metadata": {}, "source": [ "`slice` selects rows (observations) to create a subset of a DataFrame." ] }, { "cell_type": "markdown", "id": "6016d6d4-72d6-4ae2-b7dd-3d526c91ae61", "metadata": {}, "source": [ "`slice(indeces)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers. Negative index from the tail like Ruby's Array is also acceptable." ] }, { "cell_type": "code", "execution_count": 73, "id": "9cdce2e4-7876-4be6-bd1f-bc8ab6e6c871", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <10 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen(nil)(nil)(nil)(nil)(nil)2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen (nil) (nil) (nil) ... 2007\n", " 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " : : : : : : ... :\n", " 8 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", " 9 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "10 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# returns 5 rows from the start and 5 rows from the end\n", "penguins.slice(0...5, -5..-1)" ] }, { "cell_type": "code", "execution_count": 74, "id": "93c3f6f0-7bc9-4909-8f32-20e8c1ddfd3a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <1 x 9 vectors>
indexspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
113AdelieBiscoe42.219.51974275male2009
" ], "text/plain": [ "#\n", " index species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", "1 113 Adelie Biscoe 42.2 19.5 197 ... 2009\n" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# slice accepts Float index\n", "# 33% of 344 observations in index => 113.52 th data ??\n", "indexed_penguins = penguins.assign_left { [:index, indexes] } # #assign_left and assigner by Array is 0.2.0 feature\n", "indexed_penguins.slice(penguins.size * 0.33)" ] }, { "cell_type": "markdown", "id": "8139bb28-89f8-4058-b824-dde33ead0b60", "metadata": {}, "source": [ "Indices in Vectors or Arrow::Arrays are also acceptable." ] }, { "cell_type": "markdown", "id": "6f79db8c-c706-4d60-949b-3f644474d375", "metadata": {}, "source": [ "Another way to select in `slice` is to use booleans.\n", "- Booleans is an Array, Arrow::Array, Vector or their Array.\n", "- Each data type must be boolean.\n", "- Size of booleans must be same as the size of self." ] }, { "cell_type": "code", "execution_count": 75, "id": "f58ca131-7375-4489-90ce-6ba54b898eb5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, false, true, nil, false, false, false, false, false, true, false, false, ... ]\n" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# make boolean Vector to check over 40\n", "booleans = penguins[:bill_length_mm] > 40" ] }, { "cell_type": "code", "execution_count": 76, "id": "176ab365-c66a-4712-97b9-4381a536321b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <242 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen42.020.21904250(nil)2007
AdelieTorgersen41.117.61823200female2007
AdelieTorgersen42.520.71974500male2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 2 Adelie Torgersen 42.0 20.2 190 ... 2007\n", " 3 Adelie Torgersen 41.1 17.6 182 ... 2007\n", " 4 Adelie Torgersen 42.5 20.7 197 ... 2007\n", " 5 Adelie Torgersen 46.0 21.5 194 ... 2007\n", " : : : : : : ... :\n", "240 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "241 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "242 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.slice(booleans)" ] }, { "cell_type": "markdown", "id": "3264a182-6b72-461a-b712-c3b708c53516", "metadata": {}, "source": [ "`slice` accepts a block.\n", "- We can't use both arguments and a block at a same time.\n", "- The block should return indeces in any length or a boolean Array with a same length as `size`.\n", "- Block is called in the context of self. So reciever 'self' can be omitted in the block." ] }, { "cell_type": "code", "execution_count": 77, "id": "c95d3426-0bbb-430e-8d83-6e22434d99ed", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <204 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen39.320.61903650male2007
GentooBiscoe47.213.72144925female2009
GentooBiscoe46.814.32154850female2009
GentooBiscoe45.214.82125200female2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen 39.3 20.6 190 ... 2007\n", " 5 Adelie Torgersen 38.9 17.8 181 ... 2007\n", " : : : : : : ... :\n", "202 Gentoo Biscoe 47.2 13.7 214 ... 2009\n", "203 Gentoo Biscoe 46.8 14.3 215 ... 2009\n", "204 Gentoo Biscoe 45.2 14.8 212 ... 2009\n" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# return a DataFrame with bill_length_mm is in 2*std range around mean\n", "penguins.slice do\n", " min = bill_length_mm.mean - bill_length_mm.std\n", " max = bill_length_mm.mean + bill_length_mm.std\n", " bill_length_mm.to_a.map { |e| (min..max).include? e }\n", "end" ] }, { "cell_type": "markdown", "id": "4fa42801-64f5-4432-856b-85c26a68515d", "metadata": {}, "source": [ "## 29. Slice and nil option" ] }, { "cell_type": "markdown", "id": "31017a7e-0923-4283-bc92-246ebe2591c3", "metadata": {}, "source": [ "`Arrow::Table#slice` uses `#filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row." ] }, { "cell_type": "code", "execution_count": 78, "id": "8e4a8108-154b-4621-acd1-704ddf229d61", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "\t a\tb\t c\n", "0\t 1\tA\t 1.000000\n", "1\t(null)\t(null)\t (null)\n" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }\n", "table = Arrow::Table.new(hash)\n", "table.slice([true, false, nil])" ] }, { "cell_type": "markdown", "id": "dbb57c5a-e949-42b8-a82c-9affb3fe5b7b", "metadata": {}, "source": [ "Whereas in RedAmber, `DataFrame#slice` with booleans containing nil is treated as false. This behavior comes from `Allow::FilterOptions.null_selection_behavior = :drop`. This is a default value for `Arrow::Table.filter` method." ] }, { "cell_type": "code", "execution_count": 79, "id": "851c3bf6-b9e9-41bd-92c5-5372ed934549", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "\ta\tb\t c\n", "0\t1\tA\t 1.000000\n" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "RedAmber::DataFrame.new(table).slice([true, false, nil]).table" ] }, { "cell_type": "markdown", "id": "56398a3d-6146-43af-8b96-fec37730fc49", "metadata": {}, "source": [ "## 30. Remove" ] }, { "cell_type": "markdown", "id": "9e042a97-8a5d-412e-8e4a-fda382225a2d", "metadata": {}, "source": [ "Slice and reject rows (observations) to create a remainer DataFrame." ] }, { "cell_type": "markdown", "id": "2b4cbb97-eef3-4db8-8f25-c44c208ec554", "metadata": {}, "source": [ "`#remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer." ] }, { "cell_type": "code", "execution_count": 80, "id": "17e38ab8-886b-4114-bcaf-ee18df7d00cd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <334 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.320.61903650male2007
AdelieTorgersen38.917.81813625female2007
AdelieTorgersen39.219.61954675male2007
AdelieTorgersen34.118.11933475(nil)2007
GentooBiscoe44.515.72174875(nil)2009
GentooBiscoe48.816.22226000male2009
GentooBiscoe47.213.72144925female2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.3 20.6 190 ... 2007\n", " 2 Adelie Torgersen 38.9 17.8 181 ... 2007\n", " 3 Adelie Torgersen 39.2 19.6 195 ... 2007\n", " 4 Adelie Torgersen 34.1 18.1 193 ... 2007\n", " 5 Adelie Torgersen 42.0 20.2 190 ... 2007\n", " : : : : : : ... :\n", "332 Gentoo Biscoe 44.5 15.7 217 ... 2009\n", "333 Gentoo Biscoe 48.8 16.2 222 ... 2009\n", "334 Gentoo Biscoe 47.2 13.7 214 ... 2009\n" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# returns 6th to 339th obs. Remainer of 1st example of #30\n", "penguins.remove(0...5, -5..-1)" ] }, { "cell_type": "markdown", "id": "def1c1c4-6b60-4864-ae24-c797fbf008a7", "metadata": {}, "source": [ "`remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `#size`." ] }, { "cell_type": "code", "execution_count": 81, "id": "6f169420-7eb2-457f-8d59-7a5c90aa3fa5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <333 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen36.719.31933450female2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " 5 Adelie Torgersen 39.3 20.6 190 ... 2007\n", " : : : : : : ... :\n", "331 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "332 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "333 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# remove all observation contains nil\n", "removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }" ] }, { "cell_type": "markdown", "id": "5f1864c9-4ae4-4fcd-9840-ea424ef5e27d", "metadata": {}, "source": [ "`remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as size. Block is called in the context of self." ] }, { "cell_type": "code", "execution_count": 82, "id": "a6807c65-25e5-4ee1-8d1b-6018c46b3999", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <140 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen(nil)(nil)(nil)(nil)(nil)2007
AdelieTorgersen36.719.31933450female2007
AdelieTorgersen34.118.11933475(nil)2007
AdelieTorgersen37.817.11863300(nil)2007
GentooBiscoe(nil)(nil)(nil)(nil)(nil)2009
GentooBiscoe50.415.72225750male2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen (nil) (nil) (nil) ... 2007\n", " 2 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " 3 Adelie Torgersen 34.1 18.1 193 ... 2007\n", " 4 Adelie Torgersen 37.8 17.1 186 ... 2007\n", " 5 Adelie Torgersen 37.8 17.3 180 ... 2007\n", " : : : : : : ... :\n", "138 Gentoo Biscoe (nil) (nil) (nil) ... 2009\n", "139 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "140 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Remove data in 2*std range around mean\n", "penguins.remove do\n", " vector = self[:bill_length_mm]\n", " min = vector.mean - vector.std\n", " max = vector.mean + vector.std\n", " vector.to_a.map { |e| (min..max).include? e }\n", "end" ] }, { "cell_type": "markdown", "id": "591e6b22-da98-4336-b22e-c7bc9bcf2ebf", "metadata": {}, "source": [ "## 31. Remove and nil" ] }, { "cell_type": "markdown", "id": "67926d1b-c76e-4cb7-b679-6545d850e7e4", "metadata": {}, "source": [ "When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`." ] }, { "cell_type": "code", "execution_count": 83, "id": "8575614e-f702-4ee4-ac7b-745e9b32e803", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 3 vectors>
abc
1A1.0
2B2.0
(nil)C3.0
" ], "text/plain": [ "#\n", " a b c\n", " \n", "1 1 A 1.0\n", "2 2 B 2.0\n", "3 (nil) C 3.0\n" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])" ] }, { "cell_type": "code", "execution_count": 84, "id": "932a5e71-8cef-44e5-a789-ce97329bc001", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, nil]\n" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans = df[:a] < 2" ] }, { "cell_type": "code", "execution_count": 85, "id": "74cf6aa6-8913-433d-97ad-bba2d548afe5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[false, true, true]" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans_invert = booleans.to_a.map(&:!)" ] }, { "cell_type": "code", "execution_count": 86, "id": "5e466a06-cb17-4dc1-a5b0-34bfd3ffb78b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice(booleans) == df.remove(booleans_invert)" ] }, { "cell_type": "markdown", "id": "8bca0b06-2d08-4c28-8b4c-4fd088f2d2d3", "metadata": {}, "source": [ "Whereas `Vector#invert` returns nil for elements nil. This will bring different result. (See #26)" ] }, { "cell_type": "code", "execution_count": 87, "id": "077b216f-0a08-413e-95c9-12789d15a9ba", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, true, nil]\n" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans.invert" ] }, { "cell_type": "code", "execution_count": 88, "id": "b3df62a6-c4a3-44cb-bde6-f6be12b120c8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 3 vectors>
abc
1A1.0
(nil)C3.0
" ], "text/plain": [ "#\n", " a b c\n", " \n", "1 1 A 1.0\n", "2 (nil) C 3.0\n" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.remove(booleans.invert)" ] }, { "cell_type": "markdown", "id": "e05f00b6-3bae-4650-8bbc-d4e0692f6f85", "metadata": {}, "source": [ "We have `#primitive_invert` method in Vector. This method returns the same result as `.to_a.map(&:!)` above." ] }, { "cell_type": "code", "execution_count": 89, "id": "296ca3cd-a6da-4603-a576-d8c36a810e4f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, true, true]\n" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans.primitive_invert" ] }, { "cell_type": "code", "execution_count": 90, "id": "ba5b8c0b-b94e-4209-adcd-258ea3b87bfd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <1 x 3 vectors>
abc
1A1.0
" ], "text/plain": [ "#\n", " a b c\n", " \n", "1 1 A 1.0\n" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.remove(booleans.primitive_invert)" ] }, { "cell_type": "code", "execution_count": 91, "id": "2446792f-0b0a-4642-acae-b4fec89261c1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice(booleans) == df.remove(booleans.primitive_invert)" ] }, { "cell_type": "markdown", "id": "7c23a4ad-0c17-4178-b58a-abfd8153d49b", "metadata": {}, "source": [ "## 32. Remove nil" ] }, { "cell_type": "markdown", "id": "84c7238b-1029-416f-b495-9d045f77b22c", "metadata": {}, "source": [ "Remove any observations containing nil." ] }, { "cell_type": "code", "execution_count": 92, "id": "de4bb615-d14d-4c90-ab54-db2f375b9f00", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <333 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen36.719.31933450female2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " 5 Adelie Torgersen 39.3 20.6 190 ... 2007\n", " : : : : : : ... :\n", "331 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "332 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "333 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.remove_nil" ] }, { "cell_type": "markdown", "id": "4a4ae8f9-dcf8-4dad-bb77-af076e9cadb5", "metadata": {}, "source": [ "The roundabout way for this is to use `#remove`." ] }, { "cell_type": "code", "execution_count": 93, "id": "27a3da5f-0ea2-4c5d-a6c3-c0e20f2224a3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <333 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen36.719.31933450female2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " 5 Adelie Torgersen 39.3 20.6 190 ... 2007\n", " : : : : : : ... :\n", "331 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "332 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "333 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.remove { vectors.map(&:is_nil).reduce(&:|) }" ] }, { "cell_type": "markdown", "id": "4f2a58fd-f033-44f6-9eb4-ed893a2b5d1d", "metadata": {}, "source": [ "## 33. Rename" ] }, { "cell_type": "markdown", "id": "c0d39506-8ae5-48e7-9dd2-acf38d4ec1a9", "metadata": {}, "source": [ "Rename keys (column names) to create a updated DataFrame." ] }, { "cell_type": "markdown", "id": "3f6924ec-e86c-4089-ae40-6783027d3ce0", "metadata": {}, "source": [ "`#rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}` or an Array of Array `[[existing_key, new_key], ...]` ." ] }, { "cell_type": "code", "execution_count": 94, "id": "9396c96d-83d7-4b92-a4ca-27bc9e4d7b9d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
nameage
Yasuko68
Rui49
Hinata28
" ], "text/plain": [ "#\n", " name age\n", " \n", "1 Yasuko 68\n", "2 Rui 49\n", "3 Hinata 28\n" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "h = { name: %w[Yasuko Rui Hinata], age: [68, 49, 28] }\n", "comecome = RedAmber::DataFrame.new(h)" ] }, { "cell_type": "code", "execution_count": 95, "id": "fad279c6-1ca0-4493-bd69-0e9ef011bff7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
nameage_in_1993
Yasuko68
Rui49
Hinata28
" ], "text/plain": [ "#\n", " name age_in_1993\n", " \n", "1 Yasuko 68\n", "2 Rui 49\n", "3 Hinata 28\n" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comecome.rename(:age => :age_in_1993)\n", "# comecome.rename(:age, :age_in_1993) # is also OK\n", "# comecome.rename([:age, :age_in_1993]) # is also OK" ] }, { "cell_type": "markdown", "id": "9dabb005-9822-4c4b-aaa5-fa6f28f2ed43", "metadata": {}, "source": [ "`#rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}` or an Array of Array `[[existing_key, new_key], ...]`. Block is called in the context of self." ] }, { "cell_type": "markdown", "id": "aabbba20-6ef8-4da2-8dc0-0cb243cf3b23", "metadata": {}, "source": [ "Symbol key and String key are distinguished." ] }, { "cell_type": "markdown", "id": "07f98b31-6123-4466-b4f8-f995c7cde474", "metadata": {}, "source": [ "## 34. Assign" ] }, { "cell_type": "markdown", "id": "99f6787f-2b36-4360-b155-1c2d7874d25e", "metadata": {}, "source": [ "Another example of `assign` is [64. Assign revised](#64.-Assign-revised), [#65. Variations of assign](#65.-Variations-of-assign) .\n", "\n", "Assign new or updated columns (variables) and create a updated DataFrame.\n", "\n", "- Columns with new keys will append new variables at right (bottom in TDR).\n", "- Columns with exisiting keys will update corresponding vectors." ] }, { "cell_type": "markdown", "id": "b4b22da0-4ee2-4196-88e1-1cfea6a72f4d", "metadata": {}, "source": [ "`#assign(key_pairs)` accepts pairs of key and array_like values as arguments. The pairs should be a Hash of `{key => array_like}` or an Array of Array `[[key, array_like], ... ]`. `array_like` is one of `Vector`, `Array` or `Arrow::Array`. " ] }, { "cell_type": "code", "execution_count": 96, "id": "56dcfed8-a6f9-4d8c-bac3-e8ce7c0674a7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
nameage
Yasuko68
Rui49
Hinata28
" ], "text/plain": [ "#\n", " name age\n", " \n", "1 Yasuko 68\n", "2 Rui 49\n", "3 Hinata 28\n" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comecome = RedAmber::DataFrame.new( name: %w[Yasuko Rui Hinata], age: [68, 49, 28] )" ] }, { "cell_type": "code", "execution_count": 97, "id": "8da8d282-8798-44d5-bb7b-7fa2df922308", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 3 vectors>
nameagebrother
Yasuko97Santa
Rui78(nil)
Hinata57Momotaro
" ], "text/plain": [ "#\n", " name age brother\n", " \n", "1 Yasuko 97 Santa\n", "2 Rui 78 (nil)\n", "3 Hinata 57 Momotaro\n" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# update :age and add :brother\n", "assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }\n", "comecome.assign(assigner)" ] }, { "cell_type": "markdown", "id": "e6d3ddfc-b16d-4b20-83df-357e9cdb32e6", "metadata": {}, "source": [ "`#assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and array_like values as a Hash of `{key => array_like}` or an Array of Array `[[key, array_like], ... ]`. `array_like` is one of `Vector`, `Array` or `Arrow::Array`. Block is called in the context of self." ] }, { "cell_type": "code", "execution_count": 98, "id": "8d69edd0-7ad7-4318-8033-1785ce2543db", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 3 vectors>
indexfloatstring
00.0A
11.1B
22.2C
3NaND
(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " index float string\n", " \n", "1 0 0.0 A\n", "2 1 1.1 B\n", "3 2 2.2 C\n", "4 3 NaN D\n", "5 (nil) (nil) (nil)\n" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = RedAmber::DataFrame.new(\n", " index: [0, 1, 2, 3, nil],\n", " float: [0.0, 1.1, 2.2, Float::NAN, nil],\n", " string: ['A', 'B', 'C', 'D', nil])" ] }, { "cell_type": "code", "execution_count": 99, "id": "e884af01-d82b-42e7-8e92-62baf19919cb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 3 vectors>
indexfloatstring
0-0.0A
255-1.1B
254-2.2C
253NaND
(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " index float string\n", " \n", "1 0 -0.0 A\n", "2 255 -1.1 B\n", "3 254 -2.2 C\n", "4 253 NaN D\n", "5 (nil) (nil) (nil)\n" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# update numeric variables\n", "df.assign do\n", " vectors.select(&:numeric?).map { |v| [v.key, -v] }\n", "end" ] }, { "cell_type": "markdown", "id": "7b8e2090-628f-4b17-8929-cbb5e0285dff", "metadata": {}, "source": [ "In this example, columns :x and :y are updated. Column :x returns complements for #negate method because :x is :uint8 type." ] }, { "cell_type": "code", "execution_count": 100, "id": "9452f8db-5f23-4044-ac87-ac5695fae8ae", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[:uint8, :double, :string]" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.types" ] }, { "cell_type": "markdown", "id": "c5c83559-f4d8-4ed2-8b20-5c50eb1faa14", "metadata": {}, "source": [ "## 35. Coerce (Vector)" ] }, { "cell_type": "markdown", "id": "77bdfc69-b728-4335-b76e-e4be92f94310", "metadata": {}, "source": [ "Vector has coerce method." ] }, { "cell_type": "code", "execution_count": 101, "id": "2bfbe584-be54-486b-af32-e76b37c10e49", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3]\n" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector = RedAmber::Vector.new(1,2,3)" ] }, { "cell_type": "code", "execution_count": 102, "id": "ce35d901-38a8-4f13-b2d1-29b83f6c5438", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[-1, -2, -3]\n" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Vector's `#*` method\n", "vector * -1" ] }, { "cell_type": "code", "execution_count": 103, "id": "7d5fc2be-f590-4678-92e9-faa27b618266", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[-1, -2, -3]\n" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# coerced calculation\n", "-1 * vector" ] }, { "cell_type": "code", "execution_count": 104, "id": "fa90a6af-add7-42f2-9707-7d726575aeb6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[255, 254, 253]\n" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# `@-` operator\n", "-vector" ] }, { "cell_type": "markdown", "id": "4820b527-44e9-4738-aa0e-73604078b3b0", "metadata": { "tags": [] }, "source": [ "## 36. to_ary (Vector)" ] }, { "cell_type": "markdown", "id": "8507dcc4-74e3-44ad-aa54-cf43d55f2131", "metadata": {}, "source": [ "`Vector#to_ary` will enable implicit conversion to an Array." ] }, { "cell_type": "code", "execution_count": 105, "id": "b12bd7c8-2981-426c-8ae3-154504a8ea15", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[3, 4, 5]" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Array(Vector.new([3, 4, 5]))" ] }, { "cell_type": "code", "execution_count": 106, "id": "c0cb5a98-7cdf-43a8-b2f7-f9df1961c761", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 2, 3, 4, 5]" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[1, 2] + Vector.new([3, 4, 5])" ] }, { "cell_type": "markdown", "id": "216dde4f-e4d8-4f29-903a-8cbf75de5b8e", "metadata": {}, "source": [ "## 37. Fill nil (Vector)" ] }, { "cell_type": "markdown", "id": "1959d0d7-6d09-4fa5-9365-1e2f7fc35d61", "metadata": {}, "source": [ "`Vector#fill_nil_forward` or `Vector#fill_nil_backward` will\n", "propagate the last valid observation forward (or backward).\n", "Or preserve nil if all previous values are nil or at the end." ] }, { "cell_type": "code", "execution_count": 107, "id": "d003b06a-859f-4de0-9e35-803efac85169", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[0, 1, 1, 3, 3]\n" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "integer = Vector.new([0, 1, nil, 3, nil])\n", "integer.fill_nil_forward" ] }, { "cell_type": "code", "execution_count": 108, "id": "c5d74006-d364-4e86-8a5e-9e96e87a96e0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[0, 1, 3, 3, nil]\n" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "integer.fill_nil_backward" ] }, { "cell_type": "markdown", "id": "347785a6-eab0-4864-a871-2c320005211e", "metadata": {}, "source": [ "## 38. all?/any? (Vector)" ] }, { "cell_type": "markdown", "id": "f82a6f5d-03d3-4645-85f5-d25999165378", "metadata": {}, "source": [ "`Vector#all?` returns true if all elements is true.\n", "\n", "`Vector#any?` returns true if exists any true.\n", "\n", "These are unary aggregation function." ] }, { "cell_type": "code", "execution_count": 109, "id": "ebad37ad-0a09-48b1-ba3a-4e030a917837", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans = Vector.new([true, true, nil])\n", "booleans.all?" ] }, { "cell_type": "code", "execution_count": 110, "id": "97fc24da-03d4-406d-b353-562896775d60", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans.any?" ] }, { "cell_type": "markdown", "id": "0ff3b22d-9f7c-42f2-8d18-c89a06af681b", "metadata": {}, "source": [ "If these methods are used with option `skip_nulls: false` nil is considered." ] }, { "cell_type": "code", "execution_count": 111, "id": "3e0e5800-665a-4a05-b2cb-d152f3f077de", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "false" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans.all?(skip_nulls: false)" ] }, { "cell_type": "code", "execution_count": 112, "id": "3e43f0c4-a254-4735-ac28-de14d2670c67", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans.any?(skip_nulls: false)" ] }, { "cell_type": "markdown", "id": "abc71a85-7958-4a21-91cf-8c96c0784525", "metadata": {}, "source": [ "## 39. count/count_uniq (Vector)" ] }, { "cell_type": "markdown", "id": "3d556118-4105-4d12-806d-ba56c6ae3d1b", "metadata": {}, "source": [ "`Vector#count` counts element.\n", "\n", "`Vector#count_uniq` counts unique element. `#count_distinct` is an alias (Arrow's name).\n", "\n", "These are unary aggregation function." ] }, { "cell_type": "code", "execution_count": 113, "id": "2af73e32-1d7e-4f80-b54e-c40ef08b7034", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string = Vector.new(%w[A B A])\n", "string.count" ] }, { "cell_type": "code", "execution_count": 114, "id": "fe6d8d85-27b0-438f-b1b4-1b15e9eb05f9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string.count_uniq # count_distinct is also OK" ] }, { "cell_type": "markdown", "id": "70abed9f-665a-4ea7-939e-4b185ee53755", "metadata": {}, "source": [ "## 40. stddev/variance (Vector)" ] }, { "cell_type": "markdown", "id": "965de338-b3be-4d33-92e1-5ad7e2ed18f0", "metadata": {}, "source": [ "These are unary element-wise function." ] }, { "cell_type": "code", "execution_count": 115, "id": "0afec200-f377-432b-a260-ae5a0c5ce794", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.816496580927726" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "integers = Vector.new([1, 2, 3, nil])\n", "integers.stddev" ] }, { "cell_type": "code", "execution_count": 116, "id": "2e40ac09-cb7f-4978-87e8-53f84f16f7c7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Unbiased standard deviation\n", "integers.sd" ] }, { "cell_type": "code", "execution_count": 117, "id": "e6158e3b-4af8-467c-a355-8e9f2e579548", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6666666666666666" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "integers.variance" ] }, { "cell_type": "code", "execution_count": 118, "id": "d64d39f2-d979-49f1-9946-65890f40d646", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Unbiased variance\n", "integers.var" ] }, { "cell_type": "markdown", "id": "25023f5a-798a-40a5-ab84-a6615602f747", "metadata": {}, "source": [ "## 41. negate (Vector)" ] }, { "cell_type": "markdown", "id": "00ddf322-ef50-40a1-86a6-22bf3d43f007", "metadata": {}, "source": [ "These are unary element-wise function." ] }, { "cell_type": "code", "execution_count": 119, "id": "ab5a357a-e98c-40a1-9b89-0b38645e416f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[-1.0, 2.0, -3.0]\n" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double = Vector.new([1.0, -2, 3])\n", "double.negate" ] }, { "cell_type": "code", "execution_count": 120, "id": "8a06c856-d61c-4752-a296-1fa207ffd9a1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[-1.0, 2.0, -3.0]\n" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Same as #negate\n", "-double" ] }, { "cell_type": "markdown", "id": "9b145724-d165-4ef3-8a06-2948dd0c7dbb", "metadata": {}, "source": [ "## 42. round (Vector)" ] }, { "cell_type": "markdown", "id": "b780c2f3-935c-4b2f-b18a-b277cf7c24b7", "metadata": {}, "source": [ "Otions for `#round`;\n", "\n", "- `:n-digits` The number of digits to show.\n", "- `round_mode` Specify rounding mode.\n", "\n", "This is a unary element-wise function." ] }, { "cell_type": "code", "execution_count": 121, "id": "e7a069b0-3547-4cd2-a2f0-0740f186b191", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.15, 2.5, 3.5, -4.5, -5.5]\n" ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double = RedAmber::Vector.new([15.15, 2.5, 3.5, -4.5, -5.5])" ] }, { "cell_type": "code", "execution_count": 122, "id": "5ee84b24-8830-4788-a404-d5e1cca22abf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 2.0, 4.0, -4.0, -6.0]\n" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round" ] }, { "cell_type": "code", "execution_count": 123, "id": "20adb1ad-473c-4245-b959-7848c239fb76", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 2.0, 4.0, -4.0, -6.0]\n" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :half_to_even)" ] }, { "cell_type": "code", "execution_count": 124, "id": "d2777ad8-2c24-48e4-8f5f-77403e3109ea", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[16.0, 3.0, 4.0, -5.0, -6.0]\n" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :towards_infinity)" ] }, { "cell_type": "code", "execution_count": 125, "id": "a8ab2735-74cb-4cfe-a5a2-61bfa90c72ac", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 3.0, 4.0, -4.0, -5.0]\n" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :half_up)" ] }, { "cell_type": "code", "execution_count": 126, "id": "3575481c-40ed-405f-a69c-7581d4dce2cf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 2.0, 3.0, -4.0, -5.0]\n" ] }, "execution_count": 126, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :half_towards_zero)" ] }, { "cell_type": "code", "execution_count": 127, "id": "a86e4c5c-aced-4a88-b692-4e26b90f1653", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 3.0, 4.0, -5.0, -6.0]\n" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :half_towards_infinity)" ] }, { "cell_type": "code", "execution_count": 128, "id": "73f51bab-ff46-4b99-96a5-8c6547ad9d35", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 3.0, 3.0, -5.0, -5.0]\n" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :half_to_odd)" ] }, { "cell_type": "code", "execution_count": 129, "id": "a12c684c-4a63-4dac-a81b-969978812a24", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 2.0, 4.0, -4.0, -6.0]\n" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(n_digits: 0)" ] }, { "cell_type": "code", "execution_count": 130, "id": "17370f2b-0957-411b-8145-56aa9fc956ac", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.2, 2.5, 3.5, -4.5, -5.5]\n" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(n_digits: 1)" ] }, { "cell_type": "code", "execution_count": 131, "id": "53072cff-b28b-4672-b30a-8ca37562bc21", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[20.0, 0.0, 0.0, -0.0, -10.0]\n" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(n_digits: -1)" ] }, { "cell_type": "markdown", "id": "51dedfce-51c7-4e5b-b890-a90ad9cf7596", "metadata": {}, "source": [ "## 43. and/or (Vector)" ] }, { "cell_type": "markdown", "id": "b2c4869b-6ebf-476c-b2fd-a4b9c0638dc5", "metadata": {}, "source": [ "RedAmber select `and_kleene`/`or_kleene` as default `&`/`|` method.\n", "\n", "These are unary element-wise function." ] }, { "cell_type": "code", "execution_count": 132, "id": "2d4f5853-1ed9-4d8b-87a9-b5c1faac5fae", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, nil, false, false, false, nil, false, nil]\n" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bool_self = Vector.new([true, true, true, false, false, false, nil, nil, nil])\n", "bool_other = Vector.new([true, false, nil, true, false, nil, true, false, nil])\n", "\n", "bool_self & bool_other # same as bool_self.and_kleene(bool_other)" ] }, { "cell_type": "code", "execution_count": 133, "id": "236c9733-8d45-467e-b288-e6c18b9c39d2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, nil, true, false, nil, true, false, nil]\n" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Ruby's primitive `&&`\n", "bool_self && bool_other" ] }, { "cell_type": "code", "execution_count": 134, "id": "4e984a9c-7d9c-465d-bf26-0c685dedd4bf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, nil, false, false, nil, nil, nil, nil]\n" ] }, "execution_count": 134, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Arrow's default `and`\n", "bool_self.and_org(bool_other)" ] }, { "cell_type": "code", "execution_count": 135, "id": "0120ebf5-355d-41f5-83d5-49b9802f337b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, true, true, true, false, nil, true, nil, nil]\n" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bool_self | bool_other # same as bool_self.or_kleene(bool_other)" ] }, { "cell_type": "code", "execution_count": 136, "id": "24ceee23-79df-4fcd-afd8-f3839a087785", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, true, true, false, false, false, nil, nil, nil]\n" ] }, "execution_count": 136, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Ruby's primitive `||`\n", "bool_self || bool_other" ] }, { "cell_type": "code", "execution_count": 137, "id": "c152d04b-71a0-4b18-acd1-b5ab9e413d00", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, true, nil, true, false, nil, nil, nil, nil]\n" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Arrow's default `or`\n", "bool_self.or_org(bool_other)" ] }, { "cell_type": "markdown", "id": "beede237-c5ed-4e12-a432-ec7e4546d786", "metadata": {}, "source": [ "## 44. is_finite/is_nan/is_nil/is_na (Vector)" ] }, { "cell_type": "markdown", "id": "77418efd-c0d7-4d63-a7db-2d43fafd386e", "metadata": {}, "source": [ "These are unary element-wise function." ] }, { "cell_type": "code", "execution_count": 138, "id": "19558f9e-fdc4-46e5-90d0-724e4e8fbd8e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[3.141592653589793, Infinity, -Infinity, NaN, nil]\n" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double = Vector.new([Math::PI, Float::INFINITY, -Float::INFINITY, Float::NAN, nil])" ] }, { "cell_type": "code", "execution_count": 139, "id": "d90a7168-1f87-4363-9589-c1f161babc7d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, false, false, nil]\n" ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.is_finite" ] }, { "cell_type": "code", "execution_count": 140, "id": "7d88049b-695f-4b0c-a105-8fb5797a58b1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, true, true, false, nil]\n" ] }, "execution_count": 140, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.is_inf" ] }, { "cell_type": "code", "execution_count": 141, "id": "7d86a7b5-84bf-4031-9811-4076281920cf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, false, false, true, true]\n" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.is_na" ] }, { "cell_type": "code", "execution_count": 142, "id": "d562f826-7a37-4c57-8f92-777555987246", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, false, false, false, true]\n" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.is_nil" ] }, { "cell_type": "code", "execution_count": 143, "id": "e460dc6b-e48f-4462-9ce8-aa6069ebae27", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, true, true, true, false]\n" ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.is_valid" ] }, { "cell_type": "markdown", "id": "2cca75eb-f0e8-4f85-89cb-3601512e76b0", "metadata": {}, "source": [ "## 45. Prime-th rows" ] }, { "cell_type": "code", "execution_count": 144, "id": "0751e820-a22d-45b5-9005-df523d2353be", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <68 x 9 vectors>
indexspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
2AdelieTorgersen39.517.41863800female2007
3AdelieTorgersen40.318.01953250female2007
5AdelieTorgersen36.719.31933450female2007
7AdelieTorgersen38.917.81813625female2007
317GentooBiscoe49.415.82164925male2009
331GentooBiscoe50.515.22165000female2009
337GentooBiscoe44.515.72174875(nil)2009
" ], "text/plain": [ "#\n", " index species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 2 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 3 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " 4 7 Adelie Torgersen 38.9 17.8 181 ... 2007\n", " 5 11 Adelie Torgersen 37.8 17.1 186 ... 2007\n", " : : : : : : : ... :\n", "66 317 Gentoo Biscoe 49.4 15.8 216 ... 2009\n", "67 331 Gentoo Biscoe 50.5 15.2 216 ... 2009\n", "68 337 Gentoo Biscoe 44.5 15.7 217 ... 2009\n" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# prime-th rows ... Don't ask me what it means.\n", "require 'prime'\n", "penguins.assign_left(:index, Vector.new(penguins.indices) + 1) # since 0.2.0\n", " .slice { Vector.new(Prime.each(size).to_a) - 1 }" ] }, { "cell_type": "markdown", "id": "c9e8de1a-ad8f-4fdc-a65c-4d3db7123530", "metadata": {}, "source": [ "## 46. Slice by Enumerator" ] }, { "cell_type": "markdown", "id": "32dd53a3-a822-4ae1-afe2-b5aa2bfbd3e3", "metadata": {}, "source": [ "Slice accepts Enumerator." ] }, { "cell_type": "code", "execution_count": 145, "id": "b2a118fa-f3c0-4f31-9b45-6db27ccbebe6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <35 x 9 vectors>
indexspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
0AdelieTorgersen39.118.71813750male2007
10AdelieTorgersen37.817.11863300(nil)2007
20AdelieBiscoe37.818.31743400female2007
30AdelieDream39.516.71783250female2007
320GentooBiscoe48.515.02194850female2009
330GentooBiscoe50.515.22165000female2009
340GentooBiscoe46.814.32154850female2009
" ], "text/plain": [ "#\n", " index species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 0 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 10 Adelie Torgersen 37.8 17.1 186 ... 2007\n", " 3 20 Adelie Biscoe 37.8 18.3 174 ... 2007\n", " 4 30 Adelie Dream 39.5 16.7 178 ... 2007\n", " 5 40 Adelie Dream 36.5 18.0 182 ... 2007\n", " : : : : : : : ... :\n", "33 320 Gentoo Biscoe 48.5 15.0 219 ... 2009\n", "34 330 Gentoo Biscoe 50.5 15.2 216 ... 2009\n", "35 340 Gentoo Biscoe 46.8 14.3 215 ... 2009\n" ] }, "execution_count": 145, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select every 10 samples\n", "penguins.assign_left(index: penguins.indices) # 0.2.0 feature\n", " .slice(0.step by: 10, to: 340)" ] }, { "cell_type": "markdown", "id": "db312c2c-3a7c-4765-bfad-b3313b173a79", "metadata": {}, "source": [ "## 47. Output mode" ] }, { "cell_type": "markdown", "id": "714ed8df-5aa3-4ac4-8b0d-6390aff73c8c", "metadata": {}, "source": [ "Output mode of `#inspect` and `#to_iruby` is Table mode by default. If you prefer TDR mode set the environment variable `RED_AMBER_OUTPUT_MODE` to `\"TDR\"`." ] }, { "cell_type": "code", "execution_count": 146, "id": "a721804b-006e-44c6-8d38-885eae747eaa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <344 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen(nil)(nil)(nil)(nil)(nil)2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen (nil) (nil) (nil) ... 2007\n", " 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " : : : : : : ... :\n", "342 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "343 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "344 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ENV['RED_AMBER_OUTPUT_MODE'] = 'Table' # or nil (default)\n", "penguins # Almost same as `puts penguins.to_s` in any mode" ] }, { "cell_type": "code", "execution_count": 147, "id": "e4c9f70c-a4b1-4a81-bbc4-e9b14a6b6cb0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n", "2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n", " ... 5 more Vectors ...\n", "\n" ] } ], "source": [ "ENV['RED_AMBER_OUTPUT_MODE'] = 'TDR'\n", "p penguins; nil # Almost same as `penguins.tdr` in any mode" ] }, { "cell_type": "code", "execution_count": 148, "id": "2786e9a7-e321-43c5-b56e-9f2ca9d62f8b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RedAmber::DataFrame : 344 x 8 Vectors\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n", "2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n", "4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils\n", "5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils\n", "6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils\n", "7 :sex string 3 {\"male\"=>168, \"female\"=>165, nil=>11}\n", "8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}\n" ] }, "execution_count": 148, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins" ] }, { "cell_type": "code", "execution_count": 149, "id": "b00c858b-b14a-492b-bc22-d6a707bcc1ba", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Table\"" ] }, "execution_count": 149, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ENV['RED_AMBER_OUTPUT_MODE'] = 'Table'" ] }, { "cell_type": "markdown", "id": "c6781904-c014-4ef0-86c5-8a758ca3ca1c", "metadata": {}, "source": [ "## 48. Empty key" ] }, { "cell_type": "markdown", "id": "38912e38-bddd-47cc-adfd-5c4f780636ea", "metadata": {}, "source": [ "Empty key `:\"\"` will be automatically renamed to `:unnamed1`.\n", "\n", "If `:unnamed1` was used, `:unnamed1.succ` will be used.\n", "\n", "(Since 0.1.8)" ] }, { "cell_type": "code", "execution_count": 150, "id": "13569004-bb23-45fa-8d11-fe5f367641a6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 2 vectors>
unnamed2unnamed1
13
24
" ], "text/plain": [ "#\n", " unnamed2 unnamed1\n", " \n", "1 1 3\n", "2 2 4\n" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = DataFrame.new(\"\": [1, 2], unnamed1: [3, 4])" ] }, { "cell_type": "markdown", "id": "a4343343-d59a-4b71-b08f-4b737a6d4258", "metadata": {}, "source": [ "## 49. Grouping" ] }, { "cell_type": "markdown", "id": "8ddf0937-7b0e-4ee6-8116-8e78c308e76f", "metadata": {}, "source": [ "`DataFrame#group` takes group_key as an argument, and creates `Group` class. \n", "\n", "Group class inspects counts of each unique elements. \n", "\n", "(Since 0.1.7)" ] }, { "cell_type": "code", "execution_count": 151, "id": "ee602e52-7988-4fab-b5e3-c466acf01c98", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#{\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}}>" ] }, "execution_count": 151, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group = penguins.group(:species)" ] }, { "cell_type": "markdown", "id": "a48ab0e9-60e3-4c07-a27b-9c53783318c1", "metadata": {}, "source": [ "The instance of `Group` class has methods to summary functions.\n", "\n", "It returns `function(key)` style summarized columns as a result." ] }, { "cell_type": "code", "execution_count": 152, "id": "20b23ada-b895-4921-b57b-8d46b451e494", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 8 vectors>
speciescount(island)count(bill_length_mm)count(bill_depth_mm)count(flipper_length_mm)count(body_mass_g)count(sex)count(year)
Adelie152151151151151146152
Chinstrap68686868686868
Gentoo124123123123123119124
" ], "text/plain": [ "#\n", " species count(island) count(bill_length_mm) count(bill_depth_mm) ... count(year)\n", " ... \n", "1 Adelie 152 151 151 ... 152\n", "2 Chinstrap 68 68 68 ... 68\n", "3 Gentoo 124 123 123 ... 124\n" ] }, "execution_count": 152, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group.count" ] }, { "cell_type": "markdown", "id": "584d8fa1-1e70-4f31-bdc2-5d4956a28cb5", "metadata": {}, "source": [ "If count result is same in multiple columns, count column is aggregated to one column `:count`." ] }, { "cell_type": "code", "execution_count": 153, "id": "e6936488-9f23-47bd-8492-537c5be1afb3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
speciescount
Adelie151
Chinstrap68
Gentoo123
" ], "text/plain": [ "#\n", " species count\n", " \n", "1 Adelie 151\n", "2 Chinstrap 68\n", "3 Gentoo 123\n" ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.pick(:species, :bill_length_mm, :bill_depth_mm).group(:species).count" ] }, { "cell_type": "markdown", "id": "7c90b59b-918c-492c-ac5b-46b62b31d136", "metadata": {}, "source": [ "Grouping key comes first (leftmost) in the columns." ] }, { "cell_type": "markdown", "id": "21c8a6fa-3f95-41ea-817f-ffc1011e8df9", "metadata": {}, "source": [ "## 50. Grouping with a block" ] }, { "cell_type": "markdown", "id": "db89c8b4-8189-410d-abe9-eac8f440065c", "metadata": {}, "source": [ "`DataFrame#group` takes a block and we can specify multiple functions.\n", "\n", "Inside the block is the context of instance of Group. So we can use summary functions without the reciever.\n", "\n", "(Since 0.1.8)" ] }, { "cell_type": "code", "execution_count": 154, "id": "913f576b-ec86-4e94-af05-7c656ea24cc2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 3 vectors>
speciescountmean(body_mass_g)
Adelie1523700.662251655629
Chinstrap683733.0882352941176
Gentoo1245076.016260162602
" ], "text/plain": [ "#\n", " species count mean(body_mass_g)\n", " \n", "1 Adelie 152 3700.66\n", "2 Chinstrap 68 3733.09\n", "3 Gentoo 124 5076.02\n" ] }, "execution_count": 154, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.group(:species) { [count(:species), mean(:body_mass_g)] }" ] }, { "cell_type": "markdown", "id": "e6dc3206-d327-4242-9908-253ad4a74759", "metadata": {}, "source": [ "`Group#summarize` accepts same block as `DataFrame#group`." ] }, { "cell_type": "code", "execution_count": 155, "id": "67c7fc55-7b30-469c-bd0c-cda5732863fe", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 7 vectors>
speciescountmean(bill_length_mm)mean(bill_depth_mm)mean(flipper_length_mm)mean(body_mass_g)mean(year)
Adelie15238.7913907284768418.346357615894032189.953642384105963700.6622516556292008.0131578947369
Chinstrap6848.8338235294117718.420588235294115195.82352941176473733.08823529411762007.9705882352941
Gentoo12447.50487804878047614.982113821138206217.18699186991875076.0162601626022008.0806451612902
" ], "text/plain": [ "#\n", " species count mean(bill_length_mm) mean(bill_depth_mm) ... mean(year)\n", " ... \n", "1 Adelie 152 38.79 18.35 ... 2008.01\n", "2 Chinstrap 68 48.83 18.42 ... 2007.97\n", "3 Gentoo 124 47.5 14.98 ... 2008.08\n" ] }, "execution_count": 155, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group.summarize { [count(:species), mean] }" ] }, { "cell_type": "markdown", "id": "88682c8d-0ab6-4cee-8e59-5b6ad0ae5dac", "metadata": {}, "source": [ "## 51. Vector#shift" ] }, { "cell_type": "markdown", "id": "c6dd1bc0-7146-4884-aac5-9874e829235e", "metadata": {}, "source": [ "`Vector#shift(amount = 1, fill: nil)`\n", "\n", "Shift vector's values by specified `amount`. Shifted space is filled by value `fill`.\n", "\n", "(Since 0.1.8)" ] }, { "cell_type": "code", "execution_count": 156, "id": "013f2db6-3e1d-481f-a908-57605729b51d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[nil, 1, 2, 3, 4]\n" ] }, "execution_count": 156, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector = RedAmber::Vector.new([1, 2, 3, 4, 5])\n", "vector.shift" ] }, { "cell_type": "code", "execution_count": 157, "id": "7625acd7-d6a0-4775-b5e0-ca87f95f4f28", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[3, 4, 5, nil, nil]\n" ] }, "execution_count": 157, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector.shift(-2)" ] }, { "cell_type": "code", "execution_count": 158, "id": "34a9ac2a-2e3f-44bc-8ba7-c4487dc3528e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[NaN, 1.0, 2.0, 3.0, 4.0]\n" ] }, "execution_count": 158, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector.shift(fill: Float::NAN)" ] }, { "cell_type": "markdown", "id": "39f305a6-5718-4e81-b08f-564fa93861cc", "metadata": { "tags": [] }, "source": [ "## 52. From the Pandas cookbook (if-then)\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#if-then" ] }, { "cell_type": "markdown", "id": "4f1d41dd-714a-4c68-af3b-a9080847f833", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "df.loc[df.AAA >= 5, \"BBB\"] = -1\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "0 4 10 100\n", "1 5 -1 50\n", "2 6 -1 -30\n", "3 7 -1 -50\n", "```" ] }, { "cell_type": "code", "execution_count": 159, "id": "24774ccc-8f0f-4ce4-9ba0-bebed8781c38", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 3 vectors>
AAABBBCCC
410100
5-150
6-1-30
7-1-50
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 4 10 100\n", "2 5 -1 50\n", "3 6 -1 -30\n", "4 7 -1 -50\n" ] }, "execution_count": 159, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Red Amber\n", "df = DataFrame.new(\n", " \"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50] # You can omit {}\n", ")\n", "\n", "df.assign(BBB: df[:BBB].replace(df[:AAA] >= 5, -1))" ] }, { "cell_type": "markdown", "id": "c76ca2e3-e239-440b-923b-dfabb1a1c007", "metadata": {}, "source": [ "If you want to replace both :BBB and :CCC ;" ] }, { "cell_type": "code", "execution_count": 160, "id": "3f97227b-cbee-4515-b76d-3514401967d9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 3 vectors>
AAABBBCCC
410100
5-1-2
6-1-2
7-1-2
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 4 10 100\n", "2 5 -1 -2\n", "3 6 -1 -2\n", "4 7 -1 -2\n" ] }, "execution_count": 160, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.assign do\n", " replacer = v(:AAA) >= 5 # Boolean Vector\n", " {\n", " BBB: v(:BBB).replace(replacer, -1),\n", " CCC: v(:CCC).replace(replacer, -2)\n", " }\n", "end" ] }, { "cell_type": "markdown", "id": "4bf560f9-4393-4ed5-9994-24f8a93470fb", "metadata": { "tags": [] }, "source": [ "## 53. From the Pandas cookbook (Splitting)\n", "Split a frame with a boolean criterion\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#splitting" ] }, { "cell_type": "markdown", "id": "fcf06cad-4a34-45e2-b8f8-8422914eb319", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "df[df.AAA <= 5]\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "0 4 10 100\n", "1 5 20 50\n", "\n", "df[df.AAA > 5]\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "2 6 30 -30\n", "3 7 40 -50\n", "```" ] }, { "cell_type": "code", "execution_count": 161, "id": "b08e74d4-aba8-4fb5-a815-5d5384e92f81", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 3 vectors>
AAABBBCCC
410100
52050
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 4 10 100\n", "2 5 20 50\n" ] }, "execution_count": 161, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Red Amber\n", "df = DataFrame.new(\n", " # You can omit {}\n", " \"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]\n", ")\n", "\n", "df.slice(df[:AAA] <= 5)\n", "# df[df[:AAA] <= 5] # is also OK" ] }, { "cell_type": "code", "execution_count": 162, "id": "caa72796-ff7e-4275-849f-04698114ee08", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 3 vectors>
AAABBBCCC
630-30
740-50
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 6 30 -30\n", "2 7 40 -50\n" ] }, "execution_count": 162, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.remove(df[:AAA] <= 5)\n", "# df.slice(df[:AAA] > 5) # do the same thing" ] }, { "cell_type": "markdown", "id": "ba7588e9-bbac-4547-a56c-3eea9f819460", "metadata": { "tags": [] }, "source": [ "## 54. From the Pandas cookbook (Building criteria)\n", "Split a frame with a boolean criterion\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#building-criteria" ] }, { "cell_type": "markdown", "id": "c6d01b08-1af7-47b0-a9e4-f27ab41fe24e", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "\n", "# and\n", "df.loc[(df[\"BBB\"] < 25) & (df[\"CCC\"] >= -40), \"AAA\"]\n", "\n", "# returns a series =>\n", "0 4\n", "1 5\n", "Name: AAA, dtype: int64\n", "\n", "# or\n", "df.loc[(df[\"BBB\"] > 25) | (df[\"CCC\"] >= -40), \"AAA\"]\n", "\n", "# returns a series =>\n", "0 4\n", "1 5\n", "2 6\n", "3 7\n", "Name: AAA, dtype: int64\n", "```" ] }, { "cell_type": "code", "execution_count": 163, "id": "46066e96-91e5-4a96-9840-7e4ce6f06818", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 1 vector>
AAA
4
5
" ], "text/plain": [ "#\n", " AAA\n", " \n", "1 4\n", "2 5\n" ] }, "execution_count": 163, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Red Amber\n", "df = DataFrame.new(\n", " # You can omit {}\n", " \"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]\n", ")\n", "\n", "df.slice( (df[:BBB] < 25) & (df[:CCC] >= 40) ).pick(:AAA)" ] }, { "cell_type": "code", "execution_count": 164, "id": "dc8304c2-be28-420e-b2d5-a4b636eaac8b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 1 vector>
AAA
4
5
6
7
" ], "text/plain": [ "#\n", " AAA\n", " \n", "1 4\n", "2 5\n", "3 6\n", "4 7\n" ] }, "execution_count": 164, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice( (df[:BBB] > 25) | (df[:CCC] >= 40) ).pick(:AAA)\n", "# df[ (df[:BBB] > 25) | (df[:CCC] >= 40) ][:AAA)] # also OK" ] }, { "cell_type": "markdown", "id": "79616705-f497-4bb4-ad1d-5ee93c0093ce", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# or (with assignment)\n", "df.loc[(df[\"BBB\"] > 25) | (df[\"CCC\"] >= 75), \"AAA\"] = 0.1\n", "df\n", "\n", "# returns a dataframe =>\n", " AAA BBB CCC\n", "0 0.1 10 100\n", "1 5.0 20 50\n", "2 0.1 30 -30\n", "3 0.1 40 -50\n", "```" ] }, { "cell_type": "code", "execution_count": 165, "id": "7ab3b044-5a0f-4a38-8a42-aed1228b6462", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 3 vectors>
AAABBBCCC
0.110100
5.02050
0.130-30
0.140-50
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 0.1 10 100\n", "2 5.0 20 50\n", "3 0.1 30 -30\n", "4 0.1 40 -50\n" ] }, "execution_count": 165, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# df.assign(AAA: df[:AAA].replace((df[:BBB] > 25) | (df[:CCC] >= 75), 0.1)) # by one liner\n", "\n", "booleans = (df[:BBB] > 25) | (df[:CCC] >= 75)\n", "replaced = df[:AAA].replace(booleans, 0.1)\n", "df.assign(AAA: replaced)" ] }, { "cell_type": "markdown", "id": "6cd4a47c-f619-462b-94c0-4c488761d5b0", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Select rows with data closest to certain value using argsort\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "aValue = 43.0\n", "df.loc[(df.CCC - aValue).abs().argsort()]\n", "\n", "# returns a dataframe =>\n", " AAA BBB CCC\n", "1 5 20 50\n", "0 4 10 100\n", "2 6 30 -30\n", "3 7 40 -50\n", "```" ] }, { "cell_type": "code", "execution_count": 166, "id": "13cb1d45-2d13-4708-ad76-57efd72e609b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 3 vectors>
AAABBBCCC
52050
410100
630-30
740-50
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 5 20 50\n", "2 4 10 100\n", "3 6 30 -30\n", "4 7 40 -50\n" ] }, "execution_count": 166, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a_value = 43\n", "df[(df[:CCC] - a_value).abs.sort_indexes]\n", "# df.slice (df[:CCC] - a_value).abs.sort_indexes # also OK" ] }, { "cell_type": "markdown", "id": "9330117b-40e0-4574-8900-7857622daad4", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Dynamically reduce a list of criteria using a binary operators\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "Crit1 = df.AAA <= 5.5\n", "Crit2 = df.BBB == 10.0\n", "Crit3 = df.CCC > -40.0\n", "AllCrit = Crit1 & Crit2 & Crit3\n", "\n", "import functools\n", "\n", "CritList = [Crit1, Crit2, Crit3]\n", "AllCrit = functools.reduce(lambda x, y: x & y, CritList)\n", "df[AllCrit]\n", "\n", "# returns a dataframe =>\n", " AAA BBB CCC\n", "0 4 10 100\n", "```" ] }, { "cell_type": "code", "execution_count": 167, "id": "40336e62-d411-4655-8ec5-8d30876ada47", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <1 x 3 vectors>
AAABBBCCC
410100
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 4 10 100\n" ] }, "execution_count": 167, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crit1 = df[:AAA] <= 5.5\n", "crit2 = df[:BBB] == 10.0\n", "crit3 = df[:CCC] >= -40.0\n", "df[crit1 & crit2 & crit3]" ] }, { "cell_type": "markdown", "id": "62e03375-573d-4368-a26e-1be3a4d58cf8", "metadata": { "tags": [] }, "source": [ "## 55. From the Pandas cookbook (Dataframes)\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#dataframes" ] }, { "cell_type": "markdown", "id": "462ce855-6e78-49a6-8b5f-48a5864b6397", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Using both row labels and value conditionals\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))] \n", "\n", "# returns =>\n", " AAA BBB CCC\n", "0 4 10 100\n", "2 6 30 -30\n", "```" ] }, { "cell_type": "code", "execution_count": 168, "id": "9da915f1-c772-4071-b6d6-292d8ffea857", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 3 vectors>
AAABBBCCC
410100
630-30
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 4 10 100\n", "2 6 30 -30\n" ] }, "execution_count": 168, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Red Amber\n", "df = DataFrame.new(\n", " \"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]\n", ")\n", "\n", "df[(df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i }]" ] }, { "cell_type": "markdown", "id": "26bfea62-12e4-4a01-976e-6184ecafa2fd", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Use loc for label-oriented slicing and iloc positional slicing GH2904\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]},\n", " index=[\"foo\", \"bar\", \"boo\", \"kar\"],\n", ")\n", "\n", "# There are 2 explicit slicing methods, with a third general case\n", "# 1. Positional-oriented (Python slicing style : exclusive of end)\n", "# 2. Label-oriented (Non-Python slicing style : inclusive of end)\n", "# 3. General (Either slicing style : depends on if the slice contains labels or positions)\n", "\n", "df.loc[\"bar\":\"kar\"] # Label\n", "# returns =>\n", " AAA BBB CCC\n", "bar 5 20 50\n", "boo 6 30 -30\n", "kar 7 40 -50\n", "\n", "# Generic\n", "df[0:3]\n", "# returns =>\n", " AAA BBB CCC\n", "foo 4 10 100\n", "bar 5 20 50\n", "boo 6 30 -30\n", "\n", "df[\"bar\":\"kar\"]\n", "# returns =>\n", " AAA BBB CCC\n", "bar 5 20 50\n", "boo 6 30 -30\n", "kar 7 40 -50\n", "```" ] }, { "cell_type": "code", "execution_count": 169, "id": "ccabc137-33d3-47e8-ad09-88a285765380", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 4 vectors>
indexAAABBBCCC
foo410100
bar52050
boo630-30
kar740-50
" ], "text/plain": [ "#\n", " index AAA BBB CCC\n", " \n", "1 foo 4 10 100\n", "2 bar 5 20 50\n", "3 boo 6 30 -30\n", "4 kar 7 40 -50\n" ] }, "execution_count": 169, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Red Amber does not have row index. Use a new column as indexes.\n", "labeled = df.assign_left(index: %w[foo bar boo kar])\n", "# labeled = df.assign(index: %w[foo bar boo kar]).pick { [keys[-1], keys[0...-1]] } # until v0.1.8" ] }, { "cell_type": "code", "execution_count": 170, "id": "f0871131-725e-4e33-a3cc-1fccd610a4b2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
indexAAABBBCCC
bar52050
boo630-30
kar740-50
" ], "text/plain": [ "#\n", " index AAA BBB CCC\n", " \n", "1 bar 5 20 50\n", "2 boo 6 30 -30\n", "3 kar 7 40 -50\n" ] }, "execution_count": 170, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labeled[1..3]" ] }, { "cell_type": "code", "execution_count": 171, "id": "dffc55f0-481e-4076-aa3f-89f1294655b9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
indexAAABBBCCC
bar52050
boo630-30
kar740-50
" ], "text/plain": [ "#\n", " index AAA BBB CCC\n", " \n", "1 bar 5 20 50\n", "2 boo 6 30 -30\n", "3 kar 7 40 -50\n" ] }, "execution_count": 171, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labeled.slice do\n", " v = v(:index)\n", " v.index(\"bar\")..v.index(\"kar\")\n", "end" ] }, { "cell_type": "markdown", "id": "ee4f4498-c9d8-4cb9-8b1b-faa19f968f30", "metadata": {}, "source": [ "`slice_by` returns the same result as above.\n", "\n", "(Since 0.2.1)" ] }, { "cell_type": "code", "execution_count": 172, "id": "97c76bdc-6c1b-4b7d-92bf-d5108fe479a0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
indexAAABBBCCC
bar52050
boo630-30
kar740-50
" ], "text/plain": [ "#\n", " index AAA BBB CCC\n", " \n", "1 bar 5 20 50\n", "2 boo 6 30 -30\n", "3 kar 7 40 -50\n" ] }, "execution_count": 172, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labeled.slice_by(:index, keep_key: true) { \"bar\"..\"kar\"}" ] }, { "cell_type": "markdown", "id": "d72b067c-d15d-499b-8d2a-8b79f5287e97", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Ambiguity arises when an index consists of integers with a non-zero start or non-unit increment.\n", "df2 = pd.DataFrame(data=data, index=[1, 2, 3, 4]) # Note index starts at 1.\n", "\n", "df2.iloc[1:3] # Position-oriented\n", "# returns =>\n", " AAA BBB CCC\n", "2 5 20 50\n", "3 6 30 -30\n", "\n", "df2.loc[1:3] # Label-oriented\n", "# returns =>\n", " AAA BBB CCC\n", "1 4 10 100\n", "2 5 20 50\n", "3 6 30 -30\n", "```" ] }, { "cell_type": "code", "execution_count": 173, "id": "f023a23a-ea1d-415c-b395-195801351433", "metadata": {}, "outputs": [], "source": [ "# RedAmber only have an implicit integer index 0...size,\n", "# does not happen any ambiguity unless you create a new column and use it for indexes :-)." ] }, { "cell_type": "markdown", "id": "5cada26c-a37a-4eff-bfa5-66635d278671", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Using inverse operator (~) to take the complement of a mask\n", "df[~((df.AAA <= 6) & (df.index.isin([0, 2, 4])))]\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "1 5 20 50\n", "3 7 40 -50\n", "```" ] }, { "cell_type": "code", "execution_count": 174, "id": "29b81efe-5b9b-4640-89b1-422ae94cf01d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 3 vectors>
AAABBBCCC
52050
740-50
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 5 20 50\n", "2 7 40 -50\n" ] }, "execution_count": 174, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# RedAmber offers #! method for boolean Vector.\n", "df[!((df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i })]\n", "\n", "# or\n", "# df[((df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i }).invert]" ] }, { "cell_type": "markdown", "id": "4fb7103f-00ed-4e74-81e5-7d480004e681", "metadata": {}, "source": [ "If you have `nil` in your data, consider #primitive_invert for consistent result. See example #26." ] }, { "cell_type": "markdown", "id": "a1780372-b566-41f5-84d6-6213e3f9efa7", "metadata": { "tags": [] }, "source": [ "## 56. From the Pandas cookbook (New columns)\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#new-columns" ] }, { "cell_type": "markdown", "id": "5f5f9d8b-7550-44f6-9f79-64f940f5000a", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Efficiently and dynamically creating new columns using applymap\n", "df = pd.DataFrame({\"AAA\": [1, 2, 1, 3], \"BBB\": [1, 1, 2, 2], \"CCC\": [2, 1, 3, 1]})\n", "df\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "0 1 1 2\n", "1 2 1 1\n", "2 1 2 3\n", "3 3 2 1\n", "\n", "source_cols = df.columns # Or some subset would work too\n", "new_cols = [str(x) + \"_cat\" for x in source_cols]\n", "categories = {1: \"Alpha\", 2: \"Beta\", 3: \"Charlie\"}\n", "df[new_cols] = df[source_cols].applymap(categories.get)\n", "df\n", "\n", "# returns =>\n", " AAA BBB CCC AAA_cat BBB_cat CCC_cat\n", "0 1 1 2 Alpha Alpha Beta\n", "1 2 1 1 Beta Alpha Alpha\n", "2 1 2 3 Alpha Beta Charlie\n", "3 3 2 1 Charlie Beta Alpha\n", "```" ] }, { "cell_type": "code", "execution_count": 175, "id": "265d63e6-0c01-4080-8d5b-c3153be595a5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 3 vectors>
AAABBBCCC
112
211
123
321
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 1 1 2\n", "2 2 1 1\n", "3 1 2 3\n", "4 3 2 1\n" ] }, "execution_count": 175, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# RedAmber\n", "df = DataFrame.new({\"AAA\": [1, 2, 1, 3], \"BBB\": [1, 1, 2, 2], \"CCC\": [2, 1, 3, 1]})" ] }, { "cell_type": "code", "execution_count": 176, "id": "4be751d7-69a8-4400-b094-9932bf3d577b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 6 vectors>
AAABBBCCCAAA_catBBB_catCCC_cat
112AlphaAlphaBeta
211BetaAlphaAlpha
123AlphaBetaCharlie
321CharlieBetaAlpha
" ], "text/plain": [ "#\n", " AAA BBB CCC AAA_cat BBB_cat CCC_cat\n", " \n", "1 1 1 2 Alpha Alpha Beta\n", "2 2 1 1 Beta Alpha Alpha\n", "3 1 2 3 Alpha Beta Charlie\n", "4 3 2 1 Charlie Beta Alpha\n" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categories = {1 => \"Alpha\", 2 => \"Beta\", 3 => \"Charlie\"}\n", "\n", "# Creating a Hash from keys\n", "df.assign do\n", " keys.each_with_object({}) do |key, h|\n", " h[\"#{key}_cat\"] = v(key).to_a.map { |x| categories[x] }\n", " end\n", "end\n", "\n", "# Creating an Array from vectors, from v0.2.0\n", "df.assign do\n", " vectors.map do |v|\n", " [\"#{v.key}_cat\", v.to_a.map { |x| categories[x] } ]\n", " end\n", "end" ] }, { "cell_type": "markdown", "id": "aa8214e5-a897-406e-bb5e-95ad9fea0cdd", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Keep other columns when using min() with groupby\n", "df = pd.DataFrame(\n", " {\"AAA\": [1, 1, 1, 2, 2, 2, 3, 3], \"BBB\": [2, 1, 3, 4, 5, 1, 2, 3]}\n", ")\n", "df\n", "\n", "# returns =>\n", " AAA BBB\n", "0 1 2\n", "1 1 1\n", "2 1 3\n", "3 2 4\n", "4 2 5\n", "5 2 1\n", "6 3 2\n", "7 3 3\n", "\n", "# Method 1 : idxmin() to get the index of the minimums\n", "df.loc[df.groupby(\"AAA\")[\"BBB\"].idxmin()]\n", "\n", "# returns =>\n", " AAA BBB\n", "1 1 1\n", "5 2 1\n", "6 3 2\n", "\n", "# Method 2 : sort then take first of each\n", "df.sort_values(by=\"BBB\").groupby(\"AAA\", as_index=False).first()\n", "\n", "# returns =>\n", " AAA BBB\n", "0 1 1\n", "1 2 1\n", "2 3 2\n", "\n", "# Notice the same results, with the exception of the index.\n", "```" ] }, { "cell_type": "code", "execution_count": 177, "id": "5dd5f7dc-21a3-4397-9ef6-c5bc630ff858", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <8 x 2 vectors>
AAABBB
12
11
13
24
25
21
32
33
" ], "text/plain": [ "#\n", " AAA BBB\n", " \n", "1 1 2\n", "2 1 1\n", "3 1 3\n", "4 2 4\n", "5 2 5\n", "6 2 1\n", "7 3 2\n", "8 3 3\n" ] }, "execution_count": 177, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# RedAmber\n", "df = DataFrame.new(AAA: [1, 1, 1, 2, 2, 2, 3, 3], BBB: [2, 1, 3, 4, 5, 1, 2, 3])" ] }, { "cell_type": "code", "execution_count": 178, "id": "000c1632-8faf-407f-bb7d-e583ef573442", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
AAAmin(BBB)
11
21
32
" ], "text/plain": [ "#\n", " AAA min(BBB)\n", " \n", "1 1 1\n", "2 2 1\n", "3 3 2\n" ] }, "execution_count": 178, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.group(:AAA).min\n", "\n", "# Add `.rename { [keys[-1], :BBB] }` if you want." ] }, { "cell_type": "markdown", "id": "5aa5b4ab-2fb3-4a22-804e-55d6351dd427", "metadata": { "tags": [] }, "source": [ "## 57. Summary/describe" ] }, { "cell_type": "code", "execution_count": 179, "id": "610be94e-b7ce-43f5-a5c1-ddef745d6bac", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 9 vectors>
variablescountmeanstdmin25%median75%max
bill_length_mm34243.921929824561415.459583713926531532.139.22544.38200000000000548.559.6
bill_depth_mm34217.1511695906432741.974793156816781413.115.617.3218.721.5
flipper_length_mm342200.9152046783625814.061713679356888172.0190.0197.0213.0231.0
body_mass_g3424201.754385964912801.95453569809552700.03550.04031.54750.06300.0
year3442008.02906976744180.81835592548370412007.02007.02008.02009.02009.0
" ], "text/plain": [ "#\n", " variables count mean std min 25% median ... max\n", " ... \n", "1 bill_length_mm 342 43.92 5.46 32.1 39.23 44.38 ... 59.6\n", "2 bill_depth_mm 342 17.15 1.97 13.1 15.6 17.32 ... 21.5\n", "3 flipper_length_mm 342 200.92 14.06 172.0 190.0 197.0 ... 231.0\n", "4 body_mass_g 342 4201.75 801.95 2700.0 3550.0 4031.5 ... 6300.0\n", "5 year 344 2008.03 0.82 2007.0 2007.0 2008.0 ... 2009.0\n" ] }, "execution_count": 179, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.summary\n", "# or\n", "penguins.describe" ] }, { "cell_type": "markdown", "id": "c06cbb0e-fd75-4567-88df-bd9902fc94e2", "metadata": {}, "source": [ "If you need a variables in row, use `transpose`. (Since 0.2.0)" ] }, { "cell_type": "code", "execution_count": 180, "id": "42406c73-6480-4a30-a940-930bf6804fff", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <8 x 6 vectors>
statsbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gyear
count342.0342.0342.0342.0344.0
mean43.9219298245614117.151169590643274200.915204678362584201.7543859649122008.0290697674418
std5.45958371392653151.974793156816781414.061713679356888801.95453569809550.8183559254837041
min32.113.1172.02700.02007.0
25%39.22515.6190.03550.02007.0
median44.38200000000000517.32197.04031.52008.0
75%48.518.7213.04750.02009.0
max59.621.5231.06300.02009.0
" ], "text/plain": [ "#\n", " stats bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year\n", " \n", "1 count 342.0 342.0 342.0 342.0 344.0\n", "2 mean 43.92 17.15 200.92 4201.75 2008.03\n", "3 std 5.46 1.97 14.06 801.95 0.82\n", "4 min 32.1 13.1 172.0 2700.0 2007.0\n", "5 25% 39.23 15.6 190.0 3550.0 2007.0\n", "6 median 44.38 17.32 197.0 4031.5 2008.0\n", "7 75% 48.5 18.7 213.0 4750.0 2009.0\n", "8 max 59.6 21.5 231.0 6300.0 2009.0\n" ] }, "execution_count": 180, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.summary.transpose(name: :stats)" ] }, { "cell_type": "markdown", "id": "982a735e-c887-4c15-be82-4f3a28138fa7", "metadata": {}, "source": [ "## 58. Quantile" ] }, { "cell_type": "markdown", "id": "d0832e36-7281-4d3c-a35d-5a89aecc341e", "metadata": {}, "source": [ "`Vector#quantile(prob)` returns quantile at probability `prob`.\n", "\n", "(Since 0.2.0)" ] }, { "cell_type": "code", "execution_count": 181, "id": "88c6ed5d-b41e-4c17-b93f-47ce15702974", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "17.3" ] }, "execution_count": 181, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins[:bill_depth_mm].quantile # default is prob = 0.5" ] }, { "cell_type": "markdown", "id": "07368c10-65d4-4081-ace7-12bebca493c2", "metadata": {}, "source": [ "`Vector#quantiles` accepts an Array for multiple quantiles. Returns a DataFrame." ] }, { "cell_type": "code", "execution_count": 182, "id": "3cbac413-fc36-42a2-abd3-1343e3b88467", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 2 vectors>
probsquantiles
0.0513.9
0.9520.0
" ], "text/plain": [ "#\n", " probs quantiles\n", " \n", "1 0.05 13.9\n", "2 0.95 20.0\n" ] }, "execution_count": 182, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins[:bill_depth_mm].quantiles([0.05, 0.95])" ] }, { "cell_type": "markdown", "id": "54800864-c33d-4fec-818b-1a12a7e4d015", "metadata": {}, "source": [ "## 59. Transpose" ] }, { "cell_type": "markdown", "id": "517281cf-3f29-4725-816c-5a866e9cc9cc", "metadata": {}, "source": [ "`DataFrame#transpose` creates transposed DataFrame for wide type dataframe.\n", "\n", "(Since 0.2.0)" ] }, { "cell_type": "code", "execution_count": 183, "id": "a2fe66cb-e86d-402e-8724-eced2420e3d0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 6 vectors>
YearAudiBMWBMW_MINIMercedes-BenzVW
20172833652527254276822149040
20182647350982259846755451961
20192422246814238136655346794
20202230435712201965704136576
20212253535905182115172235215
" ], "text/plain": [ "#\n", " Year Audi BMW BMW_MINI Mercedes-Benz VW\n", " \n", "1 2017 28336 52527 25427 68221 49040\n", "2 2018 26473 50982 25984 67554 51961\n", "3 2019 24222 46814 23813 66553 46794\n", "4 2020 22304 35712 20196 57041 36576\n", "5 2021 22535 35905 18211 51722 35215\n" ] }, "execution_count": 183, "metadata": {}, "output_type": "execute_result" } ], "source": [ "uri = URI(\"https://raw.githubusercontent.com/heronshoes/red_amber/master/test/entity/import_cars.tsv\")\n", "import_cars = RedAmber::DataFrame.load(uri)" ] }, { "cell_type": "code", "execution_count": 184, "id": "b4ef2b48-33de-4982-b082-1966749fcf65", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 6 vectors>
N20172018201920202021
Audi2833626473242222230422535
BMW5252750982468143571235905
BMW_MINI2542725984238132019618211
Mercedes-Benz6822167554665535704151722
VW4904051961467943657635215
" ], "text/plain": [ "#\n", " N 2017 2018 2019 2020 2021\n", " \n", "1 Audi 28336 26473 24222 22304 22535\n", "2 BMW 52527 50982 46814 35712 35905\n", "3 BMW_MINI 25427 25984 23813 20196 18211\n", "4 Mercedes-Benz 68221 67554 66553 57041 51722\n", "5 VW 49040 51961 46794 36576 35215\n" ] }, "execution_count": 184, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import_cars.transpose" ] }, { "cell_type": "markdown", "id": "556e5c2d-bb01-4e16-9bf6-bdfd302d5b2a", "metadata": {}, "source": [ "You can specify index column by option `:key` even if it is in the middle of the original DataFrame." ] }, { "cell_type": "code", "execution_count": 185, "id": "31edb594-93fb-493f-8036-14e8273596ed", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 5 vectors>
AudiBMWYearBMW_MINIMercedes-Benz
283365252720172542768221
264735098220182598467554
242224681420192381366553
223043571220202019657041
225353590520211821151722
" ], "text/plain": [ "#\n", " Audi BMW Year BMW_MINI Mercedes-Benz\n", " \n", "1 28336 52527 2017 25427 68221\n", "2 26473 50982 2018 25984 67554\n", "3 24222 46814 2019 23813 66553\n", "4 22304 35712 2020 20196 57041\n", "5 22535 35905 2021 18211 51722\n" ] }, "execution_count": 185, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# locate `:Year` in the middle\n", "df = import_cars.pick(1..2, 0, 3..)" ] }, { "cell_type": "code", "execution_count": 186, "id": "ffa7cecc-5298-49d6-b483-b2b5cf2e1820", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 6 vectors>
N20172018201920202021
Audi2833626473242222230422535
BMW5252750982468143571235905
BMW_MINI2542725984238132019618211
Mercedes-Benz6822167554665535704151722
" ], "text/plain": [ "#\n", " N 2017 2018 2019 2020 2021\n", " \n", "1 Audi 28336 26473 24222 22304 22535\n", "2 BMW 52527 50982 46814 35712 35905\n", "3 BMW_MINI 25427 25984 23813 20196 18211\n", "4 Mercedes-Benz 68221 67554 66553 57041 51722\n" ] }, "execution_count": 186, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.transpose(key: :Year)" ] }, { "cell_type": "markdown", "id": "48178c99-6c68-4588-b515-fcddaeb6741b", "metadata": {}, "source": [ "Name the column from the keys in original by the option `name:`." ] }, { "cell_type": "code", "execution_count": 187, "id": "31679a91-11b4-42a8-95cc-58644a3003ba", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 6 vectors>
Manufacturer20172018201920202021
Audi2833626473242222230422535
BMW5252750982468143571235905
BMW_MINI2542725984238132019618211
Mercedes-Benz6822167554665535704151722
" ], "text/plain": [ "#\n", " Manufacturer 2017 2018 2019 2020 2021\n", " \n", "1 Audi 28336 26473 24222 22304 22535\n", "2 BMW 52527 50982 46814 35712 35905\n", "3 BMW_MINI 25427 25984 23813 20196 18211\n", "4 Mercedes-Benz 68221 67554 66553 57041 51722\n" ] }, "execution_count": 187, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.transpose(key: :Year, name: :Manufacturer)" ] }, { "cell_type": "markdown", "id": "8c0e6fd9-2355-43dc-bfdb-254fdbd405fb", "metadata": {}, "source": [ "## 60. To_long" ] }, { "cell_type": "markdown", "id": "8970bf89-6841-44aa-8faa-8c2e97842a8d", "metadata": {}, "source": [ "`DataFrame#to_long(*keep_keys)` reshapes wide DataFrame to the long DataFrame.\n", "\n", "- Parameter `keep_keys` specifies the key names to keep.\n", "\n", "(Since 0.2.0)" ] }, { "cell_type": "code", "execution_count": 188, "id": "f188cf35-3364-45aa-ad3c-1e079ed7b1a3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 6 vectors>
YearAudiBMWBMW_MINIMercedes-BenzVW
20172833652527254276822149040
20182647350982259846755451961
20192422246814238136655346794
20202230435712201965704136576
20212253535905182115172235215
" ], "text/plain": [ "#\n", " Year Audi BMW BMW_MINI Mercedes-Benz VW\n", " \n", "1 2017 28336 52527 25427 68221 49040\n", "2 2018 26473 50982 25984 67554 51961\n", "3 2019 24222 46814 23813 66553 46794\n", "4 2020 22304 35712 20196 57041 36576\n", "5 2021 22535 35905 18211 51722 35215\n" ] }, "execution_count": 188, "metadata": {}, "output_type": "execute_result" } ], "source": [ "uri = URI(\"https://raw.githubusercontent.com/heronshoes/red_amber/master/test/entity/import_cars.tsv\")\n", "import_cars = RedAmber::DataFrame.load(uri)" ] }, { "cell_type": "code", "execution_count": 189, "id": "bee776ae-7a82-41aa-8a31-c53af6b9ad9f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <25 x 3 vectors>
YearNV
2017Audi28336
2017BMW52527
2017BMW_MINI25427
2017Mercedes-Benz68221
2021BMW_MINI18211
2021Mercedes-Benz51722
2021VW35215
" ], "text/plain": [ "#\n", " Year N V\n", " \n", " 1 2017 Audi 28336\n", " 2 2017 BMW 52527\n", " 3 2017 BMW_MINI 25427\n", " 4 2017 Mercedes-Benz 68221\n", " 5 2017 VW 49040\n", " : : : :\n", "23 2021 BMW_MINI 18211\n", "24 2021 Mercedes-Benz 51722\n", "25 2021 VW 35215\n" ] }, "execution_count": 189, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import_cars.to_long(:Year)" ] }, { "cell_type": "markdown", "id": "771aac21-e739-4eea-9d75-cc99f691f87e", "metadata": {}, "source": [ "- Option `:name` specify the key of the column which is come **from key names**. Default is `:N`.\n", "- Option `:value` specify the key of the column which is come **from values**. Default is `:V`." ] }, { "cell_type": "code", "execution_count": 190, "id": "b7f7aeab-d545-4a28-82ce-db844029cc9c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <25 x 3 vectors>
YearManufacturerNum_of_imported
2017Audi28336
2017BMW52527
2017BMW_MINI25427
2017Mercedes-Benz68221
2021BMW_MINI18211
2021Mercedes-Benz51722
2021VW35215
" ], "text/plain": [ "#\n", " Year Manufacturer Num_of_imported\n", " \n", " 1 2017 Audi 28336\n", " 2 2017 BMW 52527\n", " 3 2017 BMW_MINI 25427\n", " 4 2017 Mercedes-Benz 68221\n", " 5 2017 VW 49040\n", " : : : :\n", "23 2021 BMW_MINI 18211\n", "24 2021 Mercedes-Benz 51722\n", "25 2021 VW 35215\n" ] }, "execution_count": 190, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import_cars.to_long(:Year, name: :Manufacturer, value: :Num_of_imported)" ] }, { "cell_type": "markdown", "id": "44361299-695f-4aef-8847-9b8a9dddc3d1", "metadata": {}, "source": [ "## 61. To_wide" ] }, { "cell_type": "markdown", "id": "7fa18ac8-99e1-4d55-bed4-cc02469496b5", "metadata": {}, "source": [ "`DataFrame#to_wide(*keep_keys)` reshapes long DataFrame to a wide DataFrame.\n", "\n", "- Option `:name` specify the key of the column which will be expanded **to key name**. Default is `:N`.\n", "- Option `:value` specify the key of the column which will be expanded **to values**. Default is `:V`.\n", "\n", "(Since 0.2.0)" ] }, { "cell_type": "code", "execution_count": 191, "id": "1d59c42f-c5ec-4df2-9fbe-592092ad2f8b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 6 vectors>
YearAudiBMWBMW_MINIMercedes-BenzVW
20172833652527254276822149040
20182647350982259846755451961
20192422246814238136655346794
20202230435712201965704136576
20212253535905182115172235215
" ], "text/plain": [ "#\n", " Year Audi BMW BMW_MINI Mercedes-Benz VW\n", " \n", "1 2017 28336 52527 25427 68221 49040\n", "2 2018 26473 50982 25984 67554 51961\n", "3 2019 24222 46814 23813 66553 46794\n", "4 2020 22304 35712 20196 57041 36576\n", "5 2021 22535 35905 18211 51722 35215\n" ] }, "execution_count": 191, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import_cars.to_long(:Year).to_wide" ] }, { "cell_type": "code", "execution_count": 192, "id": "84b7909a-200c-4aec-b192-25016399a7c4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 6 vectors>
YearAudiBMWBMW_MINIMercedes-BenzVW
20172833652527254276822149040
20182647350982259846755451961
20192422246814238136655346794
20202230435712201965704136576
20212253535905182115172235215
" ], "text/plain": [ "#\n", " Year Audi BMW BMW_MINI Mercedes-Benz VW\n", " \n", "1 2017 28336 52527 25427 68221 49040\n", "2 2018 26473 50982 25984 67554 51961\n", "3 2019 24222 46814 23813 66553 46794\n", "4 2020 22304 35712 20196 57041 36576\n", "5 2021 22535 35905 18211 51722 35215\n" ] }, "execution_count": 192, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import_cars.to_long(:Year).to_wide(name: :N, value: :V)\n", "# is also OK" ] }, { "cell_type": "markdown", "id": "f2919aae-12ba-448b-a370-7910845fa470", "metadata": {}, "source": [ "## 62. Custom index\n", "\n", "Another example of `indices` is [14. Indices](#14.-Indices)." ] }, { "cell_type": "markdown", "id": "0a623030-3aad-459d-a04b-23dc52ea4088", "metadata": {}, "source": [ "We can set the start of indices by the option.\n", "\n", "(Since 0.2.1)" ] }, { "cell_type": "code", "execution_count": 193, "id": "c33c39f6-b48c-45d7-a755-43c33da21b9d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 1, 2, 3, 4]" ] }, "execution_count": 193, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = DataFrame.new(x: [0, 1, 2, 3, 4])\n", "df.indices" ] }, { "cell_type": "code", "execution_count": 194, "id": "fe797234-3759-4cd6-a9fc-02919afb9fed", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 2, 3, 4, 5]" ] }, "execution_count": 194, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.indices(1)" ] }, { "cell_type": "code", "execution_count": 195, "id": "ab17dad0-de80-412d-a890-d0f85ee630ef", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[\"a\", \"b\", \"c\", \"d\", \"e\"]" ] }, "execution_count": 195, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.indices(\"a\")" ] }, { "cell_type": "code", "execution_count": 196, "id": "200f96e1-b2d4-451b-ac0a-34ab7aa0cec0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
xdate
02022-09-04
12022-09-05
22022-09-06
32022-09-07
42022-09-08
" ], "text/plain": [ "#\n", " x date\n", " \n", "1 0 2022-09-04\n", "2 1 2022-09-05\n", "3 2 2022-09-06\n", "4 3 2022-09-07\n", "5 4 2022-09-08\n" ] }, "execution_count": 196, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.assign(:date) { indices(Date.parse(\"2022/9/4\")) }" ] }, { "cell_type": "markdown", "id": "a3e52c9f-704a-4032-bde9-a834b2a3e3f2", "metadata": {}, "source": [ "You can put the first value which accepts `#succ` method." ] }, { "cell_type": "markdown", "id": "cda03584-a0f3-4e16-bc34-b33bf83bf156", "metadata": { "tags": [] }, "source": [ "## 63. Method missing\n", "\n", "`RedAmber::DataFrame` has `#method_missing` to enable to call key names as methods.\n", "\n", "This feature is limited to what can be called as a method (`:key` is OK, not allowed for the keys `:Key`, `:\"key.1\"`, `:\"1key\"`, etc. ). But it will be convenient in many cases.\n", "\n", "(Since 0.2.1)" ] }, { "cell_type": "code", "execution_count": 197, "id": "1e12170d-73d9-4adc-acca-a0d4ea697f24", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6" ] }, "execution_count": 197, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = DataFrame.new(x: [1, 2, 3])\n", "df.x.sum" ] }, { "cell_type": "code", "execution_count": 198, "id": "b80ea144-b668-4df1-a28a-2c19728f4365", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3]\n" ] }, "execution_count": 198, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Some ways to pull a Vector\n", "df[:x] # Formal style\n", "\n", "df.v(:x) # #v method\n", "\n", "df.x # method" ] }, { "cell_type": "code", "execution_count": 199, "id": "c8dca8e8-ccef-4485-a871-a251a0809d16", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6" ] }, "execution_count": 199, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.x.sum" ] }, { "cell_type": "markdown", "id": "cea71378-f0af-4c00-a7a9-6d5e1160ebe6", "metadata": { "tags": [] }, "source": [ "## 64. Assign revised\n", "\n", "Another example of `assign` is [#34. Assign](#34.-Assign), [#65. Variations of assign](#65.-Variations-of-assign) ." ] }, { "cell_type": "code", "execution_count": 200, "id": "6a395669-0b37-4bbf-9852-fcc434dbfbeb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
xy
10.1
20.2
30.3
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 0.1\n", "2 2 0.2\n", "3 3 0.3\n" ] }, "execution_count": 200, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = DataFrame.new(x: [1, 2, 3])\n", "\n", "# Assign by a Hash\n", "df.assign(y: df.x / 10.0)" ] }, { "cell_type": "code", "execution_count": 201, "id": "3099b006-9a9a-4f27-9286-e00affcf1f2d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
xy
10.1
20.2
30.3
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 0.1\n", "2 2 0.2\n", "3 3 0.3\n" ] }, "execution_count": 201, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Assign by separated key and value\n", "df.assign(:y) { x / 10.0 }" ] }, { "cell_type": "code", "execution_count": 202, "id": "aa14333d-0fd0-43a9-ad3c-1a694b03d91e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 3 vectors>
xyz
1100.1
2200.2
3300.3
" ], "text/plain": [ "#\n", " x y z\n", " \n", "1 1 10 0.1\n", "2 2 20 0.2\n", "3 3 30 0.3\n" ] }, "execution_count": 202, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Separated keys and values\n", "df.assign(:y, :z) { [x * 10, x / 10.0] }" ] }, { "cell_type": "markdown", "id": "a33a9394-3d89-4855-a2bc-4a2a95ad3f08", "metadata": {}, "source": [ "## 65. Variations of assign\n", "\n", "Another example of `assign` is [#34. Assign](#34.-Assign), [#64. Assign revised](#64.-Assign-revised) ." ] }, { "cell_type": "code", "execution_count": 203, "id": "59013a14-8c0d-4dec-b936-5a997ae4ca95", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 1 vector>
x
1
2
3
" ], "text/plain": [ "#\n", " x\n", " \n", "1 1\n", "2 2\n", "3 3\n" ] }, "execution_count": 203, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = DataFrame.new(x: [1, 2, 3])" ] }, { "cell_type": "code", "execution_count": 204, "id": "c30ba70a-0185-4b23-826d-56f810baad93", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 3 vectors>
xyz
1100.1
2200.2
3300.3
" ], "text/plain": [ "#\n", " x y z\n", " \n", "1 1 10 0.1\n", "2 2 20 0.2\n", "3 3 30 0.3\n" ] }, "execution_count": 204, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Hash args\n", "df.assign(y: df[:x] * 10, z: df[:x] / 10.0)\n", "\n", "# Hash\n", "hash = {y: df[:x] * 10, z: df[:x] / 10.0}\n", "df.assign(hash)\n", "\n", "# Array\n", "array = [[:y, df[:x] * 10], [:z, df[:x] / 10.0]]\n", "df.assign(array)\n", "\n", "# Array\n", "df.assign [\n", " [:y, df[:x] * 10],\n", " [:z, df[:x] / 10.0]\n", "]\n", "\n", "# Hash\n", "df.assign({\n", " y: df[:x] * 10,\n", " z: df[:x] / 10.0\n", "})\n", "\n", "# Block, Hash\n", "df.assign { {y: df[:x] * 10, z: df[:x] / 10.0} }\n", "\n", "# Block, Array\n", "df.assign { [[:y, df[:x] * 10], [:z, df[:x] / 10.0]] }\n", "\n", "# Block, Array, method\n", "#df.assign { [:y, x * 10], [:z, x / 10.0]] }\n", "\n", "# Separated\n", "#df.assign(:y, :z) { [x * 10, x / 10.0] }" ] }, { "cell_type": "markdown", "id": "0472a570-c0e5-4ec8-bd26-23914a48f23d", "metadata": { "tags": [] }, "source": [ "## 66. Row index label by slice_by\n", "\n", "Another example of `slice` is [#28. Slice](#28.-Slice).\n", "\n", "(Since 0.2.1)" ] }, { "cell_type": "code", "execution_count": 205, "id": "d7a6cb36-53e7-4503-88c0-301531d8b877", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
labelnum
a1.1
b2.2
c3.3
d4.4
e5.5
" ], "text/plain": [ "#\n", " label num\n", " \n", "1 a 1.1\n", "2 b 2.2\n", "3 c 3.3\n", "4 d 4.4\n", "5 e 5.5\n" ] }, "execution_count": 205, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = DataFrame.new(num: [1.1, 2.2, 3.3, 4.4, 5.5])\n", " .assign_left(:label) { indices(\"a\") }" ] }, { "cell_type": "markdown", "id": "86377e12-acf7-4529-baf6-46e54a86aa5a", "metadata": {}, "source": [ "`slice_by(key) { row_selector }` selects rows in column `key` with `row_selector`." ] }, { "cell_type": "code", "execution_count": 206, "id": "2455da80-64af-4a4e-936f-4b7651dcf5ba", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 1 vector>
num
2.2
3.3
4.4
" ], "text/plain": [ "#\n", " num\n", " \n", "1 2.2\n", "2 3.3\n", "3 4.4\n" ] }, "execution_count": 206, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice_by(:label) { \"b\"..\"d\" }" ] }, { "cell_type": "code", "execution_count": 207, "id": "366587f4-0fde-4204-bdea-c3dc5c58b155", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 1 vector>
num
3.3
2.2
5.5
" ], "text/plain": [ "#\n", " num\n", " \n", "1 3.3\n", "2 2.2\n", "3 5.5\n" ] }, "execution_count": 207, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice_by(:label) { [\"c\", \"b\", \"e\"] }" ] }, { "cell_type": "markdown", "id": "c074498f-6e27-45e3-9f9a-f1c0d92fb742", "metadata": {}, "source": [ "If the option `keep_key:` set to `true`, index label column is preserved." ] }, { "cell_type": "code", "execution_count": 208, "id": "115b01f9-3722-4a2e-a6d0-7feadb058659", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
labelnum
b2.2
c3.3
d4.4
" ], "text/plain": [ "#\n", " label num\n", " \n", "1 b 2.2\n", "2 c 3.3\n", "3 d 4.4\n" ] }, "execution_count": 208, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice_by(:label, keep_key: true) { \"b\"..\"d\" }" ] }, { "cell_type": "markdown", "id": "8285cb78-5ef5-4fd4-baa1-d862f92418d5", "metadata": {}, "source": [ "## 67. Simpson's paradox in COVID-19 data\n", "\n", "https://www.rdocumentation.org/packages/openintro/versions/2.3.0/topics/simpsons_paradox_covid" ] }, { "cell_type": "code", "execution_count": 209, "id": "47e2d348-6a36-4dfb-8697-959ef1442c11", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <268166 x 3 vectors>
age_groupvaccine_statusoutcome
under 50vaccinateddeath
under 50vaccinateddeath
under 50vaccinateddeath
under 50vaccinateddeath
50 +unvaccinatedsurvived
50 +unvaccinatedsurvived
50 +unvaccinatedsurvived
" ], "text/plain": [ "#\n", " age_group vaccine_status outcome\n", " \n", " 1 under 50 vaccinated death\n", " 2 under 50 vaccinated death\n", " 3 under 50 vaccinated death\n", " 4 under 50 vaccinated death\n", " 5 under 50 vaccinated death\n", " : : : :\n", "268164 50 + unvaccinated survived\n", "268165 50 + unvaccinated survived\n", "268166 50 + unvaccinated survived\n" ] }, "execution_count": 209, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'datasets-arrow'\n", "\n", "ds = Datasets::Rdatasets.new('openintro', 'simpsons_paradox_covid')\n", "df = RedAmber::DataFrame.new(ds.to_arrow)" ] }, { "cell_type": "markdown", "id": "3d28b955-a57e-42fb-ae2f-647ee760ad47", "metadata": {}, "source": [ "Create group and count by vaccine status and outcome." ] }, { "cell_type": "code", "execution_count": 210, "id": "1b5d9c45-a749-4e04-9f8a-5eeb0652771b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 3 vectors>
vaccine_statusoutcomecount(age_group)
vaccinateddeath481
unvaccinateddeath253
vaccinatedsurvived116633
unvaccinatedsurvived150799
" ], "text/plain": [ "#\n", " vaccine_status outcome count(age_group)\n", " \n", "1 vaccinated death 481\n", "2 unvaccinated death 253\n", "3 vaccinated survived 116633\n", "4 unvaccinated survived 150799\n" ] }, "execution_count": 210, "metadata": {}, "output_type": "execute_result" } ], "source": [ "count = df.group(:vaccine_status, :outcome).count" ] }, { "cell_type": "markdown", "id": "a6b6f501-f795-4fbf-b135-17609c5faebb", "metadata": {}, "source": [ "Reshape to human readable wide table." ] }, { "cell_type": "code", "execution_count": 211, "id": "fcec849e-4144-4035-9ce6-45d4bc124f11", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 3 vectors>
outcomevaccinatedunvaccinated
death481253
survived116633150799
" ], "text/plain": [ "#\n", " outcome vaccinated unvaccinated\n", " \n", "1 death 481 253\n", "2 survived 116633 150799\n" ] }, "execution_count": 211, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_count = count.to_wide(name: :vaccine_status, value: :\"count(age_group)\")" ] }, { "cell_type": "markdown", "id": "caccb2cc-64a3-4b89-b604-eca4780c3c52", "metadata": {}, "source": [ "Compute death or survived ratio for vaccine status." ] }, { "cell_type": "code", "execution_count": 212, "id": "de42a142-38f5-426b-b9ed-706dba19015c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 5 vectors>
outcomevaccinatedunvaccinatedvaccinated_%unvaccinated_%
death4812530.41071093122940040.167491989513545
survived11663315079999.589289068770699.83250801048645
" ], "text/plain": [ "#\n", " outcome vaccinated unvaccinated vaccinated_% unvaccinated_%\n", " \n", "1 death 481 253 0.41 0.17\n", "2 survived 116633 150799 99.59 99.83\n" ] }, "execution_count": 212, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_count.assign do\n", " {\n", " \"vaccinated_%\": 100.0 * vaccinated / vaccinated.sum,\n", " \"unvaccinated_%\": 100.0 * unvaccinated / unvaccinated.sum\n", " }\n", "end" ] }, { "cell_type": "markdown", "id": "a18599a7-a21e-45a5-aa55-0170414ba3d0", "metadata": {}, "source": [ "Death ratio for vaccinated is higher than unvaccinated. Is it true?\n", "\n", "Next, do the same thing above for each age group. Temporally create methods." ] }, { "cell_type": "code", "execution_count": 213, "id": "296b4570-7746-4444-8c0a-3bc4a7bdf929", "metadata": {}, "outputs": [ { "data": { "text/plain": [ ":make_covid_table" ] }, "execution_count": 213, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def make_covid_table(df)\n", " df.group(:vaccine_status, :outcome)\n", " .count\n", " .to_wide(name: :vaccine_status, value: :\"count(age_group)\")\n", " .assign do\n", " {\n", " \"vaccinated_%\": (100.0 * vaccinated / vaccinated.sum).round(n_digits: 3),\n", " \"unvaccinated_%\": (100.0 * unvaccinated / unvaccinated.sum).round(n_digits: 3)\n", " }\n", " end\n", "end" ] }, { "cell_type": "code", "execution_count": 214, "id": "0443fe4f-e023-4496-9538-3d8772debfb6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 5 vectors>
outcomevaccinatedunvaccinatedvaccinated_%unvaccinated_%
death21480.0230.033
survived8978614756499.97799.967
" ], "text/plain": [ "#\n", " outcome vaccinated unvaccinated vaccinated_% unvaccinated_%\n", " \n", "1 death 21 48 0.02 0.03\n", "2 survived 89786 147564 99.98 99.97\n" ] }, "execution_count": 214, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# under 50\n", "make_covid_table(df[df[:age_group] == \"under 50\"])" ] }, { "cell_type": "code", "execution_count": 215, "id": "3c451727-5d61-495d-aae9-fb8486b6067f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 5 vectors>
outcomevaccinatedunvaccinatedvaccinated_%unvaccinated_%
death4602051.6855.959
survived26847323598.31594.041
" ], "text/plain": [ "#\n", " outcome vaccinated unvaccinated vaccinated_% unvaccinated_%\n", " \n", "1 death 460 205 1.69 5.96\n", "2 survived 26847 3235 98.32 94.04\n" ] }, "execution_count": 215, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 50 +\n", "make_covid_table(df[df[:age_group] == \"50 +\"])" ] }, { "cell_type": "markdown", "id": "78c3c909-5abf-4989-bd27-e86543a14431", "metadata": {}, "source": [ "Death ratio for vaccinated is lower than unvaccinated for grouped subset by age. This is an exaple of \"Simpson's paradox\" ." ] }, { "cell_type": "markdown", "id": "184a5b21-dabd-4428-84d4-aa3e3f8d4666", "metadata": {}, "source": [ "## 68. Clean up dirty data" ] }, { "cell_type": "code", "execution_count": 216, "id": "3f34f465-7cdc-4bea-bffc-1ba2f741820d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <8 x 2 vectors>
heightweight
154.952.2
156.8cm51.1kg
15249
148.5cm45.4kg
155cm
49.9kg
1.58m49.8kg
166.8cm53.6kg
" ], "text/plain": [ "#\n", " height weight\n", " \n", "1 154.9 52.2\n", "2 156.8cm 51.1kg\n", "3 152 49\n", "4 148.5cm 45.4kg\n", "5 155cm\n", "6 49.9kg\n", "7 1.58m 49.8kg\n", "8 166.8cm 53.6kg\n" ] }, "execution_count": 216, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = RedAmber::DataFrame.load(\"../test/entity/dirty_data.csv\")" ] }, { "cell_type": "markdown", "id": "b7704f5e-8e8d-439c-bbd5-535d6e9f8911", "metadata": {}, "source": [ "It was loaded as String datatypes." ] }, { "cell_type": "code", "execution_count": 217, "id": "e7abb6f9-8ca3-4868-aa0a-80f721475235", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{:height=>:string, :weight=>:string}" ] }, "execution_count": 217, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.schema" ] }, { "cell_type": "markdown", "id": "f7bc2ef6-7d08-4d65-8439-b02fb9c6e5dc", "metadata": {}, "source": [ "First for the `:weight` column. Replacing \"\" to NaN causes casting to Float." ] }, { "cell_type": "code", "execution_count": 218, "id": "6435e95f-db41-4d26-84bd-4880de3fbd32", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <8 x 2 vectors>
heightweight
154.952.2
156.8cm51.1
15249.0
148.5cm45.4
155cmNaN
49.9
1.58m49.8
166.8cm53.6
" ], "text/plain": [ "#\n", " height weight\n", " \n", "1 154.9 52.2\n", "2 156.8cm 51.1\n", "3 152 49.0\n", "4 148.5cm 45.4\n", "5 155cm NaN\n", "6 49.9\n", "7 1.58m 49.8\n", "8 166.8cm 53.6\n" ] }, "execution_count": 218, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.assign do\n", " {\n", " weight: weight.replace(weight == \"\", Float::NAN)\n", " }\n", "end" ] }, { "cell_type": "markdown", "id": "fcc1cd2b-7ff1-4e64-a6a9-0026209455cb", "metadata": {}, "source": [ "Apply same conversion for `:height` followed by unit conversion by `if_else`." ] }, { "cell_type": "code", "execution_count": 219, "id": "c88379f4-b063-42e9-8f76-676f364a79af", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <8 x 2 vectors>
heightweight
154.952.2
156.851.1
152.049.0
148.545.4
155.0NaN
NaN49.9
158.049.8
166.853.6
" ], "text/plain": [ "#\n", " height weight\n", " \n", "1 154.9 52.2\n", "2 156.8 51.1\n", "3 152.0 49.0\n", "4 148.5 45.4\n", "5 155.0 NaN\n", "6 NaN 49.9\n", "7 158.0 49.8\n", "8 166.8 53.6\n" ] }, "execution_count": 219, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.assign do\n", " {\n", " weight: weight.replace(weight == '', Float::NAN),\n", " height: height.replace(height == '', Float::NAN)\n", " .then { |h| (h < 10).if_else(h * 100, h) }\n", " }\n", "end" ] }, { "cell_type": "markdown", "id": "8f16dc4c-2ecd-4f36-b7ba-8a05487b1a26", "metadata": {}, "source": [ "We got clean data, then compute BMI as a new column." ] }, { "cell_type": "code", "execution_count": 220, "id": "bb061dab-008c-431c-9f7d-c74aa33e5f5a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <8 x 3 vectors>
heightweightBMI
154.952.221.8
156.851.120.8
152.049.021.2
148.545.420.6
155.0NaNNaN
NaN49.9NaN
158.049.819.9
166.853.619.3
" ], "text/plain": [ "#\n", " height weight BMI\n", " \n", "1 154.9 52.2 21.8\n", "2 156.8 51.1 20.8\n", "3 152.0 49.0 21.2\n", "4 148.5 45.4 20.6\n", "5 155.0 NaN NaN\n", "6 NaN 49.9 NaN\n", "7 158.0 49.8 19.9\n", "8 166.8 53.6 19.3\n" ] }, "execution_count": 220, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.assign(:BMI) { (weight / height ** 2 * 10000).round(n_digits: 1) }" ] }, { "cell_type": "markdown", "id": "8fe6cc1a-530d-4883-8641-7aebd97ebd16", "metadata": {}, "source": [ "## 69. From the Pandas cookbook (Multiindexing)\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#multiindexing" ] }, { "cell_type": "markdown", "id": "a451139d-9b7a-4cbd-b1f1-4b99ac6f81b8", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Efficiently and dynamically creating new columns using applymap\n", "\n", "df = pd.DataFrame(\n", " {\n", " \"row\": [0, 1, 2],\n", " \"One_X\": [1.1, 1.1, 1.1],\n", " \"One_Y\": [1.2, 1.2, 1.2],\n", " \"Two_X\": [1.11, 1.11, 1.11],\n", " \"Two_Y\": [1.22, 1.22, 1.22],\n", " }\n", ")\n", "df\n", "\n", "# =>\n", " row One_X One_Y Two_X Two_Y\n", "0 0 1.1 1.2 1.11 1.22\n", "1 1 1.1 1.2 1.11 1.22\n", "2 2 1.1 1.2 1.11 1.22\n", "\n", "# As Labelled Index\n", "df = df.set_index(\"row\")\n", "df\n", "\n", "# =>\n", " One_X One_Y Two_X Two_Y\n", "row \n", "0 1.1 1.2 1.11 1.22\n", "1 1.1 1.2 1.11 1.22\n", "2 1.1 1.2 1.11 1.22\n", "\n", "# With Hierarchical Columns\n", "df.columns = pd.MultiIndex.from_tuples([tuple(c.split(\"_\")) for c in df.columns])\n", "df\n", "\n", "# =>\n", " One Two \n", " X Y X Y\n", "row \n", "0 1.1 1.2 1.11 1.22\n", "1 1.1 1.2 1.11 1.22\n", "2 1.1 1.2 1.11 1.22\n", "\n", "# Now stack & Reset\n", "df = df.stack(0).reset_index(1)\n", "df\n", "\n", "# =>\n", " level_1 X Y\n", "row \n", "0 One 1.10 1.20\n", "0 Two 1.11 1.22\n", "1 One 1.10 1.20\n", "1 Two 1.11 1.22\n", "2 One 1.10 1.20\n", "2 Two 1.11 1.22\n", "\n", "# And fix the labels (Notice the label 'level_1' got added automatically)\n", "df.columns = [\"Sample\", \"All_X\", \"All_Y\"]\n", "df\n", "\n", "# =>\n", " Sample All_X All_Y\n", "row \n", "0 One 1.10 1.20\n", "0 Two 1.11 1.22\n", "1 One 1.10 1.20\n", "1 Two 1.11 1.22\n", "2 One 1.10 1.20\n", "2 Two 1.11 1.22\n", "```" ] }, { "cell_type": "markdown", "id": "6af6de4e-3875-4e05-8642-20d07bf3a363", "metadata": {}, "source": [ "This is a tentative example. This work may be refined by the coming feature such as splitting column and combining DataFrames." ] }, { "cell_type": "code", "execution_count": 221, "id": "1193c48a-528b-4e22-a8f7-24dba63e3fa7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 5 vectors>
rowOne_XOne_YTwo_XTwo_Y
01.11.21.111.22
11.11.21.111.22
21.11.21.111.22
" ], "text/plain": [ "#\n", " row One_X One_Y Two_X Two_Y\n", " \n", "1 0 1.1 1.2 1.11 1.22\n", "2 1 1.1 1.2 1.11 1.22\n", "3 2 1.1 1.2 1.11 1.22\n" ] }, "execution_count": 221, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = RedAmber::DataFrame.new(\n", " \"row\": [0, 1, 2],\n", " \"One_X\": [1.1, 1.1, 1.1],\n", " \"One_Y\": [1.2, 1.2, 1.2],\n", " \"Two_X\": [1.11, 1.11, 1.11],\n", " \"Two_Y\": [1.22, 1.22, 1.22],\n", ")" ] }, { "cell_type": "code", "execution_count": 222, "id": "9a9bcbd1-112f-4e58-b065-90fec7270a0d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <6 x 3 vectors>
rowSampleAll_X
0One_X1.1
0Two_X1.11
1One_X1.1
1Two_X1.11
2One_X1.1
2Two_X1.11
" ], "text/plain": [ "#\n", " row Sample All_X\n", " \n", "1 0 One_X 1.1\n", "2 0 Two_X 1.11\n", "3 1 One_X 1.1\n", "4 1 Two_X 1.11\n", "5 2 One_X 1.1\n", "6 2 Two_X 1.11\n" ] }, "execution_count": 222, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = df.pick(:row, :One_X, :Two_X)\n", " .to_long(:row, name: :Sample, value: :All_X)" ] }, { "cell_type": "code", "execution_count": 223, "id": "949e64a9-5f45-481b-b4b4-dc3bc4da4326", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <6 x 3 vectors>
rowSampleAll_Y
0One_Y1.2
0Two_Y1.22
1One_Y1.2
1Two_Y1.22
2One_Y1.2
2Two_Y1.22
" ], "text/plain": [ "#\n", " row Sample All_Y\n", " \n", "1 0 One_Y 1.2\n", "2 0 Two_Y 1.22\n", "3 1 One_Y 1.2\n", "4 1 Two_Y 1.22\n", "5 2 One_Y 1.2\n", "6 2 Two_Y 1.22\n" ] }, "execution_count": 223, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = df.pick(:row, :One_Y, :Two_Y)\n", " .to_long(:row, name: :Sample, value: :All_Y)" ] }, { "cell_type": "code", "execution_count": 224, "id": "d2f5959c-c490-45fe-bcbd-3b8325760f6d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <6 x 4 vectors>
rowSampleAll_XAll_Y
0One1.11.2
0Two1.111.22
1One1.11.2
1Two1.111.22
2One1.11.2
2Two1.111.22
" ], "text/plain": [ "#\n", " row Sample All_X All_Y\n", " \n", "1 0 One 1.1 1.2\n", "2 0 Two 1.11 1.22\n", "3 1 One 1.1 1.2\n", "4 1 Two 1.11 1.22\n", "5 2 One 1.1 1.2\n", "6 2 Two 1.11 1.22\n" ] }, "execution_count": 224, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x.pick(:row)\n", " .assign [\n", " [:Sample, x[:Sample].each.map { |x| x.split(\"_\").first }],\n", " [:All_X, x[:All_X]],\n", " [:All_Y, y[:All_Y]]\n", " ]" ] }, { "cell_type": "markdown", "id": "7e65c417-692c-4145-96dd-deef12f46ed4", "metadata": { "tags": [] }, "source": [ "## 70. From the Pandas cookbook (Arithmetic)\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#arithmetic" ] }, { "cell_type": "markdown", "id": "47c3ac27-37f4-4e6b-a5f7-95d818e21cdc", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "cols = pd.MultiIndex.from_tuples(\n", " [(x, y) for x in [\"A\", \"B\", \"C\"] for y in [\"O\", \"I\"]]\n", ")\n", "\n", "df = pd.DataFrame(np.random.randn(2, 6), index=[\"n\", \"m\"], columns=cols)\n", "df\n", "\n", "# =>\n", " A B C \n", " O I O I O I\n", "n 0.469112 -0.282863 -1.509059 -1.135632 1.212112 -0.173215\n", "m 0.119209 -1.044236 -0.861849 -2.104569 -0.494929 1.071804\n", "\n", "df = df.div(df[\"C\"], level=1)\n", "df\n", "\n", "# =>\n", " A B C \n", " O I O I O I\n", "n 0.387021 1.633022 -1.244983 6.556214 1.0 1.0\n", "m -0.240860 -0.974279 1.741358 -1.963577 1.0 1.0\n", "```" ] }, { "cell_type": "markdown", "id": "56ce988f-7d7d-4bbd-8919-be3eb4242b77", "metadata": {}, "source": [ "This is a tentative example. This work may be refined by the coming feature which treats multiple key header easily." ] }, { "cell_type": "code", "execution_count": 225, "id": "a9e14b16-c7d8-4539-90f9-25eb89f4789f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[-0.5812548058448793, -0.16835433970036223], [0.5868954392364965, -0.5951422503374125], [-0.8028018065626139, -0.32610562028036544], [0.28292153294248895, 1.6842676580208051], [0.9184993853758983, -0.048538359637834545], [-0.46445256370104565, -0.9921944543521136]]" ] }, "execution_count": 225, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require \"Numo/NArray\"\n", "\n", "values = Numo::DFloat.new(6, 2).rand_norm.to_a" ] }, { "cell_type": "markdown", "id": "03aa2096-4411-4665-9bfc-dfedc41e75af", "metadata": {}, "source": [ "For consistency with the pandas result, we will use same data of them." ] }, { "cell_type": "code", "execution_count": 226, "id": "a470ced5-dadb-48cb-a43f-8161ac3df769", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[0.469112, 0.119209], [-0.282863, -1.044236], [-1.509059, -0.861849], [-1.135632, -2.104569], [1.212112, -0.494929], [-0.173215, 1.071804]]" ] }, "execution_count": 226, "metadata": {}, "output_type": "execute_result" } ], "source": [ "values = [\n", " [0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215],\n", " [0.119209, -1.044236, -0.861849, -2.104569, -0.494929, 1.071804]\n", "].transpose" ] }, { "cell_type": "code", "execution_count": 227, "id": "4288a84f-c3db-4fa7-b73e-9e21077304c5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[\"AO\", \"AI\", \"BO\", \"BI\", \"CO\", \"CI\"]" ] }, "execution_count": 227, "metadata": {}, "output_type": "execute_result" } ], "source": [ "keys = %w[A B C].product(%w[O I]).map(&:join)" ] }, { "cell_type": "code", "execution_count": 228, "id": "a1199c40-1508-4573-a427-cd7f7d580701", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 7 vectors>
indexAOAIBOBICOCI
n0.469112-0.282863-1.509059-1.1356321.212112-0.173215
m0.119209-1.044236-0.861849-2.104569-0.4949291.071804
" ], "text/plain": [ "#\n", " index AO AI BO BI CO CI\n", " \n", "1 n 0.47 -0.28 -1.51 -1.14 1.21 -0.17\n", "2 m 0.12 -1.04 -0.86 -2.1 -0.49 1.07\n" ] }, "execution_count": 228, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = RedAmber::DataFrame.new(index: %w[n m])\n", " .assign(*keys) { values }" ] }, { "cell_type": "code", "execution_count": 229, "id": "bab052e6-5b26-40ba-95b6-c87ac444cb80", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 7 vectors>
indexAOAIBOBICOCI
n0.387020341354594231.6330167710648613-1.244983136871846.5561989435095111.01.0
m-0.24086081033845258-0.97427887934734341.7413588615740843-1.963576362842461.01.0
" ], "text/plain": [ "#\n", " index AO AI BO BI CO CI\n", " \n", "1 n 0.39 1.63 -1.24 6.56 1.0 1.0\n", "2 m -0.24 -0.97 1.74 -1.96 1.0 1.0\n" ] }, "execution_count": 229, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.assign do\n", " assigner = {}\n", " %w[A B C].each do |abc|\n", " %w[O I].each do |oi|\n", " key = \"#{abc}#{oi}\".to_sym\n", " assigner[key] = v(key) / v(\"C#{oi}\".to_sym)\n", " end\n", " end\n", " assigner\n", "end" ] }, { "cell_type": "markdown", "id": "b7b483a4-5d68-492f-836f-1105420d859f", "metadata": {}, "source": [ "## 71. From the Pandas cookbook (Slicing)\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#slicing" ] }, { "cell_type": "markdown", "id": "282fe127-5b0b-45a6-9f04-0309495b0d7a", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "coords = [(\"AA\", \"one\"), (\"AA\", \"six\"), (\"BB\", \"one\"), (\"BB\", \"two\"), (\"BB\", \"six\")]\n", "index = pd.MultiIndex.from_tuples(coords)\n", "df = pd.DataFrame([11, 22, 33, 44, 55], index, [\"MyData\"])\n", "df\n", "\n", "# =>\n", " MyData\n", "AA one 11\n", " six 22\n", "BB one 33\n", " two 44\n", " six 55\n", "```" ] }, { "cell_type": "code", "execution_count": 230, "id": "24de5e41-c775-472d-ab93-24ef9981febe", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 3 vectors>
label1label2MyData
AAone11
AAsix22
BBone33
BBtwo44
BBsix55
" ], "text/plain": [ "#\n", " label1 label2 MyData\n", " \n", "1 AA one 11\n", "2 AA six 22\n", "3 BB one 33\n", "4 BB two 44\n", "5 BB six 55\n" ] }, "execution_count": 230, "metadata": {}, "output_type": "execute_result" } ], "source": [ "coords = [[\"AA\", \"one\"], [\"AA\", \"six\"], [\"BB\", \"one\"], [\"BB\", \"two\"], [\"BB\", \"six\"]].transpose\n", "df = RedAmber::DataFrame.new(MyData: [11, 22, 33, 44, 55])\n", " .assign_left(:label1, :label2) { coords }" ] }, { "cell_type": "markdown", "id": "5876deb1-8574-48c0-af6a-00447ba5ebe7", "metadata": {}, "source": [ "To take the cross section of the 1st level and 1st axis the index:\n", "\n", "```python\n", "# by Python Pandas\n", "# Note : level and axis are optional, and default to zero\n", "df.xs(\"BB\", level=0, axis=0)\n", "\n", "# =>\n", " MyData\n", "one 33\n", "two 44\n", "six 55\n", "```" ] }, { "cell_type": "code", "execution_count": 231, "id": "6941b8ef-6e40-4103-ab93-bf370d80ebf8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
label2MyData
one33
two44
six55
" ], "text/plain": [ "#\n", " label2 MyData\n", " \n", "1 one 33\n", "2 two 44\n", "3 six 55\n" ] }, "execution_count": 231, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice { label1 == \"BB\" }.drop(:label1)" ] }, { "cell_type": "markdown", "id": "cf62d474-5a15-4af3-a0fc-99455bf0a205", "metadata": {}, "source": [ "…and now the 2nd level of the 1st axis.\n", "\n", "```python\n", "# by Python Pandas\n", "df.xs(\"six\", level=1, axis=0)\n", "\n", "# =>\n", " MyData\n", "AA 22\n", "BB 55\n", "```" ] }, { "cell_type": "code", "execution_count": 232, "id": "9968d610-cadc-4b0a-8364-8c95e6ed6b0d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 2 vectors>
label1MyData
AA22
BB55
" ], "text/plain": [ "#\n", " label1 MyData\n", " \n", "1 AA 22\n", "2 BB 55\n" ] }, "execution_count": 232, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice { label2 == \"six\" }.drop(:label2)" ] }, { "cell_type": "markdown", "id": "b7d9d33d-6959-492d-b24f-9952b36cf9f6", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "import itertools\n", "\n", "index = list(itertools.product([\"Ada\", \"Quinn\", \"Violet\"], [\"Comp\", \"Math\", \"Sci\"]))\n", "headr = list(itertools.product([\"Exams\", \"Labs\"], [\"I\", \"II\"]))\n", "indx = pd.MultiIndex.from_tuples(index, names=[\"Student\", \"Course\"])\n", "cols = pd.MultiIndex.from_tuples(headr) # Notice these are un-named\n", "data = [[70 + x + y + (x * y) % 3 for x in range(4)] for y in range(9)]\n", "df = pd.DataFrame(data, indx, cols)\n", "df\n", "\n", "# =>\n", " Exams Labs \n", " I II I II\n", "Student Course \n", "Ada Comp 70 71 72 73\n", " Math 71 73 75 74\n", " Sci 72 75 75 75\n", "Quinn Comp 73 74 75 76\n", " Math 74 76 78 77\n", " Sci 75 78 78 78\n", "Violet Comp 76 77 78 79\n", " Math 77 79 81 80\n", " Sci 78 81 81 81\n", "```" ] }, { "cell_type": "code", "execution_count": 233, "id": "78e5753a-8d9c-4132-9671-6b09b3a06a20", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <9 x 6 vectors>
StudentCourseExams/IExams/IILabs/ILabs/II
AdaComp70717273
AdaMath71737574
AdaSci72757575
QuinnComp73747576
VioletComp76777879
VioletMath77798180
VioletSci78818181
" ], "text/plain": [ "#\n", " Student Course Exams/I Exams/II Labs/I Labs/II\n", " \n", "1 Ada Comp 70 71 72 73\n", "2 Ada Math 71 73 75 74\n", "3 Ada Sci 72 75 75 75\n", "4 Quinn Comp 73 74 75 76\n", "5 Quinn Math 74 76 78 77\n", "6 Quinn Sci 75 78 78 78\n", "7 Violet Comp 76 77 78 79\n", "8 Violet Math 77 79 81 80\n", "9 Violet Sci 78 81 81 81\n" ] }, "execution_count": 233, "metadata": {}, "output_type": "execute_result" } ], "source": [ "indexes = %w[Ada Quinn Violet].product(%w[Comp Math Sci]).transpose\n", "df = RedAmber::DataFrame.new(%w[Student Course].zip(indexes))\n", " .assign do\n", " assigner = {}\n", " keys = %w[Exams Labs].product(%w[I II]).map { |a| a.join(\"/\") } \n", " keys.each.with_index do |key, x|\n", " assigner[key] = (0...9).map { |y| 70 + x + y + (x * y) % 3 }\n", " end\n", " assigner\n", " end" ] }, { "cell_type": "markdown", "id": "2fc29a2d-fbe8-42f9-83dc-1ddfb7563ce0", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "All = slice(None)\n", "\n", "df.loc[\"Violet\"]\n", "\n", "# =>\n", " Exams Labs \n", " I II I II\n", "Course \n", "Comp 76 77 78 79\n", "Math 77 79 81 80\n", "Sci 78 81 81 81\n", "```" ] }, { "cell_type": "code", "execution_count": 234, "id": "1c7662a7-3208-40ee-9f64-235c68ef57b3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 5 vectors>
CourseExams/IExams/IILabs/ILabs/II
Comp76777879
Math77798180
Sci78818181
" ], "text/plain": [ "#\n", " Course Exams/I Exams/II Labs/I Labs/II\n", " \n", "1 Comp 76 77 78 79\n", "2 Math 77 79 81 80\n", "3 Sci 78 81 81 81\n" ] }, "execution_count": 234, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice(df[:Student] == \"Violet\").drop(:Student)" ] }, { "cell_type": "markdown", "id": "af5bb6db-cf2d-4887-9d73-7da651f9a3ea", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df.loc[(All, \"Math\"), All]\n", "\n", "# =>\n", " Exams Labs \n", " I II I II\n", "Student Course \n", "Ada Math 71 73 75 74\n", "Quinn Math 74 76 78 77\n", "Violet Math 77 79 81 80\n", "```" ] }, { "cell_type": "code", "execution_count": 235, "id": "28baea84-76c8-4473-b1ff-0d8f6fde2cca", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 6 vectors>
StudentCourseExams/IExams/IILabs/ILabs/II
AdaMath71737574
QuinnMath74767877
VioletMath77798180
" ], "text/plain": [ "#\n", " Student Course Exams/I Exams/II Labs/I Labs/II\n", " \n", "1 Ada Math 71 73 75 74\n", "2 Quinn Math 74 76 78 77\n", "3 Violet Math 77 79 81 80\n" ] }, "execution_count": 235, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice(df[:Course] == \"Math\")" ] }, { "cell_type": "markdown", "id": "e12bcd4f-be10-4ee4-b55e-dc57e3de9698", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df.loc[(slice(\"Ada\", \"Quinn\"), \"Math\"), All]\n", "\n", "# =>\n", " Exams Labs \n", " I II I II\n", "Student Course \n", "Ada Math 71 73 75 74\n", "Quinn Math 74 76 78 77\n", "```" ] }, { "cell_type": "code", "execution_count": 236, "id": "11387d5b-a470-4ebf-8260-55db8f5f829a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 6 vectors>
StudentCourseExams/IExams/IILabs/ILabs/II
AdaMath71737574
QuinnMath74767877
" ], "text/plain": [ "#\n", " Student Course Exams/I Exams/II Labs/I Labs/II\n", " \n", "1 Ada Math 71 73 75 74\n", "2 Quinn Math 74 76 78 77\n" ] }, "execution_count": 236, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice(df[:Course] == \"Math\")\n", " .slice { (v(:Student) == \"Ada\") | (v(:Student) == \"Quinn\") }" ] }, { "cell_type": "markdown", "id": "61270946-535c-4a42-a155-a1c0ca025416", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df.loc[(All, \"Math\"), (\"Exams\")]\n", "\n", "# =>\n", " I II\n", "Student Course \n", "Ada Math 71 73\n", "Quinn Math 74 76\n", "Violet Math 77 79\n", "```" ] }, { "cell_type": "code", "execution_count": 237, "id": "7dbdd733-6cbb-48e1-ab8d-45587d455b8b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
StudentCourseExams/IExams/II
AdaMath7173
QuinnMath7476
VioletMath7779
" ], "text/plain": [ "#\n", " Student Course Exams/I Exams/II\n", " \n", "1 Ada Math 71 73\n", "2 Quinn Math 74 76\n", "3 Violet Math 77 79\n" ] }, "execution_count": 237, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice(df[:Course] == \"Math\")\n", " .pick {\n", " [:Student, :Course].concat keys.select { |key| key.to_s.start_with?(\"Exams\") }\n", " }" ] }, { "cell_type": "markdown", "id": "54be92ae-32b8-44f8-926c-190ab7592364", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df.loc[(All, \"Math\"), (All, \"II\")]\n", "\n", "# =>\n", " Exams Labs\n", " II II\n", "Student Course \n", "Ada Math 73 74\n", "Quinn Math 76 77\n", "Violet Math 79 80\n", "```" ] }, { "cell_type": "code", "execution_count": 238, "id": "5323da02-4dbe-43ac-89b8-3f9de0ae25c1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
StudentCourseExams/IILabs/II
AdaMath7374
QuinnMath7677
VioletMath7980
" ], "text/plain": [ "#\n", " Student Course Exams/II Labs/II\n", " \n", "1 Ada Math 73 74\n", "2 Quinn Math 76 77\n", "3 Violet Math 79 80\n" ] }, "execution_count": 238, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice(df[:Course] == \"Math\")\n", " .pick {\n", " [:Student, :Course].concat keys.select { |key| key.to_s.end_with?(\"II\") }\n", " }" ] }, { "cell_type": "code", "execution_count": 239, "id": "e89cde6e-9c79-4dc6-856e-d44a6cc1025c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 2, 3, [4, 5]]" ] }, "execution_count": 239, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[1,2,3] << [4,5]" ] } ], "metadata": { "kernelspec": { "display_name": "Ruby 3.1.1", "language": "ruby", "name": "ruby" }, "language_info": { "file_extension": ".rb", "mimetype": "application/x-ruby", "name": "ruby", "version": "3.1.1" } }, "nbformat": 4, "nbformat_minor": 5 }