{ "cells": [ { "cell_type": "markdown", "id": "e355db8b-ebb6-4ea6-97b5-3b9fdadc302c", "metadata": {}, "source": [ "# 61 examples of Red Amber" ] }, { "cell_type": "markdown", "id": "f20f4970-db38-4d96-9a36-d4cf9d007596", "metadata": {}, "source": [ "Last update: August 14, 2022, for RedAmber Version 0.2.0" ] }, { "cell_type": "markdown", "id": "f6e927d0-b59a-4c4e-9f8a-4fa08f9a6b2f", "metadata": {}, "source": [ "## 1. Install" ] }, { "cell_type": "markdown", "id": "85eacfe6-fa11-4749-844f-5914d6cd7dbc", "metadata": {}, "source": [ "Install requirements before you install Red Amber.\n", "\n", "- Apache Arrow GLib (>= 8.0.0)\n", "\n", "- Apache Parquet GLib (>= 8.0.0) # if you need IO from/to Parquet resource.\n", "\n", " See [Apache Arrow install document](https://arrow.apache.org/install/).\n", " \n", " Minimum installation example for the latest Ubuntu is in the ['Prepare the Apache Arrow' section in ci test](https://github.com/heronshoes/red_amber/blob/master/.github/workflows/test.yml) of Red Amber.\n", "\n", "Then add this line to your Gemfile:\n", "```\n", "gem 'red_amber'\n", "```\n", "\n", "And then execute:\n", "```\n", "$ bundle install\n", "```\n", "\n", "Or install it yourself as:\n", "```\n", "$ gem install red_amber\n", "```" ] }, { "cell_type": "markdown", "id": "8c08c45d-0818-4b43-bc65-4d43dd8b6b66", "metadata": {}, "source": [ "## 2. Require" ] }, { "cell_type": "code", "execution_count": 1, "id": "74b76022-03ea-40ae-bac8-fc8743659042", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{:RedAmber=>\"0.2.0\", :Arrow=>\"9.0.0\"}" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'red_amber' # require 'red-amber' is also OK\n", "include RedAmber\n", "{RedAmber: VERSION, Arrow: Arrow::VERSION}" ] }, { "cell_type": "markdown", "id": "d8fb6289-39ea-4fa9-a165-b87ee6d125e9", "metadata": { "tags": [] }, "source": [ "## 3. Initialize" ] }, { "cell_type": "code", "execution_count": 2, "id": "51f81824-626a-4741-a29b-30ea357fe7b5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
xy
1A
2B
3C
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 A\n", "2 2 B\n", "3 3 C\n" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# From a Hash\n", "DataFrame.new(x: [1, 2, 3], y: %w[A B C])" ] }, { "cell_type": "code", "execution_count": 3, "id": "20b696eb-c199-444d-a957-e0b1081f1506", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
xy
1A
2B
3C
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 A\n", "2 2 B\n", "3 3 C\n" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# From a schema and a row-oriented array\n", "DataFrame.new({ x: :uint8, y: :string }, [[1, 'A'], [2, 'B'], [3, 'C']])" ] }, { "cell_type": "code", "execution_count": 4, "id": "21eab151-f977-4474-a6d1-576169e24b26", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
xy
1A
2B
3C
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 A\n", "2 2 B\n", "3 3 C\n" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# From an Arrow::Table\n", "table = Arrow::Table.new(x: [1, 2, 3], y: %w[A B C])\n", "DataFrame.new(table)" ] }, { "cell_type": "code", "execution_count": 5, "id": "aa09d3da-f332-45cd-92ca-712c6a679035", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
xy
1A
2B
3C
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 A\n", "2 2 B\n", "3 3 C\n" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# From a Rover::DataFrame\n", "require 'rover'\n", "rover = Rover::DataFrame.new(x: [1, 2, 3], y: %w[A B C])\n", "DataFrame.new(rover)" ] }, { "cell_type": "code", "execution_count": 6, "id": "cd2c3677-00fb-48fe-bb94-18bc0815db72", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <344 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen(nil)(nil)(nil)(nil)(nil)2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen (nil) (nil) (nil) ... 2007\n", " 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " : : : : : : ... :\n", "342 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "343 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "344 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# from a red-datasets\n", "require 'datasets-arrow'\n", "dataset = Datasets::Penguins.new\n", "penguins = DataFrame.new(dataset.to_arrow)" ] }, { "cell_type": "markdown", "id": "3a2d12b4-7623-42c7-9e32-76cf303c7cea", "metadata": {}, "source": [ "It should be in the future version;\n", "```ruby\n", "require 'datasets-red-amber'\n", "penguins = Datasets::Penguins.new.to_red_amber\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "id": "2e4619b7-bf6d-4081-9066-b186da8fdf5b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <32 x 11 vectors>
mpgcyldisphpdratwtqsecvsamgearcarb
21.06160.01103.92.6216.460144
21.06160.01103.92.87517.020144
22.84108.0933.852.3218.611141
21.46258.01103.083.21519.441031
19.76145.01753.622.7715.50156
15.08301.03353.543.5714.60158
21.44121.01094.112.7818.61142
" ], "text/plain": [ "#\n", " mpg cyl disp hp drat wt qsec vs am ... carb\n", " ... \n", " 1 21.0 6 160.0 110 3.9 2.62 16.46 0 1 ... 4\n", " 2 21.0 6 160.0 110 3.9 2.88 17.02 0 1 ... 4\n", " 3 22.8 4 108.0 93 3.85 2.32 18.61 1 1 ... 1\n", " 4 21.4 6 258.0 110 3.08 3.22 19.44 1 0 ... 1\n", " 5 18.7 8 360.0 175 3.15 3.44 17.02 0 0 ... 2\n", " : : : : : : : : : : ... :\n", "30 19.7 6 145.0 175 3.62 2.77 15.5 0 1 ... 6\n", "31 15.0 8 301.0 335 3.54 3.57 14.6 0 1 ... 8\n", "32 21.4 4 121.0 109 4.11 2.78 18.6 1 1 ... 2\n" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset = Datasets::Rdatasets.new('datasets', 'mtcars')\n", "mtcars = DataFrame.new(dataset.to_arrow)" ] }, { "cell_type": "markdown", "id": "e1f77a54-3a43-4d17-bb6f-332ef13832a3", "metadata": {}, "source": [ "## 4. Load" ] }, { "cell_type": "markdown", "id": "0fed4f43-3fbb-43e5-af0d-f93401deea78", "metadata": {}, "source": [ "`RedAmber::DataFrame` delegates `#load` to `Arrow::Table#load`. We can load from `[.arrow, .arrows, .csv, .csv.gz, .tsv]` files." ] }, { "cell_type": "code", "execution_count": 8, "id": "4203e671-0a0a-405c-8482-53a8cd78a891", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
nameage
Yasuko68
Rui49
Hinata28
" ], "text/plain": [ "#\n", " name age\n", " \n", "1 Yasuko 68\n", "2 Rui 49\n", "3 Hinata 28\n" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DataFrame.load(\"../test/entity/with_header.csv\")" ] }, { "cell_type": "markdown", "id": "29875147-1371-4575-a565-69c3534c15f2", "metadata": {}, "source": [ "## 5. Load from a URI" ] }, { "cell_type": "code", "execution_count": 9, "id": "916b86e2-e3a2-4ebb-8770-9e8a29c46523", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <344 x 7 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
AdelieTorgersen39.118.71813750MALE
AdelieTorgersen39.517.41863800FEMALE
AdelieTorgersen40.318.01953250FEMALE
AdelieTorgersen(nil)(nil)(nil)(nil)
GentooBiscoe50.415.72225750MALE
GentooBiscoe45.214.82125200FEMALE
GentooBiscoe49.916.12135400MALE
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... sex\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... MALE\n", " 2 Adelie Torgersen 39.5 17.4 186 ... FEMALE\n", " 3 Adelie Torgersen 40.3 18.0 195 ... FEMALE\n", " 4 Adelie Torgersen (nil) (nil) (nil) ...\n", " 5 Adelie Torgersen 36.7 19.3 193 ... FEMALE\n", " : : : : : : ... :\n", "342 Gentoo Biscoe 50.4 15.7 222 ... MALE\n", "343 Gentoo Biscoe 45.2 14.8 212 ... FEMALE\n", "344 Gentoo Biscoe 49.9 16.1 213 ... MALE\n" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "uri = URI(\"https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv\")\n", "DataFrame.load(uri)" ] }, { "cell_type": "markdown", "id": "e6abe64d-e97f-437e-9c54-18f9e06e9668", "metadata": {}, "source": [ "## 6. Save" ] }, { "cell_type": "code", "execution_count": 10, "id": "91c0fb62-7990-47f1-9fb6-b0529bc1783f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.save(\"file.arrow\")\n", "penguins.save(\"file.arrows\")\n", "penguins.save(\"file.csv\")\n", "penguins.save(\"file.csv.gz\")\n", "penguins.save(\"file.tsv\")\n", "penguins.save(\"file.feather\")" ] }, { "cell_type": "markdown", "id": "d1d30973-9e2f-406a-9f42-9e6e4c966baf", "metadata": {}, "source": [ "## 7. to_s/inspect" ] }, { "cell_type": "markdown", "id": "a7bc9cb7-eae4-495f-831e-b747e486d0bd", "metadata": {}, "source": [ "`to_s` or `inspect` (it uses to_s inside) shows a preview of the dataframe.\n", "\n", "It shows first 5 and last 3 rows if it has many rows. Columns are also omitted if line is exceeded 80 letters." ] }, { "cell_type": "code", "execution_count": 11, "id": "af6d29ef-2e1c-4a08-a8b2-d69acda79ec5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#\n", " x y s b\n", " \n", "1 1 1.0 A true\n", "2 2 2.0 B false\n", "3 3 3.0 C true\n", "4 4 NaN D false\n", "5 5 (nil) (nil) (nil)\n", "\n" ] } ], "source": [ "df = DataFrame.new(\n", " x: [1, 2, 3, 4, 5],\n", " y: [1, 2, 3, 0/0.0, nil],\n", " s: %w[A B C D] << nil,\n", " b: [true, false, true, false, nil])\n", "p df; nil" ] }, { "cell_type": "code", "execution_count": 12, "id": "cdff2e60-bd0a-4d12-b348-201a49bbbbbe", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen (nil) (nil) (nil) ... 2007\n", " 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " : : : : : : ... :\n", "342 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "343 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "344 Gentoo Biscoe 49.9 16.1 213 ... 2009\n", "\n" ] } ], "source": [ "p penguins; nil" ] }, { "cell_type": "markdown", "id": "cb44df38-58f7-479c-b7a4-c9c305639292", "metadata": {}, "source": [ "## 8. Show table" ] }, { "cell_type": "code", "execution_count": 13, "id": "fc710035-8134-4b18-89fe-8c58b95e0e0e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "\tx\t y\ts\tb\n", "0\t1\t 1.000000\tA\ttrue\n", "1\t2\t 2.000000\tB\tfalse\n", "2\t3\t 3.000000\tC\ttrue\n", "3\t4\t NaN\tD\tfalse\n", "4\t5\t (null)\t(null)\t(null)\n" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.table" ] }, { "cell_type": "code", "execution_count": 14, "id": "2634fb7b-194f-4277-94ba-05f39c497ffa", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "\tspecies\tisland\tbill_length_mm\tbill_depth_mm\tflipper_length_mm\tbody_mass_g\tsex\tyear\n", " 0\tAdelie \tTorgersen\t 39.100000\t 18.700000\t 181\t 3750\tmale\t2007\n", " 1\tAdelie \tTorgersen\t 39.500000\t 17.400000\t 186\t 3800\tfemale\t2007\n", " 2\tAdelie \tTorgersen\t 40.300000\t 18.000000\t 195\t 3250\tfemale\t2007\n", " 3\tAdelie \tTorgersen\t (null)\t (null)\t (null)\t (null)\t(null)\t2007\n", " 4\tAdelie \tTorgersen\t 36.700000\t 19.300000\t 193\t 3450\tfemale\t2007\n", " 5\tAdelie \tTorgersen\t 39.300000\t 20.600000\t 190\t 3650\tmale\t2007\n", " 6\tAdelie \tTorgersen\t 38.900000\t 17.800000\t 181\t 3625\tfemale\t2007\n", " 7\tAdelie \tTorgersen\t 39.200000\t 19.600000\t 195\t 4675\tmale\t2007\n", " 8\tAdelie \tTorgersen\t 34.100000\t 18.100000\t 193\t 3475\t(null)\t2007\n", " 9\tAdelie \tTorgersen\t 42.000000\t 20.200000\t 190\t 4250\t(null)\t2007\n", "...\n", "334\tGentoo \tBiscoe\t 46.200000\t 14.100000\t 217\t 4375\tfemale\t2009\n", "335\tGentoo \tBiscoe\t 55.100000\t 16.000000\t 230\t 5850\tmale\t2009\n", "336\tGentoo \tBiscoe\t 44.500000\t 15.700000\t 217\t 4875\t(null)\t2009\n", "337\tGentoo \tBiscoe\t 48.800000\t 16.200000\t 222\t 6000\tmale\t2009\n", "338\tGentoo \tBiscoe\t 47.200000\t 13.700000\t 214\t 4925\tfemale\t2009\n", "339\tGentoo \tBiscoe\t (null)\t (null)\t (null)\t (null)\t(null)\t2009\n", "340\tGentoo \tBiscoe\t 46.800000\t 14.300000\t 215\t 4850\tfemale\t2009\n", "341\tGentoo \tBiscoe\t 50.400000\t 15.700000\t 222\t 5750\tmale\t2009\n", "342\tGentoo \tBiscoe\t 45.200000\t 14.800000\t 212\t 5200\tfemale\t2009\n", "343\tGentoo \tBiscoe\t 49.900000\t 16.100000\t 213\t 5400\tmale\t2009\n" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.table" ] }, { "cell_type": "code", "execution_count": 15, "id": "9dba2a67-ede7-4663-907b-9b2dd5db1605", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x: uint8\n", "y: double\n", "s: string\n", "b: bool\n", "----\n", "x:\n", " [\n", " [\n", " 1,\n", " 2,\n", " 3,\n", " 4,\n", " 5\n", " ]\n", " ]\n", "y:\n", " [\n", " [\n", " 1,\n", " 2,\n", " 3,\n", " nan,\n", " null\n", " ]\n", " ]\n", "s:\n", " [\n", " [\n", " \"A\",\n", " \"B\",\n", " \"C\",\n", " \"D\",\n", " null\n", " ]\n", " ]\n", "b:\n", " [\n", " [\n", " true,\n", " false,\n", " true,\n", " false,\n", " null\n", " ]\n", " ]\n" ] } ], "source": [ "# This is a Red Arrow's feature\n", "puts df.table.to_s(format: :column)" ] }, { "cell_type": "code", "execution_count": 16, "id": "d1cc17b8-1cfc-4986-9dec-7bca02be32f0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================== 0 ====================\n", "x: 1\n", "y: 1.000000\n", "s: A\n", "b: true\n", "==================== 1 ====================\n", "x: 2\n", "y: 2.000000\n", "s: B\n", "b: false\n", "==================== 2 ====================\n", "x: 3\n", "y: 3.000000\n", "s: C\n", "b: true\n", "==================== 3 ====================\n", "x: 4\n", "y: NaN\n", "s: D\n", "b: false\n", "==================== 4 ====================\n", "x: 5\n", "y: (null)\n", "s: (null)\n", "b: (null)\n" ] } ], "source": [ "# This is also a Red Arrow's feature\n", "puts df.table.to_s(format: :list)" ] }, { "cell_type": "markdown", "id": "16e4ae6b-2399-43f0-be8e-65669b95c7b6", "metadata": {}, "source": [ "## 9. TDR" ] }, { "cell_type": "markdown", "id": "2d14eb4b-9026-4cc5-a71a-598946d40b67", "metadata": {}, "source": [ "TDR means 'Transposed Dataframe Representation'. It shows columns in lateral just the same shape as initializing by a Hash. TDR has some information which is useful for the exploratory data processing.\n", "\n", "- DataFrame shape: n_rows x n_columns\n", "- Data types\n", "- Levels: number of unique elements\n", "- Data preview: same data is aggregated if level is smaller (tally mode)\n", "- Show counts of abnormal element: NaN and nil" ] }, { "cell_type": "code", "execution_count": 17, "id": "8050462f-7c60-41b7-a011-af11763784dc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RedAmber::DataFrame : 5 x 4 Vectors\n", "Vectors : 2 numeric, 1 string, 1 boolean\n", "# key type level data_preview\n", "1 :x uint8 5 [1, 2, 3, 4, 5]\n", "2 :y double 5 [1.0, 2.0, 3.0, NaN, nil], 1 NaN, 1 nil\n", "3 :s string 5 [\"A\", \"B\", \"C\", \"D\", nil], 1 nil\n", "4 :b boolean 3 {true=>2, false=>2, nil=>1}\n" ] } ], "source": [ "# use the same dataframe as #7\n", "df.tdr" ] }, { "cell_type": "code", "execution_count": 18, "id": "bb616ffe-c19a-4b02-a011-601ceb3db656", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RedAmber::DataFrame : 344 x 8 Vectors\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n", "2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n", "4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils\n", "5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils\n", "6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils\n", "7 :sex string 3 {\"male\"=>168, \"female\"=>165, nil=>11}\n", "8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}\n" ] } ], "source": [ "penguins.tdr" ] }, { "cell_type": "markdown", "id": "73b8dc18-079f-4d40-8d0e-239f010550da", "metadata": {}, "source": [ "`#tdr` has some options:\n", "\n", "`limit` : to limit a number of variables to show. Default value is `limit=10`." ] }, { "cell_type": "code", "execution_count": 19, "id": "0962845d-e642-4d2a-9607-43e197b46bc5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RedAmber::DataFrame : 344 x 8 Vectors\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n", "2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n", " ... 5 more Vectors ...\n" ] } ], "source": [ "penguins.tdr(3)" ] }, { "cell_type": "markdown", "id": "573606c4-23b9-4b38-8c92-a04f1c1e8781", "metadata": {}, "source": [ "`elements` : max number of elements to show in observations. Default value is `elements: 5`." ] }, { "cell_type": "code", "execution_count": 20, "id": "f957d2bd-e8c0-42a1-a3b4-0a9478e740bf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RedAmber::DataFrame : 344 x 8 Vectors\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n", "2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, ... ], 2 nils\n", "4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, ... ], 2 nils\n", "5 :flipper_length_mm uint8 56 [181, 186, 195, ... ], 2 nils\n", "6 :body_mass_g uint16 95 [3750, 3800, 3250, ... ], 2 nils\n", "7 :sex string 3 {\"male\"=>168, \"female\"=>165, nil=>11}\n", "8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}\n" ] } ], "source": [ "penguins.tdr(elements: 3) # Show first 3 items in data" ] }, { "cell_type": "markdown", "id": "d37ece79-1999-49eb-a2d1-831184ee6509", "metadata": {}, "source": [ "`tally` : max level to use tally mode. Level means size of `tally`ed hash. Default value is `tally: 5`." ] }, { "cell_type": "code", "execution_count": 21, "id": "9c1c472c-3d15-4bca-9a1b-7f86c63d3ed8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RedAmber::DataFrame : 344 x 8 Vectors\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 [\"Adelie\", \"Adelie\", \"Adelie\", \"Adelie\", \"Adelie\", ... ]\n", "2 :island string 3 [\"Torgersen\", \"Torgersen\", \"Torgersen\", \"Torgersen\", \"Torgersen\", ... ]\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n", "4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils\n", "5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils\n", "6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils\n", "7 :sex string 3 [\"male\", \"female\", \"female\", nil, \"female\", ... ], 11 nils\n", "8 :year uint16 3 [2007, 2007, 2007, 2007, 2007, ... ]\n" ] } ], "source": [ "penguins.tdr(tally: 0) # Don't use tally mode" ] }, { "cell_type": "markdown", "id": "e3c38037-90a1-4fc5-9904-41fc74085908", "metadata": {}, "source": [ "`#tdr_str` returns a String. `#tdr` do the same thing as `puts #tdr_str`" ] }, { "cell_type": "markdown", "id": "21d68764-1bc1-4915-99b6-5ae938b85999", "metadata": {}, "source": [ "## 10. Size and shape" ] }, { "cell_type": "code", "execution_count": 22, "id": "487399f8-a3ef-467f-aa7f-ecbaee5fcb75", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# same as n_rows, n_obs\n", "df.size" ] }, { "cell_type": "code", "execution_count": 23, "id": "dc7441c3-7c85-4ce1-a20e-de8f41f280b4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# same as n_cols, n_vars\n", "df.n_keys" ] }, { "cell_type": "code", "execution_count": 24, "id": "3d42fea6-801a-45f4-8e22-ea9d76ae070f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[5, 4]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# [df.size, df.n_keys], [df.n_rows, df.n_cols]\n", "df.shape" ] }, { "cell_type": "markdown", "id": "bc5caa94-325f-4014-9c90-8ac909c2b378", "metadata": {}, "source": [ "## 11. Keys" ] }, { "cell_type": "code", "execution_count": 25, "id": "bb47775f-fed0-42e6-8781-aa8b721d6112", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[:x, :y, :s, :b]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.keys" ] }, { "cell_type": "code", "execution_count": 26, "id": "3d540ab0-3e52-47b7-b338-b4e0b3d929cb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[:species, :island, :bill_length_mm, :bill_depth_mm, :flipper_length_mm, :body_mass_g, :sex, :year]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.keys" ] }, { "cell_type": "markdown", "id": "decc6a61-9994-4d60-9827-b257cafafb70", "metadata": {}, "source": [ "## 12. Types" ] }, { "cell_type": "code", "execution_count": 27, "id": "bf9cd2bc-a213-427e-bc00-f2083b0e0471", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[:uint8, :double, :string, :boolean]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.types" ] }, { "cell_type": "code", "execution_count": 28, "id": "b1ecb891-98b5-4919-9f37-1847202007d8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[:string, :string, :double, :double, :uint8, :uint16, :string, :uint16]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.types" ] }, { "cell_type": "markdown", "id": "869b3670-62f8-4c23-807b-d6d100a1981e", "metadata": {}, "source": [ "## 13. Data type classes" ] }, { "cell_type": "code", "execution_count": 29, "id": "776ab4db-073b-4b30-931a-8ec77284cdc4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Arrow::UInt8DataType, Arrow::DoubleDataType, Arrow::StringDataType, Arrow::BooleanDataType]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.type_classes" ] }, { "cell_type": "code", "execution_count": 30, "id": "0546a5d0-cab1-4ca8-a2e5-0637d0fd48b6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Arrow::StringDataType, Arrow::StringDataType, Arrow::DoubleDataType, Arrow::DoubleDataType, Arrow::UInt8DataType, Arrow::UInt16DataType, Arrow::StringDataType, Arrow::UInt16DataType]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.type_classes" ] }, { "cell_type": "markdown", "id": "1c2513f6-909e-47fd-a543-66c4f424f44e", "metadata": {}, "source": [ "## 14. Indices" ] }, { "cell_type": "code", "execution_count": 31, "id": "e6e9d7ef-1471-4f23-9210-56045c9fabd5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 1, 2, 3, 4]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.indexes\n", "# or\n", "df.indices" ] }, { "cell_type": "markdown", "id": "3908395f-b086-4fbb-9855-e1ce233f0595", "metadata": {}, "source": [ "## 15. To an Array or a Hash" ] }, { "cell_type": "markdown", "id": "22cb724e-cf61-40d9-a58b-9cc793e83645", "metadata": {}, "source": [ "DataFrame#to_a returns an array of row-oriented data without a header." ] }, { "cell_type": "code", "execution_count": 32, "id": "4054daad-9266-4002-8942-c0891050cb4d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[1, 1.0, \"A\", true], [2, 2.0, \"B\", false], [3, 3.0, \"C\", true], [4, NaN, \"D\", false], [5, nil, nil, nil]]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_a" ] }, { "cell_type": "markdown", "id": "f6abae59-fe31-4056-9de8-7c36e35235de", "metadata": {}, "source": [ "If you need a column-oriented array with keys, use `.to_h.to_a`" ] }, { "cell_type": "code", "execution_count": 33, "id": "d3631290-eb74-4d21-a469-86381c668c7f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{:x=>[1, 2, 3, 4, 5], :y=>[1.0, 2.0, 3.0, NaN, nil], :s=>[\"A\", \"B\", \"C\", \"D\", nil], :b=>[true, false, true, false, nil]}" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_h" ] }, { "cell_type": "code", "execution_count": 34, "id": "08c45e92-f640-4e62-bc96-ee259d0ecff4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[:x, [1, 2, 3, 4, 5]], [:y, [1.0, 2.0, 3.0, NaN, nil]], [:s, [\"A\", \"B\", \"C\", \"D\", nil]], [:b, [true, false, true, false, nil]]]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_h.to_a" ] }, { "cell_type": "markdown", "id": "39b65fc0-4405-4414-9a74-91c724ef587c", "metadata": {}, "source": [ "## 16. Schema" ] }, { "cell_type": "code", "execution_count": 35, "id": "36db7842-e9b0-4473-84d4-3aef987d427f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{:x=>:uint8, :y=>:double, :s=>:string, :b=>:boolean}" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.schema" ] }, { "cell_type": "markdown", "id": "3e61237d-ac67-45bb-827c-a769dff61809", "metadata": {}, "source": [ "## 17. Vector" ] }, { "cell_type": "markdown", "id": "27402307-aaad-49c8-88ca-65346668601d", "metadata": {}, "source": [ "Each variable (column in the table) is represented by a Vector object." ] }, { "cell_type": "code", "execution_count": 36, "id": "6c9ba041-231d-4057-a280-acf620b68525", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[:x] # This syntax comes later" ] }, { "cell_type": "markdown", "id": "3e13d06d-b432-45b2-9745-0c6ef9228e23", "metadata": {}, "source": [ "Or create new Vector by the constructor." ] }, { "cell_type": "code", "execution_count": 37, "id": "3e18a4e0-238c-4800-8bda-a88a57dde3e9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Vector.new(1, 2, 3, 4, 5)" ] }, { "cell_type": "code", "execution_count": 38, "id": "3bd55d9d-b988-46b2-bc11-e3dc5f4adc6c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Vector.new(1..5)" ] }, { "cell_type": "code", "execution_count": 39, "id": "19688e6e-b59b-4a84-8c07-57e87cd0e242", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Vector.new([1, 2, 3], [4, 5])" ] }, { "cell_type": "code", "execution_count": 40, "id": "076bd0e2-01ab-4497-9b9b-84f72a4805bc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "array = Arrow::Array.new([1, 2, 3, 4, 5])\n", "Vector.new(array)" ] }, { "cell_type": "markdown", "id": "22091661-e78a-4c66-9e48-4c3c676469b4", "metadata": {}, "source": [ "- TODO: `Vector[1..5]` as a constructor" ] }, { "cell_type": "markdown", "id": "b729bdba-87a2-4282-bd0e-319fe17f42da", "metadata": {}, "source": [ "## 18. Vectors" ] }, { "cell_type": "markdown", "id": "f5ddd840-2f84-467b-a9bb-feb769573b69", "metadata": {}, "source": [ "Returns an Array of Vectors in a DataFrame." ] }, { "cell_type": "code", "execution_count": 41, "id": "d3ae03f2-e2fe-4a15-abe1-331185448d61", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[#\n", "[1, 2, 3, 4, 5]\n", ", #\n", "[1.0, 2.0, 3.0, NaN, nil]\n", ", #\n", "[\"A\", \"B\", \"C\", \"D\", nil]\n", ", #\n", "[true, false, true, false, nil]\n", "]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.vectors" ] }, { "cell_type": "markdown", "id": "8ac88ff3-0cb6-43d6-a999-0c2e8c6defb7", "metadata": { "tags": [] }, "source": [ "## 19. Variables\n", "\n", "Returns key and Vector pairs in a Hash." ] }, { "cell_type": "code", "execution_count": 42, "id": "3351a216-6fe5-485e-8686-53c1e754fa2e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{:x=>#\n", "[1, 2, 3, 4, 5]\n", ", :y=>#\n", "[1.0, 2.0, 3.0, NaN, nil]\n", ", :s=>#\n", "[\"A\", \"B\", \"C\", \"D\", nil]\n", ", :b=>#\n", "[true, false, true, false, nil]\n", "}" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.variables" ] }, { "cell_type": "markdown", "id": "3b518c1c-eda7-406f-a885-b2344b1726eb", "metadata": {}, "source": [ "## 20. Select columns by #[ ]" ] }, { "cell_type": "markdown", "id": "767b4e49-19eb-4d5f-b030-91bd78f0f5b9", "metadata": {}, "source": [ "`DataFrame#[]` is overloading column operations and row operations.\n", "\n", "- For columns (variables)\n", " - Key in a Symbol: `df[:symbol]`\n", " - Key in a String: `df[\"string\"]`\n", " - Keys in an Array: `df[:symbol1, \"string\", :symbol2]`\n", " - Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`" ] }, { "cell_type": "code", "execution_count": 43, "id": "ccf60edc-cccf-49e3-a503-1ca532247130", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
xy
11.0
22.0
33.0
4NaN
5(nil)
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 1.0\n", "2 2 2.0\n", "3 3 3.0\n", "4 4 NaN\n", "5 5 (nil)\n" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Keys in a Symbol and a String\n", "df[:x, 'y']" ] }, { "cell_type": "code", "execution_count": 44, "id": "8500f8c0-ff5a-4537-9f47-03d675e31b18", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
xy
11.0
22.0
33.0
4NaN
5(nil)
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 1.0\n", "2 2 2.0\n", "3 3 3.0\n", "4 4 NaN\n", "5 5 (nil)\n" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Keys in a Range\n", "df['x'..'y']" ] }, { "cell_type": "code", "execution_count": 45, "id": "db35cae1-35c2-47de-a7e8-906161f21282", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 3 vectors>
sbx
Atrue1
Bfalse2
Ctrue3
Dfalse4
(nil)(nil)5
" ], "text/plain": [ "#\n", " s b x\n", " \n", "1 A true 1\n", "2 B false 2\n", "3 C true 3\n", "4 D false 4\n", "5 (nil) (nil) 5\n" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Keys with a index Range, and a symbol\n", "df[df.keys[2..], :x]" ] }, { "cell_type": "markdown", "id": "03e14403-f7bc-4350-9e7b-715901164331", "metadata": {}, "source": [ "## 21. Select rows by #[ ]\n", "`DataFrame#[]` is overloading column operations and row operations.\n", "\n", "- For rows (observations)\n", " - Select rows by a Index: `df[index]`\n", " - Select rows by Indices: `df[indices]` # Array, Arrow::Array, Vectors are acceptable for indices\n", " - Select rows by Ranges: `df[range]`\n", " - Select rows by Booleans: `df[booleans]` # Array, Arrow::Array, Vectors are acceptable for booleans" ] }, { "cell_type": "code", "execution_count": 46, "id": "e3bc60a7-611e-4fd8-9770-8e0d167d3fee", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
xysb
11.0Atrue
33.0Ctrue
22.0Bfalse
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 1 1.0 A true\n", "2 3 3.0 C true\n", "3 2 2.0 B false\n" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# indices\n", "df[0, 2, 1]" ] }, { "cell_type": "code", "execution_count": 47, "id": "2b8b3801-ae37-4629-9db5-ff937941c895", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
xysb
22.0Bfalse
33.0Ctrue
5(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 2 2.0 B false\n", "2 3 3.0 C true\n", "3 5 (nil) (nil) (nil)\n" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# including a Range\n", "# negative indices are also acceptable\n", "df[1..2, -1]" ] }, { "cell_type": "code", "execution_count": 48, "id": "3f6f8d73-a66c-4773-9bf5-0878c700f2d6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
xysb
22.0Bfalse
33.0Ctrue
5(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 2 2.0 B false\n", "2 3 3.0 C true\n", "3 5 (nil) (nil) (nil)\n" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# booleans\n", "# length of boolean should be the same as self\n", "df[false, true, true, false, true]" ] }, { "cell_type": "code", "execution_count": 49, "id": "abe57279-54fd-48ec-a1a4-c7453211e776", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
xysb
11.0Atrue
33.0Ctrue
5(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 1 1.0 A true\n", "2 3 3.0 C true\n", "3 5 (nil) (nil) (nil)\n" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Arrow::Array\n", "indices = Arrow::UInt8Array.new([0,2,4])\n", "df[indices]" ] }, { "cell_type": "code", "execution_count": 50, "id": "2266611f-23d8-4645-a1e8-b07c2370fb3f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
xysb
33.0Ctrue
4NaNDfalse
5(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 3 3.0 C true\n", "2 4 NaN D false\n", "3 5 (nil) (nil) (nil)\n" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# By a Vector as indices\n", "indices = Vector.new(df.indices)\n", "# indices > 1 returns a boolean Vector\n", "df[indices > 1]" ] }, { "cell_type": "code", "execution_count": 51, "id": "0ea2da7e-aeca-4874-be4a-6af563aa378b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, true, false, nil]\n" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# By a Vector as booleans\n", "booleans = df[:b]" ] }, { "cell_type": "code", "execution_count": 52, "id": "9f842890-6359-4266-9a23-2f8f813ef548", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 4 vectors>
xysb
11.0Atrue
33.0Ctrue
" ], "text/plain": [ "#\n", " x y s b\n", " \n", "1 1 1.0 A true\n", "2 3 3.0 C true\n" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[booleans]" ] }, { "cell_type": "markdown", "id": "98a04874-cb2c-44c0-b410-b330b9d12b0f", "metadata": {}, "source": [ "## 22. empty?" ] }, { "cell_type": "code", "execution_count": 53, "id": "7b1ab319-90a7-4f09-8629-04dcd94076cb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "false" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.empty?" ] }, { "cell_type": "code", "execution_count": 54, "id": "1e09c32f-20a8-4175-827f-cdb98063535a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DataFrame.new.empty?" ] }, { "cell_type": "code", "execution_count": 55, "id": "3f9f8771-87dd-44eb-8aac-6a3ed8b4c183", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(empty DataFrame)" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DataFrame.new" ] }, { "cell_type": "markdown", "id": "86b826dd-10e6-4087-9162-b89ac6561a61", "metadata": {}, "source": [ "## 23. Select columns by pick" ] }, { "cell_type": "markdown", "id": "b5aefd22-4e96-4dc5-91d2-e6826256bda6", "metadata": { "tags": [] }, "source": [ "`DataFrame#pick` accepts an Array of keys to pick up columns (variables) and creates a new DataFrame. You can change the order of columns at a same time." ] }, { "cell_type": "code", "execution_count": 56, "id": "68124521-b823-424d-9e06-d11aa927d618", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
sy
A1.0
B2.0
C3.0
DNaN
(nil)(nil)
" ], "text/plain": [ "#\n", " s y\n", " \n", "1 A 1.0\n", "2 B 2.0\n", "3 C 3.0\n", "4 D NaN\n", "5 (nil) (nil)\n" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pick(:s, :y)\n", "# or\n", "df.pick([:s, :y]) # OK too." ] }, { "cell_type": "markdown", "id": "a76dca00-da8f-4959-be18-7a1015a9d13c", "metadata": {}, "source": [ "Or use a boolean Array of lengeh `n_key` to `pick`. This style remains the order of variables." ] }, { "cell_type": "code", "execution_count": 57, "id": "b91f8925-529c-43c9-93ba-e21bcac0f2f7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
ys
1.0A
2.0B
3.0C
NaND
(nil)(nil)
" ], "text/plain": [ "#\n", " y s\n", " \n", "1 1.0 A\n", "2 2.0 B\n", "3 3.0 C\n", "4 NaN D\n", "5 (nil) (nil)\n" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pick(false, true, true, false)\n", "# or\n", "df.pick([false, true, true, false]) # OK" ] }, { "cell_type": "markdown", "id": "5f903182-745b-4923-99d8-14a9b9c6ea4c", "metadata": {}, "source": [ "`#pick` also accepts a block in the context of self.\n", "\n", "Next example is picking up numeric variables." ] }, { "cell_type": "code", "execution_count": 58, "id": "37bb0a49-c38a-484c-91d4-3e23ab43a727", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
xy
11.0
22.0
33.0
4NaN
5(nil)
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 1.0\n", "2 2 2.0\n", "3 3 3.0\n", "4 4 NaN\n", "5 5 (nil)\n" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# reciever is required with the argument style\n", "df.pick(df.vectors.map(&:numeric?))\n", "\n", "# with a block\n", "df.pick { vectors.map(&:numeric?) }" ] }, { "cell_type": "markdown", "id": "e51f07c0-54eb-4114-8cd6-63c7780e7248", "metadata": {}, "source": [ "The name `pick` comes from the action to pick variables(columns) according to the label keys." ] }, { "cell_type": "markdown", "id": "7c1815e4-de6c-425e-8602-b8dd66836250", "metadata": {}, "source": [ "## 24. Reject columns by drop" ] }, { "cell_type": "markdown", "id": "d1ab045e-66f9-4922-8bf2-35aee7f2812e", "metadata": { "tags": [] }, "source": [ "`DataFrame#drop` accepts an Array keys to drop columns (variables) to create a remainer DataFrame." ] }, { "cell_type": "code", "execution_count": 59, "id": "7ccace08-62b0-4b0b-93fb-81edf673abf7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
ys
1.0A
2.0B
3.0C
NaND
(nil)(nil)
" ], "text/plain": [ "#\n", " y s\n", " \n", "1 1.0 A\n", "2 2.0 B\n", "3 3.0 C\n", "4 NaN D\n", "5 (nil) (nil)\n" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(:x, :b)\n", "# df.drop([:x, :b]) #is OK too." ] }, { "cell_type": "markdown", "id": "2085b349-95c5-4607-b029-f7c3d630ac1c", "metadata": {}, "source": [ "Or use a boolean Array of lengeh `n_key` to `drop`." ] }, { "cell_type": "code", "execution_count": 60, "id": "785c02f1-1e16-4722-9961-4b49223c8290", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
ys
1.0A
2.0B
3.0C
NaND
(nil)(nil)
" ], "text/plain": [ "#\n", " y s\n", " \n", "1 1.0 A\n", "2 2.0 B\n", "3 3.0 C\n", "4 NaN D\n", "5 (nil) (nil)\n" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(true, false, false, true)\n", "# df.drop([true, false, false, true]) # is OK too" ] }, { "cell_type": "markdown", "id": "d246161e-02cc-40fb-8921-26b37eb3956f", "metadata": {}, "source": [ "`#drop` also accepts a block in the context of self.\n", "\n", "Next example will drop variables which have nil or NaN values." ] }, { "cell_type": "code", "execution_count": 61, "id": "069932e3-d393-4ede-9eb5-7aac8625e0c0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 1 vector>
x
1
2
3
4
5
" ], "text/plain": [ "#\n", " x\n", " \n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop { vectors.map { |v| v.is_na.any } }" ] }, { "cell_type": "markdown", "id": "88b064d6-7d90-4a0b-b9c8-d92e103269fb", "metadata": {}, "source": [ "Argument style is also acceptable but it requires the reciever 'df'." ] }, { "cell_type": "code", "execution_count": 62, "id": "3003a5c2-0966-4f2c-9643-59e8b546c8aa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 1 vector>
x
1
2
3
4
5
" ], "text/plain": [ "#\n", " x\n", " \n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(df.vectors.map { |v| v.is_na.any })" ] }, { "cell_type": "markdown", "id": "c6fce15c-d4a9-4281-9c07-457e78d3c13e", "metadata": {}, "source": [ "The name `drop` comes from the pair word of `pick`." ] }, { "cell_type": "markdown", "id": "0f6dc86c-828d-4f9f-8b07-fce63c30fdca", "metadata": {}, "source": [ "## 25. Pick/drop and nil" ] }, { "cell_type": "markdown", "id": "0a108878-565b-400e-9a47-a15aae09429c", "metadata": {}, "source": [ "When `pick` or `drop` is used with booleans, nil in the booleans is treated as false. This behavior is aligned with Ruby's `BasicObject#!`." ] }, { "cell_type": "code", "execution_count": 63, "id": "7c01fbb4-9bfa-4afc-8e6b-45c97c0beb03", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans = [true, true, false, nil]\n", "booleans_invert = booleans.map(&:!) # => [false, false, true, true] because nil.! is true\n", "df.pick(booleans) == df.drop(booleans_invert)" ] }, { "cell_type": "markdown", "id": "12a24264-9b7a-42a1-a541-e292e3876e35", "metadata": {}, "source": [ "## 26. Vector#invert, #primitive_invert\n", "\n", "For the boolean Vector;" ] }, { "cell_type": "code", "execution_count": 64, "id": "ea352e12-7e8a-43be-b8ac-797adbc47708", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, true, false, nil]\n" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector = Vector.new(booleans)" ] }, { "cell_type": "markdown", "id": "2a0f82e0-157b-4185-9254-0618be291f9b", "metadata": {}, "source": [ "nil is converted to nil by `Vector#invert`." ] }, { "cell_type": "code", "execution_count": 65, "id": "596c521f-12bf-4448-9e5d-e1b4a2c3d896", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, false, true, nil]\n" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector.invert\n", "# or\n", "!vector" ] }, { "cell_type": "markdown", "id": "a1aec910-3055-4627-a02b-22d45f2ceb70", "metadata": {}, "source": [ "So `df.pick(booleans) != df.drop(booleans.invert)` when booleans have any nils.\n", "\n", "On the other hand, `Vector#primitive_invert` follows Ruby's `BasicObject#!`'s behavior. Then pick and drop keep 'MECE' behavior." ] }, { "cell_type": "code", "execution_count": 66, "id": "4dcaba48-1cea-4ce9-b4a9-b079b43af7ec", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, false, true, true]\n" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector.primitive_invert" ] }, { "cell_type": "code", "execution_count": 67, "id": "c7ae4dad-275a-49e0-a0b0-bf3686248070", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pick(vector) == df.drop(vector.primitive_invert)" ] }, { "cell_type": "markdown", "id": "9a6cec74-43f0-4a72-8262-25b1e311f602", "metadata": {}, "source": [ "## 27. Pick/drop and [ ]" ] }, { "cell_type": "markdown", "id": "32c8f74d-b3ce-4305-9af7-6ea70052c773", "metadata": {}, "source": [ "When `pick` or `drop` select a single column (variable), it returns a `DataFrame` with one column (variable)." ] }, { "cell_type": "code", "execution_count": 68, "id": "e13aee24-cac6-41ad-b8a3-0ec26edbe5d1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 1 vector>
x
1
2
3
4
5
" ], "text/plain": [ "#\n", " x\n", " \n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pick(:x) # or\n", "df.drop(:y, :s, :b)" ] }, { "cell_type": "markdown", "id": "3e47b9d2-929e-4674-9690-0a1fdf7b0a7d", "metadata": {}, "source": [ "In contrast, when `[]` selects a single column (variable), it returns a `Vector`." ] }, { "cell_type": "code", "execution_count": 69, "id": "60d228be-7357-434d-9d39-ee72c110e6fe", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3, 4, 5]\n" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[:x]" ] }, { "cell_type": "markdown", "id": "6d973934-e08b-4b45-8efb-52f9167e7238", "metadata": {}, "source": [ "This behavior may be useful to use with DataFrame manipulation verbs (like pick, drop, slice, remove, assign, rename)." ] }, { "cell_type": "code", "execution_count": 70, "id": "6beefc5a-dc47-42cc-a283-456073c4251e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 2 vectors>
xy
11.0
22.0
33.0
4NaN
5(nil)
" ], "text/plain": [ "#\n", " x y\n", " \n", "1 1 1.0\n", "2 2 2.0\n", "3 3 3.0\n", "4 4 NaN\n", "5 5 (nil)\n" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pick { keys.select { |key| df[key].numeric? } }" ] }, { "cell_type": "markdown", "id": "34c9bcb0-889a-4190-b2b8-49765cd059c2", "metadata": {}, "source": [ "## 28. Slice" ] }, { "cell_type": "markdown", "id": "9a428ba8-c306-4ab8-8607-51174e8e6ebe", "metadata": {}, "source": [ "`slice` selects rows (observations) to create a subset of a DataFrame." ] }, { "cell_type": "markdown", "id": "6016d6d4-72d6-4ae2-b7dd-3d526c91ae61", "metadata": {}, "source": [ "`slice(indeces)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers. Negative index from the tail like Ruby's Array is also acceptable." ] }, { "cell_type": "code", "execution_count": 71, "id": "9cdce2e4-7876-4be6-bd1f-bc8ab6e6c871", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <10 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen(nil)(nil)(nil)(nil)(nil)2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen (nil) (nil) (nil) ... 2007\n", " 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " : : : : : : ... :\n", " 8 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", " 9 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "10 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# returns 5 rows from the start and 5 rows from the end\n", "penguins.slice(0...5, -5..-1)" ] }, { "cell_type": "code", "execution_count": 72, "id": "93c3f6f0-7bc9-4909-8f32-20e8c1ddfd3a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <1 x 9 vectors>
indexspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
113AdelieBiscoe42.219.51974275male2009
" ], "text/plain": [ "#\n", " index species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", "1 113 Adelie Biscoe 42.2 19.5 197 ... 2009\n" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# slice accepts Float index\n", "# 33% of 344 observations in index => 113.52 th data ??\n", "indexed_penguins = penguins.assign_left { [:index, indexes] } # #assign_left and assigner by Array is 0.2.0 feature\n", "indexed_penguins.slice(penguins.size * 0.33)" ] }, { "cell_type": "markdown", "id": "8139bb28-89f8-4058-b824-dde33ead0b60", "metadata": {}, "source": [ "Indices in Vectors or Arrow::Arrays are also acceptable." ] }, { "cell_type": "markdown", "id": "6f79db8c-c706-4d60-949b-3f644474d375", "metadata": {}, "source": [ "Another way to select in `slice` is to use booleans.\n", "- Booleans is an Array, Arrow::Array, Vector or their Array.\n", "- Each data type must be boolean.\n", "- Size of booleans must be same as the size of self." ] }, { "cell_type": "code", "execution_count": 73, "id": "f58ca131-7375-4489-90ce-6ba54b898eb5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, false, true, nil, false, false, false, false, false, true, false, false, ... ]\n" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# make booleans to check over 40\n", "booleans = penguins[:bill_length_mm] > 40" ] }, { "cell_type": "code", "execution_count": 74, "id": "176ab365-c66a-4712-97b9-4381a536321b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <242 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen42.020.21904250(nil)2007
AdelieTorgersen41.117.61823200female2007
AdelieTorgersen42.520.71974500male2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 2 Adelie Torgersen 42.0 20.2 190 ... 2007\n", " 3 Adelie Torgersen 41.1 17.6 182 ... 2007\n", " 4 Adelie Torgersen 42.5 20.7 197 ... 2007\n", " 5 Adelie Torgersen 46.0 21.5 194 ... 2007\n", " : : : : : : ... :\n", "240 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "241 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "242 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.slice(booleans)" ] }, { "cell_type": "markdown", "id": "3264a182-6b72-461a-b712-c3b708c53516", "metadata": {}, "source": [ "`slice` accepts a block.\n", "- We can't use both arguments and a block at a same time.\n", "- The block should return indeces in any length or a boolean Array with a same length as `size`.\n", "- Block is called in the context of self. So reciever 'self' can be omitted in the block." ] }, { "cell_type": "code", "execution_count": 75, "id": "c95d3426-0bbb-430e-8d83-6e22434d99ed", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <204 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen39.320.61903650male2007
GentooBiscoe47.213.72144925female2009
GentooBiscoe46.814.32154850female2009
GentooBiscoe45.214.82125200female2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen 39.3 20.6 190 ... 2007\n", " 5 Adelie Torgersen 38.9 17.8 181 ... 2007\n", " : : : : : : ... :\n", "202 Gentoo Biscoe 47.2 13.7 214 ... 2009\n", "203 Gentoo Biscoe 46.8 14.3 215 ... 2009\n", "204 Gentoo Biscoe 45.2 14.8 212 ... 2009\n" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# return a DataFrame with bill_length_mm is in 2*std range around mean\n", "penguins.slice do\n", " vector = self[:bill_length_mm]\n", " min = vector.mean - vector.std\n", " max = vector.mean + vector.std\n", " vector.to_a.map { |e| (min..max).include? e }\n", "end" ] }, { "cell_type": "markdown", "id": "4fa42801-64f5-4432-856b-85c26a68515d", "metadata": {}, "source": [ "## 29. Slice and nil option" ] }, { "cell_type": "markdown", "id": "31017a7e-0923-4283-bc92-246ebe2591c3", "metadata": {}, "source": [ "`Arrow::Table#slice` uses `#filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row." ] }, { "cell_type": "code", "execution_count": 76, "id": "8e4a8108-154b-4621-acd1-704ddf229d61", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "\t a\tb\t c\n", "0\t 1\tA\t 1.000000\n", "1\t(null)\t(null)\t (null)\n" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }\n", "table = Arrow::Table.new(hash)\n", "table.slice([true, false, nil])" ] }, { "cell_type": "markdown", "id": "dbb57c5a-e949-42b8-a82c-9affb3fe5b7b", "metadata": {}, "source": [ "Whereas in RedAmber, `DataFrame#slice` with booleans containing nil is treated as false. This behavior comes from `Allow::FilterOptions.null_selection_behavior = :drop`. This is a default value for `Arrow::Table.filter` method." ] }, { "cell_type": "code", "execution_count": 77, "id": "851c3bf6-b9e9-41bd-92c5-5372ed934549", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "\ta\tb\t c\n", "0\t1\tA\t 1.000000\n" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "RedAmber::DataFrame.new(table).slice([true, false, nil]).table" ] }, { "cell_type": "markdown", "id": "56398a3d-6146-43af-8b96-fec37730fc49", "metadata": {}, "source": [ "## 30. Remove" ] }, { "cell_type": "markdown", "id": "9e042a97-8a5d-412e-8e4a-fda382225a2d", "metadata": {}, "source": [ "Slice and reject rows (observations) to create a remainer DataFrame." ] }, { "cell_type": "markdown", "id": "2b4cbb97-eef3-4db8-8f25-c44c208ec554", "metadata": {}, "source": [ "`#remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer." ] }, { "cell_type": "code", "execution_count": 78, "id": "17e38ab8-886b-4114-bcaf-ee18df7d00cd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <334 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.320.61903650male2007
AdelieTorgersen38.917.81813625female2007
AdelieTorgersen39.219.61954675male2007
AdelieTorgersen34.118.11933475(nil)2007
GentooBiscoe44.515.72174875(nil)2009
GentooBiscoe48.816.22226000male2009
GentooBiscoe47.213.72144925female2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.3 20.6 190 ... 2007\n", " 2 Adelie Torgersen 38.9 17.8 181 ... 2007\n", " 3 Adelie Torgersen 39.2 19.6 195 ... 2007\n", " 4 Adelie Torgersen 34.1 18.1 193 ... 2007\n", " 5 Adelie Torgersen 42.0 20.2 190 ... 2007\n", " : : : : : : ... :\n", "332 Gentoo Biscoe 44.5 15.7 217 ... 2009\n", "333 Gentoo Biscoe 48.8 16.2 222 ... 2009\n", "334 Gentoo Biscoe 47.2 13.7 214 ... 2009\n" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# returns 6th to 339th obs. Remainer of 1st example of #30\n", "penguins.remove(0...5, -5..-1)" ] }, { "cell_type": "markdown", "id": "def1c1c4-6b60-4864-ae24-c797fbf008a7", "metadata": {}, "source": [ "`remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `#size`." ] }, { "cell_type": "code", "execution_count": 79, "id": "6f169420-7eb2-457f-8d59-7a5c90aa3fa5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <333 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen36.719.31933450female2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " 5 Adelie Torgersen 39.3 20.6 190 ... 2007\n", " : : : : : : ... :\n", "331 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "332 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "333 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# remove all observation contains nil\n", "removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }" ] }, { "cell_type": "markdown", "id": "5f1864c9-4ae4-4fcd-9840-ea424ef5e27d", "metadata": {}, "source": [ "`remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as size. Block is called in the context of self." ] }, { "cell_type": "code", "execution_count": 80, "id": "a6807c65-25e5-4ee1-8d1b-6018c46b3999", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <140 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen(nil)(nil)(nil)(nil)(nil)2007
AdelieTorgersen36.719.31933450female2007
AdelieTorgersen34.118.11933475(nil)2007
AdelieTorgersen37.817.11863300(nil)2007
GentooBiscoe(nil)(nil)(nil)(nil)(nil)2009
GentooBiscoe50.415.72225750male2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen (nil) (nil) (nil) ... 2007\n", " 2 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " 3 Adelie Torgersen 34.1 18.1 193 ... 2007\n", " 4 Adelie Torgersen 37.8 17.1 186 ... 2007\n", " 5 Adelie Torgersen 37.8 17.3 180 ... 2007\n", " : : : : : : ... :\n", "138 Gentoo Biscoe (nil) (nil) (nil) ... 2009\n", "139 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "140 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Remove data in 2*std range around mean\n", "penguins.remove do\n", " vector = self[:bill_length_mm]\n", " min = vector.mean - vector.std\n", " max = vector.mean + vector.std\n", " vector.to_a.map { |e| (min..max).include? e }\n", "end" ] }, { "cell_type": "markdown", "id": "591e6b22-da98-4336-b22e-c7bc9bcf2ebf", "metadata": {}, "source": [ "## 31. Remove and nil" ] }, { "cell_type": "markdown", "id": "67926d1b-c76e-4cb7-b679-6545d850e7e4", "metadata": {}, "source": [ "When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`." ] }, { "cell_type": "code", "execution_count": 81, "id": "8575614e-f702-4ee4-ac7b-745e9b32e803", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 3 vectors>
abc
1A1.0
2B2.0
(nil)C3.0
" ], "text/plain": [ "#\n", " a b c\n", " \n", "1 1 A 1.0\n", "2 2 B 2.0\n", "3 (nil) C 3.0\n" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])" ] }, { "cell_type": "code", "execution_count": 82, "id": "932a5e71-8cef-44e5-a789-ce97329bc001", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, nil]\n" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans = df[:a] < 2" ] }, { "cell_type": "code", "execution_count": 83, "id": "74cf6aa6-8913-433d-97ad-bba2d548afe5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[false, true, true]" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans_invert = booleans.to_a.map(&:!)" ] }, { "cell_type": "code", "execution_count": 84, "id": "5e466a06-cb17-4dc1-a5b0-34bfd3ffb78b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice(booleans) == df.remove(booleans_invert)" ] }, { "cell_type": "markdown", "id": "8bca0b06-2d08-4c28-8b4c-4fd088f2d2d3", "metadata": {}, "source": [ "Whereas `Vector#invert` returns nil for elements nil. This will bring different result. (See #26)" ] }, { "cell_type": "code", "execution_count": 85, "id": "077b216f-0a08-413e-95c9-12789d15a9ba", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, true, nil]\n" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans.invert" ] }, { "cell_type": "code", "execution_count": 86, "id": "b3df62a6-c4a3-44cb-bde6-f6be12b120c8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 3 vectors>
abc
1A1.0
(nil)C3.0
" ], "text/plain": [ "#\n", " a b c\n", " \n", "1 1 A 1.0\n", "2 (nil) C 3.0\n" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.remove(booleans.invert)" ] }, { "cell_type": "markdown", "id": "e05f00b6-3bae-4650-8bbc-d4e0692f6f85", "metadata": {}, "source": [ "We have `#primitive_invert` method in Vector. This method returns the same result as `.to_a.map(&:!)` above." ] }, { "cell_type": "code", "execution_count": 87, "id": "296ca3cd-a6da-4603-a576-d8c36a810e4f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, true, true]\n" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans.primitive_invert" ] }, { "cell_type": "code", "execution_count": 88, "id": "ba5b8c0b-b94e-4209-adcd-258ea3b87bfd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <1 x 3 vectors>
abc
1A1.0
" ], "text/plain": [ "#\n", " a b c\n", " \n", "1 1 A 1.0\n" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.remove(booleans.primitive_invert)" ] }, { "cell_type": "code", "execution_count": 89, "id": "2446792f-0b0a-4642-acae-b4fec89261c1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice(booleans) == df.remove(booleans.primitive_invert)" ] }, { "cell_type": "markdown", "id": "7c23a4ad-0c17-4178-b58a-abfd8153d49b", "metadata": {}, "source": [ "## 32. Remove nil" ] }, { "cell_type": "markdown", "id": "84c7238b-1029-416f-b495-9d045f77b22c", "metadata": {}, "source": [ "Remove any observations containing nil." ] }, { "cell_type": "code", "execution_count": 90, "id": "de4bb615-d14d-4c90-ab54-db2f375b9f00", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <333 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen36.719.31933450female2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " 5 Adelie Torgersen 39.3 20.6 190 ... 2007\n", " : : : : : : ... :\n", "331 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "332 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "333 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.remove_nil" ] }, { "cell_type": "markdown", "id": "4a4ae8f9-dcf8-4dad-bb77-af076e9cadb5", "metadata": {}, "source": [ "The roundabout way for this is to use `#remove`." ] }, { "cell_type": "code", "execution_count": 91, "id": "27a3da5f-0ea2-4c5d-a6c3-c0e20f2224a3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <333 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen36.719.31933450female2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " 5 Adelie Torgersen 39.3 20.6 190 ... 2007\n", " : : : : : : ... :\n", "331 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "332 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "333 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.remove { vectors.map(&:is_nil).reduce(&:|) }" ] }, { "cell_type": "markdown", "id": "4f2a58fd-f033-44f6-9eb4-ed893a2b5d1d", "metadata": {}, "source": [ "## 33. Rename" ] }, { "cell_type": "markdown", "id": "c0d39506-8ae5-48e7-9dd2-acf38d4ec1a9", "metadata": {}, "source": [ "Rename keys (column names) to create a updated DataFrame." ] }, { "cell_type": "markdown", "id": "3f6924ec-e86c-4089-ae40-6783027d3ce0", "metadata": {}, "source": [ "`#rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}` or an Array of Array `[[existing_key, new_key], ...]` ." ] }, { "cell_type": "code", "execution_count": 92, "id": "9396c96d-83d7-4b92-a4ca-27bc9e4d7b9d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
nameage
Yasuko68
Rui49
Hinata28
" ], "text/plain": [ "#\n", " name age\n", " \n", "1 Yasuko 68\n", "2 Rui 49\n", "3 Hinata 28\n" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "h = { name: %w[Yasuko Rui Hinata], age: [68, 49, 28] }\n", "comecome = RedAmber::DataFrame.new(h)" ] }, { "cell_type": "code", "execution_count": 93, "id": "fad279c6-1ca0-4493-bd69-0e9ef011bff7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
nameage_in_1993
Yasuko68
Rui49
Hinata28
" ], "text/plain": [ "#\n", " name age_in_1993\n", " \n", "1 Yasuko 68\n", "2 Rui 49\n", "3 Hinata 28\n" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comecome.rename(:age => :age_in_1993)\n", "# comecome.rename(:age, :age_in_1993) # is also OK\n", "# comecome.rename([:age, :age_in_1993]) # is also OK" ] }, { "cell_type": "markdown", "id": "9dabb005-9822-4c4b-aaa5-fa6f28f2ed43", "metadata": {}, "source": [ "`#rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}` or an Array of Array `[[existing_key, new_key], ...]`. Block is called in the context of self." ] }, { "cell_type": "markdown", "id": "aabbba20-6ef8-4da2-8dc0-0cb243cf3b23", "metadata": {}, "source": [ "Symbol key and String key are distinguished." ] }, { "cell_type": "markdown", "id": "07f98b31-6123-4466-b4f8-f995c7cde474", "metadata": {}, "source": [ "## 34. Assign" ] }, { "cell_type": "markdown", "id": "99f6787f-2b36-4360-b155-1c2d7874d25e", "metadata": {}, "source": [ "Assign new or updated columns (variables) and create a updated DataFrame.\n", "\n", "- Columns with new keys will append new variables at right (bottom in TDR).\n", "- Columns with exisiting keys will update corresponding vectors." ] }, { "cell_type": "markdown", "id": "b4b22da0-4ee2-4196-88e1-1cfea6a72f4d", "metadata": {}, "source": [ "`#assign(key_pairs)` accepts pairs of key and array_like values as arguments. The pairs should be a Hash of `{key => array_like}` or an Array of Array `[[key, array_like], ... ]`. `array_like` is one of `Vector`, `Array` or `Arrow::Array`. " ] }, { "cell_type": "code", "execution_count": 94, "id": "56dcfed8-a6f9-4d8c-bac3-e8ce7c0674a7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
nameage
Yasuko68
Rui49
Hinata28
" ], "text/plain": [ "#\n", " name age\n", " \n", "1 Yasuko 68\n", "2 Rui 49\n", "3 Hinata 28\n" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comecome = RedAmber::DataFrame.new( name: %w[Yasuko Rui Hinata], age: [68, 49, 28] )" ] }, { "cell_type": "code", "execution_count": 95, "id": "8da8d282-8798-44d5-bb7b-7fa2df922308", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 3 vectors>
nameagebrother
Yasuko97Santa
Rui78(nil)
Hinata57Momotaro
" ], "text/plain": [ "#\n", " name age brother\n", " \n", "1 Yasuko 97 Santa\n", "2 Rui 78 (nil)\n", "3 Hinata 57 Momotaro\n" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# update :age and add :brother\n", "assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }\n", "comecome.assign(assigner)" ] }, { "cell_type": "markdown", "id": "e6d3ddfc-b16d-4b20-83df-357e9cdb32e6", "metadata": {}, "source": [ "`#assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and array_like values as a Hash of `{key => array_like}` or an Array of Array `[[key, array_like], ... ]`. `array_like` is one of `Vector`, `Array` or `Arrow::Array`. Block is called in the context of self." ] }, { "cell_type": "code", "execution_count": 96, "id": "8d69edd0-7ad7-4318-8033-1785ce2543db", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 3 vectors>
indexfloatstring
00.0A
11.1B
22.2C
3NaND
(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " index float string\n", " \n", "1 0 0.0 A\n", "2 1 1.1 B\n", "3 2 2.2 C\n", "4 3 NaN D\n", "5 (nil) (nil) (nil)\n" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = RedAmber::DataFrame.new(\n", " index: [0, 1, 2, 3, nil],\n", " float: [0.0, 1.1, 2.2, Float::NAN, nil],\n", " string: ['A', 'B', 'C', 'D', nil])" ] }, { "cell_type": "code", "execution_count": 97, "id": "e884af01-d82b-42e7-8e92-62baf19919cb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 3 vectors>
indexfloatstring
0-0.0A
255-1.1B
254-2.2C
253NaND
(nil)(nil)(nil)
" ], "text/plain": [ "#\n", " index float string\n", " \n", "1 0 -0.0 A\n", "2 255 -1.1 B\n", "3 254 -2.2 C\n", "4 253 NaN D\n", "5 (nil) (nil) (nil)\n" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# update numeric variables\n", "df.assign do\n", " vectors.select(&:numeric?).map { |v| [v.key, -v] }\n", "end" ] }, { "cell_type": "markdown", "id": "7b8e2090-628f-4b17-8929-cbb5e0285dff", "metadata": {}, "source": [ "In this example, columns :x and :y are updated. Column :x returns complements for #negate method because :x is :uint8 type." ] }, { "cell_type": "code", "execution_count": 98, "id": "9452f8db-5f23-4044-ac87-ac5695fae8ae", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[:uint8, :double, :string]" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.types" ] }, { "cell_type": "markdown", "id": "c5c83559-f4d8-4ed2-8b20-5c50eb1faa14", "metadata": {}, "source": [ "## 35. Coerce (Vector)" ] }, { "cell_type": "markdown", "id": "77bdfc69-b728-4335-b76e-e4be92f94310", "metadata": {}, "source": [ "Vector has coerce method." ] }, { "cell_type": "code", "execution_count": 99, "id": "2bfbe584-be54-486b-af32-e76b37c10e49", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[1, 2, 3]\n" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector = RedAmber::Vector.new(1,2,3)" ] }, { "cell_type": "code", "execution_count": 100, "id": "ce35d901-38a8-4f13-b2d1-29b83f6c5438", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[-1, -2, -3]\n" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Vector's `#*` method\n", "vector * -1" ] }, { "cell_type": "code", "execution_count": 101, "id": "7d5fc2be-f590-4678-92e9-faa27b618266", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[-1, -2, -3]\n" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# coerced calculation\n", "-1 * vector" ] }, { "cell_type": "code", "execution_count": 102, "id": "fa90a6af-add7-42f2-9707-7d726575aeb6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[255, 254, 253]\n" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# `@-` operator\n", "-vector" ] }, { "cell_type": "markdown", "id": "4820b527-44e9-4738-aa0e-73604078b3b0", "metadata": { "tags": [] }, "source": [ "## 36. to_ary (Vector)" ] }, { "cell_type": "markdown", "id": "8507dcc4-74e3-44ad-aa54-cf43d55f2131", "metadata": {}, "source": [ "`Vector#to_ary` will enable implicit conversion to an Array." ] }, { "cell_type": "code", "execution_count": 103, "id": "b12bd7c8-2981-426c-8ae3-154504a8ea15", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[3, 4, 5]" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Array(Vector.new([3, 4, 5]))" ] }, { "cell_type": "code", "execution_count": 104, "id": "c0cb5a98-7cdf-43a8-b2f7-f9df1961c761", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 2, 3, 4, 5]" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[1, 2] + Vector.new([3, 4, 5])" ] }, { "cell_type": "markdown", "id": "216dde4f-e4d8-4f29-903a-8cbf75de5b8e", "metadata": {}, "source": [ "## 37. Fill nil (Vector)" ] }, { "cell_type": "markdown", "id": "1959d0d7-6d09-4fa5-9365-1e2f7fc35d61", "metadata": {}, "source": [ "`Vector#fill_nil_forward` or `Vector#fill_nil_backward` will\n", "propagate the last valid observation forward (or backward).\n", "Or preserve nil if all previous values are nil or at the end." ] }, { "cell_type": "code", "execution_count": 105, "id": "d003b06a-859f-4de0-9e35-803efac85169", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[0, 1, 1, 3, 3]\n" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "integer = Vector.new([0, 1, nil, 3, nil])\n", "integer.fill_nil_forward" ] }, { "cell_type": "code", "execution_count": 106, "id": "c5d74006-d364-4e86-8a5e-9e96e87a96e0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[0, 1, 3, 3, nil]\n" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "integer.fill_nil_backward" ] }, { "cell_type": "markdown", "id": "347785a6-eab0-4864-a871-2c320005211e", "metadata": {}, "source": [ "## 38. all?/any? (Vector)" ] }, { "cell_type": "markdown", "id": "f82a6f5d-03d3-4645-85f5-d25999165378", "metadata": {}, "source": [ "`Vector#all?` returns true if all elements is true.\n", "\n", "`Vector#any?` returns true if exists any true.\n", "\n", "These are unary aggregation function." ] }, { "cell_type": "code", "execution_count": 107, "id": "ebad37ad-0a09-48b1-ba3a-4e030a917837", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans = Vector.new([true, true, nil])\n", "booleans.all?" ] }, { "cell_type": "code", "execution_count": 108, "id": "97fc24da-03d4-406d-b353-562896775d60", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans.any?" ] }, { "cell_type": "markdown", "id": "0ff3b22d-9f7c-42f2-8d18-c89a06af681b", "metadata": {}, "source": [ "If these methods are used with option `skip_nulls: false` nil is considered." ] }, { "cell_type": "code", "execution_count": 109, "id": "3e0e5800-665a-4a05-b2cb-d152f3f077de", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "false" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans.all?(skip_nulls: false)" ] }, { "cell_type": "code", "execution_count": 110, "id": "3e43f0c4-a254-4735-ac28-de14d2670c67", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booleans.any?(skip_nulls: false)" ] }, { "cell_type": "markdown", "id": "abc71a85-7958-4a21-91cf-8c96c0784525", "metadata": {}, "source": [ "## 39. count/count_uniq (Vector)" ] }, { "cell_type": "markdown", "id": "3d556118-4105-4d12-806d-ba56c6ae3d1b", "metadata": {}, "source": [ "`Vector#count` counts element.\n", "\n", "`Vector#count_uniq` counts unique element. `#count_distinct` is an alias (Arrow's name).\n", "\n", "These are unary aggregation function." ] }, { "cell_type": "code", "execution_count": 111, "id": "2af73e32-1d7e-4f80-b54e-c40ef08b7034", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string = Vector.new(%w[A B A])\n", "string.count" ] }, { "cell_type": "code", "execution_count": 112, "id": "fe6d8d85-27b0-438f-b1b4-1b15e9eb05f9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string.count_uniq # count_distinct is also OK" ] }, { "cell_type": "markdown", "id": "70abed9f-665a-4ea7-939e-4b185ee53755", "metadata": {}, "source": [ "## 40. stddev/variance (Vector)" ] }, { "cell_type": "markdown", "id": "965de338-b3be-4d33-92e1-5ad7e2ed18f0", "metadata": {}, "source": [ "These are unary element-wise function." ] }, { "cell_type": "code", "execution_count": 113, "id": "0afec200-f377-432b-a260-ae5a0c5ce794", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.816496580927726" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "integers = Vector.new([1, 2, 3, nil])\n", "integers.stddev" ] }, { "cell_type": "code", "execution_count": 114, "id": "2e40ac09-cb7f-4978-87e8-53f84f16f7c7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Unbiased standard deviation\n", "integers.sd" ] }, { "cell_type": "code", "execution_count": 115, "id": "e6158e3b-4af8-467c-a355-8e9f2e579548", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6666666666666666" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "integers.variance" ] }, { "cell_type": "code", "execution_count": 116, "id": "d64d39f2-d979-49f1-9946-65890f40d646", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Unbiased variance\n", "integers.var" ] }, { "cell_type": "markdown", "id": "25023f5a-798a-40a5-ab84-a6615602f747", "metadata": {}, "source": [ "## 41. negate (Vector)" ] }, { "cell_type": "markdown", "id": "00ddf322-ef50-40a1-86a6-22bf3d43f007", "metadata": {}, "source": [ "These are unary element-wise function." ] }, { "cell_type": "code", "execution_count": 117, "id": "ab5a357a-e98c-40a1-9b89-0b38645e416f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[-1.0, 2.0, -3.0]\n" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double = Vector.new([1.0, -2, 3])\n", "double.negate" ] }, { "cell_type": "code", "execution_count": 118, "id": "8a06c856-d61c-4752-a296-1fa207ffd9a1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[-1.0, 2.0, -3.0]\n" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Same as #negate\n", "-double" ] }, { "cell_type": "markdown", "id": "9b145724-d165-4ef3-8a06-2948dd0c7dbb", "metadata": {}, "source": [ "## 42. round (Vector)" ] }, { "cell_type": "markdown", "id": "b780c2f3-935c-4b2f-b18a-b277cf7c24b7", "metadata": {}, "source": [ "Otions for `#round`;\n", "\n", "- `:n-digits` The number of digits to show.\n", "- `round_mode` Specify rounding mode.\n", "\n", "This is a unary element-wise function." ] }, { "cell_type": "code", "execution_count": 119, "id": "e7a069b0-3547-4cd2-a2f0-0740f186b191", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.15, 2.5, 3.5, -4.5, -5.5]\n" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double = RedAmber::Vector.new([15.15, 2.5, 3.5, -4.5, -5.5])" ] }, { "cell_type": "code", "execution_count": 120, "id": "5ee84b24-8830-4788-a404-d5e1cca22abf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 2.0, 4.0, -4.0, -6.0]\n" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round" ] }, { "cell_type": "code", "execution_count": 121, "id": "20adb1ad-473c-4245-b959-7848c239fb76", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 2.0, 4.0, -4.0, -6.0]\n" ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :half_to_even)" ] }, { "cell_type": "code", "execution_count": 122, "id": "d2777ad8-2c24-48e4-8f5f-77403e3109ea", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[16.0, 3.0, 4.0, -5.0, -6.0]\n" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :towards_infinity)" ] }, { "cell_type": "code", "execution_count": 123, "id": "a8ab2735-74cb-4cfe-a5a2-61bfa90c72ac", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 3.0, 4.0, -4.0, -5.0]\n" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :half_up)" ] }, { "cell_type": "code", "execution_count": 124, "id": "3575481c-40ed-405f-a69c-7581d4dce2cf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 2.0, 3.0, -4.0, -5.0]\n" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :half_towards_zero)" ] }, { "cell_type": "code", "execution_count": 125, "id": "a86e4c5c-aced-4a88-b692-4e26b90f1653", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 3.0, 4.0, -5.0, -6.0]\n" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :half_towards_infinity)" ] }, { "cell_type": "code", "execution_count": 126, "id": "73f51bab-ff46-4b99-96a5-8c6547ad9d35", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 3.0, 3.0, -5.0, -5.0]\n" ] }, "execution_count": 126, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(mode: :half_to_odd)" ] }, { "cell_type": "code", "execution_count": 127, "id": "a12c684c-4a63-4dac-a81b-969978812a24", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.0, 2.0, 4.0, -4.0, -6.0]\n" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(n_digits: 0)" ] }, { "cell_type": "code", "execution_count": 128, "id": "17370f2b-0957-411b-8145-56aa9fc956ac", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[15.2, 2.5, 3.5, -4.5, -5.5]\n" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(n_digits: 1)" ] }, { "cell_type": "code", "execution_count": 129, "id": "53072cff-b28b-4672-b30a-8ca37562bc21", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[20.0, 0.0, 0.0, -0.0, -10.0]\n" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.round(n_digits: -1)" ] }, { "cell_type": "markdown", "id": "51dedfce-51c7-4e5b-b890-a90ad9cf7596", "metadata": {}, "source": [ "## 43. and/or (Vector)" ] }, { "cell_type": "markdown", "id": "b2c4869b-6ebf-476c-b2fd-a4b9c0638dc5", "metadata": {}, "source": [ "RedAmber select `and_kleene`/`or_kleene` as default `&`/`|` method.\n", "\n", "These are unary element-wise function." ] }, { "cell_type": "code", "execution_count": 130, "id": "2d4f5853-1ed9-4d8b-87a9-b5c1faac5fae", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, nil, false, false, false, nil, false, nil]\n" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bool_self = Vector.new([true, true, true, false, false, false, nil, nil, nil])\n", "bool_other = Vector.new([true, false, nil, true, false, nil, true, false, nil])\n", "\n", "bool_self & bool_other # same as bool_self.and_kleene(bool_other)" ] }, { "cell_type": "code", "execution_count": 131, "id": "236c9733-8d45-467e-b288-e6c18b9c39d2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, nil, true, false, nil, true, false, nil]\n" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Ruby's primitive `&&`\n", "bool_self && bool_other" ] }, { "cell_type": "code", "execution_count": 132, "id": "4e984a9c-7d9c-465d-bf26-0c685dedd4bf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, nil, false, false, nil, nil, nil, nil]\n" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Arrow's default `and`\n", "bool_self.and_org(bool_other)" ] }, { "cell_type": "code", "execution_count": 133, "id": "0120ebf5-355d-41f5-83d5-49b9802f337b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, true, true, true, false, nil, true, nil, nil]\n" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bool_self | bool_other # same as bool_self.or_kleene(bool_other)" ] }, { "cell_type": "code", "execution_count": 134, "id": "24ceee23-79df-4fcd-afd8-f3839a087785", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, true, true, false, false, false, nil, nil, nil]\n" ] }, "execution_count": 134, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Ruby's primitive `||`\n", "bool_self || bool_other" ] }, { "cell_type": "code", "execution_count": 135, "id": "c152d04b-71a0-4b18-acd1-b5ab9e413d00", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, true, nil, true, false, nil, nil, nil, nil]\n" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Arrow's default `or`\n", "bool_self.or_org(bool_other)" ] }, { "cell_type": "markdown", "id": "beede237-c5ed-4e12-a432-ec7e4546d786", "metadata": {}, "source": [ "## 44. is_finite/is_nan/is_nil/is_na (Vector)" ] }, { "cell_type": "markdown", "id": "77418efd-c0d7-4d63-a7db-2d43fafd386e", "metadata": {}, "source": [ "These are unary element-wise function." ] }, { "cell_type": "code", "execution_count": 136, "id": "19558f9e-fdc4-46e5-90d0-724e4e8fbd8e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[3.141592653589793, Infinity, -Infinity, NaN, nil]\n" ] }, "execution_count": 136, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double = Vector.new([Math::PI, Float::INFINITY, -Float::INFINITY, Float::NAN, nil])" ] }, { "cell_type": "code", "execution_count": 137, "id": "d90a7168-1f87-4363-9589-c1f161babc7d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, false, false, false, nil]\n" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.is_finite" ] }, { "cell_type": "code", "execution_count": 138, "id": "7d88049b-695f-4b0c-a105-8fb5797a58b1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, true, true, false, nil]\n" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.is_inf" ] }, { "cell_type": "code", "execution_count": 139, "id": "7d86a7b5-84bf-4031-9811-4076281920cf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, false, false, true, true]\n" ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.is_na" ] }, { "cell_type": "code", "execution_count": 140, "id": "d562f826-7a37-4c57-8f92-777555987246", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[false, false, false, false, true]\n" ] }, "execution_count": 140, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.is_nil" ] }, { "cell_type": "code", "execution_count": 141, "id": "e460dc6b-e48f-4462-9ce8-aa6069ebae27", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[true, true, true, true, false]\n" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" } ], "source": [ "double.is_valid" ] }, { "cell_type": "markdown", "id": "2cca75eb-f0e8-4f85-89cb-3601512e76b0", "metadata": {}, "source": [ "## 45. Prime-th rows" ] }, { "cell_type": "code", "execution_count": 142, "id": "0751e820-a22d-45b5-9005-df523d2353be", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <68 x 9 vectors>
indexspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
2AdelieTorgersen39.517.41863800female2007
3AdelieTorgersen40.318.01953250female2007
5AdelieTorgersen36.719.31933450female2007
7AdelieTorgersen38.917.81813625female2007
317GentooBiscoe49.415.82164925male2009
331GentooBiscoe50.515.22165000female2009
337GentooBiscoe44.515.72174875(nil)2009
" ], "text/plain": [ "#\n", " index species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 2 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 3 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " 4 7 Adelie Torgersen 38.9 17.8 181 ... 2007\n", " 5 11 Adelie Torgersen 37.8 17.1 186 ... 2007\n", " : : : : : : : ... :\n", "66 317 Gentoo Biscoe 49.4 15.8 216 ... 2009\n", "67 331 Gentoo Biscoe 50.5 15.2 216 ... 2009\n", "68 337 Gentoo Biscoe 44.5 15.7 217 ... 2009\n" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# prime-th rows ... Don't ask me what it means.\n", "require 'prime'\n", "penguins.assign_left(:index, Vector.new(penguins.indices) + 1) # since 0.2.0\n", " .slice { Vector.new(Prime.each(size).to_a) - 1 }" ] }, { "cell_type": "markdown", "id": "c9e8de1a-ad8f-4fdc-a65c-4d3db7123530", "metadata": {}, "source": [ "## 46. Slice by Enumerator" ] }, { "cell_type": "markdown", "id": "32dd53a3-a822-4ae1-afe2-b5aa2bfbd3e3", "metadata": {}, "source": [ "Slice accepts Enumerator as an option." ] }, { "cell_type": "code", "execution_count": 143, "id": "b2a118fa-f3c0-4f31-9b45-6db27ccbebe6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <35 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen37.817.11863300(nil)2007
AdelieBiscoe37.818.31743400female2007
AdelieDream39.516.71783250female2007
GentooBiscoe48.515.02194850female2009
GentooBiscoe50.515.22165000female2009
GentooBiscoe46.814.32154850female2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 37.8 17.1 186 ... 2007\n", " 3 Adelie Biscoe 37.8 18.3 174 ... 2007\n", " 4 Adelie Dream 39.5 16.7 178 ... 2007\n", " 5 Adelie Dream 36.5 18.0 182 ... 2007\n", " : : : : : : ... :\n", "33 Gentoo Biscoe 48.5 15.0 219 ... 2009\n", "34 Gentoo Biscoe 50.5 15.2 216 ... 2009\n", "35 Gentoo Biscoe 46.8 14.3 215 ... 2009\n" ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select every 10 samples\n", "penguins.slice(0.step by: 10, to: 340)" ] }, { "cell_type": "markdown", "id": "db312c2c-3a7c-4765-bfad-b3313b173a79", "metadata": {}, "source": [ "## 47. Output mode" ] }, { "cell_type": "markdown", "id": "714ed8df-5aa3-4ac4-8b0d-6390aff73c8c", "metadata": {}, "source": [ "Output mode of `#inspect` and `#to_iruby` is Table mode by default. If you prefer TDR mode set the environment variable `RED_AMBER_OUTPUT_MODE` to `\"TDR\"`." ] }, { "cell_type": "code", "execution_count": 144, "id": "a721804b-006e-44c6-8d38-885eae747eaa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <344 x 8 vectors>
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
AdelieTorgersen39.118.71813750male2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen(nil)(nil)(nil)(nil)(nil)2007
GentooBiscoe50.415.72225750male2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male2009
" ], "text/plain": [ "#\n", " species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n", " ... \n", " 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n", " 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n", " 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n", " 4 Adelie Torgersen (nil) (nil) (nil) ... 2007\n", " 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n", " : : : : : : ... :\n", "342 Gentoo Biscoe 50.4 15.7 222 ... 2009\n", "343 Gentoo Biscoe 45.2 14.8 212 ... 2009\n", "344 Gentoo Biscoe 49.9 16.1 213 ... 2009\n" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ENV['RED_AMBER_OUTPUT_MODE'] = 'Table' # or nil (default)\n", "penguins # Almost same as `puts penguins.to_s` in any mode" ] }, { "cell_type": "code", "execution_count": 145, "id": "e4c9f70c-a4b1-4a81-bbc4-e9b14a6b6cb0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n", "2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n", " ... 5 more Vectors ...\n", "\n" ] } ], "source": [ "ENV['RED_AMBER_OUTPUT_MODE'] = 'TDR'\n", "p penguins; nil # Almost same as `penguins.tdr` in any mode" ] }, { "cell_type": "code", "execution_count": 146, "id": "2786e9a7-e321-43c5-b56e-9f2ca9d62f8b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RedAmber::DataFrame : 344 x 8 Vectors\n", "Vectors : 5 numeric, 3 strings\n", "# key type level data_preview\n", "1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n", "2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n", "3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n", "4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils\n", "5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils\n", "6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils\n", "7 :sex string 3 {\"male\"=>168, \"female\"=>165, nil=>11}\n", "8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}\n" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins" ] }, { "cell_type": "code", "execution_count": 147, "id": "b00c858b-b14a-492b-bc22-d6a707bcc1ba", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Table\"" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ENV['RED_AMBER_OUTPUT_MODE'] = 'Table'" ] }, { "cell_type": "markdown", "id": "c6781904-c014-4ef0-86c5-8a758ca3ca1c", "metadata": {}, "source": [ "## 48. Empty key" ] }, { "cell_type": "markdown", "id": "38912e38-bddd-47cc-adfd-5c4f780636ea", "metadata": {}, "source": [ "Empty key `:\"\"` will be automatically renamed to `:unnamed1`.\n", "\n", "If `:unnamed1` was used, `:unnamed1.succ` will be used.\n", "\n", "(Since 0.1.8)" ] }, { "cell_type": "code", "execution_count": 148, "id": "13569004-bb23-45fa-8d11-fe5f367641a6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 2 vectors>
unnamed2unnamed1
13
24
" ], "text/plain": [ "#\n", " unnamed2 unnamed1\n", " \n", "1 1 3\n", "2 2 4\n" ] }, "execution_count": 148, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = DataFrame.new(\"\": [1, 2], unnamed1: [3, 4])" ] }, { "cell_type": "markdown", "id": "a4343343-d59a-4b71-b08f-4b737a6d4258", "metadata": {}, "source": [ "## 49. Grouping" ] }, { "cell_type": "markdown", "id": "8ddf0937-7b0e-4ee6-8116-8e78c308e76f", "metadata": {}, "source": [ "`DataFrame#group` takes group_key as an argument, and creates `Group` class. \n", "\n", "Group class inspects counts of each unique elements. \n", "\n", "(Since 0.1.7)" ] }, { "cell_type": "code", "execution_count": 149, "id": "ee602e52-7988-4fab-b5e3-c466acf01c98", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#{\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}}>" ] }, "execution_count": 149, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group = penguins.group(:species)" ] }, { "cell_type": "markdown", "id": "a48ab0e9-60e3-4c07-a27b-9c53783318c1", "metadata": {}, "source": [ "The instance of `Group` class has methods to summary functions.\n", "\n", "It returns `function(key)` style summarized columns as a result." ] }, { "cell_type": "code", "execution_count": 150, "id": "20b23ada-b895-4921-b57b-8d46b451e494", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 8 vectors>
speciescount(island)count(bill_length_mm)count(bill_depth_mm)count(flipper_length_mm)count(body_mass_g)count(sex)count(year)
Adelie152151151151151146152
Chinstrap68686868686868
Gentoo124123123123123119124
" ], "text/plain": [ "#\n", " species count(island) count(bill_length_mm) count(bill_depth_mm) ... count(year)\n", " ... \n", "1 Adelie 152 151 151 ... 152\n", "2 Chinstrap 68 68 68 ... 68\n", "3 Gentoo 124 123 123 ... 124\n" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group.count" ] }, { "cell_type": "markdown", "id": "584d8fa1-1e70-4f31-bdc2-5d4956a28cb5", "metadata": {}, "source": [ "If count result is same in multiple columns, count column is aggregated to one column `:count`." ] }, { "cell_type": "code", "execution_count": 151, "id": "e6936488-9f23-47bd-8492-537c5be1afb3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
speciescount
Adelie151
Chinstrap68
Gentoo123
" ], "text/plain": [ "#\n", " species count\n", " \n", "1 Adelie 151\n", "2 Chinstrap 68\n", "3 Gentoo 123\n" ] }, "execution_count": 151, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.pick(:species, :bill_length_mm, :bill_depth_mm).group(:species).count" ] }, { "cell_type": "markdown", "id": "7c90b59b-918c-492c-ac5b-46b62b31d136", "metadata": {}, "source": [ "Grouping key comes first (leftmost) in the columns." ] }, { "cell_type": "markdown", "id": "21c8a6fa-3f95-41ea-817f-ffc1011e8df9", "metadata": {}, "source": [ "## 50. Grouping with a block" ] }, { "cell_type": "markdown", "id": "db89c8b4-8189-410d-abe9-eac8f440065c", "metadata": {}, "source": [ "`DataFrame#group` takes a block and we can specify multiple functions.\n", "\n", "Inside the block is the context of instance of Group. So we can use summary functions without the reciever.\n", "\n", "(Since 0.1.8)" ] }, { "cell_type": "code", "execution_count": 152, "id": "913f576b-ec86-4e94-af05-7c656ea24cc2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 3 vectors>
speciescountmean(body_mass_g)
Adelie1523700.662251655629
Chinstrap683733.0882352941176
Gentoo1245076.016260162602
" ], "text/plain": [ "#\n", " species count mean(body_mass_g)\n", " \n", "1 Adelie 152 3700.66\n", "2 Chinstrap 68 3733.09\n", "3 Gentoo 124 5076.02\n" ] }, "execution_count": 152, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.group(:species) { [count(:species), mean(:body_mass_g)] }" ] }, { "cell_type": "markdown", "id": "e6dc3206-d327-4242-9908-253ad4a74759", "metadata": {}, "source": [ "`Group#summarize` accepts same block as `DataFrame#group`." ] }, { "cell_type": "code", "execution_count": 153, "id": "67c7fc55-7b30-469c-bd0c-cda5732863fe", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 7 vectors>
speciescountmean(bill_length_mm)mean(bill_depth_mm)mean(flipper_length_mm)mean(body_mass_g)mean(year)
Adelie15238.7913907284768418.346357615894032189.953642384105963700.6622516556292008.0131578947369
Chinstrap6848.8338235294117718.420588235294115195.82352941176473733.08823529411762007.9705882352941
Gentoo12447.50487804878047614.982113821138206217.18699186991875076.0162601626022008.0806451612902
" ], "text/plain": [ "#\n", " species count mean(bill_length_mm) mean(bill_depth_mm) ... mean(year)\n", " ... \n", "1 Adelie 152 38.79 18.35 ... 2008.01\n", "2 Chinstrap 68 48.83 18.42 ... 2007.97\n", "3 Gentoo 124 47.5 14.98 ... 2008.08\n" ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group.summarize { [count(:species), mean] }" ] }, { "cell_type": "markdown", "id": "88682c8d-0ab6-4cee-8e59-5b6ad0ae5dac", "metadata": {}, "source": [ "## 51. Vector#shift" ] }, { "cell_type": "markdown", "id": "c6dd1bc0-7146-4884-aac5-9874e829235e", "metadata": {}, "source": [ "`Vector#shift(amount = 1, fill: nil)`\n", "\n", "Shift vector's values by specified `amount`. Shifted space is filled by value `fill`.\n", "\n", "(Since 0.1.8)" ] }, { "cell_type": "code", "execution_count": 154, "id": "013f2db6-3e1d-481f-a908-57605729b51d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[nil, 1, 2, 3, 4]\n" ] }, "execution_count": 154, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector = RedAmber::Vector.new([1, 2, 3, 4, 5])\n", "vector.shift" ] }, { "cell_type": "code", "execution_count": 155, "id": "7625acd7-d6a0-4775-b5e0-ca87f95f4f28", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[3, 4, 5, nil, nil]\n" ] }, "execution_count": 155, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector.shift(-2)" ] }, { "cell_type": "code", "execution_count": 156, "id": "34a9ac2a-2e3f-44bc-8ba7-c4487dc3528e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#\n", "[NaN, 1.0, 2.0, 3.0, 4.0]\n" ] }, "execution_count": 156, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector.shift(fill: Float::NAN)" ] }, { "cell_type": "markdown", "id": "39f305a6-5718-4e81-b08f-564fa93861cc", "metadata": { "tags": [] }, "source": [ "## 52. From the Pandas cookbook (if-then)\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#if-then" ] }, { "cell_type": "markdown", "id": "4f1d41dd-714a-4c68-af3b-a9080847f833", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "df.loc[df.AAA >= 5, \"BBB\"] = -1\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "0 4 10 100\n", "1 5 -1 50\n", "2 6 -1 -30\n", "3 7 -1 -50\n", "```" ] }, { "cell_type": "code", "execution_count": 157, "id": "24774ccc-8f0f-4ce4-9ba0-bebed8781c38", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 3 vectors>
AAABBBCCC
410100
5-150
6-1-30
7-1-50
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 4 10 100\n", "2 5 -1 50\n", "3 6 -1 -30\n", "4 7 -1 -50\n" ] }, "execution_count": 157, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Red Amber\n", "df = DataFrame.new(\n", " \"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50] # You can omit {}\n", ")\n", "\n", "df.assign(BBB: df[:BBB].replace(df[:AAA] >= 5, -1))" ] }, { "cell_type": "markdown", "id": "c76ca2e3-e239-440b-923b-dfabb1a1c007", "metadata": {}, "source": [ "If you want to replace both :BBB and :CCC ;" ] }, { "cell_type": "code", "execution_count": 158, "id": "3f97227b-cbee-4515-b76d-3514401967d9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 3 vectors>
AAABBBCCC
410100
5-1-2
6-1-2
7-1-2
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 4 10 100\n", "2 5 -1 -2\n", "3 6 -1 -2\n", "4 7 -1 -2\n" ] }, "execution_count": 158, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.assign do\n", " replacer = v(:AAA) >= 5 # Boolean Vector\n", " {\n", " BBB: v(:BBB).replace(replacer, -1),\n", " CCC: v(:CCC).replace(replacer, -2)\n", " }\n", "end" ] }, { "cell_type": "markdown", "id": "4bf560f9-4393-4ed5-9994-24f8a93470fb", "metadata": { "tags": [] }, "source": [ "## 53. From the Pandas cookbook (Splitting)\n", "Split a frame with a boolean criterion\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#splitting" ] }, { "cell_type": "markdown", "id": "fcf06cad-4a34-45e2-b8f8-8422914eb319", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "df[df.AAA <= 5]\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "0 4 10 100\n", "1 5 20 50\n", "\n", "df[df.AAA > 5]\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "2 6 30 -30\n", "3 7 40 -50\n", "```" ] }, { "cell_type": "code", "execution_count": 159, "id": "b08e74d4-aba8-4fb5-a815-5d5384e92f81", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 3 vectors>
AAABBBCCC
410100
52050
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 4 10 100\n", "2 5 20 50\n" ] }, "execution_count": 159, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Red Amber\n", "df = DataFrame.new(\n", " # You can omit {}\n", " \"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]\n", ")\n", "\n", "df.slice(df[:AAA] <= 5)\n", "# df[df[:AAA] <= 5] # is also OK" ] }, { "cell_type": "code", "execution_count": 160, "id": "caa72796-ff7e-4275-849f-04698114ee08", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 3 vectors>
AAABBBCCC
630-30
740-50
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 6 30 -30\n", "2 7 40 -50\n" ] }, "execution_count": 160, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.remove(df[:AAA] <= 5)\n", "# df.slice(df[:AAA] > 5) # do the same thing" ] }, { "cell_type": "markdown", "id": "ba7588e9-bbac-4547-a56c-3eea9f819460", "metadata": { "tags": [] }, "source": [ "## 54. From the Pandas cookbook (Building criteria)\n", "Split a frame with a boolean criterion\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#building-criteria" ] }, { "cell_type": "markdown", "id": "c6d01b08-1af7-47b0-a9e4-f27ab41fe24e", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "\n", "# and\n", "df.loc[(df[\"BBB\"] < 25) & (df[\"CCC\"] >= -40), \"AAA\"]\n", "\n", "# returns a series =>\n", "0 4\n", "1 5\n", "Name: AAA, dtype: int64\n", "\n", "# or\n", "df.loc[(df[\"BBB\"] > 25) | (df[\"CCC\"] >= -40), \"AAA\"]\n", "\n", "# returns a series =>\n", "0 4\n", "1 5\n", "2 6\n", "3 7\n", "Name: AAA, dtype: int64\n", "```" ] }, { "cell_type": "code", "execution_count": 161, "id": "46066e96-91e5-4a96-9840-7e4ce6f06818", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 1 vector>
AAA
4
5
" ], "text/plain": [ "#\n", " AAA\n", " \n", "1 4\n", "2 5\n" ] }, "execution_count": 161, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Red Amber\n", "df = DataFrame.new(\n", " # You can omit {}\n", " \"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]\n", ")\n", "\n", "df.slice( (df[:BBB] < 25) & (df[:CCC] >= 40) ).pick(:AAA)" ] }, { "cell_type": "code", "execution_count": 162, "id": "dc8304c2-be28-420e-b2d5-a4b636eaac8b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 1 vector>
AAA
4
5
6
7
" ], "text/plain": [ "#\n", " AAA\n", " \n", "1 4\n", "2 5\n", "3 6\n", "4 7\n" ] }, "execution_count": 162, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.slice( (df[:BBB] > 25) | (df[:CCC] >= 40) ).pick(:AAA)\n", "# df[ (df[:BBB] > 25) | (df[:CCC] >= 40) ][:AAA)] # also OK" ] }, { "cell_type": "markdown", "id": "79616705-f497-4bb4-ad1d-5ee93c0093ce", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# or (with assignment)\n", "df.loc[(df[\"BBB\"] > 25) | (df[\"CCC\"] >= 75), \"AAA\"] = 0.1\n", "df\n", "\n", "# returns a dataframe =>\n", " AAA BBB CCC\n", "0 0.1 10 100\n", "1 5.0 20 50\n", "2 0.1 30 -30\n", "3 0.1 40 -50\n", "```" ] }, { "cell_type": "code", "execution_count": 163, "id": "7ab3b044-5a0f-4a38-8a42-aed1228b6462", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 3 vectors>
AAABBBCCC
0.110100
5.02050
0.130-30
0.140-50
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 0.1 10 100\n", "2 5.0 20 50\n", "3 0.1 30 -30\n", "4 0.1 40 -50\n" ] }, "execution_count": 163, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# df.assign(AAA: df[:AAA].replace((df[:BBB] > 25) | (df[:CCC] >= 75), 0.1)) # by one liner\n", "\n", "booleans = (df[:BBB] > 25) | (df[:CCC] >= 75)\n", "replaced = df[:AAA].replace(booleans, 0.1)\n", "df.assign(AAA: replaced)" ] }, { "cell_type": "markdown", "id": "6cd4a47c-f619-462b-94c0-4c488761d5b0", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Select rows with data closest to certain value using argsort\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "aValue = 43.0\n", "df.loc[(df.CCC - aValue).abs().argsort()]\n", "\n", "# returns a dataframe =>\n", " AAA BBB CCC\n", "1 5 20 50\n", "0 4 10 100\n", "2 6 30 -30\n", "3 7 40 -50\n", "```" ] }, { "cell_type": "code", "execution_count": 164, "id": "13cb1d45-2d13-4708-ad76-57efd72e609b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 3 vectors>
AAABBBCCC
52050
410100
630-30
740-50
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 5 20 50\n", "2 4 10 100\n", "3 6 30 -30\n", "4 7 40 -50\n" ] }, "execution_count": 164, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a_value = 43\n", "df[(df[:CCC] - a_value).abs.sort_indexes]\n", "# df.slice (df[:CCC] - a_value).abs.sort_indexes # also OK" ] }, { "cell_type": "markdown", "id": "9330117b-40e0-4574-8900-7857622daad4", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Dynamically reduce a list of criteria using a binary operators\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "Crit1 = df.AAA <= 5.5\n", "Crit2 = df.BBB == 10.0\n", "Crit3 = df.CCC > -40.0\n", "AllCrit = Crit1 & Crit2 & Crit3\n", "\n", "import functools\n", "\n", "CritList = [Crit1, Crit2, Crit3]\n", "AllCrit = functools.reduce(lambda x, y: x & y, CritList)\n", "df[AllCrit]\n", "\n", "# returns a dataframe =>\n", " AAA BBB CCC\n", "0 4 10 100\n", "```" ] }, { "cell_type": "code", "execution_count": 165, "id": "40336e62-d411-4655-8ec5-8d30876ada47", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <1 x 3 vectors>
AAABBBCCC
410100
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 4 10 100\n" ] }, "execution_count": 165, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crit1 = df[:AAA] <= 5.5\n", "crit2 = df[:BBB] == 10.0\n", "crit3 = df[:CCC] >= -40.0\n", "df[crit1 & crit2 & crit3]" ] }, { "cell_type": "markdown", "id": "62e03375-573d-4368-a26e-1be3a4d58cf8", "metadata": { "tags": [] }, "source": [ "## 55. From the Pandas cookbook (Dataframes)\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#dataframes" ] }, { "cell_type": "raw", "id": "55cee55f-7caf-423c-863a-187f96fa3072", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Using both row labels and value conditionals\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]}\n", ")\n", "df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))] \n", "\n", "# returns =>\n", " AAA BBB CCC\n", "0 4 10 100\n", "2 6 30 -30\n", "```" ] }, { "cell_type": "code", "execution_count": 166, "id": "272396f6-f689-43d5-ba34-e7c6ac0e1c28", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 3 vectors>
AAABBBCCC
410100
630-30
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 4 10 100\n", "2 6 30 -30\n" ] }, "execution_count": 166, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Red Amber\n", "df = DataFrame.new(\n", " \"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]\n", ")\n", "\n", "df[(df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i }]" ] }, { "cell_type": "markdown", "id": "26bfea62-12e4-4a01-976e-6184ecafa2fd", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Use loc for label-oriented slicing and iloc positional slicing GH2904\n", "df = pd.DataFrame(\n", " {\"AAA\": [4, 5, 6, 7], \"BBB\": [10, 20, 30, 40], \"CCC\": [100, 50, -30, -50]},\n", " index=[\"foo\", \"bar\", \"boo\", \"kar\"],\n", ")\n", "\n", "# There are 2 explicit slicing methods, with a third general case\n", "# 1. Positional-oriented (Python slicing style : exclusive of end)\n", "# 2. Label-oriented (Non-Python slicing style : inclusive of end)\n", "# 3. General (Either slicing style : depends on if the slice contains labels or positions)\n", "\n", "df.loc[\"bar\":\"kar\"] # Label\n", "# returns =>\n", " AAA BBB CCC\n", "bar 5 20 50\n", "boo 6 30 -30\n", "kar 7 40 -50\n", "\n", "# Generic\n", "df[0:3]\n", "# returns =>\n", " AAA BBB CCC\n", "foo 4 10 100\n", "bar 5 20 50\n", "boo 6 30 -30\n", "\n", "df[\"bar\":\"kar\"]\n", "# returns =>\n", " AAA BBB CCC\n", "bar 5 20 50\n", "boo 6 30 -30\n", "kar 7 40 -50\n", "```" ] }, { "cell_type": "code", "execution_count": 167, "id": "ccabc137-33d3-47e8-ad09-88a285765380", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 4 vectors>
indexAAABBBCCC
foo410100
bar52050
boo630-30
kar740-50
" ], "text/plain": [ "#\n", " index AAA BBB CCC\n", " \n", "1 foo 4 10 100\n", "2 bar 5 20 50\n", "3 boo 6 30 -30\n", "4 kar 7 40 -50\n" ] }, "execution_count": 167, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Red Amber does not have row index. Use a new column as indexes.\n", "labeled = df.assign_left(index: %w[foo bar boo kar])\n", "# labeled = df.assign(index: %w[foo bar boo kar]).pick { [keys[-1], keys[0...-1]] } # until v0.1.8" ] }, { "cell_type": "code", "execution_count": 168, "id": "f0871131-725e-4e33-a3cc-1fccd610a4b2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
indexAAABBBCCC
bar52050
boo630-30
kar740-50
" ], "text/plain": [ "#\n", " index AAA BBB CCC\n", " \n", "1 bar 5 20 50\n", "2 boo 6 30 -30\n", "3 kar 7 40 -50\n" ] }, "execution_count": 168, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labeled[1..3]" ] }, { "cell_type": "code", "execution_count": 169, "id": "dffc55f0-481e-4076-aa3f-89f1294655b9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 4 vectors>
indexAAABBBCCC
bar52050
boo630-30
kar740-50
" ], "text/plain": [ "#\n", " index AAA BBB CCC\n", " \n", "1 bar 5 20 50\n", "2 boo 6 30 -30\n", "3 kar 7 40 -50\n" ] }, "execution_count": 169, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labeled.slice do\n", " v = v(:index)\n", " v.index(\"bar\")..v.index(\"kar\")\n", "end" ] }, { "cell_type": "markdown", "id": "d72b067c-d15d-499b-8d2a-8b79f5287e97", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Ambiguity arises when an index consists of integers with a non-zero start or non-unit increment.\n", "df2 = pd.DataFrame(data=data, index=[1, 2, 3, 4]) # Note index starts at 1.\n", "\n", "df2.iloc[1:3] # Position-oriented\n", "# returns =>\n", " AAA BBB CCC\n", "2 5 20 50\n", "3 6 30 -30\n", "\n", "df2.loc[1:3] # Label-oriented\n", "# returns =>\n", " AAA BBB CCC\n", "1 4 10 100\n", "2 5 20 50\n", "3 6 30 -30\n", "```" ] }, { "cell_type": "code", "execution_count": 170, "id": "f023a23a-ea1d-415c-b395-195801351433", "metadata": {}, "outputs": [], "source": [ "# RedAmber only have an implicit integer index 0...size,\n", "# does not happen any ambiguity unless you create a new column and use it for indexes :-)." ] }, { "cell_type": "markdown", "id": "5cada26c-a37a-4eff-bfa5-66635d278671", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Using inverse operator (~) to take the complement of a mask\n", "df[~((df.AAA <= 6) & (df.index.isin([0, 2, 4])))]\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "1 5 20 50\n", "3 7 40 -50\n", "```" ] }, { "cell_type": "code", "execution_count": 171, "id": "29b81efe-5b9b-4640-89b1-422ae94cf01d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 3 vectors>
AAABBBCCC
52050
740-50
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 5 20 50\n", "2 7 40 -50\n" ] }, "execution_count": 171, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# RedAmber offers #! method for boolean Vector.\n", "df[!((df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i })]\n", "\n", "# or\n", "# df[((df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i }).invert]" ] }, { "cell_type": "markdown", "id": "4fb7103f-00ed-4e74-81e5-7d480004e681", "metadata": {}, "source": [ "If you have `nil` in your data, consider #primitive_invert for consistent result. See example #26." ] }, { "cell_type": "markdown", "id": "a1780372-b566-41f5-84d6-6213e3f9efa7", "metadata": { "tags": [] }, "source": [ "## 56. From the Pandas cookbook (New columns)\n", "\n", "https://pandas.pydata.org/docs/user_guide/cookbook.html#new-columns" ] }, { "cell_type": "markdown", "id": "5f5f9d8b-7550-44f6-9f79-64f940f5000a", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Efficiently and dynamically creating new columns using applymap\n", "df = pd.DataFrame({\"AAA\": [1, 2, 1, 3], \"BBB\": [1, 1, 2, 2], \"CCC\": [2, 1, 3, 1]})\n", "df\n", "\n", "# returns =>\n", " AAA BBB CCC\n", "0 1 1 2\n", "1 2 1 1\n", "2 1 2 3\n", "3 3 2 1\n", "\n", "source_cols = df.columns # Or some subset would work too\n", "new_cols = [str(x) + \"_cat\" for x in source_cols]\n", "categories = {1: \"Alpha\", 2: \"Beta\", 3: \"Charlie\"}\n", "df[new_cols] = df[source_cols].applymap(categories.get)\n", "df\n", "\n", "# returns =>\n", " AAA BBB CCC AAA_cat BBB_cat CCC_cat\n", "0 1 1 2 Alpha Alpha Beta\n", "1 2 1 1 Beta Alpha Alpha\n", "2 1 2 3 Alpha Beta Charlie\n", "3 3 2 1 Charlie Beta Alpha\n", "```" ] }, { "cell_type": "code", "execution_count": 172, "id": "265d63e6-0c01-4080-8d5b-c3153be595a5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 3 vectors>
AAABBBCCC
112
211
123
321
" ], "text/plain": [ "#\n", " AAA BBB CCC\n", " \n", "1 1 1 2\n", "2 2 1 1\n", "3 1 2 3\n", "4 3 2 1\n" ] }, "execution_count": 172, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# RedAmber\n", "df = DataFrame.new({\"AAA\": [1, 2, 1, 3], \"BBB\": [1, 1, 2, 2], \"CCC\": [2, 1, 3, 1]})" ] }, { "cell_type": "code", "execution_count": 173, "id": "4be751d7-69a8-4400-b094-9932bf3d577b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <4 x 6 vectors>
AAABBBCCCAAA_catBBB_catCCC_cat
112AlphaAlphaBeta
211BetaAlphaAlpha
123AlphaBetaCharlie
321CharlieBetaAlpha
" ], "text/plain": [ "#\n", " AAA BBB CCC AAA_cat BBB_cat CCC_cat\n", " \n", "1 1 1 2 Alpha Alpha Beta\n", "2 2 1 1 Beta Alpha Alpha\n", "3 1 2 3 Alpha Beta Charlie\n", "4 3 2 1 Charlie Beta Alpha\n" ] }, "execution_count": 173, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categories = {1 => \"Alpha\", 2 => \"Beta\", 3 => \"Charlie\"}\n", "\n", "# Creating a Hash from keys\n", "df.assign do\n", " keys.each_with_object({}) do |key, h|\n", " h[\"#{key}_cat\"] = v(key).to_a.map { |x| categories[x] }\n", " end\n", "end\n", "\n", "# Creating an Array from vectors, from v0.2.0\n", "df.assign do\n", " vectors.map do |v|\n", " [\"#{v.key}_cat\", v.to_a.map { |x| categories[x] } ]\n", " end\n", "end" ] }, { "cell_type": "markdown", "id": "aa8214e5-a897-406e-bb5e-95ad9fea0cdd", "metadata": {}, "source": [ "```python\n", "# by Python Pandas\n", "# Keep other columns when using min() with groupby\n", "df = pd.DataFrame(\n", " {\"AAA\": [1, 1, 1, 2, 2, 2, 3, 3], \"BBB\": [2, 1, 3, 4, 5, 1, 2, 3]}\n", ")\n", "df\n", "\n", "# returns =>\n", " AAA BBB\n", "0 1 2\n", "1 1 1\n", "2 1 3\n", "3 2 4\n", "4 2 5\n", "5 2 1\n", "6 3 2\n", "7 3 3\n", "\n", "# Method 1 : idxmin() to get the index of the minimums\n", "df.loc[df.groupby(\"AAA\")[\"BBB\"].idxmin()]\n", "\n", "# returns =>\n", " AAA BBB\n", "1 1 1\n", "5 2 1\n", "6 3 2\n", "\n", "# Method 2 : sort then take first of each\n", "df.sort_values(by=\"BBB\").groupby(\"AAA\", as_index=False).first()\n", "\n", "# returns =>\n", " AAA BBB\n", "0 1 1\n", "1 2 1\n", "2 3 2\n", "\n", "# Notice the same results, with the exception of the index.\n", "```" ] }, { "cell_type": "code", "execution_count": 174, "id": "5dd5f7dc-21a3-4397-9ef6-c5bc630ff858", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <8 x 2 vectors>
AAABBB
12
11
13
24
25
21
32
33
" ], "text/plain": [ "#\n", " AAA BBB\n", " \n", "1 1 2\n", "2 1 1\n", "3 1 3\n", "4 2 4\n", "5 2 5\n", ": : :\n", "7 3 2\n", "8 3 3\n" ] }, "execution_count": 174, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# RedAmber\n", "df = DataFrame.new(AAA: [1, 1, 1, 2, 2, 2, 3, 3], BBB: [2, 1, 3, 4, 5, 1, 2, 3])" ] }, { "cell_type": "code", "execution_count": 175, "id": "000c1632-8faf-407f-bb7d-e583ef573442", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <3 x 2 vectors>
AAAmin(BBB)
11
21
32
" ], "text/plain": [ "#\n", " AAA min(BBB)\n", " \n", "1 1 1\n", "2 2 1\n", "3 3 2\n" ] }, "execution_count": 175, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.group(:AAA).min\n", "\n", "# Add `.rename { [keys[-1], :BBB] }` if you want." ] }, { "cell_type": "markdown", "id": "5aa5b4ab-2fb3-4a22-804e-55d6351dd427", "metadata": { "tags": [] }, "source": [ "## 57. Summary/describe" ] }, { "cell_type": "code", "execution_count": 176, "id": "610be94e-b7ce-43f5-a5c1-ddef745d6bac", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 9 vectors>
variablescountmeanstdmin25%median75%max
bill_length_mm34243.921929824561415.459583713926531532.139.22544.38200000000000548.559.6
bill_depth_mm34217.1511695906432741.974793156816781413.115.617.3218.721.5
flipper_length_mm342200.9152046783625814.061713679356888172.0190.0197.0213.0231.0
body_mass_g3424201.754385964912801.95453569809552700.03550.04031.54750.06300.0
year3442008.02906976744180.81835592548370412007.02007.02008.02009.02009.0
" ], "text/plain": [ "#\n", " variables count mean std min 25% median ... max\n", " ... \n", "1 bill_length_mm 342 43.92 5.46 32.1 39.23 44.38 ... 59.6\n", "2 bill_depth_mm 342 17.15 1.97 13.1 15.6 17.32 ... 21.5\n", "3 flipper_length_mm 342 200.92 14.06 172.0 190.0 197.0 ... 231.0\n", "4 body_mass_g 342 4201.75 801.95 2700.0 3550.0 4031.5 ... 6300.0\n", "5 year 344 2008.03 0.82 2007.0 2007.0 2008.0 ... 2009.0\n" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins.summary\n", "# or\n", "penguins.describe" ] }, { "cell_type": "markdown", "id": "982a735e-c887-4c15-be82-4f3a28138fa7", "metadata": {}, "source": [ "## 58. Quantile" ] }, { "cell_type": "markdown", "id": "d0832e36-7281-4d3c-a35d-5a89aecc341e", "metadata": {}, "source": [ "`Vector#quantile(prob)` returns quantile at probability `prob`.\n", "\n", "(Since 0.2.0)" ] }, { "cell_type": "code", "execution_count": 177, "id": "88c6ed5d-b41e-4c17-b93f-47ce15702974", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "17.3" ] }, "execution_count": 177, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins[:bill_depth_mm].quantile # default value is prob = 0.5" ] }, { "cell_type": "markdown", "id": "07368c10-65d4-4081-ace7-12bebca493c2", "metadata": {}, "source": [ "`Vector#quantiles` accepts an Array for multiple quantiles. Returns a DataFrame." ] }, { "cell_type": "code", "execution_count": 178, "id": "3cbac413-fc36-42a2-abd3-1343e3b88467", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <2 x 2 vectors>
probsquantiles
0.0513.9
0.9520.0
" ], "text/plain": [ "#\n", " probs quantiles\n", " \n", "1 0.05 13.9\n", "2 0.95 20.0\n" ] }, "execution_count": 178, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguins[:bill_depth_mm].quantiles([0.05, 0.95])" ] }, { "cell_type": "markdown", "id": "54800864-c33d-4fec-818b-1a12a7e4d015", "metadata": {}, "source": [ "## 59. Transpose" ] }, { "cell_type": "markdown", "id": "517281cf-3f29-4725-816c-5a866e9cc9cc", "metadata": {}, "source": [ "`DataFrame#transpose` creates transposed DataFrame for wide type dataframe.\n", "\n", "(Since 0.2.0)" ] }, { "cell_type": "code", "execution_count": 179, "id": "a2fe66cb-e86d-402e-8724-eced2420e3d0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 6 vectors>
YearAudiBMWBMW_MINIMercedes-BenzVW
20212253535905182115172235215
20202230435712201965704136576
20192422246814238136655346794
20182647350982259846755451961
20172833652527254276822149040
" ], "text/plain": [ "#\n", " Year Audi BMW BMW_MINI Mercedes-Benz VW\n", " \n", "1 2021 22535 35905 18211 51722 35215\n", "2 2020 22304 35712 20196 57041 36576\n", "3 2019 24222 46814 23813 66553 46794\n", "4 2018 26473 50982 25984 67554 51961\n", "5 2017 28336 52527 25427 68221 49040\n" ] }, "execution_count": 179, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import_cars = RedAmber::DataFrame.load('../test/entity/import_cars.tsv')" ] }, { "cell_type": "code", "execution_count": 180, "id": "b4ef2b48-33de-4982-b082-1966749fcf65", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 6 vectors>
name20212020201920182017
Audi2253522304242222647328336
BMW3590535712468145098252527
BMW_MINI1821120196238132598425427
Mercedes-Benz5172257041665536755468221
VW3521536576467945196149040
" ], "text/plain": [ "#\n", " name 2021 2020 2019 2018 2017\n", " \n", "1 Audi 22535 22304 24222 26473 28336\n", "2 BMW 35905 35712 46814 50982 52527\n", "3 BMW_MINI 18211 20196 23813 25984 25427\n", "4 Mercedes-Benz 51722 57041 66553 67554 68221\n", "5 VW 35215 36576 46794 51961 49040\n" ] }, "execution_count": 180, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import_cars.transpose" ] }, { "cell_type": "markdown", "id": "556e5c2d-bb01-4e16-9bf6-bdfd302d5b2a", "metadata": {}, "source": [ "You can specify index column by option `:key` even if it is at the middle of the original DataFrame." ] }, { "cell_type": "code", "execution_count": 181, "id": "31edb594-93fb-493f-8036-14e8273596ed", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 6 vectors>
AudiBMWYearBMW_MINIMercedes-BenzVW
22535359052021182115172235215
22304357122020201965704136576
24222468142019238136655346794
26473509822018259846755451961
28336525272017254276822149040
" ], "text/plain": [ "#\n", " Audi BMW Year BMW_MINI Mercedes-Benz VW\n", " \n", "1 22535 35905 2021 18211 51722 35215\n", "2 22304 35712 2020 20196 57041 36576\n", "3 24222 46814 2019 23813 66553 46794\n", "4 26473 50982 2018 25984 67554 51961\n", "5 28336 52527 2017 25427 68221 49040\n" ] }, "execution_count": 181, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = import_cars.pick { [keys[1..2], keys[0], keys[3..]] }" ] }, { "cell_type": "code", "execution_count": 182, "id": "ffa7cecc-5298-49d6-b483-b2b5cf2e1820", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 6 vectors>
name20212020201920182017
Audi2253522304242222647328336
BMW3590535712468145098252527
BMW_MINI1821120196238132598425427
Mercedes-Benz5172257041665536755468221
VW3521536576467945196149040
" ], "text/plain": [ "#\n", " name 2021 2020 2019 2018 2017\n", " \n", "1 Audi 22535 22304 24222 26473 28336\n", "2 BMW 35905 35712 46814 50982 52527\n", "3 BMW_MINI 18211 20196 23813 25984 25427\n", "4 Mercedes-Benz 51722 57041 66553 67554 68221\n", "5 VW 35215 36576 46794 51961 49040\n" ] }, "execution_count": 182, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.transpose(key: :Year)" ] }, { "cell_type": "markdown", "id": "8c0e6fd9-2355-43dc-bfdb-254fdbd405fb", "metadata": {}, "source": [ "## 60. To_long" ] }, { "cell_type": "markdown", "id": "8970bf89-6841-44aa-8faa-8c2e97842a8d", "metadata": {}, "source": [ "`DataFrame#to_long(*keep_keys)` reshapes wide DataFrame to a longer DataFrame.\n", "\n", "- Parameter `keep_keys` specifies the key names to keep.\n", "\n", "(Since 0.2.0)" ] }, { "cell_type": "code", "execution_count": 183, "id": "f188cf35-3364-45aa-ad3c-1e079ed7b1a3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 6 vectors>
YearAudiBMWBMW_MINIMercedes-BenzVW
20212253535905182115172235215
20202230435712201965704136576
20192422246814238136655346794
20182647350982259846755451961
20172833652527254276822149040
" ], "text/plain": [ "#\n", " Year Audi BMW BMW_MINI Mercedes-Benz VW\n", " \n", "1 2021 22535 35905 18211 51722 35215\n", "2 2020 22304 35712 20196 57041 36576\n", "3 2019 24222 46814 23813 66553 46794\n", "4 2018 26473 50982 25984 67554 51961\n", "5 2017 28336 52527 25427 68221 49040\n" ] }, "execution_count": 183, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import_cars = RedAmber::DataFrame.load('../test/entity/import_cars.tsv')" ] }, { "cell_type": "code", "execution_count": 184, "id": "bee776ae-7a82-41aa-8a31-c53af6b9ad9f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <25 x 3 vectors>
Yearnamevalue
2021Audi22535
2021BMW35905
2021BMW_MINI18211
2021Mercedes-Benz51722
2017BMW_MINI25427
2017Mercedes-Benz68221
2017VW49040
" ], "text/plain": [ "#\n", " Year name value\n", " \n", " 1 2021 Audi 22535\n", " 2 2021 BMW 35905\n", " 3 2021 BMW_MINI 18211\n", " 4 2021 Mercedes-Benz 51722\n", " 5 2021 VW 35215\n", " : : : :\n", "23 2017 BMW_MINI 25427\n", "24 2017 Mercedes-Benz 68221\n", "25 2017 VW 49040\n" ] }, "execution_count": 184, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import_cars.to_long(:Year)" ] }, { "cell_type": "markdown", "id": "771aac21-e739-4eea-9d75-cc99f691f87e", "metadata": {}, "source": [ "- Option `:name` : key of the column which is come **from key names**.\n", "- Option `:value` : key of the column which is come **from values**." ] }, { "cell_type": "code", "execution_count": 185, "id": "b7f7aeab-d545-4a28-82ce-db844029cc9c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <25 x 3 vectors>
YearManufacturerNum_of_imported
2021Audi22535
2021BMW35905
2021BMW_MINI18211
2021Mercedes-Benz51722
2017BMW_MINI25427
2017Mercedes-Benz68221
2017VW49040
" ], "text/plain": [ "#\n", " Year Manufacturer Num_of_imported\n", " \n", " 1 2021 Audi 22535\n", " 2 2021 BMW 35905\n", " 3 2021 BMW_MINI 18211\n", " 4 2021 Mercedes-Benz 51722\n", " 5 2021 VW 35215\n", " : : : :\n", "23 2017 BMW_MINI 25427\n", "24 2017 Mercedes-Benz 68221\n", "25 2017 VW 49040\n" ] }, "execution_count": 185, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import_cars.to_long(:Year, name: :Manufacturer, value: :Num_of_imported)" ] }, { "cell_type": "markdown", "id": "44361299-695f-4aef-8847-9b8a9dddc3d1", "metadata": {}, "source": [ "## 61. To_wide" ] }, { "cell_type": "markdown", "id": "7fa18ac8-99e1-4d55-bed4-cc02469496b5", "metadata": {}, "source": [ "`DataFrame#to_wide(*keep_keys)` reshapes long DataFrame to a wider DataFrame.\n", "\n", "- Option `:name` : key of the column which will be expanded **to key name**.\n", "- Option `:value` : key of the column which will be expanded **to values**.\n", "\n", "(Since 0.2.0)" ] }, { "cell_type": "code", "execution_count": 186, "id": "1d59c42f-c5ec-4df2-9fbe-592092ad2f8b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "RedAmber::DataFrame <5 x 6 vectors>
YearAudiBMWBMW_MINIMercedes-BenzVW
20212253535905182115172235215
20202230435712201965704136576
20192422246814238136655346794
20182647350982259846755451961
20172833652527254276822149040
" ], "text/plain": [ "#\n", " Year Audi BMW BMW_MINI Mercedes-Benz VW\n", " \n", "1 2021 22535 35905 18211 51722 35215\n", "2 2020 22304 35712 20196 57041 36576\n", "3 2019 24222 46814 23813 66553 46794\n", "4 2018 26473 50982 25984 67554 51961\n", "5 2017 28336 52527 25427 68221 49040\n" ] }, "execution_count": 186, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import_cars.to_long(:Year).to_wide" ] }, { "cell_type": "code", "execution_count": 187, "id": "84b7909a-200c-4aec-b192-25016399a7c4", "metadata": {}, "outputs": [], "source": [ "# import_cars.to_long(:Year).to_wide(name: :name, value: :value)\n", "# is also OK" ] }, { "cell_type": "code", "execution_count": null, "id": "c92c2fe0-23a5-4153-bd03-7a07664b4d9e", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Ruby 3.1.2", "language": "ruby", "name": "ruby" }, "language_info": { "file_extension": ".rb", "mimetype": "application/x-ruby", "name": "ruby", "version": "3.1.2" } }, "nbformat": 4, "nbformat_minor": 5 }