{
"cells": [
{
"cell_type": "markdown",
"id": "e355db8b-ebb6-4ea6-97b5-3b9fdadc302c",
"metadata": {},
"source": [
"# 61 examples of Red Amber"
]
},
{
"cell_type": "markdown",
"id": "f20f4970-db38-4d96-9a36-d4cf9d007596",
"metadata": {},
"source": [
"Last update: August 14, 2022, for RedAmber Version 0.2.0"
]
},
{
"cell_type": "markdown",
"id": "f6e927d0-b59a-4c4e-9f8a-4fa08f9a6b2f",
"metadata": {},
"source": [
"## 1. Install"
]
},
{
"cell_type": "markdown",
"id": "85eacfe6-fa11-4749-844f-5914d6cd7dbc",
"metadata": {},
"source": [
"Install requirements before you install Red Amber.\n",
"\n",
"- Apache Arrow GLib (>= 8.0.0)\n",
"\n",
"- Apache Parquet GLib (>= 8.0.0) # if you need IO from/to Parquet resource.\n",
"\n",
" See [Apache Arrow install document](https://arrow.apache.org/install/).\n",
" \n",
" Minimum installation example for the latest Ubuntu is in the ['Prepare the Apache Arrow' section in ci test](https://github.com/heronshoes/red_amber/blob/master/.github/workflows/test.yml) of Red Amber.\n",
"\n",
"Then add this line to your Gemfile:\n",
"```\n",
"gem 'red_amber'\n",
"```\n",
"\n",
"And then execute:\n",
"```\n",
"$ bundle install\n",
"```\n",
"\n",
"Or install it yourself as:\n",
"```\n",
"$ gem install red_amber\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "8c08c45d-0818-4b43-bc65-4d43dd8b6b66",
"metadata": {},
"source": [
"## 2. Require"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "74b76022-03ea-40ae-bac8-fc8743659042",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{:RedAmber=>\"0.2.0\", :Arrow=>\"9.0.0\"}"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"require 'red_amber' # require 'red-amber' is also OK\n",
"include RedAmber\n",
"{RedAmber: VERSION, Arrow: Arrow::VERSION}"
]
},
{
"cell_type": "markdown",
"id": "d8fb6289-39ea-4fa9-a165-b87ee6d125e9",
"metadata": {
"tags": []
},
"source": [
"## 3. Initialize"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "51f81824-626a-4741-a29b-30ea357fe7b5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <3 x 2 vectors>
"
],
"text/plain": [
"#\n",
" x y\n",
" \n",
"1 1 A\n",
"2 2 B\n",
"3 3 C\n"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# From a Hash\n",
"DataFrame.new(x: [1, 2, 3], y: %w[A B C])"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "20b696eb-c199-444d-a957-e0b1081f1506",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <3 x 2 vectors> "
],
"text/plain": [
"#\n",
" x y\n",
" \n",
"1 1 A\n",
"2 2 B\n",
"3 3 C\n"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# From a schema and a row-oriented array\n",
"DataFrame.new({ x: :uint8, y: :string }, [[1, 'A'], [2, 'B'], [3, 'C']])"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "21eab151-f977-4474-a6d1-576169e24b26",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <3 x 2 vectors> "
],
"text/plain": [
"#\n",
" x y\n",
" \n",
"1 1 A\n",
"2 2 B\n",
"3 3 C\n"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# From an Arrow::Table\n",
"table = Arrow::Table.new(x: [1, 2, 3], y: %w[A B C])\n",
"DataFrame.new(table)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "aa09d3da-f332-45cd-92ca-712c6a679035",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <3 x 2 vectors> "
],
"text/plain": [
"#\n",
" x y\n",
" \n",
"1 1 A\n",
"2 2 B\n",
"3 3 C\n"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# From a Rover::DataFrame\n",
"require 'rover'\n",
"rover = Rover::DataFrame.new(x: [1, 2, 3], y: %w[A B C])\n",
"DataFrame.new(rover)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "cd2c3677-00fb-48fe-bb94-18bc0815db72",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <344 x 8 vectors> species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | (nil) | (nil) | (nil) | (nil) | (nil) | 2007 |
⋮ |
Gentoo | Biscoe | 50.4 | 15.7 | 222 | 5750 | male | 2009 |
Gentoo | Biscoe | 45.2 | 14.8 | 212 | 5200 | female | 2009 |
Gentoo | Biscoe | 49.9 | 16.1 | 213 | 5400 | male | 2009 |
"
],
"text/plain": [
"#\n",
" species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n",
" ... \n",
" 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n",
" 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n",
" 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n",
" 4 Adelie Torgersen (nil) (nil) (nil) ... 2007\n",
" 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n",
" : : : : : : ... :\n",
"342 Gentoo Biscoe 50.4 15.7 222 ... 2009\n",
"343 Gentoo Biscoe 45.2 14.8 212 ... 2009\n",
"344 Gentoo Biscoe 49.9 16.1 213 ... 2009\n"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# from a red-datasets\n",
"require 'datasets-arrow'\n",
"dataset = Datasets::Penguins.new\n",
"penguins = DataFrame.new(dataset.to_arrow)"
]
},
{
"cell_type": "markdown",
"id": "3a2d12b4-7623-42c7-9e32-76cf303c7cea",
"metadata": {},
"source": [
"It should be in the future version;\n",
"```ruby\n",
"require 'datasets-red-amber'\n",
"penguins = Datasets::Penguins.new.to_red_amber\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "2e4619b7-bf6d-4081-9066-b186da8fdf5b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <32 x 11 vectors> mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|
21.0 | 6 | 160.0 | 110 | 3.9 | 2.62 | 16.46 | 0 | 1 | 4 | 4 |
21.0 | 6 | 160.0 | 110 | 3.9 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
22.8 | 4 | 108.0 | 93 | 3.85 | 2.32 | 18.61 | 1 | 1 | 4 | 1 |
21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
⋮ |
19.7 | 6 | 145.0 | 175 | 3.62 | 2.77 | 15.5 | 0 | 1 | 5 | 6 |
15.0 | 8 | 301.0 | 335 | 3.54 | 3.57 | 14.6 | 0 | 1 | 5 | 8 |
21.4 | 4 | 121.0 | 109 | 4.11 | 2.78 | 18.6 | 1 | 1 | 4 | 2 |
"
],
"text/plain": [
"#\n",
" mpg cyl disp hp drat wt qsec vs am ... carb\n",
" ... \n",
" 1 21.0 6 160.0 110 3.9 2.62 16.46 0 1 ... 4\n",
" 2 21.0 6 160.0 110 3.9 2.88 17.02 0 1 ... 4\n",
" 3 22.8 4 108.0 93 3.85 2.32 18.61 1 1 ... 1\n",
" 4 21.4 6 258.0 110 3.08 3.22 19.44 1 0 ... 1\n",
" 5 18.7 8 360.0 175 3.15 3.44 17.02 0 0 ... 2\n",
" : : : : : : : : : : ... :\n",
"30 19.7 6 145.0 175 3.62 2.77 15.5 0 1 ... 6\n",
"31 15.0 8 301.0 335 3.54 3.57 14.6 0 1 ... 8\n",
"32 21.4 4 121.0 109 4.11 2.78 18.6 1 1 ... 2\n"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset = Datasets::Rdatasets.new('datasets', 'mtcars')\n",
"mtcars = DataFrame.new(dataset.to_arrow)"
]
},
{
"cell_type": "markdown",
"id": "e1f77a54-3a43-4d17-bb6f-332ef13832a3",
"metadata": {},
"source": [
"## 4. Load"
]
},
{
"cell_type": "markdown",
"id": "0fed4f43-3fbb-43e5-af0d-f93401deea78",
"metadata": {},
"source": [
"`RedAmber::DataFrame` delegates `#load` to `Arrow::Table#load`. We can load from `[.arrow, .arrows, .csv, .csv.gz, .tsv]` files."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "4203e671-0a0a-405c-8482-53a8cd78a891",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <3 x 2 vectors> name | age |
---|
Yasuko | 68 |
Rui | 49 |
Hinata | 28 |
"
],
"text/plain": [
"#\n",
" name age\n",
" \n",
"1 Yasuko 68\n",
"2 Rui 49\n",
"3 Hinata 28\n"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataFrame.load(\"../test/entity/with_header.csv\")"
]
},
{
"cell_type": "markdown",
"id": "29875147-1371-4575-a565-69c3534c15f2",
"metadata": {},
"source": [
"## 5. Load from a URI"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "916b86e2-e3a2-4ebb-8770-9e8a29c46523",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <344 x 7 vectors> species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex |
---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | MALE |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | FEMALE |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | FEMALE |
Adelie | Torgersen | (nil) | (nil) | (nil) | (nil) | |
⋮ |
Gentoo | Biscoe | 50.4 | 15.7 | 222 | 5750 | MALE |
Gentoo | Biscoe | 45.2 | 14.8 | 212 | 5200 | FEMALE |
Gentoo | Biscoe | 49.9 | 16.1 | 213 | 5400 | MALE |
"
],
"text/plain": [
"#\n",
" species island bill_length_mm bill_depth_mm flipper_length_mm ... sex\n",
" ... \n",
" 1 Adelie Torgersen 39.1 18.7 181 ... MALE\n",
" 2 Adelie Torgersen 39.5 17.4 186 ... FEMALE\n",
" 3 Adelie Torgersen 40.3 18.0 195 ... FEMALE\n",
" 4 Adelie Torgersen (nil) (nil) (nil) ...\n",
" 5 Adelie Torgersen 36.7 19.3 193 ... FEMALE\n",
" : : : : : : ... :\n",
"342 Gentoo Biscoe 50.4 15.7 222 ... MALE\n",
"343 Gentoo Biscoe 45.2 14.8 212 ... FEMALE\n",
"344 Gentoo Biscoe 49.9 16.1 213 ... MALE\n"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"uri = URI(\"https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv\")\n",
"DataFrame.load(uri)"
]
},
{
"cell_type": "markdown",
"id": "e6abe64d-e97f-437e-9c54-18f9e06e9668",
"metadata": {},
"source": [
"## 6. Save"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "91c0fb62-7990-47f1-9fb6-b0529bc1783f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"true"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguins.save(\"file.arrow\")\n",
"penguins.save(\"file.arrows\")\n",
"penguins.save(\"file.csv\")\n",
"penguins.save(\"file.csv.gz\")\n",
"penguins.save(\"file.tsv\")\n",
"penguins.save(\"file.feather\")"
]
},
{
"cell_type": "markdown",
"id": "d1d30973-9e2f-406a-9f42-9e6e4c966baf",
"metadata": {},
"source": [
"## 7. to_s/inspect"
]
},
{
"cell_type": "markdown",
"id": "a7bc9cb7-eae4-495f-831e-b747e486d0bd",
"metadata": {},
"source": [
"`to_s` or `inspect` (it uses to_s inside) shows a preview of the dataframe.\n",
"\n",
"It shows first 5 and last 3 rows if it has many rows. Columns are also omitted if line is exceeded 80 letters."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "af6d29ef-2e1c-4a08-a8b2-d69acda79ec5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"#\n",
" x y s b\n",
" \n",
"1 1 1.0 A true\n",
"2 2 2.0 B false\n",
"3 3 3.0 C true\n",
"4 4 NaN D false\n",
"5 5 (nil) (nil) (nil)\n",
"\n"
]
}
],
"source": [
"df = DataFrame.new(\n",
" x: [1, 2, 3, 4, 5],\n",
" y: [1, 2, 3, 0/0.0, nil],\n",
" s: %w[A B C D] << nil,\n",
" b: [true, false, true, false, nil])\n",
"p df; nil"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "cdff2e60-bd0a-4d12-b348-201a49bbbbbe",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"#\n",
" species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n",
" ... \n",
" 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n",
" 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n",
" 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n",
" 4 Adelie Torgersen (nil) (nil) (nil) ... 2007\n",
" 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n",
" : : : : : : ... :\n",
"342 Gentoo Biscoe 50.4 15.7 222 ... 2009\n",
"343 Gentoo Biscoe 45.2 14.8 212 ... 2009\n",
"344 Gentoo Biscoe 49.9 16.1 213 ... 2009\n",
"\n"
]
}
],
"source": [
"p penguins; nil"
]
},
{
"cell_type": "markdown",
"id": "cb44df38-58f7-479c-b7a4-c9c305639292",
"metadata": {},
"source": [
"## 8. Show table"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "fc710035-8134-4b18-89fe-8c58b95e0e0e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"\tx\t y\ts\tb\n",
"0\t1\t 1.000000\tA\ttrue\n",
"1\t2\t 2.000000\tB\tfalse\n",
"2\t3\t 3.000000\tC\ttrue\n",
"3\t4\t NaN\tD\tfalse\n",
"4\t5\t (null)\t(null)\t(null)\n"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.table"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "2634fb7b-194f-4277-94ba-05f39c497ffa",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"\tspecies\tisland\tbill_length_mm\tbill_depth_mm\tflipper_length_mm\tbody_mass_g\tsex\tyear\n",
" 0\tAdelie \tTorgersen\t 39.100000\t 18.700000\t 181\t 3750\tmale\t2007\n",
" 1\tAdelie \tTorgersen\t 39.500000\t 17.400000\t 186\t 3800\tfemale\t2007\n",
" 2\tAdelie \tTorgersen\t 40.300000\t 18.000000\t 195\t 3250\tfemale\t2007\n",
" 3\tAdelie \tTorgersen\t (null)\t (null)\t (null)\t (null)\t(null)\t2007\n",
" 4\tAdelie \tTorgersen\t 36.700000\t 19.300000\t 193\t 3450\tfemale\t2007\n",
" 5\tAdelie \tTorgersen\t 39.300000\t 20.600000\t 190\t 3650\tmale\t2007\n",
" 6\tAdelie \tTorgersen\t 38.900000\t 17.800000\t 181\t 3625\tfemale\t2007\n",
" 7\tAdelie \tTorgersen\t 39.200000\t 19.600000\t 195\t 4675\tmale\t2007\n",
" 8\tAdelie \tTorgersen\t 34.100000\t 18.100000\t 193\t 3475\t(null)\t2007\n",
" 9\tAdelie \tTorgersen\t 42.000000\t 20.200000\t 190\t 4250\t(null)\t2007\n",
"...\n",
"334\tGentoo \tBiscoe\t 46.200000\t 14.100000\t 217\t 4375\tfemale\t2009\n",
"335\tGentoo \tBiscoe\t 55.100000\t 16.000000\t 230\t 5850\tmale\t2009\n",
"336\tGentoo \tBiscoe\t 44.500000\t 15.700000\t 217\t 4875\t(null)\t2009\n",
"337\tGentoo \tBiscoe\t 48.800000\t 16.200000\t 222\t 6000\tmale\t2009\n",
"338\tGentoo \tBiscoe\t 47.200000\t 13.700000\t 214\t 4925\tfemale\t2009\n",
"339\tGentoo \tBiscoe\t (null)\t (null)\t (null)\t (null)\t(null)\t2009\n",
"340\tGentoo \tBiscoe\t 46.800000\t 14.300000\t 215\t 4850\tfemale\t2009\n",
"341\tGentoo \tBiscoe\t 50.400000\t 15.700000\t 222\t 5750\tmale\t2009\n",
"342\tGentoo \tBiscoe\t 45.200000\t 14.800000\t 212\t 5200\tfemale\t2009\n",
"343\tGentoo \tBiscoe\t 49.900000\t 16.100000\t 213\t 5400\tmale\t2009\n"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguins.table"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "9dba2a67-ede7-4663-907b-9b2dd5db1605",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"x: uint8\n",
"y: double\n",
"s: string\n",
"b: bool\n",
"----\n",
"x:\n",
" [\n",
" [\n",
" 1,\n",
" 2,\n",
" 3,\n",
" 4,\n",
" 5\n",
" ]\n",
" ]\n",
"y:\n",
" [\n",
" [\n",
" 1,\n",
" 2,\n",
" 3,\n",
" nan,\n",
" null\n",
" ]\n",
" ]\n",
"s:\n",
" [\n",
" [\n",
" \"A\",\n",
" \"B\",\n",
" \"C\",\n",
" \"D\",\n",
" null\n",
" ]\n",
" ]\n",
"b:\n",
" [\n",
" [\n",
" true,\n",
" false,\n",
" true,\n",
" false,\n",
" null\n",
" ]\n",
" ]\n"
]
}
],
"source": [
"# This is a Red Arrow's feature\n",
"puts df.table.to_s(format: :column)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "d1cc17b8-1cfc-4986-9dec-7bca02be32f0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================== 0 ====================\n",
"x: 1\n",
"y: 1.000000\n",
"s: A\n",
"b: true\n",
"==================== 1 ====================\n",
"x: 2\n",
"y: 2.000000\n",
"s: B\n",
"b: false\n",
"==================== 2 ====================\n",
"x: 3\n",
"y: 3.000000\n",
"s: C\n",
"b: true\n",
"==================== 3 ====================\n",
"x: 4\n",
"y: NaN\n",
"s: D\n",
"b: false\n",
"==================== 4 ====================\n",
"x: 5\n",
"y: (null)\n",
"s: (null)\n",
"b: (null)\n"
]
}
],
"source": [
"# This is also a Red Arrow's feature\n",
"puts df.table.to_s(format: :list)"
]
},
{
"cell_type": "markdown",
"id": "16e4ae6b-2399-43f0-be8e-65669b95c7b6",
"metadata": {},
"source": [
"## 9. TDR"
]
},
{
"cell_type": "markdown",
"id": "2d14eb4b-9026-4cc5-a71a-598946d40b67",
"metadata": {},
"source": [
"TDR means 'Transposed Dataframe Representation'. It shows columns in lateral just the same shape as initializing by a Hash. TDR has some information which is useful for the exploratory data processing.\n",
"\n",
"- DataFrame shape: n_rows x n_columns\n",
"- Data types\n",
"- Levels: number of unique elements\n",
"- Data preview: same data is aggregated if level is smaller (tally mode)\n",
"- Show counts of abnormal element: NaN and nil"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "8050462f-7c60-41b7-a011-af11763784dc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RedAmber::DataFrame : 5 x 4 Vectors\n",
"Vectors : 2 numeric, 1 string, 1 boolean\n",
"# key type level data_preview\n",
"1 :x uint8 5 [1, 2, 3, 4, 5]\n",
"2 :y double 5 [1.0, 2.0, 3.0, NaN, nil], 1 NaN, 1 nil\n",
"3 :s string 5 [\"A\", \"B\", \"C\", \"D\", nil], 1 nil\n",
"4 :b boolean 3 {true=>2, false=>2, nil=>1}\n"
]
}
],
"source": [
"# use the same dataframe as #7\n",
"df.tdr"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "bb616ffe-c19a-4b02-a011-601ceb3db656",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RedAmber::DataFrame : 344 x 8 Vectors\n",
"Vectors : 5 numeric, 3 strings\n",
"# key type level data_preview\n",
"1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n",
"2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n",
"3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n",
"4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils\n",
"5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils\n",
"6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils\n",
"7 :sex string 3 {\"male\"=>168, \"female\"=>165, nil=>11}\n",
"8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}\n"
]
}
],
"source": [
"penguins.tdr"
]
},
{
"cell_type": "markdown",
"id": "73b8dc18-079f-4d40-8d0e-239f010550da",
"metadata": {},
"source": [
"`#tdr` has some options:\n",
"\n",
"`limit` : to limit a number of variables to show. Default value is `limit=10`."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "0962845d-e642-4d2a-9607-43e197b46bc5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RedAmber::DataFrame : 344 x 8 Vectors\n",
"Vectors : 5 numeric, 3 strings\n",
"# key type level data_preview\n",
"1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n",
"2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n",
"3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n",
" ... 5 more Vectors ...\n"
]
}
],
"source": [
"penguins.tdr(3)"
]
},
{
"cell_type": "markdown",
"id": "573606c4-23b9-4b38-8c92-a04f1c1e8781",
"metadata": {},
"source": [
"`elements` : max number of elements to show in observations. Default value is `elements: 5`."
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "f957d2bd-e8c0-42a1-a3b4-0a9478e740bf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RedAmber::DataFrame : 344 x 8 Vectors\n",
"Vectors : 5 numeric, 3 strings\n",
"# key type level data_preview\n",
"1 :species string 3 {\"Adelie\"=>152, \"Chinstrap\"=>68, \"Gentoo\"=>124}\n",
"2 :island string 3 {\"Torgersen\"=>52, \"Biscoe\"=>168, \"Dream\"=>124}\n",
"3 :bill_length_mm double 165 [39.1, 39.5, 40.3, ... ], 2 nils\n",
"4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, ... ], 2 nils\n",
"5 :flipper_length_mm uint8 56 [181, 186, 195, ... ], 2 nils\n",
"6 :body_mass_g uint16 95 [3750, 3800, 3250, ... ], 2 nils\n",
"7 :sex string 3 {\"male\"=>168, \"female\"=>165, nil=>11}\n",
"8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}\n"
]
}
],
"source": [
"penguins.tdr(elements: 3) # Show first 3 items in data"
]
},
{
"cell_type": "markdown",
"id": "d37ece79-1999-49eb-a2d1-831184ee6509",
"metadata": {},
"source": [
"`tally` : max level to use tally mode. Level means size of `tally`ed hash. Default value is `tally: 5`."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "9c1c472c-3d15-4bca-9a1b-7f86c63d3ed8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RedAmber::DataFrame : 344 x 8 Vectors\n",
"Vectors : 5 numeric, 3 strings\n",
"# key type level data_preview\n",
"1 :species string 3 [\"Adelie\", \"Adelie\", \"Adelie\", \"Adelie\", \"Adelie\", ... ]\n",
"2 :island string 3 [\"Torgersen\", \"Torgersen\", \"Torgersen\", \"Torgersen\", \"Torgersen\", ... ]\n",
"3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils\n",
"4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils\n",
"5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils\n",
"6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils\n",
"7 :sex string 3 [\"male\", \"female\", \"female\", nil, \"female\", ... ], 11 nils\n",
"8 :year uint16 3 [2007, 2007, 2007, 2007, 2007, ... ]\n"
]
}
],
"source": [
"penguins.tdr(tally: 0) # Don't use tally mode"
]
},
{
"cell_type": "markdown",
"id": "e3c38037-90a1-4fc5-9904-41fc74085908",
"metadata": {},
"source": [
"`#tdr_str` returns a String. `#tdr` do the same thing as `puts #tdr_str`"
]
},
{
"cell_type": "markdown",
"id": "21d68764-1bc1-4915-99b6-5ae938b85999",
"metadata": {},
"source": [
"## 10. Size and shape"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "487399f8-a3ef-467f-aa7f-ecbaee5fcb75",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# same as n_rows, n_obs\n",
"df.size"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "dc7441c3-7c85-4ce1-a20e-de8f41f280b4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# same as n_cols, n_vars\n",
"df.n_keys"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "3d42fea6-801a-45f4-8e22-ea9d76ae070f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[5, 4]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# [df.size, df.n_keys], [df.n_rows, df.n_cols]\n",
"df.shape"
]
},
{
"cell_type": "markdown",
"id": "bc5caa94-325f-4014-9c90-8ac909c2b378",
"metadata": {},
"source": [
"## 11. Keys"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "bb47775f-fed0-42e6-8781-aa8b721d6112",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[:x, :y, :s, :b]"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.keys"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "3d540ab0-3e52-47b7-b338-b4e0b3d929cb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[:species, :island, :bill_length_mm, :bill_depth_mm, :flipper_length_mm, :body_mass_g, :sex, :year]"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguins.keys"
]
},
{
"cell_type": "markdown",
"id": "decc6a61-9994-4d60-9827-b257cafafb70",
"metadata": {},
"source": [
"## 12. Types"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "bf9cd2bc-a213-427e-bc00-f2083b0e0471",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[:uint8, :double, :string, :boolean]"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.types"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "b1ecb891-98b5-4919-9f37-1847202007d8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[:string, :string, :double, :double, :uint8, :uint16, :string, :uint16]"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguins.types"
]
},
{
"cell_type": "markdown",
"id": "869b3670-62f8-4c23-807b-d6d100a1981e",
"metadata": {},
"source": [
"## 13. Data type classes"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "776ab4db-073b-4b30-931a-8ec77284cdc4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Arrow::UInt8DataType, Arrow::DoubleDataType, Arrow::StringDataType, Arrow::BooleanDataType]"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.type_classes"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "0546a5d0-cab1-4ca8-a2e5-0637d0fd48b6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Arrow::StringDataType, Arrow::StringDataType, Arrow::DoubleDataType, Arrow::DoubleDataType, Arrow::UInt8DataType, Arrow::UInt16DataType, Arrow::StringDataType, Arrow::UInt16DataType]"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguins.type_classes"
]
},
{
"cell_type": "markdown",
"id": "1c2513f6-909e-47fd-a543-66c4f424f44e",
"metadata": {},
"source": [
"## 14. Indices"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "e6e9d7ef-1471-4f23-9210-56045c9fabd5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0, 1, 2, 3, 4]"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.indexes\n",
"# or\n",
"df.indices"
]
},
{
"cell_type": "markdown",
"id": "3908395f-b086-4fbb-9855-e1ce233f0595",
"metadata": {},
"source": [
"## 15. To an Array or a Hash"
]
},
{
"cell_type": "markdown",
"id": "22cb724e-cf61-40d9-a58b-9cc793e83645",
"metadata": {},
"source": [
"DataFrame#to_a returns an array of row-oriented data without a header."
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "4054daad-9266-4002-8942-c0891050cb4d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[1, 1.0, \"A\", true], [2, 2.0, \"B\", false], [3, 3.0, \"C\", true], [4, NaN, \"D\", false], [5, nil, nil, nil]]"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.to_a"
]
},
{
"cell_type": "markdown",
"id": "f6abae59-fe31-4056-9de8-7c36e35235de",
"metadata": {},
"source": [
"If you need a column-oriented array with keys, use `.to_h.to_a`"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "d3631290-eb74-4d21-a469-86381c668c7f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{:x=>[1, 2, 3, 4, 5], :y=>[1.0, 2.0, 3.0, NaN, nil], :s=>[\"A\", \"B\", \"C\", \"D\", nil], :b=>[true, false, true, false, nil]}"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.to_h"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "08c45e92-f640-4e62-bc96-ee259d0ecff4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[:x, [1, 2, 3, 4, 5]], [:y, [1.0, 2.0, 3.0, NaN, nil]], [:s, [\"A\", \"B\", \"C\", \"D\", nil]], [:b, [true, false, true, false, nil]]]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.to_h.to_a"
]
},
{
"cell_type": "markdown",
"id": "39b65fc0-4405-4414-9a74-91c724ef587c",
"metadata": {},
"source": [
"## 16. Schema"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "36db7842-e9b0-4473-84d4-3aef987d427f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{:x=>:uint8, :y=>:double, :s=>:string, :b=>:boolean}"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.schema"
]
},
{
"cell_type": "markdown",
"id": "3e61237d-ac67-45bb-827c-a769dff61809",
"metadata": {},
"source": [
"## 17. Vector"
]
},
{
"cell_type": "markdown",
"id": "27402307-aaad-49c8-88ca-65346668601d",
"metadata": {},
"source": [
"Each variable (column in the table) is represented by a Vector object."
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "6c9ba041-231d-4057-a280-acf620b68525",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[1, 2, 3, 4, 5]\n"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[:x] # This syntax comes later"
]
},
{
"cell_type": "markdown",
"id": "3e13d06d-b432-45b2-9745-0c6ef9228e23",
"metadata": {},
"source": [
"Or create new Vector by the constructor."
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "3e18a4e0-238c-4800-8bda-a88a57dde3e9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[1, 2, 3, 4, 5]\n"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Vector.new(1, 2, 3, 4, 5)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "3bd55d9d-b988-46b2-bc11-e3dc5f4adc6c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[1, 2, 3, 4, 5]\n"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Vector.new(1..5)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "19688e6e-b59b-4a84-8c07-57e87cd0e242",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[1, 2, 3, 4, 5]\n"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Vector.new([1, 2, 3], [4, 5])"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "076bd0e2-01ab-4497-9b9b-84f72a4805bc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[1, 2, 3, 4, 5]\n"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array = Arrow::Array.new([1, 2, 3, 4, 5])\n",
"Vector.new(array)"
]
},
{
"cell_type": "markdown",
"id": "22091661-e78a-4c66-9e48-4c3c676469b4",
"metadata": {},
"source": [
"- TODO: `Vector[1..5]` as a constructor"
]
},
{
"cell_type": "markdown",
"id": "b729bdba-87a2-4282-bd0e-319fe17f42da",
"metadata": {},
"source": [
"## 18. Vectors"
]
},
{
"cell_type": "markdown",
"id": "f5ddd840-2f84-467b-a9bb-feb769573b69",
"metadata": {},
"source": [
"Returns an Array of Vectors in a DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "d3ae03f2-e2fe-4a15-abe1-331185448d61",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[#\n",
"[1, 2, 3, 4, 5]\n",
", #\n",
"[1.0, 2.0, 3.0, NaN, nil]\n",
", #\n",
"[\"A\", \"B\", \"C\", \"D\", nil]\n",
", #\n",
"[true, false, true, false, nil]\n",
"]"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.vectors"
]
},
{
"cell_type": "markdown",
"id": "8ac88ff3-0cb6-43d6-a999-0c2e8c6defb7",
"metadata": {
"tags": []
},
"source": [
"## 19. Variables\n",
"\n",
"Returns key and Vector pairs in a Hash."
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "3351a216-6fe5-485e-8686-53c1e754fa2e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{:x=>#\n",
"[1, 2, 3, 4, 5]\n",
", :y=>#\n",
"[1.0, 2.0, 3.0, NaN, nil]\n",
", :s=>#\n",
"[\"A\", \"B\", \"C\", \"D\", nil]\n",
", :b=>#\n",
"[true, false, true, false, nil]\n",
"}"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.variables"
]
},
{
"cell_type": "markdown",
"id": "3b518c1c-eda7-406f-a885-b2344b1726eb",
"metadata": {},
"source": [
"## 20. Select columns by #[ ]"
]
},
{
"cell_type": "markdown",
"id": "767b4e49-19eb-4d5f-b030-91bd78f0f5b9",
"metadata": {},
"source": [
"`DataFrame#[]` is overloading column operations and row operations.\n",
"\n",
"- For columns (variables)\n",
" - Key in a Symbol: `df[:symbol]`\n",
" - Key in a String: `df[\"string\"]`\n",
" - Keys in an Array: `df[:symbol1, \"string\", :symbol2]`\n",
" - Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "ccf60edc-cccf-49e3-a503-1ca532247130",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <5 x 2 vectors> "
],
"text/plain": [
"#\n",
" x y\n",
" \n",
"1 1 1.0\n",
"2 2 2.0\n",
"3 3 3.0\n",
"4 4 NaN\n",
"5 5 (nil)\n"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Keys in a Symbol and a String\n",
"df[:x, 'y']"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "8500f8c0-ff5a-4537-9f47-03d675e31b18",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <5 x 2 vectors> "
],
"text/plain": [
"#\n",
" x y\n",
" \n",
"1 1 1.0\n",
"2 2 2.0\n",
"3 3 3.0\n",
"4 4 NaN\n",
"5 5 (nil)\n"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Keys in a Range\n",
"df['x'..'y']"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "db35cae1-35c2-47de-a7e8-906161f21282",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <5 x 3 vectors> s | b | x |
---|
A | true | 1 |
B | false | 2 |
C | true | 3 |
D | false | 4 |
(nil) | (nil) | 5 |
"
],
"text/plain": [
"#\n",
" s b x\n",
" \n",
"1 A true 1\n",
"2 B false 2\n",
"3 C true 3\n",
"4 D false 4\n",
"5 (nil) (nil) 5\n"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Keys with a index Range, and a symbol\n",
"df[df.keys[2..], :x]"
]
},
{
"cell_type": "markdown",
"id": "03e14403-f7bc-4350-9e7b-715901164331",
"metadata": {},
"source": [
"## 21. Select rows by #[ ]\n",
"`DataFrame#[]` is overloading column operations and row operations.\n",
"\n",
"- For rows (observations)\n",
" - Select rows by a Index: `df[index]`\n",
" - Select rows by Indices: `df[indices]` # Array, Arrow::Array, Vectors are acceptable for indices\n",
" - Select rows by Ranges: `df[range]`\n",
" - Select rows by Booleans: `df[booleans]` # Array, Arrow::Array, Vectors are acceptable for booleans"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "e3bc60a7-611e-4fd8-9770-8e0d167d3fee",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <3 x 4 vectors> x | y | s | b |
---|
1 | 1.0 | A | true |
3 | 3.0 | C | true |
2 | 2.0 | B | false |
"
],
"text/plain": [
"#\n",
" x y s b\n",
" \n",
"1 1 1.0 A true\n",
"2 3 3.0 C true\n",
"3 2 2.0 B false\n"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# indices\n",
"df[0, 2, 1]"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "2b8b3801-ae37-4629-9db5-ff937941c895",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <3 x 4 vectors> x | y | s | b |
---|
2 | 2.0 | B | false |
3 | 3.0 | C | true |
5 | (nil) | (nil) | (nil) |
"
],
"text/plain": [
"#\n",
" x y s b\n",
" \n",
"1 2 2.0 B false\n",
"2 3 3.0 C true\n",
"3 5 (nil) (nil) (nil)\n"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# including a Range\n",
"# negative indices are also acceptable\n",
"df[1..2, -1]"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "3f6f8d73-a66c-4773-9bf5-0878c700f2d6",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <3 x 4 vectors> x | y | s | b |
---|
2 | 2.0 | B | false |
3 | 3.0 | C | true |
5 | (nil) | (nil) | (nil) |
"
],
"text/plain": [
"#\n",
" x y s b\n",
" \n",
"1 2 2.0 B false\n",
"2 3 3.0 C true\n",
"3 5 (nil) (nil) (nil)\n"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# booleans\n",
"# length of boolean should be the same as self\n",
"df[false, true, true, false, true]"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "abe57279-54fd-48ec-a1a4-c7453211e776",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <3 x 4 vectors> x | y | s | b |
---|
1 | 1.0 | A | true |
3 | 3.0 | C | true |
5 | (nil) | (nil) | (nil) |
"
],
"text/plain": [
"#\n",
" x y s b\n",
" \n",
"1 1 1.0 A true\n",
"2 3 3.0 C true\n",
"3 5 (nil) (nil) (nil)\n"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Arrow::Array\n",
"indices = Arrow::UInt8Array.new([0,2,4])\n",
"df[indices]"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "2266611f-23d8-4645-a1e8-b07c2370fb3f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <3 x 4 vectors> x | y | s | b |
---|
3 | 3.0 | C | true |
4 | NaN | D | false |
5 | (nil) | (nil) | (nil) |
"
],
"text/plain": [
"#\n",
" x y s b\n",
" \n",
"1 3 3.0 C true\n",
"2 4 NaN D false\n",
"3 5 (nil) (nil) (nil)\n"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# By a Vector as indices\n",
"indices = Vector.new(df.indices)\n",
"# indices > 1 returns a boolean Vector\n",
"df[indices > 1]"
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "0ea2da7e-aeca-4874-be4a-6af563aa378b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[true, false, true, false, nil]\n"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# By a Vector as booleans\n",
"booleans = df[:b]"
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "9f842890-6359-4266-9a23-2f8f813ef548",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <2 x 4 vectors> "
],
"text/plain": [
"#\n",
" x y s b\n",
" \n",
"1 1 1.0 A true\n",
"2 3 3.0 C true\n"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[booleans]"
]
},
{
"cell_type": "markdown",
"id": "98a04874-cb2c-44c0-b410-b330b9d12b0f",
"metadata": {},
"source": [
"## 22. empty?"
]
},
{
"cell_type": "code",
"execution_count": 53,
"id": "7b1ab319-90a7-4f09-8629-04dcd94076cb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"false"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.empty?"
]
},
{
"cell_type": "code",
"execution_count": 54,
"id": "1e09c32f-20a8-4175-827f-cdb98063535a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"true"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataFrame.new.empty?"
]
},
{
"cell_type": "code",
"execution_count": 55,
"id": "3f9f8771-87dd-44eb-8aac-6a3ed8b4c183",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(empty DataFrame)"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataFrame.new"
]
},
{
"cell_type": "markdown",
"id": "86b826dd-10e6-4087-9162-b89ac6561a61",
"metadata": {},
"source": [
"## 23. Select columns by pick"
]
},
{
"cell_type": "markdown",
"id": "b5aefd22-4e96-4dc5-91d2-e6826256bda6",
"metadata": {
"tags": []
},
"source": [
"`DataFrame#pick` accepts an Array of keys to pick up columns (variables) and creates a new DataFrame. You can change the order of columns at a same time."
]
},
{
"cell_type": "code",
"execution_count": 56,
"id": "68124521-b823-424d-9e06-d11aa927d618",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <5 x 2 vectors> s | y |
---|
A | 1.0 |
B | 2.0 |
C | 3.0 |
D | NaN |
(nil) | (nil) |
"
],
"text/plain": [
"#\n",
" s y\n",
" \n",
"1 A 1.0\n",
"2 B 2.0\n",
"3 C 3.0\n",
"4 D NaN\n",
"5 (nil) (nil)\n"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pick(:s, :y)\n",
"# or\n",
"df.pick([:s, :y]) # OK too."
]
},
{
"cell_type": "markdown",
"id": "a76dca00-da8f-4959-be18-7a1015a9d13c",
"metadata": {},
"source": [
"Or use a boolean Array of lengeh `n_key` to `pick`. This style remains the order of variables."
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "b91f8925-529c-43c9-93ba-e21bcac0f2f7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <5 x 2 vectors> y | s |
---|
1.0 | A |
2.0 | B |
3.0 | C |
NaN | D |
(nil) | (nil) |
"
],
"text/plain": [
"#\n",
" y s\n",
" \n",
"1 1.0 A\n",
"2 2.0 B\n",
"3 3.0 C\n",
"4 NaN D\n",
"5 (nil) (nil)\n"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pick(false, true, true, false)\n",
"# or\n",
"df.pick([false, true, true, false]) # OK"
]
},
{
"cell_type": "markdown",
"id": "5f903182-745b-4923-99d8-14a9b9c6ea4c",
"metadata": {},
"source": [
"`#pick` also accepts a block in the context of self.\n",
"\n",
"Next example is picking up numeric variables."
]
},
{
"cell_type": "code",
"execution_count": 58,
"id": "37bb0a49-c38a-484c-91d4-3e23ab43a727",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <5 x 2 vectors> "
],
"text/plain": [
"#\n",
" x y\n",
" \n",
"1 1 1.0\n",
"2 2 2.0\n",
"3 3 3.0\n",
"4 4 NaN\n",
"5 5 (nil)\n"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# reciever is required with the argument style\n",
"df.pick(df.vectors.map(&:numeric?))\n",
"\n",
"# with a block\n",
"df.pick { vectors.map(&:numeric?) }"
]
},
{
"cell_type": "markdown",
"id": "e51f07c0-54eb-4114-8cd6-63c7780e7248",
"metadata": {},
"source": [
"The name `pick` comes from the action to pick variables(columns) according to the label keys."
]
},
{
"cell_type": "markdown",
"id": "7c1815e4-de6c-425e-8602-b8dd66836250",
"metadata": {},
"source": [
"## 24. Reject columns by drop"
]
},
{
"cell_type": "markdown",
"id": "d1ab045e-66f9-4922-8bf2-35aee7f2812e",
"metadata": {
"tags": []
},
"source": [
"`DataFrame#drop` accepts an Array keys to drop columns (variables) to create a remainer DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "7ccace08-62b0-4b0b-93fb-81edf673abf7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <5 x 2 vectors> y | s |
---|
1.0 | A |
2.0 | B |
3.0 | C |
NaN | D |
(nil) | (nil) |
"
],
"text/plain": [
"#\n",
" y s\n",
" \n",
"1 1.0 A\n",
"2 2.0 B\n",
"3 3.0 C\n",
"4 NaN D\n",
"5 (nil) (nil)\n"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.drop(:x, :b)\n",
"# df.drop([:x, :b]) #is OK too."
]
},
{
"cell_type": "markdown",
"id": "2085b349-95c5-4607-b029-f7c3d630ac1c",
"metadata": {},
"source": [
"Or use a boolean Array of lengeh `n_key` to `drop`."
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "785c02f1-1e16-4722-9961-4b49223c8290",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <5 x 2 vectors> y | s |
---|
1.0 | A |
2.0 | B |
3.0 | C |
NaN | D |
(nil) | (nil) |
"
],
"text/plain": [
"#\n",
" y s\n",
" \n",
"1 1.0 A\n",
"2 2.0 B\n",
"3 3.0 C\n",
"4 NaN D\n",
"5 (nil) (nil)\n"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.drop(true, false, false, true)\n",
"# df.drop([true, false, false, true]) # is OK too"
]
},
{
"cell_type": "markdown",
"id": "d246161e-02cc-40fb-8921-26b37eb3956f",
"metadata": {},
"source": [
"`#drop` also accepts a block in the context of self.\n",
"\n",
"Next example will drop variables which have nil or NaN values."
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "069932e3-d393-4ede-9eb5-7aac8625e0c0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <5 x 1 vector> "
],
"text/plain": [
"#\n",
" x\n",
" \n",
"1 1\n",
"2 2\n",
"3 3\n",
"4 4\n",
"5 5\n"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.drop { vectors.map { |v| v.is_na.any } }"
]
},
{
"cell_type": "markdown",
"id": "88b064d6-7d90-4a0b-b9c8-d92e103269fb",
"metadata": {},
"source": [
"Argument style is also acceptable but it requires the reciever 'df'."
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "3003a5c2-0966-4f2c-9643-59e8b546c8aa",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <5 x 1 vector> "
],
"text/plain": [
"#\n",
" x\n",
" \n",
"1 1\n",
"2 2\n",
"3 3\n",
"4 4\n",
"5 5\n"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.drop(df.vectors.map { |v| v.is_na.any })"
]
},
{
"cell_type": "markdown",
"id": "c6fce15c-d4a9-4281-9c07-457e78d3c13e",
"metadata": {},
"source": [
"The name `drop` comes from the pair word of `pick`."
]
},
{
"cell_type": "markdown",
"id": "0f6dc86c-828d-4f9f-8b07-fce63c30fdca",
"metadata": {},
"source": [
"## 25. Pick/drop and nil"
]
},
{
"cell_type": "markdown",
"id": "0a108878-565b-400e-9a47-a15aae09429c",
"metadata": {},
"source": [
"When `pick` or `drop` is used with booleans, nil in the booleans is treated as false. This behavior is aligned with Ruby's `BasicObject#!`."
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "7c01fbb4-9bfa-4afc-8e6b-45c97c0beb03",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"true"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"booleans = [true, true, false, nil]\n",
"booleans_invert = booleans.map(&:!) # => [false, false, true, true] because nil.! is true\n",
"df.pick(booleans) == df.drop(booleans_invert)"
]
},
{
"cell_type": "markdown",
"id": "12a24264-9b7a-42a1-a541-e292e3876e35",
"metadata": {},
"source": [
"## 26. Vector#invert, #primitive_invert\n",
"\n",
"For the boolean Vector;"
]
},
{
"cell_type": "code",
"execution_count": 64,
"id": "ea352e12-7e8a-43be-b8ac-797adbc47708",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[true, true, false, nil]\n"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vector = Vector.new(booleans)"
]
},
{
"cell_type": "markdown",
"id": "2a0f82e0-157b-4185-9254-0618be291f9b",
"metadata": {},
"source": [
"nil is converted to nil by `Vector#invert`."
]
},
{
"cell_type": "code",
"execution_count": 65,
"id": "596c521f-12bf-4448-9e5d-e1b4a2c3d896",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[false, false, true, nil]\n"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vector.invert\n",
"# or\n",
"!vector"
]
},
{
"cell_type": "markdown",
"id": "a1aec910-3055-4627-a02b-22d45f2ceb70",
"metadata": {},
"source": [
"So `df.pick(booleans) != df.drop(booleans.invert)` when booleans have any nils.\n",
"\n",
"On the other hand, `Vector#primitive_invert` follows Ruby's `BasicObject#!`'s behavior. Then pick and drop keep 'MECE' behavior."
]
},
{
"cell_type": "code",
"execution_count": 66,
"id": "4dcaba48-1cea-4ce9-b4a9-b079b43af7ec",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[false, false, true, true]\n"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vector.primitive_invert"
]
},
{
"cell_type": "code",
"execution_count": 67,
"id": "c7ae4dad-275a-49e0-a0b0-bf3686248070",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"true"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pick(vector) == df.drop(vector.primitive_invert)"
]
},
{
"cell_type": "markdown",
"id": "9a6cec74-43f0-4a72-8262-25b1e311f602",
"metadata": {},
"source": [
"## 27. Pick/drop and [ ]"
]
},
{
"cell_type": "markdown",
"id": "32c8f74d-b3ce-4305-9af7-6ea70052c773",
"metadata": {},
"source": [
"When `pick` or `drop` select a single column (variable), it returns a `DataFrame` with one column (variable)."
]
},
{
"cell_type": "code",
"execution_count": 68,
"id": "e13aee24-cac6-41ad-b8a3-0ec26edbe5d1",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <5 x 1 vector> "
],
"text/plain": [
"#\n",
" x\n",
" \n",
"1 1\n",
"2 2\n",
"3 3\n",
"4 4\n",
"5 5\n"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pick(:x) # or\n",
"df.drop(:y, :s, :b)"
]
},
{
"cell_type": "markdown",
"id": "3e47b9d2-929e-4674-9690-0a1fdf7b0a7d",
"metadata": {},
"source": [
"In contrast, when `[]` selects a single column (variable), it returns a `Vector`."
]
},
{
"cell_type": "code",
"execution_count": 69,
"id": "60d228be-7357-434d-9d39-ee72c110e6fe",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[1, 2, 3, 4, 5]\n"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[:x]"
]
},
{
"cell_type": "markdown",
"id": "6d973934-e08b-4b45-8efb-52f9167e7238",
"metadata": {},
"source": [
"This behavior may be useful to use with DataFrame manipulation verbs (like pick, drop, slice, remove, assign, rename)."
]
},
{
"cell_type": "code",
"execution_count": 70,
"id": "6beefc5a-dc47-42cc-a283-456073c4251e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <5 x 2 vectors> "
],
"text/plain": [
"#\n",
" x y\n",
" \n",
"1 1 1.0\n",
"2 2 2.0\n",
"3 3 3.0\n",
"4 4 NaN\n",
"5 5 (nil)\n"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pick { keys.select { |key| df[key].numeric? } }"
]
},
{
"cell_type": "markdown",
"id": "34c9bcb0-889a-4190-b2b8-49765cd059c2",
"metadata": {},
"source": [
"## 28. Slice"
]
},
{
"cell_type": "markdown",
"id": "9a428ba8-c306-4ab8-8607-51174e8e6ebe",
"metadata": {},
"source": [
"`slice` selects rows (observations) to create a subset of a DataFrame."
]
},
{
"cell_type": "markdown",
"id": "6016d6d4-72d6-4ae2-b7dd-3d526c91ae61",
"metadata": {},
"source": [
"`slice(indeces)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers. Negative index from the tail like Ruby's Array is also acceptable."
]
},
{
"cell_type": "code",
"execution_count": 71,
"id": "9cdce2e4-7876-4be6-bd1f-bc8ab6e6c871",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <10 x 8 vectors> species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | (nil) | (nil) | (nil) | (nil) | (nil) | 2007 |
⋮ |
Gentoo | Biscoe | 50.4 | 15.7 | 222 | 5750 | male | 2009 |
Gentoo | Biscoe | 45.2 | 14.8 | 212 | 5200 | female | 2009 |
Gentoo | Biscoe | 49.9 | 16.1 | 213 | 5400 | male | 2009 |
"
],
"text/plain": [
"#\n",
" species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n",
" ... \n",
" 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n",
" 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n",
" 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n",
" 4 Adelie Torgersen (nil) (nil) (nil) ... 2007\n",
" 5 Adelie Torgersen 36.7 19.3 193 ... 2007\n",
" : : : : : : ... :\n",
" 8 Gentoo Biscoe 50.4 15.7 222 ... 2009\n",
" 9 Gentoo Biscoe 45.2 14.8 212 ... 2009\n",
"10 Gentoo Biscoe 49.9 16.1 213 ... 2009\n"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# returns 5 rows from the start and 5 rows from the end\n",
"penguins.slice(0...5, -5..-1)"
]
},
{
"cell_type": "code",
"execution_count": 72,
"id": "93c3f6f0-7bc9-4909-8f32-20e8c1ddfd3a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <1 x 9 vectors> index | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|
113 | Adelie | Biscoe | 42.2 | 19.5 | 197 | 4275 | male | 2009 |
"
],
"text/plain": [
"#\n",
" index species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n",
" ... \n",
"1 113 Adelie Biscoe 42.2 19.5 197 ... 2009\n"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# slice accepts Float index\n",
"# 33% of 344 observations in index => 113.52 th data ??\n",
"indexed_penguins = penguins.assign_left { [:index, indexes] } # #assign_left and assigner by Array is 0.2.0 feature\n",
"indexed_penguins.slice(penguins.size * 0.33)"
]
},
{
"cell_type": "markdown",
"id": "8139bb28-89f8-4058-b824-dde33ead0b60",
"metadata": {},
"source": [
"Indices in Vectors or Arrow::Arrays are also acceptable."
]
},
{
"cell_type": "markdown",
"id": "6f79db8c-c706-4d60-949b-3f644474d375",
"metadata": {},
"source": [
"Another way to select in `slice` is to use booleans.\n",
"- Booleans is an Array, Arrow::Array, Vector or their Array.\n",
"- Each data type must be boolean.\n",
"- Size of booleans must be same as the size of self."
]
},
{
"cell_type": "code",
"execution_count": 73,
"id": "f58ca131-7375-4489-90ce-6ba54b898eb5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[false, false, true, nil, false, false, false, false, false, true, false, false, ... ]\n"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# make booleans to check over 40\n",
"booleans = penguins[:bill_length_mm] > 40"
]
},
{
"cell_type": "code",
"execution_count": 74,
"id": "176ab365-c66a-4712-97b9-4381a536321b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <242 x 8 vectors> species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | (nil) | 2007 |
Adelie | Torgersen | 41.1 | 17.6 | 182 | 3200 | female | 2007 |
Adelie | Torgersen | 42.5 | 20.7 | 197 | 4500 | male | 2007 |
⋮ |
Gentoo | Biscoe | 50.4 | 15.7 | 222 | 5750 | male | 2009 |
Gentoo | Biscoe | 45.2 | 14.8 | 212 | 5200 | female | 2009 |
Gentoo | Biscoe | 49.9 | 16.1 | 213 | 5400 | male | 2009 |
"
],
"text/plain": [
"#\n",
" species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n",
" ... \n",
" 1 Adelie Torgersen 40.3 18.0 195 ... 2007\n",
" 2 Adelie Torgersen 42.0 20.2 190 ... 2007\n",
" 3 Adelie Torgersen 41.1 17.6 182 ... 2007\n",
" 4 Adelie Torgersen 42.5 20.7 197 ... 2007\n",
" 5 Adelie Torgersen 46.0 21.5 194 ... 2007\n",
" : : : : : : ... :\n",
"240 Gentoo Biscoe 50.4 15.7 222 ... 2009\n",
"241 Gentoo Biscoe 45.2 14.8 212 ... 2009\n",
"242 Gentoo Biscoe 49.9 16.1 213 ... 2009\n"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguins.slice(booleans)"
]
},
{
"cell_type": "markdown",
"id": "3264a182-6b72-461a-b712-c3b708c53516",
"metadata": {},
"source": [
"`slice` accepts a block.\n",
"- We can't use both arguments and a block at a same time.\n",
"- The block should return indeces in any length or a boolean Array with a same length as `size`.\n",
"- Block is called in the context of self. So reciever 'self' can be omitted in the block."
]
},
{
"cell_type": "code",
"execution_count": 75,
"id": "c95d3426-0bbb-430e-8d83-6e22434d99ed",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <204 x 8 vectors> species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
⋮ |
Gentoo | Biscoe | 47.2 | 13.7 | 214 | 4925 | female | 2009 |
Gentoo | Biscoe | 46.8 | 14.3 | 215 | 4850 | female | 2009 |
Gentoo | Biscoe | 45.2 | 14.8 | 212 | 5200 | female | 2009 |
"
],
"text/plain": [
"#\n",
" species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n",
" ... \n",
" 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n",
" 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n",
" 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n",
" 4 Adelie Torgersen 39.3 20.6 190 ... 2007\n",
" 5 Adelie Torgersen 38.9 17.8 181 ... 2007\n",
" : : : : : : ... :\n",
"202 Gentoo Biscoe 47.2 13.7 214 ... 2009\n",
"203 Gentoo Biscoe 46.8 14.3 215 ... 2009\n",
"204 Gentoo Biscoe 45.2 14.8 212 ... 2009\n"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# return a DataFrame with bill_length_mm is in 2*std range around mean\n",
"penguins.slice do\n",
" vector = self[:bill_length_mm]\n",
" min = vector.mean - vector.std\n",
" max = vector.mean + vector.std\n",
" vector.to_a.map { |e| (min..max).include? e }\n",
"end"
]
},
{
"cell_type": "markdown",
"id": "4fa42801-64f5-4432-856b-85c26a68515d",
"metadata": {},
"source": [
"## 29. Slice and nil option"
]
},
{
"cell_type": "markdown",
"id": "31017a7e-0923-4283-bc92-246ebe2591c3",
"metadata": {},
"source": [
"`Arrow::Table#slice` uses `#filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row."
]
},
{
"cell_type": "code",
"execution_count": 76,
"id": "8e4a8108-154b-4621-acd1-704ddf229d61",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"\t a\tb\t c\n",
"0\t 1\tA\t 1.000000\n",
"1\t(null)\t(null)\t (null)\n"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }\n",
"table = Arrow::Table.new(hash)\n",
"table.slice([true, false, nil])"
]
},
{
"cell_type": "markdown",
"id": "dbb57c5a-e949-42b8-a82c-9affb3fe5b7b",
"metadata": {},
"source": [
"Whereas in RedAmber, `DataFrame#slice` with booleans containing nil is treated as false. This behavior comes from `Allow::FilterOptions.null_selection_behavior = :drop`. This is a default value for `Arrow::Table.filter` method."
]
},
{
"cell_type": "code",
"execution_count": 77,
"id": "851c3bf6-b9e9-41bd-92c5-5372ed934549",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"\ta\tb\t c\n",
"0\t1\tA\t 1.000000\n"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"RedAmber::DataFrame.new(table).slice([true, false, nil]).table"
]
},
{
"cell_type": "markdown",
"id": "56398a3d-6146-43af-8b96-fec37730fc49",
"metadata": {},
"source": [
"## 30. Remove"
]
},
{
"cell_type": "markdown",
"id": "9e042a97-8a5d-412e-8e4a-fda382225a2d",
"metadata": {},
"source": [
"Slice and reject rows (observations) to create a remainer DataFrame."
]
},
{
"cell_type": "markdown",
"id": "2b4cbb97-eef3-4db8-8f25-c44c208ec554",
"metadata": {},
"source": [
"`#remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer."
]
},
{
"cell_type": "code",
"execution_count": 78,
"id": "17e38ab8-886b-4114-bcaf-ee18df7d00cd",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <334 x 8 vectors> species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|
Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 |
Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 |
Adelie | Torgersen | 34.1 | 18.1 | 193 | 3475 | (nil) | 2007 |
⋮ |
Gentoo | Biscoe | 44.5 | 15.7 | 217 | 4875 | (nil) | 2009 |
Gentoo | Biscoe | 48.8 | 16.2 | 222 | 6000 | male | 2009 |
Gentoo | Biscoe | 47.2 | 13.7 | 214 | 4925 | female | 2009 |
"
],
"text/plain": [
"#\n",
" species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n",
" ... \n",
" 1 Adelie Torgersen 39.3 20.6 190 ... 2007\n",
" 2 Adelie Torgersen 38.9 17.8 181 ... 2007\n",
" 3 Adelie Torgersen 39.2 19.6 195 ... 2007\n",
" 4 Adelie Torgersen 34.1 18.1 193 ... 2007\n",
" 5 Adelie Torgersen 42.0 20.2 190 ... 2007\n",
" : : : : : : ... :\n",
"332 Gentoo Biscoe 44.5 15.7 217 ... 2009\n",
"333 Gentoo Biscoe 48.8 16.2 222 ... 2009\n",
"334 Gentoo Biscoe 47.2 13.7 214 ... 2009\n"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# returns 6th to 339th obs. Remainer of 1st example of #30\n",
"penguins.remove(0...5, -5..-1)"
]
},
{
"cell_type": "markdown",
"id": "def1c1c4-6b60-4864-ae24-c797fbf008a7",
"metadata": {},
"source": [
"`remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `#size`."
]
},
{
"cell_type": "code",
"execution_count": 79,
"id": "6f169420-7eb2-457f-8d59-7a5c90aa3fa5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <333 x 8 vectors> species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
⋮ |
Gentoo | Biscoe | 50.4 | 15.7 | 222 | 5750 | male | 2009 |
Gentoo | Biscoe | 45.2 | 14.8 | 212 | 5200 | female | 2009 |
Gentoo | Biscoe | 49.9 | 16.1 | 213 | 5400 | male | 2009 |
"
],
"text/plain": [
"#\n",
" species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n",
" ... \n",
" 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n",
" 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n",
" 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n",
" 4 Adelie Torgersen 36.7 19.3 193 ... 2007\n",
" 5 Adelie Torgersen 39.3 20.6 190 ... 2007\n",
" : : : : : : ... :\n",
"331 Gentoo Biscoe 50.4 15.7 222 ... 2009\n",
"332 Gentoo Biscoe 45.2 14.8 212 ... 2009\n",
"333 Gentoo Biscoe 49.9 16.1 213 ... 2009\n"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# remove all observation contains nil\n",
"removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }"
]
},
{
"cell_type": "markdown",
"id": "5f1864c9-4ae4-4fcd-9840-ea424ef5e27d",
"metadata": {},
"source": [
"`remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as size. Block is called in the context of self."
]
},
{
"cell_type": "code",
"execution_count": 80,
"id": "a6807c65-25e5-4ee1-8d1b-6018c46b3999",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <140 x 8 vectors> species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|
Adelie | Torgersen | (nil) | (nil) | (nil) | (nil) | (nil) | 2007 |
Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
Adelie | Torgersen | 34.1 | 18.1 | 193 | 3475 | (nil) | 2007 |
Adelie | Torgersen | 37.8 | 17.1 | 186 | 3300 | (nil) | 2007 |
⋮ |
Gentoo | Biscoe | (nil) | (nil) | (nil) | (nil) | (nil) | 2009 |
Gentoo | Biscoe | 50.4 | 15.7 | 222 | 5750 | male | 2009 |
Gentoo | Biscoe | 49.9 | 16.1 | 213 | 5400 | male | 2009 |
"
],
"text/plain": [
"#\n",
" species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n",
" ... \n",
" 1 Adelie Torgersen (nil) (nil) (nil) ... 2007\n",
" 2 Adelie Torgersen 36.7 19.3 193 ... 2007\n",
" 3 Adelie Torgersen 34.1 18.1 193 ... 2007\n",
" 4 Adelie Torgersen 37.8 17.1 186 ... 2007\n",
" 5 Adelie Torgersen 37.8 17.3 180 ... 2007\n",
" : : : : : : ... :\n",
"138 Gentoo Biscoe (nil) (nil) (nil) ... 2009\n",
"139 Gentoo Biscoe 50.4 15.7 222 ... 2009\n",
"140 Gentoo Biscoe 49.9 16.1 213 ... 2009\n"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Remove data in 2*std range around mean\n",
"penguins.remove do\n",
" vector = self[:bill_length_mm]\n",
" min = vector.mean - vector.std\n",
" max = vector.mean + vector.std\n",
" vector.to_a.map { |e| (min..max).include? e }\n",
"end"
]
},
{
"cell_type": "markdown",
"id": "591e6b22-da98-4336-b22e-c7bc9bcf2ebf",
"metadata": {},
"source": [
"## 31. Remove and nil"
]
},
{
"cell_type": "markdown",
"id": "67926d1b-c76e-4cb7-b679-6545d850e7e4",
"metadata": {},
"source": [
"When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`."
]
},
{
"cell_type": "code",
"execution_count": 81,
"id": "8575614e-f702-4ee4-ac7b-745e9b32e803",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <3 x 3 vectors> "
],
"text/plain": [
"#\n",
" a b c\n",
" \n",
"1 1 A 1.0\n",
"2 2 B 2.0\n",
"3 (nil) C 3.0\n"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])"
]
},
{
"cell_type": "code",
"execution_count": 82,
"id": "932a5e71-8cef-44e5-a789-ce97329bc001",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[true, false, nil]\n"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"booleans = df[:a] < 2"
]
},
{
"cell_type": "code",
"execution_count": 83,
"id": "74cf6aa6-8913-433d-97ad-bba2d548afe5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[false, true, true]"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"booleans_invert = booleans.to_a.map(&:!)"
]
},
{
"cell_type": "code",
"execution_count": 84,
"id": "5e466a06-cb17-4dc1-a5b0-34bfd3ffb78b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"true"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.slice(booleans) == df.remove(booleans_invert)"
]
},
{
"cell_type": "markdown",
"id": "8bca0b06-2d08-4c28-8b4c-4fd088f2d2d3",
"metadata": {},
"source": [
"Whereas `Vector#invert` returns nil for elements nil. This will bring different result. (See #26)"
]
},
{
"cell_type": "code",
"execution_count": 85,
"id": "077b216f-0a08-413e-95c9-12789d15a9ba",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[false, true, nil]\n"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"booleans.invert"
]
},
{
"cell_type": "code",
"execution_count": 86,
"id": "b3df62a6-c4a3-44cb-bde6-f6be12b120c8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <2 x 3 vectors> "
],
"text/plain": [
"#\n",
" a b c\n",
" \n",
"1 1 A 1.0\n",
"2 (nil) C 3.0\n"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.remove(booleans.invert)"
]
},
{
"cell_type": "markdown",
"id": "e05f00b6-3bae-4650-8bbc-d4e0692f6f85",
"metadata": {},
"source": [
"We have `#primitive_invert` method in Vector. This method returns the same result as `.to_a.map(&:!)` above."
]
},
{
"cell_type": "code",
"execution_count": 87,
"id": "296ca3cd-a6da-4603-a576-d8c36a810e4f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[false, true, true]\n"
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"booleans.primitive_invert"
]
},
{
"cell_type": "code",
"execution_count": 88,
"id": "ba5b8c0b-b94e-4209-adcd-258ea3b87bfd",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <1 x 3 vectors> "
],
"text/plain": [
"#\n",
" a b c\n",
" \n",
"1 1 A 1.0\n"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.remove(booleans.primitive_invert)"
]
},
{
"cell_type": "code",
"execution_count": 89,
"id": "2446792f-0b0a-4642-acae-b4fec89261c1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"true"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.slice(booleans) == df.remove(booleans.primitive_invert)"
]
},
{
"cell_type": "markdown",
"id": "7c23a4ad-0c17-4178-b58a-abfd8153d49b",
"metadata": {},
"source": [
"## 32. Remove nil"
]
},
{
"cell_type": "markdown",
"id": "84c7238b-1029-416f-b495-9d045f77b22c",
"metadata": {},
"source": [
"Remove any observations containing nil."
]
},
{
"cell_type": "code",
"execution_count": 90,
"id": "de4bb615-d14d-4c90-ab54-db2f375b9f00",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <333 x 8 vectors> species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
⋮ |
Gentoo | Biscoe | 50.4 | 15.7 | 222 | 5750 | male | 2009 |
Gentoo | Biscoe | 45.2 | 14.8 | 212 | 5200 | female | 2009 |
Gentoo | Biscoe | 49.9 | 16.1 | 213 | 5400 | male | 2009 |
"
],
"text/plain": [
"#\n",
" species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n",
" ... \n",
" 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n",
" 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n",
" 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n",
" 4 Adelie Torgersen 36.7 19.3 193 ... 2007\n",
" 5 Adelie Torgersen 39.3 20.6 190 ... 2007\n",
" : : : : : : ... :\n",
"331 Gentoo Biscoe 50.4 15.7 222 ... 2009\n",
"332 Gentoo Biscoe 45.2 14.8 212 ... 2009\n",
"333 Gentoo Biscoe 49.9 16.1 213 ... 2009\n"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguins.remove_nil"
]
},
{
"cell_type": "markdown",
"id": "4a4ae8f9-dcf8-4dad-bb77-af076e9cadb5",
"metadata": {},
"source": [
"The roundabout way for this is to use `#remove`."
]
},
{
"cell_type": "code",
"execution_count": 91,
"id": "27a3da5f-0ea2-4c5d-a6c3-c0e20f2224a3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <333 x 8 vectors> species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
⋮ |
Gentoo | Biscoe | 50.4 | 15.7 | 222 | 5750 | male | 2009 |
Gentoo | Biscoe | 45.2 | 14.8 | 212 | 5200 | female | 2009 |
Gentoo | Biscoe | 49.9 | 16.1 | 213 | 5400 | male | 2009 |
"
],
"text/plain": [
"#\n",
" species island bill_length_mm bill_depth_mm flipper_length_mm ... year\n",
" ... \n",
" 1 Adelie Torgersen 39.1 18.7 181 ... 2007\n",
" 2 Adelie Torgersen 39.5 17.4 186 ... 2007\n",
" 3 Adelie Torgersen 40.3 18.0 195 ... 2007\n",
" 4 Adelie Torgersen 36.7 19.3 193 ... 2007\n",
" 5 Adelie Torgersen 39.3 20.6 190 ... 2007\n",
" : : : : : : ... :\n",
"331 Gentoo Biscoe 50.4 15.7 222 ... 2009\n",
"332 Gentoo Biscoe 45.2 14.8 212 ... 2009\n",
"333 Gentoo Biscoe 49.9 16.1 213 ... 2009\n"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguins.remove { vectors.map(&:is_nil).reduce(&:|) }"
]
},
{
"cell_type": "markdown",
"id": "4f2a58fd-f033-44f6-9eb4-ed893a2b5d1d",
"metadata": {},
"source": [
"## 33. Rename"
]
},
{
"cell_type": "markdown",
"id": "c0d39506-8ae5-48e7-9dd2-acf38d4ec1a9",
"metadata": {},
"source": [
"Rename keys (column names) to create a updated DataFrame."
]
},
{
"cell_type": "markdown",
"id": "3f6924ec-e86c-4089-ae40-6783027d3ce0",
"metadata": {},
"source": [
"`#rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}` or an Array of Array `[[existing_key, new_key], ...]` ."
]
},
{
"cell_type": "code",
"execution_count": 92,
"id": "9396c96d-83d7-4b92-a4ca-27bc9e4d7b9d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <3 x 2 vectors> name | age |
---|
Yasuko | 68 |
Rui | 49 |
Hinata | 28 |
"
],
"text/plain": [
"#\n",
" name age\n",
" \n",
"1 Yasuko 68\n",
"2 Rui 49\n",
"3 Hinata 28\n"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"h = { name: %w[Yasuko Rui Hinata], age: [68, 49, 28] }\n",
"comecome = RedAmber::DataFrame.new(h)"
]
},
{
"cell_type": "code",
"execution_count": 93,
"id": "fad279c6-1ca0-4493-bd69-0e9ef011bff7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <3 x 2 vectors> name | age_in_1993 |
---|
Yasuko | 68 |
Rui | 49 |
Hinata | 28 |
"
],
"text/plain": [
"#\n",
" name age_in_1993\n",
" \n",
"1 Yasuko 68\n",
"2 Rui 49\n",
"3 Hinata 28\n"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"comecome.rename(:age => :age_in_1993)\n",
"# comecome.rename(:age, :age_in_1993) # is also OK\n",
"# comecome.rename([:age, :age_in_1993]) # is also OK"
]
},
{
"cell_type": "markdown",
"id": "9dabb005-9822-4c4b-aaa5-fa6f28f2ed43",
"metadata": {},
"source": [
"`#rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}` or an Array of Array `[[existing_key, new_key], ...]`. Block is called in the context of self."
]
},
{
"cell_type": "markdown",
"id": "aabbba20-6ef8-4da2-8dc0-0cb243cf3b23",
"metadata": {},
"source": [
"Symbol key and String key are distinguished."
]
},
{
"cell_type": "markdown",
"id": "07f98b31-6123-4466-b4f8-f995c7cde474",
"metadata": {},
"source": [
"## 34. Assign"
]
},
{
"cell_type": "markdown",
"id": "99f6787f-2b36-4360-b155-1c2d7874d25e",
"metadata": {},
"source": [
"Assign new or updated columns (variables) and create a updated DataFrame.\n",
"\n",
"- Columns with new keys will append new variables at right (bottom in TDR).\n",
"- Columns with exisiting keys will update corresponding vectors."
]
},
{
"cell_type": "markdown",
"id": "b4b22da0-4ee2-4196-88e1-1cfea6a72f4d",
"metadata": {},
"source": [
"`#assign(key_pairs)` accepts pairs of key and array_like values as arguments. The pairs should be a Hash of `{key => array_like}` or an Array of Array `[[key, array_like], ... ]`. `array_like` is one of `Vector`, `Array` or `Arrow::Array`. "
]
},
{
"cell_type": "code",
"execution_count": 94,
"id": "56dcfed8-a6f9-4d8c-bac3-e8ce7c0674a7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <3 x 2 vectors> name | age |
---|
Yasuko | 68 |
Rui | 49 |
Hinata | 28 |
"
],
"text/plain": [
"#\n",
" name age\n",
" \n",
"1 Yasuko 68\n",
"2 Rui 49\n",
"3 Hinata 28\n"
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"comecome = RedAmber::DataFrame.new( name: %w[Yasuko Rui Hinata], age: [68, 49, 28] )"
]
},
{
"cell_type": "code",
"execution_count": 95,
"id": "8da8d282-8798-44d5-bb7b-7fa2df922308",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <3 x 3 vectors> name | age | brother |
---|
Yasuko | 97 | Santa |
Rui | 78 | (nil) |
Hinata | 57 | Momotaro |
"
],
"text/plain": [
"#\n",
" name age brother\n",
" \n",
"1 Yasuko 97 Santa\n",
"2 Rui 78 (nil)\n",
"3 Hinata 57 Momotaro\n"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# update :age and add :brother\n",
"assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }\n",
"comecome.assign(assigner)"
]
},
{
"cell_type": "markdown",
"id": "e6d3ddfc-b16d-4b20-83df-357e9cdb32e6",
"metadata": {},
"source": [
"`#assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and array_like values as a Hash of `{key => array_like}` or an Array of Array `[[key, array_like], ... ]`. `array_like` is one of `Vector`, `Array` or `Arrow::Array`. Block is called in the context of self."
]
},
{
"cell_type": "code",
"execution_count": 96,
"id": "8d69edd0-7ad7-4318-8033-1785ce2543db",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <5 x 3 vectors> index | float | string |
---|
0 | 0.0 | A |
1 | 1.1 | B |
2 | 2.2 | C |
3 | NaN | D |
(nil) | (nil) | (nil) |
"
],
"text/plain": [
"#\n",
" index float string\n",
" \n",
"1 0 0.0 A\n",
"2 1 1.1 B\n",
"3 2 2.2 C\n",
"4 3 NaN D\n",
"5 (nil) (nil) (nil)\n"
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = RedAmber::DataFrame.new(\n",
" index: [0, 1, 2, 3, nil],\n",
" float: [0.0, 1.1, 2.2, Float::NAN, nil],\n",
" string: ['A', 'B', 'C', 'D', nil])"
]
},
{
"cell_type": "code",
"execution_count": 97,
"id": "e884af01-d82b-42e7-8e92-62baf19919cb",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"RedAmber::DataFrame <5 x 3 vectors> index | float | string |
---|
0 | -0.0 | A |
255 | -1.1 | B |
254 | -2.2 | C |
253 | NaN | D |
(nil) | (nil) | (nil) |
"
],
"text/plain": [
"#\n",
" index float string\n",
" \n",
"1 0 -0.0 A\n",
"2 255 -1.1 B\n",
"3 254 -2.2 C\n",
"4 253 NaN D\n",
"5 (nil) (nil) (nil)\n"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# update numeric variables\n",
"df.assign do\n",
" vectors.select(&:numeric?).map { |v| [v.key, -v] }\n",
"end"
]
},
{
"cell_type": "markdown",
"id": "7b8e2090-628f-4b17-8929-cbb5e0285dff",
"metadata": {},
"source": [
"In this example, columns :x and :y are updated. Column :x returns complements for #negate method because :x is :uint8 type."
]
},
{
"cell_type": "code",
"execution_count": 98,
"id": "9452f8db-5f23-4044-ac87-ac5695fae8ae",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[:uint8, :double, :string]"
]
},
"execution_count": 98,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.types"
]
},
{
"cell_type": "markdown",
"id": "c5c83559-f4d8-4ed2-8b20-5c50eb1faa14",
"metadata": {},
"source": [
"## 35. Coerce (Vector)"
]
},
{
"cell_type": "markdown",
"id": "77bdfc69-b728-4335-b76e-e4be92f94310",
"metadata": {},
"source": [
"Vector has coerce method."
]
},
{
"cell_type": "code",
"execution_count": 99,
"id": "2bfbe584-be54-486b-af32-e76b37c10e49",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[1, 2, 3]\n"
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vector = RedAmber::Vector.new(1,2,3)"
]
},
{
"cell_type": "code",
"execution_count": 100,
"id": "ce35d901-38a8-4f13-b2d1-29b83f6c5438",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[-1, -2, -3]\n"
]
},
"execution_count": 100,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Vector's `#*` method\n",
"vector * -1"
]
},
{
"cell_type": "code",
"execution_count": 101,
"id": "7d5fc2be-f590-4678-92e9-faa27b618266",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[-1, -2, -3]\n"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# coerced calculation\n",
"-1 * vector"
]
},
{
"cell_type": "code",
"execution_count": 102,
"id": "fa90a6af-add7-42f2-9707-7d726575aeb6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[255, 254, 253]\n"
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# `@-` operator\n",
"-vector"
]
},
{
"cell_type": "markdown",
"id": "4820b527-44e9-4738-aa0e-73604078b3b0",
"metadata": {
"tags": []
},
"source": [
"## 36. to_ary (Vector)"
]
},
{
"cell_type": "markdown",
"id": "8507dcc4-74e3-44ad-aa54-cf43d55f2131",
"metadata": {},
"source": [
"`Vector#to_ary` will enable implicit conversion to an Array."
]
},
{
"cell_type": "code",
"execution_count": 103,
"id": "b12bd7c8-2981-426c-8ae3-154504a8ea15",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[3, 4, 5]"
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Array(Vector.new([3, 4, 5]))"
]
},
{
"cell_type": "code",
"execution_count": 104,
"id": "c0cb5a98-7cdf-43a8-b2f7-f9df1961c761",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[1, 2, 3, 4, 5]"
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[1, 2] + Vector.new([3, 4, 5])"
]
},
{
"cell_type": "markdown",
"id": "216dde4f-e4d8-4f29-903a-8cbf75de5b8e",
"metadata": {},
"source": [
"## 37. Fill nil (Vector)"
]
},
{
"cell_type": "markdown",
"id": "1959d0d7-6d09-4fa5-9365-1e2f7fc35d61",
"metadata": {},
"source": [
"`Vector#fill_nil_forward` or `Vector#fill_nil_backward` will\n",
"propagate the last valid observation forward (or backward).\n",
"Or preserve nil if all previous values are nil or at the end."
]
},
{
"cell_type": "code",
"execution_count": 105,
"id": "d003b06a-859f-4de0-9e35-803efac85169",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[0, 1, 1, 3, 3]\n"
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"integer = Vector.new([0, 1, nil, 3, nil])\n",
"integer.fill_nil_forward"
]
},
{
"cell_type": "code",
"execution_count": 106,
"id": "c5d74006-d364-4e86-8a5e-9e96e87a96e0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[0, 1, 3, 3, nil]\n"
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"integer.fill_nil_backward"
]
},
{
"cell_type": "markdown",
"id": "347785a6-eab0-4864-a871-2c320005211e",
"metadata": {},
"source": [
"## 38. all?/any? (Vector)"
]
},
{
"cell_type": "markdown",
"id": "f82a6f5d-03d3-4645-85f5-d25999165378",
"metadata": {},
"source": [
"`Vector#all?` returns true if all elements is true.\n",
"\n",
"`Vector#any?` returns true if exists any true.\n",
"\n",
"These are unary aggregation function."
]
},
{
"cell_type": "code",
"execution_count": 107,
"id": "ebad37ad-0a09-48b1-ba3a-4e030a917837",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"true"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"booleans = Vector.new([true, true, nil])\n",
"booleans.all?"
]
},
{
"cell_type": "code",
"execution_count": 108,
"id": "97fc24da-03d4-406d-b353-562896775d60",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"true"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"booleans.any?"
]
},
{
"cell_type": "markdown",
"id": "0ff3b22d-9f7c-42f2-8d18-c89a06af681b",
"metadata": {},
"source": [
"If these methods are used with option `skip_nulls: false` nil is considered."
]
},
{
"cell_type": "code",
"execution_count": 109,
"id": "3e0e5800-665a-4a05-b2cb-d152f3f077de",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"false"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"booleans.all?(skip_nulls: false)"
]
},
{
"cell_type": "code",
"execution_count": 110,
"id": "3e43f0c4-a254-4735-ac28-de14d2670c67",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"true"
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"booleans.any?(skip_nulls: false)"
]
},
{
"cell_type": "markdown",
"id": "abc71a85-7958-4a21-91cf-8c96c0784525",
"metadata": {},
"source": [
"## 39. count/count_uniq (Vector)"
]
},
{
"cell_type": "markdown",
"id": "3d556118-4105-4d12-806d-ba56c6ae3d1b",
"metadata": {},
"source": [
"`Vector#count` counts element.\n",
"\n",
"`Vector#count_uniq` counts unique element. `#count_distinct` is an alias (Arrow's name).\n",
"\n",
"These are unary aggregation function."
]
},
{
"cell_type": "code",
"execution_count": 111,
"id": "2af73e32-1d7e-4f80-b54e-c40ef08b7034",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string = Vector.new(%w[A B A])\n",
"string.count"
]
},
{
"cell_type": "code",
"execution_count": 112,
"id": "fe6d8d85-27b0-438f-b1b4-1b15e9eb05f9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 112,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string.count_uniq # count_distinct is also OK"
]
},
{
"cell_type": "markdown",
"id": "70abed9f-665a-4ea7-939e-4b185ee53755",
"metadata": {},
"source": [
"## 40. stddev/variance (Vector)"
]
},
{
"cell_type": "markdown",
"id": "965de338-b3be-4d33-92e1-5ad7e2ed18f0",
"metadata": {},
"source": [
"These are unary element-wise function."
]
},
{
"cell_type": "code",
"execution_count": 113,
"id": "0afec200-f377-432b-a260-ae5a0c5ce794",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.816496580927726"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"integers = Vector.new([1, 2, 3, nil])\n",
"integers.stddev"
]
},
{
"cell_type": "code",
"execution_count": 114,
"id": "2e40ac09-cb7f-4978-87e8-53f84f16f7c7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 114,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Unbiased standard deviation\n",
"integers.sd"
]
},
{
"cell_type": "code",
"execution_count": 115,
"id": "e6158e3b-4af8-467c-a355-8e9f2e579548",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.6666666666666666"
]
},
"execution_count": 115,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"integers.variance"
]
},
{
"cell_type": "code",
"execution_count": 116,
"id": "d64d39f2-d979-49f1-9946-65890f40d646",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 116,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Unbiased variance\n",
"integers.var"
]
},
{
"cell_type": "markdown",
"id": "25023f5a-798a-40a5-ab84-a6615602f747",
"metadata": {},
"source": [
"## 41. negate (Vector)"
]
},
{
"cell_type": "markdown",
"id": "00ddf322-ef50-40a1-86a6-22bf3d43f007",
"metadata": {},
"source": [
"These are unary element-wise function."
]
},
{
"cell_type": "code",
"execution_count": 117,
"id": "ab5a357a-e98c-40a1-9b89-0b38645e416f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[-1.0, 2.0, -3.0]\n"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"double = Vector.new([1.0, -2, 3])\n",
"double.negate"
]
},
{
"cell_type": "code",
"execution_count": 118,
"id": "8a06c856-d61c-4752-a296-1fa207ffd9a1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[-1.0, 2.0, -3.0]\n"
]
},
"execution_count": 118,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Same as #negate\n",
"-double"
]
},
{
"cell_type": "markdown",
"id": "9b145724-d165-4ef3-8a06-2948dd0c7dbb",
"metadata": {},
"source": [
"## 42. round (Vector)"
]
},
{
"cell_type": "markdown",
"id": "b780c2f3-935c-4b2f-b18a-b277cf7c24b7",
"metadata": {},
"source": [
"Otions for `#round`;\n",
"\n",
"- `:n-digits` The number of digits to show.\n",
"- `round_mode` Specify rounding mode.\n",
"\n",
"This is a unary element-wise function."
]
},
{
"cell_type": "code",
"execution_count": 119,
"id": "e7a069b0-3547-4cd2-a2f0-0740f186b191",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[15.15, 2.5, 3.5, -4.5, -5.5]\n"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"double = RedAmber::Vector.new([15.15, 2.5, 3.5, -4.5, -5.5])"
]
},
{
"cell_type": "code",
"execution_count": 120,
"id": "5ee84b24-8830-4788-a404-d5e1cca22abf",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[15.0, 2.0, 4.0, -4.0, -6.0]\n"
]
},
"execution_count": 120,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"double.round"
]
},
{
"cell_type": "code",
"execution_count": 121,
"id": "20adb1ad-473c-4245-b959-7848c239fb76",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[15.0, 2.0, 4.0, -4.0, -6.0]\n"
]
},
"execution_count": 121,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"double.round(mode: :half_to_even)"
]
},
{
"cell_type": "code",
"execution_count": 122,
"id": "d2777ad8-2c24-48e4-8f5f-77403e3109ea",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[16.0, 3.0, 4.0, -5.0, -6.0]\n"
]
},
"execution_count": 122,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"double.round(mode: :towards_infinity)"
]
},
{
"cell_type": "code",
"execution_count": 123,
"id": "a8ab2735-74cb-4cfe-a5a2-61bfa90c72ac",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[15.0, 3.0, 4.0, -4.0, -5.0]\n"
]
},
"execution_count": 123,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"double.round(mode: :half_up)"
]
},
{
"cell_type": "code",
"execution_count": 124,
"id": "3575481c-40ed-405f-a69c-7581d4dce2cf",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#\n",
"[15.0, 2.0, 3.0, -4.0, -5.0]\n"
]
},
"execution_count": 124,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"double.round(mode: :half_towards_zero)"
]
},
{
"cell_type": "code",
"execution_count": 125,
"id": "a86e4c5c-aced-4a88-b692-4e26b90f1653",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#