docs/toolkit.md in timescaledb-0.2.3 vs docs/toolkit.md in timescaledb-0.2.4
- old
+ new
@@ -91,11 +91,11 @@
Now, let's add the model `app/models/measurement.rb`:
```ruby
class Measurement < ActiveRecord::Base
- self.primary_key = 'device_id'
+ self.primary_key = nil
acts_as_hypertable time_column: "ts"
end
```
@@ -166,26 +166,35 @@
```
The final query for the example above looks like this:
```sql
-SELECT device_id, sum(abs_delta) as volatility
+SELECT device_id, SUM(abs_delta) AS volatility
FROM (
SELECT device_id,
- abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts)) as abs_delta
+ ABS(
+ val - LAG(val) OVER (
+ PARTITION BY device_id ORDER BY ts)
+ ) AS abs_delta
FROM "measurements"
-) as calc_delta
+) AS calc_delta
GROUP BY device_id
```
It's much harder to understand the actual example then go with plain SQL and now
let's reproduce the same example using the toolkit pipelines:
```ruby
Measurement
- .select("device_id, timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility")
- .group("device_id")
+ .select(<<-SQL).group("device_id")
+ device_id,
+ timevector(ts, val)
+ -> sort()
+ -> delta()
+ -> abs()
+ -> sum() as volatility
+ SQL
```
As you can see, it's much easier to read and digest the example. Now, let's take
a look in how we can generate the queries using the scopes injected by the
`acts_as_time_vector` macro.
@@ -196,11 +205,11 @@
Let's start changing the model to add the `acts_as_time_vector` that is
here to allow us to not repeat the parameters of the `timevector(ts, val)` call.
```ruby
class Measurement < ActiveRecord::Base
- self.primary_key = 'device_id'
+ self.primary_key = nil
acts_as_hypertable time_column: "ts"
acts_as_time_vector segment_by: "device_id",
value_column: "val",
@@ -222,12 +231,18 @@
acts_as_time_vector segment_by: "device_id",
value_column: "val",
time_column: "ts"
scope :volatility, -> do
- select("device_id, timevector(#{time_column}, #{value_column}) -> sort() -> delta() -> abs() -> sum() as volatility")
- .group("device_id")
+ select(<<-SQL).group("device_id")
+ device_id,
+ timevector(#{time_column}, #{value_column})
+ -> sort()
+ -> delta()
+ -> abs()
+ -> sum() as volatility
+ SQL
end
end
```
Now, we have created the volatility scope, grouping by device_id always.
@@ -246,11 +261,16 @@
value_column: "val",
time_column: "ts"
scope :volatility, -> (columns=segment_by_column) do
_scope = select([*columns,
- "timevector(#{time_column}, #{value_column}) -> sort() -> delta() -> abs() -> sum() as volatility"
+ "timevector(#{time_column},
+ #{value_column})
+ -> sort()
+ -> delta()
+ -> abs()
+ -> sum() as volatility"
].join(", "))
_scope = _scope.group(columns) if columns
_scope
end
end
@@ -359,11 +379,11 @@
Now, let's measure compare the time to process the volatility:
```ruby
Benchmark.bm do |x|
x.report("ruby") { pp Measurement.volatility_by_device_id }
- x.report("sql") { pp Measurement.volatility("device_id").map(&:attributes) }
+ x.report("sql") { pp Measurement.volatility("device_id").map(&:attributes) }
end
# user system total real
# ruby 0.612439 0.061890 0.674329 ( 0.727590)
# sql 0.001142 0.000301 0.001443 ( 0.060301)
```
@@ -377,13 +397,106 @@
!!!warning
Note that the previous numbers where using localhost.
Now, using a remote connection between different regions,
it looks even ~500 times slower than SQL.
- user system total real
- ruby 0.716321 0.041640 0.757961 ( 6.388881)
- sql 0.001156 0.000177 0.001333 ( 0.161270)
+ user system total real
+ ruby 0.716321 0.041640 0.757961 ( 6.388881)
+ sql 0.001156 0.000177 0.001333 ( 0.161270)
+Let’s recap what’s time consuming here. The `find_all` is just not optimized to
+fetch the data and also consuming most of the time here. It’s also fetching
+the data and converting it to ActiveRecord model which has thousands of methods.
+
+It’s very comfortable but just need the attributes to make it.
+
+Let’s optimize it by plucking an array of values grouped by device.
+
+```ruby
+class Measurement < ActiveRecord::Base
+ # ...
+ scope :values_from_devices, -> {
+ ordered_values = select(:val, :device_id).order(:ts)
+ Hash[
+ from(ordered_values)
+ .group(:device_id)
+ .pluck("device_id, array_agg(val)")
+ ]
+ }
+end
+```
+
+Now, let's create a method for processing volatility.
+
+```ruby
+class Volatility
+ def self.process(values)
+ previous = nil
+ deltas = values.map do |value|
+ if previous
+ delta = (value - previous).abs
+ volatility = delta
+ end
+ previous = value
+ volatility
+ end
+ #deltas => [nil, 1, 1]
+ deltas.shift
+ volatility = deltas.sum
+ end
+ def self.process_values(map)
+ map.transform_values(&method(:process))
+ end
+end
+```
+
+Now, let's change the benchmark to expose the time for fetching and processing:
+
+
+```ruby
+volatilities = nil
+
+ActiveRecord::Base.logger = nil
+Benchmark.bm do |x|
+ x.report("ruby") { Measurement.volatility_ruby }
+ x.report("sql") { Measurement.volatility_sql.map(&:attributes) }
+ x.report("fetch") { volatilities = Measurement.values_from_devices }
+ x.report("process") { Volatility.process_values(volatilities) }
+end
+```
+
+Checking the results:
+
+ user system total real
+ ruby 0.683654 0.036558 0.720212 ( 0.743942)
+ sql 0.000876 0.000096 0.000972 ( 0.054234)
+ fetch 0.078045 0.003221 0.081266 ( 0.116693)
+ process 0.067643 0.006473 0.074116 ( 0.074122)
+
+Much better, now we can see only 200ms difference between real time which means ~36% more.
+
+
+If we try to break down a bit more of the SQL part, we can see that the
+
+```sql
+EXPLAIN ANALYSE
+ SELECT device_id, array_agg(val)
+ FROM (
+ SELECT val, device_id
+ FROM measurements
+ ORDER BY ts ASC
+ ) subquery
+ GROUP BY device_id;
+```
+
+We can check the execution time and make it clear how much time is necessary
+just for the processing part, isolating network and the ActiveRecord layer.
+
+ │ Planning Time: 17.761 ms │
+ │ Execution Time: 36.302 ms
+
+So, it means that from the **116ms** to fetch the data, only **54ms** was used from the DB
+and the remaining **62ms** was consumed by network + ORM.
[1]: https://github.com/timescale/timescaledb-toolkit
[2]: https://timescale.com
[3]: https://www.postgresql.org/docs/14/runtime-config-client.html#GUC-SEARCH-PATH
[4]: https://github.com/timescale/timescaledb-toolkit/blob/main/docs/README.md#a-note-on-tags-