# InstDataShipper This gem is intended to facilitate easy upload of LTI datasets to Instructure Hosted Data. ## Installation Add this line to your application's Gemfile: ```ruby gem 'inst_data_shipper' ``` Then run the migrations: ``` bundle exec rake db:migrate ``` ## Usage ### Dumper The main tool provided by this Gem is the `InstDataDumper::Dumper` class. It is used to define a "Dump" which is a combination of tasks and schema. It is a assumed that a `Dumper` class definition is the source of truth for all tables that it manages, and that no other processes affect the tables' data or schema. You can break this assumption, but you should understand how the `incremental` logic works and what will and will not trigger a full table upload. Dumpers have a `export_genre` method that determines the what Dumps to look at when calculating incrementals. - High level, the HD backend will look for a past dump of the same genre. If not found, a full upload of all tables is triggered. If found, each table's schema is compared; any tables with mismatched schema (determined by hashing) will do a full upload. - Note that `Proc`s in the schema are not included in the hash calculation. If you change a `Proc` implementation and need to trigger a full-upload of the table, you'll need to change something else too (like the `version`). Here is an example `Dumper` implementation, wrapped in an ActiveJob job: ```ruby class HostedDataPushJob < ApplicationJob # The schema serves two purposes: defining the schema and mapping data SCHEMA = InstDataShipper::SchemaBuilder.build do # You can augment the Table-builder DSL with custom methods like so: extend_table_builder do # It may be useful to define a custom column definition helpers: def custom_column(*args, from: nil, **kwargs, &blk) # In this example, the helper reads the value from a `data` jsonb column - without it, you'd need # to define `from: ->(row) { row.data[""] }` on each column that needs to read from the jsonb from ||= args[0].to_s from = ->(row) { row.data[from] } if from.is_a?(String) column(*args, **kwargs, from: from, &blk) end # `extend_table_builder` uses `class_eval`, so you could alternatively write your helpers in a Concern or Module and include them like normal: include SomeConcern end table(ALocalModel, "") do # If you define a table as incremental, it'll only export changes made since the start of the last successful Dumper run # The first argument "scope" can be interpreted in different ways: # If exporting a local model it may be a: (default: `updated_at`) # Proc that will receive a Relation and return a Relation (use `incremental_since`) # String of a column to compare with `incremental_since` # If exporting a Canvas report it may be a: (default: `updated_after`) # Proc that will receive report params and return modified report params (use `incremental_since`) # String of a report param to set to `incremental_since` # `on:` is passed to Hosted Data and is used as the unique key. It may be an array to form a composite-key # `if:` may be a Proc or a Symbol (of a method on the Dumper) incremental "updated_at", on: [:id], if: ->() {} # Schemas may declaratively define the data source. # This can be used for basic schemas where there's a 1:1 mapping between source table and destination table, and there is no conditional logic that needs to be performed. # In order to apply these statements, your Dumper must call `auto_enqueue_from_schema`. source :local_table # A Proc can also be passed. The below is equivalent to the above source ->(table_def) { import_local_table(table_def[:model] || table_def[:warehouse_name]) } # You may manually note a version on the table. # Note that if a version is present, the version value replaces the hash-comparison when calculating incrementals, so you must change the version whenever the schema changes enough to trigger a full-upload version "1.0.0" column :name_in_destinations, :maybe_optional_sql_type, "Optional description of column" # The type may usually be omitted if the `table()` is passed a Model class, but strings are an exception to this column :name, :"varchar(128)" # `from:` May be... # A Symbol of a method to be called on the record column :sis_type, :"varchar(32)", from: :some_model_method # A String of a column to read from the record column :sis_type, :"varchar(32)", from: "sis_source_type" # A Proc to be called with each record column :sis_type, :"varchar(32)", from: ->(rec) { ... } # Not specified. Will default to using the Schema Column Name as a String ("sis_type" in this case) column :sis_type, :"varchar(32)" end table("my_table", model: ALocalModel) do # ... end table("proserv_student_submissions_csv") do column :canvas_id, :bigint, from: "canvas user id" column :sis_id, :"varchar(64)", from: "sis user id" column :name, :"varchar(64)", from: "user name" column :submission_id, :bigint, from: "submission id" end end Dumper = InstDataShipper::Dumper.define(schema: SCHEMA, include: [ InstDataShipper::DataSources::LocalTables, InstDataShipper::DataSources::CanvasReports, ]) do import_local_table(ALocalModel) import_canvas_report_by_terms("proserv_student_submissions_csv", terms: Term.all.pluck(:canvas_id)) # If the report_name/Model don't directly match the Schema, a schema_name: parameter may be passed: import_local_table(SomeModel, schema_name: "my_table") import_canvas_report_by_terms("some_report", terms: Term.all.pluck(:canvas_id), schema_name: "my_table") # Iterate through the Tables defined in the Schema and apply any defined `source` statements. # This is the default behavior if `define()` is called w/o a block. auto_enqueue_from_schema end def perform Dumper.perform_dump([ "hosted-data://@?table_prefix=example", "s3://:@//", ]) end end ``` `Dumper`s may also be formed as a normal Ruby subclass: ```ruby class HostedDataPushJob < ApplicationJob SCHEMA = InstDataShipper::SchemaBuilder.build do # ... end class Dumper < InstDataShipper::Dumper include InstDataShipper::DataSources::LocalTables include InstDataShipper::DataSources::CanvasReports def enqueue_tasks import_local_table(ALocalModel) import_canvas_report_by_terms("proserv_student_submissions_csv", terms: Term.all.pluck(:canvas_id)) # auto_enqueue_from_schema end def table_schemas SCHEMA end end def perform Dumper.perform_dump([ "hosted-data://@?table_prefix=example", "s3://:@//", ]) end end ``` ### Destinations This Gem is mainly designed for use with Hosted Data, but it tries to abstract that a little to allow for other destinations/backends. Out of the box, support for Hosted Data and S3 are included. Destinations are passed as URI-formatted strings. Passing Hashes is also supported, but the format/keys are destination specific. Destinations blindly accept URI Fragments (the `#` chunk at the end of the URI). These options are not used internally but will be made available as `dest.user_config`. Ideally these are in the same format as query parameters (`x=1&y=2`, which it will try to parse into a Hash), but it can be any string. #### Hosted Data `hosted-data://@` ##### Optional Parameters: - `table_prefix`: An optional string to prefix onto each table name in the schema when declaring the schema in Hosted Data #### S3 `s3://:@//` ##### Optional Parameters: _None_ ## Development When adding to or updating this gem, make sure you do the following: - Update the yardoc comments where necessary, and confirm the changes by running `yardoc --server` - Write specs - If you modify the model or migration templates, run `bundle exec rake update_test_schema` to update them in the Rails Dummy application (and commit those changes) ## Docs Docs can be generated using [yard](https://yardoc.org/). To view the docs: - Clone this gem's repository - `bundle install` - `yard server --reload` The yard server will give you a URL you can visit to view the docs.