# Historiographer Losing data sucks. Every time you update or destroy a record in Rails, you lose the old data. Historiographer fixes this problem in a better way than existing auditing gems. ## Existing auditing gems for Rails suck The Audited gem has some serious flaws. 1. The `versions` table quickly grows too large to query 2. It doesn't provide the indexes you need from your primary tables 3. It doesn't provdie out-of-the-box snapshots ## How does Historiographer solve these problems? Historiographer introduces the concept of _history tables:_ append-only tables that have the same structure and indexes as your primary table. If you have a `posts` table: | id | title | | :-- | :------------- | | 1 | My Great Post | | 2 | My Second Post | You'll also have a `post_histories_table`: | id | post_id | title | history_started_at | history_ended_at | history_user_id | | :-- | :------ | :------------- | :----------------- | :--------------- | :-------------- | | 1 | 1 | My Great Post | '2019-11-08' | NULL | 1 | | 2 | 2 | My Second Post | '2019-11-08' | NULL | 1 | If you change the title of the 1st post: `Post.find(1).update(title: "Title With Better SEO", history_user_id: current_user.id)` You'll expect your `posts` table to be updated directly: | id | title | | :-- | :-------------------- | | 1 | Title With Better SEO | | 2 | My Second Post | But also, your `histories` table will be updated: | id | post_id | title | history_started_at | history_ended_at | history_user_id | | :-- | :------ | :-------------------- | :----------------- | :--------------- | :-------------- | | 1 | 1 | My Great Post | '2019-11-08' | '2019-11-09' | 1 | | 2 | 2 | My Second Post | '2019-11-08' | NULL | 1 | | 1 | 1 | Title With Better SEO | '2019-11-09' | NULL | 1 | A few things have happened here: 1. The primary table (`posts`) is updated directly 2. The existing history for `post_id=1` is timestamped when its `history_ended_at`, so that we can see when the post had the title "My Great Post" 3. A new history record is appended to the table containing a complete snapshot of the record, and a `NULL` `history_ended_at`. That's because this is the current history. 4. A record of _who_ made the change is saved (`history_user_id`). You can join to your users table to see more data. ## Snapshots Snapshots are particularly useful for two key use cases: ### 1. Time Travel & Auditing When you need to see exactly what your data looked like at a specific point in time - not just individual records, but entire object graphs with all their associations. This is invaluable for: - Debugging production issues ("What did the entire order look like when this happened?") - Compliance requirements ("Show me the exact state of this patient's record on January 1st") - Auditing complex workflows ("What was the state of this loan application when it was approved?") ### 2. Machine Learning & Analytics When you need immutable snapshots of data for: - Training data versioning - Feature engineering - Model validation - A/B test analysis - Ensuring reproducibility of results ### Taking Snapshots You can take a snapshot of a record and all its associated records: ```ruby post = Post.find(1) post.snapshot(history_user_id: current_user.id) ``` This will: 1. Create a history record for the post 2. Create history records for all associated records (comments, author, etc.) 3. Link these history records together with a shared `snapshot_id` You can retrieve the latest snapshot using: ```ruby post = Post.find(1) snapshot = post.latest_snapshot # Access associated records from the snapshot snapshot.comments # Returns CommentHistory records snapshot.author # Returns AuthorHistory record ``` Snapshots are immutable - you cannot modify history records that are part of a snapshot. This guarantees that your historical data remains unchanged, which is crucial for both auditing and machine learning applications. ### Snapshot-Only Mode If you want to only track snapshots and not record every individual change, you can configure Historiographer to operate in snapshot-only mode: ```ruby Historiographer::Configuration.mode = :snapshot_only ``` In this mode: - Regular updates/changes will not create history records - Only explicit calls to `snapshot` will create history records - Each snapshot still captures the complete state of the record and its associations This can be useful when: - You only care about specific points in time rather than every change - You want to reduce the number of history records created - You need to capture the state of complex object graphs at specific moments - You're versioning training data for machine learning models - You need to maintain immutable audit trails at specific checkpoints ## Single Table Inheritance (STI) Historiographer fully supports Single Table Inheritance, both with the default `type` column and with custom inheritance columns. ### Default STI with `type` column ```ruby class Post < ActiveRecord::Base include Historiographer end class PrivatePost < Post end # The history classes follow the same inheritance pattern: class PostHistory < ActiveRecord::Base include Historiographer::History end class PrivatePostHistory < PostHistory end ``` History records automatically maintain the correct STI type: ```ruby private_post = PrivatePost.create(title: "Secret", history_user_id: current_user.id) private_post.snapshot # History records are the correct subclass history = PostHistory.last history.is_a?(PrivatePostHistory) #=> true history.type #=> "PrivatePostHistory" ``` ### Custom Inheritance Columns You can also use a custom column for STI instead of the default `type`: ```ruby class MLModel < ActiveRecord::Base self.inheritance_column = :model_type include Historiographer end class XGBoost < MLModel self.table_name = "ml_models" end # History classes use the same custom column class MLModelHistory < MLModel self.inheritance_column = :model_type self.table_name = "ml_model_histories" end class XGBoostHistory < MLModelHistory end ``` Migration for custom inheritance column: ```ruby create_table :ml_models do |t| t.string :name t.string :model_type # Custom inheritance column t.jsonb :parameters t.timestamps t.index :model_type end create_table :ml_model_histories do |t| t.histories # Includes all columns from parent table end ``` The custom inheritance column works just like the default `type`: ```ruby model = XGBoost.create(name: "My Model", history_user_id: current_user.id) model.snapshot # History records maintain the correct subclass history = MLModelHistory.last history.is_a?(XGBoostHistory) #=> true history.model_type #=> "XGBoostHistory" ``` ### STI and Snapshots: Perfect for Model Versioning Single Table Inheritance combined with Historiographer's snapshot feature is particularly powerful for versioning machine learning models and other complex systems that need immutable historical records. Here's why: 1. **Type-Safe History**: When you snapshot an ML model, both the model and its parameters are preserved with their exact implementation type. This ensures that when you retrieve historical versions, you get back exactly the right subclass with its specific behavior: ```ruby # Create and configure an XGBoost model model = XGBoost.create( name: "Customer Churn Predictor v1", parameters: { max_depth: 3, eta: 0.1 }, history_user_id: current_user.id ) # Take a snapshot before training model.snapshot # Update the model after training model.update( name: "Customer Churn Predictor v2", parameters: { max_depth: 5, eta: 0.2 }, history_user_id: current_user.id ) # Later, retrieve the exact pre-training version historical_model = MLModel.latest_snapshot historical_model.is_a?(XGBoostHistory) #=> true historical_model.parameters #=> { max_depth: 3, eta: 0.1 } ``` 2. **Implementation Versioning**: Different model types often have different parameters, preprocessing steps, or scoring methods. STI ensures these differences are preserved in history: ```ruby class XGBoost < MLModel def predict(data) # XGBoost-specific prediction logic end end class RandomForest < MLModel def predict(data) # RandomForest-specific prediction logic end end # Your historical records maintain these implementation differences old_model = MLModel.latest_snapshot old_model.predict(data) # Uses the exact prediction logic from that point in time ``` 3. **Reproducibility**: Essential for ML workflows where you need to reproduce results or audit model behavior: ```ruby # Create model and snapshot at each significant stage model = XGBoost.create(name: "Risk Scorer v1", history_user_id: current_user.id) # Snapshot after initial configuration model.snapshot(metadata: { stage: "configuration" }) # Snapshot after training model.update(parameters: trained_parameters) model.snapshot(metadata: { stage: "post_training" }) # Snapshot after validation model.update(parameters: validated_parameters) model.snapshot(metadata: { stage: "validated" }) # Later, you can retrieve any version to reproduce results initial_version = model.histories.find_by(metadata: { stage: "configuration" }) trained_version = model.histories.find_by(metadata: { stage: "post_training" }) ``` This combination of STI and snapshots is particularly valuable for: - Model governance and compliance - A/B testing different model types - Debugging model behavior - Reproducing historical predictions - Maintaining audit trails for regulatory requirements ## Namespaced Models When using namespaced models, Rails handles foreign key naming differently than with non-namespaced models. For example, if you have a model namespaced like this: ```ruby module EasyML class Dataset self.table_name = "easy_ml_datasets" end end ``` Rails will expect foreign keys to be formatted using just the model name (without the namespace) like this: ```ruby :dataset_id ``` Therefore, when creating history migrations for namespaced models, you need to specify the foreign key name explicitly: ```ruby class CreateEasyMLDatasetHistories < ActiveRecord::Migration def change create_table :easy_ml_dataset_histories do |t| t.histories(foreign_key: :dataset_id) # instead of using the table name — easy_ml_dataset_id end end end ``` This ensures that the foreign key relationships are properly established between your namespaced models and their history tables. ## Getting Started Whenever you include the `Historiographer` gem in your ActiveRecord model, it allows you to insert, update, or delete data as you normally would. ```ruby class Post < ActiveRecord::Base include Historiographer end class PostHistory < ActiveRecord::Base self.table_name = "post_histories" include Historiographer::History end ``` ### History Modes Historiographer supports two modes of operation: 1. **:histories mode** (default) - Records history for every change to a record 2. **:snapshot_only mode** - Only records history when explicitly taking snapshots You can configure the mode globally: ```ruby # In an initializer Historiographer::Configuration.mode = :histories # Default mode # or Historiographer::Configuration.mode = :snapshot_only ``` Or per model using `historiographer_mode`: ```ruby class Post < ActiveRecord::Base include Historiographer historiographer_mode :snapshot_only # Only record history when .snapshot is called end class Comment < ActiveRecord::Base include Historiographer historiographer_mode :histories # Record history for every change (default) end ``` The class-level mode setting takes precedence over the global configuration. This allows you to: - Have different history tracking strategies for different models - Set most models to use snapshots while keeping detailed history for critical models - Optimize storage by only tracking detailed history where needed For example: ```ruby # Global setting for most models Historiographer::Configuration.mode = :snapshot_only class Order < ActiveRecord::Base include Historiographer # Uses global :snapshot_only mode end class Payment < ActiveRecord::Base include Historiographer historiographer_mode :histories # Override to record histories of every change end ``` ## Create A Migration You need a separate table to store histories for each model. So if you have a Posts model: ```ruby class CreatePosts < ActiveRecord::Migration def change create_table :posts do |t| t.string :title, null: false t.boolean :enabled end add_index :posts, :enabled end end ``` You should create a model named _posts_histories_: ```ruby require "historiographer/postgres_migration" class CreatePostHistories < ActiveRecord::Migration def change create_table :post_histories do |t| t.histories end end end ``` The `t.histories` method will automatically create a table with the following columns: - `id` (because every model has a primary key) - `post_id` (because this is the foreign key) - `title` (because it was on the original model) - `enabled` (because it was on the original model) - `history_started_at` (to denote when this history became the canonical version) - `history_ended_at` (to denote when this history was no longer the canonical version, if it has stopped being the canonical version) - `history_user_id` (to denote the user that made this change, if one is known) Additionally it will add indices on: - The same columns that had indices on the original model (e.g. `enabled`) - `history_started_at`, `history_ended_at`, and `history_user_id` ## Models The primary model should include `Historiographer`: ```ruby class Post < ActiveRecord::Base include Historiographer end class PostHistory < ActiveRecord::Base self.table_name = "post_histories" include Historiographer::History end ``` You should also make a `PostHistory` class if you're going to query `PostHistory` from Rails: ```ruby class PostHistory < ActiveRecord::Base self.table_name = "post_histories" end ``` The `Posts` class will acquire a `histories` method, and the `PostHistory` model will gain a `post` method: ```ruby p = Post.first p.histories.first.class # => "PostHistory" p.histories.first.post == p # => true ``` ## Creating, Updating, and Destroying Data: You can just use normal ActiveRecord methods, and all will record histories: ```ruby Post.create(title: "My Great Title", history_user_id: current_user.id) Post.find_by(title: "My Great Title").update(title: "A New Title", history_user_id: current_user.id) Post.update_all(title: "They're all the same!", history_user_id: current_user.id) Post.last.destroy!(history_user_id: current_user.id) Post.destroy_all(history_user_id: current_user.id) ``` The `histories` classes have a `current` method, which only finds current history records. These records will also be the same as the data in the primary table. ```ruby p = Post.first p.current_history PostHistory.current ``` ### What to do when generated index names are too long Sometimes the generated index names are too long. Just like with standard Rails migrations, you can override the name of the index to fix this problem. To do so, use the `index_names` argument to override individual index names: ```ruby require "historiographer/postgres_migration" class CreatePostHistories < ActiveRecord::Migration def change create_table :post_histories do |t| t.histories index_names: { title: "my_index_name", [:compound, :index] => "my_compound_index_name" } end end end ``` == Mysql Install For contributors on OSX, you may have difficulty installing mysql: ``` gem install mysql2 -v '0.4.10' --source 'https://rubygems.org/' -- --with-ldflags=-L/usr/local/opt/openssl/lib --with-cppflags=-I/usr/local/opt/openssl/include ``` == Copyright Copyright (c) 2016-2020 brettshollenberger. See LICENSE.txt for further details.