README.md in eps-0.2.1 vs README.md in eps-0.3.0

- old
+ new

@@ -2,13 +2,11 @@
 
 Machine learning for Ruby
 
 - Build predictive models quickly and easily
 - Serve models built in Ruby, Python, R, and more
-- Supports regression (linear regression) and classification (naive Bayes)
-- Automatically handles categorical features
-- Works great with the SciRuby ecosystem (Daru & IRuby)
+- No prior knowledge of machine learning required :tada:
 
 Check out [this post](https://ankane.org/rails-meet-data-science) for more info on machine learning with Rails
 
 [![Build Status](https://travis-ci.org/ankane/eps.svg?branch=master)](https://travis-ci.org/ankane/eps)
 
@@ -18,12 +16,16 @@
 
 ```ruby
 gem 'eps'
 ```
 
-To speed up training on large datasets, you can also [add GSL](#training-performance).
+On Mac, also install OpenMP:
 
+```sh
+brew install libomp
+```
+
 ## Getting Started
 
 Create a model
 
 ```ruby
@@ -41,164 +43,123 @@
 
 ```ruby
 model.predict(bedrooms: 2, bathrooms: 1)
 ```
 
-> Pass an array of hashes make multiple predictions at once
+Store the model
 
-The target can be numeric (regression) or categorical (classification).
-
-## Building Models
-
-### Training and Test Sets
-
-When building models, it’s a good idea to hold out some data so you can see how well the model will perform on unseen data. To do this, we split our data into two sets: training and test. We build the model with the training set and later evaluate it on the test set.
-
 ```ruby
-split_date = Date.parse("2018-06-01")
-train_set, test_set = houses.partition { |h| h.sold_at < split_date }
+File.write("model.pmml", model.to_pmml)
 ```
 
-If your data doesn’t have a time associated with it, you can split it randomly.
+Load the model
 
 ```ruby
-rng = Random.new(1) # seed random number generator
-train_set, test_set = houses.partition { rng.rand < 0.7 }
+pmml = File.read("model.pmml")
+model = Eps::Model.load_pmml(pmml)
 ```
 
-### Outliers and Missing Data
+A few notes:
 
-Next, decide what to do with outliers and missing data. There are a number of methods for handling them, but the easiest is to remove them.
+- The target can be numeric (regression) or categorical (classification)
+- Pass an array of hashes to `predict` to make multiple predictions at once
+- Models are stored in [PMML](https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language), a standard for model storage
 
-```ruby
-train_set.reject! { |h| h.bedrooms.nil? || h.price < 10000 }
-```
+## Building Models
 
-### Feature Engineering
+### Goal
 
-Selecting features for a model is extremely important for performance. Features can be numeric or categorical. For categorical features, there’s no need to create dummy variables - just pass the data as strings.
+Often, the goal of building a model is to make good predictions on future data. To help achieve this, Eps splits the data into training and validation sets if you have 30+ data points. It uses the training set to build the model and the validation set to evaluate the performance.
 
+If your data has a time associated with it, it’s highly recommended to use that field for the split.
+
 ```ruby
-{state: "CA"}
+Eps::Model.new(data, target: :price, split: :listed_at)
 ```
 
-> Categorical features generate coefficients for each distinct value except for one
+Otherwise, the split is random. There are a number of [other options](#validation-options) as well.
 
-Convert any ids to strings so they’re treated as categorical features.
+Performance is reported in the summary.
 
-```ruby
-{city_id: city_id.to_s}
-```
+- For regression, it reports validation RMSE (root mean squared error) - lower is better
+- For classification, it reports validation accuracy - higher is better
 
-For times, create features like day of week and hour of day with:
+Typically, the best way to improve performance is feature engineering.
 
-```ruby
-{weekday: time.wday.to_s, hour: time.hour.to_s}
-```
+### Feature Engineering
 
-In practice, your code may look like:
+Features are extremely important for model performance. Features can be:
 
-```ruby
-def features(house)
-  {
-    bedrooms: house.bedrooms,
-    city_id: house.city_id.to_s,
-    month: house.sold_at.strftime("%b")
-  }
-end
+1. numeric
+2. categorical
+3. text
 
-train_features = train_set.map { |h| features(h) }
-```
+#### Numeric
 
-> We use a method for features so it can be used across training, evaluation, and prediction
+For numeric features, use any numeric type.
 
-We also need to prepare the target variable.
-
 ```ruby
-def target(house)
-  house.price
-end
-
-train_target = train_set.map { |h| target(h) }
+{bedrooms: 4, bathrooms: 2.5}
 ```
 
-### Training
+#### Categorical
 
-Now, let’s train the model.
+For categorical features, use strings or booleans.
 
 ```ruby
-model = Eps::Model.new(train_features, train_target)
-puts model.summary
+{state: "CA", basement: true}
 ```
 
-For regression, the summary includes the coefficients and their significance. The lower the p-value, the more significant the feature is. p-values below 0.05 are typically considered significant. It also shows the adjusted r-squared, which is a measure of how well the model fits the data. The higher the number, the better the fit. Here’s a good explanation of why it’s [better than r-squared](https://www.quora.com/What-is-the-difference-between-R-squared-and-Adjusted-R-squared).
+Convert any ids to strings so they’re treated as categorical features.
 
-### Evaluation
+```ruby
+{city_id: city_id.to_s}
+```
 
-When you’re happy with the model, see how well it performs on the test set. This gives us an idea of how well it’ll perform on unseen data.
+For dates, create features like day of week and month.
 
 ```ruby
-test_features = test_set.map { |h| features(h) }
-test_target = test_set.map { |h| target(h) }
-model.evaluate(test_features, test_target)
+{weekday: sold_on.strftime("%a"), month: sold_on.strftime("%b")}
 ```
 
-For regression, this returns:
+For times, create features like day of week and hour of day.
 
-- RMSE - Root mean square error
-- MAE - Mean absolute error
-- ME - Mean error
-
-We want to minimize the RMSE and MAE and keep the ME around 0.
-
-For classification, this returns:
-
-- Accuracy
-
-We want to maximize the accuracy.
-
-### Finalize
-
-Now that we have an idea of how the model will perform, we want to retrain the model with all of our data. Treat outliers and missing data the same as you did with the training set.
-
 ```ruby
-# outliers and missing data
-houses.reject! { |h| h.bedrooms.nil? || h.price < 10000 }
-
-# training
-all_features = houses.map { |h| features(h) }
-all_target = houses.map { |h| target(h) }
-model = Eps::Model.new(all_features, all_target)
+{weekday: listed_at.strftime("%a"), hour: listed_at.hour.to_s}
 ```
 
-We now have a model that’s ready to serve.
+#### Text
 
-## Serving Models
+For text features, use strings with multiple words.
 
-Once the model is trained, we need to store it. Eps uses PMML - [Predictive Model Markup Language](https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language) - a standard for storing models. A great option is to write the model to a file with:
-
 ```ruby
-File.write("model.pmml", model.to_pmml)
+{description: "a beautiful house on top of a hill"}
 ```
 
-> You may need to add `nokogiri` to your Gemfile
+This creates features based on word count (term frequency).
 
-To load a model, use:
+You can specify text features explicitly with:
 
 ```ruby
-pmml = File.read("model.pmml")
-model = Eps::Model.load_pmml(pmml)
+Eps::Model.new(data, target: :price, text_features: [:description])
 ```
 
-Now we can use it to make predictions.
+You can set advanced options with:
 
 ```ruby
-model.predict(bedrooms: 2, bathrooms: 1)
+text_features: {
+  description: {
+    min_occurences: 5,
+    max_features: 1000,
+    min_length: 1,
+    case_sensitive: true,
+    tokenizer: /\s+/,
+    stop_words: ["and", "the"]
+  }
+}
 ```
 
-To continuously train models, we recommend [storing them in your database](#database-storage).
-
 ## Full Example
 
 We recommend putting all the model code in a single file. This makes it easy to rebuild the model as needed.
 
 In Rails, we recommend creating a `app/ml_models` directory. Be sure to restart Spring after creating the directory so files are autoloaded.
@@ -210,66 +171,40 @@
 Here’s what a complete model in `app/ml_models/price_model.rb` may look like:
 
 ```ruby
 class PriceModel < Eps::Base
   def build
-    houses = House.all.to_a
+    houses = House.all
 
-    # divide into training and test set
-    split_date = Date.parse("2018-06-01")
-    train_set, test_set = houses.partition { |h| h.sold_at < split_date }
-
-    # handle outliers and missing values
-    train_set = preprocess(train_set)
-
     # train
-    train_features = train_set.map { |v| features(v) }
-    train_target = train_set.map { |v| target(v) }
-    model = Eps::Model.new(train_features, train_target)
+    data = houses.map { |v| features(v) }
+    model = Eps::Model.new(data, target: :price, split: :listed_at)
     puts model.summary
 
-    # evaluate
-    test_features = test_set.map { |v| features(v) }
-    test_target = test_set.map { |v| target(v) }
-    metrics = model.evaluate(test_features, test_target)
-    puts "Test RMSE: #{metrics[:rmse]}"
-    # for classification, use:
-    # puts "Test accuracy: #{(100 * metrics[:accuracy]).round}%"
-
-    # finalize
-    houses = preprocess(houses)
-    all_features = houses.map { |h| features(h) }
-    all_target = houses.map { |h| target(h) }
-    model = Eps::Model.new(all_features, all_target)
-
-    # save
+    # save to file
     File.write(model_file, model.to_pmml)
-    @model = nil # reset for future predictions
+
+    # ensure reloads from file
+    @model = nil
   end
 
   def predict(house)
     model.predict(features(house))
   end
 
   private
 
-  def preprocess(train_set)
-    train_set.reject { |h| h.bedrooms.nil? || h.price < 10000 }
-  end
-
   def features(house)
     {
       bedrooms: house.bedrooms,
       city_id: house.city_id.to_s,
-      month: house.sold_at.strftime("%b")
+      month: house.listed_at.strftime("%b"),
+      listed_at: house.listed_at,
+      price: house.price
     }
   end
 
-  def target(house)
-    house.price
-  end
-
   def model
     @model ||= Eps::Model.load_pmml(File.read(model_file))
   end
 
   def model_file
@@ -296,94 +231,50 @@
 
 We recommend monitoring how well your models perform over time. To do this, save your predictions to the database. Then, compare them with:
 
 ```ruby
 actual = houses.map(&:price)
-estimated = houses.map(&:estimated_price)
-Eps.metrics(actual, estimated)
+predicted = houses.map(&:predicted_price)
+Eps.metrics(actual, predicted)
 ```
 
-This returns the same evaluation metrics as model building. For RMSE and MAE, alert if they rise above a certain threshold. For ME, alert if it moves too far away from 0. For accuracy, alert if it drops below a certain threshold.
+For RMSE and MAE, alert if they rise above a certain threshold. For ME, alert if it moves too far away from 0. For accuracy, alert if it drops below a certain threshold.
 
 ## Other Languages
 
-Eps makes it easy to serve models from other languages. You can build models in R, Python, and others and serve them in Ruby without having to worry about how to deploy or run another language.
+Eps makes it easy to serve models from other languages. You can build models in Python, R, and others and serve them in Ruby without having to worry about how to deploy or run another language.
 
-Eps can serve linear regression and Naive bayes models. Check out [Scoruby](https://github.com/asafschers/scoruby) to serve other models.
+Eps can serve LightGBM, linear regression, and naive Bayes models. Check out [ONNX Runtime](https://github.com/ankane/onnxruntime) and [Scoruby](https://github.com/asafschers/scoruby) to serve other models.
 
-### R
-
-To create a model in R, install the [pmml](https://cran.r-project.org/package=pmml) package
-
-```r
-install.packages("pmml")
-```
-
-For regression, run:
-
-```r
-library(pmml)
-
-model <- lm(dist ~ speed,  cars)
-
-# save model
-data <- toString(pmml(model))
-write(data, file="model.pmml")
-```
-
-For classification, run:
-
-```r
-library(pmml)
-library(e1071)
-
-model <- naiveBayes(Species ~ .,  iris)
-
-# save model
-data <- toString(pmml(model, predictedField="Species"))
-write(data, file="model.pmml")
-```
-
 ### Python
 
 To create a model in Python, install the [sklearn2pmml](https://github.com/jpmml/sklearn2pmml) package
 
 ```sh
 pip install sklearn2pmml
 ```
 
-For regression, run:
+And check out the examples:
 
-```python
-from sklearn2pmml import sklearn2pmml, make_pmml_pipeline
-from sklearn.linear_model import LinearRegression
+- [LightGBM Regression](test/support/python/lightgbm_regression.py)
+- [LightGBM Classification](test/support/python/lightgbm_classification.py)
+- [Linear Regression](test/support/python/linear_regression.py)
+- [Naive Bayes](test/support/python/naive_bayes.py)
 
-x = [1, 2, 3, 5, 6]
-y = [5 * xi + 3 for xi in x]
+### R
 
-model = LinearRegression()
-model.fit([[xi] for xi in x], y)
+To create a model in R, install the [pmml](https://cran.r-project.org/package=pmml) package
 
-# save model
-sklearn2pmml(make_pmml_pipeline(model), "model.pmml")
+```r
+install.packages("pmml")
 ```
 
-For classification, run:
+And check out the examples:
 
-```python
-from sklearn2pmml import sklearn2pmml, make_pmml_pipeline
-from sklearn.naive_bayes import GaussianNB
+- [Linear Regression](test/support/r/linear_regression.R)
+- [Naive Bayes](test/support/r/naive_bayes.R)
 
-x = [1, 2, 3, 5, 6]
-y = ["ham", "ham", "ham", "spam", "spam"]
-
-model = GaussianNB()
-model.fit([[xi] for xi in x], y)
-
-sklearn2pmml(make_pmml_pipeline(model), "model.pmml")
-```
-
 ### Verifying
 
 It’s important for features to be implemented consistently when serving models created in other languages. We highly recommend verifying this programmatically. Create a CSV file with ids and predictions from the original model.
 
 house_id | prediction
@@ -411,42 +302,63 @@
 
   putc "✓"
 end
 ```
 
-## Database Storage
+## Data
 
-The database is another place you can store models. It’s good if you retrain models automatically.
+A number of data formats are supported. You can pass the target variable separately.
 
-> We recommend adding monitoring and guardrails as well if you retrain automatically
+```ruby
+x = [{x: 1}, {x: 2}, {x: 3}]
+y = [1, 2, 3]
+Eps::Model.new(x, y)
+```
 
-Create an ActiveRecord model to store the predictive model.
+Or pass arrays of arrays
 
-```sh
-rails g model Model key:string:uniq data:text
+```ruby
+x = [[1, 2], [2, 0], [3, 1]]
+y = [1, 2, 3]
+Eps::Model.new(x, y)
 ```
 
-Store the model with:
+### Daru
 
+Eps works well with Daru data frames.
+
 ```ruby
-store = Model.where(key: "price").first_or_initialize
-store.update(data: model.to_pmml)
+df = Daru::DataFrame.from_csv("houses.csv")
+Eps::Model.new(df, target: "price")
 ```
 
-Load the model with:
+### CSVs
 
+When importing data from CSV files, be sure to convert numeric fields. The `table` method does this automatically.
+
 ```ruby
-data = Model.find_by!(key: "price").data
-model = Eps::Model.load_pmml(data)
+CSV.table("data.csv").map { |row| row.to_h }
 ```
 
-## Training Performance
+## Algorithms
 
-Speed up training on large datasets with GSL.
+Pass an algorithm with:
 
-First, [install GSL](https://www.gnu.org/software/gsl/). With Homebrew, you can use:
+```ruby
+Eps::Model.new(data, algorithm: :linear_regression)
+```
 
+Eps supports:
+
+- LightGBM (default)
+- Linear Regression
+- Naive Bayes
+
+### Linear Regression
+
+To speed up training on large datasets with linear regression, [install GSL](https://www.gnu.org/software/gsl/). With Homebrew, you can use:
+
 ```sh
 brew install gsl
 ```
 
 Then, add this line to your application’s Gemfile:
@@ -455,68 +367,96 @@
 gem 'gsl', group: :development
 ```
 
 It only needs to be available in environments used to build the model.
 
-> This only speeds up regression, not classification
+## Validation Options
 
-## Data
+Pass your own validation set with:
 
-A number of data formats are supported. You can pass the target variable separately.
+```ruby
+Eps::Model.new(data, validation_set: validation_set)
+```
 
+Split on a specific value
+
 ```ruby
-x = [{x: 1}, {x: 2}, {x: 3}]
-y = [1, 2, 3]
-Eps::Model.new(x, y)
+Eps::Model.new(data, split: {column: :listed_at, value: Date.parse("2019-01-01")})
 ```
 
-Or pass arrays of arrays
+Specify the validation set size (the default is `0.25`, which is 25%)
 
 ```ruby
-x = [[1, 2], [2, 0], [3, 1]]
-y = [1, 2, 3]
-Eps::Model.new(x, y)
+Eps::Model.new(data, split: {validation_size: 0.2})
 ```
 
-## Daru
+## Database Storage
 
-Eps works well with Daru data frames.
+The database is another place you can store models. It’s good if you retrain models automatically.
 
-```ruby
-df = Daru::DataFrame.from_csv("houses.csv")
-Eps::Model.new(df, target: "price")
+> We recommend adding monitoring and guardrails as well if you retrain automatically
+
+Create an ActiveRecord model to store the predictive model.
+
+```sh
+rails g model Model key:string:uniq data:text
 ```
 
-To split into training and test sets, use:
+Store the model with:
 
 ```ruby
-rng = Random.new(1) # seed random number generator
-train_index = houses.map { rng.rand < 0.7 }
-train_set = houses.where(train_index)
-test_set = houses.where(train_index.map { |v| !v })
+store = Model.where(key: "price").first_or_initialize
+store.update(data: model.to_pmml)
 ```
 
-## CSVs
+Load the model with:
 
-When importing data from CSV files, be sure to convert numeric fields. The `table` method does this automatically.
-
 ```ruby
-CSV.table("data.csv").map { |row| row.to_h }
+data = Model.find_by!(key: "price").data
+model = Eps::Model.load_pmml(data)
 ```
 
 ## Jupyter & IRuby
 
 You can use [IRuby](https://github.com/SciRuby/iruby) to run Eps in [Jupyter](https://jupyter.org/) notebooks. Here’s how to get [IRuby working with Rails](https://ankane.org/jupyter-rails).
 
-## Reference
+## Upgrading
 
-Get an extended summary with standard error, t-values, and r-squared
+## 0.3.0
 
-```ruby
-model.summary(extended: true)
-```
+Eps 0.3.0 brings a number of improvements, including support for LightGBM and cross-validation. There are a number of breaking changes to be aware of:
 
-## Upgrading
+- LightGBM is now the default for new models. On Mac, run:
+
+  ```sh
+  brew install libomp
+  ```
+
+  Pass the `algorithm` option to use linear regression or naive Bayes.
+
+  ```ruby
+  Eps::Model.new(data, algorithm: :linear_regression) # or :naive_bayes
+  ```
+
+- Cross-validation happens automatically by default. You no longer need to create training and test sets manually. If you were splitting on a time, use:
+
+  ```ruby
+  Eps::Model.new(data, split: {column: :listed_at, value: Date.parse("2019-01-01")})
+  ```
+
+  Or randomly, use:
+
+  ```ruby
+  Eps::Model.new(data, split: {validation_size: 0.3})
+  ```
+
+  To continue splitting manually, use:
+
+  ```ruby
+  Eps::Model.new(data, validation_set: test_set)
+  ```
+
+- It’s no longer possible to load models in JSON or PFA formats. Retrain models and save them as PMML.
 
 ## 0.2.0
 
 Eps 0.2.0 brings a number of improvements, including support for classification.