README.md in eps-0.1.0 vs README.md in eps-0.1.1

- old
+ new

@@ -4,11 +4,14 @@ - Build models quickly and easily - Serve models built in Ruby, Python, R, and more - Automatically handles categorical variables - No external dependencies +- Works great with the SciRuby ecosystem (Daru & IRuby) +[![Build Status](https://travis-ci.org/ankane/eps.svg?branch=master)](https://travis-ci.org/ankane/eps) + ## Installation Add this line to your application’s Gemfile: ```ruby @@ -56,10 +59,18 @@ ```ruby split_date = Date.parse("2018-06-01") train_set, test_set = houses.partition { |h| h.sold_at < split_date } ``` +### Outliers and Missing Data + +Next, decide what to do with outliers and missing data. There are a number of methods for handling them, but the easiest is to remove them. + +```ruby +train_set.reject! { |h| h.bedrooms.nil? || h.price < 10000 } +``` + ### Feature Engineering Selecting features for a model is extremely important for performance. Features can be numeric or categorical. For categorical features, there’s no need to create dummy variables - just pass the data as strings. ```ruby @@ -85,46 +96,70 @@ ```ruby def features(house) { bedrooms: house.bedrooms, city_id: house.city_id.to_s, - month: house.sold_at.strftime("%b"), - price: house.price + month: house.sold_at.strftime("%b") } end -train_data = train_set.map { |h| features(h) } +train_features = train_set.map { |h| features(h) } ``` +> We use a method for features so it can be used across training, evaluation, and prediction + +We also need to prepare the target variable. + +```ruby +def target(house) + house.price +end + +train_target = train_set.map { |h| target(h) } +``` + ### Training -Once we have some features, let’s train the model. +Now, let’s train the model. ```ruby -model = Eps::Regressor.new(train_data, target: :price) +model = Eps::Regressor.new(train_features, train_target) puts model.summary ``` The summary includes the coefficients and their significance. The lower the p-value, the more significant the feature is. p-values below 0.05 are typically considered significant. It also shows the adjusted r-squared, which is a measure of how well the model fits the data. The higher the number, the better the fit. Here’s a good explanation of why it’s [better than r-squared](https://www.quora.com/What-is-the-difference-between-R-squared-and-Adjusted-R-squared). ### Evaluation When you’re happy with the model, see how well it performs on the test set. This gives us an idea of how well it’ll perform on unseen data. ```ruby -test_data = test_set.map { |h| features(h) } -model.evaluate(test_data) +test_features = test_set.map { |h| features(h) } +test_target = test_set.map { |h| target(h) } +model.evaluate(test_features, test_target) ``` This returns: -- RSME - Root mean square error +- RMSE - Root mean square error - MAE - Mean absolute error - ME - Mean error We want to minimize the RMSE and MAE and keep the ME around 0. +### Finalize + +Now that we have an idea of how the model will perform, we want to retrain the model with all of our data. + +```ruby +all_features = houses.map { |h| features(h) } +all_target = houses.map { |h| target(h) } +model = Eps::Regressor.new(all_features, all_target) +``` + +We now have a model that’s ready to serve. + ## Serving Models Once the model is trained, all we need are the coefficients to make predictions. You can dump them as a Ruby object or JSON. For Ruby, use: ```ruby @@ -176,10 +211,12 @@ ```ruby data = File.read("model.pmml") model = Eps::Regressor.load_pmml(data) ``` +> Loading PMML requires Nokogiri to be installed + [PFA](http://dmg.org/pfa/) - Portable Format for Analytics ```ruby data = File.read("model.pfa") model = Eps::Regressor.load_pfa(data) @@ -323,9 +360,13 @@ When importing data from CSV files, be sure to convert numeric fields. The `table` method does this automatically. ```ruby CSV.table("data.csv").map { |row| row.to_h } ``` + +## Jupyter & IRuby + +You can use [IRuby](https://github.com/SciRuby/iruby) to run Eps in [Jupyter](https://jupyter.org/) notebooks. Here’s how to get [IRuby working with Rails](https://github.com/ankane/shorts/blob/master/Jupyter-Rails.md). ## Reference Get coefficients