--- title: JLDrill's Scheduling Strategy in_menu: true sort_info: 85 --- The Goal of JLDrill =================== JLDrill has two types of scheduling strategies. The first is for short term acquisition of new material. In this mode the item is repeatedly shown to the user until the user can remember it correctly a number of times in a row. The second mode is for long term review. In this mode JLDrill uses spaced repetition to occasionally review an item with the user over days, weeks or months. The goal of JLDrill is to maximize the learning rate of items using a spaced repetition algorithm. In the following two sections I will describe what I mean by that. What is Spaced Repetition? -------------------------- Spaced repetition is the presentation of material separated by spaces of increasingly large duration. These spaces are measured in terms of minutes, hours, days, and months. Initially a user sees a new item and tries to remember it. If the user can remember the item, the system will schedule the item for review at a later time. After some time passes (a space) the user reviews the item again. If they get it right again, the system schedules a new review at an even later time. Each time the user remembers the item, the space between reviews gets longer. But if the user forgets the item, then the system starts again with a very short space between reviews. The technique of spaced repetition is based on the concept of a "forgetting curve". When someone memorizes something, immediately after memorizing it the chance of remembering is very high. As time passes, though, the chance of remembering the item falls. The speed with which people forget something is called the forgetting curve. The first time you see something, the curve is very steep and you are likely to forget it after only a short time. Each time you remember an item, though, the speed of forgetting it slows down. That is, the forgetting curve becomes less steep. Ideally we want a very shallow forgetting curve so that even if we don't see an item for several months, we are still likely to remember it. The intent of a spaced repetition algorithm is to schedule reviews of an item so that the forgetting curve becomes less steep. But we also want to minimize the number of reviews so that we don't waste time reviewing something we already know. Learning Rate ------------- Your ability to remember an item improves every time you correctly remember it. But it takes time to remember an item. Every time you review an item it might take you 10 or 20 seconds. A database of 5,000 items (a reasonable amount for learning a language) would take you as much as 24 hours to review even one time. Obviously we can't review every item all the time. We want to pick and choose the items that require review the most. Since the learning curve gets less steep every time you remember an item, you can increase the amount of time between reviews. Maybe the first time you wait 1 day. The second time, 2 days. Each time you get it right, you double the time. In this way you would be able to go a whole month without reviewing the item after reviewing it only 6 times (an investment of a little over a minute). But if you forget the item, you have to start at the beginning, meaning you have wasted all the previous effort. So you have to make sure that you minimize the number of times you make mistakes. This means reviewing with short spaces so that you have a high probability of remembering. Obviously there is a balance to maintain. The goal of JLDrill is to maximize the number of items learned in a given amount of time. Since there is always a chance that you will forget an item, we have to define what "learned" means. In this context it means being able to remember the item after a space of 30 days. I will define the "learning rate" to be the number of items "learned" (i.e., have been successfully remembered after a space of at least 30 days) divided by the total amount of time invested in those items. The goal of JLDrill is to maximize this learning rate. Note: JLDrill doesn't actually measure the learning rate right now. It certainly should. Details of JLDrill's Strategies =============================== In this section I will describe JLDrill's Strategies in more detail. First we need some basic definitions. There are three types of vocabulary items in JLDrill: - A new item is an item that you have never seen before. - A working item is an item that you have seen, but haven't completely memorized yet. You may have memorized it once before, but if so, you have since forgotten it. - A review item is an item that you have memorized and that you are likely to remember in the future. In the program, these items are organized into 3 analogous sets: the "new set", the "working set" and the "review set". Initially items are moved from the new set into the working set. The working set is of a limited size - usually only 10 or 15 items. You simply review the items in the working set a number of times until you demonstrate that you have memorized it. Once an item has been memorized, JLDrill moves it into the review set. JLDrill is organized a little bit differently than most other spaced repetition programs. In other programs, the focus of activity is centered around reviewing items in what JLDrill calls the review set. JLDrill, on the other hand, focusses on the working set. It tries to keep the working set full. Once you move an item to the review set, JLDrill will drill you on items from the review set until you make a mistake. The item you forgot is moved into the working set so that it can be re-memorized. However, it is not beneficial to review items from the review set exclusively. As you may have noticed in the previous section, there are diminishing returns for reviewing items. JLDrill keeps track of how often you get items correct in the review set and once you reach a rate of about 90% it stops drilling items. Instead, when a working set item is memorized, it is replaced with an item from the new set. In this way, JLDrill tries to focus activity on learning new and forgotten items, while retaining a recall rate in the review set of about 90%. Details of how it does this follow. Learning Working Set Items -------------------------- The purpose of the working set is to acquire new or forgotten items. Theoretically, one could use spaced repetition to acquire new items. However, this gets to be a bit problematic because very short spaces between items might not be convenient for the user. The user is dedicated to the application while using it, so we might as well review items continuously. JLDrill simply creates a set of items (by default 10, but configurable by the user). Each item is presented to the user once in random order. Then they are randomized and presented again. If the user gets the item correct a number of times in a row (6 by default), the item is "promoted" to the review set. Since JLDrill is designed to drill Japanese vocabulary, there are 3 levels in the working set. In the first level, the user is shown the kanji and reading for a word and must guess the meaning. In the second level the user is shown the kanji and must guess the meaning and reading. In the third level, the user is shown the meaning and must guess the kanji and reading. Each level must be answered correctly a number of times (by default 2) before it is promoted to the next level. After successfully answering the third level the requisite number of times, the item is promoted to the review set. If the user makes a mistake, the item goes back to level one. The forgetting curve for new items is very steep. The exact amount of time required between reviews depends a lot on the user and the type of items involved. The user can manipulate the space between reviews by altering the size of the working set. Since each item in the set is reviewed once before they are all shown again, having less items means the items will be repeated faster. Reviewing Items in the Review Set --------------------------------- JLDrill tries to maximize the retention of learned items while minimizing the cost of review by trying to keep the chance that you can remember items in the review set at 90% or above. It does this by roughly ordering the items by their probability of success and only offering items for review when the probability drops below 90%. In most spaced repetition programs items are scheduled for review. When the item has waited long enough, the item is reviewed. The algorithms that do this make a lot of assumptions about the shape of the learning curve in various circumstances. JLDrill takes a much simpler approach. JLDrill simply tries to grossly order the items by probability that the item will be forgotten. It then offers the items for review in that order until the user demonstrates that they can correctly guess around 90% of them. At that point it stops offering items for review, and new items are used instead. The algorithm for ordering the items is also very simple. Recall that the probability for remembering an item is a curve. At the start, the chance of remembering an item is 100%. As time passes the chance of remembering falls. This curve is not linear, but the part of the curve from 100% to 90% is very, very close to linear. The JLDrill algorithm creates a potential schedule for each item. It does this by multiplying the amount of time it took since last successfully reviewing the item by a factor. It then sorts the items based on the percentage of time that has elapsed in that potential schedule duration. The factor used for multiplying depends on how much time has elapsed (see below), but for the following example, let's say the factor is 2. For example, imagine there are two items. Item A has waited 5 days since the last review. Item B has waited 50 days since the last review. Both items are guessed correctly and a new schedule is made. Item A is scheduled for 10 days in the future. Item B is scheduled for 100 days in the future (twice their previous wait times). After waiting one more day, Item A is 10% through it's schedule, while Item B is 1% through it's schedule. Item A is sorted before Item B. Note that neither item is likely to be reviewed on it's actual scheduled date. The schedule is only used for creating a rough estimation of their probability of success. Each time the application is used, the items will be presented one after another (highest percentage of time used first) until an actual measurement of 90% success is achieved. Prioritizing Review ------------------- Because JLDrill doesn't rely on estimating the exact time an item should be reviewed, the amount of time it chooses for its potential schedule is less important. This potential schedule is only used to order the items, not determine when it will be reviewed. In fact, for long duration items the exact order isn't even very important because the chance of forgetting it drops off very slowly. For example, imagine an item will drop from 100% recall rate to 90% recall rate in 30 days. Since this part of the forgetting curve is very close to linear we can say that the chance of forgetting increases by 1% every three days. So even if the item is delayed for 9 days, the chance of remembering only drops to 87%. In other words, getting the schedule wrong by almost 30% affects the chance of remembering by only 3%. As long as these kind of items are roughly ordered, any deviations from perfection will be unnoticeable. Short duration items are more problematic. Since the slope of the curve is very steep, a delay of even a few days can easily put the item in the non-linear part of the forgetting curve. Thus it is important to increasingly prioritize short duration items as time passes. JLDrill does this by ordering the items by percentage of time elapsed in the potential schedule. For example, if there are 2 items, one with a potential schedule of 2 days and another with a potential schedule of 10 days, after waiting one day, the first has waited 50% of it's schedule while the other has waited 10%. After one more day, the first item is at 100% of it's schedule while the second is only 20%. In this way, short duration items bubble to the top of the priority list as time passes. Determining the Potential Schedule ---------------------------------- Again, getting the exact scheduling time is not necessary. JLDrill orders the items roughly by percentage chance of remembering and increasingly prioritizes short duration items as time passes. Then the items are drilled until a 90% success rate is achieved. Because of this, as long as the potential schedule is consistent, it doesn't have to be correct. JLDrill measures the amount of time between correct answers and creates a new potential schedule by multiplying that time by a factor. Note that this potential schedule is almost certainly wrong. However, it is consistently wrong and will place the items in roughly the correct order. The factor it multiplies by is dependent upon the time that has elapsed. An item that was reviewed less than a day ago, will have a factor of 2. In other words, when such an item is correctly remembered, the new potential schedule is twice the duration that it had waited. But for items that waited longer, the factor is reduced linearly, until at 180 days the factor is 1. In other words, items that have waited 180 days or more will be scheduled for the same amount of time they have waited. There are many rationales for changing the multiplication factor. The first is that it was determined that it is useful to review known items on a regular basis. By backing off the multiplication, this is achieved. Old items are reviewed about every 180 days. But also, because the penalty of reviewing an old item is small, the penalty for forgetting an old item is great. So it makes sense to give a little bit of extra effort for old items. New items present a problem. They have not been scheduled before and it is difficult to determine where they related to each other and the old items already in the schedule. Because of this, JLDrill does not use a fixed starting point for the first schedule. Instead it uses a value anywhere from 0 to 5 days depending on how many times the item was incorrectly remembered in the working set. The more times it was incorrect, the closer to 0 the item will be scheduled. This creates a step function for the first interval. But this also causes a problem. Items that were guessed correctly almost every time will end up with the exact same potential schedule. This means that they will be presented to the user in the same order every time. It is desirable to mix up the items every time they are presented. To achieve this, the schedule is always varied randomly by +-10%. This creates separation between the items and allows other items to be inserted over time. This variation is used even when rescheduling old items. This allows items which are guessed identically to end up in completely different places after only a few generations of scheduling. Finally, if a user neglects to study, the amount of time since the last review can be very large. As such, the new potential schedule will be even larger. Even when an item has sat for a long time, there is a small chance that the item will be remembered. In this case the item will be scheduled far into the future, depriving the user of practice. Because of this, length of time used for generating the new potential schedule is limited to 25% more than the preview potential schedule. For example, let's say an item is scheduled for 1 day in the future. But the user waits 10 days before reviewing the item. Even if the user gets the item correct, the new potential schedule will only be 2.5 days (1 day plus 25% of one day times the multiplication factor of 2) instead of 20 days. Success Rate Estimation ------------------------ The key feature of JLDrill's algorithm is that it measures when the user is getting a 90% success rate and stops reviewing. This is what allows JLDrill to operate without having to precisely schedule each item. Originally JLDrill used a Bayesian estimate of the probability that the current items were at or above 90% success rate. Because the items in the list are changing over trials (grossly increasing) a lower limit was placed on this estimate. When the estimate reached 90% (i.e., 90% confidence that the rate was 90% or above) review was halted. This proved to be troublesome, though. The problem is that for the most part, when the user is reviewing every day, the items in the list don't fall too much below 90%. They hover around 75 - 90%. Unfortunately the average number of reviews that happen before the Bayesian estimate is 90% for items that are actually 80% is only around 24. When the daily review gets large, this means that this will be hit often. The other problem is that when an item is guessed incorrectly, the Bayesian estimate always drops a long way. That's because we are testing that the value is 90% or above. If you get 8 right in a row and get one wrong, your confidence that it's 90% or above drops quickly. This resulted in the estimate practically equating to getting 9 or 10 right in a row means that you are at 90% confidence. It is not exactly the same, but you can prove to yourself that it's so close that it doesn't matter. In the end I decided to modify the algorithm to be something that the user could relate to, rather than the Bayesian estimate. I collect the last 10 results. When there is a 90% success rate I start a countdown. If the user maintains a 90% success rate for 10 more items, the review is finished. This is almost exactly equivalent to getting the Bayesian estimate. Once it reaches 90% confidence, reset the confidence and start again. Once it reaches 90% confidence again the review is finished. The rationale for this approach is that we get to a 90% confidence that we are in the 90+% area. We then start our estimation procedure again and test that we remain in the 90+% range. This gets us over most of the false peaks. As I said, I decided simply to keep track of the last 10 results and maintain a 90% rate for 10 items. I believe this is more understandable for the end user. It is also almost identical in result to using the Bayesian estimate (although the approach is *slightly* more likely to finish earlier). Finally, the code is much more straightforward since it doesn't require any complicated math. Kanji and Meaning Problems -------------------------- Starting in JLDrill 0.4.1, Kanji problems and Meaning problems are reviewed with separate schedules. In previous releases there was only one schedule and the type of review was randomly determined. In the new scheduler, if any of the problem types are guessed wrong, all of them are restarted from the beginning. This allows you to ensure that all aspects of the vocabulary are remembered properly. But this causes some issues for scheduling. The most obvious problem is that there are now 2 problems scheduled for each item. If the user sees them close together, they are likely to remember the answer from the last review. In order to combat this, the problems are scheduled with different random variations. Since each schedule is varied by +- 10% it doesn't take long until the problems are far apart in the schedule. But the other issue is that the item is still being reviewed twice as often as it was previously. This means that the recall rate will be above 90%. This is especially problematic when there are items mixed in which don't have kanji. Those items are only reviewed once. The order of the items may be incorrect because the recall rate is different. This is an issue that hasn't been addressed yet. What Percentage Level to Stop Reviewing --------------------------------------- JLDrill chooses a 90% rate to stop reviewing. This was chosen based on personal preference and claimed success rates of other spaced repetition algorithms. However, this rate should be chosen such that the weighted cost of relearning is equal to the cost of reviewing. In other words, the cost of relearning the word over again multiplied by the chance that the word is forgotten is equal to the cost of re-reviewing the word so that it isn't forgotten. This creates a balance. This value is quite difficult to calculate and is made more difficult because the amount of time between reviews in JLDrill is not predictable. However, I have found that in practice 90% seems to work well. More work needs to be done in this area.