Todd W. Schneider

Electability of 2016 Presidential Candidates as Implied by Betting Markets

It’s fairly commonplace these days for news outlets to reference prediction markets as part of the election cycle. We often hear about betting odds on who will win the primary or be the next president, but I haven’t seen many commentators use prediction markets to infer the electability of each candidate.

With that in mind, I took the betting odds for the 2016 US presidential election from Betfair and used them to calculate the perceived electability of each candidate. Electability is defined as a candidate’s conditional probability of winning the presidency, given that the candidate earns his or her party’s nomination.

Presidential betting market odds and electabilities

Enable javascript or click through to see up-to-date data

Candidate Win Nomination Win Presidency Electability if Nominated

“Electability” refers to a candidate’s conditional probability of winning the presidency, given that the candidate wins his or her party’s nomination

I’m no political analyst, and the data above will continue to update throughout the election season, making anything I write here about it potentially immediately outdated, but according to the data at the time I wrote this on September 15, 2015, betting markets perceive Hillary Clinton as the most electable of the declared candidates, with a 57%–58% chance of winning the presidency if she receives the Democratic nomination. Betting markets also imply that the Democrats are the favorites overall, with about a 57% chance of winning the presidency, which is roughly the same as Clinton’s electability, so it appears that Clinton is considered averagely electable compared to the Democratic party as a whole.

On the Republican side, Jeb Bush has the best odds of winning the nomination, but his electability range of 47%–49% means he’s considered a slight underdog in the general election should he win the nomination. Still, that’s better than Marco Rubio (36%–40%) and Scott Walker (33%–42%), who each have lower electabilities, implying that they would be bigger underdogs if they were nominated. The big surprise to me is that Donald Trump has a fairly high electability range relative to the other Republicans, at 47%–56%. Maybe the implication is something like, “if there’s an unanticipated factor that enables the surprising result of Trump winning the nomination, then that same factor will work in his favor in the general election,” but then that logic should apply to other longshot candidates, which it seems not to, so perhaps other caveats apply.

Why are the probabilities given as ranges?

Usually when you read something in the news like “according to [bookmaker], candidate A has a 25% chance of winning the primary”, that’s not quite the complete story. The bookmaker might well have posted odds on A to win the primary at 3:1, which means you could bet $1 on A to win the primary, and if you’re correct then you’ll collect $4 from the bookmaker for a profit of $3. Such a bet has positive expected value if and only if you believe the candidate’s probability of winning the primary is greater than 25%. But traditional bookmakers typically don’t let you take the other side of their posted odds. In other words, you probably couldn’t bet $3 on A to lose the nomination, and receive a $1 profit if you’re correct.

Betting markets like Betfair, though, do allow you to bet in either direction, but not at the same odds. Maybe you can bet on candidate A to win the nomination at a 25% risk-neutral probability, but if you want to bet on A to lose the nomination, you might only be able to do so at a 20% risk-neutral probability, which means you could risk $4 for a potential $1 profit if A loses the nomination, or 1:4 odds. The difference between where you can buy and sell is known as the bid-offer spread, and it reflects, among other things, compensation for market-makers.

The probabilities in the earlier table are given as ranges because they reflect this bid-offer spread. If candidate A’s bid-offer is 20%–25%, and you think that A’s true probability is 30%, then betting on A at 25% seems like an attractive option, or if you think that A’s true probability is 15% then betting against A at 20% is also attractive. But if you think A’s true probability falls between 20% and 25%, then you probably don’t have any bets to make, though you might consider becoming a market-maker yourself by placing a bid or offer at an intermediate level and waiting for someone else to come along and take the opposite position.

A hypothetical example calculation of electability

Betfair offers betting markets on the outcome of the general election, and the outcomes of the Democratic and Republican primary elections. Although Betfair does not offer betting markets of the form “candidate A to win the presidency, if and only if A wins the primary”, bettors can place simultaneous bets on A’s primary and general election outcomes in a ratio such that the bettor will break even if A loses the primary, and make or lose money only in the scenario where A wins the primary.

Let’s continue the example with our hypothetical candidate A, who has a bid-offer 20%–25% in the primary, and let’s say a bid-offer 11%–12.5% in the general election. If we bet $25 on A to win the general election at a 12.5% probability, then our profit across scenarios looks like this:

Bet $25 on candidate A to win the general election at 12.5% probability (7:1 odds)

Scenario Amount at Risk Payout from general bet Profit
A loses primary $25 $0 -$25
A wins primary, loses general $25 $0 -$25
A wins primary, wins general $25 $200 $175

We want our profit to be $0 in the “loses primary” scenario, so we can add a hedging bet that will pay us a profit of $25 if A loses the primary. That bet is placed at a 20% probability, which means our odds ratio is 1:4, so we have to risk $100 in order to profit $25 in case A loses the primary. Now we have a total of $125 at risk: $25 on A to win the presidency, and $100 on A to lose the nomination. The scenarios look like this:

Bet $25 on candidate A to win the general election at 12.5% probability (7:1 odds) and $100 on A to lose the primary at 20% probability (1:4 odds)

Scenario Amount at risk Payout from primary bet Payout from general bet Profit
A loses primary $125 $125 $0 $0
A wins primary, loses general $125 $0 $0 -$125
A wins primary, wins general $125 $0 $200 $75

We’ve constructed our bets so that if A loses the primary, then we neither make nor lose money, but if A wins the primary, then we need A’s probability of winning the election to be greater than 62.5% in order to make our bet positive expected value, since 0.625 * 75 + 0.375 * -125 = 0. As an exercise for the reader, you can go through similar logic to show that if you want to bet on A to lose the presidential election but have 0 profit in case A loses the primary, then you need A’s conditional probability of winning the general election to be lower than 44% in order to make the bet positive expected value. In this example then, A’s electability range is 44%–62.5%.


This analysis does not take into account the total amount of money available to bet on each candidate. As of September 2015, Betfair has handled over $1 million of bets on the 2016 election, but the markets on some candidates are not as deep as others. If you actually tried to place bets in the fashion described above, you might find that there isn’t enough volume to fully hedge your exposure to primary results, or you might have to accept significantly worse odds in order to fill your bets.

It’s possible that someone might try to manipulate the odds by bidding up or selling down some combination of candidates. Given the amount of attention paid to prediction markets in the media, and the amount of money involved, it’s probably not a bad idea. In 2012 someone tried to do this to make it look like Mitt Romney was gaining momentum, but enough bettors stepped in to take the other sides of those bets and Romney’s odds fell back to where they started. Even though that attempt failed, people might try it again, and if/when they do, they might even succeed, in which case betting market data might only reflect what the manipulators want it to, as opposed to the wisdom of the crowds.

The electability calculation ignores the scenario where a candidate loses the primary but wins the general election. I don’t think this has ever happened on the national level, but it happened in Connecticut in 2006, and it probably has a non-zero probability of happening nationally. If it were to happen, and you had placed bets on the candidate to win the primary and lose the election, you might find that your supposedly safe “hedge” wasn’t so safe after all (on the other hand, you might get lucky and hit on both of your bets…). Some have speculated that Donald Trump in particular might run as an independent candidate if he doesn’t receive the Republican nomination, so whatever (probably small) probability the market assigns to the scenario of “Trump loses the Republican nomination but wins the presidency” would inflate his electability.

There are probably more caveats to list, for example I’ve failed to consider any trading fees or commissions incurred when placing bets. Additionally, though I have no proof, as mentioned earlier I’d guess that candidates who are longshots to win the primaries probably have higher electabilities due to the implicit assumption that if something so dramatic were to happen that caused them to win the primary, probably the same factor would help their odds in the general election.

Despite all of these caveats, I believe that the implied electability numbers do represent to some degree how bettors expect the candidates to perform in the general election, and I wonder if there should be betting markets set up that allow people to wager directly on these conditional probabilities, rather than having to place a series of bets to mimic the payout structure.

A Statistical Analysis of the LearnedLeague Trivia Competition

And an attempt to predict gender based on trivia knowledge

LearnedLeague bills itself as “the greatest web-based trivia league in all of civilized earth.” Having been fortunate enough to partake in the past 3 seasons, I’m inclined to agree.

LearnedLeague players, known as “LLamas”, answer trivia questions drawn from 18 assorted categories, and one of the many neat things about LearnedLeague is that it provides detailed statistics into your performance by category. Personally I was surprised at how quickly my own stats began to paint a startlingly accurate picture of my trivia knowledge: strength in math, business, sports, and geography, coupled with weakness in classical music, art, and literature. Here are my stats through 3 seasons of LearnedLeague play:


My personal category stats through 3 seasons of LearnedLeague. The “Lg%” column represents the average correct % for all LearnedLeague players, who are known colloquially as “LLamas”

It stands to reason that performance in some of these categories should be correlated. For example, people who are good at TV trivia are probably likely to be better than average at movie trivia, so we’d expect a positive correlation between performance in the TV and film categories. It’s harder to guess at what categories might be negatively correlated. Maybe some of the more scholarly pursuits, like art and literature, would be negatively correlated with some of the more, er, plebeian categories like popular music and food/drink?

With the LearnedLeague Commissioner’s approval, I collected aggregate category stats for all recently active LLamas so that I could investigate correlations between category performance and look for other interesting trends. My dataset and code are all available on GitHub, though profile names have been anonymized.

Correlated categories

I analyzed a total of 2,689 players, representing active LLamas who have answered at least 400 total questions. Each player has 19 associated numbers: a correct rate for each of the 18 categories, plus an overall correct rate. For each of the 153 pairs of categories, I calculated the correlation coefficient between player performance in those categories.

The pairs with the highest correlation were:

  1. Geography & World History, ρ = 0.860
  2. Film & Television, ρ = 0.803
  3. American History & World History, ρ = 0.802
  4. Art & Literature, ρ = 0.795
  5. Geography & Language, ρ = 0.773

And the categories with the lowest correlation:

  1. Math & Television, ρ = 0.126
  2. Math & Theatre, ρ = 0.135
  3. Math & Pop Music, ρ = 0.137
  4. Math & Film, ρ = 0.148
  5. Math & Art, ρ = 0.256

The scatterplots of the most and least correlated pairs look as follows. Each dot represents one player, and I’ve added linear regression trendlines:

Most correlated: geography and world history

Geography & World History

Least correlated: math and television

Math & Television

The full list of 153 correlations is available in this Google spreadsheet. At first I was a bit surprised to see that every category pair showed a positive correlation, but upon further reflection it shouldn’t be that surprising: some people are just better at trivia, and they’ll tend to do well in all categories (none other than Ken Jennings himself is an active LLama!).

The most correlated pairs make some intuitive sense, though we should always be wary of hindsight bias. Still, it’s pretty easy to tell believable stories about the highest correlations: people who know a lot about world history probably know where places are (i.e. geography), people who watch TV also watch movies, and so on. I must say, though, that the low correlation between knowledge of math and the pop culture categories of TV, theatre, pop music, and film doesn’t do much to dispel mathematicians’ reclusive images! The only category that math shows an above-average correlation to is science, so perhaps it’s true that mathematicians just live off in their own world?

You can view a scatterplot for any pair of categories by selecting them from the menus below. There’s also a bar graph that ranks the other categories by their correlation to your chosen category:

Turn on javascript (or click through from RSS) to view scatter and batplots for additional categories.



Predicting gender from trivia category performance

LLamas optionally provide a bit of demographic information, including gender, location, and college(s) attended. It’s not lost on me that my category performance is pretty stereotypically “male.” For better or worse, my top 3 categories—business, math, and sports—are often thought of as male-dominated fields. That got me to wondering: does performance across categories predict gender?

It’s important to note that LearnedLeague members are a highly self-selected bunch, and in no way representative of the population at large. It would be wrong to extrapolate from LearnedLeague results to make a broader statement about how men and women differ in their trivia knowledge. At the same time, predictive analysis can be fun, so I used R’s rpart package to train a recursive partitioning decision tree model which predicts a player’s gender based on category statistics. Recursive partitioning trees are known to have a tendency to overfit data, so I used R’s prune() function to snip off some of the less important splits from the full tree model:

decision tree

The labels on each leaf node report the actual fraction of the predicted gender in that bucket. For example, following from the top of the tree to the right: of the players who got at least 42% of their games/sport questions correct, and less than 66% of their theatre questions correct, 85% were male

The decision tree uses only 4 of the 18 categories available to it: games/sport, theatre, math, and food/drink, suggesting that these are the most important categories for predicting gender. Better performance in games/sport and math makes a player more likely to be male, while better performance in theatre and food/drink makes a player more likely to be female.

How accurate is the decision tree model?

The dataset includes 2,093 males and 595 females, and the model correctly categorizes gender for 2,060 of them, giving an overall accuracy rate of 77%. Note that there are more males in the dataset than there are correct predictions from the model, so in fact the ultra-naive model of “always guess male” would actually achieve a higher overall accuracy rate than the decision tree. However, as noted in this review of decision trees, “such a model would be literally accurate but practically worthless.” In order to avoid this pitfall, I manually assigned prior probabilities of 50% each to male and female. This ensures that the decision tree makes an equal effort to predict male and female genders, rather than spending most of its effort getting all of the males correct, which would maximize the number of total correct predictions.

With the equal priors assigned, the model correctly predicts gender for 75% of the males and 82% of the females. Here’s the table of actual and predicted gender counts:

Predicted Male Predicted Female Total
Actual Male 1,570 523 2,093
Actual Female 105 490 595
Total 1,675 1,013 2,688

Ranking the categories by gender preference

Another way to think about the categories’ relationship with gender is to calculate what I’ll call a “gender preference” for each category. The methodology for a single category is:

  1. Take each player’s performance in that category and adjust it by the player’s overall correct rate
    • E.g. the % of math questions I get correct minus the % of all questions I get correct
  2. Calculate the average of this value for each gender
  3. Take the difference between the male average and the female average
  4. The result is the category’s (male-female) preference, where a positive number indicates male preference, and a negative number indicates female preference

Calculating this number for each category produces a relatively easy to interpret graph that ranks categories from most “feminine” to “masculine”:

category preferences

The chart shows the difference between men and women’s average relative performance for each category. For example, women average 8.1% higher correct rate in theatre compared to their overall correct rate, and men average 5.5% worse correct rate in theatre compared to their overall average, so the difference is (-5.5 - 8.1) = -13.6%

Similar to the results from the decision tree, this methodology shows that theatre and food/drink are most indicative of female players, while games/sport and math are most associated with male players.


The dataset and scripts I used for this post are available on GitHub. If you’re interested in LearnedLeague, this article provides a good overview, and you can always try your hand at a random selection of sample questions.

Mortgages Are About Math: Open-Source Loan-Level Analysis of Fannie and Freddie

[M]ortgages were acknowledged to be the most mathematically complex securities in the marketplace. The complexity arose entirely out of the option the homeowner has to prepay his loan; it was poetic that the single financial complexity contributed to the marketplace by the common man was the Gordian knot giving the best brains on Wall Street a run for their money. Ranieri’s instincts that had led him to build an enormous research department had been right: Mortgages were about math.

The money was made, therefore, with ever more refined tools of analysis.

—Michael Lewis, Liar’s Poker (1989)

Fannie Mae and Freddie Mac began reporting loan-level credit performance data in 2013 at the direction of their regulator, the Federal Housing Finance Agency. The stated purpose of releasing the data was to “increase transparency, which helps investors build more accurate credit performance models in support of potential risk-sharing initiatives.”

The so-called government-sponsored enterprises went through a nearly $200 billion government bailout during the financial crisis, motivated in large part by losses on loans that they guaranteed, so I figured there must be something interesting in the loan-level data. I decided to dig in with some geographic analysis, an attempt to identify the loan-level characteristics most predictive of default rates, and more. As part of my efforts, I wrote code to transform the raw data into a more useful PostgreSQL database format, and some R scripts for analysis. The code for processing and analyzing the data is all available on GitHub.

Default rate by month

At the time of Chairman Bernanke’s statement, it really did seem like agency loans were unaffected by the problems observed in subprime loans, which were expected to have higher default rates. About a year later, defaults on Fannie and Freddie loans increased dramatically, and the government was forced to bail out both companies to the tune of nearly $200 billion

The “medium data” revolution

It should not be overlooked that in the not-so-distant past, i.e. when I worked as a mortgage analyst, an analysis of loan-level mortgage data would have cost a lot of money. Between licensing data and paying for expensive computers to analyze it, you could have easily incurred costs north of a million dollars per year. Today, in addition to Fannie and Freddie making their data freely available, we’re in the midst of what I might call the “medium data” revolution: personal computers are so powerful that my MacBook Air is capable of analyzing the entire 215 GB of data, representing some 38 million loans, 1.6 billion observations, and over $7.1 trillion of origination volume. Furthermore, I did everything with free, open-source software. I chose PostgreSQL and R, but there are plenty of other free options you could choose for storage and analysis.

Both agencies released data for 30-year, fully amortizing, fixed-rate mortgages, which are considered standard in the U.S. mortgage market. Each loan has some static characteristics which never change for the life of the loan, e.g. geographic information, the amount of the loan, and a few dozen others. Each loan also has a series of monthly observations, with values that can change from one month to the next, e.g. the loan’s balance, its delinquency status, and whether it prepaid in full.

The PostgreSQL schema then is split into 2 main tables, called loans and monthly_observations. Beyond the data provided by Fannie and Freddie, I also found it helpful to pull in some external data sources, most notably the FHFA’s home price indexes and Freddie Mac’s mortgage rate survey data.

A fuller glossary of the data is available in an appendix at the bottom of this post.

What can we learn from the loan-level data?

I started by calculating simple cumulative default rates for each origination year, defining a “defaulted” loan as one that became at least 60 days delinquent at some point in its life. Note that not all 60+ day delinquent loans actually turn into foreclosures where the borrower has to leave the house, but missing at least 2 payments typically indicates a serious level of distress.

Loans originated from 2005-2008 performed dramatically worse than loans that came before them! That should be an extraordinarily unsurprising statement to anyone who was even slightly aware of the U.S. mortgage crisis that began in 2007:

Cumulative default rates by vintage

About 4% of loans originated from 1999 to 2003 became seriously delinquent at some point in their lives. The 2004 vintage showed some performance deterioration, and then the vintages from 2005 through 2008 show significantly worse performance: more than 15% of all loans originated in those years became distressed.

From 2009 through present, the performance has been much better, with fewer than 2% of loans defaulting. Of course part of that is that it takes time for a loan to default, so the most recent vintages will tend to have lower cumulative default rates while their loans are still young. But as we’ll see later, there was also a dramatic shift in lending standards so that the loans made since 2009 have been much higher credit quality.

Geographic performance

Default rates increased everywhere during the bubble years, but some states fared far worse than others. I took every loan originated between 2005 and 2007, broadly considered to be the height of reckless mortgage lending, bucketed loans by state, and calculated the cumulative default rate of loans in each state. Mouse over the map to see individual state data:

4 states in particular jump out as the worst performers: California, Florida, Arizona, and Nevada. Just about every state experienced significantly higher than normal default rates during the mortgage crisis, but these 4 states, often labeled the “sand states”, experienced the worst of it.

I also used the data to make more specific maps at the county-level; default rates within different metropolitan areas can show quite a bit of variation. California jumps out as having the most interesting map: the highest default rates in California came from inland counties, most notably in the Central Valley and Inland Empire regions. These exurban areas, like Stockton, Modesto, and Riverside, experienced the largest increases in home prices leading up to the crisis, and subsequently the largest collapses.

The map clearly shows the central parts of California with the highest default rates, and the coastal parts with generally better default rates:

The major California metropolitan areas with the highest default rates in were:

  1. Modesto – 40%
  2. Stockton – 37%
  3. Riverside-San Bernardino-Ontario (Inland Empire) – 33%

And the major metropolitan areas with the lowest default rates:

  1. San Francisco – 4.3%
  2. San Jose – 7.6%
  3. Santa Ana-Anaheim-Irvine (Orange County) – 11%

It’s less than 100 miles from San Francisco to Modesto and Stockton, and only 35 miles from Anaheim to Riverside, yet we see such dramatically different default rates between the inland regions and their relatively more affluent coastal counterparts.

The inland cities, with more land available to allow expansion, experienced the most overbuilding, the most aggressive lenders, the highest levels of speculators looking to get rich quick by flipping houses, and so perhaps it’s not that surprising that when the housing market turned south, they also experienced the highest default rates. Not coincidentally, California has also led the nation in “housing bubble” searches on Google Trends every year since 2004.

The county-level map of Florida does not show as much variation as the California map:

Although the regions in the panhandle had somewhat lower default rates than central and south Florida, there were also significantly fewer loans originated in the panhandle. The Tampa, Orlando, and Miami/Fort Lauderdale/West Palm Beach metropolitan areas made up the bulk of Florida mortgage originations, and all had very high default rates. The worst performing metropolitan areas in Florida were:

  1. Miami – 40%
  2. Port St. Lucie – 39%
  3. Cape Coral/Fort Myers – 38%

Arizona and Nevada have very few counties, so their maps don’t look very interesting, and each state is dominated by a single metropolitan area: Phoenix experienced a 31% cumulative default rate, and Las Vegas a 42% cumulative default rate.

Modeling mortgage defaults

The dataset includes lots of variables for each individual loan beyond geographic location, and many of these variables seem like they should correlate to mortgage performance. Perhaps most obviously, credit scores were developed specifically for the purpose of assessing default risk, so it would be awfully surprising if credit scores weren’t correlated to default rates.

Some of the additional variables include the amount of the loan, the interest rate, the loan-to-value ratio (LTV), debt-to-income ratio (DTI), the purpose of the loan (purchase, refinance), the type of property, and whether the loan was originated directly by a lender or by a third party. All of these things seem like they might have some predictive value for modeling default rates.

We can also combine loan data with other data sources to calculate additional variables. In particular, we can use the FHFA’s home price data to calculate current loan-to-value ratios for every loan in the dataset. For example, say a loan started at an 80 LTV, but the home’s value has since declined by 25%. If the balance on the loan has remained unchanged, then the new current LTV would be 0.8 / (1 – 0.25) = 106.7. An LTV over 100 means the borrower is “underwater” — the value of the house is now less than the amount owed on the loan. If the borrower does not believe that home prices will recover for a long time, the borrower might rationally decide to “walk away” from the loan.

Another calculated variable is called spread at origination (SATO), which is the difference between the loan’s interest rate, and the prevailing market rate at the time of origination. Typically borrowers with weaker credit get higher rates, so we’d expect a larger value of SATO to correlate to higher default rates.

Even before formulating any specific model, I find it helpful to look at graphs of aggregated data. I took every monthly observation from 2009-11, bucketed along several dimensions, and calculated default rates. Note that we’re now looking at transition rates from current to defaulted, as opposed to the cumulative default rates in the previous section. Transition rates are a more natural quantity to model, since when we make future projections we have to predict not only how many loans will default, but when they’ll default.

Here are graphs of annualized default rates as a function of credit score and current LTV:

Default rate by FICO

The above graph shows FICO credit score on the x-axis, and annualized default rate on the y-axis. For example, loans with FICO score 650 defaulted at a rate of about 12% per year, while loans with FICO 750 defaulted around 4.5% per year

Default rate by current LTV

Clearly both of these variables are highly correlated with default rates, and in the directions we would expect: higher credit scores correlate to lower default rates, and higher loan-to-value ratios correlate to higher default rates.

The dataset cannot tell us why any borrowers defaulted. Some probably came upon financial hardship due to the economic recession and were unable to pay their bills. Others might have been taken advantage of by unscrupulous mortgage brokers, and could never afford their monthly payments. And, yes, some also “strategically” defaulted — meaning they could have paid their mortgages, but chose not to.

The fact that current LTV is so highly correlated to default rates leads me to suspect that strategic defaults were fairly common in the depths of the recession. But why might some people walk away from loans that they’re capable of paying?

As an example, say a borrower has a $300,000 loan at a 6% interest rate against a home that had since declined in value to $200,000, for an LTV of 150. The monthly payment on such a mortgage is $1,800. Assuming a price/rent ratio of 18, approximately the national average, then the borrower could rent a similar home for $925 per month, a savings of over $10,000 per year. Of course strategically defaulting would greatly damage the borrower’s credit, making it potentially much more difficult to get new loans in the future, but for such a large monthly savings, the borrower might reasonably decide not to pay.

A Cox proportional hazards model helps give us a sense of which variables have the largest relative impact on default rates. The model assumes that there’s a baseline default rate (the “hazard rate”), and that the independent variables have a multiplicative effect on that baseline rate. I calibrated a Cox model on a random subset of loans using R’s coxph() function:


formula = Surv(loan_age - 1, loan_age, defaulted) ~
          credit_score + ccltv + dti + loan_purpose + channel + sato

cox_model = coxph(formula, data = monthly_default_data)

> summary(cox_model)
coxph(formula = Surv(loan_age - 1, loan_age, defaulted) ~ credit_score +
    ccltv + dti + loan_purpose + channel + sato, data = monthly_default_data)

  n= 17866852, number of events= 94678

                    coef  exp(coef)   se(coef)       z Pr(>|z|)
credit_score  -9.236e-03  9.908e-01  8.387e-05 -110.12   <2e-16
ccltv          2.259e-02  1.023e+00  1.582e-04  142.81   <2e-16
dti            2.092e-02  1.021e+00  4.052e-04   51.62   <2e-16
loan_purposeR  4.655e-01  1.593e+00  9.917e-03   46.94   <2e-16
channelTPO     1.573e-01  1.170e+00  9.682e-03   16.25   <2e-16
sato           3.563e-01  1.428e+00  1.284e-02   27.75   <2e-16

The categorical variables, loan_purpose and channel, are the easiest to interpret because we can just look at the exp(coef) column to see their effect. In the case of loan_purpose, loans that were made for refinances multiply the default rate by 1.593 compared to loans that were made for purchases. For channel, loans that were made by third party originators, e.g. mortgage brokers, increase the hazard rate by 17% compared to loans that were originated directly by lenders.

The coefficients for the continuous variables are harder to compare because they each have their own independent scales: credit scores range from roughly 600 to 800, LTVs from 30 to 150, DTIs from 20 to 60, and SATO from -1 to 1. Again I find graphs the easiest way to interpret. We can use R’s predict() function to generate hazard rate multipliers for each independent variable, while holding all the other variables constant:

Hazard rate multipliers

Remember that the y-axis here shows a multiplier of the base default rate, not the default rate itself. So, for example, the average current LTV in the dataset is 82, which has a multiplier of 1. If we were looking at two loans, one of which had current LTV 82, the other a current LTV of 125, then the model predicts that the latter loan’s monthly default rate is 2.65 times the default rate of the former.

All of the variables behave directionally as we’d expect: higher LTV, DTI, and SATO are all associated with higher hazard rates, while higher credit scores are associated with lower hazard rates. The graph of hazard rate multipliers shows that current LTV and credit score have larger magnitude impact on defaults than DTI and SATO. Again the model tells us nothing about why borrowers default, but it does suggest that home price-adjusted LTVs and credit scores are the most important predictors of default rates.

There is plenty of opportunity to develop more advanced default models. Many techniques, including Cox proportional hazards models and logistic regression, are popular because they have relatively simple functional forms that behave well mathematically, and there are existing software packages that make it easy to calibrate parameters. On the other hand, these models can fall short because they have no meaningful connection to the actual underlying dynamics of mortgage borrowers.

So-called agent-based models attempt to model the behavior of individual borrowers at the micro-level, then simulate many agents interacting and making individual decisions, before aggregating into a final prediction. The agent-based approach can be computationally much more complicated, but at least in my opinion it seems like a model based on traditional statistical techniques will never explain phenomena like the housing bubble and financial crisis, whereas a well-formulated agent-based model at least has a fighting chance.

Why are defaults so much lower today?

We saw earlier that recently originated loans have defaulted at a much lower rate than loans originated during the bubble years. For one thing, home prices bottomed out sometime around 2012 and have rebounded some since then. The partial home price recovery causes current LTVs to decline, which as we’ve seen already, should correlate to lower default rates.

Perhaps more importantly, though, it appears that Fannie and Freddie have adopted significantly stricter lending standards starting in 2009. The average FICO score used to be 720, but since 2009 it has been more like 765. Furthermore, if we look 2 standard deviations from the mean, we see that the low end of the FICO spectrum used to reach down to about 600, but since 2009 there have been very few loans with FICO less than 680.

Tighter agency standards, coupled with a complete shutdown in the non-agency mortgage market, including both subprime and Alt-A lending, mean that there is very little credit available to borrowers with low credit scores (a far more difficult question is whether this is a good or bad thing!).

Average FICO by origination year

Since 2009, Fannie and Freddie have made significantly fewer loans to borrowers with credit scores below 680. There is some discussion about when, if ever, Fannie and Freddie will begin to loosen their credit standards and provide more loans to borrowers with lower credit scores

What next?

There are many more things we could study in the dataset. Long before investors worried about default rates on agency mortgages, they worried about voluntary prepayments due to refinancing and housing turnover. When interest rates go down, many mortgage borrowers refinance their loans to lower their monthly payments. For mortgage investors, investment returns can depend heavily on how well they project prepayments.

I’m sure some astronomical number of human-hours have been spent modeling prepayments, dating back to the 1970s when mortgage securitization started to become a big industry. Historically the models were calibrated against aggregated pool-level data, which was okay, but does not offer as much potential as loan-level data. With more loan-level data available, and faster computers to process it, I’d imagine that many on Wall Street are already hard at work using this relatively new data to refine their prepayment models.

Fannie and Freddie continue to improve their datasets, recently adding data for actual losses suffered on defaulted loans. In other words, when the bank has to foreclose and sell a house, how much money do the agencies typically lose? This loss severity number is itself a function of many variables, including home prices, maintenance costs, legal costs, and others. Severity will also be extremely important for mortgage investors in the proposed new world where Fannie and Freddie might no longer provide full guarantees against loss of principal.

Beyond Wall Street, I’d hope that the open-source nature of the data helps provide a better “early detection” system than we saw in the most recent crisis. A lot of people were probably generally aware that the mortgage market was in trouble as early as 2007, but unless you had access to specialized data and systems to analyze it, there was no way for most people to really know what was going on.

There’s still room for improvement: Fannie and Freddie could expand their datasets to include more than just 30-year fixed-rate loans. There are plenty of other types of loans, including 15-year terms and loans with adjustable interest rates. 30-year fixed-rate loans continue to be the standard of the U.S. mortgage market, but it would still be good to release data for all of Fannie and Freddie’s loans.

It’d also be nice if Fannie and Freddie released the data in a more timely manner instead of lagged by several months to a year. The lag before releasing the data reduces its effectiveness as a tool for monitoring the general health of the economy, but again it’s much better than only a few years ago when there was no readily available data at all. In the end, the trend toward free and open data, combined with the ever-increasing availability of computing power, will hopefully provide a clearer picture of the mortgage market, and possibly even prevent another financial crisis.

Appendix: data glossary

Mortgage data is available to download from Fannie Mae and Freddie Mac’s websites, and the full scripts I used to load and process the data are available on GitHub

Each loan has an origination record, which includes static data that will never change for the life of the loan. Each loan also has a set of monthly observations, which record values at every month of the loan’s life. The PostgreSQL database has 2 main tables: loans and monthly_observations.

Beyond the data provided by Fannie and Freddie, I found it helpful to add columns to the loans table for what we might call calculated characteristics. For example, I found that it was helpful to have a column on the loans table called first_serious_dq_date. This column would be populated with the first month in which a loan was 60 days delinquent, or null if the loan has never been 60 days delinquent. There’s no new information added by the column, but it’s convenient to have it available in the loans table as opposed to the monthly_observations table because loans is a significantly smaller table, and so if we can avoid database joins to monthly_observations for some analysis then that makes things faster and easier.

I also collected home price data from the FHFA, and mortgage rate data from Freddie Mac

Selected columns from the loans table:

  • credit_score, also referred to as FICO
  • original_upb, short for original unpaid balance; the amount of the loan
  • oltv and ocltv, short for original (combined) loan-to-value ratio. Amount of the loan divided by the value of the home at origination, expressed as a percentage. Combined loan-to-value includes and additional liens on the property
  • dti, debt-to-income ratio. From Freddie Mac’s documentation: the sum of the borrower’s monthly debt payments […] divided by the total monthly income used to underwrite the borrower
  • sato, short for spread at origination, the difference between the loan’s interest rate and the prevailing market rate at the time the loan was made
  • property_state
  • msa, metropolitan statistical area
  • hpi_index_id, references the FHFA home price index (HPI) data. If the loan’s metropolitan statistical area has its own home price index, use the MSA index, otherwise use the state-level index. Additionally if the FHFA provides a purchase-only index, use purchase-only, otherwise use purchase and refi
  • occupancy_status (owner, investor, second home)
  • channel (retail, broker, correspondent)
  • loan_purpose (purchase, refinance)
  • mip, mortgage insurance premium
  • first_serious_dq_date, the first date on which the loan was observed to be at least 60 days delinquent. Null if the loan was never observed to be delinquent
  • id and loan_sequence_number, loan_sequence_number are the unique string IDs assigned by Fannie and Freddie, id is a unique integer designed to save space in the monthly_observations table

Selected columns from the monthly_observations table:

  • loan_id, for joining against the loans table, = monthly_observations.loan_id date
  • current_upb, current unpaid balance
  • previous_upb, the unpaid balance in the previous month
  • loan_age
  • dq_status and previous_dq_status

More info available in the documentations provided by Fannie Mae and Freddie Mac