A Tale of Twenty-Two Million Citi Bike Rides: Analyzing the NYC Bike Share System

In the conclusion of my post analyzing NYC taxi and Uber trips, I noted that Citi Bike, New York City’s bike share system, also releases public data, totaling 22.2 million rides from July 2013 through November 2015. With the recent news that the Citi Bike system topped 10 million rides in 2015, making it one of the world’s largest bike shares, it seemed like an opportune time to investigate the publicly available data.

Much like with the taxi and Uber post, I’ve split the analysis into sections, covering visualization, the relationship between cyclist age, gender, and Google Maps time estimates, modeling the impact of the weather on Citi Bike ridership, and more:

Visualization: One Day in the Life of the Citi Bike Share System
The Data
Age, Gender, and the Accuracy of Google Maps Cycling Time Estimates
Anonymizing Data is Hard!
Magical Transports
Quantifying the Impact of the Weather on Citi Bike Activity

Code to download, process, and analyze the data is available on GitHub.

One Day in the Life of the Citi Bike Share System

I took Citi Bike trips from Wednesday, September 16, 2015, and created an animation using the Torque.js library from CartoDB, assuming that every trip followed the recommended cycling directions from Google Maps. There were a total of 51,179 trips that day, but I excluded trips that started and ended at the same station, leaving 47,969 trips in the visualization. Every blue dot on the map represents a single Citi Bike trip, and the small orange dots represent the 493 Citi Bike stations scattered throughout the city:

If you stare at the animation for a bit, you start to see some trends. My personal favorite spots to watch are the bridges that connect Brooklyn to Lower Manhattan. In the morning, beginning around 8 AM, you see a steady volume of bikes crossing from Brooklyn into Manhattan over the Brooklyn, Manhattan, and Williamsburg bridges. In the middle of the day, the bridges are generally less busy, then starting around 5:30 PM, we see the blue dots streaming from Manhattan back into Brooklyn, as riders leave their Manhattan offices to head back to their Brooklyn homes.

We can observe this phenomenon directly from the data, by looking at an hourly graph of trips that travel between Manhattan and the outer boroughs:

manhattan and outer boroughs

Sure enough, in the mornings there are more rides from Brooklyn to Manhattan than vice versa, while in the evenings there are more people riding from Manhattan to Brooklyn. For what it’s worth, most Citi Bike trips start and end in Manhattan. The overall breakdown since the program’s expansion in August 2015:

88% of trips start and end in Manhattan
8% of trips start and end in an outer borough
4% of trips travel between Manhattan and an outer borough

There are other distinct commuting patterns in the animation: the stretch of 1st Avenue heading north from 59th Street has very little Citi Bike traffic in the morning, but starting around 5 PM the volume picks up as people presumably head home from their Midtown offices to the Upper East Side.

Similarly, if we look during the morning rush at the parallel stretches of 1st and 2nd avenues stretching from the Lower East Side through Murray Hill, there’s clearly more volume heading north along 1st Avenue heading into Midtown. In the evening there’s more volume heading south along 2nd Avenue, as workers head home to the residential neighborhoods.

If we take all trips since Citi Bike’s expansion in August 2015, and again assume everyone followed Google Maps cycling directions, we can see which road segments throughout the city are most traveled by Citi Bikes. Here’s a map showing the most popular roads, where the thickness and brightness of the lines are based on the number of Citi Bikes that traveled that segment (click here to view higher resolution):

This map is reminiscent of the maps of taxi pickups and drop offs from my previous post, but they’re actually a bit different. The taxi maps were made of individual dots, where each dot was a pickup or drop off, while the Citi Bike map above counts each trip as a series of line segments, from the trip’s starting point to its finish.

The map shows a handful of primary routes for cyclists: 8th and 9th avenues heading uptown and downtown, respectively, on the west side, and 1st and 2nd avenues heading uptown and downtown, respectively, on the east side. The single road segment most trafficked by Citi Bikes lies along 8th Avenue, from W 28th Street to W 29th Street. Other main bike routes include Broadway, cutting diagonally across Midtown Manhattan, and the west side bike path along the Hudson River.

Remember that the map and animation assume people follow Google Maps cycling directions, which is definitely not always true. Google Maps seems to express strong preference for roads that have protected bike paths, which is why, for example, 8th Avenue has lots of traffic heading uptown, but 6th Avenue has very little. Both avenues head northbound, but only 8th Avenue has a protected bike path.

The Data

Unlike taxis, Citi Bikes cannot pick up and drop off at any arbitrary point in the city. Instead, riders can pick up and drop off bikes at finite number of stations across the city. Citi Bikes haven’t reached the ubiquity of taxis—in 2015 there were likely about 175 million taxi trips, 35 million Uber trips, and 10 million Citi Bike rides—but the bike share has plans to continue its expansion in the coming years.

Citi Bike makes data available for every individual trip in the system. Each trip record includes:

Station locations for where the ride started and ended
Timestamps for when the ride started and ended
Rider gender
Rider birth year
Whether the rider is an annual Citi Bike subscriber or a short-term customer
A unique identifier for the bike used

Here’s a graph of monthly usage since the program’s inception in June 2013:

monthly trips

Not surprisingly, there are dramatically fewer Citi Bike rides during the cold winter months. We’ll attempt to quantify the weather’s impact on Citi Bike ridership later in this post. The August 2015 increase in rides corresponds to the system’s first major expansion, which added nearly 2,000 bikes and 150 stations across Brooklyn, Queens, and Manhattan.

The system gets more usage on weekdays than on weekends, and if we look at trips by hour of the day, we can see that weekday riders primarily use Citi Bikes to commute to and from work, with peak hours from 8–9 AM and 5–7 PM. Weekend riders, on the other hand, prefer a more leisurely schedule, with most weekend rides occurring in the mid afternoon hours:

trips by hour

Age, Gender, and the Accuracy of Google Maps Cycling Time Estimates

The age and gender demographic data can be combined with Google Maps cycling directions to address a host of interesting questions, including:

How fast do Citi Bike riders tend to travel?
How accurate are Google Maps cycling time estimates?
How do age and gender impact biking speed?

For each trip, we’ll proxy the trip’s average speed by taking the distance traveled according to Google Maps, and dividing by the amount of time the trip took. This probably understates the rider’s actual average bike speed, since the trip includes time spent unlocking the bike from the origin station, adjusting it, perhaps checking a phone for directions or dealing with other distractions, and returning the bike at the destination station.

Additionally, it assumes the rider follows Google Maps directions. If the rider actually took a longer route than the one suggested by Google, that would be more distance traveled, and we would underestimate the average trip speed. On the other hand, if the rider took a more direct route than suggested by Google, it’s possible we might overestimate the trip speed.

We have no idea about any individual rider’s intent: some riders are probably trying to get from point A to point B as quickly as safely possible, while others might want to take a scenic route which happens to start at point A and end at point B. The latter group will almost certainly not follow a direct route, and so we’ll end up calculating a very slow average speed for these trips, even if the riders were pedaling hard the entire time.

Accordingly, for an analysis of bike speed, I restricted to the following subset of trips, which I at least weakly claim is more likely to include riders who are trying to get from point A to point B quickly:

Weekdays, excluding holidays
Rush hour (7–10 AM, 5–8 PM)
Annual subscribers
Average trip speed between 4 and 35 miles per hour (to avoid faulty data)

I then bucketed into cohorts defined by age, gender, and distance traveled, and calculated average trip speeds:

average trip speeds

The average speed across all such trips is 8.3 miles per hour, and the graph makes clear that younger riders tend to travel faster than older riders, men tend to travel faster than women, and trips covering longer distances have higher average speeds than shorter distance trips.

It’s also interesting to compare actual trip times to estimated times from Google Maps. Google Maps knows, for example, that the average speed along a wide, protected bike path will be faster than the speed along a narrow cross street that has no dedicated bike lane. I took the same cohorts and calculated the average difference between actual travel time and Google Maps estimated travel time:

average difference between actual and google maps

If everyone took exactly the amount of time estimated by Google Maps cycling directions, we’d see a series of flat lines at 0. However, every bucket has a positive difference, meaning that actual trip times are slower than predicted by Google Maps, by an average of 92 seconds. As mentioned earlier, part of that is because Google Maps estimates don’t account for time spent transacting at Citi Bike stations, and we can’t guarantee that every rider in our dataset was even trying to get from point A to B quickly.

I ran a linear regression in R to model the difference between actual and estimated travel time as a function of gender, age, and distance traveled. The point of the regression isn’t so much to make any accurate predictions—it’d be especially bad to extrapolate the regression for longer distance trips—but more to understand the relative magnitude of each variable’s impact:

lm(formula = difference_in_seconds ~ gender + age + distance_in_miles,
   data = rush_hour_data)

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)
(Intercept)         34.03      0.338846   100.4   <2e-16
genderMale         -87.13      0.192166  -453.4   <2e-16
age                  2.25      0.007327   306.3   <2e-16
distance_in_miles   25.69      0.072994   352.0   <2e-16
---

Residual standard error: 213.6 on 7226125 degrees of freedom
Multiple R-squared:  0.05475,	Adjusted R-squared:  0.05475
F-statistic: 1.395e+05 on 3 and 7226125 DF,  p-value: < 2.2e-16

The regression’s low R^2 of 0.055 reiterates that the data has lots of variance, and for any given trip the model is unlikely to produce a particularly accurate estimate. But the model at least gives us a simple formula to make a crude estimate of how long a Citi Bike subscriber’s rush hour trip will take relative to the Google Maps estimate:

Start with 34
If male, subtract 87
Add (2.2 * age in years)
Add (25.7 * trip distance in miles)

The result is the average number of seconds between actual and Google Maps estimated trip times, with a positive number indicating a slower than estimated trip, and a negative number indicating a faster than estimated trip. Yes, it means that for every year you get older, you’re liable to be 2.2 seconds slower on your regular Citi Bike commute route!

Anonymizing Data is Hard!

In my post about taxi data, I included a section about data privacy, noting that precise pick up and drop off coordinates might reveal potentially sensitive information about where people live, work, and socialize. Citi Bike data does not have the same issues with precise coordinates, since all Citi Bike trips have to start and end at one of the 493 fixed stations.

But unlike the taxi data, Citi Bike includes demographic information about its riders, namely gender, birth year, and subscriber status. At first glance that might not seem too revealing, but it turns out that it’s enough to uniquely identify many Citi Bike trips. If you know the following information about an individual Citi Bike trip:

The rider is an annual subscriber
Their gender
Their birth year
The station where they picked up a Citi Bike
The date and time they picked up the bike, rounded to the nearest hour

Then you can uniquely identify that individual trip 84% of the time! That means you can find out where and when the rider dropped off the bike, which might be sensitive information. Because men account for 77% of all subscriber trips, it’s even easier to uniquely identify rides by women: if we restrict to female riders, then 92% of trips can be uniquely identified. It’s also easier to identify riders who are significantly younger or older than average:

uniquely identifiable

If instead of knowing the trip’s starting time to the nearest hour you only knew it to the nearest day, then you’d be able to identify 28% of all trips, but still 49% of trips by women.

On some level this shouldn’t be too surprising: a famous paper by Latanya Sweeney showed that 87% of the U.S. population is uniquely identified by birthdate, gender, and ZIP code. We probably have a bias toward underestimating how easy it is to identify people from what seems like limited data, and I hope that people think about that when they decide what data should be made publicly available.

Magical Transports

Disclaimer: I know nothing about the logistics of running a bike share system. I’d imagine, though, that one of the big issues is making sure that there are bikes available at stations where people want to pick them up. If station A starts the day with lots of bikes, but people take them out to other stations and nobody returns any bikes to A, then A will run out of bikes, and that’s bad.

The bike share operator could transport additional bikes to A to meet demand, but that costs time/money, so the operator probably wants to avoid it as much as possible. The data lets us measure how often bikes “magically” transport from one station to another, even though no one took a ride. I took each bike drop off, and calculated the percentage of rides where the bike’s next trip started at a different station from where the previous trip dropped off:

monthly transports

From July 2013 through March 2015, around 13% of bikes were somehow transported from their drop off stations to different stations before being ridden again. Since April 2015, though, that rate has decreased to about 4%. I have no idea why: my first guess was that there were more total bikes added to the system, but the number of bikes in use did not change in March 2015. There were no stations added or removed around then either, so that seems like an unlikely explanation. Maybe the operator developed a smarter system to allocate bikes, which resulted in a lower transfer percentage?

Different neighborhoods have different transfer patterns, too. Bikes dropped off in Manhattan’s East Village have a much higher chance of being transported if they’re dropped off in the evening:

east village transports

While transfers are more likely in Fort Greene, Brooklyn for bikes dropped off in the morning:

fort greene transports

And in Midtown, Manhattan, drop offs at morning or evening rush hour are more likely to be transported:

midtown transports

Add it all up and I’m not exactly sure what it means, but it seems like something that could be pursued further. The Citi Bike program has plans to continue its expansion in 2016, I wonder how the new stations will impact the transport rate?

Quantifying the Impact of the Weather on Citi Bike Activity

We saw earlier that there are many more Citi Bike rides in the summer than in the winter. It’s not surprising: anyone with a modicum of common sense knows that it’s not very pleasant to bike when it’s freezing cold. Similarly, biking is probably less popular on rainy and snowy days. This got me wondering: how well is Citi Bike’s daily ridership predicted by the weather?

I downloaded daily Central Park weather data from the National Climatic Data Center and joined it to the Citi Bike data in an effort to model the relationship between Citi Bike usage and the weather. The weather data includes a few variables, most notably:

Daily max temperature
Daily precipitation
Daily snow depth

Even before I began investigating the data, I suspected that a linear regression would not be appropriate for the weather model, for two main reasons:

Our dependent variable, total number of trips per day, is by definition positive. A standard linear regression can’t be guaranteed to produce a positive number
The relationship between bike rides and the weather is probably nonlinear. For example, I’d guess the change in ridership between 40 degree and 60 degree days is probably a larger magnitude than the change in ridership between 60 degree and 80 degree days

We could use a linear model with log transformations to deal with problem 1, but even then we’d be stuck with the nonlinearity issue. Let’s confirm though that the relationship between weather and ridership is in fact nonlinear:

temperature

This graph makes it pretty clear that there’s a nonlinear relationship between rides and max daily temperature. The number of trips ramps up quickly between 30 and 60 degrees, but above 60 degrees or so there’s a much weaker relationship between ridership and temperature. Let’s look at rainy days:

precipitation

And snowy days:

snow depth

Rain and snow are, not surprisingly, both correlated with lower ridership. The linearity of the relationships is less clear—there are also fewer observations in the dataset compared to “normal” days—but intuitively I have to believe that there’s a diminishing marginal effect of both, i.e. the difference between no rain and 0.1 inches of rain is more significant than the difference between 0.5 and 0.6 inches.

To calibrate the model, instead of using R’s lm() function, we’ll use the nlsLM() function from the minpack.lm package, which implements the Levenberg–Marquardt algorithm to minimize squared error for a nonlinear model.

For the nonlinear regression, we first need to specify the form of the model, which I chose to look like this:

model formula

The d variables are known values for a given date d, β variables are calibrated parameters, and the capitalized functions are intermediaries that are strictly speaking unnecessary, i.e. we could write the whole model on a single line, but I find the intermediate functions make things easier to reason about. Let’s step through the model specification, one line at a time:

d_trips is the number of Citi Bike trips on date d, the dependent variable in our model. We’re breaking trips into two components: a baseline component, which is a function of the date, and a weather component, which is a function of the weather on that date.
The Baseline(d) function uses an exponent, which guarantees that it will produce a positive output. It has 3 calibrated parameters: a constant, an adjustment for days that are non-holiday weekdays, and a fudge factor for dates in the “post-expansion era”, defined as after August 25, 2015, when Citi Bike added nearly 150 stations to the system.
The Weather(d) function uses every mortgage prepayment modeler’s favorite formula: the s-curve. I readily admit I have no “deep” reason for picking this functional form, but s-curves often behave well in nonlinear models, and the earlier temperature graph kind of looked like an s-curve might fit it well.
The input to the s-curve, WeatherFactor(d), is a linear combination of the maximum temperature, precipitation, and snow depth on date d.

The input data is available here as a csv, and you can see the exact R commands, output, and parameter values here, but the short version is that the model calibrates to what seem like reasonable parameters. Assuming we hold all other variables constant, the model predicts:

Raising the daily max temperature from 40 to 60 degrees increases ridership by 12,100 trips, while raising the temperature from 60 to 80 degrees increases ridership by 7,850 trips:
1 inch of rain has the same effect as decreasing the temperature by 24 degrees
1 inch of snow on the ground has the same effect as decreasing the temperature by 1.4 degrees.

In order to assess the model’s goodness of fit, we’ll look at some more graphs, starting with a scatterplot of actual vs. predicted values. Each dot represents a single day in the dataset, where the x-axis is the actual number of trips on that day, and the y-axis is the model-predicted number of trips:

model results scatterpot

The model’s root-mean-square error is 4,138, and residuals appear to be at least roughly normally distributed. Residuals appear to exhibit some heteroscedasticity, though, as the residuals have lower variance on dates with fewer trips.

The effect of the “post-expansion” fudge factor is evident in the top-right corner of the scatterplot, where it looks like there’s an asymptote around 36,000 predicted trips for dates before August 26, 2015. Ideally we’d formulate the model to avoid using a fudge factor—maybe by modeling trips at the individual station level, then aggregating up—but we’ll conveniently gloss over that.

We can also look at the time series of actual vs. predicted, aggregating to monthly totals in order to reduce noise:

model results monthly

I make no claim that it’s a perfect model—it uses imperfect data, has some smelly features and omissions, and all of the usual correlation/causation caveats apply—but it seems to do at least an okay job quantifying the impact of temperature, rain, and snow on Citi Bike ridership.

In Conclusion

As always, there are still plenty more things we could study in the dataset. Bad weather probably affects cycling speeds, so we could take that into account when measuring speeds and Google Maps time estimates.

Ben Wellington at I Quant NY did some demographic analysis by station, it might be interesting to see how that has evolved over time.

I wonder about modeling ridership at the individual station level, especially as stations are added in the future. Adding a new station is liable to affect ridership at existing stations—and it’s not even clear whether positively or negatively. A new station might cannibalize trips from other nearby stations, which wouldn’t increase total ridership by very much. But it’s also possible that a new station could have a synergistic effect with an existing station: imagine a scenario where a neighborhood with bad subway access gets a Citi Bike station, then an existing station located near the closest subway might see a surge in usage.

There are also probably plenty of analyses that could be done comparing Citi Bike data with the taxi and Uber data: what neighborhoods have the highest and lowest ratios of Citi Bike rides compared to taxi trips? And are there any commutes where it’s faster to take a Citi Bike than a taxi during rush hour traffic? Alas, these will have to wait for another time…

GitHub

There are scripts to download, process, and analyze the data in the nyc-citibike-data repository. A csv of the raw data for the weather analysis (daily trip totals plus weather data) is included in the repo, in case you don’t want to download all of the data.

Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance

An open-source exploration of the city’s neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data

Note: this post was originally written in November 2015, and was expanded with updates in September 2016 and March 2018. There is also a dashboard available here that updates monthly with the latest taxi, Uber, and Lyft aggregate stats.

The New York City Taxi & Limousine Commission has released a staggeringly detailed historical dataset covering over 1.1 billion individual taxi trips in the city from January 2009 through June 2015. Taken as a whole, the detailed trip-level data is more than just a vast list of taxi pickup and drop off coordinates: it’s a story of New York. How bad is the rush hour traffic from Midtown to JFK? Where does the Bridge and Tunnel crowd hang out on Saturday nights? What time do investment bankers get to work? How has Uber changed the landscape for taxis? And could Bruce Willis and Samuel L. Jackson have made it from 72nd and Broadway to Wall Street in less than 30 minutes? The dataset addresses all of these questions and many more.

I mapped the coordinates of every trip to local census tracts and neighborhoods, then set about in an attempt to extract stories and meaning from the data. This post covers a lot, but for those who want to pursue more analysis on their own: everything in this post—the data, software, and code—is freely available. Full instructions to download and analyze the data for yourself are available on GitHub.

Maps
The Data
Borough Trends, and the Rise of Uber
Airport Traffic
On the Realism of Die Hard 3
How Does Weather Affect Taxi and Uber Ridership?
NYC Late Night Taxi Index
The Bridge and Tunnel Crowd
Northside Williamsburg
Privacy Concerns
Investment Bankers
Parting Thoughts
2016 Update
2017 Update

Maps

I’m certainly not the first person to use the public taxi data to make maps, but I hadn’t previously seen a map that includes the entire dataset of pickups and drop offs since 2009 for both yellow and green taxis. You can click the maps to view high resolution versions:

These maps show every taxi pickup and drop off, respectively, in New York City from 2009–2015. The maps are made up of tiny dots, where brighter regions indicate more taxi activity. The green tinted regions represent activity by green boro taxis, which can only pick up passengers in upper Manhattan and the outer boroughs. Notice how pickups are more heavily concentrated in Manhattan, while drop offs extend further into the outer boroughs.

If you think these are pretty, I recommend checking out the high resolution images of pickups and drop offs.

NYC Taxi Data

The official TLC trip record dataset contains data for over 1.1 billion taxi trips from January 2009 through June 2015, covering both yellow and green taxis. Each individual trip record contains precise location coordinates for where the trip started and ended, timestamps for when the trip started and ended, plus a few other variables including fare amount, payment method, and distance traveled.

I used PostgreSQL to store the data and PostGIS to perform geographic calculations, including the heavy lifting of mapping latitude/longitude coordinates to NYC census tracts and neighborhoods. The full dataset takes up 267 GB on disk, before adding any indexes. For more detailed information on the database schema and geographic calculations, take a look at the GitHub repository.

Uber Data

Thanks to the folks at FiveThirtyEight, there is also some publicly available data covering nearly 19 million Uber rides in NYC from April–September 2014 and January–June 2015, which I’ve incorporated into the dataset. The Uber data is not as detailed as the taxi data, in particular Uber provides time and location for pickups only, not drop offs, but I wanted to provide a unified dataset including all available taxi and Uber data. Each trip in the dataset has a cab_type_id, which indicates whether the trip was in a yellow taxi, green taxi, or Uber car.

Borough Trends, and the Rise of Uber

The introduction of the green boro taxi program in August 2013 dramatically increased the amount of taxi activity in the outer boroughs. Here’s a graph of taxi pickups in Brooklyn, the most populous borough, split by cab type:

brooklyn pickups

From 2009–2013, a period during which migration from Manhattan to Brooklyn generally increased, yellow taxis nearly doubled the number of pickups they made in Brooklyn.

Once boro taxis appeared on the scene, though, the green taxis quickly overtook yellow taxis so that as of June 2015, green taxis accounted for 70% of Brooklyn’s 850,000 monthly taxi pickups, while yellow taxis have decreased Brooklyn pickups back to their 2009 rate. Yellow taxis still account for more drop offs in Brooklyn, since many people continue to take taxis from Manhattan to Brooklyn, but even in drop offs, the green taxis are closing the gap.

Let’s add Uber into the mix. I live in Brooklyn, and although I sometimes take taxis, an anecdotal review of my credit card statements suggests that I take about four times as many Ubers as I do taxis. It turns out I’m not alone: between June 2014 and June 2015, the number of Uber pickups in Brooklyn grew by 525%! As of June 2015, the most recent data available when I wrote this, Uber accounts for more than twice as many pickups in Brooklyn compared to yellow taxis, and is rapidly approaching the popularity of green taxis:

brooklyn uber pickups

Note that Uber data is only available from Apr 2014–Sep 2014, then from Jan 2015–Jun 2015, hence the gap in the graph

Manhattan, not surprisingly, accounts for by far the largest number of taxi pickups of any borough. In any given month, around 85% of all NYC taxi pickups occur in Manhattan, and most of those are made by yellow taxis. Even though green taxis are allowed to operate in upper Manhattan, they account for barely a fraction of yellow taxi activity:

manhattan pickups

Uber has grown dramatically in Manhattan as well, notching a 275% increase in pickups from June 2014 to June 2015, while taxi pickups declined by 9% over the same period. Uber made 1.4 million more Manhattan pickups in June 2015 than it did in June 2014, while taxis made 1.1 million fewer pickups. However, even though Uber picked up nearly 2 million Manhattan passengers in June 2015, Uber still accounts for less than 15% of total Manhattan pickups:

manhattan uber pickups

Queens still has more yellow taxi pickups than green taxi pickups, but that’s entirely because LaGuardia and JFK airports are both in Queens, and they are heavily served by yellow taxis. And although Uber has experienced nearly Brooklyn-like growth in Queens, it still lags behind yellow and green taxis, though again the yellow taxis are heavily influenced by airport pickups:

queens uber pickups

If we restrict to pickups at LaGuardia and JFK Airports, we can see that Uber has grown to over 100,000 monthly pickups, but yellow cabs still shuttle over 80% of car-hailing airport passengers back into the city:

airport pickups

The Bronx and Staten Island have significantly lower taxi volume, but you can see graphs for both on GitHub. The most noteworthy observations are that almost no yellow taxis venture to the Bronx, and Uber is already more popular than taxis on Staten Island.

How Long does it Take to Get to an NYC Airport?

Most of these vehicles [heading to JFK Airport] would undoubtedly be using the Van Wyck Expressway; Moses’s stated purpose in proposing it was to provide a direct route to the airport from mid-Manhattan. But the Van Wyck Expressway was designed to carry—under “optimum” conditions (good weather, no accidents or other delays)—2,630 vehicles per hour. Even if the only traffic using the Van Wyck was JFK traffic, the expressway’s capacity would not be sufficient to handle it.

[…] The air age was just beginning: air traffic was obviously going to boom to immense dimensions. If the Van Wyck expressway could not come anywhere near handling JFK’s traffic when that traffic was 10,000 persons per hour, what was going to happen when that traffic increased to 15,000 persons per hour? To 20,000?

—Robert Caro, The Power Broker: Robert Moses and the Fall of New York (1974)

A subject near and dear to all New Yorkers’ hearts: how far in advance do you have to hail a cab in order to make your flight at one of the three area airports? Of course, this depends on many factors: is there bad rush hour traffic? Is the UN in session? Will your cab driver know a “secret” shortcut to avoid the day’s inevitable bottleneck on the Van Wyck?

I took all weekday taxi trips to the airports and calculated the distribution of how long it took to travel from each neighborhood to the airports at each hour of the day. In most cases, the worst hour to travel to an airport is 4–5 PM. For example, the median taxi trip leaving Midtown headed for JFK Airport between 4 and 5 PM takes 64 minutes! 10% of trips during that hour take over 84 minutes—good luck making your flight in that case.

If you left Midtown heading for JFK between 10 and 11 AM, you’d face a median trip time of 38 minutes, with a 90% chance of getting there in less than 50 minutes. Google Maps estimates about an hour travel time on public transit from Bryant Park to JFK, so depending on the time of day and how close you are to a subway stop, your expected travel time might be better on public transit than in a cab, and you could save a bunch of money.

The stories are similar for traveling to LaGuardia and Newark airports, and from other neighborhoods. You can see the graphs for airport travel times from any neighborhood by selecting it in the dropdown below:

Travel time from Midtown, Manhattan to…

LaGuardia Airport

JFK Airport

Newark Airport

You can view airport graphs for other neighborhoods by selecting a neighborhood from the dropdown above.

Could Bruce Willis and Samuel L. Jackson have made it from the Upper West Side to Wall Street in 30 minutes?

Airports aren’t the only destinations that suffer from traffic congestion. In Die Hard: With a Vengeance, John McClane (Willis) and Zeus Carver (Jackson) have to make it from 72nd and Broadway to the Wall Street 2/3 subway station during morning rush hour in less than 30 minutes, or else a bomb will go off. They commandeer a taxi, drive it frantically through Central Park, tailgate an ambulance, and just barely make it in time (of course the bomb goes off anyway…). Thanks to the TLC’s publicly available data, we can finally address audience concerns about the realism of this sequence.

McClane and Carver leave the Upper West Side at 9:50 AM, so I took all taxi rides that:

Picked up in the Upper West Side census tracts between West 70th and West 74th streets
Dropped off in the downtown tract containing the Wall Street 2/3 subway stop
Picked up on a weekday morning between 9:20 and 10:20 AM

And made a histogram of travel times:

die hard 3

There are 580 such taxi trips in the dataset, with a mean travel time of 29.8 minutes, and a median of 29 minutes. That means that half of such trips actually made it within the allotted time of 30 minutes! Now, our heroes might need a few minutes to commandeer a cab and get down to the subway platform on foot, so if we allot 3 minutes for those tasks and 27 minutes for driving, then only 39% of trips make it in 27 minutes or less. Still, in the movie they make it seem like a herculean task with almost zero probability of success, when in reality it’s just about average. This seems to be the rare action movie sequence which is actually easier to recreate in real life than in the movies!

How Does Weather Affect Taxi and Uber Ridership?

Since 2009, the days with the fewest city-wide taxi trips all have obvious relationships to the weather. The days with the fewest taxi trips were:

Sunday, August 28, 2011, Hurricane Irene, 28,596 trips
Monday, December 27, 2010, North American blizzard, 69,650 trips
Monday, October 29, 2012, Hurricane Sandy, 111,605 trips

I downloaded daily Central Park weather data from the National Climatic Data Center, and joined it to the taxi data to see if we could learn anything else about the relationship between weather and taxi rides. There are lots of confounding variables, including seasonal trends, annual growth due to boro taxis, and whether weather events happen to fall on weekdays or weekends, but it would appear that snowfall has a significant negative impact on daily taxi ridership:

snowfall

On the other hand, rain alone does not seem to affect total daily ridership:

precipitation

Since Uber trip data is only available for a handful of months, it’s more difficult to measure the impact of weather on Uber ridership. Uber is well-known for its surge pricing during times of high demand, which often includes inclement weather. There were a handful of rainy and snowy days in the first half of 2015 when Uber data is available, so for each rain/snow day, I calculated the total number of trips made by taxis and Ubers, and compared that to each service’s daily average over the previous week. For example, Uber’s ratio of 69% on 1/26/15 means that there were 69% as many Uber trips made that day compared to Uber’s daily average from 1/19–1/25:

Date	Snowfall in inches	Taxi trips vs. prev week	Uber trips vs. prev week
1/26/15	5.5	55%	69%
1/27/15	4.3	33%	41%
2/2/15	5.0	91%	107%
3/1/15	4.8	85%	88%
3/5/15	7.5	83%	100%
3/20/15	4.5	105%	134%

Date	Precipitation in inches	Taxi trips vs. prev week	Uber trips vs. prev week
1/18/15	2.1	98%	112%
3/14/15	0.8	114%	130%
4/20/15	1.4	90%	105%
5/31/15	1.5	96%	116%
6/1/15	0.7	99%	106%
6/21/15	0.6	92%	94%
6/27/15	1.1	114%	147%

Although this data does not conclusively prove anything, on every single inclement weather day in 2015, in both rain and snow, Uber provided more trips relative to its previous week’s average than taxis did. Part of this is probably because the number of Uber cars is still growing, so all things held constant, we’d expect Uber to provide more trips on each successive day, while total taxi trips stay flat. But for Uber’s ratio to be higher every single day seems unlikely to be random chance, though again I have no justification to make any strong claims. Whether it’s surge pricing or something else, Uber’s capacity seems less negatively impacted by bad weather relative to taxi capacity.

NYC Late Night Taxi Index

Many real estate listings these days include information about the neighborhood: rankings of local schools, walkability scores, and types of local businesses. We can use the taxi data to draw some inferences about what parts of the city are popular for going out late at night by looking at the percentage of each census tract’s taxi pickups that occur between 10 PM and 5 AM—the time period I’ve deemed “late night.”

Some people want to live in a city that never sleeps, while others prefer their peace and quiet. According to the late night taxi index, if you’re looking for a neighborhood with vibrant nightlife, try Williamsburg, Greenpoint, or Bushwick in Brooklyn. The census tract with the highest late night taxi index is in East Williamsburg, where 76% of taxi pickups occur between 10 PM and 5 AM. If you insist on Manhattan, then your best bets are the Lower East Side or the Meatpacking District.

Conversely, if you want to avoid the nighttime commotion, head uptown to the Upper East or Upper West Side (if you’re not already there…). The stretch in the east 80s between 5th Avenue and Park Avenue has the lowest late night taxi index, with only 5% of all taxi pickups occurring during the nighttime hours.

Here’s a map of all census tracts that had at least 50,000 taxi pickups, where darker shading represents a higher score on the late night taxi index:

late night taxi map

BK nights: 76% of the taxi pickups that occur in one of East Williamsburg’s census tracts happen between 10 PM and 5 AM, the highest rate in the city. A paltry 5% of taxi pickups in some Upper East Side tracts occur in the late night hours

Whither the Bridge and Tunnel Crowd?

The “bridge and tunnel” moniker applies, on a literal level, to anyone who travels onto the island of Manhattan via a bridge or tunnel, most often from New Jersey, Long Island, or the outer boroughs. Typically it’s considered an insult, though, with the emerging popularity of the outer boroughs, well, let’s just say the Times is on it.

In order to measure B&T destinations from the taxi data, I isolated all trips originating near Penn Station on Saturday evenings between 6 PM and midnight. Penn Station serves as the point of disembarkation for New Jersey Transit and Long Island Rail Road, so although not everyone hailing a taxi around Penn Station on a Saturday evening just took the train into the city, it should be at least a decent proxy for B&T trends. Here’s the map of the neighborhoods where these rides dropped off:

bridge and tunnel

The most popular destinations for B&T trips are in Murray Hill, the Meatpacking District, Chelsea, and Midtown. We can even drill down to the individual trip level to see exactly where these trips wind up. Here’s a map of Murray Hill, the most popular B&T destination, where each dot represents a single Saturday evening taxi trip originating at Penn Station:

murray hill

As reported, repeatedly, in the NYT, the heart of Murray Hill nightlife lies along 3rd Avenue, in particular the stretch from 32nd to 35th streets. Taxi data shows the plurality of Saturday evening taxi trips from Penn Station drop off in this area, with additional clusters in the high 20s on 3rd Avenue, further east along 34th Street, and a spot on East 39th Street between 1st and 2nd avenues. With a bit more work we might be able to reverse geocode these coordinates to actual bar names, perhaps putting a more scientific spin on this classic of the genre from Complex.

Northside Williamsburg

According to taxi activity, the most ascendant census tract in the entire city since 2009 lies on Williamsburg’s north side, bounded by North 14th St to the north, Berry St to the east, North 7th St to the south, and the East River to the west:

northside williamsburg

The Northside neighborhood is known for its nightlife: a full 72% of pickups occur during the late night hours. It’s difficult to compare 2009–2015 taxi growth across census tracts and boroughs because of the introduction of the green boro taxi program, but the Northside tract had a larger increase in total taxi pickups over that time period than any other tract in the city, with the exception of the airports:

northside williamsburg

Even before the boro taxi program began in August 2013, Northside Williamsburg experienced a dramatic increase in taxi activity, growing from a mere 500 monthly pickups in June 2009, to 10,000 in June 2013, and 25,000 by June 2015. Let’s look at an animated map of taxi pickups to see if we can learn anything:

map

The cool thing about the animation is that it lets us pinpoint the exact locations of some of the more popular Northside businesses to open in the past few years, in particular along Wythe Avenue:

May 2012: Wythe Hotel, Wythe and N 11th
January 2013: Output nightclub, Wythe and N 12th
March 2014: Verboten nightclub, N 11th between Wythe and Kent

Meanwhile, I’m sure the developers of the future William Vale and Hoxton hotels hope that the Northside’s inexorable rise continues, but at least according to taxi data, pickups have remained stable since mid-2014, perhaps indicating that the neighborhood’s popularity has plateaued?

Privacy Concerns, East Hampton Edition

The first time the TLC released public taxi data in 2013, following a FOIL request by Chris Whong, it included supposedly anonymized taxi medallion numbers for every trip. In fact it was possible to decode each trip’s actual medallion number, as described by Vijay Pandurangan. This led to many discussions about data privacy, and the TLC removed all information about medallion numbers from the more recent data releases.

But the data still contains precise latitude and longitude coordinates, which can potentially be used to determine where people live, work, socialize, and so on. This is all fun and games when we’re looking at the hottest new techno club in Northside Williamsburg, but when it’s people’s homes it gets a bit weird. NYC is of course very dense, and if you take a rush hour taxi ride from one populus area to another, say Grand Central Terminal to the Upper East Side, it’s unlikely that there’s anything unique about your trip that would let someone figure out where you live or work.

But what if you’re going somewhere a bit off the beaten path for taxis? In that case, your trip might well be unique, and it might reveal information about you. For example, I don’t know who owns one of theses beautiful oceanfront homes on East Hampton’s exclusive Further Lane (exact address redacted to protect the innocent):

But I do know the exact Brooklyn Heights location and time from which someone (not necessarily the owner) hailed a cab, rode 106.6 miles, and paid a $400 fare with a credit card, including a $110.50 tip. If the TLC truly wanted to remove potentially personal information, they would have to remove latitude and longitude coordinates from the dataset entirely. There’s a tension that public data is supposed to let people know how well the taxi system serves different parts of the city, so maybe the TLC should provide census tracts instead of coordinates, or perhaps only coordinates within busy parts of Manhattan, but providing coordinates that uniquely identify a rider’s home feels excessive.

Investment Bankers

While we’re on the topic of the Hamptons: we’ve already covered the hipsters of Williamsburg and the B&Ts of Murray Hill, why not see what the taxi data can tell us about investment bankers, yet another of New York’s distinctive subcultures?

Goldman Sachs lends itself nicely to analysis because its headquarters at 200 West Street has a dedicated driveway, just east of the path marked “Hudson River Greenway” on this Google Map:

goldman sachs

We can isolate all taxi trips that dropped off in that driveway to get a sense of where Goldman Sachs employees—at least the ones who take taxis—come from in the mornings, and when they arrive. Here’s a histogram of weekday drop off times at 200 West Street:

goldman sachs drop offs

The cabs start dropping off around 5 AM, then peak hours are 7–9 AM, before tapering off in the afternoon. Presumably most of the post-morning drop offs are visitors as opposed to employees. If we restrict to drop offs before 10 AM, the median drop off time is 7:59 AM, and 25% of drop offs happen before 7:08 AM.

A few blocks to the north is Citigroup’s headquarters at 388 Greenwich St, and although the building doesn’t appear to have a dedicated driveway the way Goldman does, we can still isolate taxis that drop off directly in front of the building to see what time Citigroup’s workers arrive in the morning:

citigroup drop offs

Some of the evening drop offs near Citigroup are probably for the bars and restaurants across the street, but again the morning drop offs are probably mostly Citigroup employees. Citigroup’s morning arrival stats are comparable to Goldman’s: a median arrival of 7:51 AM, and 25% of drop offs happen before 7:03 AM.

The top neighborhoods for taxi pickups that drop off at Goldman Sachs or Citigroup on weekday mornings are:

West Village
Chelsea-Flatiron-Union Square
SoHo-Tribeca

So what’s the deal, do bankers not live above 14th St (or maybe 23rd St) anymore? Alas, there are still plenty of trips from the stodgier parts further uptown, and it’s certainly possible that people coming from uptown are more likely to take the subway, private cars, or other modes of transport, so the taxi data is by no means conclusive. But still, the cool kids have been living downtown for a while now, why should the bankers be any exception?

Parting Thoughts

As I mentioned in the introduction, this post covers a lot. And even then, I feel like it barely scratches the surface of the information available in the full dataset. For example, did you know that in January 2009, just over 20% of taxi fares were paid with a credit card, but as of June 2015, that number has grown to over 60% of all fares?

cash vs credit

And for more expensive taxi trips, riders now pay via credit card more than 75% of the time:

cash vs credit

There are endless analyses to be done, and more datasets that could be merged with the taxi data for further investigation. The Citi Bike program releases public ride data; I wonder if the introduction of a bike-share system had a material impact on taxi ridership? [Update: I did some analysis of the Citi Bike system, and also an analysis of when Citi Bikes are faster than taxis and vice versa] And maybe we could quantify fairweather fandom by measuring how taxi volume to Yankee Stadium and Citi Field fluctuates based on the Yankees’ and Mets’ records?

There are investors out there who use satellite imagery to make investment decisions, e.g. if there are lots of cars in a department store’s parking lots this holiday season, maybe it’s time to buy. You might be able to do something similar with the taxi data: is airline market share shifting, based on traffic through JetBlue’s terminal at JFK vs. Delta’s terminal at LaGuardia? Is demand for lumber at all correlated to how many people are loading up on IKEA furniture in Red Hook?

I’d imagine that people will continue to obtain Uber data via FOIL requests, so it will be interesting to see how that unfolds amidst increased tension with city government and constant media speculation about a possible IPO.

Lastly, I mentioned the “medium data revolution” in my previous post about Fannie Mae and Freddie Mac, and the same ethos applies here. Not too long ago, the idea of downloading, processing, and analyzing 267 GB of raw data containing 1.1 billion rows on a commodity laptop would have been almost laughably naive. Today, not only is it possible on a MacBook Air, but there are increasingly more open-source software tools available to aid in the process. I’m partial to PostgreSQL and R, but those are implementation details: increasingly, the limiting factor of data analysis is not computational horsepower, but human curiosity and creativity.

GitHub

If you’re interested in getting the data and doing your own analysis, or just want to read a bit about the more technical details, head over to the GitHub repository.

Update September 2016

The NYC Taxi & Limousine Commission has released an additional year of data, covering taxis, Uber, and other for-hire vehicle (FHV) trips through June 2016. The complete dataset now includes over 1.3 billion trips, and the GitHub repo has been updated to process everything, including the new FHV file formats.

In Brooklyn, Uber is now bigger than taxis

October 12, 2015 marked the first day that Uber made more pickups in Brooklyn than yellow and green taxis combined. As of June 2016, Uber makes 60% more pickups per day than taxis do, and the gap appears to be growing. Lyft has also surpassed yellow taxis in Brooklyn, but still makes fewer pickups than green boro taxis.

brooklyn

Taxis still rule Manhattan and the airports, for now

In Manhattan, taxis still make more than three times as many pickups per day than Ubers do. But taxi activity shrank by 10% from June 2015 to June 2016, while Uber grew by 63% over the same time period. That’s a 1.1 million trips per month loss for taxis, coupled with a 1.2 million trips per month increase for Uber.

manhattan

Uber has also increased its share of pickups at LaGuardia and JFK airports. Uber’s airport pickups doubled in the past year while taxi activity remained flat, and Uber now makes 40% as many pickups at NYC airports compared to taxis.

airports

Will Uber overtake taxis in New York City?

Uber’s growth rate in NYC is slowing, which is not terriby surprising since intuitively it should be harder for a company to grow as it serves a larger percentage of the population. That said, Uber’s NYC year-over-year growth was still +90% as of June 2016, down from +325% one year earlier.

Taxi losses accelerated slightly over the same time period: year-over-year pickups declined 10% as of June 2016, compared to a loss of 7% the year before.

If taxi trips average an 8% annual decline over the next two years, then Uber would have to average a 40% annual growth rate in order to equal taxi activity by June 2018.

If we consider ridesharing services as a group—specifically Uber, Lyft, Via, Juno, and Gett—then that aggregate cohort would have to average a 22% annual growth rate over the next two years, again assuming 8% annual taxi decline (note that Via, Juno, and Gett do not yet appear in the trip-level TLC data, but they do report aggregate trip counts).

nyc

There are enough unknowns—in particular, I wonder if ridesharing fares are unsustainably low due to intense competition—that it’s impossible to say if or when the lines will cross, but at least for now, the overall trend is unmistakable.

You can continue to see monthly live-updating TLC aggregate data here, and the open-source code to process and analyze everything is here.

Update March 2018

Ride-hailing’s NYC dominance, and the impact of #DeleteUber

It’s been 18 months since I last updated this post, and the dataset through December 2017 has grown to over 1.4 billion taxi trips and another 400 million for-hire vehicle trips, including ride-hailing apps Uber, Lyft, Juno, and Via.

The Taxi & Limousine Commission’s monthly aggregate reports have shown for some time that ride-hailing apps have surpassed taxis in total popularity, but the granular trip-level dataset paints a more complete picture, allowing us to explore geographic trends, and the fallout from the January 2017 protest at JFK airport and the ensuing #DeleteUber social media campaign.

The GitHub repository has been updated to process the latest data, including additional analysis scripts covering the contents of this update.

Ride-hailing apps are now 65% bigger than taxis in New York City

February 2017 marked the first month that ride-hailing services collectively made more trips than yellow and green taxis combined, and by December 2017, ride-hailing services made 65% more pickups than taxis did. The ride-hailing cohort now makes more pickups per month than taxis did in any month since the dataset began 2009.

ride-hailing vs. taxis

Uber alone is now bigger than yellow and green taxis combined, first achieving that milestone in November 2017.

ride-hailing vs. taxis

Over the past 4 years, ride-hailing apps have grown from 0 to 15 million trips per month, while taxi usage has only declined by around 5 million trips per month. The TLC dataset also contains some information about non-app FHVs, what you might call traditional “black cars”, whose usage has declined by just under 1 million trips per month since the end of 2015. It’s possible this net increase in taxi/FHV trips has been at least partially offset by a decline in private or other vehicle usage, but the TLC dataset doesn’t tell us anything about that.

Ride-hailing apps are 10 times bigger than taxis in the outer boroughs

Ride-hailing services have been more popular than taxis in the outer boroughs since the beginning of 2016, but it’s still impressive to see how dramatically the gap has widened. In the outer boroughs, Uber and Lyft are each bigger than yellow and green taxis combined.

outer boroughs ride-hailing vs. taxis

Taxis are losing their edge in Manhattan and at the airports

In fact there’s a very good chance that ride-hailing apps have already surpassed taxis in Manhattan as I write this in March 2018, but it’ll be a few more months before the data can confirm. A similar result holds at JFK and LaGuardia airports.

manhattan ride-hailing vs. taxis

If we restrict to Manhattan south of 60th Street, the proposed congestion pricing zone in the Fix NYC plan, then ride-hailing services are already more popular than taxis. This surprised me; I would have guessed that ride-hailing’s Manhattan market share would be higher above 60th Street than below it, but it turns out that the Upper East Side is one of the areas with the highest taxi market share.

The #DeleteUber campaign probably had a noticeable, if short-lived, effect

On January 28, 2017, the New York Taxi Workers Alliance called for a work stoppage at JFK airport from 6 PM to 7 PM as a protest against the Trump administration’s proposed travel ban on Muslim-majority countries. Uber later suspended surge pricing at JFK, which some people perceived as an attempt to undermine the taxi strike—a claim that Uber denied. Regardless of intentions, the #DeleteUber hashtag trended on social media, and was widely reported in many news outlets.

The week after the JFK taxi strike, Uber suffered it’s largest week-over-week market share decline since mid-2015, while rival Lyft enjoyed its largest weekly market share increase over the same period. But viewed against the longer-term trend of Uber’s declining market share, the bigger-than-normal decline the week after the JFK taxi strike doesn’t look all that significant, especially considering that 2 weeks after the strike, Uber rebounded with its largest weekly market share increase of 2017.

Uber’s share of all ride-hailing trips has generally declined since 2015 as more competitors entered the NYC market, even as its total number of trips has increased dramatically.

nyc ride-hail market share

Plenty of caveats apply; Uber’s dip and Lyft’s bump might have been due to factors other than political protests. The NYT reported that “[a]bout half a million people requested deleting their Uber accounts over the course of that week”, but we don’t know how many trips those people would typically account for, and we don’t know if they switched from Uber to other ride-hailing apps.

It’s also possible that Lyft started running aggressive pricing promotions in February 2017, and it was those promotions that drove Lyft’s market share increase. Similarly, Uber’s recovery bump 2 weeks after the taxi strike might have been motivated by returning users who were convinced by the company’s apology, or maybe Uber ran pricing promotions as a form of damage control. To be clear, I don’t know if any of the above things happened, but they all sound plausible. And again, don’t lose sight of the fact that even as Uber’s share of all ride-hailing trips has declined, its total number of trips has grown, as has the total number of ride-hailing trips across the city.

Lyft saw its biggest gains in “Gentrified Brooklyn”

I was curious how Uber vs. Lyft market share varied by neighborhood in the immediate wake of the #DeleteUber campaign, so I calculated Lyft’s change in market share from the month before the JFK taxi strike (1/1–1/28) to the week after the strike (1/29–2/4) for every neighborhood in the city.

The map shows that the neighborhoods where Lyft gained the most market share are mostly concentrated in what I’d call, for lack of a better term, “Gentrified Brooklyn.” In Gowanus, Greenpoint, and Prospect Heights, Lyft doubled its market share from around 15% before the strike to 30% after it, and maintained that elevated market share throughout the rest of 2017.

uber vs. lyft nyc

Lyft has developed a bit of a reputation as a more liberal-minded company than Uber, once even adopting the unconventional corporate tactic of calling itself “woke”. Sure enough, the neighborhoods of northern and northwestern Brooklyn—where Lyft gained the most—have among the more liberal reputations in the city. On the flip side, some sections of southern Brooklyn and Staten Island that supported Donald Trump in the 2016 general election are also where Lyft gained the least market share.

I gathered the 2016 presidential election results for every neighborhood in the city, then compared Lyft’s market share gain in each neighborhood to the neighborhood’s voting patterns. The data shows that, on average, Lyft gained more market share from Uber in neighborhoods that voted more heavily for Hillary Clinton.

lyft usage vs. clinton votes

The correlation is not terribly strong, and the relationship says nothing about causality; there could be many confounding factors that are correlated to both political and ride-hailing app preferences. In many cases, ride-hailers are not local voters, especially in commercial areas like Midtown, Manhattan. Maybe most damning to the analysis, if we extend beyond the time period surrounding the JFK taxi strike and consider Lyft’s market share increase by neighborhood for all of 2017 vs. 2016, then the correlation to voting patterns disappears almost entirely. Still, given everything I know, I would guess that liberal voters were in fact more likely to switch from Uber to Lyft in the immediate wake of the incident (and for what it’s worth, Lyft’s market share increase was more correlated with voting preference for Green Party candidate Jill Stein than it was for Hillary Clinton).

What about the taxi strike itself?

Perhaps lost in the commotion, but neither the taxi strike nor Uber’s surge pricing suspension looks to have had much impact on the number of pickups at JFK on the afternoon/evening of January 28, 2017.

jfk hourly taxi pickups

As a reminder, the code used for this update is available here on GitHub, along with some of the aggregated data.

Electability of 2016 Presidential Candidates as Implied by Betting Markets

It’s fairly commonplace these days for news outlets to reference prediction markets as part of the election cycle. We often hear about betting odds on who will win the primary or be the next president, but I haven’t seen many commentators use prediction markets to infer the electability of each candidate.

With that in mind, I took the betting odds for the 2016 US presidential election from Betfair and used them to calculate the perceived electability of each candidate. Electability is defined as a candidate’s conditional probability of winning the presidency, given that the candidate earns his or her party’s nomination.

Presidential betting market odds and electabilities

Candidate	Win Nomination	Win Presidency	Electability if Nominated

“Electability” refers to a candidate’s conditional probability of winning the presidency, given that the candidate wins his or her party’s nomination

Note: the following section was written September 15, 2015. Things have changed since then, invalidating some of what’s written below

I’m no political analyst, and the data above will continue to update throughout the election season, making anything I write here about it potentially immediately outdated, but according to the data at the time I wrote this on September 15, 2015, betting markets perceive Hillary Clinton as the most electable of the declared candidates, with a 57%–58% chance of winning the presidency if she receives the Democratic nomination. Betting markets also imply that the Democrats are the favorites overall, with about a 57% chance of winning the presidency, which is roughly the same as Clinton’s electability, so it appears that Clinton is considered averagely electable compared to the Democratic party as a whole.

On the Republican side, Jeb Bush has the best odds of winning the nomination, but his electability range of 47%–49% means he’s considered a slight underdog in the general election should he win the nomination. Still, that’s better than Marco Rubio (36%–40%) and Scott Walker (33%–42%), who each have lower electabilities, implying that they would be bigger underdogs if they were nominated. The big surprise to me is that Donald Trump has a fairly high electability range relative to the other Republicans, at 47%–56%. Maybe the implication is something like, “if there’s an unanticipated factor that enables the surprising result of Trump winning the nomination, then that same factor will work in his favor in the general election,” but then that logic should apply to other longshot candidates, which it seems not to, so perhaps other caveats apply.

Why are the probabilities given as ranges?

Usually when you read something in the news like “according to [bookmaker], candidate A has a 25% chance of winning the primary”, that’s not quite the complete story. The bookmaker might well have posted odds on A to win the primary at 3:1, which means you could bet $1 on A to win the primary, and if you’re correct then you’ll collect $4 from the bookmaker for a profit of $3. Such a bet has positive expected value if and only if you believe the candidate’s probability of winning the primary is greater than 25%. But traditional bookmakers typically don’t let you take the other side of their posted odds. In other words, you probably couldn’t bet $3 on A to lose the nomination, and receive a $1 profit if you’re correct.

Betting markets like Betfair, though, do allow you to bet in either direction, but not at the same odds. Maybe you can bet on candidate A to win the nomination at a 25% risk-neutral probability, but if you want to bet on A to lose the nomination, you might only be able to do so at a 20% risk-neutral probability, which means you could risk $4 for a potential $1 profit if A loses the nomination, or 1:4 odds. The difference between where you can buy and sell is known as the bid-offer spread, and it reflects, among other things, compensation for market-makers.

The probabilities in the earlier table are given as ranges because they reflect this bid-offer spread. If candidate A’s bid-offer is 20%–25%, and you think that A’s true probability is 30%, then betting on A at 25% seems like an attractive option, or if you think that A’s true probability is 15% then betting against A at 20% is also attractive. But if you think A’s true probability falls between 20% and 25%, then you probably don’t have any bets to make, though you might consider becoming a market-maker yourself by placing a bid or offer at an intermediate level and waiting for someone else to come along and take the opposite position.

A hypothetical example calculation of electability

Betfair offers betting markets on the outcome of the general election, and the outcomes of the Democratic and Republican primary elections. Although Betfair does not offer betting markets of the form “candidate A to win the presidency, if and only if A wins the primary”, bettors can place simultaneous bets on A’s primary and general election outcomes in a ratio such that the bettor will break even if A loses the primary, and make or lose money only in the scenario where A wins the primary.

Let’s continue the example with our hypothetical candidate A, who has a bid-offer 20%–25% in the primary, and let’s say a bid-offer 11%–12.5% in the general election. If we bet $25 on A to win the general election at a 12.5% probability, then our profit across scenarios looks like this:

Bet $25 on candidate A to win the general election at 12.5% probability (7:1 odds)

Scenario	Amount at Risk	Payout from general bet	Profit
A loses primary	$25	$0	-$25
A wins primary, loses general	$25	$0	-$25
A wins primary, wins general	$25	$200	$175

We want our profit to be $0 in the “loses primary” scenario, so we can add a hedging bet that will pay us a profit of $25 if A loses the primary. That bet is placed at a 20% probability, which means our odds ratio is 1:4, so we have to risk $100 in order to profit $25 in case A loses the primary. Now we have a total of $125 at risk: $25 on A to win the presidency, and $100 on A to lose the nomination. The scenarios look like this:

Bet $25 on candidate A to win the general election at 12.5% probability (7:1 odds) and $100 on A to lose the primary at 20% probability (1:4 odds)

Scenario	Amount at risk	Payout from primary bet	Payout from general bet	Profit
A loses primary	$125	$125	$0	$0
A wins primary, loses general	$125	$0	$0	-$125
A wins primary, wins general	$125	$0	$200	$75

We’ve constructed our bets so that if A loses the primary, then we neither make nor lose money, but if A wins the primary, then we need A’s probability of winning the election to be greater than 62.5% in order to make our bet positive expected value, since 0.625 * 75 + 0.375 * -125 = 0. As an exercise for the reader, you can go through similar logic to show that if you want to bet on A to lose the presidential election but have 0 profit in case A loses the primary, then you need A’s conditional probability of winning the general election to be lower than 44% in order to make the bet positive expected value. In this example then, A’s electability range is 44%–62.5%.

Caveats

This analysis does not take into account the total amount of money available to bet on each candidate. As of September 2015, Betfair has handled over $1 million of bets on the 2016 election, but the markets on some candidates are not as deep as others. If you actually tried to place bets in the fashion described above, you might find that there isn’t enough volume to fully hedge your exposure to primary results, or you might have to accept significantly worse odds in order to fill your bets.

It’s possible that someone might try to manipulate the odds by bidding up or selling down some combination of candidates. Given the amount of attention paid to prediction markets in the media, and the amount of money involved, it’s probably not a bad idea. In 2012 someone tried to do this to make it look like Mitt Romney was gaining momentum, but enough bettors stepped in to take the other sides of those bets and Romney’s odds fell back to where they started. Even though that attempt failed, people might try it again, and if/when they do, they might even succeed, in which case betting market data might only reflect what the manipulators want it to, as opposed to the wisdom of the crowds.

The electability calculation ignores the scenario where a candidate loses the primary but wins the general election. I don’t think this has ever happened on the national level, but it happened in Connecticut in 2006, and it probably has a non-zero probability of happening nationally. If it were to happen, and you had placed bets on the candidate to win the primary and lose the election, you might find that your supposedly safe “hedge” wasn’t so safe after all (on the other hand, you might get lucky and hit on both of your bets…). Some have speculated that Donald Trump in particular might run as an independent candidate if he doesn’t receive the Republican nomination, so whatever (probably small) probability the market assigns to the scenario of “Trump loses the Republican nomination but wins the presidency” would inflate his electability.

There are probably more caveats to list, for example I’ve failed to consider any trading fees or commissions incurred when placing bets. Additionally, though I have no proof, as mentioned earlier I’d guess that candidates who are longshots to win the primaries probably have higher electabilities due to the implicit assumption that if something so dramatic were to happen that caused them to win the primary, probably the same factor would help their odds in the general election.

Despite all of these caveats, I believe that the implied electability numbers do represent to some degree how bettors expect the candidates to perform in the general election, and I wonder if there should be betting markets set up that allow people to wager directly on these conditional probabilities, rather than having to place a series of bets to mimic the payout structure.

← Older Posts All Posts Newer Posts →

One Day in the Life of the Citi Bike Share System

The Data

Age, Gender, and the Accuracy of Google Maps Cycling Time Estimates

Anonymizing Data is Hard!

Magical Transports

Quantifying the Impact of the Weather on Citi Bike Activity

In Conclusion

GitHub

Table of Contents

Maps

NYC Taxi Data

Uber Data

Borough Trends, and the Rise of Uber

How Long does it Take to Get to an NYC Airport?

Travel time from Midtown, Manhattan to…

LaGuardia Airport

JFK Airport

Newark Airport

Could Bruce Willis and Samuel L. Jackson have made it from the Upper West Side to Wall Street in 30 minutes?

How Does Weather Affect Taxi and Uber Ridership?

NYC Late Night Taxi Index

Whither the Bridge and Tunnel Crowd?

Northside Williamsburg

Privacy Concerns, East Hampton Edition

Investment Bankers

Parting Thoughts

GitHub

Update September 2016

In Brooklyn, Uber is now bigger than taxis

Taxis still rule Manhattan and the airports, for now

Will Uber overtake taxis in New York City?

Update March 2018

Ride-hailing’s NYC dominance, and the impact of #DeleteUber

Ride-hailing apps are now 65% bigger than taxis in New York City

Ride-hailing apps are 10 times bigger than taxis in the outer boroughs

Taxis are losing their edge in Manhattan and at the airports

The #DeleteUber campaign probably had a noticeable, if short-lived, effect

Lyft saw its biggest gains in “Gentrified Brooklyn”

What about the taxi strike itself?

Presidential betting market odds and electabilities

Why are the probabilities given as ranges?

A hypothetical example calculation of electability

Caveats