How Airbnb uses Machine Learning to Detect Host Preferences
By Bar Ifrach
I first heard about Airbnb in 2012 from a friend. He offered his nice apartment on the site when he traveled to see his family during our vacations from grad school. His main goal was to fit as many booked nights as possible into the 1–2 weeks when he was away. My friend would accept or reject requests depending on whether or not the request would help him to maximize his occupancy.
About two years later, I joined Airbnb as a Data Scientist. I remembered my friend’s behavior and was curious to discover what affects hosts’ decisions to accept accommodation requests and how Airbnb could increase acceptances and matches on the platform.
What started as a small research project resulted in the development of a machine learning model that learns our hosts’ preferences for accommodation requests based on their past behavior. For each search query that a guest enters on Airbnb’s search engine, our model computes the likelihood that relevant hosts will want to accommodate the guest’s request.Then, we surface likely matches more prominently in the search results. In our A/B testing the model showed about a 3.75% increase in booking conversion, resulting in many more matches on Airbnb. In this blog post I outline the process that brought us to this model.
What affects hosts’ acceptance decisions?
I kicked off my research into hosts’ acceptances by checking if other hosts maximized their occupancy like my friend. Every accommodation request falls in a sequence or in a window of available days in the calendar, such as on April 5–10 in the calendar shown below. The gray days surrounding the window are either blocked by the host or already booked. If accepted and booked, a request may leave the host with a sub-window before the check-in date (check-in gap — April 5–7) and/or a sub-window after the check-out (check-out gap — April 10).
A host looking to have a high occupancy will try to avoid such gaps. Indeed, when I plotted hosts’ tendency to accept over the sum of the check-in gap and the check-out gap (3+1= 4 in the example above), as in the next plot, I found the effect that I expected to see: hosts were more likely to accept requests that fit well in their calendar and minimize gap days.
But do all hosts try to maximize occupancy and prefer stays with short gaps? Perhaps some hosts are not interested in maximizing their occupancy and would rather host occasionally. And maybe hosts in big markets, like my friend, are different from hosts in smaller markets.
Indeed, when I looked at listings from big and small markets separately, I found that they behaved quite differently. Hosts in big markets care a lot about their occupancy — a request with no gaps is almost 6% likelier to be accepted than one with 7 gap nights. For small markets I found the opposite effect; hosts prefer to have a small number of nights between requests. So, hosts in different markets have different preferences, but it seems likely that even within a market hosts may prefer different stays.
A similar story revealed itself when I looked at hosts’ tendency to accept based on other characteristics of the accommodation request. For example, on average Airbnb hosts prefer accommodation requests that are at least a week in advance over last minute requests. But perhaps some hosts prefer short notice?
The plot below looks at the dispersion of hosts’ preferences for last minute stays (less than 7 days) versus far in advance stays (more than 7 days). Indeed, the dispersion in preferences reveals that some hosts like last minute stays better than far in advance stays — those in the bottom right — even though on average hosts prefer longer notice. I found similar dispersion in hosts’ tendency to accept other trip characteristics like the number of guests, whether it is a weekend trip etc.
All these findings pointed to the same conclusion: if we could promote in our search results hosts who would be more likely to accept an accommodation request resulting from that search query, we would expect to see happier guests and hosts and more matches that turned into fun vacations (or productive business trips).
In other words, we could personalize our search results, but not in the way you might expect. Typically personalized search results promote results that would fit the unique preferences of the searcher — the guest. At a two-sided marketplace like Airbnb, we also wanted to personalize search by the preference of the hosts whose listings would appear in the search results.
How to model host preferences?
Encouraged by my findings, I joined forces with another data scientist and a software engineer to create a personalized search signal. We set out to associate hosts’ prior acceptance and decline decisions by the following characteristics of the trip: check-in date, check-out date and number of guests. By adding host preferences to our existing ranking model capturing guest preferences, we hoped to enable more and better matches.
At first glance, this seems like a perfect case for collaborative filtering — we have users (hosts) and items (trips) and we want to understand the preference for those items by combining historical ratings (accept/decline) with statistical learning from similar hosts. However, the application does not fully fit in the collaborative filtering framework for two reasons.
First, no two trips are ever identical because behind each accommodation request there is a different guest with a unique human interaction that influences the host’s acceptance decision. This results in accept/decline labels that are noisier than, for example, the ratings of a movie or a song like in many collaborative filtering applications.
Taking this point one step further, a host can receive multiple accommodation requests for the same trip with different guests at different points in time and give those requests conflicting votes. A host may accept last minute stays that start on a Tuesday 2 out of 4 times, and it remains unclear whether the host prefers such stays.
With these points in mind, we decided to massage the problem into something resembling collaborative filtering. We used the multiplicity of responses for the same trip to reduce the noise coming from the latent factors in the guest-host interaction. To do so, we considered hosts’ average response to a certain trip characteristic in isolation. Instead of looking at the combination of trip length, size of guest party, size of calendar gap and so on, we looked at each of these trip characteristics by itself.
With this coarser structure of preferences we were able to resolve some of the noise in our data as well as the potentially conflicting labels for the same trip. We used the mean acceptance rate for each trip characteristic as a proxy for preference. Still our data-set was relatively sparse. On average, for each trip characteristic we could not determine the preference for about 26% of hosts, because they never received an accommodation request that met those trip characteristics. As a method of imputation, we smoothed the preference using a weight function that, for each trip characteristic, averages the median preference of hosts in the region with the host’s preference. The weight on the median preference is 1 when the host has no data points and goes to 0 monotonically the more data points the host has.
Using these newly defined preferences we created predictions for host acceptances using a L-2 regularized logistic regression. Essentially, we combine the preferences for different trip characteristics into a single prediction for the probability of acceptance. The weight the preference of each trip characteristic has on the acceptance decision is the coefficient that comes out of the logistic regression. To improve the prediction, we include a few more geographic and host specific features in the logistic regression.
This flow chart summarizes the modeling technique.
We ran this model on segments of hosts on our cluster using a user-generated-function (UDF) on Hive. The UDF is written in Python; its inputs are accommodation requests, hosts’ response to them and a few other host features. Depending on the flag passed to it, the UDF either builds the preferences for the different trip characteristics or trains the logistic regression model using scikit-learn.
Our main off-line evaluation metric for the model was mean squared error (MSE), which is more appropriate in a setting when we care about the predicted probability more than about classification. In our off-line evaluation of the model we were able to get a 10% decrease in MSE over our previous model that captured host acceptance probability. This was a promising result. But, we still had to test the performance of the model live on our site.
Experimenting with the model
To test the online performance of the model, we launched an experiment that used the predicted probability of host acceptance as a significant weight in our ranking algorithm that also includes many other features that capture guests’ preferences. Every time a guest in the treatment group entered a search query, our model predicted the probability of acceptance for all relevant hosts and influenced the order in which listings were presented to the guest, ranking likelier matches higher.
We evaluated the experiment by looking at multiple metrics, but the most important one was the likelihood that a guest requesting accommodation would get a booking (booking conversion). We found a 3.75% lift in our booking conversion and a significant increase in the number of successful matches between guests and hosts.
After concluding the initial experiment, we made a few more optimizations that improved conversion by approximately another 1% and then launched the experiment to 100% of users. This was an exciting outcome for our first full-fledged personalization search signal and a sizable contributor to our success.
First, this project taught us that in a two sided marketplace personalization can be effective on the buyer as well as the seller side.
Second, the project taught us that sometimes you have to roll up your sleeves and build a machine learning model tailored for your own application. In this case, the application did not quite fit in the collaborative filtering and a multilevel model with host fixed-effect was too computationally demanding and not suited for a sparse data-set. While building our own model took more time, it was a fun learning experience.
Finally, this project would not have succeeded without the fantastic work of Spencer de Mars and Lukasz Dziurzynski.