Hacking-Airbnb’s-search-rank-algorithm
Hacking-Airbnb’s-search-rank-algorithm
Foreword....................................................................................................................................................... 2
Introduction .................................................................................................................................................. 2
Methodology................................................................................................................................................. 2
Software used: .............................................................................................................................................. 2
Getting the data ............................................................................................................................................ 3
Getting an initial data set.......................................................................................................................... 3
Diving Deeper............................................................................................................................................ 4
Put it all together ...................................................................................................................................... 6
Results ........................................................................................................................................................... 8
Data Validation ......................................................................................................................................... 8
Describing the results ............................................................................................................................... 9
A word about Correlation vs Causation .................................................................................................... 9
The factors most correlated to page rank .............................................................................................. 10
Airbnb’s changing stance on Instant Book .............................................................................................. 12
The correlation of all the factors tested ................................................................................................. 13
The correlation of amenities to search rank ........................................................................................... 14
Conclusions ................................................................................................................................................. 15
About the Author ........................................................................................................................................ 15
Appendix A .................................................................................................................................................. 16
Appendix B .................................................................................................................................................. 17
Hacking Airbnb’s search rank algorithm
FOREWORD
1. The work I have done usually involves clients’ proprietary information that I cannot share nor
publish. Since this project involved sourcing data on my own and doing analysis for my own
benefit I am happy to share this, please feel free to share/use this information as you like.
2. Part of the process in collecting this data required scraping publicly available data on a massive
scale using an automated process. This is an ethical grey-area and steps were taken to ensure
that I was not DDoSing the servers where I was scraping information from. More on this in
Appendix B
INTRODUCTION
In my part-time I manage a portfolio of rental properties in Cape Town and host them on
www.airbnb.com, I would like to use data to assist with the marketing of the properties. The project
aims to uncover the factors that influence search results on the Airbnb site. SEO is an area that has seen
much focus with the rise of search engines and this project aims to answer what drives search results on
Airbnb. Hopefully we can answer what an Airbnb host can do to achieve a better rank, since as a host
you’d like your property to appear as close to the top as possible as this leads to more bookings and
more money.
The data used is scraped from the Airbnb site itself, specifically in Cape Town where the properties I
host are located. What follows is an analysis of the short-term rental market in Cape Town on
www.airbnb.com, specifically for rental properties that can accommodate 6 or more people (as this is
what the properties I manage offer).
METHODOLOGY
1. Use automated software to download/scrape data from www.airbnb.com
2. Organise the data into a database
3. Analyse what aspects of hosting correlate with listings ranking higher in search results
4. Identify what a host can do to rank higher in the search results
SOFTWARE USED:
Python 3.5.2
Scrapy 1.0.5 (Python Library)
import.io
Microsoft Excel & VBA
Qlikview – (SQL Load script)
2
GETTING THE DATA
Instead of manually going to every page to get the data we can make use of automated software to do
some web scraping. This will automatically download data from their site to later process into a
database. To get the initial data for the listings I set up an Extractor on import.io to scan the search page
from www.airbnb.com and scrape the following data (highlighted in the picture):
Figure 1: The data we want from the Airbnb search results page
Note that before scraping this information I had set the search criteria to only include listings that could
accommodate 6+ people. I also set the map to include pockets of greater Cape Town to limit the number
of search results. What import.io allows you to do is create your own API and then feed other URLs into
the API that will grab the same data for you from those pages.
The search result page only contains 18 properties, however, at the bottom of the page it shows there
are 17 pages of results. The import.io software allows you to add multiple URLs to the request when
running the Extractor. All we need to do is replicate the URL for page 1 and feed in the URLs for pages 2-
17. I did this by manipulating the string in Excel and creating the URLs for the 17 to pages of results.
3
Below is an example of a URL for page 1 (highlighted in red):
https://www.airbnb.com/s/cape-
town?guests=6&adults=6&children=0&infants=0&ss_id=ia2qhgz8&ss_preload=true&source=
bb&page=1&allow_override%5B%5D=&ne_lat=-
33.92949338148802&ne_lng=18.499001070258828&sw_lat=-
33.983999649640914&sw_lng=18.43651632904789&zoom=14&search_by_map=true&s_tag
=Bp9FFd3N
You’ll notice that the URL contains the filters we require for the search: 6 adults as well as the latitude
and longitude of the search area on the map.
To get the other 17 URLs I did some string manipulation in Excel that substituted the “1” for the
numbers 2-17:
Figure 2: The Excel formula used to insert the page numbers into the URLs
The formula:
=LEFT(B2,SEARCH("page=",B2)+4)&C2&RIGHT(B2,LEN(B2)-SEARCH("page=",B2)-5)
The above formula searches through the URL and takes everything left of “page=” adds the page
number (in column C) and then adds everything right of “page=”. We can then Autofill down 17 rows
and we have our 17 URLs to feed into import.io.
DIVING DEEPER
To get more in depth data a more intensive way of scraping the data was needed. There is some data
that isn’t obviously available on Airbnb’s site and is hidden away in the HTML and JSON on each listing
page. This time I used Python to run a web spider that scraped each listing’s data from the same points
on the map. To do this I setup a web scraping spider using the Python library Scrapy.
4
The method I used was adapted code from an incredibly helpful resource by Luca Verginer,
http://www.verginer.eu/blog/web-scraping-airbnb/. The key data I extracted from each listing can be
seen within the code to parse the listing data:
Figure 4: The parse function that scrapes the data from the listings
This data that is parsed from the JSON array within the listing page provides a lot deeper insight into the
listing and more pieces of data to analyse.
Again, we had to run the spider across multiple areas to get all the listings within the suburbs of greater
Cape Town. The code is adapted to add a suffix to the URL to only get listings for 6 guests, as well as that
area on the map (the GPS co-ordinates). This code loops over all the pages that hold the results of the
search:
5
Figure 4: Creating the web scraping spider that would collect the detailed information by listing
Figure 5: VBA Code to concatenate sheets in Excel, making one master file with the basic listing data
We now have a master file with around 3000 listings that can sleep more than 6 people in greater Cape
Town. Since we moved the map to capture certain areas we would have overlapped and included the
same listings in multiple search results. To remove duplicates Excel has a handy tool to remove
duplicates which now left us with around 2000 listings.
6
The more detailed Scrapy data was collected into a set of CSV files that had the same fields but in all
different orders. The data was loaded into Qlikview which allowed us to use the SQL functionality and
UNION the tables together building one big table of data, correctly ordering the fields automatically.
One of the fields is called Amenities which was a list of all the amenity codes a property had, by
separating the list into separate fields in Excel and creating a CrossTable in Qlikview we created a further
table with Amenities by property as well as their description. The descriptions of the 60 amenities came
from trawling through the HTML code in a very manual way unfortunately, at least we only had to do it
once!
The last 2 pieces of data were a list of Cape Town’s suburb names from Wikipedia, as well as a file that
contained a list of first names and whether they were male or female names. This came from a German
site but the data in the available zip file is all I needed to classify the genders of the Airbnb hosts.
After some cleaning and organising we have now created a database with this structure:
7
RESULTS
DATA VALIDATION
The data we have extracted was properly distributed across the 17 result pages as seen below. Since the
listing data were taken from moving a map position on the Airbnb search page we are likely to have
some overlap in results due to the map overlapping certain properties.
The data was still evenly distributed even after removing duplicate results, the drop off on page 17 is
due to certain map areas having fewer properties and therefore fewer than 17 result pages:
8
DESCRIBING THE RESULTS
When looking at results we are looking for correlation between search result page and the variable
being tested. We used the Average for each variable and a good way to interpret the table below is to
say: “The average Page 1 listing has a guest satisfaction score of 83.7%”. We will cover the results in
more detail later in the report but perhaps unsurprisingly the most important factor influencing search
rank is the Guest Satisfaction score that is calculated once a guest completes a review for a listing.
To interpret the results, we are looking for correlation (both positive and negative) with result page. As
seen in the table below, as page number increases, average guest satisfaction decreases.
9
THE FACTORS MOST CORRELATED TO PAGE RANK
The below table shows the top 5 factors that are most correlated to page rank. We can clearly see the
trends in the graphs:
Things to note:
Guest satisfaction score (from guest reviews) is understandably the most correlated factor.
Price: anecdotally from my experience, Airbnb has been recommending lower and lower prices
as suggested prices. Airbnb wants to offer the best deal to its users so lower prices mean a
better rank.
Word count: as described above, this may be a factor that Airbnb values or may be that wordy
descriptions are a characteristic of more conscientious hosts who score well elsewhere too.
Minimum stay length: perhaps shorter stays get more bookings and therefore score better in
other factors influencing rank but it seems the shorter a host’s minimum stay requirements are
the higher they rank.
Days since calendar updated: the more active a host is in updating the calendar the better the
rank of the property. Unsurprisingly Airbnb reward active hosts.
10
Below are the factors 6-10 that are most correlated to page rank. Note we are still looking at averages
for each factor ie: on average 62.4% of listings on page 1 are Instant Book listings
Things to note:
Price/Bed: Since we have details on the number of beds we can figure out price/bed, again
Airbnb rewards cheaper listings.
Name Length: this field can only be 50 characters long but listings with more words (average 5)
seem to rank higher. Again, this may be due to other factors.
Is Instant Book: See below for a more detailed analysis but Instant Book listings perform better
Reviews: Having more reviews is correlated with ranking higher
Times Saved to Wishlist: One listing on page 5 was removed as an outlier from this set, it had
been saved to wish lists over 22 000 times and skewed the results. (It must have made it onto
Airbnb’s featured page or on some other site that gets major traffic.)
11
AIRBNB’S CHANGING STANCE ON INSTANT BOOK
Having a listing set to Instant Book (where a host allows potential guests to book without their approval)
is correlated with having a higher search result in this data set. This wasn’t always the case… about a
year ago, I did some similar research looking for correlation between Instant Book listings and search
rank. Below shows how the rank algorithm seems to have changed over the last year:
Figure 11: Airbnb rewarding Instant Book listings with higher ranks in 2017
Clearly there is no real correlation in the 2016 data (0.14 correlation coefficient) but in the 2017 data we
can see that listings that have Instant Book enabled tend to appear higher in the search results.
This may be part of Airbnb’s drive to compete with the hotel industry and their stance that hosts should
not discriminate against potential guests (by not accepting certain bookings as hosts without Instant
Book can do).
From Airbnb’s “Work to fight discrimination and Build Inclusion Report” – Sept 2016:
One Million Instant Book Listings
Instant Book allows certain listings to be booked immediately̶without prior host approval
of a specific guest. To achieve these goals, Airbnb will accelerate the use of Instant Book
with a goal of one million listings bookable via Instant Book by January 2017.
More importantly, Instant Book reduces the potential for bias because hosts automatically
accept guests who meet these objective custom settings they have put in place. Airbnb has
already worked to increase the number of Instant Book listings, which has more than doubled
in the past year.
12
THE CORRELATION OF ALL THE FACTORS TESTED
To quickly see which factors are most correlated with search rank we can look at the statistical
correlation instead of interpreting the graphs as we did above. Below is a table which shows the
correlation coefficients of each factor I tested:
Note: This table shows an absolute correlation coefficient from 0 to 1, 1 being most correlated. I
converted the inversely correlated factors (negatives) for easier interpretation.
Appendix A has a more detailed description of what each of these factors means and how it was tested.
Things to note:
Being a SuperHost doesn’t seem to make as big a difference as one would think. It ranked 13th
most correlated to page rank
Airbnb hosting businesses and hosts with multiple listings aren’t correlated to higher search
results, nor are listings that are Business Ready
The ratio of male-to-female hosts didn’t correlate to search results, this was admittedly a long
shot! However, there are almost double the number of female hosts in this data set.
Smoking and pet friendly properties didn’t seem to negatively impact search results.
Having the suburb name or the base word “view” in the title didn’t correlate with search rank.
Age of a host’s account (how long they have used Airbnb) didn’t correlate with search rank.
13
THE CORRELATION OF AMENITIES TO SEARCH RANK
In the same way that we tested the factors that are correlated to search result rank we can also test the
correlation of amenities. Below is a table that ranks the correlation of amenities with search rank:
Note: This data set includes listings that can accommodate 6 or more people, likely houses not apartments.
Things to note:
Offering breakfast is not correlated to higher search results, it may be a factor for listings that
sleep 1 or 2 but not for those accommodating 6 or more.
Having a TV and having Cable/Satellite TV does not correlate to higher search ranks.
Business Ready required amenities are more highly correlated to higher search ranks.
More listings have wireless internet (93%) as an amenity than Internet (57.6%), I can’t explain
this. Perhaps some hosts don’t know that Wi-Fi isn’t possible without internet itself?
14
CONCLUSIONS
Based on this research analysing the correlation of host variables and search page rank there several
easy things a host can do to get a better search rank:
The results of this research may be skewed due to this data set only including listings that accommodate
6 or more people. It may also differ to results from other cities/countries. Further research could look to
explore the same research methodology on all listings regardless of how many people a listing can
accommodate as well in different cities/countries.
childnick@gmail.com
15
APPENDIX A
Rank Description Details
16
APPENDIX B
Scrapy has built in settings that ensure you are guilty of DDoSing Airbnb servers while scraping the data:
http://doc.scrapy.org/en/latest/topics/autothrottle.html
My settings:
17