Strata Scratch Problem Sets

Problem sets, exercises, and educational guides for SQL and python. These exercises are specifically tailored for business and marketing analytics students and novices.

Back to the Strata Scratch website
Login to Strata Scratch

AirBnB Search Details Exercise Solutions

In this exercise we will use the dataset datasets.airbnb_search_details.

This dataset contains searches and results for them for properties via AirBnB. The searches are limited to big US cities and neighbourhoods inside them. It is full of details like the type of property sought, the type of room, what extra amenities the property offers and many more. The data has no explicit time scale in the sense that searches are tagged by date and time but we know the registration date of hosts found, the date of the first review and the date of the last review. Our goal through these exercises is to understand the general patterns behind these searches, how we can use them for our own good (e.g. finding cheapest properties) and get a deeper understanding of how AirBnB operates.

The technical description of the dataset is in the following table.

Column Name Column Type Has NULL or missing values Short description
id INTEGER NO The primary key for this table.
log_price NUMERIC NO Logarithm of the apartment price.
property_type TEXT NO Type of the property sought, examples: ‘Villa’, ‘Timeshare’.
room_type TEXT NO Type of room, example: ‘Private room’
amenities TEXT NO Amenities of the apartment stored as TEXT but written in a mathematical set notation.
accommodates INTEGER NO Number of people the apartment/room can accommodate.
bathrooms INTEGER YES Number of bathrooms in the search.
bed_type TEXT NO Type of the bed, examples: ‘Airbed’, ‘Real Bed’
cancellation_policy TEXT NO Cancellation policy, examples: ‘flexible’, ‘moderate’
cleaning_fee TEXT NO The possible values are TRUE and FALSE which means this column tells us about the existence of a cleaning fee and should be of type BOOL.
city TEXT NO Name of the city either as acronym (SF, LA, NYC) or full name (Chicago, Boston)
description TEXT NO Textual description of the appartment. Can be of any number of characters.
first_review TEXT YES Date of the first review stored as TEXT. To use it we need to cast it to DATE first.
host_has_profile_pic TEXT NO A binary value stored as text where t denotes TRUE and f denotes FALSE.
host_identity_verified TEXT YES Similar to host_has_profile_pic
host_response_rate TEXT YES A number in the format 100%, 71%, 83%, etc. Sometimes missing.
host_since TEXT NO The date of the registration of the host. Is of type text but should be DATE.
instant_bookable TEXT NO A binary value stored as text where t denotes TRUE and f denotes FALSE.
last_review TEXT YES The date of last review.
latitude NUMERIC NO The geographical latitude of the apartment.
longitude NUMERIC NO The geographical longitude of the apartment.
name TEXT NO The name of the apartment.
neighborhood TEXT NO The name of the neighborhood where the apartment is located.
number_of_reviews INTEGER NO The number of reviews for this apartment.
review_scores_rating INTEGER YES The average rating of this apartment
thumbnail_url TEXT YES The URL for the thumbnail image. Not useful for us.
zipcode INTEGER YES Zip code for the municipality where the apartment is located.
bedrooms INTEGER YES The number of bedrooms.
beds INTEGER YES The total number of beds.

This is a humongous table but fear not, because we will use only small parts of it for each question. Usually the questions use no more than 3 columns.

To make life easier, always do a select for all the columns needed in the question, look at the returned values, ponder a bit and then try to write the solution.

Beginner exercise

  1. Find out the searches made by the people who search for apartments where they will be the sole person staying.

Columns: Use the beds and accomodates columns.

SELECT *
FROM datasets.airbnb_search_details
WHERE accommodates = 1 AND beds = 1
  1. Find 50 searches for apartments in New York City which are in the Harlem neighborhood.

Columns: city and neighborhood

Hint: New York City is present as NYC in this dataset.

SELECT *
FROM datasets.airbnb_search_details
WHERE city = 'NYC' AND neighbourhood = 'Harlem'
LIMIT 50
  1. Find all searches where the number of bedrooms is equal to the number of bathrooms.

Columns: bedrooms and bathrooms.

SELECT *
FROM datasets.airbnb_search_details
WHERE bedrooms = bathrooms
  1. Find Los Angeles neighborhoods.

Columns: city and neighbourhood

SELECT DISTINCT neighbourhood
FROM datasets.airbnb_search_details
WHERE city = 'LA'

Intermediate exercises

  1. Find all houses and villas which have internet access but no wireless internet access.

Columns: amenities and property_type

Hint: Use the ILIKE, IN and NOT operators.

SELECT *
FROM datasets.airbnb_search_details
WHERE
     amenities ILIKE '%Internet%' AND
     amenities NOT ILIKE '%Wireless Internet%' AND
     property_type IN ('House', 'Villa')
  1. Find all searches where the host_response_rate column is missing data.

Hint: Not all missing values are NULL. Some can be also empty strings.

SELECT *
FROM datasets.airbnb_search_details
WHERE
     host_response_rate <> ''
  1. Find all searches for San Francisco with flexible cancellation policies and which have a review score present. Present them ordered by the review score from highest to lowest.

Columns: city, cancellation_policy and review_scores_rating

Hint: San Francisco is stored as SF in this dataset.

SELECT *
FROM datasets.airbnb_search_details
WHERE
    city='SF' 
    AND cancellation_policy='flexible' 
    AND review_scores_rating IS NOT NULL
ORDER BY review_scores_rating DESC    
  1. Find the average number of bathrooms and bedrooms for each city and property type pair.

Columns: city, property_type, bathrooms, bedrooms

SELECT 
    city, property_type,
    AVG(bathrooms) AS avg_bathrooms,
    AVG(bedrooms) AS avg_bedrooms
FROM datasets.airbnb_search_details
GROUP BY city, property_type
  1. Find the total number of searches for Westlake neighborhood which require TV among the amenities.

Columns: neighbourhood, property_type, amenities

SELECT 
   COUNT(*)
FROM datasets.airbnb_search_details
WHERE 
    neighbourhood = 'Westlake' AND 
    property_type = 'House'    AND 
    amenities ILIKE '%TV%'
  1. It is time for your vacation. You need to chose which city to visit. You don’t have much money so you decide to stay in a shared room, but you want to share the room with as little people as possible. You devise a score which tells you the average number of persons accomodated by the shared room over the average number of beds available for each city and order the choices by that score.

Columns: city, accomodates and beds

SELECT 
   city,    
   AVG(accommodates) AS avg_accomodates,
   AVG(beds) AS avg_beds,
   AVG(accommodates) / AVG(beds) AS crowdness_ratio
FROM datasets.airbnb_search_details
WHERE 
    room_type='Shared room'
GROUP BY city
ORDER BY crowdness_ratio ASC
  1. Find the price of the cheapest property for every city.

Columns: city and log_price

SELECT 
   city,
   MIN(log_price) AS min_log_price
FROM datasets.airbnb_search_details
GROUP BY city
  1. Find all neighbourhoods present in this dataset.
SELECT DISTINCT
    neighbourhood
FROM datasets.airbnb_search_details
  1. Find the average number of beds in neighbourhoods in which no property has less than 3 beds.

Columns: neighbourhood and beds

Hint: Use a HAVING clause.

SELECT
    neighbourhood,
    AVG(beds) AS avg_beds
FROM datasets.airbnb_search_details
GROUP BY neighbourhood
HAVING MIN(beds) > 3
ORDER BY avg_beds DESC
  1. To better understand the effects of the number of reviews on the price you decide to bin the reviews into the following groups: NO, FEW, SOME, MANY, A LOT. The rules you use are:
    • 0 reviews is NO
    • 1 to 5 is FEW
    • 5 to 15 is SOME
    • 15 to 40 is MANY
    • more than 40 is A LOT

Hint: Use CASE and BETWEEN

Columns: number_of_reviews

SELECT
    CASE 
        WHEN number_of_reviews = 0 THEN 'NO'
        WHEN number_of_reviews BETWEEN 1 AND 5 THEN 'FEW'
        WHEN number_of_reviews BETWEEN 5 AND 15 THEN 'SOME'
        WHEN number_of_reviews BETWEEN 15 AND 40 THEN 'MANY'
        WHEN number_of_reviews > 40 THEN 'A LOT'
    END AS 
    reviews_qualificiation,
    log_price
FROM datasets.airbnb_search_details
  1. How many hosts are verified by AirBnB staff? How many aren’t?

Columns: host_identity_verified

Hint: Use GROUP BY

SELECT
    host_identity_verified,
    COUNT(*)
FROM datasets.airbnb_search_details
GROUP BY host_identity_verified
  1. Find the price of the most expensive beach properties for each city.

Columns: city and description

SELECT
    city,
    MAX(log_price)
FROM datasets.airbnb_search_details
WHERE description ILIKE '%beach%'
GROUP BY city
  1. Find all neighbourhoods where there are properties which have no cleaning fees and have parking space.

Hint: Use DISTINCT

Columns: neighbourhood, description, cleaning_fee

SELECT 
    DISTINCT
    neighbourhood
FROM datasets.airbnb_search_details
WHERE description ILIKE '%parking%' AND cleaning_fee = 'FALSE'

Advanced exercises

  1. Complete the statistical analysis started in question 10 of the Intermediate exercises. Find the min, avg and max log price per review qualification.

Hint: Use a subquery to keep your query code manageable. Remember that subqueries must be named.

Columns: number_of_reviews, log_price

SELECT
    tmp.reviews_qualificiation,
    MIN(log_price) AS min_log_price,
    AVG(log_price) AS average_log_price,
    MAX(log_price) AS max_log_price
FROM
(SELECT
    CASE 
        WHEN number_of_reviews = 0 THEN 'NO'
        WHEN number_of_reviews BETWEEN 1 AND 5 THEN 'FEW'
        WHEN number_of_reviews BETWEEN 5 AND 15 THEN 'SOME'
        WHEN number_of_reviews BETWEEN 15 AND 40 THEN 'MANY'
        WHEN number_of_reviews > 40 THEN 'A LOT'
    END AS 
    reviews_qualificiation,
    log_price
FROM datasets.airbnb_search_details) tmp
GROUP BY tmp.reviews_qualificiation
  1. Find the search for each city which has the highest number of amenities. Estimate the number of amenities as the number of characters in the amenities column.

Columns: city and amenities

Hint: Use the LENGTH function and an INNER join with the result of the subquery.

SELECT
    main.*
FROM
    datasets.airbnb_search_details AS main
INNER JOIN
    (SELECT 
        city,
        MAX(LENGTH(amenities)) AS max_amen
    FROM datasets.airbnb_search_details
    GROUP BY city) AS tmp
ON
    main.city = tmp.city AND LENGTH(main.amenities) = tmp.max_amen
  1. Convert the following two columns to their true types using type casts and if necessary string processing:
    • cleaning_fee from TEXT to BOOL
    • host_response_rate from TEXT to NUMERIC. Missing values should be NULL.

Find the average host_response_rate per zip code after type casting. This will tell you which municipality has the most talkative people.

Hint: String processing is necessary for host_response_rate.

SELECT
    zipcode,
    AVG(tmp.clean_host_response_rate) AS avg_host_response_rate
FROM
    (SELECT
        zipcode,
        cleaning_fee :: BOOL AS clean_cleaning_fee,
        (CASE 
            WHEN host_response_rate = '' 
            THEN NULL
            ELSE SUBSTR(host_response_rate, 0, POSITION ('%' IN host_response_rate))    
        END) :: NUMERIC AS clean_host_response_rate
    FROM 
        datasets.airbnb_search_details) tmp
GROUP BY zipcode
HAVING AVG(tmp.clean_host_response_rate) IS NOT NULL
ORDER BY avg_host_response_rate
  1. Fix the host_since column using string processing to be a valid DATE.

Hint: Use the string split_part function. The standard format is yyyy-mm-dd. You also need to take care of cases like 1/6/15 which should be 01/06/2018

SELECT
    id,
    (tmp.year_fixed || tmp.month_fixed || tmp.day_fixed) :: DATE AS clean_host_since
FROM
(SELECT
    id,
    (CASE 
        WHEN LENGTH(split_part(host_since, '/' :: TEXT, 1)) = 1 
        THEN '0' || split_part(host_since, '/' :: TEXT, 1) 
        ELSE split_part(host_since, '/' :: TEXT, 1) END) AS day_fixed,
        
    (CASE 
        WHEN LENGTH(split_part(host_since, '/' :: TEXT, 2)) = 1 
        THEN '0' || split_part(host_since, '/' :: TEXT, 2) 
        ELSE split_part(host_since, '/' :: TEXT, 2) END) AS month_fixed,    
    
    (CASE
        WHEN split_part(host_since, '/' :: TEXT, 3) :: INTEGER < 2000 
        THEN 2000 + split_part(host_since, '/' :: TEXT, 3) :: INTEGER
        ELSE split_part(host_since, '/' :: TEXT, 3) :: INTEGER
    END) AS year_fixed
FROM datasets.airbnb_search_details
WHERE host_since <> '' AND host_since IS NOT NULL) tmp
  1. What are the neighbourhoods where you can sleep on a real bed in a villa with a beach access while paying the minimal price possible. Try to combine the main table and the subquery using IN.
SELECT
    DISTINCT neighbourhood
FROM 
    datasets.airbnb_search_details
WHERE
    log_price IN (
        SELECT
            MIN(log_price)
        FROM 
            datasets.airbnb_search_details
        WHERE
            bed_type = 'Real Bed' AND 
            property_type = 'Villa' AND
            description ILIKE '%beach%' AND
            -- there is a bug row with log_price = 0
            log_price > 0
    )
  1. Estimate the growth of AirBnB year over year by looking at the count of hosts registered for each year. This is an estimate because we have searches here not hosts table but we assume that more searches means more hosts and less searches mean less hosts. We also don’t have all data and must filter away all NULL or empty host_since values.

Hint: Use the cleaned column for year from question 4. Use the LAG function to find percent growth as 100 * ((year_current / year_previous) - 1.0).

Finding the number of hosts registered per year:

SELECT
    year_fixed AS year,
    COUNT(*) :: NUMERIC AS total_people_registered
FROM
(SELECT
    id,
    (CASE
        WHEN split_part(host_since, '/' :: TEXT, 3) :: INTEGER < 2000 
        THEN 2000 + split_part(host_since, '/' :: TEXT, 3) :: INTEGER
        ELSE split_part(host_since, '/' :: TEXT, 3) :: INTEGER
    END) AS year_fixed
FROM datasets.airbnb_search_details
WHERE host_since <> '' AND host_since IS NOT NULL) tmp
GROUP BY year_fixed
ORDER BY year_fixed

Using LAG to estimate growth:

SELECT
    year,
    total_people_registered,
    COALESCE(LAG (total_people_registered, 1)
             OVER (ORDER BY year), 
             1) AS prev_total_people_registered,
    
    ROUND(100 * ((total_people_registered / 
            COALESCE(LAG (total_people_registered, 1) OVER (ORDER BY year), 1)) 
            - 1.0), 2) || '%' AS estimated_growth
FROM
    (SELECT
        year_fixed AS year,
        COUNT(*) :: NUMERIC AS total_people_registered
    FROM
        (SELECT
            id,
            (CASE
                WHEN split_part(host_since, '/' :: TEXT, 3) :: INTEGER < 2000 
                THEN 2000 + split_part(host_since, '/' :: TEXT, 3) :: INTEGER
                ELSE split_part(host_since, '/' :: TEXT, 3) :: INTEGER
            END) AS year_fixed
        FROM datasets.airbnb_search_details
        WHERE host_since <> '' AND host_since IS NOT NULL) tmp
    GROUP BY year_fixed
    ORDER BY year_fixed) tmp2
  1. Make a pivot table which shows the number of searches per city and room type. Rows are different cities while columns are different room types:

Hint: Find the room types using SELECT DISTINCT before making the pivot table.

SELECT
    tmp2.city,
    SUM(tmp2.apt_count) AS apt_count,
    SUM(tmp2.private_count) AS private_count,
    SUM(tmp2.shared_count) AS shared_count
FROM
(SELECT
    DISTINCT
    city,
    (CASE WHEN room_type = 'Entire home/apt' THEN cnt ELSE 0 END) AS apt_count,
    (CASE WHEN room_type = 'Private room' THEN cnt ELSE 0 END) AS private_count,
    (CASE WHEN room_type = 'Shared room' THEN cnt ELSE 0 END) AS shared_count
FROM
    (SELECT
        city,
        room_type,
        COUNT(*) AS cnt
    FROM datasets.airbnb_search_details
    GROUP BY city, room_type) tmp) tmp2
GROUP BY tmp2.city