In this exercise we will use the dataset
This dataset contains searches and results for them for properties via AirBnB. The searches are limited to big US cities and neighbourhoods inside them. It is full of details like the type of property sought, the type of room, what extra amenities the property offers and many more. The data has no explicit time scale in the sense that searches are tagged by date and time but we know the registration date of hosts found, the date of the first review and the date of the last review. Our goal through these exercises is to understand the general patterns behind these searches (e.g. average number of bedrooms per city), how we can use them for our own good (e.g. finding cheapest properties) get a deeper understanding of how AirBnB operates.
The technical description of the dataset is in the following table.
|Column Name||Column Type||Has NULL or missing values||Short description|
|id||INTEGER||NO||The primary key for this table.|
|log_price||NUMERIC||NO||Logarithm of the apartment price.|
|property_type||TEXT||NO||Type of the property sought, examples: ‘Villa’, ‘Timeshare’.|
|room_type||TEXT||NO||Type of room, example: ‘Private room’|
|amenities||TEXT||NO||Amenities of the apartment stored as TEXT but written in a mathematical set notation.|
|accommodates||INTEGER||NO||Number of people the apartment/room can accommodate.|
|bathrooms||INTEGER||YES||Number of bathrooms in the search.|
|bed_type||TEXT||NO||Type of the bed, examples: ‘Airbed’, ‘Real Bed’|
|cancellation_policy||TEXT||NO||Cancellation policy, examples: ‘flexible’, ‘moderate’|
|cleaning_fee||TEXT||NO||The possible values are TRUE and FALSE which means this column tells us about the existence of a cleaning fee and should be of type BOOL.|
|city||TEXT||NO||Name of the city either as acronym (SF, LA, NYC) or full name (Chicago, Boston)|
|description||TEXT||NO||Textual description of the appartment. Can be of any number of characters.|
|first_review||TEXT||YES||Date of the first review stored as TEXT. To use it we need to cast it to DATE first.|
|host_has_profile_pic||TEXT||NO||A binary value stored as text where t denotes TRUE and f denotes FALSE.|
|host_identity_verified||TEXT||YES||Similar to host_has_profile_pic|
|host_response_rate||TEXT||YES||A number in the format 100%, 71%, 83%, etc. Sometimes missing.|
|host_since||TEXT||NO||The date of the registration of the host. Is of type text but should be DATE.|
|instant_bookable||TEXT||NO||A binary value stored as text where t denotes TRUE and f denotes FALSE.|
|last_review||TEXT||YES||The date of last review.|
|latitude||NUMERIC||NO||The geographical latitude of the apartment.|
|longitude||NUMERIC||NO||The geographical longitude of the apartment.|
|name||TEXT||NO||The name of the apartment.|
|neighborhood||TEXT||NO||The name of the neighborhood where the apartment is located.|
|number_of_reviews||INTEGER||NO||The number of reviews for this apartment.|
|review_scores_rating||INTEGER||YES||The average rating of this apartment|
|thumbnail_url||TEXT||YES||The URL for the thumbnail image. Not useful for us.|
|zipcode||INTEGER||YES||Zip code for the municipality where the apartment is located.|
|bedrooms||INTEGER||YES||The number of bedrooms.|
|beds||INTEGER||YES||The total number of beds.|
This is a humongous table but fear not, because we will use only small parts of it for each question. Usually the questions use no more than 3 columns.
To make life easier, always do a select for all the columns needed in the question, look at the returned values, ponder a bit and then try to write the solution.
Columns: Use the
Hint: New York City is present as NYC in this dataset.
Hint: Use the ILIKE, IN and NOT operators.
host_response_ratecolumn is missing data.
Hint: Not all missing values are NULL. Some can be also empty strings.
Hint: San Francisco is stored as SF in this dataset.
Find all neighbourhoods present in this dataset.
Find the average number of beds in neighbourhoods in which no property has less than 3 beds.
Hint: Use a HAVING clause.
Hint: Use CASE and BETWEEN
Hint: Use GROUP BY
Hint: Use DISTINCT
Hint: Use a subquery to keep your query code manageable. Remember that subqueries must be named.
Hint: Use the LENGTH function and an INNER join with the result of the subquery.
cleaning_feefrom TEXT to BOOL
host_response_ratefrom TEXT to NUMERIC. Missing values should be NULL.
Find the average host_response_rate per zip code after type casting. This will tell you which municipality has the most talkative people.
Hint: String processing is necessary for
host_sincecolumn using string processing to be a valid DATE.
Hint: Use the string split_part function. The standard format is yyyy-mm-dd. You also need to take care of cases like 1/6/15 which should be 01/06/2018
What are the neighbourhoods where you can sleep on a real bed in a villa with a beach access while paying the minimal price possible. Try to combine the main table and the subquery using IN.
Estimate the growth of AirBnB year over year by looking at the count of hosts registered for each year. This is an estimate because we have searches here not hosts table but we assume that more searches means more hosts and less searches mean less hosts. We also don’t have all data and must filter away all NULL or empty
Hint: Use the cleaned column for year from question 4. Use the LAG function to find percent growth as 100 * ((year_current / year_previous) - 1.0).
Hint: Find the room types using SELECT DISTINCT before making the pivot table.