Problem sets, exercises, and educational guides for SQL and python. These exercises are specifically tailored for business and marketing analytics students and novices.
The dataset for this exercise is datasets.los_angeles_restaurant_health_inspections
This dataset lists all restaurant health inspections which happened in the city of Los Angeles. As is often the case in life the columns of this dataset are completely different compared to the similar data which we have for San Francisco. Should we want to compare these two cities we would need to do a lot of preprocessing before we can join these two tables. This happens very often, but luckily in this exercise we will stick only to this dataset. This dataset lists inspections with date and place information. We also know the score and grade each facility got, the corporation which owns the facility, the id of the employee who was working when the inspection happened. Additionally we know the inspection results in form of textual description.
The technical description of the dataset is in the following table.
Column Name | Column Type | Has NULL or missing values | Short description |
---|---|---|---|
serial_number | VARCHAR | NO | The id of the inspection. |
activity_date | VARCHAR | NO | The date when the inspection occurred |
facility_name | VARCHAR | NO | The name of the facility in which the inspection took place. |
score | INTEGER | NO | The score assigned to the facility. |
grade | VARCHAR | YES | A symbolic grade assigned to the facility. One of ‘A’, ‘B’, ‘C’ or ‘ ‘. |
service_code | INTEGER | NO | One of 1, 401. |
service_description | VARCHAR | NO | One of ROUTINE INSPECTION, OWNER INITIATED ROUTINE INSPECT. |
employee_id | VARCHAR | NO | The id of the employee which was working when the inspection happened. |
facility_address | VARCHAR | NO | The street address of the facility. |
facility_city | VARCHAR | NO | Always LOS ANGELES. |
facility_id | VARCHAR | NO | The id of the facility. |
facility_state | VARCHAR | NO | Always CA. |
facility_zip | VARCHAR | NO | The zip code of the facility. |
owner_id | VARCHAR | NO | The id of the owner. |
owner_name | VARCHAR | NO | The full name of the owner. Information like INC, LLC is included here. |
pe_description | VARCHAR | NO | The textual description of the issue found with the restaurant. |
program_element_pe | INTEGER | NO | The code for the issue. |
program_name | VARCHAR | NO | Usually same as facility_name |
program_status | VARCHAR | NO | One of ACTIVE, INACTIVE. |
record_id | VARCHAR | NO | Unknown. |
To make life easier, always do a select for all the columns needed in the question, look at the returned values, ponder a bit and then try to write the solution.
Columns: facility_name
, pe_description
, activity_date
and score
Columns: owner_name
Hint: Use ILIKE.
Columns: service_description
and pe_description
Find all facilities which have a zip code in 90049, 90034, 90035.
Find all inspections which are part of an inactive program.
Hint: You can do things like 'A' AS grade
in the SELECT
part. This can be used to disambiguate after the union occurs.
Hint: Use HAVING.
Find the average number of inspections per facility for all corporations. Present the results in a descending order by that average.
Check if record_id
is unique for every row.
Hint: It is unique if count = count distinct
Hint: To get correct results use the SUM - CASE technique.
Find the minimal score for all facilities on Hollywood boulevard. What is the facility with the highest minimum score?
Classify each owner as LLC, INC or other.
Find the rules used to determine the grades. Have the rule as a separate column with a textual description like ‘Score > X AND Score < Y => Grade = A’ where X and Y are the lower and upper bounds for a grade.
Hint: Find the minimum and maximum scores for each grade. Use | to form the rule string. |
Find all facilities which offer beverages. Assume they offer beverages if their name contains the words tea, coffee or juice. What is the most common issue for this type of venues?
Count the number of facilities per municipality along with the number of inspections. Does this data confirm the hypothesis that more facilities = higher number of inspections. Make a scatter plot of this data.
Find all bakeries and the most common grade they earned as a collective.
Find the number of inspections per day ordered by the date?
How many inspections of low risk happened in 2017?
Which month had the lowest number of inspections for fish markets over all years?
Under assumption that the scores are normally distributed, the mean per groups should be 95 for A. Find the actual mean for these scores using BETWEEN and verify this claim.
Hint: You can use an implicit join to solve this or a left join with a always TRUE condition in the ON clause.
Hint: To find the street/boulevard/road name use split_part. Sometimes the name is second or third word. Use UNION to combine all 3 queries into one result. You might have to use subqueries.
Hint: Even though it may not seem like so, this is easily solvable using pivot table techniques. The trick is to think of pivot table columns as variables which can be used in filters and aggregations.
Hint: This is very similar to the preceding question.
Hint: Use NTILE and pivot table techniques.
Hint: Use LAG to answer the last demand.
Hint: Use ROW_NUMBER()