Strata Scratch Problem Sets

Problem sets, exercises, and educational guides for SQL and python. These exercises are specifically tailored for business and marketing analytics students and novices.

Back to the Strata Scratch website
Login to Strata Scratch

SQL and Business Insights (with Solutions)

Instructions

Questions

1. Billboard Top 100

Dataset: billboard_top_100_year_end

Solution

To find out if a song is a top song or not we check the year_rank for that song. If it is equal to 1 then it was a top song. We also need to check if a song belongs to our time period of interest. We do that with the second filter. DATE_PART('year', CURRENT_DATE) will give us the current year (2018 as of writing).

SELECT song_name
FROM datasets.billboard_top_100_year_end
WHERE year_rank = 1 
AND DATE_PART('year', CURRENT_DATE) - year <= 30

Solution

The solution to the second question is in the inner query where we get all songs along with their artist which were among top 10 sometime. To answer the first question we need to count the number of songs which we do using GROUPBY artist and count(*) AS top10_songs_count. The results are also sorted in a descending order so we can quickly say that Elvis Presley, Mariah Carey and Elton John are the 3 most popular authors.

SELECT artist, count(*) AS top10_songs_count
FROM
    (SELECT artist, song_name
    FROM datasets.billboard_top_100_year_end
    WHERE year_rank <= 10) temporary
GROUP BY artist
ORDER BY top10_songs_count DESC

solution

We filter away all songs older than 10 years ( WHERE date_part('year', CURRENT_DATE) - year <= 10) and then group by artist after which we count the number of occurences and sort descending so the highest counts are presented first.

SELECT artist, COUNT(*) as count_10yrs
FROM datasets.billboard_top_100_year_end
WHERE date_part('year', CURRENT_DATE) - year <= 10
GROUP BY artist
ORDER BY count_10yrs DESC

solution

To answer this question we first need to devise a metric. A suggestion is to calculate the score of each artist as (100 - average_rank) * years_present where average_rank is the average yearly ranking of all their songs while years present tells us how many years they are on the billboard. We use 100 - average_rank because we want a lower value of average rank to produce higher scores. We then order the artists on this metric after first filtering everything older than 50 years.

SELECT 
    artist, 
    AVG(year_rank) AS average_rank, 
    COUNT(DISTINCT year) AS years_present,
    (100 - AVG(year_rank)) * COUNT(DISTINCT year) AS score
FROM datasets.billboard_top_100_year_end
WHERE date_part('year', CURRENT_DATE) - year <= 50
GROUP BY artist
ORDER BY score DESC

2. SF Crime

Dataset: sf_crime_incidents_2014_01

Solution:

In our solution, we want to determine first the number of categories which can be attained through the SELECT statement. To get the results within the year 2014, we added the conditional statement WHERE date>='2014-01-01' and date<='2014-12-31'. Knowing the top crime categories can be easily viewed by grouping and sorting the results in descending order.

SELECT
    category,
    count(category) as count
FROM datasets.sf_crime_incidents_2014_01
WHERE date>='2014-01-01' and date<='2014-12-31'
GROUP BY category
ORDER BY count DESC

Solution:

SELECT
    day_of_week,
    count(category) as count
FROM datasets.sf_crime_incidents_2014_01
GROUP BY day_of_week
ORDER BY count DESC

Based on the result, the day of the week with the highest crime is Friday.

Solution:

SELECT
    pd_district,
    count(category) as count
FROM datasets.sf_crime_incidents_2014_01
GROUP BY pd_district
ORDER BY count DESC

Based on the result, the Southern District has the most crime incidences.

Solution:

SELECT
    address,
    pd_district,
    count(category) as count
FROM datasets.sf_crime_incidents_2014_01
GROUP BY address, pd_district
ORDER BY count DESC

Based on the result, the most dangerous place in SF is at 800 Block of BRYANT ST.

3. Oscar Nominees

Dataset: oscar_nominees

Solution:

SELECT
    nominee,
    count(winner) as count
FROM datasets.oscar_nominees
WHERE winner = true
GROUP BY nominee
ORDER BY count DESC

Based on the result, the answers are Meryl Streep, Katharine Hepburn, Walter Brennan, Jack Nicholson and Ingrid Bergman.

Solution:

SELECT
    nominee,
    count(winner) as count
FROM datasets.oscar_nominees
WHERE winner = false
GROUP BY nominee
ORDER BY count DESC

The answer is Meryl Streep.

Solution:

SELECT
    nominee,
    
    SUM (CASE 
        WHEN winner = true 
        THEN 1.0 
        ELSE 0.0
        END) / COUNT(*) :: NUMERIC AS ratio
FROM 
    datasets.oscar_nominees
GROUP BY nominee
ORDER BY ratio DESC

The answers are the following: Judy Holliday, Burl Ives, Daniel Day Lewis, Brenda Fricker, Donald Crisp, Ginger Rogers, Marion Cotillard, Mo’Nique, Louise Fletcher

Solution:

SELECT
    movie,
    count(nominee)
FROM datasets.oscar_nominees
GROUP BY movie
ORDER BY count DESC

The answers are the following: The Godfather Part II, Network, On the Waterfront, Mrs. Miniver, All about Eve, Peyton Place, From Here to Eternity, Bonnie and Clyde

Solution:

SELECT
    nominee,
    count(winner) as count
FROM datasets.oscar_nominees
WHERE winner = true
GROUP BY nominee
ORDER BY count DESC

The answers are the following: Meryl Streep, Katharine Hepburn, Walter Brennan, Jack Nicholson, Ingrid Bergman

4. Video Game Charts

Dataset: global_weekly_charts_2013_2014

solution

The answers are FIFA Soccer 13 with 432 weeks, LEGO The Lord of the Rings with 378 weeks, LEGO Batman 2: DC Super Heroes with 339 weeks.

SELECT 
    game, 
    COUNT(*) AS week_count
FROM datasets.global_weekly_charts_2013_2014
WHERE week <= 100
GROUP BY game
ORDER BY week_count DESC

solution

The answers are:

SELECT game, platform, COUNT(*) as count_top10
FROM datasets.global_weekly_charts_2013_2014
WHERE week <= 10
GROUP BY game, platform
ORDER BY count_top10 DESC

solution

The answers are: Action, Sports, Shooter

    SELECT genre, SUM(total) AS total_sales
    FROM datasets.global_weekly_charts_2013_2014
    GROUP BY genre
    ORDER BY total_sales DESC

solution

Total sales were used as the chosen metric.

SELECT publisher, SUM(total) AS total_sales
FROM datasets.global_weekly_charts_2013_2014
GROUP BY publisher
ORDER BY total_sales DESC

5. NFL Combine

Datasets: nfl_combine

solution

The answers are:

Do note that this query will also return null 1469 which means there are 1469 players which do not belong to any college.

SELECT
    college,
    COUNT(*) AS player_count
FROM datasets.nfl_combine
GROUP BY college
ORDER BY player_count DESC

6. Airbnb

Datasets: datasets.airbnb_searches, datasets.airbnb_contacts

solution

For this question we will use only the dataset datasets.airbnb_searches. Unlike the majority of queries which return rows this one gives us back only a single number. This number 18605 is the number of unique id_user values. The heart of the query is COUNT (DISTINCT id_user) which counts rows after discarding repeat entries according to id_user criteria.

SELECT COUNT (DISTINCT id_user) AS total_people_searching
FROM datasets.airbnb_searches

solution

We will only need datasets.airbnb_searches. We perform groupby over number of nights and sum up the searches for every night to obtain our statistics. The results are:

number of nights number of searches
null 60926
2 58201
3 57177
4 39392
1 33786
5 17621

Notice that the majority of searches do not specify the number of nights.

SELECT n_nights, SUM(n_searches) AS total_searches
FROM datasets.airbnb_searches
GROUP BY n_nights
ORDER BY total_searches DESC

solution

We use datasets.airbnb_contacts for this task.

For check-ins

DATE_PART('dow', ds_checkin) tells us the day of the week from 0 to 6. We group on that day and count the number of rows per groups. We sort from maximal checkin_count downwards and take the first value which is 5 (Friday)

SELECT 
  DATE_PART('dow', ds_checkin) AS day_of_week,
  COUNT(*) AS checkin_count
FROM datasets.airbnb_contacts
GROUP BY day_of_week
ORDER BY checkin_count DESC
LIMIT 1

For check-outs

The procedure is very similiar we just need to use the ds_checkout column. The resulting day is 0 (Sunday)

SELECT 
  DATE_PART('dow', ds_checkout) AS day_of_week,
  COUNT(*) AS checkout_count
FROM datasets.airbnb_contacts
GROUP BY day_of_week
ORDER BY checkout_count DESC
LIMIT 1

The majority is searching for: No filters, ‘,Entire home/apt’, ‘Entire home/apt’, ‘,Private room’, ‘Private room’. We see that there is error in the data with addition of a coma before some entries so we should clean our data to get more reliable results.

SELECT
    filter_room_types,
    COUNT(*) as count_searches
FROM datasets.airbnb_searches
GROUP BY filter_room_types
ORDER BY count_searches DESC

To clean our data we can use LTRIM(filter_room_types, ',') which removes commas from start and now our query becomes:

SELECT
    LTRIM(filter_room_types, ',') AS cleaned_filter,
    COUNT(*) as count_searches
FROM datasets.airbnb_searches
GROUP BY cleaned_filter
ORDER BY count_searches DESC

solution The result is 0.463632877412757

The final query is

SELECT
    (SELECT COUNT(*) AS count_accepted
    FROM datasets.airbnb_contacts
    WHERE ts_accepted_at IS NOT NULL) /
    (SELECT COUNT(*) as count_rows
    FROM datasets.airbnb_contacts) :: FLOAT 
AS acceptance_rate

There is quite a lot going on in the query so let’s break it into pieces.

(SELECT COUNT(*) AS count_accepted
    FROM datasets.airbnb_contacts
    WHERE ts_accepted_at IS NOT NULL)

That will give us a single number which tells us how many of the bookings were accepted based on the idea that having ts_accepted_at being null is a sign of non-acceptance.

(SELECT COUNT(*) as count_rows
    FROM datasets.airbnb_contacts) :: FLOAT

The second piece is just counting how many rows are there in the table. We convert the integer to FLOAT using :: FLOAT notation so we can divide in the final query

SELECT piece1 / piece2

This is what the final query essentaly ends up to be.

Based on our data and the following query and it’s results.

SELECT n_messages, COUNT(*) 
FROM datasets.airbnb_contacts
WHERE ts_accepted_at IS NULL
GROUP BY n_messages
ORDER BY COUNT(*) DESC
n_messages count
2 1419
3 1033
4 677
5 333
1 332
6 157
7 77
8 48
9 31

We see that majority of non acceptances (around 98%) are linked to a small number of messages, which is less than 10. This is likely the result of people asking some important questions via the first message, getting an unsatisfactory answer in the second message and then not accepting. So the solution would be to encourage hosts to write that crucial information in the description so visitors don’t bother writing to a host they do not plan to visit.

SELECT
    (SELECT 
        SUM(cnt) 
     FROM 
        (SELECT COUNT(*) AS cnt
            FROM datasets.airbnb_contacts
            WHERE ts_accepted_at IS NULL AND n_messages <= 10
            GROUP BY n_messages
            ORDER BY count(*) DESC) temp
    ) :: FLOAT
    /
    (SELECT COUNT(*) FROM datasets.airbnb_contacts WHERE ts_accepted_at IS NULL) 
AS acceptance_rate

is the query that gave us that number 98%