2 min read

SSIS, MSSQL, Power BI

IMDb Exploratory Data Analysis

Power BI dashboard | GitHub Repository

1. Objective

IMDb (Internet Movie Database) is one of the biggest online databases regarding films, series, video games and others.

The objective of this project is to explore the available datasets, and to gain insights about IMDb

2. Results

IMDb's revenue comes from both advertisers and its paying members (IMDbPro). By exploring the data through SQL queries and visualizations, we can better understand the target audience for these two revenue streams.

Based on the data collected through queries, we can say that:

IMDbPro
  • The target audience for these members is the more than 4 million actors, nearly 1 million different producers, and hundreds of thousands of writers, directors, and other crew members in their database.
  • People who have worked on big productions like "Avengers: Endgame" or "Game of Thrones" would probably benefit the most from the network's capabilities of their paid membership.
IMDb
  • Regular IMDb users could benefit from having a more robust rating system. Indeed, as of now, titles with a high number of votes tend to have higher ratings, which makes the average title rating relatively high (6.62).

3. Overview

# Preprocessing

  • Using Microsoft Visual Studio 2019 and the SSIS toolbox, stored the available IMDb database in a local MSSQL database as raw data
  • Created new "working" tables through stored procedures in order to better fit the data types and other specifications
  • Tables, rows and columns with too much missing data or not enough meaningful information were discarded
  • Improved data storage by splitting columns and creating new tables to contain only atomic values to normalize the database by following normal forms.

# Data Exploration

  • Applied a number of queries to better understand the data using MSSQL
    • We can see what are the best titles in the available data, considering only titles with more than 50k votes:
      Best titles per type in the IMDb dataset From the available data in the inner join of tables “title_basics” and “title_ratings”, and considering only titles with more than 50K votes, the best titles per title type are:

      tconst titleType primaryTitle averageRating numVotes
      tt0111161 movie The Shawshank Redemption 9,3 2,523,946
      tt15097216 movie Jai Bhim 9,3 166,667
      tt0108598 short The Wrong Trousers 8,3 53,272
      tt2301451 tvEpisode Ozymandias 10 149,682
      tt5491994 tvMiniSeries Planet Earth II 9,5 107,823
      tt0067023 tvMovie Duel 7,6 68,641
      tt14392248 tvSeries Aspirants 9,6 290,200
      tt14650074 tvSeries Dhindora 9,6 111,317
      tt11337862 tvSpecial Friends: The Reunion 8,1 55,145
      tt1569923 video Batman: Under the Red Hood 8,1 59,376
      tt2140553 videoGame The Last of Us 9,7 56,588
      Best Series from the 90’s From the available data, the best series from the 90’s, by genre, are:

      tconst startYear genres primaryTitle averageRating numVotes
      tt0096657 1990 Family Mr. Bean 8,5 111625
      tt0098936 1990 Mystery Twin Peaks 8,8 188434
      tt0103359 1992 Action Batman: The Animated Series 9 95844
      tt0103359 1992 Adventure Batman: The Animated Series 9 95844
      tt0103359 1992 Animation Batman: The Animated Series 9 95844
      tt0108778 1994 Comedy Friends 8,8 925851
      tt0108778 1994 Romance Friends 8,8 925851
      tt0112178 1995 Sci-Fi Star Trek: Voyager 7,8 65338
      tt0118276 1997 Fantasy Buffy the Vampire Slayer 8,2 140086
      tt0118421 1997 Thriller Oz 8,7 95231
      tt0141842 1999 Crime The Sopranos 9,2 356000
      tt0141842 1999 Drama The Sopranos 9,2 356000
      Note: one title can have multiple genres, thus, it can appear more than once.
      Best films since the 2000s From the available data, the best movies since the 2000s, by genre, are:

      tconst startYear genres primaryTitle averageRating numVotes
      tt0245429 2001 Animation Spirited Away 8,6 711577
      tt0245429 2001 Family Spirited Away 8,6 711577
      tt0211915 2001 Romance Amélie 8,3 733344
      tt0253474 2002 Music The Pianist 8,5 786946
      tt0167260 2003 Adventure The Lord of the Rings: The Return of the King 8,9 1742198
      tt0365748 2004 Horror Shaun of the Dead 7,9 538596
      tt0367110 2004 Musical Swades 8,2 88076
      tt0338013 2004 Romance Eternal Sunshine of the Spotless Mind 8,3 963062
      tt0482571 2006 Mystery The Prestige 8,5 1268469
      tt0468569 2008 Action The Dark Knight 9 2474194
      tt1286537 2008 News Food, Inc. 7,8 50783
      tt1424432 2010 Documentary Senna 8,5 68839
      tt1375666 2010 Sci-Fi Inception 8,8 2218561
      tt1424432 2010 Sport Senna 8,5 68839
      tt1562872 2011 Musical Zindagi Na Milegi Dobara 8,2 74215
      tt1853728 2012 Western Django Unchained 8,4 1459482
      tt2582802 2014 Music Whiplash 8,5 783845
      tt5813916 2016 War The Mountain II 8,7 107342
      tt6751668 2019 Comedy Parasite 8,6 704270
      tt6751668 2019 Thriller Parasite 8,6 704270
      tt8503618 2020 History Hamilton 8,4 80437
      tt10554232 2020 War Dara of Jasenovac 8,7 80163
      tt10295212 2021 Biography Shershaah 8,7 111676
      tt15097216 2021 Crime Jai Bhim 9,3 166667
      tt15097216 2021 Drama Jai Bhim 9,3 166667
      tt10872600 2021 Fantasy Spider-Man: No Way Home 8,8 377723
  • Imported the data from the MSSQL Server to Power BI to further explore the data
  • Built a Power BI dashboard that lets users explore what is popular on IMDb (best genres and title types) using filter interactions, hierarchies and drills down on tv to get more detailed information, as well as see the evolution of genre development through time

Power BI Dashboard

Copyright © All rights reserved | This template is made with by Colorlib --- Colorlib