Netflix and Chill: How I Made My 2017 Iron Viz Feeder Entry

I’ll be honest, finding inspiration for this challenge was not easy. Of course, I, like most of us, have my favourite movies and TV shows that I could binge watch for hours. But the challenge lies in telling a data story, not the story of the film or TV show. Those stories have been told brilliantly well by their award-winning directors, actors, and actresses. I needed to find something that wouldn’t tempt me to repeat those ideas, something with a data story that would stand on its own.

So I procrastinated. Appropriately. By binge watching on Netflix. I haven’t always been a Netflix fan, I used to be a quite dedicated cord-watcher. Mostly (shamefully) because I have a strange obsession with awful reality TV shows and Dr. Phil, which you just can’t get on Netflix! I started to think about my own personal history with Netflix, and then got curious about the history of Netflix itself. I started where any good researcher starts: Google.

The first thing I found out was that Netflix started over a dispute over a $40 late fee that Reed Hastings (the founder of Netflix) had been hit with when he borrowed Apollo 13 from Blockbuster. That got me thinking about the history of Blockbuster as well. I still clearly remember the skeleton of a Blockbuster store that sat eerily empty on my bus route to university when I was living in Vancouver. What happened? Who gutted you my friend?

My research eventually led me to this:
netflix-vs-blockbuster-revenues

I think this one image is the perfect example of a data story that hits hard. The sweeping nosedive into bankruptcy, and the David that came out on top of Goliath. Ouch. Painful, but there was my inspiration for this Iron Viz feeder.

Data Sources

Finding information and datasources for Netflix was no problem. I found datasets for the movies and TV shows featured, information about their subscribers, and even the original dataset that was used in their famous competition where they challenged data scientists to beat the accuracy of their recommendation system. Blockbuster on the other hand, was much more challenging.

The issue with Blockbuster is that it went bankrupt around 2010, when data analytics was just starting to enter mainstream consciousness. Prior to that, at their prime, no one had bothered to collect and collate data about the company, or if they had they certainly did not put it out on Kaggle or Github, both of which were born just as Blockbuster died out. So my only source was Blockbuster’s financial statements.

Now, even the keenest accountant will probably tell you that for an average Jane like me, reading financial statements is not the most riveting piece of literature. But I started all the way back to 1999 and powered on through. And I started to see a narrative come to light.

What I found was that Blockbuster’s late fees were a huge source of strife for the company, even in it’s early days. It was involved in several lawsuits related to its late fees and despite customer dissatisfaction, they weren’t willing to let go. This was understandable given that at one time they raked in almost $800 million from late fees alone! It wasn’t until Netflix and other competitors entered the scene that they started to rethink things and introduced their “no late fees” pitch. Possibly (probably) too little too late.

Design

In terms of design and analytics, I tried to keep everything as minimal and simple as possible. This is financial statement data, but I don’t believe it needs to be cached in accountant-speak to be effective. The most “complex” chart I used is probably the jump plot, but I felt like it gave another perspective of Blockbuster’s decline/Netflix’s rise beyond the trend line. (Note: Thank you to Chris deMartini for outlining how to build this chart, and Robin Kennedy for helping me figure it all out!)

The only colours I used in the viz were a pale yellow-grey for the background (I dislike white, it’s quite jarring on computer screens), charcoal grey (I dislike black for the same reason), and red and blue (company branding colours) to represent Netflix and Blockbuster respectively. I tried to eliminate colour legends as much as possible and wherever I mentioned the names of the companies, I used the red and blue colours to indicate that these are the companies that my charts referenced.

I was mindful of users’ interactions with my trend lines, so I included dots overlaid on the lines to make it easier for users to know where to point their cursor for information from the tooltip. I also used a calculation to switch between displaying information in millions or billions beyond a certain threshold so that users would always be able to see Netflix’s KPIs, even before they made their first $ billion. I included an annotation in my bubble charts because I knew it would be challenging to find the little pixel that proportionally represented $40 compared to Blockbuster’s hourly revenue. I tried to make everything easy, simple, and smooth.

And yes, I did use a pie chart, although it’s technically more of a DVD chart, but I felt like I’d throw in a bit of artistic liberty in the mix. I also felt that because I was only showing two proportions of the whole (DVD), it was an appropriate use of a pie/DVD and really emphasised how much Blockbuster’s late fees contributed to their revenue.

Advertisements

Putting the R in AlteRyx: A Personal Challenge

There are a lot of nuanced differences between Data Analytics and Data Science that can be difficult to pinpoint. In general, analytics tends to explore patterns in the now to find actionable insight, while science tends to explore patterns in the now to make predictions for actionable insight. I’ve loved analytics, but I’ve been curious to see how machine learning and predictive analytics can enhance my data explorations. To that end, I recently completed Springboard’s Data Science curriculum that provides an introduction to data science, mostly using the R programming language.

My first introduction to data analytics tools was Alteryx, and in my experience, it can be challenging to switch to a programming language like R to conduct analysis. Alteryx is intuitive, there is no programming involved, and a lot of the most common manipulations like transposing, selecting fields, and joining data can be done with just a couple of clicks. However, the benefit of using R is that there are several packages pre-built that allow you to do some pretty advanced predictive analytics. Oh if only there was a way to combine the two!

Enter Alteryx and R integration, circa 2013.

Since version 8.5, Alteryx has provided several tools that are based on the R data exploration language. This allows users to explore data with the advanced predictive analytics packages in R, while still incorporating the intuitive and visual workflows that make analytics easier and more efficient in Alteryx.

So I have decided to take on a personal challenge. I’ve decided to replicate different R predictive exercises in Alteryx, not only to gain a stronger understanding of the logic behind these analyses, but also to demonstrate how they can be performed much more efficiently in Alteryx. I’ve spent a lot of time using Alteryx for data preparation and clean up, but I feel it’s strength lies in its forward-facing capacities. Over the next few weeks, I will showcase a workflow that was originally created using R coding and demonstrate how I managed to translate this with the tools provided by Alteryx. So stay tuned and watch this space!

Challenge 1: CART Models and Predicting Supreme Court Decisions

Putting the R in AlteRyx: CART Models

Challenge 1: CART Models and Predicting Supreme Court Decisions

This post is part of a series I am writing to translate R scripts I have seen or written into Alteryx workflows. Original post can be found here.

This script comes from MIT’s Data Analytics course (which you can sign up for here). In the section where they introduce Classification and Regression Trees (CART), they use data showcasing decisions made by US supreme court judges and try to predict if a supreme court judge will reverse or uphold the decision of the lower court. Between 1991 and 2001, the same nine judges served on the supreme court, the longest time in US history. This provides us with a data set with more information that can be examined than if this analysis was done during a different time period. More specifically, this CART model looks at the decisions of supreme court judge John Paul Stevens, and whether or not several factors can predict his decision to affirm or reverse the decision of the lower court. These factors include the subject of the case, whether the lower court was more liberal or conservative, and the type of petitioner involved in the case.

Although this analysis could be performed with logistic regression, the outcome would not be as easily interpretable. When we create a decision tree in R, if a variable has an effect on the outcome (in this case, supreme court Judge Stevens reversing the decision of the lower court), we can easily see where it sits in terms of its affect on the outcome and its relationship to other significant variables:

Rplot

Decision tree plot produced by R.

Creating a Decision Tree model in Alteryx uses just 2 tools:

  1. The Create Samples tool to split our data into a train set we will build our model off of, and a test set to validate our model
  2. The Decision Tree tool to actually build our model2017-07-16_16-30-20

And with that, Alteryx spits out a summary report to show the details of how the model was run, and a visual report that includes the Decision Tree, the significance of each variable in determining the outcome, and a confusion matrix to summarise the accuracy:

2017-07-16_16-32-502017-07-16_16-33-152017-07-16_16-33-50

Although the Model Comparison tool is not included in the default Alteryx package, it can be found in the Alteryx gallery.

2017-07-16_16-38-17

We can use this tool to evaluate the accuracy of our model when compared to a simple baseline that predicts the most frequent outcome in our test set.

Before looking at the report generated by this tool, we can check our simple baseline accuracy by using a summarise tool to get a count of 0 and 1 responses. We then use another summarise tool to get both the most frequent outcome, and the total number of rows in our data set. Lastly, we can calculate the accuracy using a formula tool and the calculation: “[Max_Count]/[Sum_Count].” This gives us an accuracy of 54.7%

2017-07-16_16-39-45

If we look at the report generated by the model comparison tool, we can see that our accuracy is about 67%, an improvement on our baseline. The report also indicates that the AUC is 73%, which tells us our model is good at differentiating between a reversal and an affirmation decision from Justice Stevens.

2017-07-16_16-40-57

Summary of accuracy and AUC from report generated by Model Comparison tool

The benefit of using tools in Alteryx is that the code has already been written for us, but we still have the ability to  change the default parameters in the Decision Tree tool such as the complexity parameter and the independent variables. The Model Comparison tool can also be used to quickly compare the accuracy of several models generated in Alteryx, such as logistic regression and random forest. With just a handful of tools, we are able to create an interpretable model that predicts the decisions of a supreme court judge with an accuracy well-above baseline. In addition, the report generated by the model comparison tool provides assessment plots  like an ROC curve that can help in deciding what thresholds to use when building our models.

Things to Watch When Replacing Data Sources

When creating workbooks that will have future iterations (i.e. not one-time, static infographics), there may come a time when you have to either refresh the data in your dashboard or replace is with another data source.

In the ideal scenario, especially if you have your workbook on Tableau Server, your workbook would be connected to a live data source and you would just update your data source (without changing the name of the data source or field names) and your workbook would update automatically. No problems.

However, sometimes you will have to replace the original data source with a new one. If for whatever reason you cannot update or refresh a live data source connected to your workbook, there are some things you need to bear in mind.

The usual process to replace a data source is as follows: open your workbook, click the add data source icon, add the new data source, and then replace your original data source:

    1. Add new data source
      02-05-2017 15-41-32
    2. Right click on original data source, select “Replace data source”
      02-05-2017 15-48-54

 

  1. Replace with new data source
    02-05-2017 15-49-09

If the new data source has EXACTLY the same field names, you should generally be fine. However, if anything has changed, even if it’s just removing a hyphen or changing a field so that it’s capitalised, you will break a few things.

For example, let’s say you build a dashboard with an initial data source (in my case, Sample Superstore). Then you decide that you need to replace the data source and for whatever reason (maybe a different person pulled the data this time, maybe the fields were renamed as part of a new policy, maybe you wore the wrong kind of socks that morning, whatever the case may be),  some of the fields were renamed. For this example, I’ve renamed Category as category and Subcategory as subcat.

The first thing you will notice when you replace the data source is that your fields that were renamed now have a red ! exclamation mark next to them. This is because Tableau thinks the fields are no longer in the data. To fix this, you just right click on the field, select replace references, and point it to the new renamed field:

02-05-2017 16-14-03

02-05-2017 16-16-34

This is where the break happens:

02-05-2017 16-06-47

This is my dashboard before I replace my data source

02-05-2017 16-19-33

This is my dashboard after I replace my data source

What has changed? Well there are a couple of things:

  1. Colour: The most obvious change is the colour that I had initially used for my different categories. When you replace your data source with new field names, Tableau will revert to its default colour scheme
  2. Legend arrangement: In addition to the colour change, Tableau has also rearranged my legend so it is no longer a single row
  3. Default sort: My sales by subcategory initially had a default sort that put technology at the top. Tableau has reverted to an alphabetical sort
  4. Aliases: If you look at the Segment Profitability bars, you’ll notice that the bar that was initially called “Self-Employed” has reverted to its original non-alias, “Home Office”
  5. Although it didn’t change in this instance, I have seen “Grand Total” fields disappear. In my own experience, I’ve typically seen it happen with Grand Total columns that sum up your rows, but be mindful of this as well if you have Grand Total rows that some up your columns

When you replace your data sources, make sure you pay attention to the potential changes outlined above. Some other areas to pay attention to are sets, the format of quick filters on your dashboard, and groups. In light of all these loose ends, it’s best to avoid having to replace data sources entirely and just connect your workbook to a live data source that is updated via Tableau Server. It will save a lot of time on maintaining and updating your dashboards.

Later days

Amazing Apps for #Data16

Data 16 is coming up fast and as many of us in the UK get ready for the 11 hour journey across the Atlantic, I’ve been checking out tons of apps to keep me prepared and entertained. I love apps, if I didn’t put them in neat little folders on my phone I would have home screens in the double digits. If you haven’t already, make sure to download the official data16 app to see all the available sessions, register for hands on workshops, and get live updates on what’s going on in Austin. In addition to the official app, these are 4 apps I’ve found that are in my tool kit for the great data saga of 2016.

1) Pack Point

Screen Shot 2016-11-03 at 11.16.20 pm.png

Worried about packing too little? Too much? Too warm? Too cold? This is a great little app that generates a packing list for you based on the weather where you’re going and how long you’re staying. You can also choose from a list of activities so your list is fully customised. Now you can use all that extra suitcase space for more #data16 souvenirs!

2) Jetlag Rooster

screen-shot-2016-11-03-at-11-18-20-pm

I need my sleep. I am an absolute crank if I don’t get enough hours of sleep and I love my nap time. Jetlag is a terrible affliction for me. Enter Jetlag Rooster, an app that will create an optimised sleep schedule for you to minimise the effects of jet lag. You can choose if you want to work on adjusting a few days before you leave, or when you arrive in your destination city. Use the website linked above, or download the app on iOS or the Google Play store.

3) Google Maps

It’s a staple app on many phones, but what makes it useful for traveling is that you can download maps for later offline use. Mark all the places you need to keep track of in Austin like your hotel/staying place, the Austin Convention centre, the nearest pub with the best local beer (especially if you’re coach Kriebel :)), etc. Data roaming is not cheap and if you need to figure out where you are without the convenience of GPS, the offline maps will be a life saver.

4) #Data16 Dashboards

Okay, I lied, this post isn’t all about apps. Some amazing and very useful dashboards have been developed by folks in the Tableau community that are just as resourceful and accessible as an app:

Screen Shot 2016-11-03 at 11.20.58 pm.png

See when people are arriving in town, where they’re staying, and where the fellow newbs are to huddle in a corner with (no, don’t do this, huddle with everyone everywhere please!). Fill out the google sheet here to add your data to the viz!

Screen Shot 2016-11-03 at 11.22.54 pm.png

If you’re able to get a data or wifi connection, I highly recommend checking this viz out rather than Google maps. It’s designed to provide you with a map and walking estimate to get from session to session during the conference. It might also help you narrow down your choices among hundreds of amazing presentations; kickassness ratio held equal, why not attend the presentation that’s just a hop skip and a jump away?

screen-shot-2016-11-03-at-11-19-56-pm

You know those awful flights you get where the person you’re sitting next to just wants to gab away, spills their drink in your lap, and brings smoked salmon as a mid-flight snack? Yeah that’s me. BUT I like to think if I’m sharing a seat with a fellow TC attendee, at least the gabbing about data geekery won’t be so bad? Chapman’s dashboard shows all the flights UK folks are taking on their trek across the Atlantic – see if you’ve got some other data geeks on board!

 

Dealing with Cognitive Quirks

I have a confession to make. I suffer from a debilitating condition that affects hundreds, if not thousands, of data analysts on a daily basis. It is a serious condition and it leads to many an unpublished viz and countless hours of unnecessary calculations. We call this condition… Analysis paralysis.

Over the last few months, I’ve been feeling stagnant in my creativity. The problem is, I get a little too excited when I see data. My brain automatically conjures up a thousand ways I could investigate, analyse, and extrapolate. But there aren’t enough hours in the day and more importantly, not every one of these ideas deserves investigation. We know from cognitive psych research that too many choices can lead to a lack of action (see Barry Schwartz – The Paradox of Choice). And this seems to have become my theme song lately – too many choices, not enough vizzing.

In addition, I crave my gold stars: If I create something, I want it to be perfect and I want my gold star of praise and bright shining acknowledgement. But when you’re already dealing with analysis paralysis, throwing a gold star fixation on top is a recipe for complete brain freeze, and not the fun kind that comes with chocolate chip ice cream.

I’ve dealt with this in a few ways, and if you’re dealing with one or both of these awful cognitive grips, maybe this will help you break loose:

1. Know Thyself

Everything I’ve written above has come from a lot of introspection and monitoring, as well as non-judgment. I’m not proud of my flaws, but it is foolish to pretend they aren’t there. The first step is to watch yourself, know what makes you tick and know what makes you stop in your tracks. My problem is too much inspiration, maybe yours is the opposite? Or maybe you don’t care for other peoples’ opinions, but no one likes what you’re producing and you need to learn some foundational skills? I’m not saying bend to others’ will, but being aware of what is stopping you from growing as an analyst is necessary to dealing with the problem.

2. Challenge yourself (based on what you know)

The second part of this is critical, and it’s why I emphasize self-awareness so much: Others’ challenges might not be yours. What I mean by this is that challenges are only helpful if they help you grow, not if they completely burn you out. You have to push your muscle to stretch it, but you don’t want to throw yourself out of commission. For example, I normally work on vizzes for HOURS. My challenge is not to spend more time working on vizzes, but to spend less. So I limit myself to 1 hour of work on a viz. It gives me enough time to get something satisfactory done while still pushing me to break a sweat. For some people, the challenge might be 2 hours, or even 10 minutes. You know you, set appropriate challenges, make sure they’re challenging, but don’t overdo it by using someone else’s metric.

3. Expect the expected

At the end of an hour of work, I am rarely 100% satisfied. My inner perfectionist is a whiny nagging worm and as they say, you are your biggest critic. Expect criticism, not only from yourself but from people reading your work as well. You cannot satisfy everyone, sometimes not even yourself. The way I deal with it? Just put it out there. No excuses. Don’t fear criticism, it can shape your growth in ways you couldn’t have come to own your own.

An example:

This weeks Makeover Monday (and any Makeover Monday really) provided a great opportunity to put these goals into action. Part of the challenge this week was to create a visualization based off of two numbers (and only two numbers, see link here and try the challenge yourself!). When I sat down to flex my data brain, I set my clock for an hour and just let things unfold. I ended up creating a visualization to give some context to an inconceivable number for US debt – $19.5 TRILLION. I added some measurements for things we typically think are astronomical in cost, but that are only a fraction of the current US debt. I had data, I dug deeper, I found a story that I thought deserved telling and it was told because I challenged myself to tell it. Unfortunately, I missed the mark on the original challenge and received some well-deserved criticism. So what now? Well, the benefit of my system is that because I committed to my own challenges based on my own needs (and succeeded!) I still get my gold stars. But gold stars on their own are meaningless unless they help me push my muscle further. So even though I failed the challenge this week, I have a new challenge set for myself next week! As cheesy as it is, remember:

fail-first-attempt-in-learning-2

And stay tuned for my next successful gold star 🙂

 

Designing Inclusive Dashboards

Featured image shows a stylised picture of a person standing. Their shadow shows the universal symbol for disabled people – a person in a wheelchair. Text underneath reads “not every disability is visible”

If you missed my talk on designing dashboards that are accessible for folks with disabilities, not to worry. Here is a quick summary of the key things to remember when designing accessible dashboards.

Note: My Tiny Tableau Talk is also available on YouTube! Check it out at: https://youtu.be/1ieSb_-hW7s

1) There are many benefits to designing inclusively – including financial benefits

The social benefits are (hopefully) obvious to most people – peoples’ bodies are shaped differently, this leads to challenges for folks who are not conventionally able-bodied, let’s therefore strive to remove those barriers or at least kick them down a notch. In addition, there are financial benefits to designing inclusively that are often overlooked.

For example, websites that are designed for maximum accessibility tend to be picked up by search engines more efficiently. This is because they have a simpler construction and emphasize text rather than dynamic media. Search engine optimization can therefore position your content to a wider audience platform that includes people who have both conventional and unconventional needs.

In addition, it is easier to maintain accessible design in the long run because it is less sensitive to shifts in technology. You can therefore significantly decrease maintenance costs without the need to completely restructure your design to align with every new change in web design or UX technology.

2) There are many kinds of accessibility needs to consider when designing dashboards

Because data visualisation is, well, visual, designing dashboards that address the needs of blind and partially sighted folks seems obvious. But there are other needs to consider, such as:

  1. Motor/mobility needs – people with limited mobility may also use screen readers
  2. Auditory – If you include audio or video media in your dashboard, make sure to include a text resource
  3. Seizures – If you include video media, be careful with flashing images
  4. Cognitive/Intellectual – This can be a bit tricky because of the wide range of needs presented. However, there are still things we can do as dashboard designers such as keeping language clear and simple, and reducing visual clutter. For a more thorough explanation of possible choices that can be made to minimise cognitive/intellectual barriers, visit: http://ncdae.org/resources/articles/cognitive/ 

Although visual needs may be the most pertinent to address as data visualisation designers, it’s important to bear in mind that other accessibility may come up.

3) Consider Layout

brailledisplay

Image shows a person using a screen reader that translates text into Braille.

Blind and partially sighted folks do access internet resources, most often by using screen readers that will either read text out loud or translate it into Braille. In order for these tools to work efficiently, its important to structure your layout in a way that makes it easy for these devices to pick up your content. For example, many dashboard designers will use images of text in their vizzes to get around font discrepancies between Tableu public and their local computer. Avoid doing this – in order for screen readers to pick up your content, text must be written as text.

Partially sighted and elderly folks also find it easier to consume text online by using screen magnifiers to enlarge fonts. Layout is therefore important to bear in mind. More specifically, make sure to use fixed containers in Tableau rather than floating containers and text. This ensures that everything stays in the same place if folks need to enlarge the text on their screen.

4) Use Descriptive Text and Tables

People often bristle at the suggestion of descriptive text because it sounds like it’s a lot more extra work, but it’s really not. If you’ve noticed, throughout this post I’ve tried to give brief descriptions of the images included so that if a blind person were using a screen reader, they would still be able to have some context. Descriptive text doesn’t need to be wordy, it just needs to summarise the main points of the information presented.

A good example of descriptive text and data can be seen with Penn State’s accessibility guide, which can be accessed here: http://accessibility.psu.edu/images/charts/

detroitrchart

The bar chart above is summarised in the text below. It is used as an example of how descriptive text can be used for data visualisations
Summary of Trends
The numbers show that /r/ dropping becomes more common in lower classes (lower percentages of final /r/), but that women consistently preserve more /r/’s then men across social classes. 
That is, women are more likely then men to approach standard English across social classes.

Ultimately, when deciding what to write in your text descriptions, what do you want your audience to take away from the data you have presented? What is your message? All good data designers should know the answer to these questions, regardless of whose needs they are designing for.

5) Consider Colour

If you’ve ever wondered why orange and blue are the default colour options in Tableau, it’s because approximately 8% of men are colour blind, mostly for the colours red and green. Therefore you want to avoid using the conventional red=bad and green=good because these colours would just show up as brown for someone who is red-green colour blind.

colourblind-apples

Image is split into two sections. The first shows red and green apples the way a colour sighted person would perceive them. The second section shows the same apples the way a colour blind person would see them in shades of yellow.

If you’re really set on using red and green signifiers in your dashboard, consider adding other cues for colour blind folks to pick up on, such as size and shape.

colors

This image is also split in two sections. The first shows traffic lights the way a colour sighted person would perceive them. The second shows the same traffic lights the way a colour blind person would see them as shades of yellow, but with the addition of a cross on the stop light to indicate that the driver must stop.

6) Make conscious font choices

In the UK, approximately 10% of the population (regardless of gender) is dyslexic. What this means is that when reading text, letters will often be perceived as flipped or mirrored. A “p” and a “b” might therefore be perceived as the same letter, which makes it difficult to read text efficiently. In his talk discussing the benefits of fonts designed for dyslexic people, Christian Boer gives a striking example of how dyslexic people might perceive text:

screen-shot-2016-10-04-at-10-16-19-pm

A screenshot from Christian Boer’s Ted Talk that has text written with letters squished together and spelled incorrectly to depict how a dyslexic person might read text.

In order to resolve this, fonts like Dyslexie have been designed that create unique identifiers for each letter, thus reducing the possibility of flipping or mirroring text. Many are open source such as:

If you are still hesitant to use these founds, it is best to avoid fonts with serifs (so, NOT what I use in my blog design!).This includes fonts like Arial, Calibri, and yes, even the frequently shunned Comic Sans 🙂