Book a Call
Join a Training

What is “Exploratory Data Analysis” anyway?

data moves & topics

Data Moves & Topics 

By: Kristin Hunter-Thomson

 

 

Let’s take a step back and think about data literacy from a broader perspective that thinks about different aspects of working with data. Most broadly we often do two things we data:

  1. We explore, play, manipulate, visualize, tinker with the data to figure out what is there.

  2. We analyze, interpret, make meaning, communicate what we find from the data.

The benefits of playing around with your data are grounded in what John Tukey (1980) called Exploratory Data Analysis (EDA). Tukey proposed, and others have continued to build upon this assertion, that EDA is a critical first step in working with data that involves getting familiar with the data by organizing and processing it to gain a sense of what the data do and do not include so that you are ready to conduct Confirmatory, or Explanatory, Data Analysis (CDA). In CDA we work to articulate and/or show something specific from the data, aka make a claim from the data. In EDA we work to get the data and ourselves ready to do CDA. The question is not whether we should perform EDA or CDA, but rather reminding ourselves that they are both necessary and realistic aspects of working with data. Additionally, building skills to do EDA and doing EDA have tremendous advantages in terms of time and success in our CDA work.

SO, WHAT IS INVOLVED IN EDA?

EDA includes playing, exploring, manipulating, and tinkering around to gain familiarity with all of our data before we dive into interpretations and drawing conclusions from the data. What that specifically looks like will vary depending on what we are doing with the data and how we are working with the data (e.g., by hand, using technology). However, a few key components of EDA that are applicable to our use of data in the classroom include:

  • Gain an understanding of the context of the data. We interact with data based off of our prior knowledge of the topic and prior experiences working with data. Additionally, knowing what data we do and do not have from the start is extremely helpful in understanding our inference space (see Hunter-Thomson, 2020 for more details on inference space).

  • Evaluate and adjust the scales of data. We record data in ways that make the most sense for data collection, but that is not always the most helpful way of using the data to investigate our testable questions. Therefore, we often need to manipulate, transform, or derive new variables from our data. For example, while we collect raw scores for how our students do on a test (e.g., 7/8, 13/15, 86/100), but if our supervisor is curious how many students scored 80% or higher than we need to convert those raw scores into percentages (e.g., 87.5%, 86.67%, 86%).

  • Use quick-look graphical representations of the data. We can see many more numbers at once and we can notice things we are or are not expecting in a graph more than a data table. Quickly plotting our data in a graph, changing the axes around, adjusting the graph type we are using, and/or switching out the which variables are plotted where is key for getting a sense of what the data are, before we work to make a claim from the data. If looking for some ideas of different platforms to do this more easily check out Tuva, CODAP, Tableau Public, plot.ly Chart Studio Cloud, or RStudio (listed in increasing order of complexity).

  • Find and think about outliers, where applicable. When looking at our data visually in a graph we can start to see which data points are clumping together and which are farther away. The first step is to ask questions about why those values could be so different from others. Then we work to figure out a justification, based on the why we uncover, as to whether or not they should be included or removed from our analysis and interpretation of the data.

These four components of EDA enable us to gain a better sense of the data we have before we draw a conclusion from the data. If this feels like an overwhelming thing to consider including in your teaching with data in school, take solace in Behrens (1997) reminder that EDA is not a specific set of techniques or steps to accomplish, but rather is an attitude towards working with data.

While EDA is a large part of working with data, the components of and skills for EDA are things that many of us are rarely exposed to in our K-12 experiences when working with data, outside of a dedicated research class in high school. Instead the data we most often work with are already processed for us or limited in scope to just what we personally have collected, in the interest of time. However, data in the real world, aka the data our students will work with outside of the classroom, rarely seem to be “ready to use”. For example, the data are missing values, inconsistent labels, typos, etc. Or the data are organized in a way that makes it hard to graph it to answer our question. Either way this can make it challenging and overwhelming to try to use the data. What we most often need to do with data in the real world is spend some time organizing and/or processing the data at the start so we can more easily visualize and make sense of the data in the long run. But to do that naturally and confidently in the real world, means that we need to start giving our students the scaffolded opportunities to practice these skills starting in middle school. While it takes time away from something else, I know that we can break it down and build slowly up in ways that are accessible to our school-age students to learn so that they can master the skills and make it worth the time.

Note: Similar to “inference space” (Hunter-Thomson, 2020), I do not think that students need to learn the definition of “exploratory data analysis.” In fact, I worry that learning the definition may distract from learning the skills. However, I do think that students can learn the components of EDA, so that they can apply the skills when working with data.

 NON-CLASSROOM DATA EXAMPLE

EDA is something that may seem new, but in reality most of us are familiar with it as part of the process of working with data. Let’s explore the example often demonstrated in detective TV shows or movies (bring on the CSI!).

In almost every detective story line there is a question they are trying to solve, aka who committed the crime. This is like a testable question when working with data in our classrooms.

The first thing that detective team does is start to collect evidence/data from multiple different sources to try to piece together the story of what happened or answer their question of “who don’ it”. As the team gathers evidence, they look at the data, evaluate its reliability and/or relevance, and start to amass a dataset of evidence. If the new bits of evidence are not reliable or relevant, they are removed from further analysis. If they are reliable and relevant, they remain in the dataset. They look through the data to identify gaps in their understanding, or missing pieces of the puzzle. This prompts the need for more data collection. They visualize the data/evidence that they do have in multiple ways and from different angles to better understand what they do and do not have, as well as what the evidence/data may be saying.

This is EDA at it finest! They are exploring, playing, evaluating, processing, and organizing the data. They are getting to know their data in order to make conclusions from it.

Once they have a dataset of reliable and relevant data, the detective team next explores the data to test assumptions and/or try out different potential hypotheses for what happened. This let’s them start down the path of making a conclusion from their data. Eventually by visualizing, reviewing, and analyzing the data they come to an initial conclusion. In the shows/movies this usually prompts an arrest or the charging of an individual.

The work with the evidence/data is not done in the actual process, right? It gets passed along to lawyers, judges, and juries to evaluate and decide whether the evidence/data supports the conclusion. That is the major CDA step of working with the evidence/data in this analogy.

While this is in no way a perfect analogy — as detective shows/movies notoriously underrepresent the timeframe of collecting and processing data and overrepresent the amount of positive/relevant data collected — I think it provides a good example of how we are exposed to the process of EDA and CDA in ways that we often don’t realize. These processes are all around us, the trick is to think about how we can help students more explicitly understand what they are, how to do them, and why we engage in each process.

REFERENCES

Behrens, J. T. (1997). Principles and procedures of exploratory data analysisPsychological Methods2(2), 131.

Hunter-Thomson, K. 2020. Data literacy 101: What can we actually claim from our data? Science Scope 43 (6): 20-26.

Tukey, J. W. (1980). We need both exploratory and confirmatoryThe American Statistician34(1), 23-25.