EDA

Manual Summaries

With a new dataset, answer the followings using its codebook if available or via some search and/or asking if not.

  • What is its source?
  • How it was obtained?
  • Who collected it?
  • When it was collected?
  • What does each column/variable represent?
  • What does each row/observation represent?

Computational Summaries

After that, answer the following questions through computations:

  • How many columns/variables?
  • How many rows/observations?
  • What is the data type of each column/variable?
  • What are the top few rows/observations?
  • What are the bottom few rows/observations?
  • How many missing values in each column/variable?
  • How many rows/observations contain 1+ missing values?
  • What is the summary statistics of each column/variable?

In a blog post, Michael Clark demos eight R packages(html) that can help explore data quickly and effectively. The demoed packages are: arsenal, DataExplorer, dataReporter (dataMaid previously), gtsummary, janitor, SmartEDA, summarytools, and visdat.