EDA
Manual Summaries
With a new dataset, answer the followings using its codebook if available or via some search and/or asking if not.
- What is its source?
- How it was obtained?
- Who collected it?
- When it was collected?
- What does each column/variable represent?
- What does each row/observation represent?
Computational Summaries
After that, answer the following questions through computations:
- How many columns/variables?
- How many rows/observations?
- What is the data type of each column/variable?
- What are the top few rows/observations?
- What are the bottom few rows/observations?
- How many missing values in each column/variable?
- How many rows/observations contain 1+ missing values?
- What is the summary statistics of each column/variable?
In a blog post, Michael Clark demos eight R packages(html) that can help explore data quickly and effectively. The demoed packages are: arsenal, DataExplorer, dataReporter (dataMaid previously), gtsummary, janitor, SmartEDA, summarytools, and visdat.