EDA: The process of investigating, organizing and analyzing datasets and summarizing their main characteristics, often employing data wrangling and visualization methods. The six main practices of EDA are: discovering, structuring, cleaning, joining, validating, and presenting
Discovering in EDA refers to gaining an initial understanding of the dataset, including:
Exploratory data analysis (EDA) is not like a cake recipe. It is not a step-by-step process you follow. Instead, the six practices of EDA are iterative and non-sequential.
Because of the varying nature of datasets, the approach to exploring that data will be different each time. That means that you will need to use your logic and experience throughout the EDA process to determine which of the six practices to utilize, how many times to apply them, and when in the process you should apply them.
Visual example
Imagine you are assigned a dataset that has only 200 rows and five columns of data about trees in a coniferous forest in Norway. You know that to complete your full analysis you’ll need more than 1,000 rows and at least two more columns. Even without much more detail than that, your entire EDA process might look something like this:
