Once you've got some data, you're going to be eager to dig into it! Data exploration is fundamental to developing an understanding of the nuances of the data and how the policy problem you initially scoped can be specifically formulated as a machine learning problem. This process involves generating and plotting summary statistics, exploring trend sover time and understanding rapid changes in distributions, as well as identifying missing data and outliers. Typically, data exploration should involve considerable input from domain experts as you develop an understanding of how the data relates to the underlying generative process, as well as its idiosyncrasies and limitations.
Our tool of choice for data analysis is a comnbination of Python and SQL. Start off with Intro to Git and Python, then move onto Data Exploration in Python. If you're combining data from multiple sources, you'll have to do record linkage to match entities across datasets. Depending on your particular project, you may need special methods and tools; at this time, we have resources for working with text data, spatial data and network data.
- Sanity Check the data you were given
- Understanding the problem and domain
- Problem Formulation
- Debugging
- Feature Generation/Selection
- Interpretation of results
Tables: identify the database tables that are most relevant to your problem Entities: How can you identify the primary entities from your project? What are some basic/relevant characteristics about them? How have those changed over time? Fields: Most projects have a lot of different data fields, but don’t feel like your initial exploration needs to understand everything Label: Think ahead about how you might define your label? It's bit early to know what your label/outcome of interest may be exactly but it's worth having some hypothesis about it so you can do more targeted data exploration. What can you say about its distribution and how it’s changed over time? Based on what you know of the context, what information would you expect to be important in modeling? Is this available? How well does it correlate with your label?
Here are some things you want to do during data exploration:
- distributions of different variables - historgrams, boxplots
- distinct values for a categorical variable
- correlations between variables - you can do a correlation matrix and turn it into a heatmap
- changes and trends over time - how does the data and the entities in the data change over time. Distributions over time.
- missing values: are there lots of missing values? is there any pattern there?
- looking at outliers - this can be done using clustering but also using other methods by plotting distributions.
- cross-tabs (if you're looking at multiple classes/labels), describing how the positive and negative classes are different without doing any machine learning.
- SQL (directly and through python – psycopg2)�
- Python (matplotlib, seaborn, altair,...)�
- Pandas(if you have to)�
- Tableau (use an ssh tunnel)
- Write a SQL query that takes a “person id” (e.g., student, voter, facility, entity, etc) and gives you everything you know about that person across all the tables you have. �
- Add a date parameter to it to give you everything about that entity up to that date