Skip to content

Latest commit

 

History

History
49 lines (27 loc) · 4.99 KB

dataset-ideas.md

File metadata and controls

49 lines (27 loc) · 4.99 KB

Go back to STAT545A page

Ideas for datasets

UNDER DEVELOPMENT!

As STAT 545A goes on, students will transition to analyzing and graphing the Gapminder data ad nauseum to something more exciting. Honoring the course's past "find your own dataset" final project, let's identify a few datasets to work with. We will collect ideas here and then designate certain datasets as offical STAT 545A datasets based on instructor assessment of suitability and student interest.

Thoughts on what makes a good dataset

  • at least two quantitative variables (e.g., GDP per capita and life expectancy in the Gapminder data)
  • at least one categorical variable (e.g. country and continent in Gapminder data); bear in mind that it can be rather easy to generate categorical variables yourself (e.g. one could populate new factors for Gapminder by adding political system, dominant religion, etc)
  • reasonable number of observations ... minimum on the order of 100; really massive datasets aren't great either but there are many ways to deal with that (random or directed subsetting) so less problematic
  • the main storylines are NOT heavily map-based; we just don't get into that deeply enough in the course
  • not super-specialized (i.e. your thesis data), abstract (i.e. data you created in some simulation study), or boring (i.e. grain production for various counties during the 1970s)

US Childhood Morbidity and Mortality. People, especially the older generation, sometimes like to mock today's parents for bubble-wrapping their kids. But the fact is that accidental kid death is actually lower now than it was when today's parents and grandparents were little. It would be interesting to look into this over time, for different causes of death. Also, would the risk of different accidental death causes surprise people in terms of what the media tries to freak us out about (motor vehicle accidents vs school shootings vs stranger abduction vs death for a preventable infectious disease)? This slim little PDF is interesting and might lead to some bigger better datasets. Overview of Childhood Injury Morbidity and Mortality in the U.S. Fact Sheet.

The Consuming Instinct: What Juicy Burgers, Ferraris, Pornography, and Gift Giving Reveal About Human Nature by Gad Saad. UBC library has this book. JB had it checked out for a long while meaning to look for interesting studies and datasets but never did.

The effect of hotness on pay and productivity

Protect Children Not Guns 2013 report from the Children's Defense Fund. Might lead to interesting datasets. Report itself is certainly heavy on figures and tables. Would allow comparisons of gun violence across different countries.

Mother Jones article Science Confirms: Politics Wrecks Your Ability to Do Math. Probably too small and narrow. But maybe not. Would be nice to give those figures a makeover.

Reading the NYT story Why Are There Still So Few Women in Science? on 2013-10-03 led me to these leads. Get data used in these papers. Reanalyze, regraph.

  • "Cross-Cultural Analysis of Students with Exceptional Talent in Mathematical Problem Solving" by Titu Andreescu, Joseph A. Gallian, Jonathan M. Kane, and Janet E. Mertz in AMS Notices. http://www.ams.org/notices/200810/tx081001248p.pdf

  • "Debunking Myths about Gender and Mathematics Performance" by Jonathan M. Kane and Janet E. Mertz in AMS Notices. http://www.ams.org/notices/201201/rtx120100010p.pdf

  • Ellison, Glenn, and Ashley Swanson. “The Gender Gap in Secondary School Mathematics at High Achievement Levels: Evidence from the American Mathematics Competitions.” Journal of Economic Perspectives 24.2 (2010): 109–128. Web. 1 June 2012. http://hdl.handle.net/1721.1/71007

JB should really list previous student projects, at the vey least for inspiration and maybe for actual links to datasets.

Submissions by 2013 STAT 545A students

This web page has a dataset similar to the Gapminder data, but with more columns. It also has a potentially interesting, though maybe not data-intensive enough, dataset about Olympic medals won by athletes. -- STAT 545A student

This work is licensed under the CC BY-NC 3.0 Creative Commons License.