Data Science For Process Analysts
Data Analysis for
Process Analysts team
Sajad Ghashami
4/12/2021
What is data science?
Goal
- Inference:
Use the model to learn about the data generation process.
- Prediction:
Use the model to predict the outcomes for new data points.
Steps
A nice image.
Process Analyst position
Visualization
Chart Types
A nice image.
Example
Gapminder dataset
- Country: Name of country.
- Continent: Which of the five continents the country is part of. Note that “Americas” includes - - countries in both North and South America and that Antarctica is excluded.
- Life Expectancy: Life expectancy in years.
- Population: Number of people living in the country.
- GDP per Capita: Gross domestic product (in US dollars).
Count continent frequency
Summary
| country | continent | year | lifeExp | pop | gdpPercap | |
|---|---|---|---|---|---|---|
| Afghanistan: 12 | Africa :624 | Min. :1952 | Min. :23.60 | Min. :6.001e+04 | Min. : 241.2 | |
| Albania : 12 | Americas:300 | 1st Qu.:1966 | 1st Qu.:48.20 | 1st Qu.:2.794e+06 | 1st Qu.: 1202.1 | |
| Algeria : 12 | Asia :396 | Median :1980 | Median :60.71 | Median :7.024e+06 | Median : 3531.8 | |
| Angola : 12 | Europe :360 | Mean :1980 | Mean :59.47 | Mean :2.960e+07 | Mean : 7215.3 | |
| Argentina : 12 | Oceania : 24 | 3rd Qu.:1993 | 3rd Qu.:70.85 | 3rd Qu.:1.959e+07 | 3rd Qu.: 9325.5 | |
| Australia : 12 | NA | Max. :2007 | Max. :82.60 | Max. :1.319e+09 | Max. :113523.1 | |
| (Other) :1632 | NA | NA | NA | NA | NA |
Freq Table
| continent | count | percent |
|---|---|---|
| Africa | 52 | 37% |
| Americas | 25 | 18% |
| Asia | 33 | 23% |
| Europe | 30 | 21% |
| Oceania | 2 | 1% |
Bar Chart
Pros & cons of bar chart
Pros
- Simple and Quick
cons
- Not good for many categories
- One dimensional
- Can be completely wrong (more advanced stuff)
Population per 1 million
Pop Table
| continent | average.population |
|---|---|
| Africa | 10 |
| Americas | 25 |
| Asia | 77 |
| Europe | 17 |
| Oceania | 9 |
Bar Chart(Mean)
Advantages and Problems of bar chart
Pros:
- Again Simple and Quick
Cons
- Outliers: China and India in Asia than can not be seen here but changed the results
- You can not see how each country is distributed if group
- One dimensional
- Can be completely wrong (more advanced stuff)
- It does not show the trends of population in continents
Mean vs Median
A nice image.
Consider we have these duration for completing a task(In hour) in a process: 1, 2, 3, 4, 5, 6, 7, 8, 100 then the mean is 15 but the median is 5.
Which one is a better answer?
Solve outlier (with median)
Boxplot desc
Boxplot of population
Remove China and India as outliers.
Pros and cons boxplot
Pros:
- Good to show how data is distributed
Cons:
- Bad to compare groups
Ridgeline plot
Still only show 1 variable and can not show the trend.
Comments
Post a Comment