Data Science For Process Analysts

Data Analysis for Process Analysts team

What is data science?

Goal

  • Inference:

Use the model to learn about the data generation process.

  • Prediction:

Use the model to predict the outcomes for new data points.

Steps

A nice image.

A nice image.

Process Analyst position

Visualization

Chart Types

A nice image.

A nice image.

Example

Gapminder dataset

  • Country: Name of country.
  • Continent: Which of the five continents the country is part of. Note that “Americas” includes - - countries in both North and South America and that Antarctica is excluded.
  • Life Expectancy: Life expectancy in years.
  • Population: Number of people living in the country.
  • GDP per Capita: Gross domestic product (in US dollars).

Count continent frequency

Summary

country continent year lifeExp pop gdpPercap
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 Min. :6.001e+04 Min. : 241.2
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 1st Qu.:2.794e+06 1st Qu.: 1202.1
Algeria : 12 Asia :396 Median :1980 Median :60.71 Median :7.024e+06 Median : 3531.8
Angola : 12 Europe :360 Mean :1980 Mean :59.47 Mean :2.960e+07 Mean : 7215.3
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
Australia : 12 NA Max. :2007 Max. :82.60 Max. :1.319e+09 Max. :113523.1
(Other) :1632 NA NA NA NA NA

Freq Table

continent count percent
Africa 52 37%
Americas 25 18%
Asia 33 23%
Europe 30 21%
Oceania 2 1%

Bar Chart

Pros & cons of bar chart

Pros

  • Simple and Quick

cons

  • Not good for many categories
  • One dimensional
  • Can be completely wrong (more advanced stuff)

Population per 1 million

Pop Table

continent average.population
Africa 10
Americas 25
Asia 77
Europe 17
Oceania 9

Bar Chart(Mean)

Advantages and Problems of bar chart

Pros:

  • Again Simple and Quick

Cons

  • Outliers: China and India in Asia than can not be seen here but changed the results
  • You can not see how each country is distributed if group
  • One dimensional
  • Can be completely wrong (more advanced stuff)
  • It does not show the trends of population in continents

Mean vs Median

A nice image.

A nice image.

Consider we have these duration for completing a task(In hour) in a process: 1, 2, 3, 4, 5, 6, 7, 8, 100 then the mean is 15 but the median is 5.

Which one is a better answer?

Solve outlier (with median)

Boxplot desc

Boxplot of population

Remove China and India as outliers.

Pros and cons boxplot

Pros:

  • Good to show how data is distributed

Cons:

  • Bad to compare groups

Ridgeline plot

Still only show 1 variable and can not show the trend.

Solve the trend (No good)

Solve the trend (Good)

Process Mining

Comments

Popular posts from this blog

Start Coding with R

Test