Data Science For Process Analysts

Data Analysis for Process Analysts team

What is data science?

Goal

  • Inference:

Use the model to learn about the data generation process.

  • Prediction:

Use the model to predict the outcomes for new data points.

Steps

A nice image.

A nice image.

Process Analyst position

Visualization

Chart Types

A nice image.

A nice image.

Example

Gapminder dataset

  • Country: Name of country.
  • Continent: Which of the five continents the country is part of. Note that “Americas” includes - - countries in both North and South America and that Antarctica is excluded.
  • Life Expectancy: Life expectancy in years.
  • Population: Number of people living in the country.
  • GDP per Capita: Gross domestic product (in US dollars).

Count continent frequency

Summary

country continent year lifeExp pop gdpPercap
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 Min. :6.001e+04 Min. : 241.2
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 1st Qu.:2.794e+06 1st Qu.: 1202.1
Algeria : 12 Asia :396 Median :1980 Median :60.71 Median :7.024e+06 Median : 3531.8
Angola : 12 Europe :360 Mean :1980 Mean :59.47 Mean :2.960e+07 Mean : 7215.3
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
Australia : 12 NA Max. :2007 Max. :82.60 Max. :1.319e+09 Max. :113523.1
(Other) :1632 NA NA NA NA NA

Freq Table

continent count percent
Africa 52 37%
Americas 25 18%
Asia 33 23%
Europe 30 21%
Oceania 2 1%

Bar Chart

Pros & cons of bar chart

Pros

  • Simple and Quick

cons

  • Not good for many categories
  • One dimensional
  • Can be completely wrong (more advanced stuff)

Population per 1 million

Pop Table

continent average.population
Africa 10
Americas 25
Asia 77
Europe 17
Oceania 9

Bar Chart(Mean)

Advantages and Problems of bar chart

Pros:

  • Again Simple and Quick

Cons

  • Outliers: China and India in Asia than can not be seen here but changed the results
  • You can not see how each country is distributed if group
  • One dimensional
  • Can be completely wrong (more advanced stuff)
  • It does not show the trends of population in continents

Mean vs Median

A nice image.

A nice image.

Consider we have these duration for completing a task(In hour) in a process: 1, 2, 3, 4, 5, 6, 7, 8, 100 then the mean is 15 but the median is 5.

Which one is a better answer?

Solve outlier (with median)

Boxplot desc

Boxplot of population

Remove China and India as outliers.

Pros and cons boxplot

Pros:

  • Good to show how data is distributed

Cons:

  • Bad to compare groups

Ridgeline plot

Still only show 1 variable and can not show the trend.

Solve the trend (No good)

Solve the trend (Good)

Process Mining

Comments

  1. This is definitely a resource I’ll be returning to. The tips and examples were practical and easy to implement. You’ve done an awesome job compiling everything. Thanks for this post! Data Science Services

    ReplyDelete

Post a Comment

Popular posts from this blog

Start Coding with R

Test