"Why kaggle datasets are not reliable"

A story about data analysis

Final project after week of DATA 101 course

Hello Everyone ! excited to write my first story haha.

Down below we have small dataset from kaggle. AI impact on jobs by 2030

Goal of this story was to demonstrate reliability risks using internal consistency check with distribution diagnosis.

First of all : There is no documented data source ( prediction model ) or sampling frame. No uncertainty estimates , hence external validity can't be established.

Evidence 1
Histogram

From the data set low risk jobs with >70% automation probability = 0. P.s hopefully it turns out to be wrong prediction.

Evidence 2
Distribution is too clean

Bar chart

typically real datasets show heavier skew , more outliers and noise ( since it's real data )

Evidence 3
Warning: This dataset contains synthetic salaries. Real disappointment may vary
Disclaimer
Box plot

Phd entries earning less than 60k usd per year =203 rows.

To conclude : Kaggle ≠ Ground Truth

This dataset looks realistic, but:

  • variables contradict each other

  • distributions lack real-world noise

  • labels are not derivable from underlying data