"Why kaggle datasets are not reliable"

A story about data analysis

Final project after week of DATA 101 course

Hello Everyone ! excited to write my first story haha.

Down below we have small dataset from kaggle. AI impact on jobs by 2030

Goal of this story was to demonstrate reliability risks using internal consistency check with distribution diagnosis.

First of all : There is no documented data source ( prediction model ) or sampling frame. No uncertainty estimates , hence external validity can't be established.

Evidence 1

From the data set low risk jobs with >70% automation probability = 0. P.s hopefully it turns out to be wrong prediction.

Evidence 2

Distribution is too clean

typically real datasets show heavier skew , more outliers and noise ( since it's real data )

Evidence 3

Warning: This dataset contains synthetic salaries. Real disappointment may vary

Disclaimer

Phd entries earning less than 60k usd per year =203 rows.

To conclude : Kaggle ≠ Ground Truth

This dataset looks realistic, but:

variables contradict each other

distributions lack real-world noise

labels are not derivable from underlying data

Stay in the loop

Be the first to learn our product updates, feature drops, and sources of real-life inspiration.

By submitting this form, you agree and understand that your personal data will be handled according to our Privacy Policy.

Email addressEmail address

error

globe

Country of originCountry of origin

cancel

Lapis is created by Kontinentalist, an award-winning data storytelling studio based in Singapore.

About Kontinentalistarrow_outward About Lapisarrow_forward

Community Guidelines Privacy Policy Terms of Service