About the role
Your manager is impressed with your progress but points out that the data is messy. Before we can analyze it effectively, we need to clean and structure the data properly.
Your task is to:
Handle missing values
Remove duplicate or inconsistent data
Standardize the data format
Let's get started!
Task 1: Identify Issues in the Data
Your manager provides you with an example dataset where some records are incomplete or incorrect. Here's an example:
{
"users": [
{"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
{"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
{"id": 3, "name": "", "friends": [1], "liked_pages": [101, 103]},
{"id": 4, "name": "Sara", "friends": [2, 2], "liked_pages": [104]},
{"id": 5, "name": "Amit", "friends": [], "liked_pages": []}
],
"pages": [
{"id": 101, "name": "Python Developers"},
{"id": 102, "name": "Data Science Enthusiasts"},
{"id": 103, "name": "AI & ML Community"},
{"id": 104, "name": "Web Dev Hub"},
{"id": 104, "name": "Web Development"}
]
}
Problems:
User ID 3 has an empty name.
User ID 4 has a duplicate friend entry.
User ID 5 has no connections or liked pages (inactive user).
The pages list contains duplicate page IDs.
Task 2: Clean the Data
We will:
Remove users with missing names.
Remove duplicate friend entries.
Remove inactive users (users with no friends and no liked pages).
Deduplicate pages based on IDs.