Lin Hsin Hsin Artificial intelligence Center
AI Data Laundromat Categorization
by Lin Hsin Hsin
🧼🛁🚿🧹 🧽🧻🧺
Missing Data
Identifying & addressing missing values
📍 Imputation -- eg mean, median, or regression-based
📍 Deletion
📍 Flagging
Duplicates
Detecting & eliminating duplicate records or entries to avoid redundancy & biasness
Outlier Detection
Identifying & handling data points that deviate significantly from the rest of the data.
to prevent skewing during an analysis
Note that these data points may have been resulted from errors or represent rare but valid observations.
This is perform through statistical methods such as interquartile range or adjusted boxplots
Inconsistencies
Correcting inconsistencies in data
📍 Mismatched categories
📍 Typos
📍 Conflicting entries
Stardization
Converting data to a common format
📍 dates -- eg 01/01/2000 vs 01.01.2000
📍 units
📍 categories -- eg Yes, yes, or Y
📍 text case
Normalization
Scaling numerical data to a standard range (eg 0 -- 1) for fair comparison.
Data Type Conversion
Ensuring data is in the correct format (eg string to numeric, categorical to numerical)
Text Cleaning
Processing text data, eg
📍 Removing stopwords
📍 Stemming
📍 Correcting spelling
Validation
Checking data against rules or constraints to ensure
Accuracy
This evaluates whether the data correctly represents the real-world entities or events it is meant to describe. Accuracy is often verified by cross-checking with external sources or domain knowledge
Completeness
This assesses whether all required data fields are filled and whether the dataset contains sufficient information for analysis
Conformations
Conform to predefined rules or constraints, eg
📍 A phone number must follow a specific format
📍 A date must be within a valid range
Feature Engineering
Creating new features or modifying existing ones to improve data utility for analysis