Lin Hsin Hsin Artificial intelligence Center















AI Data Laundromat Categorization
by Lin Hsin Hsin



🧼🛁🚿🧹 🧽🧻🧺








Missing Data


Identifying & addressing missing values


📍 Imputation -- eg mean, median, or regression-based
📍 Deletion
📍 Flagging




Duplicates



Detecting & eliminating duplicate records or entries to avoid redundancy & biasness



Outlier Detection



Identifying & handling data points that deviate significantly from the rest of the data. to prevent skewing during an analysis Note that these data points may have been resulted from errors or represent rare but valid observations. This is perform through statistical methods such as interquartile range or adjusted boxplots



Inconsistencies



Correcting inconsistencies in data

📍 Mismatched categories
📍 Typos
📍 Conflicting entries




Stardization



Converting data to a common format

📍 dates -- eg 01/01/2000 vs 01.01.2000
📍 units
📍 categories -- eg Yes, yes, or Y
📍 text case




Normalization



Scaling numerical data to a standard range (eg 0 -- 1) for fair comparison.

Data Type Conversion Ensuring data is in the correct format (eg string to numeric, categorical to numerical)




Text Cleaning



Processing text data, eg

📍 Removing stopwords
📍 Stemming
📍 Correcting spelling




Validation



Checking data against rules or constraints to ensure

Accuracy


This evaluates whether the data correctly represents the real-world entities or events it is meant to describe. Accuracy is often verified by cross-checking with external sources or domain knowledge


Completeness


This assesses whether all required data fields are filled and whether the dataset contains sufficient information for analysis


Conformations


Conform to predefined rules or constraints, eg

📍 A phone number must follow a specific format
📍 A date must be within a valid range




Feature Engineering



Creating new features or modifying existing ones to improve data utility for analysis