A robust XGBoost-based PII (Personally Identifiable Information) validation classifier for Chile, Uruguay, Colombia, and Brazil. This post-processing classifier validates entities detected by NER models to reduce false positives.
A robust XGBoost-based PII (Personally Identifiable Information) validation classifier for Chile, Uruguay, Colombia, and Brazil. This post-processing classifier validates entities detected by NER models to reduce false positives.
Stage 1 - Deterministic Rules: Hard validation filters
Stage 2 - ML Classification: XGBoost with context-aware features
Classifier_PII_LATAM/
├── configs/
│ ├── country_patterns.json # Country-specific patterns, names, phone formats
│ └── entity_thresholds.json # ML configuration and threshold settings
├── data_generation/
│ ├── cl/generators.py # Chilean data generator (Faker locale: es_CL)
│ ├── br/generators.py # Brazilian data generator (Faker locale: pt_BR)
│ ├── uy/generators.py # Uruguayan data generator (Faker locale: es_AR)
│ ├── co/generators.py # Colombian data generator (Faker locale: es_ES)
│ └── generate_dataset.py # Unified dataset generation pipeline
├── datasets/
│ ├── complete_dataset.csv # Full balanced dataset (42,000 samples)
│ ├── train.csv # Training set (29,400 samples, 70%)
│ ├── val.csv # Validation set (6,300 samples, 15%)
│ └── test.csv # Test set (6,300 samples, 15%)
├── feature_extraction/
│ └── extractors.py # Feature engineering (79 features total)
├── validation_rules/
│ └── deterministic.py # Checksum validators (RUT, CPF, CI, CC)
├── models/
│ ├── xgboost_pii_classifier.pkl # Trained XGBoost model
│ ├── label_encoders.pkl # Categorical feature encoders
│ ├── optimized_thresholds.json # Entity-specific prediction thresholds
│ └── feature_names.json # Complete feature list
├── inference/
│ └── pipeline.py # Two-stage validation inference pipeline
├── train_model.py # XGBoost training with hyperparameter tuning
├── example_usage.py # Usage examples
├── quick_start.py # Interactive setup script
├── requirements.txt # Python dependencies
└── README.md # This file
# Clone the repository
git clone https://github.com/andresveraf/Classifier_PII_LATAM.git
cd Classifier_PII_LATAM
# Install dependencies
pip install -r requirements.txt
python data_generation/generate_dataset.py
Output: 42,000 balanced samples across all entity types and countries
Faker('pt_BR') - Brazilian PortugueseFaker('es_CL') - Chilean SpanishFaker('es_AR') - Argentine Spanish (es_UY not available)Faker('es_ES') - Castilian Spanish (es_CO not available)python train_model.py
Training includes:
Expected results:
python example_usage.py
Or use the pipeline directly:
from inference.pipeline import PII_ValidationPipeline
# Initialize pipeline
pipeline = PII_ValidationPipeline()
# Validate a single entity
result = pipeline.validate(
text="15.783.037-6",
entity_type="ID",
country="CL"
)
print(f"Is PII: {result['is_pii']}")
print(f"Confidence: {result['confidence']:.2%}")
print(f"Reason: {result['reason']}")
Validation Set (6,300 samples):
Test Set (6,300 samples):
| Entity | F1 Score | Precision | Recall | Threshold |
|---|---|---|---|---|
| 1.000 | 1.000 | 1.000 | 0.997 | |
| LOC | 1.000 | 1.000 | 1.000 | 0.988 |
| SEX | 1.000 | 1.000 | 1.000 | 0.999 |
| ID | 0.998 | 0.996 | 1.000 | 0.871 |
| DATE | 0.993 | 0.987 | 1.000 | 0.690 |
| PER | 0.987 | 0.987 | 0.987 | 0.359 |
| PHONE | 0.975 | 0.965 | 0.984 | 0.533 |
Supports country-specific ID formats with validation:
Chile - RUT (Rol Único Tributario)
XX.XXX.XXX-K or XXXXXXXXK15.783.037-6 (valid), 15.783.037-5 (invalid)Brazil - CPF (Cadastro de Pessoas Físicas)
XXX.XXX.XXX-XXfake.cpf() for realistic generationUruguay - CI (Cédula de Identidad)
Colombia - CC (Cédula de Ciudadanía)
Country-specific phone formats and validation:
RFC 5322 compliant validation:
Person name validation using word count and patterns:
Address and location validation:
Multiple date format support:
Gender information validation:
The project uses Faker library for generating diverse, realistic synthetic data:
# Brazil - Brazilian Portuguese
Faker('pt_BR')
fake.cpf() # Valid Brazilian CPFs
fake.cnpj() # Company IDs
fake.phone_number() # Brazilian phone numbers
fake.name() # Brazilian names
fake.address() # Brazilian addresses
fake.email() # Email addresses
# Chile - Chilean Spanish
Faker('es_CL')
fake.name() # Chilean names
fake.address() # Chilean addresses
fake.email() # Email addresses
fake.city() # Chilean cities
# Uruguay - Argentine Spanish (closest available)
Faker('es_AR')
fake.name() # Argentine/Uruguayan names
fake.address() # Addresses
fake.email() # Email addresses
# Colombia - Castilian Spanish (closest available)
Faker('es_ES')
fake.name() # Spanish/Colombian names
fake.address() # Addresses
fake.email() # Email addresses
Complete Dataset: 42,000 samples
By Entity Type (6,000 each):
By Country (10,500 each):
Format Features (23):
Validation Features (3):
Statistical Features (5):
Dictionary Features (6):
Pattern Features (14):
Entity-Specific Features (28):
ID Numbers:
validate_chile_rut(text) # Modulo-11 checksum
validate_brazil_cpf(text) # Double-digit verification
validate_uruguay_ci(text) # Format and range check
validate_colombia_cc(text) # Basic format validation
Phone Numbers:
Emails:
Dates:
Special Patterns:
configs/country_patterns.jsonContains country-specific patterns:
{
"chile": {
"id_pattern": "^\\d{1,2}\\.\\d{3}\\.\\d{3}-[0-9K]$",
"common_first_names": [...],
"common_surnames": [...],
"cities": [...],
"street_types": [...]
}
}
configs/entity_thresholds.jsonContains ML and training configuration:
{
"id": {"threshold": 0.871, "min_samples": 100},
"phone": {"threshold": 0.533, "min_samples": 100},
...
}
from inference.pipeline import PII_ValidationPipeline
pipeline = PII_ValidationPipeline()
# Validate Chilean RUT
result = pipeline.validate("15.783.037-6", "ID", "CL")
# Output: {
# 'is_pii': True,
# 'confidence': 0.99,
# 'reason': 'Valid RUT checksum + high ML confidence'
# }
# Validate Brazilian CPF
result = pipeline.validate("123.456.789-09", "ID", "BR")
# Validate Email
result = pipeline.validate("user@example.com", "EMAIL", "CL")
# Validate Phone Number
result = pipeline.validate("+56 9 8765 4321", "PHONE", "CL")
entities = [
{"text": "15.783.037-6", "entity_type": "ID", "country": "CL"},
{"text": "user@test.com", "entity_type": "EMAIL", "country": "CL"},
{"text": "+56 9 8765 4321", "entity_type": "PHONE", "country": "CL"},
{"text": "Juan García", "entity_type": "PER", "country": "CL"},
{"text": "01/12/1990", "entity_type": "DATE", "country": "CL"}
]
results = pipeline.validate_batch(entities)
for result in results:
print(f"{result['text']}: {result['is_pii']} ({result['confidence']:.2%})")
# Example: Integrate with spaCy NER output
import spacy
from inference.pipeline import PII_ValidationPipeline
nlp = spacy.load("es_core_news_sm")
pipeline = PII_ValidationPipeline()
text = "Mi RUT es 15.783.037-6 y mi email es juan@example.com"
doc = nlp(text)
for ent in doc.ents:
result = pipeline.validate(ent.text, ent.label_, "CL")
if result['is_pii']:
print(f"Found PII: {ent.text} (type: {ent.label_})")
Edit configs/entity_thresholds.json to change prediction thresholds:
{
"id": {"threshold": 0.85}, // Stricter (fewer false positives)
"phone": {"threshold": 0.50}, // Balanced
"email": {"threshold": 0.95}, // Very strict
"per": {"threshold": 0.30} // Lenient (fewer false negatives)
}
Threshold Strategy:
To retrain with new data:
# Generate new dataset
python data_generation/generate_dataset.py
# Train with new parameters
python train_model.py
This project is for educational and research purposes.
Contributions are welcome! Areas for improvement:
For questions or issues, please open a GitHub issue.