OCR document extraction for Chilean ID cards using docTR with FastAPI and Streamlit
Extract structured information from Chilean ID cards using advanced Optical Character Recognition (OCR) technology.
This project implements an end-to-end OCR system for automatically extracting and validating information from Chilean national identification cards (Carnet de Identidad). It combines state-of-the-art deep learning models with custom business logic to provide accurate, validated extractions.
06-ocr-document-extraction/
โโโ data/ # Data directory
โ โโโ raw/ # Raw images
โ โโโ processed/ # Processed data
โ โโโ external/ # External models/resources
โโโ docker/ # Docker configuration
โ โโโ Dockerfile.api # API container
โ โโโ Dockerfile.dashboard # Dashboard container
โ โโโ docker-compose.yml # Multi-container orchestration
โโโ docs/ # Documentation
โโโ scripts/ # Utility scripts
โ โโโ run_api.sh # Run API server
โ โโโ run_dashboard.sh # Run dashboard
โโโ src/ # Source code
โ โโโ api/ # FastAPI application
โ โ โโโ app.py # API endpoints
โ โโโ dashboard/ # Streamlit dashboard
โ โ โโโ app.py # Dashboard UI
โ โโโ data/ # Data generation
โ โ โโโ document_generator.py # Synthetic ID card generator
โ โโโ ocr/ # OCR components
โ โโโ predictor.py # OCR prediction
โ โโโ extractor.py # Field extraction
โ โโโ validator.py # Field validation
โโโ tests/ # Tests
โโโ config.yaml # Configuration
โโโ requirements.txt # Python dependencies
โโโ README.md # This file
Clone the repository
cd 06-ocr-document-extraction
Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Install dependencies
pip install -r requirements.txt
Run the API:
chmod +x scripts/run_api.sh
./scripts/run_api.sh
Run the Dashboard:
chmod +x scripts/run_dashboard.sh
./scripts/run_dashboard.sh
cd docker
docker-compose up --build
This will start both the API (port 8000) and dashboard (port 8501).
API:
uvicorn src.api.app:app --host 0.0.0.0 --port 8000 --reload
Dashboard:
streamlit run src/dashboard/app.py
http://localhost:8000| Method | Endpoint | Description |
|---|---|---|
| GET | / |
API information |
| GET | /health |
Health check |
| GET | /fields |
List supported fields |
| GET | /stats |
API statistics |
| POST | /extract |
Extract from single image |
| POST | /batch |
Extract from multiple images |
| POST | /validate |
Validate extracted fields |
curl -X POST "http://localhost:8000/extract" \
-F "file=@id_card.jpg" \
-F "validate=true" \
-F "return_ocr_text=false"
Response:
{
"success": true,
"extracted_fields": {
"rut": {
"value": "12.345.678-5",
"confidence": 0.95,
"bbox": [100, 200, 300, 250]
},
"name": {
"value": "GONZALEZ ROJAS JUAN CARLOS",
"confidence": 0.92,
"bbox": [100, 260, 400, 290]
}
},
"validation_results": {
"is_valid": true,
"errors": [],
"warnings": []
}
}
The Streamlit dashboard provides:
Upload Tab: Upload and process ID card images
Generate Tab: Create synthetic ID cards for testing
About Tab: System information and documentation
Edit config.yaml to customize:
ocr:
model_type: "db_resnet50" # OCR model architecture
pretrained: true # Use pre-trained weights
preserve_aspect_ratio: true # Preserve image aspect ratio
straighten_pages: false # Don't rotate text lines
extraction:
min_confidence: 0.5 # Minimum confidence threshold
rut_pattern: "XX.XXX.XXX-X" # RUT format pattern
date_format: "DD/MM/YYYY" # Date format
validation:
validate_rut: true # Validate RUT verification digit
check_dates: true # Check date consistency
min_confidence_threshold: 0.7 # Minimum validation confidence
The system extracts the following fields:
| Field | Format | Required | Description |
|---|---|---|---|
rut |
XX.XXX.XXX-X | โ | Chilean National ID (RUT) |
name |
SURNAME1 SURNAME2 GIVEN NAMES | โ | Full name |
nationality |
Text | โ | Nationality (e.g., CHILENO) |
date_of_birth |
DD/MM/YYYY | โ | Date of birth |
gender |
M or F | โ | Gender |
issue_date |
DD/MM/YYYY | โ | Document issue date |
expiry_date |
DD/MM/YYYY | โ | Document expiry date |
Generate sample ID cards for testing:
from src.data.document_generator import IDCardGenerator
generator = IDCardGenerator()
image, ground_truth = generator.generate_id_card(
rut="12.345.678-5",
name="GONZALEZ ROJAS JUAN CARLOS",
nationality="CHILENO",
date_of_birth="15/03/1985",
gender="M",
issue_date="01/01/2020",
expiry_date="01/01/2030"
)
image.save("sample_id_card.png")
cd docker
docker-compose up -d
docker-compose down
The system uses docTRโs pre-trained models:
Models are automatically downloaded on first use and cached locally.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License.
For questions or issues, please open an issue in the repository.