ocr-document-extraction

OCR document extraction for Chilean ID cards using docTR with FastAPI and Streamlit

0
0
0
Python
public

๐Ÿ“„ OCR Document Extraction

Extract structured information from Chilean ID cards using advanced Optical Character Recognition (OCR) technology.

๐ŸŽฏ Overview

This project implements an end-to-end OCR system for automatically extracting and validating information from Chilean national identification cards (Carnet de Identidad). It combines state-of-the-art deep learning models with custom business logic to provide accurate, validated extractions.

Key Features

  • Advanced OCR: Uses docTR (Document Text Recognition) with pre-trained models
  • Intelligent Extraction: Custom field extraction algorithms tailored for Chilean IDs
  • Validation System: RUT verification digit validation, date consistency checks, and business rules
  • REST API: FastAPI-based API for easy integration
  • Interactive Dashboard: Streamlit UI for testing and demonstration
  • Batch Processing: Process multiple documents simultaneously
  • Docker Support: Containerized deployment with Docker and Docker Compose
  • Sample Generation: Generate synthetic ID cards for testing

๐Ÿ—๏ธ Project Structure

06-ocr-document-extraction/
โ”œโ”€โ”€ data/                      # Data directory
โ”‚   โ”œโ”€โ”€ raw/                  # Raw images
โ”‚   โ”œโ”€โ”€ processed/            # Processed data
โ”‚   โ””โ”€โ”€ external/             # External models/resources
โ”œโ”€โ”€ docker/                    # Docker configuration
โ”‚   โ”œโ”€โ”€ Dockerfile.api        # API container
โ”‚   โ”œโ”€โ”€ Dockerfile.dashboard  # Dashboard container
โ”‚   โ””โ”€โ”€ docker-compose.yml    # Multi-container orchestration
โ”œโ”€โ”€ docs/                      # Documentation
โ”œโ”€โ”€ scripts/                   # Utility scripts
โ”‚   โ”œโ”€โ”€ run_api.sh           # Run API server
โ”‚   โ””โ”€โ”€ run_dashboard.sh     # Run dashboard
โ”œโ”€โ”€ src/                       # Source code
โ”‚   โ”œโ”€โ”€ api/                 # FastAPI application
โ”‚   โ”‚   โ””โ”€โ”€ app.py           # API endpoints
โ”‚   โ”œโ”€โ”€ dashboard/           # Streamlit dashboard
โ”‚   โ”‚   โ””โ”€โ”€ app.py           # Dashboard UI
โ”‚   โ”œโ”€โ”€ data/                # Data generation
โ”‚   โ”‚   โ””โ”€โ”€ document_generator.py  # Synthetic ID card generator
โ”‚   โ””โ”€โ”€ ocr/                 # OCR components
โ”‚       โ”œโ”€โ”€ predictor.py     # OCR prediction
โ”‚       โ”œโ”€โ”€ extractor.py     # Field extraction
โ”‚       โ””โ”€โ”€ validator.py     # Field validation
โ”œโ”€โ”€ tests/                    # Tests
โ”œโ”€โ”€ config.yaml              # Configuration
โ”œโ”€โ”€ requirements.txt         # Python dependencies
โ””โ”€โ”€ README.md               # This file

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.10+
  • pip
  • (Optional) Docker and Docker Compose

Installation

  1. Clone the repository

    cd 06-ocr-document-extraction
    
  2. Create virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install dependencies

    pip install -r requirements.txt
    

Running the Application

Option 1: Using Shell Scripts

Run the API:

chmod +x scripts/run_api.sh
./scripts/run_api.sh

Run the Dashboard:

chmod +x scripts/run_dashboard.sh
./scripts/run_dashboard.sh

Option 2: Using Docker Compose

cd docker
docker-compose up --build

This will start both the API (port 8000) and dashboard (port 8501).

Option 3: Direct Python Commands

API:

uvicorn src.api.app:app --host 0.0.0.0 --port 8000 --reload

Dashboard:

streamlit run src/dashboard/app.py

๐Ÿ“ก API Endpoints

Base URL: http://localhost:8000

Endpoints

Method Endpoint Description
GET / API information
GET /health Health check
GET /fields List supported fields
GET /stats API statistics
POST /extract Extract from single image
POST /batch Extract from multiple images
POST /validate Validate extracted fields

Example: Extract from Image

curl -X POST "http://localhost:8000/extract" \
  -F "file=@id_card.jpg" \
  -F "validate=true" \
  -F "return_ocr_text=false"

Response:

{
  "success": true,
  "extracted_fields": {
    "rut": {
      "value": "12.345.678-5",
      "confidence": 0.95,
      "bbox": [100, 200, 300, 250]
    },
    "name": {
      "value": "GONZALEZ ROJAS JUAN CARLOS",
      "confidence": 0.92,
      "bbox": [100, 260, 400, 290]
    }
  },
  "validation_results": {
    "is_valid": true,
    "errors": [],
    "warnings": []
  }
}

๐ŸŽจ Dashboard Features

The Streamlit dashboard provides:

  1. Upload Tab: Upload and process ID card images

    • Image preview
    • Real-time extraction
    • Validation results
    • Download results as JSON
  2. Generate Tab: Create synthetic ID cards for testing

    • Customizable fields
    • Ground truth comparison
    • Accuracy metrics
  3. About Tab: System information and documentation

๐Ÿ”ง Configuration

Edit config.yaml to customize:

ocr:
  model_type: "db_resnet50"  # OCR model architecture
  pretrained: true            # Use pre-trained weights
  preserve_aspect_ratio: true # Preserve image aspect ratio
  straighten_pages: false     # Don't rotate text lines

extraction:
  min_confidence: 0.5         # Minimum confidence threshold
  rut_pattern: "XX.XXX.XXX-X" # RUT format pattern
  date_format: "DD/MM/YYYY"   # Date format

validation:
  validate_rut: true          # Validate RUT verification digit
  check_dates: true           # Check date consistency
  min_confidence_threshold: 0.7  # Minimum validation confidence

๐Ÿ“Š Extracted Fields

The system extracts the following fields:

Field Format Required Description
rut XX.XXX.XXX-X โœ… Chilean National ID (RUT)
name SURNAME1 SURNAME2 GIVEN NAMES โœ… Full name
nationality Text โœ… Nationality (e.g., CHILENO)
date_of_birth DD/MM/YYYY โœ… Date of birth
gender M or F โœ… Gender
issue_date DD/MM/YYYY โœ… Document issue date
expiry_date DD/MM/YYYY โœ… Document expiry date

โœ… Validation Features

  • RUT Validation: Verifies the Chilean ID verification digit
  • Date Consistency: Checks if dates are logically consistent
  • Confidence Scoring: Provides confidence scores for each field
  • Business Rules: Custom validation rules based on Chilean ID specifications

๐Ÿงช Testing

Generate sample ID cards for testing:

from src.data.document_generator import IDCardGenerator

generator = IDCardGenerator()
image, ground_truth = generator.generate_id_card(
    rut="12.345.678-5",
    name="GONZALEZ ROJAS JUAN CARLOS",
    nationality="CHILENO",
    date_of_birth="15/03/1985",
    gender="M",
    issue_date="01/01/2020",
    expiry_date="01/01/2030"
)
image.save("sample_id_card.png")

๐Ÿณ Docker Deployment

Build and Run with Docker Compose

cd docker
docker-compose up -d

Access Services

Stop Services

docker-compose down

๐Ÿ“š Technology Stack

  • OCR Engine: docTR - Document Text Recognition
  • Backend: Python 3.10+
  • API Framework: FastAPI
  • Dashboard: Streamlit
  • Image Processing: PIL/Pillow, NumPy
  • Validation: Custom business logic
  • Deployment: Docker, Docker Compose

๐Ÿ” Model Details

The system uses docTRโ€™s pre-trained models:

  • Text Detection: DBResNet50 (DeepLabv3 with ResNet-50 backbone)
  • Text Recognition: CRNN (Convolutional Recurrent Neural Network)

Models are automatically downloaded on first use and cached locally.

๐Ÿ“ˆ Performance

  • Processing Time: ~2-5 seconds per image (CPU)
  • Accuracy: 95%+ on clear, well-lit images
  • Supported Formats: JPG, JPEG, PNG
  • Max File Size: 10MB

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

๐Ÿ“ License

This project is licensed under the MIT License.

๐Ÿ“ž Support

For questions or issues, please open an issue in the repository.

๐Ÿ”ฎ Future Enhancements

๐Ÿ“– References

v0.3.3[beta]