//Amazon-ML-Challenge-2025-3rdbyAnsh-Sarkar

Amazon-ML-Challenge-2025-3rd

This Repository contains our submission that got us 3rd place in Amazon ML challenge 2025

1
0
1

Amazon ML Challenge 2025 — 3rd (Second Runner-Up)

undefinedTeam Name: 00_Team_Rocket
undefinedTeam Members: Parth Rastogi, Angadjeet Singh, Abhishek Jha, Harsh Kumar


Achievement

undefinedSecured Second Runner-Up Position in Amazon ML Challenge 2025undefined


1. Executive Summary

Our team developed a hybrid deep learning approach combining DeBERTa-large-v3 with engineered features through a cross-attention fusion mechanism.
This architecture leverages pretrained language understanding enriched with domain-specific product attributes, resulting in robust and generalizable price prediction performance.


2. Methodology Overview

2.1 Problem Analysis & Key Observations

Through extensive exploratory data analysis (EDA), we observed that product pricing is influenced by both semantic content and explicit attributes.
Key findings include:

  • Price correlations with quality indicators (organic, gourmet, gluten-free)
  • Importance of text structure (length, punctuation, capitalization ratio)
  • Relevance of packaging characteristics (bulk size, unit type)

2.2 Solution Strategy

undefinedApproach Type: Hybrid (Pretrained LM + Feature Engineering + Cross-Attention)
undefinedCore Innovation:undefined
A two-stream architecture that fuses DeBERTa’s [CLS] embedding with engineered feature embeddings via cross-attention, capturing complex relationships between product descriptions and structured metadata.

undefinedDataset Insight:undefined
As the log(price) distribution approximated a Gaussian, training was performed on log(price + 1).
Final predictions were obtained via exp(pred) - 1.

We also evaluated CLIP-based image embeddings, but analysis via UMAP clustering revealed poor consistency of embedding clusters with price values. Hence, we relied solely on text-based modeling.

undefinedLoss Function:undefined
We used Smooth L1 loss instead of MSE, as it yielded lower SMAPE and more stable training.


3. Model Architecture

3.1 Architecture Overview

Stage Model/Component SMAPE (Validation) SMAPE (Unstop)
First Approach Bert Base 49.1 48.1
Pretraining DeBERTa (Regression Task) 44.2 43.2
Final Hybrid Training DeBERTa + Cross-Attention Fusion undefined39.86undefined undefined40.329undefined

3.2 Model Components

🧩 Text Processing Pipeline

  • Tokenization: DeBERTa tokenizer
  • Preprocessing: text normalization, truncation, special char handling
  • Model: DeBERTa-large-v3 (1024 hidden dim)
  • Pretraining task: price regression

⚙️ Feature Engineering Pipeline

  • undefinedBinary: is_organic, is_gourmet, is_gluten_free, is_bulk, special_chars
  • undefinedCategorical: unit_type
  • undefinedNumeric: value, num_words, num_sentences, uppercase_ratio

undefinedFusion Strategy:undefined
Concatenated feature embeddings + DeBERTa [CLS] → Cross-Attention → Linear Regression Head


❌ Other (Failed) Approaches

  • undefinedXGBoost / CatBoost on sparse feature datasets — SMAPE ≈ 56
  • undefinedReinforcement Learning Finetuning: used SMAPE as a reward on frozen DeBERTa backbone; limited success due to differentiability constraints.

4. Model Performance

Model Validation SMAPE Notes
Pretrained DeBERTa Baseline 43.4 Text-only
Final Hybrid Model undefined39.83undefined 5% hold-out
Challenge Submission undefined40.329undefined on Unstop

5. Conclusion

Our hybrid model successfully demonstrates that structured features can enhance the predictive power of pretrained language models for complex pricing regression tasks.
The cross-attention fusion effectively learns the interplay between semantic understanding and explicit product attributes, achieving scalable and interpretable performance gains for real-world e-commerce applications.


📁 Repository Structure

Amazon-ML-Challenge-2025-3rd/
│
├── Data/
│   ├── preprocessed_train.csv
│   ├── preprocessed_test.csv
│
├── Preprocess.py         # Data preprocessing
├── Pretraining.py        # DeBERTa pretraining on regression
├── Main_Training.py      # Final hybrid model training
├── Inference.py          # Inference and submission CSV generation
├── README.md

🔗 Final Model Checkpoint

You can access the final trained model (DeBERTa + Cross-Attention Fusion) used for the final submission here:
👉 undefinedFinal_Model.pt (Google Drive)undefined


[beta]v0.14.0