Mohamed Habib Khattat MuhamedHabib

I turn raw signal into served decisions — and I ship the system around the model.

_{⭐ Hundreds of stars across 30+ repositories · 800+ followers — see the full breakdown in GitHub Analytics below.}

🧭 Who I Am

A data scientist who ships. I don't stop at the notebook — I take models from exploration → evaluation → a system that serves them in production. My work lives at the intersection of applied machine learning and real-world, messy data: noisy scanned documents, multilingual text, fiscal records, regulated decisions.

I also architect enterprise AI & fiscal systems under my engineering identity → @MohamedKhattat. This account is the research & data-science lab.

🔬 Focus Areas

Domain	What I actually build
🖼️ Computer Vision / OCR	Document-AI pipelines on real Tunisian ID & fiscal docs — CNN classification, projection-profile deskew sweeps, glare/label removal, multi-engine Arabic OCR with confidence-scored fallback, JSON extraction.
🗣️ NLP / NLU	Named-entity recognition + fuzzy entity resolution, semantic invoice checkers, intent classification across EN / FR / Tunisian (Tounsi), n-gram neural text correction.
📈 Classical ML	Credit-default & tax-risk scoring — `RobustScaler` → `PCA` → RFE/RFECV feature selection → GridSearchCV cross-validated model selection (honest metrics, imbalance-aware).
🕸️ Semantic AI	Ontology-driven fraud detection — OWL + SHACL + SPARQL over knowledge graphs; KAG (knowledge-augmented generation).
🤖 Agentic AI	LLM pipelines with tool-use & MCP, governed by semantic rules — research toward AI that reasons inside a domain, not just predicts.

⭐ Featured Repositories

Featured work, by the numbers (live ⭐ counts)

Project	Stars	One line
default-payment ML		Credit-card default prediction — full, cross-validated ML pipeline with an honest read of the class imbalance.
DataScienceProject		CV+OCR pipeline reading Arabic fields from Tunisian CIN / carte grise → structured JSON → auto-filled contract.
JavaFX-Essentials		A hands-on JavaFX learning lab — FXML/MVC, custom TableView cells, JDBC CRUD.
ETL-Django		OCR→ETL pipeline + a 33-intent EN/FR/Tounsi chatbot + DRF APIs + analytics dashboards.
Recommander-system-Django		Content-based movie recommender over a 23k-title catalog.

🏆 Project Spotlight — Tunisian Document-AI

Reading Arabic fields off real, messy government documents — and turning them into a signed contract. Off-the-shelf OCR can't do this. I engineered the pipeline that can. 📂 DataScienceProject →

The problem. Tunisian national ID cards (CIN) and vehicle registrations (carte grise) are photographed in the wild — skewed, glare-streaked, low-resolution, with right-to-left Arabic script that mainstream OCR mangles. The goal: extract structured fields reliably enough to auto-fill a legal contract.

Why it's genuinely hard 🧩

Arabic OCR is far less mature than Latin — reshaping + bidirectional handling is mandatory, not optional.
Real photos are rotated and warped; OCR accuracy collapses on un-deskewed input.
Glare and pre-printed labels create false text regions that poison naïve extraction.

The architecture — stage by stage, each chosen deliberately 🏗️

#	Stage	What I built · why
1	Classify	A CNN (`100×100×3` → 3 Conv blocks `16/32/64` → Dense, 30 epochs, ~19-layer augmentation stack) routes each document by type before extraction.
2	Orient & deskew	A projection-profile score swept −90°→+90° at 0.1° steps (+ PCA + a Haar-cascade face anchor for CIN) — because OCR is acutely orientation-sensitive.
3	Clean	Glare + label removal so only real text survives; carte-grise field localization via a dual-range red-HSV mask + K-means dominant-colour check.
4	Read (Arabic)	A layered OCR fallback chain — OCR.space `ara` → ArabicOCR → EasyOCR `ar,en` — with regex field validators (date, serial) and `arabic-reshaper` + `python-bidi` for correct RTL rendering.
5	Serve	6 extracted fields → UTF-8 JSON → Pillow auto-fills the contract template.

What it demonstrates. End-to-end computer-vision engineering on a genuinely under-served problem (Arabic document AI), with a deliberate, defensible design decision at every stage — this is a pipeline, not a notebook.

Honest status ✅ — no committed quantitative eval set yet; the rigorous next step is per-field CER / exact-match on a held-out labelled corpus. (I'd rather state that than fake a number.)

Python · OpenCV · TensorFlow/Keras · EasyOCR · ArabicOCR · spaCy · NumPy

🛠️ Toolbox

Data Science & ML

Semantic & Agentic AI

Engineering & Delivery

📊 GitHub Analytics

🌍 GSoC 2026 — Accord Project (Linux Foundation): agentic workflow + LLM-based template-logic executor.

💡 The Case, Plainly

Most data scientists hand off a notebook. I hand off a working system. I bring the rigor of regulated, zero-failure engineering to data science — honest metrics, reproducible pipelines, and models that survive contact with real, messy data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mohamed Habib Khattat MuhamedHabib

Achievements

Achievements

Highlights

Block or report MuhamedHabib