Skip to content
View MuhamedHabib's full-sized avatar
🎯
Focusing
🎯
Focusing

Block or report MuhamedHabib

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
MuhamedHabib/README.md
Mohamed Habib Khattat

I turn raw signal into served decisions β€” and I ship the system around the model.

⭐ Hundreds of stars across 30+ repositories Β· 800+ followers β€” see the full breakdown in GitHub Analytics below.


🧭 Who I Am

A data scientist who ships. I don't stop at the notebook β€” I take models from exploration β†’ evaluation β†’ a system that serves them in production. My work lives at the intersection of applied machine learning and real-world, messy data: noisy scanned documents, multilingual text, fiscal records, regulated decisions.

I also architect enterprise AI & fiscal systems under my engineering identity β†’ @MohamedKhattat. This account is the research & data-science lab.


πŸ”¬ Focus Areas

Domain What I actually build
πŸ–ΌοΈ Computer Vision / OCR Document-AI pipelines on real Tunisian ID & fiscal docs β€” CNN classification, projection-profile deskew sweeps, glare/label removal, multi-engine Arabic OCR with confidence-scored fallback, JSON extraction.
πŸ—£οΈ NLP / NLU Named-entity recognition + fuzzy entity resolution, semantic invoice checkers, intent classification across EN / FR / Tunisian (Tounsi), n-gram neural text correction.
πŸ“ˆ Classical ML Credit-default & tax-risk scoring β€” RobustScaler β†’ PCA β†’ RFE/RFECV feature selection β†’ GridSearchCV cross-validated model selection (honest metrics, imbalance-aware).
πŸ•ΈοΈ Semantic AI Ontology-driven fraud detection β€” OWL + SHACL + SPARQL over knowledge graphs; KAG (knowledge-augmented generation).
πŸ€– Agentic AI LLM pipelines with tool-use & MCP, governed by semantic rules β€” research toward AI that reasons inside a domain, not just predicts.

⭐ Featured Repositories

default-payment symfony

datascience javafx

Featured work, by the numbers (live ⭐ counts)

Project Stars One line
default-payment ML Credit-card default prediction β€” full, cross-validated ML pipeline with an honest read of the class imbalance.
DataScienceProject CV+OCR pipeline reading Arabic fields from Tunisian CIN / carte grise β†’ structured JSON β†’ auto-filled contract.
JavaFX-Essentials A hands-on JavaFX learning lab β€” FXML/MVC, custom TableView cells, JDBC CRUD.
ETL-Django OCR→ETL pipeline + a 33-intent EN/FR/Tounsi chatbot + DRF APIs + analytics dashboards.
Recommander-system-Django Content-based movie recommender over a 23k-title catalog.

πŸ† Project Spotlight β€” Tunisian Document-AI

Reading Arabic fields off real, messy government documents β€” and turning them into a signed contract. Off-the-shelf OCR can't do this. I engineered the pipeline that can.   πŸ“‚ DataScienceProject β†’

The problem. Tunisian national ID cards (CIN) and vehicle registrations (carte grise) are photographed in the wild β€” skewed, glare-streaked, low-resolution, with right-to-left Arabic script that mainstream OCR mangles. The goal: extract structured fields reliably enough to auto-fill a legal contract.

Why it's genuinely hard 🧩

  • Arabic OCR is far less mature than Latin β€” reshaping + bidirectional handling is mandatory, not optional.
  • Real photos are rotated and warped; OCR accuracy collapses on un-deskewed input.
  • Glare and pre-printed labels create false text regions that poison naΓ―ve extraction.

The architecture β€” stage by stage, each chosen deliberately πŸ—οΈ

# Stage What I built Β· why
1 Classify A CNN (100Γ—100Γ—3 β†’ 3 Conv blocks 16/32/64 β†’ Dense, 30 epochs, ~19-layer augmentation stack) routes each document by type before extraction.
2 Orient & deskew A projection-profile score swept βˆ’90Β°β†’+90Β° at 0.1Β° steps (+ PCA + a Haar-cascade face anchor for CIN) β€” because OCR is acutely orientation-sensitive.
3 Clean Glare + label removal so only real text survives; carte-grise field localization via a dual-range red-HSV mask + K-means dominant-colour check.
4 Read (Arabic) A layered OCR fallback chain β€” OCR.space ara β†’ ArabicOCR β†’ EasyOCR ar,en β€” with regex field validators (date, serial) and arabic-reshaper + python-bidi for correct RTL rendering.
5 Serve 6 extracted fields β†’ UTF-8 JSON β†’ Pillow auto-fills the contract template.

What it demonstrates. End-to-end computer-vision engineering on a genuinely under-served problem (Arabic document AI), with a deliberate, defensible design decision at every stage β€” this is a pipeline, not a notebook.

Honest status βœ… β€” no committed quantitative eval set yet; the rigorous next step is per-field CER / exact-match on a held-out labelled corpus. (I'd rather state that than fake a number.)

Python Β· OpenCV Β· TensorFlow/Keras Β· EasyOCR Β· ArabicOCR Β· spaCy Β· NumPy


πŸ› οΈ Toolbox

Data Science & ML

Semantic & Agentic AI

Engineering & Delivery


πŸ“Š GitHub Analytics

stats streak top langs trophies activity graph

🌍 GSoC 2026 β€” Accord Project (Linux Foundation): agentic workflow + LLM-based template-logic executor.


πŸ’‘ The Case, Plainly

Most data scientists hand off a notebook. I hand off a working system. I bring the rigor of regulated, zero-failure engineering to data science β€” honest metrics, reproducible pipelines, and models that survive contact with real, messy data.


πŸ“¬ Let's turn your data into decisions that ship.

Pinned Loading

  1. default-payment-next-month-ML-system default-payment-next-month-ML-system Public

    Jupyter Notebook 25 1

  2. JavaFX-Essentials JavaFX-Essentials Public

    Java 19

  3. ebooks ebooks Public

    Forked from bosesaikat/ebooks-1

    18 4

  4. All-you-need-on-symfony-bundles-crud-reverse-engineering-api-consumption All-you-need-on-symfony-bundles-crud-reverse-engineering-api-consumption Public

    first

    CSS 20

  5. DataScienceProject DataScienceProject Public

    Projet-PI-4DS2

    Jupyter Notebook 17

  6. pollsapi pollsapi Public

    CircleCI-complete-Django-APIs-project

    Python 15 2