I turn raw signal into served decisions β and I ship the system around the model.
β Hundreds of stars across 30+ repositories Β· 800+ followers β see the full breakdown in GitHub Analytics below.
A data scientist who ships. I don't stop at the notebook β I take models from exploration β evaluation β a system that serves them in production. My work lives at the intersection of applied machine learning and real-world, messy data: noisy scanned documents, multilingual text, fiscal records, regulated decisions.
I also architect enterprise AI & fiscal systems under my engineering identity β @MohamedKhattat. This account is the research & data-science lab.
| Domain | What I actually build |
|---|---|
| πΌοΈ Computer Vision / OCR | Document-AI pipelines on real Tunisian ID & fiscal docs β CNN classification, projection-profile deskew sweeps, glare/label removal, multi-engine Arabic OCR with confidence-scored fallback, JSON extraction. |
| π£οΈ NLP / NLU | Named-entity recognition + fuzzy entity resolution, semantic invoice checkers, intent classification across EN / FR / Tunisian (Tounsi), n-gram neural text correction. |
| π Classical ML | Credit-default & tax-risk scoring β RobustScaler β PCA β RFE/RFECV feature selection β GridSearchCV cross-validated model selection (honest metrics, imbalance-aware). |
| πΈοΈ Semantic AI | Ontology-driven fraud detection β OWL + SHACL + SPARQL over knowledge graphs; KAG (knowledge-augmented generation). |
| π€ Agentic AI | LLM pipelines with tool-use & MCP, governed by semantic rules β research toward AI that reasons inside a domain, not just predicts. |
Featured work, by the numbers (live β counts)
| Project | Stars | One line |
|---|---|---|
| default-payment ML | Credit-card default prediction β full, cross-validated ML pipeline with an honest read of the class imbalance. | |
| DataScienceProject | CV+OCR pipeline reading Arabic fields from Tunisian CIN / carte grise β structured JSON β auto-filled contract. | |
| JavaFX-Essentials | A hands-on JavaFX learning lab β FXML/MVC, custom TableView cells, JDBC CRUD. | |
| ETL-Django | OCRβETL pipeline + a 33-intent EN/FR/Tounsi chatbot + DRF APIs + analytics dashboards. | |
| Recommander-system-Django | Content-based movie recommender over a 23k-title catalog. |
Reading Arabic fields off real, messy government documents β and turning them into a signed contract. Off-the-shelf OCR can't do this. I engineered the pipeline that can. π DataScienceProject β
The problem. Tunisian national ID cards (CIN) and vehicle registrations (carte grise) are photographed in the wild β skewed, glare-streaked, low-resolution, with right-to-left Arabic script that mainstream OCR mangles. The goal: extract structured fields reliably enough to auto-fill a legal contract.
Why it's genuinely hard π§©
- Arabic OCR is far less mature than Latin β reshaping + bidirectional handling is mandatory, not optional.
- Real photos are rotated and warped; OCR accuracy collapses on un-deskewed input.
- Glare and pre-printed labels create false text regions that poison naΓ―ve extraction.
The architecture β stage by stage, each chosen deliberately ποΈ
| # | Stage | What I built Β· why |
|---|---|---|
| 1 | Classify | A CNN (100Γ100Γ3 β 3 Conv blocks 16/32/64 β Dense, 30 epochs, ~19-layer augmentation stack) routes each document by type before extraction. |
| 2 | Orient & deskew | A projection-profile score swept β90Β°β+90Β° at 0.1Β° steps (+ PCA + a Haar-cascade face anchor for CIN) β because OCR is acutely orientation-sensitive. |
| 3 | Clean | Glare + label removal so only real text survives; carte-grise field localization via a dual-range red-HSV mask + K-means dominant-colour check. |
| 4 | Read (Arabic) | A layered OCR fallback chain β OCR.space ara β ArabicOCR β EasyOCR ar,en β with regex field validators (date, serial) and arabic-reshaper + python-bidi for correct RTL rendering. |
| 5 | Serve | 6 extracted fields β UTF-8 JSON β Pillow auto-fills the contract template. |
What it demonstrates. End-to-end computer-vision engineering on a genuinely under-served problem (Arabic document AI), with a deliberate, defensible design decision at every stage β this is a pipeline, not a notebook.
Honest status β β no committed quantitative eval set yet; the rigorous next step is per-field CER / exact-match on a held-out labelled corpus. (I'd rather state that than fake a number.)
Python Β· OpenCV Β· TensorFlow/Keras Β· EasyOCR Β· ArabicOCR Β· spaCy Β· NumPy
Data Science & ML
Semantic & Agentic AI
Engineering & Delivery
π GSoC 2026 β Accord Project (Linux Foundation): agentic workflow + LLM-based template-logic executor.
Most data scientists hand off a notebook. I hand off a working system. I bring the rigor of regulated, zero-failure engineering to data science β honest metrics, reproducible pipelines, and models that survive contact with real, messy data.




