QSAC — Quran Semantic Annotation Corpus
A fully tagged dataset of all 6,236 Quranic verses mapped to a 3-level semantic ontology (domains, categories, tags).
Tech Stack :

QSAC (Quran Semantic Annotation Corpus) is a multi-level semantic tagging dataset covering every verse (ayah) of the Quran.
Each of the 6,236 verses is annotated with 1–5 thematic tags drawn from a structured ontology containing:
- 18 domains
- 70 categories
- 338 fine-grained tags
This dataset enables concept-level search, semantic classification, RAG pipelines, and Islamic AI applications by providing machine-readable labels beyond simple keyword matching.
What This Dataset Includes
- Surah
- Ayah
- Arabic text
- English (Saheeh International)
- Pipe-delimited semantic tags
- Full ontology
- Domains → Categories → Tags → Keyword sets
Why QSAC Was Created
Traditional Quran datasets provide text but lack semantic structure.
QSAC fills this gap by providing:
- A hierarchical semantic ontology
- Tag definitions with primary & secondary keywords
- Consistent annotation of every verse
This allows developers and researchers to build:
- Quran semantic search engines
- Islamic chatbots grounded in citations
- RAG pipelines for Islamic knowledge
- Multi-label classifiers for Quranic themes
- Thematic study tools for education
Key Statistics
| Metric | Value |
|---|---|
| Verses | 6,236 |
| Surahs | 114 |
| Domains | 18 |
| Categories | 70 |
| Tags | 338 |
| Total tag assignments | ~16,300 |
| Avg. tags per verse | 2.62 |
Example
Dataset Usages
How you can use QSAC:
- Build a semantic Quran search engine using vector embeddings
- Train multi-label classifiers to predict Quranic themes
- Build Islamic chatbots grounded in tagged verses
- Construct knowledge graphs using the ontology hierarchy
- Study thematic distribution of Quranic topics
- Integrate into RAG pipelines for contextual Islamic Q&A
Disclaimer
This dataset was produced through a combination of LLM-assisted annotation and human review. Like any human endeavour, it is not free from error — a verse may carry an imprecise tag, a tag boundary in the ontology may overlap, or an edge case may have been annotated inconsistently.
If you spot a mistake — whether in a verse's tags, an ontology description, or a keyword — please raise a GitHub Issue or open a Pull Request. Every correction improves the dataset for everyone, and all contributions are genuinely appreciated.
