QSAC — Quran Semantic Annotation Corpus

A fully tagged dataset of all 6,236 Quranic verses mapped to a 3-level semantic ontology (domains, categories, tags).

Tech Stack :

QSAC (Quran Semantic Annotation Corpus) is a multi-level semantic tagging dataset covering every verse (ayah) of the Quran. Each of the 6,236 verses is annotated with 1–5 thematic tags drawn from a structured ontology containing:

18 domains
70 categories
338 fine-grained tags

This dataset enables concept-level search, semantic classification, RAG pipelines, and Islamic AI applications by providing machine-readable labels beyond simple keyword matching.

What This Dataset Includes

Surah
Ayah
Arabic text
English (Saheeh International)
Pipe-delimited semantic tags

Full ontology
Domains → Categories → Tags → Keyword sets

Why QSAC Was Created

Traditional Quran datasets provide text but lack semantic structure. QSAC fills this gap by providing:

A hierarchical semantic ontology
Tag definitions with primary & secondary keywords
Consistent annotation of every verse

This allows developers and researchers to build:

Quran semantic search engines
Islamic chatbots grounded in citations
RAG pipelines for Islamic knowledge
Multi-label classifiers for Quranic themes
Thematic study tools for education

Key Statistics

Metric	Value
Verses	6,236
Surahs	114
Domains	18
Categories	70
Tags	338
Total tag assignments	~16,300
Avg. tags per verse	2.62

Example

Dataset Usages

How you can use QSAC:

Build a semantic Quran search engine using vector embeddings
Train multi-label classifiers to predict Quranic themes
Build Islamic chatbots grounded in tagged verses
Construct knowledge graphs using the ontology hierarchy
Study thematic distribution of Quranic topics
Integrate into RAG pipelines for contextual Islamic Q&A

Disclaimer

This dataset was produced through a combination of LLM-assisted annotation and human review. Like any human endeavour, it is not free from error — a verse may carry an imprecise tag, a tag boundary in the ontology may overlap, or an edge case may have been annotated inconsistently.

If you spot a mistake — whether in a verse's tags, an ontology description, or a keyword — please raise a GitHub Issue or open a Pull Request. Every correction improves the dataset for everyone, and all contributions are genuinely appreciated.