QSAC — Quran Semantic Annotation Corpus

A fully tagged dataset of all 6,236 Quranic verses mapped to a 3-level semantic ontology (domains, categories, tags).

QSAC — Quran Semantic Annotation Corpus
QSAC (Quran Semantic Annotation Corpus) is a multi-level semantic tagging dataset covering every verse (ayah) of the Quran. Each of the 6,236 verses is annotated with 1–5 thematic tags drawn from a structured ontology containing:
  • 18 domains
  • 70 categories
  • 338 fine-grained tags
This dataset enables concept-level search, semantic classification, RAG pipelines, and Islamic AI applications by providing machine-readable labels beyond simple keyword matching.

What This Dataset Includes

  • Surah
  • Ayah
  • Arabic text
  • English (Saheeh International)
  • Pipe-delimited semantic tags

  • Full ontology
  • Domains → Categories → Tags → Keyword sets

Why QSAC Was Created

Traditional Quran datasets provide text but lack semantic structure. QSAC fills this gap by providing:
  • A hierarchical semantic ontology
  • Tag definitions with primary & secondary keywords
  • Consistent annotation of every verse
This allows developers and researchers to build:
  • Quran semantic search engines
  • Islamic chatbots grounded in citations
  • RAG pipelines for Islamic knowledge
  • Multi-label classifiers for Quranic themes
  • Thematic study tools for education

Key Statistics

MetricValue
Verses6,236
Surahs114
Domains18
Categories70
Tags338
Total tag assignments~16,300
Avg. tags per verse2.62

Example


Dataset Usages

How you can use QSAC:
  • Build a semantic Quran search engine using vector embeddings
  • Train multi-label classifiers to predict Quranic themes
  • Build Islamic chatbots grounded in tagged verses
  • Construct knowledge graphs using the ontology hierarchy
  • Study thematic distribution of Quranic topics
  • Integrate into RAG pipelines for contextual Islamic Q&A

Disclaimer

This dataset was produced through a combination of LLM-assisted annotation and human review. Like any human endeavour, it is not free from error — a verse may carry an imprecise tag, a tag boundary in the ontology may overlap, or an edge case may have been annotated inconsistently.
If you spot a mistake — whether in a verse's tags, an ontology description, or a keyword — please raise a GitHub Issue or open a Pull Request. Every correction improves the dataset for everyone, and all contributions are genuinely appreciated.