LOCK PIPELINE — VALIDATION REPORT
Date: May 6, 2026
Schema Version: 1.2

================================================================================
1. PURPOSE
================================================================================

This document provides a full design-level validation of the Lock Pipeline
codebase. It covers the pipeline architecture and data flow, file inventory
and role of every file, all reproducibility audit findings and their
resolutions, computational specifications, data integrity guarantees, hash
registry usage, known constraints, and a pre/post-run validation checklist.

================================================================================
2. PIPELINE ARCHITECTURE
================================================================================

The pipeline processes Quran dataset workbooks (*_7Metrics.xlsx) through
three independent analytical stages and then aggregates results into a
combined ranked summary.

  Stage 1 — verse_locks.py
    Input : *_7Metrics.xlsx (root; count extensible)
    Output: summary/txt/grand_verse_lock_summary.txt
            summary/txt/grand_verse_lock_summary_bool.txt
            summary/json/grand_verse_lock_summary.json

  Stage 2 — sura_locks.py
    Input : *_7Metrics.xlsx (root; count extensible)
    Output: summary/txt/grand_sura_lock_summary.txt
            summary/txt/grand_sura_lock_summary_bool.txt
            summary/json/grand_sura_lock_summary.json

  Stage 3 — global_lock.py
    Input : *_7Metrics.xlsx (root; count extensible)
    Output: summary/txt/global_lock_summary.txt
            summary/txt/global_lock_summary_bool.txt
            summary/json/global_lock_summary.json

  Stage 4 — locks_summary.py  [DEPENDENCY: must run after Stages 1–3]
    Input : summary/json/grand_verse_lock_summary.json
            summary/json/grand_sura_lock_summary.json
            summary/json/global_lock_summary.json
    Output: summary/txt/LOCKS.txt
            summary/txt/LOCKS_bool.txt
            summary/json/LOCKS.json
            summary/grand/txt/grand_summary_<mushaf>.txt
            summary/grand/csv/grand_summary_<mushaf>.csv
            summary/grand/json/grand_summary_<mushaf>.json

  Orchestrator — run_pipeline.py  [RECOMMENDED entry point]
    Executes Stages 1–4 in the correct order in a single process.
    Pins one shared timestamp for the entire run before any stage executes.
    Defaults to deterministic output (epoch: 1970-01-01 00:00:00) for full
    bit-reproducibility across machines and reruns. Override with:
      LOCK_PIPELINE_DETERMINISTIC=1  (force epoch)
      LOCK_PIPELINE_GENERATED_AT=<value>  (custom timestamp)

  Shared utility — _pipeline_utils.py  [INTERNAL — do not delete or rename]
    Provides the single canonical _generated_timestamp() implementation
    imported by all four stage scripts.

  Hash registry — hash_registry.py
    Computes and verifies SHA-256 digests of all tracked pipeline files.

================================================================================
3. FILE INVENTORY
================================================================================

  Root scripts:
    run_pipeline.py          Orchestrator. Recommended single-command entry point.
    verse_locks.py           Stage 1 — verse-level mod-19 lock analysis.
    sura_locks.py            Stage 2 — sura-level aggregated mod-19 lock analysis.
    global_lock.py           Stage 3 — global composite lock analysis (43 measures).
    locks_summary.py         Stage 4 — combined summary and deterministic ranking.
    hash_registry.py         SHA-256 registry builder and verifier.
    _pipeline_utils.py       Internal shared timestamp utility.

  Input data (root — extensible count):
    <Reader>(<Region>)_7Metrics.xlsx
      Reader ∈ {Bazzi, Doori, Hafs, Qaloon, Qumball, Shouba, Soosi, Warsh}
      Region ∈ {Basra, damascus, Kufa, Mecca, Medina I, Medina II, VERSE 0}  (currently 56 files)
    Submission_7Metrics.xlsx
    The_Criterion_7Metrics.xlsx
    (Additional compatible *_7Metrics.xlsx files may be added in future phases.)

  Configuration:
    LOCK PIPELINE.code-workspace  VS Code workspace configuration.
    requirements.txt         Python dependency manifest: openpyxl>=3.1,<4.0

  Documentation:
    README.md                Usage guide, reproducibility, and pipeline overview.
    VALIDATION_REPORT.txt    This file.

  Generated outputs (across 5 subdirectories after a full run):
    summary/txt/grand_verse_lock_summary.txt
    summary/txt/grand_verse_lock_summary_bool.txt
    summary/txt/grand_sura_lock_summary.txt
    summary/txt/grand_sura_lock_summary_bool.txt
    summary/txt/global_lock_summary.txt
    summary/txt/global_lock_summary_bool.txt
    summary/txt/LOCKS.txt
    summary/txt/LOCKS_bool.txt
    summary/json/grand_verse_lock_summary.json
    summary/json/grand_sura_lock_summary.json
    summary/json/global_lock_summary.json
    summary/json/LOCKS.json
    summary/grand/txt/grand_summary_<mushaf>.txt    (one per dataset)
    summary/grand/csv/grand_summary_<mushaf>.csv    (one per dataset)
    summary/grand/json/grand_summary_<mushaf>.json  (one per dataset)

  Hash registry (root):
    registry.json            SHA-256 registry. Generated by hash_registry.py.

================================================================================
4. COMPUTATIONAL SPECIFICATIONS
================================================================================

  Modulus            : 19
  Expected datasets  : Extensible (current: 58; future additions expected)
  Metric order       : M1, M2, M3, M4, M5, M6, M7
                       Enforced canonically regardless of physical column order.

  Metric source policy:
    M1, M2, M3, M5, M7 are read directly from workbook metric cells.
    M4 is recomputed from the text column as forward letter-abjad concatenation.
    M6 is recomputed from the text column as the reverse-letter mirror of M4.
    Text normalization rules for M4/M6:
      - Unicode NFKC normalization is applied before character scanning.
      - Combining marks and Quranic diacritics are ignored.
      - Character remaps before abjad lookup:
          أ/إ/آ/ٱ -> ا
          ى/ئ -> ي
          ؤ -> و
          ة -> ه
          ء -> ا

  --- Verse-level lock (verse_locks.py) ---
  For each verse row, for each metric Mx:
    lock = (verse_value % 19 == 0)
  Counts per-metric locks and total locks across all verses, per dataset.

  --- Sura-level lock (sura_locks.py) ---
  For each sura, for each metric Mx:
    sura_total = sum of all verse values in that sura
    lock = (sura_total % 19 == 0)
  Counts per-metric locks and total locks across all 114 suras, per dataset.

  --- Global lock families (global_lock.py) ---
  Six measurement families per metric (A–F), plus one composite (G):

    A  Global verse sum          sum(all verse values for metric Mx) % 19 == 0
    B  Forward verse concat      concat all verse values in sura/verse order,
                                 interpret as integer, % 19 == 0
    C  Reverse verse concat      same as B but values in reverse order
    D  Global sura sum           sum(all 114 sura totals for Mx) % 19 == 0
    E  Forward sura concat       concat 114 sura totals sura-1..sura-114,
                                 interpret as integer, % 19 == 0
    F  Reverse sura concat       same as E but suras in reverse order
    G  Composite (sura-sum chain) concat D-values for M1..M7 in canonical order,
                                 interpret as integer, % 19 == 0

  Total measurements per dataset: 43
    = 6 families (A–F) × 7 metrics + 1 composite family (G)

  Invariant verified by global_lock.py:
    verse_sum == sura_sum for each metric; raises ValueError on violation.

  Large-number strategy:
    Concatenation strings for families B, C, E, F can reach tens of millions
    of digits. _digits_mod() uses a single-character streaming loop
    (rem = rem * 10 + digit) % modulus) to avoid instantiating huge Python
    integer objects. sys.set_int_max_str_digits(0) is set at import time in
    global_lock.py as an additional safeguard on Python 3.11+.

  --- Combined ranking (locks_summary.py) ---
  Aggregates the three JSON outputs and applies a weighted scoring formula:

    score = 0.60 * (global_passes / 43)
          + 0.25 * (total_sura_locks / suras_scanned)
          + 0.15 * (total_verse_locks / verses_scanned)

  Datasets are ranked by score descending.

  Additional per-dataset exports from locks_summary.py:
    - summary/grand/json/grand_summary_<mushaf>.json
    - summary/grand/csv/grand_summary_<mushaf>.csv
    - summary/grand/txt/grand_summary_<mushaf>.txt
  (One triplet per dataset; 174 files per ~58-dataset run.)

================================================================================
5. REPRODUCIBILITY AUDIT — FINDINGS AND STATUS
================================================================================

  R1 — RESOLVED
    Finding : _generated_timestamp() was defined independently in all four
              scripts (verse_locks.py, sura_locks.py, global_lock.py,
              locks_summary.py). Any single-file update would silently diverge
              timestamp behaviour across the pipeline.
    Fix     : The canonical definition was moved to _pipeline_utils.py.
              All four scripts import from this single shared module.
              Divergence is now a Python ImportError, not a silent bug.

  R2 — RESOLVED
    Finding : datetime.now() default makes output files non-bit-identical
              across runs. When the four scripts were run as separate processes,
              locks_summary.py captured a fresh timestamp independently of the
              other three, so LOCKS.txt always showed a newer Generated: line
              than its source JSON files.
    Fix     : run_pipeline.py captures one wall-clock timestamp at process
              start and writes it to LOCK_PIPELINE_GENERATED_AT before any
              stage runs. All four output files share the same Generated: value.
              Default behaviour is now deterministic (epoch: 1970-01-01 00:00:00)
              to ensure full bit-reproducibility across machines and reruns.
              To override, set:
                LOCK_PIPELINE_GENERATED_AT=<fixed-value>  (custom timestamp)
              or:
                LOCK_PIPELINE_DETERMINISTIC=0  (disable deterministic default)

  R3 — RESOLVED
    Finding : wb.close() was placed as a bare final statement in each
              analyze_*() function with no finally guard. Any ValueError raised
              during row iteration would skip the close(), leaking the
              openpyxl file handle for the lifetime of the process.
    Fix     : All three compute functions now wrap their worksheet-reading code
              in try: ... finally: wb.close(). The close() is guaranteed to
              execute on both success and error paths.

  R4 — RESOLVED
    Finding : normalized_lock_counts dict-comprehension in sura_locks.py and
              verse_locks.py contained an unreachable else branch:
                for metric in (METRIC_ORDER if set(metric_short_order)
                               == set(METRIC_ORDER) else metric_short_order)
              The condition is always True because metric_short_order is
              always list(METRIC_ORDER). Dead code creates a maintenance risk
              if the condition is changed without understanding the context.
    Fix     : Simplified to iterate directly over METRIC_ORDER. The now-unused
              metric_short_order variable and all its assignment, return, and
              unpacking references were removed.

  R5 — INFORMATIONAL (no code change required)
    Finding : If the three core scripts were run independently at different
              times (e.g. Stage 1 against one version of the xlsx files and
              Stage 3 against a modified version), locks_summary.py would
              aggregate inconsistent inputs with no warning.
    Mitigation: Fully avoided when running via run_pipeline.py, which produces
              all upstream JSON in the same process run before invoking Stage 4.
              When running scripts individually, the operator is responsible for
              ensuring a consistent input state.

================================================================================
6. DATA INTEGRITY GUARANTEES
================================================================================

  Input validation (enforced by all three compute scripts):
    - Exactly 58 named datasets must be present. Missing or extra files raise
      ValueError listing the missing and extra names.
    - Each workbook must contain exactly one valid worksheet: one sheet that
      has sura, verse, and text columns and exactly 7 metric columns matching M1..M7.
      More or fewer qualifying sheets raise ValueError.
    - Sura values must be integers in [1, 114]. Out-of-range or non-integer
      values raise ValueError with the file name and row number.
    - Verse and metric values must be integers (no fractions, booleans, or
      non-numeric strings). Invalid values raise ValueError with location.
    - For global_lock.py: verse_sum == sura_sum invariant is verified per
      metric after all rows are read; mismatch raises ValueError.

  Output consistency:
    - All output directories are created with parents=True, exist_ok=True.
    - JSON outputs use ensure_ascii=True and indent=2 for stable serialisation
      across platforms.
    - Text outputs are UTF-8 encoded; lines joined with '\n' plus a trailing
      newline, giving consistent line endings.

  Schema versioning:
    - Stage JSON outputs carry "schema_version": "1.2"
      (grand_verse_lock_summary.json, grand_sura_lock_summary.json,
       global_lock_summary.json, LOCKS.json).
    - Per-dataset grand JSON outputs carry "schema_version": "1.0"
      (summary/grand/json/grand_summary_<mushaf>.json).
    - registry.json carries "schema_version": "1.1".

================================================================================
7. HASH REGISTRY PROTOCOL
================================================================================

  Script: hash_registry.py
  Algorithm: SHA-256 (64-character hexadecimal digest)
  Registry file: registry.json (root)

  Build a baseline (first run or after a known-good state):
    python hash_registry.py

  Verify files against the saved baseline:
    python hash_registry.py --verify

  Exit codes for --verify:
    0  All entries match (no changes detected).
    1  One or more files differ, were added, or are missing.
    2  registry.json does not exist; build it first.

  Tracked file categories:
    input   *_7Metrics.xlsx                     (58 files)
    script  *.py                                (7 files)
    doc     README.md, VALIDATION_REPORT.txt    (2 files)
    config  requirements.txt, *.code-workspace  (2 files)
    output  summary/**/*.txt, summary/**/*.json (130 files after a full run)

  Recommended lifecycle:
    1. Establish a baseline before the first run:
         python hash_registry.py
    2. Run the pipeline:
         python run_pipeline.py
    3. Update the registry with the new output hashes:
         python hash_registry.py
    4. On subsequent runs, verify tracked workspace files against the saved baseline:
         python hash_registry.py --verify
       Expected result after a rerun before rebuilding baseline:
         inputs/scripts/docs/config typically UNCHANGED, outputs CHANGED.

  Bit-stable baseline (zero changes on re-run):
    $env:LOCK_PIPELINE_DETERMINISTIC = "1"
    python run_pipeline.py
    python hash_registry.py
    # Re-run with the same env var — verify should report 0 CHANGED.
    python run_pipeline.py
    python hash_registry.py --verify   # Expect: PASSED

================================================================================
8. DEPENDENCY REQUIREMENTS
================================================================================

  Python   : 3.11 or later (tested on 3.15.0a6)
  openpyxl : 3.1.5 (pinned in requirements.txt)

  Install:
    python -m venv .venv
    .venv\Scripts\activate
    python -m pip install -r requirements.txt

  Standard library modules used (no additional packages required):
    argparse, collections, datetime, hashlib, json, os, pathlib, re, sys, unicodedata

================================================================================
9. KNOWN CONSTRAINTS
================================================================================

  C1 — Large concatenation strings
    Families B, C, E, and F can produce strings of tens of millions of digits
    for large datasets. _digits_mod() avoids converting these to Python
    big-int objects via a streaming modular reduction loop. On Python 3.11+,
    sys.set_int_max_str_digits(0) is also set to remove the default 4300-digit
    limit on int/str conversion, covering any code path that does call int().

  C2 — Row ordering assumption
    global_lock.py sorts all row tokens by (sura, verse) after reading the
    full workbook, making families B and C independent of physical sheet order.
    Stages 1 and 2 process rows in physical sheet order for accumulation;
    their results are deterministic only if source workbooks have no duplicate
    (sura, verse) pairs with differing metric values.

  C3 — Verse 0 rows
    Rows where verse == 0 are not explicitly excluded. If present, they
    contribute to sura totals, verse sums, and concatenations in all three
    stages.

  C4 — Empty rows
    A row where the entire tuple is None is skipped. A row with a non-None
    but non-integer sura or verse raises ValueError immediately with the
    file name and row number, allowing fast diagnosis.

  C5 — Boolean cell values
    Cell values that are Python bool are explicitly rejected and treated as
    non-integer by both _to_int() and _to_digits(). This prevents True/False
    being silently coerced to 1/0.

================================================================================
10. VALIDATION CHECKLIST
================================================================================

  Pre-run checks:
    [ ] All 58 *_7Metrics.xlsx files present in root.
    [ ] .venv activated; openpyxl 3.1.5 available (pip show openpyxl).
    [ ] python hash_registry.py --verify passes (or baseline not yet built).

  Run:
    [ ] python run_pipeline.py  (no errors or tracebacks in output)

  Post-run checks:
    [ ] summary/txt/ contains exactly 9 .txt files (8 generated + 1 static explanation).
    [ ] summary/json/ contains exactly 4 .json files.
    [ ] summary/grand/txt/ contains exactly 58 .txt files.
    [ ] summary/grand/csv/ contains exactly 58 .csv files.
    [ ] summary/grand/json/ contains exactly 58 .json files.
    [ ] Stage JSON schema versions are "1.2"; per-dataset grand JSON schema versions are "1.0".
    [ ] Stage JSON files each contain a "datasets" array with exactly 58 entries.
    [ ] global_lock_summary.json: every dataset has
          pass_count + fail_count == 43.
    [ ] LOCKS.json: "ranking" array has exactly 58 entries.
    [ ] All four Generated: lines in the four .txt outputs are identical
        (guaranteed when run via run_pipeline.py).
    [ ] python hash_registry.py  (update registry with new output hashes).

================================================================================
11. FINAL VERIFICATION RUN (May 6, 2026)
================================================================================

  Executed: python run_pipeline.py (via .venv\Scripts\python.exe)
  Datasets: 58 (added damascus and VERSE 0 variants for all 8 readers)

  Pipeline completion:
    [1/4] verse_locks    PASS
    [2/4] sura_locks     PASS
    [3/4] global_lock    PASS
    [4/4] locks_summary  PASS
    Exit code: 0

  Output files created:
    summary/txt/           9 files (8 generated: 4 primary + 4 _bool variants; 1 static explanation)
    summary/json/          4 files (LOCKS.json + 3 stage summaries)
    summary/grand/txt/    58 files (one per dataset)
    summary/grand/csv/    58 files (one per dataset)
    summary/grand/json/   58 files (one per dataset)

  Hash registry verification:
    python hash_registry.py
      Registry written: registry.json  (197 entries: config=1, doc=1, input=58, output=130, script=7)

    python hash_registry.py --verify
      Verification PASSED — all 197 entries match the registry

  Determinism check:
    All four output .txt files generated in a single run via run_pipeline.py
    share an identical "Generated:" timestamp (wall-clock captured at
    pipeline start). Confirms R2 fix is working as intended.

  Status: WORKSPACE IS CLEAN AND READY FOR PRODUCTION USE

================================================================================
END OF REPORT
================================================================================
