LOCK PIPELINE — VALIDATION REPORT Date: May 6, 2026 Schema Version: 1.2 ================================================================================ 1. PURPOSE ================================================================================ This document provides a full design-level validation of the Lock Pipeline codebase. It covers the pipeline architecture and data flow, file inventory and role of every file, all reproducibility audit findings and their resolutions, computational specifications, data integrity guarantees, hash registry usage, known constraints, and a pre/post-run validation checklist. ================================================================================ 2. PIPELINE ARCHITECTURE ================================================================================ The pipeline processes Quran dataset workbooks (*_7Metrics.xlsx) through three independent analytical stages and then aggregates results into a combined ranked summary. Stage 1 — verse_locks.py Input : *_7Metrics.xlsx (root; count extensible) Output: summary/txt/grand_verse_lock_summary.txt summary/txt/grand_verse_lock_summary_bool.txt summary/json/grand_verse_lock_summary.json Stage 2 — sura_locks.py Input : *_7Metrics.xlsx (root; count extensible) Output: summary/txt/grand_sura_lock_summary.txt summary/txt/grand_sura_lock_summary_bool.txt summary/json/grand_sura_lock_summary.json Stage 3 — global_lock.py Input : *_7Metrics.xlsx (root; count extensible) Output: summary/txt/global_lock_summary.txt summary/txt/global_lock_summary_bool.txt summary/json/global_lock_summary.json Stage 4 — locks_summary.py [DEPENDENCY: must run after Stages 1–3] Input : summary/json/grand_verse_lock_summary.json summary/json/grand_sura_lock_summary.json summary/json/global_lock_summary.json Output: summary/txt/LOCKS.txt summary/txt/LOCKS_bool.txt summary/json/LOCKS.json summary/grand/txt/grand_summary_.txt summary/grand/csv/grand_summary_.csv summary/grand/json/grand_summary_.json Orchestrator — run_pipeline.py [RECOMMENDED entry point] Executes Stages 1–4 in the correct order in a single process. Pins one shared timestamp for the entire run before any stage executes. Defaults to deterministic output (epoch: 1970-01-01 00:00:00) for full bit-reproducibility across machines and reruns. Override with: LOCK_PIPELINE_DETERMINISTIC=1 (force epoch) LOCK_PIPELINE_GENERATED_AT= (custom timestamp) Shared utility — _pipeline_utils.py [INTERNAL — do not delete or rename] Provides the single canonical _generated_timestamp() implementation imported by all four stage scripts. Hash registry — hash_registry.py Computes and verifies SHA-256 digests of all tracked pipeline files. ================================================================================ 3. FILE INVENTORY ================================================================================ Root scripts: run_pipeline.py Orchestrator. Recommended single-command entry point. verse_locks.py Stage 1 — verse-level mod-19 lock analysis. sura_locks.py Stage 2 — sura-level aggregated mod-19 lock analysis. global_lock.py Stage 3 — global composite lock analysis (43 measures). locks_summary.py Stage 4 — combined summary and deterministic ranking. hash_registry.py SHA-256 registry builder and verifier. _pipeline_utils.py Internal shared timestamp utility. Input data (root — extensible count): ()_7Metrics.xlsx Reader ∈ {Bazzi, Doori, Hafs, Qaloon, Qumball, Shouba, Soosi, Warsh} Region ∈ {Basra, damascus, Kufa, Mecca, Medina I, Medina II, VERSE 0} (currently 56 files) Submission_7Metrics.xlsx The_Criterion_7Metrics.xlsx (Additional compatible *_7Metrics.xlsx files may be added in future phases.) Configuration: LOCK PIPELINE.code-workspace VS Code workspace configuration. requirements.txt Python dependency manifest: openpyxl>=3.1,<4.0 Documentation: README.md Usage guide, reproducibility, and pipeline overview. VALIDATION_REPORT.txt This file. Generated outputs (across 5 subdirectories after a full run): summary/txt/grand_verse_lock_summary.txt summary/txt/grand_verse_lock_summary_bool.txt summary/txt/grand_sura_lock_summary.txt summary/txt/grand_sura_lock_summary_bool.txt summary/txt/global_lock_summary.txt summary/txt/global_lock_summary_bool.txt summary/txt/LOCKS.txt summary/txt/LOCKS_bool.txt summary/json/grand_verse_lock_summary.json summary/json/grand_sura_lock_summary.json summary/json/global_lock_summary.json summary/json/LOCKS.json summary/grand/txt/grand_summary_.txt (one per dataset) summary/grand/csv/grand_summary_.csv (one per dataset) summary/grand/json/grand_summary_.json (one per dataset) Hash registry (root): registry.json SHA-256 registry. Generated by hash_registry.py. ================================================================================ 4. COMPUTATIONAL SPECIFICATIONS ================================================================================ Modulus : 19 Expected datasets : Extensible (current: 58; future additions expected) Metric order : M1, M2, M3, M4, M5, M6, M7 Enforced canonically regardless of physical column order. Metric source policy: M1, M2, M3, M5, M7 are read directly from workbook metric cells. M4 is recomputed from the text column as forward letter-abjad concatenation. M6 is recomputed from the text column as the reverse-letter mirror of M4. Text normalization rules for M4/M6: - Unicode NFKC normalization is applied before character scanning. - Combining marks and Quranic diacritics are ignored. - Character remaps before abjad lookup: أ/إ/آ/ٱ -> ا ى/ئ -> ي ؤ -> و ة -> ه ء -> ا --- Verse-level lock (verse_locks.py) --- For each verse row, for each metric Mx: lock = (verse_value % 19 == 0) Counts per-metric locks and total locks across all verses, per dataset. --- Sura-level lock (sura_locks.py) --- For each sura, for each metric Mx: sura_total = sum of all verse values in that sura lock = (sura_total % 19 == 0) Counts per-metric locks and total locks across all 114 suras, per dataset. --- Global lock families (global_lock.py) --- Six measurement families per metric (A–F), plus one composite (G): A Global verse sum sum(all verse values for metric Mx) % 19 == 0 B Forward verse concat concat all verse values in sura/verse order, interpret as integer, % 19 == 0 C Reverse verse concat same as B but values in reverse order D Global sura sum sum(all 114 sura totals for Mx) % 19 == 0 E Forward sura concat concat 114 sura totals sura-1..sura-114, interpret as integer, % 19 == 0 F Reverse sura concat same as E but suras in reverse order G Composite (sura-sum chain) concat D-values for M1..M7 in canonical order, interpret as integer, % 19 == 0 Total measurements per dataset: 43 = 6 families (A–F) × 7 metrics + 1 composite family (G) Invariant verified by global_lock.py: verse_sum == sura_sum for each metric; raises ValueError on violation. Large-number strategy: Concatenation strings for families B, C, E, F can reach tens of millions of digits. _digits_mod() uses a single-character streaming loop (rem = rem * 10 + digit) % modulus) to avoid instantiating huge Python integer objects. sys.set_int_max_str_digits(0) is set at import time in global_lock.py as an additional safeguard on Python 3.11+. --- Combined ranking (locks_summary.py) --- Aggregates the three JSON outputs and applies a weighted scoring formula: score = 0.60 * (global_passes / 43) + 0.25 * (total_sura_locks / suras_scanned) + 0.15 * (total_verse_locks / verses_scanned) Datasets are ranked by score descending. Additional per-dataset exports from locks_summary.py: - summary/grand/json/grand_summary_.json - summary/grand/csv/grand_summary_.csv - summary/grand/txt/grand_summary_.txt (One triplet per dataset; 174 files per ~58-dataset run.) ================================================================================ 5. REPRODUCIBILITY AUDIT — FINDINGS AND STATUS ================================================================================ R1 — RESOLVED Finding : _generated_timestamp() was defined independently in all four scripts (verse_locks.py, sura_locks.py, global_lock.py, locks_summary.py). Any single-file update would silently diverge timestamp behaviour across the pipeline. Fix : The canonical definition was moved to _pipeline_utils.py. All four scripts import from this single shared module. Divergence is now a Python ImportError, not a silent bug. R2 — RESOLVED Finding : datetime.now() default makes output files non-bit-identical across runs. When the four scripts were run as separate processes, locks_summary.py captured a fresh timestamp independently of the other three, so LOCKS.txt always showed a newer Generated: line than its source JSON files. Fix : run_pipeline.py captures one wall-clock timestamp at process start and writes it to LOCK_PIPELINE_GENERATED_AT before any stage runs. All four output files share the same Generated: value. Default behaviour is now deterministic (epoch: 1970-01-01 00:00:00) to ensure full bit-reproducibility across machines and reruns. To override, set: LOCK_PIPELINE_GENERATED_AT= (custom timestamp) or: LOCK_PIPELINE_DETERMINISTIC=0 (disable deterministic default) R3 — RESOLVED Finding : wb.close() was placed as a bare final statement in each analyze_*() function with no finally guard. Any ValueError raised during row iteration would skip the close(), leaking the openpyxl file handle for the lifetime of the process. Fix : All three compute functions now wrap their worksheet-reading code in try: ... finally: wb.close(). The close() is guaranteed to execute on both success and error paths. R4 — RESOLVED Finding : normalized_lock_counts dict-comprehension in sura_locks.py and verse_locks.py contained an unreachable else branch: for metric in (METRIC_ORDER if set(metric_short_order) == set(METRIC_ORDER) else metric_short_order) The condition is always True because metric_short_order is always list(METRIC_ORDER). Dead code creates a maintenance risk if the condition is changed without understanding the context. Fix : Simplified to iterate directly over METRIC_ORDER. The now-unused metric_short_order variable and all its assignment, return, and unpacking references were removed. R5 — INFORMATIONAL (no code change required) Finding : If the three core scripts were run independently at different times (e.g. Stage 1 against one version of the xlsx files and Stage 3 against a modified version), locks_summary.py would aggregate inconsistent inputs with no warning. Mitigation: Fully avoided when running via run_pipeline.py, which produces all upstream JSON in the same process run before invoking Stage 4. When running scripts individually, the operator is responsible for ensuring a consistent input state. ================================================================================ 6. DATA INTEGRITY GUARANTEES ================================================================================ Input validation (enforced by all three compute scripts): - Exactly 58 named datasets must be present. Missing or extra files raise ValueError listing the missing and extra names. - Each workbook must contain exactly one valid worksheet: one sheet that has sura, verse, and text columns and exactly 7 metric columns matching M1..M7. More or fewer qualifying sheets raise ValueError. - Sura values must be integers in [1, 114]. Out-of-range or non-integer values raise ValueError with the file name and row number. - Verse and metric values must be integers (no fractions, booleans, or non-numeric strings). Invalid values raise ValueError with location. - For global_lock.py: verse_sum == sura_sum invariant is verified per metric after all rows are read; mismatch raises ValueError. Output consistency: - All output directories are created with parents=True, exist_ok=True. - JSON outputs use ensure_ascii=True and indent=2 for stable serialisation across platforms. - Text outputs are UTF-8 encoded; lines joined with '\n' plus a trailing newline, giving consistent line endings. Schema versioning: - Stage JSON outputs carry "schema_version": "1.2" (grand_verse_lock_summary.json, grand_sura_lock_summary.json, global_lock_summary.json, LOCKS.json). - Per-dataset grand JSON outputs carry "schema_version": "1.0" (summary/grand/json/grand_summary_.json). - registry.json carries "schema_version": "1.1". ================================================================================ 7. HASH REGISTRY PROTOCOL ================================================================================ Script: hash_registry.py Algorithm: SHA-256 (64-character hexadecimal digest) Registry file: registry.json (root) Build a baseline (first run or after a known-good state): python hash_registry.py Verify files against the saved baseline: python hash_registry.py --verify Exit codes for --verify: 0 All entries match (no changes detected). 1 One or more files differ, were added, or are missing. 2 registry.json does not exist; build it first. Tracked file categories: input *_7Metrics.xlsx (58 files) script *.py (7 files) doc README.md, VALIDATION_REPORT.txt (2 files) config requirements.txt, *.code-workspace (2 files) output summary/**/*.txt, summary/**/*.json (130 files after a full run) Recommended lifecycle: 1. Establish a baseline before the first run: python hash_registry.py 2. Run the pipeline: python run_pipeline.py 3. Update the registry with the new output hashes: python hash_registry.py 4. On subsequent runs, verify tracked workspace files against the saved baseline: python hash_registry.py --verify Expected result after a rerun before rebuilding baseline: inputs/scripts/docs/config typically UNCHANGED, outputs CHANGED. Bit-stable baseline (zero changes on re-run): $env:LOCK_PIPELINE_DETERMINISTIC = "1" python run_pipeline.py python hash_registry.py # Re-run with the same env var — verify should report 0 CHANGED. python run_pipeline.py python hash_registry.py --verify # Expect: PASSED ================================================================================ 8. DEPENDENCY REQUIREMENTS ================================================================================ Python : 3.11 or later (tested on 3.15.0a6) openpyxl : 3.1.5 (pinned in requirements.txt) Install: python -m venv .venv .venv\Scripts\activate python -m pip install -r requirements.txt Standard library modules used (no additional packages required): argparse, collections, datetime, hashlib, json, os, pathlib, re, sys, unicodedata ================================================================================ 9. KNOWN CONSTRAINTS ================================================================================ C1 — Large concatenation strings Families B, C, E, and F can produce strings of tens of millions of digits for large datasets. _digits_mod() avoids converting these to Python big-int objects via a streaming modular reduction loop. On Python 3.11+, sys.set_int_max_str_digits(0) is also set to remove the default 4300-digit limit on int/str conversion, covering any code path that does call int(). C2 — Row ordering assumption global_lock.py sorts all row tokens by (sura, verse) after reading the full workbook, making families B and C independent of physical sheet order. Stages 1 and 2 process rows in physical sheet order for accumulation; their results are deterministic only if source workbooks have no duplicate (sura, verse) pairs with differing metric values. C3 — Verse 0 rows Rows where verse == 0 are not explicitly excluded. If present, they contribute to sura totals, verse sums, and concatenations in all three stages. C4 — Empty rows A row where the entire tuple is None is skipped. A row with a non-None but non-integer sura or verse raises ValueError immediately with the file name and row number, allowing fast diagnosis. C5 — Boolean cell values Cell values that are Python bool are explicitly rejected and treated as non-integer by both _to_int() and _to_digits(). This prevents True/False being silently coerced to 1/0. ================================================================================ 10. VALIDATION CHECKLIST ================================================================================ Pre-run checks: [ ] All 58 *_7Metrics.xlsx files present in root. [ ] .venv activated; openpyxl 3.1.5 available (pip show openpyxl). [ ] python hash_registry.py --verify passes (or baseline not yet built). Run: [ ] python run_pipeline.py (no errors or tracebacks in output) Post-run checks: [ ] summary/txt/ contains exactly 9 .txt files (8 generated + 1 static explanation). [ ] summary/json/ contains exactly 4 .json files. [ ] summary/grand/txt/ contains exactly 58 .txt files. [ ] summary/grand/csv/ contains exactly 58 .csv files. [ ] summary/grand/json/ contains exactly 58 .json files. [ ] Stage JSON schema versions are "1.2"; per-dataset grand JSON schema versions are "1.0". [ ] Stage JSON files each contain a "datasets" array with exactly 58 entries. [ ] global_lock_summary.json: every dataset has pass_count + fail_count == 43. [ ] LOCKS.json: "ranking" array has exactly 58 entries. [ ] All four Generated: lines in the four .txt outputs are identical (guaranteed when run via run_pipeline.py). [ ] python hash_registry.py (update registry with new output hashes). ================================================================================ 11. FINAL VERIFICATION RUN (May 6, 2026) ================================================================================ Executed: python run_pipeline.py (via .venv\Scripts\python.exe) Datasets: 58 (added damascus and VERSE 0 variants for all 8 readers) Pipeline completion: [1/4] verse_locks PASS [2/4] sura_locks PASS [3/4] global_lock PASS [4/4] locks_summary PASS Exit code: 0 Output files created: summary/txt/ 9 files (8 generated: 4 primary + 4 _bool variants; 1 static explanation) summary/json/ 4 files (LOCKS.json + 3 stage summaries) summary/grand/txt/ 58 files (one per dataset) summary/grand/csv/ 58 files (one per dataset) summary/grand/json/ 58 files (one per dataset) Hash registry verification: python hash_registry.py Registry written: registry.json (197 entries: config=1, doc=1, input=58, output=130, script=7) python hash_registry.py --verify Verification PASSED — all 197 entries match the registry Determinism check: All four output .txt files generated in a single run via run_pipeline.py share an identical "Generated:" timestamp (wall-clock captured at pipeline start). Confirms R2 fix is working as intended. Status: WORKSPACE IS CLEAN AND READY FOR PRODUCTION USE ================================================================================ END OF REPORT ================================================================================