Metadata-Version: 2.3
Name: slvalidator
Version: 0.1.0
Summary: Add your description here
Author: zoe.von.pentz_zgtm
Author-email: zoe.von.pentz_zgtm <zoe.von.pentz@ziggiz.ai>
Requires-Dist: sqlalchemy>=2.0.30
Requires-Dist: sqla-yaml-fixtures>=1.2.0
Requires-Dist: sqlglot>=28.6.0
Requires-Dist: pysigma>=0.11.0
Requires-Dist: pysigma-backend-sqlite>=0.2.0
Requires-Python: >=3.10.12
Description-Content-Type: text/markdown

# slvalidator

[![Tests](https://github.com/zigguratum-core/slvalidator/actions/workflows/tests.yaml/badge.svg)](https://github.com/zigguratum-core/slvalidator/actions/workflows/tests.yaml)

Validates and manages the Ziggiz semantic graph — the configuration layer
that maps raw datasources to semantic datasets through typed mappings.

## What It Does

slvalidator ensures your semantic graph is internally consistent before
deploying changes to the platform. It catches:

- Missing parent references (orphaned datasets)
- Schema type errors (invalid Databricks column types)
- Missing meta fields (`_z_*` system columns)
- Duplicate mappings and naming conflicts
- Invalid security rules (Sigma/SQL syntax)
- Field references in rules that don't exist in target schemas

## Architecture

### The Two Hierarchies

```
Physical layer (how data arrives):
  DatasourceTemplate ──► DatasourceInstance
          │
          ▼
  DatasetTemplate ──► DatasetInstance
          │
          │  DatasetMapping
          ▼
Semantic layer (how the platform sees data):
  SemanticDatasource
          │
          ▼
  SemanticDataset ◄──► SemanticDataset
                 SemanticMapping
```

**Physical layer** describes how data physically arrives:

- `DatasourceTemplate` / `DatasourceInstance` — connector definitions and their deployments
- `DatasetTemplate` / `DatasetInstance` — table schemas and their materialized instances

**Semantic layer** provides the business/logical view:

- `SemanticDatasource` — a logical grouping of related datasets
- `SemanticDataset` — a typed schema representing a business entity

**Mappings** bridge the two:

- `DatasetMapping` — maps a `DatasetTemplate` column to a `SemanticDataset` column
- `SemanticMapping` — maps between `SemanticDataset` columns (dataset-to-dataset)

**Security layer**:

- `SecurityUseCase` — a detection use case grouping rules
- `SecurityUseCaseRule` — a Sigma or SQL rule targeting a `SemanticDataset`

### Core Components

| Module                    | Purpose                                                        |
| ------------------------- | -------------------------------------------------------------- |
| `graph.py`                | `SemanticGraph` class — main entry point for all operations    |
| `entities.py`             | SQLAlchemy models for all entity types                         |
| `changeset.py`            | `ChangeSet` dataclasses for create/update/delete operations    |
| `validators/`             | Graph validation rules (one file per entity type)              |
| `changeset_validation.py` | Changeset validation against current graph state               |
| `fixes.py`                | Auto-fix functions for fixable validation errors               |
| `yaml_io.py`              | YAML loading and export (from files, directories, fixtureslib) |
| `changeset_parser.py`     | JSON → ChangeSet parser                                        |
| `changeset_serializer.py` | ChangeSet → JSON serializer                                    |
| `db.py`                   | Database session management (`session_scope`)                  |
| `base.py`                 | SQLAlchemy declarative base with dynamic table prefixes        |
| `utils.py`                | Meta field definitions, SQL parsing, Sigma helpers             |
| `errors.py`               | `ValidationError` and `ChangeSetValidationError` types         |

### Database

- Default: in-memory SQLite with `StaticPool` for thread safety
- UUIDs stored as `CHAR(36)` strings
- Each `SemanticGraph` instance gets its own table prefix (`slv_{name}_`)
- `session_scope()` context manager handles transactions

## Quick Start

### Validate a graph from YAML files

```python
from slvalidator.semanticgraph import SemanticGraph

graph = SemanticGraph(name="MyGraph")
graph.from_directory("path/to/yaml/files")

errors = graph.validate()
if errors:
    for error in errors:
        print(f"[{error.entity_type}] {error.entity_id}: {error.error_msg}")
        if error.fixable:
            print("  ^ This error can be auto-fixed")
```

### Apply auto-fixes

```python
# Fix all fixable errors (missing meta fields, passthrough mappings, etc.)
fixes_applied = graph.apply_fixes()
print(f"Applied {fixes_applied} fixes")

# Re-validate to confirm
remaining = graph.validate()
```

### Validate a changeset

```python
from slvalidator.semanticgraph.changeset import ChangeSet

changeset = ChangeSet()
changeset.from_json(json_input)

# Static validation (structure, UUIDs, required fields)
static_errors = changeset.validate()

# Validate against current graph state (returns errors and expanded changeset)
errors, expanded = graph.validate_changeset(changeset)
```

### Diff two graphs

```python
a = SemanticGraph(name="before")
a.from_directory("path/to/before")

b = SemanticGraph(name="after")
b.from_directory("path/to/after")

diff = a.diff(b)  # ChangeSet object
print(diff.to_json())  # JSON changeset representing the differences
```

### Export to YAML

```python
# Export all entities to a directory (one file per model)
graph.to_yaml("output_dir/")

# Export preserving source file structure
graph.to_yaml_sources("output_dir/")
```

## CLI Reference

```bash
# Validate a graph directory
slvalidator validate --graph-dir path/to/yaml/files

# Validate a changeset against a graph (reads changeset JSON from stdin)
echo '{"changeset": {...}}' | slvalidator validate-changeset --graph-dir path/to/yaml/files

# Diff two graph directories
slvalidator diff --dir-a path/to/before --dir-b path/to/after

# Apply a changeset and export (reads changeset JSON from stdin)
echo '{"changeset": {...}}' | slvalidator apply --graph-dir path/to/yaml/files --output-dir output/
```

Exit codes:

- `0` — validation passed (no errors)
- `1` — validation failed (errors found)

## Validation Rules

### Semantic Layer

#### SemanticDatasource

| Check               | Description                                               | Fixable |
| ------------------- | --------------------------------------------------------- | ------- |
| Unique display_name | No duplicate display names across datasources             | No      |
| Unique safe_name    | No duplicate safe names across datasources                | No      |
| safe_name matches   | Declared safe_name must equal value computed from display | Yes     |
| Has children        | Must have at least one SemanticDataset child              | No      |

#### SemanticDataset

| Check                | Description                                               | Fixable |
| -------------------- | --------------------------------------------------------- | ------- |
| Parent exists        | Parent SemanticDatasource must exist                      | No      |
| Schema not empty     | Schema must have at least one field                       | No      |
| Unique display_name  | No duplicate display names within parent                  | No      |
| Unique safe_name     | No duplicate safe names within parent                     | No      |
| safe_name matches    | Declared safe_name must equal value computed from display | Yes     |
| Valid schema types   | Schema field types must be valid Databricks types         | No      |
| Required meta fields | Schema must contain `_z_*` meta fields with correct types | Yes     |
| No \_record_source   | `_record_source` must not be in schema (added by backend) | Yes     |

#### SemanticMapping

| Check                        | Description                                                   | Fixable |
| ---------------------------- | ------------------------------------------------------------- | ------- |
| Source dataset exists        | Source SemanticDataset must exist                             | No      |
| Destination dataset exists   | Destination SemanticDataset must exist                        | No      |
| Destination column in schema | `dst_col_name` must exist in destination schema               | No      |
| Source columns in schema     | All `src_cols` must exist in source schema                    | No      |
| Source columns in expression | All `src_cols` must appear in `src_col_expr` or `filter_expr` | No      |
| No duplicate mappings        | No two mappings with same src_id, dst_id, dst_col_name        | No      |
| Passthrough mappings exist   | `_z_*` fields must have passthrough mappings                  | Yes     |
| Passthrough mapping format   | `_z_*` mappings must be direct passthroughs                   | Yes     |

### Template/Instance Layer

#### DatasourceTemplate

| Check               | Description                                               | Fixable |
| ------------------- | --------------------------------------------------------- | ------- |
| Unique display_name | No duplicate display names across templates               | No      |
| Unique safe_name    | No duplicate safe names across templates                  | No      |
| safe_name matches   | Declared safe_name must equal value computed from display | Yes     |
| Has children        | Must have at least one DatasetTemplate child              | No      |

#### DatasetTemplate

| Check               | Description                                               | Fixable |
| ------------------- | --------------------------------------------------------- | ------- |
| Parent exists       | Parent DatasourceTemplate must exist                      | No      |
| Unique display_name | No duplicate display names within parent                  | No      |
| Unique safe_name    | No duplicate safe names within parent                     | No      |
| safe_name matches   | Declared safe_name must equal value computed from display | Yes     |
| Has schema rows     | Must have at least one DatasetTemplateSchema row          | No      |

#### DatasetTemplateSchema

| Check                | Description                                        | Fixable |
| -------------------- | -------------------------------------------------- | ------- |
| Parent exists        | Parent DatasetTemplate must exist                  | No      |
| Unique field names   | No duplicate field_name within parent template     | No      |
| Valid field types    | field_type must be valid Databricks types          | No      |
| Required meta fields | Must contain `_z_*` meta fields with correct types | Yes     |

#### DatasetMapping

| Check                        | Description                                                    | Fixable |
| ---------------------------- | -------------------------------------------------------------- | ------- |
| Source dataset exists        | Source DatasetTemplate must exist                              | No      |
| Destination dataset exists   | Destination SemanticDataset must exist                         | No      |
| Destination column in schema | `dst_col_name` must exist in destination schema                | No      |
| Source columns in schema     | All `src_cols` must exist in source template schema            | No      |
| Source columns in expression | All `src_cols` must appear in `src_col_expr` or `filter_expr`  | No      |
| No duplicate mappings        | No two mappings with same src_id, dst_id, dst_col_name         | No      |
| Passthrough mappings exist   | `_z_*` fields (except `_z_raw`) must have passthrough mappings | Yes     |
| Passthrough mapping format   | `_z_*` mappings must be direct passthroughs                    | Yes     |

### Security Layer

#### SecurityUseCaseRule

| Check                    | Description                                                        | Fixable |
| ------------------------ | ------------------------------------------------------------------ | ------- |
| Parent exists            | Parent SecurityUseCase must exist                                  | No      |
| Sigma syntax valid       | Sigma rules must parse without errors                              | No      |
| Logsource category       | Sigma rules must define `logsource.category`                       | No      |
| Logsource product        | Sigma rules must define `logsource.product`                        | No      |
| Target datasource exists | Category must match a SemanticDatasource `safe_name`               | No      |
| Target dataset exists    | Product must match a SemanticDataset `safe_name` within datasource | No      |
| Fields in schema         | All fields referenced in rule SQL must exist in target schema      | No      |
| SQL parseable            | SQL rules must be parseable                                        | No      |

## Auto-Fixes

The `apply_fixes()` method automatically resolves errors marked `fixable=True`:

| Entity Type           | Fix                           | Description                                                                                                                           |
| --------------------- | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| All named entities    | Override mismatched safe_name | Overwrites safe_name with value computed from display_name (SemanticDatasource, SemanticDataset, DatasourceTemplate, DatasetTemplate) |
| SemanticDataset       | Add missing meta fields       | Adds missing `_z_*` fields to schema with correct types                                                                               |
| SemanticDataset       | Fix meta field types          | Corrects type mismatches in `_z_*` fields                                                                                             |
| SemanticDataset       | Remove \_record_source        | Removes `_record_source` from schema                                                                                                  |
| SemanticMapping       | Create passthrough mappings   | Creates missing `_z_*` passthrough mappings                                                                                           |
| SemanticMapping       | Fix non-passthrough mappings  | Converts `_z_*` mappings to direct passthroughs                                                                                       |
| DatasetTemplateSchema | Add missing meta fields       | Adds missing `_z_*` fields with correct types                                                                                         |
| DatasetTemplateSchema | Fix meta field types          | Corrects type mismatches in `_z_*` fields                                                                                             |
| DatasetMapping        | Create passthrough mappings   | Creates missing `_z_*` passthrough mappings (excludes `_z_raw`)                                                                       |
| DatasetMapping        | Fix non-passthrough mappings  | Converts `_z_*` mappings to direct passthroughs                                                                                       |

### Meta Fields (`_z_*`)

| Field                 | Type      | Scope          | Description                               |
| --------------------- | --------- | -------------- | ----------------------------------------- |
| `_z_source_system`    | STRING    | All            | Source system identifier                  |
| `_z_source_timestamp` | TIMESTAMP | All            | Timestamp from source                     |
| `_z_ingest_timestamp` | TIMESTAMP | All            | When data was ingested                    |
| `_z_hash`             | STRING    | All            | Hash of the record                        |
| `_z_valid_from`       | TIMESTAMP | All            | Start of validity period                  |
| `_z_valid_to`         | TIMESTAMP | All            | End of validity period                    |
| `_z_current`          | BOOLEAN   | All            | Whether record is current                 |
| `_z_deleted`          | BOOLEAN   | All            | Whether record is deleted                 |
| `_z_run_id`           | STRING    | All            | Processing run identifier                 |
| `_z_enrichments`      | STRING    | All            | Enrichment metadata                       |
| `_z_raw`              | STRING    | Templates only | Raw source data (not in SemanticDatasets) |

## Requirements

- Python 3.12.7+
- [uv](https://github.com/astral-sh/uv) for dependency management
- [Task](https://taskfile.dev/) for running development commands

## Development

```bash
# Setup development environment
task setup-dev

# Run tests
task chore:tests
uv run pytest -v tests

# Run a single test file
uv run pytest -v tests/test_semanticgraph.py

# Run a single test
uv run pytest -v tests/test_file.py::TestClass::test_name

# Lint and format
task chore:format
uvx ruff check --fix && uvx ruff format

# Run pre-commit hooks
task chore:pre-commit

# Clean caches
task chore:clean
```

## Project Structure

```
src/slvalidator/
├── __init__.py                         # Package root
├── __main__.py                         # python -m slvalidator entry point
├── cli.py                              # CLI entry point (validate, validate-changeset, diff, apply)
├── scripts/                            # Utility scripts (e.g., init_db_test.py)
├── semanticgraph/
│   ├── __init__.py                     # Package exports (SemanticGraph, ValidationError)
│   ├── graph.py                        # SemanticGraph class (main entry point)
│   ├── entities.py                     # SQLAlchemy entity models
│   ├── base.py                         # SQLAlchemy base with dynamic table prefixes
│   ├── db.py                           # Database session management
│   ├── errors.py                       # ValidationError and ChangeSetValidationError
│   ├── utils.py                        # Meta fields, SQL parsing, Sigma helpers
│   ├── changeset.py                    # ChangeSet dataclasses and validation
│   ├── changeset_parser.py             # JSON → ChangeSet parsing
│   ├── changeset_serializer.py         # ChangeSet → JSON serialization
│   ├── changeset_validation.py         # Changeset validation against graph state
│   ├── yaml_io.py                      # YAML loading and export
│   ├── fixes.py                        # Auto-fix functions
│   └── validators/                     # Graph validation rules
│       ├── __init__.py                 # Validator registry (run_all)
│       ├── _helpers.py                 # Shared validation helpers
│       ├── uuids.py                    # UUID uniqueness checks
│       ├── semantic_datasources.py     # SemanticDatasource validation
│       ├── semantic_datasets.py        # SemanticDataset validation
│       ├── semantic_mappings.py        # SemanticMapping validation
│       ├── datasource_templates.py     # DatasourceTemplate validation
│       ├── dataset_templates.py        # DatasetTemplate validation
│       ├── dataset_template_schemas.py # DatasetTemplateSchema validation
│       ├── dataset_mappings.py         # DatasetMapping validation
│       └── security_rules.py          # SecurityUseCaseRule validation
└── models/
tests/
├── __init__.py
├── conftest.py
├── test_semanticgraph.py
├── test_changeset_self_validation.py
├── test_cli.py
├── test_utils.py
├── export/                             # YAML fixture files for tests
└── test_validation/
    ├── conftest.py
    └── test_*.py
```

Additional entity types (`ConnectorTemplate`, `Policy`, `HAEnvironment`, etc.) exist
for specific integration features — see `entities.py` for the full list.

## License

Proprietary
