YAML Schema

This page describes the schema for YAML files used in Cyclonetix, including tasks, DAGs, contexts, and parameter sets.

Task Definition Schema

Tasks are defined in YAML files with the following schema:

# Unique identifier for the task (required)
id: "task_id"

# Human-readable name (required)
name: "Task Name"

# Optional description
description: "Detailed description of the task's purpose"

# Command to execute (required)
command: "python script.py --input ${INPUT_FILE} --output ${OUTPUT_FILE}"

# Dependencies on other tasks (optional)
dependencies:
  - "dependency_task_1"
  - "dependency_task_2:parameter_set_name"  # With specific parameter set

# Default parameters (optional)
parameters:
  PARAM1: "default_value"
  PARAM2: "another_value"
  # Complex parameter with validation
  COMPLEX_PARAM:
    default: "value1"
    description: "Parameter description"
    required: true
    options: ["value1", "value2", "value3"]
    validation: "^[a-z0-9_]+$"

# Queue assignment (optional, defaults to "default")
queue: "gpu_tasks"

# Whether this is an evaluation point (optional, defaults to false)
evaluation_point: false

# Timeout in seconds (optional)
timeout_seconds: 3600

# Retry configuration (optional)
retries:
  count: 3
  delay_seconds: 300
  exponential_backoff: true
  max_delay_seconds: 3600

# Resource requirements (optional)
resources:
  cpu: 2
  memory_mb: 4096
  gpu: 1
  disk_mb: 10240

# Task tags for organization (optional)
tags:
  - "etl"
  - "data_processing"

# Git configuration for code execution (optional)
git:
  repository: "https://github.com/example/repo.git"
  ref: "main"  # Branch, tag, or commit
  path: "scripts/process.py"
  environment: "python-data-science"

Example Task

id: "train_model"
name: "Train Machine Learning Model"
description: "Trains a machine learning model using the prepared dataset"
command: |
  python train.py \
    --data-path ${DATA_PATH} \
    --model-type ${MODEL_TYPE} \
    --epochs ${EPOCHS} \
    --batch-size ${BATCH_SIZE} \
    --output-path ${MODEL_OUTPUT}
dependencies:
  - "prepare_data"
  - "feature_engineering"
parameters:
  DATA_PATH: "/data/processed"
  MODEL_TYPE: "random_forest"
  EPOCHS: "100"
  BATCH_SIZE: "32"
  MODEL_OUTPUT: "/models/latest"
queue: "gpu_tasks"
timeout_seconds: 7200
resources:
  cpu: 4
  memory_mb: 16384
  gpu: 1
tags:
  - "ml"
  - "training"

DAG Definition Schema

DAGs are defined in YAML files with the following schema:

# Unique identifier for the DAG (required)
id: "dag_id"

# Human-readable name (required)
name: "DAG Name"

# Optional description
description: "Detailed description of the DAG's purpose"

# List of tasks to include (required)
tasks:
  # Simple task reference
  - id: "task_1"

  # Task with parameter overrides
  - id: "task_2"
    parameters:
      PARAM1: "override_value"
      PARAM2: "another_override"

  # Task with different queue
  - id: "task_3"
    queue: "high_priority"

# Default context for all tasks (optional)
context: "production"

# Schedule for periodic execution (optional)
schedule:
  cron: "0 5 * * *"        # Cron expression (run at 5:00 AM daily)
  timezone: "UTC"          # Timezone for the cron expression
  start_date: "2023-01-01" # Optional start date
  end_date: "2023-12-31"   # Optional end date
  catchup: false           # Whether to run for missed intervals
  depends_on_past: false   # Whether each run depends on the previous run

# Maximum concurrency (optional)
max_active_runs: 1

# DAG-level timeout (optional)
timeout_seconds: 86400  # 24 hours

# DAG tags for organization (optional)
tags:
  - "production"
  - "daily"

Example DAG

id: "etl_pipeline"
name: "ETL Pipeline"
description: "Extract, transform, and load data pipeline"
tasks:
  - id: "data_extraction"
    parameters:
      SOURCE: "production_db"
      LIMIT: "10000"

  - id: "data_cleaning"
    parameters:
      CLEAN_MODE: "strict"

  - id: "data_transformation"
    parameters:
      TRANSFORM_TYPE: "normalize"

  - id: "data_validation"
    evaluation_point: true

  - id: "data_loading"
    parameters:
      DESTINATION: "data_warehouse"
      MODE: "append"

context: "production"
schedule:
  cron: "0 2 * * *"  # Daily at 2:00 AM
  timezone: "UTC"
  catchup: false
max_active_runs: 1
tags:
  - "etl"
  - "production"
  - "daily"

Context Definition Schema

Contexts are defined in YAML files with the following schema:

# Unique identifier for the context (required)
id: "context_id"

# Optional parent context to extend
extends: "parent_context_id"

# Variables defined in this context (required)
variables:
  ENV: "production"
  LOG_LEVEL: "info"
  DATA_PATH: "/data/production"
  API_URL: "https://api.example.com"
  DB_CONNECTION: "postgresql://user:pass@host:5432/db"
  # Dynamic values
  RUN_DATE: "${DATE}"
  RUN_ID: "${RUN_ID}"
  RANDOM_ID: "${RANDOM:16}"

# Optional description
description: "Production environment context"

Example Context

id: "production"
description: "Production environment context"
variables:
  # Environment information
  ENV: "production"
  LOG_LEVEL: "info"

  # Paths and locations
  BASE_PATH: "/data/production"
  LOG_PATH: "/var/log/cyclonetix"

  # Service connections
  API_URL: "https://api.example.com"
  DB_HOST: "db.example.com"
  DB_PORT: "5432"
  DB_NAME: "production_db"
  DB_USER: "app_user"

  # Runtime configuration
  MAX_THREADS: "16"
  TIMEOUT_SECONDS: "3600"
  BATCH_SIZE: "1000"

Parameter Set Definition Schema

Parameter sets are defined in YAML files with the following schema:

# Unique identifier for the parameter set (required)
id: "parameter_set_id"

# Parameters defined in this set (required)
parameters:
  PARAM1: "value1"
  PARAM2: "value2"
  NESTED_PARAM:
    SUBPARAM1: "subvalue1"
    SUBPARAM2: "subvalue2"

# Optional description
description: "Parameter set description"

# Optional tags for organization
tags:
  - "tag1"
  - "tag2"

Example Parameter Set

id: "large_training"
description: "Parameters for large-scale model training"
parameters:
  # Model configuration
  MODEL_TYPE: "deep_learning"
  ARCHITECTURE: "transformer"

  # Training parameters
  EPOCHS: "100"
  BATCH_SIZE: "128"
  LEARNING_RATE: "0.001"
  OPTIMIZER: "adam"

  # Resource allocation
  NUM_GPUS: "4"
  DISTRIBUTED: "true"

  # Logging and checkpoints
  LOG_FREQUENCY: "10"
  CHECKPOINT_FREQUENCY: "20"
tags:
  - "ml"
  - "deep_learning"
  - "large_scale"

Evaluation Result Schema

Evaluation tasks output results in the following JSON format:

{
  "next_tasks": ["task_id_1", "task_id_2"],
  "parameters": {
    "task_id_1": {
      "PARAM1": "value1",
      "PARAM2": "value2"
    },
    "task_id_2": {
      "PARAM1": "value1"
    }
  },
  "context_updates": {
    "VARIABLE1": "new_value",
    "VARIABLE2": "new_value"
  },
  "metadata": {
    "reason": "Model accuracy below threshold",
    "metrics": {
      "accuracy": 0.82,
      "precision": 0.79
    }
  }
}

Schema Validation

Cyclonetix validates all YAML files against these schemas at load time. If a file doesn’t conform to the schema, an error will be reported.

You can validate your YAML files without loading them into Cyclonetix using the command:

cyclonetix validate-yaml <path-to-yaml-file>

File Naming Conventions

While Cyclonetix doesn’t enforce specific file naming conventions, the following are recommended:

  • Tasks: <task_id>.yaml or grouped in subdirectories
  • DAGs: <dag_id>.yaml
  • Contexts: <context_id>.yaml
  • Parameter Sets: <parameter_set_id>.yaml

Next Steps