Git Integration
Cyclonetix provides robust Git integration features, allowing you to execute code directly from Git repositories and manage workflow definitions as code. This approach ensures versioning, repeatability, and easier collaboration.
Git-Based Task Execution
Basic Git Integration
Instead of uploading scripts or defining inline commands, you can reference code in Git repositories:
id: "train_model"
name: "Train ML Model"
command: "git-exec https://github.com/example/ml-repo.git scripts/train.py --epochs ${EPOCHS}"
parameters:
EPOCHS: "100"
The git-exec
prefix tells Cyclonetix to:
- Clone the repository
- Check out the default branch (or specified branch/tag)
- Execute the specified command within the repository
Authentication
For private repositories, configure authentication:
git:
auth:
# SSH key-based auth
ssh_key_path: "/path/to/private/key"
# Or HTTPS with credentials
credentials:
- domain: "github.com"
username: "git-user"
password: "${GITHUB_TOKEN}" # Use personal access token
- domain: "gitlab.example.com"
username: "git-user"
password: "${GITLAB_TOKEN}"
Caching Repositories
To avoid repeated cloning, enable repository caching:
git:
cache:
enabled: true
path: "/tmp/cyclonetix-git-cache"
max_size_mb: 1000
ttl_minutes: 60
Git-Based Workflow Definitions
Git-Synced Task and DAG Definitions
Keep your task and DAG definitions in a Git repository:
git_sync:
enabled: true
repositories:
- url: "https://github.com/example/workflows.git"
branch: "main"
paths:
- source: "tasks/"
destination: "${CYCLO_TASK_DIRECTORY}"
- source: "dags/"
destination: "${CYCLO_DAG_DIRECTORY}"
poll_interval_seconds: 60
credentials:
username: "git-user"
password: "${GIT_TOKEN}"
This configuration: 1. Periodically pulls from the repository 2. Syncs task and DAG definitions to their respective directories 3. Automatically updates workflows when changes are pushed
Versioned Workflows
Use Git references to run specific versions of workflows:
# Run a DAG from a specific branch
./cyclonetix schedule-dag my_workflow --git-ref develop
# Run a DAG from a specific tag
./cyclonetix schedule-dag my_workflow --git-ref v1.2.3
GitOps Workflow
Implement a GitOps approach to workflow management:
- Store all workflow definitions in Git
- Use pull requests for workflow changes
- Automatically deploy changes when merged
- Track workflow history through Git history
Example GitOps setup:
git_sync:
enabled: true
repositories:
- url: "https://github.com/example/workflows.git"
branch: "main"
paths:
- source: "/"
destination: "${CYCLO_CONFIG_DIRECTORY}"
webhook:
enabled: true
endpoint: "/api/git-sync/webhook"
secret: "${WEBHOOK_SECRET}"
Configure your Git provider to send webhooks when changes are pushed, triggering immediate sync.
Advanced Git Features
Multi-Repository Workflows
Reference code from multiple repositories in a single workflow:
# Task 1 from repo A
id: "data_prep"
command: "git-exec https://github.com/example/data-tools.git scripts/prepare.py"
# Task 2 from repo B
id: "model_training"
command: "git-exec https://github.com/example/ml-models.git scripts/train.py"
dependencies:
- "data_prep"
Git Submodules
Work with repositories that use Git submodules:
command: "git-exec https://github.com/example/main-repo.git --recursive scripts/run.sh"
The --recursive
flag tells Cyclonetix to initialize and update submodules.
Environment Setup
Set up specific environments for Git-based execution:
git:
environments:
- name: "python-data-science"
setup_commands:
- "python -m venv .venv"
- "source .venv/bin/activate"
- "pip install -r requirements.txt"
pre_execute: "source .venv/bin/activate"
- name: "nodejs"
setup_commands:
- "npm install"
pre_execute: ""
Reference the environment in your task:
id: "run_analysis"
command: "git-exec https://github.com/example/data-analysis.git --env python-data-science scripts/analyze.py"
Sparse Checkout
For large repositories, use sparse checkout to only download necessary files:
command: "git-exec https://github.com/example/large-repo.git --sparse=scripts,configs scripts/run.py"
This only checks out the scripts
and configs
directories, saving time and space.
Security Considerations
Repository Validation
Limit which repositories can be used:
git:
security:
allowed_domains:
- "github.com"
- "gitlab.example.com"
allowed_repos:
- "github.com/trusted-org/*"
- "gitlab.example.com/internal/*"
disallowed_repos:
- "github.com/untrusted-org/risky-repo"
Code Scanning
Integrate with scanning tools to check code before execution:
git:
security:
scan:
enabled: true
command: "trivy fs --security-checks vuln,config,secret {repo_path}"
fail_on_findings: true
Credential Management
Securely manage Git credentials:
- Use environment variables for tokens
- Use SSH agent forwarding when possible
- Use short-lived credentials
- Consider integration with vault systems
Best Practices
- Use Pinned References: Always pin to specific commits or tags in production
- Keep Repositories Clean: Structure repositories for clarity and minimal dependencies
- Use Feature Branches: Develop new workflows in feature branches
- Test Before Merging: Validate workflows in a test environment
- Monitor Git Performance: Large repositories can slow down task startup
Example Workflow
Here’s a complete example of a Git-integrated workflow:
# Task definitions
- id: "fetch_data"
name: "Fetch Latest Data"
command: "git-exec https://github.com/example/data-tools.git@v2.1.0 --env python-data scripts/fetch.py --source ${DATA_SOURCE}"
parameters:
DATA_SOURCE: "production_api"
- id: "process_data"
name: "Process Raw Data"
command: "git-exec https://github.com/example/data-tools.git@v2.1.0 --env python-data scripts/process.py --input ${INPUT_PATH} --output ${OUTPUT_PATH}"
dependencies:
- "fetch_data"
parameters:
INPUT_PATH: "/data/raw"
OUTPUT_PATH: "/data/processed"
- id: "train_model"
name: "Train ML Model"
command: "git-exec https://github.com/example/ml-models.git@stable --env ml-training scripts/train.py --data ${DATA_PATH} --epochs ${EPOCHS} --model-type ${MODEL_TYPE}"
dependencies:
- "process_data"
parameters:
DATA_PATH: "/data/processed"
EPOCHS: "50"
MODEL_TYPE: "xgboost"
queue: "gpu_tasks"
- id: "evaluate_model"
name: "Evaluate Model Performance"
command: "git-exec https://github.com/example/ml-models.git@stable --env ml-training scripts/evaluate.py --model ${MODEL_PATH} --test-data ${TEST_DATA}"
dependencies:
- "train_model"
parameters:
MODEL_PATH: "/models/latest"
TEST_DATA: "/data/test"
evaluation_point: true
Next Steps
- Explore Contexts & Parameters for managing environment variables
- Learn about Evaluation Points for dynamic workflows
- Check the Developer Guide for extending Git integration