headscale/CLAUDE.md
2025-07-28 11:15:53 +02:00

16 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

Headscale is an open-source implementation of the Tailscale control server written in Go. It provides self-hosted coordination for Tailscale networks (tailnets), managing node registration, IP allocation, policy enforcement, and DERP routing.

Development Commands

Quick Setup

# Recommended: Use Nix for dependency management
nix develop

# Full development workflow
make dev  # runs fmt + lint + test + build

Essential Commands

# Build headscale binary
make build

# Run tests
make test
go test ./...                    # All unit tests
go test -race ./...              # With race detection

# Run specific integration test
go run ./cmd/hi run "TestName" --postgres

# Code formatting and linting
make fmt         # Format all code (Go, docs, proto)
make lint        # Lint all code (Go, proto)
make fmt-go      # Format Go code only
make lint-go     # Lint Go code only

# Protocol buffer generation (after modifying proto/)
make generate

# Clean build artifacts  
make clean

Integration Testing

# Use the hi (Headscale Integration) test runner
go run ./cmd/hi doctor                    # Check system requirements
go run ./cmd/hi run "TestPattern"         # Run specific test
go run ./cmd/hi run "TestPattern" --postgres  # With PostgreSQL backend

# Test artifacts are saved to control_logs/ with logs and debug data

Project Structure & Architecture

Top-Level Organization

headscale/
├── cmd/                    # Command-line applications
│   ├── headscale/         # Main headscale server binary
│   └── hi/               # Headscale Integration test runner
├── hscontrol/            # Core control plane logic
├── integration/          # End-to-end Docker-based tests
├── proto/               # Protocol buffer definitions
├── gen/                 # Generated code (protobuf)
├── docs/                # Documentation
└── packaging/           # Distribution packaging

Core Packages (hscontrol/)

Main Server (hscontrol/)

  • app.go: Application setup, dependency injection, server lifecycle
  • handlers.go: HTTP/gRPC API endpoints for management operations
  • grpcv1.go: gRPC service implementation for headscale API
  • poll.go: Critical - Handles Tailscale MapRequest/MapResponse protocol
  • noise.go: Noise protocol implementation for secure client communication
  • auth.go: Authentication flows (web, OIDC, command-line)
  • oidc.go: OpenID Connect integration for user authentication

State Management (hscontrol/state/)

  • state.go: Central coordinator for all subsystems (database, policy, IP allocation, DERP)
  • node_store.go: Performance-critical - In-memory cache with copy-on-write semantics
  • Thread-safe operations with deadlock detection
  • Coordinates between database persistence and real-time operations

Database Layer (hscontrol/db/)

  • db.go: Database abstraction, GORM setup, migration management
  • node.go: Node lifecycle, registration, expiration, IP assignment
  • users.go: User management, namespace isolation
  • api_key.go: API authentication tokens
  • preauth_keys.go: Pre-authentication keys for automated node registration
  • ip.go: IP address allocation and management
  • policy.go: Policy storage and retrieval
  • Schema migrations in schema.sql with extensive test data coverage

Policy Engine (hscontrol/policy/)

  • policy.go: Core ACL evaluation logic, HuJSON parsing
  • v2/: Next-generation policy system with improved filtering
  • matcher/: ACL rule matching and evaluation engine
  • Determines peer visibility, route approval, and network access rules
  • Supports both file-based and database-stored policies

Network Management (hscontrol/)

  • derp/: DERP (Designated Encrypted Relay for Packets) server implementation
    • NAT traversal when direct connections fail
    • Fallback relay for firewall-restricted environments
  • mapper/: Converts internal Headscale state to Tailscale's wire protocol format
    • tail.go: Tailscale-specific data structure generation
  • routes/: Subnet route management and primary route selection
  • dns/: DNS record management and MagicDNS implementation

Utilities & Support (hscontrol/)

  • types/: Core data structures, configuration, validation
  • util/: Helper functions for networking, DNS, key management
  • templates/: Client configuration templates (Apple, Windows, etc.)
  • notifier/: Event notification system for real-time updates
  • metrics.go: Prometheus metrics collection
  • capver/: Tailscale capability version management

Key Subsystem Interactions

Node Registration Flow

  1. Client Connection: noise.go handles secure protocol handshake
  2. Authentication: auth.go validates credentials (web/OIDC/preauth)
  3. State Creation: state.go coordinates IP allocation via db/ip.go
  4. Storage: db/node.go persists node, NodeStore caches in memory
  5. Network Setup: mapper/ generates initial Tailscale network map

Ongoing Operations

  1. Poll Requests: poll.go receives periodic client updates
  2. State Updates: NodeStore maintains real-time node information
  3. Policy Application: policy/ evaluates ACL rules for peer relationships
  4. Map Distribution: mapper/ sends network topology to all affected clients

Route Management

  1. Advertisement: Clients announce routes via poll.go Hostinfo updates
  2. Storage: db/ persists routes, NodeStore caches for performance
  3. Approval: policy/ auto-approves routes based on ACL rules
  4. Distribution: routes/ selects primary routes, mapper/ distributes to peers

Command-Line Tools (cmd/)

Main Server (cmd/headscale/)

  • headscale.go: CLI parsing, configuration loading, server startup
  • Supports daemon mode, CLI operations (user/node management), database operations

Integration Test Runner (cmd/hi/)

  • main.go: Test execution framework with Docker orchestration
  • run.go: Individual test execution with artifact collection
  • doctor.go: System requirements validation
  • docker.go: Container lifecycle management
  • Essential for validating changes against real Tailscale clients

Generated & External Code

Protocol Buffers (proto/gen/)

  • Defines gRPC API for headscale management operations
  • Client libraries can generate from these definitions
  • Run make generate after modifying .proto files

Integration Testing (integration/)

  • scenario.go: Docker test environment setup
  • tailscale.go: Tailscale client container management
  • Individual test files for specific functionality areas
  • Real end-to-end validation with network isolation

Critical Performance Paths

High-Frequency Operations

  1. MapRequest Processing (poll.go): Every 15-60 seconds per client
  2. NodeStore Reads (node_store.go): Every operation requiring node data
  3. Policy Evaluation (policy/): On every peer relationship calculation
  4. Route Lookups (routes/): During network map generation

Database Write Patterns

  • Frequent: Node heartbeats, endpoint updates, route changes
  • Moderate: User operations, policy updates, API key management
  • Rare: Schema migrations, bulk operations

Configuration & Deployment

Configuration (hscontrol/types/config.go)**

  • Database connection settings (SQLite/PostgreSQL)
  • Network configuration (IP ranges, DNS settings)
  • Policy mode (file vs database)
  • DERP relay configuration
  • OIDC provider settings

Key Dependencies

  • GORM: Database ORM with migration support
  • Tailscale Libraries: Core networking and protocol code
  • Zerolog: Structured logging throughout the application
  • Buf: Protocol buffer toolchain for code generation

Development Workflow Integration

The architecture supports incremental development:

  • Unit Tests: Focus on individual packages (*_test.go files)
  • Integration Tests: Validate cross-component interactions
  • Database Tests: Extensive migration and data integrity validation
  • Policy Tests: ACL rule evaluation and edge cases
  • Performance Tests: NodeStore and high-frequency operation validation

Integration Test System

Overview

Integration tests use Docker containers running real Tailscale clients against a Headscale server. Tests validate end-to-end functionality including routing, ACLs, node lifecycle, and network coordination.

Running Integration Tests

System Requirements

# Check if your system is ready
go run ./cmd/hi doctor

This verifies Docker, Go, required images, and disk space.

Test Execution Patterns

# Run a single test (recommended for development)
go run ./cmd/hi run "TestSubnetRouterMultiNetwork"

# Run with PostgreSQL backend (for database-heavy tests)
go run ./cmd/hi run "TestExpireNode" --postgres

# Run multiple tests with pattern matching
go run ./cmd/hi run "TestSubnet*"

# Run all integration tests (CI/full validation)
go test ./integration -timeout 30m

Test Categories & Timing

  • Fast tests (< 2 min): Basic functionality, CLI operations
  • Medium tests (2-5 min): Route management, ACL validation
  • Slow tests (5+ min): Node expiration, HA failover
  • Long-running tests (10+ min): TestNodeOnlineStatus (12 min duration)

Test Infrastructure

Docker Setup

  • Headscale server container with configurable database backend
  • Multiple Tailscale client containers with different versions
  • Isolated networks per test scenario
  • Automatic cleanup after test completion

Test Artifacts All test runs save artifacts to control_logs/TIMESTAMP-ID/:

control_logs/20250713-213106-iajsux/
├── hs-testname-abc123.stderr.log     # Headscale server logs
├── hs-testname-abc123.stdout.log
├── hs-testname-abc123.db             # Database snapshot
├── hs-testname-abc123_metrics.txt    # Prometheus metrics
├── hs-testname-abc123-mapresponses/  # Protocol debug data
├── ts-client-xyz789.stderr.log       # Tailscale client logs
├── ts-client-xyz789.stdout.log
└── ts-client-xyz789_status.json      # Client status dump

Test Development Guidelines

Timing Considerations Integration tests involve real network operations and Docker container lifecycle:

// ❌ Wrong: Immediate assertions after async operations
client.Execute([]string{"tailscale", "set", "--advertise-routes=10.0.0.0/24"})
nodes, _ := headscale.ListNodes()
require.Len(t, nodes[0].GetAvailableRoutes(), 1) // May fail due to timing

// ✅ Correct: Wait for async operations to complete
client.Execute([]string{"tailscale", "set", "--advertise-routes=10.0.0.0/24"})
require.EventuallyWithT(t, func(c *assert.CollectT) {
    nodes, err := headscale.ListNodes()
    assert.NoError(c, err)
    assert.Len(c, nodes[0].GetAvailableRoutes(), 1)
}, 10*time.Second, 100*time.Millisecond, "route should be advertised")

Common Test Patterns

  • Route Advertisement: Use EventuallyWithT for route propagation
  • Node State Changes: Wait for NodeStore synchronization
  • ACL Policy Changes: Allow time for policy recalculation
  • Network Connectivity: Use ping tests with retries

Test Data Management

// Node identification: Don't assume array ordering
expectedRoutes := map[string]string{"1": "10.33.0.0/16"}
for _, node := range nodes {
    nodeIDStr := fmt.Sprintf("%d", node.GetId())
    if route, shouldHaveRoute := expectedRoutes[nodeIDStr]; shouldHaveRoute {
        // Test the node that should have the route
    }
}

Troubleshooting Integration Tests

Common Failure Patterns

  1. Timing Issues: Test assertions run before async operations complete

    • Solution: Use EventuallyWithT with appropriate timeouts
    • Timeout Guidelines: 3-5s for route operations, 10s for complex scenarios
  2. Infrastructure Problems: Disk space, Docker issues, network conflicts

    • Check: go run ./cmd/hi doctor for system health
    • Clean: Remove old test containers and networks
  3. NodeStore Synchronization: Tests expecting immediate data availability

    • Key Points: Route advertisements must propagate through poll requests
    • Fix: Wait for NodeStore updates after Hostinfo changes
  4. Database Backend Differences: SQLite vs PostgreSQL behavior differences

    • Use: --postgres flag for database-intensive tests
    • Note: Some timing characteristics differ between backends

Debugging Failed Tests

  1. Check test artifacts in control_logs/ for detailed logs
  2. Examine MapResponse JSON files for protocol-level debugging
  3. Review Headscale stderr logs for server-side error messages
  4. Check Tailscale client status for network-level issues

Resource Management

  • Tests require significant disk space (each run ~100MB of logs)
  • Docker containers are cleaned up automatically on success
  • Failed tests may leave containers running - clean manually if needed
  • Use docker system prune periodically to reclaim space

Best Practices for Test Modifications

  1. Always test locally before committing integration test changes
  2. Use appropriate timeouts - too short causes flaky tests, too long slows CI
  3. Clean up properly - ensure tests don't leave persistent state
  4. Handle both success and failure paths in test scenarios
  5. Document timing requirements for complex test scenarios

NodeStore Implementation Details

Key Insight from Recent Work: The NodeStore is a critical performance optimization that caches node data in memory while ensuring consistency with the database. When working with route advertisements or node state changes:

  1. Timing Considerations: Route advertisements need time to propagate from clients to server. Use require.EventuallyWithT() patterns in tests instead of immediate assertions.

  2. Synchronization Points: NodeStore updates happen at specific points like poll.go:420 after Hostinfo changes. Ensure these are maintained when modifying the polling logic.

  3. Peer Visibility: The NodeStore's peersFunc determines which nodes are visible to each other. Policy-based filtering is separate from monitoring visibility - expired nodes should remain visible for debugging but marked as expired.

Testing Guidelines

Integration Test Patterns

// Use EventuallyWithT for async operations
require.EventuallyWithT(t, func(c *assert.CollectT) {
    nodes, err := headscale.ListNodes()
    assert.NoError(c, err)
    // Check expected state
}, 10*time.Second, 100*time.Millisecond, "description")

// Node route checking by actual node properties, not array position
var routeNode *v1.Node
for _, node := range nodes {
    if nodeIDStr := fmt.Sprintf("%d", node.GetId()); expectedRoutes[nodeIDStr] != "" {
        routeNode = node
        break
    }
}

Running Problematic Tests

  • Some tests require significant time (e.g., TestNodeOnlineStatus runs for 12 minutes)
  • Infrastructure issues like disk space can cause test failures unrelated to code changes
  • Use --postgres flag when testing database-heavy scenarios

Important Notes

  • Dependencies: Use nix develop for consistent toolchain (Go, buf, protobuf tools, linting)
  • Protocol Buffers: Changes to proto/ require make generate and should be committed separately
  • Code Style: Enforced via golangci-lint with golines (width 88) and gofumpt formatting
  • Database: Supports both SQLite (development) and PostgreSQL (production/testing)
  • Integration Tests: Require Docker and can consume significant disk space
  • Performance: NodeStore optimizations are critical for scale - be careful with changes to state management

Debugging Integration Tests

Test artifacts are preserved in control_logs/TIMESTAMP-ID/ including:

  • Headscale server logs (stderr/stdout)
  • Tailscale client logs and status
  • Database dumps and network captures
  • MapResponse JSON files for protocol debugging

When tests fail, check these artifacts first before assuming code issues.