codeanalyzer-python

The Python analysis backend provides PythonAnalysis in CLDK. It runs Jedi for semantic analysis, optional CodeQL for call-graph augmentation, and Tree-sitter for syntax parsing. The output is a canonical PyApplication schema that ships with the backend and is re-exported by the CLDK Python SDK.

Overview

codeanalyzer-python is a standalone static analysis library (published to PyPI as codeanalyzer-python) that the SDK manages in a virtualenv. It converts a Python project into a queryable symbol table and call graph.

The backend produces:

Symbol table: All modules, classes, methods, functions, imports, parameters, and docstrings in typed PyModule objects.
Call graph: Inter- and intra-procedural call edges (PyCallEdge), with Jedi-based baseline and optional CodeQL augmentation merging resolved callees from both engines.
Class hierarchies: Base classes, inheritance chains, and method overrides.
Entrypoints: Framework-detected entry points (Flask routes, Celery tasks, Django views, gRPC servicers, etc.) linked to their callables.

Architecture

flowchart LR
    A["Input: project_path"] --> B["Virtualenv<br/>+ deps"]
    B --> C["Jedi: symbol table<br/>+ Jedi call edges"]
    B --> D["CodeQL<br/>(optional)"]
    D --> E["Merge edges"]
    C --> E
    E --> F["Symbol table<br/>Call graph"]
    F --> G["PyApplication<br/>+ PyModule, PyClass,<br/>PyCallable, PyCallEdge"]
    G --> H["JSON / Msgpack"]
    G --> I["CLDK SDK<br/>cldk.models.python"]

Key modules

codeanalyzer.core:Codeanalyzer: Orchestrates the analysis pipeline, manages virtualenv setup, caching, and invokes semantic passes.
codeanalyzer.syntactic_analysis:SymbolTableBuilder: Parses Python source via Tree-sitter and Jedi to extract modules, classes, methods, and call sites.
codeanalyzer.semantic_analysis.call_graph: Builds inter-procedural call graphs using Jedi’s resolution; merges CodeQL edges when enabled.
codeanalyzer.semantic_analysis.codeql: Optional CodeQL integration for resolving dynamic calls, third-party dispatch, and RPC targets.
codeanalyzer.schema:py_schema: Defines all Pydantic models: PyModule, PyClass, PyCallable, PyCallEdge, PyApplication, and others.

Schema: the Py* models

All models are defined in /codeanalyzer/schema/py_schema.py and re-exported in the CLDK SDK at cldk.models.python:

Core application model

PyApplication: The root output of every analysis run.

class PyApplication(BaseModel):
    symbol_table: Dict[str, PyModule]  # file_path → PyModule
    call_graph: List[PyCallEdge] = []  # edges with source → target signature

Symbol table

PyModule: Represents one .py file.

class PyModule(BaseModel):
    file_path: str
    module_name: str
    imports: List[PyImport] = []
    comments: List[PyComment] = []
    classes: Dict[str, PyClass] = {}         # class_name → PyClass
    functions: Dict[str, PyCallable] = {}    # function_name → PyCallable
    variables: List[PyVariableDeclaration] = []
    content_hash: Optional[str] = None       # for cache invalidation
    last_modified: Optional[float] = None
    file_size: Optional[int] = None

PyClass: A class definition.

class PyClass(BaseModel):
    name: str
    signature: str  # e.g., "my_pkg.module.ClassName"
    comments: List[PyComment] = []
    code: str | None = None
    base_classes: List[str] = []  # parent signatures
    methods: Dict[str, PyCallable] = {}           # method_name → PyCallable
    attributes: Dict[str, PyClassAttribute] = {}  # attr_name → attribute
    inner_classes: Dict[str, "PyClass"] = {}
    start_line: int
    end_line: int

PyCallable: A function or method.

class PyCallable(BaseModel):
    name: str
    path: str
    signature: str  # e.g., "my_pkg.module.ClassName.method_name"
    comments: List[PyComment] = []
    decorators: List[str] = []
    parameters: List[PyCallableParameter] = []
    return_type: Optional[str] = None
    code: str | None = None  # source code of the callable
    start_line: int
    end_line: int
    code_start_line: int
    accessed_symbols: List[PySymbol] = []      # local variable / import refs
    call_sites: List[PyCallsite] = []          # calls made within this callable
    inner_callables: Dict[str, "PyCallable"] = {}
    inner_classes: Dict[str, "PyClass"] = {}
    local_variables: List[PyVariableDeclaration] = []
    cyclomatic_complexity: int = 0

PyCallsite: A single call inside a callable.

class PyCallsite(BaseModel):
    method_name: str
    receiver_expr: Optional[str] = None          # "obj" in obj.method()
    receiver_type: Optional[str] = None
    argument_types: List[str] = []
    return_type: Optional[str] = None
    callee_signature: Optional[str] = None       # resolved target (if found)
    is_constructor_call: bool = False
    start_line: int
    start_column: int
    end_line: int
    end_column: int

Call graph

PyCallEdge: A directed edge in the call graph.

class PyCallEdge(BaseModel):
    source: str  # caller PyCallable.signature
    target: str  # callee PyCallable.signature
    type: Literal["CALL_DEP"] = "CALL_DEP"
    weight: int = 1
    provenance: List[Literal["jedi", "codeql", "joern"]] = []

Supporting models

PyImport: An import statement.

class PyImport(BaseModel):
    module: str      # "os", "flask", "my_pkg.utils"
    name: str        # "path", "Flask", "helper"
    alias: Optional[str] = None
    start_line: int
    end_line: int
    start_column: int
    end_column: int

PyComment: A comment or docstring.

class PyComment(BaseModel):
    content: str
    start_line: int
    end_line: int
    start_column: int
    end_column: int
    is_docstring: bool = False

All Py* models support:

Pydantic v1 and v2 compatibility via cldk.models.python.
Builder pattern: PyModule.builder().module_name("x").classes({...}).build().
Serialization: to_msgpack_bytes(), from_msgpack_bytes(), model_dump_json().

CLI interface

The backend ships a command-line tool canpy, installed by pip install codeanalyzer-python:

canpy --input /path/to/my_pkg [OPTIONS]

codeanalyzer is a deprecated alias kept for backwards compatibility: it prints a deprecation warning to stderr and delegates to canpy. Prefer canpy.

Options

Option	Short	Type	Default	Description
`--input`	`-i`	PATH	Required	Project root directory to analyze
`--output`	`-o`	PATH	None	Save `analysis.json` or `analysis.msgpack` to this directory (stdout if None)
`--format`	`-f`	`json \| msgpack`	json	Output serialization format
`--emit`		`json \| neo4j \| schema`	json	Output target: `json` (analysis.json), `neo4j` (`graph.cypher`, or a live Bolt push with `--neo4j-uri`), or `schema` (Neo4j schema contract; needs no input)
`--app-name`		str	input dir name	Application name for the graph `:PyApplication` anchor
`--neo4j-uri`		str	None (`NEO4J_URI`)	Push the graph to a live Neo4j over Bolt (incremental); omit to write `graph.cypher`
`--neo4j-user`		str	neo4j (`NEO4J_USERNAME`)	Neo4j username
`--neo4j-password`		str	neo4j (`NEO4J_PASSWORD`)	Neo4j password (prefer the env var)
`--neo4j-database`		str	server default (`NEO4J_DATABASE`)	Neo4j database name
`--codeql` / `--no-codeql`		bool	false	Enable CodeQL-based call-graph augmentation (experimental)
`--ray` / `--no-ray`		bool	false	Enable Ray for distributed analysis
`--eager` / `--lazy`		bool	lazy	Force rebuild cache (eager) or reuse cached results (lazy)
`--cache-dir`	`-c`	PATH	`.codeanalyzer` in input dir	Where to store virtualenv, CodeQL DB, analysis cache
`--keep-cache` / `--clear-cache`		bool	keep	Retain cache after analysis (default) or remove it
`--skip-tests` / `--include-tests`		bool	skip	Exclude or include `test_.py` / `_test.py` files
`--file-name`		PATH	None	Analyze only a single file (relative to input dir)
`-v`		count	0	Verbosity: `-v`, `-vv`, `-vvv` for debug/trace

Examples

Basic symbol table analysis:

canpy -i ./my_pkg

Outputs analysis.json to stdout (symbol table + Jedi call graph).

With CodeQL augmentation:

canpy -i ./my_pkg --codeql

Merges Jedi edges with CodeQL-resolved edges; note that CodeQL integration is experimental and may take longer.

With Ray distributed analysis:

canpy -i ./my_pkg --ray

Enables Ray for parallel processing across available cores.

Save to file in msgpack format:

canpy -i ./my_pkg -o ./results --format msgpack

Saves compressed analysis.msgpack with 30-50% of JSON size.

Custom cache, eager rebuild:

canpy -i ./my_pkg --cache-dir /tmp/analysis-cache --eager

Rebuilds virtualenv and analysis cache from scratch, storing in /tmp/analysis-cache/.codeanalyzer.

Single file:

canpy -i ./my_pkg --file-name src/handlers.py

Analyzes only src/handlers.py.

How the SDK consumes it

A call to CLDK.python(project_path="my_pkg") in the Python SDK proceeds as follows:

Virtualenv provisioning: CLDK detects or installs codeanalyzer-python into a managed virtualenv in the cache directory (default: <project_dir>/.codeanalyzer/venv).
CLI invocation: The SDK constructs a canpy command with options (--codeql, --eager, --cache-dir, etc.) and runs it as a subprocess. Stdout is parsed as analysis.json.
Schema re-export: cldk.models.python re-exports PyApplication, PyModule, PyClass, PyCallable, and other Py* types directly from codeanalyzer.schema.py_schema, ensuring a single source of truth.
In-memory analysis object: The parsed PyApplication is passed to PythonAnalysis, which wraps it with convenience methods:
- get_symbol_table() → Dict[str, PyModule]
- get_classes() → Dict[str, PyClass]
- get_call_graph() → networkx.DiGraph
- get_callers(target_class_name, target_method_declaration) → Dict
- get_callees(source_class_name, source_method_declaration) → Dict

Direct SDK usage
CLI directly

from cldk import CLDK
from cldk.analysis import AnalysisLevel
from cldk.analysis.commons.backend_config import PyCodeAnalyzerConfig

analysis = CLDK.python(
    project_path="my_pkg",
    analysis_level=AnalysisLevel.call_graph,
    backend=PyCodeAnalyzerConfig(use_codeql=True),  # optional; merges CodeQL edges
)

# Query the symbol table
modules = analysis.get_symbol_table()
classes = analysis.get_classes()

# Compute reachability
call_graph = analysis.get_call_graph()
import networkx as nx
is_reachable = nx.has_path(call_graph, "my_pkg.main", "my_pkg.unsafe_sink")

# Find callers
callers = analysis.get_callers("my_pkg.MyClass", "process")

canpy -i ./my_pkg --codeql --output ./results --format json
cat results/analysis.json | jq '.symbol_table | keys'

Choosing a backend

The backend is selected by the type of the backend= config passed to CLDK.python(...):

In-memory codeanalyzer (default): omit backend=, or pass backend=PyCodeAnalyzerConfig(...). The Python-only call-graph knobs use_codeql=... and use_ray=... live on this config, as does cache_dir=....
Read-only Neo4j: pass backend=Neo4jConnectionConfig(...) to query a graph populated out of band (no local analysis is run).

from cldk import CLDK
from cldk.analysis.commons.backend_config import (
    PyCodeAnalyzerConfig,
    Neo4jConnectionConfig,
)

# In-memory backend with Ray + custom cache directory
analysis = CLDK.python(
    project_path="my_pkg",
    backend=PyCodeAnalyzerConfig(
        use_codeql=True,
        use_ray=True,
        cache_dir="/tmp/analysis-cache",
    ),
)

# Read-only Neo4j backend
analysis = CLDK.python(
    project_path="my_pkg",
    backend=Neo4jConnectionConfig(
        uri="bolt://localhost:7687",
        username="neo4j",
        password="neo4j",
        database=None,
        application_name="my_pkg",
    ),
)

Neo4jConnectionConfig is importable from cldk.analysis.commons.backend_config (and also from cldk.analysis.python.neo4j).

CLDK.python(...) keeps the project_path, analysis_level, target_files, and eager keyword arguments. The old CLDK(language="python").analysis(...) form still works but is deprecated; prefer CLDK.python(...). The from cldk import CLDK import is unchanged.

Caching and virtualenv management

Cache location: A single language-keyed cache_dir (default: <project_dir>/.codeanalyzer); Python artifacts live under <cache_dir>/python/. Set it via backend=PyCodeAnalyzerConfig(cache_dir=...).
Virtualenv: Auto-created under the cache directory. The backend installs dependencies from requirements.txt, pyproject.toml, setup.py, Pipfile, etc.
Analysis cache: Indexed by file content hash; unchanged files reuse cached results.
CodeQL database: Stored under the cache directory if CodeQL is enabled; downloaded on first use.

To force a clean rebuild, pass eager=True to CLDK.python(...) (or --eager on the CLI), or delete the cache directory.

Design principles

One schema: The same PyApplication schema is used across the CLI, SDK, and consuming code. All Py* types are Pydantic models with JSON/msgpack serialization.
Semantic over syntactic: Jedi resolves symbols and types, so queries operate on the resolved program rather than raw tokens.
Optional CodeQL: Jedi alone resolves approximately 80-90% of call edges. CodeQL augments dynamic and RPC calls at the cost of additional analysis time. Enable it when those edges are required.
Queryable interface: Reachability is a networkx query, and callers and callees are exposed through API methods over the analyzed project.

Next steps

Python API reference Query PythonAnalysis and the typed models.

cocoa A Code Context Agent plugin using call graphs for reachability and impact.

Contributing guide Extend the Python backend or add support for new languages.