Skip to content

codeanalyzer-python

The Python analysis backend is the engine that powers PythonAnalysis in CLDK. It runs Jedi for semantic code understanding, optional CodeQL for enhanced call-graph resolution, and Tree-sitter for fast syntactic parsing, all producing a single canonical PyApplication schema that ships with the backend and is re-exported by the CLDK Python SDK.

codeanalyzer-python is a standalone static analysis library (published to PyPI as codeanalyzer-python) that the SDK auto-manages in a virtualenv. Rather than crawling files with token-heavy LLM calls, agents query the analyzed program directly: call graphs become networkx graph lookups, reachability becomes a path query, and callers/callees are deterministic.

The backend produces:

  • Symbol table: All modules, classes, methods, functions, imports, parameters, and docstrings in typed PyModule objects.
  • Call graph: Inter- and intra-procedural call edges (PyCallEdge), with Jedi-based baseline and optional CodeQL augmentation merging resolved callees from both engines.
  • Class hierarchies: Base classes, inheritance chains, and method overrides.
  • Entrypoints: Framework-detected entry points (Flask routes, Celery tasks, Django views, gRPC servicers, etc.) linked to their callables.
flowchart LR
    A["Input: project_path"] --> B["Virtualenv<br/>+ deps"]
    B --> C["Jedi: symbol table<br/>+ Jedi call edges"]
    B --> D["CodeQL<br/>(optional)"]
    D --> E["Merge edges"]
    C --> E
    E --> F["Symbol table<br/>Call graph"]
    F --> G["PyApplication<br/>+ PyModule, PyClass,<br/>PyCallable, PyCallEdge"]
    G --> H["JSON / Msgpack"]
    G --> I["CLDK SDK<br/>cldk.models.python"]
  • codeanalyzer.core:Codeanalyzer: Orchestrates the analysis pipeline, manages virtualenv setup, caching, and invokes semantic passes.
  • codeanalyzer.syntactic_analysis:SymbolTableBuilder: Parses Python source via Tree-sitter and Jedi to extract modules, classes, methods, and call sites.
  • codeanalyzer.semantic_analysis.call_graph: Builds inter-procedural call graphs using Jedi’s resolution; merges CodeQL edges when enabled.
  • codeanalyzer.semantic_analysis.codeql: Optional CodeQL integration for resolving dynamic calls, third-party dispatch, and RPC targets.
  • codeanalyzer.schema:py_schema: Defines all Pydantic models: PyModule, PyClass, PyCallable, PyCallEdge, PyApplication, and others.

All models are defined in /codeanalyzer/schema/py_schema.py and re-exported in the CLDK SDK at cldk.models.python:

PyApplication: The root output of every analysis run.

class PyApplication(BaseModel):
symbol_table: Dict[str, PyModule] # file_path → PyModule
call_graph: List[PyCallEdge] = [] # edges with source → target signature

PyModule: Represents one .py file.

class PyModule(BaseModel):
file_path: str
module_name: str
imports: List[PyImport] = []
comments: List[PyComment] = []
classes: Dict[str, PyClass] = {} # class_name → PyClass
functions: Dict[str, PyCallable] = {} # function_name → PyCallable
variables: List[PyVariableDeclaration] = []
content_hash: Optional[str] = None # for cache invalidation
last_modified: Optional[float] = None
file_size: Optional[int] = None

PyClass: A class definition.

class PyClass(BaseModel):
name: str
signature: str # e.g., "my_pkg.module.ClassName"
comments: List[PyComment] = []
code: str | None = None
base_classes: List[str] = [] # parent signatures
methods: Dict[str, PyCallable] = {} # method_name → PyCallable
attributes: Dict[str, PyClassAttribute] = {} # attr_name → attribute
inner_classes: Dict[str, "PyClass"] = {}
start_line: int
end_line: int

PyCallable: A function or method.

class PyCallable(BaseModel):
name: str
path: str
signature: str # e.g., "my_pkg.module.ClassName.method_name"
comments: List[PyComment] = []
decorators: List[str] = []
parameters: List[PyCallableParameter] = []
return_type: Optional[str] = None
code: str | None = None # source code of the callable
start_line: int
end_line: int
code_start_line: int
accessed_symbols: List[PySymbol] = [] # local variable / import refs
call_sites: List[PyCallsite] = [] # who does this call?
inner_callables: Dict[str, "PyCallable"] = {}
inner_classes: Dict[str, "PyClass"] = {}
local_variables: List[PyVariableDeclaration] = []
cyclomatic_complexity: int = 0

PyCallsite: A single call inside a callable.

class PyCallsite(BaseModel):
method_name: str
receiver_expr: Optional[str] = None # "obj" in obj.method()
receiver_type: Optional[str] = None
argument_types: List[str] = []
return_type: Optional[str] = None
callee_signature: Optional[str] = None # resolved target (if found)
is_constructor_call: bool = False
start_line: int
start_column: int
end_line: int
end_column: int

PyCallEdge: A directed edge in the call graph.

class PyCallEdge(BaseModel):
source: str # caller PyCallable.signature
target: str # callee PyCallable.signature
type: Literal["CALL_DEP"] = "CALL_DEP"
weight: int = 1
provenance: List[Literal["jedi", "codeql", "joern"]] = []

PyImport: An import statement.

class PyImport(BaseModel):
module: str # "os", "flask", "my_pkg.utils"
name: str # "path", "Flask", "helper"
alias: Optional[str] = None
start_line: int
end_line: int
start_column: int
end_column: int

PyComment: A comment or docstring.

class PyComment(BaseModel):
content: str
start_line: int
end_line: int
start_column: int
end_column: int
is_docstring: bool = False

All Py* models support:

  • Pydantic v1 and v2 compatibility via cldk.models.python.
  • Builder pattern: PyModule.builder().module_name("x").classes({...}).build().
  • Serialization: to_msgpack_bytes(), from_msgpack_bytes(), model_dump_json().

The backend ships a command-line tool codeanalyzer, installed by pip install codeanalyzer-python:

Terminal window
codeanalyzer --input /path/to/my_pkg [OPTIONS]
OptionShortTypeDefaultDescription
--input-iPATHRequiredProject root directory to analyze
--output-oPATHNoneSave analysis.json or analysis.msgpack to this directory (stdout if None)
--format-fjson | msgpackjsonOutput serialization format
--codeql / --no-codeqlboolfalseEnable CodeQL-based call-graph augmentation (experimental)
--ray / --no-rayboolfalseEnable Ray for distributed analysis
--eager / --lazyboollazyForce rebuild cache (eager) or reuse cached results (lazy)
--cache-dir-cPATH.codeanalyzer in input dirWhere to store virtualenv, CodeQL DB, analysis cache
--keep-cache / --clear-cacheboolkeepRetain cache after analysis (default) or remove it
--skip-tests / --include-testsboolskipExclude or include test_*.py / *_test.py files
--file-namePATHNoneAnalyze only a single file (relative to input dir)
-vcount0Verbosity: -v, -vv, -vvv for debug/trace

Basic symbol table analysis:

Terminal window
codeanalyzer -i ./my_pkg

Outputs analysis.json to stdout (symbol table + Jedi call graph).

With CodeQL augmentation:

Terminal window
codeanalyzer -i ./my_pkg --codeql

Merges Jedi edges with CodeQL-resolved edges; note that CodeQL integration is experimental and may take longer.

With Ray distributed analysis:

Terminal window
codeanalyzer -i ./my_pkg --ray

Enables Ray for parallel processing across available cores.

Save to file in msgpack format:

Terminal window
codeanalyzer -i ./my_pkg -o ./results --format msgpack

Saves compressed analysis.msgpack with 30–50% of JSON size.

Custom cache, eager rebuild:

Terminal window
codeanalyzer -i ./my_pkg --cache-dir /tmp/analysis-cache --eager

Rebuilds virtualenv and analysis cache from scratch, storing in /tmp/analysis-cache/.codeanalyzer.

Single file:

Terminal window
codeanalyzer -i ./my_pkg --file-name src/handlers.py

Analyzes only src/handlers.py.

When you call CLDK(language="python").analysis(project_path="my_pkg") in the Python SDK:

  1. Virtualenv provisioning: CLDK detects or installs codeanalyzer-python into a managed virtualenv in the cache directory (default: <project_dir>/.codeanalyzer/venv).

  2. CLI invocation: The SDK constructs a codeanalyzer command with options (--codeql, --eager, --cache-dir, etc.) and runs it as a subprocess. Stdout is parsed as analysis.json.

  3. Schema re-export: cldk.models.python re-exports PyApplication, PyModule, PyClass, PyCallable, and other Py* types directly from codeanalyzer.schema.py_schema, ensuring a single source of truth.

  4. In-memory facade: The parsed PyApplication is passed to PythonAnalysis, which wraps it with convenience methods:

    • get_symbol_table() → Dict[str, PyModule]
    • get_classes() → Dict[str, PyClass]
    • get_call_graph() → networkx.DiGraph
    • get_callers(target_class, target_method) → List[Tuple[str, PyCallable]]
    • get_callees(source_callable) → List[PyCallable]
from cldk import CLDK
from cldk.analysis import AnalysisLevel
analysis = CLDK(language="python").analysis(
project_path="my_pkg",
analysis_level=AnalysisLevel.call_graph,
use_codeql=True, # optional; merges CodeQL edges
)
# Query the symbol table
modules = analysis.get_symbol_table()
classes = analysis.get_classes()
# Compute reachability
call_graph = analysis.get_call_graph()
import networkx as nx
is_reachable = nx.has_path(call_graph, "my_pkg.main", "my_pkg.unsafe_sink")
# Find callers
callers = analysis.get_callers("my_pkg.MyClass", "process")
  • Cache location: Stored in cache_dir/.codeanalyzer/ (default: <project_dir>/.codeanalyzer/).
  • Virtualenv: Auto-created at .codeanalyzer/<project_name>/virtualenv/. The backend installs dependencies from requirements.txt, pyproject.toml, setup.py, Pipfile, etc.
  • Analysis cache: Indexed by file content hash; unchanged files reuse cached results.
  • CodeQL database: Stored in .codeanalyzer/codeql/ if --codeql is enabled; downloaded on first use.

To force a clean rebuild, pass --eager (or eager_analysis=True in the SDK), or delete the cache directory.

  • One facade, one schema: The same PyApplication schema is used everywhere: CLI, SDK, and agent code. All Py* types are Pydantic models with JSON/msgpack serialization.
  • Semantic over syntactic: Jedi resolves symbols and types; you query the program, not parse tokens.
  • Optional CodeQL: Jedi alone covers 80–90% of call edges. CodeQL augments dynamic/RPC calls but costs extra analysis time. Enable it when you need comprehensive coverage.
  • Agents query instead of crawl: Reachability is a networkx query, callers are a dict lookup, and every claim is grounded in ground truth. No tokens wasted on approximation.