Skip to content

Core

The CLDK class is the top-level entry point. Construct it with a language, then ask it for an analysis object over your project. You never instantiate JavaAnalysis or PythonAnalysis directly; CLDK hands you the correct one.

Two steps, always the same shape:

from cldk import CLDK
from cldk.analysis import AnalysisLevel
analysis = CLDK(language="java").analysis(project_path="commons-cli")
# -> JavaAnalysis, ready to query

CLDK(language=...) accepts "java" and "python" today (Go, TypeScript, Rust, and C are on the way). The object it returns exposes the primary method used in most workflows:

  • .analysis(...): returns the language-specific analysis object (JavaAnalysis or PythonAnalysis) backed by the appropriate static analysis engine. This is where the symbol table and call graph are produced.
flowchart LR
    C["CLDK(language)"] --> A[".analysis(project_path)"]
    A --> J[JavaAnalysis]
    A --> P[PythonAnalysis]
    J --> M[Typed models]
    P --> M

Analysis levels. The depth of .analysis() is governed by analysis_level. The default, AnalysisLevel.symbol_table, populates classes, methods, and fields. Call-graph computation incurs additional cost: get_call_graph, get_callers, and get_callees require AnalysisLevel.call_graph. Set it up front when call relationships are needed.

ArgumentApplies toWhat it does
project_pathallPath to the project directory to analyze.
analysis_levelallAnalysisLevel.symbol_table (default) or AnalysisLevel.call_graph. The latter is required for call graphs, callers, and callees.
analysis_backend_pathJava onlyPath to a codeanalyzer-*.jar. Omit to auto-download.
cache_dirPython onlyDirectory for the codeanalyzer-python cache (virtualenv, CodeQL DB, analysis cache). Defaults to <project_path>/.codeanalyzer.
use_codeqlPython onlyWhen True (default), augments Jedi call-graph resolution with CodeQL for more complete edges. Set False for faster, Jedi-only analysis.

See the generated reference below for the full signature, including eager, target_files, analysis_json_path, and use_ray.

The recurring sample project is Apache Commons CLI, unpacked at commons-cli.

from cldk import CLDK
from cldk.analysis import AnalysisLevel
analysis = CLDK(language="java").analysis(
project_path="commons-cli",
analysis_level=AnalysisLevel.call_graph, # needed for call-graph methods
)
print(type(analysis).__name__) # JavaAnalysis
print(len(analysis.get_classes())) # 23

The first run may download the CodeAnalyzer backend JAR; later runs reuse the cache. From here, every method lives on analysis; see the Java API reference for the full surface.

from cldk import CLDK
analysis = CLDK(language="python").analysis(project_path="my_pkg")
print(type(analysis).__name__) # PythonAnalysis
classes = analysis.get_classes() # Dict[str, PyClass]

Same two-step shape, same method names, only language changed. Methods are documented on the Python API reference.

For an introduction, see What is CLDK? and the Quickstart. For task-oriented snippets, see Common tasks and the cocoa; the concepts page explains analysis levels and call graphs in detail, and the cheat sheet provides a one-page summary.

The full generated reference follows.

Source on GitHub

Core CLDK module.

This module provides the top-level entry point for the Code Language Development Kit (CLDK), a unified framework for performing static analysis across multiple programming languages. The primary interface is the CLDK class, which serves as a factory for creating language-specific analysis objects, tree-sitter parsers, and sanitization utilities.

The CLDK supports the following languages

  • Java: Full static analysis via CodeAnalyzer backend, including symbol tables, call graphs, and code metrics.
  • Python: Static analysis via codeanalyzer-python backend with optional CodeQL-augmented call graph resolution.
  • C: Basic analysis via libclang for parsing and extracting code structure.

Typical usage involves instantiating CLDK with a target language, then calling analysis to obtain a language-specific analysis facade.

Note This module requires language-specific backends to be available:

  • Java: codeanalyzer-*.jar (auto-downloaded or specified via path)
  • Python: codeanalyzer-python (auto-installed in virtualenv)
  • C: libclang (must be installed on the system)
class CLDK

Core class for the Code Language Development Kit (CLDK).

The CLDK class serves as the primary entry point and factory for all code analysis operations. It provides a unified interface for initializing language-specific analysis facades, tree-sitter parsers, and code sanitization utilities.

This class follows the factory pattern, where the language parameter determines which concrete analysis implementation is returned by the analysis, treesitter_parser, and tree_sitter_utils methods.

Parameters:

NameTypeDescription
languagestrThe target programming language for analysis. Supported values are "java", "python", and "c" (case-sensitive).

Raises:

  • NotImplementedError: Raised by factory methods when the specified language is not yet supported.

See Also

  • JavaAnalysis: Java-specific analysis facade.
  • PythonAnalysis: Python-specific analysis facade.
  • CAnalysis: C-specific analysis facade.
NameTypeDescription
languagestr
analysis(project_path: str | Path | None = None, source_code: str | None = None, eager: bool = False, analysis_level: str = AnalysisLevel.symbol_table, target_files: List[str] | None = None, analysis_backend_path: str | None = None, analysis_json_path: str | Path = None, cache_dir: str | Path | None = None, use_codeql: bool = True, use_ray: bool = False) -> JavaAnalysis | PythonAnalysis | CAnalysis

Initialize and return a language-specific analysis facade.

This factory method creates an appropriate analysis object based on the language specified during CLDK initialization. The analysis facade provides methods for extracting code structure, call graphs, symbol tables, and other static analysis artifacts.

The method supports two modes of operation:

  1. Project mode: Analyze an entire project directory by providing project_path. This is the recommended mode for comprehensive analysis.
  2. Source code mode (Java only): Analyze a single source code string by providing source_code. Useful for quick analysis of code snippets.

Parameters:

NameTypeDescription
project_pathstr | Path | NoneAbsolute or relative path to the project directory to analyze. The directory should contain source files in the target language. Mutually exclusive with source_code.
source_codestr | NoneRaw source code string to analyze (Java only). Useful for analyzing code snippets without a project structure. Mutually exclusive with project_path. Not supported for Python or C languages.
eagerboolIf True, forces regeneration of all analysis caches and databases, ignoring any previously cached results. Defaults to False for incremental analysis performance.
analysis_levelstrThe depth of analysis to perform. Controls which analysis artifacts are generated. See AnalysisLevel for available options. Defaults to AnalysisLevel.symbol_table.
target_filesList[str] | NoneOptional list of specific file paths (relative to project_path) to analyze. When provided, only these files are included in the analysis, improving performance for large projects. Defaults to None (analyze all files).
analysis_backend_pathstr | NoneJava only. Path to the directory containing the codeanalyzer-*.jar backend executable. If not provided, the JAR is automatically downloaded. Not valid for Python analysis; use cache_dir instead.
analysis_json_pathstr | PathPath where the analysis database (typically analysis.json) should be persisted. Useful for caching analysis results between sessions. If not provided, a default location within the project is used.
cache_dirstr | Path | NonePython only. Directory path for the codeanalyzer-python backend’s cache, including its virtualenv, CodeQL database, and analysis_cache.json. When omitted, defaults to <project_path>/.codeanalyzer. Ignored for Java and C.
use_codeqlboolPython only. If True (default), augments Jedi-based call graph resolution with CodeQL analysis for more complete call edges. Set to False for faster analysis using only Jedi. Ignored for Java and C.
use_rayboolPython only. If True, enables Ray-based parallel processing for analysis. Recommended for very large projects where sequential Jedi/CodeQL analysis would be slow. Requires Ray to be installed. Defaults to False. Ignored for Java and C.

Returns:

  • JavaAnalysis \| PythonAnalysis \| CAnalysis: A language-specific analysis facade instance: - JavaAnalysis for Java projects - PythonAnalysis for Python projects - CAnalysis for C projects

Raises:

  • CldkInitializationException: Raised in the following cases: - Neither project_path nor source_code is provided. - Both project_path and source_code are provided. - source_code is provided for Python analysis (not supported). - analysis_backend_path is provided for Python analysis (use cache_dir instead).
  • NotImplementedError: If the language specified during CLDK initialization is not supported.

Note The analysis process may download or build backend tools on first run, which can take additional time. Subsequent runs use cached backends for faster startup.

See Also

  • AnalysisLevel: Available analysis depth options.
  • JavaAnalysis: Java analysis methods.
  • PythonAnalysis: Python analysis methods.
treesitter_parser() -> TreesitterJava

Return a Tree-sitter parser for the selected language.

Creates and returns a language-specific Tree-sitter parser instance that can be used for syntactic analysis, AST traversal, and code querying operations. Tree-sitter provides incremental parsing with excellent performance characteristics for real-time code analysis.

The returned parser provides methods for

  • Parsing source code into an AST
  • Running Tree-sitter queries to extract code patterns
  • Extracting syntactic elements (methods, classes, imports, etc.)
  • Performing lexical analysis

Returns:

  • TreesitterJava: A Tree-sitter parser wrapper for Java source code. The parser provides methods such as is_parsable, get_raw_ast, get_all_imports, and various code extraction utilities.

Raises:

  • NotImplementedError: If the language specified during CLDK initialization does not have a Tree-sitter parser implementation. Currently, only Java is supported.

Note The Tree-sitter parser operates at the syntactic level only and does not perform semantic analysis. For semantic information like resolved types or call graphs, use analysis instead.

See Also

  • TreesitterJava: Java Tree-sitter parser implementation.
tree_sitter_utils(source_code: str) -> TreesitterSanitizer

Return Tree-sitter-based code sanitization utilities for the selected language.

Creates and returns a utility class that provides code transformation and sanitization operations using Tree-sitter for parsing. These utilities are particularly useful for preparing code for LLM consumption, test generation, and code analysis tasks.

The sanitization utilities provide operations such as

  • Removing unused imports from source code
  • Keeping only focal methods and their callees for context reduction
  • Extracting and manipulating test assertions
  • Identifying and removing dead code

Parameters:

NameTypeDescription
source_codestrThe source code string to initialize the utilities with. This code will be parsed and made available for transformation operations. Must be valid syntax for the target language.

Returns:

  • TreesitterSanitizer: A utility wrapper that provides sanitization and transformation methods for Java source code, including: - keep_only_focal_method_and_its_callees - remove_unused_imports

Raises:

  • NotImplementedError: If the language specified during CLDK initialization does not have sanitization utilities implemented. Currently, only Java is supported.

Note The sanitization utilities modify code at the syntactic level using Tree-sitter patterns. For complex refactoring that requires semantic understanding, consider using the full analysis capabilities via analysis.

See Also

  • TreesitterSanitizer: Java sanitization utility implementation.
  • treesitter_parser: For raw Tree-sitter parsing without sanitization utilities.