Skip to content

Multi-lingual, multi-project analysis at scale

In the default workflow, CLDK runs the analysis in-process: you point CLDK.java(...) at a project, the backend parses it, and the typed models live in memory for the lifetime of that object. That is the right model for a single project on a single machine.

It does not fit a fleet. When you have hundreds of repositories across several languages, and agents that need to answer structural questions about any of them at any time, re-analyzing on every request is wasteful, and holding every project in memory is impossible.

CLDK supports a second model for exactly this case. Analysis and querying are split into two phases that scale independently:

  1. Emit — each codeanalyzer-* backend projects its analysis into a Neo4j property graph instead of a JSON file. This is the expensive, batchable step; run it once per project (and incrementally thereafter), wherever you have compute.
  2. Poll — the CLDK SDK connects to that graph as a read-only Cypher client. No source is parsed at query time; the analysis object answers from the graph. This is the cheap, horizontally-scalable step that your agents run.

Because every language’s graph shares one database, agents get multi-lingual, multi-project program analysis behind the same analysis API they already use.

flowchart LR
    subgraph Emit["Emit · batch jobs (write once)"]
        J["codeanalyzer-java<br/>--emit neo4j"]
        P["canpy<br/>--emit neo4j"]
        T["cants<br/>--emit neo4j"]
    end
    subgraph DB["Shared Neo4j graph"]
        N[("J* · Py* · TS*<br/>one DB, many apps")]
    end
    subgraph Poll["Poll · agents (read many)"]
        A1["CLDK.java(backend=Neo4j…)"]
        A2["CLDK.python(backend=Neo4j…)"]
        A3["CLDK.typescript(backend=Neo4j…)"]
    end
    J --> N
    P --> N
    T --> N
    N --> A1
    N --> A2
    N --> A3

Each backend writes a namespaced label set so multiple languages share a single database without collisions. Java labels are J* / J_*, Python Py* / PY_*, TypeScript TS* / TS_*:

LanguageApp anchorModule nodeSymbol node (merge label)Call edge
Java:JApplication:JCompilationUnit:JSymbol:JType / :JCallable:J_CALLS
Python:PyApplication:PyModule:PySymbol:PyClass / :PyCallable / :PyExternal:PY_CALLS
TypeScript:TSApplication:TSModule:TSSymbol:TSClass:TSExternal:TS_CALLS

Within a language, multiple projects coexist too. Every node is scoped to an application anchor identified by its --app-name, so one database can hold payments-service, web-frontend, and billing-core side by side. On the read side, application_name selects which one a query sees.

External (phantom) nodes. The Python and TypeScript graphs add :PyExternal / :TSExternal nodes for call targets outside the analyzed app, third-party libraries, the standard library, and Node built-ins, so call edges never dangle. They carry no _module and are shared rather than owned by one app, which is why a query can see that an app calls os.path.join or node:crypto.createHash even though those are not part of its source.

Every backend takes the same --emit neo4j flag. With --neo4j-uri set it pushes to a live database over Bolt; without it, it writes a self-contained graph.cypher snapshot you can load with cypher-shell. The two modes differ in more than destination: the live Bolt push is content-hash incremental (it rewrites only changed modules and prunes orphans on a full run), while the graph.cypher snapshot wipes and rebuilds that application’s subgraph in full every time it is applied. Connection settings also read the NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD, and NEO4J_DATABASE environment variables; an explicit flag wins over the environment.

Java → Neo4j (live Bolt push)
# -a 2 includes call edges (:J_CALLS); -a 1 is symbol table only.
codeanalyzer -i ./payments-service -a 2 --emit neo4j \
--app-name payments-service \
--neo4j-uri bolt://localhost:7687 \
--neo4j-user neo4j --neo4j-password "$NEO4J_PASSWORD"
Java → graph.cypher snapshot (no live DB)
codeanalyzer -i ./payments-service -a 2 --emit neo4j -o ./out
cypher-shell -u neo4j -p "$NEO4J_PASSWORD" < ./out/graph.cypher

Re-running a job is safe and cheap, which is what makes a scheduled fleet practical:

  • Idempotent — both writers create the schema constraints and indexes first, then upsert with MERGE (never blind CREATE). Re-emitting the same project produces the same graph.
  • Incremental — a live Bolt push diffs each module against the graph by content hash and rewrites only what changed. On a full run, modules whose source file disappeared are pruned. The Python writer scopes this prune to the application’s anchor, so other apps in a shared database are untouched; the Java and TypeScript writers currently prune any compilation unit or module absent from the current run, regardless of app. On a shared instance, give each Java or TypeScript application its own Neo4j database (or run its full job against a dedicated one) so a full run for app B never deletes app A.
  • Index-backed — constraints exist before any MERGE, so every upsert is an index seek rather than a scan, even as the database grows.

Each backend can emit its graph schema as a machine-readable, version-stamped contract with --emit schema (no project needed). The schema_version is also stamped on every graph’s :*Application node, so a consumer can check compatibility before querying — read it back per application with MATCH (a:JApplication {name: $app}) RETURN a.schema_version (use :PyApplication / :TSApplication for the others). Each backend versions its schema independently, so pin the contract you generated against rather than assuming one shared version.

Export the schema contract
codeanalyzer --emit schema -o ./out # -> ./out/schema.neo4j.json
canpy --emit schema -o ./out # -> ./out/schema.json
cants --emit schema -o ./out # -> ./out/schema.json

On the read side, nothing about the analysis API changes. You pass a Neo4jConnectionConfig as backend= and the SDK selects the read-only Cypher backend by config type. project_path is optional in this mode (no source is read); application_name selects which project in the database the queries see.

Query a Java graph
from cldk import CLDK
from cldk.analysis.commons.backend_config import Neo4jConnectionConfig
analysis = CLDK.java(
backend=Neo4jConnectionConfig(
uri="bolt://neo4j:7687",
username="reader", # read-only credentials are enough
password="",
application_name="payments-service",
),
)
classes = analysis.get_classes() # -> dict[str, JType], straight from the graph
cg = analysis.get_call_graph() # -> networkx.DiGraph

The Neo4j backends (JNeo4jBackend, PyNeo4jBackend, TSNeo4jBackend) expose the same method surface as the in-process backends, so existing query code is a drop-in: only the backend= argument changes.

  1. Install the optional driver: pip install cldk[neo4j] (it pulls neo4j>=5.14). The driver is an extra, not a core dependency.
  2. Point Neo4jConnectionConfig.uri at your Bolt endpoint and set application_name to the project you want to query.
  3. Call the usual get_classes, get_call_graph, get_callers, get_callees, and related methods. The graph answers; no source is parsed.

The analysis API covers the common structural questions, but the graph is a full Neo4j property graph: anything the API does not model, an agent can reach with raw Cypher over the same Bolt endpoint. The label namespacing is the query contract — scope by the :*Application anchor and the language prefix.

Discover what a shared database holds. The SDK expects you to know the application_name up front (the Neo4j backends raise if it is missing, and offer no enumeration method), so taking inventory is itself a Cypher query:

MATCH (a)
WHERE a:JApplication OR a:PyApplication OR a:TSApplication
RETURN labels(a)[0] AS language, a.name AS app, a.schema_version

Full-text code search. Every backend creates a Neo4j full-text index over each callable’s code and docstring, named per language (j_code_fts, py_code_fts, code_fts). The CLDK API does not wrap it, so reach it over Bolt:

CALL db.index.fulltext.queryNodes('py_code_fts', 'jwt OR authenticate') YIELD node, score
RETURN node.signature, score ORDER BY score DESC LIMIT 20

Java carries more in its graph. The Java emitter projects CRUD operations and a system dependency graph that the other languages do not, and JNeo4jBackend surfaces them with no re-parsing: get_all_crud_operations() (plus get_all_read_operations() / …_create… / …_update… / …_delete…), get_system_dependency_graph(), and get_all_entry_point_methods(). See the Java API reference.