Multi-lingual, multi-project analysis at scale
In the default workflow, CLDK runs the analysis in-process: you point CLDK.java(...) at a project, the backend parses it, and the typed models live in memory for the lifetime of that object. That is the right model for a single project on a single machine.
It does not fit a fleet. When you have hundreds of repositories across several languages, and agents that need to answer structural questions about any of them at any time, re-analyzing on every request is wasteful, and holding every project in memory is impossible.
CLDK supports a second model for exactly this case. Analysis and querying are split into two phases that scale independently:
- Emit — each
codeanalyzer-*backend projects its analysis into a Neo4j property graph instead of a JSON file. This is the expensive, batchable step; run it once per project (and incrementally thereafter), wherever you have compute. - Poll — the CLDK SDK connects to that graph as a read-only Cypher client. No source is parsed at query time; the
analysisobject answers from the graph. This is the cheap, horizontally-scalable step that your agents run.
Because every language’s graph shares one database, agents get multi-lingual, multi-project program analysis behind the same analysis API they already use.
flowchart LR
subgraph Emit["Emit · batch jobs (write once)"]
J["codeanalyzer-java<br/>--emit neo4j"]
P["canpy<br/>--emit neo4j"]
T["cants<br/>--emit neo4j"]
end
subgraph DB["Shared Neo4j graph"]
N[("J* · Py* · TS*<br/>one DB, many apps")]
end
subgraph Poll["Poll · agents (read many)"]
A1["CLDK.java(backend=Neo4j…)"]
A2["CLDK.python(backend=Neo4j…)"]
A3["CLDK.typescript(backend=Neo4j…)"]
end
J --> N
P --> N
T --> N
N --> A1
N --> A2
N --> A3
One graph, every language, every project
Section titled “One graph, every language, every project”Each backend writes a namespaced label set so multiple languages share a single database without collisions. Java labels are J* / J_*, Python Py* / PY_*, TypeScript TS* / TS_*:
| Language | App anchor | Module node | Symbol node (merge label) | Call edge |
|---|---|---|---|---|
| Java | :JApplication | :JCompilationUnit | :JSymbol → :JType / :JCallable | :J_CALLS |
| Python | :PyApplication | :PyModule | :PySymbol → :PyClass / :PyCallable / :PyExternal | :PY_CALLS |
| TypeScript | :TSApplication | :TSModule | :TSSymbol → :TSClass … :TSExternal | :TS_CALLS |
Within a language, multiple projects coexist too. Every node is scoped to an application anchor identified by its --app-name, so one database can hold payments-service, web-frontend, and billing-core side by side. On the read side, application_name selects which one a query sees.
External (phantom) nodes. The Python and TypeScript graphs add :PyExternal / :TSExternal nodes for call targets outside the analyzed app, third-party libraries, the standard library, and Node built-ins, so call edges never dangle. They carry no _module and are shared rather than owned by one app, which is why a query can see that an app calls os.path.join or node:crypto.createHash even though those are not part of its source.
Phase 1 — Emit a graph
Section titled “Phase 1 — Emit a graph”Every backend takes the same --emit neo4j flag. With --neo4j-uri set it pushes to a live database over Bolt; without it, it writes a self-contained graph.cypher snapshot you can load with cypher-shell. The two modes differ in more than destination: the live Bolt push is content-hash incremental (it rewrites only changed modules and prunes orphans on a full run), while the graph.cypher snapshot wipes and rebuilds that application’s subgraph in full every time it is applied. Connection settings also read the NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD, and NEO4J_DATABASE environment variables; an explicit flag wins over the environment.
# -a 2 includes call edges (:J_CALLS); -a 1 is symbol table only.codeanalyzer -i ./payments-service -a 2 --emit neo4j \ --app-name payments-service \ --neo4j-uri bolt://localhost:7687 \ --neo4j-user neo4j --neo4j-password "$NEO4J_PASSWORD"codeanalyzer -i ./payments-service -a 2 --emit neo4j -o ./outcypher-shell -u neo4j -p "$NEO4J_PASSWORD" < ./out/graph.cyphercanpy -i ./billing-core --emit neo4j \ --app-name billing-core \ --neo4j-uri bolt://localhost:7687 \ --neo4j-user neo4j --neo4j-password "$NEO4J_PASSWORD"canpy -i ./billing-core --emit neo4j -o ./out # -> ./out/graph.cyphercypher-shell -u neo4j -p "$NEO4J_PASSWORD" < ./out/graph.cyphercants -i ./web-frontend -a 2 --emit neo4j \ --app-name web-frontend \ --neo4j-uri bolt://localhost:7687 \ --neo4j-user neo4j --neo4j-password "$NEO4J_PASSWORD"cants -i ./web-frontend -a 2 --emit neo4j -o ./out # -> ./out/graph.cyphercypher-shell -u neo4j -p "$NEO4J_PASSWORD" < ./out/graph.cypherWrites are idempotent and incremental
Section titled “Writes are idempotent and incremental”Re-running a job is safe and cheap, which is what makes a scheduled fleet practical:
- Idempotent — both writers create the schema constraints and indexes first, then upsert with
MERGE(never blindCREATE). Re-emitting the same project produces the same graph. - Incremental — a live Bolt push diffs each module against the graph by content hash and rewrites only what changed. On a full run, modules whose source file disappeared are pruned. The Python writer scopes this prune to the application’s anchor, so other apps in a shared database are untouched; the Java and TypeScript writers currently prune any compilation unit or module absent from the current run, regardless of app. On a shared instance, give each Java or TypeScript application its own Neo4j database (or run its full job against a dedicated one) so a full run for app B never deletes app A.
- Index-backed — constraints exist before any
MERGE, so every upsert is an index seek rather than a scan, even as the database grows.
A versioned schema contract
Section titled “A versioned schema contract”Each backend can emit its graph schema as a machine-readable, version-stamped contract with --emit schema (no project needed). The schema_version is also stamped on every graph’s :*Application node, so a consumer can check compatibility before querying — read it back per application with MATCH (a:JApplication {name: $app}) RETURN a.schema_version (use :PyApplication / :TSApplication for the others). Each backend versions its schema independently, so pin the contract you generated against rather than assuming one shared version.
codeanalyzer --emit schema -o ./out # -> ./out/schema.neo4j.jsoncanpy --emit schema -o ./out # -> ./out/schema.jsoncants --emit schema -o ./out # -> ./out/schema.jsonPhase 2 — Poll the graph
Section titled “Phase 2 — Poll the graph”On the read side, nothing about the analysis API changes. You pass a Neo4jConnectionConfig as backend= and the SDK selects the read-only Cypher backend by config type. project_path is optional in this mode (no source is read); application_name selects which project in the database the queries see.
from cldk import CLDKfrom cldk.analysis.commons.backend_config import Neo4jConnectionConfig
analysis = CLDK.java( backend=Neo4jConnectionConfig( uri="bolt://neo4j:7687", username="reader", # read-only credentials are enough password="…", application_name="payments-service", ),)
classes = analysis.get_classes() # -> dict[str, JType], straight from the graphcg = analysis.get_call_graph() # -> networkx.DiGraphfrom cldk import CLDKfrom cldk.analysis.commons.backend_config import Neo4jConnectionConfig
analysis = CLDK.python( backend=Neo4jConnectionConfig( uri="bolt://neo4j:7687", username="reader", password="…", application_name="billing-core", ),)
callers = analysis.get_callers("billing_core.invoice.Invoice", "finalize")from cldk import CLDKfrom cldk.analysis.commons.backend_config import Neo4jConnectionConfig
analysis = CLDK.typescript( backend=Neo4jConnectionConfig( uri="bolt://neo4j:7687", username="reader", password="…", application_name="web-frontend", ),)
cg = analysis.get_call_graph() # -> networkx.DiGraphThe Neo4j backends (JNeo4jBackend, PyNeo4jBackend, TSNeo4jBackend) expose the same method surface as the in-process backends, so existing query code is a drop-in: only the backend= argument changes.
- Install the optional driver:
pip install cldk[neo4j](it pullsneo4j>=5.14). The driver is an extra, not a core dependency. - Point
Neo4jConnectionConfig.uriat your Bolt endpoint and setapplication_nameto the project you want to query. - Call the usual
get_classes,get_call_graph,get_callers,get_callees, and related methods. The graph answers; no source is parsed.
Query the graph directly with Cypher
Section titled “Query the graph directly with Cypher”The analysis API covers the common structural questions, but the graph is a full Neo4j property graph: anything the API does not model, an agent can reach with raw Cypher over the same Bolt endpoint. The label namespacing is the query contract — scope by the :*Application anchor and the language prefix.
Discover what a shared database holds. The SDK expects you to know the application_name up front (the Neo4j backends raise if it is missing, and offer no enumeration method), so taking inventory is itself a Cypher query:
MATCH (a)WHERE a:JApplication OR a:PyApplication OR a:TSApplicationRETURN labels(a)[0] AS language, a.name AS app, a.schema_versionFull-text code search. Every backend creates a Neo4j full-text index over each callable’s code and docstring, named per language (j_code_fts, py_code_fts, code_fts). The CLDK API does not wrap it, so reach it over Bolt:
CALL db.index.fulltext.queryNodes('py_code_fts', 'jwt OR authenticate') YIELD node, scoreRETURN node.signature, score ORDER BY score DESC LIMIT 20Java carries more in its graph. The Java emitter projects CRUD operations and a system dependency graph that the other languages do not, and JNeo4jBackend surfaces them with no re-parsing: get_all_crud_operations() (plus get_all_read_operations() / …_create… / …_update… / …_delete…), get_system_dependency_graph(), and get_all_entry_point_methods(). See the Java API reference.