POST /compose/batch_match/{db_name}¶
Performs cross-collection semantic matching. For each row in a source collection, finds the best matching rows in a target collection based on slot embedding similarity. Enables cross-collection joins via the HDC intersection layer.
Request¶
Content-Type: application/json
URL Parameters:
| Parameter | Description |
|---|---|
db_name |
Name of the database containing both collections |
Body:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
source_collection |
string | ✅ | — | Namespace of the source collection |
source_slot |
string | ✅ | — | Field in the source collection to match on |
target_collection |
string | ✅ | — | Namespace of the target collection |
target_slot |
string | ✅ | — | Field in the target collection to match against |
top_k_per_source |
int | ❌ | 2 |
Number of top target matches per source row |
similarity_threshold |
float | ❌ | 0.5 |
Minimum similarity score to consider a match |
min_margin |
float | ❌ | 0.05 |
Minimum score gap between top-1 and top-2 required for a confident match |
source_row_ids |
list of int | ❌ | null |
Specific source rows to match. If omitted, all rows are used |
target_row_ids |
list of int | ❌ | null |
Specific target rows to match against. If omitted, all rows are used |
Behavior¶
Similarity matrix — Neural embeddings are computed for all unique source and target slot values. A full similarity matrix is built using batch vector similarity, then the top-k matches per source row are extracted.
Margin-based NULL detection — If the gap between the best and second-best match for a source row is below min_margin, the match is flagged as ambiguous. This avoids returning overconfident low-quality joins.
Cross-field matching — Source and target slots can have different names and schemas. Matching is purely based on embedding similarity, enabling joins across heterogeneous collections.
Timing — The response includes a timing object with millisecond breakdowns for embedding and matching phases.
Responses¶
200 OK¶
{
"status": "success",
"results": [
{
"source_id": 0,
"target_id": 14,
"score": 0.912,
"margin": 0.143,
"status": "matched"
},
{
"source_id": 1,
"target_id": null,
"score": 0.431,
"margin": 0.012,
"status": "ambiguous"
}
],
"source_count": 10,
"target_count": 50,
"timing": {
"embed_ms": 42.1,
"match_ms": 8.3,
"total_ms": 51.7
}
}
Each result contains:
| Field | Description |
|---|---|
source_id |
Row ID from the source collection |
target_id |
Matched row ID from the target collection, or null if unmatched |
score |
Similarity score of the best match |
margin |
Score gap between top-1 and top-2 matches |
status |
"matched", "ambiguous", or "unmatched" |
Error Responses¶
| Status | Condition |
|---|---|
404 |
Source or target collection not found |
500 |
Unexpected internal error |
Notes¶
- Both collections must exist within the same
db_name. - Raise
similarity_thresholdto reduce false positives; raisemin_marginto require more decisive matches. source_row_idsandtarget_row_idscan be used to scope the match to a subset of rows, which significantly reduces compute for large collections.
Example¶
import requests
SERVER_URL = "http://18.220.128.24:8000"
API_KEY = "yourapitoken"
def batch_match(db_name: str, source_collection: str, source_slot: str,
target_collection: str, target_slot: str) -> dict:
response = requests.post(
f"{SERVER_URL}/compose/batch_match/{db_name}",
headers={"X-API-Key": API_KEY},
json={
"source_collection": source_collection,
"source_slot": source_slot,
"target_collection": target_collection,
"target_slot": target_slot,
"top_k_per_source": 2,
"similarity_threshold": 0.5,
"min_margin": 0.05,
},
)
response.raise_for_status()
return response.json()
result = batch_match(
db_name="my_db",
source_collection="expertise",
source_slot="subject",
target_collection="employees",
target_slot="name",
)
print(result)
Expected output:
{
"status": "success",
"results": [
{ "source_id": 0, "target_id": 14, "score": 0.912, "margin": 0.143, "status": "matched" },
{ "source_id": 1, "target_id": null, "score": 0.431, "margin": 0.012, "status": "ambiguous" }
],
"source_count": 10,
"target_count": 50,
"timing": { "embed_ms": 42.1, "match_ms": 8.3, "total_ms": 51.7 }
}