Skip to content

Commit

Permalink
Merge pull request #3459 from vespa-engine/arnej/improve-documentid-doc
Browse files Browse the repository at this point in the history
explain more about Document ID
  • Loading branch information
arnej27959 authored Nov 5, 2024
2 parents 1a8bc9e + 5fe038e commit 6dc4286
Showing 1 changed file with 39 additions and 7 deletions.
46 changes: 39 additions & 7 deletions en/documents.html
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,8 @@ <h2 id="document-ids">Document IDs</h2>

<h3 id="id-scheme">id scheme</h3>
<p>
Vespa has defined one scheme, the <em>id scheme</em>:
<code>id:&lt;namespace&gt;:&lt;document-type&gt;:&lt;key/value-pairs&gt;:&lt;user-specified&gt;</code>
Vespa currently has only one defined scheme, the <em>id scheme</em>:
<code>id:&lt;namespace&gt;:&lt;document-type&gt;:&lt;key/value-pair&gt;:&lt;user-specified&gt;</code>
</p>
{% include note.html content='
An example mapping from ID to the URL used in <a href="document-v1-api-guide.html">/document/v1/</a> is from
Expand All @@ -70,22 +70,22 @@ <h3 id="id-scheme">id scheme</h3>
</tr>
</thead>
<tbody>
<tr><th>namespace</th><td>Yes</td><td>See <a href="#namespace">below</a>.</td></tr>
<tr><th>namespace</th><td>Yes</td><td>Not used by Vespa, see <a href="#namespace">below</a>.</td></tr>
<tr><th>document-type</th><td>Yes</td><td>Document type as defined in
<a href="reference/services-content.html#document">services.xml</a> and the
<a href="reference/schema-reference.html">schema</a>.</td></tr>
<tr><th style="white-space:nowrap">key/value-pairs</th><td>Optional</td><td>
<tr><th style="white-space:nowrap">key/value-pair</th><td>Optional</td><td>
Modifiers to the id scheme, used to configure document distribution to
<a href="content/buckets.html#document-to-bucket-distribution">buckets</a>.
With no modifiers, the id scheme distributes all documents uniformly.
The key/value-pairs field contains a comma-separated list of lexicographically sorted key/value pairs.
The key/value-pair field contains one of two possible key/value pairs;
<strong>n</strong> and <strong>g</strong> are mutually exclusive:
<table class="table">
<thead></thead><tbody>
<tr><th>n=<em>&lt;number&gt;</em></th><td>
Number in the range [0,2^63-1]</td></tr>
Number in the range [0,2^63-1] - only for testing of abnormal bucket distributions</td></tr>
<tr style="white-space: nowrap"><th>g=<em>&lt;groupname&gt;</em></th><td>
Just like n=, the string is hashed to a number</td></tr>
The <em>groupname</em> string is hashed and used to select the storage location</td></tr>
</tbody>
</table>
{% include important.html content='This is only useful for document types with
Expand All @@ -99,8 +99,40 @@ <h3 id="id-scheme">id scheme</h3>
</tbody>
</table>

<h3 id="docid-in-results">Document IDs in search results</h3>
<p>
The full Document ID (as a string) will often contain redundant
information and be quite long; a typical value may look like
"id:mynamespace:mydoctype::user-specified-identifier" where only the
last part is useful outside Vespa. The Document ID is therefore not
stored in memory, and it <strong>not always present</strong> in
<a href="reference/default-result-format.html#id">search results</a>.
It is therefore recommended to put your own unique identifier
(usually the "user-specified-identifier" above) in a document field,
typically named "myid" or "shortid" or similar:
<pre>
field shortid type string {
indexing: attribute | summary
}
</pre>
This enables using a
<a href="document-summaries.html">document-summary</a> with only
in-memory fields while still getting the identifier you actually
care about. If the "user-specified-identifier" is just a simple
number you could even use "type int" for this field for minimal
memory overhead.
</p>

<h3 id="namespace">Namespace</h3>
<p>
The namespace in document ids is useful when you have multiple
document collections that you want to be sure never end up with the
same document id. It has no function in Vespa beyond this, and can
just be set to any short constant value like for example "doc".
Consider also letting synthetic documents used for
testing use namespace "test" so it's easy to detect and remove
them if they are present outside the test by mistake.
</p>
<p>
Example - if feeding
</p>
Expand Down

0 comments on commit 6dc4286

Please sign in to comment.