Merge pull request #3459 from vespa-engine/arnej/improve-documentid-doc

explain more about Document ID
vespa-engine · Nov 5, 2024 · 6dc4286 · 6dc4286
2 parents 1a8bc9e + 5fe038e
commit 6dc4286
Showing 1 changed file with 39 additions and 7 deletions.
diff --git a/en/documents.html b/en/documents.html
@@ -50,8 +50,8 @@ <h2 id="document-ids">Document IDs</h2>
 
 <h3 id="id-scheme">id scheme</h3>
 <p>
-  Vespa has defined one scheme, the <em>id scheme</em>:
-  <code>id:&lt;namespace&gt;:&lt;document-type&gt;:&lt;key/value-pairs&gt;:&lt;user-specified&gt;</code>
+  Vespa currently has only one defined scheme, the <em>id scheme</em>:
+  <code>id:&lt;namespace&gt;:&lt;document-type&gt;:&lt;key/value-pair&gt;:&lt;user-specified&gt;</code>
 </p>
 {% include note.html content='
 An example mapping from ID to the URL used in <a href="document-v1-api-guide.html">/document/v1/</a> is from
@@ -70,22 +70,22 @@ <h3 id="id-scheme">id scheme</h3>
   </tr>
   </thead>
   <tbody>
-  <tr><th>namespace</th><td>Yes</td><td>See <a href="#namespace">below</a>.</td></tr>
+  <tr><th>namespace</th><td>Yes</td><td>Not used by Vespa, see <a href="#namespace">below</a>.</td></tr>
   <tr><th>document-type</th><td>Yes</td><td>Document type as defined in
     <a href="reference/services-content.html#document">services.xml</a> and the
     <a href="reference/schema-reference.html">schema</a>.</td></tr>
-  <tr><th style="white-space:nowrap">key/value-pairs</th><td>Optional</td><td>
+  <tr><th style="white-space:nowrap">key/value-pair</th><td>Optional</td><td>
     Modifiers to the id scheme, used to configure document distribution to
     <a href="content/buckets.html#document-to-bucket-distribution">buckets</a>.
     With no modifiers, the id scheme distributes all documents uniformly.
-    The key/value-pairs field contains a comma-separated list of lexicographically sorted key/value pairs.
+    The key/value-pair field contains one of two possible key/value pairs;
     <strong>n</strong> and <strong>g</strong> are mutually exclusive:
     <table class="table">
       <thead></thead><tbody>
         <tr><th>n=<em>&lt;number&gt;</em></th><td>
-          Number in the range [0,2^63-1]</td></tr>
+          Number in the range [0,2^63-1] - only for testing of abnormal bucket distributions</td></tr>
         <tr style="white-space: nowrap"><th>g=<em>&lt;groupname&gt;</em></th><td>
-          Just like n=, the string is hashed to a number</td></tr>
+          The <em>groupname</em> string is hashed and used to select the storage location</td></tr>
       </tbody>
     </table>
     {% include important.html content='This is only useful for document types with
@@ -99,8 +99,40 @@ <h3 id="id-scheme">id scheme</h3>
 </tbody>
 </table>
 
+<h3 id="docid-in-results">Document IDs in search results</h3>
+<p>
+  The full Document ID (as a string) will often contain redundant
+  information and be quite long; a typical value may look like
+  "id:mynamespace:mydoctype::user-specified-identifier" where only the
+  last part is useful outside Vespa.  The Document ID is therefore not
+  stored in memory, and it <strong>not always present</strong> in
+  <a href="reference/default-result-format.html#id">search results</a>.
+  It is therefore recommended to put your own unique identifier
+  (usually the "user-specified-identifier" above) in a document field,
+  typically named "myid" or "shortid" or similar:
+<pre>
+field shortid type string {
+    indexing: attribute | summary
+}
+</pre>
+  This enables using a
+  <a href="document-summaries.html">document-summary</a> with only
+  in-memory fields while still getting the identifier you actually
+  care about.  If the "user-specified-identifier" is just a simple
+  number you could even use "type int" for this field for minimal
+  memory overhead.
+</p>
 
 <h3 id="namespace">Namespace</h3>
+<p>
+  The namespace in document ids is useful when you have multiple
+  document collections that you want to be sure never end up with the
+  same document id. It has no function in Vespa beyond this, and can
+  just be set to any short constant value like for example "doc".
+  Consider also letting synthetic documents used for
+  testing use namespace "test" so it's easy to detect and remove
+  them if they are present outside the test by mistake.
+</p>
 <p>
   Example - if feeding
 </p>