Initial commit

alexpovel · Mar 26, 2024 · 5505066 · 5505066
commit 5505066
Show file tree

Hide file tree

Showing 22 changed files with 10,467 additions and 0 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -0,0 +1,20 @@
+name: Main
+
+on:
+  push:
+
+jobs:
+  main:
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-node@v4
+        with:
+          node-version-file: .nvmrc
+          cache: npm
+
+      - run: npm ci
+      - run: npm exec tsc
+      - run: npm run lint
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,174 @@
+*.pem
+
+# Logs
+
+logs
+_.log
+npm-debug.log_
+yarn-debug.log*
+yarn-error.log*
+lerna-debug.log*
+.pnpm-debug.log*
+
+# Diagnostic reports (https://nodejs.org/api/report.html)
+
+report.[0-9]_.[0-9]_.[0-9]_.[0-9]_.json
+
+# Runtime data
+
+pids
+_.pid
+_.seed
+\*.pid.lock
+
+# Directory for instrumented libs generated by jscoverage/JSCover
+
+lib-cov
+
+# Coverage directory used by tools like istanbul
+
+coverage
+\*.lcov
+
+# nyc test coverage
+
+.nyc_output
+
+# Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files)
+
+.grunt
+
+# Bower dependency directory (https://bower.io/)
+
+bower_components
+
+# node-waf configuration
+
+.lock-wscript
+
+# Compiled binary addons (https://nodejs.org/api/addons.html)
+
+build/Release
+
+# Dependency directories
+
+node_modules/
+jspm_packages/
+
+# Snowpack dependency directory (https://snowpack.dev/)
+
+web_modules/
+
+# TypeScript cache
+
+\*.tsbuildinfo
+
+# Optional npm cache directory
+
+.npm
+
+# Optional eslint cache
+
+.eslintcache
+
+# Optional stylelint cache
+
+.stylelintcache
+
+# Microbundle cache
+
+.rpt2_cache/
+.rts2_cache_cjs/
+.rts2_cache_es/
+.rts2_cache_umd/
+
+# Optional REPL history
+
+.node_repl_history
+
+# Output of 'npm pack'
+
+\*.tgz
+
+# Yarn Integrity file
+
+.yarn-integrity
+
+# dotenv environment variable files
+
+.env
+.env.development.local
+.env.test.local
+.env.production.local
+.env.local
+
+# parcel-bundler cache (https://parceljs.org/)
+
+.cache
+.parcel-cache
+
+# Next.js build output
+
+.next
+out
+
+# Nuxt.js build / generate output
+
+.nuxt
+dist
+
+# Gatsby files
+
+.cache/
+
+# Comment in the public line in if your project uses Gatsby and not Next.js
+
+# https://nextjs.org/blog/next-9-1#public-directory-support
+
+# public
+
+# vuepress build output
+
+.vuepress/dist
+
+# vuepress v2.x temp and cache directory
+
+.temp
+.cache
+
+# Docusaurus cache and generated files
+
+.docusaurus
+
+# Serverless directories
+
+.serverless/
+
+# FuseBox cache
+
+.fusebox/
+
+# DynamoDB Local files
+
+.dynamodb/
+
+# TernJS port file
+
+.tern-port
+
+# Stores VSCode versions used for testing VSCode extensions
+
+.vscode-test
+
+# yarn v2
+
+.yarn/cache
+.yarn/unplugged
+.yarn/build-state.yml
+.yarn/install-state.gz
+.pnp.\*
+
+# wrangler project
+
+.dev.vars
+.wrangler/
diff --git a/.nvmrc b/.nvmrc
@@ -0,0 +1 @@
+v20.11.1
diff --git a/.prettierrc b/.prettierrc
@@ -0,0 +1,3 @@
+{
+    "useTabs": false
+}
diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
@@ -0,0 +1,137 @@
+# Architecture
+
+## Working principle
+
+The below sequence diagram shows a detailed (contains all implementation details) sample
+software lifecycle, where:
+
+1. a repository maintainer initially installs the application, triggering an initial
+   backfill
+2. a user later posts a new issue, triggering a comment from issuedigger
+3. another user (in this example, the maintainer) comments in that same issue thread.
+   The new comment will contribute to that issue's embedding
+4. to see if new comments on an issue shifted similarities enough to bring forth new
+   results, one can comment `@issuedigger dig`
+5. the app can be uninstalled anytime (either by removing it from one's account
+   entirely, or revoking access to specific repositores), triggering a wipe of all
+   stored data. This is a destructive action, and later reinstallation (possible
+   anytime) might not restore all data (due to aforementioned GitHub API request limits)
+
+```mermaid
+sequenceDiagram
+  autonumber
+  actor M as Maintainer
+  actor U as User
+  participant ID as issuedigger (GitHub App)
+  participant GH as GitHub Repo
+  participant CFW as Cloudflare Worker
+  participant CFQ as Cloudflare Queue
+  participant CFWAI as Cloudflare Workers AI
+  participant CFV as Cloudflare Vectorize
+  participant CFDO as Cloudflare Durable Object
+  participant CFKV as Cloudflare KV
+  M->>ID: Visit page and install
+  ID->>GH: Is granted access to
+  ID->>CFW: Installation webhook fires
+  CFW->>CFQ: Submit work
+  Note over CFQ, CFW: Same Worker is both<br/>producer and consumer.<br/>Every webhook goes through<br/>the Queue.<br/>(Only shown once for simplicity)
+  CFQ->>CFW: Dispatch work
+  CFW->>GH: Fetch past items
+  Note over CFW, GH: via REST API
+  GH->>CFW: Items
+  loop Per item
+    CFW->>CFDO: Acquire "lock" for issue
+    Note over CFW, CFDO: Serializing work per issue<br/>avoids "last writer wins"
+    CFW->>CFW: Split body into paragraphs
+    loop Per paragraph
+      CFW->>CFWAI: Generate embedding
+      alt Happy path
+        CFWAI->>CFW: Embedding
+      else Paragraph too long
+        CFW->>CFWAI: Generate summary
+        CFWAI->>CFW: Summary
+        CFW->>CFWAI: Generate embedding
+        CFWAI->>CFW: Embedding
+      end
+    end
+    CFW->>CFW: Compute mean of all paragraph vectors
+    CFW->>CFV: Store vector under issue number
+    CFW->>CFKV: Store vector ID (for bookkeeping only)
+    alt Exists
+      Note over CFW, CFV: For example, because item is comment
+      CFW->>CFW: Average with existing
+      CFW->>CFV: Store
+    end
+    CFW->>CFDO: Release issue lock
+  end
+  U->>GH: Opens new issue
+  ID->>CFW: "New issue" webhook fires
+  CFW->>CFWAI: Get embedding (see above for details)
+  CFWAI->>CFW: Embedding
+  CFW->>CFV: Query for similar embeddings
+  CFV->>CFW: Similar embeddings
+  CFW->>GH: Post comment<br/>about similar issues
+  CFW->>CFV: Store current embedding (see above for details)
+  M->>GH: Post comment
+  Note over M, GH: For example, suggesting solution
+  ID->>CFW: "New issue comment" webhook fires
+  CFW->>CFW: Index and store, averaging w/ existing vector<br/>(see above for details)
+  M->>M: I wonder if<br/>similarities changed now
+  M->>GH: Post `@issuedigger dig`
+  ID->>CFW: "New issue comment" webhook fires
+  Note over CFW: Indexing and storing skipped<br/>for app commands
+  CFW->>GH: Post comment<br/>about similar issues
+  Note over M: Had enough of this nonsense
+  M->>ID: Uninstall
+  ID->>CFW: "Uninstall" webhook fires
+  CFW->>CFKV: Query stored vector IDs for repo
+  CFKV->>CFW: Vector IDs associated with repo
+  loop Per ID
+    CFW->>CFV: Delete
+    CFW->>CFKV: Delete
+  end
+```
+
+### Design Notes
+
+- [Durable Objects](https://developers.cloudflare.com/durable-objects/) are used as
+  mutexes, in an attempt to serialize work on individual issues.
+
+  When two items of the *same issue thread* are processed concurrently (e.g. during
+  backfilling, or if two comments are submitted simultaneously), we'd have
+  last-writer-wins issues otherwise, losing data. Serialization by the introduction of a
+  per-issue critical section alleviates this.
+- [KV Storage](https://developers.cloudflare.com/kv/) is *only* needed for bookkeeping:
+  when offboarding an installation, all related vectors need to be removed, but
+  Vectorize can only be [queried by exact
+  IDs](https://developers.cloudflare.com/vectorize/reference/client-api/#get-vectors-by-id).
+  KV with its [prefix
+  querying](https://developers.cloudflare.com/kv/api/list-keys/#list-method) helps
+  retrieve those exact IDs after the fact.
+- Generation of embeddings is pretty [grug-brained](https://grugbrain.dev/). Splitting
+  into paragraphs before processing might lose important context. For example,
+
+  ```text
+  Her shoes are red.␊
+  ␊
+  They taste like strawberry.
+  ```
+
+  makes no sense if taken (embedded) as one unit. The resulting vector might be
+  "semantically malformed". issuedigger instead embeds these separately, and averages
+  the results. The resulting mean vector is likely quite different from the single
+  embedding, leading to different results.
+
+  Paragraphs are embedded separately chiefly due to **limitations in the [used
+  model](https://developers.cloudflare.com/workers-ai/models/bge-large-en-v1.5/)**,
+  which maxes out at 512 input tokens (whatever that means in characters 🤷‍♀️). If
+  possible, embedding issue (comment) bodies in one go would be wildly preferable.
+
+  If individual paragraphs are *still* overly long, a
+  [summarization](https://developers.cloudflare.com/workers-ai/models/#summarization) is
+  applied.
+
+  The used models and how issuedigger handles overly long input is likely the bottleneck
+  to its usefulness. Available models are lightweight, with very fast inference, at the
+  cost of power in other areas, workarounds to which issuedigger implements in
+  simplistic, potentially even wrong ways!
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,19 @@
+Copyright (c) 2024 Alex Povel
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/Makefile b/Makefile
@@ -0,0 +1,14 @@
+SHELL := /bin/bash
+
+# https://github.com/gr2m/universal-github-app-jwt?tab=readme-ov-file#about-private-key-formats
+pkcs8.pem: pkcs1.pem
+	openssl pkcs8 -topk8 -inform PEM -outform PEM -nocrypt -in $< -out $@
+
+# https://www.reddit.com/r/commandline/comments/tfyrae/comment/i18uk63/
+pretty-screenshot.png: screenshot.png
+	tmpfile=$$(mktemp) && \
+	width=$$(identify -format "%w" $<) && \
+	height=$$(identify -format "%h" $<) && \
+	echo $$width $$height && \
+	convert -size "$$width"x"$$height" xc:none -draw "roundrectangle 0,0,"$$width","$$height",20,20" png:- | convert $< -matte - -compose DstIn -composite $$tmpfile && \
+	convert $$tmpfile \( +clone -background black -shadow 100x30+0+0 \) +swap -bordercolor none -border 15 -background none -layers merge +repage $@