-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 5505066
Showing
22 changed files
with
10,467 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
name: Main | ||
|
||
on: | ||
push: | ||
|
||
jobs: | ||
main: | ||
runs-on: ubuntu-latest | ||
|
||
steps: | ||
- uses: actions/checkout@v4 | ||
|
||
- uses: actions/setup-node@v4 | ||
with: | ||
node-version-file: .nvmrc | ||
cache: npm | ||
|
||
- run: npm ci | ||
- run: npm exec tsc | ||
- run: npm run lint |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,174 @@ | ||
*.pem | ||
|
||
# Logs | ||
|
||
logs | ||
_.log | ||
npm-debug.log_ | ||
yarn-debug.log* | ||
yarn-error.log* | ||
lerna-debug.log* | ||
.pnpm-debug.log* | ||
|
||
# Diagnostic reports (https://nodejs.org/api/report.html) | ||
|
||
report.[0-9]_.[0-9]_.[0-9]_.[0-9]_.json | ||
|
||
# Runtime data | ||
|
||
pids | ||
_.pid | ||
_.seed | ||
\*.pid.lock | ||
|
||
# Directory for instrumented libs generated by jscoverage/JSCover | ||
|
||
lib-cov | ||
|
||
# Coverage directory used by tools like istanbul | ||
|
||
coverage | ||
\*.lcov | ||
|
||
# nyc test coverage | ||
|
||
.nyc_output | ||
|
||
# Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files) | ||
|
||
.grunt | ||
|
||
# Bower dependency directory (https://bower.io/) | ||
|
||
bower_components | ||
|
||
# node-waf configuration | ||
|
||
.lock-wscript | ||
|
||
# Compiled binary addons (https://nodejs.org/api/addons.html) | ||
|
||
build/Release | ||
|
||
# Dependency directories | ||
|
||
node_modules/ | ||
jspm_packages/ | ||
|
||
# Snowpack dependency directory (https://snowpack.dev/) | ||
|
||
web_modules/ | ||
|
||
# TypeScript cache | ||
|
||
\*.tsbuildinfo | ||
|
||
# Optional npm cache directory | ||
|
||
.npm | ||
|
||
# Optional eslint cache | ||
|
||
.eslintcache | ||
|
||
# Optional stylelint cache | ||
|
||
.stylelintcache | ||
|
||
# Microbundle cache | ||
|
||
.rpt2_cache/ | ||
.rts2_cache_cjs/ | ||
.rts2_cache_es/ | ||
.rts2_cache_umd/ | ||
|
||
# Optional REPL history | ||
|
||
.node_repl_history | ||
|
||
# Output of 'npm pack' | ||
|
||
\*.tgz | ||
|
||
# Yarn Integrity file | ||
|
||
.yarn-integrity | ||
|
||
# dotenv environment variable files | ||
|
||
.env | ||
.env.development.local | ||
.env.test.local | ||
.env.production.local | ||
.env.local | ||
|
||
# parcel-bundler cache (https://parceljs.org/) | ||
|
||
.cache | ||
.parcel-cache | ||
|
||
# Next.js build output | ||
|
||
.next | ||
out | ||
|
||
# Nuxt.js build / generate output | ||
|
||
.nuxt | ||
dist | ||
|
||
# Gatsby files | ||
|
||
.cache/ | ||
|
||
# Comment in the public line in if your project uses Gatsby and not Next.js | ||
|
||
# https://nextjs.org/blog/next-9-1#public-directory-support | ||
|
||
# public | ||
|
||
# vuepress build output | ||
|
||
.vuepress/dist | ||
|
||
# vuepress v2.x temp and cache directory | ||
|
||
.temp | ||
.cache | ||
|
||
# Docusaurus cache and generated files | ||
|
||
.docusaurus | ||
|
||
# Serverless directories | ||
|
||
.serverless/ | ||
|
||
# FuseBox cache | ||
|
||
.fusebox/ | ||
|
||
# DynamoDB Local files | ||
|
||
.dynamodb/ | ||
|
||
# TernJS port file | ||
|
||
.tern-port | ||
|
||
# Stores VSCode versions used for testing VSCode extensions | ||
|
||
.vscode-test | ||
|
||
# yarn v2 | ||
|
||
.yarn/cache | ||
.yarn/unplugged | ||
.yarn/build-state.yml | ||
.yarn/install-state.gz | ||
.pnp.\* | ||
|
||
# wrangler project | ||
|
||
.dev.vars | ||
.wrangler/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
v20.11.1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
{ | ||
"useTabs": false | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
# Architecture | ||
|
||
## Working principle | ||
|
||
The below sequence diagram shows a detailed (contains all implementation details) sample | ||
software lifecycle, where: | ||
|
||
1. a repository maintainer initially installs the application, triggering an initial | ||
backfill | ||
2. a user later posts a new issue, triggering a comment from issuedigger | ||
3. another user (in this example, the maintainer) comments in that same issue thread. | ||
The new comment will contribute to that issue's embedding | ||
4. to see if new comments on an issue shifted similarities enough to bring forth new | ||
results, one can comment `@issuedigger dig` | ||
5. the app can be uninstalled anytime (either by removing it from one's account | ||
entirely, or revoking access to specific repositores), triggering a wipe of all | ||
stored data. This is a destructive action, and later reinstallation (possible | ||
anytime) might not restore all data (due to aforementioned GitHub API request limits) | ||
|
||
```mermaid | ||
sequenceDiagram | ||
autonumber | ||
actor M as Maintainer | ||
actor U as User | ||
participant ID as issuedigger (GitHub App) | ||
participant GH as GitHub Repo | ||
participant CFW as Cloudflare Worker | ||
participant CFQ as Cloudflare Queue | ||
participant CFWAI as Cloudflare Workers AI | ||
participant CFV as Cloudflare Vectorize | ||
participant CFDO as Cloudflare Durable Object | ||
participant CFKV as Cloudflare KV | ||
M->>ID: Visit page and install | ||
ID->>GH: Is granted access to | ||
ID->>CFW: Installation webhook fires | ||
CFW->>CFQ: Submit work | ||
Note over CFQ, CFW: Same Worker is both<br/>producer and consumer.<br/>Every webhook goes through<br/>the Queue.<br/>(Only shown once for simplicity) | ||
CFQ->>CFW: Dispatch work | ||
CFW->>GH: Fetch past items | ||
Note over CFW, GH: via REST API | ||
GH->>CFW: Items | ||
loop Per item | ||
CFW->>CFDO: Acquire "lock" for issue | ||
Note over CFW, CFDO: Serializing work per issue<br/>avoids "last writer wins" | ||
CFW->>CFW: Split body into paragraphs | ||
loop Per paragraph | ||
CFW->>CFWAI: Generate embedding | ||
alt Happy path | ||
CFWAI->>CFW: Embedding | ||
else Paragraph too long | ||
CFW->>CFWAI: Generate summary | ||
CFWAI->>CFW: Summary | ||
CFW->>CFWAI: Generate embedding | ||
CFWAI->>CFW: Embedding | ||
end | ||
end | ||
CFW->>CFW: Compute mean of all paragraph vectors | ||
CFW->>CFV: Store vector under issue number | ||
CFW->>CFKV: Store vector ID (for bookkeeping only) | ||
alt Exists | ||
Note over CFW, CFV: For example, because item is comment | ||
CFW->>CFW: Average with existing | ||
CFW->>CFV: Store | ||
end | ||
CFW->>CFDO: Release issue lock | ||
end | ||
U->>GH: Opens new issue | ||
ID->>CFW: "New issue" webhook fires | ||
CFW->>CFWAI: Get embedding (see above for details) | ||
CFWAI->>CFW: Embedding | ||
CFW->>CFV: Query for similar embeddings | ||
CFV->>CFW: Similar embeddings | ||
CFW->>GH: Post comment<br/>about similar issues | ||
CFW->>CFV: Store current embedding (see above for details) | ||
M->>GH: Post comment | ||
Note over M, GH: For example, suggesting solution | ||
ID->>CFW: "New issue comment" webhook fires | ||
CFW->>CFW: Index and store, averaging w/ existing vector<br/>(see above for details) | ||
M->>M: I wonder if<br/>similarities changed now | ||
M->>GH: Post `@issuedigger dig` | ||
ID->>CFW: "New issue comment" webhook fires | ||
Note over CFW: Indexing and storing skipped<br/>for app commands | ||
CFW->>GH: Post comment<br/>about similar issues | ||
Note over M: Had enough of this nonsense | ||
M->>ID: Uninstall | ||
ID->>CFW: "Uninstall" webhook fires | ||
CFW->>CFKV: Query stored vector IDs for repo | ||
CFKV->>CFW: Vector IDs associated with repo | ||
loop Per ID | ||
CFW->>CFV: Delete | ||
CFW->>CFKV: Delete | ||
end | ||
``` | ||
|
||
### Design Notes | ||
|
||
- [Durable Objects](https://developers.cloudflare.com/durable-objects/) are used as | ||
mutexes, in an attempt to serialize work on individual issues. | ||
|
||
When two items of the *same issue thread* are processed concurrently (e.g. during | ||
backfilling, or if two comments are submitted simultaneously), we'd have | ||
last-writer-wins issues otherwise, losing data. Serialization by the introduction of a | ||
per-issue critical section alleviates this. | ||
- [KV Storage](https://developers.cloudflare.com/kv/) is *only* needed for bookkeeping: | ||
when offboarding an installation, all related vectors need to be removed, but | ||
Vectorize can only be [queried by exact | ||
IDs](https://developers.cloudflare.com/vectorize/reference/client-api/#get-vectors-by-id). | ||
KV with its [prefix | ||
querying](https://developers.cloudflare.com/kv/api/list-keys/#list-method) helps | ||
retrieve those exact IDs after the fact. | ||
- Generation of embeddings is pretty [grug-brained](https://grugbrain.dev/). Splitting | ||
into paragraphs before processing might lose important context. For example, | ||
|
||
```text | ||
Her shoes are red.␊ | ||
␊ | ||
They taste like strawberry. | ||
``` | ||
|
||
makes no sense if taken (embedded) as one unit. The resulting vector might be | ||
"semantically malformed". issuedigger instead embeds these separately, and averages | ||
the results. The resulting mean vector is likely quite different from the single | ||
embedding, leading to different results. | ||
|
||
Paragraphs are embedded separately chiefly due to **limitations in the [used | ||
model](https://developers.cloudflare.com/workers-ai/models/bge-large-en-v1.5/)**, | ||
which maxes out at 512 input tokens (whatever that means in characters 🤷♀️). If | ||
possible, embedding issue (comment) bodies in one go would be wildly preferable. | ||
|
||
If individual paragraphs are *still* overly long, a | ||
[summarization](https://developers.cloudflare.com/workers-ai/models/#summarization) is | ||
applied. | ||
|
||
The used models and how issuedigger handles overly long input is likely the bottleneck | ||
to its usefulness. Available models are lightweight, with very fast inference, at the | ||
cost of power in other areas, workarounds to which issuedigger implements in | ||
simplistic, potentially even wrong ways! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
Copyright (c) 2024 Alex Povel | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
SHELL := /bin/bash | ||
|
||
# https://github.com/gr2m/universal-github-app-jwt?tab=readme-ov-file#about-private-key-formats | ||
pkcs8.pem: pkcs1.pem | ||
openssl pkcs8 -topk8 -inform PEM -outform PEM -nocrypt -in $< -out $@ | ||
|
||
# https://www.reddit.com/r/commandline/comments/tfyrae/comment/i18uk63/ | ||
pretty-screenshot.png: screenshot.png | ||
tmpfile=$$(mktemp) && \ | ||
width=$$(identify -format "%w" $<) && \ | ||
height=$$(identify -format "%h" $<) && \ | ||
echo $$width $$height && \ | ||
convert -size "$$width"x"$$height" xc:none -draw "roundrectangle 0,0,"$$width","$$height",20,20" png:- | convert $< -matte - -compose DstIn -composite $$tmpfile && \ | ||
convert $$tmpfile \( +clone -background black -shadow 100x30+0+0 \) +swap -bordercolor none -border 15 -background none -layers merge +repage $@ |
Oops, something went wrong.