Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
alexpovel committed Mar 26, 2024
0 parents commit 5505066
Show file tree
Hide file tree
Showing 22 changed files with 10,467 additions and 0 deletions.
20 changes: 20 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: Main

on:
push:

jobs:
main:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4

- uses: actions/setup-node@v4
with:
node-version-file: .nvmrc
cache: npm

- run: npm ci
- run: npm exec tsc
- run: npm run lint
174 changes: 174 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
*.pem

# Logs

logs
_.log
npm-debug.log_
yarn-debug.log*
yarn-error.log*
lerna-debug.log*
.pnpm-debug.log*

# Diagnostic reports (https://nodejs.org/api/report.html)

report.[0-9]_.[0-9]_.[0-9]_.[0-9]_.json

# Runtime data

pids
_.pid
_.seed
\*.pid.lock

# Directory for instrumented libs generated by jscoverage/JSCover

lib-cov

# Coverage directory used by tools like istanbul

coverage
\*.lcov

# nyc test coverage

.nyc_output

# Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files)

.grunt

# Bower dependency directory (https://bower.io/)

bower_components

# node-waf configuration

.lock-wscript

# Compiled binary addons (https://nodejs.org/api/addons.html)

build/Release

# Dependency directories

node_modules/
jspm_packages/

# Snowpack dependency directory (https://snowpack.dev/)

web_modules/

# TypeScript cache

\*.tsbuildinfo

# Optional npm cache directory

.npm

# Optional eslint cache

.eslintcache

# Optional stylelint cache

.stylelintcache

# Microbundle cache

.rpt2_cache/
.rts2_cache_cjs/
.rts2_cache_es/
.rts2_cache_umd/

# Optional REPL history

.node_repl_history

# Output of 'npm pack'

\*.tgz

# Yarn Integrity file

.yarn-integrity

# dotenv environment variable files

.env
.env.development.local
.env.test.local
.env.production.local
.env.local

# parcel-bundler cache (https://parceljs.org/)

.cache
.parcel-cache

# Next.js build output

.next
out

# Nuxt.js build / generate output

.nuxt
dist

# Gatsby files

.cache/

# Comment in the public line in if your project uses Gatsby and not Next.js

# https://nextjs.org/blog/next-9-1#public-directory-support

# public

# vuepress build output

.vuepress/dist

# vuepress v2.x temp and cache directory

.temp
.cache

# Docusaurus cache and generated files

.docusaurus

# Serverless directories

.serverless/

# FuseBox cache

.fusebox/

# DynamoDB Local files

.dynamodb/

# TernJS port file

.tern-port

# Stores VSCode versions used for testing VSCode extensions

.vscode-test

# yarn v2

.yarn/cache
.yarn/unplugged
.yarn/build-state.yml
.yarn/install-state.gz
.pnp.\*

# wrangler project

.dev.vars
.wrangler/
1 change: 1 addition & 0 deletions .nvmrc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
v20.11.1
3 changes: 3 additions & 0 deletions .prettierrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"useTabs": false
}
137 changes: 137 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Architecture

## Working principle

The below sequence diagram shows a detailed (contains all implementation details) sample
software lifecycle, where:

1. a repository maintainer initially installs the application, triggering an initial
backfill
2. a user later posts a new issue, triggering a comment from issuedigger
3. another user (in this example, the maintainer) comments in that same issue thread.
The new comment will contribute to that issue's embedding
4. to see if new comments on an issue shifted similarities enough to bring forth new
results, one can comment `@issuedigger dig`
5. the app can be uninstalled anytime (either by removing it from one's account
entirely, or revoking access to specific repositores), triggering a wipe of all
stored data. This is a destructive action, and later reinstallation (possible
anytime) might not restore all data (due to aforementioned GitHub API request limits)

```mermaid
sequenceDiagram
autonumber
actor M as Maintainer
actor U as User
participant ID as issuedigger (GitHub App)
participant GH as GitHub Repo
participant CFW as Cloudflare Worker
participant CFQ as Cloudflare Queue
participant CFWAI as Cloudflare Workers AI
participant CFV as Cloudflare Vectorize
participant CFDO as Cloudflare Durable Object
participant CFKV as Cloudflare KV
M->>ID: Visit page and install
ID->>GH: Is granted access to
ID->>CFW: Installation webhook fires
CFW->>CFQ: Submit work
Note over CFQ, CFW: Same Worker is both<br/>producer and consumer.<br/>Every webhook goes through<br/>the Queue.<br/>(Only shown once for simplicity)
CFQ->>CFW: Dispatch work
CFW->>GH: Fetch past items
Note over CFW, GH: via REST API
GH->>CFW: Items
loop Per item
CFW->>CFDO: Acquire "lock" for issue
Note over CFW, CFDO: Serializing work per issue<br/>avoids "last writer wins"
CFW->>CFW: Split body into paragraphs
loop Per paragraph
CFW->>CFWAI: Generate embedding
alt Happy path
CFWAI->>CFW: Embedding
else Paragraph too long
CFW->>CFWAI: Generate summary
CFWAI->>CFW: Summary
CFW->>CFWAI: Generate embedding
CFWAI->>CFW: Embedding
end
end
CFW->>CFW: Compute mean of all paragraph vectors
CFW->>CFV: Store vector under issue number
CFW->>CFKV: Store vector ID (for bookkeeping only)
alt Exists
Note over CFW, CFV: For example, because item is comment
CFW->>CFW: Average with existing
CFW->>CFV: Store
end
CFW->>CFDO: Release issue lock
end
U->>GH: Opens new issue
ID->>CFW: "New issue" webhook fires
CFW->>CFWAI: Get embedding (see above for details)
CFWAI->>CFW: Embedding
CFW->>CFV: Query for similar embeddings
CFV->>CFW: Similar embeddings
CFW->>GH: Post comment<br/>about similar issues
CFW->>CFV: Store current embedding (see above for details)
M->>GH: Post comment
Note over M, GH: For example, suggesting solution
ID->>CFW: "New issue comment" webhook fires
CFW->>CFW: Index and store, averaging w/ existing vector<br/>(see above for details)
M->>M: I wonder if<br/>similarities changed now
M->>GH: Post `@issuedigger dig`
ID->>CFW: "New issue comment" webhook fires
Note over CFW: Indexing and storing skipped<br/>for app commands
CFW->>GH: Post comment<br/>about similar issues
Note over M: Had enough of this nonsense
M->>ID: Uninstall
ID->>CFW: "Uninstall" webhook fires
CFW->>CFKV: Query stored vector IDs for repo
CFKV->>CFW: Vector IDs associated with repo
loop Per ID
CFW->>CFV: Delete
CFW->>CFKV: Delete
end
```

### Design Notes

- [Durable Objects](https://developers.cloudflare.com/durable-objects/) are used as
mutexes, in an attempt to serialize work on individual issues.

When two items of the *same issue thread* are processed concurrently (e.g. during
backfilling, or if two comments are submitted simultaneously), we'd have
last-writer-wins issues otherwise, losing data. Serialization by the introduction of a
per-issue critical section alleviates this.
- [KV Storage](https://developers.cloudflare.com/kv/) is *only* needed for bookkeeping:
when offboarding an installation, all related vectors need to be removed, but
Vectorize can only be [queried by exact
IDs](https://developers.cloudflare.com/vectorize/reference/client-api/#get-vectors-by-id).
KV with its [prefix
querying](https://developers.cloudflare.com/kv/api/list-keys/#list-method) helps
retrieve those exact IDs after the fact.
- Generation of embeddings is pretty [grug-brained](https://grugbrain.dev/). Splitting
into paragraphs before processing might lose important context. For example,

```text
Her shoes are red.␊
They taste like strawberry.
```

makes no sense if taken (embedded) as one unit. The resulting vector might be
"semantically malformed". issuedigger instead embeds these separately, and averages
the results. The resulting mean vector is likely quite different from the single
embedding, leading to different results.

Paragraphs are embedded separately chiefly due to **limitations in the [used
model](https://developers.cloudflare.com/workers-ai/models/bge-large-en-v1.5/)**,
which maxes out at 512 input tokens (whatever that means in characters 🤷‍♀️). If
possible, embedding issue (comment) bodies in one go would be wildly preferable.

If individual paragraphs are *still* overly long, a
[summarization](https://developers.cloudflare.com/workers-ai/models/#summarization) is
applied.

The used models and how issuedigger handles overly long input is likely the bottleneck
to its usefulness. Available models are lightweight, with very fast inference, at the
cost of power in other areas, workarounds to which issuedigger implements in
simplistic, potentially even wrong ways!
19 changes: 19 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Copyright (c) 2024 Alex Povel

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
14 changes: 14 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
SHELL := /bin/bash

# https://github.com/gr2m/universal-github-app-jwt?tab=readme-ov-file#about-private-key-formats
pkcs8.pem: pkcs1.pem
openssl pkcs8 -topk8 -inform PEM -outform PEM -nocrypt -in $< -out $@

# https://www.reddit.com/r/commandline/comments/tfyrae/comment/i18uk63/
pretty-screenshot.png: screenshot.png
tmpfile=$$(mktemp) && \
width=$$(identify -format "%w" $<) && \
height=$$(identify -format "%h" $<) && \
echo $$width $$height && \
convert -size "$$width"x"$$height" xc:none -draw "roundrectangle 0,0,"$$width","$$height",20,20" png:- | convert $< -matte - -compose DstIn -composite $$tmpfile && \
convert $$tmpfile \( +clone -background black -shadow 100x30+0+0 \) +swap -bordercolor none -border 15 -background none -layers merge +repage $@
Loading

0 comments on commit 5505066

Please sign in to comment.