Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add faceted search / custom filters / heterogenous monorepo indexation #62

Open
roscopecoltran opened this issue Aug 16, 2018 · 6 comments

Comments

@roscopecoltran
Copy link

roscopecoltran commented Aug 16, 2018

Hi guys,

Hope you are all well !

I was wondering how it would the best to add some faceting to search results to display with zoekt web-server, like filtering by language or by some custom user defined filters like matching examples below.

I just want to extend zoekt to filter a large heterogenous code monorepo (mainly all my local repositories > 500 repos). And, I am struggling to assess if I should create a blevesearch index, after zoekt indexation, or it could be possible to add some post/pre processing plugins while indexing the code with zoekt.

I found this cool tokenizer package, from a stackoverflow employee,
https://github.com/clipperhouse/jargon, for recognizing canonical and synonymous dev/tech terms, that I wanted to chain in parallel, as a plugin, for post-processing of the topics extracted from the code indexation.

Question:
What would be the best approach to build a quick poc with these external filtering bots/plugins ?

Cheers,
Rosco

Examples:

[
    {
        "language":"conan",
        "type":"BuildSystem",
        "fileNames":["conanfile.txt", "conanfile.py", "conanenv.txt"]
    },
    {
        "language":"scons",
        "type":"BuildSystem",
        "fileNames":["sconstruct"]
    },
    {
        "language":"premake",
        "type":"BuildSystem",
        "fileNames":["premake4.lua", "premake5.lua"]
    },
    {
        "language":"gulp",
        "type":"BuildSystem",
        "fileNames":["gulp.js"]
    },
    {
        "language":"zeus",
        "type":"BuildSystem",
        "fileNames":["zeusfile.yml"]
    },
    {
        "language":"bam",
        "type":"BuildSystem",
        "fileNames":["bam.lua"]
     },
     {
        "language":"meson",
        "type":"BuildSystem",
        "fileNames":["meson.build"]
    },
    {
        "language":"hunter",
        "type":"BuildSystem",
        "fileNames":["huntergate.cmake", "hunter.cmake"]
    },
    {
        "language":"cget",
        "type":"BuildSystem",
        "fileNames":["requirements.txt"]
    },
    {
        "language":"conda",
        "type":"BuildSystem",
        "fileNames":["meta.yaml"]
    },
    {
        "language":"shake",
        "type":"BuildSystem",
        "fileNames":["build.hs"]
    },
    {
        "language":"gemfile",
        "type":"BuildSystem",
        "fileNames":["gemfile"]
    },
    {
        "language":"npm",
        "type":"BuildSystem",
        "fileNames":["package.json"]
    },
    {
        "language":"webpack",
        "type":"BuildSystem",
        "fileNames":["webpack.config.js"]
    },
    {
        "language":"bower",
        "type":"BuildSystem",
        "fileNames":["bower.json"]
    },
    {
        "language":"maven",
        "type":"BuildSystem",
        "fileNames":["pom.xml"]
    },
    {
        "language":"cmake",
        "type":"BuildSystem",
        "fileNames":["cmakelists.txt"],
        "fileSuffixes": [".cmake"]
    },
    {
        "language":"makefile",
        "type":"BuildSystem",
        "fileNames":["makefile"],
        "fileSuffixes": [".make", ".mkfile", ".mak", ".mk"]
    },
    {
        "language":"qmake",
        "type":"BuildSystem",
        "fileSuffixes": [".pro", ".pri"]
    },
    {
        "language":"visual studio",
        "type":"BuildSystem",
        "fileSuffixes": [".sln", ".vcxproj", ".vcproj", ".props"]
    },
    {
        "language":"xcode",
        "type":"BuildSystem",
        "fileSuffixes": [".xcconfig", ".pbxproj", ".xcworkspacedata"]
    },
    {
        "language":"automake",
        "type":"BuildSystem",
        "fileSuffixes": [".am"]
    },
    {
        "language":"ninja",
        "type":"BuildSystem",
        "fileSuffixes": [".ninja"]
    },
    {
        "language":"vcpkg",
        "type":"BuildSystem",
        "fileSuffixes": [".vcpkg"]
    },
    {
        "language":"boost.jam",
        "type":"BuildSystem",
        "fileSuffixes": [".jam"]
    },
    {
        "language":"gradle",
        "type":"BuildSystem",
        "fileSuffixes": [".gradle"]
    },
    {
        "language":"bazel",
        "type":"BuildSystem",
        "fileSuffixes": [".bzl"]
    },
    {
        "language":"gyp",
        "type":"BuildSystem",
        "fileSuffixes": [".gyp", "gypi"]
    },
    {
        "language":"eslint",
        "type":"EnvConfig",
        "filePrefixes": [".eslintrc."]
    },
    {
        "language":"travis",
        "type":"EnvConfig",
        "fileNames":[".travis.yml"]
    },
    {
        "language":"appveyor",
        "type":"EnvConfig",
        "fileNames":["appveyor.yml"]
    },
    {
        "language":"gitlab",
        "type":"EnvConfig",
        "fileNames":[".gitlab-ci.yml"]
    },
    {
        "language":"circleci",
        "type":"EnvConfig",
        "fileNames":["circle.yml"]
    },
    {
        "language":"clangformat",
        "type":"EnvConfig",
        "fileNames":[".clang-format"]
    },
    {
        "language":"clang_complete",
        "type":"EnvConfig",
        "fileNames":[".clang_complete"]
    },
    {
        "language":"editorconfig",
        "type":"EnvConfig",
        "fileNames":[".editorconfig"]
    },
    {
        "language":"gdbinit",
        "type":"EnvConfig",
        "fileNames":[".gdbinit"]
    },
    {
        "language":"yard",
        "type":"EnvConfig",
        "fileNames":[".yardopts"]
    },
    {
        "language":"codecov.io",
        "type":"EnvConfig",
        "fileNames":[".codecov.yml"]
    },
    {
        "language":"pylint",
        "type":"EnvConfig",
        "fileNames":[".pylintrc"]
    },
    {
        "language":"flake8",
        "type":"EnvConfig",
        "fileNames":[".flake8"]
    },
    {
        "language":"emacs.dir-locals",
        "type":"EnvConfig",
        "fileNames":[".dir-locals.el"]
    },
    {
        "language":"doxygen",
        "type":"EnvConfig",
        "fileNames":["doxygen.config"]
    },
    {
        "language":"apache-2.0",
        "type":"License",
        "fileNames":["apache-2.0.txt"]
    },
    {
        "language":"agpl-3.0",
        "type":"License",
        "fileNames":["gnu-agpl-3.0.txt"]
    },

    {
        "language":"flatbuffers",
        "type":"Generator",
        "fileSuffixes": [".fbs"]
    },
    {
        "language":"cap'n proto",
        "type":"Generator",
        "fileSuffixes": [".capnp"]
    },
    {
        "language":"lex",
        "type":"Generator",
        "fileSuffixes": [".l", ".lex", ".ll"]
    },
    {
        "language":"yacc",
        "type":"Generator",
        "fileSuffixes": [".y", ".yacc", ".yxx"]
    },
    {
        "language":"m4",
        "type":"Generator",
        "fileSuffixes": [".m4"]
    }
]

or

{
    "brands": {
        "google":["google","angular","googlecloudplatform","googlechrome", "golang", "gwtproject", "zxing", "v8"],
        "twitter":["twbs", "twitter", "bower", "flightjs"],
        "facebook": ["facebook", "facebookarchive","boltsframework"],
        "github":["atom", "github"],
        "microsoft": ["microsoft", "dotnet", "aspnet", "exceptionless", "mono", "winjs"]
    },
    "keywords":{
        "node": ["node", "nodejs"],
        "jquery": ["jquery", "jq", "/^jq[\\-]?/"],
        "grunt": ["grunt", "gruntjs"],
        "angular": ["angular", "angularjs", "ng", "/^ng(?!inx)\\-]?/"],
        "ember": ["emberjs", "ember"],
        "meteor": ["meteor", "meteorjs"],
        "gulp": ["gulp"],
        "express": ["express", "expressjs"],
        "d3": ["d3"],
        "polymer": ["polymer"],
        "ionic": ["ionic"],
        "seajs": ["seajs"],
        "yeoman": ["yeoman"],
        "browserify": ["browserify"],
        "requirejs": ["requirejs"],
        "underscore": ["underscore", "underscorejs"],
        "modernizr": ["modernizr"],
        "phantom": ["phantom", "phantomjs"],
        "metalsmith": ["metalsmith"],

        "bootstrap": ["bootstrap"],

        "django": ["django"],
        "bottle": ["bottlepy", "bottle"],
        "web2py": ["web2py"],
        "webpy": ["webpy"],
        "flask": ["flask"],
        "ipython": ["ipython"],
        "fabric": ["fabric"],
        "celery": ["celery"],

        "language/python": ["python", "/^py/"],
        "language/ruby": ["ruby"],
        "language/clojure": ["clojure"],
        "language/lisp": ["lisp"],
        "language/rust": ["rust"],
        "language/erlang": ["erlang"],
        "language/go": ["golang", "go"],
        "language/javascript": ["javascript", "js"],
        "language/clojure": ["coffeescript"],
        "language/php": ["php"],
        "language/perl": ["perl"],
        "language/swift": ["swift"],
        "language/css": ["css", "stylesheet"],

        "ios": ["ios"],
        "osx": ["osx"],
        "unix": ["unix"],
        "android": ["android"],
        "linux": ["linux"],
        "windows": ["windows"],

        "deprecated": ["deprecated"],
        "pdf": ["pdf"],
        "polyfill": ["polyfill"],
        "framework": ["framework"],
        "dropbox": ["dropbox"],
        "webkit": ["webkit"],
        "sql": ["sql"],
        "svg": ["svg"],
        "boilerplate": ["boilerplate", "seed"],
        "rails": ["rails", "rails3"],
        "vim": ["vim", "vi"],
        "git": ["git"],
        "backbone": ["backbone"],
        "docker": ["docker"],
        "emacs": ["emacs"],
        "redis": ["redis"],
        "chrome": ["chrome"],
        "sublime": ["sublime"],
        "vagrant": ["vagrant"],
        "wordpress": ["wordpress", "/^wp\\-/"],
        "youtube": ["youtube"],
        "apache": ["apache"],
        "jekyll": ["jekyll"],
        "puppet": ["puppet"],
        "sass": ["sass", "scss"],
        "nginx": ["nginx"],
        "markdown": ["markdown"],
        "elasticsearch": ["elasticsearch"],
        "chef": ["chef"],
        "mongodb": ["mongodb", "mongo"],
        "cordova": ["cordova"],
        "phonegap": ["phonegap"],
        "ansible": ["ansible"],
        "openshift": ["openshift"],
        "mysql": ["mysql"],
        "couchbase": ["couchbase"],
        "firebase": ["firebase"],
        "homebrew": ["homebrew"],
        "openstack": ["openstack"],
        "maven": ["maven"],
        "hadoop": ["hadoop"],
        "spark": ["spark"],
        "jasmine": ["jasmine"],
        "hubot": ["hubot"],
        "jruby": ["jruby"],
        "couchdb": ["couchdb"],
        "travis": ["travis"],
        "bash": ["bash"],
        "coreos": ["coreos"],
        "mustache": ["mustache"],
        "zsh": ["zsh"],
        "jenkins": ["jenkins"],
        "cassandra": ["cassandra"],
        "statsd": ["statsd"],
        "eclipse": ["eclipse"],
        "knockout": ["knockout"],
        "graphite": ["graphite"],
        "textmate": ["textmate"],
        "jed": ["jed"],
        "memcached": ["memcached"],
        "mesos": ["mesos"],
        "rabbitmq": ["rabbitmq"],
        "firefox": ["firefox", "ff"],
        "postgres": ["postgres", "postgresql"],
        "selenium": ["selenium"],
        "gems": ["gems", "rubygems"],
        "zeromq": ["zeromq", "zmq", "0mq"],
        "tmux": ["tmux"],
        "cyanogenmod": ["cyanogenmod"],
        "tornado": ["tornado"],
        "octopress": ["octopress"],
        "dokku": ["dokku"],
        "karma": ["karma"],
        "bitcoin": ["bitcoin"],
        "handlebars": ["handlebars"],
        "qt": ["qt"],
        "minecraft": ["minecraft"],
        "unity": ["unity"],
        "cocos2d": ["cocos2d"],
        "openssl": ["openssl"],
        "amqp": ["amqp"],
        "logstash": ["logstash"],
        "sqlite": ["sqlite"],
        "v8": ["v8"],
        "fuse": ["fuse"],
        "cocoa": ["cocoa"],
        "curl": ["curl"],
        "ffmpeg": ["ffmpeg"],
        "hhvm": ["hhvm"],
        "rake": ["rake"],
        "drupal": ["drupal"],
        "gevent": ["gevent"],
        "nagios": ["nagios"],
        "chromium": ["chromium"],
        "jenkinsci": ["jenkinsci"],
        "etcd": ["etcd"],
        "kubernetes": ["kubernetes"],
        "react": ["react", "reactjs"]
    }
}

refs.

@hanwen
Copy link
Contributor

hanwen commented Aug 16, 2018

What do you want to search for exactly? Per-repository data or per-file data? Let's call them tags

Do you want full-text stringsearch of the tags, regex search, or only exact matches?

languages are already supported, see https://cs.bazel.build/search?q=lang%3Apython

@roscopecoltran
Copy link
Author

Hi,

Thanks for the reply !

In fact both as I would like to index my $GOPATH/src directory as I clone any kind of repos in it; nodejs/java/python repositories... It allows me to keep my repositories organized by repo uris; so I just wanted to have on the left side, some filters allowing me to filter my vcs provider, owner, project name.

To make it simple, just wanted to upgrade zoekt to a webui closer to searchcode-server (https://github.com/boyter/searchcode-server)

Video

Example of left filtering blocks of matched files:

Filter by VCS provider:

  • All
  • github.com
  • gitlab.com
  • bitbucket.com
  • android.googlesource.com
    ...

Filter by namespace (org/user):

  • All
  • roscopecoltran
  • html5rocks
  • tensorflow
  • apache
    ...

Filter by languages:

  • All
  • Golang
  • Python
  • Java
  • C++
    ...

Filter by filetypes:

  • All
  • Source
  • Build System
  • Env Config
  • Generator
    ...

Filter by topics:

  • All
  • Elk stack
  • Google
  • vuejs
  • react
    ...

Hope i made more clear my idea, thanks in advance for your time and reply.

Cheers,
Rosco

@hanwen
Copy link
Contributor

hanwen commented Aug 16, 2018

"just wanted to upgrade zoekt to a webui closer to searchcode-server "

I don't know enough about building Web UIs that I could pull that off, but I'm happy to review changes.

I could add something to the individual results to add restrictions (this repo, this directory, this language, this branch). Would that help?

@roscopecoltran
Copy link
Author

roscopecoltran commented Aug 16, 2018

Yes, that would... I can do the webui... That s not a problem...

Summary:
Would be awesome to CRUD some metadata/tags, as global for a repository, or specific to a file; to an already existing index or while creating a new one.

Eg. I could use the go-github package to fetch topics defined for a repository and enrich the restrictions of search results based on the owner defined topics. (ref. https://github.com/google/go-github/blob/master/github/repos.go#L58). Then I will do the disambiguation with the jargon package.

This pipeline could be queued and triggered separately, but the most important is to have some methods in zoekt to manage these extra data, for a repo or a file, in an already existing Zoekt's index file. I guess that it would be complicated to rebuild the index each time if you index more than 1000 repos...

If you do not mind, let me draft you an example/poc, in my forked version of zoekt ^^, of my poc, so I will send you a link in 1 to 2 hours... :-)

Thanks for your patience

@hanwen
Copy link
Contributor

hanwen commented Aug 16, 2018

please send me a change through gerrit, as described here:

https://github.com/google/zoekt/blob/master/CONTRIBUTING

@hanwen
Copy link
Contributor

hanwen commented Aug 16, 2018

for per-repository data, things are simple. There is already a pipeline for inserting metadata,

zoekt/api.go

Line 196 in 8e284ca

RawConfig map[string]string

it only needs a query operator to implement it. And you have to find a way to ingest this data from a given (git) repository. Currently, only git-config settings are imported as repo metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants