The machine-learning powered bot for detecting, and commenting about unmarked code on 4programmers.net forum.
Eleia is tracking new posts on the 4programmers.net forum, and is analyzing
every post by removing all HTML tags (including <code>
), then splitting it
into paragraphs and then performing a classification between "code" and "text"
for every paragraph.
If any of the paragraphs is classified as "code", then it is possible to post comment to this post with a text "Hey! Improve your post by adding ``` tags!" or something.
By default, Eleia will go to sleep and return after specified amount of time to perform same operations over and over again. Posts which were classified one time won't be analyzed again, but this state is in-memory only, so will be removed after application restart.
Since version 0.6, Eleia supports a set of commandline parameters to be run
with, where every is used in a format of --parameter <value>
or -p <value>
.
-u
or--username
-> username to log in with to Coyote. Default is null, and may be null only if-c
is nottrue
.-p
or--password
-> password to use in log in,-t
or--timeBetweenUpdates
-> time (in minutes) to wait before getting another set of posts, if-s
was not used. If given 0, bot will run only once and then finishes the job, like-r
. Default is 60.-n
or--nagMessage
-> message to be posted in comment to post, you can use{0}
in which probability calculated by the model will be put into,-d
or--useDebug4p
->true
orfalse
. Iftrue
,dev.4programmers.info
's endpoints will be used instead of4programmers.net
. Default istrue
.-c
or--postComments
->true
orfalse
. Iftrue
, after finding problem with post, a comment with nag message will be posted to that post. Default isfalse
.-r
or--runOnce
->true
orfalse
. Iftrue
, the application will run only one time, if not - it will go to sleep for-t
minutes to perform another analysis. Default isfalse
.-s
or--runOnSet
-> starts a single run (like-r
) but on provided list of post ids, separated by commas, instead of getting new posts. You may use it like-s 1,2,3,4
to analyze posts of id values equal to 1, 2, 3 and 4. Implies-r
.--blacklist
-> comma separated list of forum ids, from which posts will be ignored,--ignoreAlreadyAnalyzed
-> ignores "already analyzed" database, analyzes every post from not blacklisted forums, does not create "already analyzed" database on exit,--help
-> displays the help screen and exits,--version
-> displays the version information and exists.
# will run only one time, analyze the posts, post comments if threshold is met
# and exit - on dev version of 4programmers.info
eleia --u someuser -p someSECRETpassw0rd --useDebug4p true --postComments true --runOnce true
# will run continously analysing posts, sleeping for 15 minutes between getting
# new set to analyze, won't post comments, will use real 4programmers.net
eleia -d false -c false -t 15
# will analyze only posts of id 14122 and 221
eleia --runOnSet 221,14122
Apart from CLI parameters, it's possible to configure application using environment variables or configuration file to provide username, password and similar. This was considered mainly for Docker images.
Warning: Configuration has a priority over CLI! For example if you set
the username both in config file and -u
parameter, value from the config file
will be used.
Possible configuration options are:
username
-> (string) username to log in with Coyote,password
-> (string) password to authenticate with Coyote,useDebug4p
-> (bool) should usedev.4programmers.info
or4programmers.net
?,postComments
-> (bool) should post comments when unformatted code is found?,timeBetweenUpdates
-> (int) what is the time sleeping before getting new batch of posts?,threshold
-> (float) what is the threshold of "code" classification triggering posting a comment (by default: 0.99),nagMessage
-> (string) what is the nag message posted as comment?{0}
will be replaced with probability of unformatted code,blacklist
-> (string) comma-separated list of forum ids from which posts will be ignored.
All these options may be used in a appsettings.json
file in the current directory.
Apart from that, you may set verbosity level, by setting Logging:LogLevel:Default
configuration option (possible values: Debug
, Information
, Warning
and
Error
).
Example configuration file:
{
"username": "some user",
"password": "verySECRETpassw0rd!",
"useDebug4p": true,
"postComments": false,
"timeBetweenUpdates": 30,
"threshold": 0.95,
"nagMessage": "Hey! Something is wrong with code in your post!",
"Logging": {
"IncludeScopes": false,
"LogLevel": {
"Default": "Debug"
}
}
}
If you prefer to use environment variables, they must start with ELEIA_
prefix,
e.g.:
export ELEIA_username=someuser2
export ELEIA_useDebug4p=false
# if you want to set verbosity level, you have to use __ as section separator
export ELEIA_Logging__LogLevel__Default=Error
It is using ML.NET framework to perform machine learning-based binary classification,
and AutoML (mlnet auto-train
) was used to generate model.
Exact run:
mlnet auto-train -T binary-classification -d trainingdata2.tsv -o ML --label-column-name code -x 180
trainingdata2.tsv was a file which the training was performed on.
trainingdata2.log is a log of auto-training, SdcaLogisticRegressionBinary was decided with accuracy of 0.9572.
I was also running longer training sessions than 180 seconds, but still that algorithm was decided as the best, longer time had no visible impact.
You are welcome to contribute to Eleia!
Please start with creating a new issue, so we can discuss what are you trying to achieve. Then fork this repo to your own profile, fix bugs or add new things and send me a pull request.
Every PR must be built and tested on Azure Pipelines, it is done automatically.
Published versions are built on Azure Pipelines on every new tag, I am using SemVer to versioning releases.