Skip to content

Commit

Permalink
Add code in its RC stage
Browse files Browse the repository at this point in the history
  • Loading branch information
n3ssuno committed Aug 24, 2022
0 parents commit 8f311b1
Show file tree
Hide file tree
Showing 9 changed files with 2,545 additions and 0 deletions.
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
data/*
files/*
!data/test_data.jsonl

.vscode/
bak/
test_scripts/
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2020 TU/e and EPFL

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
63 changes: 63 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# IRIS Virtual Patent Marking Pages Classifier
Tool to help a human being to classify a list of potential VPM pages into several possible categories. Part of the IRIS project.

The classifier is written in Python, using the PyQt5 library.

It creates a GUI browser that shows sequentially one of the detected pages.

You can interact with the browser with the mouse and you can also use the numerical pad of the keyboard to select one of the categories.

Once you have chosen the right category for a page the software moves to the next page.

## Setup the classifier
The best is to
1. Install [Git](https://git-scm.com/)
2. Clone this repository with ``git clone https://gitlab.tue.nl/iris/iris-vpm-pages-classifier.git``
3. Install [Miniconda](https://docs.conda.io/en/latest/miniconda.html)
4. Create an environment with
* ``conda create -n iris-vpm-pages-classifier python=3.9``
* ``conda activate iris-vpm-pages-classifier``
* ``pip install -r requirements.txt``
* ``pip install git+https://gitlab.tue.nl/iris/iris-utils.git``
5. If you need to use the pre-classifier, you must also install a headless browser with the following command<br>
``playwright install chromium``<br>
Note: the code has been tested with Chromium v857950 but the last version of the browser will be installed

### GUI classifier on WSL2
1. Install ``qt5-default`` on the WSL2 distro
2. Install X410 on Windows (the free alternatives did not work for me) and select ``Allow Public Access`` from its menu
3. Add the following lines into the ``~/.bashrc`` file of the WSL2 distro (before the bunch of code about Conda)<br>
``export DISPLAY=$(awk '/nameserver / {print $2; exit}' /etc/resolv.conf 2>/dev/null):0.0``<br>
Instead, do not add ``export LIBGL_ALWAYS_INDIRECT=1`` as adviced in many online guides.

## Pre-processing
Before you start to classify the pages by hand, you must run ``pre-classify.py`` to automatically classify some pages.
This script will create a file with five main categories: cases that are (a) very likely true positives; (b) very likely false positives; (c) maybe positive; (d) maybe negative; (e) unknown.

The first two cases are automatically classified. For the second two, a hint is provided and the person is required to choose if the page is actually a VPM page or not. The last case is left to the person, without any hint.

To use it you need a bunch of software that is as easy to install on GNU/Linux as hard to have on MS-Windows. The advice is, therefore, to use a GNU/Linux machine (the instructions that follow are for Debian GNU/Linux) or use WSL2 (to run the GUI classifier from WSL2 is not trivial but possible; follow the instructions here below).
1. Install [Tesseract](https://tesseract-ocr.github.io/) with<br>
``sudo apt install tesseract-ocr``
2. Install [Poppler](https://poppler.freedesktop.org/)<br>
``sudo apt install poppler-utils``

To run the automatic classifier, please run<br>
``python pre-classify.py -I data/scraping_results.jsonl data/websites_to_exclude.txt -o data/pre_classified.jsonl``

## Populate the database
Once the data have been analyzed by the pre-classifier, you must use its output to populate a database that will be used by the classifier. To do so, please run<br>
``python write-database.py -I data/scraping_results.jsonl data/pre_classified.jsonl -o data/database.json``

If you want to split the data in sub-databased, so that more than one person can have her/his own data to classify, you can run<br>
``python write-database.py -I data/scraping_results.jsonl data/pre_classified.jsonl -o data/database.json -O N``<br>
where ``N`` is the number of files that you want to generate.

Note: you cannot overwrite the database once created (you can only update it, if not using the specific commands of [Flata](https://github.com/harryho/flata)). If you want to do so, you must delete the written files and re-run the script.

## Run the classifier
1. Remember, each time, to activate the conda environment created in the setup phase with ``conda activate iris-vpm-pages-classifier``
2. Run ``python classify.py -i data/database.json``

## Acknowledgements
The authors thank the EuroTech Universities Alliance for sponsoring this work. Carlo Bottai was supported by the European Union's Marie Skłodowska-Curie programme for the project Insights on the "Real Impact" of Science (H2020 MSCA-COFUND-2016 Action, Grant Agreement No 754462).
247 changes: 247 additions & 0 deletions classify.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,247 @@
#!/usr/bin/env python

"""
Tool to help a human being to classify the scraped VPM pages
into several categories
It creates a GUI browser that shows sequentially one of the detected pages.
You can interact with the browser with the mouse and you can also use the
numerical pad of the keyboard to select one of the categories. Once you
have chosen the right category for a page the software moves to the next.
Author: Carlo Bottai
Copyright (c) 2020 - TU/e and EPFL
License: See the LICENSE file.
Date: 2020-10-16
"""

from PyQt5.QtCore import *
from PyQt5.QtWidgets import *
from PyQt5.QtGui import *
from PyQt5.QtWebEngineWidgets import *
import qtawesome as qta
import sys
import webbrowser
from flata import Flata, Query, JSONStorage
import requests
from iris_utils.parse_args import parse_io


USER_AGENT = ('Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) '
'Gecko/2009021910 Firefox/3.0.7')


class MainWindow(QMainWindow):
def __init__(self, *args, **kwargs):
super(MainWindow, self).__init__(*args, **kwargs)

args = parse_io()
self.f_in = args.input

self.read_data()

self.view = QWebEngineView()
self.view.settings() \
.setAttribute(QWebEngineSettings.PluginsEnabled, True)
self.setCentralWidget(self.view)

self.status = QStatusBar()
self.setStatusBar(self.status)

navtb = QToolBar('Navigation')
self.addToolBar(navtb)

back_btn = QAction(qta.icon('fa5s.arrow-left'), 'Back', self)
back_btn.triggered.connect(lambda: self.view.back())
navtb.addAction(back_btn)

next_btn = QAction(qta.icon('fa5s.arrow-right'), 'Forward', self)
next_btn.triggered.connect(lambda: self.view.forward())
navtb.addAction(next_btn)

navtb.addSeparator()

self.urlbar = QLineEdit()
self.urlbar.returnPressed.connect(self.go_to_url)
navtb.addWidget(self.urlbar)

navtb.addSeparator()

reload_btn = QAction(qta.icon('fa5s.redo'), 'Reload', self)
reload_btn.triggered.connect(lambda: self.view.reload())
navtb.addAction(reload_btn)

stop_btn = QAction(qta.icon('fa5s.stop'), 'Stop', self)
stop_btn.triggered.connect(lambda: self.view.stop())
navtb.addAction(stop_btn)

open_btn = QAction(
qta.icon('fa5s.external-link-square-alt'), 'Open', self)
open_btn.triggered.connect(lambda: \
webbrowser.open_new_tab(self.urlbar.text()))
navtb.addAction(open_btn)

labtb = QToolBar('Labeling')
self.addToolBar(Qt.RightToolBarArea, labtb)

for name, idx in [
('VPM page | True patent-product link', 1),
('Brochure or description of the product | True patent-product link', 2),
('Hybrid document | True patent-product link', 3),
('List of patents or metadata of a patent | False patent-product link', 4),
('A scientific publication | False patent-product link', 5),
('News about the patent | False patent-product link', 6),
('CV/resume | False patent-product link', 7),
('Something else in a website to keep | False patent-product link', 8),
('Something else in a website to exclude | False patent-product link', 9),
('The document is unreachable | False patent-product link', 0)]:
label = QAction(f'{name} ({idx})', self)
label.setShortcut(str(idx))
label.triggered.connect(lambda checked, lbl=name: self.label_page(lbl))
labtb.addAction(label)

#labtb.addSeparator()

urls_len_lbl = f'{self.data_to_classify_len} URLs left to classify'
self.status.showMessage(urls_len_lbl)

self.open_next_page()

self.show()

self.setWindowTitle('VPM pages handmade classifier')

def read_data(self):
DB = Flata(self.f_in, storage=JSONStorage)
self.database = DB.table('iris_vpm_pages_classifier')

to_classify = \
(Query().vpm_page_classification==None) & \
(Query().vpm_page!=None)
self.data_to_classify = iter(self.database.search(to_classify))
self.data_to_classify_len = self.database.count(to_classify)

def go_to_url(self, url=None):
if url is None:
url = self.urlbar.text()
else:
self.urlbar.setText(url)
self.urlbar.setCursorPosition(0)

try:
response = requests.head(
url,
headers={'User-Agent': USER_AGENT},
verify=False,
allow_redirects=True,
timeout=10)
headers = response.headers
content_type = headers['Content-Type']
if 'Content-Disposition' in headers:
content_disposition = headers['Content-Disposition']
else:
content_disposition = ''
if not (content_type.startswith('text/html') or \
content_type.startswith('application/pdf') or \
content_type.startswith('text/plain')) or \
content_disposition.startswith('attachment'):
self.msgBox = QMessageBox.about(
self,
'Additional information (DOWNLOAD)',
('It is possible that it is needed to download the next '
'document.\nIf you do not see the page changing, try to '
'open the page in a browser by clicking on '
'the appropriate button'))
except:
pass

url = QUrl(url)

if url.scheme() == '':
url.setScheme('https')

self.view.setUrl(url)

def open_next_page(self):
try:
self.current_data = next(self.data_to_classify)
while self.current_data['vpm_page_classification']:
self.current_data = next(self.data_to_classify)

INFO_MSG = {
'COPYRIGHT':
('The information about the patent(s) has been '
'detected close to the copyright information '
'at the bottom of the document.\n'
'Please, confirm whether or not there is a link '
'between a patent and a product in this document'),
'NOCORPUS':
('No information about any of the patents has been '
'detected in the document.\nPlease, confirm whether '
'or not there is a link between a patent and a '
'product in this document'),
'NOCORPUS+IMG':
('The only information about the patent(s) '
'has been detected in one of the pictures '
'of the document.\nPlease, confirm whether '
'or not there is a link between a patent '
'and a product in this document'),
'NOCORPUS+PATNUMINURL':
('The only information about the patent(s) '
'has been detected in the URL '
'of the document.\nPlease, confirm whether '
'or not there is a link between a patent '
'and a product in this document')}
vpm_page_automatic_classification = self.current_data[
'vpm_page_automatic_classification']
vpm_page_automatic_classification_info = \
vpm_page_automatic_classification \
.split(' | ')[1]
if vpm_page_automatic_classification_info in INFO_MSG.keys():
vpm_page_automatic_classification_msg = INFO_MSG[
vpm_page_automatic_classification_info]
self.msgBox = QMessageBox.about(
self,
f'Additional information ({vpm_page_automatic_classification_info})',
vpm_page_automatic_classification_msg)

print('\n+++++++++++++++++++++++++++')
print(f"Patent assignee: {self.current_data['patent_assignee']}")
try:
print(f"Award recipient: {self.current_data['award_recipient']}")
except Exception:
pass
print(f"Patents: {self.current_data['patent_id']}")
print('+++++++++++++++++++++++++++\n')

url = self.current_data['vpm_page']
self.go_to_url(url)

except:
print('\n+++++++++++++++++++++++++++')
print('No other pages left. Well done!')
print('+++++++++++++++++++++++++++\n')
self.close()

def label_page(self, label):
updated_info = self.database.update(
{'vpm_page_classification': label},
Query().vpm_page==self.current_data['vpm_page'])
updated_ids = updated_info[0]

# Reduce the number of pages left by one
# and show this information in the status bar
self.data_to_classify_len -= len(updated_ids)
urls_len_lbl = f'{self.data_to_classify_len} URLs left to classify'
self.status.showMessage(urls_len_lbl)

self.open_next_page()
self.update()

if __name__ == "__main__":
app = QApplication(sys.argv)
app.setApplicationName('VPM pages handmade classifier')
window = MainWindow()
app.exec_()

Empty file added data/test_data.jsonl
Empty file.
Loading

0 comments on commit 8f311b1

Please sign in to comment.