-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 8f311b1
Showing
9 changed files
with
2,545 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
data/* | ||
files/* | ||
!data/test_data.jsonl | ||
|
||
.vscode/ | ||
bak/ | ||
test_scripts/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2020 TU/e and EPFL | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# IRIS Virtual Patent Marking Pages Classifier | ||
Tool to help a human being to classify a list of potential VPM pages into several possible categories. Part of the IRIS project. | ||
|
||
The classifier is written in Python, using the PyQt5 library. | ||
|
||
It creates a GUI browser that shows sequentially one of the detected pages. | ||
|
||
You can interact with the browser with the mouse and you can also use the numerical pad of the keyboard to select one of the categories. | ||
|
||
Once you have chosen the right category for a page the software moves to the next page. | ||
|
||
## Setup the classifier | ||
The best is to | ||
1. Install [Git](https://git-scm.com/) | ||
2. Clone this repository with ``git clone https://gitlab.tue.nl/iris/iris-vpm-pages-classifier.git`` | ||
3. Install [Miniconda](https://docs.conda.io/en/latest/miniconda.html) | ||
4. Create an environment with | ||
* ``conda create -n iris-vpm-pages-classifier python=3.9`` | ||
* ``conda activate iris-vpm-pages-classifier`` | ||
* ``pip install -r requirements.txt`` | ||
* ``pip install git+https://gitlab.tue.nl/iris/iris-utils.git`` | ||
5. If you need to use the pre-classifier, you must also install a headless browser with the following command<br> | ||
``playwright install chromium``<br> | ||
Note: the code has been tested with Chromium v857950 but the last version of the browser will be installed | ||
|
||
### GUI classifier on WSL2 | ||
1. Install ``qt5-default`` on the WSL2 distro | ||
2. Install X410 on Windows (the free alternatives did not work for me) and select ``Allow Public Access`` from its menu | ||
3. Add the following lines into the ``~/.bashrc`` file of the WSL2 distro (before the bunch of code about Conda)<br> | ||
``export DISPLAY=$(awk '/nameserver / {print $2; exit}' /etc/resolv.conf 2>/dev/null):0.0``<br> | ||
Instead, do not add ``export LIBGL_ALWAYS_INDIRECT=1`` as adviced in many online guides. | ||
|
||
## Pre-processing | ||
Before you start to classify the pages by hand, you must run ``pre-classify.py`` to automatically classify some pages. | ||
This script will create a file with five main categories: cases that are (a) very likely true positives; (b) very likely false positives; (c) maybe positive; (d) maybe negative; (e) unknown. | ||
|
||
The first two cases are automatically classified. For the second two, a hint is provided and the person is required to choose if the page is actually a VPM page or not. The last case is left to the person, without any hint. | ||
|
||
To use it you need a bunch of software that is as easy to install on GNU/Linux as hard to have on MS-Windows. The advice is, therefore, to use a GNU/Linux machine (the instructions that follow are for Debian GNU/Linux) or use WSL2 (to run the GUI classifier from WSL2 is not trivial but possible; follow the instructions here below). | ||
1. Install [Tesseract](https://tesseract-ocr.github.io/) with<br> | ||
``sudo apt install tesseract-ocr`` | ||
2. Install [Poppler](https://poppler.freedesktop.org/)<br> | ||
``sudo apt install poppler-utils`` | ||
|
||
To run the automatic classifier, please run<br> | ||
``python pre-classify.py -I data/scraping_results.jsonl data/websites_to_exclude.txt -o data/pre_classified.jsonl`` | ||
|
||
## Populate the database | ||
Once the data have been analyzed by the pre-classifier, you must use its output to populate a database that will be used by the classifier. To do so, please run<br> | ||
``python write-database.py -I data/scraping_results.jsonl data/pre_classified.jsonl -o data/database.json`` | ||
|
||
If you want to split the data in sub-databased, so that more than one person can have her/his own data to classify, you can run<br> | ||
``python write-database.py -I data/scraping_results.jsonl data/pre_classified.jsonl -o data/database.json -O N``<br> | ||
where ``N`` is the number of files that you want to generate. | ||
|
||
Note: you cannot overwrite the database once created (you can only update it, if not using the specific commands of [Flata](https://github.com/harryho/flata)). If you want to do so, you must delete the written files and re-run the script. | ||
|
||
## Run the classifier | ||
1. Remember, each time, to activate the conda environment created in the setup phase with ``conda activate iris-vpm-pages-classifier`` | ||
2. Run ``python classify.py -i data/database.json`` | ||
|
||
## Acknowledgements | ||
The authors thank the EuroTech Universities Alliance for sponsoring this work. Carlo Bottai was supported by the European Union's Marie Skłodowska-Curie programme for the project Insights on the "Real Impact" of Science (H2020 MSCA-COFUND-2016 Action, Grant Agreement No 754462). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,247 @@ | ||
#!/usr/bin/env python | ||
|
||
""" | ||
Tool to help a human being to classify the scraped VPM pages | ||
into several categories | ||
It creates a GUI browser that shows sequentially one of the detected pages. | ||
You can interact with the browser with the mouse and you can also use the | ||
numerical pad of the keyboard to select one of the categories. Once you | ||
have chosen the right category for a page the software moves to the next. | ||
Author: Carlo Bottai | ||
Copyright (c) 2020 - TU/e and EPFL | ||
License: See the LICENSE file. | ||
Date: 2020-10-16 | ||
""" | ||
|
||
from PyQt5.QtCore import * | ||
from PyQt5.QtWidgets import * | ||
from PyQt5.QtGui import * | ||
from PyQt5.QtWebEngineWidgets import * | ||
import qtawesome as qta | ||
import sys | ||
import webbrowser | ||
from flata import Flata, Query, JSONStorage | ||
import requests | ||
from iris_utils.parse_args import parse_io | ||
|
||
|
||
USER_AGENT = ('Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) ' | ||
'Gecko/2009021910 Firefox/3.0.7') | ||
|
||
|
||
class MainWindow(QMainWindow): | ||
def __init__(self, *args, **kwargs): | ||
super(MainWindow, self).__init__(*args, **kwargs) | ||
|
||
args = parse_io() | ||
self.f_in = args.input | ||
|
||
self.read_data() | ||
|
||
self.view = QWebEngineView() | ||
self.view.settings() \ | ||
.setAttribute(QWebEngineSettings.PluginsEnabled, True) | ||
self.setCentralWidget(self.view) | ||
|
||
self.status = QStatusBar() | ||
self.setStatusBar(self.status) | ||
|
||
navtb = QToolBar('Navigation') | ||
self.addToolBar(navtb) | ||
|
||
back_btn = QAction(qta.icon('fa5s.arrow-left'), 'Back', self) | ||
back_btn.triggered.connect(lambda: self.view.back()) | ||
navtb.addAction(back_btn) | ||
|
||
next_btn = QAction(qta.icon('fa5s.arrow-right'), 'Forward', self) | ||
next_btn.triggered.connect(lambda: self.view.forward()) | ||
navtb.addAction(next_btn) | ||
|
||
navtb.addSeparator() | ||
|
||
self.urlbar = QLineEdit() | ||
self.urlbar.returnPressed.connect(self.go_to_url) | ||
navtb.addWidget(self.urlbar) | ||
|
||
navtb.addSeparator() | ||
|
||
reload_btn = QAction(qta.icon('fa5s.redo'), 'Reload', self) | ||
reload_btn.triggered.connect(lambda: self.view.reload()) | ||
navtb.addAction(reload_btn) | ||
|
||
stop_btn = QAction(qta.icon('fa5s.stop'), 'Stop', self) | ||
stop_btn.triggered.connect(lambda: self.view.stop()) | ||
navtb.addAction(stop_btn) | ||
|
||
open_btn = QAction( | ||
qta.icon('fa5s.external-link-square-alt'), 'Open', self) | ||
open_btn.triggered.connect(lambda: \ | ||
webbrowser.open_new_tab(self.urlbar.text())) | ||
navtb.addAction(open_btn) | ||
|
||
labtb = QToolBar('Labeling') | ||
self.addToolBar(Qt.RightToolBarArea, labtb) | ||
|
||
for name, idx in [ | ||
('VPM page | True patent-product link', 1), | ||
('Brochure or description of the product | True patent-product link', 2), | ||
('Hybrid document | True patent-product link', 3), | ||
('List of patents or metadata of a patent | False patent-product link', 4), | ||
('A scientific publication | False patent-product link', 5), | ||
('News about the patent | False patent-product link', 6), | ||
('CV/resume | False patent-product link', 7), | ||
('Something else in a website to keep | False patent-product link', 8), | ||
('Something else in a website to exclude | False patent-product link', 9), | ||
('The document is unreachable | False patent-product link', 0)]: | ||
label = QAction(f'{name} ({idx})', self) | ||
label.setShortcut(str(idx)) | ||
label.triggered.connect(lambda checked, lbl=name: self.label_page(lbl)) | ||
labtb.addAction(label) | ||
|
||
#labtb.addSeparator() | ||
|
||
urls_len_lbl = f'{self.data_to_classify_len} URLs left to classify' | ||
self.status.showMessage(urls_len_lbl) | ||
|
||
self.open_next_page() | ||
|
||
self.show() | ||
|
||
self.setWindowTitle('VPM pages handmade classifier') | ||
|
||
def read_data(self): | ||
DB = Flata(self.f_in, storage=JSONStorage) | ||
self.database = DB.table('iris_vpm_pages_classifier') | ||
|
||
to_classify = \ | ||
(Query().vpm_page_classification==None) & \ | ||
(Query().vpm_page!=None) | ||
self.data_to_classify = iter(self.database.search(to_classify)) | ||
self.data_to_classify_len = self.database.count(to_classify) | ||
|
||
def go_to_url(self, url=None): | ||
if url is None: | ||
url = self.urlbar.text() | ||
else: | ||
self.urlbar.setText(url) | ||
self.urlbar.setCursorPosition(0) | ||
|
||
try: | ||
response = requests.head( | ||
url, | ||
headers={'User-Agent': USER_AGENT}, | ||
verify=False, | ||
allow_redirects=True, | ||
timeout=10) | ||
headers = response.headers | ||
content_type = headers['Content-Type'] | ||
if 'Content-Disposition' in headers: | ||
content_disposition = headers['Content-Disposition'] | ||
else: | ||
content_disposition = '' | ||
if not (content_type.startswith('text/html') or \ | ||
content_type.startswith('application/pdf') or \ | ||
content_type.startswith('text/plain')) or \ | ||
content_disposition.startswith('attachment'): | ||
self.msgBox = QMessageBox.about( | ||
self, | ||
'Additional information (DOWNLOAD)', | ||
('It is possible that it is needed to download the next ' | ||
'document.\nIf you do not see the page changing, try to ' | ||
'open the page in a browser by clicking on ' | ||
'the appropriate button')) | ||
except: | ||
pass | ||
|
||
url = QUrl(url) | ||
|
||
if url.scheme() == '': | ||
url.setScheme('https') | ||
|
||
self.view.setUrl(url) | ||
|
||
def open_next_page(self): | ||
try: | ||
self.current_data = next(self.data_to_classify) | ||
while self.current_data['vpm_page_classification']: | ||
self.current_data = next(self.data_to_classify) | ||
|
||
INFO_MSG = { | ||
'COPYRIGHT': | ||
('The information about the patent(s) has been ' | ||
'detected close to the copyright information ' | ||
'at the bottom of the document.\n' | ||
'Please, confirm whether or not there is a link ' | ||
'between a patent and a product in this document'), | ||
'NOCORPUS': | ||
('No information about any of the patents has been ' | ||
'detected in the document.\nPlease, confirm whether ' | ||
'or not there is a link between a patent and a ' | ||
'product in this document'), | ||
'NOCORPUS+IMG': | ||
('The only information about the patent(s) ' | ||
'has been detected in one of the pictures ' | ||
'of the document.\nPlease, confirm whether ' | ||
'or not there is a link between a patent ' | ||
'and a product in this document'), | ||
'NOCORPUS+PATNUMINURL': | ||
('The only information about the patent(s) ' | ||
'has been detected in the URL ' | ||
'of the document.\nPlease, confirm whether ' | ||
'or not there is a link between a patent ' | ||
'and a product in this document')} | ||
vpm_page_automatic_classification = self.current_data[ | ||
'vpm_page_automatic_classification'] | ||
vpm_page_automatic_classification_info = \ | ||
vpm_page_automatic_classification \ | ||
.split(' | ')[1] | ||
if vpm_page_automatic_classification_info in INFO_MSG.keys(): | ||
vpm_page_automatic_classification_msg = INFO_MSG[ | ||
vpm_page_automatic_classification_info] | ||
self.msgBox = QMessageBox.about( | ||
self, | ||
f'Additional information ({vpm_page_automatic_classification_info})', | ||
vpm_page_automatic_classification_msg) | ||
|
||
print('\n+++++++++++++++++++++++++++') | ||
print(f"Patent assignee: {self.current_data['patent_assignee']}") | ||
try: | ||
print(f"Award recipient: {self.current_data['award_recipient']}") | ||
except Exception: | ||
pass | ||
print(f"Patents: {self.current_data['patent_id']}") | ||
print('+++++++++++++++++++++++++++\n') | ||
|
||
url = self.current_data['vpm_page'] | ||
self.go_to_url(url) | ||
|
||
except: | ||
print('\n+++++++++++++++++++++++++++') | ||
print('No other pages left. Well done!') | ||
print('+++++++++++++++++++++++++++\n') | ||
self.close() | ||
|
||
def label_page(self, label): | ||
updated_info = self.database.update( | ||
{'vpm_page_classification': label}, | ||
Query().vpm_page==self.current_data['vpm_page']) | ||
updated_ids = updated_info[0] | ||
|
||
# Reduce the number of pages left by one | ||
# and show this information in the status bar | ||
self.data_to_classify_len -= len(updated_ids) | ||
urls_len_lbl = f'{self.data_to_classify_len} URLs left to classify' | ||
self.status.showMessage(urls_len_lbl) | ||
|
||
self.open_next_page() | ||
self.update() | ||
|
||
if __name__ == "__main__": | ||
app = QApplication(sys.argv) | ||
app.setApplicationName('VPM pages handmade classifier') | ||
window = MainWindow() | ||
app.exec_() | ||
|
Empty file.
Oops, something went wrong.