Add code in its RC stage

n3ssuno · Aug 24, 2022 · 8f311b1 · 8f311b1
commit 8f311b1
Show file tree

Hide file tree

Showing 9 changed files with 2,545 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,7 @@
+data/*
+files/*
+!data/test_data.jsonl
+
+.vscode/
+bak/
+test_scripts/
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2020 TU/e and EPFL
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,63 @@
+# IRIS Virtual Patent Marking Pages Classifier
+Tool to help a human being to classify a list of potential VPM pages into several possible categories. Part of the IRIS project.
+
+The classifier is written in Python, using the PyQt5 library.
+
+It creates a GUI browser that shows sequentially one of the detected pages.
+
+You can interact with the browser with the mouse and you can also use the numerical pad of the keyboard to select one of the categories.
+
+Once you have chosen the right category for a page the software moves to the next page.
+
+## Setup the classifier
+The best is to 
+1. Install [Git](https://git-scm.com/)
+2. Clone this repository with ``git clone https://gitlab.tue.nl/iris/iris-vpm-pages-classifier.git``
+3. Install [Miniconda](https://docs.conda.io/en/latest/miniconda.html)
+4. Create an environment with
+    * ``conda create -n iris-vpm-pages-classifier python=3.9``
+	* ``conda activate iris-vpm-pages-classifier``
+	* ``pip install -r requirements.txt``
+	* ``pip install git+https://gitlab.tue.nl/iris/iris-utils.git``
+5. If you need to use the pre-classifier, you must also install a headless browser with the following command<br>
+	``playwright install chromium``<br>
+	Note: the code has been tested with Chromium v857950 but the last version of the browser will be installed
+
+### GUI classifier on WSL2
+1. Install ``qt5-default`` on the WSL2 distro
+2. Install X410 on Windows (the free alternatives did not work for me) and select ``Allow Public Access`` from its menu
+3. Add the following lines into the ``~/.bashrc`` file of the WSL2 distro (before the bunch of code about Conda)<br>
+``export DISPLAY=$(awk '/nameserver / {print $2; exit}' /etc/resolv.conf 2>/dev/null):0.0``<br>
+Instead, do not add ``export LIBGL_ALWAYS_INDIRECT=1`` as adviced in many online guides.
+
+## Pre-processing
+Before you start to classify the pages by hand, you must run ``pre-classify.py`` to automatically classify some pages.
+This script will create a file with five main categories: cases that are (a) very likely true positives; (b) very likely false positives; (c) maybe positive; (d) maybe negative; (e) unknown.
+
+The first two cases are automatically classified. For the second two, a hint is provided and the person is required to choose if the page is actually a VPM page or not. The last case is left to the person, without any hint.
+
+To use it you need a bunch of software that is as easy to install on GNU/Linux as hard to have on MS-Windows. The advice is, therefore, to use a GNU/Linux machine (the instructions that follow are for Debian GNU/Linux) or use WSL2 (to run the GUI classifier from WSL2 is not trivial but possible; follow the instructions here below).
+1. Install [Tesseract](https://tesseract-ocr.github.io/) with<br>
+``sudo apt install tesseract-ocr``
+2. Install [Poppler](https://poppler.freedesktop.org/)<br>
+``sudo apt install poppler-utils``
+
+To run the automatic classifier, please run<br>
+``python pre-classify.py -I data/scraping_results.jsonl data/websites_to_exclude.txt -o data/pre_classified.jsonl``
+
+## Populate the database
+Once the data have been analyzed by the pre-classifier, you must use its output to populate a database that will be used by the classifier. To do so, please run<br>
+``python write-database.py -I data/scraping_results.jsonl data/pre_classified.jsonl -o data/database.json``
+
+If you want to split the data in sub-databased, so that more than one person can have her/his own data to classify, you can run<br>
+``python write-database.py -I data/scraping_results.jsonl data/pre_classified.jsonl -o data/database.json -O N``<br>
+where ``N`` is the number of files that you want to generate.
+
+Note: you cannot overwrite the database once created (you can only update it, if not using the specific commands of [Flata](https://github.com/harryho/flata)). If you want to do so, you must delete the written files and re-run the script.
+
+## Run the classifier
+1. Remember, each time, to activate the conda environment created in the setup phase with ``conda activate iris-vpm-pages-classifier``
+2. Run ``python classify.py -i data/database.json``
+
+## Acknowledgements
+The authors thank the EuroTech Universities Alliance for sponsoring this work. Carlo Bottai was supported by the European Union's Marie Skłodowska-Curie programme for the project Insights on the "Real Impact" of Science (H2020 MSCA-COFUND-2016 Action, Grant Agreement No 754462).
diff --git a/classify.py b/classify.py
@@ -0,0 +1,247 @@
+#!/usr/bin/env python
+
+"""
+Tool to help a human being to classify the scraped VPM pages 
+ into several categories
+
+It creates a GUI browser that shows sequentially one of the detected pages. 
+ You can interact with the browser with the mouse and you can also use the 
+ numerical pad of the keyboard to select one of the categories. Once you 
+ have chosen the right category for a page the software moves to the next.
+
+Author: Carlo Bottai
+Copyright (c) 2020 - TU/e and EPFL
+License: See the LICENSE file.
+Date: 2020-10-16
+
+"""
+
+from PyQt5.QtCore import *
+from PyQt5.QtWidgets import *
+from PyQt5.QtGui import *
+from PyQt5.QtWebEngineWidgets import *
+import qtawesome as qta
+import sys
+import webbrowser
+from flata import Flata, Query, JSONStorage
+import requests
+from iris_utils.parse_args import parse_io
+
+
+USER_AGENT = ('Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) '
+              'Gecko/2009021910 Firefox/3.0.7')
+
+
+class MainWindow(QMainWindow):
+    def __init__(self, *args, **kwargs):
+        super(MainWindow, self).__init__(*args, **kwargs)
+
+        args = parse_io()
+        self.f_in = args.input
+
+        self.read_data()
+
+        self.view = QWebEngineView()
+        self.view.settings() \
+            .setAttribute(QWebEngineSettings.PluginsEnabled, True)
+        self.setCentralWidget(self.view)
+
+        self.status = QStatusBar()
+        self.setStatusBar(self.status)
+
+        navtb = QToolBar('Navigation')
+        self.addToolBar(navtb)
+
+        back_btn = QAction(qta.icon('fa5s.arrow-left'), 'Back', self)
+        back_btn.triggered.connect(lambda: self.view.back())
+        navtb.addAction(back_btn)
+
+        next_btn = QAction(qta.icon('fa5s.arrow-right'), 'Forward', self)
+        next_btn.triggered.connect(lambda: self.view.forward())
+        navtb.addAction(next_btn)
+
+        navtb.addSeparator()
+
+        self.urlbar = QLineEdit()
+        self.urlbar.returnPressed.connect(self.go_to_url)
+        navtb.addWidget(self.urlbar)
+
+        navtb.addSeparator()
+
+        reload_btn = QAction(qta.icon('fa5s.redo'), 'Reload', self)
+        reload_btn.triggered.connect(lambda: self.view.reload())
+        navtb.addAction(reload_btn)
+
+        stop_btn = QAction(qta.icon('fa5s.stop'), 'Stop', self)
+        stop_btn.triggered.connect(lambda: self.view.stop())
+        navtb.addAction(stop_btn)
+
+        open_btn = QAction(
+            qta.icon('fa5s.external-link-square-alt'), 'Open', self)
+        open_btn.triggered.connect(lambda: \
+            webbrowser.open_new_tab(self.urlbar.text()))
+        navtb.addAction(open_btn)
+
+        labtb = QToolBar('Labeling')
+        self.addToolBar(Qt.RightToolBarArea, labtb)
+
+        for name, idx in [
+                ('VPM page | True patent-product link', 1),
+                ('Brochure or description of the product | True patent-product link', 2),
+                ('Hybrid document | True patent-product link', 3),
+                ('List of patents or metadata of a patent | False patent-product link', 4),
+                ('A scientific publication | False patent-product link', 5),
+                ('News about the patent | False patent-product link', 6),
+                ('CV/resume | False patent-product link', 7),
+                ('Something else in a website to keep | False patent-product link', 8),
+                ('Something else in a website to exclude | False patent-product link', 9),
+                ('The document is unreachable | False patent-product link', 0)]:
+            label = QAction(f'{name} ({idx})', self)
+            label.setShortcut(str(idx))
+            label.triggered.connect(lambda checked, lbl=name: self.label_page(lbl))
+            labtb.addAction(label)
+
+        #labtb.addSeparator()
+
+        urls_len_lbl = f'{self.data_to_classify_len} URLs left to classify'
+        self.status.showMessage(urls_len_lbl)
+
+        self.open_next_page()
+
+        self.show()
+
+        self.setWindowTitle('VPM pages handmade classifier')
+
+    def read_data(self):
+        DB = Flata(self.f_in, storage=JSONStorage)
+        self.database = DB.table('iris_vpm_pages_classifier')
+
+        to_classify = \
+            (Query().vpm_page_classification==None) & \
+            (Query().vpm_page!=None)
+        self.data_to_classify = iter(self.database.search(to_classify))
+        self.data_to_classify_len = self.database.count(to_classify)
+
+    def go_to_url(self, url=None):
+        if url is None:
+            url = self.urlbar.text()
+        else:
+            self.urlbar.setText(url)
+            self.urlbar.setCursorPosition(0)
+
+        try:
+            response = requests.head(
+                url, 
+                headers={'User-Agent': USER_AGENT}, 
+                verify=False, 
+                allow_redirects=True,
+                timeout=10)
+            headers = response.headers
+            content_type = headers['Content-Type']
+            if 'Content-Disposition' in headers:
+                content_disposition = headers['Content-Disposition']
+            else:
+                content_disposition = ''
+            if not (content_type.startswith('text/html') or \
+                    content_type.startswith('application/pdf') or \
+                    content_type.startswith('text/plain')) or \
+               content_disposition.startswith('attachment'):
+                self.msgBox = QMessageBox.about(
+                    self, 
+                    'Additional information (DOWNLOAD)', 
+                    ('It is possible that it is needed to download the next '
+                     'document.\nIf you do not see the page changing, try to '
+                     'open the page in a browser by clicking on '
+                     'the appropriate button'))
+        except:
+            pass
+
+        url = QUrl(url)
+
+        if url.scheme() == '':
+            url.setScheme('https')
+
+        self.view.setUrl(url)
+
+    def open_next_page(self):
+        try:
+            self.current_data = next(self.data_to_classify)
+            while self.current_data['vpm_page_classification']:
+                self.current_data = next(self.data_to_classify)
+
+            INFO_MSG = {
+                'COPYRIGHT': 
+                    ('The information about the patent(s) has been '
+                     'detected close to the copyright information '
+                     'at the bottom of the document.\n'
+                     'Please, confirm whether or not there is a link '
+                     'between a patent and a product in this document'),
+                'NOCORPUS': 
+                    ('No information about any of the patents has been '
+                     'detected in the document.\nPlease, confirm whether '
+                     'or not there is a link between a patent and a '
+                     'product in this document'),
+                'NOCORPUS+IMG': 
+                    ('The only information about the patent(s) '
+                     'has been detected in one of the pictures '
+                     'of the document.\nPlease, confirm whether '
+                     'or not there is a link between a patent '
+                     'and a product in this document'),
+                'NOCORPUS+PATNUMINURL': 
+                    ('The only information about the patent(s) '
+                     'has been detected in the URL '
+                     'of the document.\nPlease, confirm whether '
+                     'or not there is a link between a patent '
+                     'and a product in this document')}
+            vpm_page_automatic_classification = self.current_data[
+                'vpm_page_automatic_classification']
+            vpm_page_automatic_classification_info = \
+                vpm_page_automatic_classification \
+                    .split(' | ')[1]
+            if vpm_page_automatic_classification_info in INFO_MSG.keys():
+                vpm_page_automatic_classification_msg = INFO_MSG[
+                    vpm_page_automatic_classification_info]
+                self.msgBox = QMessageBox.about(
+                    self, 
+                    f'Additional information ({vpm_page_automatic_classification_info})', 
+                    vpm_page_automatic_classification_msg)
+
+            print('\n+++++++++++++++++++++++++++')
+            print(f"Patent assignee: {self.current_data['patent_assignee']}")
+            try:
+                print(f"Award recipient: {self.current_data['award_recipient']}")
+            except Exception:
+                pass
+            print(f"Patents: {self.current_data['patent_id']}")
+            print('+++++++++++++++++++++++++++\n')
+
+            url = self.current_data['vpm_page']
+            self.go_to_url(url)
+
+        except:
+            print('\n+++++++++++++++++++++++++++')
+            print('No other pages left. Well done!')
+            print('+++++++++++++++++++++++++++\n')
+            self.close()
+
+    def label_page(self, label):
+        updated_info = self.database.update(
+            {'vpm_page_classification': label}, 
+            Query().vpm_page==self.current_data['vpm_page'])
+        updated_ids = updated_info[0]
+
+        # Reduce the number of pages left by one 
+        #   and show this information in the status bar
+        self.data_to_classify_len -= len(updated_ids)
+        urls_len_lbl = f'{self.data_to_classify_len} URLs left to classify'
+        self.status.showMessage(urls_len_lbl)
+
+        self.open_next_page()
+        self.update()
+
+if __name__ == "__main__":
+    app = QApplication(sys.argv)
+    app.setApplicationName('VPM pages handmade classifier')
+    window = MainWindow()
+    app.exec_()
+
diff --git a/data/test_data.jsonl b/data/test_data.jsonl