Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Captcha forms #13

Open
lopuhin opened this issue May 12, 2016 · 4 comments
Open

Captcha forms #13

lopuhin opened this issue May 12, 2016 · 4 comments

Comments

@lopuhin
Copy link
Contributor

lopuhin commented May 12, 2016

Right now Formasaurus does not seem to support captcha forms: forms with a single text input and some image with captcha that are designed to block crawlers. It does not have such a form type, and when applied to such forms (I tried only two so far) it does not detect the captcha field correctly.
Are such forms in score of the library? What is the reasonable number of such forms to include into the training dataset?

@kmike
Copy link
Contributor

kmike commented May 12, 2016

Hey,

Yeah, I also noticed that; there are no such forms in training data - I was using a sample from Alexa top 1M, and captchas were on registration or password recovery or comment pages. I don't know how many examples are needed, but likely a few examples could help. There are ~90 captcha examples in the dataset, some of them are from duplicate websites.

I'm not sure it makes sense to add "captcha form" form type, but it depends on a number of examples available. We can add this form type, but convert it to 'other' here.

Dedicated captcha forms are likely to be from site protection services; it should be fine to overfit on them - even if model will just remember the exact form it will help because there is not many such services.

@kmike
Copy link
Contributor

kmike commented May 18, 2016

Another idea: there are two ways to train field type classifier, both of them are implemented (https://github.com/TeamHG-Memex/Formasaurus/blob/b2026d89f15620002586b78af998a19615844e0e/formasaurus/fieldtype_model.py).

Field type classifier uses form type classification results to increase accuracy. It means we need form types for training. We can either pass true form type classes, or predict them using form type classifier (trained on other part of dataset, using cross_val_predict). The first approach is faster, and it provided better results. Second option allows field type classifier to take in account errors form classification model makes.

Captcha forms may be a pathological case here - forms are detected incorrectly, but field type detector relies too much on form type. So it could make sense to try using use_precise_form_types=False, it may help a bit.

@lopuhin
Copy link
Contributor Author

lopuhin commented May 18, 2016

Aha, it's an interesting idea!

I've collected about 6 examples of captcha forms where the captcha is the only field. I also have more login forms with captchas, and want to add at least one that is classified incorrectly (it's almost ready), but I can easily add the others too. There are maybe 10-20 of them. Are they worth adding, or it will make the dataset less balanced?

@kmike
Copy link
Contributor

kmike commented May 18, 2016

+1 to add these examples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants