-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Captcha forms #13
Comments
Hey, Yeah, I also noticed that; there are no such forms in training data - I was using a sample from Alexa top 1M, and captchas were on registration or password recovery or comment pages. I don't know how many examples are needed, but likely a few examples could help. There are ~90 captcha examples in the dataset, some of them are from duplicate websites. I'm not sure it makes sense to add "captcha form" form type, but it depends on a number of examples available. We can add this form type, but convert it to 'other' here. Dedicated captcha forms are likely to be from site protection services; it should be fine to overfit on them - even if model will just remember the exact form it will help because there is not many such services. |
Another idea: there are two ways to train field type classifier, both of them are implemented (https://github.com/TeamHG-Memex/Formasaurus/blob/b2026d89f15620002586b78af998a19615844e0e/formasaurus/fieldtype_model.py). Field type classifier uses form type classification results to increase accuracy. It means we need form types for training. We can either pass true form type classes, or predict them using form type classifier (trained on other part of dataset, using cross_val_predict). The first approach is faster, and it provided better results. Second option allows field type classifier to take in account errors form classification model makes. Captcha forms may be a pathological case here - forms are detected incorrectly, but field type detector relies too much on form type. So it could make sense to try using |
Aha, it's an interesting idea! I've collected about 6 examples of captcha forms where the captcha is the only field. I also have more login forms with captchas, and want to add at least one that is classified incorrectly (it's almost ready), but I can easily add the others too. There are maybe 10-20 of them. Are they worth adding, or it will make the dataset less balanced? |
+1 to add these examples |
Right now Formasaurus does not seem to support captcha forms: forms with a single text input and some image with captcha that are designed to block crawlers. It does not have such a form type, and when applied to such forms (I tried only two so far) it does not detect the captcha field correctly.
Are such forms in score of the library? What is the reasonable number of such forms to include into the training dataset?
The text was updated successfully, but these errors were encountered: