Problem
Paypal, needed a document classifier that required no manual or human intervention for testing, training, and inferencing. The goal was to create a completely autonomous and multilingual generic document classification system with a quick turnaround time.
The existing systems were not meeting the requirements of Paypal, Privacy Team as they required manual intervention for testing and training. This resulted in slower turnaround times and increased costs.
Learn more about our solution here.
Solution
We created a completely autonomous and multilingual generic document classification system inspired by concepts like hashing and similarity matching from Blockchains and Recommender Systems. Our team included myself as a Machine Learning Engineer who built the whole pipeline from core to end-to-end, a Data Engineer who collected samples from open source and internal databases for testing and training, and a Test Engineer who performed vulnerability and performance tests.
The system included multilingual OCR support using various OCR engine integrations, dynamic model/method allocation for text as well as image-based documents. Text documents achieved an accuracy of approximately 98%, while image-based documents achieved an accuracy of approximately 87%.

Impact
Reduced manual intervention required for document classification
Increased speed of turnaround time for document classification
Improved accuracy rates for both text-based and image-based documents
"This project had an incredible impact on our document classification process. The automation and increased accuracy rates have saved us time and money."
Team
Machine Learning Engineer (myself) - Built the whole pipeline from core to end-to-end
Implemented hashing and similarity-matching concepts from Blockchains and Recommender Systems
Built dynamic model/method allocation for text as well as image-based documents
Data Engineer - Collected samples from open source and internal databases for testing and training
Gathered multilingual data to support OCR integration
Provided necessary data for Machine Learning Engineer to build the system
Industry
IT System Custom Software DevelopmentSkills
DevOpsMachine Learning