Founded in 2014 in Atlanta Georgia, HealthWitz is a web-based health information aggregator that gives the general public access to high quality information about health, wellness, diseases, symptoms, and treatments. Created by a doctor, HealthWitz will provide you with the information you need to make well-informed decisions that impact your health.
This project has three main moving parts and each of these has involved several challenges:
- The web crawler, written in Java and based on crawler4j, scours the internet in search of healthcare related articles and retrieves them into our database. The main challenges here were building a pipeline that can process large volumes of data. For each article, we have to check the language, the relevance to the medical field, we then need to detect keywords and key phrases, we match for duplicates, near-duplicates and similar articles and finally we index the selected article into the search engine.
- The recommendation system was built on top of Solr. The main challenge here was to find a balance between article relevance relative to the user’s profile, article quality and article novelty. A second challenge of this system was raised by the issues of improving natural language processing and profile generation speed. We are currently running NLP tasks in batches in order to reduce overhead and are using a blend of search weights for Solr in order to generate recommendations in a timely fashion for our users.
- The website itself where the main challenge has been keeping the number of API / database requests and round-trip-times to a minimum.