How to get data for AI applications

June 29, 2017, 7:51 p.m. By: Pranjal Kumar


AI has certainly been on rising and number of developer are trying to peruse their career in this field. AI is a challenging field and demands a lot but one of the biggest challenges that every AI engineer has to face is the availability of relevant data. One need accurate and precise data in order to make their AI software work smoothly and accurately. There are various sample sets of data but these data are not that accurate and does not reflect the true figure.

For example, using datasets from AWS allows developers to understand Amazon’s machine learning API. So, if you are planning to build your own AI application then data can be a big hurdle in your way.

The first challenge is to where to get the data from. Here are some methods by which you can get the data-

1. Scrape the data

This is the most common way of building any application that needs data. But one thing that you should remember that you might be violating websites terms of service by crawling their sites with scripts. But you can build a crawler that might help you gather information. Better the crawler, better will be the data source that you can crawl. There exists various option the for this purpose like Amazon’s Mechanical Turk, dubbed TDaaS.

2. Augment your data with those that are similar

You can use the data that is similar to your data need. This might work at the initial scale to test your AI. You should avoid this method for the production process.

3. Burgeoning TDaas Space

There are few start-ups that help other companies get the data. TDaaS companies that offer training data as a service. They give start-up access to a labor force that is trained in gathering, cleaning and labeling data that are crucial for any AI development. Some of the companies that provide training data are CrowdFlower and They provide all sort of data like text, images, videos and much more.

4. Look for open source data repositories

There are few open source communities that offer data for training purpose. You should look for these data repositories before trying to build scrapper.

5. Use surveys and crowdsourcing

You can crowdsource your data. These data are more accurate and reliable. You should also implement the check of the data collected from these sources based on the speed and variance of data filling from different people.

6. Partnership with companies that are rich in data

You can also enter into the partnership which any firm that has rich data collection. This might prove to very useful as you will have more accurate data.