Presage Technologies applies machine learning models in a wide variety of tasks. Since we utilize these models in a broad context, we want to be efficient with our time in getting the training pipeline set up. Our broad task required us to set up a streamlined training pipeline that would allow our team to generate large amounts of training data quickly.
As our employees are in a remote environment, the first thing that we did was set up an AWS ec2 instance where multiple people would be able to access the pipeline we were going to set up. We utilized the following AWS infrastructure:
Process: Set up the above type instance with a key pair and ssh inbound rules per user IP added to security group
The training data was in video format, and the object we were interested in was only present in certain sections. We needed a way to cut the relevant sections of video and slice them into individual images for annotation. The video data was already annotated, so we knew where the relevant sections were. Given this information, we chose to use FFMPEG as it offers video copying which can be controlled using the timestamp parameters we had as part of the annotated data.
In addition to the start of the section, we had an approximate duration. We obtained the start time from our ground truth annotations, then programmatically generated a list of FFMPEG commands. These commands were then fed and executed in the terminal; add in some threading for improved performance, and our 10GB+ videos were ready to be sliced into training images in no time flat. To accomplish the slicing, we used OpenCV for video capture and writing the images.
Once we had the images, it was time to annotate, but annotating 5000+ images with multiple employees working remotely while managing label quality is no small task. After some research, we landed on Label Studio. Label studio is a lightweight, open-source annotation platform that allows us to register multiple users and track their annotations, attach cloud storage, and export annotations in a variety of formats.
For our cloud storage, we worked with AWS S3. AWS internal speeds are extremely high speed, so by generating our training data on the ec2 instance and moving it directly to s3, where we could point Label Studio for annotation, we were able to work with large data sets in reasonable amounts of time.
It is essential to develop performant solutions, as we are working to solve worthy problems at Presage Technologies. We frequently need to sort through, annotate and train on large amounts of data. With this approach, we have achieved a training pipeline that can be quickly reapplied where we need it.
For more information, contact Presage Technologies. We would love to hear about your intended use case.