1. The idea
Self-supervised AI is super hot right now. Actors in this space (e.g., me) are highly interested in obtaining large datasets of natural images for mad science experiments and other nefarious purposes. I thought about it and realized that scraping the web for images and text is kind of hard, but it's a lot easier to get as much audiovisual data as you want from sources like PirateBay and YouTube.
So, I created a Python project called vid_sampler
(GitHub) to explore the possibilities of datamining images from a collection of movies. Right now, this project can:
- Recursively discover video files in a folder.
- Compute the basic metadata of the found video files (width, height, number of frames, duration)
- Sample frames from a uniform distribution on the frames of the found videos.
- Just for fun, combine randomly sampled frames into a Geotiff collage.
2. Mo' formats, mo' problems
There are interesting challenges involved in writing format-agnostic code that processes video files. For one, there is currently no single program that can tell whether a given file is a video or not. Part of the reason for this is because it depends on how you define "video," and how hard you want to try to salvage ill-formed videos in obtuse formats. For the purposes of vid_sampler
, a realistic working definition of "Video file" would be something like "File that VLC can open that has a video channel."
ffmpeg
is capable of converting anything that resembles a video to a common transfer format. However, this would be slow and/or require a lot of storage. An ideal solution would leave the original data untouched. To solve the "Video or not a video?" problem, vid_sampler
currently uses a combination of:
mediainfo
a.k.a.pymediainfo
OpenCV
a.k.a.CV2
ffmpeg
libmagic
a.k.a.pymagic
- Some custom logic to handle things that look like videos but aren't.