With the wealth of content now available to users, the online video industry has reached a state where sometimes even knowing the name of the title you’re looking for returns unspecific results. A title like “Her” can match to 15 TV and movies in English alone.
So how do we resolve such ambiguities? How can we automatically infer whether two items with the same name “Her” are the same or not? How do we figure out what the user is looking for so we can get them to the content they’re after quickly?
One effective approach is to take advantage of additional context, if available. The most useful of these contextual indicators would be the year of release, so if we know the item “Her” was released in 2013, we find that there are only two matches, and the movie we’re looking for is first one suggested. Using additional contextual information can refine our results further, so if the cast is provided as part of the item metadata, then any overlap is almost a definitive indicator that the items are of the same.
However, often times these “structured" metadata items would not be part of the content descriptors. Rather, we more frequently encounter “fuzzy” descriptors, such as synopsis and cover art. In such cases, fingerprinting techniques are used to compare how similar two items are to each other, and ones closer than a minimum threshold are deemed as the same item. Specifically, a language model is trained using skip-gram model on all known synopsis from our training set. The trained model then converts any input synopsis into a document vector, which are then used to compare to existing ones to find the closest matches. This is used over simple string matching or token counting/bag-of-words model since there’s a large variation between text by different providers, so we apply machine learning to better capture the semantic representation of the text to account for such variations.
Similarly, for the cover art a convolutional neural network is trained on all known cover art for TV and movies within our database to extract visual features common in them. This trained network is then used to convert input images into vector representations that are compared to known visual vectors to find the closes matches. The reason for not using traditional image retrieval techniques is due to the large number of variations that these cover art undergo, mainly due to cropping and other post-processing, as well as language and region localization across the globe. By doing feature learning via neural networks, the higher-level features are learned and encoded in the image vectors to improve the matching robustness.
At the end of this process, our metadata normalization service is able to take media items with highly varied amount of descriptors and representations, and be able to automatically and robustly infer the canonical media item within our knowledge base. Once that canonical item is found, the input item can be de-duplicated, organized, and enriched with all of the knowledge that is known about it, whether it’s the item’s Wikipedia page, Rotten Tomatoes score, Box Office Mojo gross, Metecritic reviews, and many others.
What all this means is, not only can we get users to the content they’re looking for quicker and easier, but we’re also able to enrich the quality of the metadata associated with an asset, make content discovery and richer and more efficient process.
To learn more about the details of the points raised in this blog, download our whitepaper by clicking the image below.
If you would like to learn more about how we can help you match and enrich your video asset metadata, book a meeting with us at this year’s IBC.
Gerald Chao is Piksel’s Vice President of Engineering and is focused on delivering valuable, robust, and scalable technologies to solve challenges in today’s complex and rapidly evolving media and entertainment landscape.