This came up while we had to process millions of domains and from those we had to identify if a domain is child safe or not. Millions of domains check every content every tags check everything else was quiet not feasible. So just a basic assumption checking parameters like page titles, tags meta tags etc to some extent can conclude if any site is child safe or not. Again another problem came up the question was if the domain isn’t in english how can we determine if it is in fact a vulnerable pornographic site? So we decided to use python scrape a domain grab its titles and tags and translate it using microsoft’s Translator API since it provides 2 million characters a month compared to googles 1million characters for 10$.
Didn’t know microsofts translation has been so accurate and its like for free since 2million a month was a good offer for developers like us.