This came up while we had to process millions of domains and from those we had to identify if a domain is child safe or not. Millions of domains check every content every tags check everything else was quiet not feasible. So just a basic assumption checking parameters like page titles, tags meta tags etc to some extent can conclude if any site is child safe or not. Again another problem came up the question was if the domain isn’t in english how can we determine if it is in fact a vulnerable pornographic site? So we decided to use python scrape a domain grab its titles and tags and translate it using microsoft’s Translator API since it provides 2 million characters a month compared to googles 1million characters for 10$.
Didn’t know microsofts translation has been so accurate and its like for free since 2million a month was a good offer for developers like us.
Now i’ve used python to do all the nitty gritty work and mongodb where all my scrapped domains are residing. I’ve excluded the scraping part here since we have lot of other tutorials on the web.
I’ve collected some basic bad words just bad words to be matched against
And then i searched for domain in my database if found i would grab its “title” , “metadescription” and “metatags” and removed all the stopwords from this whole string combined. Now we need to find the language it from the string so we now use amazing open source tool called langid available in github
Now we need to see if the language is english or not, if not then we have to translate it using Microsoft Translator API. For that we have one awesome wrapper available at pipy named microsofttranslator 0.7 all we have to do is have authentication your client ID and Client Secret and translate the string to required language.
NOTE: I’ve Sent whole string for translation. You can also send only one token to find out its english transaltion and save the character limit
And now we test it against the list of bad keywords we have prepared using powerful python function called set.
We now finally count the length of the count variable we used above and set a threshold value so as to determine if it crosses the threshold we determine it as child unsafe if not then it is a safe site.
Well this is just a basic test we have done. If there’s any changes you think should be opted for then please feel free to comment.