I recently discovered the Ads.txt specification and the IAB Tech Python crawler. Being more interested in Clojure I decided last weekend to write a crawler for Ads.txt files in Clojure. The first pass is available at the following repo on the release/0.0.1
branch.
Currently as of this writing, the 0.0.1 version supports passing a target domain list to have the sites crawled, their content parsed and output written in comma delimited format to STDOUT. The Python crawler from IAB saves it's output to a SQLite database. I decided that would be a feature for a later release. With my current version it is simple enough to pipe the output to another program or file for subsequent processing.
If you find this post at a later date than the release/0.0.1
branch feel free to investigate any progress I've made.
This file is concerned with reading the target-domains
file passed in via the -t
option. The goal is to read the file and remove blank and commented out lines then to collect the remaining lines and reduce each url or domain to it's core domain name. Having each domain reduced to just it's component domain name will make processing them simpler.
See the read-domain-file
function for an example of how the sequence of lines read in the file are filtered then cleaned. The cleaning consists of lower casing each line, trimming whitespace and then removing the http and precending www values if present.
(defn read-domain-file [fname]
;; read file and return list of non-commented lines
;; - remove commented lines
;; - trim leading and trailing whitespace
;; - remove http[s]:// prefixes
;; - remove www. prefixes
;; - lower case
(with-open [r (clojure.java.io/reader fname)]
(doall
(->> (line-seq r)
(filter ignore-line)
(map #(-> %
(clojure.string/lower-case)
(clojure.string/trim)
(hostname)
(strip-www)))))))
The procdess file takes the list of domains returned from read-domain-file
and then formats each into a url to retrieve the ads.txt
file. Each of these is read and the contents processed. If the return data is not of text/plain
content it is ignored. Not all sites support the ads.txt specification and you'll find that those that don't typically return an HTML page with a message.
The returned data can have blanks and commented lines. These need to be ignored. The valid lines are then parsed. The format is comma delimited so you can see in parse-line
how these are handled.
(defn parse-line [line]
;; Examples
;; google.com, pub-1037373295371110, DIRECT #video, banner, native, app
;; pubmatic.com, 120658, Direct, 5d62403b186f2ace
(let [[data comment] (clojure.string/split line #"#")]
(let [[exchange-domain account-id account-type tag-id] (clojure.string/split (clean data) #",")]
{:exchange-domain (clean exchange-domain)
:account-id (clean account-id)
:account-type (clean account-type)
:tag-id (clean tag-id)
:comment (clean comment)
:data data
})))