Continuing with the Ads.txt crawler has lead to the idea to store the crawler results in a database and have them available from a web site. This post introduces the first pass as such a site with the source code available in the following repository:
As a quick review the Ads.txt standard is one where publishers can host a simple text file with the names of authorized ad networks that have permission to sell the publisher's inventory. There is a reference Python crawler for such files and I've built a crawler in Clojure as an alternative. See this link for a series of posts about the Ads.txt specification and the development of the crawler. The crawler project is here.
The crawler previously developed has a command line interface but can be easily used as a library from another Clojure application. To facilitate this I've pushed a version of the crawler to Clojars.org. It's listing is:
For a batteries included framework to build a Clojure web application I highly recommend Luminus. Start with the tutorial and get that working first. Then I'd suggest repeating the excercise with the database you are ultimately going to us. For my example I built the Guestbook with a local Postgres instance. Then, I knew I was going to deploy to Heroku so moved the app to Heroku as another excercise.
If you aren't familiar with Heroku it is another recommendation. You can once you are familiar with their overall system host test and development versions without any cost. This includes backing them with Postgres.
Here is the short list of commands to setup a site such as the Ads.txt site. The details are for you to research.
$ heroku apps:create ads-txt $ heroku addons:create heroku-postgresql:hobby-dev $ git push heroku master $ heroku run lein run migrate
I added the ability to add domains to the system manually but to get things started it was important to allow bulk loading using the crawler library. Here are some sample commands with references to lists of domains that may be useful.
$ heroku run lein run -t https://gist.githubusercontent.com/bradlucas/f5060dfc602cfeedb140a23bb4d77403/raw/f47ae3cfe938b5035c4396d1069a1a5c2e8324c2/top-100-programmatic-domains.txt --app ads-txt $ heroku run lein run -t https://gist.githubusercontent.com/bradlucas/5b80fae610b9a5547ff10d1c8a706d35/raw/3a6262da02d17908517cbbb5499404aa58f856a6/the-moz-top-500.txt --app ads-txt $ heroku run lein run -t https://gist.githubusercontent.com/bradlucas/df0a5dc3e7d4a92eaecf2ac28bc0f17a/raw/d53ae31b3aaba88019ef2dff644a6bfeb8c9a088/adstxt_domains_2017-09-11.txt --app ads-txt
The above described site lets you request domains to be crawled for their Ads.txt file records and have the results stored. The site is currently running on Heroku at https://ads-txt.herokuapp.com/.