Hidden in the deepest corners of my hard drive, I stumbled upon this image from my old GNU-Friends project. It was part blog, part community, part aggregation of news, and part interviews. I wanted to interview the people who’d done significant good for the free software community in its earlier years. People like Aharon Robbins (maintained GNU Awk) and Chet Ramey (wrote GNU Bash) was high on the list, which was complemented with a range of people like Lawrence Lessig and Guido van Rossum, recipients of the FSF’s Free Software Award. You can see the list of interviews over on the FFKP web site (they were all done in 2002, so read them with that in mind).
Together with @gnugirl, I’ve done some work at figuring out the age-old question: how large a percentage of images used online are used without giving proper credit to its creator?
From a research point of view, this question is both interesting and difficult. One of the difficulties is creating a list of images and their web pages which is representing a uniform (or at least near-uniform) sample of the web. Baykan et. al. (2009) highlight that “no approach has been shown to sample the web pages in an unbiased way.”
In our initial work, we’ve used a modification of the technique of Bharat-Broder in which they concatenate random words from a lexicon, use this against a search engine and then selecting at random URL from the search set returned.
There are obvious risks for bias in this, including but not limited to:
- Query bias towards large, content rich pages
- Search engine bias
- Ranking bias, depending on the search engine used
Bar-Yossef and Gurevich (2006) have built upon this and introduced methods that can be used together with biased samples generated with Bharat-Broder and stochastic simulation techniques to produce near-uniform samples.
We further complicate the situation by introducing researcher bias since each image needs to be evaluated in its context by a human to determine if the image has been correctly credited or not. There are no standard ways of crediting an image; this depends on the implementation and style of the user, which makes automatic checking difficult without further introducing additional bias.
Our first set is based completely on blogs from Blogger, which is not a representative sample of the web at large but provide indications and a test-ground for our further work. In order to generate the set, we used an English lexicon (introducing bias towards English blogs) from which we randomly picked 2-5 words which was then searched for in Google’ image search. From the result set returned, a random image and context was selected from the first ten results.
Each set of context and image was studied individually, with the researcher locating the image within the context and looking at the surrounding information to identify credits. In addition, we did a reverse image search on the image (again, using Google images) to ascertain if it could be deemed obvious that the image was retrieved from another source and not an original work of the blog owner.
We excluded from the results any work which was determined to be the original work of the blog owner (beyond the scope of our research) as well as results where we could not find any indication of the image indicated actually being used on the page returned (likely due to dynamically generated content that changed from the time when Google indexed the page).
The results were then categorised into three different categories:
- Credit is given
- No credit is given
- Credit is given, but based on a reverse image search, it’s obviously incorrect or falsified.
In our initial sample, which include a small set of pages, the distribution is as follows:
- 33% - Credit is given
- 65% - No credit is given
- 2% - Credit is given, but based on a reverse image search, it’s obviously incorrect or falsified.
The next step will obviously be to further reduce the bias in our sample set, increase the size of the sample set which unsurprisingly affect bias (Brajnik et.al., 2007), as well as to run this analysis on the web at large.
Bar-Yossef, Z. & Gurevich, M. 2006. Random sampling from a search engine’s index. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland). ACM Press, New York, NY, 367-376.
Baykan, E., Henzinger, M., Keller, S.F., De Casteleberg, S., & Kinzler, W. (2009) A Comparison of Techniques for Sampling Web Pages, 26th International Symposium on Theoretical Aspects of Computer Science STACS 2009 (2009) 13-30
Bharat, K., & Broder, A. (1998) A technique for measuring the relative size and overlap of public Web search engines. In Proceedings of the 7th International Conference on World Wide Web (Brisbane, Australia). Elsevier Press, 379-388.
Brajnik, G., Mulas, A., & Pitton, C. (2007) Effects of sampling methods on web accessibility evaluations. In Proceedings of the 9th international ACM SIGACCESS conference on Computers and accessibility. ACM Press, New York, NY. 59-66.