Hidden in the deepest corners of my hard drive, I stumbled upon this image from my old GNU-Friends project. It was part blog, part community, part aggregation of news, and part interviews. I wanted to interview the people who’d done significant good for the free software community in its earlier years. People like Aharon Robbins (maintained GNU Awk) and Chet Ramey (wrote GNU Bash) was high on the list, which was complemented with a range of people like Lawrence Lessig and Guido van Rossum, recipients of the FSF’s Free Software Award. You can see the list of interviews over on the FFKP web site (they were all done in 2002, so read them with that in mind).

Hidden in the deepest corners of my hard drive, I stumbled upon this image from my old GNU-Friends project. It was part blog, part community, part aggregation of news, and part interviews. I wanted to interview the people who’d done significant good for the free software community in its earlier years. People like Aharon Robbins (maintained GNU Awk) and Chet Ramey (wrote GNU Bash) was high on the list, which was complemented with a range of people like Lawrence Lessig and Guido van Rossum, recipients of the FSF’s Free Software Award. You can see the list of interviews over on the FFKP web site (they were all done in 2002, so read them with that in mind).

If you ever thought that copyright was simple, you might think differently after looking at this visualisation from Roberto Garcia at the Universitat de Lleida. It’s a striking display of how difficult copyright can be, and why licenses such as Creative Commons need not only deal with Copyright as such, but also a number of related rights. This is also something to be vary of for anyone wanting to implement Rights Expression Languages (REL) in their tools. High-res

If you ever thought that copyright was simple, you might think differently after looking at this visualisation from Roberto Garcia at the Universitat de Lleida. It’s a striking display of how difficult copyright can be, and why licenses such as Creative Commons need not only deal with Copyright as such, but also a number of related rights. This is also something to be vary of for anyone wanting to implement Rights Expression Languages (REL) in their tools.

Researching attribution

Together with @gnugirl, I’ve done some work at figuring out the age-old question: how large a percentage of images used online are used without giving proper credit to its creator?

From a research point of view, this question is both interesting and difficult.  One of the difficulties is creating a list of images and their web pages which is representing a uniform (or at least near-uniform) sample of the web. Baykan et. al. (2009) highlight that “no approach has been shown to sample the web pages in an unbiased way.”

In our initial work, we’ve used a modification of the technique of Bharat-Broder in which they concatenate random words from a lexicon, use this against a search engine and then selecting at random URL from the search set returned.

There are obvious risks for bias in this, including but not limited to:

  • Query bias towards large, content rich pages
  • Search engine bias
  • Ranking bias, depending on the search engine used

Bar-Yossef and Gurevich (2006) have built upon this and introduced methods that can be used together with biased samples generated with Bharat-Broder and stochastic simulation techniques to produce near-uniform samples.

We further complicate the situation by introducing researcher bias since each image needs to be evaluated in its context by a human to determine if the image has been correctly credited or not. There are no standard ways of crediting an image; this depends on the implementation and style of the user, which makes automatic checking difficult without further introducing additional bias.

Our first set is based completely on blogs from Blogger, which is not a representative sample of the web at large but provide indications and a test-ground for our further work. In order to generate the set, we used an English lexicon (introducing bias towards English blogs) from which we randomly picked 2-5 words which was then searched for in Google’ image search. From the result set returned, a random image and context was selected from the first ten results.

Each set of context and image was studied individually, with the researcher locating the image within the context and looking at the surrounding information to identify credits. In addition, we did a reverse image search on the image (again, using Google images) to ascertain if it could be deemed obvious that the image was retrieved from another source and not an original work of the blog owner.

We excluded from the results any work which was determined to be the original work of the blog owner (beyond the scope of our research) as well as results where we could not find any indication of the image indicated actually being used on the page returned (likely due to dynamically generated content that changed from the time when Google indexed the page).

The results were then categorised into three different categories:

  • Credit is given
  • No credit is given
  • Credit is given, but based on a reverse image search, it’s obviously incorrect or falsified.

In our initial sample, which include a small set of pages, the distribution is as follows:

  • 33% - Credit is given
  • 65% - No credit is given
  • 2% - Credit is given, but based on a reverse image search, it’s obviously incorrect or falsified.

The next step will obviously be to further reduce the bias in our sample set, increase the size of the sample set which unsurprisingly affect bias (Brajnik et.al., 2007), as well as to run this analysis on the web at large.

If you’re interested in our further work, you can subscribe to our newsletter or look at our project page on Indiegogo where we try to raise awareness of the need for attribution.

 

References

Bar-Yossef, Z. & Gurevich, M. 2006. Random sampling from a search engine’s index. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland). ACM Press, New York, NY, 367-376.

Baykan, E., Henzinger, M., Keller, S.F., De Casteleberg, S., & Kinzler, W. (2009) A Comparison of Techniques for Sampling Web Pages, 26th International Symposium on Theoretical Aspects of Computer Science STACS 2009 (2009) 13-30

Bharat, K., & Broder, A. (1998) A technique for measuring the relative size and overlap of public Web search engines. In Proceedings of the 7th International Conference on World Wide Web (Brisbane, Australia). Elsevier Press, 379-388.

Brajnik, G., Mulas, A., & Pitton, C. (2007) Effects of sampling methods on web accessibility evaluations. In Proceedings of the 9th international ACM SIGACCESS conference on Computers and accessibility. ACM Press, New York, NY. 59-66.

What is the airspeed velocity of an unladen swallow?

Asked by
Anonymous

Thank you for this most interesting question! Aside from the obvious remarks that one could make in jest, Jonathan Corum has dug deep into this question. I propose that you check it out!

I just received the following anonymous, and very pretinent, comment:

Regarding this post: /post/50262865487/javascript-in-the-2010s (this form doesn’t allow full URLs). I found it amusing that I had to enable javascript to even see the comments, much less reply. That may answer your question.

On the subject of Javascript.

Unique identifiers

We all have names or aliases that help identify us in our surroundings. Most of the time, a name or alias is not unique, and it doesn’t matter if it’s unique or not. In some cases though, you might want an easy way to identify you even if you change affiliation, or even name. ORCID is an example of identifiers used in the research community; a researcher registers with ORCID and is assigned a unique identifier which is then included in research publications. Anyone can then look at that identifier and look up the person in ORCID’s registry.

People are not the only ones needing identifiers: our computers have unique identifiers to help identify them on a network, our cars have vehicle identification numbers, books have ISBNs, and so on. Creating unique identifiers to identify assets, people and organisations are a key component when it comes to metadata for digital works; unique identifiers provide a way to identify a given work and its creator. An identifier which can then be used to look up information about the work in a registry, for example.

The way it works today is roughly that:

  1. You register your work in a registry (or apply to your national library or similar institution to get a ISBN in the case of ISBNs)
  2. You receive a unique identifier
  3. You put that identifier on your work, labeling it as a particular identifier (Ie., “the ISBN of this book is X”).
  4. People use that identifier in catalogues, databases, web sites, etc.

The identifiers received from a registry are guaranteed to be unique within that registry. That’s one of the reasons you can’t invent identifiers at random: they wouldn’t be guaranteed to be unique.

But how strict do we need to be? Is it enough if there’s only a 0,5% chance of someone picking the same identifier? What about 0,005%? If there was a way to generate a unique identifier without communicating with any other device, everyone could generate as many as they needed for their work or themselves. And only when they wanted to would they have to register this in a registry.

UUID is one a way of generating a practically unique identifier. It’s a 128-bit self-generated identifier which, given if there were about 70 trillion such identifiers generated world wide, the probability of a collission would be 0.00000004%. A UUID could be generated without need for communication with any other device or service, meaning that a UUID could be generated in a camera, a phone, or any other recording device at the time of recording.

If we agree that a UUID is unique enough that it’s unlikely that two people will randomly generate the same identifier, we could simplify the process of generating identifiers significantly:

  1. You generate a UUID as identifier and put this (in some cases automatically) into the work
  2. Optionally, if you want your identifier to be trusted, register it in a registry.

What are your thoughts about using UUIDs as unique identifiers?

How often are images incorrectly credited?

I’m interested in researching how often images are correctly credited when used online. I have some ideas of how to go about researching this using random samples of images from various social media platforms, as well as from the internet generally. What I hope to gain from this research is a hint at how often images are used without attribution, how often images are used with obvious fraudulent attribution, and of course, how often images are used with correct attribution.

If you’re interested in this type of research and would like to contribute your thoughts, please get in touch!

Boston à l’heure bleue | Flickr - Photo Sharing!Manu_Hhttp://creativecommons.org/licenses/by/2.0/deed.en / CC BY 2.0

In the end of the month, I’ll be off to Boston for a gathering of the Shuttleworth Foundation. By the time, our fundraising campaign Please credit my work, will be a few weeks old though hopefully still going actively.

Who shall I make sure to meet while anyway in Boston, someone who might be interested in our work on persistently associating attribution and licensing metadata in digital works? High-res
Boston à l’heure bleue | Flickr - Photo Sharing!Manu_Hhttp://creativecommons.org/licenses/by/2.0/deed.en / CC BY 2.0

In the end of the month, I’ll be off to Boston for a gathering of the Shuttleworth Foundation. By the time, our fundraising campaign Please credit my work, will be a few weeks old though hopefully still going actively.
Who shall I make sure to meet while anyway in Boston, someone who might be interested in our work on persistently associating attribution and licensing metadata in digital works?

Javascript in the 2010s

I recently found myself in a situation where I needed to write a small web application. Nothing terribly complicated: something that would allow you to login to the application and send changes to or retrieve lists from a remote server. Using a full-fledged web development framework seemed excessive, yet hand-coding everything also seemed excessive.

The best alternative to hand-coding everything, yet without needing a full-fledged web development framework I’ve been able to find is Google’s AngularJS. It seems to fit exactly what I need. But it has one caveat: it’s written in and supports only browsers with JavaScript.

Is it possible to build web applications today that can be used only by people who use JavaScript? My gut feeling is “NO: you must ensure that it works also without JavaScript”, but friends and colleagues claim that requiring JavaScript is nothing strange in the 2010s and that people who might be inclined to disable JavaScript might not be satisfied anyway unless they get a command line tool to do the same as the web application. A bit excessive, perhaps, but how far can we stretch our use of JavaScript today? Is it reasonable to assume that we can leave the 2000s behind us and focus on building tools for the 2010s, or shall we work to ensure backwards compatibility? And if so, for how long?

Watermarking

Here’s an image of yours truly, taken by Creative Commons’ own David Kindler. It’s an example of a watermark in an image, this time to announce that the picture is from the Global Summit 2011, but which could equally be used to say that this picture was taken by David Kindler, as a way of ensuring that the picture was correctly attributed when it was re-used by me in this post.
 
CC Global Summit-Jonas Oberg.jpg | Flickr - Photo Sharing!DTKindler Photohttp://creativecommons.org/licenses/by/2.0/deed.en / CC BY 2.0
The problem with this kind of watermarking is partly that it takes away something from the image: it’s an invasive procedure that modifies the content of the image. Even if we were to resize the canvas of the picture so that the addition of the watermark fit outside of the actual picture, or perhaps to figure out a way to do minimalistic changes of individual pixels to somehow engrain the information without the picture, it only partially work.
If I were to rotate the picture 90 degrees for publishing, the attribution ends up on the side. Since this is licensed under a Creative Commons Attribution license, I might also chose to take just part of the image, just ignoring the watermark all together. And if I resize the picture, the attribution might get so small that it just can’t be read.
It’s a solution, but it’s not an ideal one. That’s why we’re working with the concept of metadata: information which is recorded as part of the image, but not in the visual representation of it. EXIF is the most common form of metadata for images, it allows a camera to record information about what aperture was used when shooting the image, if the flash was used or not, sometimes where the image was taken, and a lot of other information. Data which is useful to have, but which should not be part of the image itself.
Adding information about licensing and creator to this metadata would allow us to create tools that read this metadata (or a link to the metadata) and understand what to do with it, such as automatically crediting the creator when we make use of the image, or, if we try to use the image in a way that the author doesn’t want, we could get a helpful hint from our software letting us know that we might want to look into our use of the image. It shouldn’t prevent us, but it should give us notice.