Friday, February 20, 2009

Similarities between High Energy Physics and Multimedia

There are a several similarities between high energy physics collaborations (and perhaps data intensive science collaborations in general) and multimedia (specifically Swiss Public Broadcasting RTS). High energy physics is not the same as physics (which is a much broader term). High energy physics, generally revolves around big experiments (e.g. at CERN or Fermilab) and consist of large collaborations of universities and institutes.
Similarity does not mean that the two domains are the same. There are definitively differences between the domains. But even if the domains are different in some ways, these differences are sometimes orthogonal to each other and represent each others extreme. The similarities (and differences) described here focus on the (sub) domain of IT (software, design, infrastructure, etc..) within physics and multimedia, and is not an exhaustive list of differences and similarities. I do not claim that journalists and physicists share similarities. Perhaps they do (other then that both groups publish), but that is not the focus of this article.
Both physics and multimedia deal with (near) real time data. Within (high energy) physics there is a machine (detector) that generates large amounts of data from collisions from particles. Within multimedia, several television and radio channels generate an ongoing stream of data (digital audio and video).
For both physics and multimedia this so called "raw" data is processed before it is delivered to its end users (physicists and website visitors). This process is highly automated for physics but does require some manual "fine tuning" for multimedia (although a lot is automated too for multimedia).
Within (high energy) physics large amounts of data are generated and distributed to physicists all over the world, which requires excellent (network) connectivity and high availability (preferably distributed) storage elements. Similarly within multimedia high volumes of data are distributed to (in the case of RTS) French speaking regions around the world. The main difference between these data volumes is that a physics user generally consumes large amounts of data (Terabytes to Petabyte) compared to an average website visitor (Megabyte to Gigabytes). But multimedia has strength in numbers (like the Archimedean property): The user base for physics is relative small (between 2500 to 10000 users), while the user base for multimedia (web site visitors) is relatively large (between 6 and 7 million for Swiss Public Broadcasting). Within 2009 Swiss Public Broadcasting delivered approximately 2.5-3 PB of data to its users. This number is likely to grow the next years with the introduction of HD programs on the web. To effectively deliver this multimedia content, Swiss Public Broadcasting has one data center (similar to the one T0 for CERN based physics experiments) but relies on a so called Content Delivery Network (CDN) like for example Akamai, Stream the World, or EdgeCast Networks, to cache its data. For people familiar with the storage infrastructure of the physics experiments at CERN: The hierarchical T0-T1-T2-T3 structure could be seen as a very specific content delivery network for physics data.
Another similarity is the compactification process. High energy physics needs computing resources (distributed over many computing centers), not only to analyze data, but also to reduce the size of the generated data (e.g. to something called n-tupples in high energy physics jargon). Similarly multimedia needs (relative) large amounts of resources to transcode the generated data into many different (usually more compact) formats like mp3, mp4, etc...
On an organizational level there is also a similarity (which is perhaps specific for Swiss Public Broadcasting, and not for multimedia in general). Physics collaborations consist of a group of globally distributed and autonomous universities and institutes, each contributing to the goal of making discoveries. Similarly Swiss Public Broadcasting is not a single entity but consists of a set of autonomous distributed business units divided between television, radio and the four language regions, each contributing to the goal of providing quality television and radio programs for the public interest.
One can even argue that there is a similarity on the level of rights management (sometimes called access control). Within multimedia, certain types of broadcasts have so called "right restrictions", meaning for example that a broadcast cannot be made available for viewing outside the country where the broadcaster is located (for example soccer or tennis matches), or it can only be made available on the website for a short amount of time (for example television series). Within high energy physics typically most of the data is freely available for the whole collaboration. But sometimes you want to protect some of your data from other collaborations (there is always friendly competition), or restrict it for public viewing (give the scientific collaboration who produced the data the chance to analyze it first). Similarly within the collaboration itself different research groups will first share and verify discoveries between themselves before sharing it with the rest of the collaboration. The difference is that rights management (or access control) is formalized within multimedia (lawyers), but is usually more informal within the science community.
Besides these similarities there are also several differences.
Within high energy physics the users (the physicists) are usually also participating in the development of the IT infrastructure. This is somewhat different in the multimedia domain. The users (journalists and website visitors) generally do not participate in the development process, which leads to another difference: Within multimedia, generally (commercial) licenses are acquired to use tools and software or where possible open source software is used. Within the physics community, open source is used but a lot of software is written by the physics community. The main reason is perhaps the specific requirements the physics community has, and the small market they represent. Multimedia in contrast is a multi million (if not billion) market and hence numerous software packages are available.
Another difference is "user generated content". Within multimedia increasingly users supply news (.e.g video) stories which can technically be seen as data within the multimedia context. The concept of "user generated content" can generally not be applied to the (high energy) physics environment. The original (raw) data from the experiment is the only data used. Of course physicists do create so called "derived" data from this raw data, but they generally do not create their own raw data (although physicists do generate large amount of simulation data).
Perhaps the biggest difference is that of scale and complexity. The scale of high energy physics experiments (in terms of data size, and computing and storage resources) is much larger than that of Swiss Public Broadcasting, but generally I would say that the data structure within multimedia is generally more complex (perhaps better to say, richer or more elaborate) than that of high energy physics. This complexity is partly due to the integration of meta data resources between many different (legacy) systems, which I think is a general challenge for many different companies.
Perhaps there is an order of magnitude difference between Swiss Public Broadcasting and high energy physics experiments, but it should also be noted that Swiss Public Broadcasting is generally a small entity in multimedia land. What about CNN and high energy physics? If Swiss Public Broadcasting has approximately a target audience of 6-7 million, it is perhaps not unreasonable to assume CNN has a target audience of 600-700 million. So if Swiss Public Broadcasting ships 2.5-3 PB per year CNN might be delivering 250-300 PB of content per year to its audience.
Typically data delivery in multimedia is more fragmented: more consumers but less data per consumer. Data needs on an per user level are (and will be) for a long time to come be much higher within high energy physics or any other data intensive science domains than for any type of consumer of multimedia content. These data needs will not only drive new discoveries but will also encourage and stimulate innovation in computing and networking that can be beneficial for other areas such as multimedia. Similarly multimedia has enormous experience with production scale content delivery networks and rights (access control) issues which might serve as an inspiration for data intensive science collaborations when managing their data (on a technical level, not a legal level I would think).

0 comments:

Post a Comment