How to collect, structure and publish data?

This workshop provides an outline about the collection, structure and publishing of data.

You will learn about different types of data, data models and publishing methods.

The workshop is for everyone who collects data, needs to structure and share them.

Data, Information and Knowledge

Data is the lowest level of abstraction, information is the next level, and finally, knowledge is the highest level among all three.

Data on its own carries no meaning. For data to become information, it must be interpreted and take on a meaning. For example, the height of Mt. Everest is generally considered as “data”, a book on Mt. Everest geological characteristics may be considered as “information”, and a report containing practical information on the best way to reach Mt. Everest’s peak may be considered as “knowledge”.

To create information and knowledge based on data it’s necessary to know

about the different manifestations of data
various structuring processes
ways to collect and enrich the data
ways to publish the enriched data

Data

In computer science, data is information in a form suitable for use with a computer. Data can be divided by human readable data like text and binary data like images, audio and video data.

Text

A written text is the representation of a spoken language by means of a writing system. A writing system can be a pencil, a piece of paper and a defined set of characters (alphabet). In computing, its nearly the same. The pencil is the keyboard where you can choose characters from a character set. All characters for most existing languages are defined in Unicode Transformation Formats (UTF).

Unicode provides a unique number for every character,

no matter what the platform,
no matter what the program,
no matter what the language.

Today (April 2012) UTF-8 ist widely used in the internet and every email client and browser is able to display its 65,536 characters. More than 900,000 characters can be found in UTF-16 which is used for most operating systems like Windows, Apple OSX and Linux.

Human Readable Plain Text

Plain text is used as content of an ordinary file readable as textual material without much processing. On a website like Facebook it is used in text area fields (Figure 1).

Figure 1: Plain text in a text area field of a website

Formatted Text

Because plain text consists only of characters and so called white spaces, there is a need to give people the possibility to format text. Formats can be bold, italic, strikethrough, underlined, superscript, subscript, different sizes of headlines and uncountable more possibilities.

To format text, you usually mark the part of text that you want to add the special format and then you have to click on a button or press a special keyboard combination (e.g. ctrl-b for bold). You a need word processor with special features to be able to format text.

‘Binary’ Word Processor

A well known word processor is Microsoft Word. Microsoft created an application that offers the possibility to format and structure text in the year 1985 and it is now part of the Microsoft Office package (Figure 2).

The advantage is and was the ease of use.

The disadvantages are, that you have to buy a license for the right to use it, have to install it on your PC and that the .doc format is not human readable in case there is no Microsoft Word installed on your device. Later versions of Microsoft Word introduced a machine readable, so called XML format, but this is not widely used. The .doc files are still very common.

Figure 2: Microsoft Word

Microsoft Word is used for creating and editing documents on a PC.

‘Human Readable’ Word Processor

The alternatives to Microsoft Word are OpenOffice.org and LibreOffice. Both are Free and Open Source Software (FOSS) which means, that you don’t have to pay for it and that you are free to change the software (if there is a need and you are able to do so).

The OpenOffice.org files (e.g .odt) are compressed text files.

Note: To see the human readable content without using OpenOffice.org, you have to rename the file example.odt to example.zip. Than you have to extract the file (usually with a right mouse click) and afterwards you’ll see a folder with many files inside (Figure 3)

Figure 3: Content of an example.odt file

OpenOffice.org and LibreOffice are used for creating and editing documents on a PC.

Formatting Text using HTML

In the internet, the markup language HTML (Hypertext Markup Language) is used to format texts.

If you want to have part of your text in bold, italic, strikethrough, underlined, superscript, subscript you have to do it with the help of HTML tags. There is always a “tag” in front of the text you want to format and a tag behind that text. The browser will display (render) it in a nice way:

HTML: <strong>bold</strong> = result in a browser: bold
HTML: <em>bold</em> = result in a browser: italic

It is of course not very comfortable to work like this and as a solution so called What You See Is What You Get (WYSIWG) editors are used. These editors are usually FOSS software like TinyMCE (Figure 4).

Figure 4: WYSIWYG editor on a website

These editors have sometimes the possibility to see the underlying HTML code by clicking a button called Source or HTML (Figure 5).

Figure 5: WYSIWYG editor in Source mode with HTML display

The switching between the WYSIWYG mode and HTML mode is not possible in word processors and most user don’t want that possibility as it confuses them.

The general handling of the formatting in a HTML WYSIWYG editor is nearly the same compared to word processors. The main difference is, that you do not save the content as a file on your PC. The content is saved in the database of the server where the website is hosted.

The advantages are, that it is not necessary to buy and install something on your PC and the fact that everything is human readable, even when you access the website on your mobile device.

WYSIWYG editors are used on websites to create content in a particular field.

A hybrid word processor

Services like Google Docs allows you to combine the advantages of both worlds. The service is web/browser based. You are writing by using a WYSIWYG editor that it allows you to create and edit documents online while collaborating in real-time with other users. You can store the documents at Google’s server cloud or can save (download) it in various binary and human readable formats.

Web based word processors are used like binary word processors to create and store documents.

Image

An image is something that you are drawing on a piece of paper, take with a camera or as a screenshot from your PC or mobile device. If the image is on paper, it is possible to scan it. In computing an image is a big lump of bits and bytes which are usually stored in a file. The information is binary and it is not possible to display an image without using a special application like an image viewer or a web browser.

Most images consists of dots, so called pixels.

A pixel is a little dot on a screen with a related color. If you come closer to your screen you’ll see something like this:

oooooooo
oooooooo
oooooooo
oooooooo
oooooooo
oooooooo
oooooooo
oooooooo

The more pixel an image consists of, the better is the resolution of your image. That means the better and sharper it is displayed. The more colors you use, the better a photo is looking.

But how to display and store it?

The data of images is not human readable.

Like word processors for text we have image processors to create and edit images. Examples are Adobe Photoshop or the FOSS version Gimp. Images are usually stored in files.

In most operating systems, image file viewers are included in the file explorer. Sometimes, these viewers are able to edit the image, e.g. change the file size, the contrast and reduce red eyes on photos.

The formats .jpg, .png and .gif are widely used in the internet. Let’s have a closer look

What does image resolution mean?

Resolution refers to the number of pixels in an image. For example, an image that is 2048 pixels wide and 1536 pixels high (2048 x 1536) contains 3,145,728 pixels (or 3.1 Megapixels). You could call it a 2048 x 1536 or a 3.1 Megapixel image.

How is the image resolution related to the resolution of your computer monitor?

Your computer screen is able to display different resolutions. Usually, you can configure your resolutions in your operating system.

The larger the screen, the larger you likely have your screen resolution set.

If your monitor is set to 1024 x 768 and you open up an image that is 640 x 480, it will only fill up a part of your screen. If you open an image that is 2048 x 1536 (3.1 megapixels), then you will find yourself moving the slider bar around to see all the different parts of the image. It just won’t fit.

What does image quality mean?

In addition to image size, the quality of the image can also be manipulated. By using compression, you can keep the physical size of the image the same and reduce the amount of disk space required to store it ,but you will be sacrificing the quality of the image.

Joint Photographic Experts Group (JPEG)

The jpg format was invented by the Joint Photographics Experts Group. If you take a photo with a digital camera it is usually stored in a compressed .jpg format (Figure 6).

Figure 6: Typical .jpg photo

The .jpg format is used for publishing photos on websites

Graphics Interchange Format (GIF)

The Graphics Interchange Format (GIF) is an image format that was introduced by CompuServe in 1987 and has since come into widespread usage on the World Wide Web due to its wide support and portability. Controversy over the licensing agreement between the patent holder, Unisys, and CompuServe in 1994 spurred the development of the Portable Network Graphics (PNG) standard; since then, all the relevant patents have expired.

GIFs are suitable for sharp-edged line art (such as logos) with a limited number of colors. This takes advantage of the format’s lossless compression, which favors flat areas of uniform color with well defined edges.
GIFs can be used for small animations and low-resolution film clips.
In view of the general limitation on the GIF image palette to 256 colors, it is not usually used as a format for digital photography.

Portable Network Graphics (PNG)

Portable Network Graphics (PNG) is an image format that employs lossless data compression. PNG was created to improve upon and replace GIF (Graphics Interchange Format) as an image-file format not requiring a patent license. The initials PNG can also be interpreted as a recursive initials for “PNG’s Not GIF”.

PNG was designed for transferring images on the Internet, not for professional-quality print graphics.

Comparison to Graphics Interchange Format (GIF)

On small images, GIF can achieve greater compression than PNG (see the section on filesize, below).
On most images, except for the above cases, GIF will be bigger than PNG.
PNG gives a much wider range of transparency options than GIF (Figure 6)
Whereas GIF is limited to 256 colors, PNG gives a much wider range of color depths (millions of colors), allowing for greater color precision, smoother fades, etc.
GIF intrinsically supports animated images. PNG supports animation only via unofficial extensions.

Comparison to JPEG

JPEG format can produce a smaller file than PNG for photographic (and photo-like) images, since JPEG uses a lossy encoding method specifically designed for photographic image data, which is typically dominated by soft, low-contrast transitions, and an amount of noise or similar irregular structures. Using PNG instead of a high-quality JPEG for such images would result in a large increase in file size with negligible gain in quality. By contrast, when storing images that contain text, line art, or graphics – images with sharp transitions and large areas of solid color – the PNG format can compress image data more than JPEG can, and without the noticeable visual artifacts which JPEG produces around high-contrast areas. Where an image contains both sharp transitions and photographic parts a choice must be made between the two effects. JPEG does not support transparency.
The PNG specification does not include a standard for embedded Exif image data from sources such as digital cameras.

Figure 7: transparent .png file

Audio

Sound recording and reproduction is an electrical or mechanical inscription and re-creation of sound waves, such as spoken voice, singing, instrumental music, or sound effects.

Digital recording stores audio as a series of binary numbers representing samples of the amplitude of the audio signal at equal time intervals, at a sample rate high enough to convey all sounds capable of being heard. A digital audio signal must be reconverted to analog form during playback before it is applied to a loudspeaker or ear phones.

MP3 (MPEG-1 or MPEG-2 Audio Layer III) is a patented digital audio encoding format using a form of lossy data compression. It is a common audio format for consumer audio storage, as well as a de facto standard of digital audio compression for the transfer and playback of music on digital audio players.

The data is stored in files with the extension .mp3.

These files can be edited with digital audio editors like Audacity.

Video

A video is a mixture of everything mentioned above. It is the technology of electronically capturing, recording, processing, storing, transmitting, and reconstructing a sequence of still images representing scenes in motion.

There are many different digital encoding formats. The most common is MPEG-4 Part 14 or MP4. It is most commonly used to store digital video and digital audio streams, but can also be used to store other data such as subtitles and still images. It allows streaming over the Internet. The only official filename extension for MPEG-4 Part 14 files is .mp4.

Structuring Process and Databases

To structure all this data you need to understand the relations between the different types of data. Data about a house can be:

a written history
photos
ground plan
an interview with the owner
a video about the house

Objects

Usually the analogy with an object is a good way to start. A “real” girl or a boy is an object.

It has properties, like size, name, hair color, etc. It is possible to define a kind of abstract description of an object called “girl” or “boy”. This abstract description can be called class, content type, structure, or abstract description (Figure 8).

Figure 8: abstract and real objects

The important part here is that each boy or girl can be described in an abstract way and the “real” boy or girl has values for all the abstract properties

The same works for an objects like a house or a car. It works even for a news article.

Relations

All of these objects have relations to each other.

A “boy” live in a “house”,
the house is build in a city
other houses are in the city too,
the city is an object in a region/country too

Databases and data models

Any type of data can represent an object. To store the data technically, usually a database is used. A database consists of structures, data and ways to add, edit, select and delete data.

The structures are tied to another, related.

Common relational database systems use tables. A table is a kind of objects structure. Each properties is represented by a field, HId is the identification number of a house (HouseId) (Table 1).

Field	Type	Length
HId	numerical	10
Title	text	128
City	text	128
Square meters	numerical	6

Table 1: Structure of the a house table

Each row in the table represents one “real” house (Table 2).

HId	Title	City	Square meters
1	Village house	Fitou	200
2	Passive house	Freiburg	120
3	Modern style house	Istanbul	400

Table 2: House objects as rows in the houses table

Normalization

Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships. Our last example with one table was a simple representation.

Let’s normalize a “create a City” table structure too, thus we would have two tables: a house table and a city table. The city table could have fields for all the “City properties” like country, inhabitants, size, etc.

The house table could have a related field to an existing City. In this case the city field would not contain anymore the name of the city. It would contain the Cid (City Id).

Example in tables 3-6

Field	Type	Length
CId	numerical	10
Name	text	128
Country	text	128
Inhabitants	numerical	10

Table 3: Structure of the the City table

Field	Type	Length
HId	numerical	10
Title	text	128
CId	numerial	10
Square meters	numerical	6

Table 4: Structure of the the House table

Each row in the City table represents one “real” City.

CId	Name	Country	Inhabitants
1	Fitou	France	1000
2	Freiburg	Germany	50000
3	Istanbul	Turkey	15000000

Table 5: City objects as rows in the City table

Each row in the table represents one “real” house (Table 6).

HId	Title	CId	Square meters
1	Village house	1	200
2	Passive house	2	120
3	Modern style house	1	400

Table 6: House objects as rows in the Houses table

User interface

The database itself has no user interface. It is just a place to store data. A possible user interface to add data to a database is web based form, like phpMyAdmin.

Retrieving Data

To retrieve data it is necessary to describe what you want to have. An example could be: “Give me the data of all houses located in Istanbul”.

Because there is no user interface it is a bit complicate to tell your database what you want to achieve :) A common way is to use the Structured Query Language (SQL).

“Give me the data of all houses located in Istanbul” would look like this in SQL

SELECT title.house, square_meters.house, name.city,
FROM  house JOIN city
ON CId.house = CId.city
WHERE name.city = ‘Istanbul’;

When we send this SQL statement to our database, it would send us the desired data.

Collecting Process

When collecting data, it is necessary to decide in advance how you will use them after, how you want to make them available and publish. Usually, in the time of internet, data are stored in databases. Therefore, define the data structure, collection methods and target data before you start collecting data.

The structure of the defined tables are a blueprint for creating forms. These forms can be available on paper or on a website.

Collection

The data collection can be a tricky process. Depending on the data that has to be collected there are different ways to collect it (Text, Audio, Photos, Videos).

Paper form

Only text and sketches can be collected by using paper based forms. Photos (nowadays), sound and video are not possible to collect in a paper based process. The data, collected using a paper based form, needs to be submitted to the database in a separate step.

Web based form

The data which was collected using web based forms is automatically added to the database after submitting the form. Web based form can contain several validation methods to avoid inconsistent data. It is possible too to collect audio, video, spoken and written text using a mobile device (Figure 9).

Figure 9: Collecting data and author text using a mobile device

Present Findings

To present the findings one could use

an interactive map

Figure 10: Interactive Google Map

a table

Figure 11: Table

a grid

Figure 12: Grid

statistics of data

Figure 13: Statistics

Publishing Process

The publishing process of data was and still is discussed by scientists around the world. For the last 500 years books printed on paper were the only method to publish data. The whole concept of reading and publishing is still today based on the idea to print content on paper. But this is changing rapidly!

With the advent of computers and the internet, scientists and researchers were looking for ways to use and share documents and data. Nearly 20 years ago, the world wide web was “invented” and even today, everything that is available on the internet consists of HTML files that include other files such as .css, .mp3, .mp4, .jpg, .png, .gif and even more.

The World Wide Web (WWW)

The world wide web is in common use since 15 years. It consists of web browser, web servers, markup languages and the availability of internet access.

Hypertext Markup Language (HTML)

HTML was specified by Tim Berners Lee in the year 1990. It was the idea to combine different types of files and text in a HTML document and link to other HTML document in a world wide web.

HTML elements still today form the building blocks of all websites. HTML allows images and objects to be embedded and can be used to create interactive forms. It provides a means to create structured documents by denoting structural semantics for text such as headings, paragraphs, lists, links, quotes and other items.

HTML is usually stored in .html or .htm files.

Example of an HTML tag for a headline: <h1>Headline </h1>

Cascading Style Sheets (CSS)

To improve web presentation capabilities a language called CSS was published in 1996.

CSS is stored in .css files.

Example of a CSS statement that colors a headline: h1 { color:red;}

Web server

A web server is a service running on a computer that stores and delivers .html and all the other files. The most common server is called Apache.

Web browser

A web browser is the software that runs on your device. Most devices (computers) today are 24/7 connected to the world wide web. By typing a url like http://google.com into your browser you are asking one of Google’s servers to deliver a .html page to your device. The browser read the files and renders it to a webpage as you know it. The currently most used web browsers are Internet Explorer, Chrome, Firefox and Safari – Usage share of web browsers (Figure 14).

Figure 14: Wikimedia usage share of web browser

Content Management Systems

The workflow of the described publishing system of storing data in static HTML pages makes publishing and sharing of documents in general possible. At that time it was an enormous progress compared to paper based books. It changed and still changes the world!

But the pages were not interactive!

For that reason, scripting languages like PHP (Personal Home Page) were invented in the middle of the 90ties. With PHP it was possible to generate HTML pages on the fly, based on data that came from database queries and files from different ressources. Database server already exists at that time.

Most web servers, database servers and PHP itself were and are Free and Open Source Software, and so are the Content Management Systems that are based on these foundations.

The most common CMS’s today are WordPress, Joomla! and Drupal. The idea behind is to give people the possibility to publish content in an easy way by using a web browser to add and edit data by using a web browser.

Text based content is today widely created by the help of CMS. All newspaper websites are using CMS’s to create and edit their content.

But still today only 30{0f2b36d8f80fa52f37b916148a6e37fe671d96583f68d5887344addd2eee52a6} of all websites of the world uses CMS’s, 70 {0f2b36d8f80fa52f37b916148a6e37fe671d96583f68d5887344addd2eee52a6} of all websites are still made in the “old fashioned” static way of writing HTML code that was invented 20 years ago.

Web Applications

The “next generation” CMS’s are called Web Applications. They are as powerful as applications that have to be installed on your device like Microsoft Word or OpenOffice.org.

As an example, have a look on Google Docs, a web based Office System.

Platforms like Flickr offers browser based image editing, YouTube offers browser based video recording and editing and more and more services try to lower the barriers of collecting, recording, editing and storing data. The data is still stored in databases but another possibility of storing data becomes additionally more and more common.

The cloud!

The marketing buzzword “Cloud” is a mixture of the “good old” hard disk and a kind of database service. Cloud can be Software as a Service (SaaS), Infrastructure as a Service (IssS) and Platform as a Service (PaaS). The most important thing to know about “the cloud” is that it is much easier to use it compared to the hard disk and that you don’t have to deal with physical “things”.

If your device is connected 24/7 to the internet, all of your data are available at anytime, everywhere through a cloud system.

App Ecosystems

A parallel development was introduced in 2007 with Apple’s iPhone. Little Applications (Apps), that are not browser based, were available in so called App Stores. The apps are able to use the camera, the microphone, the GPS, and all the other features of the device for collecting and publishing data. The store concept allowed developers to earn money by writing apps.

eBooks

In the app ecosystems, ebooks are playing a more and more important role. They consist of HTML/CSS files and are packed in a format called EPUB. It is possible to sell them in app stores and they look like a book printed on paper. Inside they are HTML based like every website.

In 2012 the biggest ecosystems are Amazon Kindle and Apple iBookstore

Paper based books

Traditional books will still play a role in the future, but it will decrease. They are not linkable to other resources, and not interactive. It is expensive, complicate to store and deliver them.

The current role of CMS’s

Content Management Systems are still used to combine all these data. They retrieve data from file systems or databases in the cloud or elsewhere and provide the data to web browsers or apps.

Conclusion

Since the advent of the world wide web, the publishing process of content has changed massively. In former times it needed a publishing house to create and distribute your book. Today it is possible to create, write and publish paperback books, content on interactive websites and content for mobile phones totally on your own.

And thus the importance to know how to use the world wide web for publishing data has been increased constantly.