CLARIN Workshop Sofia 2019 - Speech & Technology

Date	Monday, 7 October, 2019 - 09:00 to Wednesday, 9 October, 2019 - 13:00
Location	Sofia University “St. Kl. Ohridski”, Sofia, Bulgaria

Workshop objectives, content and target audience

The objective of the two and a half days workshop was to foster collaboration between social sciences and humanities researchers in Central and Eastern Europe and the research communities in these fields represented in CLARIN (the Common Language Resources and Technology Infrastructure, involving 25 countries), and in the EU funded PARTHENOS Infrastructure project (16 partners in 9 countries).

The workshop paid special attention to the training materials and other results produced by the PARTHENOS project, and to the CLARIN Resource Families and CLARIN in general.

The target audience were researchers and lecturers in the social sciences and humanities in a broad sense who use language data in their research and/or teaching (such as literary studies, history, political science, communication science, media studies, etc.) from countries in Central and Eastern Europe that are not participating in the PARTHENOS project, with a special focus on Albania, Belarus, Bosnia, Bulgaria, Croatia, Hungary, Moldavia, Montenegro, North Macedonia, Romania, Serbia, Slovakia, Slovenia, Ukraine.

For the workshop the following 3 topics, which by their very nature lend themselves for collaborative, cross border and cross discipline research, as well as for education purposes, were selected:

Working with Parliamentary Records
Challenges in Literary History
Oral History: working with interview data

The workshop was organised by Steven Krauwer and Darja Fišer. There were 25 participants (researchers and lecturers who use language data in their research and/or teaching, such as literary studies, history, political science, communication science, media studies, etc.) from countries in Central and Eastern Europe (Slovenia, Croatia, Bosnia, Serbia, Macedonia, Montenegro, Albania, Bulgaria, Hungary, Romania, Moldavia, Ukraine, Belarus, Slovakia) which are not participating in the PARTHENOS project in order to promote the PARTHENOS training materials, CLARIN Resource Families and CLARIN in general.

Two presentations will be about text (or written resources) and one (our) about the use of spoken resources for their research.

Many scholars working with spoken narratives (oral historians, socio-linguists, antropologists, etc.) may not yet be fully aware of the possibilities that HLT and digital research infrastructures have to offer. Digital research infrastructures should not only be regarded by scholars as convenient ways of gaining and providing access to data (as it mostly is now), but also as means of data preparation and analysis.

The "Sophia workshop" seeks to remedy this limited understanding among scholars by demonstrating parts of the CLARIN and CLARIAH-NL research infrastructures, i.e. the CLARIN Oral History Portal and the CLARIAH Media Suite, which were both developed in the past few years (2015-2018).

Programme

DAY1 (7 Oct)
9:00 - 12:00	Intro (Darja & Steven)
13:00 - 16:00	Topic 1 (lecture + hands-on session)
DAY2 (8 Oct)
9:00 - 12:00	Topic 2 (lecture + hands-on session)
13:00 - 16:00	Topic 3 (lecture + hands-on session)
DAY3 (9 Oct)
9:00 - 12:00	Presentations & Discussion (groups present their results from hands-on sessions, in discussion we talk about how to bring CLARIN closer to users-researchers and students):

Topics

TOPIC1: Compiling parliamentary corpora (Tomaž Erjavec, Andrej Pančur)

Lecture

The session introduces corpora of parliamentary proceedings, with a focus on building and esp. encoding such corpora. We give the motivation for research on parliamentary proceedings,mention formats in which they are typically available, sketch the tool-chain needed for their download, clean-up, and structural and linguistic annotation, and discuss existing and emerging encoding schemes for their mark-up. Here we concentrate on the Text Encoding Initiative Guidelines where we first introduce the TEI Guidelines, and then demonstrate the mark-up of parliamentary corpora on several existing cases, discussing issues such as annotating sessions, speeches and interruptions, meta-data on speakers and sittings, using typologies, and including linguistic markup.

Hands-on

The session also includes a hands-on part. Before the workshop we will enquire as to the expectations and technical skills of the participants, but the default scenario is that the participants bring with them a short excerpt from a parliamentary debate as a Word file, which they can first roughly annotate and then automatically convert to TEI, and then do the final annotations in TEI, using the Oxygen XML editor (which can be used free of charge for one month). With this, the participants get some hand-on experience in the TEI structure of parliamentary corpora. We will also demonstrate the use of such corpora on some pre-existing ones with noSketch Engine.

TOPIC2: Challenges in Corpus Building for the Romanian ELTeC Collection

Lecture (by Dr. Roxana Patras)

In its first part, the lecture outlines the challenges of corpus building for the Romanian ELTeC collection. As shown below, some of them originate in the undisputable linguistic and cultural specificity of Romanian texts, others drawing from the post-communist policies concerning the digitisation and open-access treatment of cultural heritage: a. scarcity of digitized resources from the period 1850-1920, thus a difficult extraction/ checking of metadata; b. analysis and automation tools that are still unadjusted to the diachronical particularities of Romanian; c. eligibility and composition principles, most of them deducted from Western literary traditions - i.e. book length, number of editions, 30% canonical works, 10-30% female authors; sampling according to various time-slots, etc - that are rather inapplicable to the frame of Eastern European literary phenomena and institutions; d. in the case of Romanian, the slow process of language standardization raises difficult issues concerning clean-up and normalization. For instance, novels published between 1850 and 1865 are printed in a Cyrillic-Latin alphabet that cannot be read by regular OCR engines, while novels published after 1865, albeit in Latin alphabet, still have some special glyphs that result in bad OCR output. In the second part of the lecture, I introduce a few practical solutions that I have tested and that have proven effective in addressing the aforementioned issues: customized digitisation (focused on novel subgenres); customized dataset and DOI assignment on zenodo; support repositories for different text formats on github and zenodo; HTR models for specific prints (such as the ones using the Romanian Transition Alphabet).

Tutorial: Corpus Design Principles in the COST Action 'Distant Reading for European Literary History' (Dr. Carolin Odebrecht)

The tutorial introduces sampling and balancing criteria as well as encoding principles for the multilingual European Literary Text Collection (ELTeC). We will look at the ELTeC-TEI encoding principles and we will use text examples to work with the encoding schemas for metadata and markup. The tutorial will also present the Action's goals and our working environments.

Topic 3: A multidisciplinary approach to the use of technology in research: the case of interview data

Lecture (Louise Corti)

In the first part of this session, the lecture introduces different scholarly approaches when working with interview data as a primary or secondary data source. We set out some of the distinct traditions and differences in analytic practices and use of tools across the disciplines. The wide CLARIN family of digital methods and tools, in use by linguists and speech technologists, such as automated speech recognition, annotation, text analysis and emotion recognition tools,

are open to wider exploitation, for example by digital humanities scholars, historians and social scientists. We show how they can be used to support different phases of the research process, from data preparation to analysis and presentation. Connecting up tools to help meet the needs of a researcher’s analytic journey can also be beneficial. In this respect, we describe the work of the CLARIN Oral History ‘Transcription Chain’ (TChain), a tool that supports transcription, alignment and editing of audio and text in multiple languages.

Hands-on (Christoph Draxler)

The second part of the session offers a hand-on workshop, giving participants the opportunity to work with the TChain; using a dedicated portal to convert audio-visual material into a suitable format, use automatic speech recognition (ASR), correct the ASR results, and download them.

Presentation