Tycho Brahe Parsed Corpus of Historical Portuguese

Dynamic Version System:

1. Source text and Edition types

The kind of edition applied to a text depends on the nature of the source text.

1.1 Source text: Preserved orthography
       (original printings and diplomatic transcriptions):
       Complete Edition

Source texts with preserved orthography need to be modernized so that they can be efficiently processed by the Tagger.

This modernized version is what we refer to as the "Complete Edition" ; it follows the guidelines detailed in the [Manual (in Portuguese)].

1.2 Source text: Edited orthography
        (intermediate editions, used in the Phase I) :
        Technical Edition

Some source texts were already modernized by third parties. But in these cases there still is some itens that need to be modified so that they can be efficiently processed by the Tagger.

The result of these punctual modifications is what we mean by "Technical Edition", following the guidelines of the manual (link above).

1.3 Special cases

In some documents, transcripted from preserved orthography source texts, only the technical edition was applied. These are texts included in the Corpus in phase I (1998-2003), when no modernization was applied to the texts.

For those the complete modernization will be applied progressively priorizing the texts not yet processed by the Tagger.

To view the list of texts and their edition level, please go to the Catalog.

2. Available Versions

The following versions are available:

2.1 Source text transcription

Access to the original transcription of the source text (the original print orthography; or the modernized orthography edited by a prior editor).

2.2 Edited text

Access to the modifications done by the Corpus team (complete modernizations or technical modifications, as appropriate).

Two types of files are available in this case:

  • HTML files for reading
  • TXT (unformatted text) files, for use with processing tools and data search
2.3 Editions lexicon

A list of editions (modernizations and modifications) applied to the text.


3. General information

The corpus texts are stored in XML format. Each text is linked to a XSL stylesheet that sets up a layout for viewing its metadata (its profile) in a friendly way.

A XML file, along with the text original content, may contain the modifications and modernizations applied to it, as well as its textual structure (paragraphs, pages, etc.).

All these information can be accessed by generating diferent versions of the text with the 'getversion.pl' CGI script. This script generate a specific version of the text based on one of the following XSL stylesheets:

  • origversion.xsl: HTML version of the original source text
  • edversion.xsl: HTML version of the edited text
  • plain.xsl: TXT (unformatted text) version of the edited text
  • varietylex.xsl: HTML version of the editions lexicon

These versions are generated online and aren't stored on the server except for the TXT versions, stored here.

If you see any strange characters on screen, please check if the browser encoding is set to UTF-8.