Linguistic resources are required for the creation of grammars, in the framework of symbolic approaches, or for the training of machine learning modules. The word corpus means body in Latin, but when used as a data source in linguistics, it can be interpreted as a collection of texts. A collection of linguistic data, either written texts or transcriptions of recorded speech can be used to begin linguistic description. The linguistic data consortium (LDC) owns a large catalog of written and spoken corpora covering a wide range of languages. ELRA2 is a European language resource agency that collects, distributes, and validates spoken, written, and terminological linguistic resources, as well as software tools.
A corpus is a sizable, organized collection of texts that are machine-readable and were created in a context where communication was natural. Corpora are plural. They can be derived in a variety of ways, including electronic text, transcripts of spoken language, optical character recognition, and so on.
Sampling is yet another crucial component of corpus design. Sampling has a strong relationship with corpus representativeness and balance. As a result, sampling is unavoidable in corpus building.
The following practical factors and the intended use of the corpus will all affect how big the corpus is:
- The type of question expected from the user.
- The method by which the users studied the data.
- The availability of the data source.