A Linked Open Data Buddhist Text Archive

Buddhist thought and culture has been expressed in a surprisingly large number of languages from a huge variety of sources, spanning an immense temporal and geographical range. The earliest works were written in an Indic language closely related to Sanskrit, but the first actual Buddhist canon was compiled in Pali in Sri Lanka in the early centuries of the first millennium A.D. While Sanskrit versions of early writings were never compiled into a canon as such, Pali, Sanskrit, and Prakrit texts began to be translated into Chinese in the first century CE. By the sixth century, the Chinese had compiled their first version of the canon. Chinese Buddhists also wrote many other valuable and important works on Buddhist ritual, story, literature, biography, monastic law, and philosophy outside of the canon itself. Later, the Tibetans began translating Buddhist scriptures into their own language. The first Tibetan canon was systematized in the late thirteenth century. In addition to the canon, Tibetans wrote tens of thousands of important extra-canonical works as well. Beginning in the thirteenth century, the Tibetan canon was translated into Mongolian. In Southeast Asia, where the Pali canon is used, we find many extra-canonical works of Buddhist narrative, poetry, ritual, philosophy, and monastic law, written in the vernacular languages of Sinhala, Burmese, Thai, Cambodian, and Lao. Canonical and important extra-canonical literature is also to be found in Western and Central Asia as well as in Indonesia. The same is true for East Asian countries such as Korea, Japan, and Vietnam. Finally, many works that have been written in or translated into English and other Western languages.

Buddhist Digital Resource Center (BDRC) has developed a preservation ecosystem to digitally preserve source texts and document Buddhist cultural heritage. The preservation of Buddhist texts requires the ability to document the complex and multi-faceted elements of textual history. Relationships between texts in different languages, encoded in regional scripts spanning a broad historical range requires scholarly analysis and validation. Using the power of the semantic web, cultural heritage and digital asset metadata is modeled as linked open data governed by an RDF ontology and expressed as JSON-LD documents. Source documents are scanned in a rapidly growing 12 million page image archive with open APIs to provide page-level access. A full-text resource generated from transcripts and optical character recognition, based on a multi-layer text architecture, provides a deep search environment. In this presentation I will explore BUDA’s architecture and capabilities, including deep search, faceted browse, SPARQL querying, multi-layer texts and web annotations, and strategies for multi-language scholarly metadata creation and management.