Updates on projects to improve Welsh language technology and AI.
Contents
Overview
Over the past few years, we have done a lot of work to improve technology to help more people to use Welsh in everyday life.
Our current priorities for Welsh language technology are:
- improving technology to increase the daily use of Cymraeg
- making sure everyone can access Welsh language technology
- improving Welsh language artificial intelligence (AI) and speech and language technologies (by sharing data and other means)
The Written Statement on Welsh language technology explains what we are doing to achieve this.
We’ll publish regular updates on this page.
Updates on Welsh language technology
Bangor University
- Gathered over 440,000 segments of equivalent Welsh and English open text from web pages. This is valuable training data and useful TMX memories for translators.
- Trained an openly licensed AI model, based on Meta's Llama 3.
- Used translation memories to improve its performance in Welsh, both in general and translation tasks.
- Improved the accuracy of its speech-to-text model, named Trawsgrifiwr.
- Created new artificial Welsh voices using PiperTTS software. The voices now perform better with short words and long sentences.
- Added a new northern male Welsh voice to the official Piper TTS library. We want to offer Welsh and bilingual voices for screen readers that blind and visually impaired people use. You can use Sonata to install it within the popular NVDA screen reader. Here's a video showing how.
- Created a new voiceover website, named Trosleisio to improve accessibility. You can use synthetic voices to create voice-overs for Welsh language videos.
- Updated its AI Macsen to include web searching using your own voice. You can also prompt it to spell Welsh words such as 'llaeth' correctly. You can find videos showing it being used on YouTube.
- Created an experimental voice keyboard app on Android so you can use speech recognition as well as typing. You can download a video to see it being used.
- Created new Microsoft SpeechT5 voices. You can use them to clone your voice. This demo shows you how.
- Published an update to the Tagged training sentences. This continues to focus on more oral or informal sentences from the transcribed texts. They tagged parts of speech of more than 200 additional sentences and added them to the data.
- Continued to crawl across the web in search of Welsh language data. They have a list of web addresses that declare their content to be under an open license.
- Added new text data to its "CC0" corpus and expanded its scope to include data under other permissive licences.
- Annotated the names of 400 sentence entities. You can use them with Knowledge Linking.
- Improved automatic speech recognition within this version of the Whisper text-to-speech model.
Welsh audio training data
Cymen was awarded an Arfor grant to produce and publish new Welsh language voice data under a permissive licence. They've transcribed and verified about 40 hours of speech.
Bangor University transcribed more audio data in the past year. There are now 52 hours of data in that particular sub-collection (August 2025)
As of August 2025, there’s a total of 249 hours of Welsh audio training data, including:
- 124 hours Common Voice
- 45 hours Bangor Transcription Bank
- 40 hours Cymen Arfor
- 40 hours Paldaruo (since 2018)
Cardiff University
- Published SENTimental: a tool to collect annotations to create training data and test Welsh sentiment analysis. Built with Lancaster University.
Welsh Government
- The Cymraeg 2050 team published a blog article Technology to help people use more Cymraeg.
- The The Cabinet Secretary for Finance and the Welsh Language appeared before the Senedd's Culture, Communications, Welsh Language, Sport, and International Relations Committee on 16 July 2025. He took part in the Committee's 'Cymraeg for All?' Inquiry. Technology and the Welsh language were part of the discussion. Here's the evidence paper we submitted to the Committee, and here is a video of the session. The Welsh Language Infrastructure team has published the Get Welsh words website to help users to get Welsh translations.
- We updated and expanded our Helo Blod list of Welsh Language technology tools and resources. It now includes a dedication section for parents and carers.
Updates on Welsh placenames and mapping
The National Library for Wales
- Used Bangor University’s Welsh named entity recognition tools and other AI tools to develop a prototype pipeline. It recognises and extracts Welsh placenames from OCR text and aligns them with Wikidata entities in the Library’s own SNARC database.
- Published an article summarising how their work with placenames has helped to shape the Welsh Language Commissioner's new placenames website.
- Supported the development of the Welsh Language Commissioner's new Standard Place Names website.
- Shared a list of over 10,000 Welsh names on Wikidata so Mapio Cymru can add them to the Welsh. OpenStreetMap.
- Held an event with the Welsh Language Commissioner and Parc Eryri in May 2025 to collect audio clips and images of places in the park.
- Aligned new data about standard Names from Parc Eryri to Wikidata and share with the Welsh Language Commissioner.
- Created new Welsh labels for 11,264 building names. The data is publicly available via Wikidata and SNARC.
- Hosted events on Wikipedia and open data themes. The first was an editing event with Aberystwyth University on 16 October 2024. They then went on to host the History Hackathon 2025. Here's an article in Welsh about some of the highlights.
Mapio Cymru
Published guidance on how to add Welsh language street names to the OpenStreetMap Welsh map.
