Super-rezolucija zvučnog signala metodama strojnog učenja

Kovačević-Vranjican, Marin

prikaz prve stranice dokumenta Super-rezolucija zvučnog signala metodama strojnog učenja

Access restricted to authenticated users

master's thesis

Super-rezolucija zvučnog signala metodama strojnog učenja

2022. urn:nbn:hr:179:723564

Kovačević-Vranjican, Marin

University of Split
Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture
Department of Electronics and Computing

Request a copy
of the document

Cite this document

APA 6th Edition

Kovačević Vranjican, M. (2022). Super-rezolucija zvučnog signala metodama strojnog učenja (Master's thesis). Split: University of Split, Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture. Retrieved from https://urn.nsk.hr/urn:nbn:hr:179:723564

MLA 8th Edition

Kovačević Vranjican, Marin. "Super-rezolucija zvučnog signala metodama strojnog učenja." Master's thesis, University of Split, Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, 2022. https://urn.nsk.hr/urn:nbn:hr:179:723564

Chicago 17th Edition

Harvard

Kovačević Vranjican, M. (2022). 'Super-rezolucija zvučnog signala metodama strojnog učenja', Master's thesis, University of Split, Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, accessed 04 March 2025, https://urn.nsk.hr/urn:nbn:hr:179:723564

Vancouver

Kovačević Vranjican M. Super-rezolucija zvučnog signala metodama strojnog učenja [Master's thesis]. Split: University of Split, Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture; 2022 [cited 2025 March 04] Available at: https://urn.nsk.hr/urn:nbn:hr:179:723564

IEEE

M. Kovačević Vranjican, "Super-rezolucija zvučnog signala metodama strojnog učenja", Master's thesis, University of Split, Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, Split, 2022. Available at: https://urn.nsk.hr/urn:nbn:hr:179:723564

Cite this item: https://urn.nsk.hr/urn:nbn:hr:179:723564

Please login to the repository to save this object to your list.

Metadata

Title	Super-rezolucija zvučnog signala metodama strojnog učenja
Title (english)	Super-resolution of sound signal using machine learning methods
Author	Marin Kovačević-Vranjican
Mentor	Sven Gotovac (mentor)
Granter	University of Split Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture (Department of Electronics and Computing) Split
Defense date and country	2022-09-23, Croatia
Scientific / art field, discipline and subdiscipline	TECHNICAL SCIENCES Computing
Abstract	Cilj ovog rada je istražiti modele strojnog učenja razvijene za super-rezoluciju govora i osmisliti način njihove primjene na zvučni signal u realnom vremenu, to jest sa što manjim kašnjenjem. Izrađena su dva ogledna primjera aplikacija koji to postižu. Istražene su značajke 5 modela od kojih su 3 uzastopne inačice gdje zadnja ima najbolje performanse. Prvi model, onaj od kojeg druga dva potiču, temelji se na modelu U-mreže sa rezidualnim vezama. Takva se arhitektura primjenjuje ne samo kod njegovih nasljednika, već i kod nepovezanih modela. Iduća inačica uvodi TFiLM sloj temeljen na LSTM mreži s ciljem da se u skrivenom stanju modela sadrži kontekst proteklog dijela sekvence podataka. To se znanje primijenjuje na daljnju generaciju podataka. Završna inačica TFiLM slojeve mijenja AFiLM slojevima. AFiLM slojevi koriste mehanizam transformatora koji je u novije vrijeme stekao veliku popularnost kod generativnih zadataka. Ova inačica postiže bolje rezultate, a uz to se i brže izvršava jer je pogodnija za paralelnu obradu podataka. Preostali modeli su NU-Wave temeljen na difuzijskom probabilističkom modelu za otklanjanje šuma, metodi čija je vrijednost dokazana kod super-rezolucije slika, i NVSR temeljen na super-rezoluciji u dva koraka: iz mel-spektrograma niske rezolucije u mel-spektrogram više rezolucije i iz mel-spektrograma više rezolucije u valni oblik visoke, to jest ciljne rezolucije. Iako se NVSR pokazao kao model s uvjerljivo najboljim performansama i fleksibilnosti, za potrebe ovog rada zbog pristupačnije implementacije odabrana je U-mreža s AFiLM slojevima. Za sami tok medijskih podataka odabran je WebRTC protokol. Taj je protokol dizajniran za performanse, a to postiže izravnom komunikacijom između klijenata. Da bi osigurao vezu izmedu klijenata, oslanja se na ICE protokol koji pronalazi najbolji put kroz mrežu od jednog klijenta do drugog. Implementirane su dvije aplikacije, monolitna i višeslojna, koje prikazuju primjenu odabranog super-rezolucijskog modela na WebRTC tok podataka. Monolitna aplikacija koristi WebRTC API ugrađen u pregledniku uz eksperimentalnu tehnologiju dostupnu samo u pregledniku Google Chrome uz TensorFlowJS programski okvir za pokretanje modela. Višeslojna aplikacija koristi preglednički WebRTC samo za dohvaćanje korisničkih medija, dok se ostatak logike izvršava u Python aplikaciji koristeći AIORTC okvir za komunikaciju i TensorFlow za izvršavanje modela. Preporuča se višeslojni pristup zbog njegove stabilnosti.
Abstract (english)	The aim of this theses is to explore the machine learning models developed for super-resolution of speech and to apply them to a sound signal in real time, that is, with as little latency as possible. Two examples of apps that achieve this have been create. The features of 5 model are explored, of which 3 are consecutive versions where the final one has the best performance. The first model, the one from which the other two originate, is based on a U-net with residual connections. Such an architecture is applied not only in its successors, but to unrelated models as well. The next version introduces the LSTM-based TFiLM layer with the aim of capturing the context of the past part of the data sequence. The context is kept in the hidden layer and applied to the further generation of data. The final version exchanges the TFiLM layers for AFiLM layers. AFiLM layers are based on the transformer mechanism, which has recently gained great popularity in generative tasks. This version achieves better results with better execution speed due to it being more suitable for parallel data processing. The remaining models are the NU-Wave which is based on a diffusion denoising probabilistic model, a method which has been proven to be valuable with image super-resolution, and the NVSR model which is based on a two-step super-resolution process: from a low-resolution mel-spectrogram to a higherresolution mel-spectrogram and from a higher-resolution mel-spectrogram to a high-resolution waveform, that is, a waveform of the target resolution. Although NVSR was proven to be the model with arguably the best performance and flexibility, a U-net with AFiLM layers has been selected for the purposes of this paper due to a more accessible implementation. The WebRTC protocol was selected for streaming the media data. This protocol was designed with performance in mind, and that is achieved by having the clients communicate directly. In order to secure a connection between clients, it relies on the ICE protocol, which finds the best path through the network from one client to the other. Two apps which showcase the application of the selected super-resolution model to the WebRTC data stream have been developed: the monolithic app and the multi-layered app. The monolithic app uses the browser built-in WebRTC API, along with some experimental technology available exclusively in Google Chrome, as well as the TensorFlowJS framework used to run the model. The multi-layered app uses the browser WebRTC API only to retrieve user media, while the rest of the logic is executed in a Python app using the AIORTC and TensorFlow frameworks for communication and running the model, respectively. Due to its stability, the multi-layered approach is recommended.
Keywords
Keywords (english)
Language	croatian
URN:NBN	urn:nbn:hr:179:723564
Study programme	Title: Computing Study programme type: university Study level: graduate Academic / professional title: magistar/magistra inženjer/inženjerka računarstva (magistar/magistra inženjer/inženjerka računarstva)
Type of resource	Text
File origin	Born digital
Access conditions	Access restricted to authenticated users
Terms of use
Created on	2023-05-24 07:28:59