Abstract | Cilj ovog rada je istražiti modele strojnog učenja razvijene za super-rezoluciju govora i osmisliti način njihove primjene na zvučni signal u realnom vremenu, to jest sa što manjim kašnjenjem. Izrađena su dva ogledna primjera aplikacija koji to postižu. Istražene su značajke 5 modela od kojih su 3 uzastopne inačice gdje zadnja ima najbolje performanse. Prvi model, onaj od kojeg druga dva potiču, temelji se na modelu U-mreže sa rezidualnim vezama. Takva se arhitektura primjenjuje ne samo kod njegovih nasljednika, već i kod nepovezanih modela. Iduća inačica uvodi TFiLM sloj temeljen na LSTM mreži s ciljem da se u skrivenom stanju modela sadrži kontekst proteklog dijela sekvence podataka. To se znanje primijenjuje na daljnju generaciju podataka. Završna inačica TFiLM slojeve mijenja AFiLM slojevima. AFiLM slojevi koriste mehanizam transformatora koji je u novije vrijeme stekao veliku popularnost kod generativnih zadataka. Ova inačica postiže bolje rezultate, a uz to se i brže izvršava jer je pogodnija za paralelnu obradu podataka. Preostali modeli su NU-Wave temeljen na difuzijskom probabilističkom modelu za otklanjanje šuma, metodi čija je vrijednost dokazana kod super-rezolucije slika, i NVSR temeljen na super-rezoluciji u dva koraka: iz mel-spektrograma niske rezolucije u mel-spektrogram više rezolucije i iz mel-spektrograma više rezolucije u valni oblik visoke, to jest ciljne rezolucije. Iako se NVSR pokazao kao model s uvjerljivo najboljim performansama i fleksibilnosti, za potrebe ovog rada zbog pristupačnije implementacije odabrana je U-mreža s AFiLM slojevima. Za sami tok medijskih podataka odabran je WebRTC protokol. Taj je protokol dizajniran za performanse, a to postiže izravnom komunikacijom između klijenata. Da bi osigurao vezu izmedu klijenata, oslanja se na ICE protokol koji pronalazi najbolji put kroz mrežu od jednog klijenta do drugog. Implementirane su dvije aplikacije, monolitna i višeslojna, koje prikazuju primjenu odabranog super-rezolucijskog modela na WebRTC tok podataka. Monolitna aplikacija koristi WebRTC API ugrađen u pregledniku uz eksperimentalnu tehnologiju dostupnu samo u pregledniku Google Chrome uz TensorFlowJS programski okvir za pokretanje modela. Višeslojna aplikacija koristi preglednički WebRTC samo za dohvaćanje korisničkih medija, dok se ostatak logike izvršava u Python aplikaciji koristeći AIORTC okvir za komunikaciju i TensorFlow za izvršavanje modela. Preporuča se višeslojni pristup zbog njegove stabilnosti. |
Abstract (english) | The aim of this theses is to explore the machine learning models developed for super-resolution of speech and to apply them to a sound signal in real time, that is, with as little latency as possible. Two examples of apps that achieve this have been create. The features of 5 model are explored, of which 3 are consecutive versions where the final one has the best performance. The first model, the one from which the other two originate, is based on a U-net with residual connections. Such an architecture is applied not only in its successors, but to unrelated models as well. The next version introduces the LSTM-based TFiLM layer with the aim of capturing the context of the past part of the data sequence. The context is kept in the hidden layer and applied to the further generation of data. The final version exchanges the TFiLM layers for AFiLM layers. AFiLM layers are based on the transformer mechanism, which has recently gained great popularity in generative tasks. This version achieves better results with better execution speed due to it being more suitable for parallel data processing. The remaining models are the NU-Wave which is based on a diffusion denoising probabilistic model, a method which has been proven to be valuable with image super-resolution, and the NVSR model which is based on a two-step super-resolution process: from a low-resolution mel-spectrogram to a higherresolution mel-spectrogram and from a higher-resolution mel-spectrogram to a high-resolution waveform, that is, a waveform of the target resolution. Although NVSR was proven to be the model with arguably the best performance and flexibility, a U-net with AFiLM layers has been selected for the purposes of this paper due to a more accessible implementation. The WebRTC protocol was selected for streaming the media data. This protocol was designed with performance in mind, and that is achieved by having the clients communicate directly. In order to secure a connection between clients, it relies on the ICE protocol, which finds the best path through the network from one client to the other. Two apps which showcase the application of the selected super-resolution model to the WebRTC data stream have been developed: the monolithic app and the multi-layered app. The monolithic app uses the browser built-in WebRTC API, along with some experimental technology available exclusively in Google Chrome, as well as the TensorFlowJS framework used to run the model. The multi-layered app uses the browser WebRTC API only to retrieve user media, while the rest of the logic is executed in a Python app using the AIORTC and TensorFlow frameworks for communication and running the model, respectively. Due to its stability, the multi-layered approach is recommended. |