Poster, Autoencoder-based cleaning of non-categorical data in probabilistic databases

Autoencoder-based cleaning of non-categorical data in probabilistic databases

This report investigates the use of autoencoders to remove noise from non-categorical data in probabilistic databases. Previous research has shown that this is possible for categorical data, but a new solution is needed to do this for continuous or discrete distributions. The approach chosen was to approximate the data using discrete sampling. After training the autoencoder, we measured the difference between "cleaned" data and the original data using the Jensen-Shannon divergence. We concluded that the most effective solution was to use semi-supervised learning. This solution is quite effective at low sampling densities, reducing 99.54% of noise in a probabilistic database, while its performance at higher sampling densities is slightly lower, leading to an 86.99% reduction in the amount of noise.

  • CS & BIT: Research Project

    The Research Project is a research project that serves as an exercise for the master’s thesis. As such it serves to give master students who did their bachelor study elsewhere the experience that bachelor students from the UT obtained during their bachelor project.

  • Research Paper

    View the full research paper for this project.

Poster, Autoencoder-based cleaning of non-categorical data in probabilistic databases

Autoencoder-based cleaning of non-categorical data in probabilistic databases

This report investigates the use of autoencoders to remove noise from non-categorical data in probabilistic databases. Previous research has shown that this is possible for categorical data, but a new solution is needed to do this for continuous or discrete distributions. The approach chosen was to approximate the data using discrete sampling. After training the autoencoder, we measured the difference between "cleaned" data and the original data using the Jensen-Shannon divergence. We concluded that the most effective solution was to use semi-supervised learning. This solution is quite effective at low sampling densities, reducing 99.54% of noise in a probabilistic database, while its performance at higher sampling densities is slightly lower, leading to an 86.99% reduction in the amount of noise.

F. P. J. Nijweide

CS & BIT: Research Project

The Research Project is a research project that serves as an exercise for the master’s thesis. As such it serves to give master students who did their bachelor study elsewhere the experience that bachelor students from the UT obtained during their bachelor project.

Research Paper

View the full research paper for this project.