Algorithms Big Data Data Science Machine-Learning python
Working with datasets comprising millions or billions of samples is an increasingly common task, one that is typically tackled with distributed computing. Nodes in high-performance computing clusters have enough RAM to run intensive and well-tested data analysis workflows. More often than not, however, this is preceded by the scientific process of cleaning, filtering, grouping, and other transformations of the data, through continuous visualizations and correlation analysis. In today’s work environments, many data scientists prefer to do this on their laptops or workstations, as to more effectively use their time and not to rely on spotty internet connection to access their remote data and computation resources. Modern laptops have sufficiently fast I/O SSD storage, but upgrading RAM is expensive or impossible.
Applying the combined benefits of computational graphs, which are common in neural network libraries, with delayed (a.k.a lazy) evaluations to a DataFrame library enables efficient memory and CPU usage. Together with memory-mapped storage (Apache Arrow, hdf5) and out-of-core algorithms, we can process considerably larger data sets with fewer resources. As an added bonus, the computational graphs ‘remember’ all operations applied to a DataFrame, meaning that data processing pipelines can be generated automatically.
Type: Talk (45 mins); Python level: Intermediate; Domain level: Beginner
Jovan is a senior data scientists & researcher at XebiaLabs, where he creates predictive models related to DevOps pipelines. Working mostly with Python in the Jupyter ecosystem, he has considerable experience in clustering analysis and predictive modeling. Jovan has a PhD in Astrophysics, is a co-founder of vaex.io, and is interested in novel machine learning technologies and applications.