About us

Machine Learning Research is Allegro’s R&D lab created to develop and apply state-of-the-art machine learning methods, helping Allegro grow and innovate with artificial intelligence. Beyond bringing AI to production, we are committed to advance the understanding of machine learning through open collaboration with the scientific community.

Areas

Machine Translation

We are developing an in-house Machine Translation engine specifically for e-commerce purposes, aiming to provide better value compared to off-the-shelf solutions. Our focus is on accurately translating industry-specific terms and jargon, while also creating a scalable and cost-efficient solution. We employ state-of-the-art machine learning methods, involving human evaluators and automatic quality estimation models to continually enhance translation quality. Our goal is to make our platform accessible to non-Polish speakers globally and contribute to the machine translation community.

Language Modeling

We employ state-of-the-art deep learning models and a range of NLP algorithms to solve diverse problems that require semantic understanding of the specialized language used within a unique environment of an e-commerce platform. We utilize and develop Large Language Models (LLMs), with the goal of providing the company with general purpose Foundation Models that can be tailored for specific downstream tasks. On a daily basis, we use our models in the following applications: Semantic Search, Question Answering, Conversational AI, Generative AI, Named Entity Recognition.

Learning to Rank

In Learning to Rank our goal is to develop machine learning models for search. Our main focus is on ranking solutions in all phases of the search pipeline, serving millions of searches a day. Currently our main area of expertise is neural text-based search and relevance. We’re also interested in topics such as reranking, feature interaction architectures, and personalization.

Computer Vision

At MLR Computer Vision, our primary objective is to elevate the user experience by leveraging machine learning image processing algorithms. We specifically concentrate on image representation learning for Visual Search and the development of robust image classification models. Presently, our research is focused on the integration of multiple modalities into our models. This integration enables our models to process not only images but also harness diverse sources of information such as product titles, descriptions, and attributes. The implementation of these multimodal models holds significant potential in various domains, including semantic search and the enhancement of product catalog quality. By employing such models, we aim to deliver superior solutions in these areas, ultimately providing enhanced user experiences.

Recommendations

Our team's primary objective is to fulfill users' needs by providing them with a diverse range of products that align with their interests. We strive to inspire users and connect them with relevant offers by leveraging recommender systems. To achieve this, we rely on the collective behaviors of our user-base, forming the foundation of our algorithms. However, we also incorporate content features of the items into our models, enriching recommendations with exploratory algorithms. These algorithms not only utilize historical data but also actively engage with the world, enabling us to explore new possibilities. Our major challenges revolve around developing innovative algorithms that can deliver high-quality recommendations while effectively handling Allegro's significant daily traffic. This ambitious endeavor requires us to operate at scale, ensuring seamless user experiences across the platform.

ML Ops

The MLOps team aims to optimize, scale, and deploy advanced machine learning models. We blend artificial intelligence, software engineering, and DevOps expertise to embrace the full potential of research engineers and data scientists from other teams. We orchestrate the entire machine learning lifecycle, from data preprocessing and annotation to model deployment, using the cutting-edge infrastructure of Google Cloud and Kubernetes. We're operating at a massive scale with several terabytes of data processed daily and thousands of predictions per second.

Talks

Retrieval at Scale

Retrieval at Scale

Aleksandra Osowska-Kurczab & Jacek Szczerbiński

Sponsored talk by Allegro for ML in PL Conference 2022

Watch

Blog

Trust no one, not even your training data! Machine learning from noisy data

Label noise is ever-present in machine learning practice. Allegro datasets are no exception. We compared 7 methods for training classifiers robust to label…

+4
Alicja Rączkowska
Alicja Rączkowska…
0 Comments
read post

Turn-Based Offline Reinforcement Learning

This blogpost is the result of a research collaboration between the Allegro Machine Learning Research team and the Institute of Mathematics of the Polish Academy of…

+4
Riccardo Belluzzo
Riccardo Belluzzo…
0 Comments
read post

Open-Source

AlleNoise

Large-scale text classification benchmark dataset with real-world label noise. It is meant to spark development of new robust classification methods.

Try it!

allms

Versatile and powerful library designed to streamline the process of querying large language models.

  • Simple and User-Friendly Interface,
  • Asynchronous Querying,
  • Automatic Retrying Mechanism,
  • Error Handling and Management,
  • Output Parsing
Try it!

allRank

Framework for training neural Learning-to-Rank (LTR) models, featuring implementations of:

  • common pointwise, pairwise and listwise loss function,
  • fully connected and Transformer-like scoring function,
  • commonly used evaluation metrics like Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR},
  • click-models for experiments on simulated click-through data
Try it!

KLEJ Benchmark

The KLEJ benchmark (Kompleksowa Lista Ewaluacji Językowych) is a set of nine evaluation tasks for the Polish language understanding. Key benchmark features:

  • It contains a diverse set of tasks from different domains and with different objectives,
  • Most tasks are created from existing datasets but we also release the new sentiment analysis dataset from an e-commerce domain.
Try it!

HerBERT

HerBERT is a BERT-based language model trained on six different corpora for Polish language understanding. It achieves state-of-the-art results on multiple downstream tasks, including KLEJ Benchmark and Part-of-Speech tagging. We release both Base and Large variants of the model as a part of transformers library for anyone to use.

Try it!

Publications

2024

AlleNoise - large-scale text classification benchmark dataset with real-world label noise

Authors: Alicja Rączkowska, Aleksandra Osowska-Kurczab, Jacek Szczerbiński, Kalina Jasinska-Kobus, Klaudia Nazarko

Accepted at: arxiv

Read

2023

Improving Domain-Specific Retrieval by NLI Fine-Tuning

Authors: Roman Dušek, Aleksander Wawer, Christopher Galias, Lidia Wojciechowska

Accepted at: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, FedCSIS 2023

Read

2023

Going beyond research datasets: Novel intent discovery in the industry setting

Authors: Aleksandra Chrabrowa, Tsimur Hadeliya, Dariusz Kajtoch, Robert Mroczkowski, Piotr Rybak

Accepted at: Findings of the Association for Computational Linguistics: EACL 2023

Read

2022

Evaluation of Transfer Learning for Polish with a Text-to-Text Model

Authors: Aleksandra Chrabrowa, Łukasz Dragan, Karol Grzegorczyk, Dariusz Kajtoch, Mikołaj Koszowski, Robert Mroczkowski, Piotr Rybak

Accepted at: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022)

Read

2021

Allegro.eu Submission to WMT21 News Translation Task

Authors: Mikołaj Koszowski, Karol Grzegorczyk, Tsimur Hadeliya

Accepted at: Proceedings of the Sixth Conference on Machine Translation

Read

2021

HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish

Authors: Robert Mroczkowski, Piotr Rybak, Alina Wróblewska, Ireneusz Gawlik

Accepted at: BSNLP, accepted long paper

Read

2020

KLEJ: Comprehensive Benchmark for Polish Language Understanding

Authors: Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik

Accepted at: ACL 2020, accepted long paper

Read

2020

Context-Aware Learning to Rank with Self-Attention

Authors: Przemysław Pobrotyn, Tomasz Bartczak, Mikołaj Synowiec, Radosław Białobrzeski, Jarosław Bojar

Accepted at: SIGIR eCommerce Workshop 2020, contributed talk

Read

2020

NeuralNDCG: Direct Optimisation of a Ranking Metric via Differentiable Relaxation of Sorting

Authors: Przemysław Pobrotyn, Radosław Białobrzeski

Accepted at: The 2021 SIGIR Workshop On eCommerce (SIGIR eCom ’21)

Read

2020

BERT-based similarity learning for product matching

Authors: Janusz Tracz, Piotr Wójcik, Kalina Jasinska-Kobus, Riccardo Belluzzo, Robert Mroczkowski, Ireneusz Gawlik

Accepted at: EComNLP 2020 COLING Workshop on Natural Language Processing in E-Commerce

Read

Job offers

Senior Data Analyst - Allegro Pay (Risk Management)

Warsaw

Apply

Senior Systems Administrator (HR Tech)

Prague

Apply

Junior Software Engineer (Machine Learning)

Warsaw

Apply

Senior Research Engineer (Computer Vision)

Warsaw

Apply

Front-end Engineer - Allegro Pay

Warsaw

Apply
See more job offers