Allegro ML Research

About us

Machine Learning Research is Allegro’s R&D lab created to develop and apply state-of-the-art machine learning methods, helping Allegro grow and innovate with artificial intelligence. Beyond bringing AI to production, we are committed to advance the understanding of machine learning through open collaboration with the scientific community.

Areas

Machine Translation

We are developing an in-house Machine Translation engine specifically for e-commerce purposes, aiming to provide better value compared to off-the-shelf solutions. Our focus is on accurately translating industry-specific terms and jargon, while also creating a scalable and cost-efficient solution. We employ state-of-the-art machine learning methods, involving human evaluators and automatic quality estimation models to continually enhance translation quality. Our goal is to make our platform accessible to non-Polish speakers globally and contribute to the machine translation community.

Language Modeling

We employ state-of-the-art deep learning models and a range of NLP algorithms to solve diverse problems that require semantic understanding of the specialized language used within a unique environment of an e-commerce platform. We utilize and develop Large Language Models (LLMs), with the goal of providing the company with general purpose Foundation Models that can be tailored for specific downstream tasks. On a daily basis, we use our models in the following applications: Semantic Search, Question Answering, Conversational AI, Generative AI, Named Entity Recognition.

Learning to Rank

In Learning to Rank our goal is to develop machine learning models for search. Our main focus is on ranking solutions in all phases of the search pipeline, serving millions of searches a day. Currently our main area of expertise is neural text-based search and relevance. We’re also interested in topics such as reranking, feature interaction architectures, and personalization.

Computer Vision

At MLR Computer Vision, our primary objective is to elevate the user experience by leveraging machine learning image processing algorithms. We specifically concentrate on image representation learning for Visual Search and the development of robust image classification models. Presently, our research is focused on the integration of multiple modalities into our models. This integration enables our models to process not only images but also harness diverse sources of information such as product titles, descriptions, and attributes. The implementation of these multimodal models holds significant potential in various domains, including semantic search and the enhancement of product catalog quality. By employing such models, we aim to deliver superior solutions in these areas, ultimately providing enhanced user experiences.

Recommendations

Our team's primary objective is to fulfill users' needs by providing them with a diverse range of products that align with their interests. We strive to inspire users and connect them with relevant offers by leveraging recommender systems. To achieve this, we rely on the collective behaviors of our user-base, forming the foundation of our algorithms. However, we also incorporate content features of the items into our models, enriching recommendations with exploratory algorithms. These algorithms not only utilize historical data but also actively engage with the world, enabling us to explore new possibilities. Our major challenges revolve around developing innovative algorithms that can deliver high-quality recommendations while effectively handling Allegro's significant daily traffic. This ambitious endeavor requires us to operate at scale, ensuring seamless user experiences across the platform.

ML Ops

The MLOps team aims to optimize, scale, and deploy advanced machine learning models. We blend artificial intelligence, software engineering, and DevOps expertise to embrace the full potential of research engineers and data scientists from other teams. We orchestrate the entire machine learning lifecycle, from data preprocessing and annotation to model deployment, using the cutting-edge infrastructure of Google Cloud and Kubernetes. We're operating at a massive scale with several terabytes of data processed daily and thousands of predictions per second.

Talks

Machine Translation at Allegro – Use your model wise but data wiser

Mikołaj Koszowski

Seminar at Warsaw.ai - Episode XXI - 18.01.2024

Watch

Building a vector search engine

Maciej Mościcki

MOPS Community

Watch

Architecting ML for Huge Impact and Scale

Szymon Jacoń

MOPS Community

Watch

The structure of customer service data @Allegro

Aleksandra Chrabrowa

GHOST Day: AMLC 2022

Watch

Retrieval at Scale

Aleksandra Osowska-Kurczab & Jacek Szczerbiński

Sponsored talk by Allegro for ML in PL Conference 2022

Watch

Use Your Data Wisely – Data-Centric NLP in the E-commerce Domain

Paweł Olszewski

Seminar at Warsaw.ai - Episode XV - 2.06.2022

Watch

Evaluation of Transfer Learning for Polish with a Text-to-Text Model

Dariusz Kajtoch

Paper presentation at LREC 2022

Watch

Introduction to Offline Reinforcement Learning and its applications

Riccardo Belluzzo

Seminar at the University of Padua

Watch

Blog

Trust no one, not even your training data! Machine learning from noisy data

11 months ago

Label noise is ever-present in machine learning practice. Allegro datasets are no exception. We compared 7 methods for training classifiers robust to label noise. All of them improved…

Alicja Rączkowska…

0 Comments

read post

Turn-Based Offline Reinforcement Learning

almost 2 years ago

This blogpost is the result of a research collaboration between the Allegro Machine Learning Research team and the Institute of Mathematics of the Polish Academy of…

Riccardo Belluzzo…

0 Comments

read post

Open-Source

Hugging Face Allegro

We contribute to the NLP community by publishing models and datasets to the Hugging Face Hub!

Try it!

allms

Versatile and powerful library designed to streamline the process of querying large language models.

Simple and User-Friendly Interface,
Asynchronous Querying,
Automatic Retrying Mechanism,
Error Handling and Management,
Output Parsing

Try it!

allRank

Framework for training neural Learning-to-Rank (LTR) models, featuring implementations of:

common pointwise, pairwise and listwise loss function,
fully connected and Transformer-like scoring function,
commonly used evaluation metrics like Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR},
click-models for experiments on simulated click-through data

Try it!

KLEJ Benchmark

The KLEJ benchmark (Kompleksowa Lista Ewaluacji Językowych) is a set of nine evaluation tasks for the Polish language understanding. Key benchmark features:

It contains a diverse set of tasks from different domains and with different objectives,
Most tasks are created from existing datasets but we also release the new sentiment analysis dataset from an e-commerce domain.

Try it!

HerBERT

HerBERT is a BERT-based language model trained on six different corpora for Polish language understanding. It achieves state-of-the-art results on multiple downstream tasks, including KLEJ Benchmark and Part-of-Speech tagging. We release both Base and Large variants of the model as a part of transformers library for anyone to use.

Try it!