Compact, ultra-fast SoTA reranker enhancing retrieval pipelines and web & terminal applications.
Swiftrank is now using flashrank int8 models from Huggingface repository
Streamlined, Light-Weight, Ultra-Fast State-of-the-Art Reranker, Engineered for Both Retrieval Pipelines and Terminal Applications.
Re-write version of FlashRank with additional features, more flexibility and optimizations.
๐ Light Weight:
โก Ultra Fast:

๐ฏ Based on SoTA Cross-encoders and other models:
ms-marco-TinyBERT-L-2-v2 (default) Model cardms-marco-MiniLM-L-12-v2 Model cardrank-T5-flan (Best non cross-encoder reranker) Model cardms-marco-MultiBERT-L-12 (Multi-lingual, supports 100+ languages)ce-esci-MiniLM-L12-v2 FT on Amazon ESCI dataset (This is interesting because most models are FT on MSFT MARCO Bing queries) Model card๐ง Versatile Configuration:
Ranker and Tokenizer instances are passed to create the pipeline.โจ๏ธ Terminal Integration:
swiftrank cli tool and get reranked output๐ API Integration:
swiftrank as an API service for seamless integration into your workflow.pip install swiftrank
Usage: swiftrank COMMAND
Rerank contexts provided on stdin.
โญโ Commands โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ process STDIN processor. [ json | jsonl | yaml ] โ
โ serve Startup a swiftrank server โ
โ --help,-h Display this message and exit. โ
โ --version Display application version. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Parameters โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * --query -q query for reranking evaluation. [required] โ
โ --threshold -t filter contexts using threshold. โ
โ --first -f get most relevant context. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Print most relevant context
cat files/contexts | swiftrank -q "Jujutsu Kaisen: Season 2" -f
Jujutsu Kaisen 2nd Season
Filtering using threshold
piping the output to
fzfprovides with a selection menu
cat files/contexts | swiftrank -q "Jujutsu Kaisen: Season 2" -t 0.98 | fzf
Jujutsu Kaisen 2nd Season
Jujutsu Kaisen 2nd Season Recaps
Using different model by setting SWIFTRANK_MODEL environment variable
export SWIFTRANK_MODEL="ms-marco-MiniLM-L-12-v2"
$env:SWIFTRANK_MODEL = "ms-marco-MiniLM-L-12-v2"
cat files/contexts | swiftrank -q "Jujutsu Kaisen: Season 2"
Jujutsu Kaisen 2nd Season
Jujutsu Kaisen 2nd Season Recaps
Jujutsu Kaisen
Jujutsu Kaisen 0 Movie
Jujutsu Kaisen Official PV
Shingeki no Kyojin Season 2
Shingeki no Kyojin Season 3 Part 2
Shingeki no Kyojin Season 3
Shingeki no Kyojin: The Final Season
Kimi ni Todoke 2nd Season
Note: The schema closely resembles that of JQ, but employs a custom parser to avoid the hassle of installing JQ.
Usage: swiftrank process [OPTIONS]
STDIN processor. [ json | jsonl | yaml ]
โญโ Parameters โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --pre -r schema for pre-processing input. โ
โ --ctx -c schema for extracting context. โ
โ --post -p schema for extracting field after reranking. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
json
cat files/contexts.json | swiftrank -q "Jujutsu Kaisen: Season 2" process -r ".categories[].items" -c '.name' -t 0.9
Jujutsu Kaisen 2nd Season
Jujutsu Kaisen 2nd Season Recaps
Jujutsu Kaisen
Jujutsu Kaisen Official PV
Jujutsu Kaisen 0 Movie
Provide one field for reranking and retrieve a different field as the output using
--post/-poption
cat files/contexts.json | swiftrank -q "Jujutsu Kaisen: Season 2" process -r ".categories[].items" -c '.name' -p '.url' -f
https://myanimelist.net/anime/51009/Jujutsu_Kaisen_2nd_Season
yaml
cat files/contexts.yaml | swiftrank -q "Monogatari Series: Season 2" process -r ".categories[].items" -c '.name' -f
Monogatari Series: Second Season
Provide one field for reranking and receive a different field as the output using
--post/-poption
cat files/contexts.yaml | swiftrank -q "Monogatari Series: Season 2" process -r ".categories[].items" -c '.name' -p '.payload.status' -f
Finished Airing
Json and Yaml lines doesnโt require
--pre/-roption, as theyโre by default loaded into an array object.
jsonlines
cat files/contexts.jsonl | swiftrank -q "Monogatari Series: Season 2" process -c '.name' -p '.payload.aired' -f
Jul 7, 2013 to Dec 29, 2013
yamllines
cat files/contextlines.yaml | swiftrank -q "Monogatari Series: Season 2" process -c '.name' -f
Monogatari Series: Second Season
Usage: swiftrank serve [OPTIONS]
Startup a swiftrank server
โญโ Parameters โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --host Host name [default: 0.0.0.0] โ
โ --port Port number. [default: 12345] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
swiftrank serve
[GET] /models - List Models
[POST] /rerank - Rerank Endpoint
Build a ReRankPipeline instance
Ranker and Tokenizer instancefrom swiftrank import Ranker, Tokenizer, ReRankPipeline
ranker = Ranker(model_id="ms-marco-TinyBERT-L-2-v2")
tokenizer = Tokenizer(model_id="ms-marco-TinyBERT-L-2-v2")
reranker = ReRankPipeline(ranker=ranker, tokenizer=tokenizer)
from swiftrank import ReRankPipeline
reranker = ReRankPipeline.from_model_id("ms-marco-TinyBERT-L-2-v2")
Evaluate the pipeline
contexts = [
"Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.",
"LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper",
"There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods Iโve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run.",
"Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup.",
"vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels"
]
for ctx in reranker.invoke(
query="Tricks to accelerate LLM inference", contexts=contexts
):
print(ctx)
Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.
There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods Iโve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run.
vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels
LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper
Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup.
Get score mapped contexts as output
for ctx_w_score in reranker.invoke_with_score(
query="Tricks to accelerate LLM inference", contexts=contexts
):
print(ctx_w_score)
(0.9977508, 'Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.')
(0.9415497, "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods Iโve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run.")
(0.47455463, 'vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels')
(0.43783104, 'LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper')
(0.043041725, 'Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup.')
Want to filter contexts? Utilize threshold parameter.
for ctx in reranker.invoke(
query="Tricks to accelerate LLM inference", contexts=contexts, threshold=0.8
):
print(ctx)
Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.
There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods Iโve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run.
Have dictionary or class instance as contexts? Utilize key parameter.
dictionary object
contexts = [
{"id": 1, "content": "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step."},
{"id": 2, "content": "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper"},
{"id": 3, "content": "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods Iโve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run."},
{"id": 4, "content": "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup."},
{"id": 5, "content": "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels"}
]
for ctx in reranker.invoke(
query="Tricks to accelerate LLM inference", contexts=contexts, key=lambda x: x['content']
):
print(ctx)
{'id': 1, 'content': 'Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.'}
{'id': 3, 'content': "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods Iโve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run."}
{'id': 5, 'content': 'vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels'}
{'id': 2, 'content': 'LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper'}
{'id': 4, 'content': 'Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup.'}
class instance or pydantic.BaseModel object
from langchain_core.documents import Document
contexts = [
Document(page_content="Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step."),
Document(page_content="LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper"),
Document(page_content="There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods Iโve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run."),
Document(page_content="Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup."),
Document(page_content="vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels")
]
for ctx in reranker.invoke(
query="Tricks to accelerate LLM inference", contexts=contexts, key=lambda x: x.page_content
):
print(ctx)
page_content='Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.'
page_content="There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods Iโve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run."
page_content='vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels'
page_content='LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper'
page_content='Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup.'
This project is derived from FlashRank, which is licensed under the Apache License 2.0. We extend our gratitude to the original authors and contributors for their work. The original repository provided a foundational framework for the development of our project, and we have built upon it with additional features and improvements.
@software{Damodaran_FlashRank_Lightest_and_2023,
author = {Damodaran, Prithiviraj},
doi = {10.5281/zenodo.10426927},
month = dec,
title = {{FlashRank, Lightest and Fastest 2nd Stage Reranker for search pipelines.}},
url = {https://github.com/PrithivirajDamodaran/FlashRank},
version = {1.0.0},
year = {2023}
}