deepset-ai/FARM
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
๐ฆ DPR Improvements
- DPR - improve loading of datasets #733 @voidful
- DPR - enable saving and loading of other model types, e.g., RoBERTa models #765 @Timoeller @julian-risch
- DPR - fix conversion of BiAdaptiveModel #753 @bogdankostic
๐ฆ torch 1.8.1 and transformers 4.6.1
- Bump transformers version to 4.6.1 #787 @Timoeller @julian-risch
- Bump torch version to 1.8.1 #767 @Timoeller @julian-risch
๐ฆ Multi-task Learning
- Implement Multi-task Learning and added example #778 @johann-petrak
๐ฆ List of Evaluation Metrics
- Allow list of metrics and add tests and pythondoc #777 @johann-petrak
๐ฆ Misc
- Reduce number of logging messages by Processor about returning problematic ids #772 @johann-petrak
- Add farm.\_\_version\_\_ tag #761 @johann-petrak
- Add value of doc_stride, max_seq_len, max_query_length in error message #784 @ftesser
- Convert QACandidates with empty or whitespace answers to no_answers on doc level #756 @julian-risch
- String comparison: Should replace "is" with "==": #774 @johann-petrak
- Fix reference before assignment in DataSilo #738 @bogdankostic
- Changing QA_input format in tutorial #735 @julian-risch
- Fix TextPairClassificationProcessor example by adding metric #780 @julian-risch
A patch release focusing on bug fixes for Dense Passage Retrieval DPR Fix saving and loading of DPR models and Processors in #746 Fix DPR tokenization statisticss in #738 Fix cosine similarity in DPR training #741 Misc Fix tuple input for TextPairClassification inference #723
๐ฆ QA Confidence Scores
- In response to several requests from the community, we now provide more meaningful confidence scores for the predictions of extractive QA models. #690 #705 @julian-risch @timoeller @lalitpagaria
- An [example](https://github.com/deepset-ai/FARM/blob/master/examples/question_answering_confidence.py) shows how to calibrate and use the confidence scores.
๐ฆ Misc
- Refactor Text pair handling, that also add Text pair regression #713 @timoeller
- Refactor Textsimilarity processor #711 @timoeller
- Refactor Regression and inference processors #702 @timoeller
- Fix NER probabilities #700 @brandenchan
- Calculate squad evaluation metrics overall and separately for text answers and no answers #698 @julian-risch
- Re-enable test_dpr_modules also for windows #697 @ftesser
- Use Path instead of String in ONNXAdaptiveModel #694 @skiran252
- Big thanks to all contributors!
This is just a small patch to change the return types of offsets in our QAInferencer, see #693 It is needed to fix RestAPI related issues where int64 cannot decoded within JSONs.
๐ฆ Patch release
- This is just a quick patch release to bugfix some input validation for Question Answering
- [closed] Fix/missing truncation bug [#679](https://github.com/deepset-ai/FARM/pull/679)
โจ Additional feature for QA
- Still, another interesting feature slipped in: We can now filter QA predictions to not contain duplicate answers.
- [closed] Added filter_range parameter that allows to filter answers with similar start/end indices [#680](https://github.com/deepset-ai/FARM/pull/680)
โจ Additional test
- [part: tokenizer][task: QA] Add integration test for QA processing [#683](https://github.com/deepset-ai/FARM/pull/683)
๐ฆ Misc
- [closed] Remove "qas" inference input wherever possible [#681](https://github.com/deepset-ai/FARM/pull/681)
- [closed] Added parameter names to convert_from_transformers call in question_answering_crossvalidation.py [#672](https://github.com/deepset-ai/FARM/pull/672)
๐ฆ Question Answering Preprocessing
- We especially focussed on making QA processing more sequential and divided the code into meaningful snippets #649
- The code snippets are (see related [method](https://github.com/deepset-ai/FARM/blob/master/farm/data_handler/processor.py#L1883)):
- convert the input into FARM specific QA format
- tokenize the questions and texts
- split texts into passages to fit the sequence length constraint of Language Models
- [optionally] convert labels (disabled during inference)
- convert question, text, labels and additional information to PyTorch tensors
- 2. The Processor.dataset_from_dicts method by default returns an additional parameter `problematic_sample_ids` that keeps track of which input sample caused problems during preprocessing:
- + 23 more
โจ Add Dense Passage Retriever (DPR) incl. Training & Inference (#513, #601, #606)
- Happy to introduce a completely new task type to FARM: Text similarity with two separate transformer encoders
- Why?
- What?
- How?
- We introduce a new class `BiAdaptiveModel` that has two language models plus a prediction head.
- In the case of DPR, this will be one question encoder model and one passage encoder model.
- See the new example script [dpr_encoder.py](https://github.com/deepset-ai/FARM/blob/master/examples/dpr_encoder.py) for training / fine-tuning a DPR model.
- We also have a tight integration in Haystack, where you can use it as a Retriever for open-domain Question Answering.
โป๏ธ Refactor conversion from / to Transformers #576
- We simplified conversion between FARM <-> Transformers. You can now run:
- ```python
- model = Converter.convert_from_transformers("deepset/roberta-base-squad2", device="cpu")
- transformer_models = Converter.convert_to_transformers(your_adaptive_model)
- ```
๐ฆ Upgrade to Transformers 3.3.1 #579
- Thanks to @lalitpagaria, we'll support RAG also in Haystack soon (see https://github.com/deepset-ai/haystack/pull/484)
- ----------------------------
๐ฆ Question Answering
- Improve Speed: Vectorize Question Answering Prediction Head [#603](https://github.com/deepset-ai/FARM/pull/603)
- Fix removal of yes no answers [#540](https://github.com/deepset-ai/FARM/pull/540)
- Fix QA bug that rejected spans at beginning of passage [#564](https://github.com/deepset-ai/FARM/pull/564)
- Added warning about that Natural Questions Inference. [#565](https://github.com/deepset-ai/FARM/pull/565)
- Remove loss index from QA PH [#589](https://github.com/deepset-ai/FARM/pull/589)
๐ฆ Other
- Catch empty datasets in Inferencer [#605](https://github.com/deepset-ai/FARM/pull/605)
- Add option to set evaluation batch size [#607](https://github.com/deepset-ai/FARM/pull/607)
- Infer model type from config [#600](https://github.com/deepset-ai/FARM/pull/600)
- Fix random behavior when loading ELECTRA models [#599](https://github.com/deepset-ai/FARM/pull/599)
- Fix import for Python3.6 [#581](https://github.com/deepset-ai/FARM/pull/581)
- Fixed conversion of BertForMaskedLM to transformers [#555](https://github.com/deepset-ai/FARM/pull/555)
- Load correct config for DistilBert model [#562](https://github.com/deepset-ai/FARM/pull/562)
- Add passages per second calculation to benchmarks [#560](https://github.com/deepset-ai/FARM/pull/560)
- + 4 more
๐ฆ Minor patch: Relax PyTorch version requirements
- Installing FARM in environments where torch's GPU version was already installed via pip (e.g. torch 1.6.0+cu101), caused version trouble. This is especially annoying in Google Colab environments.
- Change: Allow all torch 1.6.x versions incl 1.6.0+cu101 etc
- -------------------------------
- Further changes:
- Nested cross validation by @PhilipMay [#508](https://github.com/deepset-ai/FARM/pull/508)
๐ฆ Experimental Support for fast Rust Tokenizers (#482)
- Usage:
- ```python
- tokenizer = Tokenizer.load(pretrained_model_name_or_path=""bert-base-german-cased"",
- do_lower_case=False,
- use_fast=True)
- ```
๐ฆ Upgrade to transformers 3.1.0 (#464)
- ----------------------------
๐ฆ Question Answering
- Add asserts on doc_stride and max_seq_len to prevent issues with sliding window [#538](https://github.com/deepset-ai/FARM/pull/538)
- fix Natural Question inference processing [#521](https://github.com/deepset-ai/FARM/pull/521)
๐ฆ Other
- Fix logging of error msg for FastTokenizer + QA [#541](https://github.com/deepset-ai/FARM/pull/541)
- Fix truncation warnings in tokenizer [#528](https://github.com/deepset-ai/FARM/pull/528)
- Evaluate model on best model when doing early stopping [#524](https://github.com/deepset-ai/FARM/pull/524)
- Bump transformers version to 3.1.0 [#515](https://github.com/deepset-ai/FARM/pull/515)
- Add warmup run to component benchmark [#504](https://github.com/deepset-ai/FARM/pull/504)
- Add optional s3 auth via params [#511](https://github.com/deepset-ai/FARM/pull/511)
- Add option to use fast HF tokenizer. [#482](https://github.com/deepset-ai/FARM/pull/482)
- CodeBERT support for embeddings [#488](https://github.com/deepset-ai/FARM/pull/488)
- + 4 more
๐ฆ Support for PyTorch 1.6 (#502)
- We now support 1.6 and 1.5.1
- ----------------------------
๐ฆ Question Answering
- Pass max_answers param to processor [#503](https://github.com/deepset-ai/FARM/pull/503)
- Deprecate QA input dicts with [context, qas] as keys [#472](https://github.com/deepset-ai/FARM/pull/472)
- Squad processor verbose feature [#470](https://github.com/deepset-ai/FARM/pull/470)
- Propagate QA ground truth in Inferencer [#469](https://github.com/deepset-ai/FARM/pull/469)
- Ensure QAInferencer always has task_type "question_answering" [#460](https://github.com/deepset-ai/FARM/pull/460)
๐ฆ Other
- Download models from (private) S3 [#500](https://github.com/deepset-ai/FARM/pull/500)
- fix _initialize_data_loaders in data_silo [#476](https://github.com/deepset-ai/FARM/pull/476)
- Remove torch version wildcard in requirements [#489](https://github.com/deepset-ai/FARM/pull/489)
- Make num processes parameter consistent across inferencer and data silo [#480](https://github.com/deepset-ai/FARM/pull/480)
- Remove rest_api_schema argument in inference_from_dicts() [#474](https://github.com/deepset-ai/FARM/pull/474)
- farm.data_handler.utils: Add encoding to open write in split_file method [#466](https://github.com/deepset-ai/FARM/pull/466)
- Fix and document Inferencer usage and pool handling [#429](https://github.com/deepset-ai/FARM/pull/429)
- Remove assertions or replace with logging error [#468](https://github.com/deepset-ai/FARM/pull/468)
- + 4 more
๐ Main changes
- Upgrading to Pytorch 1.5.1 and transformers 3.0.2
- Important bug fix for language model training from scratch
- Bug fixes and big refactorings for Question Answering, incl. a specialized QAInferencer with dedicated In- and Output objects to simplify usage and code completion:
- ```
- from farm.infer import QAInferencer
- from farm.data_handler.inputs import QAInput, Question
- nlp = QAInferencer.load(
- "deepset/roberta-base-squad2",
- + 19 more
๐ฆ Question Answering
- Add meta attribute to QACandidate for Haystack [#455](https://github.com/deepset-ai/FARM/pull/455)
- Fix start and end offset checks in QA [#450](https://github.com/deepset-ai/FARM/pull/450)
- Fix offset_end character for QA [#449](https://github.com/deepset-ai/FARM/pull/449)
- Dedicated Input Objects for QA [#445](https://github.com/deepset-ai/FARM/pull/445)
- Question Answering improvements: cleaner code, more typed objects, better compatibility between SQuAD and Natural Questions [#411](https://github.com/deepset-ai/FARM/pull/411), [#438](https://github.com/deepset-ai/FARM/pull/438), [#419](https://github.com/deepset-ai/FARM/pull/419)
๐ฆ Other
- Upgrade pytorch and python versions [#447](https://github.com/deepset-ai/FARM/pull/447)
- Upgrade transformers version [#448](https://github.com/deepset-ai/FARM/pull/448)
- Fix randomisation of train file for training from scratch [#427](https://github.com/deepset-ai/FARM/pull/427)
- Fix loading of saved models with class weights [#431](https://github.com/deepset-ai/FARM/pull/431)
- Remove raising exception errors in processor [#451](https://github.com/deepset-ai/FARM/pull/451)
- Fix bug in benchmark tests with if statement [#430](https://github.com/deepset-ai/FARM/pull/430)
- Remove hardcoded seeds from trainer [#424](https://github.com/deepset-ai/FARM/pull/424)
- Conditional num_training_steps setting [#437](https://github.com/deepset-ai/FARM/pull/437)
- + 4 more
๐ฆ Speed optimization training from scratch
- Adding multiple optimizations and bug fixes to improve training from scratch, incl.:
- Enable usage of DistributedDataParallel
- Enable Automatix Mixed Precision Training
- Fix bugs in StreamingDataSilo
- Fix bugs in Checkpointing (important for training via spot / on-demand instances)
- This helped to boost training time in our benchmark from 616 hours down to 160 hours
- See [#305](https://github.com/deepset-ai/FARM/pull/305) for details
- ---------------------------------
- + 2 more
๐ฆ ELECTRA Model
- You can load it as usual via
- ```
- LanguageModel.load("google/electra-base-discriminator")
- ```
- See HF's [model hub](https://huggingface.co/models?search=electra) for more model variants
๐ฆ Natural Questions Style QA
- Example:
- ```
- QA_input = [
- {
- "qas": ["Is Berlin the capital of Germany?"],
- "context": "Berlin (/bษหrหlษชn/) is the capital and largest city of Germany by both area and population."
- }
- ]
- + 7 more
โจ New speed benchmarking
- -------------------------------------------------------------------
๐ A few more changes ...
- Modeling
- Add support for Camembert-like models [#396](https://github.com/deepset-ai/FARM/pull/396)
- Speed up in BERTLMHead by doing argmax on logits on GPU [#377](https://github.com/deepset-ai/FARM/pull/377)
- Fix bug in BERT-style pretraining [#369](https://github.com/deepset-ai/FARM/pull/369)
- Remove additional XLM-R tokens [#360](https://github.com/deepset-ai/FARM/pull/360)
- ELECTRA: use gelu for pooled output of ELECTRA model [#364](https://github.com/deepset-ai/FARM/pull/364)
- Data handling
- Option to specify text col name in `TextClassificationProcessor` and `RegressionProcessor` [#387](https://github.com/deepset-ai/FARM/pull/387)
- + 26 more
๐ :1234: Changed Multiprocessing in Inferencer
- The Inferencer has now a fixed pool of processes instead of creating a new one for every inference call.
- This accelerates the processing a bit and solves some problems when using it in combination with Frameworks like gunicorn/FastAPI etc ([#329](https://github.com/deepset-ai/FARM/pull/329))
- Old:
- ```
- ...
- inferencer.inference_from_dicts(dicts, num_processes=8)
- ```
- New:
- + 4 more
๐ฆ :fast_forward: Streaming Inferencer
- *Input:* Generator yielding dicts with your text
- *Output:* Generator yielding your predictions
- ```
- dicts = sample_dicts_generator() # it can be a list of dicts or a generator object
- results = inferencer.inference_from_dicts(dicts, streaming=True, multiprocessing_chunksize=20)
- for prediction in results: # results is a generator object that yields predictions
- print(prediction)
- ```
๐ฆ :older_woman: :older_man: "Classic" baseline models for benchmarking + S3E Pooling
- See the [example script](https://github.com/deepset-ai/FARM/blob/master/examples/doc_classification_word_embedding_LM.py)
- See the [example script](https://github.com/deepset-ai/FARM/blob/master/examples/embeddings_extraction_s3e_pooling.py)
- -------------------------------------------------------------------
๐ A few more changes ...
- Modeling
- Cross-validation for Question-Answering [#335](https://github.com/deepset-ai/FARM/pull/335)
- Add option to use max_seq_len tokens for LM Adaptation/Training-from-scratch instead of real sentences [#314](https://github.com/deepset-ai/FARM/pull/314)
- Add english glove models [#339](https://github.com/deepset-ai/FARM/pull/339)
- Implicitly connect heads with processor + check for connection [#337](https://github.com/deepset-ai/FARM/pull/337)
- Evaluation & Inference
- Registration of custom evaluation reports [#331](https://github.com/deepset-ai/FARM/pull/331)
- Standalone Evaluation with pretrained models [#330](https://github.com/deepset-ai/FARM/pull/330)
- + 11 more
๐ฆ :fast_forward: Scalable preprocessing: StreamingDataSilo
- Allows you to load data lazily from disk and preprocess a batch on-the-fly when needed during training.
- ```stream_data_silo = StreamingDataSilo(processor=processor, batch_size=batch_size)```
- => Allows large datasets that don't fit in memory (e.g. for training from scratch)
- => Training directly starts. No initial time for preprocessing needed.
๐ฆ ONNX support:
- ```
- model = AdaptiveModel (...)
- model.convert_to_onnx(Path("./onnx_model"))
- inferencer = Inferencer.load(model_name_or_path=Path("./onnx_model"))
- ```
- => [See example](https://github.com/deepset-ai/FARM/blob/master/examples/onnx_question_answering.py)
- | Batch Size | PyTorch | ONNX | ONNX V100 optimizations | Speedup |
- |------------|---------|------|-----------|---------|
- + 9 more
๐ฆ Embedding extraction:
- Extracting embeddings from a model at inference time is now more similar to other inference modes.
- *Old*
- ```
- model = Inferencer.load(lang_model, task_type="embeddings", gpu=use_gpu, batch_size=batch_size)
- result = model.extract_vectors(dicts=basic_texts, extraction_strategy="cls_token", extraction_layer=-1)
- ```
- *New*
- ```
- + 6 more
โจ :left_right_arrow: New tasks: TextPairClassification & Passage ranking
- Examples:
- [MSMARCO passage ranking](https://github.com/deepset-ai/FARM/blob/master/examples/passage_ranking.py)
- [ASNQ text pair classification](https://github.com/deepset-ai/FARM/blob/master/examples/text_pair_classification.py)
- -------------------------------------------------------------------
๐ฆ Faster & simpler Inference
- Make extract_vectors more compatible to other inference types [#292](https://github.com/deepset-ai/FARM/pull/292)
- Add test for onnx qa inference. Fix bug in loading PHs for ONNX. [#297](https://github.com/deepset-ai/FARM/pull/297)
- Add ONNX Inference for Question Answering [#288](https://github.com/deepset-ai/FARM/pull/288)
- Improve inferencer for better multiprocessing with QA / haystack [#278](https://github.com/deepset-ai/FARM/pull/278)
- Scalable Qa aggregation [#268](https://github.com/deepset-ai/FARM/pull/268)
- Allow for multiple queries in QA inference when using rest_api format [#246](https://github.com/deepset-ai/FARM/pull/246)
- Decouple n_best in QA predictions [#269](https://github.com/deepset-ai/FARM/pull/269)
- Correct keyword argument for max_processes when used by calc_chunksize() [#255](https://github.com/deepset-ai/FARM/pull/255)
- + 2 more
๐ฆ Streaming Data Silo / Training from scratch
- StreamingDataSilo for loading & preprocessing batches lazily during training [#239](https://github.com/deepset-ai/FARM/pull/239)
- Fix dict chunking in StreamingDataSilo for LMFinetuning [#284](https://github.com/deepset-ai/FARM/pull/284)
- Add example for training with AWS SageMaker [#283](https://github.com/deepset-ai/FARM/pull/283)
- Fix deletion of old training checkpoints [#282](https://github.com/deepset-ai/FARM/pull/282)
- Fix epoch number for saving a training checkpoint [#281](https://github.com/deepset-ai/FARM/pull/281)
- Fix Train Step calculations for Checkpointing [#279](https://github.com/deepset-ai/FARM/pull/279)
- Implement __len__() for StreamingDataSilo [#274](https://github.com/deepset-ai/FARM/pull/274)
- Refactor StreamingDataSilo to support multiple train epochs [#266](https://github.com/deepset-ai/FARM/pull/266)
- + 1 more
๐ฆ Modeling
- Add support for text pair classification (ASNQ) and ranking (MSMarco) [#237](https://github.com/deepset-ai/FARM/pull/237)
- Add conversion of lm_finetuned to HF transformers [#290](https://github.com/deepset-ai/FARM/pull/290)
- Added `next_sentence_head` in `examples/lm_finetuning.py`. [#273](https://github.com/deepset-ai/FARM/pull/273)
- Quickfix loading pred head [#256](https://github.com/deepset-ai/FARM/pull/256)
- Maked use of 'language' **kwargs if present in LanguageModel.load. [#262](https://github.com/deepset-ai/FARM/pull/262)
- Add the option to define the language model class manually [#264](https://github.com/deepset-ai/FARM/pull/264)
- Fix XLMR Bug When Calculating Start of Second Sequence [#240](https://github.com/deepset-ai/FARM/pull/240)
๐ฆ Examples / Tutorials / Experiments
- Add data handling for GermEval14, add checks for correct data files [#259](https://github.com/deepset-ai/FARM/pull/259)
- Fix separator in CoNLL_de experiment config [#254](https://github.com/deepset-ai/FARM/pull/254)
- Use correct German conll03 data + conversion [#248](https://github.com/deepset-ai/FARM/pull/248)
- Bugfix parameter loading through experiment configs [#252](https://github.com/deepset-ai/FARM/pull/252)
- Add early stopping to experiment [#253](https://github.com/deepset-ai/FARM/pull/253)
- Fix Tutorial: Add missing param in initialize_optimizer [#245](https://github.com/deepset-ai/FARM/pull/245)
๐ฆ Other
- Add Azure test pipeline [#270](https://github.com/deepset-ai/FARM/pull/270)
- Fix progress bar in datasilo [#267](https://github.com/deepset-ai/FARM/pull/267)
- Turn off prints and logging during testing [#260](https://github.com/deepset-ai/FARM/pull/260)
- Pin Werkzeug version in requirements.txt [#250](https://github.com/deepset-ai/FARM/pull/250)
- Add ConnectionError handling for MLFlow logger [#236](https://github.com/deepset-ai/FARM/pull/236)
- Clearer message when DataSilo calculates Sequence Lengths [#293](https://github.com/deepset-ai/FARM/pull/293)
- Add metric to text_pair_classification example [#294](https://github.com/deepset-ai/FARM/pull/294)
- Add preprocessed CORD-19 dataset [#295](https://github.com/deepset-ai/FARM/pull/295)
- + 2 more
๐ฆ :man_farmer: :arrows_counterclockwise: :hugs: Full compatibility with Transformers' models
- 1. Convert models from/to transformers
- ```
- model = AdaptiveModel.convert_from_transformers("deepset/bert-base-cased-squad2", device="cpu", task_type="question_answering")
- transformer_model = model.convert_to_transformers()
- ```
- 2. Load models from their new [model hub](https://huggingface.co/models):
- ```
- LanguageModel.load("TurkuNLP/bert-base-finnish-cased-v1")
- + 3 more
๐ฆ :rocket: Better & Faster Training
- Thanks to @BramVanroy and @johann-petrak we got some really hot new features here:
- Automatic Mixed Precision (AMP) Training: Speed up your training by ~ 35%! Model params are usually stored with FP32 precision. Some model layers don't need that precision and can be reduced to FP16, which speeds up training and reduces memory footprint. AMP is a smart way of figuring out, for which params we can reduce precision without sacrificing performance ([Read more](https://nvlabs.github.io/iccv2019-mixed-precision-tutorial/files/dusan_stosic_intro_to_mixed_precision_training.pdf)).
- Test it by installing [apex](https://github.com/NVIDIA/apex) and setting "use_amp" to "O1" in one of the FARM example scripts.
- More flexible Optimizers & Schedulers: Choose whatever optimizer you like from PyTorch, apex or Transformers. Take your preferred learning rate schedule from Transformers or PyTorch ([Read more](https://github.com/deepset-ai/FARM#1-optimizers--learning-rate-schedules))
- Cross-validation: Get more reliable eval metrics on small datasets (see [example](https://github.com/deepset-ai/FARM/blob/master/examples/doc_classification_crossvalidation.py))
- Early Stopping: With early stopping, the run stops once a chosen metric is not improving any further and you take the best model up to this point. This helps prevent overfitting on small datasets and reduces training time if your model doesn't improve any further (see [example](https://github.com/deepset-ai/FARM/blob/master/examples/doc_classification_with_earlystopping.py)).
๐ฆ :fast_forward: Caching & Checkpointing
- Save time if you run similar pipelines (e.g. only experimenting with model params): Store your preprocessed dataset & load it next time from cache:
- ```
- data_silo = DataSilo(processor=processor, batch_size=batch_size, caching=True)
- ```
- Start & stop training by saving checkpoints of the trainer:
- ```
- trainer = Trainer.create_or_load_checkpoint(
- ...
- + 5 more
๐ฆ :computer: Windows support
- FARM now also runs on Windows. This implies one breaking change:
- We now use pathlib and therefore expect all directory paths to be of type `Path` instead of `str` [#172](https://github.com/deepset-ai/FARM/pull/172)
- -------------------------------------------------------------------
๐ฆ Modelling
- [enhancement] ALBERT support [#169](https://github.com/deepset-ai/FARM/pull/169)
- [enhancement] DistilBERT support [#187](https://github.com/deepset-ai/FARM/pull/187)
- [enhancement] XLM-Roberta support [#181](https://github.com/deepset-ai/FARM/pull/181)
- [enhancement] Automatically infer layer dims of prediction head [#195](https://github.com/deepset-ai/FARM/pull/195)
- [bug] Implement next_sent_pred flag [#198](https://github.com/deepset-ai/FARM/pull/198)
๐ฆ QA
- [enhancement] Encoding of QA IDs [#171](https://github.com/deepset-ai/FARM/pull/171)
- [enhancement] Remove repeat QA preds from overlapping passages [#186](https://github.com/deepset-ai/FARM/pull/186)
- [enhancement] More options to control predictions of Question Answering Head [#183](https://github.com/deepset-ai/FARM/pull/183)
- [bug] Fix QA example [#203](https://github.com/deepset-ai/FARM/pull/203)
๐ฆ Training
- [enhancement] Use AMP instead of naive fp16. More optimizers. More LR Schedules. [#133](https://github.com/deepset-ai/FARM/pull/133)
- [bug] Fix for use AMP instead of naive fp16 (#133) [#180](https://github.com/deepset-ai/FARM/pull/180)
- [enhancement] Add early stopping and custom metrics [#165](https://github.com/deepset-ai/FARM/pull/165)
- [enhancement] Add checkpointing for training [#188](https://github.com/deepset-ai/FARM/pull/188)
- [enhancement] Add train loss to tqdm. add desc for data preproc. log only 2 samples [#175](https://github.com/deepset-ai/FARM/pull/175)
- [enhancement] Allow custom functions to aggregate loss of prediction heads [#220](https://github.com/deepset-ai/FARM/pull/220)
๐ฆ Eval
- [bug] Fixed micro f1 score [#179](https://github.com/deepset-ai/FARM/pull/179)
- [enhancement] Rename classification_report to report [#173](https://github.com/deepset-ai/FARM/pull/173)
๐ฆ Data Handling
- [enhancement] Add caching of datasets in DataSilo [#177](https://github.com/deepset-ai/FARM/pull/177)
- [enhancement] Add option to limit number of processes in datasilo [#174](https://github.com/deepset-ai/FARM/pull/174)
- [enhancement] Add max_multiprocessing_chunksize as a param for DataSilo [#168](https://github.com/deepset-ai/FARM/pull/168)
- [enhancement] Issue59 - Add cross-validation for small datasets [#167](https://github.com/deepset-ai/FARM/pull/167)
- [enhancement] Add max_samples argument to TextClassificationProcessor [#204](https://github.com/deepset-ai/FARM/pull/204)
- [bug] Fix bug with added tokens [#197](https://github.com/deepset-ai/FARM/pull/197)
๐ฆ Other
- [other] Disable multiprocessing in lm_finetuning tests to reduce memory footprint [#176](https://github.com/deepset-ai/FARM/pull/176)
- [bug] Fix device arg in examples [#184](https://github.com/deepset-ai/FARM/pull/184)
- [other] Add error message to train/dev split fn [#190](https://github.com/deepset-ai/FARM/pull/190)
- [enhancement] Add more seeds [#192](https://github.com/deepset-ai/FARM/pull/192)
- :man_farmer: :woman_farmer: Thanks to all contributors for making FARMer's life better!
- @brandenchan, @tanaysoni, @Timoeller, @tholor, @maknotavailable, @johann-petrak, @BramVanroy
๐ฆ :paintbrush: Fundamental Re-design of Question Answering
- We put substantial effort in re-designing QA in FARM with two goals in mind: making it the simplest & fastest pipeline out there.
- Results:
- :bulb: Simplicity: The pipeline is cleaner, more modular and easier to extend.
- :rocket: Speed: Preprocessing of SQuAD 2.0 got down to 42s on a AWS p3.8xlarge (vs. ~ 20min in transformers and early versions of FARM). This will not only speed up training cycles and reduce GPU costs, but has also a big impact at inference time, where most time is actually spend on preprocessing.
- See this [blog post](https://medium.com/deepset-ai/modern-question-answering-systems-explained-4d0913744097) for more details and to learn about the key steps in a QA pipeline.
๐ฆ :briefcase: Support of proxy servers
- Example:
- ```
- proxies = {"https": "http://user:pass@10.10.10.10:8000"}
- language_model = LanguageModel.load(pretrained_model_name_or_path = "bert-base-cased",
- language = "english",
- proxies=proxies
- )
- ...
- + 9 more
๐ฆ Modelling
- [enhancement] QA redesign [#151](https://github.com/deepset-ai/FARM/pull/151)
- [enhancement] Add backwards compatibility for loading prediction head [#159](https://github.com/deepset-ai/FARM/pull/159)
- [enhancement] Raise an Exception when an invalid path is supplied for loading a saved model [#137](https://github.com/deepset-ai/FARM/pull/137)
- [bug] fix context in QA formatted preds [#163](https://github.com/deepset-ai/FARM/pull/163)
- [bug] Fix loading custom vocab in transformers style for LM finetuning [#155](https://github.com/deepset-ai/FARM/pull/155)
๐ฆ Data Handling
- [enhancement] Allow to load dataset from dicts in DataSilo [#127](https://github.com/deepset-ai/FARM/pull/127)
- [enhancement] Option to supply proxy server [#136](https://github.com/deepset-ai/FARM/pull/136)
- [bug] Fix tokenizer for multiple whitespaces [#156](https://github.com/deepset-ai/FARM/pull/156)
๐ฆ Inference
- [enhancement] Change context in QA formatted preds to not split words [#138](https://github.com/deepset-ai/FARM/pull/138)
๐ฆ Other
- [enhancement] Add test for output format of QA Inferencer [#149](https://github.com/deepset-ai/FARM/pull/149)
- [bug] Fix classification report for multilabel [#150](https://github.com/deepset-ai/FARM/pull/150)
- [bug] Fix inference in doc_classification_cola example [#147](https://github.com/deepset-ai/FARM/pull/147)
- Thanks to all contributors for making FARMer's life better!
- @johann-petrak, @brandenchan, @tanaysoni, @Timoeller, @tholor, @cregouby
๐ฆ Aggregation over multiple passages
- When asking questions on long documents, the underlying Language Model needs to cut the document in multiple passages and answer the question on each of them. The output needs to be aggregated.
๐ฆ Improved QA Inferencer
- The QA Inferencer
- projects model predictions back to character space
- can be used in the FARM demos UI
- writes predictions in SQuAD style format, so you can compare the model accuracy with other frameworks
- --------------------------------
๐ฆ Modelling
- [closed] Refactor squad qa [#131](https://github.com/deepset-ai/FARM/pull/131)
- [enhancement][part: model] Fix passing kwargs to LM loading (e.g. proxy) [#132](https://github.com/deepset-ai/FARM/pull/132)
โจ Adding Roberta & XLNet
- Welcome RoBERTa and XLNet on the FARM :tada:!
- For now, we support Roberta/XLNet on (Multilabel) Textclassification, Text Regression and NER. QA will follow soon.
- :warning: Breaking Change - Loading of Language models has changed:
- `Bert.load("bert-base-cased") -> LanguageModel.load("bert-base-cased") `
๐ฆ Migrating to tokenizers from the [transformers repo](https://github.com/huggingface/transformers).
- Pros:
- It's quite easy to add a tokenizer for any of the models implemented in transformers.
- We rather support the development there than building something in parallel
- The additional metadata during tokenization (offsets, start_of_word) is still created via tokenize_with_metadata
- We can use encode_plus to add model specific special tokens (CLS, SEP ...)
- Cons:
- We had to deprecate our attribute "never_split_chars" that allowed to adjust the BasicTokenizer of BERT.
- Custom vocab is now realized by increasing vocab_size instead of replacing unused tokens
- + 5 more
๐ฆ Modelling:
- [enhancement] Add Roberta, XLNet and redesign Tokenizer [#125](https://github.com/deepset-ai/FARM/pull/125)
- [bug] fix loading of old tokenizer style [#129](https://github.com/deepset-ai/FARM/pull/129)
๐ฆ Data Handling:
- [bug] Fix name of squad labels in experiment config [#121](https://github.com/deepset-ai/FARM/pull/121)
- [bug] change arg in squadprocessor from labels to label_list [#123](https://github.com/deepset-ai/FARM/pull/123)
๐ฆ Inference:
- [enhancement] Add option to disable multiprocessing in Inferencer(#117) [#128](https://github.com/deepset-ai/FARM/pull/128)
- [bug] Fix logging verbosity in Inferencer (#117) [#122](https://github.com/deepset-ai/FARM/pull/122)
๐ฆ Other
- [enhancement] Tutorial update [#116](https://github.com/deepset-ai/FARM/pull/116)
- [enhancement] Update docs for api/ui docker [#118](https://github.com/deepset-ai/FARM/pull/118)
๐ฆ Parallelization of Data Preprocessing :rocket:
- With this new approach we can still easily inspect & debug all important transformations for a chunk, but only keep the resulting dataset in memory once a process has finished with a chunk.
๐ฆ Multilabel classification
- => See an example [here](https://github.com/deepset-ai/FARM/blob/master/examples/doc_classification_multilabel.py)
๐ฆ Concept of Tasks
- Example:
- 1. Add the tasks to the Processor:
- ```
- processor = TextClassificationProcessor(...)
- news_categories = ["Sports", "Tech", "Politics", "Business", "Society"]
- publisher = ["cnn", "nytimes","wsj"]
- processor.add_task(name="category", label_list=news_categories, metric="acc", label_column_name="category_label")
- processor.add_task(name="publisher", label_list=publisher, metric="acc", label_column_name="publisher_label")
- + 6 more
๐ฆ Update to transformers 2.0
- --------------------------------
๐ฆ Modelling:
- ['enhancement] Add Multilabel Classification (#89)
- ['enhancement] Add PredictionHead for Regression task (#50)
- [enhancement] Introduce concept of "tasks" to support of multitask training using multiple heads of the same type (e.g. for multiple text classification tasks) (#75)
- [enhancement] Update dependency to transformers 2.0 (#106)
- [bug] TypeError: classification_report() got an unexpected keyword argument 'target_names' [#93](https://github.com/deepset-ai/FARM/issues/93)
- [bug] Fix issue with class weights (#82)
๐ฆ Data Handling:
- [enhancement] Chunkwise multiprocessing to reduce memory footprint in preprocessing large datasets (#88)
- [bug] Threading Error upon building Data Silo [#90](https://github.com/deepset-ai/FARM/issues/90)
- [bug] Multiprocessing causes data preprocessing to crash [#110](https://github.com/deepset-ai/FARM/issues/110)
- (https://github.com/deepset-ai/FARM/issues/102)
- [bug] Multiprocessing Error with PyTorch Version 1.2.0 [#97](https://github.com/deepset-ai/FARM/issues/97)
- [bug] Windows fixes (#109)
๐ฆ Inference:
- [enhancement] excessive uncalled-for warnings when using the inferencer [#104](https://github.com/deepset-ai/FARM/issues/104)
- [enhancement] Get probability distribution over all classes in Inference mode (#102)
- [enhancement] Add InferenceProcessor (#72)
- [bug] Fix classifcation report bug with binary doc classification
๐ฆ Other:
- [enhancement] Add more tests (#108)
- [enhancement] do logging within run_experiment() (#37)
- [enhancement] Improved logging (#82, #87 #105)
- [bug] fix custom vocab for bert-base-cased (#108)
- Thanks to all contributors: @tripl3a, @busyxin, @AhmedIdr, @jinnerbichler, @Timoeller, @tanaysoni, @brandenchan , @tholor
- ๐ฉโ๐พ Happy FARMing!
๐ Changes
- By adding multiprocessing to the data preprocessing, we reduced the execution time for many tasks from hours to minutes. Since the functionality is mostly hidden in the parent class, the user doesn't have to implement anything on his own. However, this required changing the interface of the processor slightly. `_dict_to_samples` and `_sample_to_features` must now be `classmethods` and all objects accessed by them must be `class attributes`.
- Multi-GPU support is now also available for the "building blocks mode"
- Instead of having one individual processor per dataset, we have implemented a more generic `TextClassificationProcessor` that you can instantiate easily for various predefined tasks (GNAD, GermEval ...) or your own dataset in CSV/TSV format
- [bug] Accuracy metric in LM finetuning always zero [#30](https://github.com/deepset-ai/FARM/issues/30)
- [enhancement] Multi-GPU only enabled in experiment mode [#57](https://github.com/deepset-ai/FARM/issues/57)
- [bug] Wrong number of total steps for linear warmup schedule [#46](https://github.com/deepset-ai/FARM/issues/46)
- [enhancement] Unify redundant `Processor`; add new `NERProcessor` and `TextClassificationProcessor`
- [enhancement] Add parallel dataprocessing [#45](https://github.com/deepset-ai/FARM/pull/45)
- + 6 more
First release of FARM package Contributor list: @brandenchan @tanaysoni @tholor @Timoeller @tripl3a
