raynardj
/

ner-chemical-bionlp-bc5cdr-pubmed

Token Classification

Model card Files Files and versions

raynardj commited on Nov 16, 2021

Commit

30dd3ed

·

1 Parent(s): eebcf34

Update README.md

Files changed (1) hide show

README.md +5 -56

README.md CHANGED Viewed

@@ -31,63 +31,12 @@ All the labels, the possible token classes.
 Notice, we removed the 'B-','I-' etc from data label.🗡
 ## This is the template we suggest for using the model
 ```python
-from transformers import pipeline
-PRETRAINED = "raynardj/ner-chemical-bionlp-bc5cdr-pubmed"
-ner = pipeline(task="ner",model=PRETRAINED, tokenizer=PRETRAINED)
-ner("Your text", aggregation_strategy="first")
-```
-And here is to make your output more consecutive ⭐️
-```python
-import pandas as pd
-from transformers import AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)
-def clean_output(outputs):
-    results = []
-    current = []
-    last_idx = 0
-    # make to sub group by position
-    for output in outputs:
-        if output["index"]-1==last_idx:
-            current.append(output)
-        else:
-            results.append(current)
-            current = [output, ]
-        last_idx = output["index"]
-    if len(current)>0:
-        results.append(current)
-    # from tokens to string
-    strings = []
-    for c in results:
-        tokens = []
-        starts = []
-        ends = []
-        for o in c:
-            tokens.append(o['word'])
-            starts.append(o['start'])
-            ends.append(o['end'])
-        new_str = tokenizer.convert_tokens_to_string(tokens)
-        if new_str!='':
-            strings.append(dict(
-                word=new_str,
-                start = min(starts),
-                end = max(ends),
-                entity = c[0]['entity']
-            ))
-    return strings
-def entity_table(pipeline, **pipeline_kw):
-    if "aggregation_strategy" not in pipeline_kw:
-        pipeline_kw["aggregation_strategy"] = "first"
-    def create_table(text):
-        return pd.DataFrame(
-            clean_output(
-                pipeline(text, **pipeline_kw)
-            )
-        )
-    return create_table
-# will return a dataframe
-entity_table(ner)(YOUR_VERY_CONTENTFUL_TEXT)
 ```
 > check our NER model on

 Notice, we removed the 'B-','I-' etc from data label.🗡
 ## This is the template we suggest for using the model
+Of course I'm well aware of the ```aggregation_strategy``` arguments offered by hf, but by the way of training, I discard any entropy loss for appending subwords, like only the label for the 1st subword token is not -100, after many search effort, I can't find a way to achieve that with default pipeline, hence I fancy an inference class myself.
 ```python
+!pip install forgebox
+from forgebox.hf.train import NERInference
+ner = NERInference.from_pretrained("raynardj/ner-chemical-bionlp-bc5cdr-pubmed")
+a_df = ner.predict(["text1", "text2"])
 ```
 > check our NER model on