Tools to send Pandas dataframe to a brat folder and catch brat data to a Pandas DataFrame.
|
5 năm trước cách đây | |
---|---|---|
.vscode | 5 năm trước cách đây | |
__pycache__ | 5 năm trước cách đây | |
pandasToBrat.egg-info | 5 năm trước cách đây | |
.gitignore | 5 năm trước cách đây | |
CHANGES.txt | 5 năm trước cách đây | |
MANIFEST.in | 5 năm trước cách đây | |
README.md | 5 năm trước cách đây | |
__init__.py | 5 năm trước cách đây | |
pandasToBrat.py | 5 năm trước cách đây | |
requirements.txt | 5 năm trước cách đây | |
setup.py | 5 năm trước cách đây |
Ali BELLAMINE - contact@alibellamine.me Last version : 1.0 - 28/10/2020
pandasToBrat is a library to manage brat configuration and brat data from a Python interface.
Clone the current repository :
git clone https://gogs.alibellamine.me/alibell/pandasToBrat
Install dependencies with pip.
pip install -r requirements.txt
Then install the library :
pip install -e .
Instantiate the brat library with the folder path :
from pandasToBrat import pandasToBrat
brat_data = pandasToBrat(FOLDER_PATH)
Parameters are stored in a dictionnary :
{
"entities":ENTITIES_CONFIGURATION_DATA,
"relations":RELATIONS_CONFIGURATION_DATA
}
Dictionnary formated as :
{
LABEL_NAME:{
LABEL_NAME_CHILD1:True,
LABEL_NAME_CHILD2:True,
LABEL_NAME_CHILD3:{
LABEL_NAME_CHILD3_CHILD1:True
}
}
}
Each entry is an entitie. An entitie can either be setted as True, it have no child, or have on or many childrens in which case is contains a dictionnary.
Dictionnary formated as :
{
RELATION_NAME:{
"args":[ENTITIES_NAME,...]
}
}
Each entrie of the dictionnary is a relation. Each relation have a relation name and defined with a sub-dictionnary containing an args entrie. The args entrie contains a list of entities that are concerned by the relation.
You can read the current parameters using the dedicated method :
bratData.read_conf()
You can write parameters using the dedicated method :
bratData.write_conf(entities = ENTITIES_CONFIGURATION, relations = RELATIONS_CONFIGURATION)
The ENTITIES_CONFIGATION is a dictionnary formated as described in the "Entities configuration data" chapter.
The RELATIONS_CONFIGURATION is a dictionnary formated as described in the "Relations configuration data" chapter.
Text is stored in a Pandas Dataframe with two columns :
#### Read and write text
bratData.read_text()
bratData.write_text(text_id=TEXT_ID_SERIES, text = TEXT_SERIES, empty = EMPTY_PARAMETER, overWriteAnnotations = OVERWRITE_ANNOTATIONS_PARAMETERS)
The required parameters are text_id and text which are Pandas Series, which should be of the same size containing for the first one the document unique id and the second one the document text data.
The empty parameters is used to empty the current folder. If set as True, the Brat folder is emptied of all text and annotations data. Configuration is not erased.
The overwrite annotations parameter is used to overwrite the current annotation (.ann) file with an empty one, it is useful if you want to remove the existing annotations when you are modifiying a text file.
This way, you can :
Parameters are stored in a dictionnary :
{
"annotations":ANNOTATIONS_ANNOTATIONS,
"relations":RELATIONS_ANNOTATIONS
}
Annotations are word labeled with entities.
It is formatted as a Pandas DataFrame, containing the following columns :
Annotations are relations between annotations.
It is formatted as a Pandas DataFrame, containing the following columns :
bratData.read_annotation()
bratData.write_annotations(df, text_id, word, label, start, end, overwrite=OVERWRITE_OPTION)
The first parameter is the datafame containing the annotations. It should be formated as described in the "Annotations format" subpart.
The text_id, word, label, start and end are the name of the column inside the dataframe which contains the related data.
The overwrite option can be set as True to overwrite existing annotations, otherwise the dataframe's data are added to existing annotations data.
bratData.write_relations(df, relation, overwrite=OVERWRITE_OPTION)
The first parameter is the datafame containing the relations. It should be formated as described in the "Relations format" subpart.
The text_id and relation are the name of the column inside the dataframe which contains the related data. The other columns should contains the type_id of related entities, as outputed by the read_annotation method.
The overwrite option can be set as True to overwrite existing annotations, otherwise the dataframe's data are added to existing annotations data.