Introductory exercise on Using ASReview Datatools
Introduction
The goal of this exercise is to learn how to use ASReview Datatools for processing a dataset.
When performing a systematic review, it is vital to proparly prepare your dataset. A dataset needs to contain at least a ‘title’ and ‘abstract’ column. Another column that comes in handy is for example ‘doi’. After you created such a dataset, it can be the case that some preprocessing or - after screening the literature - postprocessing is needed. It can for example be that the dataset contains a lot of duplicates.
In this exercise, you will learn how to describe, compose and convert datasets. If writing scripts is new to you, don’t worry, you will be guided through the process in the command-line interface.
Enjoy!
Step 1: Create a folder structure
We recommend you to create a designated folder to run a simulation in, to keep your project organised. Create a folder and give it a name, for example “Data_preprocessing_PTSD”. Within this folder, create two sub-folders called ‘data’ and ‘output’. Also using NotePad (on Windows) or Text Edit (on Mac/OS) to create an empty text file called jobs.txt
in the main folder. This is where you will save the scripts you run.
Step 2: Choose a dataset
Before we get started, we need a dataset that is already labeled. You can use a dataset from the SYNERGY dataset via the synergy-dataset Python package.
Open the command prompt, by typing ‘cmd’ (for Windows) or ‘terminal’ (for MacOS) in your computer’s search bar;
Navigate to the folder structure;
You can navigate to your folder structure with:
cd [file_path]
For MacOS: 1. Right-click on created folder. 2. Click “New Terminal at Folder”. Then the data set will be automatically be put into the designated data folder.
- Install the synergy-dataset Python package with:
pip install synergy-dataset
- Build the dataset.
To download and build the SYNERGY dataset, run the following command in the command line:
python -m synergy_dataset get -o data
You can also choose to use your own dataset(s). For the purpose of this exercise, do make sure that the dataset consists of at least a ‘title’ and ‘abstract’ and ‘doi’ column.
Step 3: Write and run the script
For the next steps, you first need to install ASReview datatools. You can do so with:
pip install asreview-datatools
Note: if you installed ASReview datatools before, make sure you install the latest version. You can do so with:
pip install --upgrade asreview-datatools
Before you are going to write your script, here is a general explanation of how you can get started with ASReview datatools. The parts between the brackets need to be filled out by you when using ASReview Datatools.
asreview data [name_of_tool]
where [name_of_tool]
is the name of the tool you want to use (describe
, convert
, dedup
, vstack
, or compose
) followed by positional arguments and optional arguments.
Each tool has its own help description which is available with:
asreview data [name_of_tool] -h
Tool 1: Describe
With the describe tool you get information about the properties of your dataset. For example, it shows you how many total, relevant, irrelevant and unlabeled records your dataset contains.
You can describe the content of a dataset with:
asreview data describe data/[name_of_your_data.csv]
Note: ‘data/’ is in front of the name of your dataset, since the datasets are stored in the ‘data’ folder you created in Step 1 and not in the directory you are currently working from.
Tool 2: Compose
Next, you are going to compose a dataset out of two datasets; a labeled and an unlabeled dataset. The code starts similarly to the code of the Describe tool (asreview data compose
), but now the output path should always be specified (output/[name_of_your_output_file.csv]
) and you need to assign a corresponding property to the datasets you want to merge. To the labeled dataset you need to assign the argument -l
, and to the unlabeled dataset you need to assign a -u
.
Furthermore, in case of conflict, you need to determine which label to keep (-c
). Let’s say you want to keep one of the labels, you can do so with the keep_one
command. By default the hierarchy in determining which label to keep is: relevant, irrelevant, unlabeled.
See the Datatools README for other possible arguments.
You can compose a dataset with:
asreview data compose output/[name_of_your_output_file.csv] -l data/[name_of_your_data.csv] -u data/[name_of_your_data.csv] -c keep_one
Tool 3: Convert
Now you are going to convert the format of your dataset. You are going to convert a RIS dataset into a CSV dataset.
Find a RIS file on you laptop, put it in the data folder you created in Step 1 and try the following command:
asreview data convert data/[name_of_your_data].ris output/[name_of_your_output_file].csv
If you don’t have a RIS file, you can practice with the code by converting a CSV file from the Synergy dataset into an Excel file:
asreview data convert data/[name_of_your_data].csv output/[name_of_your_output_file].xlsx
What else?
We introduced some of the datatools, but there are more tools available in the ASReview Datatools package.
If you like the functionality of Datatools, don’t forget to give it a star on GitHub!