You can find out a lot more about Open Refine at http://openrefine.org and check out some great introductory videos. There is also an Open Refine Google Plus community https://plus.google.com/communities/117280693504889048168 where you can find a lot of help and a lot of folks from the life sciences are members. As with other programs of this type, Open Refine libraries are available too, where you can find a script you need and copy it into your Refine instance to run it on your dataset.
This demo is based on one found at [enipedia] (http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial). See the original page for more in-depth instruction. C.B.Davis, A. Chmieliauskas, G.P.J. Dijkema, I. Nikolic (2014), Enipedia, http://enipedia.tudelft.nl, Energy & Industry group, Faculty of Technology, Policy and Management, TU Delft, Delft, The Netherlands.
The example used shows how we can use Wikipedia data to see if there is a relationship between the number of students at a university and the size of the university’s endowment.
The data contains quite a few issues, and this tutorial shows how to do things like:
Download the university data and put it in directory “data” within your working directory.
Start the program. (Double-click on the google-refine.exe file. Java services will start on your machine, and Refine will open in your Firefox browser).
Note the file types Open Refine handles: TSV, CSF, *SV, Excel (.xls .xlsx), JSON, XML, RDF as XML, Google Data documents. Support for other formats can be added with Google Refine extensions.
Once Refine is open, you’ll be asked if you want to Create, Open, or Import a Project.
The window shows a preview of the data (showing the 5, 10, 15, etc… first rows, you can click to choose). On the upper left of the spreadsheet, the total number of rows is shown- many more than we could handle by scrolling around in Excel.
Canada B1P 6L2.
Somehow, the postal code must have gotten into this column. Click “edit” to open the window and edit all 576 entries at once.This is one of the most magical bits of Refine, the moment you realize what you’ve been missing. Refine has several clustering algorithms built in. Experiment with them, and follow the link inside Refine, to learn more about these algorithms and how they work. https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth * In the country Text Facet we created in the step above, click the Cluster button. * In the resulting pop-up window, you can change the algorithm method, and keying function. Try different combinations to see the difference. * For example, with this dataset, the nearest neighbor method with the PPM keying function shows the power of clustering the best. * After corrections are made in this window, you can either Merge and Close the Cluster pop-up, or Merge and Re-cluster.
numstudents
columnnumstudents
column, click Edit cells -> TransformThis allows us to type an expression in one of the languages listed in the pull-down menu. We will use the Google Refine Expression Language (GREL). Some examples of this are shown below: * Enter value.replace("+", "")
to replace the character +
with a blank. Click OK. If you find multiple things that need to be replaced, you don’t have to keep clicking Edit cells -> Transform for every single issue. You can chain these commands together to fix several issues at once. * Enter value.replace("~", "").replace(",","").replace(" -", "")
By now, many cell values are looking more numeric, but are still interpreted as text. Fix that with a “common transform.” * Under number of students
Click Edit Cells > Common Transform > To Number.
If you see entries with strange symbols like “Lumi%C3%A8re University Lyon 2” you can fix this too. * In the university
column, Transform with this command: value.unescape('url')
It’s possible to have multiple facets in use at once. When you do this, each additional facet makes a sub-selection of the data selected by the previous facet. If you find that the number of rows you have selected and are working with is smaller than expected, then check to see if you still have facets in use which are not needed any more.
Keeping the numStudents
numeric facet open, Use the check boxes to display only the non-numeric entries. Add another text facet to see the listing of all these non-numeric entries and edit those. Remove this facet.
endowment
. Select only the non-numeric values, as was done for the number of students.Create a text facet of endowment
for the remaining rows.
Already we see issues like “US$1.3 billion” and “US $186 million”
Assuming that everything is in $ (a somewhat bogus assumption), we can clean up the data similarly to how we did it before. * Click on the endowment column -> Edit cells -> Transform value.replace("US $","").replace("US$", "")
Click on the endowment column again, and create a custom text facet to locate all the rows with the word “million” in them: Facet -> Custom text facet value.contains("million")
Under this facet, select the rows that are listed as “true.”
Then Edit cells -> Transform. It’s not advisable to just replace “million” by “000000” since you have some values like “$13.8 million”, which would be converted to “$13.8000000”. It’s better to first remove “million” from the text, convert the remaining text to a number, and then multiply this by 1000000: toNumber(value.replace(" million", ""))*1000000
The term “billion” is in the values as well, so remove previous facet for endowment, and create a new one for billion, and repeat process described above.
After most of this has been cleaned up, select the non-numeric values, and delete them, just as was done for the numStudents.
For this exercise, we are only interested in seeing how the number of students relate to the other variables. So we remove all the rows that do not have numeric values for the number of students. * Use a numeric facet again on numStudents to select only the non-numeric and blank values. Then do All -> Edit rows -> Remove all matching rows