Tablesaw is like an open-source Java power tool for data manipulation with hooks for interactive visualization, analytics, and machine learning. Come learn all about it!
Data science is one of the hottest areas in computing today. Most people learn data science using either Python or R. Both are excellent languages for crunching and analyzing data.
But many Java developers feel left behind. There are great Java libraries for machine learning, especially for jobs that require distributed computing, but there’s no simple path for Java developers to learn and apply data science. By minimizing the number of things you need to learn, the open-source Tablesaw provides a gateway.
Think of Tablesaw as a Java power tool for data manipulation with hooks for interactive visualization, analytics, and machine learning. Used interactively or embedded in an application, its focus is to make data science as easy in Java as in R or Python. If you’ve done some data science, you may think of it as a data frame.
Tablesaw is easy to learn, but it’s not a toy. Tables can be large — up to two billion rows. Performance is brisk — on my laptop, I can retrieve 500 records from a table of half of a billion rows in two milliseconds. It is open-sourced under a business-friendly Apache 2 license.
What Makes Tablesaw Beginner-Friendly?
- It builds on what you know: For Java developers who want to do data science, it’s a huge advantage to not have to also learn a new language.
- It’s easy to get started: Simply add Tablesaw as a Maven dependency for your project and you’re up and running. We’ll walk through an example below to show you how.
- It’s not distributed: Unlike many machine learning libraries, Tablesaw is not a distributed system. This removes enormous complexity and makes machine learning accessible to those without deep engineering experience or support.
- The code is clear: There’s a fluent API so you’ll understand your code the next time you read it.
- It provides fast feedback: Tablesaw is designed to be used interactively for exploratory analysis.
Introductory Example
Here, I’ll show you some of Tablesaw’s basic data manipulation features. Future posts will address visualization, machine learning, the Kotlin API and REPL, and the Tablesaw architecture. The code for this example can be found here.
Up and Running
To begin, create a Java project and add the Tablesaw core library as a Maven dependency. The current dependency is:
<!-- https://mvnrepository.com/artifact/tech.tablesaw/tablesaw-core -->
<dependency>
<groupId>tech.tablesaw</groupId>
<artifactId>tablesaw-core</artifactId>
<version>0.8.0</version>
</dependency>
Next, create a class with a main method like so:
public class Foo {
public static void main(String[] args {
// rest of code goes here
}
}
The rest of our code will go in this method. Now add a table. Tablesaw can load data from relational databases, but we will create our table from a flat file:
Table table1 = Table.createFromCsv(“data/BushApproval.csv");
Table objects can provide a lot of information:
- table1.name(); returns BushApproval.csv since it named the table after the file.
- table1.shape(); returns 323 rows X 3 cols.
- table1.structure(); returns a table of column metadata:
Index Column Name Column Type
0 dateLOCAL_DATE
1 approvalSHORT_INT
2 who CATEGORY
Note that we’ve inferred the column types from the data.
- table1.first(3); returns a new table containing only the first three rows.
BushApproval.csv
date approval who
2004-02-04 53 fox
2004-01-21 53 fox
2004-01-07 58 fox
Inevitably, we want to work with the data, and for that, we need columns. Each has a data type, and usually, you’ll want it by that type and not as a generic column because typed columns have more power. For example, to get the approval column, you can use:
ShortColumn approval = table1.shortColumn(“approval”);
Each column sub-type supports numerous operations. As a rule, operations on a column are applied to every element without explicit loops. Some call these “vector operations.” For example, operations like count(), min(), and contains() produce a single value for a column of data: approval.min();.
Other operations return a new column. The method dayOfYear() applied to a DateColumn returns a short integer column with each element the day of the year from 1 to 366.
Some column-returning operations take a scalar value as an input: cd.plusDays(4);.
This adds four days to every element. Others take a second column as an argument. These process the two columns in order, applying each integer value from the argument to the corresponding element in the receiver.
Boolean operations like isMonday() don’t return a boolean column directly, but a Selection instead. Selections can be used to filter tables by the values in their columns, so we’ll see them again:
Selection selection = table1.column(“date”).isMonday();
You can, of course, get a boolean column if you want it. You simply pass the Selection and the original column length to a BooleanColumn constructor, along with a name for the new column:
BooleanColumn mondays = new BooleanColumn(“mondays”, selection, 1000);
There are hundreds of methods available for column manipulation, but let’s turn now to tables. Operations exist for creating, describing, modifying, sorting, querying, and summarizing tables. Here we’ll cover sorting, querying, and summarizing.
Queries
For queries, we need a helper. It is called QueryHelper, and it’s best to do a static import wherever you will use it. The method selectWhere() gets the job done.
Usually, you will pass it as a Filter to selectWhere(), which can be easily created inline:
Table highApproval = table1.selectWhere(column("approval").isGreaterThan(80));
The segment column(“approval”).isGreaterThan(80) creates the filter.
Remember Selection objects from columns? You can also use those as arguments to selectWhere(), allowing you to use column-specific logic to query a table.
Table Q3 = table1.selectWhere(date.isInQ3());
Sorting
There are a number of ways to sort a table, but the easiest is sortOn();. This code gets it done:
table1.sortOn(“who”, “approval”);
“who” and “approval” are column names, and the sort is ascending. To sort in descending order, use sortDescendingOn().
To sort in mixed order, you can prepend a minus sign to a column name to indicate a descending sort on that column. For example, table1.sortOn(“who”, “-approval”); sorts on “who” in ascending order, and on “approval” in descending order.
Finally, you can write your own sort logic as an IntComparator, giving you full control over the ordering.
Summarizing
Now, we’ll cover summarization techniques like pivot tables (cross tabs). If you want to simply calculate group statistics for a table, the summarize() method works nicely. There are a large number of statistics available, including range, as shown below.
Table summary = table1.summarize("approval", range).by(“who”);
BushApproval.csv summary
whoRange [approval]
fox42.0
gallup 41.0
newsweek 40.0
time.cnn 37.0
upenn10.0
zogby37.0
Cross tabs are useful for producing counts or frequencies of the number of observations in a combination of categories. First, let’s get two categorical columns:
CategoryColumn who = table1.categoryColumn("who");
CategoryColumn month = date.month();
table1.addColumn(month);
Now, we can calculate the raw counts for each combination:
Table xtab = CrossTab.xTabCount(table1, month, who);
Crosstab Counts: date month x who
fox gallup newsweek time.cnn upenn zogby total
APRIL 6 10 310 3 23
AUGUST3 8210 2 16
DECEMBER4 9432 5 27
FEBRUARY7 9441 4 29
JANUARY 7 13 635 8 42
JULY6 9430 4 26
JUNE6 11 110 4 23
MARCH 5 12 430 6 30
MAY 4 9530 1 22
NOVEMBER4 9631 1 24
OCTOBER 7 10 821 3 31
SEPTEMBER 5 10 830 4 30
Total 6411955 30 1045323
If you prefer to see the relative frequency for each combination, pass your crosstab table to the tablePercents() method:
CrossTab.tablePercents(xtab);
Crosstab Table Proportions:
fox gallupnewsweek time.cnn upennzogbytotal
APRIL 0.018575850.030959751 0.0092879250.0030959751 0.00.0092879250.071207434
AUGUST0.009287925 0.024767801 0.0061919503 0.0030959751 0.00.0061919503 0.049535602
DECEMBER0.012383901 0.027863776 0.0123839010.0092879250.0061919503 0.0154798760.083591335
FEBRUARY0.021671826 0.027863776 0.0123839010.0123839010.0030959751 0.0123839010.08978328
JANUARY 0.021671826 0.040247680.01857585 0.0092879250.0154798760.0247678010.13003096
JULY0.018575850.027863776 0.0123839010.0092879250.00.0123839010.08049536
JUNE0.018575850.034055730.0030959751 0.0030959751 0.00.0123839010.071207434
MARCH 0.015479876 0.0371517 0.0123839010.0092879250.00.01857585 0.09287926
MAY 0.012383901 0.027863776 0.0154798760.0092879250.00.0030959751 0.06811146
NOVEMBER0.012383901 0.027863776 0.01857585 0.0092879250.0030959751 0.0030959751 0.0743034
OCTOBER 0.021671826 0.030959751 0.0247678010.0061919503 0.0030959751 0.0092879250.095975235
SEPTEMBER 0.015479876 0.030959751 0.0247678010.0092879250.00.0123839010.09287926
Total 0.198142410.368421050.17027864 0.09287926 0.0309597510.13931888 1.0
What’s Next?
I hope this has encouraged some of you to give Tablesaw a try. As I mentioned, future posts will cover visualization, machine learning, and more. You can find the code on github at https://github.com/jtablesaw/tablesaw.
Source: https://dzone.com/articles/learn-data-science-with-java-and-tablesaw