Skip to main content

#processing @Microsoft #office #Excel files with @TheASF POI (part II)

...
     Apache POI's OPCPackage abstract class represents a container that can store multiple data objects.  It is central to the processing of Excel(*.xlsx) files.  We only need to use its static open method to process an InputStream instance.  Further, we can "read" these Excel files via the XSSFWorkbook class.  This class is a high level representation of a SpreadsheetML workbook.  From an XSSFWorkbook, we can get any existing XSSFSheets within the workbook.  Then, we can further subdivide any XSSFSheet into rows and analyze the cell data within the rows.  In general, given certain assumptions in the format of the Excel document, we can extract data as text  from a cell and perform any number of business processes.

     In the Java function code excerpt below, we assume we have an Excel(*.xlsx) file represented as an InputStream.

        @Override
    public Iterator<Row> apply(InputStream inputStream) {

        try(OPCPackage pkg = OPCPackage.open(inputStream);) {

            Optional<XSSFWorkbook> workbook = Optional.empty();

            try {
                workbook = Optional.of(new XSSFWorkbook(pkg));
            } catch (IOException ioe) {
                ...
            }

            //sheet
            XSSFSheet sheet = workbook.get().getSheetAt(0);

            //return rows
            return sheet.rowIterator();

        } catch (InvalidFormatException ife) {
            ...      
} catch (IOException ioe) {
            ...
        }
        return Iterators.emptyIterator()
}
...

References:
  1. http://poi.apache.org/

Popular posts from this blog

Implementing @ApacheIgnite's cache store (part II)

Apache Ignite’s CacheStore interface is an API for cache persistence storage for read-through and write-through behavior.  When implementing this interface, you choose the type of key and value object alike -- similar to a map.  This follows the pattern established by the CacheLoader and CacheWriter interfaces CacheStore extends of the JSR107 specification.  In many cases, having a specific implementation for each method when implementing this interface may not be necessary, so Apache Ignite has a CacheStoreAdapter for this purpose.
Since Caches so closely resemble Maps, perhaps we should begin our discussion with a cache implementation that is essentially a HashMap store:
public class HashMapStore extends CloudStoreAdapter {
private final Map<Object, Object> map = new HashMap<>();
@Override public void loadCache(IgniteBiInClosure c, Object … args) {
for(Map.Entry e : map.entrySet()) { c.apply(e.getKey(), e.getValues()); }
@Override public Object load(Object key) { Return map.get(k…

@Airbnb's Aerosolve API is a gift to the #ML community! (part II)

...   Airbnb’s Aerosolve #machinelearning API contains a number of Java classes representing standard mathematical models. These classes implement the API’s Model interface -- requiring them to implement the interface’s scoreItem and debugScoreItem methods.
  The purpose of the debugScoreItem method is to provide an explanation as to how the item was scored along with the score.
In order to score an item, a Thrift struct appropriately named FeatureVector is required as input.
If you’re curious, a Thrift struct is similar to a class in OOP minus inheritance.
As a Thrift struct, the FeatureVector has a very simplistic structure as is shown below:
struct FeatureVector {
  1: optional map<string, set<string>> stringFeatures;    2: optional map<string, map<string, double>> floatFeatures;   3: optional map<string, list<double>> denseFeatures; }
What this essentially says is a FeatureVector will have as its core one of these three structures. The key of e…