Water Cost Index - Calculation Agent

Populating the Normalized Production Cost Statement from unstructured public data involves multiple challenges, including analyzing unstructured text data, identifying cost variables and subsidies, accounting for filing discrepancies, and addressing missing information. Powered by IBM Infosphere BigInsights platform, researchers in IBM Research - Almaden have developed a Calculation Agent leveraging advanced analytics capabilities (unstructured text analytics, entity integration, and statistical analysis), as outlined in the underlying architecture below.



Computing the Rickards Real Cost Water Index™ requires addressing the following challenges

Analyzing Unstructured Public Data: A key source of financial information is the audited financial reports that individual agencies publish periodically for the benefit of their shareholders, financiers, or general public. These documents are usually text documents (in pdf or html format) that need to be analyzed to identify the financial information reported within various sections of large (100+ page) documents. These documents are typically for human consumption and processing them programatically raises multiple challenges such as the ability to accurately analyze various concepts of interest such as financial tables or footnotes mentioning cost variables, and the need to process various types of text documents.

Isolating cost variables and identifying both direct and indirect cost subsidies: The cost variables that contribute to the true cost of production are reported in various parts of the financial statements depending on whether they are "explicitly reported costs" or "hidden costs". For instance, operating expenses is typically reported as expenditure in the Income Statement, while government grants may be reported as revenue in the Income Statement; further breakdown of individual costs such as operating expenses may be elaborated in textual notes associated with the financial statements. The ability to extract the individual cost variables from various parts of the reports, and combine them from multiple filings over time to create a complete temporal view for each producer is important.

Accounting for filing discrepancies: Agencies occasionally change, over time, their reporting formats or the way in which they break down specific financial details. The ability to identify these discrepancies while combining data from multiple filings and resolving them, either programmatically or through intelligent alerting of a data steward, is a key requirement.

Addressing missing information: Agencies report their financial information periodically, usually quarterly or annually, and even this information is typically available only after a lag of a few months. Therefore, the last available financial report could be over a year old in many cases. Additionally, data reported in financial statements may be incomplete (e.g., the cost of raw water may not be reported, and must be estimated). In order to have a complete and up-to-date Water Cost Index for all regions, it is imperative to address this missing data problem by estimating the missing values using advanced statistical techniques.



Calculation Agent Architecture

PDF Document Processing creates a workflow to convert financial filings from PDF format to HTML using state-of-the-art PDF document processing capabilities

Text Analytics is used to extract values of cost variables from the unstructured text in financial filings for each producer.

Entity Integration combines extracted cost variable values from multiple filings over time to create a complete temporal view for each producer. This step also validates cost variables to identify potentially incorrect values and supports data curation to correct errors.

Statistical Analysis addresses missing data by using statistical techniques to create estimated values enabling (a) index computation till present for all agencies even if the last publicly available financial report is several months in the past, (b) cost variable estimation for all prior periods, even if the data from published financial reports was incomplete, and (c) index computation for finer time granularities.

© IBM, Waterfund 2013