Skip to end of metadata
Go to start of metadata

Overall aim

This work package will be deemed successful if it results in strong, well-justified and internationally-applicable recommendations on appropriate tools, methods and environments for processing and analysing different types of Big Data, along with a report on the feasibility of establishing a shared approach for using Big Data sources that are multi-national or for which similar sources are available in different countries.  

The value of the work package, in the context of the overall goals of the project on the Role of Big Data in the Modernisation of Statistical Production and in relation to the overarching strategy of the HLG, derives from its international nature.  While individual statistical organizations can experiment with the production of official statistics from Big Data sources (and many are currently doing so or have already done so), and can share their findings and methods with other organizations, this work package will be able to do the same in a more open and collaborative setting.  The work package will draw on the international nature and/or international ownership and management of many Big Data sources, and will capitalise on the collective bargaining power of the statistical community acting as one in relation to such large transnational entities.  The work package will contribute to the overall value of the project by providing a common methodology from the outset, precluding the need for post-hoc efforts to harmonise methodology in the future.


Specific objectives

Successful completion of the work package will entail evaluation of the feasibility of the following propositions; and insofar as it is found that the propositions are feasible, it will demonstrate and document in broad yet practical terms how the actions would be achievable in statistical organizations.  

  1. 'Big Data' sources can be obtained (in a stable and reliably replicable way), installed and manipulated with relative ease and efficiency on the chosen platform, within technological and financial constraints that are realistic reflections of the situation of national statistical offices
  2. The chosen sources can be processed to produce statistics (either mainstream or novel) which conform to appropriate quality criteria –both existing and new– used to assess official statistics, and which are reliable and comparable across countries
  3. The resulting statistics correspond in a systematic and predictable way with existing mainstream products, such as price statistics, household budget indicators, etc.
  4. The chosen platforms, tools, methods and datasets can be used in similar ways to produce analogous statistics in different countries
  5. The different participating countries can share tools, methods, datasets and results efficiently, operating on the principles established in the Common Statistical Production Architecture.

While the first objective is to examine these propositions (the 'proof of concept'), a second objective is to then use these findings to produce a general model for achieving the goal of producing statistics from Big Data, and to communicate this effectively to statistical organizations.  Hence, all processes, findings, lessons learned and results will be recorded and will feed into work package 3 for dissemination and training activities.  In particular, experiences and best practices for obtaining data will be detailed for the benefit of other organizations.


Basis for the Recommendations

The recommendations given in this annex were arrived at by the deliberations of a team of experts representing eight countries or international organizations, in consultation with the broader expert task team on Big Data.  

The task team considered a wide range of alternative possibilities for tools, datasets and statistics and assessed them against various criteria.  These included the following: 

 Tools

  • Whether or not the tools are open source
  • Ease of use for statistical office staff
  • Possibilities for interoperability and integration with other tools
  • Ease of integration into existing statistical production architectures
  • Cost
  • Availability of documentation
  • Availability of online tutorials and support 
  • Training requirements, including whether or not a vendor-specific language has to be learned
  • The existence of an active and knowledgeable user community.   

 Statistics

  • At least one statistic that corresponds closely and in a predictable way with a mainstream statistic produced by most statistical organizations
  • One or more short term indicators of specific variables or cross-sectional statistics which permits the study of the detailed relationships between variables
  • One or more statistics that represent a new, non-traditional output (i.e. something that has not generally been measured by producers of official statistics, be it a novel social phenomenon or an existing one where the need to measure it has only recently arisen)

 Datasets

  • Ease of locating and obtaining data from providers
  • Cost of obtaining data (if any)
  • Stability (or expected stability) over time
  • Availability of data that can be used by several countries, or data whose format is at least broadly homogeneous across countries
  • The existence of ID variables which enable the merging of big data sets with traditional statistical data sources

 

 


Recommendations and Resource Requirements

The task team recommends that this work package proceed according to the items detailed in each row of the following table:

AspectRecommendationsLinks to further information (where applicable)
Processing environmentHortonWorks Hadoop distribution to be installed on a cluster provided by a volunteering statistical organization.http://hortonworks.com/
Processing tools/software

The Pentaho Business Analytics Suite Enterprise Edition will be deployed under a free trial license obtained for the purpose of the project (for an initial period of six months with the possibility of renewal up to one year). 

Pentaho Business Analytics Suite Enterprise Edition provides a unified, interactive, and visual environment for data integration, data analysis, data mining, visualization, and other capabilities. Pentaho's Data Integration component is fully compatible with Hortonworks Hadoop and allows 'drag and drop' development of MapReduce jobs.

Additional tools such as R and RHadoop will be installed alongside the Pentaho Suite.

http://www.pentahobigdata.com/

http://www.r-project.org/

https://github.com/RevolutionAnalytics/RHadoop/wiki

Datasets & statistics to be produced with them (or feasibility of production to be demonstrated)

One or more from each of the categories below to be installed in the sandbox and experimented with for the creation of appropriate corresponding statistics:
 

  • Transactional sources (from banks/telecommunications providers/retail outlets) (to enable the recreation of standard official statistics in easiest possible way, minimising as far as possible potential hindrances to access, etc.)
  • Sensor data sources
  • Social network sources, image or video-based sources, other less-explored sources (to enable the creation of 'new' statistics)
 
Human resource requirementsA task team will need to be identified at the outset of the project, composed of experts whose time is volunteered in kind by their respective organizations for the duration of the work package.  The project manager's first task will be to identify the number of members required, the requisite skills and the amount of time to be committed by task team members to enable the work to progress. 


 


Expected timeline

Assemble task team to lead sandbox work  

January 2014
All those interested in participating will be encouraged to do so, but a task team will be required to steer the work, ensuring objectives are pursued and processes are documented. 

Obtain and install necessary hardware, software etc.

January-March 2014

  •  Set up Pentaho suite 
    • configure Pentaho for Hadoop distribution and version          
    • test the configuration
  •  Set up R with RHadoop 
    • test the configuration.
 
 

Undergo training of task team to ensure familiarity with technical tools and start collaboration between team members

April-June 2014
  • Utilisation of online documents, tutorials, demonstration videos etc.
  • Potential running of a training session (conditional upon hosting and/or financial support from a participating organization), which could be undertaken alongside another Big Data event to save costs for participants.
 
Obtain requisite datasets and undertake analyses in sandbox
July-October 2014
  • Obtain and install data sets (minimum of one from each category outlined in preceding section) Note: process of obtaining datasets that are not freely available (whether paid or not) should be begun at the onset of the project, in order to have them available by this stage of the work.
  • For each dataset:
    • study availability of variables 
    • analyse the representativeness of the statistical figures 
    • study other statistical figures available 
    • produce some statistics
    • document all processes and results on an ongoing basis.
 
Produce a general model for achieving the goal of producing statistics from Big Data, to communicate effectively with statistical organizations
November-December 2014
  • Document findings
  • Incorporate documented results into dissemination materials and activities


 

 

 Click for viewtracker

Page viewed 4908 times by 21 users since 08 Nov, 2013:
Name Last viewed Times viewed
Anonymous 21 Oct, 2014 22:24 4789
Stanislaw Sieluzycki 14 Jun, 2014 23:24 1
Tomaz Speh 16 Apr, 2014 13:52 12
Thérèse Lalor 03 Apr, 2014 10:56 3
Carlo Vaccari 03 Apr, 2014 10:21 13
Andrew Murray 18 Mar, 2014 12:32 10
David Barraclough 17 Mar, 2014 17:14 1
Fiona Willis-Núñez 04 Dec, 2013 15:50 41
Pilar Rey del Castillo 27 Nov, 2013 16:22 10
Steven Vale 25 Nov, 2013 08:00 4
Tatiana Yarmola 23 Nov, 2013 23:05 11
Jean Pignal 19 Nov, 2013 15:53 1
Trevor Fletcher 18 Nov, 2013 08:49 1
Matjaz Jug 18 Nov, 2013 00:08 1
Gary Dunnet 17 Nov, 2013 22:05 1
Vittorio Perduca 15 Nov, 2013 15:35 2
Brian Studman 15 Nov, 2013 04:56 2
Jill Pobjoy 14 Nov, 2013 16:56 1
Stefano De Francisci 14 Nov, 2013 13:24 1
Anna Dlugosz 13 Nov, 2013 23:11 2
Susan Williams 13 Nov, 2013 18:03 1

  • No labels

3 Comments

  1. Thank you, Fiona, you have produced a very good document. I put here some comments that you can add, maybe improving the English wording.

    1. Add as specific objective “The chosen source can be processed to produce reliable and comparable statistics across countries” (maybe you have not put it because it depends on the availability of the appropriate data source).

    2. Basis for recommendations: Statistics: maybe another perspective of the statistics can be added: short term indicators of specific variables or cross-section statistics from which a detailed relation between variables can be studied.

    3. Basis for recommendations: Datasets: add “Homogeneity of format (broadly speaking) across countries”.

    4. Recommendations and resources requirements: datasets: add “Video sources (e.g. using image recognition, an indicator of tourism in Rome can be constructed from a video-camera at the entrance of the Coliseum)”.  

  2. I underline the need for shared training: I think this is a pre-requisite for a good group work.

    It's very difficult to work on common environments without having shared knowledge on the topics and even without personal acquaintance.

    I would stress this concept.

  3. Carlo, this sounds very reasonable.