This website is currently dormant!
RSS

API Definitions News

These are the news items I've curated in my monitoring of the API space that have some relevance to the API definition conversation and I wanted to include in my research. I'm using all of these links to better understand how the space is defining not just their APIs, but their schema, and other moving parts of their API operations.

APIs Used To Give Us Access To Resources That Were Out Of Our Reach

I remember when almost all the APIs out there gave us developers access to things we couldn’t ever possibly get on our own. Some of it was about the network effect with the early Amazon and eBay marketplaces, or Flickr and Delicious, and then Twitter and Facebook. Then what really brought it home was going beyond the network effect, and delivering resources that were completely out of our reach like maps of the world around us, (seemingly) infinitely scalable compute and storage, SMS, and credit card payments. In the early days it really seemed like APIs were all about giving us access to something that was out of our reach as startups, or individuals.

While this still does exist, it seems like many APIs have flipped the table and it is all about giving them access to our personal and business data in ways that used to be out of their reach. Machine learning APIs are using parlour tricks to get access to our internal systems and databases. Voice enablement, entertainment, and cameras are gaining access to our homes, what we watch and listen to, and are able to look into the dark corners of our personal lives. Tinder, Facebook, and other platforms know our deep dark secrets, our personal thoughts, and have access to our email and intimate conversations. The API promise seems to have changed along the way, and stopped being about giving us access, and is now about giving them access.

I know it has always been about money, but the early vision of APIs seemed more honest. It seemed more about selling a product or service that people needed, and was more straight up. Now it just seems like APIs are invasive. Being used to infiltrate our professional and business worlds through our mobile phones. It feels like people just want access to us, purely so they can mine us and make more money. You just don’t see many Flickrs, Google Maps, or Amazon EC2s anymore. The new features in mobile devices we carry around, and the ones we install in our home don’t really benefit us in new and amazing ways. They seem to offer just enough to get us to adopt them, and install in our life, so they can get access to yet another data point. Maybe it is just because everything has been done, or maybe it is because it has all been taken over by the money people, looking for the next big thing (for them).

Oh no! Kin is ranting again. No, I’m not. I’m actually feeling pretty grounded in my writing lately, I’m just finding it takes a lot more work to find interesting APIs. I have to sift through many more emails from folks telling me about their exploitative API, before I come across something interesting. I go through 30 vulnerabilities posts in my feeds, before I come across one creative story about something platform is doing. There are 55 posts about ICOs, before I find an interesting investment in a startup doing something that matters. I’m willing to admit that I’m a grumpy API Evangelist most of the time, but I feel really happy, content, and enjoying my research overall. I just feel like the space has lost its way with this big data thing, and are using APIs to become more about infiltrating and extraction, that it is about delivering something that actually gives developers access to something meaningful. I just think we can do better. Something has to give, or this won’t continue to be sustainable much longer.


Looking At The 37 Apache Data Projects

I’m spending time investing in my data, as well as my database API research. I’ll have guides, with accompanying stories coming out over the next couple weeks, but I want to take a moment to publish some of the raw research that I think paints an interesting picture about where things are headed.

When studying what is going on with data and APIs you can’t do any search without stumbling across an Apache project doing something or other with data. I found 37 separate projects at Apache that were data related, and wanted to publish as a single list I could learn from.

  • Airvata** - Apache Airavata is a micro-service architecture based software framework for executing and managing computational jobs and workflows on distributed computing resources including local clusters, supercomputers, national grids, academic and commercial clouds. Airavata is dominantly used to build Web-based science gateways and assist to compose, manage, execute, and monitor large scale applications (wrapped as Web services) and workflows composed of these services.
  • Ambari - Apache Ambari makes Hadoop cluster provisioning, managing, and monitoring dead simple.
  • Apex - Apache Apex is a unified platform for big data stream and batch processing. Use cases include ingestion, ETL, real-time analytics, alerts and real-time actions. Apex is a Hadoop-native YARN implementation and uses HDFS by default. It simplifies development and productization of Hadoop applications by reducing time to market. Key features include Enterprise Grade Operability with Fault Tolerance, State Management, Event Processing Guarantees, No Data Loss, In-memory Performance & Scalability and Native Window Support.
  • Avro - Apache Avro is a data serialization system.
  • Beam - Apache Beam is a unified programming model for both batch and streaming data processing, enabling efficient execution across diverse distributed execution engines and providing extensibility points for connecting to different technologies and user communities.
  • Bigtop - Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem. The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects. In short we strive to be for Hadoop what Debian is to Linux.
  • BookKeeper - BookKeeper is a reliable replicated log service. It can be used to turn any standalone service into a highly available replicated service. BookKeeper is highly available (no single point of failure), and scales horizontally as more storage nodes are added.
  • Calcite - Calcite is a framework for writing data management systems. It converts queries, represented in relational algebra, into an efficient executable form using pluggable query transformation rules. There is an optional SQL parser and JDBC driver. Calcite does not store data or have a preferred execution engine. Data formats, execution algorithms, planning rules, operator types, metadata, and cost model are added at runtime as plugins.
  • CouchDB - Apache CouchDB is a database that completely embraces the web. Store your data with JSON documents. Access your documents with your web browser, via HTTP. Query, combine, and transform your documents with JavaScript. Apache CouchDB works well with modern web and mobile apps. You can even serve web apps directly out of Apache CouchDB. And you can distribute your data, or your apps, efficiently using Apache CouchDB’s incremental replication. Apache CouchDB supports master-master setups with automatic conflict detection.
  • Crunch - The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
  • DataFu - Apache DataFu consists of two libraries: Apache DataFu Pig is a collection of useful user-defined functions for data analysis in Apache Pig. Apache DataFu Hourglass is a library for incrementally processing data using Apache Hadoop MapReduce. This library was inspired by the prevalence of sliding window computations over daily tracking data. Computations such as these typically happen at regular intervals (e.g. daily, weekly), and therefore the sliding nature of the computations means that much of the work is unnecessarily repeated. DataFu’s Hourglass was created to make these computations more efficient, yielding sometimes 50-95% reductions in computational resources.
  • Drill - Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage systems. It was inspired in part by Google’s Dremel.
  • Edgent - Apache Edgent is a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the continuous streams of data coming from equipment, vehicles, systems, appliances, devices and sensors of all kinds (for example, Raspberry Pis or smart phones). Working in conjunction with centralized analytic systems, Apache Edgent provides efficient and timely analytics across the whole IoT ecosystem: from the center to the edge.
  • Falcon - Apache Falcon is a data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters.
  • Flink - Flink is an open source system for expressive, declarative, fast, and efficient data analysis. It combines the scalability and programming flexibility of distributed MapReduce-like platforms with the efficiency, out-of-core execution, and query optimization capabilities found in parallel databases.
  • Flume - Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store
  • Giraph - Apache Giraph is an iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections.
  • Hama - The Apache Hama is an efficient and scalable general-purpose BSP computing engine which can be used to speed up a large variety of compute-intensive analytics applications.
  • Helix - Apache Helix is a generic cluster management framework used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes. Helix automates reassignment of resources in the face of node failure and recovery, cluster expansion, and reconfiguration.
  • Ignite - Apache Ignite In-Memory Data Fabric is designed to deliver uncompromised performance for a wide set of in-memory computing use cases from high performance computing, to the industry most advanced data grid, in-memory SQL, in-memory file system, streaming, and more.
  • Kafka - A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers. Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
  • Knox - The Apache Knox Gateway is a REST API Gateway for interacting with Hadoop clusters. The Knox Gateway provides a single access point for all REST interactions with Hadoop clusters. In this capacity, the Knox Gateway is able to provide valuable functionality to aid in the control, integration, monitoring and automation of critical administrative and analytical needs of the enterprise.
  • Lens - Lens provides an Unified Analytics interface. Lens aims to cut the Data Analytics silos by providing a single view of data across multiple tiered data stores and optimal execution environment for the analytical query. It seamlessly integrates Hadoop with traditional data warehouses to appear like one.
  • MetaModel - With MetaModel you get a uniform connector and query API to many very different datastore types, including: Relational (JDBC) databases, CSV files, Excel spreadsheets, XML files, JSON files, Fixed width files, MongoDB, Apache CouchDB, Apache HBase, Apache Cassandra, ElasticSearch, OpenOffice.org databases, Salesforce.com, SugarCRM and even collections of plain old Java objects (POJOs). MetaModel isn’t a data mapping framework. Instead we emphasize abstraction of metadata and ability to add data sources at runtime, making MetaModel great for generic data processing applications, less so for applications modeled around a particular domain.
  • Oozie - Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).
  • ORC - ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query.
  • Parquet - Apache Parquet is a general-purpose columnar storage format, built for Hadoop, usable with any choice of data processing framework, data model, or programming language.
  • Phoenix - Apache Phoenix enables OLTP and operational analytics for Apache Hadoop by providing a relational database layer leveraging Apache HBase as its backing store. It includes integration with Apache Spark, Pig, Flume, Map Reduce, and other products in the Hadoop ecosystem. It is accessed as a JDBC driver and enables querying, updating, and managing HBase tables through standard SQL.
  • REEF - Apache REEF (Retainable Evaluator Execution Framework) is a development framework that provides a control-plane for scheduling and coordinating task-level (data-plane) work on cluster resources obtained from a Resource Manager. REEF provides mechanisms that facilitate resource reuse for data caching, and state management abstractions that greatly ease the development of elastic data processing workflows on cloud platforms that support a Resource Manager service.
  • Samza - Apache Samza provides a system for processing stream data from publish-subscribe systems such as Apache Kafka. The developer writes a stream processing task, and executes it as a Samza job. Samza then routes messages between stream processing tasks and the publish-subscribe systems that the messages are addressed to.
  • Spark - Apache Spark is a fast and general engine for large-scale data processing. It offers high-level APIs in Java, Scala and Python as well as a rich set of libraries including stream processing, machine learning, and graph analytics.
  • Sqoop - Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • Storm - Apache Storm is a distributed real-time computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing real-time computation.
  • Tajo - The main goal of Apache Tajo project is to build an advanced open source data warehouse system in Hadoop for processing web-scale data sets. Basically, Tajo provides SQL standard as a query language. Tajo is designed for both interactive and batch queries on data sets stored on HDFS and other data sources. Without hurting query response times, Tajo provides fault-tolerance and dynamic load balancing which are necessary for long-running queries. Tajo employs a cost-based and progressive query optimization techniques for optimizing running queries in order to avoid the worst query plans.
  • Tez - Apache Tez is an effort to develop a generic application framework which can be used to process arbitrarily complex directed-acyclic graphs (DAGs) of data-processing tasks and also a reusable set of data-processing primitives which can be used by other projects.
  • VXQuery - Apache VXQuery will be a standards compliant XML Query processor implemented in Java. The focus is on the evaluation of queries on large amounts of XML data. Specifically the goal is to evaluate queries on large collections of relatively small XML documents. To achieve this queries will be evaluated on a cluster of shared nothing machines.
  • Zeppelin - Zeppelin is a modern web-based tool for the data scientists to collaborate over large-scale data exploration and visualization projects.

There is a serious amount of overlap between these projects. Not all of these projects have web APIs, while some of them are all about delivering a gateway or aggregate API across projects. There is a lot to process here, but I think listing them out provides an easier way to understand the big data explosion of projects over at Apache.

It is tough to understand what each of these do without actually playing with them, but that is something I just don’t have the time to do, so next up I’ll be doing independent searches for these project names, and finding stories from across the space regarding what folks are doing with these data solutions. That should give me enough to go on when putting them into specific buckets, and finding their place in my data, and database API research.


Data Streaming In The API Landscape

I was taking a fresh look at my real time API research as part of some data streaming, and event sourcing conversations I was having last week. My research areas are never perfect, but I’d say that real time is still the best umbrella to think about some of the shifts we are seeing on the landscape recently. They are nothing new, but there has been renewed energy, new and interesting conversation going on, as well as some growing trends that I cannot ignore. To support my research, I took a day this week to dive in, have a conversation with my buddy Alex over at the TheNewStack.io, and the new CEO of WSO2 Tyler Jewell around what is happening.

The way I approach my research is to always step back and look at what is happening already in the space, and I wanted to take another look at some of the real time API service providers I was already keeping eye on in the space:

  • Pubnub - APIs for developers building secure realtime Mobile, Web, and IoT Apps.
  • StreamData - Transform any API into a real-time data stream without a single line of server code.
  • Fanout.io - Fanout’s reverse proxy helps you push data to connected devices instantly.
  • Firebase - Store and sync data with our NoSQL cloud database. Data is synced across all clients in real time, and remains available when your app goes offline.
  • Pusher - Leaders in real time technologies. We empower all developers to create live features for web and mobile apps with our simple hosted API.

I’ve been tracking on what these providers have been doing for a while. They’ve all been pushing to boundaries of what is streaming, and real time APIs for some time. Another open source solution that I think is worth noting, which I believe some of the above services have leverages is Netty.io.

  • Netty - Netty is an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers & clients.

I also wanted to make sure and include Google’s approach to a technology that has been around a while:

  • Google Cloud Pub/Sub - Google Cloud Pub/Sub is a fully-managed real-time messaging service that allows you to send and receive messages between independent applications.

Next, I wanted to refresh my understanding of all the Apache projects that speak to this realm. I’m always trying to keep a handle on what they each actually offer, and how they overlap. So, seeing them side by side like this helps me think about how they fit into the big picture.

  • Apache Kafka - Kafka™ is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
  • Apache Flink - Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.
  • Apache Spark - Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
  • Apache Storm Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
  • Apache Apollo - ActiveMQ Apollo is a faster, more reliable, easier to maintain messaging broker built from the foundations of the original ActiveMQ.

One thing I think is worth noting with all of these is the absence of the web when you read through their APIs. Apollo had some significant RESTful approaches, and you find gateways and plugins for some of the others, but when you consider how these technologies fit into the wider API picture, I’d say they aren’t about embracing the web.

On that note, I think it is worth mentioning what is going on over at Google, with their gRPC effort, which provides “bi-directional streaming and fully integrated pluggable authentication with http/2 based transport”:

  • gRPC - A high performance, open-source universal RPC framework

Also, I think most notably, they are continuing the tradition of APIs embracing the web, and built on top of HTTP/2. For me, this is always important, and trumps just being open source in my book. The more web an open source technology, and a company’s service utilize, the more comfortable I’m going to feel telling my readers they should be baking this into their operations.

After these services and tooling, I don’t want to forget about the good ol fashioned protocols available out there, that help use doing things in real time. I’m tracking on 12 real time protocols that I see in use across the companies, organizations, institutions, and government agencies I’m tracking on:

  • Simple (or Streaming) Text Orientated Messaging Protocol (STOMP) - STOMP is the Simple (or Streaming) Text Orientated Messaging Protocol. STOMP provides an interoperable wire format so that STOMP clients can communicate with any STOMP message broker to provide easy and widespread messaging interoperability among many languages, platforms and brokers.
  • Advanced Message Queuing Protocol (AMQP) - The Advanced Message Queuing Protocol (AMQP) is an open standard for passing business messages between applications or organizations. It connects systems, feeds business processes with the information they need and reliably transmits onward the instructions that achieve their goals.
  • MQTT - MQTT is a machine-to-machine (M2M)/Internet of Things connectivity protocol. It was designed as an extremely lightweight publish/subscribe messaging transport. It is useful for connections with remote locations where a small code footprint is required and/or network bandwidth is at a premium.
  • OpenWire - OpenWire is our cross language Wire Protocol to allow native access to ActiveMQ from a number of different languages and platforms. The Java OpenWire transport is the default transport in ActiveMQ 4.x or later.
  • Websockets - WebSocket is a protocol providing full-duplex communication channels over a single TCP connection. The WebSocket protocol was standardized by the IETF as RFC 6455 in 2011, and the WebSocket API in Web IDL is being standardized by the W3C.
  • Extensible Messaging and Presence Protocol (XMPP) - XMPP is the Extensible Messaging and Presence Protocol, a set of open technologies for instant messaging, presence, multi-party chat, voice and video calls, collaboration, lightweight middleware, content syndication, and generalized routing of XML data.
  • SockJS - SockJS is a browser JavaScript library that provides a WebSocket-like object. SockJS gives you a coherent, cross-browser, Javascript API which creates a low latency, full duplex, cross-domain communication channel between the browser and the web server.
  • PubSubHubbub - PubSubHubbub is an open protocol for distributed publish/subscribe communication on the Internet. Initially designed to extend the Atom (and RSS) protocols for data feeds, the protocol can be applied to any data type (e.g. HTML, text, pictures, audio, video) as long as it is accessible via HTTP. Its main purpose is to provide real-time notifications of changes, which improves upon the typical situation where a client periodically polls the feed server at some arbitrary interval. In this way, PubSubHubbub provides pushed HTTP notifications without requiring clients to spend resources on polling for changes.
  • Real Time Streaming Protocol (RTSP) - The Real Time Streaming Protocol (RTSP) is a network control protocol designed for use in entertainment and communications systems to control streaming media servers. The protocol is used for establishing and controlling media sessions between end points. Clients of media servers issue VCR-style commands, such as play and pause, to facilitate real-time control of playback of media files from the server.
  • Server-Sent Events - Server-sent events (SSE) is a technology where a browser receives automatic updates from a server via HTTP connection. The Server-Sent Events EventSource API is standardized as part of HTML5 by the W3C.
  • HTTP Live Streaming (HLS) - HTTP Live Streaming (also known as HLS) is an HTTP-based media streaming communications protocol implemented by Apple Inc. as part of its QuickTime, Safari, OS X, and iOS software.
  • HTTP Long Polling - HTTP long polling, where the client polls the server requesting new information. The server holds the request open until new data is available. Once available, the server responds and sends the new information. When the client receives the new information, it immediately sends another request, and the operation is repeated. This effectively emulates a server push feature.

These protocols are used by the majority of the service providers and tooling I list above, but in my research I’m always trying to focus on not just the services and tooling, but the actual open standards that they support.

I have to also mention the entry level aspect of real time in my opinion. Something, that many API providers support, but also is the 101 level approach that some companies, organizations, institutions, and agencies need to be exposed to before they get overwhelmed with other approaches.

  • Webhooks - A webhook in web development is a method of augmenting or altering the behavior of a web page, or web application, with custom callbacks. These callbacks may be maintained, modified, and managed by third-party users and developers who may not necessarily be affiliated with the originating website or application.

That is the real time API landscape. Sure, there are other services, and tooling, but this is the cream on top. I’m also struggling with the overlap with event sourcing, evented architecture, messaging, and other layers of the API space that are being used to move bits and bytes around today. Technologists aren’t always the best at using precise words, or keeping things simple, and easy to understand, let alone articulate. This is one of the concerns I have with streaming API approaches, is that they are often over the heads, and beyond the needs of some API providers, and may API consumers. They have their place within certain use cases, and large organizations that have the resources, but I spend a lot of time worrying about the little guy.

I think a good example of web API vs streaming API can be found in the Twitter API community. Many folks just need simple, intuitive, RESTful endpoints to get access to data, and content. While a much smaller slice of the pie will have the technology, skills, and compute capacity to do things at scale. Regardless, I see technologies like Apache Kafka being turned into plug and play, infrastructure as a service approaches, allowing anyone to quickly deploy to Heroku, and just put to work via a SaaS model. So, of course, I will still be paying attention, and trying to make sense out of all of this. I don’t know where any one it will be going, but I will keep tuning in, and telling stories about how real time, and streaming API technology is being used, or not being used.


Admit It You Do Not Respect Your API Consumers And End Users

Just admit it, you could care less about your API consumers. You are just playing this whole API game because you read somewhere that this is what everyone should be doing now. You figured you can get some good press out of doing an API, get some free work from developers, and look like you are one of the cool kids for a while. You do the song and dance well, you have developed and deployed an API. It will look like the other APIs out there, but when it comes to supporting developers, or actually investing in the community, you really aren’t that interested in rolling up your sleeves and making a difference. You just don’t really care that much, as long as it looks like you are playing the API game.

Honestly, you’d do any trend that comes along, but this one has so many perks you couldn’t ignore it. Not only do you get to be API cool, you did all the right things, launched on Product Hunt, and you have a presence at all the right tech events. Developers are lining up to build applications, and are willing to work for free. Most of the apps that get built are worthless, but the SDKs you provide act as a vacuum for data. You’ve managed to double your budget by selling the data you acquire to your partners, and other data brokers. You could give away your API for free, and still make a killing, but hell, you have to keep charging just so you look legit, and don’t raise any alarm bells.

It is hard to respect developers who line up and work for free like this. And the users, they are so damn clueless regarding what is going on, they’ll hand over their address book and location in real-time without ever thinking twice. This is just to easy. APIs are such a great racket. You really don’t have to do anything but blog everyone once in a while, show up at events and drink beer, and make sure the API doesn’t break. What a sweet gig huh? No, not really, you are just a pretty sad excuse of a person, and it will catch up with you somewhere. You really represent everything wrong with technology right now, and are contributing to the world being a worse place than it already is–nice job!

Note: If my writing is a little dark this week, here is a little explainer–don’t worry, things will back to normal at API Evangelist soon.


The Reliability Of Government Data Over Externally Managed Data Sets

When I worked at the Department of Veterans affairs I was approached by a number of folks, external to the federal government, who wanted to help clean up, work with, and improve public data sets when it came to open data efforts in the federal government. As I was working on specific datasets about veteran facilities, organizations, programs, services, and other datasets that would make a potential impact on a veterans lives I would often suggest publishing CSVs to Github, and solicit the help of the public to validate, and manage data out in the open. Something that was almost always shut down when I brought the topic up within anyone in leadership.

The common stance regarding the public participating in acquiring, managing, and cleaning up data using Github was–NO! The federal government was the authority when it came providing data. It would own the entire process, and would be the only gatekeeper for accessing it. A couple of datasets that came up were the information for suicide assistance, and substance abuse clinic support, which I had on the ground local folks at clinics, and veteran support groups wanting to help. I was told there would be no way I could get approval to help crowdsource the evolution of data sets, that all data would be stored, maintained, and made available via VA servers.

As I waded through a significant number of links that returned 404, as part of my talk about the state of APIs in federal government last week, I’m reminded once again of the reliability of federal government datasets. I’m finding a significant number of APIs, datasets, and supporting documentation go missing. This has me looking for any existing examples of how the federal government can better publish, share, syndicate, and manage data in an interoperable way. Efforts like the National Information Exchange Model (NIEM), which “is a common vocabulary that enables efficient information exchange across diverse public and private organizations. NIEM can save time and money by providing consistent, reusable data terms and definitions, and repeatable processes.”

Another aspect of this conversation I’ll be exploring further, is the role Github plays in all this. There are 130+ federal agency Github users / organizations on the platform, and I’d like to see how this usage might contribute to federal agencies being more engaged, and managing the uptime, availability, and reliability of data, code, APIs, and other resources coming out of the federal government. I am looking for any positive examples of federal agencies leveraging external cloud services, and private sector partnership opportunities to make data, content, and other resources more available and reliable for public consumption. Let me know any other angles you’d like to see highlighted as part of my federal government data and API research.


The Hack Education Gates Foundation Grant Data Has An API

I have been helping my partner in crime Audrey Watters (@audreywatters) adopt my approach to managing data project(s) using Google Sheets and Github, as part of her work on ed-tech funding. She is going through many of the leading companies, and foundations behind the funding of technology used across the education sector, and doing the hard work of connecting the dots behind how technology gets funded in this critical layer of our society.

I want Audrey (and others), to be self-sufficient when it comes to managing their data projects, which is why I’ve engineered it to use common services (Google Sheets, Github), with any code and supporting elements as self-contained as possible–something Github excels at when it comes to managing data, content, and code. While Audrey is going to town creating spreadsheets and repos, I wanted to highlight a single area of her research into the grants that the Gates Foundation are handing out. She has worked hard to normalize data across many years (1998-2017) of PDF and HTML data into a single Google Sheet, then she has published as individual YAML files which live on Github–making her work forkable and reusable by anyone.

Once published, Audrey is the first person to fork the YAML, and put to work in her storytelling around ed-tech funding, but each of her project repos also come with an API for her research by default. Well, ok, it isn’t a full-blown searchable API, but in addition to being able to get data in YAML format, she has a JSON API for each year of the Gates Foundation grants (ie. 2016). Increasing the surface area when it comes to collaborating and building on top of her work, which can be forked using Github, or accessed via the machine readable YAML and JSON files.

While she is busy creating new Google Sheets and repos for other companies, I wanted to add one more tool to her toolbox, an APIs.json index for her project APIs. I added an apis.yaml index of all her APIs, which I also published to the root of her project as an apis.json version. Now, in addition to publishing YAML files for all the data driving her research, and enabling it all to have a JSON API, there is a single index available to quickly browse an index of machine readable feeds for all her ed-tech funding research. Did I mention, all of this on Google Sheet and Github, which both are free to use, if you leverage Github as a publicly available data project? Making it a pretty dead simple way for ANYONE to manage open data projects, and tell data-driven stories on a budget.

If you want to see the scope of what she is up to, head over to her Hack Education Data Github organization. You can follow the storytelling side of all of this on her work at Hack Education. What is scary about all of this, is that she is only getting started in this work. In August, we are moving to New York City where she is beginning her Spencer Fellowship in Education Reporting at Columbia, where she will be focused 100% on this research. I’m looking forward to seeing what she does with this type of data management using Google Sheets and Github, and working to support here where I can, but more importantly learning from how she takes the tools I’ve given her and evolve them to support her unique brand of data-driven storytelling in the education space.


First Handful Of Lessons Using My Google Sheet Github Approach

With my recent shift to using Google Sheets as my data backend for my research, and my continued usage of Github as my data project publishing platform, I started pushing out some new API related lessons. I wanted to begin formalizing my schema and process for this new approach to delivering lessons with some simple topics, so I got to work taking my 101, and history of APIs work, and converting them into a multi-step lesson.

Some of my initial 101 API lessons are:

I will keep working those 101 lessons. Editing, polishing, expanding, and as I found out with this revision–removing some elements of APIs that are fading away. While my 101 stories are always working to reach as wide as possible, my wider research is always based in two sides of the API coin, with information about providing APis, while also keep my API consumer hat on, and thinking about the needs of developers and integrators.

Now that I have the 101 lessons under way I wanted to focus on my API life cycle research, and work on creating a set of high level lessons for each of the 80+ stops I track on along a modern API life cycle. So I got to work on the lesson for API definitions, which I think is the most important stop along any API life cycle–one that actually crosses with every other line.

  • Definitions (Website) (Github Repo) (https://docs.google.com/spreadsheets/d/13WXRAA30QMzKXRu-dH8gr-UrAQlLLDAD9pBAmwUPIS4/edit#gid=0)

After kicking off a lesson for my API life cycle that speaks to API providers, I wanted to shift gears at look at things from the API consumer side of things, and kick off a lesson for what I consider to be one of the more important APIs today–Twitter.

Like my life cycle research I will continue creating lessons for each area of my API Stack research, where I am studying the approaches of specific API platforms, and the industries they are serving. Next I will be doing Facebook, Instagram, Reddit, and other APIs that are having a significant impact on our world. I’m looking to create lessons for all the top APIs that have a big brand recognition, and leverage them to help onboard a new wave of API curious folks.

My API industry research all lives as separate data driven Github repositories, using Google Sheets as the central data store. I edit all the stories published across these sites using Prose.io, but the data behind all my research live in a series of spreadsheets. This model has been extended to my API lessons, and I’ll be shifting my storytelling to leverage more of a structured approach in the future. To help onboard folks with the concept I’ve also created a lesson, about how you create data-driven projects like this:

  • Google Sheets To Github Website (Website) (Github Repo) (Google Sheet) - Walking through how you can use Google Sheets, and a Github Pages site to manage data driven websites.

All of these lessons are works in progress. It is why they run on Github, so that I can incrementally evolve them. An essential part of this is getting feedback from folks on what they’d like to learn. I’m happy to open up and collaborate around any of these lessons using Google Sheets or Github–you just let me know which one is more your jam. I am collaborating with my partner in crime Audrey Watters (@audreywatters) using this format, and I’m finding it to be a great way to not just manage my world, but also create and manage new worlds with other people.

While each of the lessons use the same schema, structure, and process, I’m reserving the right to publish the lessons in different ways, experimenting with different variations in the layout. You’ll notice the Twitter and Google Sheets to Github Website lessons have a Github issues associated with each step, as I’m looking to stimulate conversations about what makes good (or bad) curriculum when it comes to learning about APIs and the platforms I’m building on. When it comes to my API lifecycle and stack work I am a little more opinionated and not looking for as much feedback at such a granular level, but because each lesson does living on Github, folks are still welcome to edit, and share their thoughts.

I have hundreds of lessons that I want to develop. The backlog is overwhelming. Now that I have the schema, base process, and first few stories published, I can just add to my daily workload and publish new stories, and evolve existing ones as I have time. If there are any lessons you’d like to see, either at the 101, provider, or consumer level let me know–feel free to hit me up through any channel. I’m going to be doing these lessons for my clients, either publishing them privately or publicly to Github repositories, and developing API life cycle curriculum in this way. I am also going to develop a paid version of the lesson, which will perform alongside my API industry guides, as simple, yet rich walk throughs of specific API industry concepts–for a small fee, to support what I do. Ok, lots of work ahead, but I’m super stoked to have these first few lessons out the door, even if there is a lot of polishing still to be done.


Challenges When Aggregating Data Published Across Many Years

My partner in crime is working on a large data aggregation project regarding ed-tech funding. She is publishing data to Google Sheets, and I’m helping her develop Jekyll templates she can fork and expand using Github when it comes to publishing and telling stories around this data across her network of sites. Like API Evangelist, Hack Education runs as a network of Github repositories, with a common template across them–we call the overlap between API Evangelist, Contrafabulists.

One of the smaller projects she is working on as part of her ed-tech funding research involves pulling the grants made by the Gates Foundation since the 1990s. Similar to my story a couple weeks ago about my friend David Kernohan, where he was wanting to pull data from multiple sources, and aggregate into a single, workable project. Audrey is looking to pull data from a single source, but because the data spans almost 20 years–it ends up being a lot like aggregating data from across multiple sources.

A couple of the challenges she is facing trying to gather the data, and aggregate as a common dataset are:

  • PDF - The enemy of any open data advocate is the PDF, and a portion of her research data data is only available in PDF format which translates into a good deal of manual work.
  • Search - Other portions of the data is available via the web, but obfuscated behind search forms requiring many different searches to occur, with paginated results to navigate.
  • Scraping - The lack of APIs, CSV, XML, and other machine readable results raises the bar when it comes to aggregating and normalizing data across many years, making scraping a consideration, but because of PDFs, and obfuscated HTML pages behind a search, even scraping will have a significant costs.
  • Format - Even once you’ve aggregated data from across the many sources, there is a challenge with it being in different formats. Some years are broken down by topic, while others are geographically based. All of this requires a significant amount of overhead to normalize and bring into focus.
  • Manual - Ultimately Audrey has a lot of work ahead of her, manually pulling PDFs and performing searches, then copying and pasting data locally. Then she’ll have to roll up her sleeves to normalize all the data she has aggregated into a single, coherent vision of where the foundation has put its money.

Data research takes time, and is tedious, mind numbing work. I encounter many projects like hers where I have to make a decision between scraping or manually aggregating and normalizing data–each project will have it’s own pros and cons. I wish I could help, but it sounds like it will end up being a significant amount of manual labor to establish a coherent set of data in Google Sheets. Once, she is done though, she has all the tools in place to publish as YAML to Github, and get to work telling stories around the data across her work using Jekyll and Liquid. I’m also helping her make sure she has a JSON representation of each of her data projects, allowing others to build on top of her hard work.

I wish all companies, organizations, institutions, and agencies would think about how they publish their data publicly. It’s easy to think that data stewards will have ill intentions when it comes to publishing data in a variety of formats like they do, but more likely it is just a change of stewardship when it comes to managing and publishing the data. Different folks will have different visions of what sharing data on the web needs to look like, and have different tools available to them, and without a clear strategy you’ll end up with a mosaic of published data over the years. Which is why I’m telling her story. I am hoping to possibly influence one or two data stewards, or would-be data stewards when it comes to the importance of pausing for a moment and thinking through your strategy for standardizing how you store and publish your data online.


If you think there is a link I should have listed here feel free to tweet it at me, or submit as a Github issue. Even though I do this full time, I'm still a one person show, and I miss quite a bit, and depend on my network to help me know what is going on.