API Definitions News

These are the news items I've curated in my monitoring of the API space that have some relevance to the API definition conversation and I wanted to include in my research. I'm using all of these links to better understand how the space is defining not just their APIs, but their schema, and other moving parts of their API operations.

Synthetic Healthcare Records For Your API Using Synthea

I have been working on several fronts to help with API efforts at the Department of Veterans Affairs (VA) this year, and one of them is helping quantify the deployment of a lab API environment for the platform. The VA doesn’t want it called it a sandbox, so they are calling it a lab, but the idea is to provide an environment where developers can work with APIs, see data just like they would in a live environment, but not actually have access to live patient data before they can prove their applications are reviewed and meet requirements.

One of the projects being used to help deliver data within this environment is called Synthea. Providing the virtualized data will be made available through VA labs API–here is the description of what they do from their website:

Synthea is an open-source, synthetic patient generator that models the medical history of synthetic patients. Our mission is to provide high-quality, synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions, enabling research with Health IT data that is otherwise legally or practically unavailable.

Synthea data contains a complete medical history, including medications, allergies, medical encounters, and social determinants of health, providing data can be used without concern for legal or privacy restrictions by developers to support a variety of data standards, including HL7 FHIR, C-CDA and CSV. Perfect work loading up into sandbox and lab API environments, allowing to developers to safely play around with building healthcare applications, without actually touching production patient data.

I’ve been looking for solutions like this for other industries. Synthea even has a patient data generator available on Github, which is something I’d love to see for every industry. Sandbox and labs environment should be default for any API, especially APIs operating within heavily regulated industries. I think Synthea provides a pretty compelling model for the virtualization of API data, and I will be referencing it as part of my work in hopes of incentivizing someone to fork it and use to provide something we can use as part of any API implementation.

General Data Protection Regulation (GDPR) Forcing Us To Ask Questions About Our Data

I’ve been learning more about the EU General Data Protection Regulation (GDPR) recently, and have been having conversation about compliance with companies in the EU, as well as the US. In short, GDPR requires anyone working with personal data to be up front about the data they collect, making sure what they do with that data is observable to end-users, and takes a privacy and security by design approach when it comes to working with all personal data. While the regulations seems heavy handed and unrealistic to many, it really reflects a healthy view of what personal data is, and what a sustainable digital future will look like.

The biggest challenge with becoming GDPR compliant is the data mess most companies operate in. Most companies collect huge amounts of data, believing it is essential to the value they bring to the table, without no real understanding of everything that is being collected, and any logical reasons behind why it is gathered, stored, and kept around. A “gather it all”, big data mentality has dominated the last decade of doing business online. Database groups within organizations hold a lot of power and control because of the data they possess. There is a lot of money to be made when it comes to data access, aggregation, and brokering. It won’t be easy to unwind and change the data-driven culture that has emerged and flourished in the Internet age.

I regularly work with companies who do not have coherent maps of all the data they possess. If you asked them for details on what they track about any given customer, very few will be able to give you a consistent answer. Doing web APIs has forced many organizations to think more deeply about what data they posses, and how they can make it more discoverable, accessible, and usable across systems, web, mobile, and device applications. Even with this opportunity, most large organizations are still struggling with what data they have, where it is stored, and how to access it in a consistent, and meaningful way. Database culture within most organizations is just a mess, which contributes to why so many are freaking out about GDPR.

I’m guessing many companies are worried about complying with GDPR, as well as being able to even respond to any sort of regulatory policing event that may occur. This fear is going to force data stewards to begin thinking about the data the have on hand. I’ve already had conversations with some banks who are working on PSD2 compliant APIs, who are working in tandem on GDPR compliance efforts. Both are making them think deeply about what data they collect, where it is stored, and whether or not it has any value. Something I’m hoping will force some companies to stop collecting some of the data all together, because it just won’t be worth justifying its existence in the current cyber(in)secure, and increasingly accountable regulatory environment.

Doing APIs and becoming GDPR compliant go hand in hand. To do APIs you need to map out the data landscape across your organization, something that will contribute to GDPR. To respond to GDPR events, you will need APIs that provide access to end-users data, and leverage API authentication protocols like OAuth to ensure partnerships, and 3rd party access to end-users data are accountable. I’m optimistic that GDPR will continue to push forward healthy, transparent, and observable conversations around our personal data. One that focuses on, and includes the end-users who’s data we are collecting, storing, and often time selling. I’m hopeful that the stakes become higher, regarding the penalty for breaches, and shady brokering of personal data, and that GDPR becomes the normal mode of doing business online in the EU, and beyond.

Facebook, Cambridge Analytica, And Knowing What API Consumers Are Doing With Our Data

I’m processing the recent announcement by Facebook to shut off the access of Cambridge Analytica to it’s valuable social data. The story emphasizes the importance of real time awareness and response to API consumers at the API management level, as well as the difficulty in ensuring that API consumers are doing what they should be with the data and content being made available via APIs. Access to platforms using APIs is more art than science, but there are some proven ways to help mitigate serious abuses, and identify the bad actors early on, and prevent their operation within the community.

While I applaud Facebook’s response, I’m guessing they could have taken more action earlier on. Their response is more about damage control to their reputation, after the fact, than it is about preventing the problem from happening. Facebook most likely had plenty of warning signs regarding what Aleksandr Kogan, Strategic Communication Laboratories (SCL), including their political data analytics firm, Cambridge Analytica, were up to. If they weren’t than that is a problem in itself, and Facebook should be investing in more policing of their API consumers activity, as they claim they are doing in their release.

If Aleksandr Kogan has that many OAuth tokens for Facebook users, then Facebook should be up in his business, better understanding what he is doing, where his money comes from, and who is partners are. I’m guessing Facebook probably had more knowledge, but because it drove traffic, generated ad revenue, and was in alignment with their business model, it wasn’t a problem. They were willing to look the other way with the data sharing that was occurring, until it became a wider problem for the election, our democracy, and in the press. Facebook should have more awareness, oversight, and enforcement at the API management layer of their platform.

This situation I think highlights another problem of doing APIs, and ensuring API consumers are behaving appropriately with the data, content, and algorithms they are accessing. It can be tough to police what a developer does with data once they’ve pulled from an API. Where they store it, and who they share it with. You just can’t trust that all developers will have the platform, and it’s end-users best interest in mind. Once the data has left the nest, you really don’t have much control over what happens with it. There are ways you can identify unhealthy patterns of consumption via the API management layer, but Aleksandr Kogan’s quizzes probably would appear as a normal application pattern, with no clear signs of the relationships, and data sharing going on behind the scenes.

While I sympathize with Facebook’s struggle to police what people do with their data, I also know they haven’t invested in API management as much as they should have, and they are more than willing to overlook bad behavior when it supports their bottom line. The culture of the tech space supports and incentivizes this type of bad behavior from platforms, as well as consumers like Cambridge Analytica. This is something that regulations like GDPR out of the EU is looking to correct, but the culture in the United States is all about exploitation at this level, that is until it becomes front page news, then of course you act concerned, and begin acting accordingly. The app, big data, and API economy runs on the generating, consuming, buying, and selling of people’s data, and this type of practice isn’t going to go away anytime soon.

As Facebook states, they are taking measures to reign in bad actors in their developer community by being more strict in their application review process. I agree, a healthy application review process is an important aspect of API management. However, this does not address the regular review of applications usage at the API management level, assessing their consumption as they accumulate access tokens, to more user’s data, and go viral. I’d like to have more visibility into how Facebook will be regularly reviewing, assessing, and auditing applications. I’d even go so far as requiring more observability into ALL applications that are using the Facebook API, providing a community directory that will encourage transparency around what people are building. I know that sounds crazy from a platform perspective, but it isn’t, and would actually force Facebook to know their customers.

If platforms truly want to address this problem they will embrace more observability around what is happening in their API communities. They would allow certified and verified researchers and auditors to get at application level consumption data available at the API management layer. I’m sorry y’all, self-regulation isn’t going to cut it here. We need independent 3rd party access at the platform API level to better understand what is happening, otherwise we’ll only see platform action after problems occur, and when major news stories are published. This is the beauty / ugliness of APIs. The cats out of the bag, and platforms need them to innovate, deliver resources to web, mobile, and device applications, as well as remain competitive. APIs also provide the opportunity to peek behind the curtain, and better understand what is happening, and profile the good and the bad actors within each ecosystem–let’s take advantage of the good here, to help regulate the bad.

Facebook Cambridge Analytica And Knowing What Api Consumers Are Doing With Our Data

404: Not Found

Google Releases a Protocol Buffer Implementation of the Fast Healthcare Interoperability Resources (FHIR) Standard

Google is renewing its interest in the healthcare space by releasing a protocol buffer implementation of the fast healthcare interoperability resources (FHIR) standard. Protocol buffers are “Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler”. Its the core of the next generation of APIs at Google, often using HTTP/2 as a transport, while also living side by side with RESTful APIs, which use OpenAPI as the definition, in parallel to what protocol buffers deliver.

It’s a smart move by Google. Providing a gateway for healthcare data to find its way to their data platform products like Google Cloud BigQuery, and their machine learning solutions built on Tensorflow. They want to empower healthcare providers with powerful solutions that help onboard their data, and be able to connect the dots, and make sense of it at scale. However, I wouldn’t stop with protocol buffers. I would also make sure they also invest in API infrastructure on the RESTful side of the equation, developing OpenAPI specs alongside the protocol buffers, and providing translation between, and tooling for both realms.

While I am a big supporter of gRPC, and protocol buffers, I’m skeptical of the complexity it brings, in exchange for higher performance. Part of making sense of health care data will require not just technical folks being able to make sense of what is going on, but also business folks, and protocol buffers, and gRPC solutions will be out of reach of these users. Web APIs, combined with YAML OpenAPI has begun to make the API contracts involved in all of this much more accessible to business users, putting everything within their reach. In our technological push forward, let’s not forget the simplicity of web APIs, and exclude business healthcare users as IT has done historically.

I’m happy to see more FHIR-compliant APIs emerging on the landscape. PSD2 for banking, and FHIR for healthcare are the two best examples we have of industry specific API standards. So it is important that the definitions proliferate, and the services and tooling emerge and evolve. I’m hoping we see even more movement on this front in 2018, but I have to say I’m skeptical of Google’s role, as they’ve come and gone within this arena before, and are exceptional at making sure all roads lead to their solutions, without many roads leading back to enrich the data owners, and stewards. If we can keep API definitions, simple, accessible, and usable by everyone, not just developers and IT folks, we can help empower healthcare data practitioners, and not limit, or restrict them, when it is most important.

Your Obsessive Focus On The API Resource Is Hindering Meaningful Events From Happening

I’ve been profiling a number of market data APIs as part of my work with to identify valuable sources of data that could be streamed using their service. A significant portion of the APIs I come across are making it difficult for me to get at the data they have because of their views around the value of the data, intellectual property, and maintaining control over it in an API-driven world. These APIs don’t end up on the list of APIs I’m including in the profiling work, the gallery / directory, and don’t get included in any of the stories I’m telling, as a result of this tight control.

The side effect of this is I end up getting repeated sales emails and phone calls asking if I am still interested in their data. If there was just one or two of these, I’d jump on phones and explain, but because I’m dealing with 50+ of them, I just don’t have the bandwidth, and I have to move on. The thing is, I’m personally not interested in their data. I’m interested in other people being interested in their data, and being an enabler to helping them to get at it. However, since I can’t actually profile the APIs, create OpenAPI definitions for the request and response structure for inclusion in the API gallery / directory I’m building, I really don’t need their APIs in my work.

I know these platforms are protective of their data because it is valuable. They should be. However modern API management allows for them to open up the sampling of everything they have to offer, without giving away the farm. This allows enablers, analysts, and storytellers like me to test drive things, profile what they have to offer, and include within our applications. Then my users find what they want, head over to the source of the data, sign up for API keys, talk to their sales staff about what data sets they are interested in. I’m just a middle man, a broker, someone who is looking to enable engagements with their data. I’m only interested in the data, because I understand it is valuable to others, not because I am personally interested in doing anything with it.

This reality is common amongst data brokers who live in the pre-API era. They don’t understand API management, and they don’t understand how innovation using APIs work. They still rely on a pretty closed, tight-gripped approach to selling data. API enablers like me don’t have the time to mess around in these worlds. There are too many APIs out there to waste our time. I’m looking to profile the best quality market data source, that are frictionless to get up and running with. I’m not looking for free data, I understand it costs to get at. I just want to send leads their way, but I need to be able to profile what you have to offer in detailed, in a machine readable way. In the end, it doesn’t matter, because these providers won’t be around for long. Other, more API-savvy data providers will emerge, and run them out of business–it is the circle of API life.

Five APIs to Guide You on Your Way to the Data Dark Side

I was integrating with the Clearbit API, doing some enrichment of the API providers I track on, and I found their API stack pretty interesting. I’m just using the enrichment API, which allows me to pass it a URL, and it gives me back a bunch of intelligence on the organization behind. I’ve added a bookmarklet to my browser, which allows me to push it, and the enriched data goes directly into my CRM system. Delivering what it the title says it does–enrichment.

Next up, I’m going to be using the Clearbit Discovery API to find some potentially new companies who are doing APIs in specific industries. As I head over the to the docs for the API, I notice the other three APIs, and I feel like they reflect the five stages of transition to the data intelligence dark side.

  • Enrichment API - The Enrichment API lets you look up person and company data based on an email or domain. For example, you could retrieve a person’s name, location and social handles from an email. Or you could lookup a company’s location, headcount or logo based on their domain name.
  • Discovery API - The Discovery API lets you search for companies via specific criteria. For example, you could search for all companies with a specific funding, that use a certain technology, or that are similar to your existing customers.
  • Prospector API - The Prospector API lets you fetch contacts and emails associated with a company, employment role, seniority, and job title.
  • Risk API - The Risk API takes an email and IP and calculates an associated risk score. This is especially useful for figuring out whether incoming signups to your service are spam or legitimate, or whether a payment has a high chargeback risk.
  • Reveal API - Reveal API takes an IP address, and returns the company associated with that IP. This is especially useful for de-anonymizing traffic on your website, analytics, and customizing landing pages for specific company verticals.

Your journey to the dark side begins innocently enough. You just want to know more about a handful of companies, and the data provided is a real time saver! Then you begin discovering new things, finding some amazing new companies, products, services, and insights. You are addicted. You begin prospecting full time, and actively working to find your latest fix. Then you begin to get paranoid, worried you can’t trust anyone. I mean, if everyone is behaving like you, then you have to be on your guard. That visitor to your website might be your competitor, or worse! Who is it? I need to know everyone who comes to my site. Then in the darkest depths of your binges you are using the reveal API and surveilling all your users. You’ve crossed to the dark side. Your journey is complete.

Remember kids, this is all a very slippery slope. With great power comes great responsibility. One day you are a scrappy little startup, and the next your the fucking NSA. In all seriousness. I think their data intelligence stack is interesting. I do use the enrichment API, and will be using the discovery API. However, we do have to ask ourselves, do we want to be surveilling all our users and visitors. Do we want to be surveilled on every site we visit, and on every application we use? At some point we have to make sure and check how far towards the dark side we’ve gone, and ask ourselves, is this all really worth it?

P.S. This story reminds me I totally flaked on delivering a white paper to Clearbit on the topic of risk. Last year was difficult for me, and I got swamped….sorry guys. Maybe I’ll pick up the topic and send something your way. It is an interesting one, and I hope to have time at some point.

How Do We Keep Teams From Being Defensive With Resources When Doing APIs?

I was talking with the IRS about their internal API strategy before Christmas, reviewing the teams current proposal before they pitched in across teams. One of the topics that came up, which I thought was interesting, was about how to prevent some teams from taking up a defensive stances around their resources when you are trying to level the playing field across groups using APIs and microservices. They had expressed concern that some groups just didn’t see APIs as a benefit, and in some cases perceived them as a threat to their current position within the agency.

This is something I see at almost EVERY SINGLE organization I work with. Most technical groups who have established control over some valuable data, content, or other digital resource, have entrenched themselves, and become resistant to change. Often times these teams have a financial incentive to remain entrenched, and see API efforts as a threat to their budget and long term viability. This type of politics within large companies, organizations, institutions, and government agencies is the biggest threat to change than technology ever is.

So, what can you do about it. Well, the most obvious thing is you can get leadership on your team, and get them to mandate change. Often times this will involve personnel change, and can get pretty ugly in the end. Alternately, I recommend trying to build bridges, by understanding the team in question, and find ways you can do API things that might benefit them. Maybe more revenue and budget opportunities. Reuse of code through open source, or reusable code and applications that might benefit their operations. I recommend mapping out the groups structure and needs, and put together a robust plan regarding how you can make inroads, build relationships, and potentially change behavior, instead of taking an adversarial tone.

Another way forward is to ignore them. Focus on other teams. Find success. Demonstrate what APIs can do, and make the more entrenched team come to you. Of course, this depends on the type of resources they have. Depending on the situation, you may or may not be able to ignore them completely. Leading by example is the best way to take down entrenched groups. Get them to come out of their entrenched positions, and lower their walls a little bit, rather than trying to breach them. You are better off focusing doing APIs and investing in moving forward, rather than battling with groups who don’t see the benefits. I guarantee they can last longer than you probably think, and have developed some pretty crafty ways of staying in control over the years.

Anytime I encounter entrenched team stories within organizations I get sad for anyone who has to deal with these situations. I’ve had some pretty big battles over my career, which ended up in me leaving good jobs, so I don’t take them lightly. However, it makes me smile a little to hear one out of the IRS, especially internally. I know plenty of human beings who work at the IRS, but with their hard-ass reputation they have from the outside, you can’t help but smile just a bit thinking them facing the same challenges that the rest of us do. ;-) I think this is one of the most important lessons of microservices, and APIs, is that we can’t let teams ever get this big and entrenched again. Once we decoupled, let’s keep things in small enough teams, that this type of power can’t aggregate and be to big to evolve again.

The Data Behind The Washington Post Story On Police Shootings in 2017

I was getting ready to write my usual, “wish there was actual data behind this story about a database” story, while reading the Fatal Force story in the Washington Post, and then I saw the link! Fatal Force, 987 people have been shot and killed by police in 2017. Read about our methodology. Download the data. I am so very happy to see this. An actual prominent link to a machine readable version of the data, published on Github–this should be the default for ALL data journalism in this era.

I see story after story reference the data behind, without providing any links to the data. As a database professional this practice drives me insane. Every single story that provides data driven visualizations, statistics, analysis, tables, or any other derivative from data journalism, should provide a link to the Github repository which contains at least CSV representations of the data, if not JSON. This is the minimum for ALL data journalism going forward. If you do not meet this bar, your work should be in question. Other analysts, researchers, and journalists should be able to come in behind your work and audit, verify, validate, and even build upon and augment your work, for it to be considered relevant in this time period.

Github is free. Google Sheets is free. There is no excuse for you not to be publishing the data behind your work in a machine readable format. It makes me happy to see the Washington Post using Github like this, especially when they do not have an active API or developer program. I’m going to spend some time looking through the other repositories in their Github organization, and also begin tracking on which news agencies are actively using Github. Hopefully, in the near future, I can stop ranting about reputable news outlets not sharing their data behind stories in machine readable formats, because the rest of the industry will help police this, and only the real data-driven journalists will be left. #ShowYourWork

Breaking Down The Value Of Real Time Apis

404: Not Found

Streaming Data From The Google Sheet JSON API And

I am playing with as I learn how to use my new partner’s service. proxies any API, and uses Server-Sent Event (SSE) to push updates using JSON Patch. I am playing with making a variety of APIs real time using their service, and in my style, I wanted to share the story of what I’m working on, here on the blog. I was making updates to some data in a Google Sheet that I use to drive some data across a couple of my websites, and thought…can I make this spreadsheet streaming using Yes. Yes, I can.

To test out my theory I went and created a basic Google Sheet with two columns, one for product name, and one for price. Simulating a potential product pricing list that maybe I’d want to stream across multiple website, or possibly within client and partner portals. Then I published the Google Sheet to the web, making the data publicly available, so I didn’t have to deal with any sort of authentication–something you will only want to do with publicly available data. I’ll play around with an authenticated edition at some point in the future, showing more secure examples.

Once I made the sheet public I grabbed the unique key for the sheet, which you can find in the URL, and placed into this URL:[sheet key]/od6/public/basic?alt=json. The Google Sheet key takes a little bit to identify in the URL, but it is the long GUID in the URL, which is the longest part of the URL when editing the sheet. Once you put the key in the URL, you can take the URL and paste in the browser–giving you a JSON representation of your sheet, instead of HTML, basically giving you a public API for your Google Sheet. The JSON for Google Sheets is a little verbose and complicated, but once you study a bit it doesn’t take long for it to come into focus, showing eaching of the columns and rows.

Next, I created a account, verified my email, logged in and created a new app. Something that took me about 2 minutes. I take the new URL for my Google Sheet and publish as the target URL in my account. The UI then generates a curl statement for calling the API through the proxy. Before it will work, you will have to replace the second question mark with an ampersand (&), as assumes you do not have any parameters in the URL. Once replaced, you can open up your command line, paste in the command and run. Using Server-Sent Event (SSE) you’ll see the script running, checking for changes. When you make any changes to your Google Sheet, you will see a JSON Patch response returned with any changes in real time. Providing a real-time stream of your Google Sheet which can be displayed in any application.

Next, I’m going to make a simple JavaScript web page that will take the results and render to the page, showing how to navigate the Google Sheets API response structure, as well as the JSON Patch using the JavaScript SDK. All together this took me about 5 minutes to make happen, from creating the Google Sheet, to firing up a new account, and executing the curl command. Sure, you’d still have to make it display anywhere, but it was quicker than I expected to make a Google Sheet real-time. I’ll spend a little more time thinking about the possibilities for using Google Sheets in this way, and publishing some UI examples to Github, providing a forkable use case that anyone can follow when making it all work for them.

Disclosure: is an API Evangelist partner, and sponsors this site.

Being Able To See Your Database In XML, JSON, and CSV

This is a sponsored post by my friends over at SlashDB. The topic is chosen by me, but the work is funded by SlasDB, making sure I keep doing what I do here at API Evangelist. Thank you SlashDB for your support, and helping me educate my readers about what is going on in the API space.

I remember making the migration from XML to JSON. It was hard for me to understand that difference between the formats, and that you accomplish pretty much the same things in JSON that you could in XML. I’ve been seeing similarities in my migration to YAML from JSON. The parallels in each of these formats isn’t 100%, but this story is more about our perception of data formats, than it is about the technical details. CSV has long been a tool in my toolbox, but it was until this recent migration from JSON to YAML that I really started seeing the importance of CSV when it comes to helping onboard business users with the API possibilities.

In my experience API design plays a significant role in helping us understand our data. Half of this equation is understanding our schema, amd what the dimensions, field names, and data tpes of the data we are moving around using APIs. As I was working through some stories on how my friends over at SlashDB are turning databases into APIs, I saw that they were translating database, tables, and field names into API design, and that they also help you handle content negotiation between JSON, XML, CSV. Which I interpret as an excellent opportunity for learning more about the data we have in our databases, and getting to know the design aspects of the data schema.

In an earlier post about what SlashDB does I mentioned that many API designers cringe at translating database directly into a web API. While I agree that people should be investing into API design to get to know their data resources, the more time I spend with SlashDB’s approach to deploying APIs from a variety of databases, the more I see the potential for teaching API design skills along the way. I know many API developers who understand API design, but do not understand content negotiation between XML, JSON, and CSV. I see an opportunity for helping publish web APIs from a database, while having a conversation about what the API design should be, and also getting to know the underlying schema, then being able to actively negotiate between the different formats–all using an existing service.

While I want everyone to be as advanced as they possibly can with their API implementations, I also understand the reality on the ground at many organizations. I’m looking for any possible way to just get people doing APIs, and begin their journey, and I am not going to be to heavy handed when it comes to people being up to speed on modern API design concepts. The API journey is the perfect way to learn, and going from database to API, and kicking of the journey is more important than expecting everyone to be skilled from day one. This is why I’m partnering with companies like SlashDB, to help highlight tools that can help organizations take their existing legacy databases and translate them into web APIs, even if those APIs are just auto-translations of their database schema.

Being able to see your database as XML, JSON, and CSV is an important API literacy exercise for companies, organizations, institutions, and government agencies who are looking to make their data resources available to partners using the web. It is another important step in understanding what we have, and the naming and dimensions of what we are making available. I think the XML to JSON holds one particular set of lessons, but then CSV possesses a set of lessons all its own, helping keep the bar low for the average business user when it comes to making data available over the web. I’m feeling like there are a number of important lessons for companies looking to make their databases available via web APIs over at SlashDB, with automated XML, JSON, and CSV translation being just a noteable one.

Being Able To See Your Database In XML, JSON, and CSV

This is a sponsored post by my friends over at SlashDB. The topic is chosen by me, but the work is funded by SlasDB, making sure I keep doing what I do here at API Evangelist. Thank you SlashDB for your support, and helping me educate my readers about what is going on in the API space.

I remember making the migration from XML to JSON. It was hard for me to understand that difference between the formats, and that you accomplish pretty much the same things in JSON that you could in XML. I’ve been seeing similarities in my migration to YAML from JSON. The parallels in each of these formats isn’t 100%, but this story is more about our perception of data formats, than it is about the technical details. CSV has long been a tool in my toolbox, but it was until this recent migration from JSON to YAML that I really started seeing the importance of CSV when it comes to helping onboard business users with the API possibilities.

In my experience API design plays a significant role in helping us understand our data. Half of this equation is understanding our schema, and what the dimensions, field names, and data types of the data we are moving around using APIs. As I was working through some stories on how my friends over at SlashDB are turning databases into APIs, I saw that they were translating database, tables, and field names into API design, and that they also help you handle content negotiation between JSON, XML, CSV. Which I interpret as an excellent opportunity for learning more about the data we have in our databases, and getting to know the design aspects of the data schema.

In an earlier post about what SlashDB does I mentioned that many API designers cringe at translating database directly into a web API. While I agree that people should be investing into API design to get to know their data resources, the more time I spend with SlashDB’s approach to deploying APIs from a variety of databases, the more I see the potential for teaching API design skills along the way. I know many API developers who understand API design, but do not understand content negotiation between XML, JSON, and CSV. I see an opportunity for helping publish web APIs from a database, while having a conversation about what the API design should be, and also getting to know the underlying schema, then being able to actively negotiate between the different formats–all using an existing service.

While I want everyone to be as advanced as they possibly can with their API implementations, I also understand the reality on the ground at many organizations. I’m looking for any possible way to just get people doing APIs, and begin their journey, and I am not going to be to heavy handed when it comes to people being up to speed on modern API design concepts. The API journey is the perfect way to learn, and going from database to API, and kicking of the journey is more important than expecting everyone to be skilled from day one. This is why I’m partnering with companies like SlashDB, to help highlight tools that can help organizations take their existing legacy databases and translate them into web APIs, even if those APIs are just auto-translations of their database schema.

Being able to see your database as XML, JSON, and CSV is an important API literacy exercise for companies, organizations, institutions, and government agencies who are looking to make their data resources available to partners using the web. It is another important step in understanding what we have, and the naming and dimensions of what we are making available. I think the XML to JSON holds one particular set of lessons, but then CSV possesses a set of lessons all its own, helping keep the bar low for the average business user when it comes to making data available over the web. I’m feeling like there are a number of important lessons for companies looking to make their databases available via web APIs over at SlashDB, with automated XML, JSON, and CSV translation being just a notable one.

How Do You Ask Questions Of Data Using APIs?

I’m preparing to publish a bunch of transit related data as APIs, for us across a number of applications from visualizations to conversation interfaces like bots and voice-enablement. As I’m learning about the data, publishing it as unsophisticated CRUD APIs, I’m thinking deeply about how I would enable others to ask questions of this data using web APIs. I’m thinking about the hard work of deriving visual meaning from specific questions, all the way to how would you respond to an Alexa query regarding transit data in less than a second. Going well beyond what CRUD gives us when we publish our APIs and taking things to the next level.

Knowing the technology sector, the first response I’ll get is machine learning! You take all your data, and you train up some machine learning models, put some natural language process to work, and voila, you have your answer to how you provide answers. I think this is a sensible approach to many data sets, and for organizations who have the machine learning skills and resources at their disposal. There are also a growing number of SaaS solutions for helping put machine learning work to answer complex questions that might be asked of large databases. Machine learning is definitely part of the equation for me, but I’m not convinced it is the answer in all situations, and it might not always yield the correct answers we are always looking for.

After machine learning, and first on my list of solutions to this challenge is API design. How can I enable a domain expert to pull out the meaningful questions that will be asked of data, and expose as simple API paths, allowing consumers to easily get at the answers to questions. I’m a big fan of this approach because I feel like the chance we will get right answers to questions will be greater, and the APIs will help consumers understand what questions they might want to be asking, even when they are not domain experts. This approach might be more labor intensive than the magic of machine learning, but I feel like it will produce much higher quality results, and better serve the objectives I have for making data available for querying. Plus, this is a lower impact solution, allowing more people to implement, who might not have the machine learning skills or resources at their disposal. API design using low-cost web technology, makes for very accessible solutions.

Whether you go the machine learning or artisanal domain expert API design route, there has to be a feedback loop in place to help improve the questions being asked, as well as the answers being given. If there is no feedback loop, the process will never be improved. This is what APIs excel at when you do them properly. The savvy API platform providers have established feedback loops for API consumers, and their users to correct answers when they are wrong, learn how to ask new types of questions, and improve upon the entire question and answer life cycle. I don’t care whether you are going the machine learning route, or the API design route, you have to have a feedback loop in place to make this work as expected. Otherwise it is a closed loop system, and unlikely to give the answers people are looking for.

For now, I’m leaning heavily on the API design route to allow for my consumers to ask questions of the data I’m publishing as APIs. I’m convinced of my ability to ask some sensible questions of the data, and expose as simple URLs that anyone can query, and then evolve forward and improve upon as time passes. I just don’t have the time and resources to invest in the machine learning route at this point. As the leading machine learning platforms evolve, or as I generate more revenue to be able to invest in these solutions I may change my tune. However, for now I’ll just keep publishing data as simple web APIs, and crafting meaningful paths that allow people to ask questions of some of the data I’m coming across locked up in zip files, spreadsheets, and databases.

Generating Operational Revenue From Public Data Access Using API Management

This is part of some research I'm doing with We share a common interest around the accessibility of public data, so we thought it would be a good way for us to partner, and to underwrite some of my work, while also getting the occasional lead from you, my reader. Thanks for supporting my work, and thanks for support them readers!

A concept I have been championing over the years involves helping government agencies and other non-profit organizations generate revenue from public data. It is a quickly charged topic whenever brought up, as many open data and internet activists feel public data should remain freely accessible. Something I don’t entirely disagree with, but this is a conversation, that when approached right can actually help achieve the vision of open data, while also generating much needed revenue to ensure the data remains available, and even has the opportunity to improve in quality and impact over time.

Leveraging API Management I’d like to argue that APIs, and specifically API management has been well established in the private sector, and increasingly in the public sector, for making valuable data and content available online in a secure and measurable way. Companies like Amazon, Google, and even Twitter are using APIs to make data freely available, but through API management are limiting how much any single consumer can access, and even charging per API call to generate revenue from 3rd party developers and partners. This proven technique for making data and content accessible online using low-cost web technology, requiring all consumers to sign up for a unique set of keys, then rate limiting access, and establishing different levels of access tiers to identify and organize different types of consumers, can and should be applied in government agencies and non-profit organizations to make data accessible, while also asserting more control over how it is used.

Commercial Use of Public Data While this concept can apply to almost any type of data, for the purposes of this example, I am going to focus on 211 data, or the organizations, locations, and services offered by municipalities and non-profit organizations to hep increase access and awareness of health and human services. With 211 data it is obvious that you want this information to be freely available, and accessible by those who need it. However, there are plenty of commercial interests who are interested in this same data, and are using it to sell advertising against, or enrich other datasets, and products or services. There is not reason why cash strapped cities, and non-profit organizations carry the load to maintain, and serve up data for free, when the consumers are using it for commercial purposes. We do not freely give away physical public resources to commercial interests (well, ok, sometimes), without expecting something in return, why would we behave differently with our virtual public resources?

It Costs Money To Serve Public Data Providing access to public data online costs money. It takes money to run the database, servers, bandwidth, and websites and applicatiosn being used to serve up data. It takes money to clean the data, validate phone numbers, email addresses, and ensure the data is of a certain quality and brings value to end-users. Yes this data should be made freely available to those who need it. However, the non-profit organizations and government agencies who are stewards of the data shouldn’t be carrying the financial burden of this data remaining freely available to commercial entities who are looking to enrich their products and services, or simply generate advertising revenue from public data. As modern API providers have learned there are always a variety of API consumers, and I’m recommending that public data stewards begin leverage APIs, and API management to better understand who is accessing their data, and begin to put them into separate buckets, and understand who should be sharing the financial burden of providing public data.

Public Data Should Be Free To The Public If it is public data, it should be freely available to the public. One the web, and through the API. The average citizen should be able to come use human service websites to find services, as well as us the API to help them in their efforts to help others find services. As soon as any application of the public data moves into the commercial realm, and the storage, server, and bandwidth costs increase, they shouldn’t be able to offload the risk and costs to the platform, and be forced to help carry load when it comes to covering platform costs. API management is a great way to measure each application consumption, and then meter and quantify their role and impact, and either allow them to remain freely accessing information, or be forced to pay a fee for API access and consumption.

Ensuring Commercial Usage Helps Carry The Load Commercial API usage will have a distinctly different usage fingerprint than the average citizen, or smaller non-commercial application. API consumers can be asked to declare they application upon signing up for API access, as well as be identified throughout their consumption and traffic patterns. API management excels at metering and analyzing API traffic to understand where it is being applied, either on the web or in mobile, as well as in system to system, and other machine learning or big data analysis scenarios. Public data stewards should be in the business of requiring ALL API consumers sign up for a key which they include with each call, allowing the platform to identify and measure consumption in real-time, and on recurring basis.

API Plans & Access Tiers For Public Data Modern approaches to API management lean on the concept of plans or access tiers to segment out consumers of valuable resources. You see this present in software as a service (SaaS) offerings who often have starter, professional, and enterprise levels of access. Lower levels of the access plan might be free, or low cost, but as you ascend up the ladder, and engage with platforms at different levels, you pay different monthly, as well as usage costs. While also enjoying different levels of access, and loosened rate limits, depending on the plan you operate within. API plans allows platforms to target different types of consumers with different types of resources, and revenue levels. Something that should be adopted by public data stewards, helping establish common access levels that reflect their objectives, as well as is in alignment with a variety of API consumers.

Quantifying, Invoicing, And Understanding Consumption The private sector focuses on API management as a revenue generator. Each API call is identified and measured, grouping each API consumers usage by plan, and attaching a value to their access. It is common to charge API consumers for each API call they make, but there are a number of other ways to meter and charge for consumption. There is also the possibility of paying for usage on some APIs, where specific behavior is being encouraged. API calls, both reading and writing, can be operated like a credit system, accumulating credits, as well as the spending of credits, or translation of credits into currency, and back again. API management allows for the value generated, and extracted from public data resources is measured, quantified, and invoiced for even if money is never actually transacted. API management is often used to show the exchange of value between internal groups, partners, as well as with 3rd party public developers as we see commonly across the Internet today.

Sponsoring, Grants, And Continued Investment in Public Data Turning the open data conversation around using APIs, will open up direct revenue opportunities for agencies and organizations from charging for volume and commercial levels of access. It will also open up the discussion around other types of investment that can be made. Revenue generated from commercial use can go back into the platform itself, as well as funding different applications of the data–further benefitting the overall ecosystem. Platform partners can also be leveraged to join at specific sponsorship tiers where they aren’t necessarily metered for usage, but putting money on the table to fund access, research, and innovative uses of public data–going well beyond just “making money from public data”, as many open data advocates point out.

Alternative Types of API Consumers Discovering new applications, data sources, and partners is increasingly why companies, organizations, institutions, and government agencies are doing APIs in 2017. API portals are becoming external R&D labs for research, innovation, and development on top of digital resources being made available via APIs. Think of social science research that occurs on Twitter or Facebook, or entrepreneurs developing new machine learning tools for healthcare, or finance. Once data is available, identified as quality source of data, it will often be picked up by commercial interests building interesting things, but also university researchers, other government agencies, and potentially data journalists and scientists. This type of consumption can contribute directly to new revenue opportunities for organization around their valuable public data, but it can also provide more insight, tooling, and other contributions to a cities or organizations overall operations.

Helping Public Data Stewards Do What They Do Best I’m not proposing that all public data should be generating revenue using API management. I’m proposing that there is a lot of value in these public data assets being available, and a lot of this value is being extracted by commercial entities who might not be as invested in public data stewards long term viability. In an age where many businesses of all shapes and sizes are realizing the value of data, we should be helping our government agencies, and the not for profit organizations that serve the public good realize this as well. We should be helping them properly manage their digital data assets using APIs, and develop an awareness of who is consuming these resources, then develop partnerships, and new revenue opportunities along the way. I’m not proposing this happens behind closed doors, and I’m interested in things following an open API approach to providing observable, transparent access to public resources.

I want to see public data stewards be successful in what they do. The availability, quality, and access of public data across many business sectors is important to how the economy and our society works (or doesn’t). I’m suggesting that we leverage APIs, and API management to work better for everyone involved, not just generate more money. I’m looking to help government agencies, and non-profit organizations who work with public data understand the potential of APIs when it comes to access to public data. I’m also looking to help them understand modern API management practices so they can get better at identifying public data consumers, understanding how they are putting their valuable data to work, and develop ways in which they can partner, and invest together in the road map of public data resources. This isn’t a new concept, it is just one that the public sector needs to become more aware of, and begin to establish more models for how this can work across government and the public sector.

Generating Operational Revenue From Public Data Access Using Api Management

404: Not Found

The Tractor Beam Of The Database In An API World

<img src="" align=="right" width="40%" style="padding: 15px;" />

I’m an old database person. I’ve been working with databases since my first job in 1987. Cobol. FoxPro. SQL Server. MySQL. I have had a production database in my charge accessible via the web since 1998. I understand how databases are the center of gravity when it comes to data. Something that hasn’t changed in an API driven world. This is something that will make microservices in a containerized landscape much harder than some developers will want to admit. The tractor beam of the database will not give up control to data so easily, either because of technical limitations, business constraints, or political gravity.

Databases are all about the storage and access to data. APIs are about access to data. Storage, and the control that surrounds it is what creates the tractor beam. Most of the reasons for control over the storage of data are not looking to do harm. Security. Privacy. Value. Quality. Availability. There are many reasons stewards of data want to control who can access data, and what they can do with it. However, once control over data is established, I find it often morphs and evolves in many ways, that can eventually become harmful to meaningful and beneficial access to data. Which is usually the goal behind doing APIs, but is often seen as a threat to the mission of data stewards, and results in a tractor beam that API related projects will find themselves caught up in, and difficult to ever break free of.

The most obvious representation of this tractor beam is that all data retrieved via an API usually comes from a central database. Also, all data generated or posted via an API, also ends up within a database. The central database always has an appetite for more data, whether scaled horizontally or vertically. Next, it is always difficult to break off subsets of data into separate API-driven project, or prevent newly established ones from being pulled in, and made part of existing database operations. Whether due to technical, business, or political reasons, many projects born outside this tractor beam will eventually be pulled into the orbit of legacy data operations. Keeping projects decoupled will always be difficult when your central databases has so much pull when it comes to how data is stored and accessed. This isn’t just a technical decoupling, this is a cultural one, that will be much more difficult to break from.

Honestly, if your database is over 2-3 years old, and enjoys any amount of complexity, budget scope, and dependency across your organization, I doubt you’ll ever be able to decouple it. I see folks creating these new data lakes, which act as reservoirs for any and all types of data gathered and generated across operations. These lakes provide valuable opportunities for API innovators to potentially develop new and interesting ways of putting data to work, if they possess an API layer. However, I still think the massive data warehouse and database will look to consume and integrated anything structured and meaningful that evolves on the shores. Industrial grade data operations will just industrialize any smaller utilities that emerge along the fringes of large organizations. Power structures have long developed around central data stores, and no amount of decoupling, decentralizing, or blockchaining will change this any time soon. You can see this with the cloud, which was meant to disrupt this, when it just moved it from your data center to the someone else’s, and allowed it to grow at a faster rate.

I feel like us API folks have been granted ODBC and JDBC leases for our API plantations, but rarely will we ever decouple ourselves from the mother ship. No matter what the technology whispers in our ears about what is possible, the business value, and political control over established databases will always dictate what is possible and what is not possible. I feel like this is one reason all the big database platforms have waited so long to provide native API features, and why next generation data streaming solutions rarely have simple, intuitive API layers. I think we will continue to see the tractor beam of database culture continue to be aggressive, as well as passive aggressive to anything API, trumping access possibilities brought to the table by APIs, with outdated power and control beliefs rooted in how we store and control our data. These folks rarely understand they can be just as controlling and greedy with APIs, but they seem to be unable to get over the promises of access APIs afford, and refuse to play along at all, when it comes to turning down the volume on the tractor beam so anything can flourish.

Using Apis To Enrich The Data You Have In Spreadsheets

404: Not Found

Provide An Open Source Threat Information Database And API Then Sell Premium Data Subscriptions

I was doing some API security research and stumbled across vFeed, a “Correlated Vulnerability and Threat Intelligence Database Wrapper”, providing a JSON API of vulnerabilities from the vFeed database. The approach is a Python API, and not a web API, but I think provides an interesting blueprint for open source APIs. What I found interesting (somewhat) from the vFeed approach was the fact they provide an open source API, and database, but if you want a production version of the database with all the threat intelligence you have to pay for it.

I would say their technical and business approach needs a significant amount of work, but I think there is a workable version of it in there. First, I would create a Python, PHP, Node.js, Java, Go, Ruby version of the API, making sure it is a web API. Next, remove the production restriction on the database, allowing anyone to deploy a working edition, just minus all the threat data. There is a lot of value in there being an open source set of threat intelligence sharing databases and API. Then after that, get smarter about having a variety different free and paid data subscriptions, not just a single database–leverage the API presence.

You could also get smarter about how the database and API enables companies to share their threat data, plugging it into a larger network, making some of it free, and some of it paid–with revenue share all around. There should be a suite of open source threat information sharing databases and APIs, and a federated network of API implementations. Complete with a wealth of open data for folks to tap into and learn from, but also with some revenue generating opportunities throughout the long tail, helping companies fund aspects of their API security operations. Budget shortfalls are a big contributor to security incidents, and some revenue generating activity would be positive.

So, not a perfect model, but enough food for thought to warrant a half-assed blog post like this. Smells like an opportunity for someone out there. Threat information sharing is just one dimension of my API security research where I’m looking to evolve the narrative around how APIs can contribute to security in general. However, there is also an opportunity for enabling the sharing of API related security information, using APIs. Maybe also generating of revenue along the way, helping feed the development of tooling like this, maybe funding individual implementations and threat information nodes, or possibly even fund more storytelling around the concept of API security as well. ;-)

Explore, Download, API, And Share Data

I’m regularly looking through API providers, service providers, and open data platforms looking for interesting ways in which folks are exposing APIs. I have written about Kentik exposing the API call behind each dashboard visualization for their networking solution, as well as CloudFlare providing an API link for each DNS tool available via their platform. All demonstrating healthy way we can show how APIs are right behind everything we do, and today’s example of how to provide API access is out of New York Open Data, providing access to 311 service requests made available via the Socrata platform.

The page I’m showcasing provides access 311 service requests from 2010 to present, with all the columns and meta data for the dataset, complete with a handy navigation toolbar that lets you view data in Carto or, download the full dataset, access via API, or simply share via Twitter, Facebook, or email. It is a pretty simple example of offering up multiple paths for data consumers to get what they want from a dataset. Not everyone is going to want the API. Depending on who you are you might go straight for the download, or opt to access via one of the visualization and charting tools. Depending on who you are targeting with your data, the list of tools might vary, but the NYC OpenData example via Socrata provides a nice example to build upon. With the most important message being do not provide only the options you would choose–get to know your consumers, and deliver solutions they will also need.

It provides a different approach to making APIs behind available to users than the Kentik or CloudFlare approaches do, but it adds to the number of examples I have to show people how APIs and API enabled integration can be exposed through the UI, helping educate the massess about what is possible. I could see standardized buttons, drop downs, and other embeddable tooling emerge for helping deliver solutions like this for providers. Something like we are seeing with the serverless webhooks out Auth0 Extensions. Some sort of API-enabled goodness that triggers something, and can be easily embedded directly into any existing web or mobile application, or possibly a browser toolbar–opening up API enabled solutions to the average user.

One of the reasons I keep showcasing examples like this is that I want to keep pushing back on the notion that APIs are just for developers. Simple, useful, and relevant APIs are not beyond what the average web application user can grasp. They should be present behind every action, visualization, and dataset made available online. When you provide useful integration and interoperability examples that make sense to the average user, and give them easy to engage buttons, drop downs, and workflows for implementing, more folks will experience the API potential in their world. The reasons us developers and IT folk keep things complex, and outside the realm of the normal folk is more about us, our power plays, as well as our inability to simplify things so that they are accessible beyond those in the club.

Big Data Is Not About Access Using Web APIs

I’m neck deep in research around data and APIs right now, and after looking at 37 of the Apache data projects it is pretty clear that web APIs are not a priority in this world. There are some of the projects that have web APIs, and there a couple projects that look to bridge several of the projects with an aggregate or gateway API, but you can tell that the engineers behind the majority of these open source projects are not concerned with access at this level. Many engineers will counter this point by saying that web APIs can’t handle the volume, and it shows that the concept isn’t applicable in all scenarios. I’m not saying web APIs should be used for the core functionality at scale, I’m saying that web APIs should be present to provide access to the result state of the core features for each of these platform, whatever that is, which something that web APIs excel at.

From my vantage point the lack of web APIs isn’t a technical one, it is a business and political motivation. When it comes to big data the objectives are always about access, and it definitely isn’t about the wide audience access that comes when you use HTTP, and the web for API access. The objective is to aggregate, move around, and work with as much data as you possibly can amongst a core group of knowledgable developers. Then you distribute awareness, access, and usage to designated parties via distilled analysis, visualizations, or in some cases to other systems where the result can be accessed and put to use. Wide access to this data is not the primary objective, paying forward much of the power and control we currently see around database to API efforts. Big data isn’t about democratization. Big Data is about aggregating as much as you can and selling the distilled down wisdom from analysis, or derived as part of machine learning efforts.

I am not saying there is some grand conspiracy here. It just isn’t the objective of big data folks. They have their marching orders, and the technology they develop reflect these marching orders. It reflects the influence money and investment has on the technology. The ideology that drives how the tech is engineered, and the algorithms handle specific inputs, and provide intended outputs. Big data is often sold as data liberation, democratization, and access to your data, building on much of what APIs have done in recent years. However, in the last couple of years the investment model has shifted, the clients who are purchasing and implementing big data have evolved, and they aren’t your API access type of people. They don’t see wide access to data as a priority. You are either in the club, and know how to use the Apache X technology, or you are sanctioned one of the dashboard analysis visualization machine learning wisdom drips from the big data. Reaching a wide audience is not necessary.

For me, this isn’t some amazing revelation. It is just watching power do what power does in the technology space. Us engineers like to think we have control over where technology goes, yet we are just cogs in the larger business wheel. We program the technology to do exactly what we are paid to do. We don’t craft liberating technology, or the best performing technology. We assume engineer roles, with paychecks, and bosses who tell us what we should be building. This is how web APIs will fail. This is how web APIs will be rendered yesterdays technology. Not because they fail technically, it is because the ideology of the hedge funds, enterprise groups, and surveillance capitalism organizations that are selling to law enforcement and the government will stop funding data systems that require wide access. The engineers will go along with it because it will be real time, evented, complex, and satisfying to engineer in our isolated development environments (IDE). I’ve been doing data since the 1980s, and in my experience this is how data works. Data is widely seen as power, and all the technical elements, and many of the human elements involved often magically align themselves in service of this power, whether they realize they are doing it or not.

APIs Used To Give Us Access To Resources That Were Out Of Our Reach

I remember when almost all the APIs out there gave us developers access to things we couldn’t ever possibly get on our own. Some of it was about the network effect with the early Amazon and eBay marketplaces, or Flickr and Delicious, and then Twitter and Facebook. Then what really brought it home was going beyond the network effect, and delivering resources that were completely out of our reach like maps of the world around us, (seemingly) infinitely scalable compute and storage, SMS, and credit card payments. In the early days it really seemed like APIs were all about giving us access to something that was out of our reach as startups, or individuals.

While this still does exist, it seems like many APIs have flipped the table and it is all about giving them access to our personal and business data in ways that used to be out of their reach. Machine learning APIs are using parlour tricks to get access to our internal systems and databases. Voice enablement, entertainment, and cameras are gaining access to our homes, what we watch and listen to, and are able to look into the dark corners of our personal lives. Tinder, Facebook, and other platforms know our deep dark secrets, our personal thoughts, and have access to our email and intimate conversations. The API promise seems to have changed along the way, and stopped being about giving us access, and is now about giving them access.

I know it has always been about money, but the early vision of APIs seemed more honest. It seemed more about selling a product or service that people needed, and was more straight up. Now it just seems like APIs are invasive. Being used to infiltrate our professional and business worlds through our mobile phones. It feels like people just want access to us, purely so they can mine us and make more money. You just don’t see many Flickrs, Google Maps, or Amazon EC2s anymore. The new features in mobile devices we carry around, and the ones we install in our home don’t really benefit us in new and amazing ways. They seem to offer just enough to get us to adopt them, and install in our life, so they can get access to yet another data point. Maybe it is just because everything has been done, or maybe it is because it has all been taken over by the money people, looking for the next big thing (for them).

Oh no! Kin is ranting again. No, I’m not. I’m actually feeling pretty grounded in my writing lately, I’m just finding it takes a lot more work to find interesting APIs. I have to sift through many more emails from folks telling me about their exploitative API, before I come across something interesting. I go through 30 vulnerabilities posts in my feeds, before I come across one creative story about something platform is doing. There are 55 posts about ICOs, before I find an interesting investment in a startup doing something that matters. I’m willing to admit that I’m a grumpy API Evangelist most of the time, but I feel really happy, content, and enjoying my research overall. I just feel like the space has lost its way with this big data thing, and are using APIs to become more about infiltrating and extraction, that it is about delivering something that actually gives developers access to something meaningful. I just think we can do better. Something has to give, or this won’t continue to be sustainable much longer.

Looking At The 37 Apache Data Projects

I’m spending time investing in my data, as well as my database API research. I’ll have guides, with accompanying stories coming out over the next couple weeks, but I want to take a moment to publish some of the raw research that I think paints an interesting picture about where things are headed.

When studying what is going on with data and APIs you can’t do any search without stumbling across an Apache project doing something or other with data. I found 37 separate projects at Apache that were data related, and wanted to publish as a single list I could learn from.

  • Airvata** - Apache Airavata is a micro-service architecture based software framework for executing and managing computational jobs and workflows on distributed computing resources including local clusters, supercomputers, national grids, academic and commercial clouds. Airavata is dominantly used to build Web-based science gateways and assist to compose, manage, execute, and monitor large scale applications (wrapped as Web services) and workflows composed of these services.
  • Ambari - Apache Ambari makes Hadoop cluster provisioning, managing, and monitoring dead simple.
  • Apex - Apache Apex is a unified platform for big data stream and batch processing. Use cases include ingestion, ETL, real-time analytics, alerts and real-time actions. Apex is a Hadoop-native YARN implementation and uses HDFS by default. It simplifies development and productization of Hadoop applications by reducing time to market. Key features include Enterprise Grade Operability with Fault Tolerance, State Management, Event Processing Guarantees, No Data Loss, In-memory Performance & Scalability and Native Window Support.
  • Avro - Apache Avro is a data serialization system.
  • Beam - Apache Beam is a unified programming model for both batch and streaming data processing, enabling efficient execution across diverse distributed execution engines and providing extensibility points for connecting to different technologies and user communities.
  • Bigtop - Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem. The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects. In short we strive to be for Hadoop what Debian is to Linux.
  • BookKeeper - BookKeeper is a reliable replicated log service. It can be used to turn any standalone service into a highly available replicated service. BookKeeper is highly available (no single point of failure), and scales horizontally as more storage nodes are added.
  • Calcite - Calcite is a framework for writing data management systems. It converts queries, represented in relational algebra, into an efficient executable form using pluggable query transformation rules. There is an optional SQL parser and JDBC driver. Calcite does not store data or have a preferred execution engine. Data formats, execution algorithms, planning rules, operator types, metadata, and cost model are added at runtime as plugins.
  • CouchDB - Apache CouchDB is a database that completely embraces the web. Store your data with JSON documents. Access your documents with your web browser, via HTTP. Query, combine, and transform your documents with JavaScript. Apache CouchDB works well with modern web and mobile apps. You can even serve web apps directly out of Apache CouchDB. And you can distribute your data, or your apps, efficiently using Apache CouchDB’s incremental replication. Apache CouchDB supports master-master setups with automatic conflict detection.
  • Crunch - The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
  • DataFu - Apache DataFu consists of two libraries: Apache DataFu Pig is a collection of useful user-defined functions for data analysis in Apache Pig. Apache DataFu Hourglass is a library for incrementally processing data using Apache Hadoop MapReduce. This library was inspired by the prevalence of sliding window computations over daily tracking data. Computations such as these typically happen at regular intervals (e.g. daily, weekly), and therefore the sliding nature of the computations means that much of the work is unnecessarily repeated. DataFu’s Hourglass was created to make these computations more efficient, yielding sometimes 50-95% reductions in computational resources.
  • Drill - Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage systems. It was inspired in part by Google’s Dremel.
  • Edgent - Apache Edgent is a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the continuous streams of data coming from equipment, vehicles, systems, appliances, devices and sensors of all kinds (for example, Raspberry Pis or smart phones). Working in conjunction with centralized analytic systems, Apache Edgent provides efficient and timely analytics across the whole IoT ecosystem: from the center to the edge.
  • Falcon - Apache Falcon is a data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters.
  • Flink - Flink is an open source system for expressive, declarative, fast, and efficient data analysis. It combines the scalability and programming flexibility of distributed MapReduce-like platforms with the efficiency, out-of-core execution, and query optimization capabilities found in parallel databases.
  • Flume - Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store
  • Giraph - Apache Giraph is an iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections.
  • Hama - The Apache Hama is an efficient and scalable general-purpose BSP computing engine which can be used to speed up a large variety of compute-intensive analytics applications.
  • Helix - Apache Helix is a generic cluster management framework used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes. Helix automates reassignment of resources in the face of node failure and recovery, cluster expansion, and reconfiguration.
  • Ignite - Apache Ignite In-Memory Data Fabric is designed to deliver uncompromised performance for a wide set of in-memory computing use cases from high performance computing, to the industry most advanced data grid, in-memory SQL, in-memory file system, streaming, and more.
  • Kafka - A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers. Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
  • Knox - The Apache Knox Gateway is a REST API Gateway for interacting with Hadoop clusters. The Knox Gateway provides a single access point for all REST interactions with Hadoop clusters. In this capacity, the Knox Gateway is able to provide valuable functionality to aid in the control, integration, monitoring and automation of critical administrative and analytical needs of the enterprise.
  • Lens - Lens provides an Unified Analytics interface. Lens aims to cut the Data Analytics silos by providing a single view of data across multiple tiered data stores and optimal execution environment for the analytical query. It seamlessly integrates Hadoop with traditional data warehouses to appear like one.
  • MetaModel - With MetaModel you get a uniform connector and query API to many very different datastore types, including: Relational (JDBC) databases, CSV files, Excel spreadsheets, XML files, JSON files, Fixed width files, MongoDB, Apache CouchDB, Apache HBase, Apache Cassandra, ElasticSearch, databases,, SugarCRM and even collections of plain old Java objects (POJOs). MetaModel isn’t a data mapping framework. Instead we emphasize abstraction of metadata and ability to add data sources at runtime, making MetaModel great for generic data processing applications, less so for applications modeled around a particular domain.
  • Oozie - Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).
  • ORC - ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query.
  • Parquet - Apache Parquet is a general-purpose columnar storage format, built for Hadoop, usable with any choice of data processing framework, data model, or programming language.
  • Phoenix - Apache Phoenix enables OLTP and operational analytics for Apache Hadoop by providing a relational database layer leveraging Apache HBase as its backing store. It includes integration with Apache Spark, Pig, Flume, Map Reduce, and other products in the Hadoop ecosystem. It is accessed as a JDBC driver and enables querying, updating, and managing HBase tables through standard SQL.
  • REEF - Apache REEF (Retainable Evaluator Execution Framework) is a development framework that provides a control-plane for scheduling and coordinating task-level (data-plane) work on cluster resources obtained from a Resource Manager. REEF provides mechanisms that facilitate resource reuse for data caching, and state management abstractions that greatly ease the development of elastic data processing workflows on cloud platforms that support a Resource Manager service.
  • Samza - Apache Samza provides a system for processing stream data from publish-subscribe systems such as Apache Kafka. The developer writes a stream processing task, and executes it as a Samza job. Samza then routes messages between stream processing tasks and the publish-subscribe systems that the messages are addressed to.
  • Spark - Apache Spark is a fast and general engine for large-scale data processing. It offers high-level APIs in Java, Scala and Python as well as a rich set of libraries including stream processing, machine learning, and graph analytics.
  • Sqoop - Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • Storm - Apache Storm is a distributed real-time computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing real-time computation.
  • Tajo - The main goal of Apache Tajo project is to build an advanced open source data warehouse system in Hadoop for processing web-scale data sets. Basically, Tajo provides SQL standard as a query language. Tajo is designed for both interactive and batch queries on data sets stored on HDFS and other data sources. Without hurting query response times, Tajo provides fault-tolerance and dynamic load balancing which are necessary for long-running queries. Tajo employs a cost-based and progressive query optimization techniques for optimizing running queries in order to avoid the worst query plans.
  • Tez - Apache Tez is an effort to develop a generic application framework which can be used to process arbitrarily complex directed-acyclic graphs (DAGs) of data-processing tasks and also a reusable set of data-processing primitives which can be used by other projects.
  • VXQuery - Apache VXQuery will be a standards compliant XML Query processor implemented in Java. The focus is on the evaluation of queries on large amounts of XML data. Specifically the goal is to evaluate queries on large collections of relatively small XML documents. To achieve this queries will be evaluated on a cluster of shared nothing machines.
  • Zeppelin - Zeppelin is a modern web-based tool for the data scientists to collaborate over large-scale data exploration and visualization projects.

There is a serious amount of overlap between these projects. Not all of these projects have web APIs, while some of them are all about delivering a gateway or aggregate API across projects. There is a lot to process here, but I think listing them out provides an easier way to understand the big data explosion of projects over at Apache.

It is tough to understand what each of these do without actually playing with them, but that is something I just don’t have the time to do, so next up I’ll be doing independent searches for these project names, and finding stories from across the space regarding what folks are doing with these data solutions. That should give me enough to go on when putting them into specific buckets, and finding their place in my data, and database API research.

Data Streaming In The API Landscape

I was taking a fresh look at my real time API research as part of some data streaming, and event sourcing conversations I was having last week. My research areas are never perfect, but I’d say that real time is still the best umbrella to think about some of the shifts we are seeing on the landscape recently. They are nothing new, but there has been renewed energy, new and interesting conversation going on, as well as some growing trends that I cannot ignore. To support my research, I took a day this week to dive in, have a conversation with my buddy Alex over at the, and the new CEO of WSO2 Tyler Jewell around what is happening.

The way I approach my research is to always step back and look at what is happening already in the space, and I wanted to take another look at some of the real time API service providers I was already keeping eye on in the space:

  • Pubnub - APIs for developers building secure realtime Mobile, Web, and IoT Apps.
  • StreamData - Transform any API into a real-time data stream without a single line of server code.
  • - Fanout’s reverse proxy helps you push data to connected devices instantly.
  • Firebase - Store and sync data with our NoSQL cloud database. Data is synced across all clients in real time, and remains available when your app goes offline.
  • Pusher - Leaders in real time technologies. We empower all developers to create live features for web and mobile apps with our simple hosted API.

I’ve been tracking on what these providers have been doing for a while. They’ve all been pushing to boundaries of what is streaming, and real time APIs for some time. Another open source solution that I think is worth noting, which I believe some of the above services have leverages is

  • Netty - Netty is an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers & clients.

I also wanted to make sure and include Google’s approach to a technology that has been around a while:

  • Google Cloud Pub/Sub - Google Cloud Pub/Sub is a fully-managed real-time messaging service that allows you to send and receive messages between independent applications.

Next, I wanted to refresh my understanding of all the Apache projects that speak to this realm. I’m always trying to keep a handle on what they each actually offer, and how they overlap. So, seeing them side by side like this helps me think about how they fit into the big picture.

  • Apache Kafka - Kafka™ is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
  • Apache Flink - Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.
  • Apache Spark - Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
  • Apache Storm Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
  • Apache Apollo - ActiveMQ Apollo is a faster, more reliable, easier to maintain messaging broker built from the foundations of the original ActiveMQ.

One thing I think is worth noting with all of these is the absence of the web when you read through their APIs. Apollo had some significant RESTful approaches, and you find gateways and plugins for some of the others, but when you consider how these technologies fit into the wider API picture, I’d say they aren’t about embracing the web.

On that note, I think it is worth mentioning what is going on over at Google, with their gRPC effort, which provides “bi-directional streaming and fully integrated pluggable authentication with http/2 based transport”:

  • gRPC - A high performance, open-source universal RPC framework

Also, I think most notably, they are continuing the tradition of APIs embracing the web, and built on top of HTTP/2. For me, this is always important, and trumps just being open source in my book. The more web an open source technology, and a company’s service utilize, the more comfortable I’m going to feel telling my readers they should be baking this into their operations.

After these services and tooling, I don’t want to forget about the good ol fashioned protocols available out there, that help use doing things in real time. I’m tracking on 12 real time protocols that I see in use across the companies, organizations, institutions, and government agencies I’m tracking on:

  • Simple (or Streaming) Text Orientated Messaging Protocol (STOMP) - STOMP is the Simple (or Streaming) Text Orientated Messaging Protocol. STOMP provides an interoperable wire format so that STOMP clients can communicate with any STOMP message broker to provide easy and widespread messaging interoperability among many languages, platforms and brokers.
  • Advanced Message Queuing Protocol (AMQP) - The Advanced Message Queuing Protocol (AMQP) is an open standard for passing business messages between applications or organizations. It connects systems, feeds business processes with the information they need and reliably transmits onward the instructions that achieve their goals.
  • MQTT - MQTT is a machine-to-machine (M2M)/Internet of Things connectivity protocol. It was designed as an extremely lightweight publish/subscribe messaging transport. It is useful for connections with remote locations where a small code footprint is required and/or network bandwidth is at a premium.
  • OpenWire - OpenWire is our cross language Wire Protocol to allow native access to ActiveMQ from a number of different languages and platforms. The Java OpenWire transport is the default transport in ActiveMQ 4.x or later.
  • Websockets - WebSocket is a protocol providing full-duplex communication channels over a single TCP connection. The WebSocket protocol was standardized by the IETF as RFC 6455 in 2011, and the WebSocket API in Web IDL is being standardized by the W3C.
  • Extensible Messaging and Presence Protocol (XMPP) - XMPP is the Extensible Messaging and Presence Protocol, a set of open technologies for instant messaging, presence, multi-party chat, voice and video calls, collaboration, lightweight middleware, content syndication, and generalized routing of XML data.
  • SockJS - SockJS is a browser JavaScript library that provides a WebSocket-like object. SockJS gives you a coherent, cross-browser, Javascript API which creates a low latency, full duplex, cross-domain communication channel between the browser and the web server.
  • PubSubHubbub - PubSubHubbub is an open protocol for distributed publish/subscribe communication on the Internet. Initially designed to extend the Atom (and RSS) protocols for data feeds, the protocol can be applied to any data type (e.g. HTML, text, pictures, audio, video) as long as it is accessible via HTTP. Its main purpose is to provide real-time notifications of changes, which improves upon the typical situation where a client periodically polls the feed server at some arbitrary interval. In this way, PubSubHubbub provides pushed HTTP notifications without requiring clients to spend resources on polling for changes.
  • Real Time Streaming Protocol (RTSP) - The Real Time Streaming Protocol (RTSP) is a network control protocol designed for use in entertainment and communications systems to control streaming media servers. The protocol is used for establishing and controlling media sessions between end points. Clients of media servers issue VCR-style commands, such as play and pause, to facilitate real-time control of playback of media files from the server.
  • Server-Sent Events - Server-sent events (SSE) is a technology where a browser receives automatic updates from a server via HTTP connection. The Server-Sent Events EventSource API is standardized as part of HTML5 by the W3C.
  • HTTP Live Streaming (HLS) - HTTP Live Streaming (also known as HLS) is an HTTP-based media streaming communications protocol implemented by Apple Inc. as part of its QuickTime, Safari, OS X, and iOS software.
  • HTTP Long Polling - HTTP long polling, where the client polls the server requesting new information. The server holds the request open until new data is available. Once available, the server responds and sends the new information. When the client receives the new information, it immediately sends another request, and the operation is repeated. This effectively emulates a server push feature.

These protocols are used by the majority of the service providers and tooling I list above, but in my research I’m always trying to focus on not just the services and tooling, but the actual open standards that they support.

I have to also mention the entry level aspect of real time in my opinion. Something, that many API providers support, but also is the 101 level approach that some companies, organizations, institutions, and agencies need to be exposed to before they get overwhelmed with other approaches.

  • Webhooks - A webhook in web development is a method of augmenting or altering the behavior of a web page, or web application, with custom callbacks. These callbacks may be maintained, modified, and managed by third-party users and developers who may not necessarily be affiliated with the originating website or application.

That is the real time API landscape. Sure, there are other services, and tooling, but this is the cream on top. I’m also struggling with the overlap with event sourcing, evented architecture, messaging, and other layers of the API space that are being used to move bits and bytes around today. Technologists aren’t always the best at using precise words, or keeping things simple, and easy to understand, let alone articulate. This is one of the concerns I have with streaming API approaches, is that they are often over the heads, and beyond the needs of some API providers, and may API consumers. They have their place within certain use cases, and large organizations that have the resources, but I spend a lot of time worrying about the little guy.

I think a good example of web API vs streaming API can be found in the Twitter API community. Many folks just need simple, intuitive, RESTful endpoints to get access to data, and content. While a much smaller slice of the pie will have the technology, skills, and compute capacity to do things at scale. Regardless, I see technologies like Apache Kafka being turned into plug and play, infrastructure as a service approaches, allowing anyone to quickly deploy to Heroku, and just put to work via a SaaS model. So, of course, I will still be paying attention, and trying to make sense out of all of this. I don’t know where any one it will be going, but I will keep tuning in, and telling stories about how real time, and streaming API technology is being used, or not being used.

Admit It You Do Not Respect Your API Consumers And End Users

Just admit it, you could care less about your API consumers. You are just playing this whole API game because you read somewhere that this is what everyone should be doing now. You figured you can get some good press out of doing an API, get some free work from developers, and look like you are one of the cool kids for a while. You do the song and dance well, you have developed and deployed an API. It will look like the other APIs out there, but when it comes to supporting developers, or actually investing in the community, you really aren’t that interested in rolling up your sleeves and making a difference. You just don’t really care that much, as long as it looks like you are playing the API game.

Honestly, you’d do any trend that comes along, but this one has so many perks you couldn’t ignore it. Not only do you get to be API cool, you did all the right things, launched on Product Hunt, and you have a presence at all the right tech events. Developers are lining up to build applications, and are willing to work for free. Most of the apps that get built are worthless, but the SDKs you provide act as a vacuum for data. You’ve managed to double your budget by selling the data you acquire to your partners, and other data brokers. You could give away your API for free, and still make a killing, but hell, you have to keep charging just so you look legit, and don’t raise any alarm bells.

It is hard to respect developers who line up and work for free like this. And the users, they are so damn clueless regarding what is going on, they’ll hand over their address book and location in real-time without ever thinking twice. This is just to easy. APIs are such a great racket. You really don’t have to do anything but blog everyone once in a while, show up at events and drink beer, and make sure the API doesn’t break. What a sweet gig huh? No, not really, you are just a pretty sad excuse of a person, and it will catch up with you somewhere. You really represent everything wrong with technology right now, and are contributing to the world being a worse place than it already is–nice job!

Note: If my writing is a little dark this week, here is a little explainer–don’t worry, things will back to normal at API Evangelist soon.

If you think there is a link I should have listed here feel free to tweet it at me, or submit as a Github issue. Even though I do this full time, I'm still a one person show, and I miss quite a bit, and depend on my network to help me know what is going on.